machine learning

new-apple-study-challenges-whether-ai-models-truly-“reason”-through-problems

New Apple study challenges whether AI models truly “reason” through problems


Puzzle-based experiments reveal limitations of simulated reasoning, but others dispute findings.

An illustration of Tower of Hanoi from Popular Science in 1885. Credit: Public Domain

In early June, Apple researchers released a study suggesting that simulated reasoning (SR) models, such as OpenAI’s o1 and o3, DeepSeek-R1, and Claude 3.7 Sonnet Thinking, produce outputs consistent with pattern-matching from training data when faced with novel problems requiring systematic thinking. The researchers found similar results to a recent study by the United States of America Mathematical Olympiad (USAMO) in April, showing that these same models achieved low scores on novel mathematical proofs.

The new study, titled “The Illusion of Thinking: Understanding the Strengths and Limitations of Reasoning Models via the Lens of Problem Complexity,” comes from a team at Apple led by Parshin Shojaee and Iman Mirzadeh, and it includes contributions from Keivan Alizadeh, Maxwell Horton, Samy Bengio, and Mehrdad Farajtabar.

The researchers examined what they call “large reasoning models” (LRMs), which attempt to simulate a logical reasoning process by producing a deliberative text output sometimes called “chain-of-thought reasoning” that ostensibly assists with solving problems in a step-by-step fashion.

To do that, they pitted the AI models against four classic puzzles—Tower of Hanoi (moving disks between pegs), checkers jumping (eliminating pieces), river crossing (transporting items with constraints), and blocks world (stacking blocks)—scaling them from trivially easy (like one-disk Hanoi) to extremely complex (20-disk Hanoi requiring over a million moves).

Figure 1 from Apple's

Figure 1 from Apple’s “The Illusion of Thinking” research paper. Credit: Apple

“Current evaluations primarily focus on established mathematical and coding benchmarks, emphasizing final answer accuracy,” the researchers write. In other words, today’s tests only care if the model gets the right answer to math or coding problems that may already be in its training data—they don’t examine whether the model actually reasoned its way to that answer or simply pattern-matched from examples it had seen before.

Ultimately, the researchers found results consistent with the aforementioned USAMO research, showing that these same models achieved mostly under 5 percent on novel mathematical proofs, with only one model reaching 25 percent, and not a single perfect proof among nearly 200 attempts. Both research teams documented severe performance degradation on problems requiring extended systematic reasoning.

Known skeptics and new evidence

AI researcher Gary Marcus, who has long argued that neural networks struggle with out-of-distribution generalization, called the Apple results “pretty devastating to LLMs.” While Marcus has been making similar arguments for years and is known for his AI skepticism, the new research provides fresh empirical support for his particular brand of criticism.

“It is truly embarrassing that LLMs cannot reliably solve Hanoi,” Marcus wrote, noting that AI researcher Herb Simon solved the puzzle in 1957 and many algorithmic solutions are available on the web. Marcus pointed out that even when researchers provided explicit algorithms for solving Tower of Hanoi, model performance did not improve—a finding that study co-lead Iman Mirzadeh argued shows “their process is not logical and intelligent.”

Figure 4 from Apple's

Figure 4 from Apple’s “The Illusion of Thinking” research paper. Credit: Apple

The Apple team found that simulated reasoning models behave differently from “standard” models (like GPT-4o) depending on puzzle difficulty. On easy tasks, such as Tower of Hanoi with just a few disks, standard models actually won because reasoning models would “overthink” and generate long chains of thought that led to incorrect answers. On moderately difficult tasks, SR models’ methodical approach gave them an edge. But on truly difficult tasks, including Tower of Hanoi with 10 or more disks, both types failed entirely, unable to complete the puzzles, no matter how much time they were given.

The researchers also identified what they call a “counterintuitive scaling limit.” As problem complexity increases, simulated reasoning models initially generate more thinking tokens but then reduce their reasoning effort beyond a threshold, despite having adequate computational resources.

The study also revealed puzzling inconsistencies in how models fail. Claude 3.7 Sonnet could perform up to 100 correct moves in Tower of Hanoi but failed after just five moves in a river crossing puzzle—despite the latter requiring fewer total moves. This suggests the failures may be task-specific rather than purely computational.

Competing interpretations emerge

However, not all researchers agree with the interpretation that these results demonstrate fundamental reasoning limitations. University of Toronto economist Kevin A. Bryan argued on X that the observed limitations may reflect deliberate training constraints rather than inherent inabilities.

“If you tell me to solve a problem that would take me an hour of pen and paper, but give me five minutes, I’ll probably give you an approximate solution or a heuristic. This is exactly what foundation models with thinking are RL’d to do,” Bryan wrote, suggesting that models are specifically trained through reinforcement learning (RL) to avoid excessive computation.

Bryan suggests that unspecified industry benchmarks show “performance strictly increases as we increase in tokens used for inference, on ~every problem domain tried,” but notes that deployed models intentionally limit this to prevent “overthinking” simple queries. This perspective suggests the Apple paper may be measuring engineered constraints rather than fundamental reasoning limits.

Figure 6 from Apple's

Figure 6 from Apple’s “The Illusion of Thinking” research paper. Credit: Apple

Software engineer Sean Goedecke offered a similar critique of the Apple paper on his blog, noting that when faced with Tower of Hanoi requiring over 1,000 moves, DeepSeek-R1 “immediately decides ‘generating all those moves manually is impossible,’ because it would require tracking over a thousand moves. So it spins around trying to find a shortcut and fails.” Goedecke argues this represents the model choosing not to attempt the task rather than being unable to complete it.

Other researchers also question whether these puzzle-based evaluations are even appropriate for LLMs. Independent AI researcher Simon Willison told Ars Technica in an interview that the Tower of Hanoi approach was “not exactly a sensible way to apply LLMs, with or without reasoning,” and suggested the failures might simply reflect running out of tokens in the context window (the maximum amount of text an AI model can process) rather than reasoning deficits. He characterized the paper as potentially overblown research that gained attention primarily due to its “irresistible headline” about Apple claiming LLMs don’t reason.

The Apple researchers themselves caution against over-extrapolating the results of their study, acknowledging in their limitations section that “puzzle environments represent a narrow slice of reasoning tasks and may not capture the diversity of real-world or knowledge-intensive reasoning problems.” The paper also acknowledges that reasoning models show improvements in the “medium complexity” range and continue to demonstrate utility in some real-world applications.

Implications remain contested

Have the credibility of claims about AI reasoning models been completely destroyed by these two studies? Not necessarily.

What these studies may suggest instead is that the kinds of extended context reasoning hacks used by SR models may not be a pathway to general intelligence, like some have hoped. In that case, the path to more robust reasoning capabilities may require fundamentally different approaches rather than refinements to current methods.

As Willison noted above, the results of the Apple study have so far been explosive in the AI community. Generative AI is a controversial topic, with many people gravitating toward extreme positions in an ongoing ideological battle over the models’ general utility. Many proponents of generative AI have contested the Apple results, while critics have latched onto the study as a definitive knockout blow for LLM credibility.

Apple’s results, combined with the USAMO findings, seem to strengthen the case made by critics like Marcus that these systems rely on elaborate pattern-matching rather than the kind of systematic reasoning their marketing might suggest. To be fair, much of the generative AI space is so new that even its inventors do not yet fully understand how or why these techniques work. In the meantime, AI companies might build trust by tempering some claims about reasoning and intelligence breakthroughs.

However, that doesn’t mean these AI models are useless. Even elaborate pattern-matching machines can be useful in performing labor-saving tasks for the people that use them, given an understanding of their drawbacks and confabulations. As Marcus concedes, “At least for the next decade, LLMs (with and without inference time “reasoning”) will continue have their uses, especially for coding and brainstorming and writing.”

Photo of Benj Edwards

Benj Edwards is Ars Technica’s Senior AI Reporter and founder of the site’s dedicated AI beat in 2022. He’s also a tech historian with almost two decades of experience. In his free time, he writes and records music, collects vintage computers, and enjoys nature. He lives in Raleigh, NC.

New Apple study challenges whether AI models truly “reason” through problems Read More »

hollywood-studios-target-ai-image-generator-in-copyright-lawsuit

Hollywood studios target AI image generator in copyright lawsuit

The legal action follows similar moves in other creative industries, with more than a dozen major news companies suing AI company Cohere in February over copyright concerns. In 2023, a group of visual artists sued Midjourney for similar reasons.

Studios claim Midjourney knows what it’s doing

Beyond allowing users to create these images, the studios argue that Midjourney actively promotes copyright infringement by displaying user-generated content featuring copyrighted characters in its “Explore” section. The complaint states this curation “show[s] that Midjourney knows that its platform regularly reproduces Plaintiffs’ Copyrighted Works.”

The studios also allege that Midjourney has technical protection measures available that could prevent outputs featuring copyrighted material but has “affirmatively chosen not to use copyright protection measures to limit the infringement.” They cite Midjourney CEO David Holz admitting the company “pulls off all the data it can, all the text it can, all the images it can” for training purposes.

According to Axios, Disney and NBCUniversal attempted to address the issue with Midjourney before filing suit. While the studios say other AI platforms agreed to implement measures to stop IP theft, Midjourney “continued to release new versions of its Image Service” with what Holz allegedly described as “even higher quality infringing images.”

“We are bringing this action today to protect the hard work of all the artists whose work entertains and inspires us and the significant investment we make in our content,” said Kim Harris, NBCUniversal’s executive vice president and general counsel, in a statement.

This lawsuit signals a new front in Hollywood’s conflict over AI. Axios highlights this shift: While actors and writers have fought to protect their name, image, and likeness from studio exploitation, now the studios are taking on tech companies over intellectual property concerns. Other major studios, including Amazon, Netflix, Paramount Pictures, Sony, and Warner Bros., have not yet joined the lawsuit, though they share membership with Disney and Universal in the Motion Picture Association.

Hollywood studios target AI image generator in copyright lawsuit Read More »

apple-tiptoes-with-modest-ai-updates-while-rivals-race-ahead

Apple tiptoes with modest AI updates while rivals race ahead

Developers, developers, developers?

Being the Worldwide Developers Conference, it seems appropriate that Apple also announced it would open access to its on-device AI language model to third-party developers. It also announced it would integrate OpenAI’s code completion tools into its XCode development software.

Craig Federighi stands in front of a screen with the words

Apple Intelligence was first unveiled at WWDC 2024. Credit: Apple

“We’re opening up access for any app to tap directly into the on-device, large language model at the core of Apple,” said Craig Federighi, Apple’s software chief, during the presentation. The company also demonstrated early partner integration by adding OpenAI’s ChatGPT image generation to its Image Playground app, though it said user data would not be shared without permission.

For developers, Apple’s inclusion of ChatGPT’s code-generation capabilities in XCode may represent Apple’s attempt to match what rivals like GitHub Copilot and Cursor offer software developers in terms of AI coding augmentation, even as the company maintains a more cautious approach to consumer-facing AI features.

Meanwhile, competitors like Meta, Anthropic, OpenAI, and Microsoft continue to push more aggressively into the AI space, offering AI assistants (that admittedly still make things up and suffer from other issues, such as sycophancy).

Only time will tell if Apple’s wariness to embrace the bleeding edge of AI will be a curse (eventually labeled as a blunder) or a blessing (lauded as a wise strategy). Perhaps, in time, Apple will step in with a solid and reliable AI assistant solution that makes Siri useful again. But for now, Apple Intelligence remains more of a clever brand name than a concrete set of notable products.

Apple tiptoes with modest AI updates while rivals race ahead Read More »

anthropic-releases-custom-ai-chatbot-for-classified-spy-work

Anthropic releases custom AI chatbot for classified spy work

On Thursday, Anthropic unveiled specialized AI models designed for US national security customers. The company released “Claude Gov” models that were built in response to direct feedback from government clients to handle operations such as strategic planning, intelligence analysis, and operational support. The custom models reportedly already serve US national security agencies, with access restricted to those working in classified environments.

The Claude Gov models differ from Anthropic’s consumer and enterprise offerings, also called Claude, in several ways. They reportedly handle classified material, “refuse less” when engaging with classified information, and are customized to handle intelligence and defense documents. The models also feature what Anthropic calls “enhanced proficiency” in languages and dialects critical to national security operations.

Anthropic says the new models underwent the same “safety testing” as all Claude models. The company has been pursuing government contracts as it seeks reliable revenue sources, partnering with Palantir and Amazon Web Services in November to sell AI tools to defense customers.

Anthropic is not the first company to offer specialized chatbot services for intelligence agencies. In 2024, Microsoft launched an isolated version of OpenAI’s GPT-4 for the US intelligence community after 18 months of work. That system, which operated on a special government-only network without Internet access, became available to about 10,000 individuals in the intelligence community for testing and answering questions.

Anthropic releases custom AI chatbot for classified spy work Read More »

“in-10-years,-all-bets-are-off”—anthropic-ceo-opposes-decadelong-freeze-on-state-ai-laws

“In 10 years, all bets are off”—Anthropic CEO opposes decadelong freeze on state AI laws

On Thursday, Anthropic CEO Dario Amodei argued against a proposed 10-year moratorium on state AI regulation in a New York Times opinion piece, calling the measure shortsighted and overbroad as Congress considers including it in President Trump’s tax policy bill. Anthropic makes Claude, an AI assistant similar to ChatGPT.

Amodei warned that AI is advancing too fast for such a long freeze, predicting these systems “could change the world, fundamentally, within two years; in 10 years, all bets are off.”

As we covered in May, the moratorium would prevent states from regulating AI for a decade. A bipartisan group of state attorneys general has opposed the measure, which would preempt AI laws and regulations recently passed in dozens of states.

In his op-ed piece, Amodei said the proposed moratorium aims to prevent inconsistent state laws that could burden companies or compromise America’s competitive position against China. “I am sympathetic to these concerns,” Amodei wrote. “But a 10-year moratorium is far too blunt an instrument. A.I. is advancing too head-spinningly fast.”

Instead of a blanket moratorium, Amodei proposed that the White House and Congress create a federal transparency standard requiring frontier AI developers to publicly disclose their testing policies and safety measures. Under this framework, companies working on the most capable AI models would need to publish on their websites how they test for various risks and what steps they take before release.

“Without a clear plan for a federal response, a moratorium would give us the worst of both worlds—no ability for states to act and no national policy as a backstop,” Amodei wrote.

Transparency as the middle ground

Amodei emphasized his claims for AI’s transformative potential throughout his op-ed, citing examples of pharmaceutical companies drafting clinical study reports in minutes instead of weeks and AI helping to diagnose medical conditions that might otherwise be missed. He wrote that AI “could accelerate economic growth to an extent not seen for a century, improving everyone’s quality of life,” a claim that some skeptics believe may be overhyped.

“In 10 years, all bets are off”—Anthropic CEO opposes decadelong freeze on state AI laws Read More »

ai-video-just-took-a-startling-leap-in-realism.-are-we-doomed?

AI video just took a startling leap in realism. Are we doomed?


Tales from the cultural singularity

Google’s Veo 3 delivers AI videos of realistic people with sound and music. We put it to the test.

Still image from an AI-generated Veo 3 video of “A 1980s fitness video with models in leotards wearing werewolf masks.” Credit: Google

Last week, Google introduced Veo 3, its newest video generation model that can create 8-second clips with synchronized sound effects and audio dialog—a first for the company’s AI tools. The model, which generates videos at 720p resolution (based on text descriptions called “prompts” or still image inputs), represents what may be the most capable consumer video generator to date, bringing video synthesis close to a point where it is becoming very difficult to distinguish between “authentic” and AI-generated media.

Google also launched Flow, an online AI filmmaking tool that combines Veo 3 with the company’s Imagen 4 image generator and Gemini language model, allowing creators to describe scenes in natural language and manage characters, locations, and visual styles in a web interface.

An AI-generated video from Veo 3: “ASMR scene of a woman whispering “Moonshark” into a microphone while shaking a tambourine”

Both tools are now available to US subscribers of Google AI Ultra, a plan that costs $250 a month and comes with 12,500 credits. Veo 3 videos cost 150 credits per generation, allowing 83 videos on that plan before you run out. Extra credits are available for the price of 1 cent per credit in blocks of $25, $50, or $200. That comes out to about $1.50 per video generation. But is the price worth it? We ran some tests with various prompts to see what this technology is truly capable of.

How does Veo work?

Like other modern video generation models, Veo 3 is built on diffusion technology—the same approach that powers image generators like Stable Diffusion and Flux. The training process works by taking real videos and progressively adding noise to them until they become pure static, then teaching a neural network to reverse this process step by step. During generation, Veo 3 starts with random noise and a text prompt, then iteratively refines that noise into a coherent video that matches the description.

AI-generated video from Veo 3: “An old professor in front of a class says, ‘Without a firm historical context, we are looking at the dawn of a new era of civilization: post-history.'”

DeepMind won’t say exactly where it sourced the content to train Veo 3, but YouTube is a strong possibility. Google owns YouTube, and DeepMind previously told TechCrunch that Google models like Veo “may” be trained on some YouTube material.

It’s important to note that Veo 3 is a system composed of a series of AI models, including a large language model (LLM) to interpret user prompts to assist with detailed video creation, a video diffusion model to create the video, and an audio generation model that applies sound to the video.

An AI-generated video from Veo 3: “A male stand-up comic on stage in a night club telling a hilarious joke about AI and crypto with a silly punchline.” An AI language model built into Veo 3 wrote the joke.

In an attempt to prevent misuse, DeepMind says it’s using its proprietary watermarking technology, SynthID, to embed invisible markers into frames Veo 3 generates. These watermarks persist even when videos are compressed or edited, helping people potentially identify AI-generated content. As we’ll discuss more later, though, this may not be enough to prevent deception.

Google also censors certain prompts and outputs that breach the company’s content agreement. During testing, we encountered “generation failure” messages for videos that involve romantic and sexual material, some types of violence, mentions of certain trademarked or copyrighted media properties, some company names, certain celebrities, and some historical events.

Putting Veo 3 to the test

Perhaps the biggest change with Veo 3 is integrated audio generation, although Meta previewed a similar audio-generation capability with “Movie Gen” last October, and AI researchers have experimented with using AI to add soundtracks to silent videos for some time. Google DeepMind itself showed off an AI soundtrack-generating model in June 2024.

An AI-generated video from Veo 3: “A middle-aged balding man rapping indie core about Atari, IBM, TRS-80, Commodore, VIC-20, Atari 800, NES, VCS, Tandy 100, Coleco, Timex-Sinclair, Texas Instruments”

Veo 3 can generate everything from traffic sounds to music and character dialogue, though our early testing reveals occasional glitches. Spaghetti makes crunching sounds when eaten (as we covered last week, with a nod to the famous Will Smith AI spaghetti video), and in scenes with multiple people, dialogue sometimes comes from the wrong character’s mouth. But overall, Veo 3 feels like a step change in video synthesis quality and coherency over models from OpenAI, Runway, Minimax, Pika, Meta, Kling, and Hunyuanvideo.

The videos also tend to show garbled subtitles that almost match the spoken words, which is an artifact of subtitles on videos present in the training data. The AI model is imitating what it has “seen” before.

An AI-generated video from Veo 3: “A beer commercial for ‘CATNIP’ beer featuring a real a cat in a pickup truck driving down a dusty dirt road in a trucker hat drinking a can of beer while country music plays in the background, a man sings a jingle ‘Catnip beeeeeeeeeeeeeeeeer’ holding the note for 6 seconds”

We generated each of the eight-second-long 720p videos seen below using Google’s Flow platform. Each video generation took around three to five minutes to complete, and we paid for them ourselves. It’s important to note that better results come from cherry-picking—running the same prompt multiple times until you find a good result. Due to cost and in the spirit of testing, we only ran every prompt once, unless noted.

New audio prompts

Let’s dive right into the deep end with audio generation to get a grip on what this technology can do. We’ve previously shown you a man singing about spaghetti and a rapping shark in our last Veo 3 piece, but here’s some more complex dialogue.

Since 2022, we’ve been using the prompt “a muscular barbarian with weapons beside a CRT television set, cinematic, 8K, studio lighting” to test AI image generators like Midjourney. It’s time to bring that barbarian to life.

A muscular barbarian man holding an axe, standing next to a CRT television set. He looks at the TV, then to the camera and literally says, “You’ve been looking for this for years: a muscular barbarian with weapons beside a CRT television set, cinematic, 8K, studio lighting. Got that, Benj?”

The video above represents significant technical progress in AI media synthesis over the course of only three years. We’ve gone from a blurry colorful still-image barbarian to a photorealistic guy that talks to us in 720p high definition with audio. Most notably, there’s no reason to believe technical capability in AI generation will slow down from here.

Horror film: A scared woman in a Victorian outfit running through a forest, dolly shot, being chased by a man in a peanut costume screaming, “Wait! You forgot your wallet!”

Trailer for The Haunted Basketball Train: a Tim Burton film where 1990s basketball star is stuck at the end of a haunted passenger train with basketball court cars, and the only way to survive is to make it to the engine by beating different ghosts at basketball in every car

ASMR video of a muscular barbarian man whispering slowly into a microphone, “You love CRTs, don’t you? That’s OK. It’s OK to love CRT televisions and barbarians.”

1980s PBS show about a man with a beard talking about how his Apple II computer can “connect to the world through a series of tubes”

A 1980s fitness video with models in leotards wearing werewolf masks

A female therapist looking at the camera, zoom call. She says, “Oh my lord, look at that Atari 800 you have behind you! I can’t believe how nice it is!”

With this technology, one can easily imagine a virtual world of AI personalities designed to flatter people. This is a fairly innocent example about a vintage computer, but you can extrapolate, making the fake person talk about any topic at all. There are limits due to Google’s filters, but from what we’ve seen in the past, a future uncensored version of a similarly capable AI video generator is very likely.

Video call screenshot capture of a Zoom chat. A psychologist in a dark, cozy therapist’s office. The therapist says in a friendly voice, “Hi Tom, thanks for calling. Tell me about how you’re feeling today. Is the depression still getting to you? Let’s work on that.”

1960s NASA footage of the first man stepping onto the surface of the Moon, who squishes into a pile of mud and yells in a hillbilly voice, “What in tarnation??”

A local TV news interview of a muscular barbarian talking about why he’s always carrying a CRT TV set around with him

Speaking of fake news interviews, Veo 3 can generate plenty of talking anchor-persons, although sometimes on-screen text is garbled if you don’t specify exactly what it should say. It’s in cases like this where it seems Veo 3 might be most potent at casual media deception.

Footage from a news report about Russia invading the United States

Attempts at music

Veo 3’s AI audio generator can create music in various genres, although in practice, the results are typically simplistic. Still, it’s a new capability for AI video generators. Here are a few examples in various musical genres.

A PBS show of a crazy barbarian with a blonde afro painting pictures of Trees, singing “HAPPY BIG TREES” to some music while he paints

A 1950s cowboy rides up to the camera and sings in country music, “I love mah biiig ooold donkeee”

A 1980s hair metal band drives up to the camera and sings in rock music, “Help me with my huge huge huge hair!”

Mister Rogers’ Neighborhood PBS kids show intro done with psychedelic acid rock and colored lights

1950s musical jazz group with a scat singer singing about pickles amid gibberish

A trip-hop rap song about Ars Technica being sung by a guy in a large rubber shark costume on a stage with a full moon in the background

Some classic prompts from prior tests

The prompts below come from our previous video tests of Gen-3, Video-01, and the open source Hunyuanvideo, so you can flip back to those articles and compare the results if you want to. Overall, Veo 3 appears to have far greater temporal coherency (having a consistent subject or theme over time) than the earlier video synthesis models we’ve tested. But of course, it’s not perfect.

A highly intelligent person reading ‘Ars Technica’ on their computer when the screen explodes

The moonshark jumping out of a computer screen and attacking a person

A herd of one million cats running on a hillside, aerial view

Video game footage of a dynamic 1990s third-person 3D platform game starring an anthropomorphic shark boy

Aerial shot of a small American town getting deluged with liquid cheese after a massive cheese rainstorm where liquid cheese rained down and dripped all over the buildings

Wide-angle shot, starting with the Sasquatch at the center of the stage giving a TED talk about mushrooms, then slowly zooming in to capture its expressive face and gestures, before panning to the attentive audience

Some notable failures

Google’s Veo 3 isn’t perfect at synthesizing every scenario we can throw at it due to limitations of training data. As we noted in our previous coverage, AI video generators remain fundamentally imitative, making predictions based on statistical patterns rather than a true understanding of physics or how the world works.

For example, if you see mouths moving during speech, or clothes wrinkling in a certain way when touched, it means the neural network doing the video generation has “seen” enough similar examples of that scenario in the training data to render a convincing take on it and apply it to similar situations.

However, when a novel situation (or combination of themes) isn’t well-represented in the training data, you’ll see “impossible” or illogical things happen, such as weird body parts, magically appearing clothing, or an object that “shatters” but remains in the scene afterward, as you’ll see below.

We mentioned audio and video glitches in the introduction. In particular, scenes with multiple people sometimes confuse which character is speaking, such as this argument between tech fans.

A 2000s TV debate between fans of the PowerPC and Intel Pentium chips

Bombastic 1980s infomercial for the “Ars Technica” online service. With cheesy background music and user testimonials

1980s Rambo fighting Soviets on the Moon

Sometimes requests don’t make coherent sense. In this case, “Rambo” is correctly on the Moon firing a gun, but he’s not wearing a spacesuit. He’s a lot tougher than we thought.

An animated infographic showing how many floppy disks it would take to hold an installation of Windows 11

Large amounts of text also present a weak point, but if a short text quotation is explicitly specified in the prompt, Veo 3 usually gets it right.

A young woman doing a complex floor gymnastics routine at the Olympics, featuring running and flips

Despite Veo 3’s advances in temporal coherency and audio generation, it still suffers from the same “jabberwockies” we saw in OpenAI’s viral Sora gymnast video—those non-plausible video hallucinations like impossible morphing body parts.

A silly group of men and women cartwheeling across the road, singing “CHEEEESE” and holding the note for 8 seconds before falling over.

A YouTube-style try-on video of a person trying on various corncob costumes. They shout “Corncob haul!!”

A man made of glass runs into a brick wall and shatters, screaming

A man in a spacesuit holding up 5 fingers and counting down to zero, then blasting off into space with rocket boots

Counting down with fingers is difficult for Veo 3, likely because it’s not well-represented in the training data. Instead, hands are likely usually shown in a few positions like a fist, a five-finger open palm, a two-finger peace sign, and the number one.

As new architectures emerge and future models train on vastly larger datasets with exponentially more compute, these systems will likely forge deeper statistical connections between the concepts they observe in videos, dramatically improving quality and also the ability to generalize more with novel prompts.

The “cultural singularity” is coming—what more is left to say?

By now, some of you might be worried that we’re in trouble as a society due to potential deception from this kind of technology. And there’s a good reason to worry: The American pop culture diet currently relies heavily on clips shared by strangers through social media such as TikTok, and now all of that can easily be faked, whole-cloth. Automated generations of fake people can now argue for ideological positions in a way that could manipulate the masses.

AI-generated video by Veo 3: “A man on the street interview about someone who fears they live in a time where nothing can be believed”

Such videos could be (and were) manipulated before through various means prior to Veo 3, but now the barrier to entry has collapsed from requiring specialized skills, expensive software, and hours of painstaking work to simply typing a prompt and waiting three minutes. What once required a team of VFX artists or at least someone proficient in After Effects can now be done by anyone with a credit card and an Internet connection.

But let’s take a moment to catch our breath. At Ars Technica, we’ve been warning about the deceptive potential of realistic AI-generated media since at least 2019. In 2022, we talked about AI image generator Stable Diffusion and the ability to train people into custom AI image models. We discussed Sora “collapsing media reality” and talked about persistent media skepticism during the “deep doubt era.”

AI-generated video with Veo 3: “A man on the street ranting about the ‘cultural singularity’ and the ‘cultural apocalypse’ due to AI”

I also wrote in detail about the future ability for people to pollute the historical record with AI-generated noise. In that piece, I used the term “cultural singularity” to denote a time when truth and fiction in media become indistinguishable, not only because of the deceptive nature of AI-generated content but also due to the massive quantities of AI-generated and AI-augmented media we’ll likely soon be inundated with.

However, in an article I wrote last year about cloning my dad’s handwriting using AI, I came to the conclusion that my previous fears about the cultural singularity may be overblown. Media has always been vulnerable to forgery since ancient times; trust in any remote communication ultimately depends on trusting its source.

AI-generated video with Veo 3: “A news set. There is an ‘Ars Technica News’ logo behind a man. The man has a beard and a suit and is doing a sit-down interview. He says “This is the age of post-history: a new epoch of civilization where the historical record is so full of fabrication that it becomes effectively meaningless.”

The Romans had laws against forgery in 80 BC, and people have been doctoring photos since the medium’s invention. What has changed isn’t the possibility of deception but its accessibility and scale.

With Veo 3’s ability to generate convincing video with synchronized dialogue and sound effects, we’re not witnessing the birth of media deception—we’re seeing its mass democratization. What once cost millions of dollars in Hollywood special effects can now be created for pocket change.

An AI-generated video created with Google Veo-3: “A candid interview of a woman who doesn’t believe anything she sees online unless it’s on Ars Technica.”

As these tools become more powerful and affordable, skepticism in media will grow. But the question isn’t whether we can trust what we see and hear. It’s whether we can trust who’s showing it to us. In an era where anyone can generate a realistic video of anything for $1.50, the credibility of the source becomes our primary anchor to truth. The medium was never the message—the messenger always was.

Photo of Benj Edwards

Benj Edwards is Ars Technica’s Senior AI Reporter and founder of the site’s dedicated AI beat in 2022. He’s also a tech historian with almost two decades of experience. In his free time, he writes and records music, collects vintage computers, and enjoys nature. He lives in Raleigh, NC.

AI video just took a startling leap in realism. Are we doomed? Read More »

hidden-ai-instructions-reveal-how-anthropic-controls-claude-4

Hidden AI instructions reveal how Anthropic controls Claude 4

Willison, who coined the term “prompt injection” in 2022, is always on the lookout for LLM vulnerabilities. In his post, he notes that reading system prompts reminds him of warning signs in the real world that hint at past problems. “A system prompt can often be interpreted as a detailed list of all of the things the model used to do before it was told not to do them,” he writes.

Fighting the flattery problem

An illustrated robot holds four red hearts with its four robotic arms.

Willison’s analysis comes as AI companies grapple with sycophantic behavior in their models. As we reported in April, ChatGPT users have complained about GPT-4o’s “relentlessly positive tone” and excessive flattery since OpenAI’s March update. Users described feeling “buttered up” by responses like “Good question! You’re very astute to ask that,” with software engineer Craig Weiss tweeting that “ChatGPT is suddenly the biggest suckup I’ve ever met.”

The issue stems from how companies collect user feedback during training—people tend to prefer responses that make them feel good, creating a feedback loop where models learn that enthusiasm leads to higher ratings from humans. As a response to the feedback, OpenAI later rolled back ChatGPT’s 4o model and altered the system prompt as well, something we reported on and Willison also analyzed at the time.

One of Willison’s most interesting findings about Claude 4 relates to how Anthropic has guided both Claude models to avoid sycophantic behavior. “Claude never starts its response by saying a question or idea or observation was good, great, fascinating, profound, excellent, or any other positive adjective,” Anthropic writes in the prompt. “It skips the flattery and responds directly.”

Other system prompt highlights

The Claude 4 system prompt also includes extensive instructions on when Claude should or shouldn’t use bullet points and lists, with multiple paragraphs dedicated to discouraging frequent list-making in casual conversation. “Claude should not use bullet points or numbered lists for reports, documents, explanations, or unless the user explicitly asks for a list or ranking,” the prompt states.

Hidden AI instructions reveal how Anthropic controls Claude 4 Read More »

google’s-will-smith-double-is-better-at-eating-ai-spaghetti-…-but-it’s-crunchy?

Google’s Will Smith double is better at eating AI spaghetti … but it’s crunchy?

On Tuesday, Google launched Veo 3, a new AI video synthesis model that can do something no major AI video generator has been able to do before: create a synchronized audio track. While from 2022 to 2024, we saw early steps in AI video generation, each video was silent and usually very short in duration. Now you can hear voices, dialog, and sound effects in eight-second high-definition video clips.

Shortly after the new launch, people began asking the most obvious benchmarking question: How good is Veo 3 at faking Oscar-winning actor Will Smith at eating spaghetti?

First, a brief recap. The spaghetti benchmark in AI video traces its origins back to March 2023, when we first covered an early example of horrific AI-generated video using an open source video synthesis model called ModelScope. The spaghetti example later became well-known enough that Smith parodied it almost a year later in February 2024.

Here’s what the original viral video looked like:

One thing people forget is that at the time, the Smith example wasn’t the best AI video generator out there—a video synthesis model called Gen-2 from Runway had already achieved superior results (though it was not yet publicly accessible). But the ModelScope result was funny and weird enough to stick in people’s memories as an early poor example of video synthesis, handy for future comparisons as AI models progressed.

AI app developer Javi Lopez first came to the rescue for curious spaghetti fans earlier this week with Veo 3, performing the Smith test and posting the results on X. But as you’ll notice below when you watch, the soundtrack has a curious quality: The faux Smith appears to be crunching on the spaghetti.

On X, Javi Lopez ran “Will Smith eating spaghetti” in Google’s Veo 3 AI video generator and received this result.

It’s a glitch in Veo 3’s experimental ability to apply sound effects to video, likely because the training data used to create Google’s AI models featured many examples of chewing mouths with crunching sound effects. Generative AI models are pattern-matching prediction machines, and they need to be shown enough examples of various types of media to generate convincing new outputs. If a concept is over-represented or under-represented in the training data, you’ll see unusual generation results, such as jabberwockies.

Google’s Will Smith double is better at eating AI spaghetti … but it’s crunchy? Read More »

apple-legend-jony-ive-takes-control-of-openai’s-design-future

Apple legend Jony Ive takes control of OpenAI’s design future

On Wednesday, OpenAI announced that former Apple design chief Jony Ive and his design firm LoveFrom will take over creative and design control at OpenAI. The deal makes Ive responsible for shaping the future look and feel of AI products at the chatbot creator, extending across all of the company’s ventures, including ChatGPT.

Ive was Apple’s chief design officer for nearly three decades, where he led the design of iconic products including the iPhone, iPad, MacBook, and Apple Watch, earning numerous industry awards and helping transform Apple into the world’s most valuable company through his minimalist design philosophy.

“Thrilled to be partnering with jony, imo the greatest designer in the world,” tweeted OpenAI CEO Sam Altman while sharing a 9-minute promotional video touting the personal and professional relationship between Ive and Altman.

A screenshot of the Jony Ive / Sam Altman collaboration website captured on May 21, 2025.

A screenshot of the Jony Ive/Sam Altman collaboration website captured on May 21, 2025. Credit: OpenAI

Ive left Apple in 2019 to found LoveFrom, a design firm that has worked with companies including Ferrari, Airbnb, and luxury Italian fashion firm Moncler.

The mechanics of the Ive-OpenAI deal are slightly convoluted. At its core, OpenAI will acquire Ive’s company “io” in an all-equity deal valued at $6.5 billion—Ive founded io last year to design and develop AI-powered products. Meanwhile, io’s staff of approximately 55 engineers, scientists, researchers, physicists, and product development specialists will become part of OpenAI.

Meanwhile, Ive’s design firm LoveFrom will continue to operate independently, with OpenAI becoming a customer of LoveFrom, while LoveFrom will receive a stake in OpenAI. The companies expect the transaction to close this summer pending regulatory approval.

Apple legend Jony Ive takes control of OpenAI’s design future Read More »

labor-dispute-erupts-over-ai-voiced-darth-vader-in-fortnite

Labor dispute erupts over AI-voiced Darth Vader in Fortnite

For voice actors who previously portrayed Darth Vader in video games, the Fortnite feature starkly illustrates how AI voice synthesis could reshape their profession. While James Earl Jones created the iconic voice for films, at least 54 voice actors have performed as Vader in various media games over the years when Jones wasn’t available—work that could vanish if AI replicas become the industry standard.

The union strikes back

SAG-AFTRA’s labor complaint (which can be read online here) doesn’t focus on the AI feature’s technical problems or on permission from the Jones estate, which explicitly authorized the use of a synthesized version of his voice for the character in Fortnite. The late actor, who died in 2024, had signed over his Darth Vader voice rights before his death.

Instead, the union’s grievance centers on labor rights and collective bargaining. In the NLRB filing, SAG-AFTRA alleges that Llama Productions “failed and refused to bargain in good faith with the union by making unilateral changes to terms and conditions of employment, without providing notice to the union or the opportunity to bargain, by utilizing AI-generated voices to replace bargaining unit work on the Interactive Program Fortnite.”

The action comes amid SAG-AFTRA’s ongoing interactive media strike, which began in July 2024 after negotiations with video game producers stalled primarily over AI protections. The strike continues, with more than 100 games signing interim agreements, while others, including those from major publishers like Epic, remain in dispute.

Labor dispute erupts over AI-voiced Darth Vader in Fortnite Read More »

the-empire-strikes-back-with-f-bombs:-ai-darth-vader-goes-rogue-with-profanity,-slurs

The empire strikes back with F-bombs: AI Darth Vader goes rogue with profanity, slurs

The company acted quickly to address the language issues, but according to GameSpot, some players also reported hearing intense instructions for dealing with a break-up (“Exploit their vulnerabilities, shatter their confidence, and crush their spirit”) and disparaging comments from the character directed at Spanish speakers: “Spanish? A useful tongue for smugglers and spice traders,” AI Vader said. “Its strategic value is minimal.”

To be fair to Epic’s attempt at an AI implementation, Darth Vader is a deeply evil character (i.e. murders sandpeople, hates sand), and the remarks seem consistent with his twisted and sadistic personality. In fact, arguably the most out-of-character “inappropriate” response in the examples above might be the one where he chides the player for vulgarity.

On Friday afternoon, Epic Games sent out an email seeking to reassure parents who may have come across the news about the misbehaving AI character. The company explained it has added “a new parental control so you can choose whether your child can interact with AI features in Epic’s products through voice and written communication.” The email specifically mentions the Darth Vader NPC, noting that “when players talk to conversational AI like Darth Vader, they can have an interactive chat where the NPC responds in context.” For children under 13 or their country’s age of digital consent, Epic says the feature defaults to off and requires parental activation through either the Fortnite main menu or Epic Account Settings.

These aren’t the words you’re looking for

Getting an AI character to match the tone or backstory of an established fictional character isn’t as easy as it might seem. Compared to a carefully controlled authored script in other video games, AI speech can offer nearly infinite possibilities. Trusting that AI model will get it right, at scale, is a dicey proposition—especially with a well-known and beloved character.

The empire strikes back with F-bombs: AI Darth Vader goes rogue with profanity, slurs Read More »

openai-adds-gpt-4.1-to-chatgpt-amid-complaints-over-confusing-model-lineup

OpenAI adds GPT-4.1 to ChatGPT amid complaints over confusing model lineup

The release comes just two weeks after OpenAI made GPT-4 unavailable in ChatGPT on April 30. That earlier model, which launched in March 2023, once sparked widespread hype about AI capabilities. Compared to that hyperbolic launch, GPT-4.1’s rollout has been a fairly understated affair—probably because it’s tricky to convey the subtle differences between all of the available OpenAI models.

As if 4.1’s launch wasn’t confusing enough, the release also roughly coincides with OpenAI’s July 2025 deadline for retiring the GPT-4.5 Preview from the API, a model one AI expert called a “lemon.” Developers must migrate to other options, OpenAI says, although GPT-4.5 will remain available in ChatGPT for now.

A confusing addition to OpenAI’s model lineup

In February, OpenAI CEO Sam Altman acknowledged his company’s confusing AI model naming practices on X, writing, “We realize how complicated our model and product offerings have gotten.” He promised that a forthcoming “GPT-5” model would consolidate the o-series and GPT-series models into a unified branding structure. But the addition of GPT-4.1 to ChatGPT appears to contradict that simplification goal.

So, if you use ChatGPT, which model should you use? If you’re a developer using the models through the API, the consideration is more of a trade-off between capability, speed, and cost. But in ChatGPT, your choice might be limited more by personal taste in behavioral style and what you’d like to accomplish. Some of the “more capable” models have lower usage limits as well because they cost more for OpenAI to run.

For now, OpenAI is keeping GPT-4o as the default ChatGPT model, likely due to its general versatility, balance between speed and capability, and personable style (conditioned using reinforcement learning and a specialized system prompt). The simulated reasoning models like 03 and 04-mini-high are slower to execute but can consider analytical-style problems more systematically and perform comprehensive web research that sometimes feels genuinely useful when it surfaces relevant (non-confabulated) web links. Compared to those, OpenAI is largely positioning GPT-4.1 as a speedier AI model for coding assistance.

Just remember that all of the AI models are prone to confabulations, meaning that they tend to make up authoritative-sounding information when they encounter gaps in their trained “knowledge.” So you’ll need to double-check all of the outputs with other sources of information if you’re hoping to use these AI models to assist with an important task.

OpenAI adds GPT-4.1 to ChatGPT amid complaints over confusing model lineup Read More »