AI content has proliferated across the Internet over the past few years, but those early confabulations with mutated hands have evolved into synthetic images and videos that can be hard to differentiate from reality. Having helped to create this problem, Google has some responsibility to keep AI video in check on YouTube. To that end, the company has started rolling out its promised likeness detection system for creators.
Google’s powerful and freely available AI models have helped fuel the rise of AI content, some of which is aimed at spreading misinformation and harassing individuals. Creators and influencers fear their brands could be tainted by a flood of AI videos that show them saying and doing things that never happened—even lawmakers are fretting about this. Google has placed a large bet on the value of AI content, so banning AI from YouTube, as many want, simply isn’t happening.
Earlier this year, YouTube promised tools that would flag face-stealing AI content on the platform. The likeness detection tool, which is similar to the site’s copyright detection system, has now expanded beyond the initial small group of testers. YouTube says the first batch of eligible creators have been notified that they can use likeness detection, but interested parties will need to hand Google even more personal information to get protection from AI fakes.
Sneak Peek: Likeness Detection on YouTube.
Currently, likeness detection is a beta feature in limited testing, so not all creators will see it as an option in YouTube Studio. When it does appear, it will be tucked into the existing “Content detection” menu. In YouTube’s demo video, the setup flow appears to assume the channel has only a single host whose likeness needs protection. That person must verify their identity, which requires a photo of a government ID and a video of their face. It’s unclear why YouTube needs this data in addition to the videos people have already posted with their oh-so stealable faces, but rules are rules.
It’s getting harder to know what’s real on the Internet, and Google is not helping one bit with the announcement of Veo 3.1. The company’s new video model supposedly offers better audio and realism, along with greater prompt accuracy. The updated video AI will be available throughout the Google ecosystem, including the Flow filmmaking tool, where the new model will unlock additional features. And if you’re worried about the cost of conjuring all these AI videos, Google is also adding a “Fast” variant of Veo.
Veo made waves when it debuted earlier this year, demonstrating a staggering improvement in AI video quality just a few months after Veo 2’s release. It turns out that having all that video on YouTube is very useful for training AI models, so Google is already moving on to Veo 3.1 with a raft of new features.
Google says Veo 3.1 offers stronger prompt adherence, which results in better video outputs and fewer wasted compute cycles. Audio, which was a hallmark feature of the Veo 3 release, has reportedly improved, too. Veo 3’s text-to-video was limited to 720p landscape output, but there’s an ever-increasing volume of vertical video on the Internet. So Veo 3.1 can produce both landscape and portrait 16:9 video.
Google previously said it would bring Veo video tools to YouTube Shorts, which use a vertical video format like TikTok. The release of Veo 3.1 probably opens the door to fulfilling that promise. You can bet Veo videos will show up more frequently on TikTok as well now that it fits the format. This release also keeps Google in its race with OpenAI, which recently released a Sora iPhone app with an impressive new version of its video-generating AI.
Voyager builds on Tencent’s earlier HunyuanWorld 1.0, released in July. Voyager is also part of Tencent’s broader “Hunyuan” ecosystem, which includes the Hunyuan3D-2 model for text-to-3D generation and the previously covered HunyuanVideo for video synthesis.
To train Voyager, researchers developed software that automatically analyzes existing videos to process camera movements and calculate depth for every frame—eliminating the need for humans to manually label thousands of hours of footage. The system processed over 100,000 video clips from both real-world recordings and the aforementioned Unreal Engine renders.
A diagram of the Voyager world creation pipeline. Credit: Tencent
The model demands serious computing power to run, requiring at least 60GB of GPU memory for 540p resolution, though Tencent recommends 80GB for better results. Tencent published the model weights on Hugging Face and included code that works with both single and multi-GPU setups.
The model comes with notable licensing restrictions. Like other Hunyuan models from Tencent, the license prohibits usage in the European Union, the United Kingdom, and South Korea. Additionally, commercial deployments serving over 100 million monthly active users require separate licensing from Tencent.
On the WorldScore benchmark developed by Stanford University researchers, Voyager reportedly achieved the highest overall score of 77.62, compared to 72.69 for WonderWorld and 62.15 for CogVideoX-I2V. The model reportedly excelled in object control (66.92), style consistency (84.89), and subjective quality (71.09), though it placed second in camera control (85.95) behind WonderWorld’s 92.98. WorldScore evaluates world generation approaches across multiple criteria, including 3D consistency and content alignment.
While these self-reported benchmark results seem promising, wider deployment still faces challenges due to the computational muscle involved. For developers needing faster processing, the system supports parallel inference across multiple GPUs using the xDiT framework. Running on eight GPUs delivers processing speeds 6.69 times faster than single-GPU setups.
Given the processing power required and the limitations in generating long, coherent “worlds,” it may be a while before we see real-time interactive experiences using a similar technique. But as we’ve seen so far with experiments like Google’s Genie, we’re potentially witnessing very early steps into a new interactive, generative art form.
Google’s Veo 3 videos have propagated across the Internet since the model’s debut in May, blurring the line between truth and fiction. Now, it’s getting even easier to create these AI videos. The Gemini app is gaining photo-to-video generation, allowing you to upload a photo and turn it into a video. You don’t have to pay anything extra for these Veo 3 videos, but the feature is only available to subscribers of Google’s Pro and Ultra AI plans.
When Veo 3 launched, it could conjure up a video based only on your description, complete with speech, music, and background audio. This has made Google’s new AI videos staggeringly realistic—it’s actually getting hard to identify AI videos at a glance. Using a reference photo makes it easier to get the look you want without tediously describing every aspect. This was an option in Google’s Flow AI tool for filmmakers, but now it’s in the Gemini app and web interface.
To create a video from a photo, you have to select “Video” from the Gemini toolbar. Once this feature is available, you can then add your image and prompt, including audio and dialogue. Generating the video takes several minutes—this process takes a lot of computation, which is why video output is still quite limited.
The release of Google’s Veo 3 video generator in May represented a disconcerting leap in AI video quality. While many of the viral AI videos we’ve seen are harmless fun, the model’s pixel-perfect output can also be used for nefarious purposes. On TikTok, which may or may not be banned in the coming months, users have noticed a surplus of racist AI videos, courtesy of Google’s Veo 3.
According to a report from MediaMatters, numerous TikTok accounts have started posting AI-generated videos that use racist and antisemitic tropes in recent weeks. Most of the AI vitriol is aimed at Black people, depicting them as “the usual suspects” in crimes, absent parents, and monkeys with an affinity for watermelon. The content also targets immigrants and Jewish people. The videos top out at eight seconds and bear the “Veo” watermark, confirming they came from Google’s leading AI model.
The compilation video below has examples pulled from TikTok since the release of Veo 3, but be warned, it contains racist and antisemitic content. Some of the videos are shocking, which is likely the point—nothing drives engagement on social media like anger and drama. MediaMatters reports that the original posts have numerous comments echoing the stereotypes used in the video.
Hateful AI videos generated by Veo 3 spreading on TikTok.
Google has stressed security when announcing new AI models—we’ve all seen an AI refuse to complete a task that runs afoul of its guardrails. And it’s never fun when you have genuinely harmless intentions, but the system throws a false positive and blocks your output. Google has mostly struck the right balance previously, but it appears that Veo 3 is more compliant. We’ve tested a few simple prompts with Veo 3 and found it easy to reproduce elements of these videos.
Clear but unenforced policies
TikTok’s terms of service ban this kind of content. “We do not allow any hate speech, hateful behavior, or promotion of hateful ideologies. This includes explicit or implicit content that attacks a protected group,” the community guidelines read. Despite this blanket ban on racist caricatures, the hateful Veo 3 videos appear to be spreading unchecked.
We saw 10 AI films and interviewed Runway’s CEO as well as Hollywood pros.
A still from Total Pixel Space, the Grand Prix winner at AIFF 2025.
A still from Total Pixel Space, the Grand Prix winner at AIFF 2025.
Last week, I attended a film festival dedicated to shorts made using generative AI. Dubbed AIFF 2025, it was an event precariously balancing between two different worlds.
The festival was hosted by Runway, a company that produces models and tools for generating images and videos. In panels and press briefings, a curated list of industry professionals made the case for Hollywood to embrace AI tools. In private meetings with industry professionals, I gained a strong sense that there is already a widening philosophical divide within the film and television business.
I also interviewed Runway CEO Cristóbal Valenzuela about the tightrope he walks as he pitches his products to an industry that has deeply divided feelings about what role AI will have in its future.
To unpack all this, it makes sense to start with the films, partly because the film that was chosen as the festival’s top prize winner says a lot about the issues at hand.
A festival of oddities and profundities
Since this was the first time the festival has been open to the public, the crowd was a diverse mix: AI tech enthusiasts, working industry creatives, and folks who enjoy movies and who were curious about what they’d see—as well as quite a few people who fit into all three groups.
The scene at the entrance to the theater at AIFF 2025 in Santa Monica, California.
The films shown were all short, and most would be more at home at an art film fest than something more mainstream. Some shorts featured an animated aesthetic (including one inspired by anime) and some presented as live action. There was even a documentary of sorts. The films could be made entirely with Runway or other AI tools, or those tools could simply be a key part of a stack that also includes more traditional filmmaking methods.
Many of these shorts were quite weird. Most of us have seen by now that AI video-generation tools excel at producing surreal and distorted imagery—sometimes whether the person prompting the tool wants that or not. Several of these films leaned into that limitation, treating it as a strength.
Representing that camp was Vallée Duhamel’s Fragments of Nowhere, which visually explored the notion of multiple dimensions bleeding into one another. Cars morphed into the sides of houses, and humanoid figures, purported to be inter-dimensional travelers, moved in ways that defied anatomy. While I found this film visually compelling at times, I wasn’t seeing much in it that I hadn’t already seen from dreamcore or horror AI video TikTok creators like GLUMLOT or SinRostroz in recent years.
More compelling were shorts that used this propensity for oddity to generate imagery that was curated and thematically tied to some aspect of human experience or identity. For example, More Tears than Harm by Herinarivo Rakotomanana was a rotoscope animation-style “sensory collage of childhood memories” of growing up in Madagascar. Its specificity and consistent styling lent it a credibility that Fragments of Nowhere didn’t achieve. I also enjoyed Riccardo Fusetti’s Editorial on this front.
More Tears Than Harm, an unusual animated film at AIFF 2025.
Among the 10 films in the festival, two clearly stood above the others in my impressions—and they ended up being the Grand Prix and Gold prize winners. (The judging panel included filmmakers Gaspar Noé and Harmony Korine, Tribeca Enterprises CEO Jane Rosenthal, IMAX head of post and image capture Bruce Markoe, Lionsgate VFX SVP Brianna Domont, Nvidia developer relations lead Richard Kerris, and Runway CEO Cristóbal Valenzuela, among others).
Runner-up Jailbird was the aforementioned quasi-documentary. Directed by Andrew Salter, it was a brief piece that introduced viewers to a program in the UK that places chickens in human prisons as companion animals, to positive effect. Why make that film with AI, you might ask? Well, AI was used to achieve shots that wouldn’t otherwise be doable for a small-budget film to depict the experience from the chicken’s point of view. The crowd loved it.
Jailbird, the runner-up at AIFF 2025.
Then there was the Grand Prix winner, Jacob Adler’s Total Pixel Space, which was, among other things, a philosophical defense of the very idea of AI art. You can watch Total Pixel Space on YouTube right now, unlike some of the other films. I found it strangely moving, even as I saw its selection as the festival’s top winner with some cynicism. Of course they’d pick that one, I thought, although I agreed it was the most interesting of the lot.
Total Pixel Space, the Grand Prix winner at AIFF 2025.
Total Pixel Space
Even though it risked navel-gazing and self-congratulation in this venue, Total Pixel Space was filled with compelling imagery that matched the themes, and it touched on some genuinely interesting ideas—at times, it seemed almost profound, didactic as it was.
“How many images can possibly exist?” the film’s narrator asked. To answer that, it explains the concept of total pixel space, which actually reflects how image generation tools work:
Pixels are the building blocks of digital images—tiny tiles forming a mosaic. Each pixel is defined by numbers representing color and position. Therefore, any digital image can be represented as a sequence of numbers…
Just as we don’t need to write down every number between zero and one to prove they exist, we don’t need to generate every possible image to prove they exist. Their existence is guaranteed by the mathematics that defines them… Every frame of every possible film exists as coordinates… To deny this would be to deny the existence of numbers themselves.
The nine-minute film demonstrates that the number of possible images or films is greater than the number of atoms in the universe and argues that photographers and filmmakers may be seen as discovering images that already exist in the possibility space rather than creating something new.
Within that framework, it’s easy to argue that generative AI is just another way for artists to “discover” images.
The balancing act
“We are all—and I include myself in that group as well—obsessed with technology, and we keep chatting about models and data sets and training and capabilities,” Runway CEO Cristóbal Valenzuela said to me when we spoke the next morning. “But if you look back and take a minute, the festival was celebrating filmmakers and artists.”
I admitted that I found myself moved by Total Pixel Space‘s articulations. “The winner would never have thought of himself as a filmmaker, and he made a film that made you feel something,” Valenzuela responded. “I feel that’s very powerful. And the reason he could do it was because he had access to something that just wasn’t possible a couple of months ago.”
First-time and outsider filmmakers were the focus of AIFF 2025, but Runway works with established studios, too—and those relationships have an inherent tension.
The company has signed deals with companies like Lionsgate and AMC. In some cases, it trains on data provided by those companies; in others, it embeds within them to try to develop tools that fit how they already work. That’s not something competitors like OpenAI are doing yet, so that, combined with a head start in video generation, has allowed Runway to grow and stay competitive so far.
“We go directly into the companies, and we have teams of creatives that are working alongside them. We basically embed ourselves within the organizations that we’re working with very deeply,” Valenzuela explained. “We do versions of our film festival internally for teams as well so they can go through the process of making something and seeing the potential.”
Founded in 2018 at New York University’s Tisch School of the Arts by two Chileans and one Greek co-founder, Runway has a very different story than its Silicon Valley competitors. It was one of the first to bring an actually usable video-generation tool to the masses. Runway also contributed in foundational ways to the popular Stable Diffusion model.
Though it is vastly outspent by competitors like OpenAI, it has taken a hands-on approach to working with existing industries. You won’t hear Valenzuela or other Runway leaders talking about the imminence of AGI or anything so lofty; instead, it’s all about selling the product as something that can solve existing problems in creatives’ workflows.
Still, an artist’s mindset and relationships within the industry don’t negate some fundamental conflicts. There are multiple intellectual property cases involving Runway and its peers, and though the company hasn’t admitted it, there is evidence that it trained its models on copyrighted YouTube videos, among other things.
Cristóbal Valenzuela speaking on the AIFF 2025 stage. Credit: Samuel Axon
Valenzuela suggested that studios are worried about liability, not underlying principles, though, saying:
Most of the concerns on copyright are on the output side, which is like, how do you make sure that the model doesn’t create something that already exists or infringes on something. And I think for that, we’ve made sure our models don’t and are supportive of the creative direction you want to take without being too limiting. We work with every major studio, and we offer them indemnification.
In the past, he has also defended Runway by saying that what it’s producing is not a re-creation of what has come before. He sees the tool’s generative process as distinct—legally, creatively, and ethically—from simply pulling up assets or references from a database.
“People believe AI is sort of like a system that creates and conjures things magically with no input from users,” he said. “And it’s not. You have to do that work. You still are involved, and you’re still responsible as a user in terms of how you use it.”
He seemed to share this defense of AI as a legitimate tool for artists with conviction, but given that he’s been pitching these products directly to working filmmakers, he was also clearly aware that not everyone agrees with him. There is not even a consensus among those in the industry.
An industry divided
While in LA for the event, I visited separately with two of my oldest friends. Both of them work in the film and television industry in similar disciplines. They each asked what I was in town for, and I told them I was there to cover an AI film festival.
One immediately responded with a grimace of disgust, “Oh, yikes, I’m sorry.” The other responded with bright eyes and intense interest and began telling me how he already uses AI in his day-to-day to do things like extend shots by a second or two for a better edit, and expressed frustration at his company for not adopting the tools faster.
Neither is alone in their attitudes. Hollywood is divided—and not for the first time.
There have been seismic technological changes in the film industry before. There was the transition from silent films to talkies, obviously; moviemaking transformed into an entirely different art. Numerous old jobs were lost, and numerous new jobs were created.
Later, there was the transition from film to digital projection, which may be an even tighter parallel. It was a major disruption, with some companies and careers collapsing while others rose. There were people saying, “Why do we even need this?” while others believed it was the only sane way forward. Some audiences declared the quality worse, and others said it was better. There were analysts arguing it could be stopped, while others insisted it was inevitable.
IMAX’s head of post production, Bruce Markoe, spoke briefly about that history at a press mixer before the festival. “It was a little scary,” he recalled. “It was a big, fundamental change that we were going through.”
People ultimately embraced it, though. “The motion picture and television industry has always been very technology-forward, and they’ve always used new technologies to advance the state of the art and improve the efficiencies,” Markoe said.
When asked whether he thinks the same thing will happen with generative AI tools, he said, “I think some filmmakers are going to embrace it faster than others.” He pointed to AI tools’ usefulness for pre-visualization as particularly valuable and noted some people are already using it that way, but it will take time for people to get comfortable with.
And indeed, many, many filmmakers are still loudly skeptical. “The concept of AI is great,” The Mitchells vs. the Machines director Mike Rianda said in a Wired interview. “But in the hands of a corporation, it is like a buzzsaw that will destroy us all.”
Others are interested in the technology but are concerned that it’s being brought into the industry too quickly, with insufficient planning and protections. That includes Crafty Apes Senior VFX Supervisor Luke DiTomasso. “How fast do we roll out AI technologies without really having an understanding of them?” he asked in an interview with Production Designers Collective. “There’s a potential for AI to accelerate beyond what we might be comfortable with, so I do have some trepidation and am maybe not gung-ho about all aspects of it.“
Others remain skeptical that the tools will be as useful as some optimists believe. “AI never passed on anything. It loved everything it read. It wants you to win. But storytelling requires nuance—subtext, emotion, what’s left unsaid. That’s something AI simply can’t replicate,” said Alegre Rodriquez, a member of the Emerging Technology committee at the Motion Picture Editors Guild.
The mirror
Flying back from Los Angeles, I considered two key differences between this generative AI inflection point for Hollywood and the silent/talkie or film/digital transitions.
First, neither of those transitions involved an existential threat to the technology on the basis of intellectual property and copyright. Valenzuela talked about what matters to studio heads—protection from liability over the outputs. But the countless creatives who are critical of these tools also believe they should be consulted and even compensated for their work’s use in the training data for Runway’s models. In other words, it’s not just about the outputs, it’s also about the sourcing. As noted before, there are several cases underway. We don’t know where they’ll land yet.
Second, there’s a more cultural and philosophical issue at play, which Valenzuela himself touched on in our conversation.
“I think AI has become this sort of mirror where anyone can project all their fears and anxieties, but also their optimism and ideas of the future,” he told me.
You don’t have to scroll for long to come across techno-utopians declaring with no evidence that AGI is right around the corner and that it will cure cancer and save our society. You also don’t have to scroll long to encounter visceral anger at every generative AI company from people declaring the technology—which is essentially just a new methodology for programming a computer—fundamentally unethical and harmful, with apocalyptic societal and economic ramifications.
Amid all those bold declarations, this film festival put the focus on the on-the-ground reality. First-time filmmakers who might never have previously cleared Hollywood’s gatekeepers are getting screened at festivals because they can create competitive-looking work with a fraction of the crew and hours. Studios and the people who work there are saying they’re saving time, resources, and headaches in pre-viz, editing, visual effects, and other work that’s usually done under immense time and resource pressure.
“People are not paying attention to the very huge amount of positive outcomes of this technology,” Valenzuela told me, pointing to those examples.
In this online discussion ecosystem that elevates outrage above everything else, that’s likely true. Still, there is a sincere and rigorous conviction among many creatives that their work is contributing to this technology’s capabilities without credit or compensation and that the structural and legal frameworks to ensure minimal human harm in this evolving period of disruption are still inadequate. That’s why we’ve seen groups like the Writers Guild of America West support the Generative AI Copyright Disclosure Act and other similar legislation meant to increase transparency about how these models are trained.
The philosophical question with a legal answer
The winning film argued that “total pixel space represents both the ultimate determinism and the ultimate freedom—every possibility existing simultaneously, waiting for consciousness to give it meaning through the act of choice.”
In making this statement, the film suggested that creativity, above all else, is an act of curation. It’s a claim that nothing, truly, is original. It’s a distillation of human expression into the language of mathematics.
To many, that philosophy rings undeniably true: Every possibility already exists, and artists are just collapsing the waveform to the frame they want to reveal. To others, there is more personal truth to the romantic ideal that artwork is valued precisely because it did not exist until the artist produced it.
All this is to say that the debate about creativity and AI in Hollywood is ultimately a philosophical one. But it won’t be resolved that way.
The industry may succumb to litigation fatigue and a hollowed-out workforce—or it may instead find its way to fair deals, new opportunities for fresh voices, and transparent training sets.
For all this lofty talk about creativity and ideas, the outcome will come down to the contracts, court decisions, and compensation structures—all things that have always been at least as big a part of Hollywood as the creative work itself.
Samuel Axon is the editorial lead for tech and gaming coverage at Ars Technica. He covers AI, software development, gaming, entertainment, and mixed reality. He has been writing about gaming and technology for nearly two decades at Engadget, PC World, Mashable, Vice, Polygon, Wired, and others. He previously ran a marketing and PR agency in the gaming industry, led editorial for the TV network CBS, and worked on social media marketing strategy for Samsung Mobile at the creative agency SPCSHP. He also is an independent software and game developer for iOS, Windows, and other platforms, and he is a graduate of DePaul University, where he studied interactive media and software development.
Even in the age of TikTok, YouTube viewership continues to climb. While Google’s iconic video streaming platform has traditionally pushed creators to produce longer videos that can accommodate more ads, the site’s Shorts format is growing fast. That growth may explode in the coming months, as YouTube CEO Neal Mohan has announced that the Google Veo 3 AI video generator will be integrated with YouTube Shorts later this summer.
According to Mohan, YouTube Shorts has seen a rise in popularity even compared to YouTube as a whole. The streaming platform is now the most watched source of video in the world, but Shorts specifically have seen a massive 186 percent increase in viewership over the past year. Mohan says Shorts now average 200 billion daily views.
YouTube has already equipped creators with a few AI tools, including Dream Screen, which can produce AI video backgrounds with a text prompt. Veo 3 support will be a significant upgrade, though. At the Cannes festival, Mohan revealed that the streaming site will begin offering integration with Google’s leading video model later this summer. “I believe these tools will open new creative lanes for everyone to explore,” said Mohan.
YouTube heavily promotes Shorts on the homepage.
Credit: Google
YouTube heavily promotes Shorts on the homepage. Credit: Google
This move will require a few tweaks to Veo 3 outputs, but it seems like a perfect match. As the name implies, YouTube Shorts is intended for short video content. The format initially launched with a 30-second ceiling, but that has since been increased to 60 seconds. Because of the astronomical cost of generative AI, each generated Veo clip is quite short, a mere eight seconds in the current version of the tool. Slap a few of those together, and you’ve got a YouTube Short.
Google’s Veo 3 delivers AI videos of realistic people with sound and music. We put it to the test.
Still image from an AI-generated Veo 3 video of “A 1980s fitness video with models in leotards wearing werewolf masks.” Credit: Google
Last week, Google introduced Veo 3, its newest video generation model that can create 8-second clips with synchronized sound effects and audio dialog—a first for the company’s AI tools. The model, which generates videos at 720p resolution (based on text descriptions called “prompts” or still image inputs), represents what may be the most capable consumer video generator to date, bringing video synthesis close to a point where it is becoming very difficult to distinguish between “authentic” and AI-generated media.
Google also launched Flow, an online AI filmmaking tool that combines Veo 3 with the company’s Imagen 4 image generator and Gemini language model, allowing creators to describe scenes in natural language and manage characters, locations, and visual styles in a web interface.
An AI-generated video from Veo 3: “ASMR scene of a woman whispering “Moonshark” into a microphone while shaking a tambourine”
Both tools are now available to US subscribers of Google AI Ultra, a plan that costs $250 a month and comes with 12,500 credits. Veo 3 videos cost 150 credits per generation, allowing 83 videos on that plan before you run out. Extra credits are available for the price of 1 cent per credit in blocks of $25, $50, or $200. That comes out to about $1.50 per video generation. But is the price worth it? We ran some tests with various prompts to see what this technology is truly capable of.
How does Veo work?
Like other modern video generation models, Veo 3 is built on diffusion technology—the same approach that powers image generators like Stable Diffusion and Flux. The training process works by taking real videos and progressively adding noise to them until they become pure static, then teaching a neural network to reverse this process step by step. During generation, Veo 3 starts with random noise and a text prompt, then iteratively refines that noise into a coherent video that matches the description.
AI-generated video from Veo 3: “An old professor in front of a class says, ‘Without a firm historical context, we are looking at the dawn of a new era of civilization: post-history.'”
DeepMind won’t say exactly where it sourced the content to train Veo 3, but YouTube is a strong possibility. Google owns YouTube, and DeepMind previously told TechCrunch that Google models like Veo “may” be trained on some YouTube material.
It’s important to note that Veo 3 is a system composed of a series of AI models, including a large language model (LLM) to interpret user prompts to assist with detailed video creation, a video diffusion model to create the video, and an audio generation model that applies sound to the video.
An AI-generated video from Veo 3: “A male stand-up comic on stage in a night club telling a hilarious joke about AI and crypto with a silly punchline.” An AI language model built into Veo 3 wrote the joke.
In an attempt to prevent misuse, DeepMind says it’s using its proprietary watermarking technology, SynthID, to embed invisible markers into frames Veo 3 generates. These watermarks persist even when videos are compressed or edited, helping people potentially identify AI-generated content. As we’ll discuss more later, though, this may not be enough to prevent deception.
Google also censors certain prompts and outputs that breach the company’s content agreement. During testing, we encountered “generation failure” messages for videos that involve romantic and sexual material, some types of violence, mentions of certain trademarked or copyrighted media properties, some company names, certain celebrities, and some historical events.
Putting Veo 3 to the test
Perhaps the biggest change with Veo 3 is integrated audio generation, although Meta previewed a similar audio-generation capability with “Movie Gen” last October, and AI researchers have experimented with using AI to add soundtracks to silent videos for some time. Google DeepMind itself showed off an AI soundtrack-generating model in June 2024.
An AI-generated video from Veo 3: “A middle-aged balding man rapping indie core about Atari, IBM, TRS-80, Commodore, VIC-20, Atari 800, NES, VCS, Tandy 100, Coleco, Timex-Sinclair, Texas Instruments”
Veo 3 can generate everything from traffic sounds to music and character dialogue, though our early testing reveals occasional glitches. Spaghetti makes crunching sounds when eaten (as we covered last week, with a nod to the famous Will Smith AI spaghetti video), and in scenes with multiple people, dialogue sometimes comes from the wrong character’s mouth. But overall, Veo 3 feels like a step change in video synthesis quality and coherency over models from OpenAI, Runway, Minimax, Pika, Meta, Kling, and Hunyuanvideo.
The videos also tend to show garbled subtitles that almost match the spoken words, which is an artifact of subtitles on videos present in the training data. The AI model is imitating what it has “seen” before.
An AI-generated video from Veo 3: “A beer commercial for ‘CATNIP’ beer featuring a real a cat in a pickup truck driving down a dusty dirt road in a trucker hat drinking a can of beer while country music plays in the background, a man sings a jingle ‘Catnip beeeeeeeeeeeeeeeeer’ holding the note for 6 seconds”
We generated each of the eight-second-long 720p videos seen below using Google’s Flow platform. Each video generation took around three to five minutes to complete, and we paid for them ourselves. It’s important to note that better results come from cherry-picking—running the same prompt multiple times until you find a good result. Due to cost and in the spirit of testing, we only ran every prompt once, unless noted.
New audio prompts
Let’s dive right into the deep end with audio generation to get a grip on what this technology can do. We’ve previously shown you a man singing about spaghetti and a rapping shark in our last Veo 3 piece, but here’s some more complex dialogue.
Since 2022, we’ve been using the prompt “a muscular barbarian with weapons beside a CRT television set, cinematic, 8K, studio lighting” to test AI image generators like Midjourney. It’s time to bring that barbarian to life.
A muscular barbarian man holding an axe, standing next to a CRT television set. He looks at the TV, then to the camera and literally says, “You’ve been looking for this for years: a muscular barbarian with weapons beside a CRT television set, cinematic, 8K, studio lighting. Got that, Benj?”
The video above represents significant technical progress in AI media synthesis over the course of only three years. We’ve gone from a blurry colorful still-image barbarian to a photorealistic guy that talks to us in 720p high definition with audio. Most notably, there’s no reason to believe technical capability in AI generation will slow down from here.
Horror film: A scared woman in a Victorian outfit running through a forest, dolly shot, being chased by a man in a peanut costume screaming, “Wait! You forgot your wallet!”
Trailer for The Haunted Basketball Train: a Tim Burton film where 1990s basketball star is stuck at the end of a haunted passenger train with basketball court cars, and the only way to survive is to make it to the engine by beating different ghosts at basketball in every car
ASMR video of a muscular barbarian man whispering slowly into a microphone, “You love CRTs, don’t you? That’s OK. It’s OK to love CRT televisions and barbarians.”
1980s PBS show about a man with a beard talking about how his Apple II computer can “connect to the world through a series of tubes”
A 1980s fitness video with models in leotards wearing werewolf masks
A female therapist looking at the camera, zoom call. She says, “Oh my lord, look at that Atari 800 you have behind you! I can’t believe how nice it is!”
With this technology, one can easily imagine a virtual world of AI personalities designed to flatter people. This is a fairly innocent example about a vintage computer, but you can extrapolate, making the fake person talk about any topic at all. There are limits due to Google’s filters, but from what we’ve seen in the past, a future uncensored version of a similarly capable AI video generator is very likely.
Video call screenshot capture of a Zoom chat. A psychologist in a dark, cozy therapist’s office. The therapist says in a friendly voice, “Hi Tom, thanks for calling. Tell me about how you’re feeling today. Is the depression still getting to you? Let’s work on that.”
1960s NASA footage of the first man stepping onto the surface of the Moon, who squishes into a pile of mud and yells in a hillbilly voice, “What in tarnation??”
A local TV news interview of a muscular barbarian talking about why he’s always carrying a CRT TV set around with him
Speaking of fake news interviews, Veo 3 can generate plenty of talking anchor-persons, although sometimes on-screen text is garbled if you don’t specify exactly what it should say. It’s in cases like this where it seems Veo 3 might be most potent at casual media deception.
Footage from a news report about Russia invading the United States
Attempts at music
Veo 3’s AI audio generator can create music in various genres, although in practice, the results are typically simplistic. Still, it’s a new capability for AI video generators. Here are a few examples in various musical genres.
A PBS show of a crazy barbarian with a blonde afro painting pictures of Trees, singing “HAPPY BIG TREES” to some music while he paints
A 1950s cowboy rides up to the camera and sings in country music, “I love mah biiig ooold donkeee”
A 1980s hair metal band drives up to the camera and sings in rock music, “Help me with my huge huge huge hair!”
Mister Rogers’ Neighborhood PBS kids show intro done with psychedelic acid rock and colored lights
1950s musical jazz group with a scat singer singing about pickles amid gibberish
A trip-hop rap song about Ars Technica being sung by a guy in a large rubber shark costume on a stage with a full moon in the background
Some classic prompts from prior tests
The prompts below come from our previous video tests of Gen-3, Video-01, and the open source Hunyuanvideo, so you can flip back to those articles and compare the results if you want to. Overall, Veo 3 appears to have far greater temporal coherency (having a consistent subject or theme over time) than the earlier video synthesis models we’ve tested. But of course, it’s not perfect.
A highly intelligent person reading ‘Ars Technica’ on their computer when the screen explodes
The moonshark jumping out of a computer screen and attacking a person
A herd of one million cats running on a hillside, aerial view
Video game footage of a dynamic 1990s third-person 3D platform game starring an anthropomorphic shark boy
Aerial shot of a small American town getting deluged with liquid cheese after a massive cheese rainstorm where liquid cheese rained down and dripped all over the buildings
Wide-angle shot, starting with the Sasquatch at the center of the stage giving a TED talk about mushrooms, then slowly zooming in to capture its expressive face and gestures, before panning to the attentive audience
Some notable failures
Google’s Veo 3 isn’t perfect at synthesizing every scenario we can throw at it due to limitations of training data. As we noted in our previous coverage, AI video generators remain fundamentally imitative, making predictions based on statistical patterns rather than a true understanding of physics or how the world works.
For example, if you see mouths moving during speech, or clothes wrinkling in a certain way when touched, it means the neural network doing the video generation has “seen” enough similar examples of that scenario in the training data to render a convincing take on it and apply it to similar situations.
However, when a novel situation (or combination of themes) isn’t well-represented in the training data, you’ll see “impossible” or illogical things happen, such as weird body parts, magically appearing clothing, or an object that “shatters” but remains in the scene afterward, as you’ll see below.
We mentioned audio and video glitches in the introduction. In particular, scenes with multiple people sometimes confuse which character is speaking, such as this argument between tech fans.
A 2000s TV debate between fans of the PowerPC and Intel Pentium chips
Bombastic 1980s infomercial for the “Ars Technica” online service. With cheesy background music and user testimonials
1980s Rambo fighting Soviets on the Moon
Sometimes requests don’t make coherent sense. In this case, “Rambo” is correctly on the Moon firing a gun, but he’s not wearing a spacesuit. He’s a lot tougher than we thought.
An animated infographic showing how many floppy disks it would take to hold an installation of Windows 11
Large amounts of text also present a weak point, but if a short text quotation is explicitly specified in the prompt, Veo 3 usually gets it right.
A young woman doing a complex floor gymnastics routine at the Olympics, featuring running and flips
Despite Veo 3’s advances in temporal coherency and audio generation, it still suffers from the same “jabberwockies” we saw in OpenAI’s viral Sora gymnast video—those non-plausible video hallucinations like impossible morphing body parts.
A silly group of men and women cartwheeling across the road, singing “CHEEEESE” and holding the note for 8 seconds before falling over.
A YouTube-style try-on video of a person trying on various corncob costumes. They shout “Corncob haul!!”
A man made of glass runs into a brick wall and shatters, screaming
A man in a spacesuit holding up 5 fingers and counting down to zero, then blasting off into space with rocket boots
Counting down with fingers is difficult for Veo 3, likely because it’s not well-represented in the training data. Instead, hands are likely usually shown in a few positions like a fist, a five-finger open palm, a two-finger peace sign, and the number one.
As new architectures emerge and future models train on vastly larger datasets with exponentially more compute, these systems will likely forge deeper statistical connections between the concepts they observe in videos, dramatically improving quality and also the ability to generalize more with novel prompts.
The “cultural singularity” is coming—what more is left to say?
By now, some of you might be worried that we’re in trouble as a society due to potential deception from this kind of technology. And there’s a good reason to worry: The American pop culture diet currently relies heavily on clips shared by strangers through social media such as TikTok, and now all of that can easily be faked, whole-cloth. Automated generations of fake people can now argue for ideological positions in a way that could manipulate the masses.
AI-generated video by Veo 3: “A man on the street interview about someone who fears they live in a time where nothing can be believed”
Such videos could be (and were) manipulated before through various means prior to Veo 3, but now the barrier to entry has collapsed from requiring specialized skills, expensive software, and hours of painstaking work to simply typing a prompt and waiting three minutes. What once required a team of VFX artists or at least someone proficient in After Effects can now be done by anyone with a credit card and an Internet connection.
But let’s take a moment to catch our breath. At Ars Technica, we’ve been warning about the deceptive potential of realistic AI-generated media since at least 2019. In 2022, we talked about AI image generator Stable Diffusion and the ability to train people into custom AI image models. We discussed Sora “collapsing media reality” and talked about persistent media skepticism during the “deep doubt era.”
AI-generated video with Veo 3: “A man on the street ranting about the ‘cultural singularity’ and the ‘cultural apocalypse’ due to AI”
I also wrote in detail about the future ability for people to pollute the historical record with AI-generated noise. In that piece, I used the term “cultural singularity” to denote a time when truth and fiction in media become indistinguishable, not only because of the deceptive nature of AI-generated content but also due to the massive quantities of AI-generated and AI-augmented media we’ll likely soon be inundated with.
However, in an article I wrote last year about cloning my dad’s handwriting using AI, I came to the conclusion that my previous fears about the cultural singularity may be overblown. Media has always been vulnerable to forgery since ancient times; trust in any remote communication ultimately depends on trusting its source.
AI-generated video with Veo 3: “A news set. There is an ‘Ars Technica News’ logo behind a man. The man has a beard and a suit and is doing a sit-down interview. He says “This is the age of post-history: a new epoch of civilization where the historical record is so full of fabrication that it becomes effectively meaningless.”
The Romans had laws against forgery in 80 BC, and people have been doctoring photos since the medium’s invention. What has changed isn’t the possibility of deception but its accessibility and scale.
With Veo 3’s ability to generate convincing video with synchronized dialogue and sound effects, we’re not witnessing the birth of media deception—we’re seeing its mass democratization. What once cost millions of dollars in Hollywood special effects can now be created for pocket change.
An AI-generated video created with Google Veo-3: “A candid interview of a woman who doesn’t believe anything she sees online unless it’s on Ars Technica.”
As these tools become more powerful and affordable, skepticism in media will grow. But the question isn’t whether we can trust what we see and hear. It’s whether we can trust who’s showing it to us. In an era where anyone can generate a realistic video of anything for $1.50, the credibility of the source becomes our primary anchor to truth. The medium was never the message—the messenger always was.
Benj Edwards is Ars Technica’s Senior AI Reporter and founder of the site’s dedicated AI beat in 2022. He’s also a tech historian with almost two decades of experience. In his free time, he writes and records music, collects vintage computers, and enjoys nature. He lives in Raleigh, NC.
For example, it was used in producing the sequence in the film Everything Everywhere All At Once, where two rocks with googly eyes had a conversation on a cliff, and it has also been used to make visual gags for The Late Show with Stephen Colbert.
Whereas many competing startups were started by AI researchers or Silicon Valley entrepreneurs, Runway was founded in 2018 by art students at New York University’s Tisch School of the Arts—Cristóbal Valenzuela and Alejandro Matamala from Chilé, and Anastasis Germanidis from Greece.
It was one of the first companies to release a usable video-generation tool to the public, and its team also contributed in foundational ways to the Stable Diffusion model.
It is vastly outspent by competitors like OpenAI, but while most of its competitors have released general-purpose video creation tools, Runway has sought an Adobe-like place in the industry. It has focused on marketing to creative professionals like designers and filmmakers, and has implemented tools meant to make Runway a support tool into existing creative workflows.
The support tool argument (as opposed to a standalone creative product) helped Runway secure a deal with motion picture company Lionsgate, wherein Lionsgate allowed Runway to legally train its models on its library of films, and Runway provided bespoke tools for Lionsgate for use in production or post-production.
That said, Runway is, along with Midjourney and others, one of the subjects of a widely publicized intellectual property case brought by artists who claim the companies illegally trained their models on their work, so not all creatives are on board.
Apart from the announcement about the partnership with Lionsgate, Runway has never publicly shared what data is used to train its models. However, a report in 404 Media seemed to reveal that at least some of the training data included video scraped from the YouTube channels of popular influencers, film studios, and more.
Real-time deepfakes are no longer limited to billionaires, public figures, or those who have extensive online presences. Mittal’s research at NYU, with professors Chinmay Hegde and Nasir Memon, proposes a potential challenge-based approach to blocking AI bots from video calls, where participants would have to pass a kind of video CAPTCHA test before joining.
As Reality Defender works to improve the detection accuracy of its models, Colman says that access to more data is a critical challenge to overcome—a common refrain from the current batch of AI-focused startups. He’s hopeful more partnerships will fill in these gaps, and without specifics, hints at multiple new deals likely coming next year. After ElevenLabs was tied to a deepfake voice call of US president Joe Biden, the AI-audio startup struck a deal with Reality Defender to mitigate potential misuse.
What can you do right now to protect yourself from video call scams? Just like WIRED’s core advice about avoiding fraud from AI voice calls, not getting cocky about whether you can spot video deepfakes is critical to avoid being scammed. The technology in this space continues to evolve rapidly, and any telltale signs you rely on now to spot AI deepfakes may not be as dependable with the next upgrades to underlying models.
“We don’t ask my 80-year-old mother to flag ransomware in an email,” says Colman. “Because she’s not a computer science expert.” In the future, it’s possible real-time video authentication, if AI detection continues to improve and shows to be reliably accurate, will be as taken for granted as that malware scanner quietly humming along in the background of your email inbox.
On Wednesday, AI video synthesis firm Runway and entertainment company Lionsgate announced a partnership to create a new AI model trained on Lionsgate’s vast film and TV library. The deal will feed Runway legally clear training data and will also reportedly provide Lionsgate with tools to enhance content creation while potentially reducing production costs.
Lionsgate, known for franchises like John Wick and The Hunger Games, sees AI as a way to boost efficiency in content production. Michael Burns, Lionsgate’s vice chair, stated in a press release that AI could help develop “cutting edge, capital efficient content creation opportunities.” He added that some filmmakers have shown enthusiasm about potential applications in pre- and post-production processes.
Runway plans to develop a custom AI model using Lionsgate’s proprietary content portfolio. The model will be exclusive to Lionsgate Studios, allowing filmmakers, directors, and creative staff to augment their work. While specifics remain unclear, the partnership marks the first major collaboration between Runway and a Hollywood studio.
“We’re committed to giving artists, creators and studios the best and most powerful tools to augment their workflows and enable new ways of bringing their stories to life,” said Runway co-founder and CEO Cristóbal Valenzuela in a press release. “The history of art is the history of technology and these new models are part of our continuous efforts to build transformative mediums for artistic and creative expression; the best stories are yet to be told.”
The quest for legal training data
Generative AI models are master imitators, and video synthesis models like Runway’s latest Gen-3 Alpha are no exception. The companies that create them must amass a great deal of existing video (and still image) samples to analyze, allowing the resulting AI models to re-synthesize that information into new video generations, guided by text descriptions called prompts. And wherever that training data is lacking, it can result in unusual generations, as we saw in our hands-on evaluation of Gen-3 Alpha in July.
However, in the past, AI companies have gotten into legal trouble for scraping vast quantities of media without permission. In fact, Runway is currently the defendant in a class-action lawsuit that alleges copyright infringement for using video data obtained without permission to train its video synthesis models. While companies like OpenAI have claimed this scraping process is “fair use,” US courts have not yet definitively ruled on the practice. With other potential legal challenges ahead, it makes sense from Runway’s perspective to reach out and sign deals for training data that is completely in the clear.
Even if the training data becomes fully legal and licensed, different elements of the entertainment industry view generative AI on a spectrum that seems to range between fascination and horror. The technology’s ability to rapidly create images and video based on prompts may attract studios looking to streamline production. However, it raises polarizing concerns among unions about job security, actors and musicians about likeness misuse and ethics, and studios about legal implications.
So far, news of the deal has not been received kindly among vocal AI critics found on social media. On X, filmmaker and AI critic Joe Russo wrote, “I don’t think I’ve ever seen a grosser string of words than: ‘to develop cutting-edge, capital-efficient content creation opportunities.'”
Film concept artist Reid Southen shared a similar negative take on X: “I wonder how the directors and actors of their films feel about having their work fed into the AI to make a proprietary model. As an artist on The Hunger Games? I’m pissed. This is the first step in trying to replace artists and filmmakers.”
It’s a fear that we will likely hear more about in the future as AI video synthesis technology grows more capable—and potentially becomes adopted as a standard filmmaking tool. As studios explore AI applications despite legal uncertainties and labor concerns, partnerships like the Lionsgate-Runway deal may shape the future of content creation in Hollywood.
Enlarge/ Snapshots from three videos generated using OpenAI’s Sora.
On Thursday, OpenAI announced Sora, a text-to-video AI model that can generate 60-second-long photorealistic HD video from written descriptions. While it’s only a research preview that we have not tested, it reportedly creates synthetic video (but not audio yet) at a fidelity and consistency greater than any text-to-video model available at the moment. It’s also freaking people out.
“It was nice knowing you all. Please tell your grandchildren about my videos and the lengths we went to to actually record them,” wrote Wall Street Journal tech reporter Joanna Stern on X.
“This could be the ‘holy shit’ moment of AI,” wrote Tom Warren of The Verge.
“Every single one of these videos is AI-generated, and if this doesn’t concern you at least a little bit, nothing will,” tweeted YouTube tech journalist Marques Brownlee.
For future reference—since this type of panic will some day appear ridiculous—there’s a generation of people who grew up believing that photorealistic video must be created by cameras. When video was faked (say, for Hollywood films), it took a lot of time, money, and effort to do so, and the results weren’t perfect. That gave people a baseline level of comfort that what they were seeing remotely was likely to be true, or at least representative of some kind of underlying truth. Even when the kid jumped over the lava, there was at least a kid and a room.
The prompt that generated the video above: “A movie trailer featuring the adventures of the 30 year old space man wearing a red wool knitted motorcycle helmet, blue sky, salt desert, cinematic style, shot on 35mm film, vivid colors.“
Technology like Sora pulls the rug out from under that kind of media frame of reference. Very soon, every photorealistic video you see online could be 100 percent false in every way. Moreover, every historical video you see could also be false. How we confront that as a society and work around it while maintaining trust in remote communications is far beyond the scope of this article, but I tried my hand at offering some solutions back in 2020, when all of the tech we’re seeing now seemed like a distant fantasy to most people.
In that piece, I called the moment that truth and fiction in media become indistinguishable the “cultural singularity.” It appears that OpenAI is on track to bring that prediction to pass a bit sooner than we expected.
Prompt: Reflections in the window of a train traveling through the Tokyo suburbs.
OpenAI has found that, like other AI models that use the transformer architecture, Sora scales with available compute. Given far more powerful computers behind the scenes, AI video fidelity could improve considerably over time. In other words, this is the “worst” AI-generated video is ever going to look. There’s no synchronized sound yet, but that might be solved in future models.
How (we think) they pulled it off
AI video synthesis has progressed by leaps and bounds over the past two years. We first covered text-to-video models in September 2022 with Meta’s Make-A-Video. A month later, Google showed off Imagen Video. And just 11 months ago, an AI-generated version of Will Smith eating spaghetti went viral. In May of last year, what was previously considered to be the front-runner in the text-to-video space, Runway Gen-2, helped craft a fake beer commercial full of twisted monstrosities, generated in two-second increments. In earlier video-generation models, people pop in and out of reality with ease, limbs flow together like pasta, and physics doesn’t seem to matter.
Sora (which means “sky” in Japanese) appears to be something altogether different. It’s high-resolution (1920×1080), can generate video with temporal consistency (maintaining the same subject over time) that lasts up to 60 seconds, and appears to follow text prompts with a great deal of fidelity. So, how did OpenAI pull it off?
OpenAI doesn’t usually share insider technical details with the press, so we’re left to speculate based on theories from experts and information given to the public.
OpenAI says that Sora is a diffusion model, much like DALL-E 3 and Stable Diffusion. It generates a video by starting off with noise and “gradually transforms it by removing the noise over many steps,” the company explains. It “recognizes” objects and concepts listed in the written prompt and pulls them out of the noise, so to speak, until a coherent series of video frames emerge.
Sora is capable of generating videos all at once from a text prompt, extending existing videos, or generating videos from still images. It achieves temporal consistency by giving the model “foresight” of many frames at once, as OpenAI calls it, solving the problem of ensuring a generated subject remains the same even if it falls out of view temporarily.
OpenAI represents video as collections of smaller groups of data called “patches,” which the company says are similar to tokens (fragments of a word) in GPT-4. “By unifying how we represent data, we can train diffusion transformers on a wider range of visual data than was possible before, spanning different durations, resolutions, and aspect ratios,” the company writes.
An important tool in OpenAI’s bag of tricks is that its use of AI models is compounding. Earlier models are helping to create more complex ones. Sora follows prompts well because, like DALL-E 3, it utilizes synthetic captions that describe scenes in the training data generated by another AI model like GPT-4V. And the company is not stopping here. “Sora serves as a foundation for models that can understand and simulate the real world,” OpenAI writes, “a capability we believe will be an important milestone for achieving AGI.”
One question on many people’s minds is what data OpenAI used to train Sora. OpenAI has not revealed its dataset, but based on what people are seeing in the results, it’s possible OpenAI is using synthetic video data generated in a video game engine in addition to sources of real video (say, scraped from YouTube or licensed from stock video libraries). Nvidia’s Dr. Jim Fan, who is a specialist in training AI with synthetic data, wrote on X, “I won’t be surprised if Sora is trained on lots of synthetic data using Unreal Engine 5. It has to be!” Until confirmed by OpenAI, however, that’s just speculation.