Olympics

twirling-body-horror-in-gymnastics-video-exposes-ai’s-flaws

Twirling body horror in gymnastics video exposes AI’s flaws


The slithy toves did gyre and gimble in the wabe

Nonsensical jabberwocky movements created by OpenAI’s Sora are typical for current AI-generated video, and here’s why.

A still image from an AI-generated video of an ever-morphing synthetic gymnast. Credit: OpenAI / Deedy

On Wednesday, a video from OpenAI’s newly launched Sora AI video generator went viral on social media, featuring a gymnast who sprouts extra limbs and briefly loses her head during what appears to be an Olympic-style floor routine.

As it turns out, the nonsensical synthesis errors in the video—what we like to call “jabberwockies”—hint at technical details about how AI video generators work and how they might get better in the future.

But before we dig into the details, let’s take a look at the video.

An AI-generated video of an impossible gymnast, created with OpenAI Sora.

In the video, we see a view of what looks like a floor gymnastics routine. The subject of the video flips and flails as new legs and arms rapidly and fluidly emerge and morph out of her twirling and transforming body. At one point, about 9 seconds in, she loses her head, and it reattaches to her body spontaneously.

“As cool as the new Sora is, gymnastics is still very much the Turing test for AI video,” wrote venture capitalist Deedy Das when he originally shared the video on X. The video inspired plenty of reaction jokes, such as this reply to a similar post on Bluesky: “hi, gymnastics expert here! this is not funny, gymnasts only do this when they’re in extreme distress.”

We reached out to Das, and he confirmed that he generated the video using Sora. He also provided the prompt, which was very long and split into four parts, generated by Anthropic’s Claude, using complex instructions like “The gymnast initiates from the back right corner, taking position with her right foot pointed behind in B-plus stance.”

“I’ve known for the last 6 months having played with text to video models that they struggle with complex physics movements like gymnastics,” Das told us in a conversation. “I had to try it [in Sora] because the character consistency seemed improved. Overall, it was an improvement because previously… the gymnast would just teleport away or change their outfit mid flip, but overall it still looks downright horrifying. We hoped AI video would learn physics by default, but that hasn’t happened yet!”

So what went wrong?

When examining how the video fails, you must first consider how Sora “knows” how to create anything that resembles a gymnastics routine. During the training phase, when the Sora model was created, OpenAI fed example videos of gymnastics routines (among many other types of videos) into a specialized neural network that associates the progression of images with text-based descriptions of them.

That type of training is a distinct phase that happens once before the model’s release. Later, when the finished model is running and you give a video-synthesis model like Sora a written prompt, it draws upon statistical associations between words and images to produce a predictive output. It’s continuously making next-frame predictions based on the last frame of the video. But Sora has another trick for attempting to preserve coherency over time. “By giving the model foresight of many frames at a time,” reads OpenAI’s Sora System Card, we’ve solved a challenging problem of making sure a subject stays the same even when it goes out of view temporarily.”

A still image from a moment where the AI-generated gymnast loses her head. It soon re-attaches to her body.

A still image from a moment where the AI-generated gymnast loses her head. It soon reattaches to her body. Credit: OpenAI / Deedy

Maybe not quite solved yet. In this case, rapidly moving limbs prove a particular challenge when attempting to predict the next frame properly. The result is an incoherent amalgam of gymnastics footage that shows the same gymnast performing running flips and spins, but Sora doesn’t know the correct order in which to assemble them because it’s pulling on statistical averages of wildly different body movements in its relatively limited training data of gymnastics videos, which also likely did not include limb-level precision in its descriptive metadata.

Sora doesn’t know anything about physics or how the human body should work, either. It’s drawing upon statistical associations between pixels in the videos in its training dataset to predict the next frame, with a little bit of look-ahead to keep things more consistent.

This problem is not unique to Sora. All AI video generators can produce wildly nonsensical results when your prompts reach too far past their training data, as we saw earlier this year when testing Runway’s Gen-3. In fact, we ran some gymnast prompts through the latest open source AI video model that may rival Sora in some ways, Hunyuan Video, and it produced similar twirling, morphing results, seen below. And we used a much simpler prompt than Das did with Sora.

An example from open source Chinese AI model Hunyuan Video with the prompt, “A young woman doing a complex floor gymnastics routine at the olympics, featuring running and flips.”

AI models based on transformer technology are fundamentally imitative in nature. They’re great at transforming one type of data into another type or morphing one style into another. What they’re not great at (yet) is producing coherent generations that are truly original. So if you happen to provide a prompt that closely matches a training video, you might get a good result. Otherwise, you may get madness.

As we wrote about image-synthesis model Stable Diffusion 3’s body horror generations earlier this year, “Basically, any time a user prompt homes in on a concept that isn’t represented well in the AI model’s training dataset, the image-synthesis model will confabulate its best interpretation of what the user is asking for. And sometimes that can be completely terrifying.”

For the engineers who make these models, success in AI video generation quickly becomes a question of how many examples (and how much training) you need before the model can generalize enough to produce convincing and coherent results. It’s also a question of metadata quality—how accurately the videos are labeled. In this case, OpenAI used an AI vision model to describe its training videos, which helped improve quality, but apparently not enough—yet.

We’re looking at an AI jabberwocky in action

In a way, the type of generation failure in the gymnast video is a form of confabulation (or hallucination, as some call it), but it’s even worse because it’s not coherent. So instead of calling it a confabulation, which is a plausible-sounding fabrication, we’re going to lean on a new term, “jabberwocky,” which Dictionary.com defines as “a playful imitation of language consisting of invented, meaningless words; nonsense; gibberish,” taken from Lewis Carroll’s nonsense poem of the same name. Imitation and nonsense, you say? Check and check.

We’ve covered jabberwockies in AI video before with people mocking Chinese video-synthesis models, a monstrously weird AI beer commercial, and even Will Smith eating spaghetti. They’re a form of misconfabulation where an AI model completely fails to produce a plausible output. This will not be the last time we see them, either.

How could AI video models get better and avoid jabberwockies?

In our coverage of Gen-3 Alpha, we called the threshold where you get a level of useful generalization in an AI model the “illusion of understanding,” where training data and training time reach a critical mass that produces good enough results to generalize across enough novel prompts.

One of the key reasons language models like OpenAI’s GPT-4 impressed users was that they finally reached a size where they had absorbed enough information to give the appearance of genuinely understanding the world. With video synthesis, achieving this same apparent level of “understanding” will require not just massive amounts of well-labeled training data but also the computational power to process it effectively.

AI boosters hope that these current models represent one of the key steps on the way to something like truly general intelligence (often called AGI) in text, or in AI video, what OpenAI and Runway researchers call “world simulators” or “world models” that somehow encode enough physics rules about the world to produce any realistic result.

Judging by the morphing alien shoggoth gymnast, that may still be a ways off. Still, it’s early days in AI video generation, and judging by how quickly AI image-synthesis models like Midjourney progressed from crude abstract shapes into coherent imagery, it’s likely video synthesis will have a similar trajectory over time. Until then, enjoy the AI-generated jabberwocky madness.

Photo of Benj Edwards

Benj Edwards is Ars Technica’s Senior AI Reporter and founder of the site’s dedicated AI beat in 2022. He’s also a tech historian with almost two decades of experience. In his free time, he writes and records music, collects vintage computers, and enjoys nature. He lives in Raleigh, NC.

Twirling body horror in gymnastics video exposes AI’s flaws Read More »

ai-generated-al-michaels-to-provide-daily-recaps-during-2024-summer-olympics

AI-generated Al Michaels to provide daily recaps during 2024 Summer Olympics

forever young —

AI voice clone will narrate daily Olympics video recaps; critics call it a “code-generated ghoul.”

Al Michaels looks on prior to the game between the Minnesota Vikings and Philadelphia Eagles at Lincoln Financial Field on September 14, 2023 in Philadelphia, Pennsylvania.

Enlarge / Al Michaels looks on prior to the game between the Minnesota Vikings and Philadelphia Eagles at Lincoln Financial Field on September 14, 2023, in Philadelphia, Pennsylvania.

On Wednesday, NBC announced plans to use an AI-generated clone of famous sports commentator Al Michaels‘ voice to narrate daily streaming video recaps of the 2024 Summer Olympics in Paris, which start on July 26. The AI-powered narration will feature in “Your Daily Olympic Recap on Peacock,” NBC’s streaming service. But this new, high-profile use of voice cloning worries critics, who say the technology may muscle out upcoming sports commentators by keeping old personas around forever.

NBC says it has created a “high-quality AI re-creation” of Michaels’ voice, trained on Michaels’ past NBC appearances to capture his distinctive delivery style.

The veteran broadcaster, revered in the sports commentator world for his iconic “Do you believe in miracles? Yes!” call during the 1980 Winter Olympics, has been covering sports on TV since 1971, including a high-profile run of play-by-play coverage of NFL football games for both ABC and NBC since the 1980s. NBC dropped him from NFL coverage in 2023, however, possibly due to his age.

Michaels, who is 79 years old, shared his initial skepticism about the project in an interview with Vanity Fair, as NBC News notes. After hearing the AI version of his voice, which can greet viewers by name, he described the experience as “astonishing” and “a little bit frightening.” He said the AI recreation was “almost 2% off perfect” in mimicking his style.

The Vanity Fair article provides some insight into how NBC’s new AI system works. It first uses a large language model (similar technology to what powers ChatGPT) to analyze subtitles and metadata from NBC’s Olympics video coverage, summarizing events and writing custom output to imitate Michaels’ style. This text is then fed into an unspecified voice AI model trained on Michaels’ previous NBC appearances, reportedly replicating his unique pronunciations and intonations.

NBC estimates that the system could generate nearly 7 million personalized variants of the recaps across the US during the games, pulled from the network’s 5,000 hours of live coverage. Using the system, each Peacock user will receive about 10 minutes of personalized highlights.

A diminished role for humans in the future?

Al Michaels reports on the Sweden vs. USA men's ice hockey game at the 1980 Olympic Winter Games on February 12, 1980.

Enlarge / Al Michaels reports on the Sweden vs. USA men’s ice hockey game at the 1980 Olympic Winter Games on February 12, 1980.

It’s no secret that while AI is wildly hyped right now, it’s also controversial among some. Upon hearing the NBC announcement, critics of AI technology reacted strongly. “@NBCSports, this is gross,” tweeted actress and filmmaker Justine Bateman, who frequently uses X to criticize technologies that might replace human writers or performers in the future.

A thread of similar responses from X users reacting to the sample video provided above included criticisms such as, “Sounds pretty off when it’s just the same tone for every single word.” Another user wrote, “It just sounds so unnatural. No one talks like that.”

The technology will not replace NBC’s regular human sports commentators during this year’s Olympics coverage, and like other forms of AI, it leans heavily on existing human work by analyzing and regurgitating human-created content in the form of captions pulled from NBC footage.

Looking down the line, due to AI media cloning technologies like voice, video, and image synthesis, today’s celebrities may be able to attain a form of media immortality that allows new iterations of their likenesses to persist through the generations, potentially earning licensing fees for whoever holds the rights.

We’ve already seen it with James Earl Jones playing Darth Vader’s voice, and the trend will likely continue with other celebrity voices, provided the money is right. Eventually, it may extend to famous musicians through music synthesis and famous actors in video-synthesis applications as well.

The possibility of being muscled out by AI replicas factored heavily into a Hollywood actors’ strike last year, with SAG-AFTRA union President Fran Drescher saying, “If we don’t stand tall right now, we are all going to be in trouble. We are all going to be in jeopardy of being replaced by machines.”

For companies that like to monetize media properties for as long as possible, AI may provide a way to maintain a media legacy through automation. But future human performers may have to compete against all of the greatest performers of the past, rendered through AI, to break out and forge a new career—provided there will be room for human performers at all.

Al Michaels became Al Michaels because he was brought in to replace people who died, or retired, or moved on,” tweeted a writer named Geonn Cannon on X. “If he can’t do the job anymore, it’s time to let the next Al Michaels have a shot at it instead of just planting a code-generated ghoul in an empty chair.

AI-generated Al Michaels to provide daily recaps during 2024 Summer Olympics Read More »