AI confabulation

twirling-body-horror-in-gymnastics-video-exposes-ai’s-flaws

Twirling body horror in gymnastics video exposes AI’s flaws


The slithy toves did gyre and gimble in the wabe

Nonsensical jabberwocky movements created by OpenAI’s Sora are typical for current AI-generated video, and here’s why.

A still image from an AI-generated video of an ever-morphing synthetic gymnast. Credit: OpenAI / Deedy

On Wednesday, a video from OpenAI’s newly launched Sora AI video generator went viral on social media, featuring a gymnast who sprouts extra limbs and briefly loses her head during what appears to be an Olympic-style floor routine.

As it turns out, the nonsensical synthesis errors in the video—what we like to call “jabberwockies”—hint at technical details about how AI video generators work and how they might get better in the future.

But before we dig into the details, let’s take a look at the video.

An AI-generated video of an impossible gymnast, created with OpenAI Sora.

In the video, we see a view of what looks like a floor gymnastics routine. The subject of the video flips and flails as new legs and arms rapidly and fluidly emerge and morph out of her twirling and transforming body. At one point, about 9 seconds in, she loses her head, and it reattaches to her body spontaneously.

“As cool as the new Sora is, gymnastics is still very much the Turing test for AI video,” wrote venture capitalist Deedy Das when he originally shared the video on X. The video inspired plenty of reaction jokes, such as this reply to a similar post on Bluesky: “hi, gymnastics expert here! this is not funny, gymnasts only do this when they’re in extreme distress.”

We reached out to Das, and he confirmed that he generated the video using Sora. He also provided the prompt, which was very long and split into four parts, generated by Anthropic’s Claude, using complex instructions like “The gymnast initiates from the back right corner, taking position with her right foot pointed behind in B-plus stance.”

“I’ve known for the last 6 months having played with text to video models that they struggle with complex physics movements like gymnastics,” Das told us in a conversation. “I had to try it [in Sora] because the character consistency seemed improved. Overall, it was an improvement because previously… the gymnast would just teleport away or change their outfit mid flip, but overall it still looks downright horrifying. We hoped AI video would learn physics by default, but that hasn’t happened yet!”

So what went wrong?

When examining how the video fails, you must first consider how Sora “knows” how to create anything that resembles a gymnastics routine. During the training phase, when the Sora model was created, OpenAI fed example videos of gymnastics routines (among many other types of videos) into a specialized neural network that associates the progression of images with text-based descriptions of them.

That type of training is a distinct phase that happens once before the model’s release. Later, when the finished model is running and you give a video-synthesis model like Sora a written prompt, it draws upon statistical associations between words and images to produce a predictive output. It’s continuously making next-frame predictions based on the last frame of the video. But Sora has another trick for attempting to preserve coherency over time. “By giving the model foresight of many frames at a time,” reads OpenAI’s Sora System Card, we’ve solved a challenging problem of making sure a subject stays the same even when it goes out of view temporarily.”

A still image from a moment where the AI-generated gymnast loses her head. It soon re-attaches to her body.

A still image from a moment where the AI-generated gymnast loses her head. It soon reattaches to her body. Credit: OpenAI / Deedy

Maybe not quite solved yet. In this case, rapidly moving limbs prove a particular challenge when attempting to predict the next frame properly. The result is an incoherent amalgam of gymnastics footage that shows the same gymnast performing running flips and spins, but Sora doesn’t know the correct order in which to assemble them because it’s pulling on statistical averages of wildly different body movements in its relatively limited training data of gymnastics videos, which also likely did not include limb-level precision in its descriptive metadata.

Sora doesn’t know anything about physics or how the human body should work, either. It’s drawing upon statistical associations between pixels in the videos in its training dataset to predict the next frame, with a little bit of look-ahead to keep things more consistent.

This problem is not unique to Sora. All AI video generators can produce wildly nonsensical results when your prompts reach too far past their training data, as we saw earlier this year when testing Runway’s Gen-3. In fact, we ran some gymnast prompts through the latest open source AI video model that may rival Sora in some ways, Hunyuan Video, and it produced similar twirling, morphing results, seen below. And we used a much simpler prompt than Das did with Sora.

An example from open source Chinese AI model Hunyuan Video with the prompt, “A young woman doing a complex floor gymnastics routine at the olympics, featuring running and flips.”

AI models based on transformer technology are fundamentally imitative in nature. They’re great at transforming one type of data into another type or morphing one style into another. What they’re not great at (yet) is producing coherent generations that are truly original. So if you happen to provide a prompt that closely matches a training video, you might get a good result. Otherwise, you may get madness.

As we wrote about image-synthesis model Stable Diffusion 3’s body horror generations earlier this year, “Basically, any time a user prompt homes in on a concept that isn’t represented well in the AI model’s training dataset, the image-synthesis model will confabulate its best interpretation of what the user is asking for. And sometimes that can be completely terrifying.”

For the engineers who make these models, success in AI video generation quickly becomes a question of how many examples (and how much training) you need before the model can generalize enough to produce convincing and coherent results. It’s also a question of metadata quality—how accurately the videos are labeled. In this case, OpenAI used an AI vision model to describe its training videos, which helped improve quality, but apparently not enough—yet.

We’re looking at an AI jabberwocky in action

In a way, the type of generation failure in the gymnast video is a form of confabulation (or hallucination, as some call it), but it’s even worse because it’s not coherent. So instead of calling it a confabulation, which is a plausible-sounding fabrication, we’re going to lean on a new term, “jabberwocky,” which Dictionary.com defines as “a playful imitation of language consisting of invented, meaningless words; nonsense; gibberish,” taken from Lewis Carroll’s nonsense poem of the same name. Imitation and nonsense, you say? Check and check.

We’ve covered jabberwockies in AI video before with people mocking Chinese video-synthesis models, a monstrously weird AI beer commercial, and even Will Smith eating spaghetti. They’re a form of misconfabulation where an AI model completely fails to produce a plausible output. This will not be the last time we see them, either.

How could AI video models get better and avoid jabberwockies?

In our coverage of Gen-3 Alpha, we called the threshold where you get a level of useful generalization in an AI model the “illusion of understanding,” where training data and training time reach a critical mass that produces good enough results to generalize across enough novel prompts.

One of the key reasons language models like OpenAI’s GPT-4 impressed users was that they finally reached a size where they had absorbed enough information to give the appearance of genuinely understanding the world. With video synthesis, achieving this same apparent level of “understanding” will require not just massive amounts of well-labeled training data but also the computational power to process it effectively.

AI boosters hope that these current models represent one of the key steps on the way to something like truly general intelligence (often called AGI) in text, or in AI video, what OpenAI and Runway researchers call “world simulators” or “world models” that somehow encode enough physics rules about the world to produce any realistic result.

Judging by the morphing alien shoggoth gymnast, that may still be a ways off. Still, it’s early days in AI video generation, and judging by how quickly AI image-synthesis models like Midjourney progressed from crude abstract shapes into coherent imagery, it’s likely video synthesis will have a similar trajectory over time. Until then, enjoy the AI-generated jabberwocky madness.

Photo of Benj Edwards

Benj Edwards is Ars Technica’s Senior AI Reporter and founder of the site’s dedicated AI beat in 2022. He’s also a tech historian with almost two decades of experience. In his free time, he writes and records music, collects vintage computers, and enjoys nature. He lives in Raleigh, NC.

Twirling body horror in gymnastics video exposes AI’s flaws Read More »

hospitals-adopt-error-prone-ai-transcription-tools-despite-warnings

Hospitals adopt error-prone AI transcription tools despite warnings

In one case from the study cited by AP, when a speaker described “two other girls and one lady,” Whisper added fictional text specifying that they “were Black.” In another, the audio said, “He, the boy, was going to, I’m not sure exactly, take the umbrella.” Whisper transcribed it to, “He took a big piece of a cross, a teeny, small piece … I’m sure he didn’t have a terror knife so he killed a number of people.”

An OpenAI spokesperson told the AP that the company appreciates the researchers’ findings and that it actively studies how to reduce fabrications and incorporates feedback in updates to the model.

Why Whisper confabulates

The key to Whisper’s unsuitability in high-risk domains comes from its propensity to sometimes confabulate, or plausibly make up, inaccurate outputs. The AP report says, “Researchers aren’t certain why Whisper and similar tools hallucinate,” but that isn’t true. We know exactly why Transformer-based AI models like Whisper behave this way.

Whisper is based on technology that is designed to predict the next most likely token (chunk of data) that should appear after a sequence of tokens provided by a user. In the case of ChatGPT, the input tokens come in the form of a text prompt. In the case of Whisper, the input is tokenized audio data.

The transcription output from Whisper is a prediction of what is most likely, not what is most accurate. Accuracy in Transformer-based outputs is typically proportional to the presence of relevant accurate data in the training dataset, but it is never guaranteed. If there is ever a case where there isn’t enough contextual information in its neural network for Whisper to make an accurate prediction about how to transcribe a particular segment of audio, the model will fall back on what it “knows” about the relationships between sounds and words it has learned from its training data.

Hospitals adopt error-prone AI transcription tools despite warnings Read More »

openai-releases-chatgpt-app-for-windows

OpenAI releases ChatGPT app for Windows

On Thursday, OpenAI released an early Windows version of its first ChatGPT app for Windows, following a Mac version that launched in May. Currently, it’s only available to subscribers of Plus, Team, Enterprise, and Edu versions of ChatGPT, and users can download it for free in the Microsoft Store for Windows.

OpenAI is positioning the release as a beta test. “This is an early version, and we plan to bring the full experience to all users later this year,” OpenAI writes on the Microsoft Store entry for the app. (Interestingly, ChatGPT shows up as being rated “T for Teen” by the ESRB in the Windows store, despite not being a video game.)

A screenshot of the new Windows ChatGPT app captured on October 18, 2024.

A screenshot of the new Windows ChatGPT app captured on October 18, 2024.

Credit: Benj Edwards

A screenshot of the new Windows ChatGPT app captured on October 18, 2024. Credit: Benj Edwards

Upon opening the app, OpenAI requires users to log into a paying ChatGPT account, and from there, the app is basically identical to the web browser version of ChatGPT. You can currently use it to access several models: GPT-4o, GPT-4o with Canvas, 01-preview, 01-mini, GPT-4o mini, and GPT-4. Also, it can generate images using DALL-E 3 or analyze uploaded files and images.

If you’re running Windows 11, you can instantly call up a small ChatGPT window when the app is open using an Alt+Space shortcut (it did not work in Windows 10 when we tried). That could be handy for asking ChatGPT a quick question at any time.

A screenshot of the new Windows ChatGPT app listing in the Microsoft Store captured on October 18, 2024.

Credit: Benj Edwards

A screenshot of the new Windows ChatGPT app listing in the Microsoft Store captured on October 18, 2024. Credit: Benj Edwards

And just like the web version, all the AI processing takes place in the cloud on OpenAI’s servers, which means an Internet connection is required.

So as usual, chat like somebody’s watching, and don’t rely on ChatGPT as a factual reference for important decisions—GPT-4o in particular is great at telling you what you want to hear, whether it’s correct or not. As OpenAI says in a small disclaimer at the bottom of the app window: “ChatGPT can make mistakes.”

OpenAI releases ChatGPT app for Windows Read More »

duckduckgo-offers-“anonymous”-access-to-ai-chatbots-through-new-service

DuckDuckGo offers “anonymous” access to AI chatbots through new service

anonymous confabulations —

DDG offers LLMs from OpenAI, Anthropic, Meta, and Mistral for factually-iffy conversations.

DuckDuckGo's AI Chat promotional image.

DuckDuckGo

On Thursday, DuckDuckGo unveiled a new “AI Chat” service that allows users to converse with four mid-range large language models (LLMs) from OpenAI, Anthropic, Meta, and Mistral in an interface similar to ChatGPT while attempting to preserve privacy and anonymity. While the AI models involved can output inaccurate information readily, the site allows users to test different mid-range LLMs without having to install anything or sign up for an account.

DuckDuckGo’s AI Chat currently features access to OpenAI’s GPT-3.5 Turbo, Anthropic’s Claude 3 Haiku, and two open source models, Meta’s Llama 3 and Mistral’s Mixtral 8x7B. The service is currently free to use within daily limits. Users can access AI Chat through the DuckDuckGo search engine, direct links to the site, or by using “!ai” or “!chat” shortcuts in the search field. AI Chat can also be disabled in the site’s settings for users with accounts.

According to DuckDuckGo, chats on the service are anonymized, with metadata and IP address removed to prevent tracing back to individuals. The company states that chats are not used for AI model training, citing its privacy policy and terms of use.

“We have agreements in place with all model providers to ensure that any saved chats are completely deleted by the providers within 30 days,” says DuckDuckGo, “and that none of the chats made on our platform can be used to train or improve the models.”

An example of DuckDuckGo AI Chat with GPT-3.5 answering a silly question in an inaccurate way.

Enlarge / An example of DuckDuckGo AI Chat with GPT-3.5 answering a silly question in an inaccurate way.

Benj Edwards

However, the privacy experience is not bulletproof because, in the case of GPT-3.5 and Claude Haiku, DuckDuckGo is required to send a user’s inputs to remote servers for processing over the Internet. Given certain inputs (i.e., “Hey, GPT, my name is Bob, and I live on Main Street, and I just murdered Bill”), a user could still potentially be identified if such an extreme need arose.

While the service appears to work well for us, there’s a question about its utility. For example, while GPT-3.5 initially wowed people when it launched with ChatGPT in 2022, it also confabulated a lot—and it still does. GPT-4 was the first major LLM to get confabulations under control to a point where the bot became more reasonably useful for some tasks (though this itself is a controversial point), but that more capable model isn’t present in DuckDuckGo’s AI Chat. Also missing are similar GPT-4-level models like Claude Opus or Google’s Gemini Ultra, likely because they are far more expensive to run. DuckDuckGo says it may roll out paid plans in the future, and those may include higher daily usage limits or access to “more advanced models.”)

It’s true that the other three models generally (and subjectively) pass GPT-3.5 in capability for coding with lower hallucinations, but they can still make things up, too. With DuckDuckGo AI Chat as it stands, the company is left with a chatbot novelty with a decent interface and the promise that your conversations with it will remain private. But what use are fully private AI conversations if they are full of errors?

Mixtral 8x7B on DuckDuckGo AI Chat when asked about the author. Everything in red boxes is sadly incorrect, but it provides an interesting fantasy scenario. It's a good example of an LLM plausibly filling gaps between concepts that are underrepresented in its training data, called confabulation. For the record, Llama 3 gives a more accurate answer.

Enlarge / Mixtral 8x7B on DuckDuckGo AI Chat when asked about the author. Everything in red boxes is sadly incorrect, but it provides an interesting fantasy scenario. It’s a good example of an LLM plausibly filling gaps between concepts that are underrepresented in its training data, called confabulation. For the record, Llama 3 gives a more accurate answer.

Benj Edwards

As DuckDuckGo itself states in its privacy policy, “By its very nature, AI Chat generates text with limited information. As such, Outputs that appear complete or accurate because of their detail or specificity may not be. For example, AI Chat cannot dynamically retrieve information and so Outputs may be outdated. You should not rely on any Output without verifying its contents using other sources, especially for professional advice (like medical, financial, or legal advice).”

So, have fun talking to bots, but tread carefully. They’ll easily “lie” to your face because they don’t understand what they are saying and are tuned to output statistically plausible information, not factual references.

DuckDuckGo offers “anonymous” access to AI chatbots through new service Read More »