In one case from the study cited by AP, when a speaker described “two other girls and one lady,” Whisper added fictional text specifying that they “were Black.” In another, the audio said, “He, the boy, was going to, I’m not sure exactly, take the umbrella.” Whisper transcribed it to, “He took a big piece of a cross, a teeny, small piece … I’m sure he didn’t have a terror knife so he killed a number of people.”
An OpenAI spokesperson told the AP that the company appreciates the researchers’ findings and that it actively studies how to reduce fabrications and incorporates feedback in updates to the model.
Why Whisper confabulates
The key to Whisper’s unsuitability in high-risk domains comes from its propensity to sometimes confabulate, or plausibly make up, inaccurate outputs. The AP report says, “Researchers aren’t certain why Whisper and similar tools hallucinate,” but that isn’t true. We know exactly why Transformer-based AI models like Whisper behave this way.
Whisper is based on technology that is designed to predict the next most likely token (chunk of data) that should appear after a sequence of tokens provided by a user. In the case of ChatGPT, the input tokens come in the form of a text prompt. In the case of Whisper, the input is tokenized audio data.
The transcription output from Whisper is a prediction of what is most likely, not what is most accurate. Accuracy in Transformer-based outputs is typically proportional to the presence of relevant accurate data in the training dataset, but it is never guaranteed. If there is ever a case where there isn’t enough contextual information in its neural network for Whisper to make an accurate prediction about how to transcribe a particular segment of audio, the model will fall back on what it “knows” about the relationships between sounds and words it has learned from its training data.
On Thursday, DuckDuckGo unveiled a new “AI Chat” service that allows users to converse with four mid-range large language models (LLMs) from OpenAI, Anthropic, Meta, and Mistral in an interface similar to ChatGPT while attempting to preserve privacy and anonymity. While the AI models involved can output inaccurate information readily, the site allows users to test different mid-range LLMs without having to install anything or sign up for an account.
DuckDuckGo’s AI Chat currently features access to OpenAI’s GPT-3.5 Turbo, Anthropic’s Claude 3 Haiku, and two open source models, Meta’s Llama 3 and Mistral’s Mixtral 8x7B. The service is currently free to use within daily limits. Users can access AI Chat through the DuckDuckGo search engine, direct links to the site, or by using “!ai” or “!chat” shortcuts in the search field. AI Chat can also be disabled in the site’s settings for users with accounts.
According to DuckDuckGo, chats on the service are anonymized, with metadata and IP address removed to prevent tracing back to individuals. The company states that chats are not used for AI model training, citing its privacy policy and terms of use.
“We have agreements in place with all model providers to ensure that any saved chats are completely deleted by the providers within 30 days,” says DuckDuckGo, “and that none of the chats made on our platform can be used to train or improve the models.”
However, the privacy experience is not bulletproof because, in the case of GPT-3.5 and Claude Haiku, DuckDuckGo is required to send a user’s inputs to remote servers for processing over the Internet. Given certain inputs (i.e., “Hey, GPT, my name is Bob, and I live on Main Street, and I just murdered Bill”), a user could still potentially be identified if such an extreme need arose.
While the service appears to work well for us, there’s a question about its utility. For example, while GPT-3.5 initially wowed people when it launched with ChatGPT in 2022, it also confabulated a lot—and it still does. GPT-4 was the first major LLM to get confabulations under control to a point where the bot became more reasonably useful for some tasks (though this itself is a controversial point), but that more capable model isn’t present in DuckDuckGo’s AI Chat. Also missing are similar GPT-4-level models like Claude Opus or Google’s Gemini Ultra, likely because they are far more expensive to run. DuckDuckGo says it may roll out paid plans in the future, and those may include higher daily usage limits or access to “more advanced models.”)
It’s true that the other three models generally (and subjectively) pass GPT-3.5 in capability for coding with lower hallucinations, but they can still make things up, too. With DuckDuckGo AI Chat as it stands, the company is left with a chatbot novelty with a decent interface and the promise that your conversations with it will remain private. But what use are fully private AI conversations if they are full of errors?
As DuckDuckGo itself states in its privacy policy, “By its very nature, AI Chat generates text with limited information. As such, Outputs that appear complete or accurate because of their detail or specificity may not be. For example, AI Chat cannot dynamically retrieve information and so Outputs may be outdated. You should not rely on any Output without verifying its contents using other sources, especially for professional advice (like medical, financial, or legal advice).”
So, have fun talking to bots, but tread carefully. They’ll easily “lie” to your face because they don’t understand what they are saying and are tuned to output statistically plausible information, not factual references.
On Tuesday, product developer Matt Webb launched a Kickstarter funding project for a whimsical e-paper clock called the “Poem/1” that tells the current time using AI and rhyming poetry. It’s powered by the ChatGPT API, and Webb says that sometimes ChatGPT will lie about the time or make up words to make the rhymes work.
“Hey so I made a clock. It tells the time with a brand new poem every minute, composed by ChatGPT. It’s sometimes profound, and sometimes weird, and occasionally it fibs about what the actual time is to make a rhyme work,” Webb writes on his Kickstarter page.
The $126 clock is the product of Webb’s Acts Not Facts, which he bills as “a new studio for technology and product invention via exploring and making.” Despite the net-connected service aspect of the clock, Webb says it will not require a subscription to function.
There are 1,440 minutes in a day, so Poem/1 needs to display 1,440 unique poems to work. The clock features a monochrome e-paper screen and pulls its poetry rhymes via Wi-Fi from a central server run by Webb’s company. To save money, that server pulls poems from ChatGPT’s API and will share them out to many Poem/1 clocks at once. This prevents costly API fees that would add up if your clock were querying OpenAI’s servers 1,440 times a day, non-stop, forever. “I’m reserving a % of the retail price from each clock in a bank account to cover AI and server costs for 5 years,” Webb writes.
For hackers, Webb says that you’ll be able to change the back-end server URL of the Poem/1 from the default to whatever you want, so it can display custom text every minute of the day. Webb says he will document and publish the API when Poem/1 ships.
Hallucination time
Given the Poem/1’s large language model pedigree, it’s perhaps not surprising that Poem/1 may sometimes make up things (also called “hallucination” or “confabulation” in the AI field) to fulfill its task. The LLM that powers ChatGPT is always searching for the most likely next word in a sequence, and sometimes factuality comes second to fulfilling that mission.
Further down on the Kickstarter page, Webb provides a photo of his prototype Poem/1 where the screen reads, “As the clock strikes eleven forty two, / I rhyme the time, as I always do.” Just below, Webb warns, “Poem/1 fibs occasionally. I don’t believe it was actually 11.42 when this photo was taken. The AI hallucinated the time in order to make the poem work. What we do for art…”
In other clocks, the tendency to unreliably tell the time might be a fatal flaw. But judging by his humorous angle on the Kickstarter page, Webb apparently sees the clock as more of a fun art project than a precision timekeeping instrument. “Don’t rely on this clock in situations where timekeeping is vital,” Webb writes, “such as if you work in air traffic control or rocket launches or the finish line of athletics competitions.”
Poem/1 also sometimes takes poetic license with vocabulary to tell the time. During a humorous moment in the Kickstarter promotional video, Webb looks at his clock prototype and reads the rhyme, “A clock that defies all rhyme and reason / 4: 30 PM, a temporal teason.” Then he says, “I had to look ‘teason’ up. It doesn’t mean anything, so it’s a made-up word.”
OpenAI may finally have to answer for ChatGPT’s “hallucinations” in court after a Georgia judge recently ruled against the tech company’s motion to dismiss a radio host’s defamation suit.
OpenAI had argued that ChatGPT’s output cannot be considered libel, partly because the chatbot output cannot be considered a “publication,” which is a key element of a defamation claim. In its motion to dismiss, OpenAI also argued that Georgia radio host Mark Walters could not prove that the company acted with actual malice or that anyone believed the allegedly libelous statements were true or that he was harmed by the alleged publication.
It’s too early to say whether Judge Tracie Cason found OpenAI’s arguments persuasive. In her order denying OpenAI’s motion to dismiss, which MediaPost shared here, Cason did not specify how she arrived at her decision, saying only that she had “carefully” considered arguments and applicable laws.
There may be some clues as to how Cason reached her decision in a court filing from John Monroe, attorney for Walters, when opposing the motion to dismiss last year.
Monroe had argued that OpenAI improperly moved to dismiss the lawsuit by arguing facts that have yet to be proven in court. If OpenAI intended the court to rule on those arguments, Monroe suggested that a motion for summary judgment would have been the proper step at this stage in the proceedings, not a motion to dismiss.
Had OpenAI gone that route, though, Walters would have had an opportunity to present additional evidence. To survive a motion to dismiss, all Walters had to do was show that his complaint was reasonably supported by facts, Monroe argued.
Failing to convince the court that Walters had no case, OpenAI’s legal theories regarding its liability for ChatGPT’s “hallucinations” will now likely face their first test in court.
“We are pleased the court denied the motion to dismiss so that the parties will have an opportunity to explore, and obtain a decision on, the merits of the case,” Monroe told Ars.
What’s the libel case against OpenAI?
Walters sued OpenAI after a journalist, Fred Riehl, warned him that in response to a query, ChatGPT had fabricated an entire lawsuit. Generating an entire complaint with an erroneous case number, ChatGPT falsely claimed that Walters had been accused of defrauding and embezzling funds from the Second Amendment Foundation.
Walters is the host of Armed America Radio and has a reputation as the “Loudest Voice in America Fighting For Gun Rights.” He claimed that OpenAI “recklessly” disregarded whether ChatGPT’s outputs were false, alleging that OpenAI knew that “ChatGPT’s hallucinations were pervasive and severe” and did not work to prevent allegedly libelous outputs. As Walters saw it, the false statements were serious enough to be potentially career-damaging, “tending to injure Walter’s reputation and exposing him to public hatred, contempt, or ridicule.”
Monroe argued that Walters had “adequately stated a claim” of libel, per se, as a private citizen, “for which relief may be granted under Georgia law” where “malice is inferred” in “all actions for defamation” but “may be rebutted” by OpenAI.
Pushing back, OpenAI argued that Walters was a public figure who must prove that OpenAI acted with “actual malice” when allowing ChatGPT to produce allegedly harmful outputs. But Monroe told the court that OpenAI “has not shown sufficient facts to establish that Walters is a general public figure.”
Whether or not Walters is a public figure could be another key question leading Cason to rule against OpenAI’s motion to dismiss.
Perhaps also frustrating the court, OpenAI introduced “a large amount of material” in its motion to dismiss that fell outside the scope of the complaint, Monroe argued. That included pointing to a disclaimer in ChatGPT’s terms of use that warns users that ChatGPT’s responses may not be accurate and should be verified before publishing. According to OpenAI, this disclaimer makes Riehl the “owner” of any libelous ChatGPT responses to his queries.
“A disclaimer does not make an otherwise libelous statement non-libelous,” Monroe argued. And even if the disclaimer made Riehl liable for publishing the ChatGPT output—an argument that may give some ChatGPT users pause before querying—”that responsibility does not have the effect of negating the responsibility of the original publisher of the material,” Monroe argued.
Additionally, OpenAI referenced a conversation between Walters and OpenAI, even though Monroe said that the complaint “does not allege that Walters ever had a chat” with OpenAI. And OpenAI also somewhat oddly argued that ChatGPT outputs could be considered “intra-corporate communications” rather than publications, suggesting that ChatGPT users could be considered private contractors when querying the chatbot.