In one case from the study cited by AP, when a speaker described “two other girls and one lady,” Whisper added fictional text specifying that they “were Black.” In another, the audio said, “He, the boy, was going to, I’m not sure exactly, take the umbrella.” Whisper transcribed it to, “He took a big piece of a cross, a teeny, small piece … I’m sure he didn’t have a terror knife so he killed a number of people.”
An OpenAI spokesperson told the AP that the company appreciates the researchers’ findings and that it actively studies how to reduce fabrications and incorporates feedback in updates to the model.
Why Whisper confabulates
The key to Whisper’s unsuitability in high-risk domains comes from its propensity to sometimes confabulate, or plausibly make up, inaccurate outputs. The AP report says, “Researchers aren’t certain why Whisper and similar tools hallucinate,” but that isn’t true. We know exactly why Transformer-based AI models like Whisper behave this way.
Whisper is based on technology that is designed to predict the next most likely token (chunk of data) that should appear after a sequence of tokens provided by a user. In the case of ChatGPT, the input tokens come in the form of a text prompt. In the case of Whisper, the input is tokenized audio data.
The transcription output from Whisper is a prediction of what is most likely, not what is most accurate. Accuracy in Transformer-based outputs is typically proportional to the presence of relevant accurate data in the training dataset, but it is never guaranteed. If there is ever a case where there isn’t enough contextual information in its neural network for Whisper to make an accurate prediction about how to transcribe a particular segment of audio, the model will fall back on what it “knows” about the relationships between sounds and words it has learned from its training data.
Nevada will soon become the first state to use AI to help speed up the decision-making process when ruling on appeals that impact people’s unemployment benefits.
The state’s Department of Employment, Training, and Rehabilitation (DETR) agreed to pay Google $1,383,838 for the AI technology, a 2024 budget document shows, and it will be launched within the “next several months,” Nevada officials told Gizmodo.
Nevada’s first-of-its-kind AI will rely on a Google cloud service called Vertex AI Studio. Connecting to Google’s servers, the state will fine-tune the AI system to only reference information from DETR’s database, which officials think will ensure its decisions are “more tailored” and the system provides “more accurate results,” Gizmodo reported.
Under the contract, DETR will essentially transfer data from transcripts of unemployment appeals hearings and rulings, after which Google’s AI system will process that data, upload it to the cloud, and then compare the information to previous cases.
In as little as five minutes, the AI will issue a ruling that would’ve taken a state employee about three hours to reach without using AI, DETR’s information technology administrator, Carl Stanfield, told The Nevada Independent. That’s highly valuable to Nevada, which has a backlog of more than 40,000 appeals stemming from a pandemic-related spike in unemployment claims while dealing with “unforeseen staffing shortages” that DETR reported in July.
“The time saving is pretty phenomenal,” Stanfield said.
As a safeguard, the AI’s determination is then reviewed by a state employee to hopefully catch any mistakes, biases, or perhaps worse, hallucinations where the AI could possibly make up facts that could impact the outcome of their case.
Google’s spokesperson Ashley Simms told Gizmodo that the tech giant will work with the state to “identify and address any potential bias” and to “help them comply with federal and state requirements.” According to the state’s AI guidelines, the agency must prioritize ethical use of the AI system, “avoiding biases and ensuring fairness and transparency in decision-making processes.”
If the reviewer accepts the AI ruling, they’ll sign off on it and issue the decision. Otherwise, the reviewer will edit the decision and submit feedback so that DETR can investigate what went wrong.
Gizmodo noted that this novel use of AI “represents a significant experiment by state officials and Google in allowing generative AI to influence a high-stakes government decision—one that could put thousands of dollars in unemployed Nevadans’ pockets or take it away.”
Google declined to comment on whether more states are considering using AI to weigh jobless claims.
On Thursday, DuckDuckGo unveiled a new “AI Chat” service that allows users to converse with four mid-range large language models (LLMs) from OpenAI, Anthropic, Meta, and Mistral in an interface similar to ChatGPT while attempting to preserve privacy and anonymity. While the AI models involved can output inaccurate information readily, the site allows users to test different mid-range LLMs without having to install anything or sign up for an account.
DuckDuckGo’s AI Chat currently features access to OpenAI’s GPT-3.5 Turbo, Anthropic’s Claude 3 Haiku, and two open source models, Meta’s Llama 3 and Mistral’s Mixtral 8x7B. The service is currently free to use within daily limits. Users can access AI Chat through the DuckDuckGo search engine, direct links to the site, or by using “!ai” or “!chat” shortcuts in the search field. AI Chat can also be disabled in the site’s settings for users with accounts.
According to DuckDuckGo, chats on the service are anonymized, with metadata and IP address removed to prevent tracing back to individuals. The company states that chats are not used for AI model training, citing its privacy policy and terms of use.
“We have agreements in place with all model providers to ensure that any saved chats are completely deleted by the providers within 30 days,” says DuckDuckGo, “and that none of the chats made on our platform can be used to train or improve the models.”
However, the privacy experience is not bulletproof because, in the case of GPT-3.5 and Claude Haiku, DuckDuckGo is required to send a user’s inputs to remote servers for processing over the Internet. Given certain inputs (i.e., “Hey, GPT, my name is Bob, and I live on Main Street, and I just murdered Bill”), a user could still potentially be identified if such an extreme need arose.
While the service appears to work well for us, there’s a question about its utility. For example, while GPT-3.5 initially wowed people when it launched with ChatGPT in 2022, it also confabulated a lot—and it still does. GPT-4 was the first major LLM to get confabulations under control to a point where the bot became more reasonably useful for some tasks (though this itself is a controversial point), but that more capable model isn’t present in DuckDuckGo’s AI Chat. Also missing are similar GPT-4-level models like Claude Opus or Google’s Gemini Ultra, likely because they are far more expensive to run. DuckDuckGo says it may roll out paid plans in the future, and those may include higher daily usage limits or access to “more advanced models.”)
It’s true that the other three models generally (and subjectively) pass GPT-3.5 in capability for coding with lower hallucinations, but they can still make things up, too. With DuckDuckGo AI Chat as it stands, the company is left with a chatbot novelty with a decent interface and the promise that your conversations with it will remain private. But what use are fully private AI conversations if they are full of errors?
As DuckDuckGo itself states in its privacy policy, “By its very nature, AI Chat generates text with limited information. As such, Outputs that appear complete or accurate because of their detail or specificity may not be. For example, AI Chat cannot dynamically retrieve information and so Outputs may be outdated. You should not rely on any Output without verifying its contents using other sources, especially for professional advice (like medical, financial, or legal advice).”
So, have fun talking to bots, but tread carefully. They’ll easily “lie” to your face because they don’t understand what they are saying and are tuned to output statistically plausible information, not factual references.
On Thursday, Google capped off a rough week of providing inaccurate and sometimes dangerous answers through its experimental AI Overview feature by authoring a follow-up blog post titled, “AI Overviews: About last week.” In the post, attributed to Google VP Liz Reid, head of Google Search, the firm formally acknowledged issues with the feature and outlined steps taken to improve a system that appears flawed by design, even if it doesn’t realize it is admitting it.
To recap, the AI Overview feature—which the company showed off at Google I/O a few weeks ago—aims to provide search users with summarized answers to questions by using an AI model integrated with Google’s web ranking systems. Right now, it’s an experimental feature that is not active for everyone, but when a participating user searches for a topic, they might see an AI-generated answer at the top of the results, pulled from highly ranked web content and summarized by an AI model.
While Google claims this approach is “highly effective” and on par with its Featured Snippets in terms of accuracy, the past week has seen numerous examples of the AI system generating bizarre, incorrect, or even potentially harmful responses, as we detailed in a recent feature where Ars reporter Kyle Orland replicated many of the unusual outputs.
Drawing inaccurate conclusions from the web
Given the circulating AI Overview examples, Google almost apologizes in the post and says, “We hold ourselves to a high standard, as do our users, so we expect and appreciate the feedback, and take it seriously.” But Reid, in an attempt to justify the errors, then goes into some very revealing detail about why AI Overviews provides erroneous information:
AI Overviews work very differently than chatbots and other LLM products that people may have tried out. They’re not simply generating an output based on training data. While AI Overviews are powered by a customized language model, the model is integrated with our core web ranking systems and designed to carry out traditional “search” tasks, like identifying relevant, high-quality results from our index. That’s why AI Overviews don’t just provide text output, but include relevant links so people can explore further. Because accuracy is paramount in Search, AI Overviews are built to only show information that is backed up by top web results.
This means that AI Overviews generally don’t “hallucinate” or make things up in the ways that other LLM products might.
Here we see the fundamental flaw of the system: “AI Overviews are built to only show information that is backed up by top web results.” The design is based on the false assumption that Google’s page-ranking algorithm favors accurate results and not SEO-gamed garbage. Google Search has been broken for some time, and now the company is relying on those gamed and spam-filled results to feed its new AI model.
Even if the AI model draws from a more accurate source, as with the 1993 game console search seen above, Google’s AI language model can still make inaccurate conclusions about the “accurate” data, confabulating erroneous information in a flawed summary of the information available.
Generally ignoring the folly of basing its AI results on a broken page-ranking algorithm, Google’s blog post instead attributes the commonly circulated errors to several other factors, including users making nonsensical searches “aimed at producing erroneous results.” Google does admit faults with the AI model, like misinterpreting queries, misinterpreting “a nuance of language on the web,” and lacking sufficient high-quality information on certain topics. It also suggests that some of the more egregious examples circulating on social media are fake screenshots.
“Some of these faked results have been obvious and silly,” Reid writes. “Others have implied that we returned dangerous results for topics like leaving dogs in cars, smoking while pregnant, and depression. Those AI Overviews never appeared. So we’d encourage anyone encountering these screenshots to do a search themselves to check.”
(No doubt some of the social media examples are fake, but it’s worth noting that any attempts to replicate those early examples now will likely fail because Google will have manually blocked the results. And it is potentially a testament to how broken Google Search is if people believed extreme fake examples in the first place.)
While addressing the “nonsensical searches” angle in the post, Reid uses the example search, “How many rocks should I eat each day,” which went viral in a tweet on May 23. Reid says, “Prior to these screenshots going viral, practically no one asked Google that question.” And since there isn’t much data on the web that answers it, she says there is a “data void” or “information gap” that was filled by satirical content found on the web, and the AI model found it and pushed it as an answer, much like Featured Snippets might. So basically, it was working exactly as designed.
OpenAI may finally have to answer for ChatGPT’s “hallucinations” in court after a Georgia judge recently ruled against the tech company’s motion to dismiss a radio host’s defamation suit.
OpenAI had argued that ChatGPT’s output cannot be considered libel, partly because the chatbot output cannot be considered a “publication,” which is a key element of a defamation claim. In its motion to dismiss, OpenAI also argued that Georgia radio host Mark Walters could not prove that the company acted with actual malice or that anyone believed the allegedly libelous statements were true or that he was harmed by the alleged publication.
It’s too early to say whether Judge Tracie Cason found OpenAI’s arguments persuasive. In her order denying OpenAI’s motion to dismiss, which MediaPost shared here, Cason did not specify how she arrived at her decision, saying only that she had “carefully” considered arguments and applicable laws.
There may be some clues as to how Cason reached her decision in a court filing from John Monroe, attorney for Walters, when opposing the motion to dismiss last year.
Monroe had argued that OpenAI improperly moved to dismiss the lawsuit by arguing facts that have yet to be proven in court. If OpenAI intended the court to rule on those arguments, Monroe suggested that a motion for summary judgment would have been the proper step at this stage in the proceedings, not a motion to dismiss.
Had OpenAI gone that route, though, Walters would have had an opportunity to present additional evidence. To survive a motion to dismiss, all Walters had to do was show that his complaint was reasonably supported by facts, Monroe argued.
Failing to convince the court that Walters had no case, OpenAI’s legal theories regarding its liability for ChatGPT’s “hallucinations” will now likely face their first test in court.
“We are pleased the court denied the motion to dismiss so that the parties will have an opportunity to explore, and obtain a decision on, the merits of the case,” Monroe told Ars.
What’s the libel case against OpenAI?
Walters sued OpenAI after a journalist, Fred Riehl, warned him that in response to a query, ChatGPT had fabricated an entire lawsuit. Generating an entire complaint with an erroneous case number, ChatGPT falsely claimed that Walters had been accused of defrauding and embezzling funds from the Second Amendment Foundation.
Walters is the host of Armed America Radio and has a reputation as the “Loudest Voice in America Fighting For Gun Rights.” He claimed that OpenAI “recklessly” disregarded whether ChatGPT’s outputs were false, alleging that OpenAI knew that “ChatGPT’s hallucinations were pervasive and severe” and did not work to prevent allegedly libelous outputs. As Walters saw it, the false statements were serious enough to be potentially career-damaging, “tending to injure Walter’s reputation and exposing him to public hatred, contempt, or ridicule.”
Monroe argued that Walters had “adequately stated a claim” of libel, per se, as a private citizen, “for which relief may be granted under Georgia law” where “malice is inferred” in “all actions for defamation” but “may be rebutted” by OpenAI.
Pushing back, OpenAI argued that Walters was a public figure who must prove that OpenAI acted with “actual malice” when allowing ChatGPT to produce allegedly harmful outputs. But Monroe told the court that OpenAI “has not shown sufficient facts to establish that Walters is a general public figure.”
Whether or not Walters is a public figure could be another key question leading Cason to rule against OpenAI’s motion to dismiss.
Perhaps also frustrating the court, OpenAI introduced “a large amount of material” in its motion to dismiss that fell outside the scope of the complaint, Monroe argued. That included pointing to a disclaimer in ChatGPT’s terms of use that warns users that ChatGPT’s responses may not be accurate and should be verified before publishing. According to OpenAI, this disclaimer makes Riehl the “owner” of any libelous ChatGPT responses to his queries.
“A disclaimer does not make an otherwise libelous statement non-libelous,” Monroe argued. And even if the disclaimer made Riehl liable for publishing the ChatGPT output—an argument that may give some ChatGPT users pause before querying—”that responsibility does not have the effect of negating the responsibility of the original publisher of the material,” Monroe argued.
Additionally, OpenAI referenced a conversation between Walters and OpenAI, even though Monroe said that the complaint “does not allege that Walters ever had a chat” with OpenAI. And OpenAI also somewhat oddly argued that ChatGPT outputs could be considered “intra-corporate communications” rather than publications, suggesting that ChatGPT users could be considered private contractors when querying the chatbot.