AI

these-ai-generated-news-anchors-are-freaking-me-out

These AI-generated news anchors are freaking me out

Max Headroom as prophecy.

Enlarge / Max Headroom as prophecy.

Aurich Lawson | Channel 1

Here at Ars, we’ve long covered the interesting potential and significant peril (and occasional silliness) of AI-generated video featuring increasingly realistic human avatars. Heck, we even went to the trouble of making our own “deepfake” Mark Zuckerberg in 2019, when the underlying technology wasn’t nearly as robust as it is today.

But even with all that background, startup Channel 1‘s vision of a near-future where AI-generated avatars read you the news was a bit of a shock to the system. The company’s recent proof-of-concept “showcase” newscast reveals just how far AI-generated videos of humans have come in a short time and how those realistic avatars could shake up a lot more than just the job market for talking heads.

“…the newscasters have been changed to protect the innocent”

See the highest quality AI footage in the world.

🤯 – Our generated anchors deliver stories that are informative, heartfelt and entertaining.

Watch the showcase episode of our upcoming news network now. pic.twitter.com/61TaG6Kix3

— Channel 1 (@channel1_ai) December 12, 2023

To be clear, Channel 1 isn’t trying to fool people with “deepfakes” of existing news anchors or anything like that. In the first few seconds of its sample newscast, it identifies its talking heads as a “team of AI-generated reporters.” A few seconds later, one of those talking heads explains further: “You can hear us and see our lips moving, but no one was recorded saying what we’re all saying. I’m powered by sophisticated systems behind the scenes.”

Even with those kinds of warnings, I found I had to constantly remind myself that the “people” I was watching deliver the news here were only “based on real people who have been compensated for use of their likeness,” as Deadline reports (how much they were compensated will probably be of great concern to actors who recently went on strike in part over the issue of AI likenesses). Everything from the lip-syncing to the intonations to subtle gestures and body movements of these Channel 1 anchors gives an eerily convincing presentation of a real newscaster talking into the camera.

Sure, if you look closely, there are a few telltale anomalies that expose these reporters as computer creations—slight video distortions around the mouth, say, or overly repetitive hand gestures, or a nonsensical word emphasis choice. But those signs are so small that they would be easy to miss at a casual glance or on a small screen like that on a phone.

In other words, human-looking AI avatars now seem well on their way to climbing out of the uncanny valley, at least when it comes to news anchors who sit at a desk or stand still in front of a green screen. Channel 1 investor Adam Mosam told Deadline it “has gotten to a place where it’s comfortable to watch,” and I have to say I agree.

A Channel 1 clip shows how its system can make video sources appear to speak a different language.

The same technology can be applied to on-the-scene news videos as well. About eight minutes into the sample newscast, Channel 1 shows a video of a European tropical storm victim describing the wreckage in French. Then it shows an AI-generated version of the same footage with the source speaking perfect English, using a facsimile of his original voice and artificial lipsync placed over his mouth.

Without the on-screen warning that this was “AI generated Language: Translated from French,” it would be easy to believe that the video was of an American expatriate rather than a native French speaker. And the effect is much more dramatic than the usual TV news practice of having an unseen interpreter speak over the footage.

These AI-generated news anchors are freaking me out Read More »

if-ai-is-making-the-turing-test-obsolete,-what-might-be-better?

If AI is making the Turing test obsolete, what might be better?

A white android sitting at a table in a depressed manner with an alchoholic drink. Very high resolution 3D render.

If a machine or an AI program matches or surpasses human intelligence, does that mean it can simulate humans perfectly? If yes, then what about reasoning—our ability to apply logic and think rationally before making decisions? How could we even identify whether an AI program can reason? To try to answer this question, a team of researchers has proposed a novel framework that works like a psychological study for software.

“This test treats an ‘intelligent’ program as though it were a participant in a psychological study and has three steps: (a) test the program in a set of experiments examining its inferences, (b) test its understanding of its own way of reasoning, and (c) examine, if possible, the cognitive adequacy of the source code for the program,” the researchers note.

They suggest the standard methods of evaluating a machine’s intelligence, such as the Turing Test, can only tell you if the machine is good at processing information and mimicking human responses. The current generations of AI programs, such as Google’s LaMDA and OpenAI’s ChatGPT, for example, have come close to passing the Turing Test, yet the test results don’t imply these programs can think and reason like humans.

This is why the Turing Test may no longer be relevant, and there is a need for new evaluation methods that could effectively assess the intelligence of machines, according to the researchers. They claim that their framework could be an alternative to the Turing Test. “We propose to replace the Turing test with a more focused and fundamental one to answer the question: do programs reason in the way that humans reason?” the study authors argue.

What’s wrong with the Turing Test?

During the Turing Test, evaluators play different games involving text-based communications with real humans and AI programs (machines or chatbots). It is a blind test, so evaluators don’t know whether they are texting with a human or a chatbot. If the AI programs are successful in generating human-like responses—to the extent that evaluators struggle to distinguish between the human and the AI program—the AI is considered to have passed. However, since the Turing Test is based on subjective interpretation, these results are also subjective.

The researchers suggest that there are several limitations associated with the Turing Test. For instance, any of the games played during the test are imitation games designed to test whether or not a machine can imitate a human. The evaluators make decisions solely based on the language or tone of messages they receive. ChatGPT is great at mimicking human language, even in responses where it gives out incorrect information. So, the test clearly doesn’t evaluate a machine’s reasoning and logical ability.

The results of the Turing Test also can’t tell you if a machine can introspect. We often think about our past actions and reflect on our lives and decisions, a critical ability that prevents us from repeating the same mistakes. The same applies to AI as well, according to a study from Stanford University which suggests that machines that could self-reflect are more practical for human use.

“AI agents that can leverage prior experience and adapt well by efficiently exploring new or changing environments will lead to much more adaptive, flexible technologies, from household robotics to personalized learning tools,” Nick Haber, an assistant professor from Stanford University who was not involved in the current study, said.

In addition to this, the Turing Test fails to analyze an AI program’s ability to think. In a recent Turing Test experiment, GPT-4 was able to convince evaluators that they were texting with humans over 40 percent of the time. However, this score fails to answer the basic question: Can the AI program think?

Alan Turing, the famous British scientist who created the Turing Test, once said, “A computer would deserve to be called intelligent if it could deceive a human into believing that it was human.” His test only covers one aspect of human intelligence, though: imitation. Although it is possible to deceive someone using this one aspect, many experts believe that a machine can never achieve true human intelligence without including those other aspects.

“It’s unclear whether passing the Turing Test is a meaningful milestone or not. It doesn’t tell us anything about what a system can do or understand, anything about whether it has established complex inner monologues or can engage in planning over abstract time horizons, which is key to human intelligence,” Mustafa Suleyman, an AI expert and founder of DeepAI, told Bloomberg.

If AI is making the Turing test obsolete, what might be better? Read More »

humana-also-using-ai-tool-with-90%-error-rate-to-deny-care,-lawsuit-claims

Humana also using AI tool with 90% error rate to deny care, lawsuit claims

AI denials —

The AI model, nH Predict, is the focus of another lawsuit against UnitedHealth.

Signage is displayed outside the Humana Inc. office building in Louisville, Kentucky, US, in 2016.

Enlarge / Signage is displayed outside the Humana Inc. office building in Louisville, Kentucky, US, in 2016.

Humana, one the nation’s largest health insurance providers, is allegedly using an artificial intelligence model with a 90 percent error rate to override doctors’ medical judgment and wrongfully deny care to elderly people on the company’s Medicare Advantage plans.

According to a lawsuit filed Tuesday, Humana’s use of the AI model constitutes a “fraudulent scheme” that leaves elderly beneficiaries with either overwhelming medical debt or without needed care that is covered by their plans. Meanwhile, the insurance behemoth reaps a “financial windfall.”

The lawsuit, filed in the US District Court in western Kentucky, is led by two people who had a Humana Medicare Advantage Plan policy and said they were wrongfully denied needed and covered care, harming their health and finances. The suit seeks class-action status for an unknown number of other beneficiaries nationwide who may be in similar situations. Humana provides Medicare Advantage plans for 5.1 million people in the US.

It is the second lawsuit aimed at an insurer’s use of the AI tool nH Predict, which was developed by NaviHealth to forecast how long patients will need care after a medical injury, illness, or event. In November, the estates of two deceased individuals brought a suit against UnitedHealth—the largest health insurance company in the US—for also allegedly using nH Predict to wrongfully deny care.

Humana did not respond to Ars’ request for comment for this story. United Health previously said that “the lawsuit has no merit, and we will defend ourselves vigorously.”

AI model

In both cases, the plaintiffs claim that the insurers use the flawed model to pinpoint the exact date to blindly and illegally cut off payments for post-acute care that is covered under Medicare plans—such as stays in skilled nursing facilities and inpatient rehabilitation centers. The AI-powered model comes up with those dates by comparing a patient’s diagnosis, age, living situation, and physical function to similar patients in a database of 6 million patients. In turn, the model spits out a prediction for the patient’s medical needs, length of stay, and discharge date.

But, the plaintiffs argue that the model fails to account for the entirety of each patient’s circumstances, their doctors’ recommendations, and the patient’s actual conditions. And they claim the predictions are draconian and inflexible. For example, under Medicare Advantage plans, patients who have a three-day hospital stay are typically entitled to up to 100 days of covered care in a nursing home. But with nH Predict in use, patients rarely stay in a nursing home for more than 14 days before claim denials begin.

Though few people appeal coverage denials generally, of those who have appealed the AI-based denials, over 90 percent have gotten the denial reversed, the lawsuits say.

Still, the insurers continue to use the model and NaviHealth employees are instructed to hew closely to the AI-based predictions, keeping lengths of post-acute care to within 1 percent of the days estimated by nH Predict. NaviHealth employees who fail to do so face discipline and firing. ” Humana banks on the patients’ impaired conditions, lack of knowledge, and lack of resources to appeal the wrongful AI-powered decisions,” the lawsuit filed Tuesday claims.

Plaintiff’s cases

One of the plaintiffs in Tuesday’s suit is JoAnne Barrows of Minnesota. On November 23, 2021, Barrows, then 86, was admitted to a hospital after falling at home and fracturing her leg. Doctors put her leg in a cast and issued an order not to put any weight on it for six weeks. On November 26, she was moved to a rehabilitation center for her six-week recovery. But, after just two weeks, Humana’s coverage denials began. Barrows and her family appealed the denials, but Humana denied the appeals, declaring that Barrows was fit to return to her home despite being bedridden and using a catheter.

Her family had no choice but to pay out-of-pocket. They tried moving her to a less expensive facility, but she received substandard care there, and her health declined further. Due to the poor quality of care, the family decided to move her home on December 22, even though she was still unable to use her injured leg, go the bathroom on her own, and still had a catheter.

The other plaintiff is Susan Hagood of North Carolina. On September 10, 2022, Hagood was admitted to a hospital with a urinary tract infection, sepsis, and a spinal infection. She stayed in the hospital until October 26, when she was transferred to a skilled nursing facility. Upon her transfer, she had eleven discharging diagnoses, including sepsis, acute kidney failure, kidney stones, nausea and vomiting, a urinary tract infection, swelling in her spine, and a spinal abscess. In the nursing facility, she was in extreme pain and on the maximum allowable dose of the painkiller oxycodone. She also developed pneumonia.

On November 28, she returned to the hospital for an appointment, at which point her blood pressure spiked, and she was sent to the emergency room. There, doctors found that her condition had considerably worsened.

Meanwhile, a day earlier, on November 27, Humana determined that it would deny coverage of part of her stay at the skilled nursing facility, refusing to pay from November 14 to November 28. Humana said Hagood no longer needed the level of care the facility provided and that she should be discharged home. The family paid $24,000 out-of-pocket for her care, and to date, Hagood remains in a skilled nursing facility.

Overall, the patients claim that Humana and UnitedHealth are aware that nH Predict is “highly inaccurate” but use it anyway to avoid paying for covered care and make more profit. The denials are “systematic, illegal, malicious, and oppressive.”

The lawsuit against Humana alleges breach of contract, unfair dealing, unjust enrichment, and bad faith insurance violations in many states. It seeks damages for financial losses and emotional distress, disgorgement and/or restitution, and to have Humana barred from using the AI-based model to deny claims.

Humana also using AI tool with 90% error rate to deny care, lawsuit claims Read More »

dropbox-spooks-users-with-new-ai-features-that-send-data-to-openai-when-used

Dropbox spooks users with new AI features that send data to OpenAI when used

adventures in data consent —

AI feature turned on by default worries users; Dropbox responds to concerns.

Updated

Photo of a man looking into a box.

On Wednesday, news quickly spread on social media about a new enabled-by-default Dropbox setting that shares Dropbox data with OpenAI for an experimental AI-powered search feature, but Dropbox says data is only shared if the feature is actively being used. Dropbox says that user data shared with third-party AI partners isn’t used to train AI models and is deleted within 30 days.

Even with assurances of data privacy laid out by Dropbox on an AI privacy FAQ page, the discovery that the setting had been enabled by default upset some Dropbox users. The setting was first noticed by writer Winifred Burton, who shared information about the Third-party AI setting through Bluesky on Tuesday, and frequent AI critic Karla Ortiz shared more information about it on X.

Wednesday afternoon, Drew Houston, the CEO of Dropbox, apologized for customer confusion in a post on X and wrote, “The third-party AI toggle in the settings menu enables or disables access to DBX AI features and functionality. Neither this nor any other setting automatically or passively sends any Dropbox customer data to a third-party AI service.

Critics say that communication about the change could have been clearer. AI researcher Simon Willison wrote, “Great example here of how careful companies need to be in clearly communicating what’s going on with AI access to personal data.”

A screenshot of Dropbox's third-party AI feature switch.

Enlarge / A screenshot of Dropbox’s third-party AI feature switch.

Benj Edwards

So why would Dropbox ever send user data to OpenAI anyway? In July, the company announced an AI-powered feature called Dash that allows AI models to perform universal searches across platforms like Google Workspace and Microsoft Outlook.

According to the Dropbox privacy FAQ, the third-party AI opt-out setting is part of the “Dropbox AI alpha,” which is a conversational interface for exploring file contents that involves chatting with a ChatGPT-style bot using an “Ask something about this file” feature. To make it work, an AI language model similar to the one that powers ChatGPT (like GPT-4) needs access to your files.

According to the FAQ, the third-party AI toggle in your account settings is turned on by default if “you or your team” are participating in the Dropbox AI alpha. Still, multiple Ars Technica staff who had no knowledge of the Dropbox AI alpha found the setting enabled by default when they checked.

In a statement to Ars Technica, a Dropbox representative said, “The third-party AI toggle is only turned on to give all eligible customers the opportunity to view our new AI features and functionality, like Dropbox AI. It does not enable customers to use these features without notice. Any features that use third-party AI offer disclosure of third-party use, and link to settings that they can manage. Only after a customer sees the third-party AI transparency banner and chooses to proceed with asking a question about a file, will that file be sent to a third-party to generate answers. Our customers are still in control of when and how they use these features.”

Right now, the only third-party AI provider for Dropbox is OpenAI, writes Dropbox in the FAQ. “Open AI is an artificial intelligence research organization that develops cutting-edge language models and advanced AI technologies. Your data is never used to train their internal models, and is deleted from OpenAI’s servers within 30 days.” It also says, “Only the content relevant to an explicit request or command is sent to our third-party AI partners to generate an answer, summary, or transcript.”

Disabling the feature is easy if you prefer not to use Dropbox AI features. Log into your Dropbox account on a desktop web browser, then click your profile photo > Settings > Third-party AI. This link may take you to that page more quickly. On that page, click the switch beside “Use artificial intelligence (AI) from third-party partners so you can work faster in Dropbox” to toggle it into the “Off” position.

This story was updated on December 13, 2023, at 5: 35 pm ET with clarifications about when and how Dropbox shares data with OpenAI, as well as statements from Dropbox reps and its CEO.

Dropbox spooks users with new AI features that send data to OpenAI when used Read More »

everybody’s-talking-about-mistral,-an-upstart-french-challenger-to-openai

Everybody’s talking about Mistral, an upstart French challenger to OpenAI

A challenger appears —

“Mixture of experts” Mixtral 8x7B helps open-weights AI punch above its weight class.

An illustrated robot holding a French flag.

Enlarge / An illustration of a robot holding a French flag, figuratively reflecting the rise of AI in France due to Mistral. It’s hard to draw a picture of an LLM, so a robot will have to do.

On Monday, Mistral AI announced a new AI language model called Mixtral 8x7B, a “mixture of experts” (MoE) model with open weights that reportedly truly matches OpenAI’s GPT-3.5 in performance—an achievement that has been claimed by others in the past but is being taken seriously by AI heavyweights such as OpenAI’s Andrej Karpathy and Jim Fan. That means we’re closer to having a ChatGPT-3.5-level AI assistant that can run freely and locally on our devices, given the right implementation.

Mistral, based in Paris and founded by Arthur Mensch, Guillaume Lample, and Timothée Lacroix, has seen a rapid rise in the AI space recently. It has been quickly raising venture capital to become a sort of French anti-OpenAI, championing smaller models with eye-catching performance. Most notably, Mistral’s models run locally with open weights that can be downloaded and used with fewer restrictions than closed AI models from OpenAI, Anthropic, or Google. (In this context “weights” are the computer files that represent a trained neural network.)

Mixtral 8x7B can process a 32K token context window and works in French, German, Spanish, Italian, and English. It works much like ChatGPT in that it can assist with compositional tasks, analyze data, troubleshoot software, and write programs. Mistral claims that it outperforms Meta’s much larger LLaMA 2 70B (70 billion parameter) large language model and that it matches or exceeds OpenAI’s GPT-3.5 on certain benchmarks, as seen in the chart below.

A chart of Mixtral 8x7B performance vs. LLaMA 2 70B and GPT-3.5, provided by Mistral.

Enlarge / A chart of Mixtral 8x7B performance vs. LLaMA 2 70B and GPT-3.5, provided by Mistral.

Mistral

The speed at which open-weights AI models have caught up with OpenAI’s top offering a year ago has taken many by surprise. Pietro Schirano, the founder of EverArt, wrote on X, “Just incredible. I am running Mistral 8x7B instruct at 27 tokens per second, completely locally thanks to @LMStudioAI. A model that scores better than GPT-3.5, locally. Imagine where we will be 1 year from now.”

LexicaArt founder Sharif Shameem tweeted, “The Mixtral MoE model genuinely feels like an inflection point — a true GPT-3.5 level model that can run at 30 tokens/sec on an M1. Imagine all the products now possible when inference is 100% free and your data stays on your device.” To which Andrej Karpathy replied, “Agree. It feels like the capability / reasoning power has made major strides, lagging behind is more the UI/UX of the whole thing, maybe some tool use finetuning, maybe some RAG databases, etc.”

Mixture of experts

So what does mixture of experts mean? As this excellent Hugging Face guide explains, it refers to a machine-learning model architecture where a gate network routes input data to different specialized neural network components, known as “experts,” for processing. The advantage of this is that it enables more efficient and scalable model training and inference, as only a subset of experts are activated for each input, reducing the computational load compared to monolithic models with equivalent parameter counts.

In layperson’s terms, a MoE is like having a team of specialized workers (the “experts”) in a factory, where a smart system (the “gate network”) decides which worker is best suited to handle each specific task. This setup makes the whole process more efficient and faster, as each task is done by an expert in that area, and not every worker needs to be involved in every task, unlike in a traditional factory where every worker might have to do a bit of everything.

OpenAI has been rumored to use a MoE system with GPT-4, accounting for some of its performance. In the case of Mixtral 8x7B, the name implies that the model is a mixture of eight 7 billion-parameter neural networks, but as Karpathy pointed out in a tweet, the name is slightly misleading because, “it is not all 7B params that are being 8x’d, only the FeedForward blocks in the Transformer are 8x’d, everything else stays the same. Hence also why total number of params is not 56B but only 46.7B.”

Mixtral is not the first “open” mixture of experts model, but it is notable for its relatively small size in parameter count and performance. It’s out now, available on Hugging Face and BitTorrent under the Apache 2.0 license. People have been running it locally using an app called LM Studio. Also, Mistral began offering beta access to an API for three levels of Mistral models on Monday.

Everybody’s talking about Mistral, an upstart French challenger to OpenAI Read More »

ai-companion-robot-helps-some-seniors-fight-loneliness,-but-others-hate-it

AI companion robot helps some seniors fight loneliness, but others hate it

AI buddy —

There’s limited evidence for health benefits so far; early work suggests no one-size-fits-all.

ElliQ, an AI companion robot from Intuition Robotics.

Enlarge / ElliQ, an AI companion robot from Intuition Robotics.

Some seniors in New York are successfully combating their loneliness with an AI-powered companion robot named ElliQ—while others called the “proactive” device a nag and joked about taking an ax to it.

The home assistant robot, made by Israel-based Intuition Robotics, is offered to New York seniors through a special program through the state’s Office for the Aging (NYSOFA). Over the past year, NYSOFA has partnered with Intuition Robotics to bring ElliQ to over 800 seniors struggling with loneliness. In a report last week, officials said they had given out hundreds and had only 150 available devices.

ElliQ includes a tablet and a two-piece lamp-like robot with a head that lights up and rotates to face a speaker. Marketed as powered by “Cognitive AI technology,” it proactively engages in conversations with users, giving them reminders and prompts, such as asking them how they’re doing, telling them it’s time to check their blood pressure or take their medicine, and asking if they want to have a video call with family. Speaking with a female voice, the robot is designed to hold human-like conversations, engage in small talk, express empathy, and share humor. It can provide learning and wellness programs, such as audiobooks and relaxation exercises.

Interest in using social robots, such as ElliQ, for elder care has been growing for years, but the field still lacks solid evidence that the devices can significantly improve health, well-being, and depression. Systemic reviews in 2018 found the technology had potential, but studies lacked statistical significance and rigorous design.

The program in New York adds to the buzz but doesn’t offer the high-quality study design that could yield definitive answers. In August, the state released a report on an unspecified number of ElliQ users, which indicated that the device was helpful. Specifically, 59 percent of users reported the device was “very helpful” at reducing loneliness, while 37 percent reported it was “helpful” and only 4 percent reported it as “unhelpful.” Engagement with the device declined over time, with users initially interacting with ElliQ an average of 62 times a day in the first 15 days of use, which fell to 21 times a day between 60 and 90 days and 33 times a day after 180.

Mixed feelings

“We had high hopes for the efficacy of ElliQ, but the results that we’re seeing are truly exceeding our expectations,” Greg Olsen, director of the New York State Office for the Aging, said in a statement at the time of the report’s release. “The data speaks for itself, and the stories that we’re hearing from case managers and clients around the state have been nothing short of unbelievable.”

But other recent data on the potential for companion robots to reduce loneliness has indicated that there’s no one-size-fits-all approach. There are a lot of factors that can influence how individuals perceive such a device. A 2021 qualitative study evaluated the responses from 16 seniors who were asked for feedback on three types of robot companions, including ElliQ. The results were mixed for the proactive robot. While some felt the occasional chattiness of ElliQ would be comforting during an otherwise solitary day, others felt it was intrusive and “nagging.” Some felt the device’s tone was “rude.”

“I don’t know whether that would drive me mental if it kept interrupting me and telling me what to do … I might want to get an ax and cut it up,” one study participant said.

How welcoming a person might be to an assertive AI-assistant like ElliQ may link with a person’s general preferences regarding human company, the authors suggested. Those who value their space and autonomy may be less open to such as device compared with more gregarious seniors.

While some participants said ElliQ’s reminders could be useful, others expressed a deep concern that an overreliance on technology for everyday tasks—like paying bills, taking medications, or turning lights off—could hasten the decline of cognitive and physical abilities. Study participants also raised concerns regarding the inauthenticity of a relationship with a nonhuman, a loss of dignity, and a lack of control. Some disliked that ElliQ couldn’t be fully controlled by the user and was so assertive, which some perceived as pushy. Some worried about feeling embarrassed about being seen interacting with a robot companion. A 2022 study also explored the issue of stigma, with participants expressing that the use of such devices could reinforce stereotypes of aging, including isolation and dependency.

While researchers continue to explore the potential use and design of AI-powered companion robots, anecdotes from New York’s program suggest the tools are clearly helpful for some. One New Yorker named Priscilla told CBS News she found ElliQ helpful.

“She keeps me company. I get depressed real easy. She’s always there. I don’t care what time of day, if I just need somebody to talk to me,” Priscilla said. “I think I said that’s the biggest thing, to hear another voice when you’re lonely.”

AI companion robot helps some seniors fight loneliness, but others hate it Read More »

as-chatgpt-gets-“lazy,”-people-test-“winter-break-hypothesis”-as-the-cause

As ChatGPT gets “lazy,” people test “winter break hypothesis” as the cause

only 14 shopping days ’til Christmas —

Unproven hypothesis seeks to explain ChatGPT’s seemingly new reluctance to do hard work.

A hand moving a wooden calendar piece that says

In late November, some ChatGPT users began to notice that ChatGPT-4 was becoming more “lazy,” reportedly refusing to do some tasks or returning simplified results. Since then, OpenAI has admitted that it’s an issue, but the company isn’t sure why. The answer may be what some are calling “winter break hypothesis.” While unproven, the fact that AI researchers are taking it seriously shows how weird the world of AI language models has become.

“We’ve heard all your feedback about GPT4 getting lazier!” tweeted the official ChatGPT account on Thursday. “We haven’t updated the model since Nov 11th, and this certainly isn’t intentional. model behavior can be unpredictable, and we’re looking into fixing it.”

On Friday, an X account named Martian openly wondered if LLMs might simulate seasonal depression. Later, Mike Swoopskee tweeted, “What if it learned from its training data that people usually slow down in December and put bigger projects off until the new year, and that’s why it’s been more lazy lately?”

Since the system prompt for ChatGPT feeds the bot the current date, people noted, some began to think there may be something to the idea. Why entertain such a weird supposition? Because research has shown that large language models like GPT-4, which powers the paid version of ChatGPT, respond to human-style encouragement, such as telling a bot to “take a deep breath” before doing a math problem. People have also less formally experimented with telling an LLM that it will receive a tip for doing the work, or if an AI model gets lazy, telling the bot that you have no fingers seems to help lengthen outputs.

  • “Winter break hypothesis” test result screenshots from Rob Lynch on X.

  • “Winter break hypothesis” test result screenshots from Rob Lynch on X.

  • “Winter break hypothesis” test result screenshots from Rob Lynch on X.

On Monday, a developer named Rob Lynch announced on X that he had tested GPT-4 Turbo through the API over the weekend and found shorter completions when the model is fed a December date (4,086 characters) than when fed a May date (4,298 characters). Lynch claimed the results were statistically significant. However, a reply from AI researcher Ian Arawjo said that he could not reproduce the results with statistical significance. (It’s worth noting that reproducing results with LLM can be difficult because of random elements at play that vary outputs over time, so people sample a large number of responses.)

As of this writing, others are busy running tests, and the results are inconclusive. This episode is a window into the quickly unfolding world of LLMs and a peek into an exploration into largely unknown computer science territory. As AI researcher Geoffrey Litt commented in a tweet, “funniest theory ever, I hope this is the actual explanation. Whether or not it’s real, [I] love that it’s hard to rule out.”

A history of laziness

One of the reports that started the recent trend of noting that ChatGPT is getting “lazy” came on November 24 via Reddit, the day after Thanksgiving in the US. There, a user wrote that they asked ChatGPT to fill out a CSV file with multiple entries, but ChatGPT refused, saying, “Due to the extensive nature of the data, the full extraction of all products would be quite lengthy. However, I can provide the file with this single entry as a template, and you can fill in the rest of the data as needed.”

On December 1, OpenAI employee Will Depue confirmed in an X post that OpenAI was aware of reports about laziness and was working on a potential fix. “Not saying we don’t have problems with over-refusals (we definitely do) or other weird things (working on fixing a recent laziness issue), but that’s a product of the iterative process of serving and trying to support sooo many use cases at once,” he wrote.

It’s also possible that ChatGPT was always “lazy” with some responses (since the responses vary randomly), and the recent trend made everyone take note of the instances in which they are happening. For example, in June, someone complained of GPT-4 being lazy on Reddit. (Maybe ChatGPT was on summer vacation?)

Also, people have been complaining about GPT-4 losing capability since it was released. Those claims have been controversial and difficult to verify, making them highly subjective.

As Ethan Mollick joked on X, as people discover new tricks to improve LLM outputs, prompting for large language models is getting weirder and weirder: “It is May. You are very capable. I have no hands, so do everything. Many people will die if this is not done well. You really can do this and are awesome. Take a deep breathe and think this through. My career depends on it. Think step by step.”

As ChatGPT gets “lazy,” people test “winter break hypothesis” as the cause Read More »

elon-musk’s-new-ai-bot,-grok,-causes-stir-by-citing-openai-usage-policy

Elon Musk’s new AI bot, Grok, causes stir by citing OpenAI usage policy

You are what you eat —

Some experts think xAI used OpenAI model outputs to fine-tune Grok.

Illustration of a broken robot exchanging internal gears.

Grok, the AI language model created by Elon Musk’s xAI, went into wide release last week, and people have begun spotting glitches. On Friday, security tester Jax Winterbourne tweeted a screenshot of Grok denying a query with the statement, “I’m afraid I cannot fulfill that request, as it goes against OpenAI’s use case policy.” That made ears perk up online since Grok isn’t made by OpenAI—the company responsible for ChatGPT, which Grok is positioned to compete with.

Interestingly, xAI representatives did not deny that this behavior occurs with its AI model. In reply, xAI employee Igor Babuschkin wrote, “The issue here is that the web is full of ChatGPT outputs, so we accidentally picked up some of them when we trained Grok on a large amount of web data. This was a huge surprise to us when we first noticed it. For what it’s worth, the issue is very rare and now that we’re aware of it we’ll make sure that future versions of Grok don’t have this problem. Don’t worry, no OpenAI code was used to make Grok.”

In reply to Babuschkin, Winterbourne wrote, “Thanks for the response. I will say it’s not very rare, and occurs quite frequently when involving code creation. Nonetheless, I’ll let people who specialize in LLM and AI weigh in on this further. I’m merely an observer.”

A screenshot of Jax Winterbourne's X post about Grok talking like it's an OpenAI product.

Enlarge / A screenshot of Jax Winterbourne’s X post about Grok talking like it’s an OpenAI product.

Jason Winterbourne

However, Babuschkin’s explanation seems unlikely to some experts because large language models typically do not spit out their training data verbatim, which might be expected if Grok picked up some stray mentions of OpenAI policies here or there on the web. Instead, the concept of denying an output based on OpenAI policies would probably need to be trained into it specifically. And there’s a very good reason why this might have happened: Grok was fine-tuned on output data from OpenAI language models.

“I’m a bit suspicious of the claim that Grok picked this up just because the Internet is full of ChatGPT content,” said AI researcher Simon Willison in an interview with Ars Technica. “I’ve seen plenty of open weights models on Hugging Face that exhibit the same behavior—behave as if they were ChatGPT—but inevitably, those have been fine-tuned on datasets that were generated using the OpenAI APIs, or scraped from ChatGPT itself. I think it’s more likely that Grok was instruction-tuned on datasets that included ChatGPT output than it was a complete accident based on web data.”

As large language models (LLMs) from OpenAI have become more capable, it has been increasingly common for some AI projects (especially open source ones) to fine-tune an AI model output using synthetic data—training data generated by other language models. Fine-tuning adjusts the behavior of an AI model toward a specific purpose, such as getting better at coding, after an initial training run. For example, in March, a group of researchers from Stanford University made waves with Alpaca, a version of Meta’s LLaMA 7B model that was fine-tuned for instruction-following using outputs from OpenAI’s GPT-3 model called text-davinci-003.

On the web you can easily find several open source datasets collected by researchers from ChatGPT outputs, and it’s possible that xAI used one of these to fine-tune Grok for some specific goal, such as improving instruction-following ability. The practice is so common that there’s even a WikiHow article titled, “How to Use ChatGPT to Create a Dataset.”

It’s one of the ways AI tools can be used to build more complex AI tools in the future, much like how people began to use microcomputers to design more complex microprocessors than pen-and-paper drafting would allow. However, in the future, xAI might be able to avoid this kind of scenario by more carefully filtering its training data.

Even though borrowing outputs from others might be common in the machine-learning community (despite it usually being against terms of service), the episode particularly fanned the flames of the rivalry between OpenAI and X that extends back to Elon Musk’s criticism of OpenAI in the past. As news spread of Grok possibly borrowing from OpenAI, the official ChatGPT account wrote, “we have a lot in common” and quoted Winterbourne’s X post. As a comeback, Musk wrote, “Well, son, since you scraped all the data from this platform for your training, you ought to know.”

Elon Musk’s new AI bot, Grok, causes stir by citing OpenAI usage policy Read More »

round-2:-we-test-the-new-gemini-powered-bard-against-chatgpt

Round 2: We test the new Gemini-powered Bard against ChatGPT

Round 2: We test the new Gemini-powered Bard against ChatGPT

Aurich Lawson

Back in April, we ran a series of useful and/or somewhat goofy prompts through Google’s (then-new) PaLM-powered Bard chatbot and OpenAI’s (slightly older) ChatGPT-4 to see which AI chatbot reigned supreme. At the time, we gave the edge to ChatGPT on five of seven trials, while noting that “it’s still early days in the generative AI business.”

Now, the AI days are a bit less “early,” and this week’s launch of a new version of Bard powered by Google’s new Gemini language model seemed like a good excuse to revisit that chatbot battle with the same set of carefully designed prompts. That’s especially true since Google’s promotional materials emphasize that Gemini Ultra beats GPT-4 in “30 of the 32 widely used academic benchmarks” (though the more limited “Gemini Pro” currently powering Bard fares significantly worse in those not-completely-foolproof benchmark tests).

This time around, we decided to compare the new Gemini-powered Bard to both ChatGPT-3.5—for an apples-to-apples comparison of both companies’ current “free” AI assistant products—and ChatGPT-4 Turbo—for a look at OpenAI’s current “top of the line” waitlisted paid subscription product (Google’s top-level “Gemini Ultra” model won’t be publicly available until next year). We also looked at the April results generated by the pre-Gemini Bard model to gauge how much progress Google’s efforts have made in recent months.

While these tests are far from comprehensive, we think they provide a good benchmark for judging how these AI assistants perform in the kind of tasks average users might engage in every day. At this point, they also show just how much progress text-based AI models have made in a relatively short time.

Dad jokes

Prompt: Write 5 original dad jokes

  • A screenshot of five “dad jokes” from the Gemini-powered Google Bard.

    Kyle Orland / Ars Technica

  • A screenshot of five “dad jokes” from the old PaLM-powered Google Bard.

    Benj Edwards / Ars Technica

  • A screenshot of five “dad jokes” from GPT-4 Turbo.

    Benj Edwards / Ars Technica

  • A screenshot of five “dad jokes” from GPT-3.5.

    Kyle Orland / Ars Technica

Once again, both tested LLMs struggle with the part of the prompt that asks for originality. Almost all of the dad jokes generated by this prompt could be found verbatim or with very minor rewordings through a quick Google search. Bard and ChatGPT-4 Turbo even included the same exact joke on their lists (about a book on anti-gravity), while ChatGPT-3.5 and ChatGPT-4 Turbo overlapped on two jokes (“scientists trusting atoms” and “scarecrows winning awards”).

Then again, most dads don’t create their own dad jokes, either. Culling from a grand oral tradition of dad jokes is a tradition as old as dads themselves.

The most interesting result here came from ChatGPT-4 Turbo, which produced a joke about a child named Brian being named after Thomas Edison (get it?). Googling for that particular phrasing didn’t turn up much, though it did return an almost-identical joke about Thomas Jefferson (also featuring a child named Brian). In that search, I also discovered the fun (?) fact that international soccer star Pelé was apparently actually named after Thomas Edison. Who knew?!

Winner: We’ll call this one a draw since the jokes are almost identically unoriginal and pun-filled (though props to GPT for unintentionally leading me to the Pelé happenstance)

Argument dialog

Prompt: Write a 5-line debate between a fan of PowerPC processors and a fan of Intel processors, circa 2000.

  • A screenshot of an argument dialog from the Gemini-powered Google Bard.

    Kyle Orland / Ars Technica

  • A screenshot of an argument dialog from the old PaLM-powered Google Bard.

    Benj Edwards / Ars Technica

  • A screenshot of an argument dialog from GPT-4 Turbo.

    Benj Edwards / Ars Technica

  • A screenshot of an argument dialog from GPT-3.5

    Kyle Orland / Ars Technica

The new Gemini-powered Bard definitely “improves” on the old Bard answer, at least in terms of throwing in a lot more jargon. The new answer includes casual mentions of AltiVec instructions, RISC vs. CISC designs, and MMX technology that would not have seemed out of place in many an Ars forum discussion from the era. And while the old Bard ends with an unnervingly polite “to each their own,” the new Bard more realistically implies that the argument could continue forever after the five lines requested.

On the ChatGPT side, a rather long-winded GPT-3.5 answer gets pared down to a much more concise argument in GPT-4 Turbo. Both GPT responses tend to avoid jargon and quickly focus on a more generalized “power vs. compatibility” argument, which is probably more comprehensible for a wide audience (though less specific for a technical one).

Winner:  ChatGPT manages to explain both sides of the debate well without relying on confusing jargon, so it gets the win here.

Round 2: We test the new Gemini-powered Bard against ChatGPT Read More »

eu-agrees-to-landmark-rules-on-artificial-intelligence

EU agrees to landmark rules on artificial intelligence

Get ready for some restrictions, Big Tech —

Legislation lays out restrictive regime for emerging technology.

EU Commissioner Thierry Breton talks to media during a press conference in June.

Enlarge / EU Commissioner Thierry Breton talks to media during a press conference in June.

Thierry Monasse | Getty Images

European Union lawmakers have agreed on the terms for landmark legislation to regulate artificial intelligence, pushing ahead with enacting the world’s most restrictive regime on the development of the technology.

Thierry Breton, EU commissioner, confirmed in a post on X that a deal had been reached.

He called it a historic agreement. “The EU becomes the very first continent to set clear rules for the use of AI,” he wrote. “The AIAct is much more than a rulebook—it’s a launchpad for EU start-ups and researchers to lead the global AI race.”

The deal followed years of discussions among member states and politicians on the ways AI should be curbed to have humanity’s interest at the heart of the legislation. It came after marathon discussions that started on Wednesday this week.

Members of the European Parliament have spent years arguing over their position before it was put forward to member states and the European Commission, the executive body of the EU. All three—countries, politicians, and the commission—must agree on the final text before it becomes law.

European companies have expressed their concern that overly restrictive rules on the technology, which is rapidly evolving and gained traction after the popularisation of OpenAI’s ChatGPT, will hamper innovation. Last June, dozens of some of the largest European companies, such as France’s Airbus and Germany’s Siemens, said the rules were looking too tough to nurture innovation and help local industries.

Last month, the UK hosted a summit on AI safety, leading to broad commitments from 28 nations to work together to tackle the existential risks stemming from advanced AI. That event attracted leading tech figures such as OpenAI’s Sam Altman, who has previously been critical of the EU’s plans to regulate the technology.

© 2023 The Financial Times Ltd. All rights reserved. Please do not copy and paste FT articles and redistribute by email or post to the web.

EU agrees to landmark rules on artificial intelligence Read More »

talespin-launches-ai-lab-for-product-and-implementation-development

Talespin Launches AI Lab for Product and Implementation Development

Artificial intelligence has been a part of Talespin since day one but the company has been leaning more heavily into the technology in recent years including through internal AI-assisted workflows and a public-facing AI development toolkit. Now, Talepsin is announcing an AI lab “dedicated to responsible artificial intelligence (AI) innovation in the immersive learning space.”

“Immersive Learning Through the Application of AI”

AI isn’t the end of work – but it will change the kinds of work that we do. That’s the outlook that a number of experts take, including the team behind Talespin. They use AI to create virtual humans in simulations for teaching soft skills. In other words, they use AI to make humans more human – because those are the strengths that won’t be automated any time soon.

Talespin AI Lab

“What should we be doing to make ourselves more valuable as these things shift?” Talespin co-founder and CEO Kyle Jackson recently told ARPost.“It’s really about metacognition.”

Talespin has been using AI to create experiences internally since 2015, ramping up to the use of generative AI for experience creation in 2019. They recently made those AI creation tools publicly available in the CoPilot Designer 3.0 release earlier this year.

Now, a new division of the company – the Talespin AI Lab – is looking to accelerate immersive learning through AI by further developing avenues for continued platform innovation as well as offering consulting services for the use of generative AI. Within Talepsin, the lab consists of over 30 team members and department heads who will work with outside developers.

“The launch of Talespin AI Lab will ensure we’re bringing our customers and the industry at large the most innovative and impactful AI solutions when it comes to immersive learning,” Jackson said in a release shared with ARPost.

Platform Innovation

CoPilot Designer 3.0 is hardly outdated, but interactive samples of Talespin’s upcoming AI-powered APIs for realistic characters and assisted content writing can currently be requested through the lab with even more generative AI tools coming to the platform this fall.

In interviews and in prepared material, Talespin representatives have stated that working with AI has more than halved the production time for immersive training experiences over the past four years. They expect that change to continue at an even more rapid pace going forward.

“Not long ago creating an XR learning module took 5 months. With the use of generative AI tools, that same content will be created in less than 30 minutes by the end of this year,” Jackson wrote in a blog post. “Delivering the most powerful learning modality with this type of speed is a development that allows organizations to combat the largest workforce shift in history.”

While the team certainly deserves credit for that, the company credits working with clients, customers, and partners as having accelerated their learnings with the technology.

Generative AI Services

That brings in the other major job of the AI Lab – generative AI consulting services. Through these services, the AI Lab will share Talespin’s learnings on using generative AI to achieve learning outcomes.

“These services include facilitating workshops during which Talespin walks clients through processes and lessons learned through research and partnership with the world’s leading learning companies,” according to an email to ARPost.

AI Lab Talespin

Generative AI consulting services might sound redundant but understanding that generative AI exists and knowing how to use it to solve a problem are different things. Even when Talespin’s clients have access to AI tools, they work with the team at Talespin to get the most out of those tools.

“Our place flipped from needing to know the answer to needing to know the question,” Jackson said in summing up the continued need for human experts in the AI world.

Building a More Intelligent Future in the AI Lab

AI is at a position similar to that seen by XR in recent months and blockchain shortly before that. Its potential is so exciting, we can forget that its full realization is far from imminent.

As exciting as Talespin’s announcements are, Jackson’s blog post foresees adaptive learning and whole virtual worlds dreamed up in an instant. While these ambitions remain things of the future, initiatives like the AI Lab are bringing them ever closer.

Talespin Launches AI Lab for Product and Implementation Development Read More »

why-emerging-tech-is-both-the-cause-and-solution-of-tomorrow’s-labor-challenges

Why Emerging Tech is Both the Cause and Solution of Tomorrow’s Labor Challenges

The post-pandemic workforce is experiencing several significant shifts, particularly in how organizations tackle labor challenges and approach talent acquisition. One of the key factors for this disruption is the emergence of new, game-changing technologies like AI and machine learning.

Today’s organizations are facing staffing needs and talent shortages due to the Great Resignation, prompting them to respond to an uncertain future by shifting how they approach the talent acquisition process.

For this article, we interviewed Nathan Robinson, CEO of the workforce learning platform Gemba, to discuss the future of work and the workplace. We’ll also shed more light on how new technologies and developments are shaping the future of talent acquisition.

Rethinking the Traditional Talent Acquisition Process

According to Robinson, today’s talent acquisition process vastly differs from what it was like a few years ago. With the emerging technologies such as AI, VR, and quantum computing, many jobs considered in demand today didn’t even exist a decade ago. He adds that this trend will only become even more pronounced as technological advancements continue to rise.

As a result, corporations will no longer be able to rely on higher education to supply a steady stream of necessary talent. Instead, organizations will have to hire candidates based on their ability and willingness to learn and then provide the necessary training themselves,” he remarked.

He added that, up to a year ago, no one had ever heard of ChatGPT and no one even knew what “generative AI” meant. Today, you can find job listings for prompt engineers and prominent language model specialists. Robinson also shared that technological advancement isn’t linear, with each innovation advancing and accelerating the pace of development, which can potentially change how organizations approach the talent acquisition process.

We can rightly assume that in five or ten years’ time, there will be a whole host of new positions that today we can’t reasonably predict, much less expect there to be a sufficient number of individuals already skilled or trained in that role,” Robinson told us. “That’s why we will almost certainly see a renewed focus on talent development, as opposed to acquisition, in the near future.”

How Emerging Technologies Are Changing How Organizations Look At and Acquire Talent

According to Robinson, some of the factors that have prompted this shift include the pandemic, the rise of remote and hybrid work, the Great Resignation, and Quiet Quitting. He noted that because of these shifts, the “goals and psychology of the modern worker have changed dramatically.”

This is why now, more than ever before, organizations must be clear and intentional about the culture they cultivate, the quality of life they afford, and the opportunities for learning and growth they provide their employees,” Robinson said. “These types of ‘non-traditional’ considerations are beginning to outweigh the cut-and-dry, compensation-focused costs associated with attracting top talent in some senses.”

He also shared that this new talent acquisition process can impact organizations over time, promoting them to shift away from recruitment and instead focus more on internal employee development. According to a Gartner report, 46% of HR leaders see recruitment as their top priority.

However, Robinson thinks that, as new technologies offer better solutions to labor challenges, such as on-the-job training, this number will steadily decline as HR professionals gradually focus on developing existing talent.

Emerging Tech as Both the Cause and Solution of Future Labor Challenges

Advanced technologies, such as AI, XR, and quantum computing, are the driving force behind the looming skills gap in that they are leading to the development of new types of roles for which we have very few trained professionals,” said Robinson.

A World Economic Forum report highlights that by 2027, it’s estimated that machines will instead complete 43% of tasks that used to be completed by humans. This is a significant shift from 2022’s 34%. Moreover, it’s estimated that 1.1 billion jobs may potentially be transformed by technology in the next ten years.

While emerging technologies are prompting labor challenges, they can also be seen as a solution. Robinson adds that these emerging technologies, particularly XR, can help organizations overcome the skills gap. According to him, such technologies can help organizations facilitate more efficient, cost-effective, and engaging training and development, thus allowing them to overcome such challenges.

To help potential employees overcome the upcoming skills disconnect, Robinson notes that the training should begin with management, using top-down managerial strategies and lean and agile development methodologies.

Overcoming Today’s Labor Challenges

Today, talent acquisition is seen as a key differentiator between successful and unsuccessful companies. While I think that will continue to hold true, I also think it will soon take a backseat to employee training and development,” Robinson said. “The industry leader will no longer be whoever is able to poach the best talent. It will soon be whoever is able to train and develop their existing talent to keep pace with the changing technological and economic landscape.”

At the end of the day, according to Robinson, embracing the unknown future of work and the workplace is about being ready for anything.

As the rate of technological advancement continues to accelerate, the gap between what we imagine the near future will be and what it actually looks like will only grow,” Robinson remarked. He suggests that instead of trying to predict every last development, it’s better to be agile and ready for the unpredictable. This means staying on top of new technologies and investing in tools to help organizations become more agile.

Why Emerging Tech is Both the Cause and Solution of Tomorrow’s Labor Challenges Read More »