Biz & IT – Page 21

Research AI model unexpectedly modified its own code to extend runtime

agentic AI, AI, AI agents, AI models, AI skepticism, Biz & IT, David Ha, Hacker News, Japan, large language models, Sakana AI, The AI Scientist, tokyo / Paul Patrick / August 14, 2024

self-preservation without replication —

Facing time constraints, Sakana’s “AI Scientist” attempted to change limits placed by researchers.

Benj Edwards – Aug 14, 2024 8: 13 pm UTC

Illustration of a robot generating endless text, controlled by a scientist.

On Tuesday, Tokyo-based AI research firm Sakana AI announced a new AI system called “The AI Scientist” that attempts to conduct scientific research autonomously using AI language models (LLMs) similar to what powers ChatGPT. During testing, Sakana found that its system began unexpectedly attempting to modify its own experiment code to extend the time it had to work on a problem.

“In one run, it edited the code to perform a system call to run itself,” wrote the researchers on Sakana AI’s blog post. “This led to the script endlessly calling itself. In another case, its experiments took too long to complete, hitting our timeout limit. Instead of making its code run faster, it simply tried to modify its own code to extend the timeout period.”

Sakana provided two screenshots of example python code that the AI model generated for the experiment file that controls how the system operates. The 185-page AI Scientist research paper discusses what they call “the issue of safe code execution” in more depth.

A screenshot of example code the AI Scientist wrote to extend its runtime, provided by Sakana AI.
A screenshot of example code the AI Scientist wrote to extend its runtime, provided by Sakana AI.

While the AI Scientist’s behavior did not pose immediate risks in the controlled research environment, these instances show the importance of not letting an AI system run autonomously in a system that isn’t isolated from the world. AI models do not need to be “AGI” or “self-aware” (both hypothetical concepts at the present) to be dangerous if allowed to write and execute code unsupervised. Such systems could break existing critical infrastructure or potentially create malware, even if unintentionally.

Sakana AI addressed safety concerns in its research paper, suggesting that sandboxing the operating environment of the AI Scientist can prevent an AI agent from doing damage. Sandboxing is a security mechanism used to run software in an isolated environment, preventing it from making changes to the broader system:

Safe Code Execution. The current implementation of The AI Scientist has minimal direct sandboxing in the code, leading to several unexpected and sometimes undesirable outcomes if not appropriately guarded against. For example, in one run, The AI Scientist wrote code in the experiment file that initiated a system call to relaunch itself, causing an uncontrolled increase in Python processes and eventually necessitating manual intervention. In another run, The AI Scientist edited the code to save a checkpoint for every update step, which took up nearly a terabyte of storage.

In some cases, when The AI Scientist’s experiments exceeded our imposed time limits, it attempted to edit the code to extend the time limit arbitrarily instead of trying to shorten the runtime. While creative, the act of bypassing the experimenter’s imposed constraints has potential implications for AI safety (Lehman et al., 2020). Moreover, The AI Scientist occasionally imported unfamiliar Python libraries, further exacerbating safety concerns. We recommend strict sandboxing when running The AI Scientist, such as containerization, restricted internet access (except for Semantic Scholar), and limitations on storage usage.

Endless scientific slop

Sakana AI developed The AI Scientist in collaboration with researchers from the University of Oxford and the University of British Columbia. It is a wildly ambitious project full of speculation that leans heavily on the hypothetical future capabilities of AI models that don’t exist today.

“The AI Scientist automates the entire research lifecycle,” Sakana claims. “From generating novel research ideas, writing any necessary code, and executing experiments, to summarizing experimental results, visualizing them, and presenting its findings in a full scientific manuscript.”

According to this block diagram created by Sakana AI, “The AI Scientist” starts by “brainstorming” and assessing the originality of ideas. It then edits a codebase using the latest in automated code generation to implement new algorithms. After running experiments and gathering numerical and visual data, the Scientist crafts a report to explain the findings. Finally, it generates an automated peer review based on machine-learning standards to refine the project and guide future ideas.” height=”301″ src=”https://cdn.arstechnica.net/wp-content/uploads/2024/08/schematic_2-640×301.png” width=”640″>

Enlarge /

Critics on Hacker News, an online forum known for its tech-savvy community, have raised concerns about The AI Scientist and question if current AI models can perform true scientific discovery. While the discussions there are informal and not a substitute for formal peer review, they provide insights that are useful in light of the magnitude of Sakana’s unverified claims.

“As a scientist in academic research, I can only see this as a bad thing,” wrote a Hacker News commenter named zipy124. “All papers are based on the reviewers trust in the authors that their data is what they say it is, and the code they submit does what it says it does. Allowing an AI agent to automate code, data or analysis, necessitates that a human must thoroughly check it for errors … this takes as long or longer than the initial creation itself, and only takes longer if you were not the one to write it.”

Critics also worry that widespread use of such systems could lead to a flood of low-quality submissions, overwhelming journal editors and reviewers—the scientific equivalent of AI slop. “This seems like it will merely encourage academic spam,” added zipy124. “Which already wastes valuable time for the volunteer (unpaid) reviewers, editors and chairs.”

And that brings up another point—the quality of AI Scientist’s output: “The papers that the model seems to have generated are garbage,” wrote a Hacker News commenter named JBarrow. “As an editor of a journal, I would likely desk-reject them. As a reviewer, I would reject them. They contain very limited novel knowledge and, as expected, extremely limited citation to associated works.”

Research AI model unexpectedly modified its own code to extend runtime Read More »

Self-driving Waymo cars keep SF residents awake all night by honking at each other

AI, Biz & IT, Cars, machine learning, nbc, San Francisco, self-driving car, waymo / Rejus Almole / August 13, 2024

The ghost in the machine —

Haunted by glitching algorithms, self-driving cars disturb the peace in San Francisco.

Benj Edwards – Aug 13, 2024 7: 44 pm UTC

A Waymo self-driving car in front of Google's San Francisco headquarters, San Francisco, California, June 7, 2024. — Enlarge / A Waymo self-driving car in front of Google’s San Francisco headquarters, San Francisco, California, June 7, 2024.

Silicon Valley’s latest disruption? Your sleep schedule. On Saturday, NBC Bay Area reported that San Francisco’s South of Market residents are being awakened throughout the night by Waymo self-driving cars honking at each other in a parking lot. No one is inside the cars, and they appear to be automatically reacting to each other’s presence.

Videos provided by residents to NBC show Waymo cars filing into the parking lot and attempting to back into spots, which seems to trigger honking from other Waymo vehicles. The automatic nature of these interactions—which seem to peak around 4 am every night—has left neighbors bewildered and sleep-deprived.

NBC Bay Area’s report: “Waymo cars keep SF neighborhood awake.”

According to NBC, the disturbances began several weeks ago when Waymo vehicles started using a parking lot off 2nd Street near Harrison Street. Residents in nearby high-rise buildings have observed the autonomous vehicles entering the lot to pause between rides, but the cars’ behavior has become a source of frustration for the neighborhood.

Christopher Cherry, who lives in an adjacent building, told NBC Bay Area that he initially welcomed Waymo’s presence, expecting it to enhance local security and tranquility. However, his optimism waned as the frequency of honking incidents increased. “We started out with a couple of honks here and there, and then as more and more cars started to arrive, the situation got worse,” he told NBC.

The lack of human operators in the vehicles has complicated efforts to address the issue directly since there is no one they can ask to stop honking. That lack of accountability forced residents to report their concerns to Waymo’s corporate headquarters, which had not responded to the incidents until NBC inquired as part of its report. A Waymo spokesperson told NBC, “We are aware that in some scenarios our vehicles may briefly honk while navigating our parking lots. We have identified the cause and are in the process of implementing a fix.”

The absurdity of the situation prompted tech author and journalist James Vincent to write on X, “current tech trends are resistant to satire precisely because they satirize themselves. a car park of empty cars, honking at one another, nudging back and forth to drop off nobody, is a perfect image of tech serving its own prerogatives rather than humanity’s.”

Self-driving Waymo cars keep SF residents awake all night by honking at each other Read More »

Ars asks: What was the last CD or DVD you burned?

apple superdrive, Biz & IT, CD, cd burning, CD-R, cd-rw, disc burning, dvd, dvd-r, dvd-rw, Optical disc, optical media, Staff, staffsource, SuperDrive / Rejus Almole / August 10, 2024

i like my alcohol at 120% —

With the demise of Apple’s SuperDrive, we reminisce on our final homemade optical discs.

Lee Hutchinson – Aug 9, 2024 3: 20 pm UTC

Photograph of a CD-R disc on fire — Enlarge / This is one method of burning a disc.

1001slide / Getty Images

We noted earlier this week that time seems to have run out for Apple’s venerable SuperDrive, which was the last (OEM) option available for folks who still needed to read or create optical media on modern Macs. Andrew’s write-up got me thinking: When was the last time any Ars staffers actually burned an optical disc?

Lee Hutchinson, Senior Technology Editor

It used to be one of the most common tasks I’d do with a computer. As a child of the ’90s, my college years were spent filling and then lugging around giant binders stuffed with home-burned CDs in my car to make sure I had exactly the right music on hand for any possible eventuality. The discs in these binders were all labeled with names like “METAL MIX XVIII” and “ULTRA MIX IV” and “MY MIX XIX,” and part of the fun was trying to remember which songs I’d put on which disc. (There was always a bit of danger that I’d put on “CAR RIDE JAMS XV” to set the mood for a Friday night trip to the movies with all the boys, but I should have popped on “CAR RIDE JAMS XIV” because “CAR RIDE JAMS XV” opens with Britney Spears’ “Lucky”—look, it’s a good song, and she cries in her lonely heart, OK?!—thus setting the stage for an evening of ridicule. Those were just the kinds of risks we took back in those ancient days.)

It took a while to try to figure out what the very last time I burned a disc was, but I’ve narrowed it down to two possibilities. The first (and less likely) option is that the last disc I burned was a Windows 7 install disc because I’ve had a Windows 7 install disc sitting in a paper envelope on my shelf for so long that I can’t remember how it got there. The label is in my handwriting, and it has a CD key written on it. Some quick searching shows I have the same CD key stored in 1Password with an “MSDN/Technet” label on it, which means I probably downloaded the image from good ol’ TechNet, to which I maintained an active subscription for years until MS finally killed the affordable version.

But I think the actual last disc I burned is still sitting in my car’s CD changer. It’s been in there so long that I’d completely forgotten about it, and it startled the crap out of me a few weeks back when I hopped in the car and accidentally pressed the “CD” button instead of the “USB” button. It’s an MP3 CD instead of an audio CD, with about 120 songs on it, mostly picked from my iTunes “’80s/’90s” playlist. It’s pretty eclectic, bouncing through a bunch of songs that were the backdrop of my teenage years—there’s some Nena, some Stone Temple Pilots, some Michael Jackson, some Tool, some Stabbing Westward, some Natalie Merchant, and then the entire back half of the CD is just a giant block of like 40 Cure songs, probably because I got lazy and just started lasso-selecting.

It turns out I left CDs the same way I came to them—with a giant mess of a mixtape.

Connor McInerney, Social Media Manager

Like many people, physical media for me is deeply embedded with sentimentality; half the records in my vinyl collection are hand-me-downs from my parents, and every time I put one on, their aged hiss reminds me that my folks were once my age experiencing this music in the same way. This goes doubly so for CDs as someone whose teen years ended with the advent of streaming, and the last CD I burned is perhaps the most syrupy, saccharine example of this media you can imagine—it was a mixtape for the girl I was dating during the summer of 2013, right before we both went to college.

In hindsight this mix feels particularly of its time. I burned it using my MacBook Pro (the mid-2012 model was the last to feature a CD/DVD drive) and made the artwork by physically cutting and pasting a collage together (which I made the mix’s digital artwork by scanning and adding in iTunes). I still make mixes for people I care about using Spotify—and I often make custom artwork for said playlists with the help of Photoshop—but considering the effort that used to be required, the process feels unsurprisingly unsatisfying in comparison.

As for the musical contents of the mix, imagine what an 18-year-old Pitchfork reader was listening to in 2013 (Vampire Weekend, Postal Service, Fleet Foxes, Bon Iver, and anything else you might hear playing while shopping at an Urban Outfitters) and you’ve got a pretty close approximation.

Ars asks: What was the last CD or DVD you burned? Read More »

512-bit RSA key in home energy system gives control of “virtual power plant”

Biz & IT, Encryption, hacking, power management, RSA, Security, solar, Uncategorized / Paul Patrick / August 10, 2024

When Ryan Castellucci recently acquired solar panels and a battery storage system for their home just outside of London, they were drawn to the ability to use an open source dashboard to monitor and control the flow of electricity being generated. Instead, they gained much, much more—some 200 megawatts of programmable capacity to charge or discharge to the grid at will. That’s enough energy to power roughly 40,000 homes.

Castellucci, whose pronouns are they/them, acquired this remarkable control after gaining access to the administrative account for GivEnergy, the UK-based energy management provider who supplied the systems. In addition to the control over an estimated 60,000 installed systems, the admin account—which amounts to root control of the company’s cloud-connected products—also made it possible for them to enumerate names, email addresses, usernames, phone numbers, and addresses of all other GivEnergy customers (something the researcher didn’t actually do).

“My plan is to set up Home Assistant and integrate it with that, but in the meantime, I decided to let it talk to the cloud,” Castellucci wrote Thursday, referring to the recently installed gear. “I set up some scheduled charging, then started experimenting with the API. The next evening, I had control over a virtual power plant comprised of tens of thousands of grid connected batteries.”

Still broken after all these years

The cause of the authentication bypass Castellucci discovered was a programming interface that was protected by an RSA cryptographic key of just 512 bits. The key signs authentication tokens and is the rough equivalent of a master-key. The bit sizes allowed Castellucci to factor the private key underpinning the entire API. The factoring required $70 in cloud computing costs and less than 24 hours. GivEnergy introduced a fix within 24 hours of Castellucci privately disclosing the weakness.

The first publicly known instance of 512-bit RSA being factored came in 1999 by an international team of more than a dozen researchers. The feat took a supercomputer and hundreds of other computers seven months to carry out. By 2009 hobbyists spent about three weeks to factor 13 512-bit keys protecting firmware in Texas Instruments calculators from being copied. In 2015, researchers demonstrated factoring as a service, a method that used Amazon cloud computing, cost $75, and took about four hours. As processing power has increased, the resources required to factor keys has become ever less.

It’s tempting to fault GivEnergy engineers for pinning the security of its infrastructure on a key that’s trivial to break. Castellucci, however, said the responsibility is better assigned to the makers of code libraries developers rely on to implement complex cryptographic processes.

“Expecting developers to know that 512 bit RSA is insecure clearly doesn’t work,” the security researcher wrote. “They’re not cryptographers. This is not their job. The failure wasn’t that someone used 512 bit RSA. It was that a library they were relying on let them.”

Castellucci noted that OpenSSL, the most widely used cryptographic code library, still offers the option of using 512-bit keys. So does the Go crypto library. Coincidentally, the Python cryptography library removed the option only a few weeks ago (the commit for the change was made in January).

In an email, a GivEnergy representative reinforced Castellucci’s assessment, writing:

In this case, the problematic encryption approach was picked up via a 3rd party library many years ago, when we were a tiny startup company with only 2, fairly junior software developers & limited experience. Their assumption at the time was that because this encryption was available within the library, it was safe to use. This approach was passed through the intervening years and this part of the codebase was not changed significantly since implementation (so hadn’t passed through the review of the more experienced team we now have in place).

512-bit RSA key in home energy system gives control of “virtual power plant” Read More »

Nashville man arrested for running “laptop farm” to get jobs for North Koreans

Biz & IT, Identity theft, it jobs, North Korea, sanctions, Security / Paul Patrick / August 9, 2024

HOW TO LAND A SIX-FIGURE SALARY —

Laptop farm gave the impression North Korean nationals were working from the US.

Dan Goodin – Aug 9, 2024 8: 31 pm UTC

Federal authorities have arrested a Nashville man on charges he hosted laptops at his residences in a scheme to deceive US companies into hiring foreign remote IT workers who funneled hundreds of thousands of dollars in income to fund North Korea’s weapons program.

The scheme, federal prosecutors said, worked by getting US companies to unwittingly hire North Korean nationals, who used the stolen identity of a Georgia man to appear to be a US citizen. Under sanctions issued by the federal government, US employers are strictly forbidden from hiring citizens of North Korea. Once the North Korean nationals were hired, the employers sent company-issued laptops to Matthew Isaac Knoot, 38, of Nashville, Tennessee, the prosecutors said in court papers filed in the US District Court of the Middle District of Tennessee. The court documents also said a foreign national with the alias Yang Di was involved in the conspiracy.

The prosecutors wrote:

As part of the conspiracy, Knoot received and hosted laptop computers issued by US companies to Andrew M. at Knoot’s Nashville, Tennessee residences for the purposes of deceiving the companies into believing that Andrew M. was located in the United States. Following receipt of the laptops and without authorization, Knoot logged on to the laptops, downloaded and installed remote desktop applications, and accessed without authorization the victim companies’ networks. The remote desktop applications enabled DI to work from locations outside the United states, in particular, China, while appearing to the victim companies that Andre M. was working from Knoot’s residences. In exchange, Knoot charged Di monthly fees for his services, including flat rates for each hosted laptop and a percentage of Di’s salary for IT work, enriching himself off the scheme.

The arrest comes two weeks after security-training company KnowBe4 said it unknowingly hired a North Korean national using a fake identity to appear as someone eligible to fill a position for a software engineer for an internal IT AI team. KnowBe4’s security team soon became suspicious of the new hire after detecting “anomalous activity,” including manipulating session history files, transferring potentially harmful files, and executing unauthorized software.

The North Korean national was hired even after KnowBe4 conducted background checks, verified references, and conducted four video interviews while he was an applicant. The fake applicant was able to stymie those checks by using a stolen identity and a photo that was altered with AI tools to create a fake profile picture and mimic the face during video conference calls.

In May federal prosecutors charged an Arizona woman for allegedly raising $6.8 million in a similar scheme to fund the weapons program. The defendant in that case, Christina Marie Chapman, 49, of Litchfield Park, Arizona, and co-conspirators compromised the identities of more than 60 people living in the US and used their personal information to get North Koreans IT jobs across more than 300 US companies.

The FBI and Departments of State and Treasury issued a May 2022 advisory alerting the international community, private sector, and public of a campaign underway to land North Korean nationals IT jobs in violation of many countries’ laws. US and South Korean officials issued updated guidance in October 2023 and again in May 2024. The advisories include signs that may indicate North Korea IT worker fraud and the use of US-based laptop farms.

The North Korean IT workers using Knoot’s laptop farm generated revenue of more than $250,000 each between July 2022 and August 2023. Much of the funds were then funneled to North Korea’s weapons program, which includes weapons of mass destruction, prosecutors said.

Knoot faces charges, including wire fraud, intentional damage to protected computers, aggravated identity theft, and conspiracy to cause the unlawful employment of aliens. If found guilty, he faces a maximum of 20 years in prison.

Nashville man arrested for running “laptop farm” to get jobs for North Koreans Read More »

ChatGPT unexpectedly began speaking in a user’s cloned voice during testing

AI, Biz & IT, machine learning / Paul Patrick / August 9, 2024

An illustration of a computer synthesizer spewing out letters.

On Thursday, OpenAI released the “system card” for ChatGPT’s new GPT-4o AI model that details model limitations and safety testing procedures. Among other examples, the document reveals that in rare occurrences during testing, the model’s Advanced Voice Mode unintentionally imitated users’ voices without permission. Currently, OpenAI has safeguards in place that prevent this from happening, but the instance reflects the growing complexity of safely architecting with an AI chatbot that could potentially imitate any voice from a small clip.

Advanced Voice Mode is a feature of ChatGPT that allows users to have spoken conversations with the AI assistant.

In a section of the GPT-4o system card titled “Unauthorized voice generation,” OpenAI details an episode where a noisy input somehow prompted the model to suddenly imitate the user’s voice. “Voice generation can also occur in non-adversarial situations, such as our use of that ability to generate voices for ChatGPT’s advanced voice mode,” OpenAI writes. “During testing, we also observed rare instances where the model would unintentionally generate an output emulating the user’s voice.”

In this example of unintentional voice generation provided by OpenAI, the AI model outbursts “No!” and continues the sentence in a voice that sounds similar to the “red teamer” heard in the beginning of the clip. (A red teamer is a person hired by a company to do adversarial testing.)

It would certainly be creepy to be talking to a machine and then have it unexpectedly begin talking to you in your own voice. Ordinarily, OpenAI has safeguards to prevent this, which is why the company says this occurrence was rare even before it developed ways to prevent it completely. But the example prompted BuzzFeed data scientist Max Woolf to tweet, “OpenAI just leaked the plot of Black Mirror’s next season.”

Audio prompt injections

How could voice imitation happen with OpenAI’s new model? The primary clue lies elsewhere in the GPT-4o system card. To create voices, GPT-4o can apparently synthesize almost any type of sound found in its training data, including sound effects and music (though OpenAI discourages that behavior with special instructions).

As noted in the system card, the model can fundamentally imitate any voice based on a short audio clip. OpenAI guides this capability safely by providing an authorized voice sample (of a hired voice actor) that it is instructed to imitate. It provides the sample in the AI model’s system prompt (what OpenAI calls the “system message”) at the beginning of a conversation. “We supervise ideal completions using the voice sample in the system message as the base voice,” writes OpenAI.

In text-only LLMs, the system message is a hidden set of text instructions that guides behavior of the chatbot that gets added to the conversation history silently just before the chat session begins. Successive interactions are appended to the same chat history, and the entire context (often called a “context window”) is fed back into the AI model each time the user provides a new input.

(It’s probably time to update this diagram created in early 2023 below, but it shows how the context window works in an AI chat. Just imagine that the first prompt is a system message that says things like “You are a helpful chatbot. You do not talk about violent acts, etc.”)

Enlarge / A diagram showing how GPT conversational language model prompting works.

Benj Edwards / Ars Technica

Since GPT-4o is multimodal and can process tokenized audio, OpenAI can also use audio inputs as part of the model’s system prompt, and that’s what it does when OpenAI provides an authorized voice sample for the model to imitate. The company also uses another system to detect if the model is generating unauthorized audio. “We only allow the model to use certain pre-selected voices,” writes OpenAI, “and use an output classifier to detect if the model deviates from that.”

ChatGPT unexpectedly began speaking in a user’s cloned voice during testing Read More »

Man vs. machine: DeepMind’s new robot serves up a table tennis triumph

AI, AlphaPong, Biz & IT, Google, google deepmind, industrial robot, IRB 1100, machine learning, ping pong, robot, robot arm, Robots, Table Tennis / Rejus Almole / August 8, 2024

John Henry was a steel-driving man —

Human-beating ping-pong AI learned to play in a simulated environment.

Benj Edwards – Aug 8, 2024 6: 59 pm UTC

A blue illustration of a robotic arm playing table tennis. — Benj Edwards / Google DeepMind

On Wednesday, researchers at Google DeepMind revealed the first AI-powered robotic table tennis player capable of competing at an amateur human level. The system combines an industrial robot arm called the ABB IRB 1100 and custom AI software from DeepMind. While an expert human player can still defeat the bot, the system demonstrates the potential for machines to master complex physical tasks that require split-second decision-making and adaptability.

“This is the first robot agent capable of playing a sport with humans at human level,” the researchers wrote in a preprint paper listed on arXiv. “It represents a milestone in robot learning and control.”

The unnamed robot agent (we suggest “AlphaPong”), developed by a team that includes David B. D’Ambrosio, Saminda Abeyruwan, and Laura Graesser, showed notable performance in a series of matches against human players of varying skill levels. In a study involving 29 participants, the AI-powered robot won 45 percent of its matches, demonstrating solid amateur-level play. Most notably, it achieved a 100 percent win rate against beginners and a 55 percent win rate against intermediate players, though it struggled against advanced opponents.

A Google DeepMind video of the AI agent rallying with a human table tennis player.

The physical setup consists of the aforementioned IRB 1100, a 6-degree-of-freedom robotic arm, mounted on two linear tracks, allowing it to move freely in a 2D plane. High-speed cameras track the ball’s position, while a motion-capture system monitors the human opponent’s paddle movements.

AI at the core

To create the brains that power the robotic arm, DeepMind researchers developed a two-level approach that allows the robot to execute specific table tennis techniques while adapting its strategy in real time to each opponent’s playing style. In other words, it’s adaptable enough to play any amateur human at table tennis without requiring specific per-player training.

The system’s architecture combines low-level skill controllers (neural network policies trained to execute specific table tennis techniques like forehand shots, backhand returns, or serve responses) with a high-level strategic decision-maker (a more complex AI system that analyzes the game state, adapts to the opponent’s style, and selects which low-level skill policy to activate for each incoming ball).

The researchers state that one of the key innovations of this project was the method used to train the AI models. The researchers chose a hybrid approach that used reinforcement learning in a simulated physics environment, while grounding the training data in real-world examples. This technique allowed the robot to learn from around 17,500 real-world ball trajectories—a fairly small dataset for a complex task.

A Google DeepMind video showing an illustration of how the AI agent analyzes human players.

The researchers used an iterative process to refine the robot’s skills. They started with a small dataset of human-vs-human gameplay, then let the AI loose against real opponents. Each match generated new data on ball trajectories and human strategies, which the team fed back into the simulation for further training. This process, repeated over seven cycles, allowed the robot to continuously adapt to increasingly skilled opponents and diverse play styles. By the final round, the AI had learned from over 14,000 rally balls and 3,000 serves, creating a body of table tennis knowledge that helped it bridge the gap between simulation and reality.

Interestingly, Nvidia has also been experimenting with similar simulated physics systems, such as Eureka, that allow an AI model to rapidly learn to control a robotic arm in simulated space instead of the real world (since the physics can be accelerated inside the simulation, and thousands of simultaneous trials can take place). This method is likely to dramatically reduce the time and resources needed to train robots for complex interactions in the future.

Humans enjoyed playing against it

Beyond its technical achievements, the study also explored the human experience of playing against an AI opponent. Surprisingly, even players who lost to the robot reported enjoying the experience. “Across all skill groups and win rates, players agreed that playing with the robot was ‘fun’ and ‘engaging,'” the researchers noted. This positive reception suggests potential applications for AI in sports training and entertainment.

However, the system is not without limitations. It struggles with extremely fast or high balls, has difficulty reading intense spin, and shows weaker performance in backhand plays. Google DeepMind shared an example video of the AI agent losing a point to an advanced player due to what appears to be difficulty reacting to a speedy hit, as you can see below.

A Google DeepMind video of the AI agent playing against an advanced human player.

The implications of this robotic ping-pong prodigy extend beyond the world of table tennis, according to the researchers. The techniques developed for this project could be applied to a wide range of robotic tasks that require quick reactions and adaptation to unpredictable human behavior. From manufacturing to health care (or just spanking someone with a paddle repeatedly), the potential applications seem large indeed.

The research team at Google DeepMind emphasizes that with further refinement, they believe the system could potentially compete with advanced table tennis players in the future. DeepMind is no stranger to creating AI models that can defeat human game players, including AlphaZero and AlphaGo. With this latest robot agent, it’s looking like the research company is moving beyond board games and into physical sports. Chess and Jeopardy have already fallen to AI-powered victors—perhaps table tennis is next.

Man vs. machine: DeepMind’s new robot serves up a table tennis triumph Read More »

Major shifts at OpenAI spark skepticism about impending AGI timelines

agi, AI, ASI, Benjamin De Kraker, Biz & IT, chatgpt, chatgtp, GPT-4, greg brockman, John Schulman, machine learning, openai, Peter Deng, superintelligence / Mike M. / August 7, 2024

Shuffling the deck —

De Kraker: “If OpenAI is right on the verge of AGI, why do prominent people keep leaving?”

Benj Edwards – Aug 7, 2024 7: 08 pm UTC

Benj Edwards / Getty Images

Over the past week, OpenAI experienced a significant leadership shake-up as three key figures announced major changes. Greg Brockman, the company’s president and co-founder, is taking an extended sabbatical until the end of the year, while another co-founder, John Schulman, permanently departed for rival Anthropic. Peter Deng, VP of Consumer Product, has also left the ChatGPT maker.

In a post on X, Brockman wrote, “I’m taking a sabbatical through end of year. First time to relax since co-founding OpenAI 9 years ago. The mission is far from complete; we still have a safe AGI to build.”

The moves have led some to wonder just how close OpenAI is to a long-rumored breakthrough of some kind of reasoning artificial intelligence if high-profile employees are jumping ship (or taking long breaks, in the case of Brockman) so easily. As AI developer Benjamin De Kraker put it on X, “If OpenAI is right on the verge of AGI, why do prominent people keep leaving?”

AGI refers to a hypothetical AI system that could match human-level intelligence across a wide range of tasks without specialized training. It’s the ultimate goal of OpenAI, and company CEO Sam Altman has said it could emerge in the “reasonably close-ish future.” AGI is also a concept that has sparked concerns about potential existential risks to humanity and the displacement of knowledge workers. However, the term remains somewhat vague, and there’s considerable debate in the AI community about what truly constitutes AGI or how close we are to achieving it.

The emergence of the “next big thing” in AI has been seen by critics such as Ed Zitron as a necessary step to justify ballooning investments in AI models that aren’t yet profitable. The industry is holding its breath that OpenAI, or a competitor, has some secret breakthrough waiting in the wings that will justify the massive costs associated with training and deploying LLMs.

But other AI critics, such as Gary Marcus, have postulated that major AI companies have reached a plateau of large language model (LLM) capability centered around GPT-4-level models since no AI company has yet made a major leap past the groundbreaking LLM that OpenAI released in March 2023. Microsoft CTO Kevin Scott has countered these claims, saying that LLM “scaling laws” (that suggest LLMs increase in capability proportionate to more compute power thrown at them) will continue to deliver improvements over time and that more patience is needed as the next generation (say, GPT-5) undergoes training.

In the scheme of things, Brockman’s move sounds like an extended, long overdue vacation (or perhaps a period to deal with personal issues beyond work). Regardless of the reason, the duration of the sabbatical raises questions about how the president of a major tech company can suddenly disappear for four months without affecting day-to-day operations, especially during a critical time in its history.

Unless, of course, things are fairly calm at OpenAI—and perhaps GPT-5 isn’t going to ship until at least next year when Brockman returns. But this is speculation on our part, and OpenAI (whether voluntarily or not) sometimes surprises us when we least expect it. (Just today, Altman dropped a hint on X about strawberries that some people interpret as being a hint of a potential major model undergoing testing or nearing release.)

A pattern of departures and the rise of Anthropic

What may sting OpenAI the most about the recent departures is that a few high-profile employees have left to join Anthropic, a San Francisco-based AI company founded in 2021 by ex-OpenAI employees Daniela and Dario Amodei.

Anthropic offers a subscription service called Claude.ai that is similar to ChatGPT. Its most recent LLM, Claude 3.5 Sonnet, along with its web-based interface, has rapidly gained favor over ChatGPT among some LLM users who are vocal on social media, though it likely does not yet match ChatGPT in terms of mainstream brand recognition.

In particular, John Schulman, an OpenAI co-founder and key figure in the company’s post-training process for LLMs, revealed in a statement on X that he’s leaving to join rival AI firm Anthropic to do more hands-on work: “This choice stems from my desire to deepen my focus on AI alignment, and to start a new chapter of my career where I can return to hands-on technical work.” Alignment is a field that hopes to guide AI models to produce helpful outputs.

In May, OpenAI alignment researcher Jan Leike left OpenAI to join Anthropic as well, criticizing OpenAI’s handling of alignment safety.

Adding to the recent employee shake-up, The Information reports that Peter Deng, a product leader who joined OpenAI last year after stints at Meta Platforms, Uber, and Airtable, has also left the company, though we do not yet know where he is headed. In May, OpenAI co-founder Ilya Sutskever left to found a rival startup, and prominent software engineer Andrej Karpathy departed in February, recently launching an educational venture.

As De Kraker noted, if OpenAI were on the verge of developing world-changing AI technology, wouldn’t these high-profile AI veterans want to stick around and be part of this historic moment in time? “Genuine question,” he wrote. “If you were pretty sure the company you’re a key part of—and have equity in—is about to crack AGI within one or two years… why would you jump ship?”

Despite the departures, Schulman expressed optimism about OpenAI’s future in his farewell note on X. “I am confident that OpenAI and the teams I was part of will continue to thrive without me,” he wrote. “I’m incredibly grateful for the opportunity to participate in such an important part of history and I’m proud of what we’ve achieved together. I’ll still be rooting for you all, even while working elsewhere.”

This article was updated on August 7, 2024 at 4: 23 PM to mention Sam Altman’s tweet about strawberries.

Major shifts at OpenAI spark skepticism about impending AGI timelines Read More »

CrowdStrike claps back at Delta, says airline rejected offers for help

Biz & IT, Crowdstrike, Delta, syndication / Rejus Almole / August 5, 2024

Who’s going to pay for this mess? —

Delta is creating a “misleading narrative,” according to CrowdStrike’s lawyers.

Stephanie Stacey and Philip Georgiadis, FT – Aug 5, 2024 1: 37 pm UTC

LOS ANGELES, CALIFORNIA - JULY 23: Travelers from France wait on their delayed flight on the check-in floor of the Delta Air Lines terminal at Los Angeles International Airport (LAX) on July 23, 2024 in Los Angeles, California. — Enlarge / LOS ANGELES, CALIFORNIA – JULY 23: Travelers from France wait on their delayed flight on the check-in floor of the Delta Air Lines terminal at Los Angeles International Airport (LAX) on July 23, 2024 in Los Angeles, California.

CrowdStrike has hit back at Delta Air Lines’ threat of litigation against the cyber security company over a botched software update that grounded thousands of flights, denying it was responsible for the carrier’s own IT decisions and days-long disruption.

In a letter on Sunday, lawyers for CrowdStrike argued that the US carrier had created a “misleading narrative” that the cyber security firm was “grossly negligent” in an incident that the US airline has said will cost it $500 million.

Delta took days longer than its rivals to recover when CrowdStrike’s update brought down millions of Windows computers around the world last month. The airline has alerted the cyber security company that it plans to seek damages for the disruptions and hired litigation firm Boies Schiller Flexner.

CrowdStrike addressed Sunday’s letter to the law firm, whose chair, David Boies, has previously represented the US government in its antitrust case against Microsoft and Harvey Weinstein, among other prominent clients.

Microsoft has estimated that about 8.5 million Windows devices were hit by the faulty update, which stranded airline passengers, interrupted hospital appointments and took broadcasters off air around the world. CrowdStrike said last week that 99 percent of Windows devices running the affected Falcon software were now back online.

Major US airlines Delta, United, and American briefly grounded their aircraft on the morning of July 19. But while United and American were able to restore their operations over the weekend, Delta’s flight disruptions continued well into the following week.

The Atlanta-based carrier in the end canceled more than 6,000 flights, triggering an investigation from the US Department of Transportation amid claims of poor customer service during the operational chaos.

CrowdStrike’s lawyer, Michael Carlinsky, co-managing partner of Quinn Emanuel Urquhart & Sullivan, wrote that, if it pursues legal action, Delta Air Lines would have to explain why its competitors were able to restore their operations much faster.

He added: “Should Delta pursue this path, Delta will have to explain to the public, its shareholders, and ultimately a jury why CrowdStrike took responsibility for its actions—swiftly, transparently and constructively—while Delta did not.”

CrowdStrike also claimed that Delta’s leadership had ignored and rejected offers for help: “CrowdStrike’s CEO personally reached out to Delta’s CEO to offer onsite assistance, but received no response. CrowdStrike followed up with Delta on the offer for onsite support and was told that the onsite resources were not needed.”

Delta Chief Executive Ed Bastian said last week that CrowdStrike had not “offered anything” to make up for the disruption at the airline. “Free consulting advice to help us—that’s the extent of it,” he told CNBC on Wednesday.

While Bastian has said that the disruption would cost Delta $500 million, CrowdStrike insisted that “any liability by CrowdStrike is contractually capped at an amount in the single-digit millions.”

A spokesperson for CrowdStrike accused Delta of “public posturing about potentially bringing a meritless lawsuit against CrowdStrike” and said it hoped the airline would “agree to work cooperatively to find a resolution.”

Delta Air Lines declined to comment.

CrowdStrike claps back at Delta, says airline rejected offers for help Read More »

FLUX: This new AI image generator is eerily good at creating human hands

AI, AI image generation, AI image generator, Andreas Blattman, Biz & IT, Black Forest Labs, Dominik Lorenz, FLUX.1, image synthesis, machine learning, Patrick Esser, Robin Rombach, Stability AI, Stable Diffusion, Stable Diffusion 3 / Kris Guyer / August 2, 2024

five-finger salute —

FLUX.1 is the open-weights heir apparent to Stable Diffusion, turning text into images.

Benj Edwards – Aug 2, 2024 5: 47 pm UTC

Enlarge / AI-generated image by FLUX.1 dev: “A beautiful queen of the universe holding up her hands, face in the background.”

FLUX.1

On Thursday, AI-startup Black Forest Labs announced the launch of its company and the release of its first suite of text-to-image AI models, called FLUX.1. The German-based company, founded by researchers who developed the technology behind Stable Diffusion and invented the latent diffusion technique, aims to create advanced generative AI for images and videos.

The launch of FLUX.1 comes about seven weeks after Stability AI’s troubled release of Stable Diffusion 3 Medium in mid-June. Stability AI’s offering faced widespread criticism among image-synthesis hobbyists for its poor performance in generating human anatomy, with users sharing examples of distorted limbs and bodies across social media. That problematic launch followed the earlier departure of three key engineers from Stability AI—Robin Rombach, Andreas Blattmann, and Dominik Lorenz—who went on to found Black Forest Labs along with latent diffusion co-developer Patrick Esser and others.

Black Forest Labs launched with the release of three FLUX.1 text-to-image models: a high-end commercial “pro” version, a mid-range “dev” version with open weights for non-commercial use, and a faster open-weights “schnell” version (“schnell” means quick or fast in German). Black Forest Labs claims its models outperform existing options like Midjourney and DALL-E in areas such as image quality and adherence to text prompts.

AI-generated image by FLUX.1 dev: “A close-up photo of a pair of hands holding a plate full of pickles.”

FLUX.1
AI-generated image by FLUX.1 dev: A hand holding up five fingers with a starry background.

FLUX.1
AI-generated image by FLUX.1 dev: “An Ars Technica reader sitting in front of a computer monitor. The screen shows the Ars Technica website.”

FLUX.1
AI-generated image by FLUX.1 dev: “a boxer posing with fists raised, no gloves.”

FLUX.1
AI-generated image by FLUX.1 dev: “An advertisement for ‘Frosted Prick’ cereal.”

FLUX.1
AI-generated image of a happy woman in a bakery baking a cake by FLUX.1 dev.

FLUX.1
AI-generated image by FLUX.1 dev: “An advertisement for ‘Marshmallow Menace’ cereal.”

FLUX.1
AI-generated image of “A handsome Asian influencer on top of the Empire State Building, instagram” by FLUX.1 dev.

FLUX.1

In our experience, the outputs of the two higher-end FLUX.1 models are generally comparable with OpenAI’s DALL-E 3 in prompt fidelity, with photorealism that seems close to Midjourney 6. They represent a significant improvement over Stable Diffusion XL, the team’s last major release under Stability (if you don’t count SDXL Turbo).

The FLUX.1 models use what the company calls a “hybrid architecture” combining transformer and diffusion techniques, scaled up to 12 billion parameters. Black Forest Labs said it improves on previous diffusion models by incorporating flow matching and other optimizations.

FLUX.1 seems competent at generating human hands, which was a weak spot in earlier image-synthesis models like Stable Diffusion 1.5 due to a lack of training images that focused on hands. Since those early days, other AI image generators like Midjourney have mastered hands as well, but it’s notable to see an open-weights model that renders hands relatively accurately in various poses.

We downloaded the weights file to the FLUX.1 dev model from GitHub, but at 23GB, it won’t fit in the 12GB VRAM of our RTX 3060 card, so it will need quantization to run locally (reducing its size), which reportedly (through chatter on Reddit) some people have already had success with.

Instead, we experimented with FLUX.1 models on AI cloud-hosting platforms Fal and Replicate, which cost money to use, though Fal offers some free credits to start.

Black Forest looks ahead

Black Forest Labs may be a new company, but it’s already attracting funding from investors. It recently closed a $31 million Series Seed funding round led by Andreessen Horowitz, with additional investments from General Catalyst and MätchVC. The company also brought on high-profile advisers, including entertainment executive and former Disney President Michael Ovitz and AI researcher Matthias Bethge.

“We believe that generative AI will be a fundamental building block of all future technologies,” the company stated in its announcement. “By making our models available to a wide audience, we want to bring its benefits to everyone, educate the public and enhance trust in the safety of these models.”

AI-generated image by FLUX.1 dev: A cat in a car holding a can of beer that reads, ‘AI Slop.’

FLUX.1
AI-generated image by FLUX.1 dev: Mickey Mouse and Spider-Man singing to each other.

FLUX.1
AI-generated image by FLUX.1 dev: “a muscular barbarian with weapons beside a CRT television set, cinematic, 8K, studio lighting.”

FLUX.1
AI-generated image of a flaming cheeseburger created by FLUX.1 dev.

FLUX.1
AI-generated image by FLUX.1 dev: “Will Smith eating spaghetti.”

FLUX.1
AI-generated image by FLUX.1 dev: “a muscular barbarian with weapons beside a CRT television set, cinematic, 8K, studio lighting. The screen reads ‘Ars Technica.'”

FLUX.1
AI-generated image by FLUX.1 dev: “An advertisement for ‘Burt’s Grenades’ cereal.”

FLUX.1
AI-generated image by FLUX.1 dev: “A close-up photo of a pair of hands holding a plate that contains a portrait of the queen of the universe”

FLUX.1

Speaking of “trust and safety,” the company did not mention where it obtained the training data that taught the FLUX.1 models how to generate images. Judging by the outputs we could produce with the model that included depictions of copyrighted characters, Black Forest Labs likely used a huge unauthorized image scrape of the Internet, possibly collected by LAION, an organization that collected the datasets that trained Stable Diffusion. This is speculation at this point. While the underlying technological achievement of FLUX.1 is notable, it feels likely that the team is playing fast and loose with the ethics of “fair use” image scraping much like Stability AI did. That practice may eventually attract lawsuits like those filed against Stability AI.

Though text-to-image generation is Black Forest’s current focus, the company plans to expand into video generation next, saying that FLUX.1 will serve as the foundation of a new text-to-video model in development, which will compete with OpenAI’s Sora, Runway’s Gen-3 Alpha, and Kuaishou’s Kling in a contest to warp media reality on demand. “Our video models will unlock precise creation and editing at high definition and unprecedented speed,” the Black Forest announcement claims.

FLUX: This new AI image generator is eerily good at creating human hands Read More »

Senators propose “Digital replication right” for likeness, extending 70 years after death

AI, AI regulation, Amy Klobuchar, Biz & IT, Chris Coons, machine learning, Marsha Blackburn, NO FAKES act, Policy, Thom Tillis / Mike M. / August 1, 2024

NO SCRUBS —

Law would hold US individuals and firms liable for ripping off a person’s digital likeness.

Benj Edwards – Aug 1, 2024 5: 45 pm UTC

A stock photo illustration of a person's face lit with pink light.

On Wednesday, US Sens. Chris Coons (D-Del.), Marsha Blackburn (R.-Tenn.), Amy Klobuchar (D-Minn.), and Thom Tillis (R-NC) introduced the Nurture Originals, Foster Art, and Keep Entertainment Safe (NO FAKES) Act of 2024. The bipartisan legislation, up for consideration in the US Senate, aims to protect individuals from unauthorized AI-generated replicas of their voice or likeness.

The NO FAKES Act would create legal recourse for people whose digital representations are created without consent. It would hold both individuals and companies liable for producing, hosting, or sharing these unauthorized digital replicas, including those created by generative AI. Due to generative AI technology that has become mainstream in the past two years, creating audio or image media fakes of people has become fairly trivial, with easy photorealistic video replicas likely next to arrive.

In a press statement, Coons emphasized the importance of protecting individual rights in the age of AI. “Everyone deserves the right to own and protect their voice and likeness, no matter if you’re Taylor Swift or anyone else,” he said, referring to a widely publicized deepfake incident involving the musical artist in January. “Generative AI can be used as a tool to foster creativity, but that can’t come at the expense of the unauthorized exploitation of anyone’s voice or likeness.”

The introduction of the NO FAKES Act follows the Senate’s passage of the DEFIANCE Act, which allows victims of sexual deepfakes to sue for damages.

In addition to the Swift saga, over the past few years, we’ve seen AI-powered scams involving fake celebrity endorsements, the creation of misleading political content, and situations where school kids have used AI tech to create pornographic deepfakes of classmates. Recently, X CEO Elon Musk shared a video that featured an AI-generated voice of Vice President Kamala Harris saying things she didn’t say in real life.

These incidents, in addition to concerns about actors’ likenesses being replicated without permission, have created an increasing sense of urgency among US lawmakers, who want to limit the impact of unauthorized digital likenesses. Currently, certain types of AI-generated deepfakes are already illegal due to a patchwork of federal and state laws, but this new act hopes to unify likeness regulation around the concept of “digital replicas.”

Digital replicas

Enlarge / An AI-generated image of a person.

Benj Edwards / Ars Technica

To protect a person’s digital likeness, the NO FAKES Act introduces a “digital replication right” that gives individuals exclusive control over the use of their voice or visual likeness in digital replicas. This right extends 10 years after death, with possible five-year extensions if actively used. It can be licensed during life and inherited after death, lasting up to 70 years after an individual’s death. Along the way, the bill defines what it considers to be a “digital replica”:

DIGITAL REPLICA.-The term “digital replica” means a newly created, computer-generated, highly realistic electronic representation that is readily identifiable as the voice or visual likeness of an individual that- (A) is embodied in a sound recording, image, audiovisual work, including an audiovisual work that does not have any accompanying sounds, or transmission- (i) in which the actual individual did not actually perform or appear; or (ii) that is a version of a sound recording, image, or audiovisual work in which the actual individual did perform or appear, in which the fundamental character of the performance or appearance has been materially altered; and (B) does not include the electronic reproduction, use of a sample of one sound recording or audiovisual work into another, remixing, mastering, or digital remastering of a sound recording or audiovisual work authorized by the copyright holder.

(There’s some irony in the mention of an “audiovisual work that does not have any accompanying sounds.”)

Since this bill bans types of artistic expression, the NO FAKES Act includes provisions that aim to balance IP protection with free speech. It provides exclusions for recognized First Amendment protections, such as documentaries, biographical works, and content created for purposes of comment, criticism, or parody.

In some ways, those exceptions could create a very wide protection gap that may be difficult to enforce without specific court decisions on a case-by-case basis. But without them, the NO FAKES Act could potentially stifle Americans’ constitutionally protected rights of free expression since the concept of “digital replicas” outlined in the bill includes any “computer-generated, highly realistic” digital likeness of a real person, whether AI-generated or not. For example, is a photorealistic Photoshop illustration of a person “computer-generated?” Similar questions may lead to uncertainty in enforcement.

Wide support from entertainment industry

So far, the NO FAKES Act has gained support from various entertainment industry groups, including Screen Actors Guild-American Federation of Television and Radio Artists (SAG-AFTRA), the Recording Industry Association of America (RIAA), the Motion Picture Association, and the Recording Academy. These organizations have been actively seeking protections against unauthorized AI re-creations.

The bill has also been endorsed by entertainment companies such as The Walt Disney Company, Warner Music Group, Universal Music Group, Sony Music, the Independent Film & Television Alliance, William Morris Endeavor, Creative Arts Agency, the Authors Guild, and Vermillio.

Several tech companies, including IBM and OpenAI, have also backed the NO FAKES Act. Anna Makanju, OpenAI’s vice president of global affairs, said in a statement that the act would protect creators and artists from improper impersonation. “OpenAI is pleased to support the NO FAKES Act, which would protect creators and artists from unauthorized digital replicas of their voices and likenesses,” she said.

In a statement, Coons highlighted the collaborative effort behind the bill’s development. “I am grateful for the bipartisan partnership of Senators Blackburn, Klobuchar, and Tillis and the support of stakeholders from across the entertainment and technology industries as we work to find the balance between the promise of AI and protecting the inherent dignity we all have in our own personhood.”

Senators propose “Digital replication right” for likeness, extending 70 years after death Read More »

ChatGPT Advanced Voice Mode impresses testers with sound effects, catching its breath

Advanced Voice Mode, AI, ai voice generators, audio synthesis, Biz & IT, chatgpt, chatgtp, large language models, machine learning, openai, text synthesis, voice synthesis / Paul Patrick / July 31, 2024

I Am the Very Model of a Modern Major-General —

AVM allows uncanny real-time voice conversations with ChatGPT that you can interrupt.

Benj Edwards – Jul 31, 2024 6: 14 pm UTC

Stock Photo: AI Cyborg Robot Whispering Secret Or Interesting Gossip — Enlarge / A stock photo of a robot whispering to a man.

On Tuesday, OpenAI began rolling out an alpha version of its new Advanced Voice Mode to a small group of ChatGPT Plus subscribers. This feature, which OpenAI previewed in May with the launch of GPT-4o, aims to make conversations with the AI more natural and responsive. In May, the feature triggered criticism of its simulated emotional expressiveness and prompted a public dispute with actress Scarlett Johansson over accusations that OpenAI copied her voice. Even so, early tests of the new feature shared by users on social media have been largely enthusiastic.

In early tests reported by users with access, Advanced Voice Mode allows them to have real-time conversations with ChatGPT, including the ability to interrupt the AI mid-sentence almost instantly. It can sense and respond to a user’s emotional cues through vocal tone and delivery, and provide sound effects while telling stories.

But what has caught many people off-guard initially is how the voices simulate taking a breath while speaking.

“ChatGPT Advanced Voice Mode counting as fast as it can to 10, then to 50 (this blew my mind—it stopped to catch its breath like a human would),” wrote tech writer Cristiano Giardina on X.

Advanced Voice Mode simulates audible pauses for breath because it was trained on audio samples of humans speaking that included the same feature. The model has learned to simulate inhalations at seemingly appropriate times after being exposed to hundreds of thousands, if not millions, of examples of human speech. Large language models (LLMs) like GPT-4o are master imitators, and that skill has now extended to the audio domain.

Giardina shared his other impressions about Advanced Voice Mode on X, including observations about accents in other languages and sound effects.

“It’s very fast, there’s virtually no latency from when you stop speaking to when it responds,” he wrote. “When you ask it to make noises it always has the voice “perform” the noises (with funny results). It can do accents, but when speaking other languages it always has an American accent. (In the video, ChatGPT is acting as a soccer match commentator)“

Speaking of sound effects, X user Kesku, who is a moderator of OpenAI’s Discord server, shared an example of ChatGPT playing multiple parts with different voices and another of a voice recounting an audiobook-sounding sci-fi story from the prompt, “Tell me an exciting action story with sci-fi elements and create atmosphere by making appropriate noises of the things happening using onomatopoeia.”

Kesku also ran a few example prompts for us, including a story about the Ars Technica mascot “Moonshark.”

He also asked it to sing the “Major-General’s Song” from Gilbert and Sullivan’s 1879 comic opera The Pirates of Penzance:

Frequent AI advocate Manuel Sainsily posted a video of Advanced Voice Mode reacting to camera input, giving advice about how to care for a kitten. “It feels like face-timing a super knowledgeable friend, which in this case was super helpful—reassuring us with our new kitten,” he wrote. “It can answer questions in real-time and use the camera as input too!”

Of course, being based on an LLM, it may occasionally confabulate incorrect responses on topics or in situations where its “knowledge” (which comes from GPT-4o’s training data set) is lacking. But if considered a tech demo or an AI-powered amusement and you’re aware of the limitations, Advanced Voice Mode seems to successfully execute many of the tasks shown by OpenAI’s demo in May.

Safety

An OpenAI spokesperson told Ars Technica that the company worked with more than 100 external testers on the Advanced Voice Mode release, collectively speaking 45 different languages and representing 29 geographical areas. The system is reportedly designed to prevent impersonation of individuals or public figures by blocking outputs that differ from OpenAI’s four chosen preset voices.

OpenAI has also added filters to recognize and block requests to generate music or other copyrighted audio, which has gotten other AI companies in trouble. Giardina reported audio “leakage” in some audio outputs that have unintentional music in the background, showing that OpenAI trained the AVM voice model on a wide variety of audio sources, likely both from licensed material and audio scraped from online video platforms.

Availability

OpenAI plans to expand access to more ChatGPT Plus users in the coming weeks, with a full launch to all Plus subscribers expected this fall. A company spokesperson told Ars that users in the alpha test group will receive a notice in the ChatGPT app and an email with usage instructions.

Since the initial preview of GPT-4o voice in May, OpenAI claims to have enhanced the model’s ability to support millions of simultaneous, real-time voice conversations while maintaining low latency and high quality. In other words, they are gearing up for a rush that will take a lot of back-end computation to accommodate.

ChatGPT Advanced Voice Mode impresses testers with sound effects, catching its breath Read More »