ai voice generators

chatgpt-advanced-voice-mode-impresses-testers-with-sound-effects,-catching-its-breath

ChatGPT Advanced Voice Mode impresses testers with sound effects, catching its breath

I Am the Very Model of a Modern Major-General —

AVM allows uncanny real-time voice conversations with ChatGPT that you can interrupt.

Stock Photo: AI Cyborg Robot Whispering Secret Or Interesting Gossip

Enlarge / A stock photo of a robot whispering to a man.

On Tuesday, OpenAI began rolling out an alpha version of its new Advanced Voice Mode to a small group of ChatGPT Plus subscribers. This feature, which OpenAI previewed in May with the launch of GPT-4o, aims to make conversations with the AI more natural and responsive. In May, the feature triggered criticism of its simulated emotional expressiveness and prompted a public dispute with actress Scarlett Johansson over accusations that OpenAI copied her voice. Even so, early tests of the new feature shared by users on social media have been largely enthusiastic.

In early tests reported by users with access, Advanced Voice Mode allows them to have real-time conversations with ChatGPT, including the ability to interrupt the AI mid-sentence almost instantly. It can sense and respond to a user’s emotional cues through vocal tone and delivery, and provide sound effects while telling stories.

But what has caught many people off-guard initially is how the voices simulate taking a breath while speaking.

“ChatGPT Advanced Voice Mode counting as fast as it can to 10, then to 50 (this blew my mind—it stopped to catch its breath like a human would),” wrote tech writer Cristiano Giardina on X.

Advanced Voice Mode simulates audible pauses for breath because it was trained on audio samples of humans speaking that included the same feature. The model has learned to simulate inhalations at seemingly appropriate times after being exposed to hundreds of thousands, if not millions, of examples of human speech. Large language models (LLMs) like GPT-4o are master imitators, and that skill has now extended to the audio domain.

Giardina shared his other impressions about Advanced Voice Mode on X, including observations about accents in other languages and sound effects.

It’s very fast, there’s virtually no latency from when you stop speaking to when it responds,” he wrote. “When you ask it to make noises it always has the voice “perform” the noises (with funny results). It can do accents, but when speaking other languages it always has an American accent. (In the video, ChatGPT is acting as a soccer match commentator)

Speaking of sound effects, X user Kesku, who is a moderator of OpenAI’s Discord server, shared an example of ChatGPT playing multiple parts with different voices and another of a voice recounting an audiobook-sounding sci-fi story from the prompt, “Tell me an exciting action story with sci-fi elements and create atmosphere by making appropriate noises of the things happening using onomatopoeia.”

Kesku also ran a few example prompts for us, including a story about the Ars Technica mascot “Moonshark.”

He also asked it to sing the “Major-General’s Song” from Gilbert and Sullivan’s 1879 comic opera The Pirates of Penzance:

Frequent AI advocate Manuel Sainsily posted a video of Advanced Voice Mode reacting to camera input, giving advice about how to care for a kitten. “It feels like face-timing a super knowledgeable friend, which in this case was super helpful—reassuring us with our new kitten,” he wrote. “It can answer questions in real-time and use the camera as input too!”

Of course, being based on an LLM, it may occasionally confabulate incorrect responses on topics or in situations where its “knowledge” (which comes from GPT-4o’s training data set) is lacking. But if considered a tech demo or an AI-powered amusement and you’re aware of the limitations, Advanced Voice Mode seems to successfully execute many of the tasks shown by OpenAI’s demo in May.

Safety

An OpenAI spokesperson told Ars Technica that the company worked with more than 100 external testers on the Advanced Voice Mode release, collectively speaking 45 different languages and representing 29 geographical areas. The system is reportedly designed to prevent impersonation of individuals or public figures by blocking outputs that differ from OpenAI’s four chosen preset voices.

OpenAI has also added filters to recognize and block requests to generate music or other copyrighted audio, which has gotten other AI companies in trouble. Giardina reported audio “leakage” in some audio outputs that have unintentional music in the background, showing that OpenAI trained the AVM voice model on a wide variety of audio sources, likely both from licensed material and audio scraped from online video platforms.

Availability

OpenAI plans to expand access to more ChatGPT Plus users in the coming weeks, with a full launch to all Plus subscribers expected this fall. A company spokesperson told Ars that users in the alpha test group will receive a notice in the ChatGPT app and an email with usage instructions.

Since the initial preview of GPT-4o voice in May, OpenAI claims to have enhanced the model’s ability to support millions of simultaneous, real-time voice conversations while maintaining low latency and high quality. In other words, they are gearing up for a rush that will take a lot of back-end computation to accommodate.

ChatGPT Advanced Voice Mode impresses testers with sound effects, catching its breath Read More »

openai-pauses-chatgpt-4o-voice-that-fans-said-ripped-off-scarlett-johansson

OpenAI pauses ChatGPT-4o voice that fans said ripped off Scarlett Johansson

“Her” —

“Sky’s voice is not an imitation of Scarlett Johansson,” OpenAI insists.

Scarlett Johansson and Joaquin Phoenix attend <em>Her</em> premiere during the 8th Rome Film Festival at Auditorium Parco Della Musica on November 10, 2013, in Rome, Italy.  ” src=”https://cdn.arstechnica.net/wp-content/uploads/2024/05/GettyImages-187586586-800×534.jpg”></img><figcaption>
<p><a data-height=Enlarge / Scarlett Johansson and Joaquin Phoenix attend Her premiere during the 8th Rome Film Festival at Auditorium Parco Della Musica on November 10, 2013, in Rome, Italy.

OpenAI has paused a voice mode option for ChatGPT-4o, Sky, after backlash accusing the AI company of intentionally ripping off Scarlett Johansson’s critically acclaimed voice-acting performance in the 2013 sci-fi film Her.

In a blog defending their casting decision for Sky, OpenAI went into great detail explaining its process for choosing the individual voice options for its chatbot. But ultimately, the company seemed pressed to admit that Sky’s voice was just too similar to Johansson’s to keep using it, at least for now.

“We believe that AI voices should not deliberately mimic a celebrity’s distinctive voice—Sky’s voice is not an imitation of Scarlett Johansson but belongs to a different professional actress using her own natural speaking voice,” OpenAI’s blog said.

OpenAI is not naming the actress, or any of the ChatGPT-4o voice actors, to protect their privacy.

A week ago, OpenAI CEO Sam Altman seemed to invite this controversy by posting “her” on X (formerly Twitter) after announcing the ChatGPT audio-video features that he said made it more “natural” for users to interact with the chatbot.

Altman has said that Her, a movie about a man who falls in love with his virtual assistant, is among his favorite movies. He told conference attendees at Dreamforce last year that the movie “was incredibly prophetic” when depicting “interaction models of how people use AI,” The San Francisco Standard reported. And just last week, Altman touted GPT-4o’s new voice mode by promising, “it feels like AI from the movies.”

But OpenAI’s chief technology officer, Mira Murati, has said that GPT-4o’s voice modes were less inspired by Her than by studying the “really natural, rich, and interactive” aspects of human conversation, The Wall Street Journal reported.

In 2013, of course, critics praised Johansson’s Her performance as expressively capturing a wide range of emotions, which is exactly what Murati described as OpenAI’s goals for its chatbot voices. Rolling Stone noted how effectively Johansson naturally navigated between “tones sweet, sexy, caring, manipulative, and scary.” Johansson achieved this, the Hollywood Reporter said, by using a “vivacious female voice that breaks attractively but also has an inviting deeper register.”

Her director/screenwriter Spike Jonze was so intent on finding the right voice for his film’s virtual assistant that he replaced British actor Samantha Morton late in the film’s production. According to Vulture, Jonze realized that Morton’s “maternal, loving, vaguely British, and almost ghostly” voice didn’t fit his film as well as Johansson’s “younger,” “more impassioned” voice, which he said brought “more yearning.”

Late-night shows had fun mocking OpenAI’s demo featuring the Sky voice, which showed the chatbot seemingly flirting with engineers, giggling through responses like “oh, stop it. You’re making me blush.” Where The New York Times described these demo interactions as Sky being “deferential and wholly focused on the user,” The Daily Show‘s Desi Lydic joked that Sky was “clearly programmed to feed dudes’ egos.”

OpenAI is likely hoping to avoid any further controversy amidst plans to roll out more voices soon that its blog said will “better match the diverse interests and preferences of users.”

OpenAI did not immediately respond to Ars’ request for comment.

Voice actors versus AI

The OpenAI controversy arrives at a moment when many are questioning AI’s impact on creative communities, triggering early lawsuits from artists and book authors. Just this month, Sony opted all of its artists out of AI training to stop voice clones from ripping off top talents like Adele and Beyoncé.

Voice actors, too, have been monitoring increasingly sophisticated AI voice generators, waiting to see what threat AI might pose to future work opportunities. Recently, two actors sued an AI start-up called Lovo that they claimed “illegally used recordings of their voices to create technology that can compete with their voice work,” The New York Times reported. According to that lawsuit, Lovo allegedly used the actors’ actual voice clips to clone their voices.

“We don’t know how many other people have been affected,” the actors’ lawyer, Steve Cohen, told The Times.

Rather than replace voice actors, OpenAI’s blog said that they are striving to support the voice industry when creating chatbots that will laugh at your jokes or mimic your mood. On top of paying voice actors “compensation above top-of-market rates,” OpenAI said they “worked with industry-leading casting and directing professionals to narrow down over 400 submissions” to the five voice options in the initial roll-out of audio-video features.

Their goals in hiring voice actors were to hire talents “from diverse backgrounds or who could speak multiple languages,” casting actors who had voices that feel “timeless” and “inspire trust.” To OpenAI, that meant finding actors who have a “warm, engaging, confidence-inspiring, charismatic voice with rich tone” that sounds “natural and easy to listen to.”

For ChatGPT-4o’s first five voice actors, the gig lasted about five months before leading to more work, OpenAI said.

“We are continuing to collaborate with the actors, who have contributed additional work for audio research and new voice capabilities in GPT-4o,” OpenAI said.

Arguably, these actors are helping to train AI tools that could one day replace them, though. Backlash defending Johansson—one of the world’s highest-paid actors—perhaps shows that fans won’t take direct mimicry of any of Hollywood’s biggest stars lightly, though.

While criticism of the Sky voice seemed widespread, some fans seemed to think that OpenAI has overreacted by pausing the Sky voice.

NYT critic Alissa Wilkinson wrote that it was only “a tad jarring” to hear Sky’s voice because “she sounded a whole lot” like Johansson. And replying to OpenAI’s X post announcing its decision to pull the voice feature for now, a clump of fans protested the AI company’s “bad decision,” with some complaining that Sky was the “best” and “hottest” voice.

At least one fan noted that OpenAI’s decision seemed to hurt the voice actor behind Sky most.

“Super unfair for the Sky voice actress,” a user called Ate-a-Pi wrote. “Just because she sounds like ScarJo, now she can never make money again. Insane.”

OpenAI pauses ChatGPT-4o voice that fans said ripped off Scarlett Johansson Read More »