Google Gemini

google-and-meta-update-their-ai-models-amid-the-rise-of-“alphachip”

Google and Meta update their AI models amid the rise of “AlphaChip”

Running the AI News Gauntlet —

News about Gemini updates, Llama 3.2, and Google’s new AI-powered chip designer.

Cyberpunk concept showing a man running along a futuristic path full of monitors.

Enlarge / There’s been a lot of AI news this week, and covering it sometimes feels like running through a hall full of danging CRTs, just like this Getty Images illustration.

It’s been a wildly busy week in AI news thanks to OpenAI, including a controversial blog post from CEO Sam Altman, the wide rollout of Advanced Voice Mode, 5GW data center rumors, major staff shake-ups, and dramatic restructuring plans.

But the rest of the AI world doesn’t march to the same beat, doing its own thing and churning out new AI models and research by the minute. Here’s a roundup of some other notable AI news from the past week.

Google Gemini updates

On Tuesday, Google announced updates to its Gemini model lineup, including the release of two new production-ready models that iterate on past releases: Gemini-1.5-Pro-002 and Gemini-1.5-Flash-002. The company reported improvements in overall quality, with notable gains in math, long context handling, and vision tasks. Google claims a 7 percent increase in performance on the MMLU-Pro benchmark and a 20 percent improvement in math-related tasks. But as you know, if you’ve been reading Ars Technica for a while, AI typically benchmarks aren’t as useful as we would like them to be.

Along with model upgrades, Google introduced substantial price reductions for Gemini 1.5 Pro, cutting input token costs by 64 percent and output token costs by 52 percent for prompts under 128,000 tokens. As AI researcher Simon Willison noted on his blog, “For comparison, GPT-4o is currently $5/[million tokens] input and $15/m output and Claude 3.5 Sonnet is $3/m input and $15/m output. Gemini 1.5 Pro was already the cheapest of the frontier models and now it’s even cheaper.”

Google also increased rate limits, with Gemini 1.5 Flash now supporting 2,000 requests per minute and Gemini 1.5 Pro handling 1,000 requests per minute. Google reports that the latest models offer twice the output speed and three times lower latency compared to previous versions. These changes may make it easier and more cost-effective for developers to build applications with Gemini than before.

Meta launches Llama 3.2

On Wednesday, Meta announced the release of Llama 3.2, a significant update to its open-weights AI model lineup that we have covered extensively in the past. The new release includes vision-capable large language models (LLMs) in 11 billion and 90B parameter sizes, as well as lightweight text-only models of 1B and 3B parameters designed for edge and mobile devices. Meta claims the vision models are competitive with leading closed-source models on image recognition and visual understanding tasks, while the smaller models reportedly outperform similar-sized competitors on various text-based tasks.

Willison did some experiments with some of the smaller 3.2 models and reported impressive results for the models’ size. AI researcher Ethan Mollick showed off running Llama 3.2 on his iPhone using an app called PocketPal.

Meta also introduced the first official “Llama Stack” distributions, created to simplify development and deployment across different environments. As with previous releases, Meta is making the models available for free download, with license restrictions. The new models support long context windows of up to 128,000 tokens.

Google’s AlphaChip AI speeds up chip design

On Thursday, Google DeepMind announced what appears to be a significant advancement in AI-driven electronic chip design, AlphaChip. It began as a research project in 2020 and is now a reinforcement learning method for designing chip layouts. Google has reportedly used AlphaChip to create “superhuman chip layouts” in the last three generations of its Tensor Processing Units (TPUs), which are chips similar to GPUs designed to accelerate AI operations. Google claims AlphaChip can generate high-quality chip layouts in hours, compared to weeks or months of human effort. (Reportedly, Nvidia has also been using AI to help design its chips.)

Notably, Google also released a pre-trained checkpoint of AlphaChip on GitHub, sharing the model weights with the public. The company reported that AlphaChip’s impact has already extended beyond Google, with chip design companies like MediaTek adopting and building on the technology for their chips. According to Google, AlphaChip has sparked a new line of research in AI for chip design, potentially optimizing every stage of the chip design cycle from computer architecture to manufacturing.

That wasn’t everything that happened, but those are some major highlights. With the AI industry showing no signs of slowing down at the moment, we’ll see how next week goes.

Google and Meta update their AI models amid the rise of “AlphaChip” Read More »

google-rolls-out-voice-powered-ai-chat-to-the-android-masses

Google rolls out voice-powered AI chat to the Android masses

Chitchat Wars —

Gemini Live allows back-and-forth conversation, now free to all Android users.

The Google Gemini logo.

Enlarge / The Google Gemini logo.

Google

On Thursday, Google made Gemini Live, its voice-based AI chatbot feature, available for free to all Android users. The feature allows users to interact with Gemini through voice commands on their Android devices. That’s notable because competitor OpenAI’s Advanced Voice Mode feature of ChatGPT, which is similar to Gemini Live, has not yet fully shipped.

Google unveiled Gemini Live during its Pixel 9 launch event last month. Initially, the feature was exclusive to Gemini Advanced subscribers, but now it’s accessible to anyone using the Gemini app or its overlay on Android.

Gemini Live enables users to ask questions aloud and even interrupt the AI’s responses mid-sentence. Users can choose from several voice options for Gemini’s responses, adding a level of customization to the interaction.

Gemini suggests the following uses of the voice mode in its official help documents:

Talk back and forth: Talk to Gemini without typing, and Gemini will respond back verbally.

Brainstorm ideas out loud: Ask for a gift idea, to plan an event, or to make a business plan.

Explore: Uncover more details about topics that interest you.

Practice aloud: Rehearse for important moments in a more natural and conversational way.

Interestingly, while OpenAI originally demoed its Advanced Voice Mode in May with the launch of GPT-4o, it has only shipped the feature to a limited number of users starting in late July. Some AI experts speculate that a wider rollout has been hampered by a lack of available computer power since the voice feature is presumably very compute-intensive.

To access Gemini Live, users can reportedly tap a new waveform icon in the bottom-right corner of the app or overlay. This action activates the microphone, allowing users to pose questions verbally. The interface includes options to “hold” Gemini’s answer or “end” the conversation, giving users control over the flow of the interaction.

Currently, Gemini Live supports only English, but Google has announced plans to expand language support in the future. The company also intends to bring the feature to iOS devices, though no specific timeline has been provided for this expansion.

Google rolls out voice-powered AI chat to the Android masses Read More »

google-pulls-its-terrible-pro-ai-“dear-sydney”-ad-after-backlash

Google pulls its terrible pro-AI “Dear Sydney” ad after backlash

Gemini, write me a fan letter! —

Taking the “human” out of “human communication.”

A picture of the Gemini prompt box from the

Enlarge / The Gemini prompt box in the “Dear Sydney” ad.

Google

Have you seen Google’s “Dear Sydney” ad? The one where a young girl wants to write a fan letter to Olympic hurdler Sydney McLaughlin-Levrone? To which the girl’s dad responds that he is “pretty good with words but this has to be just right”? And so, to be just right, he suggests that the daughter get Google’s Gemini AI to write a first draft of the letter?

If you’re watching the Olympics, you have undoubtedly seen it—because the ad has been everywhere. Until today. After a string of negative commentary about the ad’s dystopian implications, Google has pulled the “Dear Sydney” ad from TV. In a statement to The Hollywood Reporter, the company said, “While the ad tested well before airing, given the feedback, we have decided to phase the ad out of our Olympics rotation.”

The backlash was similar to that against Apple’s recent ad in which an enormous hydraulic press crushed TVs, musical instruments, record players, paint cans, sculptures, and even emoji into… the newest model of the iPad. Apple apparently wanted to show just how much creative and entertainment potential the iPad held; critics read the ad as a warning image about the destruction of human creativity in a technological age. Apple apologized soon after.

Now Google has stepped on the same land mine. Not only is AI coming for human creativity, the “Dear Sydney” ad suggests—but it won’t even leave space for the charming imperfections of a child’s fan letter to an athlete. Instead, AI will provide the template, just as it will likely provide the template for the athlete’s response, leading to a nightmare scenario in which huge swathes of human communication have the “human” part stripped right out.

“Very bad”

The generally hostile tone of the commentary to the new ad was captured by Alexandra Petri’s Washington Post column on the ad, which Petri labeled “very bad.”

This ad makes me want to throw a sledgehammer into the television every time I see it. Given the choice between watching this ad and watching the ad about how I need to be giving money NOW to make certain that dogs do not perish in the snow, I would have to think long and hard. It’s one of those ads that makes you think, perhaps evolution was a mistake and our ancestor should never have left the sea. This could be slight hyperbole but only slight!

If you haven’t seen this ad, you are leading a blessed existence and I wish to trade places with you.

A TechCrunch piece said that it was “hard to think of anything that communicates heartfelt inspiration less than instructing an AI to tell someone how inspiring they are.”

Shelly Palmer, a Syracuse University professor and marketing consultant, wrote that the ad’s basic mistake was overestimating “AI’s ability to understand and convey the nuances of human emotions and thoughts.” Palmer would rather have a “heartfelt message over a grammatically correct, AI-generated message any day,” he said. He then added:

I received just such a heartfelt message from a reader years ago. It was a single line email about a blog post I had just written: “Shelly, you’re to [sic] stupid to own a smart phone.” I love this painfully ironic email so much, I have it framed on the wall in my office. It was honest, direct, and probably accurate.

But his conclusion was far more serious. “I flatly reject the future that Google is advertising,” Palmer wrote. “I want to live in a culturally diverse world where billions of individuals use AI to amplify their human skills, not in a world where we are used by AI pretending to be human.”

Things got saltier from there. NPR host Linda Holmes wrote on social media:

This commercial showing somebody having a child use AI to write a fan letter to her hero SUCKS. Obviously there are special circumstances and people who need help, but as a general “look how cool, she didn’t even have to write anything herself!” story, it SUCKS. Who wants an AI-written fan letter?? I promise you, if they’re able, the words your kid can put together will be more meaningful than anything a prompt can spit out. And finally: A fan letter is a great way for a kid to learn to write! If you encourage kids to run to AI to spit out words because their writing isn’t great yet, how are they supposed to learn? Sit down with your kid and write the letter with them! I’m just so grossed out by the entire thing.

The Atlantic was more succinct with its headline: “Google Wins the Gold Medal for Worst Olympic Ad.”

All of this largely tracks with our own take on the ad, which Ars Technica’s Kyle Orland called a “grim” vision of the future. “I want AI-powered tools to automate the most boring, mundane tasks in my life, giving me more time to spend on creative, life-affirming moments with my family,” he wrote. “Google’s ad seems to imply that these life-affirming moments are also something to be avoided—or at least made pleasingly more efficient—through the use of AI.”

Getting people excited about their own obsolescence and addiction is a tough sell, so I don’t envy the marketers who have to hawk Big Tech’s biggest products in a climate of suspicion and hostility toward everything from AI to screen time to social media to data collection. I’m sure the marketers will find a way—but clearly “Dear Sydney” isn’t it.

Google pulls its terrible pro-AI “Dear Sydney” ad after backlash Read More »

outsourcing-emotion:-the-horror-of-google’s-“dear-sydney”-ai-ad

Outsourcing emotion: The horror of Google’s “Dear Sydney” AI ad

Here's an idea: Don't be a deadbeat and do it yourself!

Enlarge / Here’s an idea: Don’t be a deadbeat and do it yourself!

If you’ve watched any Olympics coverage this week, you’ve likely been confronted with an ad for Google’s Gemini AI called “Dear Sydney.” In it, a proud father seeks help writing a letter on behalf of his daughter, who is an aspiring runner and superfan of world-record-holding hurdler Sydney McLaughlin-Levrone.

“I’m pretty good with words, but this has to be just right,” the father intones before asking Gemini to “Help my daughter write a letter telling Sydney how inspiring she is…” Gemini dutifully responds with a draft letter in which the LLM tells the runner, on behalf of the daughter, that she wants to be “just like you.”

Every time I see this ad, it puts me on edge in a way I’ve had trouble putting into words (though Gemini itself has some helpful thoughts). As someone who writes words for a living, the idea of outsourcing a writing task to a machine brings up some vocational anxiety. And the idea of someone who’s “pretty good with words” doubting his abilities when the writing “has to be just right” sets off alarm bells regarding the superhuman framing of AI capabilities.

But I think the most offensive thing about the ad is what it implies about the kinds of human tasks Google sees AI replacing. Rather than using LLMs to automate tedious busywork or difficult research questions, “Dear Sydney” presents a world where Gemini can help us offload a heartwarming shared moment of connection with our children.

The “Dear Sydney” ad.

It’s a distressing answer to what’s still an incredibly common question in the AI space: What do you actually use these things for?

Yes, I can help

Marketers have a difficult task when selling the public on their shiny new AI tools. An effective ad for an LLM has to make it seem like a superhuman do-anything machine but also an approachable, friendly helper. An LLM has to be shown as good enough to reliably do things you can’t (or don’t want to) do yourself, but not so good that it will totally replace you.

Microsoft’s 2024 Super Bowl ad for Copilot is a good example of an attempt to thread this needle, featuring a handful of examples of people struggling to follow their dreams in the face of unseen doubters. “Can you help me?” those dreamers ask Copilot with various prompts. “Yes, I can help” is the message Microsoft delivers back, whether through storyboard images, an impromptu organic chemistry quiz, or “code for a 3D open world game.”

Microsoft’s Copilot marketing sells it as a helper for achieving your dreams.

The “Dear Sydney” ad tries to fit itself into this same box, technically. The prompt in the ad starts with “Help my daughter…” and the tagline at the end offers “A little help from Gemini.” If you look closely near the end, you’ll also see Gemini’s response starts with “Here’s a draft to get you started.” And to be clear, there’s nothing inherently wrong with using an LLM as a writing assistant in this way, especially if you have a disability or are writing in a non-native language.

But the subtle shift from Microsoft’s “Help me” to Google’s “Help my daughter” changes the tone of things. Inserting Gemini into a child’s heartfelt request for parental help makes it seem like the parent in question is offloading their responsibilities to a computer in the coldest, most sterile way possible. More than that, it comes across as an attempt to avoid an opportunity to bond with a child over a shared interest in a creative way.

It’s one thing to use AI to help you with the most tedious parts of your job, as people do in recent ads for Salesforce’s Einstein AI. It’s another to tell your daughter to go ask the computer for help pouring their heart out to their idol.

Outsourcing emotion: The horror of Google’s “Dear Sydney” AI ad Read More »

google,-its-cat-fully-escaped-from-bag,-shows-off-the-pixel-9-pro-weeks-early

Google, its cat fully escaped from bag, shows off the Pixel 9 Pro weeks early

Google Pixel 9 Series —

Upcoming phone is teased with an AI breakup letter to “the same old thing.”

Top part of rear of Pixel 9 Pro, with

Enlarge / You can have confirmation of one of our upcoming four phones, but you have to hear us talk about AI again. Deal?

Google

After every one of its house-brand phones, and even its new wall charger, have been meticulously photographed, sized, and rated for battery capacity, what should Google do to keep the anticipation up for the Pixel 9 series’ August 13 debut?

Lean into it, it seems, and Google is doing so with an eye toward further promoting its Gemini-based AI aims. In a video post on X (formerly Twitter), Google describes a “phone built for the Gemini era,” one that can, through the power of Gemini, “even let your old phone down easy” with a breakup letter. The camera pans out, and the shape of the Pixel 9 Pro appears and turns around to show off the now-standard Pixel camera bar across the upper back.

There’s also a disclaimer to this tongue-in-cheek request for a send-off to a phone that is “just the same old thing”: “Screen simulated. Limitations apply. Check responses for accuracy.”

Over at the Google Store, you can see a static image of the Pixel 9 Pro and sign up for alerts about its availability. The image confirms that the photos taken by Taiwanese regulatory authority NCC were legitimate, right down to the coloring on the back of the Pixel 9 Pro and the camera and flash placement.

Those NCC photos confirmed that Google intends to launch four different phone-ish devices at its August 13 “Made by Google” event. The Pixel 9 and Pixel 9 Pro are both roughly 6.1-inch devices, but the Pro will likely offer more robust Gemini AI integration due to increased RAM and other spec bumps. The Pixel 9 Pro XL should have similarly AI-ready specs, just in a larger size. And the Pixel 9 Pro Fold is an iteration on Google’s first Pixel Fold model, with seemingly taller dimensions and a daringly smaller battery.

Google, its cat fully escaped from bag, shows off the Pixel 9 Pro weeks early Read More »

anthropic-introduces-claude-3.5-sonnet,-matching-gpt-4o-on-benchmarks

Anthropic introduces Claude 3.5 Sonnet, matching GPT-4o on benchmarks

The Anthropic Claude 3 logo, jazzed up by Benj Edwards.

Anthropic / Benj Edwards

On Thursday, Anthropic announced Claude 3.5 Sonnet, its latest AI language model and the first in a new series of “3.5” models that build upon Claude 3, launched in March. Claude 3.5 can compose text, analyze data, and write code. It features a 200,000 token context window and is available now on the Claude website and through an API. Anthropic also introduced Artifacts, a new feature in the Claude interface that shows related work documents in a dedicated window.

So far, people outside of Anthropic seem impressed. “This model is really, really good,” wrote independent AI researcher Simon Willison on X. “I think this is the new best overall model (and both faster and half the price of Opus, similar to the GPT-4 Turbo to GPT-4o jump).”

As we’ve written before, benchmarks for large language models (LLMs) are troublesome because they can be cherry-picked and often do not capture the feel and nuance of using a machine to generate outputs on almost any conceivable topic. But according to Anthropic, Claude 3.5 Sonnet matches or outperforms competitor models like GPT-4o and Gemini 1.5 Pro on certain benchmarks like MMLU (undergraduate level knowledge), GSM8K (grade school math), and HumanEval (coding).

Claude 3.5 Sonnet benchmarks provided by Anthropic.

Enlarge / Claude 3.5 Sonnet benchmarks provided by Anthropic.

If all that makes your eyes glaze over, that’s OK; it’s meaningful to researchers but mostly marketing to everyone else. A more useful performance metric comes from what we might call “vibemarks” (coined here first!) which are subjective, non-rigorous aggregate feelings measured by competitive usage on sites like LMSYS’s Chatbot Arena. The Claude 3.5 Sonnet model is currently under evaluation there, and it’s too soon to say how well it will fare.

Claude 3.5 Sonnet also outperforms Anthropic’s previous-best model (Claude 3 Opus) on benchmarks measuring “reasoning,” math skills, general knowledge, and coding abilities. For example, the model demonstrated strong performance in an internal coding evaluation, solving 64 percent of problems compared to 38 percent for Claude 3 Opus.

Claude 3.5 Sonnet is also a multimodal AI model that accepts visual input in the form of images, and the new model is reportedly excellent at a battery of visual comprehension tests.

Claude 3.5 Sonnet benchmarks provided by Anthropic.

Enlarge / Claude 3.5 Sonnet benchmarks provided by Anthropic.

Roughly speaking, the visual benchmarks mean that 3.5 Sonnet is better at pulling information from images than previous models. For example, you can show it a picture of a rabbit wearing a football helmet, and the model knows it’s a rabbit wearing a football helmet and can talk about it. That’s fun for tech demos, but the tech is still not accurate enough for applications of the tech where reliability is mission critical.

Anthropic introduces Claude 3.5 Sonnet, matching GPT-4o on benchmarks Read More »

google’s-“ai-overview”-can-give-false,-misleading,-and-dangerous-answers

Google’s “AI Overview” can give false, misleading, and dangerous answers

This is fine.

Enlarge / This is fine.

Getty Images

If you use Google regularly, you may have noticed the company’s new AI Overviews providing summarized answers to some of your questions in recent days. If you use social media regularly, you may have come across many examples of those AI Overviews being hilariously or even dangerously wrong.

Factual errors can pop up in existing LLM chatbots as well, of course. But the potential damage that can be caused by AI inaccuracy gets multiplied when those errors appear atop the ultra-valuable web real estate of the Google search results page.

“The examples we’ve seen are generally very uncommon queries and aren’t representative of most people’s experiences,” a Google spokesperson told Ars. “The vast majority of AI Overviews provide high quality information, with links to dig deeper on the web.”

After looking through dozens of examples of Google AI Overview mistakes (and replicating many ourselves for the galleries below), we’ve noticed a few broad categories of errors that seemed to show up again and again. Consider this a crash course in some of the current weak points of Google’s AI Overviews and a look at areas of concern for the company to improve as the system continues to roll out.

Treating jokes as facts

  • The bit about using glue on pizza can be traced back to an 11-year-old troll post on Reddit. (via)

    Kyle Orland / Google

  • This wasn’t funny when the guys at Pep Boys said it, either. (via)

    Kyle Orland / Google

  • Weird Al recommends “running with scissors” as well! (via)

    Kyle Orland / Google

Some of the funniest example of Google’s AI Overview failing come, ironically enough, when the system doesn’t realize a source online was trying to be funny. An AI answer that suggested using “1/8 cup of non-toxic glue” to stop cheese from sliding off pizza can be traced back to someone who was obviously trying to troll an ongoing thread. A response recommending “blinker fluid” for a turn signal that doesn’t make noise can similarly be traced back to a troll on the Good Sam advice forums, which Google’s AI Overview apparently trusts as a reliable source.

In regular Google searches, these jokey posts from random Internet users probably wouldn’t be among the first answers someone saw when clicking through a list of web links. But with AI Overviews, those trolls were integrated into the authoritative-sounding data summary presented right at the top of the results page.

What’s more, there’s nothing in the tiny “source link” boxes below Google’s AI summary to suggest either of these forum trolls are anything other than good sources of information. Sometimes, though, glancing at the source can save you some grief, such as when you see a response calling running with scissors “cardio exercise that some say is effective” (that came from a 2022 post from Little Old Lady Comedy).

Bad sourcing

  • Washington University in St. Louis says this ratio is accurate, but others disagree. (via)

    Kyle Orland / Google

  • Man, we wish this fantasy remake was real. (via)

    Kyle Orland / Google

Sometimes Google’s AI Overview offers an accurate summary of a non-joke source that happens to be wrong. When asking about how many Declaration of Independence signers owned slaves, for instance, Google’s AI Overview accurately summarizes a Washington University of St. Louis library page saying that one-third “were personally enslavers.” But the response ignores contradictory sources like a Chicago Sun-Times article saying the real answer is closer to three-quarters. I’m not enough of a history expert to judge which authoritative-seeming source is right, but at least one historian online took issue with the Google AI’s answer sourcing.

Other times, a source that Google trusts as authoritative is really just fan fiction. That’s the case for a response that imagined a 2022 remake of 2001: A Space Odyssey, directed by Steven Spielberg and produced by George Lucas. A savvy web user would probably do a double-take before citing citing Fandom’s “Idea Wiki” as a reliable source, but a careless AI Overview user might not notice where the AI got its information.

Google’s “AI Overview” can give false, misleading, and dangerous answers Read More »

llms-keep-leaping-with-llama-3,-meta’s-newest-open-weights-ai-model

LLMs keep leaping with Llama 3, Meta’s newest open-weights AI model

computer-powered word generator —

Zuckerberg says new AI model “was still learning” when Meta stopped training.

A group of pink llamas on a pixelated background.

On Thursday, Meta unveiled early versions of its Llama 3 open-weights AI model that can be used to power text composition, code generation, or chatbots. It also announced that its Meta AI Assistant is now available on a website and is going to be integrated into its major social media apps, intensifying the company’s efforts to position its products against other AI assistants like OpenAI’s ChatGPT, Microsoft’s Copilot, and Google’s Gemini.

Like its predecessor, Llama 2, Llama 3 is notable for being a freely available, open-weights large language model (LLM) provided by a major AI company. Llama 3 technically does not quality as “open source” because that term has a specific meaning in software (as we have mentioned in other coverage), and the industry has not yet settled on terminology for AI model releases that ship either code or weights with restrictions (you can read Llama 3’s license here) or that ship without providing training data. We typically call these releases “open weights” instead.

At the moment, Llama 3 is available in two parameter sizes: 8 billion (8B) and 70 billion (70B), both of which are available as free downloads through Meta’s website with a sign-up. Llama 3 comes in two versions: pre-trained (basically the raw, next-token-prediction model) and instruction-tuned (fine-tuned to follow user instructions). Each has a 8,192 token context limit.

A screenshot of the Meta AI Assistant website on April 18, 2024.

Enlarge / A screenshot of the Meta AI Assistant website on April 18, 2024.

Benj Edwards

Meta trained both models on two custom-built, 24,000-GPU clusters. In a podcast interview with Dwarkesh Patel, Meta CEO Mark Zuckerberg said that the company trained the 70B model with around 15 trillion tokens of data. Throughout the process, the model never reached “saturation” (that is, it never hit a wall in terms of capability increases). Eventually, Meta pulled the plug and moved on to training other models.

“I guess our prediction going in was that it was going to asymptote more, but even by the end it was still leaning. We probably could have fed it more tokens, and it would have gotten somewhat better,” Zuckerberg said on the podcast.

Meta also announced that it is currently training a 400B parameter version of Llama 3, which some experts like Nvidia’s Jim Fan think may perform in the same league as GPT-4 Turbo, Claude 3 Opus, and Gemini Ultra on benchmarks like MMLU, GPQA, HumanEval, and MATH.

Speaking of benchmarks, we have devoted many words in the past to explaining how frustratingly imprecise benchmarks can be when applied to large language models due to issues like training contamination (that is, including benchmark test questions in the training dataset), cherry-picking on the part of vendors, and an inability to capture AI’s general usefulness in an interactive session with chat-tuned models.

But, as expected, Meta provided some benchmarks for Llama 3 that list results from MMLU (undergraduate level knowledge), GSM-8K (grade-school math), HumanEval (coding), GPQA (graduate-level questions), and MATH (math word problems). These show the 8B model performing well compared to open-weights models like Google’s Gemma 7B and Mistral 7B Instruct, and the 70B model also held its own against Gemini Pro 1.5 and Claude 3 Sonnet.

A chart of instruction-tuned Llama 3 8B and 70B benchmarks provided by Meta.

Enlarge / A chart of instruction-tuned Llama 3 8B and 70B benchmarks provided by Meta.

Meta says that the Llama 3 model has been enhanced with capabilities to understand coding (like Llama 2) and, for the first time, has been trained with both images and text—though it currently outputs only text. According to Reuters, Meta Chief Product Officer Chris Cox noted in an interview that more complex processing abilities (like executing multi-step plans) are expected in future updates to Llama 3, which will also support multimodal outputs—that is, both text and images.

Meta plans to host the Llama 3 models on a range of cloud platforms, making them accessible through AWS, Databricks, Google Cloud, and other major providers.

Also on Thursday, Meta announced that Llama 3 will become the new basis of the Meta AI virtual assistant, which the company first announced in September. The assistant will appear prominently in search features for Facebook, Instagram, WhatsApp, Messenger, and the aforementioned dedicated website that features a design similar to ChatGPT, including the ability to generate images in the same interface. The company also announced a partnership with Google to integrate real-time search results into the Meta AI assistant, adding to an existing partnership with Microsoft’s Bing.

LLMs keep leaping with Llama 3, Meta’s newest open-weights AI model Read More »

words-are-flowing-out-like-endless-rain:-recapping-a-busy-week-of-llm-news

Words are flowing out like endless rain: Recapping a busy week of LLM news

many things frequently —

Gemini 1.5 Pro launch, new version of GPT-4 Turbo, new Mistral model, and more.

An image of a boy amazed by flying letters.

Enlarge / An image of a boy amazed by flying letters.

Some weeks in AI news are eerily quiet, but during others, getting a grip on the week’s events feels like trying to hold back the tide. This week has seen three notable large language model (LLM) releases: Google Gemini Pro 1.5 hit general availability with a free tier, OpenAI shipped a new version of GPT-4 Turbo, and Mistral released a new openly licensed LLM, Mixtral 8x22B. All three of those launches happened within 24 hours starting on Tuesday.

With the help of software engineer and independent AI researcher Simon Willison (who also wrote about this week’s hectic LLM launches on his own blog), we’ll briefly cover each of the three major events in roughly chronological order, then dig into some additional AI happenings this week.

Gemini Pro 1.5 general release

On Tuesday morning Pacific time, Google announced that its Gemini 1.5 Pro model (which we first covered in February) is now available in 180-plus countries, excluding Europe, via the Gemini API in a public preview. This is Google’s most powerful public LLM so far, and it’s available in a free tier that permits up to 50 requests a day.

It supports up to 1 million tokens of input context. As Willison notes in his blog, Gemini 1.5 Pro’s API price at $7/million input tokens and $21/million output tokens costs a little less than GPT-4 Turbo (priced at $10/million in and $30/million out) and more than Claude 3 Sonnet (Anthropic’s mid-tier LLM, priced at $3/million in and $15/million out).

Notably, Gemini 1.5 Pro includes native audio (speech) input processing that allows users to upload audio or video prompts, a new File API for handling files, the ability to add custom system instructions (system prompts) for guiding model responses, and a JSON mode for structured data extraction.

“Majorly Improved” GPT-4 Turbo launch

A GPT-4 Turbo performance chart provided by OpenAI.

Enlarge / A GPT-4 Turbo performance chart provided by OpenAI.

Just a bit later than Google’s 1.5 Pro launch on Tuesday, OpenAI announced that it was rolling out a “majorly improved” version of GPT-4 Turbo (a model family originally launched in November) called “gpt-4-turbo-2024-04-09.” It integrates multimodal GPT-4 Vision processing (recognizing the contents of images) directly into the model, and it initially launched through API access only.

Then on Thursday, OpenAI announced that the new GPT-4 Turbo model had just become available for paid ChatGPT users. OpenAI said that the new model improves “capabilities in writing, math, logical reasoning, and coding” and shared a chart that is not particularly useful in judging capabilities (that they later updated). The company also provided an example of an alleged improvement, saying that when writing with ChatGPT, the AI assistant will use “more direct, less verbose, and use more conversational language.”

The vague nature of OpenAI’s GPT-4 Turbo announcements attracted some confusion and criticism online. On X, Willison wrote, “Who will be the first LLM provider to publish genuinely useful release notes?” In some ways, this is a case of “AI vibes” again, as we discussed in our lament about the poor state of LLM benchmarks during the debut of Claude 3. “I’ve not actually spotted any definite differences in quality [related to GPT-4 Turbo],” Willison told us directly in an interview.

The update also expanded GPT-4’s knowledge cutoff to April 2024, although some people are reporting it achieves this through stealth web searches in the background, and others on social media have reported issues with date-related confabulations.

Mistral’s mysterious Mixtral 8x22B release

An illustration of a robot holding a French flag, figuratively reflecting the rise of AI in France due to Mistral. It's hard to draw a picture of an LLM, so a robot will have to do.

Enlarge / An illustration of a robot holding a French flag, figuratively reflecting the rise of AI in France due to Mistral. It’s hard to draw a picture of an LLM, so a robot will have to do.

Not to be outdone, on Tuesday night, French AI company Mistral launched its latest openly licensed model, Mixtral 8x22B, by tweeting a torrent link devoid of any documentation or commentary, much like it has done with previous releases.

The new mixture-of-experts (MoE) release weighs in with a larger parameter count than its previously most-capable open model, Mixtral 8x7B, which we covered in December. It’s rumored to potentially be as capable as GPT-4 (In what way, you ask? Vibes). But that has yet to be seen.

“The evals are still rolling in, but the biggest open question right now is how well Mixtral 8x22B shapes up,” Willison told Ars. “If it’s in the same quality class as GPT-4 and Claude 3 Opus, then we will finally have an openly licensed model that’s not significantly behind the best proprietary ones.”

This release has Willison most excited, saying, “If that thing really is GPT-4 class, it’s wild, because you can run that on a (very expensive) laptop. I think you need 128GB of MacBook RAM for it, twice what I have.”

The new Mixtral is not listed on Chatbot Arena yet, Willison noted, because Mistral has not released a fine-tuned model for chatting yet. It’s still a raw, predict-the-next token LLM. “There’s at least one community instruction tuned version floating around now though,” says Willison.

Chatbot Arena Leaderboard shake-ups

A Chatbot Arena Leaderboard screenshot taken on April 12, 2024.

Enlarge / A Chatbot Arena Leaderboard screenshot taken on April 12, 2024.

Benj Edwards

This week’s LLM news isn’t limited to just the big names in the field. There have also been rumblings on social media about the rising performance of open source models like Cohere’s Command R+, which reached position 6 on the LMSYS Chatbot Arena Leaderboard—the highest-ever ranking for an open-weights model.

And for even more Chatbot Arena action, apparently the new version of GPT-4 Turbo is proving competitive with Claude 3 Opus. The two are still in a statistical tie, but GPT-4 Turbo recently pulled ahead numerically. (In March, we reported when Claude 3 first numerically pulled ahead of GPT-4 Turbo, which was then the first time another AI model had surpassed a GPT-4 family model member on the leaderboard.)

Regarding this fierce competition among LLMs—of which most of the muggle world is unaware and will likely never be—Willison told Ars, “The past two months have been a whirlwind—we finally have not just one but several models that are competitive with GPT-4.” We’ll see if OpenAI’s rumored release of GPT-5 later this year will restore the company’s technological lead, we note, which once seemed insurmountable. But for now, Willison says, “OpenAI are no longer the undisputed leaders in LLMs.”

Words are flowing out like endless rain: Recapping a busy week of LLM news Read More »

google-might-make-users-pay-for-ai-features-in-search-results

Google might make users pay for AI features in search results

Pay-eye for the AI —

Plan would represent a first for what has been a completely ad-funded search engine.

You think this cute little search robot is going to work for free?

Enlarge / You think this cute little search robot is going to work for free?

Google might start charging for access to search results that use generative artificial intelligence tools. That’s according to a new Financial Times report citing “three people with knowledge of [Google’s] plans.”

Charging for any part of the search engine at the core of its business would be a first for Google, which has funded its search product solely with ads since 2000. But it’s far from the first time Google would charge for AI enhancements in general; the “AI Premium” tier of a Google One subscription costs $10 more per month than a standard “Premium” plan, for instance, while “Gemini Business” adds $20 a month to a standard Google Workspace subscription.

While those paid products offer access to Google’s high-end “Gemini Advanced” AI model, Google also offers free access to its less performant, plain “Gemini” model without any kind of paid subscription.

When ads aren’t enough?

Under the proposed plan, Google’s standard search (without AI) would remain free, and subscribers to a paid AI search tier would still see ads alongside their Gemini-powered search results, according to the FT report. But search ads—which brought in a reported $175 billion for Google last year—might not be enough to fully cover the increased costs involved with AI-powered search. A Reuters report from last year suggested that running a search query through an advanced neural network like Gemini “likely costs 10 times more than a standard keyword search,” potentially representing “several billion dollars of extra costs” across Google’s network.

Cost aside, it remains to be seen if there’s a critical mass of market demand for this kind of AI-enhanced search. Microsoft’s massive investment in generative AI features for its Bing search engine has failed to make much of a dent in Google’s market share over the last year or so. And there has reportedly been limited uptake for Google’s experimental opt-in “Search Generative Experience” (SGE), which adds chatbot responses above the usual set of links in response to a search query.

“SGE never feels like a useful addition to Google Search,” Ars’ Ron Amadeo wrote last month. “Google Search is a tool, and just as a screwdriver is not a hammer, I don’t want a chatbot in a search engine.”

Regardless, the current tech industry mania surrounding anything and everything related to generative AI may make Google feel it has to integrate the technology into some sort of “premium” search product sooner rather than later. For now, FT reports that Google hasn’t made a final decision on whether to implement the paid AI search plan, even as Google engineers work on the backend technology necessary to launch such a service

Google also faces AI-related difficulties on the other side of the search divide. Last month, the company announced it was redoubling its efforts to limit the appearance of “spammy, low-quality content”—much of it generated by AI chatbots—in its search results.

In February, Google shut down the image generation features of its Gemini AI model after the service was found inserting historically inaccurate examples of racial diversity into some of its prompt responses.

Google might make users pay for AI features in search results Read More »

openai-drops-login-requirements-for-chatgpt’s-free-version

OpenAI drops login requirements for ChatGPT’s free version

free as in beer? —

ChatGPT 3.5 still falls far short of GPT-4, and other models surpassed it long ago.

A glowing OpenAI logo on a blue background.

Benj Edwards

On Monday, OpenAI announced that visitors to the ChatGPT website in some regions can now use the AI assistant without signing in. Previously, the company required that users create an account to use it, even with the free version of ChatGPT that is currently powered by the GPT-3.5 AI language model. But as we have noted in the past, GPT-3.5 is widely known to provide more inaccurate information compared to GPT-4 Turbo, available in paid versions of ChatGPT.

Since its launch in November 2022, ChatGPT has transformed over time from a tech demo to a comprehensive AI assistant, and it’s always had a free version available. The cost is free because “you’re the product,” as the old saying goes. Using ChatGPT helps OpenAI gather data that will help the company train future AI models, although free users and ChatGPT Plus subscription members can both opt out of allowing the data they input into ChatGPT to be used for AI training. (OpenAI says it never trains on inputs from ChatGPT Team and Enterprise members at all).

Opening ChatGPT to everyone could provide a frictionless on-ramp for people who might use it as a substitute for Google Search or potentially gain new customers by providing an easy way for people to use ChatGPT quickly, then offering an upsell to paid versions of the service.

“It’s core to our mission to make tools like ChatGPT broadly available so that people can experience the benefits of AI,” OpenAI says on its blog page. “For anyone that has been curious about AI’s potential but didn’t want to go through the steps to set up an account, start using ChatGPT today.”

When you visit the ChatGPT website, you're immediately presented with a chat box like this (in some regions). Screenshot captured April 1, 2024.

Enlarge / When you visit the ChatGPT website, you’re immediately presented with a chat box like this (in some regions). Screenshot captured April 1, 2024.

Benj Edwards

Since kids will also be able to use ChatGPT without an account—despite it being against the terms of service—OpenAI also says it’s introducing “additional content safeguards,” such as blocking more prompts and “generations in a wider range of categories.” What exactly that entails has not been elaborated upon by OpenAI, but we reached out to the company for comment.

There might be a few other downsides to the fully open approach. On X, AI researcher Simon Willison wrote about the potential for automated abuse as a way to get around paying for OpenAI’s services: “I wonder how their scraping prevention works? I imagine the temptation for people to abuse this as a free 3.5 API will be pretty strong.”

With fierce competition, more GPT-3.5 access may backfire

Willison also mentioned a common criticism of OpenAI (as voiced in this case by Wharton professor Ethan Mollick) that people’s ideas about what AI models can do have so far largely been influenced by GPT-3.5, which, as we mentioned, is far less capable and far more prone to making things up than the paid version of ChatGPT that uses GPT-4 Turbo.

“In every group I speak to, from business executives to scientists, including a group of very accomplished people in Silicon Valley last night, much less than 20% of the crowd has even tried a GPT-4 class model,” wrote Mollick in a tweet from early March.

With models like Google Gemini Pro 1.5 and Anthropic Claude 3 potentially surpassing OpenAI’s best proprietary model at the moment —and open weights AI models eclipsing the free version of ChatGPT—allowing people to use GPT-3.5 might not be putting OpenAI’s best foot forward. Microsoft Copilot, powered by OpenAI models, also supports a frictionless, no-login experience, but it allows access to a model based on GPT-4. But Gemini currently requires a sign-in, and Anthropic sends a login code through email.

For now, OpenAI says the login-free version of ChatGPT is not yet available to everyone, but it will be coming soon: “We’re rolling this out gradually, with the aim to make AI accessible to anyone curious about its capabilities.”

OpenAI drops login requirements for ChatGPT’s free version Read More »

apple-may-hire-google-to-power-new-iphone-ai-features-using-gemini—report

Apple may hire Google to power new iPhone AI features using Gemini—report

Bake a cake as fast as you can —

With Apple’s own AI tech lagging behind, the firm looks for a fallback solution.

A Google

Benj Edwards

On Monday, Bloomberg reported that Apple is in talks to license Google’s Gemini model to power AI features like Siri in a future iPhone software update coming later in 2024, according to people familiar with the situation. Apple has also reportedly conducted similar talks with ChatGPT maker OpenAI.

The potential integration of Google Gemini into iOS 18 could bring a range of new cloud-based (off-device) AI-powered features to Apple’s smartphone, including image creation or essay writing based on simple prompts. However, the terms and branding of the agreement have not yet been finalized, and the implementation details remain unclear. The companies are unlikely to announce any deal until Apple’s annual Worldwide Developers Conference in June.

Gemini could also bring new capabilities to Apple’s widely criticized voice assistant, Siri, which trails newer AI assistants powered by large language models (LLMs) in understanding and responding to complex questions. Rumors of Apple’s own internal frustration with Siri—and potential remedies—have been kicking around for some time. In January, 9to5Mac revealed that Apple had been conducting tests with a beta version of iOS 17.4 that used OpenAI’s ChatGPT API to power Siri.

As we have previously reported, Apple has also been developing its own AI models, including a large language model codenamed Ajax and a basic chatbot called Apple GPT. However, the company’s LLM technology is said to lag behind that of its competitors, making a partnership with Google or another AI provider a more attractive option.

Google launched Gemini, a language-based AI assistant similar to ChatGPT, in December and has updated it several times since. Many industry experts consider the larger Gemini models to be roughly as capable as OpenAI’s GPT-4 Turbo, which powers the subscription versions of ChatGPT. Until just recently, with the emergence of Gemini Ultra and Claude 3, OpenAI’s top model held a fairly wide lead in perceived LLM capability.

The potential partnership between Apple and Google could significantly impact the AI industry, as Apple’s platform represents more than 2 billion active devices worldwide. If the agreement gets finalized, it would build upon the existing search partnership between the two companies, which has seen Google pay Apple billions of dollars annually to make its search engine the default option on iPhones and other Apple devices.

However, Bloomberg reports that the potential partnership between Apple and Google is likely to draw scrutiny from regulators, as the companies’ current search deal is already the subject of a lawsuit by the US Department of Justice. The European Union is also pressuring Apple to make it easier for consumers to change their default search engine away from Google.

With so much potential money on the line, selecting Google for Apple’s cloud AI job could potentially be a major loss for OpenAI in terms of bringing its technology widely into the mainstream—with a market representing billions of users. Even so, any deal with Google or OpenAI may be a temporary fix until Apple can get its own LLM-based AI technology up to speed.

Apple may hire Google to power new iPhone AI features using Gemini—report Read More »