Google Gemini

samsung’s-galaxy-s25-event-was-an-ai-presentation-with-occasional-phone-hardware

Samsung’s Galaxy S25 event was an AI presentation with occasional phone hardware

Samsung announced the Galaxy S25, S25+, and S25 Ultra at its Unpacked event today. What is different from last year’s models? With the phones themselves, not much, other than a new chipset and a wide camera. But pure AI optimism? Samsung managed to pack a whole lot more of that into its launch event and promotional materials.

The corners on the S25 Ultra are a bit more rounded, the edges are flatter, and the bezels seem to be slightly thinner. The S25 and S25+ models have the same screen size as the S24 models, at 6.2 and 6.7 inches, respectively, while the Ultra notches up slightly from 6.8 to 6.9 inches.

Samsung’s S25 Ultra, in titanium builds colored silver blue, black, gray, and white silver.

Credit: Samsung

Samsung’s S25 Ultra, in titanium builds colored silver blue, black, gray, and white silver. Credit: Samsung

The S25 Ultra, starting at $1,300, touts a Snapdragon 8 Elite processor, a new 50-megapixel ultra-wide lens, and what Samsung claims is improved detail in software-derived zoom images. It comes with the S Pen, a vestige of the departed Note line, but as The Verge notes, there is no Bluetooth included, so you can’t pull off hand gestures with the pen off the screen or use it as a quirky remote camera trigger.

Samsung’s S25 Plus phones, in silver blue, navy, and icy blue.

Credit: Samsung

Samsung’s S25 Plus phones, in silver blue, navy, and icy blue. Credit: Samsung

It’s much the same with the S25 and S25 Plus, starting at $800. The base models got an upgrade to a default of 12GB of RAM. The displays, cameras, and general shape and build are the same. All the Galaxy devices released in 2025 have Qi2 wireless charging support—but not by default. You’ll need a “Qi2 Ready” magnetic case to get a sturdy attachment and the 15 W top charging speed.

One thing that hasn’t changed, for the better, is Samsung’s recent bump up in longevity. Each Galaxy S25 model gets seven years of security updates and seven of OS upgrades, which matches Google’s Pixel line in number of years.

Side view of the Galaxy S25 Edge, which is looking rather thin. Samsung

At the very end of Samsung’s event, for less than 30 seconds, a “Galaxy S25 Edge” was teased. In a mostly black field with some shiny metal components, Samsung seemed to be teasing the notably slimmer variant of the S25 that had been rumored. The same kinds of leaks about an “iPhone Air” have been circulating. No details were provided beyond its name, and a brief video suggesting its svelte nature.

Samsung’s Galaxy S25 event was an AI presentation with occasional phone hardware Read More »

the-ai-war-between-google-and-openai-has-never-been-more-heated

The AI war between Google and OpenAI has never been more heated

Over the past month, we’ve seen a rapid cadence of notable AI-related announcements and releases from both Google and OpenAI, and it’s been making the AI community’s head spin. It has also poured fuel on the fire of the OpenAI-Google rivalry, an accelerating game of one-upmanship taking place unusually close to the Christmas holiday.

“How are people surviving with the firehose of AI updates that are coming out,” wrote one user on X last Friday, which is still a hotbed of AI-related conversation. “in the last <24 hours we got gemini flash 2.0 and chatGPT with screenshare, deep research, pika 2, sora, chatGPT projects, anthropic clio, wtf it never ends."

Rumors travel quickly in the AI world, and people in the AI industry had been expecting OpenAI to ship some major products in December. Once OpenAI announced “12 days of OpenAI” earlier this month, Google jumped into gear and seemingly decided to try to one-up its rival on several counts. So far, the strategy appears to be working, but it’s coming at the cost of the rest of the world being able to absorb the implications of the new releases.

“12 Days of OpenAI has turned into like 50 new @GoogleAI releases,” wrote another X user on Monday. “This past week, OpenAI & Google have been releasing at the speed of a new born startup,” wrote a third X user on Tuesday. “Even their own users can’t keep up. Crazy time we’re living in.”

“Somebody told Google that they could just do things,” wrote a16z partner and AI influencer Justine Moore on X, referring to a common motivational meme telling people they “can just do stuff.”

The Google AI rush

OpenAI’s “12 Days of OpenAI” campaign has included releases of their full o1 model, an upgrade from o1-preview, alongside o1-pro for advanced “reasoning” tasks. The company also publicly launched Sora for video generation, added Projects functionality to ChatGPT, introduced Advanced Voice features with video streaming capabilities, and more.

The AI war between Google and OpenAI has never been more heated Read More »

not-to-be-outdone-by-openai,-google-releases-its-own-“reasoning”-ai-model

Not to be outdone by OpenAI, Google releases its own “reasoning” AI model

Google DeepMind’s chief scientist, Jeff Dean, says that the model receives extra computing power, writing on X, “we see promising results when we increase inference time computation!” The model works by pausing to consider multiple related prompts before providing what it determines to be the most accurate answer.

Since OpenAI’s jump into the “reasoning” field in September with o1-preview and o1-mini, several companies have been rushing to achieve feature parity with their own models. For example, DeepSeek launched DeepSeek-R1 in early November, while Alibaba’s Qwen team released its own “reasoning” model, QwQ earlier this month.

While some claim that reasoning models can help solve complex mathematical or academic problems, these models might not be for everybody. While they perform well on some benchmarks, questions remain about their actual usefulness and accuracy. Also, the high computing costs needed to run reasoning models have created some rumblings about their long-term viability. That high cost is why OpenAI’s ChatGPT Pro costs $200 a month, for example.

Still, it appears Google is serious about pursuing this particular AI technique. Logan Kilpatrick, a Google employee in its AI Studio, called it “the first step in our reasoning journey” in a post on X.

Not to be outdone by OpenAI, Google releases its own “reasoning” AI model Read More »

google-goes-“agentic”-with-gemini-2.0’s-ambitious-ai-agent-features

Google goes “agentic” with Gemini 2.0’s ambitious AI agent features

On Wednesday, Google unveiled Gemini 2.0, the next generation of its AI-model family, starting with an experimental release called Gemini 2.0 Flash. The model family can generate text, images, and speech while processing multiple types of input including text, images, audio, and video. It’s similar to multimodal AI models like GPT-4o, which powers OpenAI’s ChatGPT.

“Gemini 2.0 Flash builds on the success of 1.5 Flash, our most popular model yet for developers, with enhanced performance at similarly fast response times,” said Google in a statement. “Notably, 2.0 Flash even outperforms 1.5 Pro on key benchmarks, at twice the speed.”

Gemini 2.0 Flash—which is the smallest model of the 2.0 family in terms of parameter count—launches today through Google’s developer platforms like Gemini API, AI Studio, and Vertex AI. However, its image generation and text-to-speech features remain limited to early access partners until January 2025. Google plans to integrate the tech into products like Android Studio, Chrome DevTools, and Firebase.

The company addressed potential misuse of generated content by implementing SynthID watermarking technology on all audio and images created by Gemini 2.0 Flash. This watermark appears in supported Google products to identify AI-generated content.

Google’s newest announcements lean heavily into the concept of agentic AI systems that can take action for you. “Over the last year, we have been investing in developing more agentic models, meaning they can understand more about the world around you, think multiple steps ahead, and take action on your behalf, with your supervision,” said Google CEO Sundar Pichai in a statement. “Today we’re excited to launch our next era of models built for this new agentic era.”

Google goes “agentic” with Gemini 2.0’s ambitious AI agent features Read More »

google-and-meta-update-their-ai-models-amid-the-rise-of-“alphachip”

Google and Meta update their AI models amid the rise of “AlphaChip”

Running the AI News Gauntlet —

News about Gemini updates, Llama 3.2, and Google’s new AI-powered chip designer.

Cyberpunk concept showing a man running along a futuristic path full of monitors.

Enlarge / There’s been a lot of AI news this week, and covering it sometimes feels like running through a hall full of danging CRTs, just like this Getty Images illustration.

It’s been a wildly busy week in AI news thanks to OpenAI, including a controversial blog post from CEO Sam Altman, the wide rollout of Advanced Voice Mode, 5GW data center rumors, major staff shake-ups, and dramatic restructuring plans.

But the rest of the AI world doesn’t march to the same beat, doing its own thing and churning out new AI models and research by the minute. Here’s a roundup of some other notable AI news from the past week.

Google Gemini updates

On Tuesday, Google announced updates to its Gemini model lineup, including the release of two new production-ready models that iterate on past releases: Gemini-1.5-Pro-002 and Gemini-1.5-Flash-002. The company reported improvements in overall quality, with notable gains in math, long context handling, and vision tasks. Google claims a 7 percent increase in performance on the MMLU-Pro benchmark and a 20 percent improvement in math-related tasks. But as you know, if you’ve been reading Ars Technica for a while, AI typically benchmarks aren’t as useful as we would like them to be.

Along with model upgrades, Google introduced substantial price reductions for Gemini 1.5 Pro, cutting input token costs by 64 percent and output token costs by 52 percent for prompts under 128,000 tokens. As AI researcher Simon Willison noted on his blog, “For comparison, GPT-4o is currently $5/[million tokens] input and $15/m output and Claude 3.5 Sonnet is $3/m input and $15/m output. Gemini 1.5 Pro was already the cheapest of the frontier models and now it’s even cheaper.”

Google also increased rate limits, with Gemini 1.5 Flash now supporting 2,000 requests per minute and Gemini 1.5 Pro handling 1,000 requests per minute. Google reports that the latest models offer twice the output speed and three times lower latency compared to previous versions. These changes may make it easier and more cost-effective for developers to build applications with Gemini than before.

Meta launches Llama 3.2

On Wednesday, Meta announced the release of Llama 3.2, a significant update to its open-weights AI model lineup that we have covered extensively in the past. The new release includes vision-capable large language models (LLMs) in 11 billion and 90B parameter sizes, as well as lightweight text-only models of 1B and 3B parameters designed for edge and mobile devices. Meta claims the vision models are competitive with leading closed-source models on image recognition and visual understanding tasks, while the smaller models reportedly outperform similar-sized competitors on various text-based tasks.

Willison did some experiments with some of the smaller 3.2 models and reported impressive results for the models’ size. AI researcher Ethan Mollick showed off running Llama 3.2 on his iPhone using an app called PocketPal.

Meta also introduced the first official “Llama Stack” distributions, created to simplify development and deployment across different environments. As with previous releases, Meta is making the models available for free download, with license restrictions. The new models support long context windows of up to 128,000 tokens.

Google’s AlphaChip AI speeds up chip design

On Thursday, Google DeepMind announced what appears to be a significant advancement in AI-driven electronic chip design, AlphaChip. It began as a research project in 2020 and is now a reinforcement learning method for designing chip layouts. Google has reportedly used AlphaChip to create “superhuman chip layouts” in the last three generations of its Tensor Processing Units (TPUs), which are chips similar to GPUs designed to accelerate AI operations. Google claims AlphaChip can generate high-quality chip layouts in hours, compared to weeks or months of human effort. (Reportedly, Nvidia has also been using AI to help design its chips.)

Notably, Google also released a pre-trained checkpoint of AlphaChip on GitHub, sharing the model weights with the public. The company reported that AlphaChip’s impact has already extended beyond Google, with chip design companies like MediaTek adopting and building on the technology for their chips. According to Google, AlphaChip has sparked a new line of research in AI for chip design, potentially optimizing every stage of the chip design cycle from computer architecture to manufacturing.

That wasn’t everything that happened, but those are some major highlights. With the AI industry showing no signs of slowing down at the moment, we’ll see how next week goes.

Google and Meta update their AI models amid the rise of “AlphaChip” Read More »

google-rolls-out-voice-powered-ai-chat-to-the-android-masses

Google rolls out voice-powered AI chat to the Android masses

Chitchat Wars —

Gemini Live allows back-and-forth conversation, now free to all Android users.

The Google Gemini logo.

Enlarge / The Google Gemini logo.

Google

On Thursday, Google made Gemini Live, its voice-based AI chatbot feature, available for free to all Android users. The feature allows users to interact with Gemini through voice commands on their Android devices. That’s notable because competitor OpenAI’s Advanced Voice Mode feature of ChatGPT, which is similar to Gemini Live, has not yet fully shipped.

Google unveiled Gemini Live during its Pixel 9 launch event last month. Initially, the feature was exclusive to Gemini Advanced subscribers, but now it’s accessible to anyone using the Gemini app or its overlay on Android.

Gemini Live enables users to ask questions aloud and even interrupt the AI’s responses mid-sentence. Users can choose from several voice options for Gemini’s responses, adding a level of customization to the interaction.

Gemini suggests the following uses of the voice mode in its official help documents:

Talk back and forth: Talk to Gemini without typing, and Gemini will respond back verbally.

Brainstorm ideas out loud: Ask for a gift idea, to plan an event, or to make a business plan.

Explore: Uncover more details about topics that interest you.

Practice aloud: Rehearse for important moments in a more natural and conversational way.

Interestingly, while OpenAI originally demoed its Advanced Voice Mode in May with the launch of GPT-4o, it has only shipped the feature to a limited number of users starting in late July. Some AI experts speculate that a wider rollout has been hampered by a lack of available computer power since the voice feature is presumably very compute-intensive.

To access Gemini Live, users can reportedly tap a new waveform icon in the bottom-right corner of the app or overlay. This action activates the microphone, allowing users to pose questions verbally. The interface includes options to “hold” Gemini’s answer or “end” the conversation, giving users control over the flow of the interaction.

Currently, Gemini Live supports only English, but Google has announced plans to expand language support in the future. The company also intends to bring the feature to iOS devices, though no specific timeline has been provided for this expansion.

Google rolls out voice-powered AI chat to the Android masses Read More »

google-pulls-its-terrible-pro-ai-“dear-sydney”-ad-after-backlash

Google pulls its terrible pro-AI “Dear Sydney” ad after backlash

Gemini, write me a fan letter! —

Taking the “human” out of “human communication.”

A picture of the Gemini prompt box from the

Enlarge / The Gemini prompt box in the “Dear Sydney” ad.

Google

Have you seen Google’s “Dear Sydney” ad? The one where a young girl wants to write a fan letter to Olympic hurdler Sydney McLaughlin-Levrone? To which the girl’s dad responds that he is “pretty good with words but this has to be just right”? And so, to be just right, he suggests that the daughter get Google’s Gemini AI to write a first draft of the letter?

If you’re watching the Olympics, you have undoubtedly seen it—because the ad has been everywhere. Until today. After a string of negative commentary about the ad’s dystopian implications, Google has pulled the “Dear Sydney” ad from TV. In a statement to The Hollywood Reporter, the company said, “While the ad tested well before airing, given the feedback, we have decided to phase the ad out of our Olympics rotation.”

The backlash was similar to that against Apple’s recent ad in which an enormous hydraulic press crushed TVs, musical instruments, record players, paint cans, sculptures, and even emoji into… the newest model of the iPad. Apple apparently wanted to show just how much creative and entertainment potential the iPad held; critics read the ad as a warning image about the destruction of human creativity in a technological age. Apple apologized soon after.

Now Google has stepped on the same land mine. Not only is AI coming for human creativity, the “Dear Sydney” ad suggests—but it won’t even leave space for the charming imperfections of a child’s fan letter to an athlete. Instead, AI will provide the template, just as it will likely provide the template for the athlete’s response, leading to a nightmare scenario in which huge swathes of human communication have the “human” part stripped right out.

“Very bad”

The generally hostile tone of the commentary to the new ad was captured by Alexandra Petri’s Washington Post column on the ad, which Petri labeled “very bad.”

This ad makes me want to throw a sledgehammer into the television every time I see it. Given the choice between watching this ad and watching the ad about how I need to be giving money NOW to make certain that dogs do not perish in the snow, I would have to think long and hard. It’s one of those ads that makes you think, perhaps evolution was a mistake and our ancestor should never have left the sea. This could be slight hyperbole but only slight!

If you haven’t seen this ad, you are leading a blessed existence and I wish to trade places with you.

A TechCrunch piece said that it was “hard to think of anything that communicates heartfelt inspiration less than instructing an AI to tell someone how inspiring they are.”

Shelly Palmer, a Syracuse University professor and marketing consultant, wrote that the ad’s basic mistake was overestimating “AI’s ability to understand and convey the nuances of human emotions and thoughts.” Palmer would rather have a “heartfelt message over a grammatically correct, AI-generated message any day,” he said. He then added:

I received just such a heartfelt message from a reader years ago. It was a single line email about a blog post I had just written: “Shelly, you’re to [sic] stupid to own a smart phone.” I love this painfully ironic email so much, I have it framed on the wall in my office. It was honest, direct, and probably accurate.

But his conclusion was far more serious. “I flatly reject the future that Google is advertising,” Palmer wrote. “I want to live in a culturally diverse world where billions of individuals use AI to amplify their human skills, not in a world where we are used by AI pretending to be human.”

Things got saltier from there. NPR host Linda Holmes wrote on social media:

This commercial showing somebody having a child use AI to write a fan letter to her hero SUCKS. Obviously there are special circumstances and people who need help, but as a general “look how cool, she didn’t even have to write anything herself!” story, it SUCKS. Who wants an AI-written fan letter?? I promise you, if they’re able, the words your kid can put together will be more meaningful than anything a prompt can spit out. And finally: A fan letter is a great way for a kid to learn to write! If you encourage kids to run to AI to spit out words because their writing isn’t great yet, how are they supposed to learn? Sit down with your kid and write the letter with them! I’m just so grossed out by the entire thing.

The Atlantic was more succinct with its headline: “Google Wins the Gold Medal for Worst Olympic Ad.”

All of this largely tracks with our own take on the ad, which Ars Technica’s Kyle Orland called a “grim” vision of the future. “I want AI-powered tools to automate the most boring, mundane tasks in my life, giving me more time to spend on creative, life-affirming moments with my family,” he wrote. “Google’s ad seems to imply that these life-affirming moments are also something to be avoided—or at least made pleasingly more efficient—through the use of AI.”

Getting people excited about their own obsolescence and addiction is a tough sell, so I don’t envy the marketers who have to hawk Big Tech’s biggest products in a climate of suspicion and hostility toward everything from AI to screen time to social media to data collection. I’m sure the marketers will find a way—but clearly “Dear Sydney” isn’t it.

Google pulls its terrible pro-AI “Dear Sydney” ad after backlash Read More »

outsourcing-emotion:-the-horror-of-google’s-“dear-sydney”-ai-ad

Outsourcing emotion: The horror of Google’s “Dear Sydney” AI ad

Here's an idea: Don't be a deadbeat and do it yourself!

Enlarge / Here’s an idea: Don’t be a deadbeat and do it yourself!

If you’ve watched any Olympics coverage this week, you’ve likely been confronted with an ad for Google’s Gemini AI called “Dear Sydney.” In it, a proud father seeks help writing a letter on behalf of his daughter, who is an aspiring runner and superfan of world-record-holding hurdler Sydney McLaughlin-Levrone.

“I’m pretty good with words, but this has to be just right,” the father intones before asking Gemini to “Help my daughter write a letter telling Sydney how inspiring she is…” Gemini dutifully responds with a draft letter in which the LLM tells the runner, on behalf of the daughter, that she wants to be “just like you.”

Every time I see this ad, it puts me on edge in a way I’ve had trouble putting into words (though Gemini itself has some helpful thoughts). As someone who writes words for a living, the idea of outsourcing a writing task to a machine brings up some vocational anxiety. And the idea of someone who’s “pretty good with words” doubting his abilities when the writing “has to be just right” sets off alarm bells regarding the superhuman framing of AI capabilities.

But I think the most offensive thing about the ad is what it implies about the kinds of human tasks Google sees AI replacing. Rather than using LLMs to automate tedious busywork or difficult research questions, “Dear Sydney” presents a world where Gemini can help us offload a heartwarming shared moment of connection with our children.

The “Dear Sydney” ad.

It’s a distressing answer to what’s still an incredibly common question in the AI space: What do you actually use these things for?

Yes, I can help

Marketers have a difficult task when selling the public on their shiny new AI tools. An effective ad for an LLM has to make it seem like a superhuman do-anything machine but also an approachable, friendly helper. An LLM has to be shown as good enough to reliably do things you can’t (or don’t want to) do yourself, but not so good that it will totally replace you.

Microsoft’s 2024 Super Bowl ad for Copilot is a good example of an attempt to thread this needle, featuring a handful of examples of people struggling to follow their dreams in the face of unseen doubters. “Can you help me?” those dreamers ask Copilot with various prompts. “Yes, I can help” is the message Microsoft delivers back, whether through storyboard images, an impromptu organic chemistry quiz, or “code for a 3D open world game.”

Microsoft’s Copilot marketing sells it as a helper for achieving your dreams.

The “Dear Sydney” ad tries to fit itself into this same box, technically. The prompt in the ad starts with “Help my daughter…” and the tagline at the end offers “A little help from Gemini.” If you look closely near the end, you’ll also see Gemini’s response starts with “Here’s a draft to get you started.” And to be clear, there’s nothing inherently wrong with using an LLM as a writing assistant in this way, especially if you have a disability or are writing in a non-native language.

But the subtle shift from Microsoft’s “Help me” to Google’s “Help my daughter” changes the tone of things. Inserting Gemini into a child’s heartfelt request for parental help makes it seem like the parent in question is offloading their responsibilities to a computer in the coldest, most sterile way possible. More than that, it comes across as an attempt to avoid an opportunity to bond with a child over a shared interest in a creative way.

It’s one thing to use AI to help you with the most tedious parts of your job, as people do in recent ads for Salesforce’s Einstein AI. It’s another to tell your daughter to go ask the computer for help pouring their heart out to their idol.

Outsourcing emotion: The horror of Google’s “Dear Sydney” AI ad Read More »

google,-its-cat-fully-escaped-from-bag,-shows-off-the-pixel-9-pro-weeks-early

Google, its cat fully escaped from bag, shows off the Pixel 9 Pro weeks early

Google Pixel 9 Series —

Upcoming phone is teased with an AI breakup letter to “the same old thing.”

Top part of rear of Pixel 9 Pro, with

Enlarge / You can have confirmation of one of our upcoming four phones, but you have to hear us talk about AI again. Deal?

Google

After every one of its house-brand phones, and even its new wall charger, have been meticulously photographed, sized, and rated for battery capacity, what should Google do to keep the anticipation up for the Pixel 9 series’ August 13 debut?

Lean into it, it seems, and Google is doing so with an eye toward further promoting its Gemini-based AI aims. In a video post on X (formerly Twitter), Google describes a “phone built for the Gemini era,” one that can, through the power of Gemini, “even let your old phone down easy” with a breakup letter. The camera pans out, and the shape of the Pixel 9 Pro appears and turns around to show off the now-standard Pixel camera bar across the upper back.

There’s also a disclaimer to this tongue-in-cheek request for a send-off to a phone that is “just the same old thing”: “Screen simulated. Limitations apply. Check responses for accuracy.”

Over at the Google Store, you can see a static image of the Pixel 9 Pro and sign up for alerts about its availability. The image confirms that the photos taken by Taiwanese regulatory authority NCC were legitimate, right down to the coloring on the back of the Pixel 9 Pro and the camera and flash placement.

Those NCC photos confirmed that Google intends to launch four different phone-ish devices at its August 13 “Made by Google” event. The Pixel 9 and Pixel 9 Pro are both roughly 6.1-inch devices, but the Pro will likely offer more robust Gemini AI integration due to increased RAM and other spec bumps. The Pixel 9 Pro XL should have similarly AI-ready specs, just in a larger size. And the Pixel 9 Pro Fold is an iteration on Google’s first Pixel Fold model, with seemingly taller dimensions and a daringly smaller battery.

Google, its cat fully escaped from bag, shows off the Pixel 9 Pro weeks early Read More »

anthropic-introduces-claude-3.5-sonnet,-matching-gpt-4o-on-benchmarks

Anthropic introduces Claude 3.5 Sonnet, matching GPT-4o on benchmarks

The Anthropic Claude 3 logo, jazzed up by Benj Edwards.

Anthropic / Benj Edwards

On Thursday, Anthropic announced Claude 3.5 Sonnet, its latest AI language model and the first in a new series of “3.5” models that build upon Claude 3, launched in March. Claude 3.5 can compose text, analyze data, and write code. It features a 200,000 token context window and is available now on the Claude website and through an API. Anthropic also introduced Artifacts, a new feature in the Claude interface that shows related work documents in a dedicated window.

So far, people outside of Anthropic seem impressed. “This model is really, really good,” wrote independent AI researcher Simon Willison on X. “I think this is the new best overall model (and both faster and half the price of Opus, similar to the GPT-4 Turbo to GPT-4o jump).”

As we’ve written before, benchmarks for large language models (LLMs) are troublesome because they can be cherry-picked and often do not capture the feel and nuance of using a machine to generate outputs on almost any conceivable topic. But according to Anthropic, Claude 3.5 Sonnet matches or outperforms competitor models like GPT-4o and Gemini 1.5 Pro on certain benchmarks like MMLU (undergraduate level knowledge), GSM8K (grade school math), and HumanEval (coding).

Claude 3.5 Sonnet benchmarks provided by Anthropic.

Enlarge / Claude 3.5 Sonnet benchmarks provided by Anthropic.

If all that makes your eyes glaze over, that’s OK; it’s meaningful to researchers but mostly marketing to everyone else. A more useful performance metric comes from what we might call “vibemarks” (coined here first!) which are subjective, non-rigorous aggregate feelings measured by competitive usage on sites like LMSYS’s Chatbot Arena. The Claude 3.5 Sonnet model is currently under evaluation there, and it’s too soon to say how well it will fare.

Claude 3.5 Sonnet also outperforms Anthropic’s previous-best model (Claude 3 Opus) on benchmarks measuring “reasoning,” math skills, general knowledge, and coding abilities. For example, the model demonstrated strong performance in an internal coding evaluation, solving 64 percent of problems compared to 38 percent for Claude 3 Opus.

Claude 3.5 Sonnet is also a multimodal AI model that accepts visual input in the form of images, and the new model is reportedly excellent at a battery of visual comprehension tests.

Claude 3.5 Sonnet benchmarks provided by Anthropic.

Enlarge / Claude 3.5 Sonnet benchmarks provided by Anthropic.

Roughly speaking, the visual benchmarks mean that 3.5 Sonnet is better at pulling information from images than previous models. For example, you can show it a picture of a rabbit wearing a football helmet, and the model knows it’s a rabbit wearing a football helmet and can talk about it. That’s fun for tech demos, but the tech is still not accurate enough for applications of the tech where reliability is mission critical.

Anthropic introduces Claude 3.5 Sonnet, matching GPT-4o on benchmarks Read More »

google’s-“ai-overview”-can-give-false,-misleading,-and-dangerous-answers

Google’s “AI Overview” can give false, misleading, and dangerous answers

This is fine.

Enlarge / This is fine.

Getty Images

If you use Google regularly, you may have noticed the company’s new AI Overviews providing summarized answers to some of your questions in recent days. If you use social media regularly, you may have come across many examples of those AI Overviews being hilariously or even dangerously wrong.

Factual errors can pop up in existing LLM chatbots as well, of course. But the potential damage that can be caused by AI inaccuracy gets multiplied when those errors appear atop the ultra-valuable web real estate of the Google search results page.

“The examples we’ve seen are generally very uncommon queries and aren’t representative of most people’s experiences,” a Google spokesperson told Ars. “The vast majority of AI Overviews provide high quality information, with links to dig deeper on the web.”

After looking through dozens of examples of Google AI Overview mistakes (and replicating many ourselves for the galleries below), we’ve noticed a few broad categories of errors that seemed to show up again and again. Consider this a crash course in some of the current weak points of Google’s AI Overviews and a look at areas of concern for the company to improve as the system continues to roll out.

Treating jokes as facts

  • The bit about using glue on pizza can be traced back to an 11-year-old troll post on Reddit. (via)

    Kyle Orland / Google

  • This wasn’t funny when the guys at Pep Boys said it, either. (via)

    Kyle Orland / Google

  • Weird Al recommends “running with scissors” as well! (via)

    Kyle Orland / Google

Some of the funniest example of Google’s AI Overview failing come, ironically enough, when the system doesn’t realize a source online was trying to be funny. An AI answer that suggested using “1/8 cup of non-toxic glue” to stop cheese from sliding off pizza can be traced back to someone who was obviously trying to troll an ongoing thread. A response recommending “blinker fluid” for a turn signal that doesn’t make noise can similarly be traced back to a troll on the Good Sam advice forums, which Google’s AI Overview apparently trusts as a reliable source.

In regular Google searches, these jokey posts from random Internet users probably wouldn’t be among the first answers someone saw when clicking through a list of web links. But with AI Overviews, those trolls were integrated into the authoritative-sounding data summary presented right at the top of the results page.

What’s more, there’s nothing in the tiny “source link” boxes below Google’s AI summary to suggest either of these forum trolls are anything other than good sources of information. Sometimes, though, glancing at the source can save you some grief, such as when you see a response calling running with scissors “cardio exercise that some say is effective” (that came from a 2022 post from Little Old Lady Comedy).

Bad sourcing

  • Washington University in St. Louis says this ratio is accurate, but others disagree. (via)

    Kyle Orland / Google

  • Man, we wish this fantasy remake was real. (via)

    Kyle Orland / Google

Sometimes Google’s AI Overview offers an accurate summary of a non-joke source that happens to be wrong. When asking about how many Declaration of Independence signers owned slaves, for instance, Google’s AI Overview accurately summarizes a Washington University of St. Louis library page saying that one-third “were personally enslavers.” But the response ignores contradictory sources like a Chicago Sun-Times article saying the real answer is closer to three-quarters. I’m not enough of a history expert to judge which authoritative-seeming source is right, but at least one historian online took issue with the Google AI’s answer sourcing.

Other times, a source that Google trusts as authoritative is really just fan fiction. That’s the case for a response that imagined a 2022 remake of 2001: A Space Odyssey, directed by Steven Spielberg and produced by George Lucas. A savvy web user would probably do a double-take before citing citing Fandom’s “Idea Wiki” as a reliable source, but a careless AI Overview user might not notice where the AI got its information.

Google’s “AI Overview” can give false, misleading, and dangerous answers Read More »

llms-keep-leaping-with-llama-3,-meta’s-newest-open-weights-ai-model

LLMs keep leaping with Llama 3, Meta’s newest open-weights AI model

computer-powered word generator —

Zuckerberg says new AI model “was still learning” when Meta stopped training.

A group of pink llamas on a pixelated background.

On Thursday, Meta unveiled early versions of its Llama 3 open-weights AI model that can be used to power text composition, code generation, or chatbots. It also announced that its Meta AI Assistant is now available on a website and is going to be integrated into its major social media apps, intensifying the company’s efforts to position its products against other AI assistants like OpenAI’s ChatGPT, Microsoft’s Copilot, and Google’s Gemini.

Like its predecessor, Llama 2, Llama 3 is notable for being a freely available, open-weights large language model (LLM) provided by a major AI company. Llama 3 technically does not quality as “open source” because that term has a specific meaning in software (as we have mentioned in other coverage), and the industry has not yet settled on terminology for AI model releases that ship either code or weights with restrictions (you can read Llama 3’s license here) or that ship without providing training data. We typically call these releases “open weights” instead.

At the moment, Llama 3 is available in two parameter sizes: 8 billion (8B) and 70 billion (70B), both of which are available as free downloads through Meta’s website with a sign-up. Llama 3 comes in two versions: pre-trained (basically the raw, next-token-prediction model) and instruction-tuned (fine-tuned to follow user instructions). Each has a 8,192 token context limit.

A screenshot of the Meta AI Assistant website on April 18, 2024.

Enlarge / A screenshot of the Meta AI Assistant website on April 18, 2024.

Benj Edwards

Meta trained both models on two custom-built, 24,000-GPU clusters. In a podcast interview with Dwarkesh Patel, Meta CEO Mark Zuckerberg said that the company trained the 70B model with around 15 trillion tokens of data. Throughout the process, the model never reached “saturation” (that is, it never hit a wall in terms of capability increases). Eventually, Meta pulled the plug and moved on to training other models.

“I guess our prediction going in was that it was going to asymptote more, but even by the end it was still leaning. We probably could have fed it more tokens, and it would have gotten somewhat better,” Zuckerberg said on the podcast.

Meta also announced that it is currently training a 400B parameter version of Llama 3, which some experts like Nvidia’s Jim Fan think may perform in the same league as GPT-4 Turbo, Claude 3 Opus, and Gemini Ultra on benchmarks like MMLU, GPQA, HumanEval, and MATH.

Speaking of benchmarks, we have devoted many words in the past to explaining how frustratingly imprecise benchmarks can be when applied to large language models due to issues like training contamination (that is, including benchmark test questions in the training dataset), cherry-picking on the part of vendors, and an inability to capture AI’s general usefulness in an interactive session with chat-tuned models.

But, as expected, Meta provided some benchmarks for Llama 3 that list results from MMLU (undergraduate level knowledge), GSM-8K (grade-school math), HumanEval (coding), GPQA (graduate-level questions), and MATH (math word problems). These show the 8B model performing well compared to open-weights models like Google’s Gemma 7B and Mistral 7B Instruct, and the 70B model also held its own against Gemini Pro 1.5 and Claude 3 Sonnet.

A chart of instruction-tuned Llama 3 8B and 70B benchmarks provided by Meta.

Enlarge / A chart of instruction-tuned Llama 3 8B and 70B benchmarks provided by Meta.

Meta says that the Llama 3 model has been enhanced with capabilities to understand coding (like Llama 2) and, for the first time, has been trained with both images and text—though it currently outputs only text. According to Reuters, Meta Chief Product Officer Chris Cox noted in an interview that more complex processing abilities (like executing multi-step plans) are expected in future updates to Llama 3, which will also support multimodal outputs—that is, both text and images.

Meta plans to host the Llama 3 models on a range of cloud platforms, making them accessible through AWS, Databricks, Google Cloud, and other major providers.

Also on Thursday, Meta announced that Llama 3 will become the new basis of the Meta AI virtual assistant, which the company first announced in September. The assistant will appear prominently in search features for Facebook, Instagram, WhatsApp, Messenger, and the aforementioned dedicated website that features a design similar to ChatGPT, including the ability to generate images in the same interface. The company also announced a partnership with Google to integrate real-time search results into the Meta AI assistant, adding to an existing partnership with Microsoft’s Bing.

LLMs keep leaping with Llama 3, Meta’s newest open-weights AI model Read More »