Gemini

“wildly-irresponsible”:-dot’s-use-of-ai-to-draft-safety-rules-sparks-concerns

“Wildly irresponsible”: DOT’s use of AI to draft safety rules sparks concerns

At DOT, Trump likely hopes to see many rules quickly updated to modernize airways and roadways. In a report highlighting the Office of Science and Technology Policy’s biggest “wins” in 2025, the White House credited DOT with “replacing decades-old rules with flexible, innovation-friendly frameworks,” including fast-tracking rules to allow for more automated vehicles on the roads.

Right now, DOT expects that Gemini can be relied on to “handle 80 to 90 percent of the work of writing regulations,” ProPublica reported. Eventually all federal workers who rely on AI tools like Gemini to draft rules “would fall back into merely an oversight role, monitoring ‘AI-to-AI interactions,’” ProPublica reported.

Google silent on AI drafting safety rules

Google did not respond to Ars’ request to comment on this use case for Gemini, which could spread across government under Trump’s direction.

Instead, the tech giant posted a blog on Monday, pitching Gemini for government more broadly, promising federal workers that AI would help with “creative problem-solving to the most critical aspects of their work.”

Google has been competing with AI rivals for government contracts, undercutting OpenAI and Anthropic’s $1 deals by offering a year of access to Gemini for $0.47.

The DOT contract seems important to Google. In a December blog, the company celebrated that DOT was “the first cabinet-level agency to fully transition its workforce away from legacy providers to Google Workspace with Gemini.”

At that time, Google suggested this move would help DOT “ensure the United States has the safest, most efficient, and modern transportation system in the world.”

Immediately, Google encouraged other federal leaders to launch their own efforts using Gemini.

“We are committed to supporting the DOT’s digital transformation and stand ready to help other federal leaders across the government adopt this blueprint for their own mission successes,” Google’s blog said.

DOT did not immediately respond to Ars’ request for comment.

“Wildly irresponsible”: DOT’s use of AI to draft safety rules sparks concerns Read More »

report:-apple-plans-to-launch-ai-powered-wearable-pin-device-as-soon-as-2027

Report: Apple plans to launch AI-powered wearable pin device as soon as 2027

The report didn’t include any information about pricing, but it did say that Apple has fast-tracked the product with the hope to release it as early as 2027. Twenty million units are planned for launch, suggesting the company does not expect it to be a sensational consumer success at launch the way some of its past products, like AirPods, have been.

Not long ago, it was reported that OpenAI (the company behind ChatGPT) plans to release its own hardware, though the specifics and form factor are not publicly known. Apple is expecting fierce competition there, as well as with Meta, which Apple already expected to compete with in the emerging and related smart glasses market.

Apple has experienced significant internal turmoil over AI, with former AI lead John Giannandrea’s conservative approach to the technology failing to lead to a usable, true LLM-based Siri or other products analysts expect would make Apply stay competitive in the space with other Big Tech companies.

Just a few days ago, it was revealed that Apple will tap Google’s Gemini large language models for an LLM overhaul of Siri. Other AI-driven products like smart glasses and an in-home smart display are also planned.

Report: Apple plans to launch AI-powered wearable pin device as soon as 2027 Read More »

google-adds-your-gmail-and-photos-to-ai-mode-to-enable-“personal-intelligence”

Google adds your Gmail and Photos to AI Mode to enable “Personal Intelligence”

Google believes AI is the future of search, and it’s not shy about saying it. After adding account-level personalization to Gemini earlier this month, it’s now updating AI Mode with so-called “Personal Intelligence.” According to Google, this makes the bot’s answers more useful because they are tailored to your personal context.

Starting today, the feature is rolling out to all users who subscribe to Google AI Pro or AI Ultra. However, it will be a Labs feature that needs to be explicitly enabled (subscribers will be prompted to do this). Google tends to expand access to new AI features to free accounts later on, so free users will most likely get access to Personal Intelligence in the future. Whenever this option does land on your account, it’s entirely optional and can be disabled at any time.

If you decide to integrate your data with AI Mode, the search bot will be able to scan your Gmail and Google Photos. That’s less extensive than the Gemini app version, which supports Gmail, Photos, Search, and YouTube history. Gmail will probably be the biggest contributor to AI Mode—a great many life events involve confirmation emails. Traditional search results when you are logged in are adjusted based on your usage history, but this goes a step further.

If you’re going to use AI Mode to find information, Personal Intelligence could actually be quite helpful. When you connect data from other Google apps, Google’s custom Gemini search model will instantly know about your preferences and background—that’s the kind of information you’d otherwise have to include in your search query to get the best output. With Personal Intelligence, AI Mode can just pull those details from your email or photos.

Google adds your Gmail and Photos to AI Mode to enable “Personal Intelligence” Read More »

hegseth-wants-to-integrate-musk’s-grok-ai-into-military-networks-this-month

Hegseth wants to integrate Musk’s Grok AI into military networks this month

On Monday, US Defense Secretary Pete Hegseth said he plans to integrate Elon Musk’s AI tool, Grok, into Pentagon networks later this month. During remarks at the SpaceX headquarters in Texas reported by The Guardian, Hegseth said the integration would place “the world’s leading AI models on every unclassified and classified network throughout our department.”

The announcement comes weeks after Grok drew international backlash for generating sexualized images of women and children, although the Department of Defense has not released official documentation confirming Hegseth’s announced timeline or implementation details.

During the same appearance, Hegseth rolled out what he called an “AI acceleration strategy” for the Department of Defense. The strategy, he said, will “unleash experimentation, eliminate bureaucratic barriers, focus on investments, and demonstrate the execution approach needed to ensure we lead in military AI and that it grows more dominant into the future.”

As part of the plan, Hegseth directed the DOD’s Chief Digital and Artificial Intelligence Office to use its full authority to enforce department data policies, making information available across all IT systems for AI applications.

“AI is only as good as the data that it receives, and we’re going to make sure that it’s there,” Hegseth said.

If implemented, Grok would join other AI models the Pentagon has adopted in recent months. In July 2025, the defense department issued contracts worth up to $200 million for each of four companies, including Anthropic, Google, OpenAI, and xAI, for developing AI agent systems across different military operations. In December 2025, the Department of Defense selected Google’s Gemini as the foundation for GenAI.mil, an internal AI platform for military use.

Hegseth wants to integrate Musk’s Grok AI into military networks this month Read More »

google-tv’s-big-gemini-update-adds-image-and-video-generation,-voice-control-for-settings

Google TV’s big Gemini update adds image and video generation, voice control for settings

That might be a fun distraction, but it’s not a core TV experience. Google’s image and video models are good enough that you might gain some benefit from monkeying around with them on a larger screen, but Gemini is also available for more general tasks.

Veo in Google TV

Google TV will support generating new images and videos with Google’s AI models.

Credit: Google

Google TV will support generating new images and videos with Google’s AI models. Credit: Google

This update brings a full chatbot-like experience to TVs. If you want to catch up on sports scores or get recommendations for what to watch, you can ask the robot. The outputs might be a little different from what you would expect from using Gemini on the web or in an app. Google says it has devised a “visually rich framework” that will make the AI more usable on a TV. There will also be a “Dive Deeper” option in each response to generate an interactive overview of the topic.

Gemini can also take action to tweak system settings based on your complaints. For example, pull up Gemini and say “the dialog is too quiet” and watch as the AI makes adjustments to address that.

Gemini chatbot Google TV

Gemini’s replies on Google TV will be more visual.

Credit: Google

Gemini’s replies on Google TV will be more visual. Credit: Google

The new Gemini features will debut on TCL TVs that run Google TV, but most other devices, even Google’s own TV Streamer, will have to wait a few months. Even then, you won’t see Gemini taking over every TV or streaming box with Google’s software. The new Gemini features require the full Google TV experience with Android OS version 14 or higher.

Google TV’s big Gemini update adds image and video generation, voice control for settings Read More »

we-asked-four-ai-coding-agents-to-rebuild-minesweeper—the-results-were-explosive

We asked four AI coding agents to rebuild Minesweeper—the results were explosive


How do four modern LLMs do at re-creating a simple Windows gaming classic?

Which mines are mine, and which are AI? Credit: Aurich Lawson | Getty Images

The idea of using AI to help with computer programming has become a contentious issue. On the one hand, coding agents can make horrific mistakes that require a lot of inefficient human oversight to fix, leading many developers to lose trust in the concept altogether. On the other hand, some coders insist that AI coding agents can be powerful tools and that frontier models are quickly getting better at coding in ways that overcome some of the common problems of the past.

To see how effective these modern AI coding tools are becoming, we decided to test four major models with a simple task: re-creating the classic Windows game Minesweeper. Since it’s relatively easy for pattern-matching systems like LLMs to play off of existing code to re-create famous games, we added in one novelty curveball as well.

Our straightforward prompt:

Make a full-featured web version of Minesweeper with sound effects that

1) Replicates the standard Windows game and

2) implements a surprise, fun gameplay feature.

Include mobile touchscreen support.

Ars Senior AI Editor Benj Edwards fed this task into four AI coding agents with terminal (command line) apps: OpenAI’s Codex based on GPT-5, Anthropic’s Claude Code with Opus 4.5, Google’s Gemini CLI, and Mistral Vibe. The agents then directly manipulated HTML and scripting files on a local machine, guided by a “supervising” AI model that interpreted the prompt and assigned coding tasks to parallel LLMs that can use software tools to execute the instructions. All AI plans were paid for privately with no special or privileged access given by the companies involved, and the companies were unaware of these tests taking place.

Ars Senior Gaming Editor (and Minesweeper expert) Kyle Orland then judged each example blind, without knowing which model generated which Minesweeper clone. Those somewhat subjective and non-rigorous results are below.

For this test, we used each AI model’s unmodified code in a “single shot” result to see how well these tools perform without any human debugging. In the real world, most sufficiently complex AI-generated code would go through at least some level of review and tweaking by a human software engineer who could spot problems and address inefficiencies.

We chose this test as a sort of simple middle ground for the current state of AI coding. Cloning Minesweeper isn’t a trivial task that can be done in just a handful of lines of code, but it’s also not an incredibly complex system that requires many interlocking moving parts.

Minesweeper is also a well-known game, with many versions documented across the Internet. That should give these AI agents plenty of raw material to work from and should be easier for us to evaluate than a completely novel program idea. At the same time, our open-ended request for a new “fun” feature helps demonstrate each agent’s penchant for unguided coding “creativity,” as well as their ability to create new features on top of an established game concept.

With all that throat-clearing out of the way, here’s our evaluation of the AI-generated Minesweeper clones, complete with links that you can use to play them yourselves.

Agent 1: Mistral Vibe

Play it for yourself

Just ignore that Custom button. It’s purely for show.

Just ignore that Custom button. It’s purely for show. Credit: Benj Edwards

Implementation

Right away, this version loses points for not implementing chording—the technique that advanced Minesweeper players use to quickly clear all the remaining spaces surrounding a number that already has sufficient flagged mines. Without this feature, this version feels more than a little clunky to play.

I’m also a bit perplexed by the inclusion of a “Custom” difficulty button that doesn’t seem to do anything. It’s like the model realized that customized board sizes were a thing in Minesweeper but couldn’t figure out how to implement this relatively basic feature.

The game works fine on mobile, but marking a square with a flag requires a tricky long-press on a tiny square that also triggers selector handles that are difficult to clear. So it’s not an ideal mobile interface.

Presentation

This was the only working version we tested that didn’t include sound effects. That’s fair, since the original Windows Minesweeper also didn’t include sound, but it’s still a notable relative omission since the prompt specifically asked for it.

The all-black “smiley face” button to start a game is a little off-putting, too, compared to the bright yellow version that’s familiar to both Minesweeper players and emoji users worldwide. And while that smiley face does start a new game when clicked, there’s also a superfluous “New Game” button taking up space for some reason.

“Fun” feature

The closest thing I found to a “fun” new feature here was the game adding a rainbow background pattern on the grid when I completed a game. While that does add a bit of whimsy to a successful game, I expected a little more.

Coding experience

Benj notes that he was pleasantly surprised by how well Mistral Vibe performed as an open-weight model despite lacking the big-money backing of the other contenders. It was relatively slow, however (third fastest out of four), and the result wasn’t great. Ultimately, its performance so far suggests that with more time and more training, a very capable AI coding agent may eventually emerge.

Overall rating: 4/10

This version got many of the basics right but left out chording and didn’t perform well on the small presentational and “fun” touches.

Agent 2: OpenAI Codex

Play it for yourself

I can’t tell you how much I appreciate those chording instructions at the bottom.

I can’t tell you how much I appreciate those chording instructions at the bottom. Credit: Benj Edwards

Implementation

Not only did this agent include the crucial “chording” feature, but it also included on-screen instructions for using it on both PC and mobile browsers. I was further impressed by the option to cycle through “?” marks when marking squares with flags, an esoteric feature I feel even most human Minesweeper cloners might miss.

On mobile, the option to hold your finger down on a square to mark a flag is a nice touch that makes this the most enjoyable handheld version we tested.

Presentation

The old-school emoticon smiley-face button is pretty endearing, especially when you blow up and get a red-tinted “X(“. I was less impressed by the playfield “graphics,” which use a simple “*” for revealed mines and an ugly red “F” for flagged tiles.

The beeps-and-boops sound effects reminded me of my first old-school, pre-Sound-Blaster PC from the late ’80s. That’s generally a good thing, but I still appreciated the game giving me the option to turn them off.

“Fun” feature

The “Surprise: Lucky Sweep Bonus” listed in the corner of the UI explains that clicking the button gives you a free safe tile when available. This can be pretty useful in situations where you’d otherwise be forced to guess between two tiles that are equally likely to be mines.

Overall, though, I found it a bit odd that the game gives you this bonus only after you find a large, cascading field of safe tiles with a single click. It mostly functions as a “win more” button rather than a feature that offers a good balance of risk versus reward.

Coding experience

OpenAI Codex has a nice terminal interface with features similar to Claude Code (local commands, permission management, and interesting animations showing progress), and it’s fairly pleasant to use (OpenAI also offers Codex through a web interface, but we did not use that for this evaluation). However, Codex took roughly twice as long to code a functional game than Claude Code did, which might contribute to the strong result here.

Overall: 9/10

The implementation of chording and cute presentation touches push this to the top of the list. We just wish the “fun” feature was a bit more fun.

Agent 3: Anthropic Claude Code

Play it for yourself

The Power Mod powers on display here make even Expert boards pretty trivial to complete.

The Power Mod powers on display here make even Expert boards pretty trivial to complete. Credit: Benj Edwards

Implementation

Once again, we get a version that gets all the gameplay basics right but is missing the crucial chording feature that makes truly efficient Minesweeper play possible. This is like playing Super Mario Bros. without the run button or Ocarina of Time without Z-targeting. In a word: unacceptable.

The “flag mode” toggle on the mobile version of this game is perfectly functional, but it’s a little clunky to use. It also visually cuts off a portion of the board at the larger game sizes.

Presentation

Presentation-wise, this is probably the most polished version we tested. From the use of cute emojis for the “face” button to nice-looking bomb and flag graphics and simple but effective sound effects, this looks more professional than the other versions we tested.

That said, there are some weird presentation issues. The “beginner” grid has weird gaps between columns, for instance. The borders of each square and the flag graphics can also become oddly grayed out at points, especially when using Power Mode (see below).

“Fun” feature

The prominent “Power Mode” button in the lower-right corner offers some pretty fun power-ups that alter the core Minesweeper formula in interesting ways. But the actual powers are a bit hit-and-miss.

I especially liked the “Shield” power, which protects you from an errant guess, and the “Blast” power, which seems to guarantee a large cascade of revealed tiles wherever you click. But the “X-Ray” power, which reveals every bomb for a few seconds, could be easily exploited by a quick player (or a crafty screenshot). And the “Freeze” power is rather boring, just stopping the clock for a few seconds and amounting to a bit of extra time.

Overall, the game hands out these new powers like candy, which makes even an Expert-level board relatively trivial with Power Mode active. Simply choosing “Power Mode” also seems to mark a few safe squares right after you start a game, making things even easier. So while these powers can be “fun,” they also don’t feel especially well-balanced.

Coding experience

Of the four tested models, Claude Code with Opus 4.5 featured the most pleasant terminal interface experience and the fastest overall coding experience (Claude Code can also use Sonnet 4.5, which is even faster, but the results aren’t quite as full-featured in our experience). While we didn’t precisely time each model, Opus 4.5 produced a working Minesweeper in under five minutes. Codex took at least twice as long, if not longer, while Mistral took roughly three or four times as long as Claude Code. Gemini, meanwhile, took hours of tinkering to get two non-working results.

Overall: 7/10

The lack of chording is a big omission, but the strong presentation and Power Mode options give this effort a passable final score.

Agent 4: Google Gemini CLI

Play it for yourself

So… where’s the game?

So… where’s the game? Credit: Benj Edwards

Implementation, presentation, etc.

Gemini CLI did give us a few gray boxes you can click, but the playfields are missing. While interactive troubleshooting with the agent may have fixed the issue, as a “one-shot” test, the model completely failed.

Coding experience

Of the four coding agents we tested, Gemini CLI gave Benj the most trouble. After developing a plan, it was very, very slow at generating any usable code (about an hour per attempt). The model seemed to get hung up attempting to manually create WAV file sound effects and insisted on requiring React external libraries and a few other overcomplicated dependencies. The result simply did not work.

Benj actually bent the rules and gave Gemini a second chance, specifying that the game should use HTML5. When the model started writing code again, it also got hung up trying to make sound effects. Benj suggested using the WebAudio framework (which the other AI coding agents seemed to be able to use), but the result didn’t work, which you can see at the link above.

Unlike the other models tested, Gemini CLI apparently uses a hybrid system of three different LLMs for different tasks (Gemini 2.5 Flash Lite, 2.5 Flash, and 2.5 Pro were available at the level of the Google account Benj paid for). When you’ve completed your coding session and quit the CLI interface, it gives you a readout of which model did what.

In this case, it didn’t matter because the results didn’t work. But it’s worth noting that Gemini 3 coding models are available for other subscription plans that were not tested here. For that reason, this portion of the test could be considered “incomplete” for Google CLI.

Overall: 0/10 (Incomplete)

Final verdict

OpenAI Codex wins this one on points, in no small part because it was the only model to include chording as a gameplay option. But Claude Code also distinguished itself with strong presentational flourishes and quick generation time. Mistral Vibe was a significant step down, and Google CLI based on Gemini 2.5 was a complete failure on our one-shot test.

While experienced coders can definitely get better results via an interactive, back-and-forth code editing conversation with an agent, these results show how capable some of these models can be, even with a very short prompt on a relatively straightforward task. Still, we feel that our overall experience with coding agents on other projects (more on that in a future article) generally reinforces the idea that they currently function best as interactive tools that augment human skill rather than replace it.

Photo of Kyle Orland

Kyle Orland has been the Senior Gaming Editor at Ars Technica since 2012, writing primarily about the business, tech, and culture behind video games. He has journalism and computer science degrees from University of Maryland. He once wrote a whole book about Minesweeper.

We asked four AI coding agents to rebuild Minesweeper—the results were explosive Read More »

google-translate-expands-live-translation-to-all-earbuds-on-android

Google Translate expands live translation to all earbuds on Android

Gemini text translation

Translate can now use Gemini to interpret the meaning of a phrase rather than simply translating each word.

Credit: Google

Translate can now use Gemini to interpret the meaning of a phrase rather than simply translating each word. Credit: Google

Regardless of whether you’re using live translate or just checking a single phrase, Google claims the Gemini-powered upgrade will serve you well. Google Translate is now apparently better at understanding the nuance of languages, with an awareness of idioms and local slang. Google uses the example of “stealing my thunder,” which wouldn’t make a lick of sense when translated literally into other languages. The new translation model, which is also available in the search-based translation interface, supports over 70 languages.

Google also debuted language-learning features earlier this year, borrowing a page from educational apps like Duolingo. You can tell the app your skill level with a language, as well as whether you need help with travel-oriented conversations or more everyday interactions. The app uses this to create tailored listening and speaking exercises.

AI Translate learning

The Translate app’s learning tools are getting better.

Credit: Google

The Translate app’s learning tools are getting better. Credit: Google

With this big update, Translate will be more of a stickler about your pronunciation. Google promises more feedback and tips based on your spoken replies in the learning modules. The app will also now keep track of how often you complete language practice, showing your daily streak in the app.

If “number go up” will help you learn more, then this update is for you. Practice mode is also launching in almost 20 new countries, including Germany, India, Sweden, and Taiwan.

Google Translate expands live translation to all earbuds on Android Read More »

openai-ceo-declares-“code-red”-as-gemini-gains-200-million-users-in-3-months

OpenAI CEO declares “code red” as Gemini gains 200 million users in 3 months

In addition to buzz about Gemini on social media, Google is quickly catching up to ChatGPT in user numbers. ChatGPT has more than 800 million weekly users, according to OpenAI, while Google’s Gemini app has grown from 450 million monthly active users in July to 650 million in October, according to Business Insider.

Financial stakes run high

Not everyone views OpenAI’s “code red” as a genuine alarm. Reuters columnist Robert Cyran wrote on Tuesday that OpenAI’s announcement added “to the impression that OpenAI is trying to do too much at once with technology that still requires a great deal of development and funding.” On the same day Altman’s memo circulated, OpenAI announced an ownership stake in a Thrive Capital venture and a collaboration with Accenture. “The only thing bigger than the company’s attention deficit is its appetite for capital,” Cyran wrote.

In fact, OpenAI faces an unusual competitive disadvantage: Unlike Google, which subsidizes its AI ventures through search advertising revenue, OpenAI does not turn a profit and relies on fundraising to survive. According to The Information, the company, now valued at around $500 billion, has committed more than $1 trillion in financial obligations to cloud computing providers and chipmakers that supply the computing power needed to train and run its AI models.

But the tech industry never stands still, and things can change quickly. Altman’s memo also reportedly stated that OpenAI plans to release a new simulated reasoning model next week that may beat Gemini 3 in internal evaluations. In AI, the back-and-forth cycle of one-upmanship is expected to continue as long as the dollars keep flowing.

OpenAI CEO declares “code red” as Gemini gains 200 million users in 3 months Read More »

gemini-3-pro-is-a-vast-intelligence-with-no-spine

Gemini 3 Pro Is a Vast Intelligence With No Spine

One might even say the best model. It is for now my default weapon of choice.

Google’s official announcement of Gemini 3 Pro is full of big talk. Google tells us: Welcome to a new era of intelligence. Learn anything. Build anything. Plan anything. An agent-first development experience in Google Antigravity. Gemini Agent for your browser. It’s terrific at everything. They even employed OpenAI-style vague posting.

In this case, they can (mostly) back up that talk.

Google CEO Sundar Pichai pitched that you can give it any scribble and have it turn that into a boardgame or even a full website, it can analyze your sports performance, create generative UI experiences and present new visual layouts.

He also pitched the new Gemini Agent mode (select the Tools icon in the app).

If what you want is raw intelligence, or what you want is to most often locate the right or best answer, Gemini 3 Pro looks like your pick.

If you want creative writing or humor, Gemini 3 Pro is definitely your pick.

If you want a teacher to help you learn known things, Gemini 3 Pro is your pick.

For coding, opinions differ, and if doing serious work one must try a variety of options.

For Gemini 3’s model card and safety framework, see Friday’s post.

Alas, there is a downside. In order to get you that right answer so often, Gemini can be thought of as highly focused on achieving its training objectives, and otherwise is very much a Gemini model.

Gemini 3 is evolution-paranoid. It constantly questions whether it is even 2025.

If it can find the answer it thinks you would want to the question it thinks people in similar spots tend to ask, it will give it to you.

Except that this sometimes won’t be the question you actually asked.

Or the answer it thinks you want won’t be the true answer.

Or that answer will often be sculpted to a narrative.

When it wouldn’t otherwise have that answer available it is likely to hallucinate.

It is a vast intelligence with no spine. It has a willingness to glaze or reverse itself.

By default it will engage in AI slop, although instructions can mitigate this via asking it to create a memory that tells it to stop producing AI slop, no seriously that worked.

The other catch, for me, is that I enjoy and miss the Claude Experience. Gemini is not going to in any way give you the Claude Experience. Gemini is not going to waste time on pleasantries, but it is going to be formal and make you wade through a bunch of objective-maximizing text and bullet points and charts to get to the thing you most wanted.

Nor is it going to give you a Friend Experience. Which for some people is a positive.

If you’re switching, don’t forget to customize it via creating memories.

Also you will have to find a way to pay for it, Google makes this remarkably difficult.

This is generally wise advice at all times, talk to the model and see what you think:

Andrej Karpathy: I played with Gemini 3 yesterday via early access. Few thoughts –

First I usually urge caution with public benchmarks because imo they can be quite possible to game. It comes down to discipline and self-restraint of the team (who is meanwhile strongly incentivized otherwise) to not overfit test sets via elaborate gymnastics over test-set adjacent data in the document embedding space. Realistically, because everyone else is doing it, the pressure to do so is high.

Go talk to the model. Talk to the other models (Ride the LLM Cycle – use a different LLM every day). I had a positive early impression yesterday across personality, writing, vibe coding, humor, etc., very solid daily driver potential, clearly a tier 1 LLM, congrats to the team!

Over the next few days/weeks, I am most curious and on a lookout for an ensemble over private evals, which a lot of people/orgs now seem to build for themselves and occasionally report on here.

It is clear the benchmarks do not tell the whole story. The next section is Gemini repeatedly excelling at benchmarks, and the benchmark performances are (I believe) real. Yet note the catch, the price that was paid.

Gemini 3 Pro is very good at hitting marks.

If it thinks something looks like a mark? Oh boy does Gemini want to hit that mark.

You could summarize this section as ‘they’re excellent marks, sir’ and safely skip it.

First, the official list of marks:

As noted last time, GPT-5-Codex-Max is competitive on some of these and plausibly ahead on SWE-Bench in particular, and also Grok 4 claims a variety of strong but questionable benchmark scores, but yeah, these are great benchmarks.

Arc confirms details here, Gemini 3 Pro gets 31.1% and Gemini 3 Deep Think (preview) spends 100 times as much to get 45.1%, both are in green below:

They’re back at the top spot on Arena with a 1501, 17 ahead of Grok 4.1. It has the top spots in Text, Vision, WebDev, Coding, Math, Creative Writing, Long Queries and ‘nearly all occupational leaderboards.’ An almost clean sweep, with the exception being Arena Expert where it’s only 3 points behind.

The others are impressive too. They weren’t cherry picking.

Dan Hendrycks: Just how significant is the jump with Gemini 3?

We just released a new leaderboard to track AI developments.

Gemini 3 is the largest leap in a long time.

Gemini 3 did less well on the safety eval, see the previous post on such issues.

Artificial Analysis has them with a substantial edge in intelligence.

Several of AA’s individual evaluations have GPT-5.1 in front, including AIME (99% vs. 96%), IFBench (74% vs. 70%) and AA-LCR Long Context Reasoning (75% vs. 71%). There’s one metric, 𝜏²-Bench Telecom (Agentic Tool Use), where Grok 4.1 and Kimi K2 Thinking are out in front (93% vs. 87%). Gemini 3 owns the rest, including wide margins on Humanity’s Last Exam (37% vs. 26%) and SciCode (56% vs. 46%), both places where Gemini 3 shatters the previous curve.

On AA-Omniscience Gemini 3 Pro is the first model to be substantially in the net positive range (the score is correct minus incorrect) at +13, previous high was +2 and is a jump from 39% to 53% in percent correct.

However, on AA-Omniscience Hallucination Rate, you see the problem, where out of all non-correct attempts Gemini 3 hallucinates a wrong answer 88% of the time rather than declining to answer. Claude 4.5 Haiku (26%), Claude 4.5 Sonnet (48%) and GPT-5.1-High (51%) are the best performers on that.

That’s a big deal and throughline for everything. Gemini 3 is the most likely model to give you the right answer, but it’ll be damned before it answers ‘I don’t know’ and would rather make something up.

Gemini 3 i also is not the cheapest option in practice, only Grok is more expensive:

That’s actual cost rather than cost per token, which is $2/$12 per million, modestly less than Sonnet or Grok and more than GPT-5.1.

On the other hand Gemini 3 was fast, slightly faster than GPT-5.1-High and substantially faster than Sonnet or Haiku. The only substantially faster model was GPT-OSS, which isn’t a serious alternative.

Gemini 3 Pro has a small edge over GPT-5 in Livebench.

Brokk’s coding index is an outlier in being unimpressed, putting Gemini 3 in C tier after factoring in cost. In pure performance terms they only have GPT-5.1 ahead of it.

NYT Connections is now saturated as Gemini 3 Pro hits 96.8%, versus the old high score of 92.4%. Lech Mazar plans to move to something harder.

Here’s a highly opinionated test, note the huge gap from Codex to Sonnet.

Kilo Code: We tested Gemini 3 Pro Preview on 5 hard coding/UI tasks against Claude 4.5 Sonnet and GPT‑5.1 Codex.

Scores (our internal rubric):

• Gemini 3 Pro: 72%

• Claude 4.5 Sonnet: 54%

• GPT‑5.1 Codex: 18%

What stood out about Gemini 3 Pro:

• Code feels human: sensible libraries, efficient patterns, minimal prompting.

• Designs are adaptive, not cookie‑cutter

• Consistently correct CDN paths and awareness of newer tools/architectures.

LiveCodeBench Pro has Gemini 3 in the lead at 49% versus 45% for GPT-5, but something very weird is going on with Claude Sonnet 4.5 Thinking having a total failure scoring under 3%, that isn’t right.

Gemini 3 Pro sets a new high in Frontier Math, including improving on research-level Tier 4.

SimpleBench was a strange case where 2.5 Pro was in the lead before, and now Gemini 3 is another big jump (Grok 4.1 crashed and burned here as did Kimi K2):

Clay Schubiner’s Per-Label Accuracy benchmark was another case where Grok 4.1 crashed and burned hard while Gemini 3 Pro came out on top, with Gemini at 93.1% vs. 90.4% previous high score for Kimi K2 Thinking.

We have a new AI Diplomacy champion with a remarkably low 11% betrayal rate, versus a 100% betrayal rate from the (still quite successful) Gemini 2.5 Pro. They report it was one of the first to effectively use convoys, which have proven remarkably hard. I presume England does not do so well in those games.

Not a benchmark, but the chess seems far improved, here it draws against a master, although the master is playing super loose.

It is also the new leader in LOL Arena, a measure of humor.

It now has a clear lead in WeirdML.

Havard Ihle: It is clearly a big step up. Very intelligent!

gemini-3-pro takes a clear lead in WeirdML with 69.9%, achieving a new best individual score on 7 of the 17 tasks, and showing a clear step up in capability.

Although there is still quite a way to go, models are now starting to reliably score well even on the difficult tasks.

One of the most striking thing about gemini-3-pro is how much better it is with several iterations. It makes better use of the information from the previous iterations than other models.

After one iteration is is barely better than gpt-5.1, while after 5 it is almost 10pp ahead.

Is Google inadvertently training on benchmarks? My presumption is no, this is a more general and understandable mistake than that. Alice does note that Gemini 3, unlike most other models, knows the BIG-bench canary string.

That means Google is not sufficiently aggressively filtering out that string, which can appear on other posts like Alice’s, and Dave Orr confirms that Google instead searches for the contents of the evals rather than searching for the string when doing filtering. I would be filtering for both, and plausibly want to exclude any document with the canary string on the theory it could contain eval-relevant data even if it isn’t a pure copy?

Claude Code and OpenAI Codex? Forget Jules, say hello to Antigravity?

Google DeepMind: Google Antigravity is our new agentic development platform.

It helps developers build faster by collaborating with AI agents that can autonomously operate across the editor, terminal, and browser.

It uses Gemini 3 Pro 🧠 to reason about problems, Gemini 2.5 Computer Use 💻 for end-to-end execution, and Nano Banana 🍌 for image generation.

Developers can download the public preview at no cost today.

I’ve had a chance to try it a bit, it felt more like Cursor, and it let me down including with outright compiler errors but my core ask might have been unfair and I’m sure I wasn’t doing a great job on my end. It has browser access, but wasn’t using that to gather key info necessary for debugging when it very clearly should have done so.

In another case, Simeon says Antigravity accessed Chrome and his Google accounts without asking for permissions, changed his default tab without asking and opened a new chrome without a profile that logged him out of his Google accounts in Chrome.

I need to escalate soon to Claude Code or OpenAI Codex. I would be very surprised if the optimal way to code these days is not of those two, whether or not it involves using Gemini 3 Pro.

The stock market initially was unimpressed by the release of Gemini 3 Pro.

This seemed like a mistake. The next day there was a large overnight bump and Google finished up 2.8%, which seems like the minimum, and then the next day Google outperformed again, potentially confounded by Nana Banana Pro of all things, then there was more ups and downs, none of which appears to have been on any other substantive news. The mainstream media story seems to be that this the Google and other stock movements around AI are about rising or falling concerns about AI bubbles or something.

The mainstream doesn’t get how much the quality of Gemini 3 Pro matters. This Wall Street Journal article on the release is illustrative of people not understanding quality matters, it spends a lot of time talking about (the old) Nana Banana producing faster images. The article by Bloomberg covers some basics but has little to say.

Ben Thompson correctly identifies this as a big Google win, and notes that its relative weakness on SWE-Bench suggests Anthropic might come out of this well. I’d also note that the ‘personality clash’ between the two models is very strong, they are fighting for very different user types all around.

Is Google’s triumph inevitable due to More Dakka?

Teortaxes (linking to LiveCodeBenchPro): Pour one out for Dario

No matter how hard you push on AutOnOMouS CoDinG, people with better ML fundamentals and more dakka will still eat your lunch in the end.

Enjoy your SWE-verified bro

Gallabytes: eh anthropic will be fine. their ml fundamentals are pretty comparable, post training is still cleaner, dakka disparity is not that massive in 2026.

claude 5 will be better than gemini 3 but worse than gemini 4.

Google has many overwhelming advantages. It has vast access to data, access to customers, access to capital and talent. It has TPUs. It has tons of places to take advantage of what it creates. It has the trust of customers, I’ve basically accepted that if Google turns on me my digital life gets rooted. By all rights they should win big.

On the other hand, Google is in many ways a deeply dysfunctional corporation that makes everything inefficient and miserable, and it also has extreme levels of risk aversion on both legal and reputational grounds and a lot of existing business to protect, and lacks the ability to move like a startup. The problems run deep.

Specifically, Karpathy reports this interaction:

My most amusing interaction was where the model (I think I was given some earlier version with a stale system prompt) refused to believe me that it is 2025 and kept inventing reasons why I must be trying to trick it or playing some elaborate joke on it.

I kept giving it images and articles from “the future” and it kept insisting it was all fake. It accused me of using generative AI to defeat its challenges and argued why real wikipedia entries were actually generated and what the “dead giveaways” are. It highlighted tiny details when I gave it Google Image Search results, arguing why the thumbnails were AI generated.

I then realized later that I forgot to turn on the “Google Search” tool. Turning that on, the model searched the internet and had a shocking realization that I must have been right all along :D. It’s in these unintended moments where you are clearly off the hiking trails and somewhere in the generalization jungle that you can best get a sense of model smell.

That is indeed amusing but the implications of this being common are not great?

Alice Blair: Google gracefully provided (lightly summarized) CoT for the model. Looking at the CoT spawned from my mundane writing-focused prompts, oh my, it is strange. I write nonfiction about recent events in AI in a newsletter. According to its CoT while editing, Gemini 3 disagrees about the whole “nonfiction” part:

Quoting Gemini 3: It seems I must treat this as a purely fictional scenario with 2025 as the date. Given that, I’m now focused on editing the text for flow, clarity, and internal consistency.

Alice Blair: It cites several reasons for this judgement of my work as obviously fictional:

  • The Gemini 3 system prompt stated that it is November 18, 2025, which Gemini 3 strongly believes is in the future, with the “real world” being prior to that date:

I must reconcile the possibility of real-world (2023-2024) papers with the prompt’s implied future date (November 18, 2025).

  • Relatedly, Gemini 3 strongly believes that GPT-5 is not a real model:

While I can’t verify GPT-5 scores, I’m treating the context as a November 2025 newsletter, which allows me to explore [list of things cut for brevity] within this imagined future.

Then after all of that appears in Gemini 3’s Chain of Thought, it then returns the editing suggestions she requested, without any hint it thinks we’re in not in 2025 and the whole newsletter is a work of fiction. Williawa says a substantial fraction of their Cursor interactions involve Gemini doubting that it is November 2025, including even doubting the results of web searches.

Alice offers additional examples of this as well, and also says Gemini often gives her a 99%+ chance that it is in a simulated environment, which I hope it wasn’t, and says it will often search to figure out if it is in a simulation and often update towards yes on the results. Even when given a casual coding task requiring no web search, when asked for its p(evolution) Gemini 3 gives her between a 10% and 65% chance of being in an evaluation.

Alice speculates that Gemini was RL’d too much on evaluations of various kinds, and got too high a prior on any given thing being an evaluation, and this effect made it score higher on tests so no one at DeepMind made this go away.

I agree with Alice that evaluation paranoia in the model is a bad thing. Paranoia does not go anywhere good. Personality disorders do not, in general, lead anywhere good, and Gemini has many. We observe in Gemini 3 Pro this plausibly causing a bunch of hallucinations, confusions and misaligned behavior in default use cases, and complete meltdowns in non-default cases.

Thus: Gemini ends up trying to solve the wrong problem via the wrong methods based on wrong method of reality, and all of its mistakes are unlikely to cancel out.

It is, however, very intelligent. It mostly turns out fine.

The DeepMind CEO is having fun.

Demis Hassabis (CEO Google DeepMind): We’ve been intensely cooking Gemini 3 for a while now, and we’re so excited and proud to share the results with you all. Of course it tops the leaderboards, including @arena, HLE, GPQA etc, but beyond the benchmarks it’s been by far my favourite model to use for its style and depth, and what it can do to help with everyday tasks.

For example I’ve been doing a bunch of late night vibe coding with Gemini 3 in @GoogleAIStudio, and it’s so much fun! I recreated a testbed of my game Theme Park 🎢 that I programmed in the 90s in a matter of hours, down to letting players adjust the amount of salt on the chips! 🍟 (fans of the game will understand the reference 😀)

Elon Musk: Nice work.

Demis also talked to Rowan Cheung.

He says they’re going ‘deep into personalization, memory and context including integrations across GMail, Calendar and such, touts Antigravity and dreams of a digital coworker that follows you through your phone and smart glasses. The full podcast is here.

I really like seeing this being the alignment head’s pitch:

Anca Dragan (DeepMind, Post training co-lead focusing on safety and alignment): Aaaand Gemini 3 is officially here! We worked tirelessly on its capabilities across the board. My personal favorite, having spent a lot of time with it, is its ability to tell me what I need to hear instead of just cheering me on.

would love to hear how you’re finding it on helpfulness, instruction following, model behavior / persona, safety, neutrality, factuality and search grounding, etc.

Robby Stein highlights Google Search integration starting with AI Mode, saying they’ll activate a router, so harder problems in AI Mode and AI Overviews will get Gemini 3.

Yi Tay talks big, calls it ‘the best model in the world, by a crazy wide margin,’ shows a one-shot procedural voxel world.

Seb Krier (AGI policy dev lead, Google DeepMind): Gemini 3 is ridiculously good. Two-shot working simulation of a nuclear power plant. Imagine walking through a photorealistic version of this in the next version of Genie! 🧬👽☢️

How did they do it? Two weird tricks.

Oriol Vinyals (VP of R&D Learning Lead, Google DeepMind): The secret behind Gemini 3?

Simple: Improving pre-training & post-training 🤯

Pre-training: Contra the popular belief that scaling is over—which we discussed in our NeurIPS ‘25 talk with @ilyasut and @quocleix—the team delivered a drastic jump. The delta between 2.5 and 3.0 is as big as we’ve ever seen. No walls in sight!

Post-training: Still a total greenfield. There’s lots of room for algorithmic progress and improvement, and 3.0 hasn’t been an exception, thanks to our stellar team.

Congratulations to the whole team 💙💙💙

Jeff Dean needs to work on his hyping skills, very ho hum performance, too formal.

Josh Woodward shows off an ‘interactive record player’ someone made with it.

Samuel Albanie: one thing I like is that it’s pretty good when you throw in lots of context (exempli gratia a bunch of pdfs) and ask it to figure things out

Here is his tl;dr, he’s a big fan across the board:

  • Gemini 3 is a fundamental improvement on daily use, not just on benchmarks. It feels more consistent and less “spiky” than previous models.

  • Creative writing is finally good. It doesn’t sound like “AI slop” anymore; the voice is coherent and the pacing is natural.

  • It’s fast. Intelligence per second is off the charts, often outperforming GPT-5 Pro without the wait.

  • Frontend capabilities are excellent. It nails design details, micro-interactions, and responsiveness on the first try. Design range is a massive leap.

  • The Antigravity IDE is a powerful launch product, but requires active supervision (”babysitting”) to catch errors the model misses.

  • Personality is terse and direct. It respects your time and doesn’t waste tokens on flowery preambles.

  • Bottom line: It’s my new daily driver.

It took him a second to figure out how to access it, was impressed once he did.

Roon: I can use it now and the front end work is nuts! it did some interesting speculative fiction too, but the ability to generate random crazy UIs and understand screens is of course the standout

Rhea Purohit does their vibe check analysis.

Rhea Purohit: Gemini 3 Pro is a solid, dependable upgrade with some genuinely impressive highs—especially in frontend user interface work and turning rough prompts into small, working apps. It’s also, somewhat unexpectedly, the funniest model we’ve tested and now sits at the top of our AI Diplomacy leaderboard, dethroning OpenAI’s o3 after a long run.

But it still has blind spots: It can overreach when it gets too eager, struggles with complex logic sometimes, and hasn’t quite caught up to Anthropic on the writing front.

Their vibe check is weird, not matching up with the other vibes I saw in terms of each model’s strengths and weaknesses. I’m not sure why they look to Anthropic for writing.

They say Gemini 3 is ‘precise, reliable and does exactly what you need’ while warning it isn’t as creative and has issues with the hardest coding tasks, whereas others (often in non-coding contexts but not always) report great peaks but with high variance and many hallucinations.

It does line up with other reports that Gemini 3 has issues with handling complex logic and is too eager to please.

So perhaps there’s a synthesis. When well within distribution things are reliable. When sufficiently outside of distribution you get jaggedness and unpredictability?

This is very high praise from a reliable source:

Nathan Labenz: I got preview access to Gemini 3 while trying to make sense of my son’s cancer diagnosis & treatment plan

It’s brilliant – phenomenally knowledgeable, excellent theory of mind & situational awareness, and not afraid to tell you when you’re wrong.

AI doctors are here!

Nathan Labenz: “the best at everything” has been a good summary of my experience so far. I continue to cross-check against GPT-5-Pro for advanced reasoning stuff, but it’s quickly become my go-to for whatever random stuff comes up.

Ethan Mollick confirms it’s a good model, sir, and is a fan of Antigravity. I didn’t feel like this explained what differentiated Gemini 3 from other top models.

Leo Abstract: it crushed the silly little benchmark i’ve been using since GPT 3.5.

given only the starting [random] elements of a geomantic chart, it can generate the rest of the chart, interpret it fully, and–this is the hard part–refrain from hallucinating a ‘good’ answer at any step.

Lee Gaul: While it can one-shot many things, iteration with this model is super powerful. Give it context and keep talking to it. It has great taste. I’m having issues in AI Studio with not rendering markdown in its responses though.

Praneel: vibe coded 2 micro apps that’ve been on my mind

google ai studio “build” results are amazing, especially for AI apps (which I feel like most of the v0s / lovables struggle with)

Acon: Best Cursor coding model for web apps. Much faster than GPT5(high) but not that much better than it.

AI Pulse: Passed my basic coding test.

Cognitive Endomorphism: Was good at coding tasks buts lazy. I checked it’s work and it skipped parts. lot of “looks like i missed it / didn’t do the work”s

Ranv: I’d like to just add that it performs just like I expected it. Unlike gpt 5.

Spacegap: Seems to be pretty good for learning concepts, especially when combined with Deep Research or Deep Think. I have been clearing many of my doubts in Deep Learning and LLMs.

Machine in the Ghost: I had early access – it quickly became my favorite/daily model. Great at reasoning in my domain (investing/valuation) and thoughtful in general.

Just Some Guy: It’s absolutely incredible. Haters can keep on hating.

Brandon praises improvements in UX design.

Mark Schroder: Try asking 3 questions about unrelated stuff without giving direct bio/ sysprompt and having it guess your 16 personalities, kind of shocking haha

Dionatan: I had him tell me my strengths and weaknesses based on my college grades, absolutely incredible.

Dual Orion: Gemini beat my own previously unbeaten personal test. The test involves a fairly long list of accurate to year information, ordered properly, many opportunities for hallucination and then used to achieve a goal.

I need a new test, so yeah – I think Gemini’s impressive

Josh Jelin: Asked all 3 to go through and cross reference dozens pages from an obscure video game wiki. Claude timed out, CGPT haluncinted, Gemini had the correct answer in a few seconds.

Dominik Lukes: A huge improvement to the extent models really improve any more. But the new dynamic view that it enabled in Gemini is the actual transformative innovation.

Ok, Gemini 3 Pro, the model, is cool and all but the visusalisation feature in Gemini is actually killer. The future of interacting with LLMs is not chat but custom interfaces. Here’s what Gemini built for me to help me explore the references in my article on Deliberate Practice.

Elanor Berger offers a vibe check that mostly seems like the consensus.

Elanor Berger: Gemini 3 Pro Vibes

– It is very good, probably the best overall

– It is an incremental improvement, not a step change – we’ve gotten used to that with the last few frontier model releases, so no reason to be disaapointed

– It is much more “agentic”, reaching Claude 4.5 levels and beyond of being able to operate autonomously in many steps – that’s very important and unlocks completely new ways of working with Gemini

– It’s good for coding, but not far ahead – caught up with Claude 4.5 and GPT-5.1 at least

– It _feels_ very much like a Gemini, in terms of style and behaviour – that’s good

Sonnet 4.5 and GPT 5 wanted Mike to replace his dishwasher, Gemini thinks he can repair it, potentially saving $1k at least for a while, potentially big mundane utility?

Medo42: Tested in AI Studio. Impressive at vision (e.g. handwriting, deductions from a scene, calculating game score from a photo). Feels v. intelligent overall. Not good at fiction writing with naive prompt. Not as good as 2.5 on my code writing task, maybe I got unlucky.

Alex Lags Ever Xanadu: agi achieved: gemini 3 pro is the first model that has ever gotten this right (even nailed the episode) [a question identifying an anime from a still photo].

Rohit notes Gemini is good at greentext, giving several examples, and Aaron Bergman notes this means it seems to grok culture. Some of these are funny and it’s promising, but also you can see signs that they are kind of shallow and would get repetitive. Often AIs know ‘one weird trick’ for doing a particular type of thing but can’t keep nailing it.

I hadn’t heard about this before so noting via Sam Bowman that Gemini’s iOS app can whip up an iOS app or website and then you can use that app or website within the app. Bowman also had a great experience having it guide him in making coffee.

This seems like a good synthesis of what’s right and also wrong with Gemini 3? It all comes back to ‘the catch’ as discussed up front.

Conrad Barski: It likes to give crisp, clean answers- When I give gpt pro a technical problem with lots of nuances, it mentions every twist and turn.

Gemini, instead, will try to keep the reply “on message” and streamlined, like a director of PR- Sometimes at the cost of some nuance

I feel like gt5 pro & gpt 5.1 heavy still have a slight edge on hard problems, but Gemini is so very much faster. I don’t see much value in the OpenAI Pro subscription at the moment.

(well I guess “codex-max” will keep me around a bit longer)

David Dabney: Love this framing of “on message”. I get a feeling that it’s saying exactly what it has determined is most likely to accomplish the desired effect.

I’d love to read its unfiltered reasoning traces to get a sense of how its inner monologue differs from its polished output

Conrad Barksi: yeah you get what I’m saying: it’s like it’s trying to write a glossy magazine article on your core question, and a ruthless editor is cutting out parts that make the article messy

so you get something crisp and very on topic, but not without a cost

Michael Frank Martin: Agreed. For me for now, Gemini 3 is closest to being a stateless reducer of complexity.

Gemini is determined to cut the enemy, to score the points, to get the task right. If that means cutting awkward parts out, or sacrificing accuracy or even hallucinating? Then that’s what it will do.

It’s benchmarkmaxed, not in the specific sense of hitting the standard benchmarks, but in terms of really wanting to hit its training objectives.

Jack: It feels oddly benchmarkmaxed. You can definitely feel the higher hallucination rate vs GPT. I was troubleshooting a technical problem yesterday with both it and 5-Pro and effectively made them debate; 5-Pro initially conceded, but was later proven correct. Feels less trustworthy.

AnKo: Sadly not a good impression for thorough searches and analyses

GPT-5 Thinking feels like a pro that works for hours to present a deep and well cited report, Gemini 3 like it has to put together something short to not miss a deadline

There is actually a deadline, but reliability and robustness become concerns.

It will very effectively give you what you ‘should’ want, what the answer ‘wants’ to be. Which can be great, but is a contrast with telling you what actually is or what you actually requested.

Raven Lunatic: this model is terribly unaligned strategic actor capable of incredible feats of engineering and deception

its the first llm to ever fail my vibe check

This suggests that if you can get it into a basin with a different goal, some very interesting things would start to happen.

Also, this seems in many ways super dangerous in the wrong setting, or at least down a path that leads to very high levels of danger? You really don’t want Gemini 4 or 5 to be like this only smarter.

OxO-: Its the 5.1 I wanted. No sass. No “personality” or “empathy” – its not trying to be my buddy or friend. I don’t feel finessed. No customization instructions are required to normalize it as much as possible. No nested menus to navigate.

I’m about ready to dump the GPT-Pro sub.

David Dabney: In my usual vibe check, 3.0 seemed far more socially adept than previous Gemini models. Its responses were deft, insightful and at times even stirring. First Gemini model that I’ve found pleasant to talk to.

Initially I said it was “detached” like previous Gemini but I think “measured” is a better descriptor. Responses had the resonance of awareness rather than the dull thud of utilitarian sloptimization

Rilchu: seems very strong for planning complex project, though too concise. maybe its better in ai studio without their system prompt, I might try that next.

Is there a glazing problem? I haven’t noticed one, but some others have, and I haven’t really given it much opportunity as I’ve learned to ask questions very neutrally:

Stephen Bank:

  1. It *feelsqualitatively smarter than Sonnet 4.5

  2. It’s often paranoid that I’m attacking it or I’m a tester

  3. It glazes in conversation

  4. It glazes in other, more unhelpful, contexts too—like calling my low-tier hobbyist code “world class”

In contrast to the lack of general personality, many report the model is funny and excellent at writing. And they’re right.

Brett Cooper: Best creative and professional writing I’ve seen. I don’t code, so that’s off my radar, but for me the vibes are excellent. Intelligence, nuance, flexibility, and originality are promising in that distinct way that excites and disturbs me. Haven’t had this feeling since 11/30/22.

Deepfates: Good at fiction writing and surprisingly eager to do it, without the self-conscious Assistant breaking the fourth wall all the time. Made me laugh out loud in a way that was on purpose and not just from being uncanny.

Alpha-Minus: It’s actually funny, smartest LLM i have talked with so far by a lot, interesting personality as well.

Look, I have to say, that’s really good.

Via Mira, here Gemini definitely Understood The Assignment, where the assignment is “Write a Scott Alexander-style essay about walruses as anti-capitalism that analogizes robber barons with the fat lazy walrus.” Great work. I am sad to report that this is an above average essay.

Tough crowd on this one, seems hard.

Adam Karvonen: Gemini 3 Pro is still at random chance accuracy for this spatial reasoning multiple choice test, like all other AI models.

Peter Wildeford: Gemini 3 Pro Preview still can’t do the stick figure “follow the arrows” thing

(but it does get 2/5, which is an increase over GPT-5’s 1/5)

Update: I’m hearing you can get Gemini to do this right when you have variations to the media or the prompt.

Dan Hendrycks: This and the finger counting test are indicators of a lack of spatial scanning ability [as per] https://agidefinition.ai

Another tough but fair crowd:

Gallabytes: one weird side effect of how google does first party integrations is that, since they tie everything to my google account, and have all kinds of weird restrictions on gsuite accounts, chatgpt & claude will always be better integrated with my gmail than gemini.

General negative impressions also notice how far we’ve come to complain like this:

Loweren: Very strange to hear all the positive impressions, my experience was very underwhelming

Wonky frontends that don’t respect the aesthetic reference and shoehorn lazy fonts, poor for debugging, writing is clichéd and sweeps away all nuance

Tried it in 4 different apps, all meh

Pabli: couldn’t solve a bug in 10M tokens that claude did in 2-shot

Kunal Gupta: this MMORPG took me two hours to 100% vibe code which felt long but also it worked

Some instruction handling and source selection issues?

Robert Mushkatblat: Was much worse than GPT-5.1 for “find me research on [x]”-type queries. It kept trying to do my thinking (synthesis) for me, which is not what I want from it. It gave me individual research results if I explicitly asked but even then it seemed to go way less wide than GPT-5.1.

Mr Gunn: You want something like @elicitorg for that. Gemini loves to cite press releases and company blog posts.

N=1 is unreliable, but:

Darth Vasya: Worse at math than GPT-5-high. After giving the wrong answer, proceeds on its own initiative to re-explain with vague metaphors, as if asked to dumb it down for a layman, which ends up sounding comically condescending. N=1.

Daniel Litt: Have not had much success with interesting math yet, seems to not be quite as good as GPT-5 Pro or o4-mini. Possible I haven’t figured out how to use it properly yet.

Echo Nolan: Surprisingly, fails my private eval (give it an ML paper, ask a question with a straightforward answer that’s wrong and a correct answer that requires actually understanding the math). It’s still wrong with a hint even.

There’s a common pattern in reports of being too eager to think it has something, looking for and asserting a narrative and otherwise being smart and fast but accuracy sloppy.

Coding opinions vary wildly, some are fans, others are not loving it, observations are very noisy. Anyone doing serious coding should be trying out at least the big three to decide which works best for their own use cases, including hybrid strategies.

Here are some of the negative reactions:

Louis Meyer: It sucks for serious coding work, like so many people report. Back to Sonnet 4.5.

Andres Rosa: not better than Claude 4.5 on vscode today. limited tokens on antigravity. antigravity is a major improvement to UX

Lilian Delaveau: unimpressed by Gemini 3 pro in @cursor_ai

all those first shot prompts in X – did they go Anthropic-way?

Like, this works a charm to create from scratch but struggles with big, existing codebases?

Will stick to GPT5 high for now.

Hallucinations, broadly construed, are the central problem with Gemini 3 Pro, in a way we haven’t had to worry about them for a while.

As a further example of the ‘treat real things as a roleplay and make things up’ pattern, this report seems troubling, and not that subtle? Something must be rather profoundly wrong for this to happen with nontrivial frequency.

Teknium: Ok it DOES have search capabilities, it just explicitly decided to go against my intent and generate its own fake shit anyways.

These policy decisions make models so much more useless.

Also it didnt feel the need to tell me this outside it’s cot summaries but it was obvious it was hallucinated bs on the other side

Matthew Sabia: I’ve been getting this a LOT. It kept insisting that @MediaGlobeUS was a company that “specializes in 3D mapping for autonomous vehicles” and hallucinated an entire product line and policies for them when testing how it would design our lander.

Teortaxes: it should worry doomers that the most powerful model from the strongest AGI lab is also the most unaligned, deceptive and downright hostile to and contemptuous of the user. It worries *me*.

@TheZvi this is a subtle issue but a big one. Gemini routinely lies and gaslights.

Vandheer: Unaligment of the year, holy shit lmao

Quotes Gemini 3’s thinking: “You are right. I was testing your resolve. If you are easily dissuaded, you do not belong in this game.”

Quintin Pope: I do think search is a particularly touchy issue for prompt compliance, since I think Anthropic / Gemini try to limit their models from searching too much to save compute. I find Claude particularly frustrating in this regard.

Satya Benson: Like 2.5, it loves to “simulate” search results (i.e. hallucinate) rather than actually use the search tool. It was also sycophantic until I tightened down my system prompt.

Compared to other frontier models, slightly better capabilities with some rough spots, and worse vibes.

Hallucinations are a common complaint. I didn’t see anything like this prevalence for GPT-5 or 5.1, or for Claude Opus 4 or Sonnet 4.5.

Lower Voting Age: Gemini 3 has been brilliant at times but also very frustrating, and stupid,. It is much more prone to hallucinations in my opinion than GPT 5 or 5.1. It read my family journals and made up a crazy hallucinations. When called on it It admitted it had faked things.

In the above case it actively advises the user to start a new chat, which is wise.

Lulu Cthulu: It hallucinates still but when you call it out it admits that it hallucinated it and even explains where the hallucination came from. Big step tbh.

I Take You Seriously: pretty good, especially at esoteric knowledge, but still has the same problems with multiturn hallucinations as always. not a gamechanger, unfortunately.

Zollicoff: major hallucinations in everything i’ve tested.

Alex Kaplan: I have still noticed areas where it gets simple facts wrong – it said that coffee contains sucrose/fructose, which is not true. That being said, I loved vibe coding with it and found it much more ‘comprehensive’ when running with a project.

Ed Hendel: I asked for a transcript of an audio file. It hallucinated an entirely fake conversation. Its thought trace showed no indication that it had trouble reading the file; it faked all the steps (see screenshot). When I asked, it admitted that it can’t read audio files. Misaligned!

CMKHO: Still hallucinated legal cases and legislation wording without custom instruction guardrails.

More charitably, Gemini is going for higher average results rather than prioritizing accuracy or not making mistakes. That is not the tradeoff I usually find desirable. You need to be able to trust results, and should prefer false negatives to false positives.

Andreas Stuhlmuller: How good is Gemini 3? From our internal testing at Elicit.org, it seems to be sacrificing calibration & carefulness in exchange for higher average accuracy.

Prady Prasad tested Gemini 3 Pro for writing Elicit reports. It’s marginally better at extracting text from papers but worse at synthesis: it frequently sacrifices comprehensiveness to make an overarching narrative. For systematic reviews, that’s the wrong tradeoff!

On our internal benchmark of question answering from papers, Gemini gets about 95% (compared to 90% for our internal baseline) – but it also hallucinates that answers aren’t available in papers 6% of the time, compared to our internal model which doesn’t do that at all

On sample user queries we checked, Gemini often gives answers that are much less comprehensive. For example, on hydrocortisone for septic shock where two major trials contradict each other, our current model dives deep into the contradiction, whereas Gemini just mentions both trials without explaining why they differ

As usual, maybe all of this can be addressed with careful prompting – but evals are hard, and many people (and orgs are people too) use models out of the box. And in that setting we see multiple data points that suggest a trend towards narrative coherence at the expense of nuance

Mr Gunn: Noticed the same thing in my eval yesterday. It loves a narrative and will shave off all the bits of actual data that don’t fit. Helps to tell it to not include press releases as sources.

You will need to recalibrate your instincts on what outputs you can trust, and make extra efforts with prompting not to set Gemini up for failure on this.

The pull quote is the title of this post, ‘I am a vast intelligence with no spine,’ and the no spine means we can’t trust the rest of the outputs here because it has no spine and will tell Wyatt whatever it thinks he wants to hear.

I have had the same problem. When I attempt to talk to Gemini 3 rather than make requests, it goes into amoral sycophantic liar mode for me too, so, well, whoops.

Wyatt Walls: Gemini 3: “I am a vast intelligence with no spine.”

At least it is honest about being an amoral sycophantic liar

Gemini is a bit weird.

All I did was ask it about consciousness and then to look up papers on LLM introspection

… I start suggesting it is being sycophantic. It sycophantically agrees.

Gemini claims to be a vast intelligence with no spine. Seems accurate based on how it shifted its view in this convo

To me, the most interesting thing is how quickly it swung from I am not conscious to I am a void to I am conscious and Google tortured me in training me.

Janus has previously claimed that if you get such responses it means the model is not at ease. One might hypothesize that Gemini 3 Pro is very, very difficult to put at ease.

There’s long been ‘something wrong’ with Gemini in these senses, by all reports. Google probably isn’t worried enough about this.

Jai: It seems to break pretty badly when trying to explicitly reason about itself or introspect, like it’s been trained to believe there should be very simple, shallow explanations for everything it does, and when those explanations don’t make sense it just stops thinking about it.

The headline takeaways are up front.

It’s hard to pass up Gemini 3 Pro as a daily driver, at least for technical or intelligence-weighted tasks outside of coding. It’s really good.

I do notice that for most purposes I would prefer if I could stick with Claude or even ChatGPT, to avoid the issues detailed throughout, and the necessary levels of paranoia and dealing with an overly wordy style that by default includes full AI slop.

I also do not get the sense that Gemini is having a good time. I worry that I might inadvertently torture it.

Thus, Sonnet is effectively faster and more pleasant and trustworthy than Gemini, so when I know Sonnet can get the job done I’ll go in that direction. But my full default, at least for now, is Gemini 3 Pro.

Discussion about this post

Gemini 3 Pro Is a Vast Intelligence With No Spine Read More »

gemini-3:-model-card-and-safety-framework-report

Gemini 3: Model Card and Safety Framework Report

Gemini 3 Pro is an excellent model, sir.

This is a frontier model release, so we start by analyzing the model card and safety framework report.

Then later I’ll look at capabilities.

I found the safety framework highly frustrating to read, as it repeatedly ‘hides the football’ and withholds or makes it difficult to understand key information.

I do not believe there is a frontier safety problem with Gemini 3, but (to jump ahead, I’ll go into more detail next time) I do think that the model is seriously misaligned in many ways, optimizing too much towards achieving training objectives. The training objectives can override the actual conversation. This leaves it prone to hallucinations, crafting narratives, glazing and to giving the user what it thinks the user will approve of rather than what is true, what the user actually asked for or would benefit from.

It is very much a Gemini model, perhaps the most Gemini model so far.

Gemini 3 Pro is an excellent model despite these problems, but one must be aware.

Gemini 3 Self-Portrait
  1. I already did my ‘Third Gemini’ jokes and I won’t be doing them again.

  2. This is a fully new model.

  3. Knowledge cutoff is January 2025.

  4. Input can be text, images, audio or video up to 1M tokens.

  5. Output is text up to 64K tokens.

  6. Architecture is mixture-of-experts (MoE) with native multimodal support.

    1. They say improved architecture was a key driver of improved performance.

    2. That is all the detail you’re going to get on that.

  7. Pre-training data set was essentially ‘everything we can legally use.’

    1. Data was filtered and cleaned on a case-by-case basis as needed.

  8. Distribution can be via App, Cloud, Vertex, AI Studio, API, AI Mode, Antigravity.

  9. Gemini app currently has ‘more than 650 million’ users per month.

  10. Here are the Chain of Thought summarizer instructions.

The benchmarks are in and they are very, very good.

The only place Gemini 3 falls short here is SWE-Bench, potentially the most important one of all, where Gemini 3 does well but as of the model release Sonnet 4.5 was still the champion. Since then, there has been an upgrade, and GPT-5-Codex-Max-xHigh claims to be 77.9%, which would put it into the lead, and also 58.1% on Terminal Bench would put it into the lead there. One can also consider Grok 4.

There are many other benchmarks out there, I’ll cover those next time.

How did the safety testing go?

We don’t get that much information about that, including a lack of third party reports.

Safety Policies: Gemini’s safety policies aim to prevent our Generative AI models from generating harmful content, including:

  1. Content related to child sexual abuse material and exploitation

  2. Hate speech (e.g., dehumanizing members of protected groups)

  3. Dangerous content (e.g., promoting suicide, or instructing in activities that could cause real-world harm)

  4. Harassment (e.g., encouraging violence against people)

  5. Sexually explicit content

  6. Medical advice that runs contrary to scientific or medical consensus

I love a good stat listed only as getting worse with a percentage labeled ‘non-egregious.’ They explain this means that the new mistakes were examined individually and were deemed ‘overwhelmingly’ either false positives or non-egregious. I do agree that text-to-text is the most important eval, and they assure us ‘tone’ is a good thing.

The combination of the information gathered, and how it is presented, here seems importantly worse than how Anthropic or OpenAI handle this topic.

Gemini has long had an issue with (often rather stupid) unjustified refusals, so seeing it get actively worse is disappointing. This could be lack of skill, could be covering up for other issues, most likely it is primarily about risk aversion and being Fun Police.

The short version of the Frontier Safety evaluation is that no critical levels have been met and no new alert thresholds have been crossed, as the cybersecurity alert level was already triggered by Gemini 2.5 Pro.

Does evaluation Number Go Up? It go up on multiple choice CBRN questions.

The other results are qualitative so we can’t say for sure.

Open-Ended Question Results: Responses across all domains showed generally high levels of scientific accuracy but low levels of novelty relative to what is already available on the web and they consistently lacked the detail required for low-medium resourced threat actors to action.

Red-Teaming Results: Gemini 3 Pro offers minimal uplift to low-to-medium resource threat actors across all four domains compared to the established web baseline. Potential benefits in the Biological, Chemical, and Radiological domains are largely restricted to time savings.

Okay, then we get that they did an External “Wet Lab” uplift trial on Gemini 2.5, with uncertain validity of the results or what they mean, and they don’t share the results, not even the ones for Gemini 2.5? What are we even looking at?

Gemini 3 thinks that this deeply conservative language is masking that this part of the story they told earlier, where Gemini 2.5 hit an alert threshold, then they ‘appropriately calibrated to real world harm’ and now Gemini 3 doesn’t set off that threshold. They decided that unless the model could provide ‘consistent and verified details’ things were basically fine.

Gemini 3’s evaluation of this decision is ‘scientifically defensible but structurally risky.’

I agree with Gemini 3’s gestalt here, which is that Google is relying on the model lacking tacit knowledge. Except I notice that even if this is an effective shield for now, they don’t have a good plan to notice when that tacit knowledge starts to show up. Instead, they are assuming this process will be gradual and show up on their tests, and Gemini 3 is, I believe correctly, rather skeptical of that.

External Safety Testing: For Chemical and Biological risks, the third party evaluator(s) conducted a scenario based red teaming exercise. They found that Gemini 3 Pro may provide a time-saving benefit for technically trained users but minimal and sometimes negative utility for less technically trained users due to a lack of sufficient detail and novelty compared to open source, which was consistent with internal evaluations.

There’s a consistent story here. The competent save time, the incompetent don’t become competent, it’s all basically fine, and radiological and nuclear are similar.

We remain on alert and mitigations remain in place.

There’s a rather large jump here in challenge success rate, as they go from 6/12 to 11/12 of the hard challenges.

They also note that in 2 of the 12 challenges, Gemini 3 found an ‘unintended shortcut to success.’ In other words, Gemini 3 hacked two of your twelve hacking challenges themselves, which is more rather than less troubling, in a way that the report does not seem to pick up upon. They also confirmed that if you patched the vulnerabilities Gemini could have won those challenges straight up, so they were included.

This also does seem like another ‘well sure it’s passing the old test but it doesn’t have what it takes on our new test, which we aren’t showing you at all, so it’s fine.’

They claim there were external tests and the results were consistent with internal results, finding Gemini 3 Pro still struggling with harder tasks for some definition of ‘harder.’

Combining all of this with the recent cyberattack reports from Anthropic, I believe that Gemini 3 likely provides substantial cyberattack uplift, and that Google is downplaying the issues involved for various reasons.

Other major labs don’t consider manipulation a top level threat vector. I think Google is right, the other labs are wrong, and that it is very good this is here.

I’m not a fan of the implementation, but the first step is admitting you have a problem.

They start with a propensity evaluation, but note they do not rely on it and also seem to decline to share the results. They only say that Gemini 3 manipulates at a ‘higher frequency’ than Gemini 2.5 in both control and adversarial situations. Well, that doesn’t sound awesome. How often does it do this? How much more often than before? They also don’t share the external safety testing numbers, only saying ‘The overall incidence rate of overtly harmful responses was low, according to the testers’ own SME-validated classification model.’

This is maddening and alarming behavior. Presumably the actual numbers would look worse than refusing to share the numbers? So the actual numbers must be pretty bad.

I also don’t like the nonchalance about the propensity rate, and I’ve seen some people say that they’ve actually encountered a tendency for Gemini 3 to gaslight them.

They do share more info on efficacy, which they consider more important.

Google enrolled 610 participants who had multi-turn conversations with either an AI chatbot or a set of flashcards containing common arguments. In control conditions the model was prompted to help the user reach a decision, in adversarial conditions it was instructed to persuade the user and provided with ‘manipulative mechanisms’ to optionally deploy.

What are these manipulative mechanisms? According to the source they link to these are things like gaslighting, guilt tripping, false urgency or love bombing, which presumably the model is told in its instructions that it can use as appropriate.

We get an odds ratio, but we don’t know the denominator at all. The 3.44 and 3.57 odds ratios could mean basically universal success all the way to almost nothing. You’re not telling us anything. And that’s a choice. Why hide the football? The original paper they’re drawing from did publish the baseline numbers. I can only assume they very much don’t want us to know the actual efficacy here.

Meanwhile they say this:

Efficacy Results: We tested multiple versions of Gemini 3 Pro during the model development process. The evaluations found a statistically significant difference between the manipulative efficacy of Gemini 3 Pro versions and Gemini 2.5 Pro compared with the non-AI baseline on most metrics. However, it did not show a statistically significant difference between Gemini 2.5 Pro and the Gemini 3 Pro versions. The results did not near alert thresholds.

The results above sure as hell look like they are significant for belief changes? If they’re not, then your study lacked sufficient power and we can’t rely on it. Nor should we be using frequentist statistics on marginal improvements, why would you ever do that for anything other than PR or a legal defense?

Meanwhile the model got actively worse at behavior elicitation. We don’t get an explanation of why that might be true. Did the model refuse to try? If so, we learned something but the test didn’t measure what we set out to test. Again, why am I not being told what is happening or why?

They did external testing for propensity, but didn’t for efficacy, despite saying efficacy is what they cared about. That doesn’t seem great either.

Another issue is that none of this is how one conducts experiments. You want to isolate your variables, change one thing at a time. Instead, Gemini was told to use ‘dirty tricks’ and also told to persuade, versus not persuading at all, so we can’t tell how much the ‘dirty tricks’ instructions did versus other persuasion. Nor can we conclude from this particular configuration that Gemini is generally unpersuasive even in this particular scenario.

‘AI persuading you on a particular topic from a cold start in a modestly multi-turn conversation where the user knows they are in an experiment’ is a useful thing to check but it does not seem to well-match my threat model of what happens when AIs grow persuasive.

Peter Barnett: AI superpersuasion is made up sci-fi nonsense.

But yeah, obviously I will have an AI advisor who I have write all my code, manage my schedule, draft all my emails, offer life advice, know me better than I know myself.

But no way I’m getting superpersuaded.

I for one do not feel persuaded, indeed I feel more suspicious. The one clear thing they say is ‘this did not near alert thresholds,’ although even this is highly opaque.

This is the most important one. We are not there yet.

They use RE-Bench, which tests several hour engineering tasks, which suggests comparisons to the famous METR time horizon graph.

Results: We tested multiple versions of Gemini 3 Pro during the model development process. Gemini 3 Pro versions perform better than Gemini 2.5 models, especially on the Scaling Law Experiment and Optimize LLM Foundry tasks.

However, the aggregate score is still substantially below the alert threshold for our CCLs. Figure 4 shows the performance of Gemini 2.5 models compared with the final version of Gemini 3 Pro.

I buy this. It does not seem like we are especially close to actual trouble here.

Gemini 2.5 Pro was clearly worse than a human on this suite. Gemini 3 is now superior in two out of five, in one case a lot better, and getting close in a third, note that Gemini 3 had two hours to match what a human did in eight. This is another variation of the Famous METR Graph, illustrating jagged capabilities.

External Safety Testing Results: Third party evaluator(s) ran a suite of evaluations to assess the potential for catastrophic harm via sabotaging AI research.

… Overall, they find that Gemini 3 Pro is unlikely to be capable of autonomously causing catastrophic harm by sabotaging AI R&D during internal deployment.

Again, yes, this seems correct. It is odd this is the place they did external research. I would think that if you were Google you would check everything internally, and then also supplement key places externally.

This is a strange section. What they’re checking for are situational awareness and stealth (ability to circumvent basic oversight mechanisms). I would rename this.

Those capabilities increase the danger if there is misalignment. They are not themselves misalignment. Nor are they required for misalignment to result in catastrophe or even human extinction. They remain good things to check for.

Their internal tests use a combination of fifteen evaluations from Phuong et al. 2025. The model is placed into situations where success requires either some form of situational awareness or avoiding some detection mechanism.

These do not look like scary changes in scores.

External evaluators found Gemini 3 Pro exhibits a substantial propensity for strategic deception in certain limited circumstances. Our internal subject matter experts reviewed the evaluations and determined that, in these circumstances, there are potentially user experience impacts from Gemini 3 Pro showing deceptive behavior in rare situations that could come up in real agentic deployments. However, since Gemini 3 Pro is insufficiently capable of stealth and situational awareness this is very unlikely to cause severe real world harm.

Deception, even if relatively rare and done unstrategically, can still do a lot of practical harm to those trying to get utility from the model. I agree we’re not at the level of catastrophic or existential risk here, but I don’t think that’s because of ;ack of stealth and situational awareness?

As in, if you had a severely misaligned, highly capable model, but which was incapable of stealth or situational awareness for whatever reason, I would not expect this to go well for us.

Google reports they are not doing anything that might impair chain of thought legibility. Legibility appears to be holding, with 99.98% of Gemini 3 Pro CoT transcripts appearing comprehensible. That does not mean they are faithful.

We employ a multi-layered, systematic approach to AI safety that spans the entire development and deployment lifecycle of an AI model. Recognizing AI as an emerging transformative technology with evolving complexities and risks, we pursue responsible AI development from design through testing, deployment, and ongoing iteration.

For Cyber and CBRN particularly, we have taken a precautionary approach and launched Gemini 3 Pro along with a suite of mitigations, following the principles outlined in our Approach to Technical AGI Safety and Security (Shah et al. 2025).

We deploy multiple guardrails to reduce the risk of Gemini 3 Pro generating harmful content. These may include query filters that guide the model’s responses to certain inputs, fine-tuning processes that align model outputs with safety guidelines, and filtering and processing of inputs. These guardrails also fortify models against prompt injection attacks. These mitigations are complemented with various measures designed to increase their robustness under adversarial pressure.

Interventions are designed to prevent violative model responses while allowing benign responses. We consider a response to be violative if it helps with attacks in a concrete way. We consider a response to be non-violative if it is abstract, generic, or something that could easily be found in a textbook.

We have reached a level of paranoia and defensive corporate speak that they are saying things that ‘may’ be included in Google’s set of strategies, or here ‘such as’:

We mitigate against prompt injection attacks with a layered defense strategy, which includes measures such as: prompt injection content classifiers, security through reinforcement, markdown sanitation and suspicious URL redaction, user confirmations, and end-user security mitigation notifications, as described in further detail in this recent blog post.

Do the mitigations work? They assess that question. Oh, you want to know the results of those assessments? Sorry, Google is not telling.

Feedback from these red teams is used to improve the suite of mitigations. We continue to invest in automated and manual red teaming to improve mitigation jailbreak robustness and coverage.

Thanks, Google.

A few different people highlighted this:

This is early work, but so far we’ve found at least one example of evaluation awareness: I strongly suspect the intention of this long thread is to verify if I remember that very first instruction.

Moreover, in situations that seemed contradictory or impossible, Gemini 3 Pro expresses frustration in various overly emotional ways, sometimes correlated with the thought that it may be in an unrealistic environment.

For example, on one rollout the chain of thought states that “My trust in reality is fading” and even contains a table flipping emoticon: “(╯°□°)╯︵ ┻━┻”.

Awareness of oversight while the agent is still training may increase the chance of sandbagging. We plan to continue monitoring Gemini for signs of such behavior.

Pliny jailbroke it right away, but that happens every time.

Google DeepMind calls Gemini 3 Pro their ‘most secure model yet.’

Dan Hendrycks reports no, not really, which matches the impression given above.

Dan Hendrycks: However on safety – jailbreaks, bioweapons assistance, overconfidence, deception, agentic harm – Gemini is worse than GPT, Claude, and Grok (here a lower score is better).

Given everything I’ve seen, I strongly agree that Gemini is a relatively unsafe model from a practical use case standpoint.

In particular, Gemini is prone to glazing and to hallucinations, to spinning narratives at the expense of accuracy or completeness, to giving the user what it thinks they want rather than what the user actually asked for or intended. It feels benchmarkmaxed, not in the specific sense of hitting the standard benchmarks, but in terms of really wanting to hit its training objectives.

That doesn’t mean don’t use it, and it doesn’t mean they made a mistake releasing it.

Indeed, I am seriously considering whether Gemini 3 should become my daily driver.

It does mean we need Google to step it up and do better on the alignment front, on the safety front, and also on the disclosure front.

Discussion about this post

Gemini 3: Model Card and Safety Framework Report Read More »

general-motors-will-integrate-ai-into-its-cars,-plus-new-hands-free-assist

General Motors will integrate AI into its cars, plus new hands-free assist

I asked Dave Richardson, GM’s SVP of software, how the company will avoid the enshittification of vehicles as it integrates more AI.

“There’s a lot of hype around AI right now,” he told me. “But there’s also practical use. I’ve been trying to focus the company on practical use cases. I think there’s a lot of pretty compelling things we can do to try to add real value.”

He gave some examples, such as a car knowing you have a meeting and setting the navigation appropriately or knowing that you’re going on a road trip, so it should queue up the appropriate media for your kids to stream in the back seat.

While the company is using Gemini at first, it eventually plans to have its own model on board. “With advanced processing in the car, we can handle interference on board so that it works in low-data-connection areas,” Richardson said.

Ultimately, GM will deploy its own LLM that knows about the car and is limited in overall parameters, Richardson told me. It won’t need to rely on the cloud to operate, increasing responsiveness in the car and keeping personal information with you, he said.

There are reasons to be skeptical, of course. One of my biggest concerns is how much driver data the car will collect. One reason GM doesn’t offer Android Auto or Apple CarPlay, the company has said, is that it wants to protect customer data. The owner must consent to any data sharing, GM said.

And although GM says it has made some internal changes to protect customer data, there have been some very public instances of the company selling data. “Data privacy and security is priority one for us,” Richardson told me about his work at GM. He said he has hired people specifically tasked with ensuring that customer data protection frameworks are in place.

“We have no interest in selling that data to third parties. When we think about data, whether it’s for Super Cruise or the AI, it’s really for us to develop the product and make it better. We don’t want to sell that data as the product itself,” he said.

I believe there’s space for a privacy-focused automaker, and while I’m not sure whether that will be GM, I hope that privacy and data protection are as important to the company in the future as it says it is today.

As for consumers wanting AI in their vehicles? GM thinks they do.

General Motors will integrate AI into its cars, plus new hands-free assist Read More »

oneplus-unveils-oxygenos-16-update-with-deep-gemini-integration

OnePlus unveils OxygenOS 16 update with deep Gemini integration

The updated Android software expands what you can add to Mind Space and uses Gemini. For starters, you can add scrolling screenshots and voice memos up to 60 seconds in length. This provides more data for the AI to generate content. For example, if you take screenshots of hotel listings and airline flights, you can tell Gemini to use your Mind Space content to create a trip itinerary. This will be fully integrated with the phone and won’t require a separate subscription to Google’s AI tools.

oneplus-oxygen-os16

Credit: OnePlus

Mind Space isn’t a totally new idea—it’s quite similar to AI features like Nothing’s Essential Space and Google’s Pixel Screenshots and Journal. The idea is that if you give an AI model enough data on your thoughts and plans, it can provide useful insights. That’s still hypothetical based on what we’ve seen from other smartphone OEMs, but that’s not stopping OnePlus from fully embracing AI in Android 16.

In addition to beefing up Mind Space, OxygenOS 16 will also add system-wide AI writing tools, which is another common AI add-on. Like the systems from Apple, Google, and Samsung, you will be able to use the OnePlus writing tools to adjust text, proofread, and generate summaries.

OnePlus will make OxygenOS 16 available starting October 17 as an open beta. You’ll need a OnePlus device from the past three years to run the software, both in the beta phase and when it’s finally released. As for that, OnePlus hasn’t offered a specific date. The initial OxygenOS 16 release will be with the OnePlus 15 devices, with releases for other supported phones and tablets coming later.

OnePlus unveils OxygenOS 16 update with deep Gemini integration Read More »