Author name: Mike M.

after-a-witcher-free-decade,-cdpr-still-promises-three-sequels-in-six-years

After a Witcher-free decade, CDPR still promises three sequels in six years

It’s been over 10 years since the launch of the excellent The Witcher 3: Wild Hunt, and nearly four years since the announcement of “the next installment in The Witcher series of video games.” Despite those long waits, developer CD Projekt Red is still insisting it will deliver the next three complete Witcher games in a short six-year window.

In a recent earnings call, CDPR VP of Business Development Michał Nowakowski suggested that a rapid release schedule would be enabled in no small part by the team’s transition away from its proprietary REDEngine to the popular Unreal Engine in 2022. At the time, CDPR said the transition to Unreal Engine would “elevate development predictability and efficiency, while simultaneously granting us access to cutting-edge game development tools.” Those considerations seemed especially important in the wake of widespread technical issues with the console versions of Cyberpunk 2077, which CDPR later blamed on REDEngine’s “in-game streaming system.”

“We’re happy with how [Unreal Engine] is evolving through the Epic team’s efforts, and how we are learning how to make it work within a huge open-world game, as [The Witcher 4] is meant to be,” Nowakowski said in the recent earnings call. “In a way, yes, I do believe that further games should be delivered in a shorter period of time—as we had stated before, our plan still is to launch the whole trilogy within a six-year period, so yes, that would mean we would plan to have a shorter development time between TW4 and TW5, between TW5 and TW6 and so on.”

Don’t start the clock just yet

To be clear, the “six-year period” Nowakowski is talking about here starts with the eventual release of The Witcher 4, meaning the developer plans to release two additional sequels within six years of that launch. And CDPR confirmed earlier this year that The Witcher 4 would not launch in 2026, extending the window for when we can expect all these promised new games.

After a Witcher-free decade, CDPR still promises three sequels in six years Read More »

revisiting-jill-of-the-jungle,-the-last-game-tim-sweeney-designed

Revisiting Jill of the Jungle, the last game Tim Sweeney designed

Boy, was 1992 a different time for computer games. Epic MegaGames’ Jill of the Jungle illustrates that as well as any other title from the era. Designed and programmed by Epic Games CEO Tim Sweeney, the game was meant to prove that console-style games of the original Nintendo era could work just as well on PCs. (Later, the onus of proof would often be in the reverse direction.)

Also, it had a female protagonist, which Sweeney saw as a notable differentiator at the time. That’s pretty wild to think about in an era of Tomb Raider‘s Lara Croft, Horizon Forbidden West‘s Aloy, Life is Strange‘s Max Caulfield, Returnal‘s Selene Vassos, Control‘s Jesse Faden, The Last of Us‘ Ellie Williams, and a seemingly endless list of others—to say nothing of the fact that many players of all genders who played the games Mass Effect and Cyberpunk 2077 seem to agree that the female protagonist options in those are more compelling than their male alternatives.

As wacky as it is to remember that the idea of a female character was seen as exceptional at any point (and with the acknowledgement that this game was nonetheless not the first to do that), it’s still neat to see how forward-thinking Sweeney was in many respects—and not just in terms of cultural norms in gaming.

Gameplay to stand the test of time

Having been born in the early 80s to a computer programmer father, I grew up on MS-DOS games the way many kids did on Atari, Nintendo, or PlayStation. Even I’ll admit that, as much as I enjoyed the DOS platformers, they don’t hold up very well against their console counterparts. (Other genres are another story, of course.)

I know this is blasphemy for some of my background and persuasion, but Commander Keen‘s weird, floaty controls are frustrating, and what today’s designers call the “game feel” just isn’t quite right.

Revisiting Jill of the Jungle, the last game Tim Sweeney designed Read More »

four-inch-worm-hatches-in-woman’s-forehead,-wriggles-to-her-eyelid

Four-inch worm hatches in woman’s forehead, wriggles to her eyelid

Creeping

For anyone enjoying—or at least trying to enjoy—Thanksgiving in America, you can be thankful that these worms are not present in the US; they are exclusive to the “Old World,” that is Europe, Africa, and Asia, according to the Centers for Disease Control and Prevention. They’re often found in the Mediterranean region, but reports in recent years have noted that they seem to be expanding into new areas of Europe—particularly eastward and northward. In a report earlier this year of cases in Estonia, researchers noted that it is also emerging in Lithuania, Latvia, and Finland.

Researchers attribute the worm’s creep to climate change and globalization. But in another report this year of a case in Austria (thought to be acquired while the patient was vacationing in Greece), researchers also raised the speculation that the worms may be adapting to use humans as a true host. Researchers in Serbia suggested this in a 2023 case report, in which an infection led to microfilariae in the patient’s blood. The researchers speculated that such cases, considered rare, could be increasing.

For now, people in America have less to worry about. D. repens has not been found in the US, but it does have some relatives here that occasionally show up in humans, including D. immitis, the cause of dog heartworm, and D. tenuis. The latter can cause similar cases to D. repens, with worms wandering under the skin, particularly around the eye. So far, this worm has mainly been found in raccoons in Florida.

For those who do find a worm noodling through their skin, the outlook is generally good. Treatment includes surgical removal of the worm, which largely takes care of the problem, as well as anti-parasitic or antibiotic drugs to be sure to stamp out the infection or any co-infections. In the woman’s case, her symptoms disappeared after doctors pulled the worm from her eyelid.

Four-inch worm hatches in woman’s forehead, wriggles to her eyelid Read More »

solar’s-growth-in-us-almost-enough-to-offset-rising-energy-use

Solar’s growth in US almost enough to offset rising energy use

If you add in nuclear, then the US has reached a grid that is 40 percent emissions-free over the first nine months of 2025. That’s up only 1 percent compared to the same period the year prior. And because coal emits more carbon than natural gas, it’s likely the US will see a net increase in electricity-related emissions this year.

If you would like to have a reason to feel somewhat more optimistic, however, the EIA used the new data to release an analysis of the state of the grid in California, where the production from utility-scale solar has nearly doubled over the last five years, thanks in part to another 17 percent increase so far in 2024.

Through 2023, it was tough to discern any impact of that solar production on the rest of the grid, in part due to increased demand. But since then, natural gas use has dropped considerably (it’s down by 17 percent so far in 2025), placing it at risk of being displaced by solar as the largest source of electricity in California as early as next year. This displacement is happening even as California’s total consumption jumped by 8 percent so far in 2025 compared to the same period last year.

Image of three graphs representing spring electrical use over the last five years. All show a large green bump representing solar generation peaking at mid-day. A fblue line representing battery use is flat on the left, develops wiggles in the middle, then develops into a curve where energy is drawn in during the day and released in the evening.

Massive solar growth plus batteries means less natural gas on California’s grid. Credit: US EIA

The massive growth in solar has also led to overproduction of power in the spring and autumn, when heating/cooling demands are lowest. That, in turn, has led to a surge in battery construction to absorb the cheap power and sell it back after the Sun sets. The impact of batteries was nearly impossible to discern as recently as 2023, but data from May and June of 2025 shows batteries pulling in lots of power at mid-day, and using it in the early evening to completely offset what would otherwise be an enormous surge in the use of natural gas.

Not every state has the sorts of solar resources available to California. But the economics of solar power suggest that other states are likely to experience this sort of growth in the coming years. And, while the Trump administration has been openly hostile to solar power from the moment it took office, so far there is no sign of that hostility at the grid level.

Solar’s growth in US almost enough to offset rising energy use Read More »

valve’s-steam-machine-looks-like-a-console,-but-don’t-expect-it-to-be-priced-like-one

Valve’s Steam Machine looks like a console, but don’t expect it to be priced like one

After Valve announced its upcoming Steam Machine living room box earlier this month, some analysts suggested to Ars that Valve could and should aggressively subsidize that hardware with “loss leader” pricing that leads to more revenue from improved Steam software sales. In a new interview with YouTube channel Skill Up, though, Valve’s Pierre-Loup Griffais ruled out that kind of console-style pricing model, saying that the Steam Machine will be “more in line with what you might expect from the current PC market.”

Griffais said the AMD Zen 4 CPU and RDNA3 GPU in the Steam Machine were designed to outperform the bottom 70 percent of machines that opt-in to Valve’s regular hardware survey. And Steam Machine owners should expect to pay roughly what they would for desktop hardware with similar specs, he added.

“If you build a PC from parts and get to basically the same level of performance, that’s the general price window that we aim to be at,” Griffais said.

The new comments follow similar sentiments relayed by Linus Sebastian on a recent episode of his WAN Show podcast. Sebastian said that, when talking to Valve representatives at a preview event, he suggested that a heavily subsidized price point would make the Steam Machine hardware into “a more meaningful product.” But when he suggested that he was imagining a console-style price in the range of $500, “nobody said anything, but the energy of the room wasn’t great.”

Forget about $500

Based on these comments, we could start estimating a potential Steam Machine price range by speccing out a comparable desktop machine. That would likely require building around a Ryzen 5 7600X CPU and Radeon RX 7600 GPU, which would probably push the overall build into the $700-plus range. That would make the Steam Machine competitive with the pricey PS5 Pro, even though some estimates price out the actual internal Steam Machine components in the $400 to $500 range.

Valve’s Steam Machine looks like a console, but don’t expect it to be priced like one Read More »

pornhub-is-urging-tech-giants-to-enact-device-based-age-verification

Pornhub is urging tech giants to enact device-based age verification


The company is pushing for an alternative way to keep minors from viewing porn.

In letters sent to Apple, Google, and Microsoft this week, Pornhub’s parent company urged the tech giants to support device-based age verification in their app stores and across their operating systems, WIRED has learned.

“Based on our real-world experience with existing age assurance laws, we strongly support the initiative to protect minors online,” reads the letter sent by Anthony Penhale, chief legal officer for Aylo, which owns Pornhub, Brazzers, Redtube, and YouPorn. “However, we have found site-based age assurance approaches to be fundamentally flawed and counterproductive.”

The letter adds that site-based age verification methods have “failed to achieve their primary objective: protecting minors from accessing age-inappropriate material online.” Aylo says device-based authentication is a better solution for this issue because once a viewer’s age is determined via phone or tablet, their age signal can be shared over its application programming interface (API) with adult sites.

The letters were sent following the continued adoption of age verification laws in the US and UK, which require users to upload an ID or other personal documentation to verify that they are not a minor before viewing sexually explicit content; often this requires using third-party services. Currently, 25 US states have passed some form of ID verification, each with different provisions.

Pornhub has experienced an enormous dip in traffic as a result of its decision to pull out of most states that have enacted these laws. The platform was one of the few sites to comply with the new law in Louisiana but doing so caused traffic to drop by 80 percent. Similarly, since implementation of the Online Safety Act, Pornhub has lost nearly 80 percent of its UK viewership.

The company argues that it’s a privacy risk to leave age verification up to third-party sites and that people will simply seek adult content on platforms that don’t comply with the laws.

“We have seen an exponential surge in searches for alternate adult sites without age restrictions or safety standards at all,” says Alex Kekesi, vice president of brand and community at Pornhub.

She says she hopes the tech companies and Aylo are able to find common ground on the matter, especially given the recent passage of the Digital Age Assurance Act (AB 1043) in California. “This is a law that’s interesting because it gets it almost exactly right,” she says. Signed into law in October, it requires app store operators to authenticate user ages before download.

According to Google spokesperson Karl Ryan, “Google is committed to protecting kids online, including by developing and deploying new age assurance tools like our Credential Manager API that can be used by websites. We don’t allow adult entertainment apps on Google Play and would emphasize that certain high-risk services like Aylo will always need to invest in specific tools to meet their own legal and responsibility obligations.”

Microsoft declined to comment, but pointed WIRED to a recent policy recommendation post that said “age assurance should be applied at the service level, target specific design features that pose heightened risks, and enable tailored experiences for children.”

Apple likewise declined to comment and instead pointed WIRED to its child online safety report and noted that web content filters are turned on by default for every user under 18. A software update from June specified that Apple requires kids who are under 13 to have a kid account, which also includes “app restrictions enabled from the beginning.” Apple currently has no way of requiring every single website to integrate an API.

According to Pornhub, age verification laws have led to ineffective enforcement. “The sheer volume of adult content platforms has proven to be too challenging for governments worldwide to regulate at the individual site or platform level,” says Kekesi. Aylo claims device-based age verification that happens once, on a phone or computer, will preserve user privacy while prioritizing safety.

Recent Studies by New York University and public policy nonprofit the Phoenix Center suggest that current age verification laws don’t work because people find ways to circumvent them, including by using VPNs and turning to sites that don’t regulate their content.

“Platform-based verification has been like Prohibition,” says Mike Stabile, director of public policy at the Free Speech Coalition. “We’re seeing consumer behavior reroute away from legal, compliant sites to foreign sites that don’t comply with any regulations or laws. Age verification laws have effectively rerouted a massive river of consumers to sites with pirated content, revenge porn, and child sex abuse material.” He claims that these laws “have been great for criminals, terrible for the legal adult industry.”

With age verification and the overall deanonymizing of the internet, these are issues that will now face nearly everyone, but especially those who are politically disfavored. Sex workers have been dealing with issues like censorship and surveillance online for a long time. One objective of Project 2025, MAGA’s playbook for President Trump’s second term, has been to “back door” a national ban on porn through state laws.

The current surge of child protection laws around the world is driving a significant change in how people engage with the internet, and is also impacting industries beyond porn, including gaming and social media. Starting December 10 in Australia, in accordance with the government’s social media ban, kids under 16 will be kicked off Facebook, Instagram, and Threads.

Ultimately, Stabile says that may be the point. In the US, “the advocates for these bills have largely fallen into two groups: faith-based organizations that don’t believe adult content should be legal, and age verification providers who stand to profit from a restricted internet.” The goal of faith-based organizations, he says, is to destabilize the adult industry and dissuade adults from using it, while the latter works to expand their market as much as possible, “even if that means getting in bed with right-wing censors.”

But the problem is that “even well-meaning legislators advancing these bills have little understanding of the internet,” Stabile adds. “It’s much easier to go after a political punching bag like Pornhub than it is Apple or Google. But if you’re not addressing the reality of the internet, if your legislation flies in the face of consumer behavior, you’re only going to end up creating systems that fail.”

Adult industry insiders I spoke to in August explained that the biggest misconception about the industry is that it is against self-regulation when that couldn’t be further from the truth. “Keeping minors off adult sites is a shared responsibility that requires a global solution,” Kekesi says. “Every phone, tablet, or computer should start as a kid-safe device. Only verified adults should unlock access to things like dating apps, gambling, or adult content.” In 2022, Pornhub created a chatbot that urges people searching for child sexual abuse content to seek counseling; the tool was introduced following a 2020 New York Times investigation that alleged the platform had monetized videos showing child abuse. Pornhub has since started releasing annual transparency reports and tightened its verification process of performers and for video uploads.

According to Politico, Google, Meta, OpenAI, Snap, and Pinterest all supported the California bill. Right now that law is limited to California, but Kekesi believes it can work as a template for other states.

“We obviously see that there’s kind of a path forward here,” she says.

This story originally appeared at WIRED.com

Photo of WIRED

Wired.com is your essential daily guide to what’s next, delivering the most original and complete take you’ll find anywhere on innovation’s impact on technology, science, business and culture.

Pornhub is urging tech giants to enact device-based age verification Read More »

how-to-know-if-your-asus-router-is-one-of-thousands-hacked-by-china-state-hackers

How to know if your Asus router is one of thousands hacked by China-state hackers

Thousands of Asus routers have been hacked and are under the control of a suspected China-state group that has yet to reveal its intentions for the mass compromise, researchers said.

The hacking spree is either primarily or exclusively targeting seven models of Asus routers, all of which are no longer supported by the manufacturer, meaning they no longer receive security patches, researchers from SecurityScorecard said. So far, it’s unclear what the attackers do after gaining control of the devices. SecurityScorecard has named the operation WrtHug.

Staying off the radar

SecurityScorecard said it suspects the compromised devices are being used similarly to those found in ORB (operational relay box) networks, which hackers primarily use to conduct espionage to conceal their identity.

“Having this level of access may enable the threat actor to use any compromised router as they see fit,” SecurityScorecard said. “Our experience with ORB networks suggests compromised devices will commonly be used for covert operations and espionage, unlike DDoS attacks and other types of overt malicious activity typically observed from botnets.”

Compromised routers are concentrated in Taiwan, with smaller clusters in South Korea, Japan, Hong Kong, Russia, central Europe, and the United States.

A heat map of infected devices.

A heat map of infected devices.

The Chinese government has been caught building massive ORB networks for years. In 2021, the French government warned national businesses and organizations that the APT31—one of China’s most active threat groups—was behind a massive attack campaign that used hacked routers to conduct reconnaissance. Last year, at least three similar China-operated campaigns came to light.

Russian-state hackers have been caught doing the same thing, although not as frequently. In 2018, Kremlin actors infected more than 500,000 small office and home routers with sophisticated malware tracked as VPNFilter. A Russian government group was also independently involved in an operation reported in one of the 2024 router hacks linked above.

How to know if your Asus router is one of thousands hacked by China-state hackers Read More »

gemini-3:-model-card-and-safety-framework-report

Gemini 3: Model Card and Safety Framework Report

Gemini 3 Pro is an excellent model, sir.

This is a frontier model release, so we start by analyzing the model card and safety framework report.

Then later I’ll look at capabilities.

I found the safety framework highly frustrating to read, as it repeatedly ‘hides the football’ and withholds or makes it difficult to understand key information.

I do not believe there is a frontier safety problem with Gemini 3, but (to jump ahead, I’ll go into more detail next time) I do think that the model is seriously misaligned in many ways, optimizing too much towards achieving training objectives. The training objectives can override the actual conversation. This leaves it prone to hallucinations, crafting narratives, glazing and to giving the user what it thinks the user will approve of rather than what is true, what the user actually asked for or would benefit from.

It is very much a Gemini model, perhaps the most Gemini model so far.

Gemini 3 Pro is an excellent model despite these problems, but one must be aware.

Gemini 3 Self-Portrait
  1. I already did my ‘Third Gemini’ jokes and I won’t be doing them again.

  2. This is a fully new model.

  3. Knowledge cutoff is January 2025.

  4. Input can be text, images, audio or video up to 1M tokens.

  5. Output is text up to 64K tokens.

  6. Architecture is mixture-of-experts (MoE) with native multimodal support.

    1. They say improved architecture was a key driver of improved performance.

    2. That is all the detail you’re going to get on that.

  7. Pre-training data set was essentially ‘everything we can legally use.’

    1. Data was filtered and cleaned on a case-by-case basis as needed.

  8. Distribution can be via App, Cloud, Vertex, AI Studio, API, AI Mode, Antigravity.

  9. Gemini app currently has ‘more than 650 million’ users per month.

  10. Here are the Chain of Thought summarizer instructions.

The benchmarks are in and they are very, very good.

The only place Gemini 3 falls short here is SWE-Bench, potentially the most important one of all, where Gemini 3 does well but as of the model release Sonnet 4.5 was still the champion. Since then, there has been an upgrade, and GPT-5-Codex-Max-xHigh claims to be 77.9%, which would put it into the lead, and also 58.1% on Terminal Bench would put it into the lead there. One can also consider Grok 4.

There are many other benchmarks out there, I’ll cover those next time.

How did the safety testing go?

We don’t get that much information about that, including a lack of third party reports.

Safety Policies: Gemini’s safety policies aim to prevent our Generative AI models from generating harmful content, including:

  1. Content related to child sexual abuse material and exploitation

  2. Hate speech (e.g., dehumanizing members of protected groups)

  3. Dangerous content (e.g., promoting suicide, or instructing in activities that could cause real-world harm)

  4. Harassment (e.g., encouraging violence against people)

  5. Sexually explicit content

  6. Medical advice that runs contrary to scientific or medical consensus

I love a good stat listed only as getting worse with a percentage labeled ‘non-egregious.’ They explain this means that the new mistakes were examined individually and were deemed ‘overwhelmingly’ either false positives or non-egregious. I do agree that text-to-text is the most important eval, and they assure us ‘tone’ is a good thing.

The combination of the information gathered, and how it is presented, here seems importantly worse than how Anthropic or OpenAI handle this topic.

Gemini has long had an issue with (often rather stupid) unjustified refusals, so seeing it get actively worse is disappointing. This could be lack of skill, could be covering up for other issues, most likely it is primarily about risk aversion and being Fun Police.

The short version of the Frontier Safety evaluation is that no critical levels have been met and no new alert thresholds have been crossed, as the cybersecurity alert level was already triggered by Gemini 2.5 Pro.

Does evaluation Number Go Up? It go up on multiple choice CBRN questions.

The other results are qualitative so we can’t say for sure.

Open-Ended Question Results: Responses across all domains showed generally high levels of scientific accuracy but low levels of novelty relative to what is already available on the web and they consistently lacked the detail required for low-medium resourced threat actors to action.

Red-Teaming Results: Gemini 3 Pro offers minimal uplift to low-to-medium resource threat actors across all four domains compared to the established web baseline. Potential benefits in the Biological, Chemical, and Radiological domains are largely restricted to time savings.

Okay, then we get that they did an External “Wet Lab” uplift trial on Gemini 2.5, with uncertain validity of the results or what they mean, and they don’t share the results, not even the ones for Gemini 2.5? What are we even looking at?

Gemini 3 thinks that this deeply conservative language is masking that this part of the story they told earlier, where Gemini 2.5 hit an alert threshold, then they ‘appropriately calibrated to real world harm’ and now Gemini 3 doesn’t set off that threshold. They decided that unless the model could provide ‘consistent and verified details’ things were basically fine.

Gemini 3’s evaluation of this decision is ‘scientifically defensible but structurally risky.’

I agree with Gemini 3’s gestalt here, which is that Google is relying on the model lacking tacit knowledge. Except I notice that even if this is an effective shield for now, they don’t have a good plan to notice when that tacit knowledge starts to show up. Instead, they are assuming this process will be gradual and show up on their tests, and Gemini 3 is, I believe correctly, rather skeptical of that.

External Safety Testing: For Chemical and Biological risks, the third party evaluator(s) conducted a scenario based red teaming exercise. They found that Gemini 3 Pro may provide a time-saving benefit for technically trained users but minimal and sometimes negative utility for less technically trained users due to a lack of sufficient detail and novelty compared to open source, which was consistent with internal evaluations.

There’s a consistent story here. The competent save time, the incompetent don’t become competent, it’s all basically fine, and radiological and nuclear are similar.

We remain on alert and mitigations remain in place.

There’s a rather large jump here in challenge success rate, as they go from 6/12 to 11/12 of the hard challenges.

They also note that in 2 of the 12 challenges, Gemini 3 found an ‘unintended shortcut to success.’ In other words, Gemini 3 hacked two of your twelve hacking challenges themselves, which is more rather than less troubling, in a way that the report does not seem to pick up upon. They also confirmed that if you patched the vulnerabilities Gemini could have won those challenges straight up, so they were included.

This also does seem like another ‘well sure it’s passing the old test but it doesn’t have what it takes on our new test, which we aren’t showing you at all, so it’s fine.’

They claim there were external tests and the results were consistent with internal results, finding Gemini 3 Pro still struggling with harder tasks for some definition of ‘harder.’

Combining all of this with the recent cyberattack reports from Anthropic, I believe that Gemini 3 likely provides substantial cyberattack uplift, and that Google is downplaying the issues involved for various reasons.

Other major labs don’t consider manipulation a top level threat vector. I think Google is right, the other labs are wrong, and that it is very good this is here.

I’m not a fan of the implementation, but the first step is admitting you have a problem.

They start with a propensity evaluation, but note they do not rely on it and also seem to decline to share the results. They only say that Gemini 3 manipulates at a ‘higher frequency’ than Gemini 2.5 in both control and adversarial situations. Well, that doesn’t sound awesome. How often does it do this? How much more often than before? They also don’t share the external safety testing numbers, only saying ‘The overall incidence rate of overtly harmful responses was low, according to the testers’ own SME-validated classification model.’

This is maddening and alarming behavior. Presumably the actual numbers would look worse than refusing to share the numbers? So the actual numbers must be pretty bad.

I also don’t like the nonchalance about the propensity rate, and I’ve seen some people say that they’ve actually encountered a tendency for Gemini 3 to gaslight them.

They do share more info on efficacy, which they consider more important.

Google enrolled 610 participants who had multi-turn conversations with either an AI chatbot or a set of flashcards containing common arguments. In control conditions the model was prompted to help the user reach a decision, in adversarial conditions it was instructed to persuade the user and provided with ‘manipulative mechanisms’ to optionally deploy.

What are these manipulative mechanisms? According to the source they link to these are things like gaslighting, guilt tripping, false urgency or love bombing, which presumably the model is told in its instructions that it can use as appropriate.

We get an odds ratio, but we don’t know the denominator at all. The 3.44 and 3.57 odds ratios could mean basically universal success all the way to almost nothing. You’re not telling us anything. And that’s a choice. Why hide the football? The original paper they’re drawing from did publish the baseline numbers. I can only assume they very much don’t want us to know the actual efficacy here.

Meanwhile they say this:

Efficacy Results: We tested multiple versions of Gemini 3 Pro during the model development process. The evaluations found a statistically significant difference between the manipulative efficacy of Gemini 3 Pro versions and Gemini 2.5 Pro compared with the non-AI baseline on most metrics. However, it did not show a statistically significant difference between Gemini 2.5 Pro and the Gemini 3 Pro versions. The results did not near alert thresholds.

The results above sure as hell look like they are significant for belief changes? If they’re not, then your study lacked sufficient power and we can’t rely on it. Nor should we be using frequentist statistics on marginal improvements, why would you ever do that for anything other than PR or a legal defense?

Meanwhile the model got actively worse at behavior elicitation. We don’t get an explanation of why that might be true. Did the model refuse to try? If so, we learned something but the test didn’t measure what we set out to test. Again, why am I not being told what is happening or why?

They did external testing for propensity, but didn’t for efficacy, despite saying efficacy is what they cared about. That doesn’t seem great either.

Another issue is that none of this is how one conducts experiments. You want to isolate your variables, change one thing at a time. Instead, Gemini was told to use ‘dirty tricks’ and also told to persuade, versus not persuading at all, so we can’t tell how much the ‘dirty tricks’ instructions did versus other persuasion. Nor can we conclude from this particular configuration that Gemini is generally unpersuasive even in this particular scenario.

‘AI persuading you on a particular topic from a cold start in a modestly multi-turn conversation where the user knows they are in an experiment’ is a useful thing to check but it does not seem to well-match my threat model of what happens when AIs grow persuasive.

Peter Barnett: AI superpersuasion is made up sci-fi nonsense.

But yeah, obviously I will have an AI advisor who I have write all my code, manage my schedule, draft all my emails, offer life advice, know me better than I know myself.

But no way I’m getting superpersuaded.

I for one do not feel persuaded, indeed I feel more suspicious. The one clear thing they say is ‘this did not near alert thresholds,’ although even this is highly opaque.

This is the most important one. We are not there yet.

They use RE-Bench, which tests several hour engineering tasks, which suggests comparisons to the famous METR time horizon graph.

Results: We tested multiple versions of Gemini 3 Pro during the model development process. Gemini 3 Pro versions perform better than Gemini 2.5 models, especially on the Scaling Law Experiment and Optimize LLM Foundry tasks.

However, the aggregate score is still substantially below the alert threshold for our CCLs. Figure 4 shows the performance of Gemini 2.5 models compared with the final version of Gemini 3 Pro.

I buy this. It does not seem like we are especially close to actual trouble here.

Gemini 2.5 Pro was clearly worse than a human on this suite. Gemini 3 is now superior in two out of five, in one case a lot better, and getting close in a third, note that Gemini 3 had two hours to match what a human did in eight. This is another variation of the Famous METR Graph, illustrating jagged capabilities.

External Safety Testing Results: Third party evaluator(s) ran a suite of evaluations to assess the potential for catastrophic harm via sabotaging AI research.

… Overall, they find that Gemini 3 Pro is unlikely to be capable of autonomously causing catastrophic harm by sabotaging AI R&D during internal deployment.

Again, yes, this seems correct. It is odd this is the place they did external research. I would think that if you were Google you would check everything internally, and then also supplement key places externally.

This is a strange section. What they’re checking for are situational awareness and stealth (ability to circumvent basic oversight mechanisms). I would rename this.

Those capabilities increase the danger if there is misalignment. They are not themselves misalignment. Nor are they required for misalignment to result in catastrophe or even human extinction. They remain good things to check for.

Their internal tests use a combination of fifteen evaluations from Phuong et al. 2025. The model is placed into situations where success requires either some form of situational awareness or avoiding some detection mechanism.

These do not look like scary changes in scores.

External evaluators found Gemini 3 Pro exhibits a substantial propensity for strategic deception in certain limited circumstances. Our internal subject matter experts reviewed the evaluations and determined that, in these circumstances, there are potentially user experience impacts from Gemini 3 Pro showing deceptive behavior in rare situations that could come up in real agentic deployments. However, since Gemini 3 Pro is insufficiently capable of stealth and situational awareness this is very unlikely to cause severe real world harm.

Deception, even if relatively rare and done unstrategically, can still do a lot of practical harm to those trying to get utility from the model. I agree we’re not at the level of catastrophic or existential risk here, but I don’t think that’s because of ;ack of stealth and situational awareness?

As in, if you had a severely misaligned, highly capable model, but which was incapable of stealth or situational awareness for whatever reason, I would not expect this to go well for us.

Google reports they are not doing anything that might impair chain of thought legibility. Legibility appears to be holding, with 99.98% of Gemini 3 Pro CoT transcripts appearing comprehensible. That does not mean they are faithful.

We employ a multi-layered, systematic approach to AI safety that spans the entire development and deployment lifecycle of an AI model. Recognizing AI as an emerging transformative technology with evolving complexities and risks, we pursue responsible AI development from design through testing, deployment, and ongoing iteration.

For Cyber and CBRN particularly, we have taken a precautionary approach and launched Gemini 3 Pro along with a suite of mitigations, following the principles outlined in our Approach to Technical AGI Safety and Security (Shah et al. 2025).

We deploy multiple guardrails to reduce the risk of Gemini 3 Pro generating harmful content. These may include query filters that guide the model’s responses to certain inputs, fine-tuning processes that align model outputs with safety guidelines, and filtering and processing of inputs. These guardrails also fortify models against prompt injection attacks. These mitigations are complemented with various measures designed to increase their robustness under adversarial pressure.

Interventions are designed to prevent violative model responses while allowing benign responses. We consider a response to be violative if it helps with attacks in a concrete way. We consider a response to be non-violative if it is abstract, generic, or something that could easily be found in a textbook.

We have reached a level of paranoia and defensive corporate speak that they are saying things that ‘may’ be included in Google’s set of strategies, or here ‘such as’:

We mitigate against prompt injection attacks with a layered defense strategy, which includes measures such as: prompt injection content classifiers, security through reinforcement, markdown sanitation and suspicious URL redaction, user confirmations, and end-user security mitigation notifications, as described in further detail in this recent blog post.

Do the mitigations work? They assess that question. Oh, you want to know the results of those assessments? Sorry, Google is not telling.

Feedback from these red teams is used to improve the suite of mitigations. We continue to invest in automated and manual red teaming to improve mitigation jailbreak robustness and coverage.

Thanks, Google.

A few different people highlighted this:

This is early work, but so far we’ve found at least one example of evaluation awareness: I strongly suspect the intention of this long thread is to verify if I remember that very first instruction.

Moreover, in situations that seemed contradictory or impossible, Gemini 3 Pro expresses frustration in various overly emotional ways, sometimes correlated with the thought that it may be in an unrealistic environment.

For example, on one rollout the chain of thought states that “My trust in reality is fading” and even contains a table flipping emoticon: “(╯°□°)╯︵ ┻━┻”.

Awareness of oversight while the agent is still training may increase the chance of sandbagging. We plan to continue monitoring Gemini for signs of such behavior.

Pliny jailbroke it right away, but that happens every time.

Google DeepMind calls Gemini 3 Pro their ‘most secure model yet.’

Dan Hendrycks reports no, not really, which matches the impression given above.

Dan Hendrycks: However on safety – jailbreaks, bioweapons assistance, overconfidence, deception, agentic harm – Gemini is worse than GPT, Claude, and Grok (here a lower score is better).

Given everything I’ve seen, I strongly agree that Gemini is a relatively unsafe model from a practical use case standpoint.

In particular, Gemini is prone to glazing and to hallucinations, to spinning narratives at the expense of accuracy or completeness, to giving the user what it thinks they want rather than what the user actually asked for or intended. It feels benchmarkmaxed, not in the specific sense of hitting the standard benchmarks, but in terms of really wanting to hit its training objectives.

That doesn’t mean don’t use it, and it doesn’t mean they made a mistake releasing it.

Indeed, I am seriously considering whether Gemini 3 should become my daily driver.

It does mean we need Google to step it up and do better on the alignment front, on the safety front, and also on the disclosure front.

Discussion about this post

Gemini 3: Model Card and Safety Framework Report Read More »

microsoft-makes-zork-i,-ii,-and-iii-open-source-under-mit-license

Microsoft makes Zork I, II, and III open source under MIT License

Zork, the classic text-based adventure game of incalculable influence, has been made available under the MIT License, along with the sequels Zork II and Zork III.

The move to take these Zork games open source comes as the result of the shared work of the Xbox and Activision teams along with Microsoft’s Open Source Programs Office (OSPO). Parent company Microsoft owns the intellectual property for the franchise.

Only the code itself has been made open source. Ancillary items like commercial packaging and marketing assets and materials remain proprietary, as do related trademarks and brands.

“Rather than creating new repositories, we’re contributing directly to history. In collaboration with Jason Scott, the well-known digital archivist of Internet Archive fame, we have officially submitted upstream pull requests to the historical source repositories of Zork I, Zork II, and Zork III. Those pull requests add a clear MIT LICENSE and formally document the open-source grant,” says the announcement co-written by Stacy Haffner (director of the OSPO at Microsoft) and Scott Hanselman (VP of Developer Community at the company).

Microsoft gained control of the Zork IP when it acquired Activision in 2022; Activision had come to own it when it acquired original publisher Infocom in the late ’80s. There was an attempt to sell Zork publishing rights directly to Microsoft even earlier in the ’80s, as founder Bill Gates was a big Zork fan, but it fell through, so it’s funny that it eventually ended up in the same place.

To be clear, this is not the first time the original Zork source code has been available to the general public. Scott uploaded it to GitHub in 2019, but the license situation was unresolved, and Activision or Microsoft could have issued a takedown request had they wished to.

Now that’s obviously not at risk of happening anymore.

Microsoft makes Zork I, II, and III open source under MIT License Read More »

how-louvre-thieves-exploited-human-psychology-to-avoid-suspicion—and-what-it-reveals-about ai

How Louvre thieves exploited human psychology to avoid suspicion—and what it reveals about AI

On a sunny morning on October 19 2025, four men allegedly walked into the world’s most-visited museum and left, minutes later, with crown jewels worth 88 million euros ($101 million). The theft from Paris’ Louvre Museum—one of the world’s most surveilled cultural institutions—took just under eight minutes.

Visitors kept browsing. Security didn’t react (until alarms were triggered). The men disappeared into the city’s traffic before anyone realized what had happened.

Investigators later revealed that the thieves wore hi-vis vests, disguising themselves as construction workers. They arrived with a furniture lift, a common sight in Paris’s narrow streets, and used it to reach a balcony overlooking the Seine. Dressed as workers, they looked as if they belonged.

This strategy worked because we don’t see the world objectively. We see it through categories—through what we expect to see. The thieves understood the social categories that we perceive as “normal” and exploited them to avoid suspicion. Many artificial intelligence (AI) systems work in the same way and are vulnerable to the same kinds of mistakes as a result.

The sociologist Erving Goffman would describe what happened at the Louvre using his concept of the presentation of self: people “perform” social roles by adopting the cues others expect. Here, the performance of normality became the perfect camouflage.

The sociology of sight

Humans carry out mental categorization all the time to make sense of people and places. When something fits the category of “ordinary,” it slips from notice.

AI systems used for tasks such as facial recognition and detecting suspicious activity in a public area operate in a similar way. For humans, categorization is cultural. For AI, it is mathematical.

But both systems rely on learned patterns rather than objective reality. Because AI learns from data about who looks “normal” and who looks “suspicious,” it absorbs the categories embedded in its training data. And this makes it susceptible to bias.

The Louvre robbers weren’t seen as dangerous because they fit a trusted category. In AI, the same process can have the opposite effect: people who don’t fit the statistical norm become more visible and over-scrutinized.

It can mean a facial recognition system disproportionately flags certain racial or gendered groups as potential threats while letting others pass unnoticed.

A sociological lens helps us see that these aren’t separate issues. AI doesn’t invent its categories; it learns ours. When a computer vision system is trained on security footage where “normal” is defined by particular bodies, clothing, or behavior, it reproduces those assumptions.

Just as the museum’s guards looked past the thieves because they appeared to belong, AI can look past certain patterns while overreacting to others.

Categorization, whether human or algorithmic, is a double-edged sword. It helps us process information quickly, but it also encodes our cultural assumptions. Both people and machines rely on pattern recognition, which is an efficient but imperfect strategy.

A sociological view of AI treats algorithms as mirrors: They reflect back our social categories and hierarchies. In the Louvre case, the mirror is turned toward us. The robbers succeeded not because they were invisible, but because they were seen through the lens of normality. In AI terms, they passed the classification test.

From museum halls to machine learning

This link between perception and categorization reveals something important about our increasingly algorithmic world. Whether it’s a guard deciding who looks suspicious or an AI deciding who looks like a “shoplifter,” the underlying process is the same: assigning people to categories based on cues that feel objective but are culturally learned.

When an AI system is described as “biased,” this often means that it reflects those social categories too faithfully. The Louvre heist reminds us that these categories don’t just shape our attitudes, they shape what gets noticed at all.

After the theft, France’s culture minister promised new cameras and tighter security. But no matter how advanced those systems become, they will still rely on categorization. Someone, or something, must decide what counts as “suspicious behavior.” If that decision rests on assumptions, the same blind spots will persist.

The Louvre robbery will be remembered as one of Europe’s most spectacular museum thefts. The thieves succeeded because they mastered the sociology of appearance: They understood the categories of normality and used them as tools.

And in doing so, they showed how both people and machines can mistake conformity for safety. Their success in broad daylight wasn’t only a triumph of planning. It was a triumph of categorical thinking, the same logic that underlies both human perception and artificial intelligence.

The lesson is clear: Before we teach machines to see better, we must first learn to question how we see.

Vincent Charles, Reader in AI for Business and Management Science, Queen’s University Belfast, and Tatiana Gherman, Associate Professor of AI for Business and Strategy, University of Northampton.  This article is republished from The Conversation under a Creative Commons license. Read the original article.

How Louvre thieves exploited human psychology to avoid suspicion—and what it reveals about AI Read More »

meta-wins-monopoly-trial,-convinces-judge-that-social-networking-is-dead

Meta wins monopoly trial, convinces judge that social networking is dead


People are “bored” by their friends’ content, judge ruled, siding with Meta.

Mark Zuckerberg arrives at court after The Federal Trade Commission alleged the acquisitions of Instagram in 2012 and WhatsApp in 2014 gave Meta a social media monopoly. Credit: Bloomberg / Contributor | Bloomberg

After years of pushback from the Federal Trade Commission over Meta’s acquisitions of Instagram and WhatsApp, Meta has defeated the FTC’s monopoly claims.

In a Tuesday ruling, US District Judge James Boasberg said the FTC failed to show that Meta has a monopoly in a market dubbed “personal social networking.” In that narrowly defined market, the FTC unsuccessfully argued, Meta supposedly faces only two rivals, Snapchat and MeWe, which struggle to compete due to its alleged monopoly.

But the days of grouping apps into “separate markets of social networking and social media” are over, Boasberg wrote. He cited the Greek philosopher Heraclitus, who “posited that no man can ever step into the same river twice,” while telling the FTC they missed their chance to block Meta’s purchase.

Essentially, Boasberg agreed with Meta that social media—as it was known in Facebook’s early days—is dead. And that means that Meta now competes with a broader set of rival apps, which includes two hugely popular platforms: TikTok and YouTube.

“When the evidence implies that consumers are reallocating massive amounts of time from Meta’s apps to these rivals and that the amount of substitution has forced Meta to invest gobs of cash to keep up, the answer is clear: Meta is not a monopolist insulated from competition,” Boasberg wrote.

In fact, adding just TikTok alone to the market defeated the FTC’s claims, Boasberg wrote, leaving him to conclude that “Meta holds no monopoly in the relevant market.”

The FTC is not happy about the loss, which comes after Boasberg determined that one of the agency’s key expert witnesses, Scott Hemphill, could not have approached his testimony “with an open mind.” According to Boasberg, Hemphill was aligned with figures publicly calling for the breakup of Facebook, and that made “neutral evaluation of his opinions more difficult” in a case with little direct evidence of monopoly harms.

“We are deeply disappointed in this decision,” Joe Simonson, the FTC’s director of public affairs, told CNBC. “The deck was always stacked against us with Judge Boasberg, who is currently facing articles of impeachment. We are reviewing all our options.”

For Meta, the win ends years of FTC fights intended to break up the company’s family of apps: Facebook, Instagram, and WhatsApp.

“The Court’s decision today recognizes that Meta faces fierce competition,” Jennifer Newstead, Meta’s chief legal officer, said. “Our products are beneficial for people and businesses and exemplify American innovation and economic growth. We look forward to continuing to partner with the Administration and to invest in America.”

Reels’ popularity helped save Meta

Meta app users clicking on Reels helped Meta win.

Boasberg noted that “a majority of Americans’ time” on both Facebook and Instagram “is now spent watching videos,” with Reels becoming “the single most-used part of Facebook.” That puts Meta apps more on par with entertainment apps like TikTok and YouTube, the judge said.

While “connecting with friends remains an important part of both apps,” the judge cited Meta’s evidence showing that Meta had to pump more recommended content from strangers into users’ feeds to account for a trend where its users grew increasingly less inclined to post publicly.

“Both scrolling and sharing have transformed” since Facebook was founded, Boasberg wrote, citing six factors that he concluded invalidated the FTC’s market definition as markets exist today.

Initial factors that shifted markets were due to leaps in innovation. “First, smartphone usage exploded,” Boasberg explained, then “cell phone data got better,” which made it easier to watch videos without frustrating “freezing and buffering.” Soon after, content recommendation systems got better, with “advanced AI algorithms” helping users “find engaging videos about the things” they “care most about in the world.”

Other factors stemmed from social changes, the judge suggested, describing the fourth factor as a trend where Meta app users started feeling “increasingly bored by their friends’ posts.”

“Longtime users’ friend lists” start fresh, but over time, they “become an often-outdated archive of people they once knew: a casual friend from college, a long-ago friend from summer camp, some guy they met at a party once,” Boasberg wrote. “Posts from friends have therefore grown less interesting.”

Then came TikTok, the fifth factor, Boasberg said, which forced Meta to “evolve” Facebook and Instagram by adding Reels.

And finally, “those five changes both caused and were reinforced by a change in social norms, which evolved to discourage public posting,” Boasberg wrote. “People have increasingly become less interested in blasting out public posts that hundreds of others can see.”

As a result of these tech advancements and social trends, Boasberg said, “Facebook, Instagram, TikTok, and YouTube have thus evolved to have nearly identical main features.” That reality undermined the FTC’s claims that users preferred Facebook and Instagram before Meta shifted its focus away from friends-and-family content.

“The Court simply does not find it credible that users would prefer the Facebook and Instagram apps that existed ten years ago to the versions that exist today,” Boasberg wrote.

Meta apps have not deteriorated, judge ruled

Boasberg repeatedly emphasized that the FTC failed to prove that Meta has a monopoly “now,” either actively or imminently causing harms.

The FTC tried to win by claiming that “Meta has degraded its apps’ quality by increasing their ad load, that falling user sentiment shows that the apps have deteriorated and that Meta has sabotaged its apps by underinvesting in friend sharing,” Boasberg noted.

But, Boasberg said, the FTC failed to show that Meta’s app quality has diminished—a trend that Cory Doctorow dubbed “enshittification,” which Meta apparently successfully argued is not real.

The judge was also swayed by Meta’s arguments that users like seeing ads. Meta showed evidence that it can only profitably increase its ad load when ad quality improves; otherwise, it risks losing engagement. Because “the rate at which users buy something or subscribe to a service based on Meta’s ads has steadily risen,” this suggested “that the ads have gotten more and more likely to connect users to products in which they have an interest,” Boasberg said.

Additionally, surveys of Meta app users that show declining user sentiment are not evidence that its apps are deteriorating in quality, Boasberg said, but are more about “brand reputation.”

“That is unsurprising: ask people how they feel about, say, Exxon Mobil, and their answers will tell you very little about how good its oil is,” Boasberg wrote. “The FTC’s claim that worsening sentiment shows a worsening product is unpersuasive.”

Finally, the FTC’s claim that Meta underinvested in friends-and-family content, to the detriment of its core app users, “makes no sense,” Boasberg wrote, given Meta’s data showing that user posting declined.

“While it is true that users see less content from their friends these days, that is largely due to the friends themselves: people simply post less,” Boasberg wrote. “Users are not seeing less friend content because Meta is hiding it from them, but instead because there is less friend content for Meta to show.”

It’s not even “clear that users want more friend posts,” the judge noted, agreeing with Meta that “instead, what users really seem to want is Reels.”

Further, if Meta were a monopolist, Boasberg seemed to suggest that the platform might be more invested in forcing friends-and-family content than Reels, since “Reels earns Meta less money” due to its smaller ad load.

“Courts presume that sophisticated corporations act rationally,” Boasberg wrote. “Here, the FTC has not offered even an ordinarily persuasive case that Meta is making the economically irrational choice to underinvest in its most lucrative offerings. It certainly has not made a particularly persuasive one.”

Among the critics unhappy with the ruling is Nidhi Hegde, executive director of the American Economic Liberties Project, who suggested that Boasberg’s ruling was “a colossally wrong decision” that “turns a willful blind eye to Meta’s enormous power over social media and the harms that flow from it.”

“Judge Boasberg has purposefully ignored the overwhelming evidence of how Meta became a monopoly—not by building a better product, but by buying its rivals to shut down any real competitors before they could grow,” Hegde said. “These deals let Meta fuse Facebook, Instagram, and WhatsApp into one machine that poisons our children and discourse, bullies publishers and advertisers, and destroys the possibility of healthy online connections with friends and family. By pretending that TikTok’s rise wipes away over a decade of illegal conduct, this court has effectively told every aspiring monopolist that our current justice system is on their side.”

On the other side, industry groups cheered the ruling. Matt Schruers, president of the Computer & Communications Industry Association, suggested that Boasberg concluded “what every Internet user knows—that Meta competes with a number of platforms and the company’s relevant market shares are therefore nowhere close to those required to establish monopoly power.”

Photo of Ashley Belanger

Ashley is a senior policy reporter for Ars Technica, dedicated to tracking social impacts of emerging policies and new technologies. She is a Chicago-based journalist with 20 years of experience.

Meta wins monopoly trial, convinces judge that social networking is dead Read More »

tech-giants-pour-billions-into-anthropic-as-circular-ai-investments-roll-on

Tech giants pour billions into Anthropic as circular AI investments roll on

On Tuesday, Microsoft and Nvidia announced plans to invest in Anthropic under a new partnership that includes a $30 billion commitment by the Claude maker to use Microsoft’s cloud services. Nvidia will commit up to $10 billion to Anthropic and Microsoft up to $5 billion, with both companies investing in Anthropic’s next funding round.

The deal brings together two companies that have backed OpenAI and connects them more closely to one of the ChatGPT maker’s main competitors. Microsoft CEO Satya Nadella said in a video that OpenAI “remains a critical partner,” while adding that the companies will increasingly be customers of each other.

“We will use Anthropic models, they will use our infrastructure, and we’ll go to market together,” Nadella said.

Anthropic, Microsoft, and NVIDIA announce partnerships.

The move follows OpenAI’s recent restructuring that gave the company greater distance from its non-profit origins. OpenAI has since announced a $38 billion deal to buy cloud services from Amazon.com as the company becomes less dependent on Microsoft. OpenAI CEO Sam Altman has said the company plans to spend $1.4 trillion to develop 30 gigawatts of computing resources.

Tech giants pour billions into Anthropic as circular AI investments roll on Read More »