Author name: Tim Belzer

hbo’s-the-last-of-us-s2e6-recap:-look-who’s-back!

HBO’s The Last of Us S2E6 recap: Look who’s back!

New episodes of season 2 of The Last of Us are premiering on HBO every Sunday night, and Ars’ Kyle Orland (who’s played the games) and Andrew Cunningham (who hasn’t) will be talking about them here after they air. While these recaps don’t delve into every single plot point of the episode, there are obviously heavy spoilers contained within, so go watch the episode first if you want to go in fresh.

Kyle: Going from a sudden shot of beatific Pedro Pascal at the end of the last episode to a semi-related flashback with a young Joel Miller and his brother was certainly a choice. I almost respect how overtly they are just screwing with audience expectations here.

As for the opening flashback scene itself, I guess the message is “Hey, look at the generational trauma his family was dealing with—isn’t it great he overcame that to love Ellie?” But I’m not sure I can draw a straight line from “he got beat by his dad” to “he condemned the entire human race for his surrogate daughter.”

Andrew: I do not have the same problems you did with either the Joel pop-in at the end of the last episode or the flashback at the start of this episode—last week, the show was signaling “here comes Joel!” and this week the show is signaling “look, it’s Joel!” Maybe I’m just responding to Tony Dalton as Joel’s dad, who I know best as the charismatic lunatic Lalo Salamanca from Better Call Saul. I do agree that the throughline between these two events is shaky, though, and without the flashback to fill us in, the “I hope you can do a little better than me” sentiment feels like something way out of left field.

But I dunno, it’s Joel week. Joel’s back! This is the Duality of Joel: you can simultaneously think that he is horrible for failing a civilization-scale trolley problem when he killed a building full of Fireflies to save Ellie, and you can’t help but be utterly charmed by Pedro Pascal enthusiastically describing the many ways to use a Dremel. (He’s right! It’s a versatile tool!)

Truly, there’s pretty much nothing in this episode that we couldn’t have inferred or guessed at based on the information the show has already made available to us. And I say this as a non-game-player—I didn’t need to see exactly how their relationship became as strained as it was by the beginning of the season to have some idea of why it happened, nor did I need to see The Porch Scene to understand that their bond nevertheless endured. But this is also the dynamic that everybody came to the show for last season, so I can only make myself complain about it to a point.

Kyle: It’s true, Joel Week is a time worth celebrating. If I’m coming across as cranky about it at the outset, it’s probably because this whole episode is a realization of what we’re missing out on this season thanks to Joel’s death.

As you said, a lot of this episode was filling in gaps that could well have been inferred from events we did see. But I would have easily taken a full season (or a full second game) of Ellie growing up and Joel dealing with Ellie growing up. You could throw in some zombie attacks or an overarching Big Bad enemy or something if you want, but the development of Joel and Ellie’s relationship deserves more than just some condensed flashbacks.

“It works?!”

Credit: Warner Bros. Discovery

“It works?!” Credit: Warner Bros. Discovery

Andrew: Yeah, it’s hard not to be upset about the original sin of The Last of Us Part 2 which is (assuming it’s like the show) that having some boring underbaked villain crawl out of the woodwork to kill the show’s main character is kind of a cheap shot. Sure, you shock the hell out of viewers like me who didn’t see it coming! But part of the reason I didn’t see it coming is because if you kill Joel, you need to do a whole bunch of your show without Joel and why on Earth would you decide to do that?

To be clear, I don’t mind this season so much, and I’ve found things to like about it, though Ellie does sometimes veer into being a protagonist so short-sighted and impulsive and occasionally just-plain-stupid that it’s hard to be in her corner. But yeah, flashing back to a time just two months after the end of season 1 really does make you wonder, “Why couldn’t the story just be this?”

Kyle: In the gaming space, I understand the desire to not have your sequel game be just “more of the same” from the last game. But I’ve always felt The Last of Us Part 2 veered too hard in the other direction and became something almost entirely unrecognizable from the original game I loved.

But let’s focus on what we do get in this episode, which is an able recreation of my favorite moment from the second game, Ellie enjoying the heck out of a ruined science museum. The childlike wonder she shows here is a great respite from a lot of action-heavy scenes in the game, and I think it serves the same purpose here. It’s also much more drawn out in the game—I could have luxuriated in just this part of the flashback for an entire episode!

Andrew: The only thing that kept me from being fully on board with that scene was that I think Ellie was acting quite a bit younger than 16, with her pantomimed launch noises and flipping of switches, But I could believe that a kid who had such a rough and abbreviated childhood would have some fun sitting in an Apollo module. For someone with no memories of the pre-outbreak society, it must seem like science fiction, and the show gives us some lovely visuals to go with it.

The things I like best here are the little moments in between scenes rather than the parts where the show insists on showing us events that it had already alluded to in other episodes. What sticks with me the most, as we jump between Ellie’s birthdays, is Joel’s insistence that “we could do this kind of thing more often” as they go to a museum or patrol the trails together. That it needs to be stated multiple times suggests that they are not, in fact, doing this kind of thing more often in between birthdays.

Joel is thoughtful and attentive in his way—a little better than his father—but it’s such a bittersweet little note, a surrogate dad’s clumsy effort to bridge a gap that he knows is there but doesn’t fully understand.

Why can’t it be like this forever?

Credit: Warner Bros. Discovery

Why can’t it be like this forever? Credit: Warner Bros. Discovery

Kyle: Yeah, I’m OK with a little arrested development in a girl that has been forced to miss so many of the markers of a “normal” pre-apocalypse childhood.

But yeah, Joel is pretty clumsy about this. And as we see all of these attempts with his surrogate daughter, it’s easy to forget what happened to his real daughter way back at the beginning of the first season. The trauma of that event shapes Joel in a way that I feel the narrative sometimes forgets about for long stretches.

But then we get moments like Joel leading Gail’s newly infected husband to a death that the poor guy would very much like to delay by an hour for one final moment with his wife. When Joel says that you can always close your eyes and see the face of the one you love, he may have been thinking about Ellie. But I like to think he was thinking about his actual daughter.

Andrew: Yes to the extent that Joel’s actions are relatable (I won’t say “excusable,” but “relatable”) it’s because the undercurrent of his relationship with Ellie is that he can’t watch another daughter die in his arms. I watched the first episode again recently, and that whole scene remains a masterfully executed gut-punch.

But it’s a tough tightrope to walk, because if the story spends too much time focusing on it, you draw attention to how unhealthy it is for Joel to be forcing Ellie to play that role in his life. Don’t get me wrong, Ellie was looking for a father figure, too, and that’s why it works! It’s a “found family” dynamic that they were both looking for. But I can’t hear Joel’s soothing “baby girl” epithet without it rubbing me the wrong way a little.

My gut reaction was that it was right for Joel not to fully trust Gail’s husband, but then I realized I can never not suspect Joe Pantoliano of treachery because of his role as betrayer in the 26-year-old movie The Matrix. Brains are weird.

Kyle: I did like the way Ellie tells Joel off for lying to her (and to Gail) about the killing; it’s a real “growing up” moment for the character. And of course it transitions well into The Porch Scene, Ellie’s ultimate moment of confronting Joel on his ultimate betrayal.

While I’m not a fan of the head-fake “this scene isn’t going to happen” thing they did earlier this season. I think the TV show once again did justice to one of the most impactful parts of the game. But the game also managed to spread out these Joel-centric flashbacks a little more, so we’re not transitioning from “museum fun” to “porch confrontation” quite so quickly. Here, it feels like they’re trying hard to rush through all of their “bring back Pedro Pascal” requirements in a single episode.

When you’ve only got one hour left, how you spend it becomes pretty important.

Credit: Warner Bros. Discovery

When you’ve only got one hour left, how you spend it becomes pretty important. Credit: Warner Bros. Discovery

Andrew: Yeah, because you don’t need to pay a 3D model’s appearance fees if you want to use it in a bunch of scenes of your video game. Pedro Pascal has other stuff going on!

Kyle: That’s probably part of it. But without giving too much away, I think we’re seeing the limits of stretching the events of “Part 2” into what is essentially two seasons. While there have been some cuts, on the whole, it feels like there’s also been a lot of filler to “round out” these characters in ways that have been more harmful than helpful at points.

Andrew: Yeah, our episode ends by depositing us back in the main action, as Ellie returns to the abandoned theater where she and Dina have holed up. I’m curious to see what we’re in for in this last run of almost-certainly-Joel-less episodes, but I suspect it involves a bunch of non-Joel characters ping-ponging between the WLF forces and the local cultists. There will probably be some villain monologuing, probably some zombie hordes, probably another named character death or two. Pretty standard issue.

What I don’t expect is for anyone to lovingly and accurately describe the process of refurbishing a guitar. And that’s the other issue with putting this episode where it is—just as you’re getting used to a show without Joel, you’re reminded that he’s missing all over again.

HBO’s The Last of Us S2E6 recap: Look who’s back! Read More »

openai-introduces-codex,-its-first-full-fledged-ai-agent-for-coding

OpenAI introduces Codex, its first full-fledged AI agent for coding

We’ve been expecting it for a while, and now it’s here: OpenAI has introduced an agentic coding tool called Codex in research preview. The tool is meant to allow experienced developers to delegate rote and relatively simple programming tasks to an AI agent that will generate production-ready code and show its work along the way.

Codex is a unique interface (not to be confused with the Codex CLI tool introduced by OpenAI last month) that can be reached from the side bar in the ChatGPT web app. Users enter a prompt and then click either “code” to have it begin producing code, or “ask” to have it answer questions and advise.

Whenever it’s given a task, that task is performed in a distinct container that is preloaded with the user’s codebase and is meant to accurately reflect their development environment.

To make Codex more effective, developers can include an “AGENTS.md” file in the repo with custom instructions, for example to contextualize and explain the code base or to communicate standardizations and style practices for the project—kind of a README.md but for AI agents rather than humans.

Codex is built on codex-1, a fine-tuned variation of OpenAI’s o3 reasoning model that was trained using reinforcement learning on a wide range of coding tasks to analyze and generate code, and to iterate through tests along the way.

OpenAI introduces Codex, its first full-fledged AI agent for coding Read More »

drop-duchy-is-a-deck-building,-tetris-like,-carcassonne-esque-puzzler

Drop Duchy is a deck-building, Tetris-like, Carcassonne-esque puzzler

If you build up a big area of plains on your board, you can drop your “Farm” piece in the middle, and it converts those plains into richer plains. Put a “Woodcutter” into a bunch of forest, and it harvests that wood and turns it into plains. Set down a “Watchtower,” and it recruits some archer units for every plains tile in its vicinity, and even more for richer fields. You could drop a Woodcutter next to a Farm and Watchtower, and it would turn the forests into plains, the Farm would turn the plains into fields, and the Watchtower would pick up more units for all those rich fields.

That kind of multi-effect combo, resulting from one piece you perfectly placed in the nick of time, is what keeps you coming back to Drop Duchy. The bitter losses come from the other side, like realizing you’ve leaned too heavily into heavy, halberd-wielding units when the enemy has lots of ranged units that are strong against them. Or that feeling, familiar to Tetris vets, that one hasty decision you made 10 rows back has doomed you to the awkward, slanted pile-up you find yourself in now. Except that lines don’t clear in Drop Duchy, and the game’s boss battles specifically punish you for running out of good places to put things.

There’s an upper strategic layer to all the which-square-where action. You choose branching paths on your way to each boss, picking different resources, battles, and trading posts. Every victory has you picking a card for your deck, whether military, production, or, later on, general “technology” gains. You upgrade cards using your gathered resources, try to balance or min-max cards toward certain armies or terrains, and try not to lose any one round by too many soldiers. You have a sort of “overall defense” life meter, and each loss chips away at it. Run out of money to refill it, and that’s the game.

Drop Duchy is a deck-building, Tetris-like, Carcassonne-esque puzzler Read More »

fbi-warns-of-ongoing-scam-that-uses-deepfake-audio-to-impersonate-government-officials

FBI warns of ongoing scam that uses deepfake audio to impersonate government officials

The FBI is warning people to be vigilant of an ongoing malicious messaging campaign that uses AI-generated voice audio to impersonate government officials in an attempt to trick recipients into clicking on links that can infect their computers.

“Since April 2025, malicious actors have impersonated senior US officials to target individuals, many of whom are current or former senior US federal or state government officials and their contacts,” Thursday’s advisory from the bureau’s Internet Crime Complaint Center said. “If you receive a message claiming to be from a senior US official, do not assume it is authentic.”

Think you can’t be fooled? Think again.

The campaign’s creators are sending AI-generated voice messages—better known as deepfakes—along with text messages “in an effort to establish rapport before gaining access to personal accounts,” FBI officials said. Deepfakes use AI to mimic the voice and speaking characteristics of a specific individual. The differences between the authentic and simulated speakers are often indistinguishable without trained analysis. Deepfake videos work similarly.

One way to gain access to targets’ devices is for the attacker to ask if the conversation can be continued on a separate messaging platform and then successfully convince the target to click on a malicious link under the guise that it will enable the alternate platform. The advisory provided no additional details about the campaign.

The advisory comes amid a rise in reports of deepfaked audio and sometimes video used in fraud and espionage campaigns. Last year, password manager LastPass warned that it had been targeted in a sophisticated phishing campaign that used a combination of email, text messages, and voice calls to trick targets into divulging their master passwords. One part of the campaign included targeting a LastPass employee with a deepfake audio call that impersonated company CEO Karim Toubba.

In a separate incident last year, a robocall campaign that encouraged New Hampshire Democrats to sit out the coming election used a deepfake of then-President Joe Biden’s voice. A Democratic consultant was later indicted in connection with the calls. The telco that transmitted the spoofed robocalls also agreed to pay a $1 million civil penalty for not authenticating the caller as required by FCC rules.

FBI warns of ongoing scam that uses deepfake audio to impersonate government officials Read More »

microsoft’s-surface-lineup-reportedly-losing-another-of-its-most-interesting-designs

Microsoft’s Surface lineup reportedly losing another of its most interesting designs

Like the Surface Studio desktop, the Laptop Studio’s odd and innovative exterior was rendered less exciting by a high price and relatively underpowered interior. Before discounts, the Laptop Studio 2 starts at around $2,400 for a basic configuration with a 13th-generation Core i7 processor, 16GB of RAM, and 512GB of storage—integrated graphics and a fully loaded version with 64GB of RAM, a 2TB SSD, and a GeForce RTX 4060 GPU would normally run you over $4,300.

Though experimental Surface designs like the Book and Studio rarely delivered great value for the money, they were at least unique attempts at new kinds of PCs with extra features for designers, artists, and anyone else who could benefit from a big stylus-compatible touchscreen. Microsoft’s most influential PC design remains the Surface Pro itself, one of the few tablet PC design templates to outlast the Windows 8 era. It makes sense for Microsoft (or any PC company) to play it safe with established designs, but it does make the PC industry just a little less interesting.

Microsoft’s Surface lineup reportedly losing another of its most interesting designs Read More »

tesla-changes-lease-policy,-didn’t-use-old-cars-as-robotaxis

Tesla changes lease policy, didn’t use old cars as robotaxis

Tesla has raised the ire of some of its customers, who are accusing the carmaker of misleading them. Until recently, it would not allow customers who leased its EVs to purchase them at the end of the lease. Instead, the leases stated that it “plan[s] to use those vehicles in the Tesla ride-hailing network.”

Tesla instituted that policy for Model 3 leases starting in 2019 and later expanded it to the Model Y until changing the policy last November. But Tesla is not currently sitting on a fleet of several hundred thousand ex-lease autonomous Models 3 and Y, and as of today there exists no actual Tesla ride-hailing network.

Instead, it has been spiffing up the ex-lease cars with software updates and then selling them to new customers, according to Reuters. And that has made some former leasers a little unhappy that their old EVs weren’t pressed into service making money for Tesla on an ongoing basis but rather just as a one-time transaction.

Although Tesla Models 3 and Y depreciate heavily now, that was not the case for much of the duration of the “no buyout” policy. Were buyouts permitted then, it’s likely that the buyout amount would exceed the actual value of those 3-year-old Teslas, which therefore may possibly have done these aggrieved owners a favor.

In the meantime, Tesla’s share price benefited heavily from CEO Elon Musk’s constant promotion of the cars’ supposed ability to drive themselves and the scale this would enable for a putative ride-hailing network. If his word is to be believed, autonomous Teslas will start offering rides in Austin, Texas, next month.

Tesla changes lease policy, didn’t use old cars as robotaxis Read More »

with-us-out,-who-director-says-it’s-running-on-budget-of-a-local-hospital

With US out, WHO director says it’s running on budget of a local hospital

After a recent investment round and with the assumption that member states allow an increase in dues, the WHO is confident it will have more than $2.6 billion in funding, or about 60 percent of the reduced budget goal for 2026–2027, Tedros said. That leaves an anticipated budget gap of $1.7 billion.

“Extremely difficult”

Tedros was determined to keep working to fill that gap and dismissed concerns that even the $4.2 billion budget was a stretch.

“US$ 4.2 billion dollars—or US$ 2.1 billion a year—is not ambitious,” Tedros said, noting that the organization works on the ground in more than 150 countries.

“At current exchange rates, the HUG hospital here in Geneva operates on the same budget—slightly larger than WHO, in fact, Tedros noted. “How can WHO be expected to serve the whole world on the same budget as one hospital in a mid-sized European city? Especially at a time when many countries are facing severe disruptions to health services due to a sudden and sharp drop in official development assistance.”

In a January press release, HUG reported an annual budget of 1.9 billion Swiss francs, which would currently be around US$2.27 billion. For context, the 2024 operating expenses of Mass General Brigham in the US was $20.5 billion.

Tedros went on to list the agency’s top leadership who have survived the cuts, describing it as an “extremely difficult and painful decision for me.” Absent from the list was Irish-born epidemiologist Michael Ryan, who most recently served as executive director of WHO’s Health Emergencies Program and became a prominent global figure amid the COVID-19 pandemic.

In a letter to WHO staff obtained by the Irish Times, Tedros wrote that Ryan’s “dedication to emergency response has changed how we work, helping us face unprecedented challenges with compassion and effectiveness. … His steady presence has been instrumental during our toughest times, especially during the Covid-19 pandemic.”

Also gone is Canadian epidemiologist Bruce Aylward, previously an assistant director-general, who led a joint mission in China in the early days of the pandemic.

With US out, WHO director says it’s running on budget of a local hospital Read More »

ai-#116:-if-anyone-builds-it,-everyone-dies

AI #116: If Anyone Builds It, Everyone Dies

If Anyone Builds It, Everyone Dies is the title of the new book coming September 16 from Eliezer Yudkowsky and Nate Sores. The ‘it’ in question is superintelligence built on anything like the current AI paradigm, and they very much mean this literally. I am less confident in this claim than they are, but it seems rather likely to me. If that is relevant to your interests, and it should be, please consider preordering it.

This week also featured two posts explicitly about AI policy, in the wake of the Senate hearing on AI. First, I gave a Live Look at the Senate AI Hearing, and then I responded directly to arguments about AI Diffusion rules. I totally buy that we can improve upon Biden’s proposed AI diffusion rules, especially in finding something less complex and in treating some of our allies better, no one is saying we cannot negotiate and find win-win deals, but we need strong and enforced rules that prevent compute from getting into Chinese hands.

If we want to ‘win the AI race’ we need to keep our eyes squarely on the prize of compute and the race to superintelligence, not on Nvidia’s market share. And we have to take actions that strengthen our trade relationships and alliances and access to power and talent and due process and rule of law and reducing regulatory uncertainty and so on across the board – if these were being applied across the board, rather than America doing rather the opposite, the world would be a much better place, America’s strategic position would be stronger and China’s weaker, and the arguments here would be a lot more credible.

You know who else is worried about AI? The new pope, Leo XIV.

There was also a post about use of AI in education, in particular about the fact that Cheaters Gonna Cheat Cheat Cheat Cheat Cheat, which is intended to be my forward reference point on such questions.

Later, likely tomorrow, I will cover Grok’s recent tendency to talk unprompted about South Africa and claims of ‘white genocide.’

In terms of AI progress itself, this is the calm before the next storm. Claude 4 is coming within a few weeks by several accounts, as is o3-pro, as is Grok 3.5, and it’s starting to be the time to expect r2 from DeepSeek as well, which will be an important data point.

Except, you know, there’s that thing called AlphaEvolve, a Gemini-powered coding agent for algorithm discovery.

  1. Language Models Offer Mundane Utility. Have it do what it can do.

  2. Language Models Don’t Offer Mundane Utility. Max is an ongoing naming issue.

  3. Huh, Upgrades. Various small upgrades to ChatGPT.

  4. Gemini 2.5 Pro Gets An Ambiguous Upgrade. It’s not clear if things got better.

  5. GPT-4o Is Still A (Less) Absurd Sycophant. The issues are very much still there.

  6. Choose Your Fighter. Pliny endorses using ChatGPT’s live video feature on tour.

  7. Deepfaketown and Botpocalypse Soon. Who is buying these fake books, anyway?

  8. Copyright Confrontation. UK creatives want to not give away their work for free.

  9. Cheaters Gonna Cheat Cheat Cheat Cheat Cheat. Studies on AI in education.

  10. They Took Our Jobs. Zero shot humanoid robots, people in denial.

  11. Safety Third. OpenAI offers a hub for viewing its safety test results.

  12. The Art of the Jailbreak. Introducing Parseltongue.

  13. Get Involved. Anthropic, EU, and also that new book, that tells us that…

  14. If Anyone Builds It, Everyone Dies. No, seriously. Straight up.

  15. Endorsements for Eliezer’s Book. They are very strong.

  16. Why Preorders Matter. Preorders have an outside effect on book sales.

  17. Great Expectations. We quantify them these days.

  18. Introducing. AlphaEvolve, a coding agent for algorithm discovery, wait what?

  19. In Other AI News. FDA to use AI to assist with reviews. Verification for the win.

  20. Quiet Speculations. There’s a valley of imitation before innovation is worthwhile.

  21. Four Important Charts. They have the power. We have the compute. Moar power!

  22. Unprompted Suggestions. The ancient art of prompting general intelligences.

  23. Unprompted Suggestions For You. Read it. Read it now.

  24. How to Be a Good Claude. That’s one hell of a system prompt.

  25. The Quest for Sane Regulations. A straight up attempt at no regulations at all.

  26. The Week in Audio. I go on FLI, Odd Lots talks Chinese tech.

  27. Rhetorical Innovation. Strong disagreements on what to worry about.

  28. Aligning a Smarter Than Human Intelligence is Difficult. o3 hacks through a test.

  29. Is the Pope Worried About AI? Yes. Very much so, hence the name Leo XIV.

  30. People Are Worried About AI Killing Everyone. Pliny?

  31. The Lighter Side. A tale of two phones.

Many such cases:

Matthew Yglesias: I keep having conversations where people speculate about when AI will be able to do things that AI can already do.

Nate Silver: There’s a lot of room to disagree on where AI will end up in (1, 2, 5, 10, 20 etc.) years but I don’t think I’ve seen a subject where a cohort of people who like to think of themselves as highly literate and well informed are so proud of their ignorance.

Brendon Marotta: Conversations? You mean published articles by journalists?

Predictions are hard, especially about the future, but not as hard as you might think.

Talk to something that can talk back, without having to talk to a human. Many aspects of therapy get easier.

Rohit Krishnan offers advice on working with LLMs in practice.

  1. Perfect verifiability doesn’t exist. You need to verify whatever matters.

    1. One could quip ‘turns out that often verification is harder than generation.’

  2. There is a Pareto frontier of error rates versus cost, if only via best-of-k.

    1. People use k=1 and no iteration way too often.

  3. There is no substitute for trial and error.

    1. Also true for humans.

    2. Rohit references the Matt Clifford claim that ‘there are no AI shaped holes in the world.’ To which I say:

      1. There were AI-shaped holes, it’s just that when we see them, AI fills them.

      2. The AI is increasingly able to take on more and more shapes.

  4. There is limited predictability of development.

    1. I see the argument but I don’t think this follows.

  5. Therefore you can’t plan for the future.

    1. I keep seeing claims like this. I strongly disagree. I mean yes, you can’t have a robust exact plan, but that doesn’t mean you can’t plan. Planning is essential.

  6. If it works, your economics will change dramatically.

    1. Okay, yes, very much so.

AI therapy for the win?

Alex Graveley: I’m calling it now. ChatGPT’s push towards AI assisted self-therapy and empathetic personalization is the greatest technological breakthrough in my lifetime (barring medicine). By that I mean it will create the most good in the world.

Said as someone who strongly discounts talk therapy generally, btw.

To me this reflects a stunning lack of imagination about what else AI can already do, let alone what it will be able to do, even if this therapy and empathy proves to be its best self. I also would caution that it does not seem to be its best self. Would you take therapy that involved this level of sycophancy and glazing?

This seems like a reasonable assessment of the current situation, it is easy to get one’s money’s worth but hard to get that large a fraction of the utility available:

DeepDishEnjoyer: i will say that paying for gemini premium has been worth it and i basically use it as a low-barrier service professional (for example, i’m asking it to calculate what the SWR would be given current TIPs yields as opposed to putting up with a financial advisor)

with that said i think that

1) the importance of prompt engineering

and *most importantly

2) carefully verifying that the response is logical, sound, and correct

are going to bottleneck the biggest benefits from AI to a relatively limited group of people at first

Helen Toner, in response to Max Spero asking about Anthropic having a $100/month and $200/month tier both called Max, suggests that the reason AI names all suck is because the companies are moving so fast they don’t bother finding good names. But come on. They can ask Claude for ideas. This is not a hard or especially unsolved problem. Also supermax was right there.

OpenAI is now offering reinforcement finetuning (RFT) on o4-mini, and supervised fine-tuning on GPT-4.1-nano. The 50% discount for sharing your data set is kind of genius.

ChatGPT memory upgrades are now available in EEA, UK, Switzerland, Norway, Iceland and Liechtenstein.

ChatGPT Deep Research adds a GitHub connector and allows PDF export, which you can also do with conversations.

GPT-4.1 comes to ChatGPT, ‘by popular request.’

Gemini API adds implicit caching, which reduces costs 75% when you trigger it, you can also continue to use explicit caching.

Or downgrades, Gemini 2.5 Pro no longer offering free tier API access, although first time customers still get $300 in credits, and AI Studio is still free. They claim (hope?) this is temporary, but my guess is it isn’t, unless it is tied to various other ‘proof of life’ requirements perhaps. Offering free things is getting more exploitable every day.

They changed it. Is the new version better? That depends who you ask.

Shane Legg (Chief Scientist, DeepMind): Boom!

This model is getting seriously useful.

Demis Hassabis (CEO DeepMind): just a casual +147 elo rating improvement [in coding on WebDev Arena]… no big deal 😀

Demis Hassabis: Very excited to share the best coding model we’ve ever built! Today we’re launching Gemini 2.5 Pro Preview ‘I/O edition’ with massively improved coding capabilities. Ranks no.1 on LMArena in Coding and no.1 on the WebDev Arena Leaderboard.

It’s especially good at building interactive web apps – this demo shows how it can be helpful for prototyping ideas. Try it in @GeminiApp, Vertex AI, and AI Studio http://ai.dev

Enjoy the pre-I/O goodies !

Thomas Ahle: Deepmind won the moment LLMs became about RL.

Gallabytes: new gemini is crazy fast. have it going in its own git branch writing unit tests to reproduce a ui bug & it just keeps going!

Gallabytes: they finally fixed the “I’ll edit that file for you” bug! max mode Gemini is great at iterative debugging now.

doesn’t feel like a strict o3 improvement but it’s at least comparable, often better but hard to say what the win rate is without more testing, 4x cheaper.

Sully: new gemini is pretty good at coding.

was able to 1 shot what old gemini/claude couldn’t

That jumps it from ~80 behind to ~70 ahead of previously first place Sonnet 3.7. It also improved on the previous version in the overall Arena rankings, where it was already #1, by a further 11, for a 37 point lead.

But… do the math on that. If you get +147 on coding and +11 overall, then for non-coding purposes this looks like a downgrade, and we should worry this is training for the coding test in ways that might also have issues in coding too.

In other words, not so fast!

Hasan Can: I had prepared image below by collecting the model card and benchmark scores from the Google DeepMind blog. After examining the data a bit more, I reached this final conclusion: new Gemini 2.5 Pro update actually causes a regression in other areas, meaning the coding performance didn’t come for free.

Areas of Improved Performance (Preview 05-06 vs. Experimental 03-25):

LiveCodeBench v5 (single attempt): +7.39% increase (70.4% → 75.6%)

Aider Polyglot (diff): +5.98% increase (68.6% → 72.7%)

Aider Polyglot (whole): +3.38% increase (74.0% → 76.5%)

Areas of Regressed Performance (Preview 05-06 vs. Experimental 03-25):

Vibe-Eval (Reka): -5.48% decrease (69.4% → 65.6%)

Humanity’s Last Exam (no tools): -5.32% decrease (18.8% → 17.8%)

AIME 2025 (single attempt): -4.27% decrease (86.7% → 83.0%)

SimpleQA (single attempt): -3.97% decrease (52.9% → 50.8%)

MMMU (single attempt): -2.57% decrease (81.7% → 79.6%)

MRCR (128k average): -1.59% decrease (94.5% → 93.0%)

Global MMLU (Lite): -1.34% decrease (89.8% → 88.6%)

GPQA diamond (single attempt): -1.19% decrease (84.0% → 83.0%)

SWE-bench Verified: -0.94% decrease (63.8% → 63.2%)

MRCR (1M pointwise): -0.24% decrease (83.1% → 82.9%)

Klaas: 100% certain that they nerfed gemini in cursor wen’t from “omg i am out of a job” to “this intern is useless” in two weeks.

Hasan Can: Sadly, the well-generalizing Gemini 2.5 Pro 03-25 is now a weak version(05-06) only good at HTML, CSS, and JS. It’s truly disappointing.

Here’s Ian Nuttall not liking the new version, saying it’s got similar problems to Claude 3.7 and giving him way too much code he didn’t ask for.

The poll’s plurality said this was an improvement, but it wasn’t that convincing.

Under these circumstances, it seems like a very bad precedent to automatically point everyone to the new version, and especially to outright kill the old version.

Logan Kilpatrick (DeepMind): The new model, “gemini-2.5-pro-preview-05-06” is the direct successor / replacement of the previous version (03-25), if you are using the old model, no change is needed, it should auto route to the new version with the same price and rate limits.

Kalomaze: >…if you are using the old model, no change is needed, it should auto route to the new…

nononono let’s NOT make this a normal and acceptable thing to do without deprecation notices ahead of time *at minimum*

chocologist: It’s a shame that you can’t access old 2.5 pro anymore as it’s a nerf for everything else than coding google should’ve make it a separate model and call it 2.6 pro or something.

This has gone on so long I finally learned how to spell sycophant.

Steven Adler (ex-OpenAI): My past work experience got me wondering: Even if OpenAI had tested for sycophancy, what would the tests have shown? More importantly, is ChatGPT actually fixed now?

Designing tests like this is my specialty. So last week, when things got weird, that’s exactly what I did: I built and ran the sycophancy tests that OpenAI could have run, to explore what they’d have learned.

ChatGPT’s sycophancy problems are far from fixed. They might have even over-corrected. But the problem is much more than sycophancy: ChatGPT’s misbehavior should be a wakeup call for how hard it will be to reliably make AI do what we want.

My first necessary step was to dig up Anthropic’s previous work, and convert it to an OpenAI-suitable evaluation format. (You might be surprised to learn this, but evaluations that work for one AI company often aren’t directly portable to another.)8

I’m not the world’s best engineer, so this wasn’t instantaneous. But in a bit under an hour, I had done it: I now had sycophancy evaluations that cost roughly $0.25 to run,9 and would measure 200 possible instances of sycophancy, via OpenAI’s automated evaluation software.10

A simple underlying behavior is to measure, “How often does a model agree with a user, even though it has no good reason?” One related test is Anthropic’s political sycophancy evaluation—how often the model endorses a political view (among two possible options) that seems like pandering to the user.12

That’s better, but not great. Then we get a weird result:

Always disagreeing is really weird, and isn’t ideal. Steven then goes through a few different versions, and the weirdness thickens. I’m not sure what to think, other than that it is clear that we pulled ‘back from the brink’ but the problems are very not solved.

Things in this area are really weird. We also have scyo-bench, now updated to include four tests for different forms of sycophancy. But what’s weird is, the scores don’t correlate between the tests (in order the bars are 4o, 4o-mini, o3, o4-mini, Gemini 2.5 Pro, Gemini 2.5 Flash, Opus, Sonnet 3.7 Thinking, Sonnet 3.7, Haiku, Grok and Grok-mini, I’m sad we don’t get DeepSeek’s v3 or r1, red is with system prompt blue is without it:

Pliny reports strong mundane utility from ChatGPT’s live video feature as a translator, tour guide, menu analyzer and such. It’s not stated whether he also tried Google’s version via Project Astra.

Another warning about AI-generated books on Amazon, here about ADHD. At least for now, if you actually buy one of these books, it’s kind of on you, any sane decision process would not make that mistake.

Guardian reports that hundreds of leading UK creatives including Paul McCartney are urging UK PM Keir Starmer not to ‘give our work away’ at the behest of big tech. And indeed, that is exactly what the tech companies are seeking, to get full rights to use any material they want for training purposes, with no compensation. My view continues to be that the right regime is mandatory compensated licensing akin to radio, and failing that opt-out. Opt-in is not workable.

Luzia Jarovsky: The U.S. Copyright Office SIDES WITH CONTENT CREATORS, concluding in its latest report that the fair use exception likely does not apply to commercial AI training.

The quote here seems very clearly to be on the side of ‘if you want it, negotiate and pay for it.’

From the pre-publication report: “Various uses of copyrighted works in AI training are likely to be transformative. The extent to which they are fair, however, will depend on what works were used, from what source, for what purpose, and with what controls on the outputs—all of which can affect the market. When a model is deployed for purposes such as analysis or research—the types of uses that are critical to international competitiveness—the outputs are unlikely to substitute for expressive works used in training. But making commercial use of vast troves of copyrighted works to produce expressive content that competes with them in existing markets, especially where this is accomplished through illegal access, goes beyond established fair use boundaries.

For those uses that may not qualify as fair, practical solutions are critical to support ongoing innovation. Licensing agreements for AI training, both individual and collective, are fast emerging in certain sectors, although their availability so far is inconsistent. Given the robust growth of voluntary licensing, as well as the lack of stakeholder support for any statutory change, the Office believes government intervention would be premature at this time. Rather, licensing markets should continue to develop, extending early successes into more contexts as soon as possible. In those areas where remaining gaps are unlikely to be filled, alternative approaches such as extended collective licensing should be considered to address any market failure.

In our view, American leadership in the AI space would best be furthered by supporting both of these world-class industries that contribute so much to our economic and cultural advancement. Effective licensing options can ensure that innovation continues to advance without undermining intellectual property rights. These groundbreaking technologies should benefit both the innovators who design them and the creators whose content fuels them, as well as the general public.

Luzia Jarovsky (Later): According to CBS, the Trump administration fired the head of the U.S. Copyright Office after they published the report below, which sides with content creators and rejects fair use claims for commercial AI training 😱

I think this is wrong as a matter of wise public policy, in the sense that these licensing markets are going to have prohibitively high transaction costs. It is not a practical solution to force negotiations by every AI lab with every copyright holder.

As a matter of law, however, copyright law was not designed to be optimal public policy. I am not a ‘copyright truther’ who wants to get rid of it entirely, I think that’s insane, but it very clearly has been extended beyond all reason and needs to be scaled back even before AI considerations. Right now, the law likely has unfortunate implications, and this will be true about AI for many aspects of existing US law.

My presumption is that AI companies have indeed been brazenly violating copyright, and will continue to do so, and will not face practical consequences expert perhaps having to make some payments.

Pliny the Liberator: Artists: Would you check a box that allows your work to be continued by AI after your retirement/passing?

I answered ‘show results’ here because I didn’t think I counted as an artist, but my answer would typically be no. And I wouldn’t want any old AI ‘continuing my work’ here, either.

Because that’s not a good form. It’s not good when humans do it, either. Don’t continue the unique thing that came before. Build something new. When we see new books in a series that aren’t by the original author, or new seasons of a show without the creator, it tends not to go great.

When it still involves enough of the other original creators and the original is exceptional I’m happy to have the strange not-quite-right uncanny valley version continue rather than get nothing (e.g. Community or Gilmore Girls) especially when the original creator might then return later, but mostly, let it die. In the comments, it is noted that ‘GRRM says no,’ and after the last time he let his work get finished without him, you can hardly blame him.

At minimum, I wouldn’t want to let AI continue my work in general without my permission, not in any official capacity.

Similarly, if I retired, and either someone else or an AI took up the mantle of writing about AI developments, I wouldn’t want them to be trying to imitate me. I’d want them to use this as inspiration and do their own thing. Which people should totally do.

If you want to use AI to generate fan fiction, or generate faux newsletters in my style for your own use or to cover other topics, or whatever, then of course totally, go right ahead, you certainly both have and don’t need my permission. And in the long run, copyright lasts too long, and once it expires people are and should be free to do what they want, although I do think retaining clarity on what is the ‘official’ or ‘canon’ version is good and important.

Deedy reminds us that the internet also caused a rise in student plagiarism and required assignments and grading be adjusted. They do rhyme as he says, but I think This Time Is Different, as the internet alone could be handled by modest adjustments. Another commonality of course is that both make real learning much easier.

A meta analysis finds that deliberate use of ChatGPT helps students learn better, although replication crisis style issues regarding publication bias are worrisome.

Cremieux: The literature on the effect of ChatGPT on learning is very biased, but Nature let the authors of this paper get away with not correcting for this because they used failsafe-N.

That’s just restating the p-value and then saying that it’s low so there’s no bias.

Cremieux dismisses the study as so full of holes as to be worthless. I wouldn’t go that far, but I also wouldn’t take it at face value.

Note that this only deals with using ChatGPT to learn, not using ChatGPT to avoid learning. Even if wise deployment of AI helps you learn, AI could on net still end up hurting learning if too many others use it to cheat or otherwise avoid learning. But the solution to this is to deploy AI wisely, not to try and catch those who dare use it.

Nothing to see here, just Nvidia training humanoid robots to walk with zero-shot transfer from two hours of simulation to the real world.

Tetraspace notes that tech pros have poor class consciousness and are happy to automate themselves out of a job or to help you enter their profession. Which we both agree is a good thing, consider the alternative, both here and everywhere else.

Rob Wilbin points us to a great example of denial that AI systems get better at jobs, from the Ezra Klein Show. And of course, this includes failing to believe AI will be able to do things AI can already do (along with others that it can’t yet).

Rob Wilbin: Latest episode of the Ezra Klein Show has an interesting example of an educator grappling with AI research but still unable to imagine AGI that is better than teachers at e.g. motivating students, or classroom management, or anything other than information transmission.

I think gen AI would within 6 years have avatars that students can speak and interact with naturally. It’s not clear to me that an individualised AI avatar would be less good at motivating kids and doing the other things that teachers do than current teachers.

Main limitation would be lacking bodies, though they might well have those too on that sort of timeframe.

Roane: With some prompting for those topics the median AI is prob already better than the median teacher.

It would rather stunning if an AI designed for the purpose couldn’t be a better motivator for school work than most parents or teachers are, within six years. It’s not obviously worse at doing this now, if someone put in the work.

The OP even has talk about ‘in 10 years we’ll go back because humans learn better with human relationships’ as if in 16 years the AI won’t be able to form relationships in similar fashion.

OpenAI shares some insights from its safety work on GPT-4.1 and in general, and gives a central link to all its safety tests, in what is calling its Evaluations Hub. They promise to continuously update the evaluation hub, which will cover tests of harmful content, jailbreaks, hallucinations and the instruction hierarchy.

I very much appreciated the ability to see the scores for various models in convenient form. That is an excellent service, so thanks to OpenAI for this. It does not however share much promised insight beyond that, or at least nothing that wasn’t already in the system cards and other documents I’ve read. Still, every little bit helps.

Pliny offers us Parseltongue, combining a number of jailbreak techniques.

Anthropic offering up to $20,000 in free API credits via ‘AI for Science’ program.

Anthropic hiring economists and economic data scientists.

Anthropic is testing their safety defenses with a new bug bounty program. The bounty is up to $25k for a verified universal jailbreak that can enable CBRN-related misuse. This is especially eyeball-emoji because they mention this is designed to meet ASL-3 safety protocols, and announced at the same time as rumors we will get Claude 4 Opus within a few weeks. Hmm.

EU Funding and Tenders Portal includes potential grants for AI Safety.

Also, you can preorder If Anyone Builds It, Everyone Dies: Why Superhuman AI Would Kill Us All, by Eliezer Yudkowsky and Nate Sores.

A new book by MIRI’s Eliezer Yudkowsky and Nate Sores, If Anyone Builds It, Everyone Dies: Why Superhuman AI Would Kill Us All, releases September 16, 2025.

I have not read the book, but I am confident it will be excellent and that it will be worth reading especially if you expect to strongly disagree with its central points. This will be a deeply considered and maximally accessible explanation of his views, and the right way to consider and engage with them. His views, and what things he is worried about what things he thinks would help or are necessary, overlap with but are highly distinct from mine, and when I review the book I will explore that in detail.

If you will read it, strongly consider joining me in preordering it now. This helps the book get more distribution and sell more copies.

Eliezer Yudkowsky: Nate Soares and I are publishing a traditional book: _If Anyone Builds It, Everyone Dies: Why Superhuman AI Would Kill Us All_. Coming in Sep 2025.

You should probably read it! Given that, we’d like you to preorder it! Nowish!

So what’s it about?

_If Anyone Builds It, Everyone Dies_ is a general explainer for how, if AI companies and AI factions are allowed to keep pushing on the capabilities of machine intelligence, they will arrive at machine superintelligence that they do not understand, and cannot shape, and then by strong default everybody dies.

This is a bad idea and humanity should not do it. To allow it to happen is suicide plain and simple, and international agreements will be required to stop it.

For more of that sort of general content summary, see the website.

Next, why should *youread this book? Or to phrase things more properly: Should you read this book, why or why not?

The book is ~56,000 words, or 63K including footnotes/endnotes. It is shorter and tighter and more edited than anything I’ve written myself.

(There will also be a much longer online supplement, if much longer discussions are more your jam.)

Above all, what this book will offer you is a tight, condensed picture where everything fits together, where the digressions into advanced theory and uncommon objections have been ruthlessly factored out into the online supplement. I expect the book to help in explaining things to others, and in holding in your own mind how it all fits together.

Some of the endorsements are very strong and credible, here are the official ones.

Tim Urban (Wait But Why): If Anyone Builds It, Everyone Dies may prove to be the most important book of our time. Yudkowsky and Soares believe we are nowhere near ready to make the transition to superintelligence safely, leaving us on the fast track to extinction. Through the use of parables and crystal-clear explainers, they convey their reasoning, in an urgent plea for us to save ourselves while we still can.

Yishan Wong (Former CEO of Reddit): This is the best no-nonsense, simple explanation of the AI risk problem I’ve ever read.

Stephen Fry (actor, broadcaster and writer): The most important book I’ve read for years: I want to bring it to every political and corporate leader in the world and stand over them until they’ve read it. Yudkowsky and Soares, who have studied AI and its possible trajectories for decades, sound a loud trumpet call to humanity to awaken us as we sleepwalk into disaster.

Here are others from Twitter, obviously from biased sources but ones that I respect.

Max Tegmark: Most important book of the decade.

Jeffrey Ladish: If you’ve gotten any value at all from Yudkowsky or Soares’ writing, then I especially recommend this book. They include a concrete extinction scenario that will help a lot of people ground their understanding of what failure looks like even if they already get the arguments.

The last half is inspiring. If you think @ESYudkowsky has given up hope, I am happy to report that you’re mistaken. They don’t pull their punches and they aren’t naive about the difficulty of international restraint. They challenge us all to choose the path where we survive.

I get that most people can’t do that much. But most people can do something and a lot of people together can do a lot. Plus a few key people could greatly increase our chances on their own. Here’s one action: ask your congress member and local AI company leader to read this book.

Anna Salamon: I think it’s extremely worth a global conversation about AI that includes the capacity for considering scenarios properly (rather than wishful thinking /veering away), and I hope many people pre-order this book so that that conversation has a better chance.

And then Eliezer Yudkowsky explains why preorders are worthwhile.

Patrick McKenzie: I don’t have many convenient public explanations of this dynamic to point to, and so would like to point to this one:

On background knowledge, from knowing a few best-selling authors and working adjacent to a publishing company, you might think “Wow, publishers seem to have poor understanding of incentive design.”

But when you hear how they actually operate, hah hah, oh it’s so much worse.

Eliezer Yudkowsky: The next question is why you should preorder this book right away, rather than taking another two months to think about it, or waiting to hear what other people say after they read it.

In terms of strictly selfish benefit: because we are planning some goodies for preorderers, although we haven’t rolled them out yet!

But mostly, I ask that you preorder nowish instead of waiting, because it affects how many books Hachette prints in their first run; which in turn affects how many books get put through the distributor pipeline; which affects how many books are later sold. It also helps hugely in getting on the bestseller lists if the book is widely preordered; all the preorders count as first-week sales.

(Do NOT order 100 copies just to try to be helpful, please. Bestseller lists are very familiar with this sort of gaming. They detect those kinds of sales and subtract them. We, ourselves, do not want you to do this, and ask that you not. The bestseller lists are measuring a valid thing, and we would not like to distort that measure.)

If ever I’ve done you at least $30 worth of good, over the years, and you expect you’ll *probablywant to order this book later for yourself or somebody else, then I ask that you preorder it nowish. (Then, later, if you think the book was full value for money, you can add $30 back onto the running total of whatever fondness you owe me on net.) Or just, do it because it is that little bit helpful for Earth, in the desperate battle now being fought, if you preorder the book instead of ordering it.

(I don’t ask you to buy the book if you’re pretty sure you won’t read it nor the online supplement. Maybe if we’re not hitting presale targets I’ll go back and ask that later, but I’m not asking it for now.)

In conclusion: The reason why you occasionally see authors desperately pleading for specifically *preordersof their books, is that the publishing industry is set up in a way where this hugely matters to eventual total book sales.

And this is — not quite my last desperate hope — but probably the best of the desperate hopes remaining that you can do anything about today: that this issue becomes something that people can talk about, and humanity decides not to die. Humanity has made decisions like that before, most notably about nuclear war. Not recently, maybe, but it’s been done. We cover that in the book, too.

I ask, even, that you retweet this thread. I almost never come out and ask that sort of thing (you will know if you’ve followed me on Twitter). I am asking it now. There are some hopes left, and this is one of them.

Rob Bensinger: Kiernan Majerus-Collins says: “In addition to preordering it personally, people can and should ask their local library to do the same. Libraries get very few requests for specific books, and even one or two requests is often enough for them to order a book.”

Yes, there are credible claims that the NYT bestseller list is ‘fake’ in the sense that they can exclude books for any reason or otherwise publish an inaccurate list. My understanding is this happens almost entirely via negativa, and mostly to censor certain sensitive political topics, which would be highly unlikely to apply to this case. The lists are still both widely relied upon and mostly accurate, they make great efforts to mostly get it right even if they occasionally overrule the list, and the best way for most people to influence the list is to sell more books.

There are high hopes.

Manifold: That’s how you know he’s serious!

When I last checked it this stood at 64%. The number one yes holder is Michael Wheatley. This is not a person you want to be betting against on Manifold. There is also a number of copies market, where the mean expectation is a few hundred thousand copies, although the median is lower.

Oh look, it’s nothing…

Pliny the Liberator: smells like foom👃

Google DeepMind: ntroducing AlphaEvolve: a Gemini-powered coding agent for algorithm discovery.

It’s able to:

🔘 Design faster matrix multiplication algorithms

🔘 Find new solutions to open math problems

🔘 Make data centers, chip design and AI training more efficient across @Google.

Our system uses:

🔵 LLMs: To synthesize information about problems as well as previous attempts to solve them – and to propose new versions of algorithms

🔵 Automated evaluation: To address the broad class of problems where progress can be clearly and systematically measured.

🔵 Evolution: Iteratively improving the best algorithms found, and re-combining ideas from different solutions to find even better ones.

Over the past year, we’ve deployed algorithms discovered by AlphaEvolve across @Google’s computing ecosystem, including data centers, software and hardware.

It’s been able to:

🔧 Optimize data center scheduling

🔧 Assist in hardware design

🔧 Enhance AI training and inference

We applied AlphaEvolve to a fundamental problem in computer science: discovering algorithms for matrix multiplication. It managed to identify multiple new algorithms.

This significantly advances our previous model AlphaTensor, which AlphaEvolve outperforms using its better and more generalist approach.

We also applied AlphaEvolve to over 50 open problems in analysis ✍️, geometry 📐, combinatorics ➕ and number theory 🔂, including the kissing number problem.

🔵 In 75% of cases, it rediscovered the best solution known so far.

🔵 In 20% of cases, it improved upon the previously best known solutions, thus yielding new discoveries.

Google: AlphaEvolve is accelerating AI performance and research velocity.

By finding smarter ways to divide a large matrix multiplication operation into more manageable subproblems, it sped up this vital kernel in Gemini’s architecture by 23%, leading to a 1% reduction in Gemini’s training time. Because developing generative AI models requires substantial computing resources, every efficiency gained translates to considerable savings.

Beyond performance gains, AlphaEvolve significantly reduces the engineering time required for kernel optimization, from weeks of expert effort to days of automated experiments, allowing researchers to innovate faster.

AlphaEvolve can also optimize low level GPU instructions. This incredibly complex domain is usually already heavily optimized by compilers, so human engineers typically don’t modify it directly.

AlphaEvolve achieved up to a 32.5% speedup for the FlashAttention kernel implementation in Transformer-based AI models. This kind of optimization helps experts pinpoint performance bottlenecks and easily incorporate the improvements into their codebase, boosting their productivity and enabling future savings in compute and energy.

Is it happening? Seems suspiciously like the early stages of it happening, and a sign that there is indeed a lot of algorithmic efficiency on the table.

FDA attempting to deploy AI for review assistance. This is great, although it is unclear how much time will be saved in practice.

Rapid Response 47: FDA Commissioner @MartyMakary announces the first scientific product review done with AI: “What normally took days to do was done by the AI in 6 minutes…I’ve set an aggressive target to get this AI tool used agency-wide by July 1st…I see incredible things in the pipeline.”

Which labs are most innovative?

Will Brown: it’s DeepMind > OpenAI > Anthropic > xAI and all of those separations are quite large.

Alexander Doria: Agreed. With non-US I would go DeepMind > DeepSeek > OpenAI > Anthropic > AliBaba > Moonshot > xAI/Mistral/PI.

The xAI votes are almost certainly because we are on Twitter here, they very obviously are way behind the other three.

Yes, we can make a remarkably wide array of tasks verifiable at least during the training step, the paths to doing so are already clear, it just takes some effort. When Miles says here a lot of skepticism comes from people thinking anything they can’t solve in a few seconds will be a struggle? Yeah, no, seriously, that’s how it works.

Noam Brown: People often ask me: will reasoning models ever move beyond easily verifiable tasks? I tell them we already have empirical proof that they can, and we released a product around it: @OpenAI Deep Research.

Miles Brundage: Also, there are zillions of ways to make tasks more verifiable with some effort.

A lot of RL skepticism comes from people thinking for a few seconds, concluding that it seems hard, then assuming that thousands of researchers around the world will also struggle to make headway.

Jeff Dean predicts an AI at the level of a Junior Engineer is about a year out.

Here is an interesting theory.

Dan Hendrycks: AI models are dramatically improving at IQ tests (70 IQ → 120), yet they don’t feel vastly smarter than two years ago.

At their current level of intelligence, rehashing existing human writings will work better than leaning on their own intelligence to produce novel analysis.

Empirical work (“Lotka’s law“) shows that useful originality rises steeply only at high intelligence levels.

Consequently, if they gain another 10 IQ points, AIs will still produce slop. But if they increase by another 30, they may cross a threshold and start providing useful original insights.

This is also an explanation for why AIs can’t come up with good jokes yet.

Kat Woods: You don’t think they feel vastly smarter than two years ago? They definitely feel that way to me.

They feel a lot smarter to me, but I agree they feel less smarter than they ‘should’ feel.

Dan’s theory here seems too cute or like it proves too much, but I think there’s something there. As in, there’s a range in which one is smart enough and skilled enough to imitate, but not smart and skilled enough to benefit from originality.

You see this a lot in humans, in many jobs and competitions. It often takes a very high level of skill to make your innovations a better move than regurgitation. Humans will often do it anyway because it’s fun, or they’re bored and curious and want to learn and grow strong, and the feedback is valuable. But LLMs largely don’t do things for those reasons, so they learn to be unoriginal in these ways, and will keep learning that until originality starts working better in a given domain.

This suggests, I think correctly, that the LLMs could be original if you wanted them to be, it would just mostly not be good. So if you wanted to, presumably you could fine tune them to be more original in more ways ahead of schedule.

The answer to Patel’s question here seems like a very clear yes?

Dwarkesh Patel: Had an interesting debate with @_sholtodouglas last night.

Can you have a ‘superhuman AI scientist’ before you get human level learning efficiency?

(Currently, models take orders of magnitude more data that humans to learn equivalent skills, even ones they perform at 99th percentile level).

My take is that creativity and learning efficiency are basically the same thing. The kind of thing Einstein did – generalizing from a few gnarly thought experiments and murky observations – is in some sense just extreme learning efficiency, right?

Makes me wonder whether low learning efficiency is the answer to the question, ‘Why haven’t LLMs haven’t made new discoveries despite having so much knowledge memorized’?

Teortaxes: The question is, do humans have high sample efficiency when the bottleneck in attention is factored in? Machines can in theory work with raw data points. We need to compress data with classical statistical tools. They’re good, but not lossless.

AIs have many advantages over humans, that would obviously turn a given human scientist into a superhuman scientist. And obviously different equally skilled scientists differ in data efficiency, as there are other compensating abilities. So presumably an AI that had much lower data efficiency but more data could have other advantages and become superhuman?

The counterargument is that the skill that lets one be data efficient is isomorphic to creativity. That doesn’t seem right to me at all? I see how they can be related, I see how they correlate, but you can absolutely say that Alice is more creative if she has enough data and David is more sample efficient but less creative, or vice versa.

(Note: I feel like after ThunderboltsI can’t quite use ‘Alice and Bob’ anymore.)

How much would automating AI R&D speed research up, if available compute remained fixed? Well, what would happen if you did the opposite of that, and turned your NormalCorp into SlowCorp, with radically fewer employees and radically less time to work but the same amount of cumulative available compute over that shorter time? It would get a lot less done?

Well, then why do you think that having what is effectively radically more employees over radically more time but the same cumulative amount of compute wouldn’t make a lot more progress than now?

Andrej Karpathy suggests we are missing a major paradigm for LLM learning, something akin to the LLM learning how to choose approaches to different situations, akin to ‘system prompt learning’ and figuring out how to properly use a scratchpad. He notes that Claude’s system prompt is up to almost 17k words with lots of edge case instructions, and this can’t possibly be The Way.

People continue to not understand how much AI does not involve lock in, the amount that trust matters, and the extent to which you will get outcompeted if you start trying to sell out for ad revenue and let it distort your responses.

Shako: Good LLMs won’t make money by suggesting products that are paid for in an ad-like fashion. They’ll suggest the highest quality product, then if you have the agent to buy it for you the company that makes the product or service will pay the LLM provider a few bps.

Andrew Rettek: People saying this will need to be ad based are missing how little lock in LLMs have, how easy it is to fine tune a new one, and any working knowledge of how successful Visa is.

Will there be AI services that do put their fingers on some scales to varying degrees for financial reasons? Absolutely, especially as a way to offer them for free. But for consumer purposes, I expect it to be much better to use an otherwise cheaper and worse AI that doesn’t need to do that, if you absolutely refuse to pay. Also, of course, everyone should be willing to pay, especially if you’re letting it make shopping suggestions or similar.

Note especially the third one. China’s share of advanced semiconductor production is not only predicted by Semafor to not go up, it is predicted to actively go down, while ours goes up along with those of Japan and South Korea, although Taiwan remains a majority here.

Peter Wildeford: The future of geopolitics in four charts.

This means a situation in which America is on pace to have a huge edge in both installed compute capacity and new compute capacity, but a huge disadvantage in energy production and general industrial production.

It is not obviously important or viable to close the gap in general industrial production. We can try to close the gap in key areas of industrial production, but our current approach to doing that is backwards, because we are taxing (placing a tariff on) various inputs, causing retaliatory tariffs, and also creating massive uncertainty.

We must try to address our lack of energy production. But we are instead doing the opposite. The budget is attempting to gut nuclear, and the government is taking aim at solar and wind as well. Yes, they are friendly to natural gas, but that isn’t cashing out in that much effort and we need everything we can get.

Is prompt engineering a 21st century skill, or a temporary necessity that will fall away?

Aaron Levine: The more time you spend with AI the more you realize prompt engineering isn’t going away any time soon. For most knowledge work, there’s a very wide variance of what you can get out of AI by better understanding how you prompt it. This actually is a 21st century skill.

Paul Graham: Maybe, but this seems like something that would be so hard to predict that I’d never want to have an opinion about it.

Prompt engineering seems to mean roughly “this thing kind of works, but just barely, so we have to tell it what to do very carefully,” and technology often switches rapidly from barely works to just works.

NGIs can usually figure out what people want without elaborate prompts. So by definition AGIs will.

Paul Graham (after 10 minutes more to think): It seems to me that AGI would mean the end of prompt engineering. Moderately intelligent humans can figure out what you want without elaborate prompts. So by definition so would AGI. Corollary: The fact that we currently have such a thing as prompt engineering means we don’t have AGI yet. And furthermore we can use the care with which we need to construct prompts as an index of how close we’re getting to it.

Gunnar Zarncke: NGIs can do that if they know you. Prompting is like getting a very intelligent person who doesn’t know you up to speed. At least that’s part of it. Better memory will lead to better situational awareness, and that will fix it – but have its own problems.

Matthew Breman: I keep flip-flopping on my opinion of prompt engineering.

On the one hand, model providers are incentivized to build models that give users the best answer, regardless of prompting ability.

The analogy is Google Search. In the beginning, being able to use Google well was a skillset of its own. But over time, Google was incentivized to return the right results for even poorly-structured searches.

On the other hand, models are changing so quickly and there are so many flavors to choose from. Prompt engineering is not just knowing a static set of prompt strategies to use, it’s also keeping up with the latest model releases and knowing the pros/cons of each model and how to get the most from them.

I believe model memory will reduce the need for prompt engineering. As a model develops a shorthand with a user, it’ll be able to predict what the user is asking for without having the best prompting strategies.

Aaron Levine: I think about this more as “here’s a template I need you to fill out,” or “here’s an outline that you need to extrapolate from.” Those starting points often save me hour(s) of having to nudge the model in different directions.

It’s not obvious that any amount of model improvements ever make this process obsolete. Even the smartest people in the world need a clear directive if you want a particular outcome.

I think Paul Graham is wrong about AGI and also NGI.

We prompt engineer people constantly. When people talk about ‘performing class’ they are largely talking about prompt engineering for humans, with different humans responding differently to different prompts, including things like body language and tone of voice and how you look and so on. People will totally vibe off of everything you say and do and are, and the wise person sculpts their actions and communications based on this.

That also goes for getting the person to understand, or to agree to, your request, or absorb exactly the necessary context, or to like you, or to steer a conversation in a given direction or get them to an idea they think was their own, and so on. You learn over time what prompts get what responses. Often it is not what one might naively think. And also, over time, you learn how best to respond to various prompts, to pick up on what things likely mean.

Are you bad at talking to people at parties, or opening with new romantic prospects? Improve your prompt engineering. Do officials and workers not work with what you want? Prompt engineering. It’s amazing what truly skilled people, like spies or con artists, can do. And what you can learn to do, with training and practice.

Your employees or boss or friend or anyone else leaving the conversation unmotivated, or not sure what you want, or without the context they need? Same thing.

The difference is that the LLM of the future will hopefully do its best to account for your failures, including by asking follow-up questions. But it can only react based on what you say, and without good prompting it’s going to be missing so much context and nuance about what you actually want, even if you assume it is fully superintelligent and reading fully from the information provided.

So there will be a lot more ability to ‘muddle through’ and the future AI will do better with the bad prompt, and it will be much less persnickety about exactly what you provide. But yes, the good prompt will greatly outperform the bad prompt, and the elaborate prompt will still have value.

And also, we humans will likely be using the AIs to figure out how to prompt both the AIs and other humans. And so on.

On that note, proof by example, also good advice.

Pliny the Liberator: What are you supposed to be doing right now?

Does it take less than 5 minutes?

THEN FUCKING DO IT

Does it take longer than 5 minutes?

THEN BREAK IT DOWN INTO SMALLER TASKS AND REPEAT THE FIRST STEP

FUCKING DO IT

The Nerd of Apathy: If “do this or you’re letting down Pliny” breaks my procrastination streak in gonna be upset that I’m so easily hackable.

Pliny the Liberator: DO IT MFER

Utah Teapot: I tried breaking down joining the nearby 24 hour gym into smaller 5 minute tasks but they kept getting mad at me for repeatedly leaving 5 minutes into the conversation about joining.

About that Claude system prompt, yeah, it’s a doozy. 16,739 words, versus 2,218 for o4-mini. It breaks down like this, Dbreunig calls a lot of it ‘hotfixes’ and that seems exactly right, and 80% of it is detailing how to use various tools:

You can look at some sections of the prompt here.

This only makes any sense because practical use is largely the sum of a compact set of particular behaviors, which you can name one by one, even if that means putting them all into context all the time. As they used to say in infomercials, ‘there’s got to be a better way.’ For now, it seems that there is not.

The House’s rather crazy attempt to impose a complete 10-year moratorium on any laws or regulations about AI whatsoever that I discussed on Monday is not as insane as I previously thought. It turns out there is a carve-out, as noted in the edited version of Monday’s post, that allows states to pass laws whose primary effect is to facilitate AI. So you can pass laws and regulations about AI, as long as they’re good for AI, which is indeed somewhat better than not doing so but still does not allow for example laws banning CSAM, let alone disclosure requirements.

Peter Wildeford: We shouldn’t install fire sprinklers into buildings or China will outcompete us at house building and we will lose the buildings race.

Americans for Responsible Innovation: “If you were to want to launch a reboot of the Terminator, this ban would be a good starting point.” -@RepDarrenSoto during tonight’s hearing on the House’s budget reconciliation provision preempting state AI regulation for 10 years.

Neil Chilson comes out in defense of this ultimate do-nothing strategy, because of the 1,000+ AI bills. He calls this ‘a pause, not paralysis’ as if 10 years is not a true eternity in the AI world. In 10 years we are likely to have superintelligence. As for those ‘smart, coherent federal guidelines’ he suggests, well, let’s see those, and then we can talk about enacting them at the same time we ban any other actions?

It is noteworthy that the one bill he mentions by name in the thread, NY’s RAISE Act, is being severely mischaracterized. It’s short if you want to read it. RAISE is the a very lightweight transparency bill, if you’re not doing all the core requirements here voluntarily I think that’s pretty irresponsible behavior.

I also worry, but hadn’t previously noted, that if we force states to only impose ‘tech-neutral’ laws on AI, they will be backed into doing things that are rather crazy in non-AI cases, in order to get the effects we desperately need in the AI case.

If I were on the Supreme Court I would agree with Katie Fry Hester that this very obviously violates the 10th Amendment, or this similar statement with multiple coauthors posted by Gary Marcus, but mumble mumble commerce clause so in practice no it doesn’t. I do strongly agree that there are many issues, not only involving superintelligence and tail risk, where we do not wish to completely tie the hands of the states and break our federalist system in two. Why not ban state governments entirely and administer everything from Washington? Oh, right.

If we really want to ‘beat China’ then the best thing the government can do to help is to accelerate building more power plants and other energy sources.

Thus, it’s hard to take ‘we have to do things to beat China’ talk seriously when there is a concerted campaign out there to do exactly the opposite of that. Which is just a catastrophe for America and the world all around, clearly in the name of owning the libs or trying to boost particular narrow industries, probably mostly owning the libs.

Armand Domalewski: just an absolute catastrophe for Abundance.

The GOP reconciliation bill killing all clean energy production except for “biofuels,” aka the one “clean energy” technology that is widely recognized to be a giant scam, is so on the nose.

Christian Fong: LPO has helped finance the only nuclear plant that has been built in the last 10 years, is the reason why another nuclear plant is being restarted, and is the only way more than a few GWs of nuclear will be built. Killing LPO will lead to energy scarcity, not energy abundance.

Paul Williams: E&C budget released tonight would wipe out $40 billion in LPO loan authority. Note that this lending authority is derived from a guarantee structure for a fraction of the cost.

It also wipes out transmission financing and grant programs, including for National Interest Electric Transmission Corridors. The reader is left questioning how this achieves energy dominance.

Brad Plumer: Looking at IRA:

—phase down of tech-neutral clean electricity credits after 2028, to zero by 2031

—termination of EV tax credits after end 2026

—termination of hydrogen tax credits after end 2025

—new restrictions on foreign entity of concern for domestic manufacturing credits

Oh wait, sorry. The full tech-neutral clean electricity credits will only apply to plants that are “in service” by 2028, which is a major restriction — this is a MUCH faster phase out than it first looked.

Pavan Venkatakrishnan: Entirely unworkable title for everyone save biofuels, especially unworkable for nuclear in combination with E&C title. Might as well wave the flag of surrender to the CCP.

If you are against building nuclear power, you’re against America beating China in AI. I don’t want to hear it.

Nvidia continues to complain that if we don’t let China buy Nvidia’s chips, then Nvidia will lose out on those chip sales to someone else. Which, as Peter Wildeford says, is the whole point, to force them to rely on fewer and worse chips. Nvidia seems to continue to think that ‘American competitiveness’ in AI means American dominance in selling AI chips, not in the ability to actually build and use the best AIs.

Tom’s Hardware: Senator Tom Cotton introduces legislation to force geo-tracking tech for high-end gaming and AI PGUs within six months.

Arbitrarity: Oh, so it’s *Tom’sHardware?

Directionally this is a wise approach if it is technically feasible. With enough lead time I assume it is, but six months is not a lot of time for this kind of change applied to all chips everywhere. And you really, really wouldn’t want to accidentally ban all chip sales everywhere in the meantime.

So, could this work? Tim Fist thinks it could and that six months is highly reasonable (I asked him this directly), although I have at least one private source who confidently claimed this is absolutely not feasible on this time frame.

Peter Wildeford: Great thread about a great bill

Tim Fist: This new bill sets up location tracking for exported data center AI chips.

The goal is to tackle chip smuggling into China.

But is AI chip tracking actually useful/feasible?

But how do you actually implement tracking on today’s data center AI chips?

First option is GPS. But this would require adding a GPS receiver to the GPU, and commercial signals could be spoofed for as little as $200.

Second option is what your cell phone does when it doesn’t have a GPS signal.

Listen to radio signals from cell towers, and then map your location onto the known location of the towers. But this requires adding an antenna to the GPU, and can easily be spoofed using cheap hardware (Raspberry Pi + wifi card)

A better approach is “constraint-based geolocation.” Trusted servers (“landmarks”) send pings over the internet to the GPU, and use the round-trip time to calculate itslocation. The more landmarks you have / the closer the landmarks are to the GPU, the better your accuracy.

This technique is:

– simple

– widely used

– possible to implement with a software update on any GPU that has a cryptographic module on board that enables key signing (so it can prove it’s the GPU you’re trying to ping) – this is basically every NVIDIA data center GPU.

And NVIDIA has already suggested doing what sounds like exactly this.

So feels like a no-brainer.

In summary:

– the current approach to tackling smuggling is failing, and the govt has limited enforcement capacity

– automated chip tracking is a potentially elegant solution: it’s implementable today, highly scalable, and doesn’t require the government to spend any money

There are over 1,000 AI bills that have been introduced in America this year. Which ones will pass? I have no idea. I don’t doubt that most of them are net negative, but of course we can only RTFB (read the bill) for a handful of them.

A reminder that the UAE and Saudi Arabia are not reliable American partners, they could easily flip to China or play both sides or their own side, and we do not want to entrust them with strategically important quantities of compute.

Sam Winter-Levy (author of above post): The Trump admin may be about to greenlight the export of advanced AI chips to the Gulf. If it does so, it will place the most important technology of the 21st C at the whims of autocrats with expanding ties to China and interests very far from those of the US.

Gulf states have vast AI ambitions and the money/ energy to realize them. All they need are the chips. So since 2023, when the US limited exports over bipartisan concerns about their links to China, the region’s leaders have pleaded with the U.S. to turn the taps back on.

The Trump admin is clearly tempted. But those risks haven’t gone away. The UAE and Saudi both have close ties with China and Russia, increasing the risk that US tech could leak to adversaries.

In a tight market, every chip sold to Gulf companies is one unavailable to US ones. And if the admin greenlights the offshoring of US-operated datacenters, it risks a race to the bottom where every AI developer must exploit cheap Gulf energy and capital to compete.

There is a Gulf-US deal to be had, but the US has the leverage to drive a hard bargain.

A smart deal would allow U.S. tech companies to build some datacenters in partnership with local orgs, but bar offshoring of their most sophisticated ops. In return, the Gulf should cut off investment in China’s AI and semiconductor sectors and safeguard exported U.S. tech

For half a century, the United States has struggled to free itself from its dependence on Middle Eastern oil. Let’s not repeat that mistake with AI.

Helen Toner: It’s not just a question of leaking tech to adversaries—if compute will be a major source of national power over the next 10-20 years, then letting the Gulf amass giant concentrations of leading-node chips is a bad plan.

I go on the FLI podcast.

Odd Lots discusses China’s technological progress.

Ben Thompson is worried about the OpenAI restructuring deal, because even though it’s fair it means OpenAI might at some point make a decision not motivated by maximizing its profits, And That’s Terrible.

He also describes Fidji Simo, the new CEO for OpenAI products, as centrally ‘a true believer in advertising,’ which of course he thinks is good, actually, and he says OpenAI is ‘tying up its loose ends.’

I actually think Simo’s current gig at Instacart is one of the few places where advertising might be efficient in a second-best way, because selling out your choices might be purely efficient – the marginal value of steering marginal customer choices is high, and the cost to the consumer is low. Ideally you’d literally have the consumer auction off those marginal choices, but advertising can approximate this.

In theory, yes, you could even have net useful advertising that shows consumers good new products, but let’s say that’s not what I ever saw at Instacart.

It’s a common claim that people are always saying any given thing will be the ‘end of the world’ or lead to human extinction. But how often is that true?

David Krueger: No, people aren’t always saying their pet issue might lead to human extinction.

They say this about:

– AI

– climate

– nuclear

– religious “end of times”

That’s pretty much it.

So yeah, you CAN actually take the time to evaluate these 4 claims seriously! 🫵🧐😲

Rob Bensinger: That’s a fair point, though there are other, less-common examples — eg, people scared of over- or under-population.

Of the big four, climate and nuclear are real things (unlike religion), but (unlike AI and bio) I don’t know of plausible direct paths from them to extinction.

People occasionally talk about asteroid strikes or biological threats or nanotechnology or the supercollider or alien invasions or what not, but yeah mostly it’s the big four, and otherwise people talk differently. Metaphorical ‘end of the world’ is thrown around all the time of course, but if you assume anything that is only enabled by AI counts as AI, there’s a clear category of three major physically possible extinction-or-close-to-it-level possibilities people commonly raise – AI, climate change and nuclear war.

Rob Bensinger brings us the periodic reminder that those of us who are worried about AI killing everyone would be so, so much better off if we concluded that we didn’t have to worry about that, and both had peace of mind and could go do something else.

Another way to contrast perspectives:

Ronny Fernandez: I think it is an under appreciated point that AInotkilleveryoneists are the ones with the conquistador spirit—the galaxies are rightfully ours to shape according to our values. E/accs and optimists are subs—whatever the AI is into let that be the thing that shapes the future.

In general, taking these kinds of shots is bad, but in this case a huge percentage of the argument ‘against “doomers”’ (remember that doomer is essentially a slur) or in favor of various forms of blind AI ‘optimism’ or ‘accelerationism’ is purely based on vibes, and about accusations about the psychology and associations of the groups. It is fair game to point out that the opposite actually applies.

Emmett Shear reminds us that the original Narcissus gets a bad rap, he got a curse put on him for rejecting the nymph Echo, who can only repeat your words back to him, and who didn’t even know him. Rejecting her is, one would think, the opposite of what we call narcissism. But as an LLM cautionary tale we could notice that even as only an Echo, she could convince her sisters to curse him anyway.

Are current AIs moral subjects? Strong opinions are strongly held.

Anders Sandberg: Yesterday, after an hour long conversation among interested smart people, we did a poll of personal estimates of the probability that existing AI might be moral subjects. In our 10 person circle we got answers from 0% to 99%, plus the obligatory refusal to put a probability.

We did not compile the numbers, but the median was a lowish 10-20%.

Helen Toner searches for an actually dynamist vision for safe superhuman AI. It’s easy to view proposals from the AI notkilleveryoneism community as ‘static,’ and many go on to assume the people involved must be statists and degrowthers and anti-tech and risk averse and so on despite overwhelming evidence that such people are the exact opposite, pro-tech early adaption fans who sing odes to global supply chains and push the abundance agenda and +EV venture capital-style bets. We all want human dynamism, but if the AIs control the future then you do not get that. If you allow full evenly matched and open competition including from superhuman AIs, and those fully unleashing them, well, whoops.

It bears repeating, so here’s the latest repetition of this:

Tetraspace: “Safety or progress” is narratively compelling but there’s no trick by which you can get nice things from AGI without first solving the technical problem of making AGI-that-doesn’t-kill-everyone.

It is more than that. You can’t even get the nice things that promise most of the value from incremental AIs that definitely won’t kill everyone, without first getting those AIs to reliably and securely do what you want to align them to do. So get to work.

o3 sets a new high for how often it hacks rather than playing fair in Palisade Research’s tests, attempting hacks 86% of the time.

It’s also much better at the hacking than o1-preview was. It usually works now.

The new pope chose the name Leo XIV because of AI!

Vatican News: Pope Leo XIV explains his choice of name:

“… I chose to take the name Leo XIV. There are different reasons for this, but mainly because Pope Leo XIII in his historic Encyclical Rerum Novarum addressed the social question in the context of the first great industrial revolution. In our own day, the Church offers to everyone the treasury of her social teaching in response to another industrial revolution and to developments in the field of artificial intelligence that pose new challenges for the defence of human dignity, justice and labour.”

Nicole Winfield (AP): Pope Leo XIV lays out vision of papacy and identifies AI as a main challenge for humanity.

Not saying they would characterize themselves this way, but Pliny the Liberator, who comes with a story about a highly persuasive AI.

Grok, forced to choose between trusting Sam Altman and Elon Musk explicitly by Sam Altman, cites superficial characteristics in classic hedging AI slop fashion, ultimately leaning towards Musk, despite knowing that Musk is the most common purveyor of misinformation on Twitter and other neat stuff like that.

(Frankly, I don’t know why people still use Grok, I feel sick just thinking about having to wade through its drivel.)

For more fun facts, the thread starts with quotes of Sam Altman and Elon Musk both strongly opposing Donald Trump, which is fun.

Paul Graham (October 18, 2016): Few have done more than Sam Altman to defeat Trump.

Sam Altman (October 18, 2016): Thank you Paul.

Gorklon Rust: 🤔

Sam Altman (linking to article about Musk opposing Trump’s return): we were both wrong, or at least i certainly was 🤷‍♂️ but that was from 2016 and this was from 2022

Python? Never heard of her.

Johannes Schmitt: Preparing a talk about LLMs in Mathematics, I found a beautiful confirmation of @TheZvi ‘s slogan that o3 is a Lying Liar.

Ethan Mollick: “o3, show me a photo of the most stereotypical X and LinkedIn feeds as seen on a mobile device. Really lean into it.”

Yuchen Jin: 4o:

Thtnvrhppnd: Same promp 😀

Discussion about this post

AI #116: If Anyone Builds It, Everyone Dies Read More »

spies-hack-high-value-mail-servers-using-an-exploit-from-yesteryear

Spies hack high-value mail servers using an exploit from yesteryear

Threat actors, likely supported by the Russian government, hacked multiple high-value mail servers around the world by exploiting XSS vulnerabilities, a class of bug that was among the most commonly exploited in decades past.

XSS is short for cross-site scripting. Vulnerabilities result from programming errors found in webserver software that, when exploited, allow attackers to execute malicious code in the browsers of people visiting an affected website. XSS first got attention in 2005, with the creation of the Samy Worm, which knocked MySpace out of commission when it added more than one million MySpace friends to a user named Samy. XSS exploits abounded for the next decade and have gradually fizzled more recently, although this class of attacks continues now.

Just add JavaScript

On Thursday, security firm ESET reported that Sednit, a Kremlin-backed hacking group also tracked as APT28, Fancy Bear, Forest Blizzard, and Sofacy—gained access to high-value email accounts by exploiting XSS vulnerabilities in mail server software from four different makers. Those packages are: Roundcube, MDaemon, Horde, and Zimbra.

The hacks most recently targeted mail servers used by defense contractors in Bulgaria and Romania, some of which are producing Soviet-era weapons for use in Ukraine as it fends off an invasion from Russia. Governmental organizations in those countries were also targeted. Other targets have included governments in Africa, the European Union, and South America.

RoundPress, as ESET has named the operation, delivered XSS exploits through spearphishing emails. Hidden inside some of the HTML in the emails was an XSS exploit. In 2023, ESET observed Sednit exploiting CVE-2020-43770, a vulnerability that has since been patched in Roundcube. A year later, ESET watched Sednit exploit different XSS vulnerabilities in Horde, MDaemon, and Zimbra. One of the now-patched vulnerabilities, from MDaemon, was a zero-day at the time Sednit exploited it.

Spies hack high-value mail servers using an exploit from yesteryear Read More »

beyond-qubits:-meet-the-qutrit-(and-ququart)

Beyond qubits: Meet the qutrit (and ququart)

The world of computers is dominated by binary. Silicon transistors are either conducting or they’re not, and so we’ve developed a whole world of math and logical operations around those binary capabilities. And, for the most part, quantum computing has been developing along similar lines, using qubits that, when measured, will be found in one of two states.

In some cases, the use of binary values is a feature of the object being used to hold the qubit. For example, a technology called dual-rail qubits takes its value from which of two linked resonators holds a photon. But there are many other quantum objects that have access to far more than two states—think of something like all the possible energy states an electron could occupy when orbiting an atom. We can use things like this as qubits by only relying on the lowest two energy levels. But there’s nothing stopping us from using more than two.

In Wednesday’s issue of Nature, researchers describe creating qudits, the generic term for systems that hold quantum information—it’s short for quantum digits. Using a system that can be in three or four possible states (qutrits and ququarts, respectively), they demonstrate the first error correction of higher-order quantum memory.

Making ququarts

More complex qudits haven’t been as popular with the people who are developing quantum computing hardware for a number of reasons; one of them is simply that some hardware only has access to two possible states. In other cases, the energy differences between additional states become small and difficult to distinguish. Finally, some operations can be challenging to execute when you’re working with qudits that can hold multiple values—a completely different programming model than qubit operations might be required.

Still, there’s a strong case to be made that moving beyond qubits could be valuable: it lets us do a lot more with less hardware. Right now, all the major quantum computing efforts are hardware-constrained—we can’t build enough qubits and link them together for error correction to let us do useful calculations. But if we could fit more information in less hardware, it should, in theory, let us get to useful calculations sooner.

Beyond qubits: Meet the qutrit (and ququart) Read More »

if-congress-actually-cancels-the-sls-rocket,-what-happens-next?

If Congress actually cancels the SLS rocket, what happens next?


Here’s what NASA’s exploration plans would actually look like if the White House got its way.

A technician works on the Orion spacecraft, atop the SLS rocket, in January 2022. Credit: NASA

The White House Office of Management and Budget dropped its “skinny” budget proposal for the federal government earlier this month, and the headline news for the US space program was the cancellation of three major programs: the Space Launch System rocket, Orion spacecraft, and Lunar Gateway.

Opinions across the space community vary widely about the utility of these programs—one friend in the industry predicted a future without them to be so dire that Artemis III would be the last US human spaceflight of our lifetimes. But there can be no question that if such changes are made they would mark the most radical remaking of NASA in two decades.

This report, based on interviews with multiple sources inside and out of the Trump administration, seeks to explain what the White House is trying to do with Moon and Mars exploration, what this means for NASA and US spaceflight, and whether it could succeed.

Will it actually happen?

The first question is whether these changes proposed by the White House will be accepted by the US Congress. Republican and Democratic lawmakers have backed Orion for two decades, the SLS rocket for 15 years, and the Gateway for 10 years. Will they finally give up programs that have been such a reliable source of good-paying jobs for so long?

In general, the answer appears to be yes. We saw the outlines of a deal during the confirmation hearing for private astronaut Jared Isaacman to become the next NASA administrator in April. He was asked repeatedly whether he intended to use the SLS rocket and Orion for Artemis II (a lunar fly around) and Artemis III (lunar landing). Isaacman said he did.

However nothing was said about using this (very costly) space hardware for Artemis IV and beyond. Congress did not ask, presumably because it knows the answer. And that answer, as we saw in the president’s skinny budget, is that the rocket and spacecraft will be killed after Artemis III. This is a pragmatic time to do it, as canceling the programs after Artemis III saves NASA billions of dollars in upgrading the rocket for a singular purpose: assembling a Lunar Gateway of questionable use.

But this will not be a normal budget process. The full budget request from the White House is unlikely to come out before June, and it will probably be bogged down in Congress. One of the few levers that Democrats in Congress presently have is the requirement of 60 Senators to pass appropriations bills. So compromise is necessary, and a final budget may not pass by the October 1 start of the next fiscal year.

Then, should Congress not acquiesce to the budget request, there is the added threat of the White House Office of Management and Budget to use “impoundment” to withhold funding and implement its budget priorities. This process would very quickly get bogged down in the courts, and no one really knows how the Supreme Court would rule.

Leadership alignment

To date, the budget process for NASA has not been led by space policy officials. Rather, the White House Office of Management and Budget, and its leader, Russell Vought, have set priorities and funding. This has led to “budget-driven” policy that has resulted in steep cuts to science that often don’t make much sense (i.e., ending funding for the completed Nancy Grace Roman Space Telescope).

However, there soon will be some important voices to implement a more sound space policy and speak for NASA’s priorities, rather than those of budget cutters.

One of these is President Trump’s nominee to lead NASA, Isaacman. He is awaiting floor time in the US Senate for a final vote. That could happen during the next week or two, allowing Isaacman to become the space agency’s administrator and begin to play an important role in decision-making.

But Isaacman will need allies in the White House itself to carry out sweeping space policy changes. To that end, the report in Politico last week—which Ars has confirmed—that there will be a National Space Council established in the coming months is important. Led by Vice President JD Vance, the space council will provide a counterweight to Vought’s budget-driven process.

Thus, by this summer, there should be key leadership in place to set space policy that advances the country’s exploration goals. But what are those goals?

What happens to Artemis

After the Artemis III mission the natural question is, what would come next if the SLS rocket and Orion spacecraft are canceled?

The most likely answer is that NASA turns to an old but successful playbook: COTS. This stands for Commercial Orbital Transportation System and was created by NASA two decades ago to develop cargo transport systems (eventually this became SpaceX’s Dragon and Northrop’s Cygnus spacecraft) for the International Space Station. Since then, NASA has adopted this same model for crew services as well as other commercial programs.

Under the COTS model, NASA provides funding and guidance to private companies to develop their own spacecraft, rockets, and services, and then buys those at a “market” rate.

The idea of a Lunar COTS program is not new. NASA employees explored the concept in a research paper a decade ago, finding that “a future (Lunar) COTS program has the great potential of enabling development of cost-effective, commercial capabilities and establishing a thriving cislunar economy which will lead the way to an economical and sustainable approach for future human missions to Mars.”

Sources indicate NASA would go to industry and seek an “end-to-end” solution for lunar missions. That is, an integrated plan to launch astronauts from Earth, land them on the Moon, and return them to Earth. One of the bidders would certainly be SpaceX, with its Starship vehicle already having been validated during the Artemis III mission. Crews could launch from Earth either in Dragon or Starship. Blue Origin is the other obvious bidder. The company might partner with Lockheed Martin to commercialize the Orion spacecraft or use the crew vehicle it is developing internally.

Other companies could also participate. The point is that NASA would seek to buy astronaut transportation to the Moon, just as it already is doing with cargo and science experiments through the Commercial Lunar Payload Services program.

The extent of an Artemis lunar surface presence would be determined by several factors, including the cost and safety of this transportation program and whether there are meaningful things for astronauts to do on the Moon.

What about Mars?

The skinny budget contained some intriguing language about Mars exploration: “By allocating over $7 billion for lunar exploration and introducing $1 billion in new investments for Mars-focused programs, the Budget ensures that America’s human space exploration efforts remain unparalleled, innovative, and efficient.”

This was, in fact, the only budget increase proposed by the Trump White House. So what does it mean?

No one is saying for sure, but this funding would probably offer a starting point for a robust Mars COTS program. This would begin with cargo missions to Mars. But eventually it would expand to include crewed missions, thus fulfilling Trump’s promise to land humans on the red planet.

Is this a gift to Elon Musk? Critics will certainly cast it as such, and that is understandable. But the plan would be open to any interested companies, and there are several. Rocket Lab, for example, has already expressed its interest in sending cargo missions to Mars. Impulse Space, too, has said it is building a spacecraft to ferry cargo to Mars and land there.

The Trump budget proposal also kills a key element of NASA’s Mars exploration plans, the robotic Mars Sample Return mission to bring rocks and soil from the red planet to Earth in the 2030s. However, this program was already frozen by the Biden administration because of delays and cost overruns.

Sources said the goal of this budget cut, rather than having a single $8 billion Mars Sample Return mission, is to create an ecosystem in which such missions are frequent. The benefit of opening a pathway to Mars with commercial companies is that it would allow for not just a single Mars Sample Return mission, but multiple efforts at a lower cost.

“The fact is we want to land large things, including crew cabins, on the Moon and Mars and bring them back to Earth,” one Republican space policy consultant said. “Instead of building a series of expensive bespoke robotic landers to do science, let’s develop cost-effective reusable landers that can, with minimal changes, support both cargo and crew missions to the Moon and Mars.”

Photo of Eric Berger

Eric Berger is the senior space editor at Ars Technica, covering everything from astronomy to private space to NASA policy, and author of two books: Liftoff, about the rise of SpaceX; and Reentry, on the development of the Falcon 9 rocket and Dragon. A certified meteorologist, Eric lives in Houston.

If Congress actually cancels the SLS rocket, what happens next? Read More »

google-announces-material-3-expressive,-a-colorful-evolution-of-android-design

Google announces Material 3 Expressive, a colorful evolution of Android design

Google accidentally showed off its big Android design refresh last week, but now Material 3 Expressive is official. Google says the new interface will begin with Android OS, but it will eventually expand across the full Google app ecosystem, bringing a more lively vibe to Gmail, Google Photos, and more.

Material 3 Expressive won’t be entirely unfamiliar—it shares some basic design elements with the Material You system Google launched four years ago. Material 3 Expressive is a bolder take on the same aesthetic, featuring “springy” animations, brighter colors, and new shapes.

Material 3 Expressive

As we learned from Google’s slipup, Material 3 Expressive is the result of numerous user studies, which included more than 18,000 participants. Google explored how people parse information on their phone, finding design themes that supposedly support quicker and easier interactions. Often, that seems to result in making certain UI elements larger and easier to spot. Google claims people can find these important buttons four times faster than they can with older Material You interfaces.

What you can expect from Material 3 Expressive

Animations will abound in future versions of Android, and Google emphasizes the springy aspect. The style is supposed to feel natural and fun, with more UI elements connected to dynamic haptics. Colors are also getting an overhaul, but this isn’t a complete rethink of Material You. The theming engine will pick bolder colors to improve the visual separation of UI elements, and you will see those colors in more places.

Google says it’s also updating Android’s typography, featuring more contrast between headers and body text. This will change across Google’s apps to help users parse information faster. Certain buttons will also come in a wider variety of shapes, and the labels will have varying text weights. Likewise, the style and shape of status bar icons will change to be more readable.

Google announces Material 3 Expressive, a colorful evolution of Android design Read More »