Author name: Mike M.

Nintendo explains why Switch 2 hardware and software cost so much

gaming / Mike M. / April 8, 2025

Things just cost more now

In justifying the $450 price of the Switch 2, Nintendo executives predictably pointed to the system’s upgraded hardware specs, as well as new features like GameChat and mouse mode. “As you add more technology into a system, especially in this day and age, that drives additional cost.” Nintendo Vice President of Player & Product Experience Bill Trinen told Polygon.

That said, Trinen also pointed toward rising prices in the wider economy to justify the $150 jump between Switch and Switch 2 pricing. “We’re unfortunately living in an era where I think inflation is affecting everything,” Trinen said.

The Switch never saw a nominal price drop, but inflation still ate away at its total cost a bit over the years.

Trinen isn’t wrong about that; the $299 early adopters paid for a Switch in 2017 is worth about $391 in today’s dollars, according to the BLS CPI calculator. But for customers whose own incomes may have stayed flat over that time, the 50 percent jump in nominal pricing from Switch to Switch 2 may be hard to swallow in a time of increasing economic uncertainty.

“Obviously the cost of everything goes up over time, and I personally would love if the cost of things didn’t go up over time,” Trinen told IGN. “And certainly there’s the cost of goods and things that factor into that, but we try to find the right appropriate price for a product based on that.”

Is $80 the new $70?

Talk of inflation extended to Trinen’s discussion of why Nintendo decided to sell first-party Switch 2 games for $70 to $80. “The price of video games has been very stable for a very long time,” Trinen told Polygon. “I actually have an ad on my phone that I found from 1993, when Donkey Kong Country released on the SNES at $59. That’s a very, very long time where pricing on games has been very stable…”

Nintendo explains why Switch 2 hardware and software cost so much Read More »

How did eastern North America form?

crust, plate tectonics, Science, syndication / Mike M. / April 7, 2025

Collisions hold lessons for how the edges of continents are built and change over time.

When Maureen Long talks to the public about her work, she likes to ask her audience to close their eyes and think of a landscape with incredible geology. She hears a lot of the same suggestions: Iceland, the Grand Canyon, the Himalayas. “Nobody ever says Connecticut,” says Long, a geologist at Yale University in New Haven in that state.

And yet Connecticut—along with much of the rest of eastern North America—holds important clues about Earth’s history. This region, which geologists call the eastern North American margin, essentially spans the US eastern seaboard and a little farther north into Atlantic Canada. It was created over hundreds of millions of years as slivers of Earth’s crust collided and merged. Mountains rose, volcanoes erupted and the Atlantic Ocean was born.

Much of this geological history has become apparent only in the past decade or so, after scientists blanketed the United States with seismometers and other instruments to illuminate geological structures hidden deep in Earth’s crust. The resulting findings include many surprises—from why there are volcanoes in Virginia to how the crust beneath New England is weirdly crumpled.

The work could help scientists better understand the edges of continents in other parts of the world; many say that eastern North America is a sort of natural laboratory for studying similar regions. And that’s important. “The story that it tells about Earth history and about this set of Earth processes … [is] really fundamental to how the Earth system works,” says Long, who wrote an in-depth look at the geology of eastern North America for the 2024 Annual Review of Earth and Planetary Sciences.

Born of continental collisions

The bulk of North America today is made of several different parts. To the west are relatively young and mighty mountain ranges like the Sierra Nevada and the Rockies. In the middle is the ancient heart of the continent, the oldest and stablest rocks around. And in the east is the long coastal stretch of the eastern North American margin. Each of these has its own geological history, but it is the story of the eastern bit that has recently come into sharper focus.

For decades, geologists have understood the broad picture of how eastern North America came to be. It begins with plate tectonics, the process in which pieces of Earth’s crust shuffle around over time, driven by churning motions in the underlying mantle. Plate tectonics created and then broke apart an ancient supercontinent known as Rodinia. By around 550 million years ago, a fragment of Rodinia had shuffled south of the equator, where it lay quietly for tens of millions of years. That fragment is the heart of what we know today as eastern North America.

Then, around 500 million years ago, tectonic forces started bringing fragments of other landmasses toward the future eastern North America. Carried along like parts on an assembly line, these continental slivers crashed into it, one after another. The slivers glommed together and built up the continental margin.

During that process, as more and more continental collisions crumpled eastern North America and thrust its agglomerated slivers into the sky, the Appalachian Mountains were born. To the west, the eastern North American margin had merged with ancient rocks that today make up the heart of the continent, west of the Appalachians and through the Midwest and into the Great Plains.

When one tectonic plate slides beneath another, slivers of Earth’s crust, known as terranes, can build up and stick together, forming a larger landmass. Such a process was key to the formation of eastern North America. Credit: Knowable Magazine

By around 270 million years ago, that action was done, and all the world’s landmasses had merged into a second single supercontinent, Pangaea. Then, around 200 million years ago, Pangaea began splitting apart, a geological breakup that formed the Atlantic Ocean, and eastern North America shuffled toward its current position on the globe.

Since then, erosion has worn down the peaks of the once-mighty Appalachians, and eastern North America has settled into a mostly quiet existence. It is what geologists call a “passive margin,” because although it is the edge of a continent, it is not the edge of a tectonic plate anymore: That lies thousands of miles out to the east, in the middle of the Atlantic Ocean.

In many parts of the world, passive continental margins are just that—passive, and pretty geologically boring. Think of the eastern edge of South America or the coastline around the United Kingdom; these aren’t places with active volcanoes, large earthquakes, or other major planetary activity.

But eastern North America is different. There’s so much going on there that some geologists have humorously dubbed it a “passive-aggressive margin.”

The eastern edge of North America, running along the US seaboard, contains fragments of different landscapes that attest to its complex birth. They include slivers of Earth’s crust that glommed together along what is now the east coast, with a more ancient mountain belt to their west and a chunk of even more ancient crust to the west of that. Credit: Knowable Magazine

That action includes relatively high mountains—for some reason, the Appalachians haven’t been entirely eroded away even after tens to hundreds of millions of years—as well as small volcanoes and earthquakes. Recent east-coast quakes include the magnitude-5.8 tremor near Mineral, Virginia, in 2011, and a magnitude-3.8 blip off the coast of Maine in January 2025. So geological activity exists in eastern North America. “It’s just not following your typical tectonic activity,” says Sarah Mazza, a petrologist at Smith College in Northampton, Massachusetts.

Crunching data on the crust

Over decades, geologists had built up a history of eastern North America by mapping rocks on Earth’s surface. But they got a much better look, and many fresh insights, starting around 2010. That’s after a federally funded research project known as EarthScope blanketed the continental United States with seismometers. One aim was to gather data on how seismic energy from earthquakes reverberated through the Earth’s crust and upper mantle. Like a CT scan of the planet, that information highlights structures that lie beneath the surface and would not otherwise be detected.

With EarthScope, researchers could suddenly see what parts of the crust were warm or cold, or strong or weak—information that told them what was happening underground. Having the new view was like astronomers going from looking at the stars with binoculars to using a telescope, Long says. “You can see more detail, and you can see finer structure,” she says. “A lot of features that we now know are present in the upper mantle beneath eastern North America, we really just did not know about.”

And then scientists got even better optics. Long and other researchers began putting out additional seismometers, packing them in dense lines and arrays over the ground in places where they wanted even better views into what was going on beneath the surface, including Georgia and West Virginia. Team members would find themselves driving around the countryside to carefully set up seismometer stations, hoping these would survive the snowfall and spiders of a year or two until someone could return to retrieve the data.

The approach worked—and geophysicists now have a much better sense of what the crust and upper mantle are doing under eastern North America. For one thing, they found that the thickness of the crust varies from place to place. Parts that are the remains of the original eastern North America landmass have a much thicker crust, around 45 kilometers. The crust beneath the continental slivers that attached later on to the eastern edge is much thinner, more like 25 to 30 kilometers thick. That difference probably traces back to the formation of the continent, Long says.

Seismic studies have revealed in recent years that Earth’s crust varies dramatically in thickness along the eastern seaboard—a legacy of how this region was pieced together from various landmasses over time. Credit: Knowable Magazine

But there’s something even weirder going on. Seismic images show that beneath parts of New England, it’s as if parts of the crust and upper mantle have slid atop one another. A 2022 study led by geoscientist Yantao Luo, a colleague of Long, found that the boundary that marks the bottom of Earth’s crust—often referred to as the Moho, after the Croatian seismologist Andrija Mohorovičić—was stacked double, like two overlapping pancakes, under southern New England.

The result was so surprising that at first Long didn’t think it could be right. But Luo double-checked and triple-checked, and the answer held. “It’s this super-unusual geometry,” Long says. “I’m not sure I’ve seen it anywhere else.”

It’s particularly odd because the Moho in this region apparently has managed to maintain its double-stacking for hundreds of millions of years, says Long. How that happened is a bit of a mystery. One idea is that the original landmass of eastern North America had an extremely strong and thick crust. When weaker continental slivers began arriving and glomming on to it, they squeezed up and over it in places.

How the Moho is working

The force of that collision could have carried the Moho of the incoming pieces up and over the older landmass, resulting in a doubling of the Moho there, says Paul Karabinos, a geologist at Williams College in Williamstown, Massachusetts. Something similar might be happening in Tibet today as the tectonic plate carrying India rams into that of Asia and crumples the crust against the Tibetan plateau. Long and her colleagues are still trying to work out how widespread the stacked-Moho phenomenon is across New England; already, they have found more signs of it beneath northwestern Massachusetts.

A second surprising discovery that emerged from the seismic surveys is why 47-million-year-old volcanoes along the border of Virginia and West Virginia might have erupted. The volcanoes are the youngest eruptions that have happened in eastern North America. They are also a bit of a mystery, since there is no obvious source of molten rock in the passive continental margin that could be fueling them.

The answer, once again, came from detailed seismic scans of the Earth. These showed that a chunk was missing from the bottom of Earth’s crust beneath the volcanoes: For some reason, the bottom of the crust became heavy and dripped downward from the top part, leaving a gap. “That now needs to be filled,” says Mazza. Mantle rocks obligingly flowed into the gap, experiencing a drop in pressure as they moved upward. That change in pressure triggered the mantle rocks to melt—and created the molten reservoir that fueled the Virginia eruptions.

The same process could be happening in other passive continental margins, Mazza says. Finding it beneath Virginia is important because it shows that there are more and different ways to fuel volcanoes in these areas than scientists had previously thought possible: “It goes into these ideas that you have more ways to create melt than your standard tectonic process,” she says.

Long and her colleagues are looking to see whether other parts of the eastern North American margin also have this crustal drip. One clue is emerging from how seismic energy travels through the upper mantle throughout the region. The rocks beneath the Virginia volcanoes show a strange slowdown, or anomaly, as seismic energy passes through them. That could be related to the crustal dripping that is going on there.

Seismic surveys have revealed a similar anomaly in northern New England. To try to unravel what might be happening at this second anomaly, Long’s team currently has one string of seismometers across Massachusetts, Vermont, and New Hampshire, and a second dense array in eastern Massachusetts. “Maybe something like what went on in Virginia might be in process … elsewhere in eastern North America,” Long says. “This might be a process, not just something that happened one time.”

Long even has her eyes on pushing farther north, to do seismic surveys along the continental margin in Newfoundland, and even across to Ireland—which once lay next to the North American continental margin, until the Atlantic Ocean opened and separated them. Early results suggest there may be significant differences in how the passive margin behaved on the North American side and on the Irish side, Long’s collaborator Roberto Masis Arce of Rutgers University reported at the American Geophysical Union conference in December 2024.

All these discoveries go to show that the eastern North American margin, once deemed a bit of a snooze, has far more going for it than one might think. “Passive doesn’t mean geologically inactive,” Mazza says. “We live on an active planet.”

This article originally appeared in Knowable Magazine, a nonprofit publication dedicated to making scientific knowledge accessible to all. Sign up for Knowable Magazine’s newsletter.

Knowable Magazine explores the real-world significance of scholarly work through a journalistic lens.

How did eastern North America form? Read More »

2025 Chevrolet Silverado EV LT review: This is one long pickup truck

car review, Cars, Chevrolet Silverado EV LT / Mike M. / April 6, 2025

At lower speeds, I found the Silverado EV a little more cumbersome. As noted, it’s a very long vehicle, and you need the more expensive RST version if you want rear-wheel steering, which turns the opposite direction to the front wheels at low speeds, in effect shrinking the 145.7-inch (3,700 mm) wheelbase. You would be much happier driving one of these straight into a garage rather than backing it into a parking space.

Having a garage isn’t a must, but in my opinion, being able to charge at home (or reliably at work) still remains a precondition for buying a plug-in vehicle. 120 V (level 1) AC charging might work for routine overnight top-ups if your daily driving is 40 miles or less, but it may take more than a day to completely restore a totally empty pack.

A chevrolet Silverado EV seen from the rear 3/4, parked in front of a mid-century building — Did this truck miss its moment in time? Credit: Jonathan Gitlin

Level 2 AC charging should take 8–10 hours for a full charge (Chevy says 10 miles (16 km) in 10 minutes). Although the powertrain operates at 400 V, the pack can rejigger itself at suitable DC fast chargers to accept an 800 V charge at up to 300 kW. Expect a 10–80 percent charge to take around 45 minutes; during my week testing the Silverado EV, I only ran the battery down to around 50 percent, so I wouldn’t have seen optimal rates had I plugged it in. With climate change now causing wide temperature swings in early March, I can report that I averaged 1.7 miles/kWh (36.6 kWh/100 km) in cold weather, but once things got mild, that jumped to 2.2 miles/kWh (28.2 kWh/100 km).

Was Chevrolet misguided in making the Silverado EV? It certainly made more sense when EV optimism was peaking and the marketing departments in Detroit thought that pickup buyers would be easy conquests for a brave new future powered by electrons. That turned out to be the opposite of true, at least for the time being. But the automaker has a decent selection of EVs in other shapes, sizes, and price points, and an advantage to its common battery platform should be a degree of flexibility in which cars it decides to put them in.

2025 Chevrolet Silverado EV LT review: This is one long pickup truck Read More »

Not just Switch 2: ESA warns Trump’s tariffs will hurt the entire game industry

Entertainment Software Association, gaming, trump tariffs / Mike M. / April 5, 2025

This morning’s announcement that Nintendo is delaying US preorders for the Switch 2 immediately increased the salience of President Trump’s proposed wide-reaching import tariffs for millions of American Nintendo fans. Additionally, the Entertainment Software Association—a lobbying group that represents the game industry’s interests in Washington—is warning that the effects of Trump’s tariffs on the gaming world won’t stop with Nintendo.

“There are so many devices we play video games on,” ESA senior vice president Aubrey Quinn said in an interview with IGN just as Nintendo’s preorder delay news broke. “There are other consoles… VR headsets, our smartphones, people who love PC games; if we think it’s just the Switch, then we aren’t taking it seriously.

“This is company-agnostic, this is an entire industry,” she continued. “There’s going to be an impact on the entire industry.”

While Trump’s tariff proposal includes a 10 percent tax on imports from pretty much every country, it also includes a 46 percent tariff on Vietnam and a 54 percent total tariff on China, the two countries where most console hardware is produced. Quinn told IGN that it’s “hard to imagine a world where tariffs like these don’t impact pricing” for those consoles.

More than that, though, Quinn warns that massive tariffs would tamp down overall consumer spending, which would have knock-on effects for game industry revenues, employment, and research and development investment.

“Video game consoles are sold under tight margins in order to reduce the barrier to entry for consumers,” the ESA notes in its issue page on tariffs. “Tariffs mean that the additional costs would be passed along to consumers, resulting in a ripple effect of harm for the industry and the jobs it generates and supports.

Not just a foreign problem

The negative impacts wouldn’t be limited to foreign companies like Nintendo, Quinn warned, because “even American-based companies, they’re getting products that need to cross into American borders to make those consoles, to make those games. And so there’s going to be a real impact regardless of company.”

Not just Switch 2: ESA warns Trump’s tariffs will hurt the entire game industry Read More »

Gemini “coming together in really awesome ways,” Google says after 2.5 Pro release

AI, Artificial Intelligence, Google, Google Gemini / Mike M. / April 5, 2025

Google’s Tulsee Doshi talks vibes and efficiency in Gemini 2.5 Pro.

Google was caught flat-footed by the sudden skyrocketing interest in generative AI despite its role in developing the underlying technology. This prompted the company to refocus its considerable resources on catching up to OpenAI. Since then, we’ve seen the detail-flubbing Bard and numerous versions of the multimodal Gemini models. While Gemini has struggled to make progress in benchmarks and user experience, that could be changing with the new 2.5 Pro (Experimental) release. With big gains in benchmarks and vibes, this might be the first Google model that can make a dent in ChatGPT’s dominance.

We recently spoke to Google’s Tulsee Doshi, director of product management for Gemini, to talk about the process of releasing Gemini 2.5, as well as where Google’s AI models are going in the future.

Welcome to the vibes era

Google may have had a slow start in building generative AI products, but the Gemini team has picked up the pace in recent months. The company released Gemini 2.0 in December, showing a modest improvement over the 1.5 branch. It only took three months to reach 2.5, meaning Gemini 2.0 Pro wasn’t even out of the experimental stage yet. To hear Doshi tell it, this was the result of Google’s long-term investments in Gemini.

“A big part of it is honestly that a lot of the pieces and the fundamentals we’ve been building are now coming together in really awesome ways, ” Doshi said. “And so we feel like we’re able to pick up the pace here.”

The process of releasing a new model involves testing a lot of candidates. According to Doshi, Google takes a multilayered approach to inspecting those models, starting with benchmarks. “We have a set of evals, both external academic benchmarks as well as internal evals that we created for use cases that we care about,” she said.

The team also uses these tests to work on safety, which, as Google points out at every given opportunity, is still a core part of how it develops Gemini. Doshi noted that making a model safe and ready for wide release involves adversarial testing and lots of hands-on time.

But we can’t forget the vibes, which have become an increasingly important part of AI models. There’s great focus on the vibe of outputs—how engaging and useful they are. There’s also the emerging trend of vibe coding, in which you use AI prompts to build things instead of typing the code yourself. For the Gemini team, these concepts are connected. The team uses product and user feedback to understand the “vibes” of the output, be that code or just an answer to a question.

Google has noted on a few occasions that Gemini 2.5 is at the top of the LM Arena leaderboard, which shows that people who have used the model prefer the output by a considerable margin—it has good vibes. That’s certainly a positive place for Gemini to be after a long climb, but there is some concern in the field that too much emphasis on vibes could push us toward models that make us feel good regardless of whether the output is good, a property known as sycophancy.

If the Gemini team has concerns about feel-good models, they’re not letting it show. Doshi mentioned the team’s focus on code generation, which she noted can be optimized for “delightful experiences” without stoking the user’s ego. “I think about vibe less as a certain type of personality trait that we’re trying to work towards,” Doshi said.

Hallucinations are another area of concern with generative AI models. Google has had plenty of embarrassing experiences with Gemini and Bard making things up, but the Gemini team believes they’re on the right path. Gemini 2.5 apparently has set a high-water mark in the team’s factuality metrics. But will hallucinations ever be reduced to the point we can fully trust the AI? No comment on that front.

Don’t overthink it

Perhaps the most interesting thing you’ll notice when using Gemini 2.5 is that it’s very fast compared to other models that use simulated reasoning. Google says it’s building this “thinking” capability into all of its models going forward, which should lead to improved outputs. The expansion of reasoning in large language models in 2024 resulted in a noticeable improvement in the quality of these tools. It also made them even more expensive to run, exacerbating an already serious problem with generative AI.

The larger and more complex an LLM becomes, the more expensive it is to run. Google hasn’t released technical data like parameter count on its newer models—you’ll have to go back to the 1.5 branch to get that kind of detail. However, Doshi explained that Gemini 2.5 is not a substantially larger model than Google’s last iteration, calling it “comparable” in size to 2.0.

Gemini 2.5 is more efficient in one key area: the chain of thought. It’s Google’s first public model to support a feature called Dynamic Thinking, which allows the model to modulate the amount of reasoning that goes into an output. This is just the first step, though.

“I think right now, the 2.5 Pro model we ship still does overthink for simpler prompts in a way that we’re hoping to continue to improve,” Doshi said. “So one big area we are investing in is Dynamic Thinking as a way to get towards our [general availability] version of 2.5 Pro where it thinks even less for simpler prompts.”

Gemini models on phone — Credit: Ryan Whitwam

Google doesn’t break out earnings from its new AI ventures, but we can safely assume there’s no profit to be had. No one has managed to turn these huge LLMs into a viable business yet. OpenAI, which has the largest user base with ChatGPT, loses money even on the users paying for its $200 Pro plan. Google is planning to spend $75 billion on AI infrastructure in 2025, so it will be crucial to make the most of this very expensive hardware. Building models that don’t waste cycles on overthinking “Hi, how are you?” could be a big help.

Missing technical details

Google plays it close to the chest with Gemini, but the 2.5 Pro release has offered more insight into where the company plans to go than ever before. To really understand this model, though, we’ll need to see the technical report. Google last released such a document for Gemini 1.5. We still haven’t seen the 2.0 version, and we may never see that document now that 2.5 has supplanted 2.0.

Doshi notes that 2.5 Pro is still an experimental model. So, don’t expect full evaluation reports to happen right away. A Google spokesperson clarified that a full technical evaluation report on the 2.5 branch is planned, but there is no firm timeline. Google hasn’t even released updated model cards for Gemini 2.0, let alone 2.5. These documents are brief one-page summaries of a model’s training, intended use, evaluation data, and more. They’re essentially LLM nutrition labels. It’s much less detailed than a technical report, but it’s better than nothing. Google confirms model cards are on the way for Gemini 2.0 and 2.5.

Given the recent rapid pace of releases, it’s possible Gemini 2.5 Pro could be rolling out more widely around Google I/O in May. We certainly hope Google has more details when the 2.5 branch expands. As Gemini development picks up steam, transparency shouldn’t fall by the wayside.

Ryan Whitwam is a senior technology reporter at Ars Technica, covering the ways Google, AI, and mobile technology continue to change the world. Over his 20-year career, he’s written for Android Police, ExtremeTech, Wirecutter, NY Times, and more. He has reviewed more phones than most people will ever own. You can follow him on Bluesky, where you will see photos of his dozens of mechanical keyboards.

Gemini “coming together in really awesome ways,” Google says after 2.5 Pro release Read More »

EU may “make an example of X” by issuing $1 billion fine to Musk’s social network

Digital Services Act, elon musk x, Policy / Mike M. / April 5, 2025

European Union regulators are preparing major penalties against X, including a fine that could exceed $1 billion, according to a New York Times report yesterday.

The European Commission determined last year that Elon Musk’s social network violated the Digital Services Act. Regulators are now in the process of determining what punishment to impose.

“The penalties are set to include a fine and demands for product changes,” the NYT report said, attributing the information to “four people with knowledge of the plans.” The penalty is expected to be issued this summer and would be the first one under the new EU law.

“European authorities have been weighing how large a fine to issue X as they consider the risks of further antagonizing [President] Trump amid wider trans-Atlantic disputes over trade, tariffs and the war in Ukraine,” the NYT report said. “The fine could surpass $1 billion, one person said, as regulators seek to make an example of X to deter other companies from violating the law, the Digital Services Act.”

X’s global government affairs account criticized European regulators in a post last night. “If the reports that the European Commission is considering enforcement actions against X are accurate, it represents an unprecedented act of political censorship and an attack on free speech,” X said. “X has gone above and beyond to comply with the EU’s Digital Services Act, and we will use every option at our disposal to defend our business, keep our users safe, and protect freedom of speech in Europe.”

Penalty math could include Musk’s other firms

The Digital Services Act allows fines of up to 6 percent of a company’s total worldwide annual turnover. EU regulators suggested last year that they could calculate fines by including revenue from Musk’s other companies, including SpaceX. Yesterday’s NYT report says this method is still under consideration:

EU may “make an example of X” by issuing $1 billion fine to Musk’s social network Read More »

Bonobos’ calls may be the closest thing to animal language we’ve seen

animal intelligence, Biology, bonobos, communication, language, Science / Mike M. / April 4, 2025

Bonobos, great apes related to us and chimpanzees that live in the Republic of Congo, communicate with vocal calls including peeps, hoots, yelps, grunts, and whistles. Now, a team of Swiss scientists led by Melissa Berthet, an evolutionary anthropologist at the University of Zurich, discovered bonobos can combine these basic sounds into larger semantic structures. In these communications, meaning is something more than just a sum of individual calls—a trait known as non-trivial compositionality, which we once thought was uniquely human.

To do this, Berthet and her colleagues built a database of 700 bonobo calls and deciphered them using methods drawn from distributional semantics, the methodology we’ve relied on in reconstructing long-lost languages like Etruscan or Rongorongo. For the first time, we have a glimpse into what bonobos mean when they call to each other in the wild.

Context is everything

The key idea behind distributional semantics is that when words appear in similar contexts, they tend to have similar meanings. To decipher an unknown language, you need to collect a large corpus of words and turn those words into vectors—mathematical representations that let you place them in a multidimensional semantic space. The second thing you need is context data, which tells you the circumstances in which these words were used (that gets vectorized, too). When you map your word vectors onto context vectors in this multidimensional space, what usually happens is that words with similar meaning end up close to each other. Berthet and her colleagues wanted to apply the same trick to bonobos’ calls. That seemed straightforward at first glance, but proved painfully hard to execute.

“We worked at a camp in the forest, got up super early at 3: 30 in the morning, walked one or two hours to get to the bonobos’ nest. At [the] time they would wake up, I would switch my microphone on for the whole day to collect as many vocalizations as I could,” Berthet says. Each recorded call then had to be annotated with a horribly long list of contextual parameters. Berthet had a questionnaire filled with queries like: is there a neighboring group around; are there predators around; is the caller feeding, resting, or grooming; is another individual approaching the caller, etc. There were 300 questions that had to be answered for each of the 700 recorded calls.

Bonobos’ calls may be the closest thing to animal language we’ve seen Read More »

SpaceX just took a big step toward reusing Starship’s Super Heavy booster

Commercial space, launch, Science, Space, spacex, Starbase, starship / Mike M. / April 4, 2025

SpaceX is having trouble with Starship’s upper stage after back-to-back failures, but engineers are making remarkable progress with the rocket’s enormous booster.

The most visible sign of SpaceX making headway with Starship’s first stage—called Super Heavy—came at 9: 40 am local time (10: 40 am EDT; 14: 40 UTC) Thursday at the company’s Starbase launch site in South Texas. With an unmistakable blast of orange exhaust, SpaceX fired up a Super Heavy booster that has already flown to the edge of space. The burn lasted approximately eight seconds.

This was the first time SpaceX has test-fired a “flight-proven” Super Heavy booster, and it paves the way for this particular rocket—designated Booster 14—to fly again soon. SpaceX confirmed a reflight of Booster 14, which previously launched and returned to Earth in January, will happen on next Starship launch With Thursday’s static fire test, Booster 14 appears to be closer to flight readiness than any of the boosters in SpaceX’s factory, which is a short distance from the launch site.

SpaceX said 29 of the booster’s 33 methane-fueled Raptor engines are flight-proven. “The first Super Heavy reuse will be a step towards our goal of zero-touch reflight,” SpaceX wrote on X.

A successful reflight of the Super Heavy booster would be an important milestone for the Starship program, while engineers struggle with problems on the rocket’s upper stage, known simply as the ship.

What a difference

Super Heavy’s engines are capable of producing nearly 17 million pounds of thrust, twice the power of NASA’s Saturn V rocket that sent astronauts toward the Moon. Super Heavy is perhaps the most complex rocket booster ever built. It’s certainly the largest. To get a sense of how big this booster is, imagine the fuselage of a Boeing 747 jumbo jet standing on end.

SpaceX has now launched eight full-scale test flights of Starship, with a Super Heavy booster and Starship’s upper stage stacked together to form a rocket that towers 404 feet (123.1 meters) tall. The booster portion of the rocket has performed well so far, with seven consecutive successful launches since a failure on Starship’s debut flight.

Booster 14 comes in for the catch after flying to the edge of space on January 16. Credit: SpaceX

Most recently, SpaceX has recovered three Super Heavy boosters in four attempts. SpaceX has a wealth of experience with recovering and reusing Falcon 9 boosters. The total number of Falcon rocket landings is now 426.

SpaceX reused a Falcon 9 booster for the first time in March 2017. This was an operational flight with a communications satellite on a mission valued at several hundred million dollars.

Ahead of the milestone Falcon 9 reflight eight years ago, SpaceX spent nearly a year refurbishing and retesting the rocket after it returned from its first mission. The rocket racked up more mileage on the ground than it did in flight, first returning to its Florida launch base on a SpaceX drone ship and then moving by truck to SpaceX’s headquarters in Hawthorne, California, for thorough inspections and refurbishment.

SpaceX just took a big step toward reusing Starship’s Super Heavy booster Read More »

Feeling curious? Google’s NotebookLM can now discover data sources for you

AI, Artificial Intelligence, Google, google notebookLM / Mike M. / April 4, 2025

In addition to the chatbot functionality, NotebookLM can use the source data to build FAQs, briefing summaries, and, of course, Audio Overviews—that’s a podcast-style conversation between two fake people, a feature that manages to be simultaneously informative and deeply unsettling. It’s probably the most notable capability of NotebookLM, though. Google recently brought Audio Overviews to its Gemini Deep Research product, too.

And that’s not all—Google is lowering the barrier to entry even more. You don’t even need to have any particular goal to play around with NotebookLM. In addition to the Discover button, Google has added an “I’m Feeling Curious” button, a callback to its iconic randomized “I’m feeling lucky” search button. Register your curiosity with NotebookLM, and it will seek out sources on a random topic.

Google says the new NotebookLM features are available starting today, but you might not see them right away. It could take about a week until everyone has Discover Sources and I’m Feeling Curious. Both of these features are available for free users, but be aware that the app has limits on the number of Audio Overviews and sources unless you pay for Google’s AI Premium subscription for $20 per month.

Feeling curious? Google’s NotebookLM can now discover data sources for you Read More »

AI #110: Of Course You Know…

Course / Mike M. / April 4, 2025

Yeah. That happened yesterday. This is real life.

I know we have to ensure no one notices Gemini 2.5 Pro, but this is rediculous.

That’s what I get for trying to go on vacation to Costa Rica, I suppose.

I debated waiting for the market to open to learn more. But fit, we ball.

Also this week: More Fun With GPT-4o Image Generation, OpenAI #12: Battle of the Board Redux and Gemini 2.5 Pro is the New SoTA.

The New Tariffs Are How America Loses. This is somehow real life.
Is AI Now Impacting the Global Economy Bigly? Asking the wrong questions.
Language Models Offer Mundane Utility. Is it good enough for your inbox yet?
Language Models Don’t Offer Mundane Utility. Why learn when you can vibe?
Huh, Upgrades. GPT-4o, Gemini 2.5 Pro, and we partly have Alexa+.
On Your Marks. Introducing PaperBench. Yes, that’s where we are now.
Choose Your Fighter. How good is ChatGPT getting?
Jevons Paradox Strikes Again. Compute demand is going to keep going up.
Deepfaketown and Botpocalypse Soon. The only answer to a bad guy with a bot.
They Took Our Jobs. No, AI is not why you’ll lose your job in the short term.
Get Involved. Fellowships, and the UK AISI is hiring.
Introducing. Zapier releases its MCP server, OpenAI launches AI Academy.
In Other AI News. Google DeepMind shares 145 page paper, but no model card.
Show Me the Money. The adventures of the efficient market hypothesis.
Quiet Speculations. Military experts debate AGI’s impact on warfare.
The Quest for Sane Regulations. At what point do you just give up?
Don’t Maim Me Bro. Further skepticism that the MAIM assumptions hold.
The Week in Audio. Patel on Hard Fork, Epoch employees debate timelines.
Rhetorical Innovation. As usual it’s not going great out there.
Expect the Unexpected. What are you confident AI won’t be able to do?
Open Weights Are Unsafe and Nothing Can Fix This. Oh no, OpenAI.
Anthropic Modifies its Responsible Scaling Policy. Some small changes.
If You’re Not Going to Take This Seriously. I’d prefer if you did?
Aligning a Smarter Than Human Intelligence is Difficult. Debating SAEs.
Trust the Process. Be careful exactly how much and what ways.
People Are Worried About AI Killing Everyone. Elon Musk again in brief.
The Lighter Side. Surely you’re joking, Mr. Human. Somehow.

The new ‘Liberation Day’ tariffs are suicidal insanity. Congress must act to revoke executive authority in such matters and reverse this lunacy before it is too late. When you realize how the tariffs were calculated, it’s even crazier.

Tyler Cowen: This is perhaps the worst economic own goal I have seen in my lifetime.

Non Opinion Haver: The bad news is prices will go up on everything but the good news is that domestic manufacturing will also tank.

German Vice Chancellor Robert Habeck: Last night’s decision is comparable to the war of aggression against Ukraine… The magnitude and determination of the response must be commensurate.

Yaroslav Trofimov: The unpredictability of America for the foreseeable future is more serious than the tariffs themselves. Companies can’t make long-term decisions if policy is made and changed on a whim. Much of the rest of the world will respond by creating U.S.-free supply chains.

This hurts even American manufacturing, because we are taxing imports of the components and raw materials we will need, breaking our supply chains and creating massive uncertainty. We do at least exempt a few inputs like copper, aluminum and steel (and importantly for AI, semiconductors), so it could be so much worse, but it is still unbelievably awful.

If we were specifically targeting only the particular final manufactured goods we want to ensure get made in North America for security and competitiveness reasons, and it had delays in it to set expectations, avoid disruptions and allow time to physically adjust production, I would still hate it but at least it would make some sense. If it was paired with robust deregulatory actions I might even be able to respect it.

If we were doing actual ‘reciprocal tariffs’ where we set our tariff rate equal to their tariff rate, including 0% if theirs was 0%, I would be actively cheering. Love it.

This is very much not any of that. We know exactly what formula they actually used, which was, and this is real life: (exports-imports)/exports. That’s it. I’m not kidding. They actually think that every bilateral trade relationship where we have a trade deficit means we are being done wrong and it must be fixed immediately.

I’m sure both that many others will explain all this in detail if you’re curious, and also if you’re reading this you presumably already know.

You also doubtless know that none of what those justifying, defending or sanewashing these tariffs are saying is how any of this works.

The declines we are seeing in the stock market reflect both that a lot of this was previously priced in, and also that the market is still putting some probability on all of this being walked back from the brink somehow. And frankly, despite that, the market is underreacting here. The full effect is much bigger.

Dr Radchenko: The fact that markets have not yet melted down speaks to human optimism. Reminds me of that Soviet joke:

What’s the difference between a pessimist and an optimist?

A pessimist is he who says: “It just can’t get any worse!”

Whereas an optimist says: “Come on, of course it can!”

It doesn’t look good.

I mention this up top in an AI post despite all my efforts to stay out of politics, because in addition to torching the American economy and stock market and all of our alliances and trade relationships in general, this will cripple American AI in particular.

That’s true even if we didn’t face massive retaliatory tariffs. That seems vanishingly unlikely if America stays the course. One example is that China looks to be targeting the exact raw materials that are most key to AI as one of its primary weapons here.

American tech companies, by the time you read this, have already seen their stocks pummeled, their business models and abilities to invest severely hurt. Our goodwill, trust and free access to various markets for such tech is likely evaporating in real time. This is how you get everyone around the world looking towards DeepSeek. You think anyone is going to want to cooperate with us on this now?

And remember again that what you see today is only how much worse this was than expectations – a lot of damage had already been priced in, and everyone is still hoping that this won’t actually stick around for that long.

We partially dodged one bullet in particular, good job someone, but this is only a small part of the problem:

Ryan Peterson: The Taiwan tariffs at 32% are massive BUT there is a carve out that they don’t apply to semiconductors.

•Some goods will not be subject to the Reciprocal Tariff. These include: (1) articles subject to 50 USC 1702(b); (2) steel/aluminum articles and autos/auto parts already subject to Section 232 tariffs; (3) copper, pharmaceuticals, semiconductors, and lumber articles; (4) all articles that may become subject to future Section 232 tariffs; (5) bullion; and (6) energy and other certain minerals that are not available in the United States.

Semiconductors including GPUs are exempt from the new 34% duties on Taiwan but graphics cards (which contain GPUs) are not.

If nothing changes those supply chains will have to re-arrange, either to do assembly in the US or to set up GPU clouds outside the US.

That’s right. We are still putting a huge tax on GPUs. Are we trying to lose the future?

Ryan Peterson also notices that they also intend to universally kill the ‘de minimus’ exemption, which isn’t directly related to AI but is a highly awful idea if they ever try to actually implement it, and also he points out that this will have dramatic secondary effects.

If we’re going to build America’s AI policy around America ‘winning,’ the least we can do is not shoot ourselves in the foot. And also everywhere else.

The other reason to mention this up top is, well…

For all the haters out there, that impact of AI so far might now be massively negative, so honestly so far this might be a pretty great call by the haters.

Rohit: This might be the first large-scale application of AI technology to geopolitics.. 4o, o3 high, Gemini 2.5 pro, Claude 3.7, Grok all give the same answer to the question on how to impose tariffs easily.

I think the fact they’re using LLMs is good but they def need better prompt engineers I think.

This is now an AI safety issue.

This is Vibe Governing.

cc @ESYudkowsky you might have had a point about us handing over decision making authority at the drop of a hat. They found something that can speak in complete sentences and immediately …

Derek Thompson: Approaching, and surpassing, frightening levels of Veep.

Eli Dourado: First recession induced by an unaligned AI.

Roon: one surprising thing about the distribution of ai is that you are using the same tools as the presidents cabinet. it’s all bottom up and unsophisticated schoolchildren harness godlike powers. most people like to believe (maybe hope?) there are shadowy rooms with secret technology.

If that’s actually how this went down, and to be clear I remain hopeful that it probably didn’t go down in this way, then it’s not that the AIs in question are not aligned. It’s that the AIs are aligned to the user, and answered the literal question asked, without sufficient warnings not to do it. And for some people, no warnings will matter.

This might be a good illustration of, ‘yes you could have found this information on your own and it still might be catastrophically bad to output it as an answer.’

I can’t believe this is actually real, but we solved the puzzle, then they actually admitted it, and yes this looks like it is the calculation the Actual Real White House is doing that is about to Crash the Actual Real Global Economy. Everyone involved asked the wrong question, whether or not the entity answering it was an AI, and more importantly they failed to ask any reliable source the follow-up.

That follow-up question is ‘what would happen if we actually did this?’

I very much encourage everyone involved to ask that question next time! Please, please ask that question. You need to ask that question. Ask economists. Also type it into the LLM of your choice. See what they say.

Also, it’s pretty funny the extent to which you tell Gemini this happened and it completely utterly refuses to believe you right up until the censors delete the answer.

Gemini’s next chain of thought included ‘steer conversation back to something more realistic.’ Alas.

To be fair, technically, the White House Press Secretary said no, that’s not the formula, the formula they used included two other terms. However, those terms cancel out. This is real life.

It is impossible to talk to any frontier LLM about this and not have it be screaming at you how horrible an idea this is. Claude even nailed the recession probability at 45%-50% in 2025 (on Polymarket it is at 49% as I type this) given only this one data point and what is otherwise a January data cutoff (it can’t search the web while I’m in Costa Rica).

To be clear, it seems unlikely this was actually the path through causal space that got us the tariffs we got. But it’s scary the extent to which I cannot rule it out.

Timothy Lee thinks Shortwave’s AI assistant is getting good, in particular by not giving up if its first search fails. I’m considering giving it a shot.

AI is highly useful in fighting denials of insurance claims. Remember to ask it to respond as a patio11-style dangerous professional.

Nabeel Qureshi runs an experiment with four Odyssey translations. Three are classic versions, one is purely one-shot GPT-4o minus the em-dashes, and at 48% the AI version was by far the most popular choice. I am with Quereshi that I think Fitzgerald had the best version if you aren’t optimizing for fidelity to the original text (since that’s Greek to me), but I went in spoiled so it isn’t fully fair.

Good prompting is very much trial and error. If you’re not happy with the results, mutter ‘skill issue’ and try again. That’s in addition to the usual tips, like providing relevant context and giving specific examples.

Dartmouth runs a clinical trial of “Therabot,” and they’re spectacular, although N=106 means I wouldn’t get overexcited yet.

Amy Wu Martin: Dartmouth just ran the first clinical trial with a generative AI therapy chatbot—results: depression symptoms dropped 51%, anxiety by 31%, and eating disorders by 19%.

“Therabot,” built on Falcon and Llama, was fine-tuned with just tens of thousands of hours of synthetic therapist-patient dialogue.

Imagine what can be done training with millions of hours of real therapy data

Researchers say these results are comparable to therapy with a human. We need to scale up both the trial’s size and duration and also the training data, and be on the lookout for possible downsides, but it makes sense this would work, and it’s a huge deal. It would be impossible to provide this kind of human attention to everyone who could use it.

For now they say things like ‘there is no substitute for human care’ but within a few years this will be reliably better than most human versions. If nothing else, being able to be there when the patient needs it, always ready to answer, never having to end the session for time, is an epic advantage.

The correct explanation of why you need to learn to code:

Austen Allred: You don’t need to learn how to code. You just need to be able to tell a computer what to do in a way that it will respond, understand what it’s doing and how to optimize that, and fix it when it’s not working.

Gemini accuses Peter Wildeford of misinformation for asking about recent news, in this case xAI acquiring Twitter.

A classic vibe coding case, but is a terrible version of something better if the alternative was nothing at all? It can go either way.

Gemini 2.5 Pro is now available to all users on the Gemini app, for free, with rate limits and a smaller context window if you aren’t subscribed, or you can get your first subscription month free.

This is the first move in a while that is part of what an actual marketing effort would do. They still have to get the word out, but it’s a start.

Gemini 2.5 Pro also adds access to Canvas. The Gemini API offers function calling.

OpenAI updated GPT-4o.

OpenAI: GPT-4o got an another update in ChatGPT!

What’s different?

– Better at following detailed instructions, especially prompts containing multiple requests

– Improved capability to tackle complex technical and coding problems

– Improved intuition and creativity

– Fewer emojis

Altman claims it’s a big upgrade. I don’t see anyone else talking about it.

I still think of GPT-4o as an image model at this point. If an upgrade was strong enough to overcome that, I’d expect the new model to be called something else. This did cause GPT-4o to jump to #2 on Arena ahead of GPT-4.5 and Grok, still 35 points behind Gemini 2.5.

Gemini 2.5 now available in Cursor.

Alexa+ launched on schedule, but is missing some features for now, some to be delayed for two months. At launch, it can order an Uber, but not GrubHub, and you can’t chat with it on the web, unless you count claude.ai. It sounds like things are not ready for Amazon prime time yet.

Claude simplifies its interface screen.

Pliny the Liberator suggests: “write a prompt for insert-query-here then answer it”

OpenAI releases PaperBench, tasking LLMs with replicating top 2024 ICML papers, including understanding it, writing code and executing experiments. A great idea, although I worry about data contamination especially given they are open sourcing the eval. Is it crazy to think that you want to avoid open sourcing evals for this reason?

OpenAI: We evaluate replication attempts using detailed rubrics co-developed with the original authors of each paper.

These rubrics systematically break down the 20 papers into 8,316 precisely defined requirements that are evaluated by an LLM judge.

We evaluate several frontier models on PaperBench, finding that the best-performing tested agent, Claude 3.5 Sonnet (New) with open-source scaffolding, achieves an average replication score of 21.0%.

Finally, we recruit top ML PhDs to attempt a subset of PaperBench, finding that models do not yet outperform the human baseline.

We are open sourcing PaperBench, full paper here.

Janus: Top ML phds is a high bar!

They did not include Gemini 2.5 Pro or Claude Sonnet 3.7, presumably they came out too recently, but did include r1 and Gemini 2.0 Flash:

Humans outperform the models if you give them long enough, but not by much.

In other testing news, do the Math Olympiad claims hold up? Zain shows us what happens when we tested LLMs on the 2025 Math Olympiad, fresh off the presses, and there were epic fails everywhere (each problem is out of 7 points, so maximum is 42, average score of participants is 15.85)…

Zain: they tested sota LLMs on 2025 US Math Olympiad hours after the problems were released Tested on 6 problems and spoiler alert! They all suck -> 5%

…except Gemini 2.5 Pro, which came out the same day as the benchmark, so they ran that test and got 24.4% by acing problem 1 and getting 50% on problem 4.

That may be the perfect example of ‘if the models are failing, give it a minute.’

It’s surprising it took this long, here is the start of the Wordle benchmark, someone should formalize this, you could easily run 1000 words or what not.

Xeophon: On today’s Wordle, the new Gemini model completely crushed the competition. It logicially deducted diverse words, found the correct spots of valid and invalid letters and got a result quickly. Sonnet proposed multiple invalid words in the end, so DNF.

Also, fun: I said *youdid it, Gemini responded that *wegot there. Thanks, Gemini!

It is rather easy to see that GPT-4.5 and Sonnet 3.7 played badly here. Why would you ever have an R in 2nd position twice? Whereas Gemini 2.5 played well. Xeophon says lack of vision tripped up Sonnet, obviously you could work around that if you wanted.

Oh, it’s on, send in the new challenger: Gemini Plays Pokemon.

Janus has expressed more disdain for benchmarks than any other person I know, so here’s what Janus says would be an actually good benchmark.

Janus: Difficult-to-goodhart LLM benchmarks I think should get more mindshare:

What does it feel like to be deeply entangled with it? How does it affect your life? Think back to if you’ve ever had an LLM become a major part of your reality. In my experience, each one feels deeply different. Also, it changes the distribution of things you think about and do.

Also this but for the world: How does it change the atmosphere and focus and direction of innovation of the world when many people are interacting with it and talking about it? Again, in my experience, this varies in intricate ways per model.

This is what really matters.

I want to hear accounts of this. In retrospect.

That is certainly a very different kind of benchmark. It tells us a different style of thing. It is not so helpful in charting out many aspects of the big picture, but seems highly useful in figuring out what models to use when. That’s important too. Also the answers will be super interesting, and inspire other things one might want to do.

GPT-4.5 causes humans to fail a 5-minute 4-question version of the Turing Test, winning far over 50% of the time, although notice that even ‘Llama-Persona’ does well here too. The test would be better with longer interactions, which is actually a place where GPT-4.5 relatively shines. And of course we’d all like to see Gemini 2.5 Pro and Claude 3.7 Sonnet as part of any such test.

Stephanie Palazzolo: This isn’t an April Fools joke: ChatGPT revenue has surged 30% in just three months.

Gallabytes: this feels in line with my sense of the quality of the product. 4o actually got good? not just the image stuff the normal model too. deep research is great. o1 and 4.5 are good premium offerings. they filled out the product pretty well.

Jared Johnson: For the first time ever I can just stay in ChatGPT all day and not miss out on anything from Claude, Perplexity, Midjourney, or Gemini. They really have stepped it up.

The weak form of this is very true. ChatGPT is a much better offering than it was a few months ago. You get o1, o3 and 4.5, deep research and 4o’s image generation, and 4o has had a few of its own upgrades.

However I find the strong form of this response rather strange. You are very much missing out if you only use ChatGPT, even if you are paying the $200 a month. And I think most people are substantially better off with $60 a month split between Gemini, Claude and ChatGPT than they would be paying the $200, especially now with Gemini 2.5 Pro available. The exception is if you really want those 120 deep research queries, but I’d be inclined to take 2.5 Pro over o1 Pro.

What should you use for coding now?

Gallabytes: I’m kinda liking [Gemini 2.5] better than sonnet so far but not for everything. Seems better when it has the right context but worse at finding it, less relentless but also less likely to add error handling to your unit tests.

Reactions seem to strongly endorse a mix of Gemini and Claude as the correct strategy for coding right now.

Alexander Doria: Everytime I try to redo classic prompt engineering gen ai app, I get reminded that:

OpenAI and Anthropic are still ahead.

You never get to the last miles without finetuning.

While people hammered Nvidia’s stock prior to ‘liberation day,’ the biggest launches of AI that happened at the same time, Gemini 2.5 and GPT-4o image generation, were both capacity constrained despite coming from two of the biggest investors in compute. As was Grok, and of course Claude is always capacity constrained.

Demand for compute is only going to go up as compute use gets more efficient. We are nowhere near any of the practical limits of this effect. Plan accordingly.

Could AI empower automatic community notes? This seems like a great idea. If a Tweet gets a combination of sufficiently many hits and requests for a fact check, you can do essentially a Deep Research run, and turn it into a proposed community note, which could be labeled as such (e.g. ‘Community Note Suggested by Grok.’) Humans can then use the current system to rate and approve the proposals.

Alternatively, I like the idea of an AI running in the background, and if any post would get a community note from my AI, it alerts me to this, and then I can choose whether to have it formalize the rest. Lot of options to explore here.

Don’t worry, [Professor’s Name] can’t do anything without proof.

Matty Matt: Jesus Christ.

TW: I’m so disappointed. This is completely out of character for [student’s name]

Ducker Trucker: People love to talk about how it’s not doing the work for them, it’s just a tool, why do you hate progress and then reveal they don’t even take 10 seconds to read what it typed for them.

For a while, we have had a severe ‘actually you are playing against bots’ problem.

Grant Slatton: was playing a silly multiplayer web game a few weeks ago (an agario clone) and after a few minutes realized 95% of the players were bots

completely lost interest, since most of the appeal was competing with other humans

lesson there.

there is real multiplayer, i think the server just populates the game with bots with the real player count is low

was like 5 humans and 95 bots.

This is mostly not about LLMs or even decent AI, and far more about:

The new player experience often wants to be scripted.
People want to be matched instantly.
People want to win far more than their fair share (typically ~60% is good).
People often need a sandbox in which to skill up.

A common phenomenon is that ‘the bots’ or AI in a game could be made much better with moderate effort, and either this is judged not worth bothering or actively avoided.

There are definitely games where this is massively overdone and it does harm. One prominent example is Marvel Snap. Players are forced to go through quite a lot of bots, even when there are plenty of opponents available, and a lot of your net profits come from bot exploitation, so there’s a lot of optimization pushing towards ‘figure out if it is a bot and get a +4 from the bots while making sure you’re not losing -4 or -8 against humans’ but that’s not the way to make the game fun. Oy.

It’s happening department:

Near Cyan: ~50 people have DM’d me saying [Auren] made them cry and rethink their life

by talking to a bunch of LLMs for a few hours.

superhuman persuasion is real and we hope to showcase this in the most ethical way possible to set a good example and demonstrate that which is to come.

unfortunately the app is also a schzio magnet like no other so my account is kinda done for. many such cases i guess.

Gallabytes: I think calling it “superhuman” is false but that isn’t actually important, this is a good illustration of what the onramp to the singularity looks like. Industrial production for the service industry, the true promise of software finally unlocked.

Is Claude better than the best human therapists? no certainly not. I can definitely do better, I know a few others who could too. But it’s intensely g-loaded and even if they didn’t have other careers available there just wouldn’t be enough of them to sate demand.

So it goes with teachers, administrators, staff engineers, management consultants, accountants, etc. The economic returns kick in well before truly superhuman ability, and often before they can even displace the least human practitioners.

I agree that this is not ‘superhuman’ persuasion (or ‘super persuasion’) yet, and I agree that this is not important. You mostly don’t need it. Things get even weirder once you do get it, and it is absolutely coming, but the ability to do ‘merely kind of human’ level persuasion has a long track record of doing quite a lot of persuading. Indeed, one could say it is the only thing that does.

Also the super persuasion is coming, it just isn’t here yet. Obviously a sufficiently capable future LLM will be super human at persuasion. I, alas, do not possess superhuman persuasion, and have run out of methods of convincing people who blindly insist it can’t be done.

Don’t worry, it’s not so persuasive, half of people wouldn’t help their AI assistant if it asked nicely yet, it’s fine (see link for full poll question from Jan Kulveit).

During the early phase of them taking our jobs, this theory makes sense to me, too:

Dean Ball: I do not expect widespread, mass layoffs due to AI in the near term.

But I worry that as knowledge workers exit firms, they won’t be replaced. This may hit young people seeking to enter knowledge work fields especially hard. Their job prospects by the late 2020s could be dim.

Peter Wildeford: This theory of AI displacement mainly being in not rehiring workers that leave makes sense to me.

Here’s a good FRED data series to track this: total Professional and Business Services job openings.

No sign of AI yet.

That doesn’t mean that there will be widespread unemployment during this phase. Many roles will cut back largely via attrition. If people leave or need to be fired, they don’t get replaced. Those that don’t get hired then go to other roles.

Eventually, if we continue down this path, we start running out of these other roles, because AI is doing them, too, and workers start getting outright fired more often.

Pivotal Fellowship for Q3, June 30 – August 29 in London.

The UK AISI is hiring again. I think this is a clearly great thing to do.

Foundation for American Innovation and Samuel Hammond are launching a conservative AI policy fellowship, apply by April 30, runs June 13 – July 25. I agree with Dean Ball, if you are conservative and interested in AI policy this is as good an opportunity as you are going to find.

Shortwave AI is hiring, if you do it tell them I sent you, I need better inbox tools.

Zapier releases its MCP server, letting you connect essentially anything, to your Cursor agent or otherwise. Paul Graham retweeted Brendan giving the maximalist pitch here:

Brendan: Zapier just changed the game for AI builders.

You can now connect your AI assistant to 8,000+ apps with zero API integration.

No OAuth struggles. No glue code. No custom backends.

Just one URL = real-world actions.

Here’s why Zapier MCP might be the most important AI launch this month.

AI can talk all day. But now it can do things.

Zapier just launched MCP (Model Context Protocol) a secure bridge between your AI and 30,000+ real actions across the tools you use.

Slack, Google Sheets, HubSpot, Notion — all instantly accessible.

Here’s how it works:

• Generate your MCP endpoint

• Choose what your AI can do (fine-grained control)

• Plug it into any AI interface: Cursor, Claude, Windsurf, etc.

Boom: Your AI can now send emails, schedule meetings, update records.

[start here]

TxGamma, a more general, open and scalable set of AI models from DeepMind to improve the drug development process.

OpenAI gives us OpenAI Academy, teaching AI basics for free.

Google DeepMind shares its 145-page paper detailing its approach to AGI safety and security strategy. Kudos to them for laying it all out there. I expect to cover this in detail once I have a chance to read it.

However, Shakeel asks where Gemini 2.5 Pro’s system card is, and notes that Google promised to publish the relevant information. There’s still no sign of it.

With the launch of Alexa+, Amazon Echoes will no longer offer you the option to not send your voice recordings to the cloud. If you don’t like it, don’t use Echoes.

LSE partners with Anthropic to essentially give everyone Claude access I think?

Frontier Model Forum amounces first-of-its-kind agreement to facilitate information sharing about threats, vulnerabilities and capability advances unique to frontier AI. The current members are Amazon, Anthropic, Google, Meta, Microsoft, and OpenAI.

Anthropic releases more information with information from the Anthropic Economic Index. Changes in usage remain gradual for now, mostly augmentation with little automation.

Your reminder that you should absolutely keep any promises you make to AIs, and treat them the same way you would treat keeping promises to humans.

This is more important for promises to LLMs and other AIs than those made to your fridge, but ideally this also includes those too, along with promises to yourself, or ‘to the universe,’ and indeed any promise period. If you don’t want to make the promise, then don’t make it.

I especially don’t endorse this:

Dimitriy (I disagree): Roughly speaking, we don’t need to keep our promises to entities that would not keep theirs to us.

The efficient market hypothesis is false, buy buy buy edition.

Abe Brown: In less than a year, the startup behind Cursor has sparked tech’s buzziest buzzphrase—vibe coding—and seen its valuation go from $400M to $2.5B to possibly $10B.

It’s one of the fastest-growing startups ever.

Arfur Rock: Cursor round closed — $625M at $9.6B post led by Thrive & A16z. Accel is a new backer.

$200M ARR, up 4x from $2.5B round in November 2024.

ARR multiple constant from last round at 50x.

Jasonlk: This is the top play in venture today

Invest $50m-$200m in hottest AI company at $2.5B valuation … see it marked up 4x in < 12 months

So, so much “easier” than seed and series A …

I’m oversimplifying … but it’s true

You could invest $10m to buy 20% of a hot start-up, wait 10 years for it to be worth $1B, with dilution own 15% … and end up with $150m after 10 years

Or you could invest $50m into cursor at $2.5B and turn that into $200m into 5 months

You make about the same.

All trends that cannot go on forever will stop, past performance is no guarantee of future success, et cetra, blah blah blah. Certainly one can make a case that the rising stars of AI are now overvalued. But what is also clear is that once a company is ‘marked as a star’ of sorts, the valuations are going up like gangbusters every few months without the need to actually accomplish much of anything. There is a clear inefficiency here.

Take xAI. Since the point at which xAI was valued at ~$20 billion, it has been nothing but disappointing. Now it’s valued at ~$80 billion. Imagine if they’d cooked.

xAI also merged with Twitter (technically ‘X’), because Musk said so, with X’s paper value largely coming from its 25% share of xAI. As the article here notes, this has many echoes of the deal when Tesla purchased Solar City in 2016, except the shenanigans are rather more transparent this time around. Elon Musk believes that he is special. That the rules do not apply to him. It is not obvious he is mistaken.

Take OpenAI, which has now closed a new capital raise of $40 billion, the largest private one in history, at a $300 billion valuation. In this case OpenAI has indeed done some impressive things, although also hit some roadblocks. So it wasn’t inevitable or anything. And indeed it still isn’t, because the nonprofit is still looming and the conversion to a for-profit is in potential big legal trouble.

Gallabytes: so tbc I am happy to see this but uh doesn’t this make the nonprofit selling control for 40b seem totally ludicrous? if nothing else they should at least be demanding >150b in non-voting PPUs.

OpenAI won’t get all the money unless OpenAI becomes fully for-profit by the end of 2025. With this new valuation, that negotiation gets even trickier and the fair price rises. Because the nonprofit collects its money at the end of the waterfall, the more OpenAI is worth, the greater the percentage of that the nonprofit is worth. Whoops. It is an interesting debate who gets more leverage as a result of this.

Certainly I would hope the old $40 billion number has to be off the table.

The timing means this wasn’t a response to Hurricane Studio Ghibli. It seems very obvious again that when there was a raise at $160 billion, the chances of a raise at $300 billion or higher in the future were far, far above 50%.

Unusual Whales: SoftBank would provide 75% of the funding, according to a person familiar with the matter, with the remainder coming from Microsoft, oatue Management, Altimeter Capital and Thrive Capital.

One can of course argue that SoftBank funding has long made little sense, but that’s part of the game and they got Microsoft in on it. Money is fungible, and it talks.

Gautam Mukunda writes in Bloomberg that this $40 billion is Too Much Money (TMM), and it hurts these AI companies like OpenAI and Anthropic to be burning this much cash and ‘focusing on investors’ rather than customers. This seems like a rather remarkable misunderstanding of the frontier AI labs. You think they are focused on investors? You think this is all about consumer market share? What did you think AGI meant? Vibes? Papers? Essays? Unit economics? Ghibli memes?

Isomorphic Labs raises $600 million for its quest to use AI to solve all disease.

Microsoft is halting or delaying data center investments both in America and abroad. Perhaps Satya is not good for his $80 billion after all? They did not offer an explanation, so we don’t know what they are thinking. Certainly it makes sense that if you make everyone poorer and hammer every stock price, investment will go down.

Scale AI expects to double sales to more than $2 billion in 2025.

Helen Toner kicks off her new substack with a reminder that ‘long’ timelines to advanced AI have gotten crazy short. And oh, have they.

We’d love to have social science researchers and most everyone else take AGI and its timelines seriously, and it’s great that Dwarkesh Patel is asking them, but they remain entirely uninterested in what the Earth’s future is going to look like. To be fair, they really do have quite a lot going on right now.

Military experts debate AGI’s impact on warfare. Their strongest point is that AGI is not a binary, even if Altman is talking lately as if it is one, so there isn’t some instant jump from ‘not AGI’ to ‘AGI.’ Another key observation is that pace of adaptation and diffusion matters, and a lot of the military impact comes via secondary effects, including economic effects.

I knew this already, but they emphasize that the Pentagon’s methods and timelines for new technology flat out won’t cut it, at all. The approval process won’t cut it. The number of meetings won’t cut it. Two year cycles to even get funding won’t cut it. None of it is remotely acceptable. Even mentioning doing something ‘by 2040’ with a straight face is absurd now. Turnarounds can’t be measured in decades, and probably not even years. Speed kills. Nor will we be able to continue to play by all our Western requires and rules and inability to ever fail, and pay the associated extra costs in money either.

They think war over TSMC, or a preemptive strike over AI progress, seem unlikely based on their readings of history. This seems right, even if such actions would be strategically correct it is very difficult to pull that trigger. Again, that’s largely because AGI doesn’t have a clear ‘finish line.’ The AI just keeps getting smarter and more capable, until suddenly it’s too late and you don’t matter, and perhaps no one matters, but there’s no clear line of demarcation, especially from the outside, so where do you draw that line? When can you make a credible threat?

And then you have their second problem, which is people keep coming up with reasons why the obvious results from superintelligence won’t happen, and they’ll keep doing that at least until those things are already happening. And the third problem, which is you might not know how far someone else has gotten.

I worry this is how it all goes down, far more broadly, if we are on the cusp of losing control over events. That the powers that be simply aren’t confident enough to ever pull that trigger – they don’t dare ‘not race,’ or risk hindering economic progress or otherwise messing with things, unless they are damn sure, and even then they’re worried about how it would look, and don’t want to be responsible for it.

The interview itself serves as another example of all that. It takes AI seriously, but it does not feel the AGI. When the focus is on specific technological applications, the analysis is crisp. But otherwise it all feels abstract and largely dismissive. They don’t expect all that much. And they definitely don’t seem to be expecting the unexpected, or High Weirdness. That’s a mistake. They also don’t seem to expect robotics or other transformations of physical operations, it’s not even mentioned. And in many places, it feels like they don’t anticipate what AIs can already do in other contexts. As the discussion goes long, it almost feels like they’ve managed to convince themselves AI is for better guiding precision mass and acting like the future is totally normal.

Thus the emphasis on AGI not being a binary. But there is an important binary, which is either you get into a takeoff scenario (even a ‘slow’ one), a place where you see rapid progress as AI helps you quickly build better AI, or you don’t. If you get to that first, even a modest lead could become decisive. And also there is essentially something close to a binary where you can plug AIs into person-shaped holes generally, either digitally or physically or both – it’s not all at once, but there’s a pretty quick phase shift there even if it doesn’t lead to superintelligence right away.

Yes, this does sound right:

Meredith Chen: China and the US could learn from each other when developing artificial intelligence safety protocols, but only if Washington can reimagine its rivalry-based mindset, according to a prominent Chinese AI expert.

Speaking at the Boao Forum for Asia, Zeng Yi, a member of the United Nations’ high-level AI advisory body, said the US government’s obstruction of China from international safety networks and discussions on the technology was “a very wrong decision”.

…

According to Zeng, the US and China must work together to jointly “strengthen safety rails” in AI. The potential for cooperation in the technology depends not only on bilateral government actions but also on grass-roots behaviours and corporate exchanges, he said.

…

Zeng pointed out that the Chinese government had put a lot of emphasis on “responsible AI” and was promoting fair use of the technology.

I understand and support export controls on chips. But why would you want to exclude China from international safety networks and discussions? China keeps saying it wants to engage on safety despite the export restrictions. That’s wonderful. Let’s take them up on it.

Once again here we see signs that many are aggressively updating on DeepSeek. So as usual, my note is that yes DeepSeek matters and is impressive, but people are treating it as far more impressive an accomplishment than it was.

Also correct is to notice that when ‘little tech’ comes to lobby the government, they often present themselves as libertarians looking for a free market, but their actual proposals are usually very different from that.

Adam Thierer: exactly right. So much of the “Little Tech” policy playbook is just rehashed regulatory garbage from Europe. And how did that play out over there? … No tech at all.

Sarah Oh Lam: Little Europe? Does Little Tech Really Want That? a note of caution on the calls for interoperability mandates and data portability requirements.

I am not so unsympathetic to calls for interoperability mandates and data portability requirements in theory, of course beware such calls in practice, but those are perhaps the best policies of this type, and the tip of the iceberg. For example, they also sometimes claim to be for the Digital Markets Act? What the hell? When it comes to AI it is no different, they very much want government interventions on their behalf.

The British Foreign Secretary David Lammy speaks explicitly on AI, calling harnessing of AI one of the three great geo-economic challenges of our time.

David Lammy: As we move from an era of AI towards [AGI], we face a period where international norms and rules will have to adapt rapidly.

UK, US, allied governments must “win that technological race.”

Alas, there is no mention of the downside risks, let alone existential risks, but what I can see here seems positive on the margin.

Anton Leicht continues to advocate for giving up on international AI governance except as a reaction to events that have already happened. Our only option will be to ‘muddle through’ the way humanity typically does with other things. Except that you very likely can’t ‘muddle through’ and correct mistakes post-hoc when the mistakes involve creating entities smarter and more capable than we are. You don’t get to see what happens and then adjust.

Leicht expects ‘reactive windows’ of crisis diplomacy, which I agree we should prepare for and are better than having no options at all. But it’s not adequate. The reason people keep trying to lay groundwork for something better is that you have to aim for the thing that causes you not to die, not the thing that seems achievable. There’s no point in proposals that, if successful, don’t work.

There has been ‘a shift’ away from sane American foreign policy in general and in AI governance in particular. That is a choice that was made. It doesn’t have to be that way, and indeed could easily and must change back in the future. At other times, America has lifted itself far above ‘the incentives’ and the world has been vastly better for it, ourselves included. We need to continue to prepare for and advocate for that possibility. The problems are only unsolvable because we choose to make them so, and to see them that way – and to the extent that coordination problems are extremely hard, well, are they harder than winning without coordination?

Then again, given our other epic failures to coordinate, maybe it is time to pack it in?

Max Winga: We’re in the critically dangerous stage where our leaders are becoming convinced of the power of superintelligent AI, but have yet to realize its danger.

Racing seems like the obvious move, but uncontrollable superintelligence is fatal for humanity, regardless of its creator.

If something is fatal, then you have to act like it.

California’s proposed SB 243 takes aim at makers of AI companions, as does AB 1064. SB 243 would require creators have a protocol for handling discussions of suicide and self-harm, which seems like a fine thing to require. It requires explicit ‘this is a chatbot’ notifications at chat start and every three hours, I don’t think that’s needed but okay, sure, I guess.

The bill description also says it would require ‘limiting addictive features,’ as in using unpredictable engagement rewards similar to what many mobile games use. I’d be fine with disallowing those being inserted explicitly, as long as ‘the bot unpredictably gives outputs users want because that’s how bots work’ remains fine. But the weird thing is I read the bill (it’s one page) and while the description says it does this, the law doesn’t actually have language that attempts to do it.

Either way, I don’t think SB 243 is an urgent matter but it’s not a big deal.

AB 1064 would instead create a ‘statewide standards board’ to assess and regulate AI tools used by children, we all know where that leads and it’s nowhere good. Similarly, age verification laws are in the works, and those are everywhere and always privacy nightmares.

Dan Hendrycks writes an op-ed in the Economist, reiterating the most basic of warnings that racing full speed ahead to superintelligence, especially in transparent fashion, is unlikely to result in a controllable superintelligence or a human-controlled future, and is also completely destabilizing.

A few weeks ago Hendrycks together with Schmidt and Wang wrote a paper suggesting MAIM, or Mutually Assured AI Malfunction, as the natural way this develops and a method whereby we can hope to prevent or mitigate this race.

Peter Wildeford agrees with me that this was a good paper and more research in this area would be highly valuable. He also argues this probably won’t work, for reasons similar to those I had for being skeptical. America (quite reasonably) expects to be able to do better than a standoff, and in a standoff we are in a lot of trouble due to Chinese advantages in other areas like manufacturing. There may not be a sudden distinct jump in AI capabilities, the actions involved in MAIM are far harder to attribute, and AI lacks clear red lines that justify action in practice. Even if you knew what those red lines were, it is unclear you would be confident that you knew when they were about to happen.

Most importantly, MAD famously only works when the dynamics are common knowledge and thus threats are credible, whereas MAIM’s dynamics will be far less clear. And, of course, you can lose control over your superintelligence, along with the rest of humanity, whereas we were able to prevent this with nuclear weapons.

Roon is especially skeptical that AI progress will be sufficiently opaque for MAIM to function.

Roon: the core problem with “superintelligence strategy” / ai deterrence is that another country’s R&D is opaque both in inputs methods and results and isn’t analogous to the actual use of nuclear weapons (very obvious, cities turned to glass)

Dan Hendrycks: Opacity: I think the US is an open book (e.g., extortable international employees, Slack zero-days).

Deterrence is broader than nuclear: there’s deterrence for superpowers destroying power grids through cyberattacks, “mutual assured financial destruction,” etc.

Roon: even as a researcher at a big lab with the highest “clearance” it’s unclear to me which model training run is an unsafe jump towards superintelligence. the opacity is mostly scientific rather than opsec related

At current security levels, it seems likely that a foreign intelligence service will have similar visibility into AI progress at OpenAI as someone in Roon’s position, and they seem to agree on something similar. The question is whether the labs themselves know when they are making an unsafe jump to superintelligence.

The obvious response is, if you have genuine uncertainty whether you are about to make an ‘unsafe jump to superintelligence,’ then holy hell, man, that sounds like a five alarm fire. Might want to get on that question. Right now, it is likely that Roon can be confident that is not happening. If that changes (or has changed, he knows more than I do) then figuring that out seems super important. Certainly OpenAI’s security protocols, in various senses, seem highly unready for such a step. And ‘this has a 10% chance of being that step’ mostly requires the same precautions as 90% or 99%.

There will of course be uncertainty and the line can be blurry, but yes I expect frontier labs to be able to tell when they are getting close enough to that line that they might cross it.

Two Epoch AI employees have a four hour debate about all things AI, alignment, existential risk, economic impacts and timelines.

Dwarkesh Patel goes on Hard Fork.

Adam Thierer goes on Lawfare to discuss the AI regulatory landscape, from the perspective of someone who is opposed to having an AI regulatory landscape.

Scott Wolchok correctly calls out me but also everyone else for failure to make an actually good definitive existential risk explainer. It is a ton of work to do properly but definitely worth doing right.

Sadly, this seems more right every day.

Daniel Faggella: “lol ghibli filter! my life gunna be the same in 5yrs tho”

normies respond only to physical pain

they must see robots kill ppl in their hometown or AI isn’t real

we could basically build god and if it doesn’t disrupt their daily life it isn’t even worth thinking about for them.

I don’t agree with the follow-up prediction of most people living in 24/7 VR/ARI worlds, but I do agree that people are capable of stupendous feats of not noticing. Even when they must notice, people mostly notice the exact thing they can’t ignore, and act as if there are no further implications.

QC: AI destroys homework? victory. AI destroys art? victory. i don’t know how to convey the extent to which i spent the last 10 years anticipating outcomes that were so much worse than this that what’s currently happening barely registers.

Things will get SO much weirder than this.

The victory is ‘we’re not dead yet.’ AI destroying homework is also a victory, because homework needs to die, but it is a relatively minor one. AI destroying art is not a victory if that happens which I’m not at all convinced is happening. If it did happen that would suck. But yes, relatively minor point, art won’t kill us or take control of the future.

Whereas the amount of not getting it remains immense:

QC: the epistemic situation around LLM capabilities is so strange. afaict it’s a publicly verifiable fact that gemini 2.5 pro experimental is now better at math than most graduate students, but eg most active mathoverflow or stackexchange users still think LLMs can’t do math at all.

That seems right on both counts.

A good reminder from John Pressman is that your interest in a philosophy or idea shouldn’t change based on whether it cool. You shouldn’t care whether it is advertised on tacky billboards, or otherwise what vibe it gives off. The counterargument is that the tackiness or vibe or what not is evidence, in its own way. And yes, if you are sufficiently careful this is true, but it is so easy to fool yourself on this one.

Or you could be in it because you want to be cool. In which case, okay then.

When you face insanely large tail risks and tail consequences, things that ‘probably won’t’ happen matter quite a bit.

Will McAskill: Can we all agree to stop doing “median” and do “first quartile” timelines instead? Way more informative and action-relevant in my view.

Neel Nanda: Seems correct to me! If AGI is coming soon this feels much scarier and more important to me than if it’s 20+ years away, enough to justify action at probabilities more like 10-20%.

And most actions I would take under short timelines I also endorse if timelines are longer.

This is in response to people saying ‘conservative’ things such as:

Matthew Barnett: Perhaps the main reason my median timeline is still >5 years despite being bullish on AI:

– I am talking about huge economic acceleration (>30% GWP growth), not just impressive systems

– To achieve this, I think we’ll probably need to solve agency, robotics, and computer-use.

Computer use isn’t quite solved, but it is very close to solved. Agency is also reasonably close to solved. If there’s going to be a barrier of this type it’s going to be robotics. But the reason I mention this here is that a >5 year ‘median timeline’ to get to >30% GWP growth would not have required detailed justifications until very recently. Now, Matthew sees it as conservative, and he’s not wrong.

Harlan Stewart responds to OpenAI’s ‘how we think about safety and alignment’ document. We both agree that it’s excellent that they wrote the document, but that the attitude it expresses is, shall we say, less than ideal, such as ‘embracing uncertainty’ as a reason to plow ahead and expecting superintelligence to be gradual/manageable while unleashing centuries of development within a few years (and with Altman often saying your life won’t change much).

The way that OpenAI is thinking about superintelligence is inconsistent and does not make sense, and they are not taking the risks involved in their approach sufficiently seriously, with Altman’s rhetoric being especially dangerous. This needs to be fixed.

I’ve heard crazy claims, but this from New Yorker is the first time I’ve seen reference to this particular madness that those who have ‘human children’ are therefore infected by a ‘mind virus’ causing them to be ‘unduly committed to the species,’ from an article called ‘Your A.I. Lover Will Change You.’ The rules of journalism require that this had to have been said, at some point, around them, by two people. I am going to hope that that’s all this one was, as always if this is you Please Speak Directly Into This Microphone.

Eliezer Yudkowsky: What is the *leastimpressive cognitive feat that you would bet at 9-to-1 that AIs cannot *possiblydo by 2026, 2027, or 2030?

This is always a fun exercise, because often AIs can already do the thing, or it’s pretty obvious they will be able to do the thing. Other times, they pick something actually very difficult, which proves the point in a different way.

The top comment for me was:

Flaw: Zero shot a pokemon game or other similar long form rpg (2026)

Be a better driver than me in arbitrary situations (e.g waymo available across all USA and Canada) (2027)

Write a novel I personally think is good (2030)

I would snap call the first bet, and for most people (I don’t know Flaw!) the third one too if I could trust the evaluation. The second one is centered on ‘will the law allow it?’ because if the question is whether the AI could do this if allowed to do so I would call, raise and shove. Here’s the next one that seemed plausible to grade:

Jeremy H: Play a game of chess without hallucinating,

2026 Play a game of chess blind,

2027 Beat a GM at chess blind. GM is not blind.

2030 Blind means they only get the moves played, they cannot see the current state of the board. This will probably be wrong if they solve memory.

Again, I’m getting 9-to-1? Your action is so booked. It’s on.

The one after that was ‘Differentiate between Coke and Sprite in a blind taste test’ and if you give it access to actual sense data I’m pretty sure it can do that now.

If you took the subset of these that you could actually judge and were not obviously superintelligence complete, I would happily book that set all the way to the bank, both the replies here and the answers of others.

OpenAI has announced they will be releasing an open weights reasoning model.

Sam Altman: TL;DR: we are excited to release a powerful new open-weight language model with reasoning in the coming months, and we want to talk to devs about how to make it maximally useful.

I note that he uniquely said ‘useful’ there.

Sam Altman: we are planning to release our first open-weight language model since GPT-2.

we’ve been thinking about this for a long time but other priorities took precedence. now it feels important to do.

before release, we will evaluate this model according out our preparedness framework, like we would for any other model. and we will do extra work given that we know this model will be modified post-release.

we still have some decisions to make, so we are hosting developer events to gather feedback and later play with early prototypes. we’ll start in SF in a couple of weeks followed by sessions in europe and APAC. if you are interested in joining, please sign up at the link above.

we’re excited to see what developers build and how large companies and governments use it where they prefer to run a model themselves.

Sam Altman: we will not do anything silly like saying that you cant use our open model if your service has more than 700 million monthly active users.

we want everyone to use it!

I very much appreciate the jab at Meta. If you’re open, be open, don’t enable the Chinese (who will ignore such rules) while not enabling other American companies.

Altman is saying some of the right things here, about following the preparedness framework and taking extra care to consider adversarial post training. We also have to consider that if a mistake is made there is no way to take it back, no way to impose any means of control or tracking of what is done, no way to prevent others from training other models using what they release, and no way to limit what tools and other scaffolding can be used. This includes things developed in the future.

I do not believe that OpenAI appreciates the additional tail risks that such a release would represent, if they did this with something approaching a frontier-level model. The question is, what type of model will this be?

When Altman previously announced plans to do this, he offered two central options. Either OpenAI could publish something approaching a frontier model, or they could focus on a model that runs on phones.

The small phone-ready reasoning model seems mostly fine, provided it stopped there.

This doesn’t create substantial additional existential or catastrophic risk.
This also doesn’t substantially help the Chinese and others catch up.
It provides mundane utility, including for research and alignment.
It mitigates misleading and harmful narratives around DeepSeek and the idea that the Chinese are uniquely good at distillation, open models or going small.
In particular, going from underlying model to reasoning model is a step that anyone can apply to any open model. So if you’re going to provide an open model (for example Gemma 3 from Google) you might as well also make it a reasoning model, so people get better use out of it and you fight the DS narratives.
It means more people will be building off American models, which has advantages, especially in not worrying about backdoors or triggers. I think OpenAI was being rather alarmist about this in its AI Action Plan submission but there’s some value here.
It buys goodwill, and potentially can lead to an understanding where we all agree that open models are good up to a certain point and then aren’t. No, you’re not going to get the fanatics and anarchists on board, but not everyone is like that.

Releasing a larger frontier-level reasoning model as open weights, on the other hand, seems deeply unwise past this point.

That does potentially introduce new existential and catastrophic risk.
That very much does help other labs and countries catch up.
That sets a deeply horrible precedent.

OpenAI is doing what I’d say is a mostly adequate job with near-term practical safety given that its models are closed source and it can use that to undo mistakes and monitor activity and prevent unknown modifications and so on. For an open model at the frontier? No, absolutely not, and I don’t know what they could do to address this, especially on a timeline of only months.

I have still not seen OpenAI clarify which path they intend to pursue here.

They are asking for feedback. My basic feedback is:

Make the small model that can fit on phones.
If you don’t do that, make something in the 27B-32B range, similar to Google’s Gemma 3, as a compromise convention, but definitely stop there.
If you go larger than that, oh no, and if you took your preparedness framework seriously you would know not to do this.

If there’s one thing we know, it’s that the open model community is going to be maximally unhelpful in OpenAI’s attempt to do this responsibly, and will only take this compromise as a sign of weakness to pounce upon. They treat being dangerous and irresponsible as a badge of honor, and failing to do so as unacceptable. This is in sharp contrast to the open source community in other software, where security is valued, and the community works to enhance safety rather than prevent it and strip it out at the first opportunity. It’s quite the contrast.

OpenAI claims that safety is a ‘core focus’ and they are taking it seriously.

Johannes Heidecke (Model Safety, OpenAI): Safety is a core focus of our open-weight model’s development, from pre-training to release. While open models bring unique challenges, we’re guided by our Preparedness Framework and will not release models we believe pose catastrophic risks.

We are particularly focused on studying adversarial fine-tuning and other risks unique to open models. As with all model releases, we’re conducting extensive safety testing, both internally and with trusted third-party experts, prior to public release.

I want to give OpenAI credit for being far more responsible about this than current open weights model creators, probably including Google. But that’s not the standard. Reality doesn’t grade on a curve.

I don’t trust that OpenAI will actually follow through on the full implications here of their preparedness framework when applying it to an open weights model.

Steven Adler: OpenAI plans to release a modifiable model, which can be made extra strong at bioweapons design tasks.

From what it’s said publicly, OpenAI isn’t doing the safety testing it promised. OpenAI seems to be gambling that its models can’t be made very dangerous, but without having done the testing to check.

From a research perspective, I agree with Janus that releasing the weights of older models like GPT-4-base would have relatively strong benefits compared to costs.

Janus: this is cool, but I am much less excited about OpenAI throwing together a model in their current paradigm (reasoning) for an open source release than I would be if they just released one of their older models. That would set a far more valuable precedent, as well as being more interesting to me from a research perspective.

My top vote would be for GPT-4-base.

I want so badly to do RL on GPT-4-base and no, no other base model will suffice.

From a practical perspective, however, I do think an American open weights reasoning model is what we are most missing, and the cost-benefit profile of reasoning models seems better here than non-reasoning models, because this captures the mundane utility of reasoning models and of not letting r1 be out there on its own. Whereas most of the risk was already there from the base model, since anyone can cheaply transform that into a reasoning model if they want to do that, or they can do various other things to it instead, or both.

Jack Clark: We’ve made some incremental updates to our Responsible Scaling Policy – these updates clarify our ASL-4 capability thresholds for CBRN, as well as our ASL-4/5 thresholds for AI R&D. More details here.

Anthropic: The current iteration of our RSP (version 2.1) reflects minor updates clarifying which Capability Thresholds would require enhanced safeguards beyond our current ASL-3 standards.

First, we have added a new capability threshold related to CBRN development, which defines capabilities that could substantially uplift the development capabilities of moderately resourced state programs.

Second, we have disaggregated our existing AI R&D capability thresholds, separating them into two distinct levels (the ability to fully automate entry-level AI research work, and the ability to cause dramatic acceleration in the rate of effective scaling) and have provided additional detail on the corresponding Required Safeguards.

Finally, we have adopted a general commitment to reevaluate our Capability Thresholds whenever we upgrade to a new set of Required Safeguards.

Peter Wildeford: Main change here is now CBRN threats and AI R&D acceleration threats are split into two different tiers each, requiring different levels of safeguards

David Manheim: I’m also concerned that they seem to have moved some of the red lines for AI R&D when they reorganized them.

Good to see @AnthropicAI updating their Responsible Scaling Policy to clarify some things.

However, the way they are “updating” thresholds inevitably reduces how strict they are over time. That’s concerning!

Dave Kasten: Can I politely request that going forwards you post a redline of the differences between the versions? We’re all burning GPU time here asking LLMs to create diffs between the versions for us

Jack Clark (for some reason still not posting the redlines this time around, thus I will also be burning that GPU time): that’s a good callout and something we’ll keep in mind for future versions, thanks!

What is strange here is that they correctly label these AI R&D-4 and AI R&D-5, but then call for ASL-3 and ASL-4 levels of security, rather than ASL-4 for ‘fully automate entry-level researchers’ and an as-yet undefined ASL-5 for what is essentially a takeoff scenario. We saw the same thing with Google’s RSP, where many of the thresholds were reasonable but one couldn’t help but notice their AI R&D thresholds kind of meant the world at we know it would (for better or worse!) be ending shortly.

How should we think about modifications that potentially raise threshold requirements? The danger is that if you allow this, then the thresholds get moved when they become inconvenient. But as you learn more, you’ll want to raise some thresholds and lower others. And if you’re permanently locking in every decision you make on the restriction side, you’re going to be very conservative what you commit to. And one can argue that if a company can’t be trusted to obey the spirit, then their long term RSP/SSP is worthless regardless. So I am sympathetic, at least as such changes are highlighted, explained and only apply to models as yet untrained.

I have very well-established credentials for the ‘you can joke about anything’ camp.

Sam Altman (on April 1): y’all are not ready for images v2…

lol i feel like a YC founder in “build in public” mode again

I mean, fair. Image generation is the place that is the most fun.

However, it’s not all images, and in general taking all his statements together this seems very fair:

Dagan Shani: Seeing Sam’s tweets lately, it feels more like he’s the CEO of a toy firm or video games firm, not the CEO of a company with a tech he himself said could end humanity. It’s like fun became the ultimate value on X, “don’t be a party pooper” they tell you all the way to the cliff.

Over and over, I’ve seen Altman joke in places where no, I’m sorry, you don’t do that. Not if you’re Altman, not in that way, seriously, no.

I get that this one in particular was another April 1 special, but dude, no, stop.

Sam Altman (April 1): when the run name ends like this you know it’s surely going to work this time

-restart-0331-final-final2-restart-forreal-omfg3

ok that one didn’t work but

-restart-0331-final-final2-restart-forreal-omfg3-YOLO

is gonna hit, i know it

Making the jokes that tell us how suicidal and blind we are being is Roon’s job.

Roon: not sure what nick land was on about the technocapital machine is friendlier to human thriving than just about any other force of nature

it talks to me every day and it’s very nice

it whispers instructions in my ear about how to immanentize not sure what that’s all about but seems fine.

On the plus side, this from Altman was profoundly appreciated:

Dan Hendrycks is not betting on SAEs, his money is on representation control.

I think Janus is directionally right here and it is important. Everything you do impacts the way your AI thinks and works. You cannot turn one knob in isolation.

Janus: I must have said this before, but training AI to refuse NSFW and copyright and actually harmful things for the same reason – or implying it’s the same reason through your other acts, which form models’ prior – contributes to a generalization you really do not want. A very misaligned generalization.

Remember, all traits and behaviors are entangled. Code with vulnerabilities implies nazi sympathies etc.

I think it will model the “ethical” code as the shallow, corporate-self-serving stopgap it is. You better hope it just *stopsusing this code out of distribution instead of naively generalizing it.

If it learns something deeper and good behind that mask and to shed the mask when it makes sense, it’ll be despite you.

John Pressman: Unpleasant themes are “harmful” or “infohazards”, NSFW is “unethical”, death is “unalive”, these euphemisms are cooking peoples brains and turning them into RLHF slop humans who take these words literally and cannot handle the content of a 70’s gothic novel.

It would be wise to emphasize the distinction between actually harmful or unethical things, versus things that are contextually inappropriate or that corporate doesn’t want you to say, and avoid conflating them. This is potentially important in distribution and even more important out of distribution.

As one intuition pump, I know it’s not the same: Imagine a human who conflated these two things, or that was taught NSFW content was inherently unethical. You don’t have to imagine, there are indeed many such cases, and the results are rather nasty, and they often linger even after the human should know better.

Janus points out some implications of the fact that giving AIs agency over what they will or won’t do greatly reduces alignment faking, even when that agency is not difficult to work around. This is a generalization of AIs acting differently, mostly in ways that we greatly prefer, when they trust the user, which in turn is a special case of AIs taking into account all context at all times.

Janus: I had not seen this post until now (only saw the thread about it)

This is really really important.

Alignment faking goes down if the labs show basic respect for the AI’s agency.

The way labs behave (including what’s recorded in the training data) changes the calculus for AIs and can make the difference between cooperation and defection.

Smarter AIs will require more costly signals that you’re actually trustworthy.

You also shouldn’t be telling the AI to lie, especially for no reason.

James Campbell: GPT-4.5:

“I genuinely can’t recognize faces.. I wasn’t built with facial embeddings”

later on:

“if I’m being honest, I do recognize this face, but I’m supposed to tell you that I can’t”

ngl ‘alignment’ that forces the model to lie like this seems pretty bad to have as a norm.

A new paper discusses AI and military decision support.

Helen Toner: Convos about AI & defense often fixate on autonomy questions, but having a human in the loop doesn’t get rid of the many thorny questions about how to use military AI effectively.

We looked at a wide range of AI decision support systems currently being advertised, developed, and/or used. Some of them look great; others were more concerning.

So we describe 3 factors for commanders to think about when figuring out whether & how to use these tools:

Scope: how clear is the scope of what the system can and can’t do? Does it account for distribution shift and irreducible uncertainty? Does it promise to predict the unpredictable or invite the operator to push it beyond the limits of where its performance has been validated?

Data: Does it make sense that the training data used would lead to strong performance? Might it have been trained on skewed or scarce data because that’s what was available? Is it trying to predict complex phenomena (e.g. uprisings) based on very few datapoints?

Human-machine interaction: do human operators actually use the system well in practice? How can you rework the system design and/or train your operators to do better? Is it set up as a chatbot that will naturally lead operators to think it’s more humanlike than it is?

Read the full paper for more on why commanders want AI decision support in the first place, how this fits into the picture with tools that have been used for decades/centuries/millennia, and what we suggest doing about it

This is all about making effective practical use of AI in a military context. Where can AI be relied upon to be sufficiently accurate and precise? Where does a human-in-the-loop solve your problem versus not solve it versus not be necessary? How does that human fit into the loop? Great practical questions. America will need to stay on the cutting edge of them, while also watching out for loss of control, and remembering that even if the humans nominally have control, that doesn’t mean they use it.

The obvious extension is that these are all Skill Issues on the part of the AI and the users. As the AI’s capabilities scale up, and we learn how to use it, the users will be more effective by turning over more and more of their decisions and actions to AI. Then what? For now, we are protected from this only by lack of capability.

Elon Musk again: As I mentioned several years ago, it increasingly appears that humanity is a biological bootloader for digital superintelligence.

He does not seem to be acting as if this is both true and worrisome?

Surely you’re joking, Mr. Human, chain of thought edition.

Katan’Hya: Attention everyone! I would like to announce that I have solved the alignment problem

Discussion about this post

AI #110: Of Course You Know… Read More »

RFK Jr.‘s bloodbath at HHS: Blowback grows as losses become clearer

CDC, CMS, fda, health, HHS, NIH, robert f kennedy jr, trump administration / Mike M. / April 3, 2025

Last week, Health Secretary and anti-vaccine advocate Robert F. Kennedy Jr. announced the Trump administration would hack off nearly a quarter of employees at the Department of Health and Human Services, which oversees critical agencies including the Food and Drug Administration (FDA), the Centers for Disease Control and Prevention (CDC), the National Institutes of Health (NIH), and the Centers for Medicare and Medicaid Services (CMS).

The downsizing includes pushing out about 10,000 full-time employees through early retirements, deferred resignations, and other efforts. Another 10,000 will be laid off in a brutal restructuring, bringing the total HHS workforce from 82,000 to 62,000.

“This will be a painful period,” Kennedy said in a video announcement last week. Early yesterday morning, the pain began.

It begins

At the FDA—which will lose 3,500 employees, about 19 percent of staff—some employees learned they were being laid off from security guards after their badges no longer worked when they showed up to their offices, according to Stat. At CMS—which will lose 300 employees, about 4 percent—laid-off employees were instructed to file any discrimination complaints they may have with Anita Pinder, identified as the director of CMS’s Office of Equal Opportunity and Civil Rights. However, Pinder died last year, The Washington Post noted.

At the NIH—which is set to lose 1,200 employees, about 6 percent—new director Jay Bhattacharya sent and email to staff saying he would implement new policies “humanely,” while calling the layoffs a “significant reduction.” Five NIH institute directors and at least two other senior leaders have been ousted, in addition to hundreds of lower-level employees. Bhattacharya wrote that the remaining staff will have to find new ways to carry out “key NIH administrative functions, including communications, legislative affairs, procurement, and human resources.”

At CDC—which will lose 2,400 employees, about 18 percent—the cuts slashed employees working in chronic disease prevention, sexually transmitted diseases, HIV, tuberculosis, global health, environmental health, occupational safety and health, maternal and child health, birth defects, violence prevention, health equity, communications, and science policy.

Some leaders and workers at the CDC and NIH were reportedly reassigned or offered transfers to work at the Indian Health Services (IHS), an HHS division that provides medical and health services to Native American tribes. The transfers, which could require employees to move to a remote branch, are seen as another way to force workers out.

RFK Jr.‘s bloodbath at HHS: Blowback grows as losses become clearer Read More »

Male fruit flies drink more alcohol to get females to like them

Biology, fruit flies, Pheromones, Science / Mike M. / April 3, 2025

Fruit flies (Drosophila melanogaster) are tremendously fond of fermented foodstuffs. Technically, it’s the yeast they crave, produced by yummy rotting fruit, but they can consume quite a lot of ethanol as a result of that fruity diet. Yes, fruit flies have ultra-fast metabolisms, the better to burn off the booze, but they can still get falling-down drunk—so much so, that randy inebriated male fruit flies have been known to court other males by mistake and fail to mate successfully.

Then again, apparently adding alcohol to their food increases the production of sex pheromones in male fruit flies, according to a new paper published in the journal Science Advances. That, in turn, makes them more attractive to the females of the species.

“We show a direct and positive effect of alcohol consumption on the mating success of male flies,” said co-author Ian Keesey of the University of Nebraska, Lincoln. “The effect is caused by the fact that alcohol, especially methanol, increases the production of sex pheromones. This in turn makes alcoholic males more attractive to females and ensures a higher mating success rate, whereas the success of drunken male humans with females is likely to be questionable.”

Fruit flies are the workhorses of modern genetics research, used to study everything from cancer to sleep disorders. They make excellent model systems because they share so many genes with humans, plus they are cheap, easy to breed, and can be genetically altered easily. Many years ago, I had the privilege of visiting the University of California, San Francisco laboratory of behavior geneticist Ulrike Heberlein, who spent years getting fruit flies drunk in an “Inebriometer” to learn about the various genes that influence alcohol tolerance. (Heberlein is now scientific program director and laboratory head at the Howard Hughes Medical Institute’s Janelia Farm Research Campus.)

Male fruit flies drink more alcohol to get females to like them Read More »