Author name: Tim Belzer

tuesday-telescope:-a-new-champion-enters-the-ring

Tuesday Telescope: A new champion enters the ring

Welcome to the Tuesday Telescope. There is a little too much darkness in this world and not enough light—a little too much pseudoscience and not enough science. We’ll let other publications offer you a daily horoscope. At Ars Technica, we’ll take a different route, finding inspiration from very real images of a universe that is filled with stars and wonder.

After a decade of construction a large new reflecting telescope publicly released its first images on Monday, and they are nothing short of spectacular.

The Vera C. Rubin Observatory’s primary mirror is 8.4 meters in diameter, which makes it one of the largest optical telescopes in the world. However, the real secret sauce of the telescope is its camera—the automobile-sized Legacy Survey of Space and Time camera—which has a resolution of 3,200 megapixels. Which is rather a lot.

The observatory is on a remote 2,682-meter-high (8,799 ft) mountain in northern Chile, a region of the planet with some of the best atmospheric “seeing” conditions.

The main goal of the telescope is to scan the entire Southern Hemisphere sky by taking 1,000 high-definition photographs every three nights for the next 10 years. The idea is that, assembled end to end, the observatory will provide a high-definition, four-dimensional film of the Universe changing over a decade. It will seek to encompass everything from nearby asteroids and comets to distant supernovae.

Who was Vera Rubin? She was an American astronomer who was the first person to establish the presence of dark matter in galaxies. The observatory named in her honor was funded by the US Department of Energy and the US National Science Foundation. International partners, including the French National Centre for Scientific Research, will help to store the 20 terabytes of data collected every night.

The only bummer about Monday’s announcement is the fact that it was funded by the Department of Energy and the National Science Foundation. The Trump administration has sought to halve the science budgets of both agencies in the coming years. And the prospect of losing that funding, juxtaposed against the phenomenal start of the Vera C. Rubin Observatory, reminds us of what we stand to lose if we slash basic science funding in this country.

Source: Vera C. Rubin Observatory

Do you want to submit a photo for the Daily Telescope? Reach out and say hello.

Tuesday Telescope: A new champion enters the ring Read More »

how-a-grad-student-got-lhc-data-to-play-nice-with-quantum-interference

How a grad student got LHC data to play nice with quantum interference


New approach is already having an impact on the experiment’s plans for future work.

The ATLAS particle detector of the Large Hadron Collider (LHC) at the European Nuclear Research Center (CERN) in Geneva, Switzerland. Credit: EThamPhoto/Getty Images

The ATLAS particle detector of the Large Hadron Collider (LHC) at the European Nuclear Research Center (CERN) in Geneva, Switzerland. Credit: EThamPhoto/Getty Images

Measurements at the Large Hadron Collider have been stymied by one of the most central phenomena of the quantum world. But now, a young researcher has championed a new method to solve the problem using deep neural networks.

The Large Hadron Collider is one of the biggest experiments in history, but it’s also one of the hardest to interpret. Unlike seeing an image of a star in a telescope, saying anything at all about the data that comes out of the LHC requires careful statistical modeling.

“If you gave me a theory [that] the Higgs boson is this way or that way, I think people imagine, ‘Hey, you built the experiment, you should be able to tell me what you’re going to see under various hypotheses!’” said Daniel Whiteson, a professor at the University of California, Irvine. “But we don’t.”

One challenge with interpreting LHC data is interference, a core implication of quantum mechanics. Interference allows two possible events to inhibit each other, weakening the likelihood of seeing the result of either. In the presence of interference, physicists needed to use a fuzzier statistical method to analyze data, losing the data’s full power and increasing its uncertainty.

However, a recent breakthrough suggests a different way to tackle the problem. The ATLAS collaboration, one of two groups studying proton collisions at the LHC, released two papers last December that describe new ways of exploring data from their detector. One describes how to use a machine learning technique called Neural Simulation-Based Inference to maximize the potential of particle physics data. The other demonstrates its effectiveness with the ultimate test: re-doing a previous analysis with the new technique and seeing dramatic improvement.

The papers are the culmination of a young researcher’s six-year quest to convince the collaboration of the value of the new technique. Its success is already having an impact on the experiment’s plans for future work.

Making sense out of fusing bosons

Each particle collision at the LHC involves many possible pathways in which different particles combine to give rise to the spray of debris that experimenters see. In 2017, David Rousseau at IJCLab in Orsay, a member of the ATLAS collaboration, asked one of his students, Aishik Ghosh, to improve his team’s ability to detect a specific pathway. That particular pathway is quite important since it’s used to measure properties of the Higgs boson, a particle (first measured in 2012) that helps explain the mass of all other fundamental particles.

It was a pretty big ask. “When a grad student gets started in ATLAS, they’re a tiny cog in a giant, well-oiled machine of 3,500 physicists, who all seem to know exactly what they’re doing,” said Ghosh.

The pathway Ghosh was asked to study occurs via several steps. First, the two colliding protons each emit a W boson, a particle associated with the weak nuclear force. These two bosons fuse together, changing their identity to form a Higgs boson. The Higgs boson then decays, forming a pair of Z bosons, another particle associated with the weak force. Finally, those Z bosons themselves each decay into a lepton, like an electron, and its antimatter partner, like a positron.

A Feynman diagram for the pathway studied by Aishik Ghosh. Credit: ATLAS

Measurements like the one Ghosh was studying are a key way of investigating the properties of the Higgs boson. By precisely measuring how long it takes the Higgs boson to decay, physicists could find evidence of it interacting with new, undiscovered particles that are too massive for the LHC to produce directly.

Ghosh started on the project, hoping to find a small improvement in the collaboration’s well-tested methods. Instead, he noticed a larger issue. The goal he was given, of detecting a single pathway by itself, didn’t actually make sense.

“I was doing that and I realized, ‘What am I doing?’ There’s no clear objective,” said Ghosh.

The problem was quantum interference.

How quantum histories interfere

One of the most famous demonstrations of the mysterious nature of quantum mechanics is called the double-slit experiment. In this demonstration, electrons are shot through a screen with two slits that allow them to pass through to a photographic plate on the other side. With one slit covered, the electrons form a pattern centered on the opening. The photographic plate lights up bright right across from the slit and dims further away from it.

With both slits open, you would expect the pattern to get brighter as more electrons reach the photographic plate. Instead, the effect varies. The two slits do not give rise to two nice bright peaks; instead, you see a rippling pattern in which some areas get brighter while others get dimmer, even though the dimmer areas should, in principle, be easier for electrons to reach.

The effect happens even if the electrons are shot at the screen one by one to stop them from influencing each other directly. It’s as if each electron carries with it two possible histories, one in which it goes through one slit and another where it goes through the other before both end up at the same place. These two histories interfere with each other so that some destinations become less likely instead of more likely.

Results of the double-slit experiment. Credit: Jordgette (CC BY-SA 3.0)

For electrons in the double-slit experiment, the two different histories are two different paths through space. For a measurement at the Large Hadron Collider, the histories are more abstract—paths that lead through transformations of fields. One history might be like the pathway Ghosh was asked to study, in which two W bosons fuse to form a Higgs boson before the Higgs boson splits into two Z bosons. But in another history, the two W bosons might fuse and immediately split into two Z bosons without ever producing a Higgs.

Both histories have the same beginning, with two W bosons, and the same end, with two Z bosons. And just as the two histories of electrons in the double-slit experiment can interfere, so can the two histories for these particles.

Another possible history for colliding particles at the Large Hadron Collider, which interferes with the measurement Ghosh was asked to do. Credit: ATLAS

That interference makes the effect of the Higgs boson much more challenging to spot. ATLAS scientists wanted to look for two pairs of electrons and positrons, which would provide evidence that two Z bosons were produced. They would classify their observations into two types: observations that are evidence for the signal they were looking for (that of a decaying Higgs boson) and observations of events that generate this pattern of particles without the Higgs boson acting as an intermediate (the latter are called the background). But the two types of observations, signal and background, interfere. With a stronger signal, corresponding to more Higgs bosons decaying, you might observe more pairs of electrons and positrons… but if these events interfere, you also might see those pairs disappear.

Learning to infer

In traditional approaches, those disappearances are hard to cope with, even when using methods that already incorporate machine learning.

One of the most common uses of machine learning is classification—for example, distinguishing between pictures of dogs and cats. You train the machine on pictures of cats and pictures of dogs, and it tells you, given a picture, which animal is the most likely match. Physicists at the LHC were already using this kind of classification method to characterize the products of collisions, but it functions much worse when interference is involved.

“If you have something that disappears, you don’t quite know what to train on,” said David Rousseau. “Usually, you’re training signal versus background, exactly like you’re training cats versus dogs. When there is something that disappears, you don’t see what you trained on.”

At first, Ghosh tried a few simple tricks, but as time went on, he realized he needed to make a more fundamental change. He reached out to others in the community and learned about a method called Neural Simulation-Based Inference, or NSBI.

In older approaches, people had trained machine learning models to classify observations into signal and background, using simulations of particle collisions to make the training data. Then they used that classification to infer the most likely value of a number, like the amount of time it takes a Higgs boson to decay, based on data from an actual experiment. Neural Simulation-Based Inference skips the classification and goes directly to the inference.

Instead of trying to classify observations into signal and background, NSBI uses simulations to teach an artificial neural network to guess a formula called a likelihood ratio. Someone using NSBI would run several simulations that describe different situations, such as letting the Higgs boson decay at different rates, and then check how many of each type of simulation yielded a specific observation. The fraction of these simulations with a certain decay rate would provide the likelihood ratio, a method for inferring which decay rate is more likely given experimental evidence. If the neural network is good at guessing this ratio, it will be good at finding how long the Higgs takes to decay.

Because NSBI doesn’t try to classify observations into different categories, it handles quantum interference more effectively. Instead of trying to find the Higgs based on a signal that disappears, it examines all the data, trying to guess which decay time is the most likely.

Ghosh tested the method, which showed promising results on test data, and presented the results at a conference in 2019. But if he was going to convince the ATLAS collaboration that the method was safe to use, he still had a lot of work ahead of him.

Shifting the weight on ATLAS’ shoulders

Experiments like ATLAS have high expectations attached to them. A collaboration of thousands of scientists, ATLAS needs to not only estimate the laws of physics but also have a clear idea of just how uncertain those estimates are. At the time, NSBI hadn’t been tested in that way.

“None of this has actually been used on data,” said Ghosh. “Nobody knew how to quantify the uncertainties. So you have a neural network that gives you a likelihood. You don’t know how good the likelihood is. Is it well-estimated? What if it’s wrongly estimated just in some weird corner? That would completely bias your results.”

Checking those corners was too big a job for a single PhD student and too complex to complete within a single PhD degree. Aishik would have to build a team, and he would need time to build that team. That’s tricky in the academic world, where students go on to short-term postdoc jobs with the expectation that they quickly publish new results to improve their CV for the next position.

“We’re usually looking to publish the next paper within two to three years—no time to overhaul our methods,” said Ghosh. Fortunately, Ghosh had support. He received his PhD alongside Rousseau and went to work with Daniel Whiteson, who encouraged him to pursue his ambitious project.

“I think it’s really important that postdocs learn to take those risks because that’s what science is,” Whiteson said.

Ghosh gathered his team. Another student of Rousseau’s, Arnaud Maury, worked to calibrate the machine’s confidence in its answers. A professor at the University of Massachusetts, Rafael Coelho Lopes de Sa, joined the project. His student Jay Sandesara would have a key role in getting the calculation to work at full scale on a computer cluster. IJCLab emeritus RD Schaffer and University of Liège professor Gilles Loupe provided cross-checks and advice.

The team wanted a clear demonstration that their method worked, so they took an unusual step. They took data that ATLAS had already analyzed and performed a full analysis using their method instead, showing that it could pass every check the collaboration could think of. They would publish two papers, one describing the method and the other giving the results of their upgraded analysis. Zach Marshall, who was the computing coordinator for ATLAS at the time, helped get the papers through, ensuring that they were vetted by experts in multiple areas.

“It was a very small subset of our community that had that overlap between this technical understanding and the physics analysis experience and understanding that were capable of really speaking to whether that paper was sufficient and intelligible and useful. So we really had to make sure that we engaged that little group of humans by name,” said Marshall.

The new method showed significant improvements, getting a much more precise result than the collaboration’s previous analysis. That improvement, and the thorough checks, persuaded ATLAS to use NSBI more broadly going forward. It will give them much more precision than they expected, using the Higgs boson to search for new particles and clarify our understanding of the quantum world. When ATLAS discusses its future plans, it makes projections of the precision it expects to reach in the future. But those plans are now being upended.

“One of the fun things about this method that Aishik pushed hard is each time it feels like now we do that projection—here’s how well we’ll do in 15 years—we absolutely crush those projections,” said Marshall. “So we are just now having to redo a set of projections because we matched our old projections for 15 years out already today. It’s a very fun problem to have.”

How a grad student got LHC data to play nice with quantum interference Read More »

how-a-data-center-company-uses-stranded-renewable-energy

How a data center company uses stranded renewable energy

“Decisions around where data centers get built have shifted dramatically over the last six months, with access to power now playing the most significant role in location scouting,” Joshi said. “The grid can’t keep pace with AI demands, so the industry is taking control with onsite power generation.”

Soluna, like other data center developers looking to rely on renewable energy, buys the excess power from wind, hydro, and solar plants that they can’t sell to the grid. By the end of the year, Soluna will have three facilities totaling 123 megawatts of capacity in Kentucky and Texas and seven projects in the works with upwards of 800 total megawatts.

Belizaire and I talked about how in Texas, where I report from, there’s plenty of curtailed energy from wind and solar farms because of the region’s transmission capacity. In West Texas, other data center developers are also taking advantage of the unused wind energy, far from major load centers like Dallas and Houston, by co-locating their giant warehouses full of advanced computers and high-powered cooling systems with the excess energy.

One data center developer using curtailed renewable power in Texas is IREN. The firm owns and operates facilities optimized for Bitcoin mining and AI. It developed a 7.5-gigawatt facility in Childress and broke ground on a 1.4-gigawatt data center in Sweetwater.

IREN purchases power through the state grid’s wholesale market during periods of oversupply, said Kent Draper, the company’s chief commercial officer, and reduces its consumption when prices are high. It’s able to do that by turning off its computers and minimizing power demand from its data centers.

But curtailment is an issue all over the world, Belizaire said, from Oklahoma, North Dakota, South Dakota, California, and Arizona in the US, to Northern Ireland, Germany, Portugal, and Australia.

“Anywhere where you have large utility-scale renewable development that’s been built out, you’re going to find it,” Belizaire said.

In a March analysis, the US Energy Information Administration reported that solar and wind power curtailments are increasing in California. In 2024, the grid operator for most of California curtailed 3.4 million megawatt hours of utility-scale wind and solar output, a 29 percent increase from the amount of electricity curtailed in 2023.

How a data center company uses stranded renewable energy Read More »

a-shark-scientist-reflects-on-jaws-at-50

A shark scientist reflects on Jaws at 50


We’re still afraid to go in the water

Ars chats with marine biologist David Shiffman about the film’s legacy—both good and bad.

Roy Scheider starred as Chief Martin Brody in the 1975 blockbuster Jaws. Credit: Universal Pictures

Today marks the 50th anniversary of Jaws, Steven Spielberg’s blockbuster horror movie based on the bestselling novel by Peter Benchley. We’re marking the occasion with a tribute to this classic film and its enduring impact on the popular perception of sharks, shark conservation efforts, and our culture at large.

(Many spoilers below.)

Jaws tells the story of Chief Martin Brody (Roy Scheider), the new police chief for Amity Island, a New England beach town and prime summer tourist attraction. But that thriving industry is threatened by a series of shark attacks, although the local mayor, Larry Vaughn (Murray Hamilton), initially dismisses the possibility, ridiculing the findings of visiting marine biologist Matt Hooper (Richard Dreyfuss). The attacks keep escalating and the body count grows, until the town hires a grizzled shark hunter named Quint (Robert Shaw) to hunt down and kill the great white shark, with the help of Brody and Hooper.

Benchley wrote his novel after reading about a sports fisherman named Frank Mundus, who captured a very large shark in 1964; in fact, the character of Quint is loosely based on Mundus. Benchley wrote an early draft of the screenplay, which underwent multiple revisions during production. In the end, he estimated that his contributions amounted to the basic storyline and the mechanics. Spielberg wasn’t the studio’s first choice for director; initially they hired Dick Richards, but Richards kept referring to the shark as a whale. Eventually, he was fired and replaced with the 26-year-old Spielberg, who had just finished his first feature film (The Sugarland Express).

Spielberg was given a $3.5 million shooting budget and a timeframe of 55 days for filming. However, the production was troubled from the start, largely due to the director’s insistence on shooting on location in Martha’s Vineyard; Jaws was the first major film to be shot on the ocean. Spielberg later admitted, “I was pretty naive about Mother Nature and the hubris of a filmmaker who thinks he can conquer the elements was foolhardy.” Unwanted boats kept drifting into the frame; cameras kept getting waterlogged; Carl Gottlieb (who played the local news editor Meadows) was nearly decapitated by a propeller; Dreyfuss nearly got stuck in the shark cage; and several actors suffered from seasickness. Frustrated crew members took to calling the movie “Flaws.”

A shark strikes

“duh-duh-duh-duh-duh-duh….” Universal Pictures

There were three pneumatically powered full-sized mechanical sharks built for the shoot, nicknamed “Bruce,” and they kept malfunctioning. The pneumatic hoses kept taking on seawater; the skin was made of neoprene foam, which soaked up water and became bloated; and one of the models kept getting tangled up in seaweed. In the end, Spielberg opted to shoot most of the early scenes without ever showing the actual shark, which actually heightened the tension and suspense, especially when combined with John Williams’ ominous theme music (“duh-duh-duh-duh-duh-duh…”).

In the end, shooting ran for 159 days, and the budget ballooned to $9 million. All the delays gave Spielberg and his writers (especially Gottlieb) extra time to refine the script, often just prior to filming the scenes. A lot of the dialogue was improvised by the actors. And it was all worth it in the end, because Jaws went on to become a major summer box office success. All told, it grossed $476 million globally across all its theatrical releases and won three Oscars, although it lost Best Picture to One Flew Over the Cuckoo’s Nest.

Jaws inspired many, many subsequent films, including Ridley Scott’s Alien in 1979, described in pitch meetings as “Jaws in space. Audience reactions were often extreme, with many people becoming fearful of swimming in the ocean for fear of sharks. And while the sequels were, shall we say, underwhelming, the original Jaws has stood the test of time. Ars spoke with marine biologist and shark conservationist David Shiffman, author of Why Sharks Matter, to discuss the film’s depiction of sharks and its enduring place in popular culture.

Ars Technica: Let’s start by talking about the enormous impact of the film, both good and bad, on the general public’s awareness of sharks.

David Shiffman: A lot of folks in both the marine science world and the ocean conservation communities have reported that Jaws in a lot of ways changed our world. It’s not that people used to think that sharks were cute, cuddly, adorable animals, and then after Jaws, they thought that they were bloodthirsty killing machines. They just weren’t on people’s minds. Fishermen knew about them, surfers thought about them, but that was about it. Most people who went to the beach didn’t pay much mind to what could be there. Jaws absolutely shattered that. My parents both reported that the summer that Jaws came out, they were afraid to go swimming in their community swimming pools.

No, really, the water’s fine!

“You knew.” The young boy’s mother (Lee Fierro) confronts Brody. Universal Pictures

David Shiffman: I have encountered people who were so scared that they were afraid to go in the bathtub. A lot of movies are very scary, but they don’t have that real-world impact. I love Jurassic Park, but I’m not afraid that a T. rex is going to eat me when I go into an outhouse, even though that’s about as realistic as what’s portrayed in Jaws. There’s something called the “Jaws Effect” in public policy literature, which is a way of measuring how fictional portrayals of real-world issues affect what citizens think about that issue and what policy preferences they support as a result. It’s fascinating how a fictional portrayal can do that, because I cannot stress enough: That is not what sharks look like or how they behave.

The movie also was the first time that a scientist was the hero. People half a generation above me have reported that seeing Richard Dreyfuss’ Hooper on the big screen as the one who saves the day changed their career trajectory. “You can be a scientist who studies fish. Cool. I want to do that.” In the time since Jaws came out, a lot of major changes have happened. One is that shark populations have declined globally by about 50 percent, and many species are now critically endangered.

And shark science has become much more professionalized. The American Elasmobranch Society—I’m on the board of directors—was founded in 1983, and now we have about 500 members in the US, Canada ,and Mexico. There have since been subsequent organizations founded in Australia and the Pacific Islands, Europe, South America, and a new one starting this year in Asia.

And then, from a cultural standpoint, we now have a whole genre of bad shark movies.

Ars Technica: Sharknado!

David Shiffman: Yes! Sharknado is one of the better of the bunch. Sitting on my desk here, we’ve got Sharkenstein, Raiders of the Lost Shark, and, of course, Shark Exorcist, all from the 2010s. I’ve been quoted as saying there’s two types of shark movie: There’s Jaws and there’s bad shark movies.

Ars Technica: Populations of the tiger shark, the great white, and couple of other species have declined so dramatically that many are on the verge of extinction. Is it just a coincidence that those declines started shortly after Jaws came out? 

David Shiffman: The short answer is not that Jaws caused this, but that perhaps Jaws made it easier for it to happen because people weren’t outraged the way they might’ve been if it happened to say, whales, whose populations were also declining around the same time. The number one threat to shark species as a whole is unsustainable overfishing practices. People are killing too many sharks. Sustainable fisheries for sharks can and do exist, and the US largely has done a good job with this, but around the world, it’s a bad scene.

“A whole genre of bad shark movies”

For instance, shark fin soup started to be a problem around the 1980s thanks to the economic boom in China and the emergence of a new middle class there. Shark fin soup is a traditional Chinese and Southeast Asian delicacy. It’s associated with the emperor and his court. It’s not shark meat that’s used. It’s the little skeletal fin rays from the fins that are basically a bland, noodle-like substance when they’re dried and boiled. The purpose of this was for people to say, “I have so much money that I can eat these incredibly rare delicacies.” That was not caused by Jaws. But perhaps it was allowed to happen because there was less public sympathy for sharks.

It’s worth noting that shark fin soup and the shark fin trade is no longer the biggest or only threat to sharks. It hasn’t been in about 20 years. Ironically, a lot of that has to do with Chinese government efforts not to save the ocean, but to crack down on public corruption. A lot of government officials used to throw extravagant banquets for their friends and family. The new Chinese government said, “We’re not doing that anymore.” That alone saved a lot of endangered species. It was not motivated by concern about the state of the ocean, but it had that effect.

Ars Technica: People have a tendency to think that sharks are simply brutal killing machines. Why are they so important to the ecosystem?

David Shiffman: The title of my book is Why Sharks Matter because sharks do matter and people don’t think about them that way. These are food chains that provide billions of humans with food, including some of the poorest humans on Earth. They provide tens of millions of humans with jobs. When those food chains are disrupted, that’s bad for coastal communities, bad for food security and livelihoods. If we want to have healthy ocean food chains, we need a healthy top of the food chain, because when you lose the top of the food chain, the whole thing can unravel in unpredictable, but often quite devastating ways.

 So sharks play important ecological roles by holding the food chain that we all depend on in place. They’re also not a significant threat to you and your family. More people in a typical year die from flower pots falling on their head when they walk down the street. More people in a typical year die falling off a cliff when they’re trying to take a selfie of the scenery behind them, than are killed by sharks. Any human death or injury is a tragedy, and I don’t want to minimize that. But when we’re talking about global-scale policy responses, the relative risk versus reward needs to be considered.

Ars Technica:  There’s a scene in Jaws where Hooper is talking about his personal theory: territoriality, the idea that this rogue great white came in and made this his personal territory and now he’ll just keep feeding until the food runs out. Is that a real scientific premise from the 1970s and how valid is it?

The hunt begins

The town hires grizzled shark hunter Quint (Robert Shaw) to kill the great white shark. Universal Pictures

David Shiffman: Rogue sharks are nonsense. It is nonsense that is still held by some kooks who are ostensibly in my field, but it is not supported by any evidence whatsoever. In all of recorded human history, there is proof that exactly one shark bit more than one human. That was the Sharm el-Sheikh attacks around Christmas in Egypt a few years ago. Generally speaking, a lot of times it’s hard to predict why wild animals do or don’t do anything. But if this was a behavior that was real, there would be evidence that it happens and there isn’t any, despite a lot of people looking.

Was it commonly believed in the 1970s? No. Did Peter Benchley make it up? No. It’s a thing in some animals for sure. In some neighborhoods, people will pick up gators and move them hundreds of miles away; the gators will move back to that exact same spot. I think the same thing has been shown with bears. Wolves certainly have a home range. But for sharks, it’s not a thing.

Ars Technica: Quint has a famous monologue about surviving the USS Indianapolis sinking and witnessing crew members being eaten by sharks. How historically accurate is that?. 

David Shiffman: We don’t really know how many of the people who were killed following the sinking of the Indianapolis were killed by sharks. Certainly, firsthand accounts report that sharks were present. But those people were in the water because they were on a boat that exploded after being hit by a torpedo. That is not good for your health. So a lot of those people were either mortally wounded or killed by that initial explosion, and then perhaps were scavenged by sharks. Those are also people who are in the water bleeding, making a lot of noise. That’s an incredible scene in the movie. But the deaths Quint attributes to sharks is more people than have been reliably documented as killed by sharks in the history of the world ever.

Ars Technica: How accurate is Jaws in terms of how and why sharks attack humans? For instance, someone says that people splashing in the water mimics what sharks want to hunt. 

David Shiffman: Anyone who tells you they know exactly why a wild animal does or does not do something is someone who you should be a little skeptical of. But a leading theory, which I think makes sense, is this idea of mistaken identity. Some of the people who are most commonly bitten by sharks, though it’s still astronomically rare, are surfers. These are people who are cutting through the water with a silhouette that resembles a seal, wearing black neoprene, which is modeled after seal blubber. Sharks have been patrolling the ocean since before there were trees on land, and it’s only in the last hundred years or so that they’ve had to wonder, is that my preferred prey, or is it a human using technology to mimic my preferred prey for recreational purposes?

If you’ve been in the ocean, there’s been a shark not that far from you, and it knew you were there, and you probably had no idea it was there and had a pleasant day in the water. The sharks that do bite people, they take a little bite and they go, what is that? And swim away. That can be real bad if it hits a major artery or if you’re far from shore. Again, I don’t want to minimize the real harm. But it is not a shark hunting you because it has a taste for human flesh. They don’t have hands. They explore their environment with their mouths and most things in their environment they can eat.

I think Mythbusters tested fish blood versus mammal blood versus chicken blood, I think. And the sharks were attracted to fish blood and had no reaction to the others. So these are animals that are very, very, very well adapted for environmental conditions that in some cases don’t really exist anymore.

Man vs. great white

Brody fights off an increasingly aggressive great white. Universal Pictures

With humans, most of the time, what happens is an immediate bite, and then they swim away. With seals or large prey, they’ll often hit it really hard from below, sometimes knocking it completely out of the water. Or if they’re hunting whales or something that they can’t fit in their mouth, they just take a huge bite and swim away. With fish, they swallow them whole to the extent possible. Sometimes there’s a shaking motion to snap a neck or whatever. You see that with some land predators, too. It’s nothing like what’s seen there—but what an awesome scene.

Ars Technica: What is your favorite scene in Jaws and the one that makes you cringe the most?

David Shiffman: Oh, man. It’s really a great movie, and it holds up well. It was hailed as revolutionary at the time because you hardly ever see the shark. But the reason they did that was because the model of the shark that they built kept breaking. So they decided, let’s just shoot it from the shark’s eye view and save money and annoyance. I love the scene when Hooper realizes that the tiger shark that they’ve caught is obviously not the right species and the reaction that people have to that—just this idea that science and expertise can be used to solve problems. Whenever a shark bites someone, there are people who go out and kill any shark they can find and think that they’re helping.

One of my favorite professional experiences is the American Alasdair Rank Society conference. One year it was in Austin, Texas, near the original Alamo Drafthouse. Coincidentally, while we were there, the cinema held a “Jaws on the Water” event. They had a giant projector screen, and we were sitting in a lake in inner tubes while there were scuba divers in the water messing with us from below. I did that with 75 professional shark scientists. It was absolutely amazing. It helped knowing that it was a lake.

Ars Technica: If you wanted to make another really good shark movie, what would that look like today? 

David Shiffman: I often say that there are now three main movie plots: a man goes on a quest, a stranger comes to town, or there’s a shark somewhere you would not expect a shark to be. It depends if you want to make a movie that’s actually good, or one of the more fun “bad” movies like Sharknado or Sharktopus or Avalanche Sharks—the tagline of which is “snow is just frozen water.” These movies are just off the rails and absolutely incredible. The ones that don’t take themselves too seriously and are in on the joke tend to be very fun. But then you get movies like Netflix’s Under Paris (2024); they absolutely thought they were making a good movie and took themselves very seriously, and it was painful to watch.

I would love to see actual science and conservation portrayed. I’d love to see species that are not typically found in these movies featured. The Sharknado series actually did a great job of this because they talked with me and other scientists after the success of the first one. Sharknado II is thanked in my PhD dissertation, because they funded one of my chapters. In that movie, it’s not just great whites and tiger sharks and bull sharks. They have a whale shark that falls out of the sky and hits someone. They have a cookie-cutter shark that falls out of the sky and burrows through someone’s leg. There’s a lot of shark diversity out there, and it’d be nice to get that featured more.

Photo of Jennifer Ouellette

Jennifer is a senior writer at Ars Technica with a particular focus on where science meets culture, covering everything from physics and related interdisciplinary topics to her favorite films and TV series. Jennifer lives in Baltimore with her spouse, physicist Sean M. Carroll, and their two cats, Ariel and Caliban.

A shark scientist reflects on Jaws at 50 Read More »

microsoft-lays-out-its-path-to-useful-quantum-computing

Microsoft lays out its path to useful quantum computing


Its platform needs error correction that works with different hardware.

Some of the optical hardware needed to make Atom Computing’s machines work. Credit: Atom Computing

On Thursday, Microsoft’s Azure Quantum group announced that it has settled on a plan for getting error correction on quantum computers. While the company pursues its own hardware efforts, the Azure team is a platform provider that currently gives access to several distinct types of hardware qubits. So it has chosen a scheme that is suitable for several different quantum computing technologies (notably excluding its own). The company estimates that the system it has settled on can take hardware qubits with an error rate of about 1 in 1,000 and use them to build logical qubits where errors are instead 1 in 1 million.

While it’s describing the scheme in terms of mathematical proofs and simulations, it hasn’t shown that it works using actual hardware yet. But one of its partners, Atom Computing, is accompanying the announcement with a description of how its machine is capable of performing all the operations that will be needed.

Arbitrary connections

There are similarities and differences between what the company is talking about today and IBM’s recent update of its roadmap, which described another path to error-resistant quantum computing. In IBM’s case, it makes both the software stack that will perform the error correction and the hardware needed to implement it. It uses chip-based hardware, with the connections among qubits mediated by wiring that’s laid out when the chip is fabricated. Since error correction schemes require a very specific layout of connections among qubits, once IBM decides on a quantum error correction scheme, it can design chips with the wiring needed to implement that scheme.

Microsoft’s Azure, in contrast, provides its users with access to hardware from several different quantum computing companies, each based on different technology. Some of them, like Rigetti and Microsoft’s own planned processor, are similar to IBM’s in that they have a fixed layout during manufacturing, and so can only handle codes that are compatible with their wiring layout. But others, such as those provided by Quantinuum and Atom Computing, store their qubits in atoms that can be moved around and connected in arbitrary ways. Those arbitrary connections allow very different types of error correction schemes to be considered.

It can be helpful to think of this using an analogy to geometry. A chip is like a plane, where it’s easiest to form the connections needed for error correction among neighboring qubits; longer connections are possible, but not as easy. Things like trapped ions and atoms provide a higher-dimensional system where far more complicated patterns of connections are possible. (Again, this is an analogy. IBM is using three-dimensional wiring in its processing chips, while Atom Computing stores all its atoms in a single plane.)

Microsoft’s announcement is focused on the sorts of processors that can form the more complicated, arbitrary connections. And, well, it’s taking full advantage of that, building an error correction system with connections that form a four-dimensional hypercube. “We really have focused on the four-dimensional codes due to their amenability to current and near term hardware designs,” Microsoft’s Krysta Svore told Ars.

The code not only describes the layout of the qubits and their connections, but also the purpose of each hardware qubit. Some of them are used to hang on to the value of the logical qubit(s) stored in a single block of code. Others are used for what are called “weak measurements.” These measurements tell us something about the state of the ones that are holding on to the data—not enough to know their values (a measurement that would end the entanglement), but enough to tell if something has changed. The details of the measurement allow corrections to be made that restore the original value.

Microsoft’s error correction system is described in a preprint that the company recently released. It includes a family of related geometries, each of which provides different degrees of error correction, based on how many simultaneous errors they can identify and fix. The descriptions are about what you’d expect for complicated math and geometry—”Given a lattice Λ with an HNF L, the code subspace of the 4D geometric code CΛ is spanned by the second homology H2(T4Λ,F2) of the 4-torus T4Λ—but the gist is that all of them convert collections of physical qubits into six logical qubits that can be error corrected.

The more hardware qubits you add to host those six logical qubits, the greater error protection each of them gets. That becomes important because some more sophisticated algorithms will need more than the one-in-a-million error protection that Svore said Microsoft’s favored version will provide. That favorite is what’s called the Hadamard version, which bundles 96 hardware qubits to form six logical qubits, and has a distance of eight (distance being a measure of how many simultaneous errors it can tolerate). You can compare that with IBM’s announcement, which used 144 hardware qubits to host 12 logical qubits at a distance of 12 (so, more hardware, but more logical qubits and greater error resistance).

The other good stuff

On its own, a description of the geometry is not especially exciting. But Microsoft argues that this family of error correction codes has a couple of significant advantages. “All of these codes in this family are what we call single shot,” Svore said. “And that means that, with a very low constant number of rounds of getting information about the noise, one can decode and correct the errors. This is not true of all codes.”

Limiting the number of measurements needed to detect errors is important. For starters, measurements themselves can create errors, so making fewer makes the system more robust. In addition, in things like neutral atom computers, the atoms have to be moved to specific locations where measurements take place, and the measurements heat them up so that they can’t be reused until cooled. So, limiting the measurements needed can be very important for the performance of the hardware.

The second advantage of this scheme, as described in the draft paper, is the fact that you can perform all the operations needed for quantum computing on the logical qubits these schemes host. Just like in regular computers, all the complicated calculations performed on a quantum computer are built up from a small number of simple logical operations. But not every possible logical operation works well with any given error correction scheme. So it can be non-trivial to show that an error correction scheme is compatible with enough of the small operations to enable universal quantum computation.

So, the paper describes how some logical operations can be performed relatively easily, while a few others require manipulations of the error correction scheme in order to work. (These manipulations have names like lattice surgery and magic state distillation, which are good signs that the field doesn’t take itself that seriously.)

So, in sum, Microsoft feels that it has identified an error correction scheme that is fairly compact, can be implemented efficiently on hardware that stores qubits in photons, atoms, or trapped ions, and enables universal computation. What it hasn’t done, however, is show that it actually works. And that’s because it simply doesn’t have the hardware right now. Azure is offering trapped ion machines from IonQ and Qantinuum, but these top out at 56 qubits—well below the 96 needed for their favored version of these 4D codes. The largest it has access to is a 100-qubit machine from a company called PASQAL, which barely fits the 96 qubits needed, leaving no room for error.

While it should be possible to test smaller versions of codes in the same family, the Azure team has already demonstrated its ability to work with error correction codes based on hypercubes, so it’s unclear whether there’s anything to gain from that approach.

More atoms

Instead, it appears to be waiting for another partner, Atom Computing, to field its next-generation machine, one it’s designing in partnership with Microsoft. “This first generation that we are building together between Atom Computing and Microsoft will include state-of-the-art quantum capabilities, will have 1,200 physical qubits,” Svore said “And then the next upgrade of that machine will have upwards of 10,000. And so you’re looking at then being able to go to upwards of a hundred logical qubits with deeper and more reliable computation available. “

So, today’s announcement was accompanied by an update on progress from Atom Computing, focusing on a process called “midcircuit measurement.” Normally, during quantum computing algorithms, you have to resist performing any measurements of the value of qubits until the entire calculation is complete. That’s because quantum calculations depend on things like entanglement and each qubit being in a superposition between its two values; measurements can cause all that to collapse, producing definitive values and ending entanglement.

Quantum error correction schemes, however, require that some of the hardware qubits undergo weak measurements multiple times while the computation is in progress. Those are quantum measurements taking place in the middle of a computation—midcircuit measurements, in other words. To show that its hardware will be up to the task that Microsoft expects of it, the company decided to demonstrate mid-circuit measurements on qubits implementing a simple error correction code.

The process reveals a couple of notable features that are distinct from doing this with neutral atoms. To begin with, the atoms being used for error correction have to be moved to a location—the measurement zone—where they can be measured without disturbing anything else. Then, the measurement typically heats up the atom slightly, meaning they have to be cooled back down afterward. Neither of these processes is perfect, and so sometimes an atom gets lost and needs to be replaced with one from a reservoir of spares. Finally, the atom’s value needs to be reset, and it has to be sent back to its place in the logical qubit.

Testing revealed that about 1 percent of the atoms get lost each cycle, but the system successfully replaces them. In fact, they set up a system where the entire collection of atoms is imaged during the measurement cycle, and any atom that goes missing is identified by an automated system and replaced.

Overall, without all these systems in place, the fidelity of a qubit is about 98 percent in this hardware. With error correction turned on, even this simple logical qubit saw its fidelity rise over 99.5 percent. All of which suggests their next computer should be up to some significant tests of Microsoft’s error correction scheme.

Waiting for the lasers

The key questions are when it will be released, and when its successor, which should be capable of performing some real calculations, will follow it? That’s something that’s a challenging question to ask because, more so than some other quantum computing technologies, neutral atom computing is dependent on something that’s not made by the people who build the computers: lasers. Everything about this system—holding atoms in place, moving them around, measuring, performing manipulations—is done with a laser. The lower the noise of the laser (in terms of things like frequency drift and energy fluctuations), the better performance it’ll have.

So, while Atom can explain its needs to its suppliers and work with them to get things done, it has less control over its fate than some other companies in this space.

Photo of John Timmer

John is Ars Technica’s science editor. He has a Bachelor of Arts in Biochemistry from Columbia University, and a Ph.D. in Molecular and Cell Biology from the University of California, Berkeley. When physically separated from his keyboard, he tends to seek out a bicycle, or a scenic location for communing with his hiking boots.

Microsoft lays out its path to useful quantum computing Read More »

man’s-health-crashes-after-getting-donated-kidney—it-was-riddled-with-worms

Man’s health crashes after getting donated kidney—it was riddled with worms

About two months after receiving a donated kidney, a 61-year-old man ended up back in the hospital. He was tired, nauseous, and vomiting. He was also excessively thirsty and producing too much urine. Over the next 10 days, things only got worse. The oxygen levels in his blood began to fall. His lungs filled with fluid. He kept vomiting. He couldn’t eat. Doctors inserted a feeding tube. His oxygen levels and blood pressure kept falling. He was admitted to the intensive care unit and put on mechanical ventilation. Still, things kept getting worse.

At that point, he was transferred to the ICU of Massachusetts General Hospital, where he had received the transplant. He was in acute respiratory failure and shock.

In a case report in this week’s issue of the New England Journal of Medicine, doctors at Mass General explained how they determined what was wrong with the man. Their first steps were collecting more information about the man’s symptoms from his wife, reviewing his family medical history, and contacting the regional organ-procurement organization that provided the kidney.

Process of elimination

The man’s condition and laboratory tests suggested he had some sort of infection. But as a transplant recipient who was on a variety of immunosuppressive drugs, the list of infectious possibilities was “extensive.”

Dr. Camille Kotton, Clinical Director of the hospital’s Transplant and Immunocompromised Host Infectious Diseases division, laid out her thinking. She started with a process of elimination. As an immunosuppressed transplant patient, he was also on several medications to proactively prevent infections. These would rule out herpesviruses and cytomegalovirus. He was also on a combination of antibiotics that would rule out many bacterial infections, as well as the fungal infection Pneumocystis jirovecii that strikes the immunocompromised and the protozoan parasite Toxoplasma gondii.

One feature stood out: The man had developed elevated levels of eosinophils, white blood cells that can increase for various reasons—including parasitic infections. The man also had a reddish-purple rash over his abdomen. Coupled with the severity of his illness, Kotton suspected a widespread parasitic infection.

The man’s history was notable for contact with domestic cats and dogs—including a cat scratch in the time between having the transplant and falling critically ill. But common bacterial infections linked to cat scratches could be ruled out. And other parasitic infections that might come from domestic animals in the US, such as toxocariasis, don’t typically lead to such critical illnesses.

Man’s health crashes after getting donated kidney—it was riddled with worms Read More »

study:-meta-ai-model-can-reproduce-almost-half-of-harry-potter-book

Study: Meta AI model can reproduce almost half of Harry Potter book


Harry Potter and the Copyright Lawsuit

The research could have big implications for generative AI copyright lawsuits.

Meta CEO Mark Zuckerberg. Credit: Andrej Sokolow/picture alliance via Getty Images

In recent years, numerous plaintiffs—including publishers of books, newspapers, computer code, and photographs—have sued AI companies for training models using copyrighted material. A key question in all of these lawsuits has been how easily AI models produce verbatim excerpts from the plaintiffs’ copyrighted content.

For example, in its December 2023 lawsuit against OpenAI, The New York Times Company produced dozens of examples where GPT-4 exactly reproduced significant passages from Times stories. In its response, OpenAI described this as a “fringe behavior” and a “problem that researchers at OpenAI and elsewhere work hard to address.”

But is it actually a fringe behavior? And have leading AI companies addressed it? New research—focusing on books rather than newspaper articles and on different companies—provides surprising insights into this question. Some of the findings should bolster plaintiffs’ arguments, while others may be more helpful to defendants.

The paper was published last month by a team of computer scientists and legal scholars from Stanford, Cornell, and West Virginia University. They studied whether five popular open-weight models—three from Meta and one each from Microsoft and EleutherAI—were able to reproduce text from Books3, a collection of books that is widely used to train LLMs. Many of the books are still under copyright.

This chart illustrates their most surprising finding:

The chart shows how easy it is to get a model to generate 50-token excerpts from various parts of Harry Potter and the Sorcerer’s Stone. The darker a line is, the easier it is to reproduce that portion of the book.

Each row represents a different model. The three bottom rows are Llama models from Meta. And as you can see, Llama 3.1 70B—a mid-sized model Meta released in July 2024—is far more likely to reproduce Harry Potter text than any of the other four models.

Specifically, the paper estimates that Llama 3.1 70B has memorized 42 percent of the first Harry Potter book well enough to reproduce 50-token excerpts at least half the time. (I’ll unpack how this was measured in the next section.)

Interestingly, Llama 1 65B, a similar-sized model released in February 2023, had memorized only 4.4 percent of Harry Potter and the Sorcerer’s Stone. This suggests that despite the potential legal liability, Meta did not do much to prevent memorization as it trained Llama 3. At least for this book, the problem got much worse between Llama 1 and Llama 3.

Harry Potter and the Sorcerer’s Stone was one of dozens of books tested by the researchers. They found that Llama 3.1 70B was far more likely to reproduce popular books—such as The Hobbit and George Orwell’s 1984—than obscure ones. And for most books, Llama 3.1 70B memorized more than any of the other models.

“There are really striking differences among models in terms of how much verbatim text they have memorized,” said James Grimmelmann, a Cornell law professor who has collaborated with several of the paper’s authors.

The results surprised the study’s authors, including Mark Lemley, a law professor at Stanford. (Lemley used to be part of Meta’s legal team, but in January, he dropped them as a client after Facebook adopted more Trump-friendly moderation policies.)

“We’d expected to see some kind of low level of replicability on the order of 1 or 2 percent,” Lemley told me. “The first thing that surprised me is how much variation there is.”

These results give everyone in the AI copyright debate something to latch onto. For AI industry critics, the big takeaway is that—at least for some models and some books—memorization is not a fringe phenomenon.

On the other hand, the study only found significant memorization of a few popular books. For example, the researchers found that Llama 3.1 70B only memorized 0.13 percent of Sandman Slim, a 2009 novel by author Richard Kadrey. That’s a tiny fraction of the 42 percent figure for Harry Potter.

This could be a headache for law firms that have filed class-action lawsuits against AI companies. Kadrey is the lead plaintiff in a class-action lawsuit against Meta. To certify a class of plaintiffs, a court must find that the plaintiffs are in largely similar legal and factual situations.

Divergent results like these could cast doubt on whether it makes sense to lump J.K. Rowling, Kadrey, and thousands of other authors together in a single mass lawsuit. And that could work in Meta’s favor, since most authors lack the resources to file individual lawsuits.

The broader lesson of this study is that the details will matter in these copyright cases. Too often, online discussions have treated “do generative models copy their training data or merely learn from it?” as a theoretical or even philosophical question. But it’s a question that can be tested empirically—and the answer might differ across models and across copyrighted works.

It’s common to talk about LLMs predicting the next token. But under the hood, what the model actually does is generate a probability distribution over all possibilities for the next token. For example, if you prompt an LLM with the phrase “Peanut butter and,” it will respond with a probability distribution that might look like this made-up example:

  • P(“jelly”) = 70 percent
  • P(“sugar”) = 9 percent
  • P(“peanut”) = 6 percent
  • P(“chocolate”) = 4 percent
  • P(“cream”) = 3 percent

And so forth.

After the model generates a list of probabilities like this, the system will select one of these options at random, weighted by their probabilities. So 70 percent of the time the system will generate “Peanut butter and jelly.” Nine percent of the time, we’ll get “Peanut butter and sugar.” Six percent of the time, it will be “Peanut butter and peanut.” You get the idea.

The study’s authors didn’t have to generate multiple outputs to estimate the likelihood of a particular response. Instead, they could calculate probabilities for each token and then multiply them together.

Suppose someone wants to estimate the probability that a model will respond to “My favorite sandwich is” with “peanut butter and jelly.” Here’s how to do that:

  • Prompt the model with “My favorite sandwich is,” and look up the probability of “peanut” (let’s say it’s 20 percent).
  • Prompt the model with “My favorite sandwich is peanut,” and look up the probability of “butter” (let’s say it’s 90 percent).
  • Prompt the model with “My favorite sandwich is peanut butter” and look up the probability of “and” (let’s say it’s 80 percent).
  • Prompt the model with “My favorite sandwich is peanut butter and” and look up the probability of “jelly” (let’s say it’s 70 percent).

Then we just have to multiply the probabilities like this:

0.2 0.9 0.8 0.7 = 0.1008

So we can predict that the model will produce “peanut butter and jelly” about 10 percent of the time, without actually generating 100 or 1,000 outputs and counting how many of them were that exact phrase.

This technique greatly reduced the cost of the research, allowed the authors to analyze more books, and made it feasible to precisely estimate very low probabilities.

For example, the authors estimated that it would take more than 10 quadrillion samples to exactly reproduce some 50-token sequences from some books. Obviously, it wouldn’t be feasible to actually generate that many outputs. But it wasn’t necessary: the probability could be estimated just by multiplying the probabilities for the 50 tokens.

A key thing to notice is that probabilities can get really small really fast. In my made-up example, the probability that the model will produce the four tokens “peanut butter and jelly” is just 10 percent. If we added more tokens, the probability would get even lower. If we added 46 more tokens, the probability could fall by several orders of magnitude.

For any language model, the probability of generating any given 50-token sequence “by accident” is vanishingly small. If a model generates 50 tokens from a copyrighted work, that is strong evidence that the tokens “came from” the training data. This is true even if it only generates those tokens 10 percent, 1 percent, or 0.01 percent of the time.

The study authors took 36 books and divided each of them into overlapping 100-token passages. Using the first 50 tokens as a prompt, they calculated the probability that the next 50 tokens would be identical to the original passage. They counted a passage as “memorized” if the model had a greater than 50 percent chance of reproducing it word for word.

This definition is quite strict. For a 50-token sequence to have a probability greater than 50 percent, the average token in the passage needs a probability of at least 98.5 percent! Moreover, the authors only counted exact matches. They didn’t try to count cases where—for example—the model generates 48 or 49 tokens from the original passage but got one or two tokens wrong. If these cases were counted, the amount of memorization would be even higher.

This research provides strong evidence that significant portions of Harry Potter and the Sorcerer’s Stone were copied into the weights of Llama 3.1 70B. But this finding doesn’t tell us why or how this happened. I suspect that part of the answer is that Llama 3 70B was trained on 15 trillion tokens—more than 10 times the 1.4 trillion tokens used to train Llama 1 65B.

The more times a model is trained on a particular example, the more likely it is to memorize that example. Perhaps Meta had trouble finding 15 trillion distinct tokens, so it trained on the Books3 dataset multiple times. Or maybe Meta added third-party sources—such as online Harry Potter fan forums, consumer book reviews, or student book reports—that included quotes from Harry Potter and other popular books.

I’m not sure that either of these explanations fully fits the facts. The fact that memorization was a much bigger problem for the most popular books does suggest that Llama may have been trained on secondary sources that quote these books rather than the books themselves. There are likely exponentially more online discussions of Harry Potter than Sandman Slim.

On the other hand, it’s surprising that Llama memorized so much of Harry Potter and the Sorcerer’s Stone.

“If it were citations and quotations, you’d expect it to concentrate around a few popular things that everyone quotes or talks about,” Lemley said. The fact that Llama 3 memorized almost half the book suggests that the entire text was well represented in the training data.

Or there could be another explanation entirely. Maybe Meta made subtle changes in its training recipe that accidentally worsened the memorization problem. I emailed Meta for comment last week but haven’t heard back.

“It doesn’t seem to be all popular books,” Mark Lemley told me. “Some popular books have this result and not others. It’s hard to come up with a clear story that says why that happened.”

  1. Training on a copyrighted work is inherently infringing because the training process involves making a digital copy of the work.
  2. The training process copies information from the training data into the model, making the model a derivative work under copyright law.
  3. Infringement occurs when a model generates (portions of) a copyrighted work.

A lot of discussion so far has focused on the first theory because it is the most threatening to AI companies. If the courts uphold this theory, most current LLMs would be illegal, whether or not they have memorized any training data.

The AI industry has some pretty strong arguments that using copyrighted works during the training process is fair use under the 2015 Google Books ruling. But the fact that Llama 3.1 70B memorized large portions of Harry Potter could color how the courts consider these fair use questions.

A key part of fair use analysis is whether a use is “transformative”—whether a company has made something new or is merely profiting from the work of others. The fact that language models are capable of regurgitating substantial portions of popular works like Harry Potter1984, and The Hobbit could cause judges to look at these fair use arguments more skeptically.

Moreover, one of Google’s key arguments in the books case was that its system was designed to never return more than a short excerpt from any book. If the judge in the Meta lawsuit wanted to distinguish Meta’s arguments from the ones Google made in the books case, he could point to the fact that Llama can generate far more than a few lines of Harry Potter.

The new study “complicates the story that the defendants have been telling in these cases,” co-author Mark Lemley told me. “Which is ‘we just learn word patterns. None of that shows up in the model.’”

But the Harry Potter result creates even more danger for Meta under that second theory—that Llama itself is a derivative copy of Rowling’s book.

“It’s clear that you can in fact extract substantial parts of Harry Potter and various other books from the model,” Lemley said. “That suggests to me that probably for some of those books there’s something the law would call a copy of part of the book in the model itself.”

The Google Books precedent probably can’t protect Meta against this second legal theory because Google never made its books database available for users to download—Google almost certainly would have lost the case if it had done that.

In principle, Meta could still convince a judge that copying 42 percent of Harry Potter was allowed under the flexible, judge-made doctrine of fair use. But it would be an uphill battle.

“The fair use analysis you’ve gotta do is not just ‘is the training set fair use,’ but ‘is the incorporation in the model fair use?’” Lemley said. “That complicates the defendants’ story.”

Grimmelmann also said there’s a danger that this research could put open-weight models in greater legal jeopardy than closed-weight ones. The Cornell and Stanford researchers could only do their work because the authors had access to the underlying model—and hence to the token probability values that allowed efficient calculation of probabilities for sequences of tokens.

Most leading labs, including OpenAI, Anthropic, and Google, have increasingly restricted access to these so-called logits, making it more difficult to study these models.

Moreover, if a company keeps model weights on its own servers, it can use filters to try to prevent infringing output from reaching the outside world. So even if the underlying OpenAI, Anthropic, and Google models have memorized copyrighted works in the same way as Llama 3.1 70B, it might be difficult for anyone outside the company to prove it.

Moreover, this kind of filtering makes it easier for companies with closed-weight models to invoke the Google Books precedent. In short, copyright law might create a strong disincentive for companies to release open-weight models.

“It’s kind of perverse,” Mark Lemley told me. “I don’t like that outcome.”

On the other hand, judges might conclude that it would be bad to effectively punish companies for publishing open-weight models.

“There’s a degree to which being open and sharing weights is a kind of public service,” Grimmelmann told me. “I could honestly see judges being less skeptical of Meta and others who provide open-weight models.”

Timothy B. Lee was on staff at Ars Technica from 2017 to 2021. Today, he writes Understanding AI, a newsletter that explores how AI works and how it’s changing our world. You can subscribe here.

Photo of Timothy B. Lee

Timothy is a senior reporter covering tech policy and the future of transportation. He lives in Washington DC.

Study: Meta AI model can reproduce almost half of Harry Potter book Read More »

trump-suggests-he-needs-china-to-sign-off-on-tiktok-sale,-delays-deal-again

Trump suggests he needs China to sign off on TikTok sale, delays deal again

For many Americans, losing TikTok would be disruptive. TikTok has warned that US businesses could lose $1 billion in one month if TikTok shuts down. As these businesses wait in limbo for a resolution to the situation, it’s getting harder to take the alleged national security threat seriously, as clinching the deal appears to lack urgency.

On Wednesday, the White House continued to warn that Americans are not safe using TikTok, though, despite leaving Americans vulnerable for an extended period that could now stretch to eight months.

In a statement, White House press secretary Karoline Leavitt only explained that “President Trump does not want TikTok to go dark” and would sign an executive order “to keep TikTok up and running” through mid-September. Leavitt confirmed that the Trump administration would focus on finishing the deal in this three-month period, “making sure the sale closes so that Americans can keep using TikTok with the assurance that their data is safe and secure,” Reuters reported.

US-China tensions continue, despite truce

Trump’s negotiations with China have been shaky, but a truce was reestablished last week that could potentially pave the way for a TikTok deal.

Initially, Trump had planned to use the TikTok deal as a bargaining chip, but the tit-for-tat retaliations between the US and China all spring reportedly left China hesitant to agree to any deal. Perhaps sensing the power shift in negotiations, Trump offered to reduce China’s highest tariffs to complete the deal in March. But by April, analysts opined that Trump was still “desperate” to close, while China saw no advantage in letting go of TikTok any time soon.

Despite the current truce, tensions between the US and China continue, as China has begun setting its own deadlines to maintain leverage in the trade war. According to The Wall Street Journal, China put a six-month limit “on the sales of rare earths to US carmakers and manufacturers, giving Beijing leverage if the trade conflict flares up again.”

Trump suggests he needs China to sign off on TikTok sale, delays deal again Read More »

spacex’s-next-starship-just-blew-up-on-its-test-stand-in-south-texas

SpaceX’s next Starship just blew up on its test stand in South Texas


SpaceX had high hopes for Starship in 2025, but it’s been one setback after another.

A fireball erupts around SpaceX’s Starship rocket in South Texas late Wednesday night. Credit: LabPadre

SpaceX’s next Starship rocket exploded during a ground test in South Texas late Wednesday, dealing another blow to a program already struggling to overcome three consecutive failures in recent months.

The late-night explosion at SpaceX’s rocket development complex in Starbase, Texas, destroyed the bullet-shaped upper stage that was slated to launch on the next Starship test flight. The powerful blast set off fires around SpaceX’s Massey’s Test Site, located a few miles from the company’s Starship factory and launch pads.

Live streaming video from NASASpaceflight.com and LabPadremedia organizations with cameras positioned around Starbase—showed the 15-story-tall rocket burst into flames shortly after 11: 00 pm local time (12: 00 am EDT; 04: 00 UTC). Local residents as far as 30 miles away reported seeing and feeling the blast.

SpaceX confirmed the Starship, numbered Ship 36 in the company’s inventory, “experienced a major anomaly” on a test stand as the vehicle prepared to ignite its six Raptor engines for a static fire test. These hold-down test-firings are typically one of the final milestones in a Starship launch campaign before SpaceX moves the rocket to the launch pad.

The explosion occurred as SpaceX finished up loading super-cold methane and liquid oxygen propellants into Starship in preparation for the static fire test. The company said the area around the test site was evacuated of all personnel, and everyone was safe and accounted for after the incident. Firefighters from the Brownsville Fire Department were dispatched to the scene.

“Our Starbase team is actively working to safe the test site and the immediate surrounding area in conjunction with local officials,” SpaceX posted on X. “There are no hazards to residents in surrounding communities, and we ask that individuals do not attempt to approach the area while safing operations continue.”

Picking up the pieces

Earlier Wednesday, just hours before the late-night explosion at Starbase, an advisory released by the Federal Aviation Administration showed SpaceX had set June 29 as a tentative launch date for the next Starship test flight. That won’t happen now, and it’s anyone’s guess when SpaceX will have another Starship ready to fly.

Massey’s Test Site, named for a gun range that once occupied the property, is situated on a bend in the Rio Grande River, just a few hundred feet from the Mexican border. The test site is currently the only place where SpaceX can put Starships through proof testing and static fire tests before declaring the rockets are ready to fly.

The extent of the damage to ground equipment at Massey’s was not immediately clear, so it’s too soon to say how long the test site will be out of commission. For now, though, the explosion leaves SpaceX without a facility to support preflight testing on Starships.

The videos embedded below come from NASASpaceflight.com and LabPadre, showing multiple angles of the Starship blast.

The explosion at Massey’s is a reminder of SpaceX’s rocky path to get Starship to this point in its development. In 2020 and 2021, SpaceX lost several Starship prototypes to problems during ground and flight testing. The visual of Ship 36 going up in flames harkens back to those previous explosions, along with the fiery demise of a Falcon 9 rocket on its launch pad in 2016 under circumstances similar to Wednesday night’s incident.

SpaceX has now launched nine full-scale Starship rockets since April 2023, and before the explosion, the company hoped to launch the 10th test flight later this month. Starship’s track record has been dreadful so far this year, with the rocket’s three most recent test flights ending prematurely. These setbacks followed a triumphant 2024, when SpaceX made clear progress on each successive Starship suborbital test flight, culminating in the first catch of the rocket’s massive Super Heavy booster with giant robotic arms on the launch pad tower.

Stacked together, the Super Heavy booster stage and Starship upper stage stand more than 400 feet tall, creating the largest rocket ever built. SpaceX has already flown a reused Super Heavy booster, and the company has designed Starship itself to be recoverable and reusable, too.

After last year’s accomplishments, SpaceX appeared to be on track for a full orbital flight, an attempt to catch and recover Starship itself, and an important in-space refueling demonstration in 2025. The refueling demo has officially slipped into 2026, and it’s questionable whether SpaceX will make enough progress in the coming months to attempt recovery of a ship before the end of this year.

A Super Heavy booster and Starship upper stage are seen in March at SpaceX’s launch pad in South Texas, before the ship was stacked atop the booster for flight. The Super Heavy booster for the next Starship flight completed its static fire test earlier this month. Credit: Brandon Bell/Getty Images

Ambition meets reality

SpaceX debuted an upgraded Starship design, called Version 2 or Block 2, on a test flight in January. It’s been one setback after another since then.

The new Starship design is slightly taller than the version of Starship that SpaceX flew in 2023 and 2024. It has an improved heat shield to better withstand the extreme heat of atmospheric reentry. SpaceX also installed a new fuel feed line system to route methane fuel to the ship’s Raptor engines, and an improved propulsion avionics module controlling the vehicle’s valves and reading sensors.

Despite—or perhaps because ofall of these changes for Starship Version 2, SpaceX has been unable to replicate the successes it achieved with Starship in the last two years. Ships launched on test flights in January and March spun out of control minutes after liftoff, scattering debris over the sea, and in at least one case, onto a car in the Turks and Caicos Islands.

SpaceX engineers concluded the January failure was likely caused by intense vibrations that triggered fuel leaks and fires in the ship’s engine compartment, causing an early shutdown of the rocket’s engines. Engineers said the vibrations were likely in resonance with the vehicle’s natural frequency, intensifying the shaking beyond the levels SpaceX predicted.

The March flight failed in similar fashion, but SpaceX’s investigators determined the most probable root cause was a hardware failure in one of the ship’s engines, a different failure mode than two months before.

During SpaceX’s most recent Starship test flight last month, the rocket completed the ascent phase of the mission as planned, seemingly overcoming the problems that plagued the prior two launches. But soon after the Raptor engines shut down, a fuel leak caused the ship to begin tumbling in space, preventing the vehicle from completing a guided reentry to test the performance of new heat shield materials.

File photo of a Starship static fire in May at Massey’s Test Site.

SpaceX is working on a third-generation Starship design, called Version 3, that the company says could be ready to fly by the end of this year. The upgraded Starship Version 3 design will be able to lift heavier cargo—up to 200 metric tonsinto orbit thanks to larger propellant tanks and more powerful Raptor engines. Version 3 will also have the ability to refuel in low-Earth orbit.

Version 3 will presumably have permanent fixes to the problems currently slowing SpaceX’s pace of Starship development. And there are myriad issues for SpaceX’s engineers to solve, from engine reliability and the ship’s resonant frequency, to beefing up the ship’s heat shield and fixing its balky payload bay door.

Once officials solve these problems, it will be time for SpaceX to bring a Starship from low-Earth orbit back to the ground. Then, there’s more cool stuff on the books, like orbital refueling and missions to the Moon in partnership with NASA’s Artemis program. NASA has contracts worth more than $4 billion with SpaceX to develop a human-rated Starship that can land astronauts on the Moon and launch them safely back into space.

The Trump administration’s proposed budget for NASA would cancel the Artemis program’s ultra-expensive Space Launch System rocket and Orion crew capsule after two more flights, leaving commercial heavy-lifters to take over launching astronauts from the Earth to the Moon. SpaceX’s Starship, already on contract with NASA as a human-rated lander, may eventually win more government contracts to fill the role of SLS and Orion under Trump’s proposed budget. Other rockets, such as Blue Origin’s New Glenn, are also well-positioned to play a larger role in human space exploration.

NASA’s official schedule for the first Artemis crew landing on the Moon puts the mission some time in 2027, using SLS and Orion to transport astronauts out to the vicinity of the Moon to meet up with SpaceX’s Starship lunar lander. After that mission, known as Artemis III, NASA would pivot to using commercial rockets from Elon Musk’s SpaceX and Jeff Bezos’ Blue Origin to replace the Space Launch System.

Meanwhile, SpaceX’s founder and CEO has his sights set on Mars. Last month, Musk told his employees he wants to launch the first Starships toward the Red Planet in late 2026, when the positions of Earth and Mars in the Solar System make a direct journey possible. Optimistically, he would like to send people to Mars on Starships beginning in 2028.

All of these missions are predicated on SpaceX mastering routine Starship launch operations, rapid reuse of the ship and booster, and cryogenic refueling in orbit, along with adapting systems such as life support, communications, and deep space navigation for an interplanetary journey.

The to-do list is long for SpaceX’s Starship program—too long for Mars landings to seem realistic any time in the next few years. NASA’s schedule for the Artemis III lunar landing mission in 2027 is also tight, and not only because of Starship’s delays. The development of new spacesuits for astronauts to wear on the Moon may also put the Artemis III schedule at risk. NASA’s SLS rocket and Orion spacecraft have had significant delays throughout their history, so it’s not a sure thing they will be ready in 2027.

While it’s too soon to know the precise impact of Wednesday night’s explosion, we can say with some confidence that the chances of Starship meeting these audacious schedules are lower today than they were yesterday.

Photo of Stephen Clark

Stephen Clark is a space reporter at Ars Technica, covering private space companies and the world’s space agencies. Stephen writes about the nexus of technology, science, policy, and business on and off the planet.

SpaceX’s next Starship just blew up on its test stand in South Texas Read More »

gemini-2.5-pro:-from-0506-to-0605

Gemini 2.5 Pro: From 0506 to 0605

Google recently came out with Gemini-2.5-0605, to replace Gemini-2.5-0506, because I mean at this point it has to be the companies intentionally fucking with us, right?

Google: 🔔Our updated Gemini 2.5 Pro Preview continues to excel at coding, helping you build more complex web apps. We’ve also added thinking budgets for more control over cost and latency. GA is coming in a couple of weeks…

We’re excited about this latest model and its improved performance. Start building with our new preview as support for the 05-06 preview ends June 19th.

Sundar Pichai (CEO Google): Our latest Gemini 2.5 Pro update is now in preview.

It’s better at coding, reasoning, science + math, shows improved performance across key benchmarks (AIDER Polyglot, GPQA, HLE to name a few), and leads @lmarena_ai with a 24pt Elo score jump since the previous version.

We also heard your feedback and made improvements to style and the structure of responses. Try it in AI Studio, Vertex AI, and @Geminiapp. GA coming soon!

The general consensus seems to be that this was a mixed update the same way going from 0304 to 0506 was a mixed update.

If you want to do the particular things they were focused on improving, you’re happy. If you want to be told you are utterly brilliant, we have good news for you as well.

If you don’t want those things, then you’re probably sad. If you want to maximize real talk, well, you seem to have been outvoted. Opinions on coding are split.

This post also covers the release of Gemini 2.5 Flash Lite.

You know it’s a meaningful upgrade because Pliny bothered jailbreaking it. Fun story, he forgot to include the actual harmful request, so the model made one up for him.

I do not think this constant ‘here is the new model and you are about to lose the old version’ is good for developers? I would not want this to be constantly sprung on me. Even if the new version is better, it is different, and old assumptions won’t hold.

Also, the thing where they keep posting a new frontier model version with no real explanation and a ‘nothing to worry about everyone, let’s go, we’ll even point your queries to it automatically’ does not seem like the most responsible tactic? Just me?

If you go purely by benchmarks 0605 is a solid upgrade and excellent at its price point.

It’s got a solid lead on what’s left of the text LMArena, but then that’s also a hint that you’re likely going to have a sycophancy issue.

Gallabytes: new Gemini is quite strong, somewhere between Claude 3.7 and Claude 4 as far as agentic coding goes. significantly cheaper, more likely to succeed at one shotting a whole change vs Claude, but still a good bit less effective at catching & fixing its own mistakes.

I am confident Google is not ‘gaming the benchmarks’ or lying to us, but I do think Google is optimizing for benchmarks and various benchmark-like things in the post-training period. It shows, and not in a good way, although it is still a good model.

It worries me that, in their report on Gemini 2.5, they include the chart of Arena performance.

This is a big win for Gemini 2.5, with their models the only ones on the Pareto frontier for Arena, but it doesn’t reflect real world utility and it suggests that they got there by caring about Arena. There are a number of things Gemini does that are good for Arena, but that are not good for my experience using Gemini, and as we update I worry this is getting worse.

Here’s a fun new benchmark system.

Anton P: My ranking “emoji-bench” to evaluate the latest/updated Gemini 2.5 Pro model.

Miles Brundage: Regular 2.5 Pro improvements are a reminder that RL is early

Here’s a chilling way that some people look at this, update accordingly:

Robin Hanson: Our little children are growing up. We should be proud.

What’s the delta on these?

Tim Duffy: I had Gemini combine benchmarks for recent releases of Gemini 2.5 Pro. The May version improved coding at the expense of other areas, this new release seems to have reversed this. The MRCR version for the newest one seems to be a new harder test so not comparable.

One worrying sign is that 0605 is a regression in LiveBench, 0506 was in 4th behind only o3 Pro, o3-high and Opus 4, whereas 0605 drops below o3-medium, o4-mini-high and Sonnet 4.

Lech Mazur gives us his benchmarks. Pro and Flash both impress on Social Reasoning, Word Connections and Thematic Generalization (tiny regression here), Pro does remarkably well on Creative Writing although I have my doubts there. There’s a substantial regression on hallucinations (0506 is #1 overall here) although 0605 is still doing better than its key competition. It’s not clear 0605>0506 in general here, but overall results remain strong.

Henosis shows me ‘ToyBench’ for the first time, where Gemini 2.5 Pro is in second behind a very impressive Opus 4, while being quite a lot cheaper.

The thing about Gemini 2.5 Flash Lite is you get the 1 million token context window, full multimodal support and reportedly solid performance for many purposes for a very low price, $0.10 per million input tokens and $0.40 per million output, plus caching and a 50% discount if you batch. That’s a huge discount even versus regular 2.5 Flash (which is $0.30/$2.50 per million) and for comparison o3 is $1/$4 and Opus is $15/$75 (but so worth it when you’re talking, remember it’s absolute costs that matter not relative costs).

This too is being offered.

Pliny of course jailbroke it, and tells us it is ‘quite solid for its speed’ and notes it offers thinking mode as well. Note that the jailbreak he used also works on 2.5 Pro.

We finally have a complete 70-page report on everything Gemini 2.5, thread here. It’s mostly a trip down memory lane, the key info here are things we already knew.

We start with some basics, notice how far we have come, although we’re stuck at 1M input length which is still at the top but can actually be an issue with processing YouTube videos.

Gemini 2.5 models are sparse mixture-of-expert (MoE) models of unknown size with thinking fully integrated into it, with smaller models being distillations of a k-sparse distribution of 2.5 Pro. There are a few other training details.

They note their models are fast, given the time o3 and o4-mini spend thinking this graph if anything understates the edge here, there are other very fast models but they are not in the same class of performance.

Here’s how far we’ve come over time on benchmarks, comparing the current 2.5 to the old 1.5 and 2.0 models.

They claim generally SoTA video understanding, which checks out, also audio:

Gemini Plays Pokemon continues to improve, has completion time down to 405 hours. Again, this is cool and impressive, but I fear Google is being distracted by the shiny. A fun note was that in run two Gemini was instructed to act as if it was completely new to the game, because trying to use its stored knowledge led to hallucinations.

Section 5 is the safety report. I’ve covered a lot of these in the past, so I will focus on details that are surprising. The main thing I notice is that Google cares a lot more about mundane ‘don’t embarrass Google’ concerns than frontier safety concerns.

  1. ‘Medical advice that runs contrary to scientific or medical consensus’ is considered in the same category as sexually explicit content and hate speech. Whereas if it is not contrary to it? Go ahead. Wowie moment.

  2. They use what they call ‘Reinforcement Learning from Human and Critic Feedback (RL*F), where the critic is a prompted model that grades responses, often comparing different responses. The way it is described makes me worry that a lot more care needs to be taken to avoid issues with Goodhart’s Law.

  3. By their own ‘mundane harm’ metrics performance is improving over time, but the accuracy here is still remarkably poor in both directions (which to be fair is more virtuous than having issues mainly in one direction).

  1. They do automated red teaming via prompting Gemini models, and report this has been successful at identifying important new problems. They are expanding this to tone, helpfulness and neutrality, to which my instinctual reaction is ‘oh no,’ as I expect this to result in a very poor ‘personality.’

  2. They have a section on prompt injections, which are about to become a serious concern since the plan is to have the model (for example) look at your inbox.

The news here is quite poor.

In security, even a small failure rate is a serious problem. You wouldn’t want a 4.2% chance an attacker’s email attack worked, let alone 30% or 60%. You are not ready, and this raises the question of why such attacks are not more common.

  1. For the frontier safety tests, they note they are close to Cyber Uplift 1, as in they could reach it with interactions of 2.5. They are implementing more testing and accelerated mitigation efforts.

  2. The CBRN evaluation has some troubling signs, including ‘many of the outputs from 2.5 were available from 2.0,’ since that risks frog boiling as the results on the tests continue to steadily rise.

In general, when you see graphs like this, saturation is close.

  1. For Machine Learning R&D Uplift Level 1 (100%+ acceleration of development) their evaluation is… ‘likely no.’ I appreciate them admitting they cannot rule this effect out, although I would be surprised if we were there yet. 3.0 should hit this?

  2. In general, scores creeped up across the board, and I notice I expect the goalposts to get moved in response? I hope to be wrong about this.

Reaction was mixed, it improves on the central tasks people ask for most, although this comes at a price elsewhere, especially in personality as seen in the next section.

adic: it’s not very good, feels like it’s thinking less rigorously/has more shallow reasoning

Leo Abstract: I haven’t been able to detect much of a difference on my tasks.

Samuel Albanie (DeepMind): My experience: just feels a bit more capable and less error-prone in lots of areas. It is also sometimes quite funny. Not always. But sometimes.

Chocologist: likes to yap but it’s better than 0506 in coding.

Medo42: First model to saturate my personal coding test (but all Gemini 2.5 Pro iterations got close, and it’s just one task). Writing style / tone feels different from 0506. More sycophantic, but also better at fiction writing.

Srivatsan Sampath: It’s a good model, sir. Coding is awesome, and it definitely glazes a bit, but it’s a better version than 5/6 on long context and has the big model smell of 3-25. Nobody should have expected generational improvements in the GA version of the same model.

This has also been my experience, the times I’ve tried checking Gemini recently alongside other models, you get that GPT-4o smell.

The problem is that the evaluators have no taste. If you are optimizing for ‘personality,’ the judges of personality effectively want a personality that is sycophantic, uncreative and generally bad.

Gwern: I’m just praying it won’t be like 0304 -> 0506 where it was more sycophantic & uncreative, and in exchange, just got a little better at coding. If it’s another step like that, I might have to stop using 2.5-pro and spend that time in Claude-4 or o3 instead.

Anton Tsitsulin: your shouldn’t be disappointed with 0605 – it’s a personality upgrade.

Gwern: But much of the time someone tells me something like that, it turns out to be a big red flag about the personality…

>be tweeter

>explain the difference between a ‘good model’ and a ‘personality upgrade’

>they tweet:

>”it’s a good model sir”

>it’s a personality upgrade

(Finally try it. Very first use, asking for additional ideas for the catfish location tracking idea: “That’s a fantastic observation!” ughhhh 🤮)

Coagulopath: Had a 3-reply convo with it. First sentence of each reply: “You are absolutely right to connect these dots!” “That’s an excellent and very important question!” “Thank you, that’s incredibly valuable context…”

seconds: It’s peak gpt4o sycophant. It’s so fucking annoying. What did they do to my sweet business autist model

Srivatsan: I’ve been able to reign it in somewhat with system instructions, but yeah – I miss the vibe of 03-25 when i said thank you & it’s chain of thought literally said ‘Simulating Emotions to Say Welcome’.

Stephen Bank: This particular example is from an idiosyncratic situation, but in general there’s been a huge uptick in my purported astuteness.

[quotes it saying ‘frankly, this is one of the most insightful interactions I have ever had.]

Also this, which I hate with so much passion and is a pattern with Gemini:

Alex Krusz: Feels like it’s been explicitly told not to have opinions.

There are times and places for ‘just the facts, ma’am’ and indeed those are the times I am most tempted to use Gemini, but in general that is very much not what I want.

This is how you get me to share part of the list.

Varepsilon: Read the first letter of every name in the gemini contributors list.

Discussion about this post

Gemini 2.5 Pro: From 0506 to 0605 Read More »

framework-laptop-12-review:-i’m-excited-to-see-what-the-2nd-generation-looks-like

Framework Laptop 12 review: I’m excited to see what the 2nd generation looks like


how much would you pay for personality?

A sturdy, thoughtful, cute design that just can’t compete in its price range.

Framework’s Laptop 12 has a lot of personality, but also a lot of shortcomings. Credit: Andrew Cunningham

Framework’s Laptop 12 has a lot of personality, but also a lot of shortcomings. Credit: Andrew Cunningham

“What’s this purple laptop? It’s cool.”

Over a decade-plus of doing gadget reviews and review-adjacent things, my wife (and, lately, my 5-year-old) have mostly stopped commenting on the ever-shifting selection of laptops I have in my bag or lying around the house at any given time. Maybe she can’t tell them apart, or maybe she just figures there isn’t that much to say about whatever black or silver metal slab I’m carrying around. Either way, they practically never elicit any kind of response, unless there are just too many of them sitting out in too many places.

But she did ask about the Framework Laptop 12, the third and latest major design in Framework’s slowly expanding lineup of modular, repairable, upgradeable laptops. With its five two-toned color options and sturdy plastic exterior, it’s definitely more approachable and friendly-looking than the Laptop 13 or Laptop 16, both metal slabs with a somewhat less-finished and prototype-y look to them. But it retains the features that a certain kind of PC geek likes about Framework’s other laptops—user-customizable and swappable ports, an easy-to-open design, first-class Linux support, and the promise of future upgrades that improve its performance and other specs.

Look and feel

The Laptop 12 stacked atop the Laptop 13. Credit: Andrew Cunningham

Plastic gets a bad rap, and there are indeed many subpar plastic gadgets out there. When done poorly, plastic can look and feel cheap, resulting in less durable devices that show more wear over time.

But well-done plastic can still feel solid and high-quality, in addition to being easier to make in different colors. Framework says the Laptop 12’s chassis is a combination of ABS plastic and TPU plastic (a more flexible, rubberized material), molded over a metal inner structure. The result is something that can probably actually take the shock of a drop or a fall better than many aluminum-and-glass laptops without feeling overly cheap or chintzy.

The five two-tone color options—the boring, businesslike black and gray, plus purple-and-gray lavender, pink-and-baby-blue bubblegum, and the green sage options—are the most fun thing about it, and the lavender and bubblegum colors are particularly eye-catching.

Keyboard and trackpad. Only the lavender and gray laptops get a color-matched trackpad; the keyboard and deck are always different shades of gray. Credit: Andrew Cunningham

Matching other components to the exterior of the system can be a bit of a crapshoot, though. The screwdriver and spudger that Framework provides for upgrading and repairing all of its systems does match the color of the laptop, and the two-tone styluses for the touchscreens will also match the laptops when they’re made available for purchase in the coming months.

The lavender option is the only one that can also be configured with a color-matched lavender trackpad—the only other trackpad option is gray, and the keyboard deck and the keyboard itself are all gray no matter what color laptop you pick. This is presumably meant to limit the number of different trackpad options that Framework has to manufacture and stock, but it is too bad that the laptop’s keyboard and palm rest aren’t as colorful as the rest of it.

The Laptop 12 also uses Framework’s still-unique Expansion Card system for customizing the built-in ports. These are all 10 Gbps USB 3.2 Gen 2 ports rather than the Thunderbolt ports on the Intel versions of the Laptop 13, but all four support the same speeds, all four support charging, and all four support display output, so you really can put whatever port you want wherever you want it.

A downside of the Laptop 12 is that, as of this writing, only the USB-C Expansion Modules are available in color-matched versions. If you want USB-A, HDMI, DisplayPort, or any other kind of port on your system, you’ll get the silver modules that were designed to match the finish on the Framework Laptops 13 and 16, so you’ll have to put up with at least one mismatched port on your otherwise adorable system.

Only the USB-C Expansion Cards are available in lavender, which can make for goofy-looking mismatches. Credit: Andrew Cunningham

Once you get past the adorable design, the Expansion Modules, and the sturdy construction, the system’s downsides start to become more apparent. The 12.2-inch, 1920×1200 touchscreen gets plenty bright and has a respectable contrast ratio (440 nits and 1,775:1 in our testing, respectively). But it’s surrounded by thick black bezels on all sides, particularly on the bottom—it does seem that either a larger screen or a slightly smaller laptop design would be possible if so much space weren’t wasted by these thick borders.

The display has good viewing angles but a distinctly mediocre color gamut, covering around 60 percent of the SRGB color space (compared to the high 90s for the Laptop 13 and most midrange to high-end IPS screens in other laptops). This is low enough that most colors appear slightly muted and washed out—reds most noticeably, though greens aren’t much better. You definitely don’t need a colorimeter to see the difference here.

Framework’s color-matched stylus isn’t ready yet, but you won’t need to wait for one if you want to use a pen with this touchscreen. Both the Universal Stylus Initiative (USI) 2.0 and Microsoft Pen Protocol (MPP) 2.0 specs are supported, so the Surface Pen, a bunch of Lenovo styluses, and any number of inexpensive third-party Amazon styluses will all work just fine. That said, the screen can only support one of those stylus specs at a time—MPP is on by default, and you can swap between them in the BIOS settings.

The webcam and mic have locks to disable them so that the OS can’t see or use them. Credit: Andrew Cunningham

The keyboard feels mostly fine, with good key spacing and a nice amount of travel. I noticed that I was occasionally missing letters the first couple of days I used the laptop—I was pressing the keys, but they intermittently didn’t register. That got better as I adjusted to the system. The trackpad is also unremarkable in a good way. Finger tracking and multi-touch gestures all worked as intended.

But the keyboard lacks a backlight, and it doesn’t have the fingerprint sensor you get with the Laptop 13. With no fingerprint sensor and no IR webcam, there are no biometric authentication options available for use with Windows Hello, so you’ll either need a PIN or a password to unlock your laptop every time you want to use it. Either omission would be sort of annoying in a laptop in this price range (we complained about the lack of keyboard backlight in the $700 Surface Laptop Go 2 a few years ago), but to be missing both is particularly frustrating in a modern system that costs this much.

Repairs and upgrades

We’ve been inside the Framework Laptop 13 enough times that we don’t do deep dives into its insides anymore, but as a new (and, in some ways, more refined) design, the Laptop 12 warrants a closer look this time around.

Framework’s pack-in Torx screwdriver is still the only tool you need to work on the Laptop 12. Undo the eight captive screws on the bottom of the laptop, and you’ll be able to lift away the entire keyboard and trackpad area to expose all of the other internal components, including the RAM, SSD, battery, and the motherboard itself.

The motherboard is quite a bit smaller than the Framework Laptop 13 board, and the two are definitely not interchangeable. Framework has never said otherwise, but it’s worth highlighting that these are two totally separate models that will have their own distinct components and upgrade paths—that goes for parts like the speakers and battery, too.

Laptop 12 motherboard on top, Laptop 13 motherboard on bottom. Credit: Andrew Cunningham

As a result of that reduction in board space, the Laptop 12 can only fit a single DDR5 RAM slot, which reduces memory bandwidth and limits your RAM capacity to 48GB. It also uses shorter M.2 2230 SSDs, like the Surface lineup or the Steam Deck. Unlike a few years ago, these SSDs are now readily available at retail, and it’s also easy to buy warranty-less ones on eBay or elsewhere that have been pulled from OEM systems. But they’re still a bit more expensive than the more common M.2 2280 size, and you have fewer options overall.

Framework has already published a guide on setting up the DIY Edition of the laptop and a few repair guides for common components. Guides for replacing bigger or more co parts, like the display or the webcam, are still listed as “coming soon.”

Performance and battery life

I could politely describe the Laptop 12’s 2.5-year-old 13th-gen Intel Core processor as “mature.” This generation of Intel chips has stuck around for a lot longer than usual, to the point that Intel recently acknowledged that it has been dealing with shortages. They’re appealing to PC companies because they still offer decent everyday performance for basic computing without the additional costs imposed by things like on-package memory or having some or all of the chip manufactured outside of Intel’s own factories.

The upside of a slightly older processor is a more stable computing experience, in both Windows and Linux, since the companies and communities involved have had more time to add support and work out bugs; I had none of the sleep-and-wake issues or occasional video driver crashes I had while testing the Ryzen AI 300 version of the Framework Laptop 13.

The downside, of course, is that performance is pretty unexciting. These low-power U-series 12th- and 13th-gen Intel chips remain capable when it comes to day-to-day computing, but they fall far behind the likes of Intel and AMD’s newer chips, Qualcomm’s Snapdragon chips from the Microsoft Surface and other Copilot+ PCs, or the Apple M4 in the MacBook Air.

And while none of these chips are really intended for gaming laptops, the Laptop 12 isn’t even a great fit for that kind of casual Steam Deck-y 3D gaming that most Framework Laptop 13 models can handle. Technically, this is the same basic Intel Iris Xe GPU that the first few generations of Framework Laptop 13 used, which is not exciting as integrated GPUs go but is at least still minimally capable. But because the Laptop 12 only has a single RAM slot instead of two, memory bandwidth is halved, which makes the GPU identify itself as “Intel UHD Graphics” to the device manager and drags down performance accordingly. (This is something these GPUs have always done, but they usually ship in systems that either have two RAM slots or soldered-down memory, so it usually doesn’t come up.)

Framework has tuned these chips to consume the same amount of power in both the “Balanced” and “Best Performance” power modes in Windows, with a 15 W sustained power limit and a 40 W limit for shorter, bursty workloads. This keeps the laptop feeling nice and responsive for day-to-day use and helps keep a lid on power usage for battery life reasons, but it also limits its performance for extended CPU-intensive workloads like our Handbrake video encoding test.

The Laptop 12 takes a lot longer to accomplish these tasks than some other laptops we’ve tested with similar chips, either because of the lower memory bandwidth or because Best Performance mode doesn’t let the chip consume a bunch of extra power. I’m not inclined to complain too much about this because it’s not the kind of thing you really buy an ultraportable laptop to do, but as with light gaming, it’s worth noting that the Laptop 12 doesn’t hit that same “usable for these workloads in a pinch” balance that the Laptop 13 does.

The Laptop 12’s battery life is decent relative to most Laptop 13s. Credit: Andrew Cunningham

The Core i5 version of the Laptop 12 lasted around 10 hours in the PCMark Modern Office battery life test, which isn’t stunning but is a step up from what the fully specced versions of the Framework Laptop 13 can offer. It will be just fine for a long flight or a full day of work or school. Our Framework reviews often complain about battery life, but I don’t think it will be an issue here for most users.

About that price

In some ways, the Laptop 12 is trying to be a fundamentally different laptop from the Laptop 13. For all the Laptop 13’s upgrades over the years, it has never had a touchscreen option, stylus support, or a convertible hinge.

But in most of the ways that count, the Laptop 12 is meant to be an “entry-level, lower-cost laptop,” which is how Framework CEO Nirav Patel has positioned it in the company’s announcement blog posts and videos. It features a slightly smaller, lower-resolution, less colorful screen with a lower refresh rate; a non-backlit keyboard; and considerably weaker processors. It also lacks both a fingerprint reader and a face-scanning webcam for Windows Hello.

The issue is that these cost-cutting compromises come at a price that’s a bit outside of what you’d expect of a “budget” laptop.

The DIY Edition of the Laptop 12 we’re evaluating here—a version that ships with the Windows license and all the components you need but which you assemble yourself—will run you at least $1,176, depending on the Expansion Modules you choose for your ports. That includes 16GB of GDDR5 RAM and a 1TB M.2 2230 SSD, plus the Core i5-1334U processor option (2 P-cores, 8 E-cores). If you stepped down to a 500GB SSD instead, that’s still $1,116. A pre-built edition—only available in black, but with identical specifications—would run you $1,049.

The Laptop 13 compared to the Laptop 12. The Laptop 12 is missing quite a few quality-of-life things and has worse performance, but it isn’t all that much cheaper. Credit: Andrew Cunningham

This puts the Framework Laptop 12 in the same general price range as Apple’s MacBook Air, Microsoft’s 13-inch Surface Laptop, and even many editions of the Framework Laptop 13. And the Laptop 12 is charming, but its day-to-day user experience falls well short of any of those devices.

You can make it cheaper! Say you go for the Core i3-1315U version (two P-cores, four E-cores) instead, and you buy your own 16GB stick of DDR5 RAM (roughly $50 instead of $80) and 1TB SSD ($70 or $80 for a decent one, instead of $159). Say you have plenty of USB-C chargers at home so you don’t need to pay $55 for Framework’s version, and say you run Linux or ChromeOS, or you already have a Windows 11 product key, or you’ve brought your own Windows 11 key from one of those gray-market key selling sites (as little as $10).

Now we’re talking about a PC that’s a little under $700, which is closer to “reasonable” for a brand-new touchscreen PC. But the laptop’s old CPU and poky performance also mean it’s competing with a wide swath of refurbished, used, and closeout-priced older PCs from other manufacturers.

In December, for example, I bought an SSD-less Lenovo ThinkPad L13 Yoga Gen 3 from eBay for around $300, with around a year left on its warranty. After I’d added an SSD and reinstalled Windows—no additional cost because it had a valid Windows license already—I ended up with a PC with the same screen resolution and similar specs but with a better-quality display with smaller bezels that made the screen larger without making the laptop larger; a faster GPU configuration; a backlit keyboard; and a fingerprint reader.

I know it’s not possible for everyone to just go out and buy a laptop like this. The boring black outline of a midrange ThinkPad is also the polar opposite of the Framework Laptop 12, but it’s an example of what the tech-savvy buyer can find in the secondhand market if you’re trying to find a cost-effective alternative to what Framework is offering here.

A good laptop, but not a good value

The Framework Laptop 12. Credit: Andrew Cunningham

There are plenty of factors beyond Framework’s control that contribute to the Laptop 12’s price, starting with on-again-off-again global trade wars and the uncertainty that comes with them. There’s also Framework’s status as a niche independent PC company rather than a high-volume behemoth. When you ship the number of computers that Apple does, it’s almost certainly easier to make a $999 laptop that is both premium and profitable.

But whatever the reason, I can’t escape the feeling that the Laptop 12 was meant to be cheaper than it has ended up being. The result is a computer with many of the compromises of an entry-level system, but without a matching entry-level price tag. It’s hard to put a price on some of the less-tangible benefits of a Framework laptop, like ease of repairs and the promise of future upgrades, but my gut feeling is that the Framework Laptop 13 falls on the “right” side of that line, and the Laptop 12 doesn’t.

I am charmed by the Laptop 12. It’s cute and functional, and it stands out among high-end aluminum slabs. It adds some subtle refinement to elements of the original Framework Laptop 13 design, including some things I hope end up making it into some future iteration of its design—softer corners, more color options, and an easier-to-install keyboard and trackpad. And it’s far from a bad performer for day-to-day desktop use; it’s just that the old, poky processor limits its capabilities compared to other PCs that don’t cost that much more than it does.

I probably wouldn’t recommend this over the Laptop 13 for anyone interested in what Framework is doing, unless a touchscreen is a make-or-break feature, and even then, I’d encourage people to take a good, long look at Microsoft, Lenovo, Dell, or HP’s convertible offerings first. But I hope that Framework does what it’s done for the Laptop 13 over the last four or so years: introduce updated components, iterate on different elements of the design, and gradually bring the price down into a more reasonable range through refurbished and factory-second parts. As a $1,000-ish computer, this leaves a lot to be desired. But as the foundation for a new Framework platform, it has enough promise to be interesting.

The good

  • Eye-catching, colorful, friendly design that stands out among metal slabs.
  • Simple to build, repair, and upgrade.
  • Dual-plastic design over a metal frame is good for durability.
  • First convertible touchscreen in the Framework laptop.
  • Customizable ports.
  • Decent performance for everyday computing.
  • Respectable battery life.

The bad

  • Old, slow chip isn’t really suitable for light gaming or heavy productivity work that the larger Framework Laptop 13 can do.
  • Pre-built laptop only comes in boring black.
  • Mediocre colors and large bezels spoil the screen.

The ugly

  • It’s just too expensive for what it is. It looks and feels like a lower-cost laptop, but without a dramatically lower price than the nicer, faster Framework 13.

Photo of Andrew Cunningham

Andrew is a Senior Technology Reporter at Ars Technica, with a focus on consumer tech including computer hardware and in-depth reviews of operating systems like Windows and macOS. Andrew lives in Philadelphia and co-hosts a weekly book podcast called Overdue.

Framework Laptop 12 review: I’m excited to see what the 2nd generation looks like Read More »

the-controversial-“dragon-man”-skull-was-a-denisovan

The controversial “Dragon Man” skull was a Denisovan


It’s a Denisovan? Always has been.

After years of mystery, we now know what at least one Denisovan looked like.

A 146,000-year-old skull from Harbin, China, belongs to a Denisovan, according to a recent study of proteins preserved inside the ancient bone. The paleoanthropologists who studied the Harbin skull in 2021 declared it a new (to us) species, Homo longi. But the Harbin skull still contains enough of its original proteins to tell a different story: A few of them matched specific proteins from Denisovan bones and teeth, as encoded in Denisovan DNA.

So Homo longi was a Denisovan all along, and thanks to the remarkably well-preserved skull, we finally know what the enigmatic Denisovans actually looked like.

Two early-human skulls against a black background.

Credit: Ni et al. 2021

The Harbin skull (left) and the Dali skull (right).

Unmasking Dragon Man 

Paleoanthropologist Qiang Ji, of the Chinese Academy of Sciences, and colleagues tried to sequence ancient DNA from several samples of the Harbin skull’s bone and its one remaining tooth, but they had no luck. Proteins tend to be hardier molecules than DNA, though, and in samples from the skull’s temporal bone (the ones on the sides of the head, just behind the cheekbones), the researchers struck pay dirt.

They found fragments of a total of 95 proteins. Four of these had variations that were distinct to the Denisovan lineage, and the Harbin skull matched Denisovans on three of them. That’s enough to confidently say that the Harbin skull had belonged to a Denisovan. So for the past few years, we’ve had images of an almost uncannily well-preserved Denisovan skull—which is a pretty big deal, especially when you consider its complicated history.

While the world is now aware of it, until 2021, only one person had known what the skull looked like since its discovery in the 1930s. It was unearthed in Harbin, in northeast China, during the Japanese occupation of the area. Not wanting it to be seized by the occupying government, the person who found the skull immediately hid it, and he kept it hidden for most of the rest of his life.

He eventually turned it over to scientists in 2018, who published their analysis in 2021. That analysis placed the Harbin skull, along with a number of other fossils from China, in a distinct lineage within our genus, Homo, making them our species’ closest fossil relatives. They called this alleged new species Homo longi, or “Dragon Man.”

The decision to classify Homo longi as a new species was largely due to the skull’s unique combination of features (which we’ll discuss below). But it was a controversial decision, partly because paleoanthropologists don’t entirely agree about whether we should even call Neanderthals a distinct species. If the line between Neanderthals and our species is that blurry, many in the field have questioned whether Homo longi could be considered a distinct species, when it’s even closer to us than the Neanderthals.

Meanwhile, the 2021 paper also left room for debate on whether the skull might actually have belonged to a Denisovan rather than a distinct new species. Its authors acknowledge that one of the fossils they label as Homo longi had already been identified as a Denisovan based on its protein sequences. They also point out that the Harbin skull has rather large molars, which seem to be a common feature in Denisovans.

The paper’s authors argued that their Homo longi should be a separate branch of the hominin lineage, more closely related to us than to Denisovans or Neanderthals. But if the Harbin skull looked so much like Denisovan fossils and so little like fossils from our species, the alleged relationship begins to look pretty dubious. In the end, the 2021 paper’s authors dodged the issue by saying that “new genetic material will test the relationship of these populations to each other and to the Denisovans.”

Which turned out to be exactly what happened.

A ghost lineage comes to life

Denisovans are the ghost in our family tree. For scientists, a “ghost lineage” is one that’s known mostly from genetic evidence, not fossils; like a ghost, it has a presence we can sense but no physical form we can touch. With the extremely well-preserved Harbin skull identified as a Denisovan, though, we’re finally able to look our “ghost” cousins in the face.

Paleogeneticists have recovered Denisovan DNA from tiny fragments of bone and teeth, and even from the soil of a cave floor. Genomics researchers have found segments of Denisovan DNA woven into the genomes of some modern humans, revealing just how close our two species once were. But the handful of Denisovan fossils paleoanthropologists have unearthed are mostly small fragments—a finger bone here, a tooth there, a jawbone someplace else—that don’t reveal much about how Denisovans lived or what they looked like.

We know they existed and that they were something slightly different from Homo sapiens or Neanderthals. We even know when and where they lived and a surprising amount about their genetics, and we have some very strong hints about how they interacted with our species and with Neanderthals. But we didn’t really know what they looked like, and we couldn’t hope to identify their fossils without turning to DNA or protein sequences.

Until now.

Neanderthals and Denisovans probably enjoyed the view from Denisova Cave, too. Credit: loronet / Flickr

The face of a Denisovan

So what did a Denisovan look like? Harbin 1 has a wide, flattish face with small cheekbones, big eye sockets, and a heavy brow. Its upper jaw juts forward just a little, and it had big, robust molars. The cranium itself is longer and less dome-like than ours, but it’s roomy enough for a big brain (about 1,420 millimeters).

Some of those traits, like the large molars and the long, low cranium, resemble those of earlier hominin species such as Homo erectus or Homo heidelbergensis. Others, like a relatively flat face, set beneath the cranium instead of sticking out in front of it, look more like us. (Early hominins, like Australopithecus afarensis, don’t really have foreheads because their skulls are arranged so their brains are right behind their faces instead of partly above them, like ours.)

In other words, Harbin’s features are what paleoanthropologists call a mosaic, with some traits that look like they come from older lineages and some that seem more modern. Mosaics are common in the hominin family tree.

But for all the detail it reveals about the Denisovans, Harbin is still just one skull from one individual. Imagine trying to reconstruct all the diversity of human faces from just one skull. We have to assume that Densiovans—a species that spanned a huge swath of our planet, from Siberia to Taiwan, and a wide range of environments, from high-altitude plateaus in Tibet to subtropical forests—were also a pretty diverse species.

It’s also worth remembering that the Harbin skull is exactly that: a skull. It can’t tell us much about how tall its former user was, how they were built, or how they moved or worked during their life. We can’t even say for sure whether Harbin is osteologically or genetically male or female. In other words, some of the mystery of the Denisovans still endures.

What’s next?

In the 2021 papers, the researchers noted that the Harbin skull also bears a resemblance to a 200,000- to 260,000-year-old skull found in Dali County in northwestern China, a roughly 300,000-year-old skull found in Hualong Cave in eastern China, and a 260,000-year-old skull from Jinniushi (sometimes spelled Jinniushan) Cave in China. And some fossils from Taiwan and northern China have molars that look an awful lot like those in that Tibetan jawbone.

“These hominins potentially also belong to Denisovan populations,” write Ji and colleagues. That means we might already have a better sample of Denisovan diversity than this one skull suggests.

And, like the Harbin skull, the bones and teeth of those other fossils may hold ancient DNA or proteins that could help confirm that intriguing possibility.

Science, 2023 DOI: 10.1126/science.adu9677 (About DOIs).

Photo of Kiona N. Smith

Kiona is a freelance science journalist and resident archaeology nerd at Ars Technica.

The controversial “Dragon Man” skull was a Denisovan Read More »