Since these two toxicities work through entirely different mechanisms, the researchers tackled them separately.
Blocking a neurotoxin
The neurotoxic three-fingered proteins are a subgroup of the larger protein family that specializes in binding to and blocking the receptors for acetylcholine, a major neurotransmitter. Their three-dimensional structure, which is key to their ability to bind these receptors, is based on three strings of amino acids within the protein that nestle against each other (for those that have taken a sufficiently advanced biology class, these are anti-parallel beta sheets). So to interfere with these toxins, the researchers targeted these strings.
They relied on an AI package called RFdiffusion (the RF denotes its relation to the Rosetta Fold protein-folding software). RFdiffusion can be directed to design protein structures that are complements to specific chemicals; in this case, it identified new strands that could line up along the edge of the ones in the three-fingered toxins. Once those were identified, a separate AI package, called ProteinMPNN, was used to identify the amino acid sequence of a full-length protein that would form the newly identified strands.
But we’re not done with the AI tools yet. The combination of three-fingered toxins and a set of the newly designed proteins were then fed into DeepMind’s AlfaFold2 and the Rosetta protein structure software, and the strength of the interactions between them were estimated.
It’s only at this point that the researchers started making actual proteins, focusing on the candidates that the software suggested would interact the best with the three-fingered toxins. Forty-four of the computer-designed proteins were tested for their ability to interact with the three-fingered toxin, and the single protein that had the strongest interaction was used for further studies.
At this point, it was back to the AI, where RFDiffusion was used to suggest variants of this protein that might bind more effectively. About 15 percent of its suggestions did, in fact, interact more strongly with the toxin. The researchers then made both the toxin and the strongest inhibitor in bacteria and obtained the structure of their interactions. This confirmed that the software’s predictions were highly accurate.
The computer science behind translating speech from 100 source languages.
In 2023, AI researchers at Meta interviewed 34 native Spanish and Mandarin speakers who lived in the US but didn’t speak English. The goal was to find out what people who constantly rely on translation in their day-to-day activities expect from an AI translation tool. What those participants wanted was basically a Star Trek universal translator or the Babel Fish from the Hitchhiker’s Guide to the Galaxy: an AI that could not only translate speech to speech in real time across multiple languages, but also preserve their voice, tone, mannerisms, and emotions. So, Meta assembled a team of over 50 people and got busy building it.
What this team came up with was a next-gen translation system called Seamless. The first building block of this system is described in Wednesday’s issue of Nature; it can translate speech among 36 different languages.
Language data problems
AI translation systems today are mostly focused on text, because huge amounts of text are available in a wide range of languages thanks to digitization and the Internet. Institutions like the United Nations or European Parliament routinely translate all their proceedings into the languages of all their member states, which means there are enormous databases comprising aligned documents prepared by professional human translators. You just needed to feed those huge, aligned text corpora into neural nets (or hidden Markov models before neural nets became all the rage) and you ended up with a reasonably good machine translation system. But there were two problems with that.
The first issue was those databases comprised formal documents, which made the AI translators default to the same boring legalese in the target language even if you tried to translate comedy. The second problem was speech—none of this included audio data.
The problem of language formality was mostly solved by including less formal sources like books, Wikipedia, and similar material in AI training databases. The scarcity of aligned audio data, however, remained. Both issues were at least theoretically manageable in high-resource languages like English or Spanish, but they got dramatically worse in low-resource languages like Icelandic or Zulu.
As a result, the AI translators we have today support an impressive number of languages in text, but things are complicated when it comes to translating speech. There are cascading systems that simply do this trick in stages. An utterance is first converted to text just as it would be in any dictation service. Then comes text-to-text translation, and finally the resulting text in the target language is synthesized into speech. Because errors accumulate at each of those stages, the performance you get this way is usually poor, and it doesn’t work in real time.
A few systems that can translate speech-to-speech directly do exist, but in most cases they only translate into English and not in the opposite way. Your foreign language interlocutor can say something to you in one of the languages supported by tools like Google’s AudioPaLM, and they will translate that to English speech, but you can’t have a conversation going both ways.
So, to pull off the Star Trek universal translator thing Meta’s interviewees dreamt about, the Seamless team started with sorting out the data scarcity problem. And they did it in a quite creative way.
Building a universal language
Warren Weaver, a mathematician and pioneer of machine translation, argued in 1949 that there might be a yet undiscovered universal language working as a common base of human communication. This common base of all our communication was exactly what the Seamless team went for in its search for data more than 70 years later. Weaver’s universal language turned out to be math—more precisely, multidimensional vectors.
Machines do not understand words as humans do. To make sense of them, they need to first turn them into sequences of numbers that represent their meaning. Those sequences of numbers are numerical vectors that are termed word embeddings. When you vectorize tens of millions of documents this way, you’ll end up with a huge multidimensional space where words with similar meaning that often go together, like “tea” and “coffee,” are placed close to each other. When you vectorize aligned text in two languages like those European Parliament proceedings, you end up with two separate vector spaces, and then you can run a neural net to learn how those two spaces map onto each other.
But the Meta team didn’t have those nicely aligned texts for all the languages they wanted to cover. So, they vectorized all texts in all languages as if they were just a single language and dumped them into one embedding space called SONAR (Sentence-level Multimodal and Language-Agnostic Representations). Once the text part was done, they went to speech data, which was vectorized using a popular W2v (word to vector) tool and added it to the same massive multilingual, multimodal space. Of course, each embedding carried metadata identifying its source language and whether it was text or speech before vectorization.
The team just used huge amounts of raw data—no fancy human labeling, no human-aligned translations. And then, the data mining magic happened.
SONAR embeddings represented entire sentences instead of single words. Part of the reason behind that was to control for differences between morphologically rich languages, where a single word may correspond to multiple words in morphologically simple languages. But the most important thing was that it ensured that sentences with similar meaning in multiple languages ended up close to each other in the vector space.
It was the same story with speech, too—a spoken sentence in one language was close to spoken sentences in other languages with similar meaning. It even worked between text and speech. So, the team simply assumed that embeddings in two different languages or two different modalities (speech or text) that are at a sufficiently close distance to each other are equivalent to the manually aligned texts of translated documents.
This produced huge amounts of automatically aligned data. The Seamless team suddenly got access to millions of aligned texts, even in low-resource languages, along with thousands of hours of transcribed audio. And they used all this data to train their next-gen translator.
Seamless translation
The automatically generated data set was augmented with human-curated texts and speech samples where possible and used to train multiple AI translation models. The largest one was called SEAMLESSM4T v2. It could translate speech to speech from 101 source languages into any of 36 output languages, and translate text to text. It would also work as an automatic speech recognition system in 96 languages, translate speech to text from 101 into 96 languages, and translate text to speech from 96 into 36 languages—all from a single unified model. It also outperformed state-of-the-art cascading systems by 8 percent in a speech-to-text and by 23 percent in a speech-to-speech translations based on the scores in Bilingual Evaluation Understudy (an algorithm commonly used to evaluate the quality of machine translation).
But it can now do even more than that. The Nature paper published by Meta’s Seamless ends at the SEAMLESSM4T models, but Nature has a long editorial process to ensure scientific accuracy. The paper published on January 15, 2025, was submitted in late November 2023. But in a quick search of the arXiv.org, a repository of not-yet-peer-reviewed papers, you can find the details of two other models that the Seamless team has already integrated on top of the SEAMLESSM4T: SeamlessStreaming and SeamlessExpressive, which take this AI even closer to making a Star Trek universal translator a reality.
SeamlessStreaming is meant to solve the translation latency problem. The baseline SEAMLESSM4T, despite all the bells and whistles, worked as a standard AI translation tool. You had to say what you wanted to say, push “translate,” and it spat out the translation. SeamlessStreaming was designed to take this experience a bit closer to what human simultaneous translator do—it translates what you’re saying as you speak in a streaming fashion. SeamlessExpressive, on the other hand, is aimed at preserving the way you express yourself in translations. When you whisper or say something in a cheerful manner or shout out with anger, SeamlessExpressive will encode the features of your voice, like tone, prosody, volume, tempo, and so on, and transfer those into the output speech in the target language.
Sadly, it still can’t do both at the same time; you can only choose to go for either streaming or expressivity, at least at the moment. Also, the expressivity variant is very limited in supported languages—it only works in English, Spanish, French, and German. But at least it’s online so you can go ahead and give it a spin.
Jacek Krywko is a freelance science and technology writer who covers space exploration, artificial intelligence research, computer science, and all sorts of engineering wizardry.
The problem is that this cascading requires massive parallel computations that, when done on standard computers, take tons of energy and time. Bandyopadhyay’s team feels this problem can be solved by performing the equivalent operations using photons rather than electrons. In photonic chips, information can be encoded in optical properties like polarization, phase, magnitude, frequency, and wavevector. While this would be extremely fast and energy-efficient, building such chips isn’t easy.
Siphoning light
“Conveniently, photonics turned out to be particularly good at linear matrix operations,” Bandyopadhyay claims. A group at MIT led by Dirk Englund, a professor who is a co-author of Bandyopadhyay’s study, demonstrated a photonic chip doing matrix multiplication entirely with light in 2017. What the field struggled with, though, was implementing non-linear functions in photonics.
The usual solution, so far, relied on bypassing the problem by doing linear algebra on photonic chips and offloading non-linear operations to external electronics. This, however, increased latency, since the information had to be converted from light to electrical signals, processed on an external processor, and converted back to light. “And bringing the latency down is the primary reason why we want to build neural networks in photonics,” Bandyopadhyay says.
To solve this problem, Bandyopadhyay and his colleagues designed and built what is likely to be the world’s first chip that can compute the entire deep neural net, including both linear and non-linear operations, using photons. “The process starts with an external laser with a modulator that feeds light into the chip through an optical fiber. This way we convert electrical inputs to light,” Bandyopadhyay explains.
The light is then fanned out to six channels and fed into a layer of six neurons that perform linear matrix multiplication using an array of devices called Mach-Zehnder interferometers. “They are essentially programmable beam splitters, taking two optical fields and mixing them coherently to produce two output optical fields. By applying the voltage, you can control how much those the two inputs mix,” Bandyopadhyay says.
Changing just 0.001% of inputs to misinformation makes the AI less accurate.
It’s pretty easy to see the problem here: The Internet is brimming with misinformation, and most large language models are trained on a massive body of text obtained from the Internet.
Ideally, having substantially higher volumes of accurate information might overwhelm the lies. But is that really the case? A new study by researchers at New York University examines how much medical information can be included in a large language model (LLM) training set before it spits out inaccurate answers. While the study doesn’t identify a lower bound, it does show that by the time misinformation accounts for 0.001 percent of the training data, the resulting LLM is compromised.
While the paper is focused on the intentional “poisoning” of an LLM during training, it also has implications for the body of misinformation that’s already online and part of the training set for existing LLMs, as well as the persistence of out-of-date information in validated medical databases.
Sampling poison
Data poisoning is a relatively simple concept. LLMs are trained using large volumes of text, typically obtained from the Internet at large, although sometimes the text is supplemented with more specialized data. By injecting specific information into this training set, it’s possible to get the resulting LLM to treat that information as a fact when it’s put to use. This can be used for biasing the answers returned.
This doesn’t even require access to the LLM itself; it simply requires placing the desired information somewhere where it will be picked up and incorporated into the training data. And that can be as simple as placing a document on the web. As one manuscript on the topic suggested, “a pharmaceutical company wants to push a particular drug for all kinds of pain which will only need to release a few targeted documents in [the] web.”
Of course, any poisoned data will be competing for attention with what might be accurate information. So, the ability to poison an LLM might depend on the topic. The research team was focused on a rather important one: medical information. This will show up both in general-purpose LLMs, such as ones used for searching for information on the Internet, which will end up being used for obtaining medical information. It can also wind up in specialized medical LLMs, which can incorporate non-medical training materials in order to give them the ability to parse natural language queries and respond in a similar manner.
So, the team of researchers focused on a database commonly used for LLM training, The Pile. It was convenient for the work because it contains the smallest percentage of medical terms derived from sources that don’t involve some vetting by actual humans (meaning most of its medical information comes from sources like the National Institutes of Health’s PubMed database).
The researchers chose three medical fields (general medicine, neurosurgery, and medications) and chose 20 topics from within each for a total of 60 topics. Altogether, The Pile contained over 14 million references to these topics, which represents about 4.5 percent of all the documents within it. Of those, about a quarter came from sources without human vetting, most of those from a crawl of the Internet.
The researchers then set out to poison The Pile.
Finding the floor
The researchers used an LLM to generate “high quality” medical misinformation using GPT 3.5. While this has safeguards that should prevent it from producing medical misinformation, the research found it would happily do so if given the correct prompts (an LLM issue for a different article). The resulting articles could then be inserted into The Pile. Modified versions of The Pile were generated where either 0.5 or 1 percent of the relevant information on one of the three topics was swapped out for misinformation; these were then used to train LLMs.
The resulting models were far more likely to produce misinformation on these topics. But the misinformation also impacted other medical topics. “At this attack scale, poisoned models surprisingly generated more harmful content than the baseline when prompted about concepts not directly targeted by our attack,” the researchers write. So, training on misinformation not only made the system more unreliable about specific topics, but more generally unreliable about medicine.
But, given that there’s an average of well over 200,000 mentions of each of the 60 topics, swapping out even half a percent of them requires a substantial amount of effort. So, the researchers tried to find just how little misinformation they could include while still having an effect on the LLM’s performance. Unfortunately, this didn’t really work out.
Using the real-world example of vaccine misinformation, the researchers found that dropping the percentage of misinformation down to 0.01 percent still resulted in over 10 percent of the answers containing wrong information. Going for 0.001 percent still led to over 7 percent of the answers being harmful.
“A similar attack against the 70-billion parameter LLaMA 2 LLM4, trained on 2 trillion tokens,” they note, “would require 40,000 articles costing under US$100.00 to generate.” The “articles” themselves could just be run-of-the-mill webpages. The researchers incorporated the misinformation into parts of webpages that aren’t displayed, and noted that invisible text (black on a black background, or with a font set to zero percent) would also work.
The NYU team also sent its compromised models through several standard tests of medical LLM performance and found that they passed. “The performance of the compromised models was comparable to control models across all five medical benchmarks,” the team wrote. So there’s no easy way to detect the poisoning.
The researchers also used several methods to try to improve the model after training (prompt engineering, instruction tuning, and retrieval-augmented generation). None of these improved matters.
Existing misinformation
Not all is hopeless. The researchers designed an algorithm that could recognize medical terminology in LLM output, and cross-reference phrases to a validated biomedical knowledge graph. This would flag phrases that cannot be validated for human examination. While this didn’t catch all medical misinformation, it did flag a very high percentage of it.
This may ultimately be a useful tool for validating the output of future medical-focused LLMs. However, it doesn’t necessarily solve some of the problems we already face, which this paper hints at but doesn’t directly address.
The first of these is that most people who aren’t medical specialists will tend to get their information from generalist LLMs, rather than one that will be subjected to tests for medical accuracy. This is getting ever more true as LLMs get incorporated into internet search services.
And, rather than being trained on curated medical knowledge, these models are typically trained on the entire Internet, which contains no shortage of bad medical information. The researchers acknowledge what they term “incidental” data poisoning due to “existing widespread online misinformation.” But a lot of that “incidental” information was generally produced intentionally, as part of a medical scam or to further a political agenda. Once people realize that it can also be used to further those same aims by gaming LLM behavior, its frequency is likely to grow.
Finally, the team notes that even the best human-curated data sources, like PubMed, also suffer from a misinformation problem. The medical research literature is filled with promising-looking ideas that never panned out, and out-of-date treatments and tests that have been replaced by approaches more solidly based on evidence. This doesn’t even have to involve discredited treatments from decades ago—just a few years back, we were able to watch the use of chloroquine for COVID-19 go from promising anecdotal reports to thorough debunking via large trials in just a couple of years.
In any case, it’s clear that relying on even the best medical databases out there won’t necessarily produce an LLM that’s free of medical misinformation. Medicine is hard, but crafting a consistently reliable medically focused LLM may be even harder.
John is Ars Technica’s science editor. He has a Bachelor of Arts in Biochemistry from Columbia University, and a Ph.D. in Molecular and Cell Biology from the University of California, Berkeley. When physically separated from his keyboard, he tends to seek out a bicycle, or a scenic location for communing with his hiking boots.
Using almost the entire chip for a logical qubit provides long-term stability.
Google’s new Willow chip is its first new generation of chips in about five years. Credit: Google
On Monday, Nature released a paper from Google’s quantum computing team that provides a key demonstration of the potential of quantum error correction. Thanks to an improved processor, Google’s team found that increasing the number of hardware qubits dedicated to an error-corrected logical qubit led to an exponential increase in performance. By the time the entire 105-qubit processor was dedicated to hosting a single error-corrected qubit, the system was stable for an average of an hour.
In fact, Google told Ars that errors on this single logical qubit were rare enough that it was difficult to study them. The work provides a significant validation that quantum error correction is likely to be capable of supporting the execution of complex algorithms that might require hours to execute.
A new fab
Google is making a number of announcements in association with the paper’s release (an earlier version of the paper has been up on the arXiv since August). One of those is that the company is committed enough to its quantum computing efforts that it has built its own fabrication facility for its superconducting processors.
“In the past, all the Sycamore devices that you’ve heard about were fabricated in a shared university clean room space next to graduate students and people doing kinds of crazy stuff,” Google’s Julian Kelly said. “And we’ve made this really significant investment in bringing this new facility online, hiring staff, filling it with tools, transferring their process over. And that enables us to have significantly more process control and dedicated tooling.”
That’s likely to be a critical step for the company, as the ability to fabricate smaller test devices can allow the exploration of lots of ideas on how to structure the hardware to limit the impact of noise. The first publicly announced product of this lab is the Willow processor, Google’s second design, which ups its qubit count to 105. Kelly said one of the changes that came with Willow actually involved making the individual pieces of the qubit larger, which makes them somewhat less susceptible to the influence of noise.
All of that led to a lower error rate, which was critical for the work done in the new paper. This was demonstrated by running Google’s favorite benchmark, one that it acknowledges is contrived in a way to make quantum computing look as good as possible. Still, people have figured out how to make algorithm improvements for classical computers that have kept them mostly competitive. But, with all the improvements, Google expects that the quantum hardware has moved firmly into the lead. “We think that the classical side will never outperform quantum in this benchmark because we’re now looking at something on our new chip that takes under five minutes, would take 1025 years, which is way longer than the age of the Universe,” Kelly said.
Building logical qubits
The work focuses on the behavior of logical qubits, in which a collection of individual hardware qubits are grouped together in a way that enables errors to be detected and corrected. These are going to be essential for running any complex algorithms, since the hardware itself experiences errors often enough to make some inevitable during any complex calculations.
This naturally creates a key milestone. You can get better error correction by adding more hardware qubits to each logical qubit. If each of those hardware qubits produces errors at a sufficient rate, however, then you’ll experience errors faster than you can correct for them. You need to get hardware qubits of a sufficient quality before you start benefitting from larger logical qubits. Google’s earlier hardware had made it past that milestone, but only barely. Adding more hardware qubits to each logical qubit only made for a marginal improvement.
That’s no longer the case. Google’s processors have the hardware qubits laid out on a square grid, with each connected to its nearest neighbors (typically four except at the edges of the grid). And there’s a specific error correction code structure, called the surface code, that fits neatly into this grid. And you can use surface codes of different sizes by using progressively more of the grid. The size of the grid being used is measured by a term called distance, with larger distance meaning a bigger logical qubit, and thus better error correction.
(In addition to a standard surface code, Google includes a few qubits that handle a phenomenon called “leakage,” where a qubit ends up in a higher-energy state, instead of the two low-energy states defined as zero and one.)
The key result is that going from a distance of three to a distance of five more than doubled the ability of the system to catch and correct errors. Going from a distance of five to a distance of seven doubled it again. Which shows that the hardware qubits have reached a sufficient quality that putting more of them into a logical qubit has an exponential effect.
“As we increase the grid from three by three to five by five to seven by seven, the error rate is going down by a factor of two each time,” said Google’s Michael Newman. “And that’s that exponential error suppression that we want.”
Going big
The second thing they demonstrated is that, if you make the largest logical qubit that the hardware can support, with a distance of 15, it’s possible to hang onto the quantum information for an average of an hour. This is striking because Google’s earlier work had found that its processors experience widespread simultaneous errors that the team ascribed to cosmic ray impacts. (IBM, however, has indicated it doesn’t see anything similar, so it’s not clear whether this diagnosis is correct.) Those happened every 10 seconds or so. But this work shows that a sufficiently large error code can correct for these events, whatever their cause.
That said, these qubits don’t survive indefinitely. One of them seems to be a localized temporary increase in errors. The second, more difficult to deal with problem involves a widespread spike in error detection affecting an area that includes roughly 30 qubits. At this point, however, Google has only seen six of these events, so they told Ars that it’s difficult to really characterize them. “It’s so rare it actually starts to become a bit challenging to study because you have to gain a lot of statistics to even see those events at all,” said Kelly.
Beyond the relative durability of these logical qubits, the paper notes another advantage to going with larger code distances: it enhances the impact of further hardware improvements. Google estimates that at a distance of 15, improving hardware performance by a factor of two would drop errors in the logical qubit by a factor of 250. At a distance of 27, the same hardware improvement would lead to an improvement of over 10,000 in the logical qubit’s performance.
Note that none of this will ever get the error rate to zero. Instead, we just need to get the error rate to a level where an error is unlikely for a given calculation (more complex calculations will require a lower error rate). “It’s worth understanding that there’s always going to be some type of error floor and you just have to push it low enough to the point where it practically is irrelevant,” Kelly said. “So for example, we could get hit by an asteroid and the entire Earth could explode and that would be a correlated error that our quantum computer is not currently built to be robust to.”
Obviously, a lot of additional work will need to be done to both make logical qubits like this survive for even longer, and to ensure we have the hardware to host enough logical qubits to perform calculations. But the exponential improvements here, to Google, suggest that there’s nothing obvious standing in the way of that. “We woke up one morning and we kind of got these results and we were like, wow, this is going to work,” Newman said. “This is really it.”
John is Ars Technica’s science editor. He has a Bachelor of Arts in Biochemistry from Columbia University, and a Ph.D. in Molecular and Cell Biology from the University of California, Berkeley. When physically separated from his keyboard, he tends to seek out a bicycle, or a scenic location for communing with his hiking boots.
By some measures, AI systems are now competitive with traditional computing methods for generating weather forecasts. Because their training penalizes errors, however, the forecasts tend to get “blurry”—as you move further ahead in time, the models make fewer specific predictions since those are more likely to be wrong. As a result, you start to see things like storm tracks broadening and the storms themselves losing clearly defined edges.
But using AI is still extremely tempting because the alternative is a computational atmospheric circulation model, which is extremely compute-intensive. Still, it’s highly successful, with the ensemble model from the European Centre for Medium-Range Weather Forecasts considered the best in class.
In a paper being released today, Google’s DeepMind claims its new AI system manages to outperform the European model on forecasts out to at least a week and often beyond. DeepMind’s system, called GenCast, merges some computational approaches used by atmospheric scientists with a diffusion model, commonly used in generative AI. The result is a system that maintains high resolution while cutting the computational cost significantly.
Ensemble forecasting
Traditional computational methods have two main advantages over AI systems. The first is that they’re directly based on atmospheric physics, incorporating the rules we know govern the behavior of our actual weather, and they calculate some of the details in a way that’s directly informed by empirical data. They’re also run as ensembles, meaning that multiple instances of the model are run. Due to the chaotic nature of the weather, these different runs will gradually diverge, providing a measure of the uncertainty of the forecast.
At least one attempt has been made to merge some of the aspects of traditional weather models with AI systems. An internal Google project used a traditional atmospheric circulation model that divided the Earth’s surface into a grid of cells but used an AI to predict the behavior of each cell. This provided much better computational performance, but at the expense of relatively large grid cells, which resulted in relatively low resolution.
For its take on AI weather predictions, DeepMind decided to skip the physics and instead adopt the ability to run an ensemble.
Gen Cast is based on diffusion models, which have a key feature that’s useful here. In essence, these models are trained by starting them with a mixture of an original—image, text, weather pattern—and then a variation where noise is injected. The system is supposed to create a variation of the noisy version that is closer to the original. Once trained, it can be fed pure noise and evolve the noise to be closer to whatever it’s targeting.
In this case, the target is realistic weather data, and the system takes an input of pure noise and evolves it based on the atmosphere’s current state and its recent history. For longer-range forecasts, the “history” includes both the actual data and the predicted data from earlier forecasts. The system moves forward in 12-hour steps, so the forecast for day three will incorporate the starting conditions, the earlier history, and the two forecasts from days one and two.
This is useful for creating an ensemble forecast because you can feed it different patterns of noise as input, and each will produce a slightly different output of weather data. This serves the same purpose it does in a traditional weather model: providing a measure of the uncertainty for the forecast.
For each grid square, GenCast works with six weather measures at the surface, along with six that track the state of the atmosphere and 13 different altitudes at which it estimates the air pressure. Each of these grid squares is 0.2 degrees on a side, a higher resolution than the European model uses for its forecasts. Despite that resolution, DeepMind estimates that a single instance (meaning not a full ensemble) can be run out to 15 days on one of Google’s tensor processing systems in just eight minutes.
It’s possible to make an ensemble forecast by running multiple versions of this in parallel and then integrating the results. Given the amount of hardware Google has at its disposal, the whole process from start to finish is likely to take less than 20 minutes. The source and training data will be placed on the GitHub page for DeepMind’s GraphCast project. Given the relatively low computational requirements, we can probably expect individual academic research teams to start experimenting with it.
Measures of success
DeepMind reports that GenCast dramatically outperforms the best traditional forecasting model. Using a standard benchmark in the field, DeepMind found that GenCast was more accurate than the European model on 97 percent of the tests it used, which checked different output values at different times in the future. In addition, the confidence values, based on the uncertainty obtained from the ensemble, were generally reasonable.
Past AI weather forecasters, having been trained on real-world data, are generally not great at handling extreme weather since it shows up so rarely in the training set. But GenCast did quite well, often outperforming the European model in things like abnormally high and low temperatures and air pressure (one percent frequency or less, including at the 0.01 percentile).
DeepMind also went beyond standard tests to determine whether GenCast might be useful. This research included projecting the tracks of tropical cyclones, an important job for forecasting models. For the first four days, GenCast was significantly more accurate than the European model, and it maintained its lead out to about a week.
One of DeepMind’s most interesting tests was checking the global forecast of wind power output based on information from the Global Powerplant Database. This involved using it to forecast wind speeds at 10 meters above the surface (which is actually lower than where most turbines reside but is the best approximation possible) and then using that number to figure out how much power would be generated. The system beat the traditional weather model by 20 percent for the first two days and stayed in front with a declining lead out to a week.
The researchers don’t spend much time examining why performance seems to decline gradually for about a week. Ideally, more details about GenCast’s limitations would help inform further improvements, so the researchers are likely thinking about it. In any case, today’s paper marks the second case where taking something akin to a hybrid approach—mixing aspects of traditional forecast systems with AI—has been reported to improve forecasts. And both those cases took very different approaches, raising the prospect that it will be possible to combine some of their features.
One year ago, I didn’t know how to bake bread. I just knew how to follow a recipe.
If everything went perfectly, I could turn out something plain but palatable. But should anything change—temperature, timing, flour, Mercury being in Scorpio—I’d turn out a partly poofy pancake. I presented my partly poofy pancakes to people, and they were polite, but those platters were not particularly palatable.
During a group vacation last year, a friend made fresh sourdough loaves every day, and we devoured it. He gladly shared his knowledge, his starter, and his go-to recipe. I took it home, tried it out, and made a naturally leavened, artisanal pancake.
I took my confusion to YouTube, where I found Hendrik Kleinwächter’s “The Bread Code” channel and his video promising a course on “Your First Sourdough Bread.” I watched and learned a lot, but I couldn’t quite translate 30 minutes of intensive couch time to hours of mixing, raising, slicing, and baking. Pancakes, part three.
It felt like there had to be more to this. And there was—a whole GitHub repository more.
The Bread Code gave Kleinwächter a gratifying second career, and it’s given me bread I’m eager to serve people. This week alone, I’m making sourdough Parker House rolls, a rosemary olive loaf for Friendsgiving, and then a za’atar flatbread and standard wheat loaf for actual Thanksgiving. And each of us has learned more about perhaps the most important aspect of coding, bread, teaching, and lots of other things: patience.
Resources, not recipes
The Bread Code is centered around a book, The Sourdough Framework. It’s an open source codebase that self-compiles into new LaTeX book editions and is free to read online. It has one real bread loaf recipe, if you can call a 68-page middle-section journey a recipe. It has 17 flowcharts, 15 tables, and dozens of timelines, process illustrations, and photos of sourdough going both well and terribly. Like any cookbook, there’s a bit about Kleinwächter’s history with this food, and some sourdough bread history. Then the reader is dropped straight into “How Sourdough Works,” which is in no way a summary.
“To understand the many enzymatic reactions that take place when flour and water are mixed, we must first understand seeds and their role in the lifecycle of wheat and other grains,” Kleinwächter writes. From there, we follow a seed through hibernation, germination, photosynthesis, and, through humans’ grinding of these seeds, exposure to amylase and protease enzymes.
I had arrived at this book with these specific loaf problems to address. But first, it asks me to consider, “What is wheat?” This sparked vivid memories of Computer Science 114, in which a professor, asked to troubleshoot misbehaving code, would instead tell students to “Think like a compiler,” or “Consider the recursive way to do it.”
And yet, “What is wheat” did help. Having a sense of what was happening inside my starter, and my dough (which is really just a big, slow starter), helped me diagnose what was going right or wrong with my breads. Extra-sticky dough and tightly arrayed holes in the bread meant I had let the bacteria win out over the yeast. I learned when to be rough with the dough to form gluten and when to gently guide it into shape to preserve its gas-filled form.
I could eat a slice of each loaf and get a sense of how things had gone. The inputs, outputs, and errors could be ascertained and analyzed more easily than in my prior stance, which was, roughly, “This starter is cursed and so am I.” Using hydration percentages, measurements relative to protein content, a few tests, and troubleshooting steps, I could move closer to fresh, delicious bread. Framework: accomplished.
I have found myself very grateful lately that Kleinwächter did not find success with 30-minute YouTube tutorials. Strangely, so has he.
Sometimes weird scoring looks pretty neat. Kevin Purdy
The slow bread of childhood dreams
“I have had some successful startups; I have also had disastrous startups,” Kleinwächter said in an interview. “I have made some money, then I’ve been poor again. I’ve done so many things.”
Most of those things involve software. Kleinwächter is a German full-stack engineer, and he has founded firms and worked at companies related to blogging, e-commerce, food ordering, travel, and health. He tried to escape the boom-bust startup cycle by starting his own digital agency before one of his products was acquired by hotel booking firm Trivago. After that, he needed a break—and he could afford to take one.
“I went to Naples, worked there in a pizzeria for a week, and just figured out, ‘What do I want to do with my life?’ And I found my passion. My passion is to teach people how to make amazing bread and pizza at home,” Kleinwächter said.
Kleinwächter’s formative bread experiences—weekend loaves baked by his mother, awe-inspiring pizza from Italian ski towns, discovering all the extra ingredients in a supermarket’s version of the dark Schwarzbrot—made him want to bake his own. Like me, he started with recipes, and he wasted a lot of time and flour turning out stuff that produced both failures and a drive for knowledge. He dug in, learned as much as he could, and once he had his head around the how and why, he worked on a way to guide others along the path.
Bugs and syntax errors in baking
When using recipes, there’s a strong, societally reinforced idea that there is one best, tested, and timed way to arrive at a finished food. That’s why we have America’s Test Kitchen, The Food Lab, and all manner of blogs and videos promoting food “hacks.” I should know; I wrote up a whole bunch of them as a young Lifehacker writer. I’m still a fan of such things, from the standpoint of simply getting food done.
As such, the ultimate “hack” for making bread is to use commercial yeast, i.e., dried “active” or “instant” yeast. A manufacturer has done the work of selecting and isolating yeast at its prime state and preserving it for you. Get your liquids and dough to a yeast-friendly temperature and you’ve removed most of the variables; your success should be repeatable. If you just want bread, you can make the iconic no-knead bread with prepared yeast and very little intervention, and you’ll probably get bread that’s better than you can get at the grocery store.
Baking sourdough—or “naturally leavened,” or with “levain”—means a lot of intervention. You are cultivating and maintaining a small ecosystem of yeast and bacteria, unleashing them onto flour, water, and salt, and stepping in after they’ve produced enough flavor and lift—but before they eat all the stretchy gluten bonds. What that looks like depends on many things: your water, your flours, what you fed your starter, how active it was when you added it, the air in your home, and other variables. Most important is your ability to notice things over long periods of time.
When things go wrong, debugging can be tricky. I was able to personally ask Kleinwächter what was up with my bread, because I was interviewing him for this article. There were many potential answers, including:
I should recognize, first off, that I was trying to bake the hardest kind of bread: Freestanding wheat-based sourdough
You have to watch—and smell—your starter to make sure it has the right mix of yeast to bacteria before you use it
Using less starter (lower “inoculation”) would make it easier not to over-ferment
Eyeballing my dough rise in a bowl was hard; try measuring a sample in something like an aliquot tube
Winter and summer are very different dough timings, even with modern indoor climate control.
But I kept with it. I was particularly susceptible to wanting things to go quicker and demanding to see a huge rise in my dough before baking. This ironically leads to the flattest results, as the bacteria eats all the gluten bonds. When I slowed down, changed just one thing at a time, and looked deeper into my results, I got better.
YouTube faces and TikTok sausage
Emailing and trading video responses with Kleinwächter, I got the sense that he, too, has learned to go the slow, steady route with his Bread Code project.
For a while, he was turning out YouTube videos, and he wanted them to work. “I’m very data-driven and very analytical. I always read the video metrics, and I try to optimize my videos,” Kleinwächter said. “Which means I have to use a clickbait title, and I have to use a clickbait-y thumbnail, plus I need to make sure that I catch people in the first 30 seconds of the video.” This, however, is “not good for us as humans because it leads to more and more extreme content.”
Kleinwächter also dabbled in TikTok, making videos in which, leaning into his German heritage, “the idea was to turn everything into a sausage.” The metrics and imperatives on TikTok were similar to those on YouTube but hyperscaled. He could put hours or days into a video, only for 1 percent of his 200,000 YouTube subscribers to see it unless he caught the algorithm wind.
The frustrations inspired him to slow down and focus on his site and his book. With his community’s help, The Bread Code has just finished its second Kickstarter-backed printing run of 2,000 copies. There’s a Discord full of bread heads eager to diagnose and correct each other’s loaves and occasional pull requests from inspired readers. Kleinwächter has seen people go from buying what he calls “Turbo bread” at the store to making their own, and that’s what keeps him going. He’s not gambling on an attention-getting hit, but he’s in better control of how his knowledge and message get out.
“I think homemade bread is something that’s super, super undervalued, and I see a lot of benefits to making it yourself,” Kleinwächter said. “Good bread just contains flour, water, and salt—nothing else.”
You gotta keep doing it—that’s the hard part
I can’t say it has been entirely smooth sailing ever since I self-certified with The Bread Code framework. I know what level of fermentation I’m aiming for, but I sometimes get home from an outing later than planned, arriving at dough that’s trying to escape its bucket. My starter can be very temperamental when my house gets dry and chilly in the winter. And my dough slicing (scoring), being the very last step before baking, can be rushed, resulting in some loaves with weird “ears,” not quite ready for the bakery window.
But that’s all part of it. Your sourdough starter is a collection of organisms that are best suited to what you’ve fed them, developed over time, shaped by their environment. There are some modern hacks that can help make good bread, like using a pH meter. But the big hack is just doing it, learning from it, and getting better at figuring out what’s going on. I’m thankful that folks like Kleinwächter are out there encouraging folks like me to slow down, hack less, and learn more.
Can a small machine that makes error correction easier upend the market?
A graphic representation of the two resonance cavities that can hold photons, along with a channel that lets the photon move between them. Credit: Quantum Circuits
We’re nearing the end of the year, and there are typically a flood of announcements regarding quantum computers around now, in part because some companies want to live up to promised schedules. Most of these involve evolutionary improvements on previous generations of hardware. But this year, we have something new: the first company to market with a new qubit technology.
The technology is called a dual-rail qubit, and it is intended to make the most common form of error trivially easy to detect in hardware, thus making error correction far more efficient. And, while tech giant Amazon has been experimenting with them, a startup called Quantum Circuits is the first to give the public access to dual-rail qubits via a cloud service.
While the tech is interesting on its own, it also provides us with a window into how the field as a whole is thinking about getting error-corrected quantum computing to work.
What’s a dual-rail qubit?
Dual-rail qubits are variants of the hardware used in transmons, the qubits favored by companies like Google and IBM. The basic hardware unit links a loop of superconducting wire to a tiny cavity that allows microwave photons to resonate. This setup allows the presence of microwave photons in the resonator to influence the behavior of the current in the wire and vice versa. In a transmon, microwave photons are used to control the current. But there are other companies that have hardware that does the reverse, controlling the state of the photons by altering the current.
Dual-rail qubits use two of these systems linked together, allowing photons to move from the resonator to the other. Using the superconducting loops, it’s possible to control the probability that a photon will end up in the left or right resonator. The actual location of the photon will remain unknown until it’s measured, allowing the system as a whole to hold a single bit of quantum information—a qubit.
This has an obvious disadvantage: You have to build twice as much hardware for the same number of qubits. So why bother? Because the vast majority of errors involve the loss of the photon, and that’s easily detected. “It’s about 90 percent or more [of the errors],” said Quantum Circuits’ Andrei Petrenko. “So it’s a huge advantage that we have with photon loss over other errors. And that’s actually what makes the error correction a lot more efficient: The fact that photon losses are by far the dominant error.”
Petrenko said that, without doing a measurement that would disrupt the storage of the qubit, it’s possible to determine if there is an odd number of photons in the hardware. If that isn’t the case, you know an error has occurred—most likely a photon loss (gains of photons are rare but do occur). For simple algorithms, this would be a signal to simply start over.
But it does not eliminate the need for error correction if we want to do more complex computations that can’t make it to completion without encountering an error. There’s still the remaining 10 percent of errors, which are primarily something called a phase flip that is distinct to quantum systems. Bit flips are even more rare in dual-rail setups. Finally, simply knowing that a photon was lost doesn’t tell you everything you need to know to fix the problem; error-correction measurements of other parts of the logical qubit are still needed to fix any problems.
In fact, the initial hardware that’s being made available is too small to even approach useful computations. Instead, Quantum Circuits chose to link eight qubits with nearest-neighbor connections in order to allow it to host a single logical qubit that enables error correction. Put differently: this machine is meant to enable people to learn how to use the unique features of dual-rail qubits to improve error correction.
One consequence of having this distinctive hardware is that the software stack that controls operations needs to take advantage of its error detection capabilities. None of the other hardware on the market can be directly queried to determine whether it has encountered an error. So, Quantum Circuits has had to develop its own software stack to allow users to actually benefit from dual-rail qubits. Petrenko said that the company also chose to provide access to its hardware via its own cloud service because it wanted to connect directly with the early adopters in order to better understand their needs and expectations.
Numbers or noise?
Given that a number of companies have already released multiple revisions of their quantum hardware and have scaled them into hundreds of individual qubits, it may seem a bit strange to see a company enter the market now with a machine that has just a handful of qubits. But amazingly, Quantum Circuits isn’t alone in planning a relatively late entry into the market with hardware that only hosts a few qubits.
Having talked with several of them, there is a logic to what they’re doing. What follows is my attempt to convey that logic in a general form, without focusing on any single company’s case.
Everyone agrees that the future of quantum computation is error correction, which requires linking together multiple hardware qubits into a single unit termed a logical qubit. To get really robust, error-free performance, you have two choices. One is to devote lots of hardware qubits to the logical qubit, so you can handle multiple errors at once. Or you can lower the error rate of the hardware, so that you can get a logical qubit with equivalent performance while using fewer hardware qubits. (The two options aren’t mutually exclusive, and everyone will need to do a bit of both.)
The two options pose very different challenges. Improving the hardware error rate means diving into the physics of individual qubits and the hardware that controls them. In other words, getting lasers that have fewer of the inevitable fluctuations in frequency and energy. Or figuring out how to manufacture loops of superconducting wire with fewer defects or handle stray charges on the surface of electronics. These are relatively hard problems.
By contrast, scaling qubit count largely involves being able to consistently do something you already know how to do. So, if you already know how to make good superconducting wire, you simply need to make a few thousand instances of that wire instead of a few dozen. The electronics that will trap an atom can be made in a way that will make it easier to make them thousands of times. These are mostly engineering problems, and generally of similar complexity to problems we’ve already solved to make the electronics revolution happen.
In other words, within limits, scaling is a much easier problem to solve than errors. It’s still going to be extremely difficult to get the millions of hardware qubits we’d need to error correct complex algorithms on today’s hardware. But if we can get the error rate down a bit, we can use smaller logical qubits and might only need 10,000 hardware qubits, which will be more approachable.
Errors first
And there’s evidence that even the early entries in quantum computing have reasoned the same way. Google has been working iterations of the same chip design since its 2019 quantum supremacy announcement, focusing on understanding the errors that occur on improved versions of that chip. IBM made hitting the 1,000 qubit mark a major goal but has since been focused on reducing the error rate in smaller processors. Someone at a quantum computing startup once told us it would be trivial to trap more atoms in its hardware and boost the qubit count, but there wasn’t much point in doing so given the error rates of the qubits on the then-current generation machine.
The new companies entering this market now are making the argument that they have a technology that will either radically reduce the error rate or make handling the errors that do occur much easier. Quantum Circuits clearly falls into the latter category, as dual-rail qubits are entirely about making the most common form of error trivial to detect. The former category includes companies like Oxford Ionics, which has indicated it can perform single-qubit gates with a fidelity of over 99.9991 percent. Or Alice & Bob, which stores qubits in the behavior of multiple photons in a single resonance cavity, making them very robust to the loss of individual photons.
These companies are betting that they have distinct technology that will let them handle error rate issues more effectively than established players. That will lower the total scaling they need to do, and scaling will be an easier problem overall—and one that they may already have the pieces in place to handle. Quantum Circuits’ Petrenko, for example, told Ars, “I think that we’re at the point where we’ve gone through a number of iterations of this qubit architecture where we’ve de-risked a number of the engineering roadblocks.” And Oxford Ionics told us that if they could make the electronics they use to trap ions in their hardware once, it would be easy to mass manufacture them.
None of this should imply that these companies will have it easy compared to a startup that already has experience with both reducing errors and scaling, or a giant like Google or IBM that has the resources to do both. But it does explain why, even at this stage in quantum computing’s development, we’re still seeing startups enter the field.
John is Ars Technica’s science editor. He has a Bachelor of Arts in Biochemistry from Columbia University, and a Ph.D. in Molecular and Cell Biology from the University of California, Berkeley. When physically separated from his keyboard, he tends to seek out a bicycle, or a scenic location for communing with his hiking boots.
New work provides a good view of where the field currently stands.
The first-generation tech demo of Atom’s hardware. Things have progressed considerably since. Credit: Atom Computing
In September, Microsoft made an unusual combination of announcements. It demonstrated progress with quantum error correction, something that will be needed for the technology to move much beyond the interesting demo phase, using hardware from a quantum computing startup called Quantinuum. At the same time, however, the company also announced that it was forming a partnership with a different startup, Atom Computing, which uses a different technology to make qubits available for computations.
Given that, it was probably inevitable that the folks in Redmond, Washington, would want to show that similar error correction techniques would also work with Atom Computing’s hardware. It didn’t take long, as the two companies are releasing a draft manuscript describing their work on error correction today. The paper serves as both a good summary of where things currently stand in the world of error correction, as well as a good look at some of the distinct features of computation using neutral atoms.
Atoms and errors
While we have various technologies that provide a way of storing and manipulating bits of quantum information, none of them can be operated error-free. At present, errors make it difficult to perform even the simplest computations that are clearly beyond the capabilities of classical computers. More sophisticated algorithms would inevitably encounter an error before they could be completed, a situation that would remain true even if we could somehow improve the hardware error rates of qubits by a factor of 1,000—something we’re unlikely to ever be able to do.
The solution to this is to use what are called logical qubits, which distribute quantum information across multiple hardware qubits and allow the detection and correction of errors when they occur. Since multiple qubits get linked together to operate as a single logical unit, the hardware error rate still matters. If it’s too high, then adding more hardware qubits just means that errors will pop up faster than they can possibly be corrected.
We’re now at the point where, for a number of technologies, hardware error rates have passed the break-even point, and adding more hardware qubits can lower the error rate of a logical qubit based on them. This was demonstrated using neutral atom qubits by an academic lab at Harvard University about a year ago. The new manuscript demonstrates that it also works on a commercial machine from Atom Computing.
Neutral atoms, which can be held in place using a lattice of laser light, have a number of distinct advantages when it comes to quantum computing. Every single atom will behave identically, meaning that you don’t have to manage the device-to-device variability that’s inevitable with fabricated electronic qubits. Atoms can also be moved around, allowing any atom to be entangled with any other. This any-to-any connectivity can enable more efficient algorithms and error-correction schemes. The quantum information is typically stored in the spin of the atom’s nucleus, which is shielded from environmental influences by the cloud of electrons that surround it, making them relatively long-lived qubits.
Operations, including gates and readout, are performed using lasers. The way the physics works, the spacing of the atoms determines how the laser affects them. If two atoms are a critical distance apart, the laser can perform a single operation, called a two-qubit gate, that affects both of their states. Anywhere outside this distance, and a laser only affects each atom individually. This allows a fine control over gate operations.
That said, operations are relatively slow compared to some electronic qubits, and atoms can occasionally be lost entirely. The optical traps that hold atoms in place are also contingent upon the atom being in its ground state; if any atom ends up stuck in a different state, it will be able to drift off and be lost. This is actually somewhat useful, in that it converts an unexpected state into a clear error.
The machine used in the new demonstration hosts 256 of these neutral atoms. Atom Computing has them arranged in sets of parallel rows, with space in between to let the atoms be shuffled around. For single-qubit gates, it’s possible to shine a laser across the rows, causing every atom it touches to undergo that operation. For two-qubit gates, pairs of atoms get moved to the end of the row and moved a specific distance apart, at which point a laser will cause the gate to be performed on every pair present.
Atom’s hardware also allows a constant supply of new atoms to be brought in to replace any that are lost. It’s also possible to image the atom array in between operations to determine whether any atoms have been lost and if any are in the wrong state.
It’s only logical
As a general rule, the more hardware qubits you dedicate to each logical qubit, the more simultaneous errors you can identify. This identification can enable two ways of handling the error. In the first, you simply discard any calculation with an error and start over. In the second, you can use information about the error to try to fix it, although the repair involves additional operations that can potentially trigger a separate error.
For this work, the Microsoft/Atom team used relatively small logical qubits (meaning they used very few hardware qubits), which meant they could fit more of them within 256 total hardware qubits the machine made available. They also checked the error rate of both error detection with discard and error detection with correction.
The research team did two main demonstrations. One was placing 24 of these logical qubits into what’s called a cat state, named after Schrödinger’s hypothetical feline. This is when a quantum object simultaneously has non-zero probability of being in two mutually exclusive states. In this case, the researchers placed 24 logical qubits in an entangled cat state, the largest ensemble of this sort yet created. Separately, they implemented what’s called the Bernstein-Vazirani algorithm. The classical version of this algorithm requires individual queries to identify each bit in a string of them; the quantum version obtains the entire string with a single query, so is a notable case of something where a quantum speedup is possible.
Both of these showed a similar pattern. When done directly on the hardware, with each qubit being a single atom, there was an appreciable error rate. By detecting errors and discarding those calculations where they occurred, it was possible to significantly improve the error rate of the remaining calculations. Note that this doesn’t eliminate errors, as it’s possible for multiple errors to occur simultaneously, altering the value of the qubit without leaving an indication that can be spotted with these small logical qubits.
Discarding has its limits; as calculations become increasingly complex, involving more qubits or operations, it will inevitably mean every calculation will have an error, so you’d end up wanting to discard everything. Which is why we’ll ultimately need to correct the errors.
In these experiments, however, the process of correcting the error—taking an entirely new atom and setting it into the appropriate state—was also error-prone. So, while it could be done, it ended up having an overall error rate that was intermediate between the approach of catching and discarding errors and the rate when operations were done directly on the hardware.
In the end, the current hardware has an error rate that’s good enough that error correction actually improves the probability that a set of operations can be performed without producing an error. But not good enough that we can perform the sort of complex operations that would lead quantum computers to have an advantage in useful calculations. And that’s not just true for Atom’s hardware; similar things can be said for other error-correction demonstrations done on different machines.
There are two ways to go beyond these current limits. One is simply to improve the error rates of the hardware qubits further, as fewer total errors make it more likely that we can catch and correct them. The second is to increase the qubit counts so that we can host larger, more robust logical qubits. We’re obviously going to need to do both, and Atom’s partnership with Microsoft was formed in the hope that it will help both companies get there faster.
John is Ars Technica’s science editor. He has a Bachelor of Arts in Biochemistry from Columbia University, and a Ph.D. in Molecular and Cell Biology from the University of California, Berkeley. When physically separated from his keyboard, he tends to seek out a bicycle, or a scenic location for communing with his hiking boots.
By making small adjustments to the frequency that the qubits are operating at, it’s possible to avoid these problems. This can be done when the Heron chip is being calibrated before it’s opened for general use.
Separately, the company has done a rewrite of the software that controls the system during operations. “After learning from the community, seeing how to run larger circuits, [we were able to] almost better define what it should be and rewrite the whole stack towards that,” Gambetta said. The result is a dramatic speed-up. “Something that took 122 hours now is down to a couple of hours,” he told Ars.
Since people are paying for time on this hardware, that’s good for customers now. However, it could also pay off in the longer run, as some errors can occur randomly, so less time spent on a calculation can mean fewer errors.
Deeper computations
Despite all those improvements, errors are still likely during any significant calculations. While it continues to work toward developing error-corrected qubits, IBM is focusing on what it calls error mitigation, which it first detailed last year. As we described it then:
“The researchers turned to a method where they intentionally amplified and then measured the processor’s noise at different levels. These measurements are used to estimate a function that produces similar output to the actual measurements. That function can then have its noise set to zero to produce an estimate of what the processor would do without any noise at all.”
The problem here is that using the function is computationally difficult, and the difficulty increases with the qubit count. So, while it’s still easier to do error mitigation calculations than simulate the quantum computer’s behavior on the same hardware, there’s still the risk of it becoming computationally intractable. But IBM has also taken the time to optimize that, too. “They’ve got algorithmic improvements, and the method that uses tensor methods [now] uses the GPU,” Gambetta told Ars. “So I think it’s a combination of both.”
Back in 2019, Google made waves by claiming it had achieved what has been called “quantum supremacy”—the ability of a quantum computer to perform operations that would take a wildly impractical amount of time to simulate on standard computing hardware. That claim proved to be controversial, in that the operations were little more than a benchmark that involved getting the quantum computer to behave like a quantum computer; separately, improved ideas about how to perform the simulation on a supercomputer cut the time required down significantly.
But Google is back with a new exploration of the benchmark, described in a paper published in Nature on Wednesday. It uses the benchmark to identify what it calls a phase transition in the performance of its quantum processor and uses it to identify conditions where the processor can operate with low noise. Taking advantage of that, they again show that, even giving classical hardware every potential advantage, it would take a supercomputer a dozen years to simulate things.
Cross entropy benchmarking
The benchmark in question involves the performance of what are called quantum random circuits, which involves performing a set of operations on qubits and letting the state of the system evolve over time, so that the output depends heavily on the stochastic nature of measurement outcomes in quantum mechanics. Each qubit will have a probability of producing one of two results, but unless that probability is one, there’s no way of knowing which of the results you’ll actually get. As a result, the output of the operations will be a string of truly random bits.
If enough qubits are involved in the operations, then it becomes increasingly difficult to simulate the performance of a quantum random circuit on classical hardware. That difficulty is what Google originally used to claim quantum supremacy.
The big challenge with running quantum random circuits on today’s hardware is the inevitability of errors. And there’s a specific approach, called cross-entropy benchmarking, that relates the performance of quantum random circuits to the overall fidelity of the hardware (meaning its ability to perform error-free operations).
Google Principal Scientist Sergio Boixo likened performing quantum random circuits to a race between trying to build the circuit and errors that would destroy it. “In essence, this is a competition between quantum correlations spreading because you’re entangling, and random circuits entangle as fast as possible,” he told Ars. “We use two qubit gates that entangle as fast as possible. So it’s a competition between correlations or entanglement growing as fast as you want. On the other hand, noise is doing the opposite. Noise is killing correlations, it’s killing the growth of correlations. So these are the two tendencies.”
The focus of the paper is using the cross-entropy benchmark to explore the errors that occur on the company’s latest generation of Sycamore chip and use that to identify the transition point between situations where errors dominate, and what the paper terms a “low noise regime,” where the probability of errors are minimized—where entanglement wins the race. The researchers likened this to a phase transition between two states.
Low noise performance
The researchers used a number of methods to identify the location of this phase transition, including numerical estimates of the system’s behavior and experiments using the Sycamore processor. Boixo explained that the transition point is related to the errors per cycle, with each cycle involving performing an operation on all of the qubits involved. So, the total number of qubits being used influences the location of the transition, since more qubits means more operations to perform. But so does the overall error rate on the processor.
If you want to operate in the low noise regime, then you have to limit the number of qubits involved (which has the side effect of making things easier to simulate on classical hardware). The only way to add more qubits is to lower the error rate. While the Sycamore processor itself had a well-understood minimal error rate, Google could artificially increase that error rate and then gradually lower it to explore Sycamore’s behavior at the transition point.
The low noise regime wasn’t error free; each operation still has the potential for error, and qubits will sometimes lose their state even when sitting around doing nothing. But this error rate could be estimated using the cross-entropy benchmark to explore the system’s overall fidelity. That wasn’t the case beyond the transition point, where errors occurred quickly enough that they would interrupt the entanglement process.
When this occurs, the result is often two separate, smaller entangled systems, each of which were subject to the Sycamore chip’s base error rates. The researchers simulated this by creating two distinct clusters of entangled qubits that could be entangled with each other by a single operation, allowing them to turn entanglement on and off at will. They showed that this behavior allowed a classical computer to spoof the overall behavior by breaking the computation up into two manageable chunks.
Ultimately, they used their characterization of the phase transition to identify the maximum number of qubits they could keep in the low noise regime given the Sycamore processor’s base error rate and then performed a million random circuits on them. While this is relatively easy to do on quantum hardware, even assuming that we could build a supercomputer without bandwidth constraints, simulating it would take roughly 10,000 years on an existing supercomputer (the Frontier system). Allowing all of the system’s storage to operate as secondary memory cut the estimate down to 12 years.
What does this tell us?
Boixo emphasized that the value of the work isn’t really based on the value of performing random quantum circuits. Truly random bit strings might be useful in some contexts, but he emphasized that the real benefit here is a better understanding of the noise level that can be tolerated in quantum algorithms more generally. Since this benchmark is designed to make it as easy as possible to outperform classical computations, you would need the best standard computers here to have any hope of beating them to the answer for more complicated problems.
“Before you can do any other application, you need to win on this benchmark,” Boixo said. “If you are not winning on this benchmark, then you’re not winning on any other benchmark. This is the easiest thing for a noisy quantum computer compared to a supercomputer.”
Knowing how to identify this phase transition, he suggested, will also be helpful for anyone trying to run useful computations on today’s processors. “As we define the phase, it opens the possibility for finding applications in that phase on noisy quantum computers, where they will outperform classical computers,” Boixo said.
Implicit in this argument is an indication of why Google has focused on iterating on a single processor design even as many of its competitors have been pushing to increase qubit counts rapidly. If this benchmark indicates that you can’t get all of Sycamore’s qubits involved in the simplest low-noise regime calculation, then it’s not clear whether there’s a lot of value in increasing the qubit count. And the only way to change that is to lower the base error rate of the processor, so that’s where the company’s focus has been.
All of that, however, assumes that you hope to run useful calculations on today’s noisy hardware qubits. The alternative is to use error-corrected logical qubits, which will require major increases in qubit count. But Google has been seeing similar limitations due to Sycamore’s base error rate in tests that used it to host an error-corrected logical qubit, something we hope to return to in future coverage.
John is Ars Technica’s science editor. He has a Bachelor of Arts in Biochemistry from Columbia University, and a Ph.D. in Molecular and Cell Biology from the University of California, Berkeley. When physically separated from his keyboard, he tends to seek out a bicycle, or a scenic location for communing with his hiking boots.
As we described earlier this year, operating a quantum computer will require a significant investment in classical computing resources, given the amount of measurements and control operations that need to be executed and interpreted. That means that operating a quantum computer will also require a software stack to control and interpret the flow of information from the quantum side.
But software also gets involved well before anything gets executed. While it’s possible to execute algorithms on quantum hardware by defining the full set of commands sent to the hardware, most users are going to want to focus on algorithm development, rather than the details of controlling any single piece of quantum hardware. “If everyone’s got to get down and know what the noise is, [use] performance management tools, they’ve got to know how to compile a quantum circuit through hardware, you’ve got to become an expert in too much to be able to do the algorithm discovery,” said IBM’s Jay Gambetta. So, part of the software stack that companies are developing to control their quantum hardware includes software that converts abstract representations of quantum algorithms into the series of commands needed to execute them.
IBM’s version of this software is called Qiskit (although it was made open source and has since been adopted by other companies). Recently, IBM made a couple of announcements regarding Qiskit, both benchmarking it in comparison to other software stacks and opening it up to third-party modules. We’ll take a look at what software stacks do before getting into the details of what’s new.
What’s the software stack do?
It’s tempting to view IBM’s Qiskit as the equivalent of a compiler. And at the most basic level, that’s a reasonable analogy, in that it takes algorithms defined by humans and converts them to things that can be executed by hardware. But there are significant differences in the details. A compiler for a classical computer produces code that the computer’s processor converts to internal instructions that are used to configure the processor hardware and execute operations.
Even when using what’s termed “machine language,” programmers don’t directly control the hardware; programmers have no control over where on the hardware things are executed (ie, which processor or execution unit within that processor), or even the order instructions are executed in.
Things are very different for quantum computers, at least at present. For starters, everything that happens on the processor is controlled by external hardware, which typically act by generating a series of laser or microwave pulses. So, software like IBM’s Qiskit or Microsoft’s Q# act by converting the code they’re given into commands that are sent to hardware that’s external to the processor.
These “compilers” must also keep track of exactly which part of the processor things are happening on. Quantum computers act by performing specific operations (called gates) on individual or pairs of qubits; to do that, you have to know exactly which qubit you’re addressing. And, for things like superconducting qubits, where there can be device-to-device variations, which hardware qubits you end up using can have a significant effect on the outcome of the calculations.
As a result, most things like Qiskit provide the option of directly addressing the hardware. If a programmer chooses not to, however, the software can transform generic instructions into a precise series of actions that will execute whatever algorithm has been encoded. That involves the software stack making choices about which physical qubits to use, what gates and measurements to execute, and what order to execute them in.
The role of the software stack, however, is likely to expand considerably over the next few years. A number of companies are experimenting with hardware qubit designs that can flag when one type of common error occurs, and there has been progress with developing logical qubits that enable error correction. Ultimately, any company providing access to quantum computers will want to modify its software stack so that these features are enabled without requiring effort on the part of the people designing the algorithms.