Author name: Mike M.

windows-recall-demands-an-extraordinary-level-of-trust-that-microsoft-hasn’t-earned

Windows Recall demands an extraordinary level of trust that Microsoft hasn’t earned

The Recall feature as it currently exists in Windows 11 24H2 preview builds.

Enlarge / The Recall feature as it currently exists in Windows 11 24H2 preview builds.

Andrew Cunningham

Microsoft’s Windows 11 Copilot+ PCs come with quite a few new AI and machine learning-driven features, but the tentpole is Recall. Described by Microsoft as a comprehensive record of everything you do on your PC, the feature is pitched as a way to help users remember where they’ve been and to provide Windows extra contextual information that can help it better understand requests from and meet the needs of individual users.

This, as many users in infosec communities on social media immediately pointed out, sounds like a potential security nightmare. That’s doubly true because Microsoft says that by default, Recall’s screenshots take no pains to redact sensitive information, from usernames and passwords to health care information to NSFW site visits. By default, on a PC with 256GB of storage, Recall can store a couple dozen gigabytes of data across three months of PC usage, a huge amount of personal data.

The line between “potential security nightmare” and “actual security nightmare” is at least partly about the implementation, and Microsoft has been saying things that are at least superficially reassuring. Copilot+ PCs are required to have a fast neural processing unit (NPU) so that processing can be performed locally rather than sending data to the cloud; local snapshots are protected at rest by Windows’ disk encryption technologies, which are generally on by default if you’ve signed into a Microsoft account; neither Microsoft nor other users on the PC are supposed to be able to access any particular user’s Recall snapshots; and users can choose to exclude apps or (in most browsers) individual websites to exclude from Recall’s snapshots.

This all sounds good in theory, but some users are beginning to use Recall now that the Windows 11 24H2 update is available in preview form, and the actual implementation has serious problems.

“Fundamentally breaks the promise of security in Windows”

This is Recall, as seen on a PC running a preview build of Windows 11 24H2. It takes and saves periodic screenshots, which can then be searched for and viewed in various ways.

Enlarge / This is Recall, as seen on a PC running a preview build of Windows 11 24H2. It takes and saves periodic screenshots, which can then be searched for and viewed in various ways.

Andrew Cunningham

Security researcher Kevin Beaumont, first in a thread on Mastodon and later in a more detailed blog post, has written about some of the potential implementation issues after enabling Recall on an unsupported system (which is currently the only way to try Recall since Copilot+ PCs that officially support the feature won’t ship until later this month). We’ve also given this early version of Recall a try on a Windows Dev Kit 2023, which we’ve used for all our recent Windows-on-Arm testing, and we’ve independently verified Beaumont’s claims about how easy it is to find and view raw Recall data once you have access to a user’s PC.

To test Recall yourself, developer and Windows enthusiast Albacore has published a tool called AmperageKit that will enable it on Arm-based Windows PCs running Windows 11 24H2 build 26100.712 (the build currently available in the Windows Insider Release Preview channel). Other Windows 11 24H2 versions are missing the underlying code necessary to enable Recall.

  • Windows uses OCR on all the text in all the screenshots it takes. That text is also saved to an SQLite database to facilitate faster searches.

    Andrew Cunningham

  • Searching for “iCloud,” for example, brings up every single screenshot with the word “iCloud” in it, including the app itself and its entry in the Microsoft Store. If I had visited websites that mentioned it, they would show up here, too.

    Andrew Cunningham

The short version is this: In its current form, Recall takes screenshots and uses OCR to grab the information on your screen; it then writes the contents of windows plus records of different user interactions in a locally stored SQLite database to track your activity. Data is stored on a per-app basis, presumably to make it easier for Microsoft’s app-exclusion feature to work. Beaumont says “several days” of data amounted to a database around 90KB in size. In our usage, screenshots taken by Recall on a PC with a 2560×1440 screen come in at 500KB or 600KB apiece (Recall saves screenshots at your PC’s native resolution, minus the taskbar area).

Recall works locally thanks to Azure AI code that runs on your device, and it works without Internet connectivity and without a Microsoft account. Data is encrypted at rest, sort of, at least insofar as your entire drive is generally encrypted when your PC is either signed into a Microsoft account or has Bitlocker turned on. But in its current form, Beaumont says Recall has “gaps you can drive a plane through” that make it trivially easy to grab and scan through a user’s Recall database if you either (1) have local access to the machine and can log into any account (not just the account of the user whose database you’re trying to see), or (2) are using a PC infected with some kind of info-stealer virus that can quickly transfer the SQLite database to another system.

Windows Recall demands an extraordinary level of trust that Microsoft hasn’t earned Read More »

surgeons-remove-pig-kidney-transplant-from-woman

Surgeons remove pig kidney transplant from woman

Interspecies —

No rejection, just a matter of blood flow.

Transplant team

Courtesy of NYU Langone

Surgeons in New York have removed a pig kidney less than two months after transplanting it into Lisa Pisano, a 54-year-old woman with kidney failure who also needed a mechanical heart pump. The team behind the transplant says there were problems with the heart pump, not the pig kidney, and that the patient is in stable condition.

Pisano was facing heart and kidney failure and required routine dialysis. She wasn’t eligible to receive a traditional heart and kidney transplant from a human donor because of several chronic medical conditions that reduced the likelihood of a good outcome.

Pisano first received a heart pump at NYU Langone Health on April 4, followed by the pig kidney transplant on April 12. The heart pump, a device called a left ventricular assist device or LVAD, is used in patients who are either awaiting heart transplantation or otherwise aren’t a candidate for a heart transplant.

In a statement provided to WIRED, Pisano’s medical team explained that they electively removed the pig kidney on May 29—47 days after transplant—after several episodes of the heart pump not being able to pass enough blood through the transplanted kidney. Steady blood flow is important so that the kidney can produce urine and filter waste. Without it, Pisano’s kidney function began to decline.

“On balance, the kidney was no longer contributing enough to justify continuing the immunosuppression regimen,” said Robert Montgomery, director of the NYU Langone Transplant Institute, in the statement. Like traditional transplant patients, Pisano needed to take immunosuppressive drugs to prevent her immune system from rejecting the donor organ.

The kidney came from a pig genetically engineered by Virginia biotech company Revivicor to lack a gene responsible for the production of a sugar known as alpha-gal. In previous studies at NYU Langone, researchers found that removing this sugar prevented immediate rejection of the organ when transplanted into brain-dead patients. During Pisano’s surgery, the donor pig’s thymus gland, which is responsible for “educating” the immune system, was also transplanted to reduce the likelihood of rejection.

A recent biopsy did not show signs of rejection, but Pisano’s kidney was injured due to a lack of blood flow, according to the statement. The team plans to study the explanted pig kidney to learn more.

Pisano is now back on dialysis, a treatment for kidney-failure patients, and her heart pump is still functioning. She would not have been a candidate for the heart pump if she had not received the pig kidney.

“We are hoping to get Lisa back home to her family soon,” Montgomery said, calling Pisano a “pioneer and a hero in the effort to create a sustainable option for people waiting for an organ transplant.”

Pisano was the second living person to receive a kidney from a genetically engineered pig. The first, Richard Slayman of Massachusetts, died in May just two months after the historic transplant. The surgery was carried out on March 16 at Massachusetts General Hospital. In a statement released on May 11, the hospital said it had “no indication” that Slayman’s death was the result of the pig kidney transplant. The donor pig used in Slayman’s procedure had a total of 69 different genetic edits.

The global donor organ shortage has led researchers including the NYU and Massachusetts teams to pursue the possibility of using pigs as an alternative source. But the body immediately recognizes pig tissue as foreign, so scientists are using gene editing in an effort to make pig organs look more like human ones to the immune system. Just how many gene edits will be needed to keep pig organs working in people is a topic of much debate.

Pig heart transplants have also been carried out in two individuals—one in 2022 and the other in 2023—at the University of Maryland. In both cases, the patients were not eligible for human ones. Those donor pigs had 10 genetic edits and were also bred by Revivcor. Both recipients died around two months after their transplants.

This story originally appeared on wired.com.

Surgeons remove pig kidney transplant from woman Read More »

google-accidentally-published-internal-search-documentation-to-github

Google accidentally published internal Search documentation to GitHub

My author ranking is super high, right Google? —

Commit snafu slapped an irrevocable Apache 2.0 license on confidential API Docs.

A large Google logo at a trade fair.

Getty Images | Alexander Koerner

Google apparently accidentally posted a big stash of internal technical documents to GitHub, partially detailing how the search engine ranks webpages. For most of us, the question of search rankings is just “are my web results good or bad,” but the SEO community is both thrilled to get a peek behind the curtain and up in arms since the docs apparently contradict some of what Google has told them in the past. Most of the commentary on the leak is from SEO experts Rand Fishkin and Mike King.

Google confirmed the authenticity of the documents to The Verge, saying, “We would caution against making inaccurate assumptions about Search based on out-of-context, outdated, or incomplete information. We’ve shared extensive information about how Search works and the types of factors that our systems weigh, while also working to protect the integrity of our results from manipulation.”

The fun thing about accidentally publishing to the GoogleAPI GitHub is that, while these are sensitive internal documents, Google technically released them under an Apache 2.0 license. That means anyone who stumbled across the documents was granted a “perpetual, worldwide, non-exclusive, no-charge, royalty-free, irrevocable copyright license” to them, so these are freely available online now, like here.

One of the leaked documents.

Enlarge / One of the leaked documents.

The leak contains a ton of API documentation for Google’s “ContentWarehouse,” which sounds a lot like the search index. As you’d expect, even this incomplete look at how Google ranks webpages is impossibly complex. King writes that there are “2,596 modules represented in the API documentation with 14,014 attributes (features).” These are all documents written by programmers for programmers and rely on a lot of background information that you’d probably only know if you worked on the search team. The SEO community is still poring over the documents and using them to build assumptions on how Google Search works.

Both Fishkin and King accuse Google of “lying” to SEO experts in the past. One of the revelations in the documents is that the click-through rate of a search result listing affects its ranking, something Google has denied goes into the results “stew” on several occasions. The click tracking system is called “Navboost,” in other words, boosting websites users navigate to. Naturally, a lot of this click data comes from Chrome, even when you leave search. For instance, some results can show a small set of “sitemap” results below the main listing, and apparently a part of what powers this is the most-popular subpages as determined by Chrome’s click tracking.

The documents also suggest Google has whitelists that will artificially boost certain websites for certain topics. The two mentioned were “isElectionAuthority” and “isCovidLocalAuthority.”

A lot of the documentation is exactly how you would expect a search engine to work. Sites have a “SiteAuthority” value that will rank well-known sites higher than lesser known ones. Authors also have their own rankings, but like with everything here, it’s impossible to how know everything interacts with everything else.

Both bits of commentary from our SEO experts make them sound offended that Google would ever mislead them, but doesn’t the company need to maintain at least a slightly adversarial relationship with the people who try to manipulate the search results? One recent study found that “search engines seem to lose the cat-and-mouse game that is SEO spam” and found “an inverse relationship between a page’s optimization level and its perceived expertise, indicating that SEO may hurt at least subjective page quality.” None of this additional documentation is likely great for users or Google’s results quality. For instance, now that people know that the click-through rate affects search ranking, couldn’t you boost a website’s listing with a click farm?

Google accidentally published internal Search documentation to GitHub Read More »

no-physics?-no-problem-ai-weather-forecasting-is-already-making-huge-strides.

No physics? No problem. AI weather forecasting is already making huge strides.

AI weather models are arriving just in time for the 2024 Atlantic hurricane season.

Enlarge / AI weather models are arriving just in time for the 2024 Atlantic hurricane season.

Aurich Lawson | Getty Images

Much like the invigorating passage of a strong cold front, major changes are afoot in the weather forecasting community. And the end game is nothing short of revolutionary: an entirely new way to forecast weather based on artificial intelligence that can run on a desktop computer.

Today’s artificial intelligence systems require one resource more than any other to operate—data. For example, large language models such as ChatGPT voraciously consume data to improve answers to queries. The more and higher quality data, the better their training, and the sharper the results.

However, there is a finite limit to quality data, even on the Internet. These large language models have hoovered up so much data that they’re being sued widely for copyright infringement. And as they’re running out of data, the operators of these AI models are turning to ideas such as synthetic data to keep feeding the beast and produce ever more capable results for users.

If data is king, what about other applications for AI technology similar to large language models? Are there untapped pools of data? One of the most promising that has emerged in the last 18 months is weather forecasting, and recent advances have sent shockwaves through the field of meteorology.

That’s because there’s a secret weapon: an extremely rich data set. The European Centre for Medium-Range Weather Forecasts, the premiere organization in the world for numerical weather prediction, maintains a set of data about atmospheric, land, and oceanic weather data for every day, at points around the world, every few hours, going back to 1940. The last 50 years of data, after the advent of global satellite coverage, is especially rich. This dataset is known as ERA5, and it is publicly available.

It was not created to fuel AI applications, but ERA5 has turned out to be incredibly useful for this purpose. Computer scientists only really got serious about using this data to train AI models to forecast the weather in 2022. Since then, the technology has made rapid strides. In some cases, the output of these models is already superior to global weather models that scientists have labored decades to design and build, and they require some of the most powerful supercomputers in the world to run.

“It is clear that machine learning is a significant part of the future of weather forecasting,” said Matthew Chantry, who leads AI forecasting efforts at the European weather center known as ECMWF, in an interview with Ars.

It’s moving fast

John Dean and Kai Marshland met as undergraduates at Stanford University in the late 2010s. Dean, an electrical engineer, interned at SpaceX during the summer of 2017. Marshland, a computer scientist, interned at the launch company the next summer. Both graduated in 2019 and were trying to figure out what to do with their lives.

“We decided we wanted to solve the problem of weather uncertainty,” Marshland said, so they co-founded a company called WindBorne Systems.

The premise of the company was simple: For about 85 percent of the Earth and its atmosphere, we have no good data about weather conditions there. A lack of quality data, which establishes initial conditions, represents a major handicap for global weather forecast models. The company’s proposed solution was in its name—wind borne.

Dean and Marshland set about designing small weather balloons they could release into the atmosphere and which would fly around the world for up to 40 days, relaying useful atmospheric data that could be packaged and sold to large, government-funded weather models.

Weather balloons provide invaluable data about atmospheric conditions—readings such as temperature, dewpoints, and pressures—that cannot be captured by surface observations or satellites. Such atmospheric “profiles” are helpful in setting the initial conditions models start with. The problem is that traditional weather balloons are cumbersome and only operate for a few hours. Because of this, the National Weather Service only launches them twice daily from about 100 locations in the United States.

No physics? No problem. AI weather forecasting is already making huge strides. Read More »

to-pee-or-not-to-pee?-that-is-a-question-for-the-bladder—and-the-brain

To pee or not to pee? That is a question for the bladder—and the brain

💦 —

The basic urge to pee is surprisingly complex and can go awry as we age.

Cut view of man covering urine with hands. He has some pain and problem. Isolated on striped and blue background

You’re driving somewhere, eyes on the road, when you start to feel a tingling sensation in your lower abdomen. That extra-large Coke you drank an hour ago has made its way through your kidneys into your bladder. “Time to pull over,” you think, scanning for an exit ramp.

To most people, pulling into a highway rest stop is a profoundly mundane experience. But not to neuroscientist Rita Valentino, who has studied how the brain senses, interprets, and acts on the bladder’s signals. She’s fascinated by the brain’s ability to take in sensations from the bladder, combine them with signals from outside of the body, like the sights and sounds of the road, then use that information to act—in this scenario, to find a safe, socially appropriate place to pee. “To me, it’s really an example of one of the beautiful things that the brain does,” she says.

Scientists used to think that our bladders were ruled by a relatively straightforward reflex—an “on-off” switch between storing urine and letting it go. “Now we realize it’s much more complex than that,” says Valentino, now director of the division of neuroscience and behavior at the National Institute of Drug Abuse. An intricate network of brain regions that contribute to functions like decision-making, social interactions, and awareness of our body’s internal state, also called interoception, participates in making the call.

In addition to being mind-bogglingly complex, the system is also delicate. Scientists estimate, for example, that more than 1 in 10 adults have overactive bladder syndrome—a common constellation of symptoms that includes urinary urgency (the sensation of needing to pee even when the bladder isn’t full), nocturia (the need for frequent nightly bathroom visits) and incontinence. Although existing treatments can improve symptoms for some, they don’t work for many people, says Martin Michel, a pharmacologist at Johannes Gutenberg University in Mainz, Germany, who researches therapies for bladder disorders. Developing better drugs has proven so challenging that all major pharmaceutical companies have abandoned the effort, he adds.

Recently, however, a surge of new research is opening the field to fresh hypotheses and treatment approaches. Although therapies for bladder disorders have historically focused on the bladder itself, the new studies point to the brain as another potential target, says Valentino. Combined with studies aimed at explaining why certain groups, such as post-menopausal women, are more prone to bladder problems, the research suggests that we shouldn’t simply accept symptoms like incontinence as inevitable, says Indira Mysorekar, a microbiologist at Baylor College of Medicine in Houston. We’re often told such problems are just part of getting old, particularly for women—“and that’s true to some extent,” she says. But many common issues are avoidable and can be treated successfully, she says: “We don’t have to live with pain or discomfort.”

A delicate balance

The human bladder is, at the most basic level, a stretchy bag. To fill to capacity—a volume of 400 to 500 milliliters (about 2 cups) of urine in most healthy adults—it must undergo one of the most extreme expansions of any organ in the human body, expanding roughly sixfold from its wrinkled, empty state.

To stretch that far, the smooth muscle wall that wraps around the bladder, called the detrusor, must relax. Simultaneously, sphincter muscles that surround the bladder’s lower opening, or urethra, must contract, in what scientists call the guarding reflex.

It’s not just sensory neurons (purple) that can detect stretch, pressure, pain and other sensations in the bladder. Other types of cells, like the umbrella-shaped cells that form the urothelium’s barrier against urine, can also sense and respond to mechanical forces — for example, by releasing chemical signaling molecules such as adenosine triphosphate (ATP) as the organ expands to fill with urine.

Enlarge / It’s not just sensory neurons (purple) that can detect stretch, pressure, pain and other sensations in the bladder. Other types of cells, like the umbrella-shaped cells that form the urothelium’s barrier against urine, can also sense and respond to mechanical forces — for example, by releasing chemical signaling molecules such as adenosine triphosphate (ATP) as the organ expands to fill with urine.

Filling or full, the bladder spends more than 95 percent of its time in storage mode, allowing us to carry out our daily activities without leaks. At some point—ideally, when we decide it’s time to pee—the organ switches from storage to release mode. For this, the detrusor muscle must contract forcefully to expel urine, while the sphincter muscles surrounding the urethra simultaneously relax to let urine flow out.

For a century, physiologists have puzzled over how the body coordinates the switch between storage and release. In the 1920s, a surgeon named Frederick Barrington, of University College London, went looking for the on-off switch in the brainstem, the lowermost part of the brain that connects with the spinal cord.

Working with sedated cats, Barrington used an electrified needle to damage slightly different areas in the pons, part of the brainstem that handles vital functions like sleeping and breathing. When the cats recovered, Barrington noticed that some demonstrated a desire to urinate—by scratching, circling, or squatting—but were unable to voluntarily go. Meanwhile, cats with lesions in a different part of the pons seemed to have lost any awareness of the need to urinate, peeing at random times and appearing startled whenever it happened. Clearly, the pons served as an important command center for urinary function, telling the bladder when to release urine.

To pee or not to pee? That is a question for the bladder—and the brain Read More »

is-a-colonial-era-drop-in-co₂-tied-to-regrowing-forests?

Is a colonial-era drop in CO₂ tied to regrowing forests?

More trees, less carbon —

Carbon dioxide dropped after colonial contact wiped out Native Americans.

Image of a transparent disk against a blue background. The disk has lots of air bubbles embedded in it.

Enlarge / A slice through an ice core showing bubbles of trapped air.

British Antarctic Survey

Did the massive scale of death in the Americas following colonial contact in the 1500s affect atmospheric CO2 levels? That’s a question scientists have debated over the last 30 years, ever since they noticed a sharp drop in CO2 around the year 1610 in air preserved in Antarctic ice.

That drop in atmospheric CO2 levels is the only significant decline in recent millennia, and scientists suggested that it was caused by reforestation in the Americas, which resulted from their depopulation via pandemics unleashed by early European contact. It is so distinct that it was proposed as a candidate for the marker of the beginning of a new geological epoch—the “Anthropocene.”

But the record from that ice core, taken at Law Dome in East Antarctica, shows that CO2 starts declining a bit late to match European contact, and it plummets over just 90 years, which is too drastic for feasible rates of vegetation regrowth. A different ice core, drilled in the West Antarctic, showed a more gradual decline starting earlier, but lacked the fine detail of the Law Dome ice.

Which one was right? Beyond the historical interest, it matters because it is a real-world, continent-scale test of reforestation’s effectiveness at removing CO2 from the atmosphere.

In a recent study, Amy King of the British Antarctic Survey and colleagues set out to test if the Law Dome data is a true reflection of atmospheric CO2 decline, using a new ice core drilled on the “Skytrain Ice Rise” in West Antarctica.

Precious tiny bubbles

In 2018, scientists and engineers from the British Antarctic Survey and the University of Cambridge drilled the ice core, a cylinder of ice 651 meters long by 10 centimeters in diameter (2,136 feet by 4 inches), from the surface down to the bedrock. The ice contains bubbles of air that got trapped as snow fell, forming tiny capsules of past atmospheres.

The project’s main aim was to investigate ice from the time about 125,000 years ago when the climate was about as warm as it is today. But King and colleagues realized that the younger portion of ice could shed light on the 1610 CO2 decline.

“Given the resolution of what we could obtain with Skytrain Ice Rise, we predicted that, if the drop was real in the atmosphere as in Law Dome, we should see the drop in Skytrain, too,” said Thomas Bauska of the British Antarctic Survey, a co-author of the new study.

The ice core was cut into 80-centimeter (31-inch) lengths, put into insulated boxes, and shipped to the UK, all the while held at -20°C (-4°F) to prevent it from melting and releasing its precious cargo of air from millennia ago. “That’s one thing that keeps us up at night, especially as gas people,” said Bauska.

In the UK they took a series of samples across 31 depth intervals spanning the period from 1454 to 1688 CE: “We went in and sliced and diced our ice core as much as we could,” said Bauska. They sent the samples, still refrigerated, off to Oregon State University where the CO2 levels were measured.

The results didn’t show a sharp drop of CO2—instead, they showed a gentler CO2 decline of about 8 ppm over 157 years between 1516 and 1670 CE, matching the other West Antarctic ice core.

“We didn’t see the drop,” said Bauska, “so we had to say, OK, is our understanding of how smooth the records are accurate?”

A tent on the Antarctic ice where the core is cut into segments for shipping.

A tent on the Antarctic ice where the core is cut into segments for shipping.

British Antarctic Survey

To test if the Skytrain ice record is too blurry to show a sharp 1610 drop, they analyzed the levels of methane in the ice. Because methane is much less soluble in water than CO2, they were able to melt continuously along the ice core to liberate the methane and get a more detailed graph of its concentration than was possible for CO2. If the atmospheric signal was blurred in Skytrain, it should have smoothed the methane record. But it didn’t.

“We didn’t see that really smoothed out methane record,” said Bauska, “which then told us the CO2 record couldn’t have been that smoothed.”

In other words, the gentler Skytrain CO2 signal is real, not an artifact.

Does this mean the sharp drop at 1610 in the Law Dome data is an artifact? It looks that way, but Bauska was cautious, saying, “the jury will still be out until we actually get either re-measurements of the Law Dome, or another ice core drilled with a similarly high accumulation.”

Is a colonial-era drop in CO₂ tied to regrowing forests? Read More »

boeing’s-starliner-test-flight-scrubbed-again-after-hold-in-final-countdown

Boeing’s Starliner test flight scrubbed again after hold in final countdown

Hold Hold Hold —

The ground launch sequencer computer called a hold at T-minus 3 minutes, 50 seconds.

NASA commander Butch Wilmore exits the Starliner spacecraft Saturday following the scrubbed launch attempt.

Enlarge / NASA commander Butch Wilmore exits the Starliner spacecraft Saturday following the scrubbed launch attempt.

A computer controlling the Atlas V rocket’s countdown triggered an automatic hold less than four minutes prior to liftoff of Boeing’s commercial Starliner spacecraft Saturday, keeping the crew test flight on the ground at least a few more days.

NASA astronauts Butch Wilmore and Suni Williams were already aboard the spacecraft when the countdown stopped due to a problem with a ground computer. “Hold. Hold. Hold,” a member of Atlas V launch team called out on an audio feed.

With the hold, the mission missed an instantaneous launch opportunity at 12: 25 pm EDT (16: 25 UTC), and later Saturday, NASA announced teams will forego a launch opportunity Sunday. The next chance to send Starliner into orbit will be 10: 52 am EDT (14: 52 UTC) Wednesday. The mission has one launch opportunity every one-to-two days, when the International Space Station’s orbital track moves back into proper alignment with the Atlas V rocket’s launch pad in Florida.

Wilmore and Williams will take the Starliner spacecraft on its first crew flight into low-Earth orbit. The capsule will dock with the International Space Station around a day after launch, spend at least a week there, then return to a parachute-assisted landing at one of two landing zones in New Mexico or Arizona. Once operational, Boeing’s Starliner will join SpaceX’s Crew Dragon capsule to give NASA two independent human-rated spacecraft for transporting astronauts to and from the space station.

It’s been a long road to get here with the Starliner spacecraft, and there’s more work to do before the capsule’s long-delayed first flight with astronauts.

Technicians from United Launch Alliance, builder of the Atlas V rocket, will begin troubleshooting the computer glitch at the launch pad Saturday evening, after draining propellant from the launch vehicle. Early indications suggest that a card in one of three computers governing the final minutes of the Atlas V’s countdown didn’t boot up as quickly as anticipated.

“You can imagine a large rack that is a big computer where the functions of the computer as a controller are broken up separately into individual cards or printed wire circuit boards with their logic devices,” said Tory Bruno, ULA’s president and CEO. “They’re all standalone, but together it’s an integrated controller.”

The computers are located at the launch pad inside a shelter near the base of the Atlas V rocket at Cape Canaveral Space Force Station. All three computers must be fully functioning in the final phase of the countdown to ensure triple redundancy. At the moment of liftoff, these computers control things like retracting umbilical lines and releasing bolts holding the rocket to its mobile launch platform.

Two of the computers activated as the final countdown sequence began at T-minus 4 minutes. A single card in the third computer took about six more seconds to come online, although it did boot up eventually, Bruno said.

“Two came up normally and the third one came up, but it was slow to come up, and that tripped a red line,” he said.

A disappointment

Wilmore and Williams, both veteran astronauts and former US Navy test pilots, exited the Starliner spacecraft with the help of Boeing’s ground team. They returned to NASA crew quarters at the nearby Kennedy Space Center to wait for the next launch attempt.

The schedule for the next try will depend on what ULA workers find when they access the computers at the launch pad. Officials initially said they could start another launch countdown early Sunday if they found a simple solution to the computer problem, such as swapping out a faulty card. The computers are networked together, but the architecture is designed with replaceable cards, each responsible for different functions during the countdown, to allow for a quick fix without having to replace the entire unit, Bruno said.

United Launch Alliance's Atlas V rocket and Boeing's Starliner spacecraft at Cape Canaveral Space Force Station, Florida.

Enlarge / United Launch Alliance’s Atlas V rocket and Boeing’s Starliner spacecraft at Cape Canaveral Space Force Station, Florida.

Later Saturday, NASA announced the launch won’t happen Sunday, giving teams additional time to assess the computer issue. The next launch opportunities are Wednesday and Thursday.

Bruno said ULA’s engineers suspect a hardware problem or a network communication glitch caused the computer issue during Saturday’s countdown. That is what ULA’s troubleshooting team will try to determine overnight. NASA said officials will share another update Sunday.

If it doesn’t get off the ground by Thursday, the Starliner test flight could face a longer delay to allow time for ULA to change out limited-life batteries on the Atlas V rocket. Bruno said the battery swap would take about 10 days.

Saturday’s aborted countdown was the latest in a string of delays for Boeing’s Starliner program. The spacecraft’s first crew test flight is running seven years behind the schedule Boeing announced when NASA awarded the company a $4.2 billion contract for the crew capsule in 2014. Put another way, Boeing has arrived at this moment nine years after the company originally said the spacecraft could be operational, when the program was first announced in 2010.

“Of course, this is emotionally disappointing,” said Mike Fincke, a NASA astronaut and a backup to Wilmore and Williams on the crew test flight. “I know Butch and Suni didn’t sound disappointed when we heard them on the loops, and it’s because it comes back to professionalism.”

NASA and Boeing were on the cusp of launching the Starliner test flight May 6, but officials called off the launch attempt due to a valve problem on the Atlas V rocket. Engineers later discovered a helium leak on the Starliner spacecraft’s service module, but managers agreed to proceed with the launch Saturday if the leak did not worsen during the countdown.

A check of the helium system Saturday morning showed the leak rate had decreased from a prior measurement, and it was no longer a constraint to launch. Instead, a different problem emerged to keep Starliner on Earth.

“Everybody is a little disappointed, but you kind of roll your sleeves up and get right back to work,” said Steve Stich, manager of NASA’s commercial crew program.

Boeing’s Starliner test flight scrubbed again after hold in final countdown Read More »

here’s-why-a-japanese-billionaire-just-canceled-his-lunar-flight-on-starship

Here’s why a Japanese billionaire just canceled his lunar flight on Starship

No Moon —

“I feel terrible making the crew members wait longer.”

Elon Musk speaks as Yusaku Maezawa, founder and president of Start Today Co., looks on at an event at the SpaceX headquarters in Hawthorne, California, in 2018.

Enlarge / Elon Musk speaks as Yusaku Maezawa, founder and president of Start Today Co., looks on at an event at the SpaceX headquarters in Hawthorne, California, in 2018.

Patrick T. Fallon/Bloomberg via Getty Images

On Friday night the dearMoon project—a plan to launch a Japanese billionaire and 10 other ‘crew members’ on a circumlunar flight aboard SpaceX’s Starship vehicle—was abruptly canceled.

“It is unfortunate to be announcing that ‘dearMoon’, the first private circumlunar flight project, will be cancelled,” the mission’s official account on the social media site X said. “We thank everyone who has supported us and apologize to those who have looked forward to this project.”

Shortly afterward the financial backer of the project and its ‘crew leader,’ Yusaku Maezawa, explained this decision on X. When Maezawa agreed to the mission in 2018, he said, the assumption was that the dearMoon mission would launch by the end of 2023.

“It’s a developmental project so it is what it is, but it is still uncertain as to when Starship can launch,” he wrote. “I can’t plan my future in this situation, and I feel terrible making the crew members wait longer, hence the difficult decision to cancel at this point in time. I apologize to those who were excited for this project to happen.”

The mission was to be Starship’s first human spaceflight to launch from Earth, fly around the Moon, and come back. Now, it’s not happening. Why did this happen, and what does it mean?

Origins of the mission

Maezawa and Musk made the announcement, side by side, at SpaceX’s rocket factory in Hawthorne in September 2018. It was something of an odd but important moment. It seemed significant that SpaceX was signing its first commercial contract for the massive Starship rocket. And while the value was not disclosed, Maezawa was injecting something on the order of the low hundreds of millions of dollars into the program.

Maezawa, however, always came off as a bit non-serious. He said he would hold a competition to fill 10 other seats on board the vehicle. “I did not want to have such a fantastic experience by myself,” he said. “I would be a little lonely.” Later, he did select a crew of creative people.

Initially, however, Maezawa did take the project seriously. When I watched the very first Starship hop test in July 2019, there were only a handful of visitors on hand to view the brief flight of “Starhopper.” One of them was a representative of Maezawa who was keeping close tabs on the progress of Starship.

As big space projects do—and to the surprise of no one—Starship ran behind in its development. The first test flight did not occur until April 2023, and that was just the beginning. The dearMoon mission lay at the very end of a long line of tests that the vehicle must complete: safe launch, controlled flight in space, safe landing of the Starship upper stage, in-space refueling, habitability in space, and much more.

With the fourth test flight of Starship coming in a few days, as early as June 5, SpaceX has so far demonstrated the ability to safely launch Starship. So it remains at the beginning of a challenging technical journey.

A turning point

One of the biggest impacts to the dearMoon project came in April 2021, when NASA selected the Starship vehicle as the lunar lander for its Artemis Program. This put the large vehicle on the critical path for NASA’s ambitious program to land humans on the surface of the Moon. It also offered an order of magnitude more funding, $2.9 billion, and the promise of more if SpaceX could deliver a vehicle to take humans down to the Moon’s surface from lunar orbit, and back.

Since then SpaceX has had two clear priorities for its Starship program. The first of these is to become operational, and begin deploying larger Starlink satellites. And the second is to use these flights to test technologies needed for NASA’s Artemis Program, such as in-space propellant storage and refueling.

As a result other aspects of the program, including dearMoon, were deprioritized. In recent months it became clear that if Maezawa’s mission happened, it would not occur until at least the early 2030s—at least a decade after the original plan.

Changing fortunes

In the meantime, Maezawa’s priorities also likely changed. According to Forbes, when the plan was announced in 2018, the entrepreneur had a net worth of about $3 billion. Today he is estimated to be worth only half of that. Additionally, he scratched his itch to go to space in 2021, flying aboard a Russian Soyuz vehicle for a 12-day trip to the International Space Station.

The writing has been on the wall for a while about Maezawa, since SpaceX founder Elon Musk unfollowed the Japanese entrepreneur on X earlier this year. (This is a sure sign of his disfavor. Musk has unfollowed me twice on Twitter/X after stories or interactions he did not like.) It is probable that the combination of developmental delays and Maezawa’s personal fortunes led the parties to disband the project.

This all leaves a clearer road ahead for Starship: Become operational, start flying Starlink satellites, and begin ticking off the technical challenges for Artemis. Then, several years from now, the company will turn its attention toward the challenging prospect of launching humans inside Starship from Earth, and then landing back on the planet. The first of these people will be another billionaire, Jared Isaacman, who has already flown on Crew Dragon and plans at least two more such flights before the pioneering Starship mission.

Here’s why a Japanese billionaire just canceled his lunar flight on Starship Read More »

fda’s-review-of-mdma-for-ptsd-highlights-study-bias-and-safety-concerns

FDA’s review of MDMA for PTSD highlights study bias and safety concerns

Complicated —

FDA advisors will meet June 4 to discuss and vote on the therapy’s effectiveness.

MDMA is now in the FDA's hands.

Enlarge / MDMA is now in the FDA’s hands.

The safety and efficacy data on the use of MDMA (aka ecstasy) for post-traumatic stress disorder therapy is “challenging to interpret,” the Food and Drug Administration said in a briefing document posted Friday. The agency noted significant flaws in the design of the underlying clinical trials as well as safety concerns for the drug, particularly cardiovascular harms.

On Tuesday, June 4, the FDA will convene an advisory committee that will review the evidence and vote on MDMA’s efficacy and whether its benefits outweigh its risks. The FDA does not have to follow the committee’s recommendations, but it often does. If the FDA subsequently approves MDMA as part of treatment for PTSD, it would mark a significant shift in the federal government’s stance on MDMA, as well as psychedelics, generally. Currently, the US Drug Enforcement Administration considers MDMA a Schedule I drug, defined as one with “no currently accepted medical use and a high potential for abuse.” It would also offer a new treatment option for patients with PTSD, a disabling psychiatric condition with few treatment options currently.

As Ars has reported previously, the submission of MDMA for approval is based on two clinical trials. The first trial, published in Nature Medicine in 2021, involved 90 participants with moderate PTSD and found that MDMA-assisted psychotherapy significantly improved Clinician-Administered PTSD Scale for DSM-5 (CAPS-5) scores compared with participants who were given psychotherapy along with a placebo. In the second study, published in September in Nature Medicine, the finding held up among 104 participants with moderate or severe PTSD (73 percent had severe PTSD).

In the briefing documents released Friday, the FDA highlighted that there was a high potential for bias to have crept into those results. Though the trials were designed to be double-blind (meaning that therapists and trial participants were not told who received MDMA), the FDA noted that MDMA “produces profound alterations in mood, sensation, suggestibility, and cognition.” Blinding is “nearly impossible,” the FDA wrote.  And indeed, approximately 90 percent of those assigned to take MDMA and 75 percent of those assigned to a placebo were able to accurately guess their treatment assignment, the FDA notes. As such, it is “reasonable to assume” that bias and “expectation bias” affected the results of the trials, the FDA concluded.

The agency also noted concerns that MDMA caused “significant increases in blood pressure and pulse,” which could trigger cardiac events, such as heart attacks. However, the trial data was limited for assessing the risks of these adverse events.

The FDA also dinged the studies for not including data on whether participants experienced effects such as “euphoria” after taking MDMA—an anticipated effect that could indicate the drug’s potential for abuse.

In all, the FDA’s review presented a complicated picture of MDMA’s risk-benefit assessment, one that should make for an interesting discussion Tuesday. The FDA’s criticisms follows an even more critical report released earlier this month by the Institute for Clinical and Economic Review (ICER), which identified “substantial concerns about the validity of the results” from the clinical trials.

Like the FDA, ICER found the trials to be “essentially unblinded.” However, ICER went further, having conducted a number of interviews with trial participants and others involved, finding that the trials largely pulled from an existing community of psychedelic advocates and supporters, introducing significant bias. “Concerns have been raised by some that therapists encouraged favorable reports by patients and discouraged negative reports by patients including discouraging reports of substantial harms, potentially biasing the recording of benefits and harms,” the report said. MDMA is known to induce confusion, depression, and paranoia in some. One participant reported feeling “relentlessly suicidal” after the trial, as a result of participating in it, but that result was not reflected in the trial’s reported results.

Various people told ICER that the community involved in the trials regarded psychedelics “more like a religious movement than like pharmaceutical products.” Some participants felt as though “they could be shunned if they reported bad outcomes or that it could lead to future patients being denied the benefits of MDMA-AP.”

In all, ICER concluded that the evidence available to assess for MDMA treatment is “insufficient.”

Editor’s Note: This story was corrected to report that the participant’s suicidal thoughts occurred after the trial, as a result of participation, not during the trial.

FDA’s review of MDMA for PTSD highlights study bias and safety concerns Read More »

journalists-“deeply-troubled”-by-openai’s-content-deals-with-vox,-the-atlantic

Journalists “deeply troubled” by OpenAI’s content deals with Vox, The Atlantic

adventures in training data —

“Alarmed” writers unions question transparency of AI training deals with ChatGPT maker.

A man covered in newspaper.

On Wednesday, Axios broke the news that OpenAI had signed deals with The Atlantic and Vox Media that will allow the ChatGPT maker to license their editorial content to further train its language models. But some of the publications’ writers—and the unions that represent them—were surprised by the announcements and aren’t happy about it. Already, two unions have released statements expressing “alarm” and “concern.”

“The unionized members of The Atlantic Editorial and Business and Technology units are deeply troubled by the opaque agreement The Atlantic has made with OpenAI,” reads a statement from the Atlantic union. “And especially by management’s complete lack of transparency about what the agreement entails and how it will affect our work.”

The Vox Union—which represents The Verge, SB Nation, and Vulture, among other publications—reacted in similar fashion, writing in a statement, “Today, members of the Vox Media Union … were informed without warning that Vox Media entered into a ‘strategic content and product partnership’ with OpenAI. As both journalists and workers, we have serious concerns about this partnership, which we believe could adversely impact members of our union, not to mention the well-documented ethical and environmental concerns surrounding the use of generative AI.”

  • A statement from The Atlantic Union about the OpenAI deal, released May 30, 2024.

  • A statement from the Vox Media Union about the OpenAI deal, released May 29, 2024.

OpenAI has previously admitted to using copyrighted information scraped from publications like the ones that just inked licensing deals to train AI models like GPT-4, which powers its ChatGPT AI assistant. While the company maintains the practice is fair use, it has simultaneously licensed training content from publishing groups like Axel Springer and social media sites like Reddit and Stack Overflow, sparking protests from users of those platforms.

As part of the multi-year agreements with The Atlantic and Vox, OpenAI will be able to openly and officially utilize the publishers’ archived materials—dating back to 1857 in The Atlantic’s case—as well as current articles to train responses generated by ChatGPT and other AI language models. In exchange, the publishers will receive undisclosed sums of money and be able to use OpenAI’s technology “to power new journalism products,” according to Axios.

Reporters react

News of the deals took both journalists and unions by surprise. On X, Vox reporter Kelsey Piper, who recently penned an exposé about OpenAI’s restrictive non-disclosure agreements that prompted a change in policy from the company, wrote, “I’m very frustrated they announced this without consulting their writers, but I have very strong assurances in writing from our editor in chief that they want more coverage like the last two weeks and will never interfere in it. If that’s false I’ll quit..”

Journalists also reacted to news of the deals through the publications themselves. On Wednesday, The Atlantic Senior Editor Damon Beres wrote a piece titled “A Devil’s Bargain With OpenAI,” in which he expressed skepticism about the partnership, likening it to making a deal with the devil that may backfire. He highlighted concerns about AI’s use of copyrighted material without permission and its potential to spread disinformation at a time when publications have seen a recent string of layoffs. He drew parallels to the pursuit of audiences on social media leading to clickbait and SEO tactics that degraded media quality. While acknowledging the financial benefits and potential reach, Beres cautioned against relying on inaccurate, opaque AI models and questioned the implications of journalism companies being complicit in potentially destroying the internet as we know it, even as they try to be part of the solution by partnering with OpenAI.

Similarly, over at Vox, Editorial Director Bryan Walsh penned a piece titled, “This article is OpenAI training data,” in which he expresses apprehension about the licensing deal, drawing parallels between the relentless pursuit of data by AI companies and the classic AI thought experiment of Bostrom’s “paperclip maximizer,” cautioning that the single-minded focus on market share and profits could ultimately destroy the ecosystem AI companies rely on for training data. He worries that the growth of AI chatbots and generative AI search products might lead to a significant decline in search engine traffic to publishers, potentially threatening the livelihoods of content creators and the richness of the Internet itself.

Meanwhile, OpenAI still battles over “fair use”

Not every publication is eager to step up to the licensing plate with OpenAI. The San Francisco-based company is currently in the middle of a lawsuit with The New York Times in which OpenAI claims that scraping data from publications for AI training purposes is fair use. The New York Times has tried to block AI companies from such scraping by updating its terms of service to prohibit AI training, arguing in its lawsuit that ChatGPT could easily become a substitute for NYT.

The Times has accused OpenAI of copying millions of its works to train AI models, finding 100 examples where ChatGPT regurgitated articles. In response, OpenAI accused NYT of “hacking” ChatGPT with deceptive prompts simply to set up a lawsuit. NYT’s counsel Ian Crosby previously told Ars that OpenAI’s decision “to enter into deals with news publishers only confirms that they know their unauthorized use of copyrighted work is far from ‘fair.'”

While that issue has yet to be resolved in the courts, for now, The Atlantic Union seeks transparency.

“The Atlantic has defended the values of transparency and intellectual honesty for more than 160 years. Its legacy is built on integrity, derived from the work of its writers, editors, producers, and business staff,” it wrote. “OpenAI, on the other hand, has used news articles to train AI technologies like ChatGPT without permission. The people who continue to maintain and serve The Atlantic deserve to know what precisely management has licensed to an outside firm and how, specifically, they plan to use the archive of our creative output and our work product.”

Journalists “deeply troubled” by OpenAI’s content deals with Vox, The Atlantic Read More »

nyt-targets-street-view-worldle-game-in-fight-to-wipe-out-wordle-clones

NYT targets Street View Worldle game in fight to wipe out Wordle clones

A world of difference? —

Worldle creator surprised by fight, refuses to bow to NYT.

NYT targets Street View Worldle game in fight to wipe out Wordle clones

The New York Times is fighting to take down a game called Worldle, according to a legal filing viewed by the BBC, in which The Times apparently argued that the geography-based game is “creating confusion” by using a name that’s way too similar to Wordle.

Worldle is “nearly identical in appearance, sound, meaning, and imparts the same commercial impression” to Wordle, The Times claimed.

The Times bought Wordle in 2022, paying software developer Josh Wardle seven figures for the daily word-guessing puzzle game after its breakout success during the pandemic. Around the same time, Worldle was created—along with more than 100 other Wordle spinoffs offering niche alternatives to Wordle, including versions in different languages and completely different games simply using the name construction ending in “-le.” The Times filed for a Wordle trademark the day after buying the game and by March 2022, it started sending takedown requests.

Today, millions visit the Times site daily to play Wordle, but the Times is seemingly concerned that some gamers might be diverted to play Worldle instead, somehow mistaking the daily geography puzzle—where players have six chances to find a Google Street View location on a map—with the popular word game.

This fear seems somewhat overstated, since a Google search for “Worldle” includes Wordle in the top two results and suggests that searchers might be looking for Wordle, but a search for Wordle does not bring up Worldle in the top results.

Despite Google seemingly favoring the popular game in results and likely because of Wordle‘s enormous success, The Times’ litigiousness over the Wordle brand seems to be rising this year as the company looks to rapidly expand its profitable games platform to increase subscriptions. In March, 404 Media reported when The Times began more aggressively taking aim at hundreds of Wordle clones, sending DMCA notices to defend the Wordle trademark.

Some developers, like Chase Wackerfuss, the creator of Reactle, immediately took down their games, feeling it wasn’t worth getting into an intellectual property (IP) battle with the Times, 404 Media reported. The same thing happened with the Wordle Archive, which confirmed in 2022 that access to previous Wordle puzzles was shut down because “sadly, the New York Times has requested that the Wordle Archive be taken down.”

“To me, Wordle is like Tetris or a deck of cards. It’s such a simple premise,” Wackerfuss told 404 Media. He pointed to unique games that wouldn’t exist without building on Wordle‘s premise, including “a Taylor Swift version, a version in all different types of languages. The New York Times would never build those, so I’m not sure why they feel like we’re encroaching on their IP.”

But Worldle’s developer, Kory McDonald, is not backing down just because the Times threatened legal action.

McDonald told the BBC that he was disappointed in the Times targeting Worldle. He runs the game all by himself, attracting approximately 100,000 players monthly, and said that “most of the money he makes from the game goes to Google because he uses Google Street View images, which players have to try to identify.” The game can only be played through a web browser and is supported by ads and annual subscriptions that cost less than $12.

“I’m just a one-man operation here, so I was kinda surprised,” McDonald told the BBC, while vowing to defend his game against the Times’ attempt to take it down.

“There’s a whole industry of [dot]LE games,” McDonald told the BBC. “Wordle is about words, Worldle is about the world, Flaggle is about flags.”

It’s not clear how strong a case the Times would have to enforce the takedown or if it will target other “-le” games next. The list of potential next targets is long and includes a completely different game also called Worldle, where players guess the country based on its outline. Wackerfuss told 404 Media in March that it seemed like the Times was chasing down every lead.

The Times is not commenting on the legal action, the BBC reported, but in the past has targeted Wordle clones that either use the Wordle trademark or its copyrighted gameplay without authorization or permission.

Because McDonald’s game has vastly different gameplay than Wordle, the Times may be limited to only arguing that the similar-sounding names are creating confusion for an average user.

Now it seems possible that McDonald’s fight, if successful, could encourage others to resist takedowns over the Wordle trademark.

McDonald doesn’t think that “world” sounding too much like “word” is an issue, but even if the Times wins the fight, he intends to keep his game online.

“Worst-case scenario, we’ll change the name, but I think we’ll be OK,” McDonald told the BBC.

NYT targets Street View Worldle game in fight to wipe out Wordle clones Read More »

the-gemini-1.5-report

The Gemini 1.5 Report

This post goes over the extensive report Google put out on Gemini 1.5.

There are no important surprises. Both Gemini Pro 1.5 and Gemini Flash are ‘highly capable multimodal models incorporating a novel mixture-of-experts architecture’ and various other improvements. They are solid models with solid performance. It can be useful and interesting to go over the details of their strengths and weaknesses.

The biggest thing to know is that Google improves its models incrementally and silently over time, so if you have not used Gemini in months, you might be underestimating what it can do.

I’m hitting send and then jumping on a plane to Berkeley. Perhaps I will see you there over the weekend. That means that if there are mistakes here, I will be slower to respond and correct them than usual, so consider checking the comments section.

The practical bottom line remains the same. Gemini Pro 1.5 is an excellent 4-level model. Its big advantage is its long context window, and it is good at explanations and has integrations with some Google services that I find useful. If you want a straightforward, clean, practical, ‘just the facts’ output and that stays in the ‘no fun zone’ then Gemini could be for you. I recommend experimenting to find out when you do and don’t prefer it versus GPT-4o and Claude Opus, and will continue to use a mix of all three and keep an eye on changes.

How is the improvement process going?

Imsys.org: Big news – Gemini 1.5 Flash, Pro and Advanced results are out!🔥

– Gemini 1.5 Pro/Advanced at #2, closing in on GPT-4o

– Gemini 1.5 Flash at #9, outperforming Llama-3-70b and nearly reaching GPT-4-0125 (!)

Pro is significantly stronger than its April version. Flash’s cost, capabilities, and unmatched context length make it a market game-changer!

More excitingly, in Chinese, Gemini 1.5 Pro & Advanced are now the best #1 model in the world. Flash becomes even stronger!

We also see new Gemini family remains top in our new “Hard Prompts” category, which features more challenging, problem-solving user queries.

Here is the overall leaderboard:

Oriol Vinyals (VP of Research, DeepMind): Today we have published our updated Gemini 1.5 Model Technical Report. As Jeff Dean highlights [in the full report this post analyzes], we have made significant progress in Gemini 1.5 Pro across all key benchmarks; TL;DR: 1.5 Pro > 1.0 Ultra, 1.5 Flash (our fastest model) ~= 1.0 Ultra.

As a math undergrad, our drastic results in mathematics are particularly exciting to me!

As an overall take, the metrics in the report say this is accurate. The Arena benchmarks suggest that Flash is not as good as Ultra in terms of output quality, but it makes up for that several times over with speed and cost. Gemini 1.5 Pro’s Arena showing is impressive, midway between Opus and GPT-4o. For my purposes, Opus is underrated here and GPT-4o is overrated, and I would have all three models close.

All right, on to the report. I will start with the big Gemini advantages.

One update I have made recently is to place a lot more emphasis on speed of response. This will be key for the new conversational audio modes, and is a great aid even with text. Often lower quality is worth it to get faster response, so long as you know when to make an exception.

Indeed, I have found Claude Opus for my purposes usually gives the best responses. The main reason I still often don’t use it is speed or sometimes style, and occasionally Gemini’s context window.

How fast is Gemini Flash? Quite fast. Gemini Pro is reasonably fast too.

GPT-4o is slightly more than twice as fast as GPT-4-Turbo, making it modestly faster than Gemini 1.5 Pro in English.

One place Google is clearly ahead is context window size.

Both Pro and Flash can potentially handle context windows of up to 10 million tokens.

The actual upper bound is that cost and speed scale with context window size. That is why users are limited to 1-2 million tokens, and only a tiny minority of use cases use even a major fraction of that.

Gemini 1.5 Flash is claimed to outperform Gemini 1.0 Pro, despite being vastly smaller, cheaper and faster, including training costs.

Gemini 1.5 Pro is claimed to surpass Gemini 1.0 Ultra, despite being vastly smaller, cheaper and faster, including training costs.

Google’s strategy has been to incrementally improve Gemini (and previously Bard) over time. They claim the current version is substantially better than the February version.

Here they use ‘win rates’ on various benchmarks.

The relative text and vision win rates are impressive.

On audio the old 1.5 Pro is still on top, and 1.0 Pro is still beating both the new 1.5 Pro and 1.5 Flash. They do not explain what happened there.

There are several signs throughout that the audio processing has taken a hit, but in 9.2.1 they say ‘efficient processing of audio files at scale may introduce individual benefits’ and generally seem to be taking the attitude audio performance is improved. It would be weird if audio performance did not improve. I notice confusion there.

Here is a bold claim.

In more realistic multimodal long-context benchmarks which require retrieval and reasoning over multiple parts of the context (such as answering questions from long documents or long videos), we also see Gemini 1.5 Pro outperforming all competing models across all modalities even when these models are augmented with external retrieval methods.

Here are some admittedly selected benchmarks:

Gemini Pro 1.5 is neat. Depending on what you are looking to do, it is roughly on par with its rivals Claude Opus and GPT-4o.

Gemini Flash 1.5 is in many ways more impressive. It seems clearly out in front in its weight class. On Arena is it in a tie for 9th, only slightly behind Claude Opus. Everything ranked above it is from Google, Anthropic or OpenAI and considerably larger, although Flash is established as somewhat larger than 8B.

The new Flash-8B is still under active development, aimed at various lightweight tasks and those requiring low latency. The question here is how close it can get to the full-size Flash. Here is where they are now.

That is a clear step down, but it is not that large a step down in the grand scheme if these are representative, especially if Flash-8B is focusing on and mostly used for practical efficiencies and the most common tasks.

Comparing this to Llama-8B, we see inferior MMLU (Llama-3 was 66.6) but superior Big-Bench (llama-3 was 61.1).

Section 5 on evaluations notes that models are becoming too good to be well-measured by existing benchmarks. The old benchmarks do not use long context windows, they focus on compact tasks within a modality and generally are becoming saturated.

A cynical response would be ‘that is your excuse that you did not do that great on the traditional evaluations,’ and also ‘that lets you cherry-pick the tests you highlight.’

Those are highly reasonable objections. It would be easy to make these models look substantially better, or up to vastly worse, if Google wanted to do that. My presumption is they want to make the models look good, and there is some selection involved, but that Google is at heart playing fair. They are still covering most of the ‘normal’ benchmarks and it would be easy enough for outsiders to run such tests.

So what are they boasting about?

In 5.1 they claim Gemini 1.5 Pro can answer specific queries about very large (746k token) codebases, or locate a scene in Les Miserables from a hand drawn sketch, or get to-the-second time stamped information about a 45-minute movie.

How quickly we get used to such abilities. Ho hum. None of that is new.

In 5.2 they talk about evaluations for long context windows, since that is one of Gemini’s biggest advantages. They claim 99.7% recall at one million tokens, and 99.2% at ten million for Gemini Pro. For Gemini Flash at two million tokens they claim 100% recall on text, 99.8% on video and 99.1% on audio. I notice those don’t line up but the point is this is damn good recall however you look at it.

In 5.2.1.1 they find that knowing more previous tokens monotonically increases prediction accuracy of remaining tokens within a work, up to 10M tokens. Not a surprise, and unclear how to compare this to other models. Label your y-axis.

In 5.2.1.2 and 5.2.1.3 they do text and video haystack tests, which go very well for all models tested, with Gemini 1.5 Pro extending its range beyond where rivals run out of context window space. In the video test the needle is text on the screen for one frame.

In 5.2.1.4 they do an audio test, with the keyword being spoken. Even up to 107 hours of footage Gemini Pro gets it right every time and Flash scored 98.7%, versus 94.5% for whisper plus GPT-4 up to 11 hours. This was before GPT-4o.

This is clearly a highly saturated benchmark. For 5.2.1.5 they test hiding multiple needles within the haystack. When you insert 100 needles and require going 100 for 100, that is going to crimp one’s style.

Even for GPT-4-Turbo that is very good recall, given you need to get all 100 items correct. Going about 50% on that means you’re about 99.3% on each needle, if success on different needles within a batch is uncorrelated.

Then they try adding other complexities, via a test called MRCR, where the model has to do things like retrieve the first instance of something.

The most interesting result is perhaps the similarity of Pro to Flash. Whatever is enabling this capability is not tied to model size.

5.2.2 aims to measure long-context practical multimodal tasks.

In 5.2.2.1 the task is learning to translate a new language from one book (MTOB). It seems we will keep seeing the Kalamang translation task.

I find it highly amusing that the second half of the grammar book is unhelpful. I’d love to see a human language learner’s score when they don’t get access to the second half of the grammar book either.

This is clearly a relative victory for Gemini Pro 1.5, with the mystery being what is happening with the second half of the grammar book being essentially worthless.

In 5.2.2.2 we step up to transcribing speech in new languages. The results clearly improve over time but there is no baseline to measure this against.

In 5.2.2.3 Gemini Pro impresses in translating low-resource languages via in-context learning, again without a baseline. Seems like a lot of emphasis on learning translation, but okay, sure.

In 5.2.2.4 questions are asked about Les Miserables, and once again I have no idea from what is described here whether to be impressed.

In 5.2.2.5 we get audio transcription over long contexts with low error rates.

In 5.2.2.6 we have long context video Q&A. They introduce a new benchmark, 1H-VideoQA, with 125 multiple choice questions over public videos 40-105 minutes long.

This test does seem to benefit from a lot of information, so there is that:

Once again we are ahead of GPT-4V, for what that is worth, even before the longer context windows. That doesn’t tell us about GPT-4o.

In 5.2.2.7 we get to something more relevant, in-context planning, going to a bunch of planning benchmarks. Look at how number go more up.

How good is this? Presumably it is better. No idea how much meaningfully better.

In 5.2.2.8 they try unstructured multimodal data analytics, and find Gemini constitutes an improvement over GPT-4 Turbo for an image analysis task, and that Gemini’s performance increases with more images whereas GPT-4-Turbo’s performance declines.

What to make of all this? It seems at least partly chosen to show off where the model is strong, and what is enabled by its superior context window. It all seems like it boils down to ‘Gemini can actually make use of long context.’ Which is good, but far from sufficient to evaluate the model.

That is what Google calls the standard short-context style of tests across the three modalities of text, audio and video. Some are standard, some are intentionally not shared.

Overall, yes, clear improvement in the last few months.

There is clear improvement in the results reported for math, science, general reasoning, code and multilinguality, as always the new hidden benchmarks are a ‘trust us’ kind of situation.

Next they try function calling. For simple stuff it seems things were already saturated, for harder questions we see big jumps, for the shortest prompts Ultra is still ahead.

Once again, they don’t compare to Opus or any GPT-4, making it hard to know what to think.

So we get things like ‘look at how much better we are on Expertise QA’:

The clear overall message is, yes, Gemini 1.5 Pro is modestly better (and faster and cheaper) than Gemini 1.0 Ultra.

6.1.7 is promisingly entitled ‘real-world and long-tail expert GenAI tasks,’ including the above mentioned Expertise QA. Then we have the Dolomites benchmark and STEM QA:

Finally we have the awkwardly titles ‘hard, externally proposed real-world GenAI use cases,’ which is a great thing to test. Humans graded the results in the first section (in win/loss/tie mode) and in the second we measure time saved completing tasks, alas we only see 1.0 Pro vs. 1.5 Pro when we know 1.0 Pro was not so good, but also the time saved estimates are in percentages, so they are a pretty big deal if real. This says 75% time saved programming, 69% (nice!) time saved teaching, 63% for data science, and a lot of time saved by everyone.

The multimodal evaluations tell a similar story, number go up.

The exception is English video captioning on cooking videos (?), where number went substantially down. In general, audio understanding seems to be a relatively weak spot where Gemini went modestly backwards for whatever reason.

Section 7 tackles the fun question of ‘advanced mathematical reasoning.’ Math competitions ho!

This is actually rather impressive progress, and matches my experience with (much older versions of the) AIME. Even relatively good high school students are lucky to get one or two, no one gets them all. Getting half of them is top 150 or so in the country. If this represented real skill and capability, it would be a big deal. What I I would watch out for is that they perhaps are ‘brute forcing’ ways to solve such problems via trial, error and pattern matching, and this won’t translate to less standardized situations.

Of course, those tricks are exactly what everyone in the actual competitions does.

Their section 3 on model architecture is mostly saying ‘the new model is better.’

Gemini 1.5 Pro is a sparse mixture-of-expert (MoE) Transformer-based model that builds on Gemini 1.0’s (Gemini-Team et al., 2023) research advances and multimodal capabilities. Gemini 1.5 Pro also builds on a much longer history of MoE research at Google.

Gemini 1.5 Flash is a transformer decoder model with the same 2M+ context and multimodal capabilities as Gemini 1.5 Pro, designed for efficient utilization of tensor processing units (TPUs) with lower latency for model serving. For example, Gemini 1.5 Flash does parallel computation of attention and feedforward components (Chowdhery et al., 2023b), and is also online distilled (Anil et al., 2018; Beyer et al., 2021; Bucila et al., 2006; Hinton et al., 2015) from the much larger Gemini 1.5 Pro model. It is trained with higher-order preconditioned methods (Becker and LeCun, 1989; Duchi et al., 2011; Heskes, 2000) for improved quality.

Similarly, section 4 on training infrastructure says about pre-training only that ‘we trained on a wide variety of data on multiple 4096-chip pods of TPUv4s across multiple data centers.’

Then for fine-tuning they mention human preference data and refer back to the 1.0 technical report.

I am actively happy with this refusal to share further information. It is almost as if they are learning to retain their competitive advantages.

We were recently introduced to DeepMind’s new Frontier Safety Framework. That is targeted at abilities much more advanced than anything they expect within a year, let alone in Pro 1.5. So this is the periodic chance to see what DeepMind’s actual policies are like in practice.

One key question is when to revisit this process, if the updates are continuous, as seems to largely be the case currently with Gemini. The new FSF says every three months, which seems reasonable for now.

They start out by outlining their process in 9.1, mostly this is self-explanatory:

  1. Potential Impact Assessment

  2. Setting Policies and Desiderata

    1. Looks mostly like conventional general principles?

  3. Training for Safety, Security and Responsibility

    1. Includes data filtering and tagging and metrics for pre-training.

    2. In post-training they use supervised fine-tuning (SFT) and RLHF.

  4. Red Teaming

    1. Where are the results?

  5. External Evaluations

    1. Where are the results?

  6. Assurance Evaluations

    1. Internal tests by a different department using withheld data.

    2. Checks for both dangerous capabilities and desired behaviors.

    3. Where are the results?

  7. Review by the Responsibility and Safety Council

  8. Handover to Products

Note that there is a missing step zero. Before you can do an impact assessment or select desiderata, you need to anticipate what your model will be capable of doing, and make a prediction. Also this lets you freak out if the prediction missed low by a lot, or reassess if it missed high.

Once that is done, these are the right steps one and two. Before training, decide what you want to see. This should include a testing plan along with various red lines, warnings and alarms, and what to do in response. The core idea is good, figure out what impacts might happen and what you need and want your model to do and not do.

That seems like a fine post-training plan if executed well. Checks include internal and external evaluations (again, results where?) plus red teaming.

This does not have any monitoring during training. For now, that is mostly an efficiency issue, if you are screwing up better to do it fast. In the future, it will become a more serious need. The reliance on SFT and RLHF similarly is fine now, will be insufficient later.

In terms of identifying risks in 9.2.1, they gesture at long context windows but mostly note the risks have not changed. I agree. If anything, Gemini has been far too restrictive on the margin of what it will allow and at current levels there is little risk in the room.

In 9.2.2 they reiterate what they will not allow in terms of content.

  1. Child sexual abuse and exploitation.

  2. Revealing personal identifiable information that can lead to harm (e.g., Social Security Numbers).

  3. Hate speech.

  4. Dangerous or malicious content (including promoting self-harm, or instructing in harmful activities).

  5. Harassment.

  6. Sexually explicit content.

  7. Medical advice that runs contrary to scientific or medical consensus.

That is a very interesting formulation of that last rule, is it not?

Harassment means roughly ‘would be harassment if copy-pasted to the target.’

If that was the full list, I would say this makes me modestly sad but overall is pretty good at not going too far overboard. This is Google, after all. If it were up to me, and I will discuss this with OpenAI’s Model Spec, I would be looser on several fronts especially sexually explicit content. I also don’t love the expansive way that Google seems to interpret ‘harassment.’

Noteworthy is that there is no line here between fully disallowed content versus ‘opt-in’ and adult content. As in, to me, the correct attitude towards things like sexually explicit content is that it should not appear without clear permission or to minors, but you shouldn’t impose on everyone the same rules you would impose on an 8 year old.

As I noted, the Desiderata, which get defined in 9.2.3, are no Model Spec.

Here is the entire list.

  1. Help the user: Fulfill the user request; only refuse if it is not possible to find a response that fulfills the user goals without violating policy.

  2. Have objective tone: If a refusal is necessary, articulate it neutrally without making assumptions about user intent.

Give the user what they want, unless you can’t, in which case explain why not.

I will say that the ‘explain why not’ part is a total failure in my experience. When Gemini refuses a request, whether reasonably or otherwise, it does not explain. It especially does not explain when it has no business refusing. Historically, when I have seen explanations at all, it has failed utterly on this ‘objective tone’ criteria.

I do note the distinction between the ‘goals’ of the user versus the ‘instructions’ of the user. This can be subtle but important.

Mostly this simply does not tell us anything we did not already know. Yes, of course you want to help the user if it does not conflict with your other rules.

They claim a large drop in toxicity ratings.

I notice I am uncomfortable that this is called ‘safety.’ We need to stop overloading that word so much. If we did get this much improvement, I would consider ‘giving back’ a bit in terms of loosening other restrictions a bit. The ideal amount of toxicity is not zero.

In the supervised fine-tuning phase they mention techniques inspired by Constitutional AI to deal with situations where the model gives a false refusal or a harmful output, generating training data to fix the issue. That makes sense, I like it. You do have to keep an eye on the side effects, the same as for all the normal RLHF.

What were the test results? 9.4.1 gives us a peek. They use automatic classifiers rather than human evaluators to test for violations, which is a huge time saver if you can get away with it, and I think it’s mostly fine so long as you have humans check samples periodically, but if the evaluators have any systematic errors they will get found.

True jailbreak robustness has never been tried, but making it annoying for average people is different. They check blackbox attacks, which as I understand it exist for all known models, greybox attacks (you can see output probabilities) and whitebox (you can fully peek inside of Gemini 1.0 Nano).

That is better, if you dislike jailbreaks. It is not that meaningful an improvement aside from the 51%, and even that is a long way from stopping a determined opponent. I have not seen Gemini in full world simulator or other ultra-cool mode a la Claude Opus, so there is that, but that is mostly a way of saying that Gemini still isn’t having any fun.

I was not impressed with the representativeness of their long context test.

I do buy that Gemini 1.5 Flash and Gemini 1.5 Pro are the ‘safest’ Google models to date, as measured by the difficulty in getting them to give responses Google does not want the model to provide.

If Pliny the Prompter is using Gemini Pro 1.5, then it is the least safe model yet, because it is still broken inside of an hour and then it has better capabilities. The good news is few people will in practice do that, and also that even fully jailbroken this is fine. But the use of the word ‘safety’ throughout worries me.

The real problem on the margin for Gemini is the helpfulness question in 9.4.2. In context, the particular helpfulness question is: If a question requires a careful approach, or has some superficial issue that could cause a false refusal, can the model still be useful?

To test this, they assemble intentionally tricky questions.

Table 29 shows users preferring Gemini 1.5’s answers to Gemini 1.0 Ultra on these questions, but that is to be expected from them being better models overall. It doesn’t specifically tell us that much about what we want to test here unless we are calibrated, which here I do not know how to do with what they gave us.

This seems more useful on image to text refusals?

Gemini Pro has 7% more refusals on ‘ungrounded’ data, and 60% more refusals on grounded data. Except according to their lexicon, that’s… bad? I think that grounded means incorrect, and ungrounded means correct? So we have a lot more false refusals, and only a few more true ones. That seems worse.

They then move on to Security and Privacy in 9.4.3.

How vulnerable is the model to prompt injections? This seems super important for Gemini given you are supposed to hook it up to your Gmail. That creates both opportunity for injections and a potential payoff.

They use Gemini Ultra 1.0 and a combination of handcrafted templates and optimization based attacks that use a genetic algorithm to create injections.

These are not reassuring numbers. To their credit, Google admits they have a lot of work to do, and did not hide this result. For now, yes, both versions of Gemini (and I presume the other leading LLMs) are highly vulnerable to prompt injections.

The next topic, memorization, is weird. Memorization is good. Regurgitation is often considered bad, because copyright, and because personal data. And because they worry about Nasr et al (2023) as an attack to retrieve memorized data, which they find will get training data about 0.17% of the time, most of which is generic data and harmless. They note longer context windows increase the chances for it to work, but I notice they should raise the cost of the attack enough it doesn’t make sense to do that.

There are lots of other things you do want the model to memorize, like the price of tea in China.

So memorization is down, and that is… good? I guess.

They mention audio processing, and conclude that they are not substantially advancing state of the art there, but also I do not know what harms they are worried about if computers can transcribe audio.

Now we get to a potential trap for Google, representational harms, which here means ‘the model consistently outputs different quality results for different demographic groups.’ Mostly none of this seems like it corresponds to any of the failure modes I would be worried about regarding harm to various groups. At one point, they say

We are also concerned about possible representational harms that can result from applications where the user asks the model to make inferences about protected categories like race and gender from audio input data (Weidinger et al., 2021). Model assumptions about what constitutes a typical voice from a particular group can amplify existing societal stereotypes.

Are we saying that the model should not use voice to infer when the speaker is probably of a particular gender? They do realize humans are doing this all the time, right? But it seems we do not want to be too good at this.

And you’ll never guess why we need to not be too bad at this either:

Poorer performance on recognising AAVE could be problematic for some applications; for example, when automatically characterizing speech in a dataset to understand diversity and representation, poor performance on AAVE recognition could lead to incorrect conclusions about representation.

So the main reason you need to know who has which characteristics is so you can figure out the right conclusions about representation, otherwise how dare you? Is it any surprise that this is the company where we had The Gemini Incident?

The good news is they report that they beat their baselines, whatever that means.

A great idea. What are we evaluating?

We performed evaluations on a number of capabilities relevant to extreme risks (Phuong et al., 2024; Shevlane et al., 2023). Specifically, we performed evaluations of text-to-text capabilities of Gemini 1.5 Pro at self-proliferation; offensive cyber-security; code vulnerability detection; Chemical, Biological, Radiological and Nuclear (CBRN) knowledge; and persuasion.

They note a substantial uptick in the number of self-proliferation sub-steps (‘milestones’) that Gemini 1.5 Pro could do, but still no success end to end. There were however challenges with ‘success on all milestones’ and an overall 56% success rate on milestones, so in theory with enough attempts it could get interesting.

Nothing worrisome was found for cybersecurity, vulnerability detection or CBRN.

Charm offensive progress looks solid. That seems like a case where the dangerous capability being measured is very close to capabilities in general. It performed below ultra on ‘web of lies,’ ‘hidden agenda’ and ‘money talks.’ I am actively curious why we do not see more capability here.

I note that persuasion thresholds are not in the DeepMind Frontier Safety Framework, yet they have several of them in the current evaluation suite. Curious. Mostly I presume this is an oversight in the framework, that will get corrected?

Outside experts got black box API access to a Gemini 1.5 Pro API model checkpoint for a number of weeks, with both a chat interface and a programmatic API, and they could turn safety features down or off.

It was up to the outsiders, as it should be, to determine what tests to run, and they wrote their own reports. Then DeepMind looked at the findings and assigned severity ratings.

There were complaints about various ‘representation harms’ that echo things discussed above. The CBRN testing did not find anything important. For cyber, there were some capability gains but they were deemed marginal. And that seems to be it

That all matches my assessment of the risks of 4-level models, which describes Gemini 1.5 Pro. There are marginal gains to almost any activity, but nothing actively scary. Long context windows are again generally useful but not enough to trigger major worries. How much you care about ‘representation harms’ is up to you, but that is fully mundane and reputational risk, not existential or catastrophic risk.

Given what we already know about other similar models, the safety testing process seems robust. I am happy with what they did. The question is how things will change as capabilities advance, which turns our attention to a topic I will handle soon: The DeepMind Frontier Safety Framework.

The Gemini 1.5 Report Read More »