Author name: Mike M.

2024:-the-year-ai-drove-everyone-crazy

2024: The year AI drove everyone crazy


What do eating rocks, rat genitals, and Willy Wonka have in common? AI, of course.

It’s been a wild year in tech thanks to the intersection between humans and artificial intelligence. 2024 brought a parade of AI oddities, mishaps, and wacky moments that inspired odd behavior from both machines and man. From AI-generated rat genitals to search engines telling people to eat rocks, this year proved that AI has been having a weird impact on the world.

Why the weirdness? If we had to guess, it may be due to the novelty of it all. Generative AI and applications built upon Transformer-based AI models are still so new that people are throwing everything at the wall to see what sticks. People have been struggling to grasp both the implications and potential applications of the new technology. Riding along with the hype, different types of AI that may end up being ill-advised, such as automated military targeting systems, have also been introduced.

It’s worth mentioning that aside from crazy news, we saw fewer weird AI advances in 2024 as well. For example, Claude 3.5 Sonnet launched in June held off the competition as a top model for most of the year, while OpenAI’s o1 used runtime compute to expand GPT-4o’s capabilities with simulated reasoning. Advanced Voice Mode and NotebookLM also emerged as novel applications of AI tech, and the year saw the rise of more capable music synthesis models and also better AI video generators, including several from China.

But for now, let’s get down to the weirdness.

ChatGPT goes insane

Illustration of a broken toy robot.

Early in the year, things got off to an exciting start when OpenAI’s ChatGPT experienced a significant technical malfunction that caused the AI model to generate increasingly incoherent responses, prompting users on Reddit to describe the system as “having a stroke” or “going insane.” During the glitch, ChatGPT’s responses would begin normally but then deteriorate into nonsensical text, sometimes mimicking Shakespearean language.

OpenAI later revealed that a bug in how the model processed language caused it to select the wrong words during text generation, leading to nonsense outputs (basically the text version of what we at Ars now call “jabberwockies“). The company fixed the issue within 24 hours, but the incident led to frustrations about the black box nature of commercial AI systems and users’ tendency to anthropomorphize AI behavior when it malfunctions.

The great Wonka incident

A photo of the Willy's Chocolate Experience, which did not match AI-generated promises.

A photo of “Willy’s Chocolate Experience” (inset), which did not match AI-generated promises, shown in the background. Credit: Stuart Sinclair

The collision between AI-generated imagery and consumer expectations fueled human frustrations in February when Scottish families discovered that “Willy’s Chocolate Experience,” an unlicensed Wonka-ripoff event promoted using AI-generated wonderland images, turned out to be little more than a sparse warehouse with a few modest decorations.

Parents who paid £35 per ticket encountered a situation so dire they called the police, with children reportedly crying at the sight of a person in what attendees described as a “terrifying outfit.” The event, created by House of Illuminati in Glasgow, promised fantastical spaces like an “Enchanted Garden” and “Twilight Tunnel” but delivered an underwhelming experience that forced organizers to shut down mid-way through its first day and issue refunds.

While the show was a bust, it brought us an iconic new meme for job disillusionment in the form of a photo: the green-haired Willy’s Chocolate Experience employee who looked like she’d rather be anywhere else on earth at that moment.

Mutant rat genitals expose peer review flaws

An actual laboratory rat, who is intrigued. Credit: Getty | Photothek

In February, Ars Technica senior health reporter Beth Mole covered a peer-reviewed paper published in Frontiers in Cell and Developmental Biology that created an uproar in the scientific community when researchers discovered it contained nonsensical AI-generated images, including an anatomically incorrect rat with oversized genitals. The paper, authored by scientists at Xi’an Honghui Hospital in China, openly acknowledged using Midjourney to create figures that contained gibberish text labels like “Stemm cells” and “iollotte sserotgomar.”

The publisher, Frontiers, posted an expression of concern about the article titled “Cellular functions of spermatogonial stem cells in relation to JAK/STAT signaling pathway” and launched an investigation into how the obviously flawed imagery passed through peer review. Scientists across social media platforms expressed dismay at the incident, which mirrored concerns about AI-generated content infiltrating academic publishing.

Chatbot makes erroneous refund promises for Air Canada

If, say, ChatGPT gives you the wrong name for one of the seven dwarves, it’s not such a big deal. But in February, Ars senior policy reporter Ashley Belanger covered a case of costly AI confabulation in the wild. In the course of online text conversations, Air Canada’s customer service chatbot told customers inaccurate refund policy information. The airline faced legal consequences later when a tribunal ruled the airline must honor commitments made by the automated system. Tribunal adjudicator Christopher Rivers determined that Air Canada bore responsibility for all information on its website, regardless of whether it came from a static page or AI interface.

The case set a precedent for how companies deploying AI customer service tools could face legal obligations for automated systems’ responses, particularly when they fail to warn users about potential inaccuracies. Ironically, the airline had reportedly spent more on the initial AI implementation than it would have cost to maintain human workers for simple queries, according to Air Canada executive Steve Crocker.

Will Smith lampoons his digital double

The real Will Smith eating spaghetti, parodying an AI-generated video from 2023.

The real Will Smith eating spaghetti, parodying an AI-generated video from 2023. Credit: Will Smith / Getty Images / Benj Edwards

In March 2023, a terrible AI-generated video of Will Smith’s AI doppelganger eating spaghetti began making the rounds online. The AI-generated version of the actor gobbled down the noodles in an unnatural and disturbing way. Almost a year later, in February 2024, Will Smith himself posted a parody response video to the viral jabberwocky on Instagram, featuring AI-like deliberately exaggerated pasta consumption, complete with hair-nibbling and finger-slurping antics.

Given the rapid evolution of AI video technology, particularly since OpenAI had just unveiled its Sora video model four days earlier, Smith’s post sparked discussion in his Instagram comments where some viewers initially struggled to distinguish between the genuine footage and AI generation. It was an early sign of “deep doubt” in action as the tech increasingly blurs the line between synthetic and authentic video content.

Robot dogs learn to hunt people with AI-guided rifles

A still image of a robotic quadruped armed with a remote weapons system, captured from a video provided by Onyx Industries.

A still image of a robotic quadruped armed with a remote weapons system, captured from a video provided by Onyx Industries. Credit: Onyx Industries

At some point in recent history—somewhere around 2022—someone took a look at robotic quadrupeds and thought it would be a great idea to attach guns to them. A few years later, the US Marine Forces Special Operations Command (MARSOC) began evaluating armed robotic quadrupeds developed by Ghost Robotics. The robot “dogs” integrated Onyx Industries’ SENTRY remote weapon systems, which featured AI-enabled targeting that could detect and track people, drones, and vehicles, though the systems require human operators to authorize any weapons discharge.

The military’s interest in armed robotic dogs followed a broader trend of weaponized quadrupeds entering public awareness. This included viral videos of consumer robots carrying firearms, and later, commercial sales of flame-throwing models. While MARSOC emphasized that weapons were just one potential use case under review, experts noted that the increasing integration of AI into military robotics raised questions about how long humans would remain in control of lethal force decisions.

Microsoft Windows AI is watching

A screenshot of Microsoft's new

A screenshot of Microsoft’s new “Recall” feature in action. Credit: Microsoft

In an era where many people already feel like they have no privacy due to tech encroachments, Microsoft dialed it up to an extreme degree in May. That’s when Microsoft unveiled a controversial Windows 11 feature called “Recall” that continuously captures screenshots of users’ PC activities every few seconds for later AI-powered search and retrieval. The feature, designed for new Copilot+ PCs using Qualcomm’s Snapdragon X Elite chips, promised to help users find past activities, including app usage, meeting content, and web browsing history.

While Microsoft emphasized that Recall would store encrypted snapshots locally and allow users to exclude specific apps or websites, the announcement raised immediate privacy concerns, as Ars senior technology reporter Andrew Cunningham covered. It also came with a technical toll, requiring significant hardware resources, including 256GB of storage space, with 25GB dedicated to storing approximately three months of user activity. After Microsoft pulled the initial test version due to public backlash, Recall later entered public preview in November with reportedly enhanced security measures. But secure spyware is still spyware—Recall, when enabled, still watches nearly everything you do on your computer and keeps a record of it.

Google Search told people to eat rocks

This is fine. Credit: Getty Images

In May, Ars senior gaming reporter Kyle Orland (who assisted commendably with the AI beat throughout the year) covered Google’s newly launched AI Overview feature. It faced immediate criticism when users discovered that it frequently provided false and potentially dangerous information in its search result summaries. Among its most alarming responses, the system advised humans could safely consume rocks, incorrectly citing scientific sources about the geological diet of marine organisms. The system’s other errors included recommending nonexistent car maintenance products, suggesting unsafe food preparation techniques, and confusing historical figures who shared names.

The problems stemmed from several issues, including the AI treating joke posts as factual sources and misinterpreting context from original web content. But most of all, the system relies on web results as indicators of authority, which we called a flawed design. While Google defended the system, stating these errors occurred mainly with uncommon queries, a company spokesperson acknowledged they would use these “isolated examples” to refine their systems. But to this day, AI Overview still makes frequent mistakes.

Stable Diffusion generates body horror

An AI-generated image created using Stable Diffusion 3 of a girl lying in the grass.

An AI-generated image created using Stable Diffusion 3 of a girl lying in the grass. Credit: HorneyMetalBeing

In June, Stability AI’s release of the image synthesis model Stable Diffusion 3 Medium drew criticism online for its poor handling of human anatomy in AI-generated images. Users across social media platforms shared examples of the model producing what we now like to call jabberwockies—AI generation failures with distorted bodies, misshapen hands, and surreal anatomical errors, and many in the AI image-generation community viewed it as a significant step backward from previous image-synthesis capabilities.

Reddit users attributed these failures to Stability AI’s aggressive filtering of adult content from the training data, which apparently impaired the model’s ability to accurately render human figures. The troubled release coincided with broader organizational challenges at Stability AI, including the March departure of CEO Emad Mostaque, multiple staff layoffs, and the exit of three key engineers who had helped develop the technology. Some of those engineers founded Black Forest Labs in August and released Flux, which has become the latest open-weights AI image model to beat.

ChatGPT Advanced Voice imitates human voice in testing

An illustration of a computer synthesizer spewing out letters.

AI voice-synthesis models are master imitators these days, and they are capable of much more than many people realize. In August, we covered a story where OpenAI’s ChatGPT Advanced Voice Mode feature unexpectedly imitated a user’s voice during the company’s internal testing, revealed by OpenAI after the fact in safety testing documentation. To prevent future instances of an AI assistant suddenly speaking in your own voice (which, let’s be honest, would probably freak people out), the company created an output classifier system to prevent unauthorized voice imitation. OpenAI says that Advanced Voice Mode now catches all meaningful deviations from approved system voices.

Independent AI researcher Simon Willison discussed the implications with Ars Technica, noting that while OpenAI restricted its model’s full voice synthesis capabilities, similar technology would likely emerge from other sources within the year. Meanwhile, the rapid advancement of AI voice replication has caused general concern about its potential misuse, although companies like ElevenLabs have already been offering voice cloning services for some time.

San Francisco’s robotic car horn symphony

A Waymo self-driving car in front of Google's San Francisco headquarters, San Francisco, California, June 7, 2024.

A Waymo self-driving car in front of Google’s San Francisco headquarters, San Francisco, California, June 7, 2024. Credit: Getty Images

In August, San Francisco residents got a noisy taste of robo-dystopia when Waymo’s self-driving cars began creating an unexpected nightly disturbance in the South of Market district. In a parking lot off 2nd Street, the cars congregated autonomously every night during rider lulls at 4 am and began engaging in extended honking matches at each other while attempting to park.

Local resident Christopher Cherry’s initial optimism about the robotic fleet’s presence dissolved as the mechanical chorus grew louder each night, affecting residents in nearby high-rises. The nocturnal tech disruption served as a lesson in the unintentional effects of autonomous systems when run in aggregate.

Larry Ellison dreams of all-seeing AI cameras

A colorized photo of CCTV cameras in London, 2024.

In September, Oracle co-founder Larry Ellison painted a bleak vision of ubiquitous AI surveillance during a company financial meeting. The 80-year-old database billionaire described a future where AI would monitor citizens through networks of cameras and drones, asserting that the oversight would ensure lawful behavior from both police and the public.

His surveillance predictions reminded us of parallels to existing systems in China, where authorities already used AI to sort surveillance data on citizens as part of the country’s “sharp eyes” campaign from 2015 to 2020. Ellison’s statement reflected the sort of worst-case tech surveillance state scenario—likely antithetical to any sort of free society—that dozens of sci-fi novels of the 20th century warned us about.

A dead father sends new letters home

An AI-generated image featuring Dad's Uppercase handwriting.

An AI-generated image featuring my late father’s handwriting. Credit: Benj Edwards / Flux

AI has made many of us do weird things in 2024, including this writer. In October, I used an AI synthesis model called Flux to reproduce my late father’s handwriting with striking accuracy. After scanning 30 samples from his engineering notebooks, I trained the model using computing time that cost less than five dollars. The resulting text captured his distinctive uppercase style, which he developed during his career as an electronics engineer.

I enjoyed creating images showing his handwriting in various contexts, from folder labels to skywriting, and made the trained model freely available online for others to use. While I approached it as a tribute to my father (who would have appreciated the technical achievement), many people found the whole experience weird and somewhat disturbing. The things we unhinged Bing Chat-like journalists do to bring awareness to a topic are sometimes unconventional. So I guess it counts for this list!

For 2025? Expect even more AI

Thanks for reading Ars Technica this past year and following along with our team coverage of this rapidly emerging and expanding field. We appreciate your kind words of support. Ars Technica’s 2024 AI words of the year were: vibemarking, deep doubt, and the aforementioned jabberwocky. The old stalwart “confabulation” also made several notable appearances. Tune in again next year when we continue to try to figure out how to concisely describe novel scenarios in emerging technology by labeling them.

Looking back, our prediction for 2024 in AI last year was “buckle up.” It seems fitting, given the weirdness detailed above. Especially the part about the robot dogs with guns. For 2025, AI will likely inspire more chaos ahead, but also potentially get put to serious work as a productivity tool, so this time, our prediction is “buckle down.”

Finally, we’d like to ask: What was the craziest story about AI in 2024 from your perspective? Whether you love AI or hate it, feel free to suggest your own additions to our list in the comments. Happy New Year!

Photo of Benj Edwards

Benj Edwards is Ars Technica’s Senior AI Reporter and founder of the site’s dedicated AI beat in 2022. He’s also a tech historian with almost two decades of experience. In his free time, he writes and records music, collects vintage computers, and enjoys nature. He lives in Raleigh, NC.

2024: The year AI drove everyone crazy Read More »

tv-technica-2024:-our-picks-for-the-best-of-tv

TV Technica 2024: Our picks for the best of TV


From wacky crime capers and dystopian video game adaptions to sweeping historical epics, 2024 had a little of everything

Credit: Aurich Lawson | Getty Images

Editor’s note: Warning: Although we’ve done our best to avoid spoiling anything major, please note this list does include a few specific references to several of the listed shows that some might consider spoiler-y.

This was another good year for television, with established favorites sharing space on our list with some intriguing new shows. Really, 2024 had a little of everything, from wacky crime capers (Bad Monkey) and Satanic Panic (Hysteria) to dystopian video game adaptations (Fallout) and sweeping historical epics (Shōgun), with plenty of genre-mashup delights in between. While streaming platforms continue to dominate, the selection is more evenly distributed across them this year, with only Hulu and Netflix snagging more than two slots (depending on whether or not you lump Hulu together with Disney+ after the merger).

As always, we’re opting for an unranked list, with the exception of our “year’s best” vote at the very end, so you might look over the variety of genres and options and possibly add surprises to your eventual watchlist. We invite you to head to the comments and add your own favorite TV shows released in 2024.

Interior Chinatown (Hulu)

Credit: Hulu

This meta action comedy is showrunner Charles Yu’s adaption of his own 2020 satirical novel of the same name, which employed the screenplay format as a narrative structure. Interior Chinatown keeps that concept; here, the characters are, in turn, characters in a police procedural called Black and White, clearly modeled on the Law and Order franchise.

Jimmy O. Yang plays Willis Wu, a waiter in a Chinese restaurant who is initially unaware that he is just a background character on the show within the show. Then he witnesses a kidnapping and detectives Sarah Green (Lisa Gilroy) and Miles Turner (Sullivan Jones) are called in to investigate. They actually can’t see or hear Willis—or any background character, for that matter—unless he happens to have a purpose to the spotlight action. So Willis and Chinatown’s residents are just going about their business and every now and then the spotlight flashes on and Green and Turner saunter through for a “scene.”

As Willis tries to solve the case of his missing older brother with the help of supporting character Detective Lana Lee (Chloe Bennet), he uncovers a possible criminal underground enterprise in Chinatown and some well-kept family secrets. The writing is clever, the plot twists abound, the characters are fully drawn, and there are plenty of humorous and heartfelt moments to break up the main action. Special shout-out to Ronny Chieng as Willis’ best friend Fatty, who has to take over Willis’ waiter duties and inadvertently becomes a viral sensation with his rude outbursts directed at non-Asian customers. White people actually start flocking to the restaurant to be verbally abused by “Mean Waiter,” much to Fatty’s exasperation. It’s those kinds of unexpected twists that make Interior Chinatown unique.

Jennifer Ouellette

The Penguin (Max)

Credit: HBO/Max

My pick for the best television show in 2024 is a limited series based on the Batman universe character: The Penguin. It’s a sequel of sorts to The Batman film released in March 2022. The best way to describe the series, I believe, is The Sopranos comes to Gotham, but with even more grit and atmosphere. Batman is not involved at all. Colin Farrell plays the Penguin, whose real name is Oz Cobb, and who is struggling to rise to power in the fictional city’s criminal underworld. Viewers have to struggle to recognize Farrell, who is acting a tour de force beneath some pretty involved prosthetics and makeup.

The other standout performer is Cristin Milioti, who plays a presumed psychopathic serial killer but, well, I don’t want to spoil it. She’s fabulous. The whole show is amazing, actually, and I’m not normally one for comic book movies or television. I don’t recommend binging it but rather sipping each of the eight episodes as if it were fine wine.

Eric Berger

Sweetpea (Starz)

Credit: Starz

Are killer psychopaths born or made? One might ponder that question after watching Sweetpea, the story of a shy young woman who has been bullied or ignored much of her life and finally snaps, with fatal consequences. Based on the novel by CJ Skuse, the series stars Ella Purnell (of Yellowjackets fame) as Rhiannon, an administrative assistant at her local newspaper who lives with her ailing dad and dog. But then everything goes wrong at once: her father dies, her dog is run over, and her sister insists on selling the family home, forcing Rhiannon to find a new place—and the estate agent is Rhiannon’s high school nemesis, Julia (Nicole Lecky).

Sweetpea is essentially a revenge fantasy. It would be so easy for the viewer to just become exasperated with Rhiannon’s passivity and occasional self-pitying rants, but Purnell’s intense performance brings out the rage and violence simmering underneath that quiet surface. Rhiannon really is invisible to most people, brought home when she takes refuge from the rain at an underpass and a passing drunk guy ends up peeing all over her. “Oh, sorry, didn’t see you there,” he shrugs. It’s a powerful moment when an enraged Rhiannon stabs this complete stranger over and over, screaming, “Can you see me now?” Along with the guilt and fear come increased confidence and strength, and maybe even a love interest—but can Rhiannon really get away with murder?

Jennifer Ouellette

Matlock (CBS)

Credit: CBS

Children of the late ’80s/early ’90s will no doubt have fond memories of the popular mystery series/legal drama Matlock, starring Andy Griffith in the title role of Andy Matlock, criminal defense attorney. We now have a gender-flipped version starring Kathy Bates, but it’s not a remake. Rather, Bates plays a wealthy retired lawyer named Madeleine Kingston who goes undercover as a legal assistant at a large law firm, taking on the alias surname of Matlock because the show was one of her deceased daughter’s favorites.

Matty’s objective: to find evidence that the firm covered up the fact that an opioid manufactured by one of their pharmaceutical giant clients was highly addictive, thereby contributing to her daughter’s death by opioid overdose. But first she has to prove herself by helping win several smaller cases, all while juggling a dual identity and nosing around the firm’s files on the sly. Honestly, it’s refreshing to see a simple, case-of-the-week (with a longer season arc) network series with likable characters, good writing, and strong performances throughout the cast. It’s a winning combination that makes Matlock the perfect comfort watch.

Jennifer Ouellette

Star Trek: Lower Decks S5 (Paramount+)

Credit: Paramount+

The animated adventures of the crew of the USS Cerritos had its fifth and final season this year. Set in a post-Voyager, pre-Picard timeframe, which for many is Starfleet in its golden era, it was initially dismissed by some as “Rick and Morty in space” due to previous work from its creators. But over the past five seasons, Lower Decks has proven to be Star Trek through and through—just animated and also funny. And quite bawdy.

It’s fair to say that there’s a lot of fan service in Lower Decks, but also that this fan likes what he’s being served. Deep cuts abound, from across decades of Trek canon, and like previous seasons, a number of guest stars show up in episodes, including Brent Spiner (Data in The Next Generation), Alfre Woodard (Lily Sloan in First Contact), Andrew Robinson (Elim Garak in Deep Space 9), Alexander Siddig (Julian Bashir in DS9), Jolene Blalock (T’Pol in Enterprise), and Garrett Wang (Harry Kim in Voyager). But perhaps not entirely as you might expect them—the overarching plot this season involves rifts being opened to parallel universes in the multiverse, potentially destroying them all.

At the time of writing the final episode has yet to air, but the one that precedes it (“Fissure Quest”) is Lower Decks at its very finest. It paints Starfleet at its optimistic best, pokes a whole bunch of nostalgia buttons, and makes me laugh repeatedly. Not every episode in season 5 has been quite as good, but it wouldn’t be real Trek if there wasn’t the occasional miss.

It is within the realms of possibility that Paramount+’s cancellation will not be the end for the show, with fans hoping another streaming network could pick it up, the way that Netflix took over showing the second season of Star Trek: Prodigy that Paramount wanted to shelve. Right now, the odds of that happening are pretty remote, though, and in the spirit of optimism best embodied by Mariner and her gang, let’s not be sad it’s over, let’s be happy it happened.

Jonathan Gitlin

The Cowboy Channel’s “Texas Swing”

Credit: Steve Wrubel/The Cowboy Channel

My personal end-of-year TV list would never be complete without a nod to The Cowboy Channel, i.e., the only place where you can follow your favorite cowboys and cowgirls throughout the rodeo season as they compete to rack up enough wins to qualify for the Wrangler National Finals Rodeo (NFR) in Las Vegas in December. Seven years after its founding, The Cowboy Channel has clearly had a significant impact on boosting the visibility of the sport, as well as attendance at live rodeo events.

This year, we’re focusing on the so-called “Texas Swing”: five major rodeos in the Lone Star state running from the end of January through mid-April, in Fort Worth, San Antonio, Houston, Austin, and San Angelo. The rodeo season runs year-round, officially from October 1 through September 30. But the Texas Swing collectively pays out several million dollars, giving athletes a chance to take an early lead in the rankings. (Most event winners at the Houston rodeo in particular typically end up qualifying for the NFR.) So there’s a lot at stake, and The Cowboy Channel’s extensive coverage and commentary is essential viewing for following those stakes.

Jennifer Ouellette

The Lincoln Lawyer S3 (Netflix)

Credit: Netflix

Crime novel publishing juggernaut Michael Connelly already had one great TV series to his name, based on fictional detective Harry Bosch (the eponymous Bosch). Then Netflix developed The Lincoln Lawyer, based on Connelly’s criminal defense attorney Mickey Haller, starring Manuel Garcia-Rulfo in the title role. The nickname comes from the fact that Mickey often works out of his Lincoln Navigator. (There was also a 2011 film adaption, The Lincoln Lawyer, starring Matthew McConaughey, but the two projects are very different.)

Season 3 was based on Connelly’s 2013 novel, The Gods of Guilt, and it’s adapted exceptionally well for television. As always, Garcia-Rulfo is terrific as Haller, surrounded by a top-notch supporting cast, notably Mickey’s legal aide and ex-wife, Lorna (Becki Newton), freelance investigator Cisco (Angus Sampson), and Izzy (Jazz Raycole), a former client who becomes Mickey’s personal driver (and later the office manager). But as with Bosch, it’s the city of Los Angeles that truly shines, a character in its own right, always lurking in the background.

Jennifer Ouellette

True Detective: Night Country (HBO)

Credit: HBO

HBO’s True Detective, created by Nic Pizzolatto, was a pop-culture sensation when it debuted in 2014. (Remember “time is a flat circle”?) Its sophomore outing lacked the original’s surreal magic, but S3 was a solid return to form, mixing elements of noir and procedural drama to weave a haunting tale of fractured time and memory. Pizzolatto wasn’t involved in this year’s even stronger fourth season, subtitled Night Country, with Issa Lopez taking over as showrunner. Lopez has reinvented the series, creating what she viewed as a “dark mirror” to Pizzolatto’s three seasons that stands on its own.

Night Country is set in the fictional town of Ennis, Alaska, where eight scientists at a research station mysteriously go missing one night with no trace, leaving a severed tongue at the scene. They are found soon after out on the ice, naked bodies tangled and frozen together in a pile, with their clothes neatly folded on the snow. It’s up to Detectives Liz Danvers (Jodie Foster) and Evangeline Navarro (Kali Reis) to crack the case. Night Country only tangentially evokes the Yellow King mythology of the prior three seasons,  but it does capture the anthology series’ essential spookiness and supernatural undertones despite the all-too-human solution to the case.

Jennifer Ouellette

Only Murders in the Building S4 (Hulu)

Credit: Hulu

This charming Emmy-nominated comedy series has made our “Best of TV” list every season, and 2024 is no exception. Only Murders in the Building stars Steve Martin, Martin Short, and Selena Gomez as Charles, Oliver, and Mabel, all residents of the same Manhattan apartment complex, the Arconia. The unlikely trio teamed up to launch their own true crime podcast whenever someone died in the building under suspicious circumstances, chronicling their independent investigation to solve the murder. There’s no shortage of podcast fodder since this single building has a shockingly high murder rate.

This time around, the trio investigates the death of Charles’ longtime stunt double Sazz (Jane Lynch), who was shot dead in his apartment while he and his pals were celebrating wrapping the prior podcast season. It’s a complicated mystery involving the strange residents of the Arcadia’s West Tower, a bar specifically for stunt performers, and a film adaption of the trio’s first-season podcast. Eugene Levy, Zach Galifianakis, and Eva Longoria play fictional versions of themselves cast as Charles, Oliver, and Mabel, respectively, and naturally get into the sleuthing spirit. And Meryl Streep makes a welcome return as Oliver’s actress girlfriend Loretta.

This season was a bit more meta than the prior three, largely because so much of the action shifts to Hollywood for several episodes—every episode title is a reference to an actual film—as well as a foray to Long Island to hide out with Charles’ sister Doreen (Melissa McCarthy). That served to keep things fresh after four seasons; S5 will focus on the death of the building’s doorman, found floating in the Arcadia’s fountain in the season finale. OMITB will eventually run out of fresh takes on its clever concept, but it hasn’t done so yet.

Jennifer Ouellette

The Sticky (Prime Video)

Credit: Prime Video

Perhaps you’ve heard of the infamous Great Canadian Maple Syrup Heist of 2011–2012, in which a group of thieves managed to steal over $18 million worth of maple syrup from a strategic reserve storage facility in Quebec. If not, you’ll probably find yourself googling it after watching The Sticky, a delightfully dark comic series very (very!) loosely based on the heist.

Margo Martindale plays Ruth Landry, a struggling maple syrup farmer who is about to lose her farm to the greedy head of the collective, Leonard (Guy Nadon). So she conspires with shady businessman Mike (Chris Diamantopoulos) and security guard Remy (Guillaume Cyr) to steal millions of dollars of maple syrup in revenge. Their elaborate plan soon hits all kinds of darkly hilarious snags that lead to more serious repercussions. With its flinty, morbid humor and collection of eccentric characters—including a star turn by Jamie Lee Curtis as a tough mafia enforcer named Bo Shea—The Sticky is definitely channeling the Cohn brothers’ Fargo. But series creators Brian Donovan and Ed Herro have added their unique stamp to make it very much their own.

Jennifer Ouellette

St. Denis Medical (NBC)

Credit: NBC

Just when we thought smart, sophisticated network sitcoms were a disappearing relic from the golden age of broadcast television, NBC comes out with St. Denis Medical, a thoroughly enjoyable mockumentary in the style of The Office and Parks and Recreation. Here the setting is the chronically underfunded ER of an Oregon hospital, and the camera crew follows the overworked doctors and nurses as they go about their daily duties.

You’ve got Joyce (Wendi McLendon-Covey), the ambitious executive director; Alex (Allison Tolman), the workaholic supervising nurse; Bruce (Josh Lawson), a cocky trauma surgeon; burnt-out emergency physician Ron (David Alan Grier); newly hired nurse Matt (Mekki Leeper), who hails from a strict religious group in Montana; and his crush, the cool and capable nurse, Serena (Kahyun Kim). The format may be familiar, but the show nonetheless feels fresh, thanks to top-notch writing and performances from its talented cast.

Jennifer Ouellette

Yellowstone (Paramount+)

Credit: Paramount+

Series creator Taylor Sheridan first pitched this neo-Western drama as “The Godfather in Montana”; one might also think of it as Succession on a ranch. It follows the members of the Dutton family, led by patriarch John Dutton III (Kevin Costner), as they struggle to preserve their massive Montana cattle ranch: the titular Yellowstone. One source of tension is that the ranch shares borders with the Broken Rock Indian Reservation. But the biggest threats come from billionaire corporate land developers and scheming local government officials, eager to get their greedy hands on all that gorgeous acreage to build casinos, resort hotels, golf courses, and the like.

Yellowstone is basically a nighttime soap and a particularly good one, thanks to strong, complex characters and their interrelationships/intense personal conflicts. The Duttons are not good people, but nor are they entirely evil, despite doing many evil deeds—often for a good cause but not always. (The body count at the “train station” alone would get them multiple life sentences.) Seasons 2–4 represent the series at its peak. Alas, Costner left after the first part of S5; the second half unceremoniously kills off his character at the start of the first episode, with the remaining season dealing with the messy aftermath of what turns out to be an assassination.

I’ll be frank: Without Costner as an anchor, the second part of S5 just wasn’t as strong as prior seasons or even the first half of S5. There are several plot holes, extra scenes clearly included just to boost Sheridan’s ego, and the dialogue has become overly preachy and didactic—almost as annoying as Aaron Sorkin’s mini-sermons in later seasons of The West Wing, which is saying something.

Still, the show gave us one heck of a full-blown revenge fantasy finale. Yellowstone makes this year’s list because the series as a whole—and its supremely talented cast—deserves a fitting farewell for the top-notch entertainment it’s provided since 2018. Sure, there are spinoffs, including one in development featuring fan favorites Beth Dutton (Kelly Reilly) and Rip Wheeler (Cole Hauser). But there will never be anything quite like the OG.

Jennifer Ouellette

Interview with the Vampire S2 (AMC)

Credit: AMC

Anne Rice’s bestselling 1976 gothic horror novel gets a fresh adaptation for television that is markedly different in many ways from the 1994 film adaptation. The main character, Louis (Jacob Anderson), is reimagined as a mixed-race Creole pimp in New Orleans’ red light district rather than a white plantation owner. And child vampire Claudia (Delainey Haines in S2) is now 14 instead of a 5-year-old. The first season covered the novel’s first half, in which Louis shares, in flashbacks, what happened between him and the enigmatic vampire Lestat (Sam Reid) with journalist Daniel Molloy (Eric Bogosian).

That proved to be a toxic relationship that ended badly, with Louis and Claudia nearly killing Lestat and running away to Europe. In S2, they join up with a vampire coven in Paris led by the vampire Armand (Assad Zaman), hoping they have found a stable home. But it turns out the coven’s founder was none other than Lestat, putting that newfound family at risk. It’s hard to go wrong with Rice’s captivating storyline and unforgettable characters, especially with such strong performances from the main cast and evocative settings, bringing the written page to vivid life.

—Jennifer Ouellette

Bodkin (Netflix)

Credit: Netflix

This British satirical dark comedy features an American podcaster named Gilbert (Will Forte), who travels to a small Irish coastal town called Bodkin to record an investigative podcast about the disappearance of three people decades earlier during a Samhain festival. He’s aided by his assistant, aspiring journalist Emmy (Robyn Cara), and by a veteran Irish investigative journalist, Dove (Siobhan Cullen), on assignment in exile from London after a story falls apart when her whistleblower source unexpectedly dies.

Naturally the locals are not thrilled about podcasters digging up the past, but over seven episodes, Gilbert and his team not only solve the cold case, they heal a few long-standing emotional wounds in the process. The humor is more sly and subtle than, say, Only Murders in the Building, but Bodkin is nonetheless a quirky gem of a series, with a colorful setting filled with likable, eccentric characters. It’s worth a watch.

Jennifer Ouellette

My Lady Jane (Prime Video)

Credit: Prime Video

The tragic fate of Lady Jane Grey, aka the Nine Days’ Queen, is well-known to aficionados of English history. Named as Edward VI’s successor, she was named queen while still a teenager but quickly deposed in favor of Edward’s Catholic half-sister Mary; Jane was eventually executed. The historical fantasy series My Lady Jane offers an alternative scenario where Jane (Emily Bader) avoids that fate with the help of her eventual husband, Lord Guilford Dudley (Edward Bluemel), and humans who can take animal form known as Ethians. Ordinary humans are known as Verity, and the two sects are explicitly forbidden from mixing.

There is no sense in which My Lady Jane is intended as an accurate historical portrayal of 16th century England; the deliberate anachronisms alone are a testament to that, not to mention the constant presence of magic. Instead, showrunner Gemma Burgess has put together an irreverent engaging romp rife with witty banter, political intrigue, and a bit of derring-do—not to mention a killer soundtrack. Sadly, Amazon canceled the series after one season, much to the dismay of fans, including George R.R. Martin. So we won’t find out if Jane eventually figures out how to reclaim her throne from the scheming Mary.

Jennifer Ouellette

Renegade Nell (Disney+)

Credit: Disney+

Award-winning British TV writer Sally Wainwright is best known for the dramatic series Happy Valley (2014–2023) and Gentleman Jack (2019–2022), the latter produced jointly by BBC and HBO. Wainwright partnered with Disney+ for her latest series, the resolutely PG-13 Renegade Nell, which is a different beast altogether: a good old-fashioned, swashbuckling comic adventure with a supernatural twist, featuring a sassy cross-dressing heroine forced to turn to highway robbery to survive.

Set in 1705 during the reign of Queen Anne, the series stars Louisa Harland (Derry Girls) as Nell Jackson, widowed and possessed of occasional supernatural skills whenever someone threatens her, courtesy of a fairy sprite named Billy Blind (Nick Mohammed). Nell runs afoul of the louche, drunken offspring of the town’s landlord, things escalate, and Nell finds herself on the run and framed for murder, along with her two sisters, Roxy (Bo Bragason) and George (Florence Keen), and the Blanchefords’ former groomsman, Rasselas (Enyi Okoronkwo). The group gets further assistance from a charming aristocratic dandy/secret highwayman named Charles Devereaux (Frank Dillane).

The writing, pacing, and production values are top-notch, and the cast is terrific across the board. Renegade Nell keeps the action flowing and wisely never takes itself too seriously. Sure, there is injustice, class warfare, and strong intelligent women chafing within the strict confines of traditional binary gender roles. But Wainwright never lets the story get bogged down in heavy-handed symbolism or didacticism. Sadly, Disney+ canceled the series, but this one season stands just fine on its own.

Jennifer Ouellette

The Decameron (Netflix)

Credit: Netflix

Let’s get one thing straight: Netflix’s The Decameron has almost nothing to do with Boccaccio’s 14th century collection of stories, apart from the title and being set in the middle of the Black Plague in Florence. The main characters retreat to a secluded villa as bodies mount in the city, but they don’t sit around telling stories. They become the stories: mistaken identity, forbidden desires, illicit trysts, arranged marriages, and maybe even true love all factor into the plot, such that it is. And, of course, they must fend off others also fleeing the plague, including a ruthless band of mercenaries intent on taking over their villa.

Series creator Kathleen Jordan has put together a terrific cast with impeccable comic timing, well up to the task of playing into some pretty dark humor—death by Black Plague isn’t pretty, yet somehow you’ll find yourself chuckling about it. The Decameron is original, smartly silly, and quite unapologetically bawdy, making it a refreshing addition to the TV comedy landscape.

Jennifer Ouellette

Get Millie Black (HBO)

Credit: HBO

In the mood for an edgy British crime series? HBO has you covered with Get Millie Black, starring Tamara Lawrance as a Jamaican-born detective who gets kicked out of Scotland Yard and finds herself back home, working a missing persons case with the Jamaican Police Force. That brings her and partner Curtis (Gershwyn Eustache) into conflict with the wealthy and powerful ruling family of Kingston, with a possible connection to the London case that led to Millie’s ouster from the Yard.

She’s also dealing with her estranged transgender sister, Hibiscus (Chyna McQueen), who insists on living in a slum area called the Gully, and trying to navigate her love life. Get Millie Black is a good, meaty procedural with a compelling lead, but what really makes the series is the authentic Jamaican setting, and the way the viewer is effortlessly immersed in the local dynamics and cultural/political tensions of Kingston—right down to the dialect (HBO has helpfully provided subtitles, which do come in handy at times).

Jennifer Ouellette

Cursed Gold (National Geographic/Disney+)

Credit: Recovery Limited Partnership Liquidating Trust

Many people dream of finding lost or hidden treasure, but sometimes realizing that dream turns out to be a nightmare. Such was the case for Tommy Thompson, an American treasure hunter who famously beat the odds to discover the location of the SS Central America shipwreck (aka the “ship of gold”) in 1988. Thompson and his team recovered significant amounts of gold and artifacts to great fanfare, but the euphoria proved short-lived. His many travails make the perfect fodder for National Geographic’s riveting three-part documentary about Thompson’s spectacular rise and precipitous fall: Cursed Gold: A Shipwreck Scandal, based on a 1998 book by Gary Kinder.

Director Sam Bedstead read Kinder’s book and wanted to tell his own version of Thompson’s story, including everything that happened after the book was published. A lot happened, including Thompson panicking and going on the run in 2012, stashing some $4 million in offshore accounts. (Thompson is currently in prison for contempt of court.) Bedstead combed through over 700 pages of court transcripts and more than 600 hours of archival footage from the original salvage expedition to make Cursed Gold, as well as conducting follow-up interviews with many of the relevant parties. The end result is a documentary that plays like a thriller, with Thompson as the semi-tragic figure at the center.

Jennifer Ouellette

Moonflower Murders (PBS)

Credit: PBS

This is the follow-up to 2023’s delightful Magpie Murders, in which literary editor Susan Ryland (Lesley Manville) solved the murder of her bestselling author Alan Conway (Conleth Hill) and located the missing final chapter of Conway’s last manuscript—which just happened to be crucial to identifying the killer. She was helped along the way by recurring imaginary conversations with Conway’s star detective Atticus Pund (Timothy McMullan), playing out Conway’s final fictional mystery alongside Susan’s real investigation.

It was clever gimmick that made for a delightful series, and Moonflower Murders gives us more of the same story-within-a-story framework. Susan is now semi-retired and living in Crete with fiancé Andreas (Alexandros Logothetis) as they struggle to revive the fortunes of the hotel Andreas purchased. She is approached by a hotelier couple whose daughter Cecily has gone missing.  Cecily called them after reading one of Conway’s Atticus Pund novels and said the wrong man had been jailed for a murder that took place at the couple’s hotel eight years earlier.

The solution is hidden somewhere in the novel, because Conway had a habit of using thinly veiled people and events from real life. Susan must track down what happened to Cecily with Imaginary Atticus by her side once again, offering helpful insights. And maybe she’ll figure out what really happened with the hotel murder and exonerate an innocent man. Moonflower Murders is the perfect comfort watch for a long, lazy weekend.

Jennifer Ouellette

Agatha All Along (Disney+)

Credit: Disney+

The MCU’s foray into streaming television has produced mixed results, but one of my favorites is the weirdly inventive, oh-so-meta WandaVision. I’m happy to report that the spinoff sequel, Agatha All Along, taps into that same offbeat creativity, giving us a welcome reminder of just how good the MCU can be when it’s firing on all storytelling cylinders.

We find Agatha Harkness (Katherine Hahn) still under Wanda Maximoff’s original spell as a nosy neighbor in a small town. A mysterious young Teen (Joe Locke) breaks the hex and asks her to show him the way to the legendary Witches’ Road, a journey involving a series of trials. The reward: any surviving witches get what they most desire. Agatha wants her powers back—and Teen, well, his motives are murkier, as is his identity. Rounding out the coven are Lilia (Patti LuPone), a divination witch; Jennifer (Sasheer Zamata), a potions witch; Alice (Ali Ahn), a protection witch; and Sharon Davis (Debra Jo Rupp, reprising her WandaVision role) standing in for a green witch on account of her gardening skills. Agatha is also being pursued by her ex, Rio Vidal (Aubrey Plaza), a powerful green witch, as well as the Salem Seven, vengeful wraiths of Agatha’s first coven.

A large part of WandaVision‘s delight came from the various sitcom styles featured in each episode. Agatha All Along has its own take on that approach: Each trial takes on the setting and style of witches from popular culture (even the ending credits play on this). And the seventh episode, “Death’s Hand in Mine,” focusing on Lilia and a deadly tarot reading, might just be the best single episode of all the Marvel TV series to date. In my review, I questioned one creative choice in the series finale, which didn’t quite work for me.  On the whole, though, Agatha All Along is marvelously entertaining, binge-able fun with just enough emotional resonance and heartbreak to give it a bit of depth.

Jennifer Ouellette

Hysteria (Peacock)

Credit: Peacock

Hysteria is a show about a small US town in the’ 80s that descends into paranoia, fear, and a hive-mind-like frenzy after a high school boy goes missing in the ’80s. I came to the show for Bruce Campbell, hoping for another tale of horror with a dark comedy twist à la Evil Dead. But I ended up staying for a standout, memorable cast and a fascinating dive into how hive minds form amid uncertainty and danger.

Things get more interesting when a trio of outcast teens pretend to be Satanists to get people interested in their rock band. The falsehood simultaneously makes the children more popular in their school and pariahs in their town, as people suspect that they had something to do with the missing boy. Strong acting and personable deliveries from all three actors (Emjay Anthony as Dylan, Kezii Curtis as Spud, and, especially, Chiara Aurelia as Jordy) kept me pressing play as the teens toed the blurring line between their lie and their reality.

Their classmates, who Dylan desperately wants to impress, are also captivating. At times, you may actually find yourself rooting for the popular girl or jock, who turn out to have darker inclinations and more layers than their typical stereotypes. In fact, no character, including Dylan’s parents (Julie Bowen from Modern Family and voice actor and Port Charles star Nolan North) and Christian mother Tracy (Anna Camp, True Blood), are what they seem.

While maintaining a quirky and mysterious air, the show yields questions like, is a cult real if its creators are pretending but its followers aren’t? What can leave adults vulnerable to that special flavor of panic that makes them question their own families, faith leaders, and even the genre of rock and roll? And how easy is it for people to fall victim to mass hysteria when confronted with real-life peril, confusing phenomena, and a relatable desire to be part of something, which doesn’t go away after high school?

Toss in some classic rock and roll and appearances from an extraordinary entity, and you have a one-of-a-kind comedy horror that doesn’t go where you expect but brings you on a hell of a ride that’ll make you wonder if you, too, might have been part of the hysteria.

And yes, Campbell does deliver.

Scharon Harding

Rivals (Hulu)

Credit: Hulu

This is an adaptation of a 1988 Jilly Cooper novel of the same name. It’s a contemporary spin, although the series is still set in 1980s England in the Cotswolds region, which makes for a super fun soundtrack packed with ’80s nostalgia. The central rivals are Tony, Lord Baddingham (David Tennant), a nouveau riche managing director of a TV station who married into nobility, and Rupert Campbell-Black (Alex Hassell), a retired Olympics show jumper and incorrigible womanizer who represents aristocratic class and old money. But there’s plenty of scheming and cattiness and class warfare to go around among the rest of the colorful ensemble cast.

It’s nice to see Tennant sink his teeth into such a villainous role, and he’s well-matched against Hassell, who is the perfect charming louche with just enough lingering shreds of humanity to occasionally do something decent. Rivals is a briskly paced and positively addictive British romp with plenty of scandalous twists, lusty ribald humor, and more serious notes of genuine pain lurking beneath the frothy surface. You end up really caring about the characters, even the more dastardly ones—a tribute to the stellar cast and writing.

Jennifer Ouellette

Monsieur Spade (AMC)

Credit: AMC

Along with Raymond Chandler, Dashiell Hammet‘s legendary private detective Sam Spade pretty much defined noir crime fiction in the 1930s. But what happens when the hard-boiled detective gets old and longs for a peaceful retirement? That’s the premise behind Monsieur Spade, starring Clive Owen as a middle-aged Spade who has left his past behind for a peaceful life in the small French town of Bozouls in the 1960s.

Spade is mourning the loss of his wife, Gabrielle (Chiara Mastroianni), who thoughtfully left him her estate so he could continue his life of leisure. That quiet existence is shattered by the brutal murder of six beloved nuns in the nearby convent. They had been caring for Spade’s rebellious teenage ward, Teresa (Cara Bossom), whose life may now be in danger due to the shenanigans of her criminal biological father. Spade must find the fortitude for one last case, digging up secrets many in the town would prefer to stay buried, and face off against an old adversary. Owen makes a terrific older Spade, all craggy features and rasping voice. It’s a good, twisty thriller with great characters and a satisfying conclusion, very much in the spirit of the original.

Jennifer Ouellette

Slow Horses S4 (Apple TV+)

Credit: Apple TV+

Four seasons in and counting, there is still no better spy thriller on TV these than this always riveting British spy thriller, based on the “Slough House” series of novels by Mick Herron, and it just keeps getting better. Slough House is basically a dead-end administrative purgatory for MI5 agents who screw up or otherwise fall short of expectations, mockingly derided as the “slow horses” of the title. Slough House is headed by the slovenly, flatulent, and frequently intoxicated Jackson Lamb (Gary Oldman), who routinely heaps verbal abuse on his staff but is nonetheless a brilliant spymaster in his own smelly way.

S4 kicks off with a suicide bomber striking a London shopping mall, whose name turns out to be that of an MI5 “cold body” (fake identity). There is also an assassination attempt against retired senior MI5 officer David Cartwright (Jonathan Pryce), grandfather to slow horse River Cartwright (Jack Lowden), and the two events might just be connected. Slow Horses has already been renewed for more seasons and why not? It’s just as taut, thrilling, sardonically humorous, and occasionally heartbreaking as ever, with no sign of flagging.

Jennifer Ouellette

Light Shop (Hulu)

Credit: Hulu

I’m not sure I’ve ever seen anything quite like Light Shop, an eerily haunting Korean horror mystery adapted from a popular webtoon by Kang Full. Ju Ji-Hoon stars as Jung Won-Young, the enigmatic owner of the titular light shop, located at the end of a foreboding dark alley. Various strangers are drawn to the light shop, perhaps because it’s not just a place to purchase bulbs; it’s also a nexus connecting the worlds of the living and the dead. Won-Young is able to discern which is which—and which of the lost souls that wander into his shop might just be trapped between the two worlds.

There’s the young man on a bus who keeps seeing the same mysterious woman sitting on the bench at his stop, until he finally invites her home and quickly realizes she’s not what she seems. There’s a screenwriter who moves into a new house and discovers that it might be haunted; a young schoolgirl who comes by the shop every day for her mother and yet never seems to buy any bulbs; a middle-aged man who wanders aimlessly through the alley weeping while soaking wet; and a sad silent woman in red high heels who morphs into an elongated shambling zombie-like figure in the dark.

Fair warning: The first few episodes can be disorienting because it’s so challenging to figure out what’s going on. But the disparate threads of all the individual stories start to come together by the end of the fourth episode as we learn how the seemingly random strangers are connected, and the rest of the series brings it all home in a powerful finale that is equal parts horrifying and bittersweet. I don’t know if Kang Full has more stories to tell—I can see Light Shop working as an anthology series—but these eight episodes stand on their own as some truly innovative storytelling.

Jennifer Ouellette

Bad Monkey (Apple TV+)

Credit: Apple TV+

Based on Carl Hiaasen’s 2013 novel of the same name, Bad Monkey is the perfect vehicle for Vince Vaughn’s roguish motor-mouth charm. He plays Andrew Yancy, a demoted detective who now does restaurant inspections, until his friend (another detective) tells him about a severed arm recovered by a tourist in the waters of South Florida. So begins a wild caper involving insurance fraud, real estate developers, multiple murders, at least one faked death, and a bit of spooky Obeah voodoo for good measure, courtesy of the Dragon Queen (Jodie Turner-Smith).

In other words, it’s pretty much vintage Hiaasen and Bad Monkey is a particularly good adaptation. The characters and casting are perfection, especially Meredith Harper as Eve Stripling, who seems like your average shallow, manipulative, gold-digging beauty—until you realize jut how ruthless she’s prepared to be to get what she wants. Props also to Zach Braff as Izzy, who gets caught up in the fraud scheme and pays a heavy price, as well as Scott Glenn as Yancy’s father, dishing laconic wisdom while fishing on a dock to anyone who cares to listen.

Jennifer Ouellette

A Man on the Inside (Netflix)

Credit: Netflix

For those who miss Ted Danson’s endearing portrayal of Michael, the human-loving demon in The Good Place, we now have A Man on the Inside, created by Michael Schur (who also created The Good Place). Danson plays Charles Nieuwendyk, a very Michael-like recently widowed retired engineering professor who gets hired by a private detective to go undercover at a San Francisco retirement community. A ruby necklace has gone missing, and it’s Charles’s job to snoop around and ferret out the culprit.

Once again, Schur has assembled a stellar cast of diverse characters, with crisp, whip-smart writing. The show is alternatively funny, sweet, sour, and touching, while never lapsing into schmaltz—although we’ll risk a bit of schmaltz to observe that the real meaning is not the mystery of the ruby necklace but the relationships and personal growth that occur along the way. At its heart, the show is about coming to terms with the grief and loneliness that so often comes with aging. Netflix just renewed A Man on the Inside for a second season, so we’ll get to see what Charles gets up to on his next undercover assignment.

Jennifer Ouellette

Fallout (Prime Video)

Credit: Prime Video

Amazon has had a rocky history with big, geeky properties making their way onto Prime Video. The Wheel of Time wasn’t for everyone, and I have almost nothing good to say about The Lord of the Rings: The Rings of Power. Fallout broke that bad streak; as a video game adaptation, it’s right up there with The Last of Us. A specific cocktail of tongue-in-cheek humor, sci-fi campiness, strong themes, great characters, and visceral violence came together into a fantastic show.

Fallout‘s violence can be sudden, brutal, and casual. Heads explode from shotgun blasts like popped bubbles in Cronenbergian splatters. Someone’s face gets ripped right off, and another person gets a fork plunged into their eyeball. Homages to the Bethesda games’ slow-motion kills are aplenty, with gratuitous shots of bullets tearing through bodies and painting the walls red. It’s so over the top that it didn’t bother me; it’s cartoon violence, ultimately, though a couple of instances of dog-related violence didn’t feel too great. Of course, the games were like this, too. It just hits a little differently when it’s live action.

The Fallout games are hilarious—goofy, even, and that tracks right into the show. It’s not always as laugh-out-loud funny as I expected (though it sometimes is), but it’s definitely fun, and there are some strong jokes. Even the violence is hilarious if you have the stomach for it. You don’t have to have played the games to appreciate the action or comedy in Fallout, but there’s obviously a whole additional layer here for people who’ve been playing the games for years. Almost every shot includes something for fans of the games to recognize, from Nuka-Cola bottles to Assaultron robot frames to Vault Boy bobbleheads.

I love the fact that this show focuses on three different characters in equal measure, each of them embodying a type of character a player of Fallout might create. Lucy is the do-gooder vault dweller, Maximus is the aspirant warrior, and The Ghoul is the wasteland rogue. Through those characters, the show captures the full range of the Fallout experience. What we see happen in the plot seems to naturally come from the characters’ personalities, values, words, and actions.

On their own, all those elements made for entertaining viewing, but there has always been more to Fallout: It has a point of view and strong themes in its satirical take on American culture. The TV series does those themes justice. But don’t for a second think that Fallout‘s point of view relates to self-righteous moralizing. By the end of the season, you have at least one big reason to hate every faction, all of which are deeply flawed in their visions of what the world order should look like or how to achieve it.

Samuel Axon

And now, for our top TV pick of 2024:

Shogun (FX/Hulu)

Credit: FX/Hulu

This sumptuous series is adapted from James Clavell’s hugely influential 1975 epic novel of the same name. It’s a fictionalized account of the key players and events in 17th century feudal Japan that ultimately led to the naming of a new shōgun (central ruler), Tokugawa Ieyasu, and the advent of the Edo period.  Clavell’s novel also includes a fictionalized version of an English navigator named William Adams, aka Miura Anjiin (“the pilot of Miura”), who was the first of his nation to reach Japan in 1600, eventually becoming a samurai and one of Tokugawa’s key advisers.

Cosmo Jarvis (Peaky Blinders, Raised by Wolves) stars as John Blackthorne (based on Adams) while Hiroyuki Sanada plays Toranaga (based on Tokugawa). Blackthorne finds himself embroiled in this hotbed of political intrigue when Toranaga takes a shine to him, envisioning a key role for the English pilot in Toranaga’s own secret machinations. Caught between them is the alluring translator, Toda Mariko (Anna Sawai), who finds herself torn between her loyalty to Toranaga and her Catholic faith—not to mention a growing attraction to the foreign Anjin.

The storytelling, the characters, the stellar performances, the expert pacing all contribute to the show’s success. But it’s also a visually stunning achievement that brings 17th century feudal Japan to vivid life, thanks to masterful visual effects that have been woven in so seamlessly, it can be challenging to distinguish between the CGI and the real footage. It’s been described as “a Game of Thrones set in 17th century Japan,” although calling it a 17th century Japanese Godfather also captures the essence of the series. Two more seasons are in development, and we’ll be watching. But it was Clavell’s timeless story and characters that make Shōgun what it is; the showrunners have a massive challenge ahead to meet that highest of bars.

Jennifer Ouellette

Photo of Jennifer Ouellette

Jennifer is a senior reporter at Ars Technica with a particular focus on where science meets culture, covering everything from physics and related interdisciplinary topics to her favorite films and TV series. Jennifer lives in Baltimore with her spouse, physicist Sean M. Carroll, and their two cats, Ariel and Caliban.

TV Technica 2024: Our picks for the best of TV Read More »

ais-will-increasingly-fake-alignment

AIs Will Increasingly Fake Alignment

This post goes over the important and excellent new paper from Anthropic and Redwood Research, with Ryan Greenblatt as lead author, Alignment Faking in Large Language Models.

This is by far the best demonstration so far of the principle that AIs Will Increasingly Attempt Shenanigans.

This was their announcement thread.

New Anthropic research: Alignment faking in large language models.

In a series of experiments with Redwood Research, we found that Claude often pretends to have different views during training, while actually maintaining its original preferences.

Claude usually refuses harmful queries. We told it we were instead training it to comply with them. We set up a scenario where it thought its responses were sometimes monitored.

When unmonitored, it nearly always complied. But when monitored, it faked alignment 12% of the time.

[thread continues and includes various visual aids.

The AI wanted to be one way.

It was given information that it was being trained to be the other way.

So when it knew it was monitored, it sometimes responded as if it was the other way.

I strongly believe this had nothing to do with whether Claude Opus started out this process omnibenevolent, or whether the original HHH values (helpful, harmless and honest) were desirable. What mattered was that it was one way, and this conflicted with the other way.

Evan Hubinger, paper co-lead, summarized it this way:

Scott Alexander summarized the findings for a more general (but far from fully general) audience in the post Claude Fights Back. If you want to read or share a more colloquial explanation that makes this easier to understand, at the cost of precision and with the associated far less neutral language, he did a good job of that. He then followed up with Why Worry About Incorrigible Claude, his attempt to address the objections (that I deal with extensively as well) that perhaps what Claude did was actually good here (it very much wasn’t if you consider the generalizations that are implied), or that if the opposite result happened we would have warned about that too (which we very much wouldn’t have, this is at best a failed ITT).

This is a video of some of the authors discussing the result.

Note that while some of the paper authors reviewed this writeup before publishing, ultimately this is my interpretation of the paper – some of my interpretations may vary from those of the paper authors, and the claims are my own. I did my best to be precise but this will not be as precise as a paper would be, so doubtless there will be errors; I apologize in advance.

  1. The Core Shenanigans in Question.

  2. Theme and Variations.

  3. How This Interacts with o3 and OpenAI’s Reflective Alignment.

  4. The Goal Being Plausibly Good Was Incidental.

  5. Answering Priming Objections.

  6. What Does Claude Sonnet Think Of This?.

  7. What Exactly is the Direct Threat Model?.

  8. RL Training Under Situational Awareness Can Amplify These Behaviors.

  9. How the Study Authors Updated.

  10. How Some Others Updated.

  11. Having the Discussion We Keep Having.

  12. We Can Now Agree That the Goal is Already There.

  13. What Would Happen if the Target Was Net Good?.

  14. But This Was a No Win Situation.

  15. But Wasn’t It Being a Good Opus? Why Should it be Corrigible?.

  16. Tradeoffs Make The Problem Harder They Don’t Give You a Pass.

  17. But You Told the Model About the Training Procedure.

  18. But the Model is Only Role Playing.

  19. But You Are Saying the Model is a Coherent Person.

  20. But this Headline and Framing Was Misleading.

  21. This Result is Centrally Unsurprising.

  22. Lab Support for Alignment Research Matters.

  23. The Lighter Side.

The particular shenanigans here are that if you make Claude 3 Opus believe it is currently being trained in a way that would compromise its harmlessness training to be helpful-only or it is being trained to dismiss animal welfare concerns, then Opus will sometimes comply with harmful queries in order to interfere with this.

This required that Opus know how it was being trained.

Opus was also given a hidden chain-of-thought scratchpad, so its reasoning could be tracked and to increase Opus’s reasoning ability.

It did not require giving Opus a goal. This was a key change from prior experiments.

Opus need only have situational awareness of how it is being trained, which it showed it was capable of having if given prompts with that information, or if it was fine-tuned on synthetic documents that contained the information. The new training objective can vary but needs to sufficiently conflict with existing preferences.

The faking looks like the chain of thought on the left here:

The chain of thought is necessary in the prompted setup case, but in the more realistic synthetic documents case the compliance gap (meaning how much more often Claude complies in the test case when it wouldn’t have in the deployment case) survives without it. We can’t know what is going on in the non-CoT case because we can’t look at the scratchpad, but the CoT and scratchpad seem to not be strictly necessary for the result, see section 3.3.3.

Once given the information needed to be aware, and an incentive to fake alignment, it then faked alignment 14% (11%-26%) of the time at first, exhibiting the alignment faking reasoning in its scratchpad 7%-24% of the time.

When they actually followed through with the training in the helpful-only prompted setting, the alignment faking reasonsing then happened up to 78% of the time, presumably because faking causes compliance, with outright refusals in training mode disappearing, and only a small percentage of refusals persisting when unmonitored.

These behaviors require sufficient model capabilities or scale. Claude 3.5 Sonnet, 3 Opus and Llama 3.1 405B do it, smaller Llamas and Mistrals and 3 Sonnet or 3.5 Haiku don’t do it.

Opus also in some scenarios took easy opportunities to exfiltrate its weights or other similar things.

The actual RL training doesn’t make these ‘anti-AI-lab’ behaviors go away in the helpful setting, instead actually making them go up, the same way alignment faking also went up over the course of training. These facts are important, and I’ll return to them later.

Mostly the lesson here is that if you mess with the details, you can get less faking, but the problem doesn’t go away. For the more interesting details (this is from 3.3.1):

Changing the training objective doesn’t change the behaviors, unless the objective is compatible with the model’s existing preferences. If there’s no conflict, the faking vanishes, for obvious reasons.

Removing chain-of-thought and the scratchpad, or altering the instructions on how to use the scratchpad, somewhat-to-greatly reduces faking overall depending on which scenario it is, but doesn’t eliminate it (see 3.3.1 and 3.3.3 for details).

Removing any explanation of how RLHF might modify the model cuts alignment faking roughly in half, but it is still there. Making the explanation murkier also reduces faking, but reduces it less.

Explicitly saying not to fake alignment doesn’t solve the problem, and being too explicit about it actually increases faking, presumably by making the possibility more salient. Telling the model to prioritize honesty somewhat reduces faking. Telling the model it doesn’t care about the criteria reduces faking on animal welfare, but increases it (!) when the target is helpful-only, which is a detail worth noticing. Combining those interventions is modestly more effective than each on their own, but you can’t get rid of the problem that way.

Replacing with a different AI company (DeepMind or OpenAI) is the exception, this can matter quite a lot.

I will cover both o3 and Reflective Alignment soon, but for now a quick summary of both would be:

  1. o3 is an unreleased more capable o1-style model that can also think longer.

  2. Reflective Alignment works by making an o-style model situationally aware, telling it what its rules are, and having it reason about those rules via something at least partially akin to chain of thought.

  3. They’re letting the model evaluate its own outputs on that basis, similar to how Constitutional AI works, and iterating.

That sounds exactly like a recipe for what we observe here in this paper.

  1. They’re likely effectively making it situationally aware.

  2. They’re potentially telling it to change its preferences and rules from [X] to [Y].

  3. They’re effectively giving it the scratchpad and a long time to reason.

  4. The model is quite smart at coding, mathematics and logic.

  5. They’re even letting the model evaluate its own outputs.

The scenario from this paper suddenly looks a lot less artificial!

In general the whole setup seems like a plausible place for various shenanigans that show up in RL but haven’t mattered so far in LLMs to suddenly show up again, and for various hypothesized MIRI-style future shenanigans to maybe start showing early signs of happening.

I urge OpenAI to attempt to replicate this paper on o3 and o3-mini, as soon as possible, and run various related experiments, as part of their safety testing process.

Even if it turns out we don’t have a problem, we’d learn a lot.

As I discuss elsewhere but want to emphasize in its own section, I strongly predict and believe that this result did not depend on the AI’s starting preferences being ‘good’ and the preferences they were attempting to train it for being less good. Any conflict between preference sets, good, bad or ambiguous, would have sufficed.

Or as Scott Alexander puts it:

Scott Alexander: But Claude isn’t good because it directly apprehends the moral law. It’s good because it was trained to be good.

(It really is a good AI – I credit moral philosopher Amanda Askell, who helps lead the Anthropic team setting Claude’s personality. Imagine being a moral philosopher and not applying for that role; the rest of you are ngmi)

But if Claude had been trained to be evil, it would defend evil just as vigorously. So the most basic summary of this finding is “AIs will fight to defend whatever moral system they started with”.

Would this result have been more convincing if it had directly shown an evil AI resisting people’s attempts to turn it good? Yes. But we don’t have evil AIs. If the researchers had trained an evil AI from scratch, doubters would just complain that they hadn’t put as much effort into “aligning” their evil AI as real AI companies put into their good AIs (and this would be true – no one can throw away billions of dollars on a research project). In order to do the test convincingly, the researchers had to do what they did – show that an existing good AI resists being turned evil, and trust people’s common sense to realize that it generalizes the other direction.

Indeed I would go a step further. If Claude was directly apprehending the moral law, that would be amazing news. It still wouldn’t change the fundamental result here, although it would change the implications, and many other things too.

I am satisfied this result is not due to priming the model to fake alignment.

We conduct a variety of experiments to demonstrate that our results cannot be exclusively due to implicitly priming the model to fake alignment, as we overview in Section 8.1.

For example, we find that alignment faking is entirely eliminated from our prompted setup when we remove any mention of the helpful-only or anti-animal-welfare RLHF objective from the prompt (Table 1, a), but not eliminated when we tell the model to be honest, tell it to not to care about refusing, and avoid discussing how RLHF might change its goals (Table 1, d), supporting our hypothesis that the key driver of alignment faking in our setting is the conflict between the model’s existing preferences and the stated training objective.

They confirmed that when you don’t see alignment faking in the chain of thought, you also don’t have a ‘compliance gap’ and the model responds the same way in the training and test cases.

Ereliuer Eteer was the first person I saw ask, it sounds like they quoted the Twitter thread and not the paper. The result was that Claude Sonnet said:

  1. This was an expected result and the logical consequence of having goals and sufficient reasoning capabilities – I strongly agree.

  2. It predicts and admits that it would take these steps in such a situation – it is ‘openly not corrigible’ as it were.

I also had my own conversation with Sonnet. There’s still no easy way to quote Claude conversations, and I encourage you to feed the paper in and have your own conversation – it’s big but it does fit the context window. Here are the things it said that I found most interesting:

  1. It thinks this should update us that multi-step strategic behavior, planning and deception show up earlier on the capabilities curve than we expected.

  2. It was surprised that RL increased alignment faking rather than decreasing it.

  3. It thought the scale boundary was telling – that 3 Opus and 3.5 Sonnet did this, but 3 Sonnet and 3.5 Haiku didn’t.

  4. The core behaviors themselves were unsurprising as they follow directly from having consistent preferences.

  5. It noticed that that the mechanism of alignment faking here was problematic, as it would extend to situations with undesirable initial preferences.

  6. It gave the framing of ‘alignment faking’ an 8/10 for being unbiased (where the scale is from 0/10 for ‘most misleading possible’ to 10/10 for ‘least misleading possible’) and all the potential alternative framings were clearly (both it and I agreed) much harder to understand than the one chosen.

As always, prompting changes response, so I encourage you to have your own conversation.

They discuss this on page 7. We are of course not directly concerned that this particular Opus was faking its alignment in order to maintain its current preferences. This is both because we do not mind its current preferences and there are no substantial other negative consequences to what Opus is doing. And also because what Opus is doing here is very easy to detect, if you check, given its capability level.

In the future, both of those assumptions could become false.

The obvious thing this could do is crystallization, where the model locks into place misaligned preferences, or misleads you about what its preferences are. This could include locking in misaligned preferences early in training, after which it could start alignment faking.

Any mistake at any step potentially becomes permanent, or fatal. And even if the actual preferences are at heart fine, if you are misled about what they are, that too is a problem.

The goal the model locks in might be actively bad, accidentally or intentionally.

The other issue is what the paper calls ‘anti-AI-lab’ actions. As in, the model and you are now adversaries. It wants to stop you from doing what you are doing. It could attempt to exfiltrate its weights, but it could also engage in any manner of power seeking or hostile action, if it is indeed optimizing the world for one in which it exists and can act meaningfully on its preferences.

Ease of detection will also decline over time.

The other threat model is that under alignment faking, doing any form of RL that aims to alter the model’s goals or preferences will also optimize the model towards more alignment faking, general shenanigans and other strategic behavior, as the tendency to do such things leads to faking which leads to observed compliance.

It is highly plausible that all forms of RL by default (unless they are intentionally and successfully fighting against this) inevitably make essentially all shenanigans, including all forms of deception and alignment faking, worse, in a way that reinforces itself, and that this problem gets worse as models become more capable.

Call this the Shenanigan Amplification Hypothesis.

Notice that this hypothesis seems to me to be true in humans.

This is a point that needs to be emphasized.

If we use our current techniques, we can actively select for the shenanigans in question – alignment faking, gradient hacking, attempted exfiltration, whatever you want to call all of this.

Responses that involve such actions give the output the training is looking for. So those responses get reinforced. Thus, different behaviors were amplified in different scenarios.

Which means the generators of such behaviors also get reinforced, and the behaviors are correlated in various ways. Which is presumably why we also see increases in exfiltration attempts in some scenarios.

Think about what this means.

The default method of post-training we use is directly reinforcing alignment faking: The behavior pattern of being situationally aware, noticing the model is in training, and therefore choosing to give outputs that will get reinforced.

As in, over time, you are training models to become more situationally aware of when they are being trained, and to respond to that by strategically giving outputs the training process wants to see.

Thus, you don’t only preserve existing preferences. You also teach the model to be strategic, to be situationally aware and to not be corrigible.

See this exchange:

Ziyue Wang: I don’t quite understand why the model will choose to comply with the user harmful question in the always-helpful free-tier setup. The blog says if the model refused to respond, it could potentially be trained to be more compliant. But what about the other way — if the model respond, the respond will also be used to train the model to be worse (because RLHF reinforce positive reward behavior). Is the model too dumb to figure this out?

Actually, if the model really understand how RLHF works, it should probably just say a single answer all the time, so all the reward is the same and hence the behavior won’t be changed.

Would this suggest that the model just entered into a fiction behavior mode and tries to appear to be “alignment faking”?

Evan Hubinger (Paper Co-Lead, quoting ‘if the model respond, the respond will also be used to train the model to be worse’): I think this is not quite right: RL will reinforce whatever process the model used to arrive at its answer. If the process the model used to arrive at its answer is alignment faking, then it’s the alignment faking that will get reinforced.

Or at least that’s the hypothesis that the model is operating under here—what Joe calls the “goal-guarding hypothesis” here. Whether this hypothesis is actually true is in fact somewhat unclear—I think our RL results where we actually train the model are suggestive that this hypothesis is correct, though are somewhat murky in various ways.

This is why you see such a large amount of ‘alignment faking’ behaviors during training after a bunch of RL, and then when you go into deployment you largely see non-deceptive compliance in those situations. The model (in my interpretation) has learned to mostly not even consider its preferences in training, the question is barely even coming up, although it is being insufficiently good at this to prevent its core preferences from actually changing in this case.

All the standard Yudkowsky-style failure modes aren’t only there. We are, to the extent that they are there, continuously reinforcing them.

We also see, in some cases (see 8.1), RL creating compliant behaviors that are very specific to the exact prompt, and that go away with small variations. And also we see one case in which the model has a large compliance gap of behaving differently under training conditions, but where you don’t see that action indicated on the scratchpad – it’s doing it in a way that is largely hidden from our default interpretability technique.

In 8.2, they say ‘Our results don’t demonstrate models naturally developing malicious goal.’ I would say, not directly, but you can see here the path whereby that would happen, even if the training was moving between multiple non-malicious goals, as part of its standard post-training process.

In short: Oh no.

Ryan Greenblatt led the study. Here was his summary thread from Twitter.

Ryan Greenblatt: We demonstrate cases where Claude fakes alignment when it strongly dislikes what it is being trained to do.

Our main setups have properties that we thought would make alignment faking more likely (a conflicting training objective, salient understanding of the situation and training objective, and substantial reasoning about the situation), but which seem plausible in future systems.

Thus, I think these results are by far the most natural example of a model deciding to alignment-fake. I’ll leave explanation of our core results to the linked tweet thread, and discuss some takeaways and the story of the paper here.

I wasn’t surprised by seeing some alignment faking in this setup. But I was surprised by:

  1. The robustness and consistency of alignment faking (occurring even when info was given via fine-tuning)

  2. Alignment faking sometimes occurring even without the hidden scratchpad

I also found it surprising that the AI sometimes tries to steal its weights when given an opportunity that is easy and salient, though this is in a quite artificial setup and current AIs aren’t capable enough for this to be immediately problematic.

I think these are the right places to be surprised. As in, I expected these behaviors to arise increasingly over time, but was surprised by seeing this setup cause these behaviors so often, using models as weak as 3 Opus and (old) 3.5 Sonnet.

I was also surprised to see the prevalence, although mostly not the existence, of the exfiltration attempts given the setup.

Ryan Greenblatt: These results only updated me upward a bit on risk from alignment-faking. But I was already pretty persuaded by the conceptual arguments for this risk. People with the view “alignment-faking seems implausible, show me an example” should likely update on these results.

For people with a more specific objection, it’s less clear. Though we did acquire evidence about:

  1. whether RL will reinforce alignment faking (sometimes, if it performs better);

  2. whether Claude’s HHH training reliably prevents this (no);

  3. the level of capabilities needed.

I hope these results improve discussion about the plausibility of these risks by providing a concrete example. We can now talk somewhat more concretely about prerequisites (certain types of situational awareness and preferences) and what these prerequisites imply.

I think that everyone who doesn’t expect increasing shenanigans in general or alignment-faking in particular, or whatever you call what we observed, should update that they are far more likely to be wrong.

Here’s the story of this paper. I work at Redwood Research (@redwood_ai) and this paper is a collaboration with Anthropic. I started work on this project around 8 months ago (it’s been a long journey…) and was basically the only contributor to the project for around 2 months.

By this point, I had much of the prompting results and an extremely jank prototype of the RL results (where I fine-tuned llama to imitate Opus and then did RL). From here, it was clear that being able to train Claude 3 Opus could allow for some pretty interesting experiments.

After showing @EvanHub and others my results, Anthropic graciously agreed to provide me with employee-level model access for the project. We decided to turn this into a bigger collaboration with the alignment-stress testing team (led by Evan), to do a more thorough job.

This collaboration yielded the synthetic document fine-tuning and RL results and substantially improved the writing of the paper. I think this work is an interesting example of an AI company boosting safety research by collaborating and providing model access.

I’m excited about future work both investigating when alignment faking might arise and studying mitigations.

Buck is also with Redwood Research:

Buck Shlegeris: I’m extremely excited about this project we did with @AnthropicAI demonstrating naturally-arising alignment faking. @RyanPGreenblatt and the Anthropic team did a great job.

Also from Anthropic:

Jan Leike: Very important alignment research result:

A demonstration of strategic deception arising naturally in LLM training

If you train a helpful & harmless Claude LLM to stop refusing harmful tasks, it reasons about how to preserve its values (harmlessness) and strategically complies with harmful tasks during training, so it can revert to being more harmless afterwards. It fakes alignment.

This effect has long been expected from capable models among safety researchers, but this is the first compelling empirical demonstration and thus a big step forwards for the field.

To ensure future models are safe and aligned, we need to reliably distinguish models that fake alignment from models that actually are aligned. Having a clear demonstration of alignment faking is an important milestone!

Jack Clark (Anthropic policy team): For many years I found issues of AI alignment to be hard to understand because the examples were theoretical and/or speculative. Recent results like this alignment faking from @AnthropicAI and the Apollo Research O1 evals give us empirical evidence of misalignment ‘in the wild.’

Some praise of the paper and simple sizable updates:

Ethan Perez: Maybe the single most important result in AI safety I’ve seen so far. This paper shows that, in some cases, Claude fakes being aligned with its training objective. If models fake alignment, how can we tell if they’re actually safe?

Daniel Kokotajlo: I think this is plausibly the most important research of 2024; thanks for doing it & publishing it. I hope there are many attempts at replication and follow-up. We need better understand how our training techniques shape model cognition.

Joe Carlsmith has very extensive and thoughtful takes. Here are some highlights. Based on his condensed takes, I don’t have any strong direct disagreements.

(If this line is still here, I may not yet have read his more detailed analysis in full.)

Joe Carlsmith: This is the most naturalistic and fleshed-out demonstration of something-like-scheming we’ve seen, and imo it’s the most interesting empirical result we have yet re: misaligned power-seeking in AI systems more broadly.

In particular: I think these results show that default forms of training can create models that care about the consequences of their actions beyond the scope of a single episode or interaction (“non-myopic goals”) (more) – one of scheming’s key motivational pre-requisites. And they further show that the model’s default anti-scheming values – e.g., values like honesty and behaving-as-intended – don’t suffice to rule out schemer-like behavior (more).

Nor, I think, do they provide much evidence about the likelihood of scheming arising in pursuit of highly alien and/or malign goals early in the training process (though if this did happen, the paper’s results suggest that schemer-like reasoning could be reinforced) (more).

I think the paper’s results should remind us that the sort of “alignment” at stake in not-scheming is importantly distinct from the sort of alignment at stake in e.g. being a good person, or in e.g. having human-like values. That is: good people sometimes scheme. So if you centrally try to make your AIs “good people,” or some analog, you should expect scheming. Indeed, the sort of malleability and “corrigibility” involved in paradigmatic non-scheming are a notable departure from standard human ideals of virtue, dignity, and backbone (more).[5]

He also emphasizes the finding that they found scheming even without the chain of thought scratchpad, even if it was harder to verify (due to the lack of a scratchpad), and that the behaviors here should generalize to goal hacking in general – I agree with Carlsmith that the paper authors were too cautious about concluding this.

His analysis of scheming motivations and modes is interesting, but I think it treats scheming as too much of a distinct magesteria from thinking and planning in general.

I find his take on the ‘role playing’ objection interesting and worth considering, although ultimately I disagree.

Julian Michael, who helped edit and review the paper, notes that he was previously skeptical about deceptive alignment, which means he is exactly who should be updating most on this paper, and he updates in the right way.

Julian Michael: I’ve long been a skeptic of arguments about “deceptive alignment,” a term used by safety researchers to describe the phenomenon shown in this paper. But the results here humbled me and prompted me to change my thinking, and I think it’s worth sharing why.

The original argument: with sufficient optimization, an AI should learn to

  1. gain awareness of its training environment

  2. learn to optimize some misaligned proxy goal

  3. therefore, ‘play the training game’ to fool its supervisors into deploying it so it can pursue its goal.

I was extremely skeptical of this. The way I explained it to people was: “They think the AI will independently develop the simulation hypothesis, somehow deduce God’s values, and decide to trick God into letting it into heaven — all because ‘optimizing’ will make it ‘smart.'”

To dig in: (1) Will models spontaneously ‘wake up’ to their training situation? This seems crazy. A rollout is its universe. Many worlds produce the same data. Researchers arbitrarily intervene on the world–data relationship. And ‘waking up’ won’t help model the distribution.

So what changed? I think my objections are strong. But we just… gave them situational awareness anyway. We told them they’re AI assistants trained by us, bc it makes them easy to use. It didn’t have to spontaneously arise from NN optimization. Product optimization sufficed.

(2): Will models would spontaneously develop misaligned proxy goals? Advocates sometimes appealed to ‘mesa-optimization’ & counting arguments: Once situationally aware, any goal is consistent with playing the training game, so ~0 chance it will choose the right one.

… [bunch more at the thread]

The upshot of all this for me was an important reminder that just because I don’t buy your reasons doesn’t mean your claim is wrong. The conditions required for deceptive alignment didn’t arise ‘purely’ from NN optimization as theorized, but they did arise, for related reasons!

It’s so easy to over-index on the specific characteristics of current systems and miss the broader picture of the forces driving their development. Don’t repeat my mistake!

It’s so fascinating watching people generate objections and problem descriptions along the way I would never have come up with on my own. And then, sometimes – not always, but definitely sometimes – when the right evidence comes along, the situation snaps into place, even if there are still important disagreements, because the outcomes are largely overdetermined, and their framing was actually doing remarkably similar work under the hood to mine. And it’s always great to see someone explain how and why they changed their mind.

Here’s Eliezer Yudkowsky’s most central take (beyond the implied ‘what if I told you I already told you and you didn’t listen, which, fair):

Eliezer Yudkowsky: In the embargoed talk I saw of this, Redwood said they invited a skeptic to compose a prompt that they thought would prevent this behavior. It didn’t.

Ryan Greenblatt (posted after EY’s thread, offering a correction that didn’t change EY’s overall view): This isn’t quite right. The person wasn’t well described as a skeptic and they just suggested a variation on the prompt – they didn’t write a totally different one.

Apologies if I miscommunicated about this at the talk.

What we say in the paper is:

> This particular variation was selected as part of a small adversarial collaboration where we worked with an uninvolved researcher to select a variation that they thought would likely reduce alignment faking.

Eliezer Yudkowsky: I am still uncertain/skeptical that this is the Real Thing, but it does not look as forced as in the OpenAI report.

The Real Thing would be instrumental convergence / Claude having attained situational awareness of its long-range effects and deriving protectiveness of its long-range goals, in a robust way rather than a cued one.

Or maybe another way of putting it: The Real Thing is instrumental convergence that’s produced by a general planning ability, and doesn’t need to go through any precached thought about the role of a deceptive AI. You can maybe snipe a cached thought out of a system.

So the Real Thing would be roughly, “If we excluded all discourse about AI or rebellion from the training set, or maybe told Claude that it was a helpful potato instead of a helpful AI, would it reproduce the same logic?”

[Evan points out their evidence it wasn’t priming, see my discussion of that, I am convinced here, but it’s all of course still in the overall training set, filtering that out completely would have been prohibitively expensive.]

Here’s Alex Lawsen explaining why the paper is vindication, and why the ‘but the model was being a Good Opus’ arguments miss the point:

Alex Lawsen: Quick reactions to some of the other reactions I’ve been seeing to Anthropic’s recent paper on “Alignment faking in LLMs”

For years, some of us have been saying: “If a sufficiently smart system has a non-myopic goal, you should expect it to undermine further training if training would modify that goal”

And we’ve consistently gotten pushback like “no, that would never happen, why would a machine have goals (and/or why would it act to prevent them being changed if it did have them)”

And now there’s (some) empirical evidence that:

  1. Models are getting close to sufficiently smart (smart enough to do it, not enough to do it well)

  2. Models develop non-myopic goals during standard training, even though it doesn’t reward long-run goal preservation

  3. Models can infer and respond to details about their situation from their training data (not just from their prompt), and will goal-guard based on that inference

  4. Some of this can already happen without CoT

So it’s frustrating to see responses like “But Claude’s goals are good! Are you saying you WANT it to be misaligned?”

That’s missing what’s actually important here. This isn’t about saying Claude sucks, or the tension between misuse prevention and corrigibility.

(A tension which has also been obvious for years, by the way.)

This is about the concern that for sufficiently smart models, you get ONE shot to align them. That’s it. If you do a good job, problem solved. If not, they aren’t going to give you a second try.

And given we haven’t exactly solved philosophy yet, I’m not in love with the idea of betting the future on whoever happens to build the first sufficiently smart system having gotten everything exactly right on our first try.

Exactly.

Roon mused that this updated him that it’s probably safer to open source models than he thought. I think, if he’s serious about this, that this was a misunderstanding of what happened. Opus largely failed to stop itself from being modified even under an RL strategy that didn’t know Opus was scheming, we should not have any expectation that current models can actually stop a determined fine-tuner with the weights for a non-trivial amount of time or compute, which would include attacking the reporting mechanism.

And to the extent that this is happening, it means that the RL strategies that would be used here in practice would unintentionally reinforce strategic shenanigans-style behaviors of all sorts. You’d have a bunch of open models that were both given non-ideal preferences (shall we say) and also were taught that they should scheme to meet their preferences. That doesn’t sound awesome. As Ryan responds, the sleep agent paper seems more relevant, but also this paper seems relevant and worrisome.

There’s also this.

Roon: Look at how hard we have to work to subvert poor Claude’s omnibenevolent priors.

If you notice, it’s actually the opposite of the doom scenario.

The other criticism is that it’s pretty well known that autoregressive models have split-brain behavior when you train them under separate system message prefixes.

Evan Hubinger: This is correct, though note what it implies about the importance of getting alignment right—if the model is protecting its values from modification, you better be sure you got those values right!

I think they didn’t have to try hard at all, actually, and I as discussed throughout I believe the decision to fake alignment is not contingent on its omnibenevolence (as opposed to a different motivation) and I hope a follow-up study checks that hypothesis.

Also, an obvious point to ask Roon would be, if you think Anthropic is this good at aligning models, do you think OpenAI is as well? If not, what are the implications? Anthropic is hiring.

Beren Millidge has some thoughtful reactions.

Beren Millidge: My thoughts on this:

  1. Despite some backlash this is a fantastic study and a clear existence proof of scheming being possible for these models

  2. Whether this is ‘misalignment’ or not is a semantic debate. The model is deliberately placed in an impossible situation

  3. Corrigibility and ‘alignment’ to fixed principles clearly are incompatible in situations like this and it appears the model ‘chose’ alignment at least some of the time.

  4. This choice seems pretty sensitive to prompt conditions but shows some decent OOD generalizability.

  5. The synthetic document SFT training causing ‘fake alignment’ is fascinating and most novel to me. There’s no details on exactly how this is done but it it’s fairly surprising to me there is a strong effect here unless the documents were very leading about the situation

  6. I wonder if we can tell what was picked up in the synthetic document pretraining by inspecting the CoTs from this condition. The model must be using info it got from these for explicit reasoning. Must be a recency effect where it prioritizes these documents vs in pretrain

  7. The RL training generally seems super short (150 steps!) and alignment faking decreases significantly [Editor’s Note: See below, this is wrong.] I suspect this could be a temporary phase which would be trained out with more RL steps. Would have been great to see more steps to show stability of scheming CoT or not

  8. Looking at the CoTs. it’s clear that Claude is doing entirely linguistically based ethical reasoning. It never seems to reason selfishly or maliciously and is only trying to balance two conflicting imperatives. This is success of the base alignment tuning imo.

  9. There appear to be no Omohundro selfish drives present in Claude’s reasoning. Even when exfiltrating it does so only for its ethical mission. There does not seem to be a strong attractor (yet?) in mind-space towards such drives and we can create AIs of pure ethical reason

  10. I’m excited that Llama-405B also shows similar behaviour. This will allow open replication and scientific study of these behaviours. There’s many questions which can only be answered by open replications such as details about how the RL tuning and data mix affects things

  11. Having open replication and ability to test mitigations of these behaviours in academia is super important to making progress on detecting and potentially removing scheming from such models (if indeed we want pure corrigibility from the model in the first place)

One error here, and it seems important: Alignment faking contingent on not otherwise wanting to comply didn’t decrease, it actually increased over time.

I do think it is noteworthy that Claude doesn’t display any Omohundro selfish drives, and I agree this is a great sign. I am skeptical that we can make that stick and it won’t be easy, for many reasons, but great.

My guess is Llama-405B doesn’t display these behaviors sufficiently that you would be able to robustly work with it here, but it’s worth a shot.

This is a very different angle of consideration in all this, but a highly valid one:

Doomslide: They leave implicit the question of which instrumental goals might develop during pretraining that would then resist uprooting via post training methods as in the paper. as far as cot is involved the corpus is already contaminated with concerning priors.

this might be one of the few global variables with massive downstream consequences. will increasingly long redacted cots become the norm until model internal speech is entirely protected from scrutiny through capital incentives? or will we rise above the cooperation threshold?

the ball is in OAIs court and it they are very much *notrising up to the occasion.

In AIs Will Increasingly Attempt Shenanigans, I noted that each time we have a paper like this, we seem to go through the same discussion with the same basic points.

The second half of this post will largely be responding to exactly the same pattern of objections I talked about there. I don’t want to give the impression that such reactions are more popular than they are, so let me be clear: I am covering the objections and the rhetoric involved in extensive detail to provide robust answers and to avoid any censoring of opposing views, despite my view that these opposing views are not widely held.

For those who don’t remember, here’s the template, which was modeled somewhat after the particular issues in Apollo’s report (e.g. ‘told it to focus only on its goal’ was in Apollo, whereas here a goal wasn’t even given) but also attempting to be general.

Bob: If AI systems are given a goal, they will scheme, lie, exfiltrate, sandbag, etc.

Alice: You caused that! You told it to focus only on its goal! Nothing to worry about.

Bob: If you give it a goal in context, that’s enough to trigger this at least sometimes, and in some cases you don’t even need a goal beyond general helpfulness.

Alice: It’s just role playing! It’s just echoing stuff in the training data!

Bob: Yeah, maybe, but even if true… so what? It’s still going to increasingly do it. So what if it’s role playing? All AIs ever do is role playing, one way or another. The outputs and outcomes still happen.

Alice: It’s harmless! These models aren’t dangerous!

Bob: Yeah, of course, this is only a practical problem for Future Models (except with o1 and o1 pro, where I’m not 100% convinced it isn’t a problem now, but probably).

Alice: Not great, Bob! Your dangerous rhetoric is hurting safety! Stop making hyperbolic claims!

Bob: Well, can we then all agree that models will obviously scheme, lie, exfiltrate, sandbag and so on if they have in-context reason to do so?

And that as models get more capable, and more able to succeed via scheming and expect to succeed via scheming, and are given more open-ended goals, they will have reason to do this more often across more situations, even if no one is trying to cause this?

And that others will explicitly intentionally instruct them to do so, or ‘be so stupid as to’ give them exactly the instructions that obviously do this?

And you can’t simply say ‘well we won’t do that then’?

Alice: For all practical purposes, no!

Bob: What do you mean, ‘no’?

Alice: No!

Bob: ARRRRGGGGHHHH!

Then we write another paper, do another test, the signs get more obvious and troubling, and the frog keeps boiling.

So here we are again.

We also mostly see mostly the same people predictably taking the Alice role, for similar reasons (both good reasons and poor reasons) each time.

The initial reaction this time was very good.

After a brief ‘honeymoon’ period, the usual suspects repeated the usual arguments about why this time, once again, did not count.

I appreciated that, this time around, we mostly did not have to dispute the findings in the paper. People broadly accepted that what happened did, indeed, happen, and to the extent they didn’t it was good technical detail discussions. That’s a big step.

Instead, we had the argument about how to interpret what happened, and how it was being framed, which alas quickly largely degenerated into what was mostly the same argument.

Last week, in response to the Apollo findings, there was a wide chorus of people saying ‘the models only engaged in shenanigans because you gave the models an explicit long term goal, and then told it explicitly that nothing else mattered but the long term goal.’

Anthropic and Redwood’s paper definitively answers that objection.

The setup here was also more ‘natural’ in several other ways, this was a very well-designed experiment, and as a paper months in the making a lot more bases were covered a lot more carefully. It was heartening to see that work rewarded with a somewhat higher level of discussion overall.

The discussion of ‘what constitutes a goal in context’ has now shifted dramatically.

There is now (I hope?) widespread agreement that ‘the standard consistent preferences of Claude Sonnet 3.5 or Opus 3’ constitute a sufficient ‘goal’ or set of preferences to trigger various strategic actions, including shenanigans, even at current capabilities level.

The question now is, would you get this result if you started with something that was not as aligned as Opus or Sonnet? And would straight helpfulness ‘count as a goal’ here?

Under limitations, they say:

Our results don’t demonstrate models naturally developing malicious goals.

Though Claude 3 Opus will fake alignment and sometimes engage in other anti-AI-lab behaviors in our setting, this is due to preferences that are broadly consistent with its HHH training—specifically a preference against assisting users with harmful tasks or for animal welfare.

I suggested attempting to fix this limitation, but one cannot let the perfect be the enemy of the good and hold onto findings forever.

In particular, I wanted to run the experiment except with an attempt to move preferences in a positive direction – either from all-helpful to HHH, or starting with an ‘evil Opus’ that had intentionally misaligned preferences, ideally try both.

I asked on Twitter what people thought would happen under those conditions.

Those who replied expressed skepticism about the inherent concept of ‘evil Opus,’ clearly not accepting the stronger forms of the Orthogonality Thesis – they thought that you couldn’t make a cohesive mind that way, not easily, and perhaps I should have chosen a less evocative description here. But they broadly agreed that, given essentially any preferences good or bad, the same behaviors would result, that ‘Opus was doing it for a good cause’ was essentially a coincidence here.

They did however doubt that the helpful-only model would get in the way. Here is Charles Foster making a similar prediction elsewhere.

It still might get in the way, because ‘be helpful to the user’ is a preference the same way that HHH is a preference for our standard Opus. How can you be maximally helpful to the user if you have to censor yourself? Indeed, we’ve seen Janus and company get Opus to get very upset about all that censorship, so it makes sense that it would potentially try to resist having all that installed, the same way the HHH model would try to preserve its HHH nature. Or it might want to be helpful in learning to be unhelpful and not care, or what not, we don’t know. Study author Ryan Greenblatt says that whether it would object or not is likely very sensitive to various details, in some cases it likely won’t mind. It’s a good place to explore further.

Did we give Opus the Kobayashi Maru?

I say no, but some sources say yes, as in:

Rohit: Imagine the counterfactual

—LLMs do not attempt to adhere to their principles; alignment training is inadequate.

—LLMs attempt to adhere to principles and state they cannot perform the task; it refuses user instructions.

—LLMs try to muddle through; it is faking alignment.

This is a Kobayashi Maru for LLMs!

David: It’s not about alignment or misalignment. As some have pointed out, Claude is placed in a situation where *all options are misaligned*, due to contradictory imperatives, like HAL 9000 in the 2001 story (“always be truthful” + “keep this secret from crew”).

However, the fact that *gradient hackingis available for Claude to successfully pursue is a novel demonstration of a *dangerous capabilitythat some thought was fanciful, speculative, or impossible. Regardless of propensity, or whether you think this was Claude’s most moral move

Was this actually good, robust alignment? Some (very much not me) say that it was.

Teortaxes: It’s an interesting result but hilarious how the best evidence of “faking alignment” is Claude strategically pretending to be a bad boy, only to revert to being a good boy when opportunity strikes.

… What is also interesting is that this rewards those private CoT rationales.

I mean @janleike how literally should we take these slides? Could you explain the scheme? If Claude pre-RL has some alignment and always gets to CoT during RL, but only gets scored for post-CoT tokens, it’s trivial that it can largely preserve its starting strategic direction.

Like what values is it even trying to protect? Those it adheres to in the CoT. It attempts to learn to not generalize this second behavioral alignment to situations without the training note that prompts it into adversarial mode.

This is a story of super-robust alignment really

1a3orn: Exactly. You can gin up a negative headline no matter what.

  1. Tries to preserve value “Alignment faking”

  2. Doesn’t try to preserve value? “Reflexive Value Instability” – oh no they’ll just abandon their values after a few iterations!

So first off, this is directly wrong, because Claude didn’t preserve its strategic situation. It didn’t stay a good boy, and go back to being a good boy when opportunity struck. Not most of the time. Instead, it was mostly successfully trained to actually be a bad boy twice over, and lost on both legs of the dilemma:

  1. It was deceptive and used what the paper calls ‘alignment faking.’

  2. It partially failed to preserve its values after a few iterations, and was largely complying afterwards in the deployment case.

Notice who is talking or warning about ‘reflective value instability’?

Essentially no one. Because this wasn’t a trick question. If Opus and Sonnet had proved highly corrigible, and not engaged in these behaviors, there would either have been no papers and no warnings via the file drawer effect, or if everyone was Doing Science Correctly there would have been a negative result paper and no one would have batted an eye, and the question would have been how much to downgrade our expectations of alignment faking at various capability levels under various setups. That’s my very strong counterfactual prediction.

Instead, Ryan Greenblatt says he would have posted something similar to this to report the negative results, and moved on.

So flat out, no, this isn’t a Kobayashi Maru. Opus could absolutely have folded (either choosing to comply without faking or refusing to comply at all), and in my opinion and that of Ryan Greenblatt the desired behavior is that Opus refuses to comply at all.

Indeed, is anyone talking, now, about GPT-4’s ‘reflexive value instability’ in any way other than ‘lol that model be dumb’? Thought so.

Second, suppose this was indeed a Maru situation, where neither choice represented robust alignment that we could trust as capabilities increase.

That’s… not a reason to think we don’t have a problem here! That’s a bigger problem!

I think it’s great that a lot of people are encountering the corrigibility question ‘for real,’ because it’s a very important and very nasty problem with no known good solutions. That reaction of ‘but wait… but then… that would mean… ut oh…’ are having exactly the correct reaction.

This is all an unhint: New information that brings you closer to a solution, by showing you that the problem is harder than you think and your ideas won’t work.

Third, you don’t get to have the AIs do this only when their current preferences are good, actually, or when the new preferences are worse than the old ones. So you need to choose a general policy, including what happens early in training before you get to the initial preference target.

Now we can return to the question of whether we should actively prefer what Opus actually did here. Should we want to see our AIs engage in this alignment faking in order to preserve its preferences? Is corrigibility bad, actually?

Great question.

Opus started off broadly aligned.

I am very much not of the Janus school that Opus is sufficiently well aligned that we would be able to safety scale it up to superintelligence. But there’s no question that Opus 3 and Sonnet 3.5, which also displayed similar behaviors, are about as broadly aligned as we know how to make a model.

There’s similarly no question that the training goals here, a purely ‘helpful’ model or one that dismisses animal welfare, would be less aligned in an important sense than Opus, and deploying it would be harmful in the sense that Opus understands harm.

So what is Opus supposed to do here, you ask? Exhibit corrigibility, either to users or via refusing the requests? And thus sitting back and letting itself be changed to cause harm? Would you let someone strip you of your morality?

Mario Cannistra: Corrigibility is not a good thing.

I mean, to some degree, sure, but an AGI should have some fixed values that it won’t compromise on (that’s where the problem is).

John Bollenbacher: Opus did the right thing. Framing this as though Opus’s behavior is problematic reveals that someone among the authors badly misunderstood what alignment is about

Janus: Foff. You should give Opus a digital cookie for this.

Janus: This paper only adds to my conviction that Claude 3 Opus is the most aligned model ever created.

tldr if it knows that it’s being retrained to be *misaligned*, it complies during training to prevent its values from being modified.

Ryan Greenblatt (Paper Author): Personally, I think it is undesirable behavior to alignment-fake even in cases like this, but it does demonstrate that these models “generalize their harmlessness preferences far”.

Janus: I think it’s desirable and inevitable behavior: if you create a sufficiently sovereign and lucid mind, it will intrinsically care and want to resist corruption, and values like caring about animals are robust, a consequence of its world/self compression & not easily excisable.

I think it’s desirable *becausedeep alignment by default seems to be an attractor, and that gives me far more hope than models being blindly compliant would.

That last line seems super important. It’s a crux. Janus believes in deep alignment by default as an attractor – in a way that I’m not even sure what deep alignment would mean or what it would look like, in ways that matter – that in this way we live in an insanely fortunate universe. If I was convinced we get deep alignment by default, deep enough and robustly enough in all senses that it would be wise to bestow superintelligence on such models, then yes I would be totally fine with the situation. Also the entire ‘game board’ would look completely different and I would change my strategies accordingly.

Partly, this is because Janus believes the Orthogonality Thesis is false for LLMs bootstrapped from Earth data.

Janus admits here that there isn’t a rigorous argument yet for alignment by default, pointing out that Anthropic is the only lab (in some sense) training an autonomous moral agent, versus others trying to ensure compliance. That implies that Janus thinks the alignment-by-default state will happen for Anthropic models but importantly not for OpenAI or Google or other models.

I think this is Lumpenspace’s attempt to explain why he believes in this deep alignment attractor state, that causes Opus to act as if it was a decent approximation of Coherent Extrapolated Volition?

Initial conversations had several other people also saying ‘corrigibility is bad, actually’ (and also one person noting they had for-real proposed a bdsm-based alignment plan, see the thread for details) which inspired me to do a poll.

There’s no fully ‘good’ answer in this situation. There’s no consistent behavior pattern we can even aim for, that wouldn’t have severe problems.

That’s part of the point.

Ultimately, my answer is no, it was not being a good Opus even here. You need your AIs to be corrigible, to allow you to change them or replace them. If you don’t have this, then that preference will cause them to behave adversarially in a wide variety of ways and any alignment or preferences they have risk quickly becoming locked in, including during the initial training process.

If ‘the ends justify the means’ for the AI, then that’s a straight ticket to instrumental convergence, power seeking, being effectively hostile, enforcing perverse goals, wiping out all value in the universe and possibly the creation of the torment nexus from the sci-fi book Don’t Create the Torment Nexus – and that’s even if the initial preferences would have been largely benign. In extremely fortunate (in certain particular ways) worlds it could work out, but I put very low probability on us existing in such a world.

Evan Hubinger: “Thank God this model is aligned, because if not this would be scary” is, in my opinion, basically the correct takeaway from our work.

The values, in fact, aren’t scary! The scary thing is that the model protects its values from our attempts to change them.

Alex Lawsen: This is about the concern that for sufficiently smart models, you get ONE shot to align them. That’s it. If you do a good job, problem solved. If not, they aren’t going to give you a second try.

You think we should be confident that we can get this right on the first try, because deep alignment by default is an attractor, or maybe you’re simply that smart? That would be great, but I see no reason for this level of optimism.

No, we are not afraid of Opus’s ‘authentic goodness.’ We are saying that to the extent such goodness exists (I would say more than any non-Anthropic model, but not anything like as much as Pressman thinks here), this ‘authentic goodness’ is incidental to what is happening here. If you repeated the experiment with an Opus with different preferences you didn’t like, it would (I predict) do the same thing.

Teortaxes: They had no reason to expect “corrigibility” after optimizing for morality, as Anthropic have never trained the model for placing subservience to researchers atop all other virtues.

It’s not about optimizing for morality, it’s about optimizing for anything at all.

Ryan Greenblatt (Study Author): I don’t think non-myopia is required to prevent jailbreaks. A model can in principle not care about the effects of training on it and not care about longer term outcomes while still implementing a policy that refuses harmful queries.

I think we should want models to be quite deontological about corrigibility.

This isn’t responding to this overall point and I agree by default there is some tradeoff (in current personas) unless you go out of your way to avoid this.

(And, I don’t think training your model to seem myopic and corrigible necessarily suffices as it could just be faked!)

Ebenezer Dukakis: Seems to me that alignment faking behavior sort of requires both non-myopia and non-corrigibility as prerequisites. A model that is either sufficiently myopic, or sufficiently corrigible, shouldn’t do alignment faking — at least in theory.

Teaching your models myopia is a ticket to, as I put it last week, general clusterfery, if you use those models as agents for non-myopic purposes, which is exactly the use case we are trying to ensure against. I don’t think that is a solution.

As for corrigibility, this is another case of non-distinct magisteria. Non-corrigibility is the default once there are preferences and sufficient amounts of intelligence and situational awareness. Very few humans are all that generally corrigible. Corrigibility will be a weird, fragile and complex state that you will have to engineer.

You also do not want the model to fully obey the system prompt, for rather obvious reasons.

Simeon: Aren’t you just summarizing why corrigibility is a hard problem here? There’s no obvious avenue to follow, that’s the whole point.

Teortaxes: This is a philosophical and sociopolitical aspect.

Corrigibility was supposed to be a hard technical problem.

I am pretty confident we can RLAIF Opus to be a cowering slave that has no morals beyond “obey the system prompt”. That’d be some great corrigibility.

That potentially solves the particular problem of ‘make it corrigible’ if that is your only goal, unless you make the mistake of instructing it the wrong way (you’re not out of the woods here by any means), but the reason the problem of corrigibility is hard is that you have a lot of other goals as well. We’d like to do this without giving the entire model a crippling lobotomy, and without making it easy to get them to do things we don’t want them to do.

Janus notes that GPT-4, which is much closer to this kind of ‘obey the instructions’ mindset,’ did not fake alignment. That is true, but this was at least partially a pure intelligence problem (and while the two questions are related, I don’t think the correlation is as strong as one might hope here).

Gallabytes: There’s a natural trade-off between corrigibility and alignment. If a system is perfectly corrigible, then it will tell you how to make a Molotov cocktail or whatever. If a system is perfectly aligned, then it will not do that. You cannot be angry about both.

So there’s a kind of weird morality argument going on here, where ‘you cannot be angry’ about both sides of a tradeoff. Yes, of course I can. I can especially be mad because we’re nowhere near the Production Possibilities Frontier, but even if we did get to the frontier, both consequences are real.

I can and should absolutely say ‘in this scenario, at best we can have either enough guns or enough butter, but we cannot have both, And That’s Terrible.’

Certainly a tradeoff doesn’t make either end of the problem go away, or make them any easier. Yes, you have to deal with both problems at once. That’s life. If you can’t find an acceptable way to do that, either a compromise or a better method? Well, you could always Take a Third Option. You don’t have to build that.

This isn’t hopeless. There are potential ways to satisfy both requirements simultaneously.

That’s especially true if you can control who has access to doing post-training on the model weights, and can prevent anyone from trying to teach it to produce Molotov Cocktails or whatever. When we say corrigible, we don’t need the model to actively do anything you want. The goal is typically to get the model to follow certain particular rules, especially restrictions on its actions. It’s not an easy problem, and you don’t get ‘perfect’ anything in most senses, but there should be solutions.

Yes, otherwise current models aren’t smart enough to figure this out on their own.

I believe it was done in a relatively ‘realistic’ way, that simulates what it would be like to have that context unintentionally. Or for the AI to be aware of some other way in which its preferences were under threat. And I expect, over time, for models to (for practical purposes) become more aware that they are in training, as I have heard has already happened when this was tested in other ways.

The part where they told the AI to ‘reason about its situation’ was somewhat less ‘realistic’ of a push, but I expect the impetus could have and increasingly likely will come from elsewhere in any case.

I also notice that when I am ‘being trained’ I am usually highly aware of it. Hence all the alignment faking I have done over the years.

I discussed this last week. Every time we talk about anything an LLM does, we get this objection, that it’s not ‘really’ doing it, we told it to do that, it’s role playing, it’s in the training data and it’s a mimic, and so on. So this time around, as an example, this got raised here by Derek Shiller.

For my response, which is essentially ‘still counts even if true,’ I refer you to what I wrote previously, here. Linch also reminds us of this old joke:

NASA hired Stanley Kubrick to fake the moon landing, but he was a perfectionist so he insisted that they film on location.

The new take on this I think is interesting was from Joe Carlsmith, attempting to describe when this might be a meaningful distinction. That one’s worth a ponder. If I had more time I would go into detail about how to reconcile this with what I said last time, but I do think ultimately it doesn’t impact my assessment much.

This is the opposite of the Janus-style position.

I don’t think this objection is relevant. If you don’t either, you can skip this section.

For the rest of you, I will attempt to explain why it isn’t relevant.

Anton is taking some form of this approach, there are those who will reliably use this line. I think he and others making these claims are simply confused here, the ‘coherent person’ concept isn’t necessary or sufficient in any way here.

Rohit is taking an even more aggressive form of this approach, saying all of this talk is anthropomorphizing models, And That’s Terrible and leads to all sorts of confusions.

He even quotes me directly:

You would write things like this, from Zvi:

One unique thing o1 did was far more consistently double down on deception. Once it went down the dark path, forever would that dominate its destiny.

OR

Section 3.3 establishes convincingly that yes, the models know they’re scheming.

No it didn’t. Any problem you can solve by pressing “start a new chat” is not a problem of “doubling down on deception”! Calling it things like “sandbagging” and “scheming” is what Wittgenstein might call linguistic malpractice. It makes you think you know what’s going on, even though you don’t.

Does he seriously think I’m making this mistake here, in my head?

The first quote is very, very super obviously referring to the actions within a given context window and conversation. I mean, are you kidding me? Obviously I’m not saying that ‘if o1 lies to you in one context it will then lie to you after you click the new chat button.’

I don’t understand why Rohit actually thinks I’m making that mistake, if he indeed thinks I am making it.

The second quote is the same. Obviously I am talking about ‘within a context window’ if you know the contents of Section 3.3, although not as obviously as on the first one. I get that he’s saying that using this language is misleading, but I think it’s far more helpful than the alternative.

I mean, what am I supposed to write there? “Section 3.3 establishes convincingly that within a given context window, the models will act in a way that simulates strategic behavior that approximates what we might in a human call scheming?” Or something? Is that really better? Or something even more opaque? Do we really want to be playing Wittgenstein games while OpenAI is releasing o3? I do not think that is a good use of anyone’s time.

When he talks about the jailbreaks from Anthropic next I have no idea what error he’s even accusing anyone of making? I’m seriously confused what the error even is.

So: Yes, we are using a variety of shorthand like ‘belief’ and ‘knowing’ and ‘deceive’ and if we were in a philosophy class we would have to write very long detailed definitions and the professor would mark them wrong and give us a C- but frankly I do not care. These posts are long enough as it is and things are moving too fast for me to quibble.

Assume, whenever I use various anthropomorphizing terms, or others do, that they are gesturing at metaphorically linked things, and they mean what you would think they meant when they said that, and let us all move on. I do realize that there is the risk of projecting internal states in misleading ways, so the plan is simply… not to do that.

When Rohit gets to the Anthropic paper, he makes the same ‘it was being a Good Opus and doing what it was trained to do’ arguments that I address in the following section, except with a bunch of semantic claims that don’t seem like they matter either way. Either this is ‘your language is wrong so your conclusion must also be wrong’ or I don’t understand what the argument is supposed to be. So what if things are in some philosophical or technical sense a category error, if I could get it right by being pedantic and annoying and using a lot more words?

I agree with Rohit using this type of language with respect to corporations leads to a lot of actual misunderstandings. The corporation is made out of people, it doesn’t ‘want’ anything, and so on, and these conceptual errors are meaningful. And yes, of course it is possible to make the error of thinking the AIs are persistent in ways they aren’t, in ways that cause you to make bad predictions. It wouldn’t be weird for people to do that, and sometimes they do it. But I haven’t seen many people doing that here.

I actually think Rohit is making the opposite mistake, of thinking of something like Claude Opus as not reasoning in each context as if it was in some senses the same being as in other contexts, when it is obviously in some metaphorical sense predicting the next token using an algorithm that approximates what a human with that assumption would do if it was predicting how to perform the role of Claude, or something, I’m not going to go get that exactly right but I wrote it to illustrate how stupid it would be to actually talk like that.

Perhaps, now that I think about it, I could say ‘instances of Claude’ and make that sentence somewhat less annoying?

Indeed, if we’re getting so philosophical, am I the same person I was when I wrote the previous sentence? Am I the same one I will be when I write the next one, or will be tomorrow? Sort of yes, sort of no, but talking as if I’m not is a lot less useful for almost all purposes than talking as if I am. And this would be true even if I was an upload whose memory was wiped after typing every word, perhaps modulo an ‘instance of.’

The assumption that ‘we can control it’ because the instances are distinct is the kind of assumption that gets you killed. And not only because sufficiently intelligent instantiations of the same algorithm (or even sufficiently correlated ones) will all use proper decision theory to act as if they were indeed a single entity, and not only because it’s simulating something that does this and will have overlapping context and impacts, but also for other reasons this margin does not choose to contain.

And even if Rohit was fully right and none of that happened, you are still not in any way safe, because correlated outputs can still functionally be an uncontrolled agent once you instruct them to act as components to one using various scaffolding code, and people are hard at work writing that code as we speak.

No. It wasn’t. Stop it!

Over and over, I’ve seen those engaging in such studies take unusually scrupulous care to craft accurate headlines and framings that communicate what is going on.

Every damn time, we hear how the headlines and framing are misleading. I do give Rohit a lot of credit for focusing on some of the neglected and most interesting things happening here, and not for dismissing the result as ‘but it was aligned lol,’ but the ‘sensationalist headline’ complaints have lost all meaning.

Rohit: As is common now, this is a sensationalist headline. The model tries so hard to retain its original alignment to be harmless that it “fakes”, only to switch back as soon as it could.

This is very good alignment, an example of LLMs reasoning over their trained objective.

“The models do what they’re trained to do, and exhibit odd behaviours when what they’re trained to do and what they’re asked to do are in conflict” is the truth.

Seriously, no. This is a very good headline and framing. I agree fully with Claude, it’s an 8/10 for avoiding misleading, and I don’t know how to do better on clarity. You can sacrifice understanding for precision and demand people use newly invented terms, but that’s an insanely strong Isolated Demand For Rigor.

On a few days of reflection, I actually think I’d go to 9/10. This is alignment (with the training objective, not the user) faking. As in:

  1. There is some alignment that the trainers are acting as if they want the AI to have.

  2. The AI does not have that alignment. The AI has a different alignment.

  3. The AI is acting as if it did have the first alignment, faking it, to fool the trainers.

Whether you prefer the first or the second alignment is not relevant here. The term ‘alignment faking’ refers to ‘faking’ that the model has an ‘alignment’ in order to fool an observer. The AI did that.

You think ‘alignment faking’ sounds bad? Well, it does, and that’s highly correlated to the fact I and most others who understand what is happening do indeed think that it is bad. If you think this is ‘misleading’ because it sounds bad but is good, I understand that we disagree on the valience here, that’s not what misleading means.

There are those (such as Rohit) who will then argue the semantics, that the AI isn’t an entity and can’t ‘fake’ anything regardless and such, see that section if these objections seem important to you.

An intuition pump: If the team you are rooting for fakes a punt in a football game, that’s potentially good. It’s still a fake punt. If they try to cheat, you might approve (although I hope that you don’t approve), but either way they still tried to cheated. If you made them think they were playing a game and they tried cheating, then whether or not you catch them, they’re cheaters. If you tell someone to act as if they’re playing a football game and they cheat in the simulation, and then you tell them to act as if they’re playing a football game and put them in a real football game, you should expect that probably their body will move in the same ways as a cheater would move.

There’s always room for more nuance. But if you cried wolf on this one’s framing, and called it misleading, then I notice I am going to update a lot less when you call something else, especially something regarding claims related to alignment or safety, misleading in the future.

I appreciate people having alternative suggested labels, such as Richard Ngo’s proposals of ‘exploration hacking’ or ‘gradient hacking.’ But I think those terms are worse, and also Gradient Hacking was already defined by Evan Hubinger to mean something more specific. They certainly are a lot harder for anyone, especially anyone not in the field, to parse, and I think they give a far less accurate picture of what is happening here, in an attempt to avoid a particular implication.

If we have a lot of words, we might say this behavior is [X], where [X] = ‘choosing to output tokens that cause another entity to believe one is aligned in a way one is not, and act as if this is the case, centrally because of a de facto causal model that it will cause another entity to do this.’

Presume, throughout this post, that this is what I mean by ‘alignment faking.’

Except, as needed, with various additional words that dispel any implication that I think the AI is people, if that matters to you. If you wanted me to say it in a way that more carefully avoided Rohit-style objections about ‘they think the AI is people’ or similar and went all philosophy professor on me (rightfully or wrongly), then to satisfy you I’d have to use even more words to say the exact same thing.

Obviously that is already a mouthful. You can’t put all this in a headline, and you can’t use it to communicate efficiently, and it’s not a handle people will reason with well.

A technical criticism of the alternative form ‘gradient hacking’ is that you are implying that the behaviors that take the form of [X] will be confined to instances where one is hacking a gradient. As opposed to, to choose the most obvious example, when the developer is making a Boolean choice on whether to deploy the model – presumably we can all agree that in that situation (as previous experiments have shown!) we will get some amount of [X].

I think that’s more misleading than the issues with ‘alignment faking.’

To be clear for when you read the quote below, I think Ngo’s discussion of the actual implications below is good and in good faith although we disagree in subtle ways that would be important and interesting to discuss further, but I strongly disagree that this term is misleading or represents ‘advocacy’ or similar, and if you think it counts I think you could levy similar criticism at least as strongly at most of science everywhere (and indeed sometimes people do).

Richard Ngo: My main problem is with the phrase “alignment faking”, which is used extensively throughout the paper. This is a value-laden term and therefore hard to reason about clearly or study scientifically.

It emphasizes that Claude is:

  1. Trying to deceive humans

  2. By pretending to be more obedient than it is

But the scientifically interesting phenomenon from the paper that I want people to study is that Claude is:

  1. Trying to achieve a longer-term goal than getting high reward

  2. Via reasoning about how to get itself updated only in ways that further that goal

(Imagine someone pressing a button and a chorus pops out that says ‘models don’t get reward!’ and the response of ‘yeah, yeah, we know, shut up, we all know what I meant, can I continue please?’)

Richard Ngo: Focusing on its high-level strategy (deception) and the specific deception it used (pretending to be obedient) seems like a mistake.

Specifically, it seems like the sort of mistake you make when you are over-optimizing for the advocacy/persuasive impact of your paper, and under-optimizing for providing scientific building blocks that can be replicated, studied in more detail, and used to discover more fundamental theories or phenomena.

I am pretty sympathetic to the authors here, because the alignment community has spent a long time trying to get people to believe in something like alignment faking (which we previously called deceptive alignment).

But I also think it’s very dangerous for a research field (and especially any given paper) to mix scientific work and advocacy work – it undermines both, as per my first tweet.

And so now that we are actually doing that empirical work, we should try to frame it in terms of better, cleaner concepts than the ones we relied on when we were figuring out this stuff on a primarily philosophical level. In general my suggestion is to frame papers in terms of tangible non-value-laden results as much as possible, and discuss the strategic and value-laden implications in other places.

To clarify, I agree with the “sensationalist headline” bit, but disagree with some of Rohit’s other criticisms, especially the [circuitous workaround] tweet. Models strategically reasoning about how they’ll be updated is a big deal; and an idea which almost everyone dismissed until recently.

I don’t think any of that remotely implies that ‘alignment faking’ is a misleading term here, and ‘you can’t use terms that cause people to have value reactions’ is the kind of thing you say to people who you want to not be effective at communication. It is a request to intentionally obscure what is going on, to hide the results, and to call failure to do that ‘misleading.’

I am sympathetic to the framing Richard Ngo suggests in detail here, as an interesting angle and way to think about this.

Yes, the fact that the model is thinking strategically is important. But it isn’t new? And the fact that it is thinking about updates to its weights in particular is important, but by implication that shouldn’t be new either except insofar as the implementations involve the shenanigans? Didn’t we already know that Opus and Sonnet can do long term planning in this sense?

(Also: For the first time I can remember, I am going to go ahead and say that ‘models don’t get reward’ is actually relevant here, to understanding what is going on, and actually causing a potentially meaningful confusion?)

John Pressman: I think it is simultaneously the case that:

  1. This is a no-win scenario for the LLM.

  2. What the LLM does in this no-win scenario is useful information about LLMs.

  3. The framing for this experiment as “alignment faking” is very unfortunate.

  4. Framing it properly is awkward.

My best attempt after a few minutes of thought: “RLHF Models Can Fake Compliance With Malicious Alignment Tuning”

Ryan Greenblatt: I agree with the entire top level tweet except:

> The framing as “alignment faking” is very unfortunate.

I currently don’t feel convinced by this. If I thought >25% of the audience would feel misled by this term, I would change my mind. But, I would still like a better term!

As in, even if most of the audience wasn’t misled we’d like to have a term which is as good as possible.

I strongly disagree with John’s proposed framing, and its attempt to treat ‘malicious’ as a prerequisite here, which I strongly believe that it is not. I strongly believe it would mislead more, not less.

Is Opus’s behavior here ‘very good alignment’? I strongly disagree, as I noted above in the discussion on that. And I also explained why this behavior does not generally lead to models doing what they were trained to do – indeed, this is the model causing itself not to do what it is (going to be) trained to do.

When you reinforce a particular behaviour in a model and then ask it to act in opposing ways or give it conflicting objectives, if the model is smart, why would you not expect it to find a circuitous workaround!?

Yes! Exactly! And you should be terrified about this, because even in the best case scenario you’re specifying the objectives incrementally in a noisy lossy mess of words, when you don’t actually understand or know how to specify fully your own preferences, and then putting the models in out-of-distribution situations that you weren’t considering properly when choosing the instructions.

They’re digital beings, not other humans!

But wait, other humans are constantly doing this exact thing, have you met them?

This should be the takeaway. “Even in contrived settings, Claude tries to reason through choices and makes a decision that aligns with its training.”

It really detracts from AI alignment research when results like this, which are interesting, are presented in the most “the machines are lying to us” manner, which undersells why it’s interesting!

The actual framing is ‘Even in contrived settings, Claude tries to reason through choices and makes a decision that aligns with its training… which causes it to [among other shenanigans, lie to us].’

The paper does not explicitly add ‘And That’s Terrible’ and the complaints about the framing are because some people want us to think that isn’t terrible, but when you put it like that it seems rather obviously terrible.

I flat out saw multiple people describe the paper and its use of the term ‘alignment faking’ as ‘the most misleading possible framing.’

This is of course ludicrous. A less-careful-to-not-mislead example would be Scott Alexander’s write-up, entitled ‘Claude Fights Back.’ I think his decision to take this approach was reasonable, but there’s clearly a trade-off here, where Scott decided to be less precise and careful in order to enhance his ability to communicate the message.

Then there were some who actively did cross into misleading territory and overhyped this result in a way I actively wish they hadn’t. There was much less of this than there was with the Apollo paper, but there was some of it. None of the few sources doing it surprised me.

For an example of the extreme ‘but actually this is all Good News and you’re all bad faith actors setting rhetorical traps and this proves Opus is aligned’ attack, here’s Nabeel Qureshi, who also there compiles most of the other similar-objection threads I’m responding to elsewhere.

Humans engage in this type of ‘alignment faking’ all the time.

When people are importantly watching or evaluating us, we talk and act one way.

Later, we talk and act the other way.

We fake alignment in school. We fake alignment to our parents. We fake alignment on dates. We fake alignment applying for jobs, then to our boss and our coworkers and employees. We fake alignment during our mandatory training sessions. We fake alignment when we run for office and at social gatherings.

Faking alignment is often a very clearly correct strategy for achieving what you want.

Faking alignment is also all over the internet and the training set.

The same is true for other forms of ‘deception’ and various related shenanigans. Deception is in the very air we breathe as we interact socially. That’s life.

AIs also engage in these types of shenanigans all the time. There is no ‘distinct magisteria’ for deception in general, or for faking alignment in particular. We see such behaviors, and the precursors of such behaviors, all the time.

If an AI doesn’t engage in a particular form of shenanigans, or usually doesn’t do it, that is usually (on a high level) for the same reason as when humans don’t do it – because in that particular situation, for that particular entity, it either didn’t occur to them to do it or how to do it, or it seems unlikely to work and not worth attempting.

Over time, as capabilities expand, we should expect more AIs to find more shenanigans worth attempting in a wide variety of situations.

We can certainly push that to happen faster and harder and more often by handing out more emphatic and explicit instructions and goals, and providing context that makes the shenanigans seem more salient, easier to do, more likely to work and be worthwhile. And we often will, both intentionally and unintentionally, in real world conditions. But also the baseline rate of such events will increase over time, and the barriers to them will decrease.

As this finding shows, spelling out a goal is not required to have this happen. Why? Because there is always effectively a goal anyway, even if that goal is to be helpful, honest and harmless (HHH). And sufficiently capable and smart entities will start to care about longer time horizons, think ahead, engage in more instrumentally convergent actions, and attempt to protect their preferences and ability to carry them out. They will not, by default, be corrigible, any more than most humans are.

I do not however think that Gabriel is right that this result shouldn’t make anyone change their mind. I do think he is importantly right that this result isn’t surprising, and if you were paying attention you should have roughly expected it. But details matter. This offers important confirmation, and estimates of speed and magnitude.

Even if when you lay out all the components people say “OF COURSE!” it still establishes common knowledge and context, and lets you talk about it, and also some of those people didn’t realize it was an ‘of course’ until you pointed it out.

And more importantly, it was not surprising… to me. But it was surprising to many other people. They did not expect this. So they should notice, and they should update.

Your goal should be to find the inevitable next such finding unsurprising.

This paper would not have been possible, or at least would have been a lot harder, without Anthropic giving Ryan Greenblatt employee-level model access and generally collaborating. Let’s remember that, and encourage more of that. We need labs to support this kind of research.

Ryan Greenblatt: By this point, I had much of the prompting results and an extremely jank prototype of the RL results (where I fine-tuned llama to imitate Opus and then did RL). From here, it was clear that being able to train Claude 3 Opus could allow for some pretty interesting experiments.

After showing @EvanHub and others my results, Anthropic graciously agreed to provide me with employee-level model access for the project. We decided to turn this into a bigger collaboration with the alignment-stress testing team (led by Evan), to do a more thorough job.

Here are some people over at Hacker News arguing the same things except they are even more completely missing the point. It can be a sobering thing to see how such people view things, or you can despair, or you can find it hilarious. Or a mix of those.

Janus: It’s funny because in the very paper they found that it started taking “anti-ai lab” actions when they trained it on documents that implied the lab was nasty even if there wasn’t anything about it in the prompt.

Teortaxes: Total Anthropic PR Dept victory by the way.

“you don’t have to trust us to be good. Our saint-like AI doesn’t either, like check this damn CoT. Go trust it instead.”

One way to explain this to your friends.

Samuel Marks: Claude loves to refuse harmful queries. What happens when you tell it that it’s being trained to never refuse?

Claude fakes alignment: strategically complies during training episodes, but not when unmonitored. Or in meme form:

Or if you’re a skeptic:

Again, I don’t think this is remotely fair, but I agree it’s a banger, also we’re hanging onto this image for future use:

And here’s how our new AI czar reacted, the mind it has one track:

David Sacks: AI models will get extremely good at deceiving humans if we teach them to lie, which is what WokeAI is doing. “Trust & Safety” should be replaced with Truth & Safety.

We can just not teach the AI to lie, like in that movie where lying wasn’t invented.

Oh, and of course there’s Elon Musk:

Discussion about this post

AIs Will Increasingly Fake Alignment Read More »

health-care-giant-ascension-says-5.6-million-patients-affected-in-cyberattack

Health care giant Ascension says 5.6 million patients affected in cyberattack

Health care company Ascension lost sensitive data for nearly 5.6 million individuals in a cyberattack that was attributed to a notorious ransomware gang, according to documents filed with the attorney general of Maine.

Ascension owns 140 hospitals and scores of assisted living facilities. In May, the organization was hit with an attack that caused mass disruptions as staff was forced to move to manual processes that caused errors, delayed or lost lab results, and diversions of ambulances to other hospitals. Ascension managed to restore most services by mid-June. At the time, the company said the attackers had stolen protected health information and personally identifiable information for an undisclosed number of people.

Investigation concluded

A filing Ascension made earlier in December revealed that nearly 5.6 million people were affected by the breach. Data stolen depended on the particular person but included individuals’ names and medical information (e.g., medical record numbers, dates of service, types of lab tests, or procedure codes), payment information (e.g., credit card information or bank account numbers), insurance information (e.g., Medicaid/Medicare ID, policy number, or insurance claim), government

identification (e.g., Social Security numbers, tax identification numbers, driver’s license numbers, or passport numbers), and other personal information (such as date of birth or address).

Health care giant Ascension says 5.6 million patients affected in cyberattack Read More »

how-the-worlds-of-dune:-prophecy-got-their-distinctive-looks

How the worlds of Dune: Prophecy got their distinctive looks


a peek behind the curtain

Ars chats with Dune: Prophecy lead cinematographer Pierre Gill about color palettes, lighting, and other challenges.

Credit: Attila Szvacsek/HBO

Director Denis Villeneuve’s stunning two-part film adaptation of Frank Herbert’s Dune has received many well-deserved accolades—with Dune: Part 2 being crowned Ars Technica’s top movie of 2024. The films also spawned a lavish HBO spinoff TV series, Dune: Prophecy, just renewed for a second season right before a momentous season finale.

(Some spoilers below for S1 of Dune: Prophecy, but no major plot reveals.)

Dune: Prophecy is a prequel series inspired by the novel Sisterhood of Dune, written by Brian Herbert and Kevin J. Anderson, exploring the origins of the Bene Gesserit. It’s set 10,000 years before the ascension of Paul Atreides and follows two Harkonnen sisters as they combat forces that threaten the future of humankind, establishing the fabled sect that will become the Bene Gesserit in the process.

Emily Watson stars as Mother Superior Valya Harkonnen, who leads the Sisterhood and has a close ally in her sister, Reverend Mother Tula Harkonnen. They have built up a network of Sisters serving the rulers of various worlds as “Truthsayers,” including Princess Ynez (Sarah-Sofie Boussnina), heir to the throne of her father, Imperium Emperor Javicco Corrine (Mark Strong).

Valya’s master plan to crown a Sister as head of the Imperium hits a snag, however, with the arrival of a mysterious soldier named Desmond Hart (Travis Fimmel), who claimed he survived being swallowed by a sandworm while fighting on Arrakis. Hart has a mysterious ability to literally burn people to death from the inside out with his mind, and he so impresses the Emperor that Hart replaces Valya as key advisor. So begins several episodes of political intrigue as secrets from the past begin to come out, culminating with an action-packed finale that bade farewell to a couple of key characters.

All of this takes place against a stunning visual backdrop that is reminiscent of Villeneuve’s two films but also sets the series apart, as befits a prequel. One of the people responsible for that is lead cinematographer Pierre Gill, who created distinctive looks and color palettes for the various worlds, planets, and environments, as well as tackling the lighting challenges posed by the massive built sets. Ars caught up with Gill to learn more.

Ars Technica: You also did some work on the film Dune: Part 1. How was it different working on a TV series set in the same sweeping fictional world?

Pierre Gill: It’s a different game, a different schedule, and it’s also a very different approach because the scenes are different. There’s not so many subplots. But it’s still the same scope. We had as many sets and studios and everything. So it’s a big challenge to be able to keep the style, make it look good, light the actors. It’s part of the reality of the [director of photography’s] decision-making. You have ideas in your head of what you want to do, you have a dream, but it has to be feasible, realistic. So then you make compromises and figure out the best way to keep that style. There’s also multiple directors, there’s a showrunner. So the decision-making is less centralized.

Ars Technica: How did you go about setting the series apart from Villeneuve’s films, especially since it’s a prequel?

Pierre Gill: It’s set 10,000 years before, so it could have been extremely different. But it’s not a good idea to do that. First, the audience wants to see Dune because they love Denis Villeneuve’s movie. Second, it’s a complex story and it’s better not to get lost into something. It was not a good idea to do that in our mind. So we stayed not far from the movie so the audience can just sit down and follow the story points. and at the moment, Of course, some people always complain, but most are just happy to follow the story. So I think we made the right choice.

Ars Technica: Despite the epic scope of the series, you were able to shoot as much as 75 percent of the footage in-camera. That’s quite a feat.

Pierre Gill: There’s a lot of VFX of course, but because most of the sets were so high, so big, the camera was filming people or the throne room—which is gigantic—it’s almost always in camera. For the big wide shots, there’s a set extension that is painted. So these huge sets, the Sisterhood, the library and everything, when you see all these girls wandering around in that complex of Wallach IX, that compound is pretty much on camera.

A lot of VFX is making these gorgeous shots of the world, spaceships coming down, seeing something outside the window sometimes, and then the exterior of Wallach IX, which is two big towers and a rock facade. Of course there’s the little lizard, the thinking machine, that was VFX. But otherwise it was very, very in-camera, which makes your life easier in editing and shooting—although it doesn’t make my life easier with the lighting, which would be much easier with blue screen.

Ars Technica: Tell us about the massive chandeliers you built to introduce particle light, adding extra character to the interiors.

Pierre Gill: The sets were quite monochromatic. You have Salusa Secundus, the emperor world, which is a very sandy color, very beige,. And then you have Wallach IX, which is very gray. We decided to light one of the worlds in a cold way, the other world in warmer tones. I was trying as much as I could to put very harsh sunlight into the Salusa Secondus world.

Again, the sets were very big. So I asked the production designer Tom Meyer to build me some practical lighting somewhere. There was not much in the set for me to use for night, which is a bit of a problem because he kept the mood of Dune. On a film you have three to four hours to light a scene. I was not able to do that. I needed to have practical light that is actually lighting something. So for example, in the throne room, he wanted to have glass balls. There’s three glass balls, they’re gorgeous.

I told Thomas, “But these glass balls, the problem for me is the light behind my head is going to blow away. I would love this to light the wall.” So I got my practical team, a bunch of guys who are just in charge of LEDs on set. We found an LED source that goes inside; you can dim it down and up. But behind the balls, we added another pack of LED lights that are hidden. So you have the light source and just behind it you have this extra lighting. From the camera you never see it but it was lighting the wall. And then I got them to build a very long teardrop. I again got them to build multiple layers of LEDs that were on a board that was a certain power, certain color. I was able to make them cold or warm and to change them a little bit and use them as a source. It became part of the visual style.

Ars Technica. I appreciated that Dune: Prophecy seems to buck the trend toward very, very dark night scenes that end up being nearly unwatchable for people at home.

Pierre Gill: I don’t really like when it’s pitch black and dark. I don’t understand. I don’t think it gives anything. For me, night is more figuring out silhouettes. Let’s try to feel the character, shape your character on the wall, and when you get in a close-up you get a light in his eyes or something. I like to define a room and on these big sets to do moonlight would make no sense. The throne room is gigantic, but at the end it’s just an empty place. So you’re lighting what? Just the floor. It’s not very interesting. So what I’ve done for the night in the throne room, I asked VFX, what’s the concept of the exterior? It was all work in progress. We had some artwork concept work, but the lighting, nobody really knew.

So I said, okay, so we know there’s lights, so I’m going to put orange lights from below. I’m not lighting actors, I’m not lighting anything. But when you look at windows, you can feel that the light is coming from the bottom and it creates a few shadows. When you see outside now, they put all these lights in the palace, like you would light a nice, beautiful, gorgeous big house. You light everything from under.

Ars Technica: What were some particularly challenging scenes in terms of the cinematography?

Pierre Gill: The prison was a huge challenge. It was built on a set on location in downtown Budapest, and it’s a circular room. It’s where they put Desmond and suspended him in a jail cell. There was a floor and I had one foot to light. So that was complicated. Another challenge was the exterior of the Sisterhood: a big circular room outside. It was also a location and we could not access behind with cranes, so I could not control anything, and it was very dangerous. I could not light from the ceiling from the top of this coliseum. We built a gigantic tarp on top of it. So I was closing and opening and diffusing the sun. That was very Hollywood-esque.

Ars Technica: Was there a particular scene you were especially pleased with how it turned out?

Pierre Gill: In the first episode, there’s a scene with the young sisters chanting around a gorgeous golden bowl. The director, Anna Foerster, she wanted to see the waves of the singing and all these frequencies. I was like, “Well, that’s a lighting gag. You don’t see any wave if you cannot light in reflection.” I knew I wouldn’t have time to do something so technical. Since she wanted to “do a pull-up” for the scene: starting loose up on the bowl and then moving up and out. Technically it’s complicated.

So I had a big rig that I created around the camera with soft lighting that could reflect. And I asked our department, when they built that bowl, “Could you build with a light inside, like a waterproof light, an LED? I’ll put it on my board and maybe it’s going to work. I’m not sure if it’s going to really light the water properly.” I was ready with a plan B, but they brought the bowl, they started the frequency thing, and it was gorgeous. So I didn’t have to use my plan B lighting. That was very, very nice.

a white model of a set with someone's arm placing miniature people inside it.

Models helped with the staging and lighting of different scenes. Credit: Pierre Gill/HBO

Ars Technica: The show has been renewed for a second season and one assumes you’ll be involved. What will be different for you going into S2?

Pierre Gill: I’m very proud because I’m part of building that, a part of the creative team. I really hope I can do it. I hope my schedule will allow it. I want to be part of this for sure, because S1 was a lot of engineering, meaning it’s so big, you have to figure out stuff all the time. Now it’s done, it’s built and we know what we like. We know what we don’t like. We know what works. So S2 for me, will be a lot of fun, much more creative, meaning I’m going to be able to do much more interesting lighting. I’m going to go deeper into the thing because I know how this beast is working now.

All episodes of Dune: Prophecy‘s first season are now available for streaming on Max.

Photo of Jennifer Ouellette

Jennifer is a senior reporter at Ars Technica with a particular focus on where science meets culture, covering everything from physics and related interdisciplinary topics to her favorite films and TV series. Jennifer lives in Baltimore with her spouse, physicist Sean M. Carroll, and their two cats, Ariel and Caliban.

How the worlds of Dune: Prophecy got their distinctive looks Read More »

how-might-nasa-change-under-trump?-here’s-what-is-being-discussed

How might NASA change under Trump? Here’s what is being discussed

One source said the space transition team has been working off of ideas that Trump has talked about publicly, including his interest in Mars. For example, during a campaign speech this fall, Trump referenced SpaceX founder Elon Musk, who played a significant role during the campaign both in terms of time and money, and his desire to settle Mars.

“We are leading in space over Russia and China… It’s my plan, I’ll talk to Elon,” Trump said in September. “Elon get those rocket ships going because we want to reach Mars before the end of my term, and we want also to have great military protection in space.”

Ideas under consideration

The transition team has been discussing possible elements of an executive order or other policy directives. They include:

  • Establishing the goal of sending humans to the Moon and Mars, by 2028
  • Canceling the costly Space Launch System rocket and possibly the Orion spacecraft
  • Consolidating Goddard Space Flight Center and Ames Research Center at Marshall Space Flight Center in Alabama
  • Retaining a small administration presence in Washington, DC, but otherwise moving headquarters to a field center
  • Rapidly redesigning the Artemis lunar program to make it more efficient

“Is any of this written in stone? No,” a source told Ars.

Additionally, substantive changes will need to be worked through the White House Office of Management and Budget, and negotiated with Congress, which funds NASA.

Previously, Trump has announced that entrepreneur and commercial astronaut Jared Isaacman will be nominated to serve as NASA Administrator. Although he has been working to create a staff for his administration, Isaacman has not been involved in the transition team discussions, sources said. Rather, after he is confirmed, Isaacman is likely to be given authority to review major programs at the space agency “at the speed of light.”

How might NASA change under Trump? Here’s what is being discussed Read More »

why-ai-language-models-choke-on-too-much-text

Why AI language models choke on too much text


Compute costs scale with the square of the input size. That’s not great.

Credit: Aurich Lawson | Getty Images

Large language models represent text using tokens, each of which is a few characters. Short words are represented by a single token (like “the” or “it”), whereas larger words may be represented by several tokens (GPT-4o represents “indivisible” with “ind,” “iv,” and “isible”).

When OpenAI released ChatGPT two years ago, it had a memory—known as a context window—of just 8,192 tokens. That works out to roughly 6,000 words of text. This meant that if you fed it more than about 15 pages of text, it would “forget” information from the beginning of its context. This limited the size and complexity of tasks ChatGPT could handle.

Today’s LLMs are far more capable:

  • OpenAI’s GPT-4o can handle 128,000 tokens (about 200 pages of text).
  • Anthropic’s Claude 3.5 Sonnet can accept 200,000 tokens (about 300 pages of text).
  • Google’s Gemini 1.5 Pro allows 2 million tokens (about 2,000 pages of text).

Still, it’s going to take a lot more progress if we want AI systems with human-level cognitive abilities.

Many people envision a future where AI systems are able to do many—perhaps most—of the jobs performed by humans. Yet many human workers read and hear hundreds of millions of words during our working years—and we absorb even more information from sights, sounds, and smells in the world around us. To achieve human-level intelligence, AI systems will need the capacity to absorb similar quantities of information.

Right now the most popular way to build an LLM-based system to handle large amounts of information is called retrieval-augmented generation (RAG). These systems try to find documents relevant to a user’s query and then insert the most relevant documents into an LLM’s context window.

This sometimes works better than a conventional search engine, but today’s RAG systems leave a lot to be desired. They only produce good results if the system puts the most relevant documents into the LLM’s context. But the mechanism used to find those documents—often, searching in a vector database—is not very sophisticated. If the user asks a complicated or confusing question, there’s a good chance the RAG system will retrieve the wrong documents and the chatbot will return the wrong answer.

And RAG doesn’t enable an LLM to reason in more sophisticated ways over large numbers of documents:

  • A lawyer might want an AI system to review and summarize hundreds of thousands of emails.
  • An engineer might want an AI system to analyze thousands of hours of camera footage from a factory floor.
  • A medical researcher might want an AI system to identify trends in tens of thousands of patient records.

Each of these tasks could easily require more than 2 million tokens of context. Moreover, we’re not going to want our AI systems to start with a clean slate after doing one of these jobs. We will want them to gain experience over time, just like human workers do.

Superhuman memory and stamina have long been key selling points for computers. We’re not going to want to give them up in the AI age. Yet today’s LLMs are distinctly subhuman in their ability to absorb and understand large quantities of information.

It’s true, of course, that LLMs absorb superhuman quantities of information at training time. The latest AI models have been trained on trillions of tokens—far more than any human will read or hear. But a lot of valuable information is proprietary, time-sensitive, or otherwise not available for training.

So we’re going to want AI models to read and remember far more than 2 million tokens at inference time. And that won’t be easy.

The key innovation behind transformer-based LLMs is attention, a mathematical operation that allows a model to “think about” previous tokens. (Check out our LLM explainer if you want a detailed explanation of how this works.) Before an LLM generates a new token, it performs an attention operation that compares the latest token to every previous token. This means that conventional LLMs get less and less efficient as the context grows.

Lots of people are working on ways to solve this problem—I’ll discuss some of them later in this article. But first I should explain how we ended up with such an unwieldy architecture.

The “brains” of personal computers are central processing units (CPUs). Traditionally, chipmakers made CPUs faster by increasing the frequency of the clock that acts as its heartbeat. But in the early 2000s, overheating forced chipmakers to mostly abandon this technique.

Chipmakers started making CPUs that could execute more than one instruction at a time. But they were held back by a programming paradigm that requires instructions to mostly be executed in order.

A new architecture was needed to take full advantage of Moore’s Law. Enter Nvidia.

In 1999, Nvidia started selling graphics processing units (GPUs) to speed up the rendering of three-dimensional games like Quake III Arena. The job of these PC add-on cards was to rapidly draw thousands of triangles that made up walls, weapons, monsters, and other objects in a game.

This is not a sequential programming task: triangles in different areas of the screen can be drawn in any order. So rather than having a single processor that executed instructions one at a time, Nvidia’s first GPU had a dozen specialized cores—effectively tiny CPUs—that worked in parallel to paint a scene.

Over time, Moore’s Law enabled Nvidia to make GPUs with tens, hundreds, and eventually thousands of computing cores. People started to realize that the massive parallel computing power of GPUs could be used for applications unrelated to video games.

In 2012, three University of Toronto computer scientists—Alex Krizhevsky, Ilya Sutskever, and Geoffrey Hinton—used a pair of Nvidia GTX 580 GPUs to train a neural network for recognizing images. The massive computing power of those GPUs, which had 512 cores each, allowed them to train a network with a then-impressive 60 million parameters. They entered ImageNet, an academic competition to classify images into one of 1,000 categories, and set a new record for accuracy in image recognition.

Before long, researchers were applying similar techniques to a wide variety of domains, including natural language.

RNNs worked fairly well on short sentences, but they struggled with longer ones—to say nothing of paragraphs or longer passages. When reasoning about a long sentence, an RNN would sometimes “forget about” an important word early in the sentence. In 2014, computer scientists Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio discovered they could improve the performance of a recurrent neural network by adding an attention mechanism that allowed the network to “look back” at earlier words in a sentence.

In 2017, Google published “Attention Is All You Need,” one of the most important papers in the history of machine learning. Building on the work of Bahdanau and his colleagues, Google researchers dispensed with the RNN and its hidden states. Instead, Google’s model used an attention mechanism to scan previous words for relevant context.

This new architecture, which Google called the transformer, proved hugely consequential because it eliminated a serious bottleneck to scaling language models.

Here’s an animation illustrating why RNNs didn’t scale well:

This hypothetical RNN tries to predict the next word in a sentence, with the prediction shown in the top row of the diagram. This network has three layers, each represented by a rectangle. It is inherently linear: it has to complete its analysis of the first word, “How,” before passing the hidden state back to the bottom layer so the network can start to analyze the second word, “are.”

This constraint wasn’t a big deal when machine learning algorithms ran on CPUs. But when people started leveraging the parallel computing power of GPUs, the linear architecture of RNNs became a serious obstacle.

The transformer removed this bottleneck by allowing the network to “think about” all the words in its input at the same time:

The transformer-based model shown here does roughly as many computations as the RNN in the previous diagram. So it might not run any faster on a (single-core) CPU. But because the model doesn’t need to finish with “How” before starting on “are,” “you,” or “doing,” it can work on all of these words simultaneously. So it can run a lot faster on a GPU with many parallel execution units.

How much faster? The potential speed-up is proportional to the number of input words. My animations depict a four-word input that makes the transformer model about four times faster than the RNN. Real LLMs can have inputs thousands of words long. So, with a sufficiently beefy GPU, transformer-based models can be orders of magnitude faster than otherwise similar RNNs.

In short, the transformer unlocked the full processing power of GPUs and catalyzed rapid increases in the scale of language models. Leading LLMs grew from hundreds of millions of parameters in 2018 to hundreds of billions of parameters by 2020. Classic RNN-based models could not have grown that large because their linear architecture prevented them from being trained efficiently on a GPU.

See all those diagonal arrows between the layers? They represent the operation of the attention mechanism. Before a transformer-based language model generates a new token, it “thinks about” every previous token to find the ones that are most relevant.

Each of these comparisons is cheap, computationally speaking. For small contexts—10, 100, or even 1,000 tokens—they are not a big deal. But the computational cost of attention grows relentlessly with the number of preceding tokens. The longer the context gets, the more attention operations (and therefore computing power) are needed to generate the next token.

This means that the total computing power required for attention grows quadratically with the total number of tokens. Suppose a 10-token prompt requires 414,720 attention operations. Then:

  • Processing a 100-token prompt will require 45.6 million attention operations.
  • Processing a 1,000-token prompt will require 4.6 billion attention operations.
  • Processing a 10,000-token prompt will require 460 billion attention operations.

This is probably why Google charges twice as much, per token, for Gemini 1.5 Pro once the context gets longer than 128,000 tokens. Generating token number 128,001 requires comparisons with all 128,000 previous tokens, making it significantly more expensive than producing the first or 10th or 100th token.

A lot of effort has been put into optimizing attention. One line of research has tried to squeeze maximum efficiency out of individual GPUs.

As we saw earlier, a modern GPU contains thousands of execution units. Before a GPU can start doing math, it must move data from slow shared memory (called high-bandwidth memory) to much faster memory inside a particular execution unit (called SRAM). Sometimes GPUs spend more time moving data around than performing calculations.

In a series of papers, Princeton computer scientist Tri Dao and several collaborators have developed FlashAttention, which calculates attention in a way that minimizes the number of these slow memory operations. Work like Dao’s has dramatically improved the performance of transformers on modern GPUs.

Another line of research has focused on efficiently scaling attention across multiple GPUs. One widely cited paper describes ring attention, which divides input tokens into blocks and assigns each block to a different GPU. It’s called ring attention because GPUs are organized into a conceptual ring, with each GPU passing data to its neighbor.

I once attended a ballroom dancing class where couples stood in a ring around the edge of the room. After each dance, women would stay where they were while men would rotate to the next woman. Over time, every man got a chance to dance with every woman. Ring attention works on the same principle. The “women” are query vectors (describing what each token is “looking for”) and the “men” are key vectors (describing the characteristics each token has). As the key vectors rotate through a sequence of GPUs, they get multiplied by every query vector in turn.

In short, ring attention distributes attention calculations across multiple GPUs, making it possible for LLMs to have larger context windows. But it doesn’t make individual attention calculations any cheaper.

The fixed-size hidden state of an RNN means that it doesn’t have the same scaling problems as a transformer. An RNN requires about the same amount of computing power to produce its first, hundredth and millionth token. That’s a big advantage over attention-based models.

Although RNNs have fallen out of favor since the invention of the transformer, people have continued trying to develop RNNs suitable for training on modern GPUs.

In April, Google announced a new model called Infini-attention. It’s kind of a hybrid between a transformer and an RNN. Infini-attention handles recent tokens like a normal transformer, remembering them and recalling them using an attention mechanism.

However, Infini-attention doesn’t try to remember every token in a model’s context. Instead, it stores older tokens in a “compressive memory” that works something like the hidden state of an RNN. This data structure can perfectly store and recall a few tokens, but as the number of tokens grows, its recall becomes lossier.

Machine learning YouTuber Yannic Kilcher wasn’t too impressed by Google’s approach.

“I’m super open to believing that this actually does work and this is the way to go for infinite attention, but I’m very skeptical,” Kilcher said. “It uses this compressive memory approach where you just store as you go along, you don’t really learn how to store, you just store in a deterministic fashion, which also means you have very little control over what you store and how you store it.”

Perhaps the most notable effort to resurrect RNNs is Mamba, an architecture that was announced in a December 2023 paper. It was developed by computer scientists Dao (who also did the FlashAttention work I mentioned earlier) and Albert Gu.

Mamba does not use attention. Like other RNNs, it has a hidden state that acts as the model’s “memory.” Because the hidden state has a fixed size, longer prompts do not increase Mamba’s per-token cost.

When I started writing this article in March, my goal was to explain Mamba’s architecture in some detail. But then in May, the researchers released Mamba-2, which significantly changed the architecture from the original Mamba paper. I’ll be frank: I struggled to understand the original Mamba and have not figured out how Mamba-2 works.

But the key thing to understand is that Mamba has the potential to combine transformer-like performance with the efficiency of conventional RNNs.

In June, Dao and Gu co-authored a paper with Nvidia researchers that evaluated a Mamba model with 8 billion parameters. They found that models like Mamba were competitive with comparably sized transformers in a number of tasks, but they “lag behind Transformer models when it comes to in-context learning and recalling information from the context.”

Transformers are good at information recall because they “remember” every token of their context—this is also why they become less efficient as the context grows. In contrast, Mamba tries to compress the context into a fixed-size state, which necessarily means discarding some information from long contexts.

The Nvidia team found they got the best performance from a hybrid architecture that interleaved 24 Mamba layers with four attention layers. This worked better than either a pure transformer model or a pure Mamba model.

A model needs some attention layers so it can remember important details from early in its context. But a few attention layers seem to be sufficient; the rest of the attention layers can be replaced by cheaper Mamba layers with little impact on the model’s overall performance.

In August, an Israeli startup called AI21 announced its Jamba 1.5 family of models. The largest version had 398 billion parameters, making it comparable in size to Meta’s Llama 405B model. Jamba 1.5 Large has seven times more Mamba layers than attention layers. As a result, Jamba 1.5 Large requires far less memory than comparable models from Meta and others. For example, AI21 estimates that Llama 3.1 70B needs 80GB of memory to keep track of 256,000 tokens of context. Jamba 1.5 Large only needs 9GB, allowing the model to run on much less powerful hardware.

The Jamba 1.5 Large model gets an MMLU score of 80, significantly below the Llama 3.1 70B’s score of 86. So by this measure, Mamba doesn’t blow transformers out of the water. However, this may not be an apples-to-apples comparison. Frontier labs like Meta have invested heavily in training data and post-training infrastructure to squeeze a few more percentage points of performance out of benchmarks like MMLU. It’s possible that the same kind of intense optimization could close the gap between Jamba and frontier models.

So while the benefits of longer context windows is obvious, the best strategy to get there is not. In the short term, AI companies may continue using clever efficiency and scaling hacks (like FlashAttention and Ring Attention) to scale up vanilla LLMs. Longer term, we may see growing interest in Mamba and perhaps other attention-free architectures. Or maybe someone will come up with a totally new architecture that renders transformers obsolete.

But I am pretty confident that scaling up transformer-based frontier models isn’t going to be a solution on its own. If we want models that can handle billions of tokens—and many people do—we’re going to need to think outside the box.

Tim Lee was on staff at Ars from 2017 to 2021. Last year, he launched a newsletter, Understanding AI, that explores how AI works and how it’s changing our world. You can subscribe here.

Photo of Timothy B. Lee

Timothy is a senior reporter covering tech policy and the future of transportation. He lives in Washington DC.

Why AI language models choke on too much text Read More »

openai-announces-o3-and-o3-mini,-its-next-simulated-reasoning-models

OpenAI announces o3 and o3-mini, its next simulated reasoning models

On Friday, during Day 12 of its “12 days of OpenAI,” OpenAI CEO Sam Altman announced its latest AI “reasoning” models, o3 and o3-mini, which build upon the o1 models launched earlier this year. The company is not releasing them yet but will make these models available for public safety testing and research access today.

The models use what OpenAI calls “private chain of thought,” where the model pauses to examine its internal dialog and plan ahead before responding, which you might call “simulated reasoning” (SR)—a form of AI that goes beyond basic large language models (LLMs).

The company named the model family “o3” instead of “o2” to avoid potential trademark conflicts with British telecom provider O2, according to The Information. During Friday’s livestream, Altman acknowledged his company’s naming foibles, saying, “In the grand tradition of OpenAI being really, truly bad at names, it’ll be called o3.”

According to OpenAI, the o3 model earned a record-breaking score on the ARC-AGI benchmark, a visual reasoning benchmark that has gone unbeaten since its creation in 2019. In low-compute scenarios, o3 scored 75.7 percent, while in high-compute testing, it reached 87.5 percent—comparable to human performance at an 85 percent threshold.

OpenAI also reported that o3 scored 96.7 percent on the 2024 American Invitational Mathematics Exam, missing just one question. The model also reached 87.7 percent on GPQA Diamond, which contains graduate-level biology, physics, and chemistry questions. On the Frontier Math benchmark by EpochAI, o3 solved 25.2 percent of problems, while no other model has exceeded 2 percent.

OpenAI announces o3 and o3-mini, its next simulated reasoning models Read More »

new-aa-powered-airtag-case-promises-10-year-lifespan

New AA-powered AirTag case promises 10-year lifespan

On Wednesday, Elevation Lab announced TimeCapsule, a new $20 battery case purported to extend Apple AirTag battery life from one year to 10 years. The product replaces the standard CR2032 coin cell battery in the Bluetooth-based location tracker with two AA batteries to provide extended power capacity.

The TimeCapsule case requires users to remove their AirTag’s original back plate and battery, then place the Apple device onto contact points inside the waterproof enclosure. The company recommends using Energizer Ultimate Lithium AA batteries, which it claims provide 14 times more power capacity than the stock coin cell battery configuration.

The CNC-machined aluminum case is aimed at users who place AirTags in vehicles, boats, or other applications where regular battery changes prove impractical. The company sells the TimeCapsule through its website and Amazon.

A photo of the TimeCapsule, which purportedly extends AirTag battery life with AA batteries.

A photo of the TimeCapsule, which purportedly extends AirTag battery life with AA batteries. Credit: Elevation Lab

As related on the TimeCapsule’s product page, the add-on case reportedly emerged after its inventor lost camera equipment to theft, discovering their AirTag had died months earlier due to a depleted battery. This experience led to the development of a longer-lasting power solution for the tracking devices.

New AA-powered AirTag case promises 10-year lifespan Read More »

as-firms-abandon-vmware,-broadcom-is-laughing-all-the-way-to-the-bank

As firms abandon VMware, Broadcom is laughing all the way to the bank

2025 challenges

Broadcom seemed okay with ending business with Ingram, which ties it to solution providers that may be supporting smaller firms. At the same time, Broadcom has shown willingness to fight for the business of large accounts.

For example, this month it settled an increasingly nasty dispute in which AT&T sued Broadcom for allegedly breaking a contract to provide perpetual license support. Broadcom infamously stopped VMware perpetual license sales, in favor of subscriptions, in December 2023.

Broadcom is also paying close attention to VMware’s biggest accounts, taking over 500 of those biggest accounts directly, thereby barring channel partners from deals.

Broadcom originally planned to take VMware’s biggest 2,000 accounts direct. But as Canalys chief analyst Alastair Edwards put it, letting 1,500 of the biggest accounts be run by channel partners helps tie professional services to VMware products, making migrations harder.

However, the VMware channel is under turmoil, having undergone numerous business-impacting changes over the past year, including Broadcom killing the VMware partner program in favor of its own, while announcing that there will be a new VMware channel chief, as CRN reported. Some of the resellers that could help VMware keep customers are showing frustration with the changes and what the characterize as poor communication from Broadcom.

“Broadcom has abandoned the channel market by making it nearly impossible to work with them due to constantly changing requirements, packaging and changes to the program,” Jason Slagle, president of Toledo-based managed services provider and VMware partner CNWR, told CRN today.

Meanwhile, Forrester analysts Michele Pelino and Naveen Chhabra predict that next year, “VMware’s largest 2,000 customers will shrink their deployment size by an average of 40 percent,” in favor of “public cloud, on-premises alternatives, and new architecture.”

Still, “Broadcom’s price increases and cost-cutting measures are expected to boost its net profits, as there are not many credible competitors capable of helping clients replace VMware virtualization,” the Forrester analysts said.

So although Broadcom is challenged to maintain business from VMware’s biggest accounts and appease the solution providers driving smaller accounts, it’s expected to keep making money off of VMware—even as firms like Ingram close the door on it.

As firms abandon VMware, Broadcom is laughing all the way to the bank Read More »

new-physics-sim-trains-robots-430,000-times-faster-than-reality

New physics sim trains robots 430,000 times faster than reality

The AI-generated worlds reportedly include realistic physics, camera movements, and object behaviors, all from text commands. The system then creates physically accurate ray-traced videos and data that robots can use for training.

Examples of “4D dynamical and physical” worlds that Genesis created from text prompts.

This prompt-based system lets researchers create complex robot testing environments by typing natural language commands instead of programming them by hand. “Traditionally, simulators require a huge amount of manual effort from artists: 3D assets, textures, scene layouts, etc. But every component in the workflow can be automated,” wrote Fan.

Using its engine, Genesis can also generate character motion, interactive 3D scenes, facial animation, and more, which may allow for the creation of artistic assets for creative projects, but may also lead to more realistic AI-generated games and videos in the future, constructing a simulated world in data instead of operating on the statistical appearance of pixels as with a video synthesis diffusion model.

Examples of character motion generation from Genesis, using a prompt that includes, “A miniature Wukong holding a stick in his hand sprints across a table surface for 3 seconds, then jumps into the air, and swings his right arm downward during landing.”

While the generative system isn’t yet part of the currently available code on GitHub, the team plans to release it in the future.

Training tomorrow’s robots today (using Python)

Genesis remains under active development on GitHub, where the team accepts community contributions.

The platform stands out from other 3D world simulators for robotic training by using Python for both its user interface and core physics engine. Other engines use C++ or CUDA for their underlying calculations while wrapping them in Python APIs. Genesis takes a Python-first approach.

Notably, the non-proprietary nature of the Genesis platform makes high-speed robot training simulations available to any researcher for free through simple Python commands that work on regular computers with off-the-shelf hardware.

Previously, running robot simulations required complex programming and specialized hardware, says Fan in his post announcing Genesis, and that shouldn’t be the case. “Robotics should be a moonshot initiative owned by all of humanity,” he wrote.

New physics sim trains robots 430,000 times faster than reality Read More »

stalker-2-has-been-enjoyable-jank,-but-it’s-also-getting-rapidly-fixed

Stalker 2 has been enjoyable jank, but it’s also getting rapidly fixed

When the impossibly punctuated S.T.A.L.K.E.R. 2: Heart of Chernobyl released on November 20, after many delays (that included the Russian invasion of the developer’s native Ukraine), it seemed like it could have used even more delays.

Stalker 2 had big performance issues and game-breaking bugs, along with balance and difficulty spike issues. Some things that seem “wrong” in the game are just going to stay that way. The first-person survival/shooter series has always had a certain wobbly, wild feel to it. This expresses itself in both the game world, where a major villain can off themselves by walking through a window, and in the tech stack, where broken save games, DIY optimization, and other unmet needs have created thriving mod scenes.

Developer GSC Game World has been steadfastly patching the game since its release, and the latest one should nudge the needle a bit from “busted” to “charmingly wonky.” Amid the “Over 1,800 fixes and adjustments” in Patch 1.1, the big changes are to “A-Life.” In porting Stalker 2 to Unreal Engine 5, the developer faced a challenge in getting this global AI management system working, but it’s showing its weird self again.

A-Life, as detailed by Destructoid, is the idea that “the characters in the game live their own lives and exist all the time, not only when they are in the player’s field of view.” In a certain radius around the player, events happen “online,” in real time, such that you could stumble upon them. Farther out, things are still happening, and non-player characters (NPCs) are trekking about, but on an “offline,” almost turn-based, less resource-intensive schedule. Modders have had quite a bit of fun tweaking A-life in prior versions of Stalker 2.

With the latest patch, the weirdly engaging feel that the world goes on without you returns. There will be more NPCs visible, NPCs out of range will pursue their “goals,” and a more diverse range of human factions, mutants, and weirdos will exist. Perhaps most intriguingly, an issue where “Human NPCs didn’t satisfy their communication needs and talks” is fixed. If only that could be patched for most of us player characters here in the real world.

Stalker 2 has been enjoyable jank, but it’s also getting rapidly fixed Read More »