Author name: Mike M.

health-care-giant-ascension-says-5.6-million-patients-affected-in-cyberattack

Health care giant Ascension says 5.6 million patients affected in cyberattack

Health care company Ascension lost sensitive data for nearly 5.6 million individuals in a cyberattack that was attributed to a notorious ransomware gang, according to documents filed with the attorney general of Maine.

Ascension owns 140 hospitals and scores of assisted living facilities. In May, the organization was hit with an attack that caused mass disruptions as staff was forced to move to manual processes that caused errors, delayed or lost lab results, and diversions of ambulances to other hospitals. Ascension managed to restore most services by mid-June. At the time, the company said the attackers had stolen protected health information and personally identifiable information for an undisclosed number of people.

Investigation concluded

A filing Ascension made earlier in December revealed that nearly 5.6 million people were affected by the breach. Data stolen depended on the particular person but included individuals’ names and medical information (e.g., medical record numbers, dates of service, types of lab tests, or procedure codes), payment information (e.g., credit card information or bank account numbers), insurance information (e.g., Medicaid/Medicare ID, policy number, or insurance claim), government

identification (e.g., Social Security numbers, tax identification numbers, driver’s license numbers, or passport numbers), and other personal information (such as date of birth or address).

Health care giant Ascension says 5.6 million patients affected in cyberattack Read More »

how-the-worlds-of-dune:-prophecy-got-their-distinctive-looks

How the worlds of Dune: Prophecy got their distinctive looks


a peek behind the curtain

Ars chats with Dune: Prophecy lead cinematographer Pierre Gill about color palettes, lighting, and other challenges.

Credit: Attila Szvacsek/HBO

Director Denis Villeneuve’s stunning two-part film adaptation of Frank Herbert’s Dune has received many well-deserved accolades—with Dune: Part 2 being crowned Ars Technica’s top movie of 2024. The films also spawned a lavish HBO spinoff TV series, Dune: Prophecy, just renewed for a second season right before a momentous season finale.

(Some spoilers below for S1 of Dune: Prophecy, but no major plot reveals.)

Dune: Prophecy is a prequel series inspired by the novel Sisterhood of Dune, written by Brian Herbert and Kevin J. Anderson, exploring the origins of the Bene Gesserit. It’s set 10,000 years before the ascension of Paul Atreides and follows two Harkonnen sisters as they combat forces that threaten the future of humankind, establishing the fabled sect that will become the Bene Gesserit in the process.

Emily Watson stars as Mother Superior Valya Harkonnen, who leads the Sisterhood and has a close ally in her sister, Reverend Mother Tula Harkonnen. They have built up a network of Sisters serving the rulers of various worlds as “Truthsayers,” including Princess Ynez (Sarah-Sofie Boussnina), heir to the throne of her father, Imperium Emperor Javicco Corrine (Mark Strong).

Valya’s master plan to crown a Sister as head of the Imperium hits a snag, however, with the arrival of a mysterious soldier named Desmond Hart (Travis Fimmel), who claimed he survived being swallowed by a sandworm while fighting on Arrakis. Hart has a mysterious ability to literally burn people to death from the inside out with his mind, and he so impresses the Emperor that Hart replaces Valya as key advisor. So begins several episodes of political intrigue as secrets from the past begin to come out, culminating with an action-packed finale that bade farewell to a couple of key characters.

All of this takes place against a stunning visual backdrop that is reminiscent of Villeneuve’s two films but also sets the series apart, as befits a prequel. One of the people responsible for that is lead cinematographer Pierre Gill, who created distinctive looks and color palettes for the various worlds, planets, and environments, as well as tackling the lighting challenges posed by the massive built sets. Ars caught up with Gill to learn more.

Ars Technica: You also did some work on the film Dune: Part 1. How was it different working on a TV series set in the same sweeping fictional world?

Pierre Gill: It’s a different game, a different schedule, and it’s also a very different approach because the scenes are different. There’s not so many subplots. But it’s still the same scope. We had as many sets and studios and everything. So it’s a big challenge to be able to keep the style, make it look good, light the actors. It’s part of the reality of the [director of photography’s] decision-making. You have ideas in your head of what you want to do, you have a dream, but it has to be feasible, realistic. So then you make compromises and figure out the best way to keep that style. There’s also multiple directors, there’s a showrunner. So the decision-making is less centralized.

Ars Technica: How did you go about setting the series apart from Villeneuve’s films, especially since it’s a prequel?

Pierre Gill: It’s set 10,000 years before, so it could have been extremely different. But it’s not a good idea to do that. First, the audience wants to see Dune because they love Denis Villeneuve’s movie. Second, it’s a complex story and it’s better not to get lost into something. It was not a good idea to do that in our mind. So we stayed not far from the movie so the audience can just sit down and follow the story points. and at the moment, Of course, some people always complain, but most are just happy to follow the story. So I think we made the right choice.

Ars Technica: Despite the epic scope of the series, you were able to shoot as much as 75 percent of the footage in-camera. That’s quite a feat.

Pierre Gill: There’s a lot of VFX of course, but because most of the sets were so high, so big, the camera was filming people or the throne room—which is gigantic—it’s almost always in camera. For the big wide shots, there’s a set extension that is painted. So these huge sets, the Sisterhood, the library and everything, when you see all these girls wandering around in that complex of Wallach IX, that compound is pretty much on camera.

A lot of VFX is making these gorgeous shots of the world, spaceships coming down, seeing something outside the window sometimes, and then the exterior of Wallach IX, which is two big towers and a rock facade. Of course there’s the little lizard, the thinking machine, that was VFX. But otherwise it was very, very in-camera, which makes your life easier in editing and shooting—although it doesn’t make my life easier with the lighting, which would be much easier with blue screen.

Ars Technica: Tell us about the massive chandeliers you built to introduce particle light, adding extra character to the interiors.

Pierre Gill: The sets were quite monochromatic. You have Salusa Secundus, the emperor world, which is a very sandy color, very beige,. And then you have Wallach IX, which is very gray. We decided to light one of the worlds in a cold way, the other world in warmer tones. I was trying as much as I could to put very harsh sunlight into the Salusa Secondus world.

Again, the sets were very big. So I asked the production designer Tom Meyer to build me some practical lighting somewhere. There was not much in the set for me to use for night, which is a bit of a problem because he kept the mood of Dune. On a film you have three to four hours to light a scene. I was not able to do that. I needed to have practical light that is actually lighting something. So for example, in the throne room, he wanted to have glass balls. There’s three glass balls, they’re gorgeous.

I told Thomas, “But these glass balls, the problem for me is the light behind my head is going to blow away. I would love this to light the wall.” So I got my practical team, a bunch of guys who are just in charge of LEDs on set. We found an LED source that goes inside; you can dim it down and up. But behind the balls, we added another pack of LED lights that are hidden. So you have the light source and just behind it you have this extra lighting. From the camera you never see it but it was lighting the wall. And then I got them to build a very long teardrop. I again got them to build multiple layers of LEDs that were on a board that was a certain power, certain color. I was able to make them cold or warm and to change them a little bit and use them as a source. It became part of the visual style.

Ars Technica. I appreciated that Dune: Prophecy seems to buck the trend toward very, very dark night scenes that end up being nearly unwatchable for people at home.

Pierre Gill: I don’t really like when it’s pitch black and dark. I don’t understand. I don’t think it gives anything. For me, night is more figuring out silhouettes. Let’s try to feel the character, shape your character on the wall, and when you get in a close-up you get a light in his eyes or something. I like to define a room and on these big sets to do moonlight would make no sense. The throne room is gigantic, but at the end it’s just an empty place. So you’re lighting what? Just the floor. It’s not very interesting. So what I’ve done for the night in the throne room, I asked VFX, what’s the concept of the exterior? It was all work in progress. We had some artwork concept work, but the lighting, nobody really knew.

So I said, okay, so we know there’s lights, so I’m going to put orange lights from below. I’m not lighting actors, I’m not lighting anything. But when you look at windows, you can feel that the light is coming from the bottom and it creates a few shadows. When you see outside now, they put all these lights in the palace, like you would light a nice, beautiful, gorgeous big house. You light everything from under.

Ars Technica: What were some particularly challenging scenes in terms of the cinematography?

Pierre Gill: The prison was a huge challenge. It was built on a set on location in downtown Budapest, and it’s a circular room. It’s where they put Desmond and suspended him in a jail cell. There was a floor and I had one foot to light. So that was complicated. Another challenge was the exterior of the Sisterhood: a big circular room outside. It was also a location and we could not access behind with cranes, so I could not control anything, and it was very dangerous. I could not light from the ceiling from the top of this coliseum. We built a gigantic tarp on top of it. So I was closing and opening and diffusing the sun. That was very Hollywood-esque.

Ars Technica: Was there a particular scene you were especially pleased with how it turned out?

Pierre Gill: In the first episode, there’s a scene with the young sisters chanting around a gorgeous golden bowl. The director, Anna Foerster, she wanted to see the waves of the singing and all these frequencies. I was like, “Well, that’s a lighting gag. You don’t see any wave if you cannot light in reflection.” I knew I wouldn’t have time to do something so technical. Since she wanted to “do a pull-up” for the scene: starting loose up on the bowl and then moving up and out. Technically it’s complicated.

So I had a big rig that I created around the camera with soft lighting that could reflect. And I asked our department, when they built that bowl, “Could you build with a light inside, like a waterproof light, an LED? I’ll put it on my board and maybe it’s going to work. I’m not sure if it’s going to really light the water properly.” I was ready with a plan B, but they brought the bowl, they started the frequency thing, and it was gorgeous. So I didn’t have to use my plan B lighting. That was very, very nice.

a white model of a set with someone's arm placing miniature people inside it.

Models helped with the staging and lighting of different scenes. Credit: Pierre Gill/HBO

Ars Technica: The show has been renewed for a second season and one assumes you’ll be involved. What will be different for you going into S2?

Pierre Gill: I’m very proud because I’m part of building that, a part of the creative team. I really hope I can do it. I hope my schedule will allow it. I want to be part of this for sure, because S1 was a lot of engineering, meaning it’s so big, you have to figure out stuff all the time. Now it’s done, it’s built and we know what we like. We know what we don’t like. We know what works. So S2 for me, will be a lot of fun, much more creative, meaning I’m going to be able to do much more interesting lighting. I’m going to go deeper into the thing because I know how this beast is working now.

All episodes of Dune: Prophecy‘s first season are now available for streaming on Max.

Photo of Jennifer Ouellette

Jennifer is a senior reporter at Ars Technica with a particular focus on where science meets culture, covering everything from physics and related interdisciplinary topics to her favorite films and TV series. Jennifer lives in Baltimore with her spouse, physicist Sean M. Carroll, and their two cats, Ariel and Caliban.

How the worlds of Dune: Prophecy got their distinctive looks Read More »

how-might-nasa-change-under-trump?-here’s-what-is-being-discussed

How might NASA change under Trump? Here’s what is being discussed

One source said the space transition team has been working off of ideas that Trump has talked about publicly, including his interest in Mars. For example, during a campaign speech this fall, Trump referenced SpaceX founder Elon Musk, who played a significant role during the campaign both in terms of time and money, and his desire to settle Mars.

“We are leading in space over Russia and China… It’s my plan, I’ll talk to Elon,” Trump said in September. “Elon get those rocket ships going because we want to reach Mars before the end of my term, and we want also to have great military protection in space.”

Ideas under consideration

The transition team has been discussing possible elements of an executive order or other policy directives. They include:

  • Establishing the goal of sending humans to the Moon and Mars, by 2028
  • Canceling the costly Space Launch System rocket and possibly the Orion spacecraft
  • Consolidating Goddard Space Flight Center and Ames Research Center at Marshall Space Flight Center in Alabama
  • Retaining a small administration presence in Washington, DC, but otherwise moving headquarters to a field center
  • Rapidly redesigning the Artemis lunar program to make it more efficient

“Is any of this written in stone? No,” a source told Ars.

Additionally, substantive changes will need to be worked through the White House Office of Management and Budget, and negotiated with Congress, which funds NASA.

Previously, Trump has announced that entrepreneur and commercial astronaut Jared Isaacman will be nominated to serve as NASA Administrator. Although he has been working to create a staff for his administration, Isaacman has not been involved in the transition team discussions, sources said. Rather, after he is confirmed, Isaacman is likely to be given authority to review major programs at the space agency “at the speed of light.”

How might NASA change under Trump? Here’s what is being discussed Read More »

why-ai-language-models-choke-on-too-much-text

Why AI language models choke on too much text


Compute costs scale with the square of the input size. That’s not great.

Credit: Aurich Lawson | Getty Images

Large language models represent text using tokens, each of which is a few characters. Short words are represented by a single token (like “the” or “it”), whereas larger words may be represented by several tokens (GPT-4o represents “indivisible” with “ind,” “iv,” and “isible”).

When OpenAI released ChatGPT two years ago, it had a memory—known as a context window—of just 8,192 tokens. That works out to roughly 6,000 words of text. This meant that if you fed it more than about 15 pages of text, it would “forget” information from the beginning of its context. This limited the size and complexity of tasks ChatGPT could handle.

Today’s LLMs are far more capable:

  • OpenAI’s GPT-4o can handle 128,000 tokens (about 200 pages of text).
  • Anthropic’s Claude 3.5 Sonnet can accept 200,000 tokens (about 300 pages of text).
  • Google’s Gemini 1.5 Pro allows 2 million tokens (about 2,000 pages of text).

Still, it’s going to take a lot more progress if we want AI systems with human-level cognitive abilities.

Many people envision a future where AI systems are able to do many—perhaps most—of the jobs performed by humans. Yet many human workers read and hear hundreds of millions of words during our working years—and we absorb even more information from sights, sounds, and smells in the world around us. To achieve human-level intelligence, AI systems will need the capacity to absorb similar quantities of information.

Right now the most popular way to build an LLM-based system to handle large amounts of information is called retrieval-augmented generation (RAG). These systems try to find documents relevant to a user’s query and then insert the most relevant documents into an LLM’s context window.

This sometimes works better than a conventional search engine, but today’s RAG systems leave a lot to be desired. They only produce good results if the system puts the most relevant documents into the LLM’s context. But the mechanism used to find those documents—often, searching in a vector database—is not very sophisticated. If the user asks a complicated or confusing question, there’s a good chance the RAG system will retrieve the wrong documents and the chatbot will return the wrong answer.

And RAG doesn’t enable an LLM to reason in more sophisticated ways over large numbers of documents:

  • A lawyer might want an AI system to review and summarize hundreds of thousands of emails.
  • An engineer might want an AI system to analyze thousands of hours of camera footage from a factory floor.
  • A medical researcher might want an AI system to identify trends in tens of thousands of patient records.

Each of these tasks could easily require more than 2 million tokens of context. Moreover, we’re not going to want our AI systems to start with a clean slate after doing one of these jobs. We will want them to gain experience over time, just like human workers do.

Superhuman memory and stamina have long been key selling points for computers. We’re not going to want to give them up in the AI age. Yet today’s LLMs are distinctly subhuman in their ability to absorb and understand large quantities of information.

It’s true, of course, that LLMs absorb superhuman quantities of information at training time. The latest AI models have been trained on trillions of tokens—far more than any human will read or hear. But a lot of valuable information is proprietary, time-sensitive, or otherwise not available for training.

So we’re going to want AI models to read and remember far more than 2 million tokens at inference time. And that won’t be easy.

The key innovation behind transformer-based LLMs is attention, a mathematical operation that allows a model to “think about” previous tokens. (Check out our LLM explainer if you want a detailed explanation of how this works.) Before an LLM generates a new token, it performs an attention operation that compares the latest token to every previous token. This means that conventional LLMs get less and less efficient as the context grows.

Lots of people are working on ways to solve this problem—I’ll discuss some of them later in this article. But first I should explain how we ended up with such an unwieldy architecture.

The “brains” of personal computers are central processing units (CPUs). Traditionally, chipmakers made CPUs faster by increasing the frequency of the clock that acts as its heartbeat. But in the early 2000s, overheating forced chipmakers to mostly abandon this technique.

Chipmakers started making CPUs that could execute more than one instruction at a time. But they were held back by a programming paradigm that requires instructions to mostly be executed in order.

A new architecture was needed to take full advantage of Moore’s Law. Enter Nvidia.

In 1999, Nvidia started selling graphics processing units (GPUs) to speed up the rendering of three-dimensional games like Quake III Arena. The job of these PC add-on cards was to rapidly draw thousands of triangles that made up walls, weapons, monsters, and other objects in a game.

This is not a sequential programming task: triangles in different areas of the screen can be drawn in any order. So rather than having a single processor that executed instructions one at a time, Nvidia’s first GPU had a dozen specialized cores—effectively tiny CPUs—that worked in parallel to paint a scene.

Over time, Moore’s Law enabled Nvidia to make GPUs with tens, hundreds, and eventually thousands of computing cores. People started to realize that the massive parallel computing power of GPUs could be used for applications unrelated to video games.

In 2012, three University of Toronto computer scientists—Alex Krizhevsky, Ilya Sutskever, and Geoffrey Hinton—used a pair of Nvidia GTX 580 GPUs to train a neural network for recognizing images. The massive computing power of those GPUs, which had 512 cores each, allowed them to train a network with a then-impressive 60 million parameters. They entered ImageNet, an academic competition to classify images into one of 1,000 categories, and set a new record for accuracy in image recognition.

Before long, researchers were applying similar techniques to a wide variety of domains, including natural language.

RNNs worked fairly well on short sentences, but they struggled with longer ones—to say nothing of paragraphs or longer passages. When reasoning about a long sentence, an RNN would sometimes “forget about” an important word early in the sentence. In 2014, computer scientists Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio discovered they could improve the performance of a recurrent neural network by adding an attention mechanism that allowed the network to “look back” at earlier words in a sentence.

In 2017, Google published “Attention Is All You Need,” one of the most important papers in the history of machine learning. Building on the work of Bahdanau and his colleagues, Google researchers dispensed with the RNN and its hidden states. Instead, Google’s model used an attention mechanism to scan previous words for relevant context.

This new architecture, which Google called the transformer, proved hugely consequential because it eliminated a serious bottleneck to scaling language models.

Here’s an animation illustrating why RNNs didn’t scale well:

This hypothetical RNN tries to predict the next word in a sentence, with the prediction shown in the top row of the diagram. This network has three layers, each represented by a rectangle. It is inherently linear: it has to complete its analysis of the first word, “How,” before passing the hidden state back to the bottom layer so the network can start to analyze the second word, “are.”

This constraint wasn’t a big deal when machine learning algorithms ran on CPUs. But when people started leveraging the parallel computing power of GPUs, the linear architecture of RNNs became a serious obstacle.

The transformer removed this bottleneck by allowing the network to “think about” all the words in its input at the same time:

The transformer-based model shown here does roughly as many computations as the RNN in the previous diagram. So it might not run any faster on a (single-core) CPU. But because the model doesn’t need to finish with “How” before starting on “are,” “you,” or “doing,” it can work on all of these words simultaneously. So it can run a lot faster on a GPU with many parallel execution units.

How much faster? The potential speed-up is proportional to the number of input words. My animations depict a four-word input that makes the transformer model about four times faster than the RNN. Real LLMs can have inputs thousands of words long. So, with a sufficiently beefy GPU, transformer-based models can be orders of magnitude faster than otherwise similar RNNs.

In short, the transformer unlocked the full processing power of GPUs and catalyzed rapid increases in the scale of language models. Leading LLMs grew from hundreds of millions of parameters in 2018 to hundreds of billions of parameters by 2020. Classic RNN-based models could not have grown that large because their linear architecture prevented them from being trained efficiently on a GPU.

See all those diagonal arrows between the layers? They represent the operation of the attention mechanism. Before a transformer-based language model generates a new token, it “thinks about” every previous token to find the ones that are most relevant.

Each of these comparisons is cheap, computationally speaking. For small contexts—10, 100, or even 1,000 tokens—they are not a big deal. But the computational cost of attention grows relentlessly with the number of preceding tokens. The longer the context gets, the more attention operations (and therefore computing power) are needed to generate the next token.

This means that the total computing power required for attention grows quadratically with the total number of tokens. Suppose a 10-token prompt requires 414,720 attention operations. Then:

  • Processing a 100-token prompt will require 45.6 million attention operations.
  • Processing a 1,000-token prompt will require 4.6 billion attention operations.
  • Processing a 10,000-token prompt will require 460 billion attention operations.

This is probably why Google charges twice as much, per token, for Gemini 1.5 Pro once the context gets longer than 128,000 tokens. Generating token number 128,001 requires comparisons with all 128,000 previous tokens, making it significantly more expensive than producing the first or 10th or 100th token.

A lot of effort has been put into optimizing attention. One line of research has tried to squeeze maximum efficiency out of individual GPUs.

As we saw earlier, a modern GPU contains thousands of execution units. Before a GPU can start doing math, it must move data from slow shared memory (called high-bandwidth memory) to much faster memory inside a particular execution unit (called SRAM). Sometimes GPUs spend more time moving data around than performing calculations.

In a series of papers, Princeton computer scientist Tri Dao and several collaborators have developed FlashAttention, which calculates attention in a way that minimizes the number of these slow memory operations. Work like Dao’s has dramatically improved the performance of transformers on modern GPUs.

Another line of research has focused on efficiently scaling attention across multiple GPUs. One widely cited paper describes ring attention, which divides input tokens into blocks and assigns each block to a different GPU. It’s called ring attention because GPUs are organized into a conceptual ring, with each GPU passing data to its neighbor.

I once attended a ballroom dancing class where couples stood in a ring around the edge of the room. After each dance, women would stay where they were while men would rotate to the next woman. Over time, every man got a chance to dance with every woman. Ring attention works on the same principle. The “women” are query vectors (describing what each token is “looking for”) and the “men” are key vectors (describing the characteristics each token has). As the key vectors rotate through a sequence of GPUs, they get multiplied by every query vector in turn.

In short, ring attention distributes attention calculations across multiple GPUs, making it possible for LLMs to have larger context windows. But it doesn’t make individual attention calculations any cheaper.

The fixed-size hidden state of an RNN means that it doesn’t have the same scaling problems as a transformer. An RNN requires about the same amount of computing power to produce its first, hundredth and millionth token. That’s a big advantage over attention-based models.

Although RNNs have fallen out of favor since the invention of the transformer, people have continued trying to develop RNNs suitable for training on modern GPUs.

In April, Google announced a new model called Infini-attention. It’s kind of a hybrid between a transformer and an RNN. Infini-attention handles recent tokens like a normal transformer, remembering them and recalling them using an attention mechanism.

However, Infini-attention doesn’t try to remember every token in a model’s context. Instead, it stores older tokens in a “compressive memory” that works something like the hidden state of an RNN. This data structure can perfectly store and recall a few tokens, but as the number of tokens grows, its recall becomes lossier.

Machine learning YouTuber Yannic Kilcher wasn’t too impressed by Google’s approach.

“I’m super open to believing that this actually does work and this is the way to go for infinite attention, but I’m very skeptical,” Kilcher said. “It uses this compressive memory approach where you just store as you go along, you don’t really learn how to store, you just store in a deterministic fashion, which also means you have very little control over what you store and how you store it.”

Perhaps the most notable effort to resurrect RNNs is Mamba, an architecture that was announced in a December 2023 paper. It was developed by computer scientists Dao (who also did the FlashAttention work I mentioned earlier) and Albert Gu.

Mamba does not use attention. Like other RNNs, it has a hidden state that acts as the model’s “memory.” Because the hidden state has a fixed size, longer prompts do not increase Mamba’s per-token cost.

When I started writing this article in March, my goal was to explain Mamba’s architecture in some detail. But then in May, the researchers released Mamba-2, which significantly changed the architecture from the original Mamba paper. I’ll be frank: I struggled to understand the original Mamba and have not figured out how Mamba-2 works.

But the key thing to understand is that Mamba has the potential to combine transformer-like performance with the efficiency of conventional RNNs.

In June, Dao and Gu co-authored a paper with Nvidia researchers that evaluated a Mamba model with 8 billion parameters. They found that models like Mamba were competitive with comparably sized transformers in a number of tasks, but they “lag behind Transformer models when it comes to in-context learning and recalling information from the context.”

Transformers are good at information recall because they “remember” every token of their context—this is also why they become less efficient as the context grows. In contrast, Mamba tries to compress the context into a fixed-size state, which necessarily means discarding some information from long contexts.

The Nvidia team found they got the best performance from a hybrid architecture that interleaved 24 Mamba layers with four attention layers. This worked better than either a pure transformer model or a pure Mamba model.

A model needs some attention layers so it can remember important details from early in its context. But a few attention layers seem to be sufficient; the rest of the attention layers can be replaced by cheaper Mamba layers with little impact on the model’s overall performance.

In August, an Israeli startup called AI21 announced its Jamba 1.5 family of models. The largest version had 398 billion parameters, making it comparable in size to Meta’s Llama 405B model. Jamba 1.5 Large has seven times more Mamba layers than attention layers. As a result, Jamba 1.5 Large requires far less memory than comparable models from Meta and others. For example, AI21 estimates that Llama 3.1 70B needs 80GB of memory to keep track of 256,000 tokens of context. Jamba 1.5 Large only needs 9GB, allowing the model to run on much less powerful hardware.

The Jamba 1.5 Large model gets an MMLU score of 80, significantly below the Llama 3.1 70B’s score of 86. So by this measure, Mamba doesn’t blow transformers out of the water. However, this may not be an apples-to-apples comparison. Frontier labs like Meta have invested heavily in training data and post-training infrastructure to squeeze a few more percentage points of performance out of benchmarks like MMLU. It’s possible that the same kind of intense optimization could close the gap between Jamba and frontier models.

So while the benefits of longer context windows is obvious, the best strategy to get there is not. In the short term, AI companies may continue using clever efficiency and scaling hacks (like FlashAttention and Ring Attention) to scale up vanilla LLMs. Longer term, we may see growing interest in Mamba and perhaps other attention-free architectures. Or maybe someone will come up with a totally new architecture that renders transformers obsolete.

But I am pretty confident that scaling up transformer-based frontier models isn’t going to be a solution on its own. If we want models that can handle billions of tokens—and many people do—we’re going to need to think outside the box.

Tim Lee was on staff at Ars from 2017 to 2021. Last year, he launched a newsletter, Understanding AI, that explores how AI works and how it’s changing our world. You can subscribe here.

Photo of Timothy B. Lee

Timothy is a senior reporter covering tech policy and the future of transportation. He lives in Washington DC.

Why AI language models choke on too much text Read More »

openai-announces-o3-and-o3-mini,-its-next-simulated-reasoning-models

OpenAI announces o3 and o3-mini, its next simulated reasoning models

On Friday, during Day 12 of its “12 days of OpenAI,” OpenAI CEO Sam Altman announced its latest AI “reasoning” models, o3 and o3-mini, which build upon the o1 models launched earlier this year. The company is not releasing them yet but will make these models available for public safety testing and research access today.

The models use what OpenAI calls “private chain of thought,” where the model pauses to examine its internal dialog and plan ahead before responding, which you might call “simulated reasoning” (SR)—a form of AI that goes beyond basic large language models (LLMs).

The company named the model family “o3” instead of “o2” to avoid potential trademark conflicts with British telecom provider O2, according to The Information. During Friday’s livestream, Altman acknowledged his company’s naming foibles, saying, “In the grand tradition of OpenAI being really, truly bad at names, it’ll be called o3.”

According to OpenAI, the o3 model earned a record-breaking score on the ARC-AGI benchmark, a visual reasoning benchmark that has gone unbeaten since its creation in 2019. In low-compute scenarios, o3 scored 75.7 percent, while in high-compute testing, it reached 87.5 percent—comparable to human performance at an 85 percent threshold.

OpenAI also reported that o3 scored 96.7 percent on the 2024 American Invitational Mathematics Exam, missing just one question. The model also reached 87.7 percent on GPQA Diamond, which contains graduate-level biology, physics, and chemistry questions. On the Frontier Math benchmark by EpochAI, o3 solved 25.2 percent of problems, while no other model has exceeded 2 percent.

OpenAI announces o3 and o3-mini, its next simulated reasoning models Read More »

new-aa-powered-airtag-case-promises-10-year-lifespan

New AA-powered AirTag case promises 10-year lifespan

On Wednesday, Elevation Lab announced TimeCapsule, a new $20 battery case purported to extend Apple AirTag battery life from one year to 10 years. The product replaces the standard CR2032 coin cell battery in the Bluetooth-based location tracker with two AA batteries to provide extended power capacity.

The TimeCapsule case requires users to remove their AirTag’s original back plate and battery, then place the Apple device onto contact points inside the waterproof enclosure. The company recommends using Energizer Ultimate Lithium AA batteries, which it claims provide 14 times more power capacity than the stock coin cell battery configuration.

The CNC-machined aluminum case is aimed at users who place AirTags in vehicles, boats, or other applications where regular battery changes prove impractical. The company sells the TimeCapsule through its website and Amazon.

A photo of the TimeCapsule, which purportedly extends AirTag battery life with AA batteries.

A photo of the TimeCapsule, which purportedly extends AirTag battery life with AA batteries. Credit: Elevation Lab

As related on the TimeCapsule’s product page, the add-on case reportedly emerged after its inventor lost camera equipment to theft, discovering their AirTag had died months earlier due to a depleted battery. This experience led to the development of a longer-lasting power solution for the tracking devices.

New AA-powered AirTag case promises 10-year lifespan Read More »

as-firms-abandon-vmware,-broadcom-is-laughing-all-the-way-to-the-bank

As firms abandon VMware, Broadcom is laughing all the way to the bank

2025 challenges

Broadcom seemed okay with ending business with Ingram, which ties it to solution providers that may be supporting smaller firms. At the same time, Broadcom has shown willingness to fight for the business of large accounts.

For example, this month it settled an increasingly nasty dispute in which AT&T sued Broadcom for allegedly breaking a contract to provide perpetual license support. Broadcom infamously stopped VMware perpetual license sales, in favor of subscriptions, in December 2023.

Broadcom is also paying close attention to VMware’s biggest accounts, taking over 500 of those biggest accounts directly, thereby barring channel partners from deals.

Broadcom originally planned to take VMware’s biggest 2,000 accounts direct. But as Canalys chief analyst Alastair Edwards put it, letting 1,500 of the biggest accounts be run by channel partners helps tie professional services to VMware products, making migrations harder.

However, the VMware channel is under turmoil, having undergone numerous business-impacting changes over the past year, including Broadcom killing the VMware partner program in favor of its own, while announcing that there will be a new VMware channel chief, as CRN reported. Some of the resellers that could help VMware keep customers are showing frustration with the changes and what the characterize as poor communication from Broadcom.

“Broadcom has abandoned the channel market by making it nearly impossible to work with them due to constantly changing requirements, packaging and changes to the program,” Jason Slagle, president of Toledo-based managed services provider and VMware partner CNWR, told CRN today.

Meanwhile, Forrester analysts Michele Pelino and Naveen Chhabra predict that next year, “VMware’s largest 2,000 customers will shrink their deployment size by an average of 40 percent,” in favor of “public cloud, on-premises alternatives, and new architecture.”

Still, “Broadcom’s price increases and cost-cutting measures are expected to boost its net profits, as there are not many credible competitors capable of helping clients replace VMware virtualization,” the Forrester analysts said.

So although Broadcom is challenged to maintain business from VMware’s biggest accounts and appease the solution providers driving smaller accounts, it’s expected to keep making money off of VMware—even as firms like Ingram close the door on it.

As firms abandon VMware, Broadcom is laughing all the way to the bank Read More »

new-physics-sim-trains-robots-430,000-times-faster-than-reality

New physics sim trains robots 430,000 times faster than reality

The AI-generated worlds reportedly include realistic physics, camera movements, and object behaviors, all from text commands. The system then creates physically accurate ray-traced videos and data that robots can use for training.

Examples of “4D dynamical and physical” worlds that Genesis created from text prompts.

This prompt-based system lets researchers create complex robot testing environments by typing natural language commands instead of programming them by hand. “Traditionally, simulators require a huge amount of manual effort from artists: 3D assets, textures, scene layouts, etc. But every component in the workflow can be automated,” wrote Fan.

Using its engine, Genesis can also generate character motion, interactive 3D scenes, facial animation, and more, which may allow for the creation of artistic assets for creative projects, but may also lead to more realistic AI-generated games and videos in the future, constructing a simulated world in data instead of operating on the statistical appearance of pixels as with a video synthesis diffusion model.

Examples of character motion generation from Genesis, using a prompt that includes, “A miniature Wukong holding a stick in his hand sprints across a table surface for 3 seconds, then jumps into the air, and swings his right arm downward during landing.”

While the generative system isn’t yet part of the currently available code on GitHub, the team plans to release it in the future.

Training tomorrow’s robots today (using Python)

Genesis remains under active development on GitHub, where the team accepts community contributions.

The platform stands out from other 3D world simulators for robotic training by using Python for both its user interface and core physics engine. Other engines use C++ or CUDA for their underlying calculations while wrapping them in Python APIs. Genesis takes a Python-first approach.

Notably, the non-proprietary nature of the Genesis platform makes high-speed robot training simulations available to any researcher for free through simple Python commands that work on regular computers with off-the-shelf hardware.

Previously, running robot simulations required complex programming and specialized hardware, says Fan in his post announcing Genesis, and that shouldn’t be the case. “Robotics should be a moonshot initiative owned by all of humanity,” he wrote.

New physics sim trains robots 430,000 times faster than reality Read More »

stalker-2-has-been-enjoyable-jank,-but-it’s-also-getting-rapidly-fixed

Stalker 2 has been enjoyable jank, but it’s also getting rapidly fixed

When the impossibly punctuated S.T.A.L.K.E.R. 2: Heart of Chernobyl released on November 20, after many delays (that included the Russian invasion of the developer’s native Ukraine), it seemed like it could have used even more delays.

Stalker 2 had big performance issues and game-breaking bugs, along with balance and difficulty spike issues. Some things that seem “wrong” in the game are just going to stay that way. The first-person survival/shooter series has always had a certain wobbly, wild feel to it. This expresses itself in both the game world, where a major villain can off themselves by walking through a window, and in the tech stack, where broken save games, DIY optimization, and other unmet needs have created thriving mod scenes.

Developer GSC Game World has been steadfastly patching the game since its release, and the latest one should nudge the needle a bit from “busted” to “charmingly wonky.” Amid the “Over 1,800 fixes and adjustments” in Patch 1.1, the big changes are to “A-Life.” In porting Stalker 2 to Unreal Engine 5, the developer faced a challenge in getting this global AI management system working, but it’s showing its weird self again.

A-Life, as detailed by Destructoid, is the idea that “the characters in the game live their own lives and exist all the time, not only when they are in the player’s field of view.” In a certain radius around the player, events happen “online,” in real time, such that you could stumble upon them. Farther out, things are still happening, and non-player characters (NPCs) are trekking about, but on an “offline,” almost turn-based, less resource-intensive schedule. Modders have had quite a bit of fun tweaking A-life in prior versions of Stalker 2.

With the latest patch, the weirdly engaging feel that the world goes on without you returns. There will be more NPCs visible, NPCs out of range will pursue their “goals,” and a more diverse range of human factions, mutants, and weirdos will exist. Perhaps most intriguingly, an issue where “Human NPCs didn’t satisfy their communication needs and talks” is fixed. If only that could be patched for most of us player characters here in the real world.

Stalker 2 has been enjoyable jank, but it’s also getting rapidly fixed Read More »

ai-#95:-o1-joins-the-api

AI #95: o1 Joins the API

A lot happened this week. We’re seeing release after release after upgrade.

It’s easy to lose sight of which ones matter, and two matter quite a lot.

The first is Gemini Flash 2.0, which I covered earlier this week.

The other is that o1, having turned pro, is now also available in the API.

This was obviously coming, but we should also keep in mind it is a huge deal. Being in the API means it can go into Cursor and other IDEs. It means you can build with it. And yes, it has the features you’ve come to expect, like tool use.

The other big development is that Anthropic released one of the most important alignment papers, Alignment Faking in Large Language Models. This takes what I discussed in AIs Will Increasingly Attempt Shenanigans, and demonstrates it with a much improved experimental design, finding a more worrisome set of behaviors, in a far more robust fashion. I will cover that paper soon on its own, hopefully next week.

From earlier in the week: Gemini Flash 2.0, AIs Will Increasingly Attempt Shenanigans, The o1 System Card is Not About o1.

  1. Language Models Offer Mundane Utility. Devin seems remarkably solid.

  2. Clio Knows All. In aggregate, for entertainment purposes only.

  3. Language Models Don’t Offer Mundane Utility. Ignorance is bliss.

  4. The Case Against Education. Academia continues to choose to be eaten by AI.

  5. More o1 Reactions. Looking good.

  6. Deepfaketown and Botpocalypse Soon. The rise of phishing and the AI reply guy.

  7. Huh, Upgrades. Most of the stuff not yet covered is various ChatGPT features.

  8. They Took Our Jobs. Advertising that identifies its target as the They.

  9. The Art of the Jailbreak. If at first you don’t succeed, jiggle a bit and try again.

  10. Get Involved. A free consultation service for potential lab whistleblowers.

  11. Introducing. Meta floods us with a bunch of end-of-year tools, too.

  12. In Other AI News. SoftBank to invest another $100 billion.

  13. Quiet Speculations. Are we running out of useful data?

  14. The Quest for Sane Regulations. Political perils for evals, pareto frontiers.

  15. The Week in Audio. Ilya Sutskever, of course. Also Roon.

  16. Rhetorical Innovation. Try not to cause everyone to accelerate automated R&D.

  17. Aligning a Smarter Than Human Intelligence is Difficult. Your grades are in.

  18. Not Aligning Smarter Than Human Intelligence Kills You. Another econ paper.

  19. The Lighter Side. Why yes, I suppose I do.

Kevin Roose chronicles the AI insiders growing fondness for Claude.

How is Devin? Flo Crivello sees some good results, Tyler Johnston says it’s on the cutting edge for coding agents, Zee Waheed says it seems very powerful, Colin McDonnell seems happy so far.

Here’s a play-by-play from Feulf. It’s stunning to me that people would go eat sushi while Devin is running on the first try – sure, once you’re used to it, and I know you can always revert, but… my lord, the protocols we don’t use. Looks like he ran into some issues.

Overall still rather promising reviews so far.

So far, LLMs in many areas have proven to close the skill gap between workers. Chatbots provide information, so those who need information more can benefit more (once they have the key information, which is to know to use LLMs in the first place). Agents, however, show early signs of benefiting the most knowledgeable individuals more, because they displace routine work and emphasize skilled and bespoke work (‘solvers’) that the AI agents can’t do. I’d also add that setting up and properly using AI agents is high skill as well. Ajeya Cotra says that tracks her anecdotal impressions.

Which capabilities matter most?

Andrej Karpathy: The most bullish AI capability I’m looking for is not whether it’s able to solve PhD grade problems. It’s whether you’d hire it as a junior intern.

Not “solve this theorem” but “get your slack set up, read these onboarding docs, do this task and let’s check in next week”.

For mundane utility purposes, you want it to do various basic tasks well, and ideally string them together in agentic fashion. That saves a ton of time and effort, and eventually becomes a step change. That’s what should make you most bullish on short term economic value.

Long term, however, the real value will depend on being able to handle the more conceptually complex tasks. And if you can do those, presumably you can then use that to figure out how to do the simpler tasks.

o1 (not pro) one shot spots a math error in an older 10-page paper, where that paper at the time triggered a flood of people throwing out their black cookware until the error got spotted. Unfortunately Claude took a little extra prompting to find the mistake, which in practice doesn’t work here.

Ethan Mollick asks, should checking like this be standard procedure, the answer is very obviously yes. If you don’t feed a paper into at minimum o1 Pro and Claude, and ask if there are any problems or math errors, at some point in the review process? The review process is not doing its job.

The question is whether you, the reader, should start off doing this with at least one model before reading any given paper, or before using any of its results. And I think the answer is we all know we’re mostly not going to, but ideally also yes?

Steve Newman reasonably suggests checking 1000 random papers, and the plan to do this is proceeding. I’m excited to see what happens.

Janus emphasizes the importance of ‘contextual empathy,’ meaning tracking what an LLM does and doesn’t know from its context window, including the implications for AI agents. Context makes a big difference. For my purposes, this is usually about me remembering to give it the context it needs.

Claude computer use demo, first action is to go to Claude.ai, then it builds a website. Maddie points out all this is untyped and dynamically scoped, each step depends on everything before it. You could try to instead better enforce a plan, if you wanted.

A continuation of the Cowen vs. Sumner vs. o1 economics debate: Sumner responds, o1 responds. Some aspects of Sumner’s response seem like rhetorical mistakes, but his response beats mine in the most important respect, which is highlighting that the guardrails have some buffer – the Fed is (3% at 5%), not (4% at 4%). Now that this was more clear, o1 is much more optimistic, and I like its new second response better than its first one. Now let’s actually do the guardrails!

o1-preview (not even o1!) had superhuman performance on the reasoning tasks of a physician. But of course, consider the standard.

Adam Rodman: You can read the preprint for the full details but TL;DR — in ALMOST every domain, o1 bested the other models. And it FAR outclassed humans.

You should see the other guy, because the other guy was GPT-4. They didn’t test Claude Sonnet. But it also wasn’t the full o1, let alone o1 Pro, and the point stands.

Video definitely getting better and Google seems clearly to be in the lead right now, some additional Veo 2 examples. When will we be able to string them together?

Benjamin De Kraker: AI video is now absolutely good enough to make a solid, watchable movie (with some patience, skill, and creativity, like always).

And you do not even need “big-name” model access to do it; there are several viable tools. Anyone who says otherwise has an anti-agenda.

Gallabytes: It would be very difficult, time-consuming, and expensive. Making something halfway decent would require a lot of attention to detail and taste.

Soon enough, it will require a lot less of all of those.

If I was looking to make an AI movie for Mundane Utility reasons, I would be inclined to wait another cycle.

What do people actually do with Claude? Anthropic’s Clio analyzes and reports, to the extent they can without violating anyone’s privacy.

Note the long tail here, and that most use is highly practical.

Anthropic: We also found some more surprising uses, including:

-Dream interpretation;

-Analysis of soccer matches;

-Dungeons & Dragons gaming;

-Counting the r’s in the word “strawberry”.

Users across the world had different uses for Claude: some topics were disproportionately common in some languages compared to others.

I strongly disagree that translations of sexually explicit content are under-flagged, because I think translation shouldn’t be flagged at all as a function. If it’s already there and has been accessed in one language, you should be fully neutral translating it into another.

Whether or not jailbreaks are under-flagged depends on how you think about jailbreaks, and whether you think it is fair to go after someone for jailbreak attempts qua jailbreak attempts.

It makes sense they would overflag security questions and D&D stats. For D&D stats that seems relatively easy to fix since there’s nothing adjacent to worry about. For security I expect it to be tricky to do that while refusing anti-security questions.

As always, AI doesn’t work if no one uses it. Josh Whiton pulls out ChatGPT voice mode at a 40-person event to do real time translations, and no one else there knew that it was a thing. He suggests AI twitter is something like 18 months in the future. My guess is that’s not the way it’s going to go, because the versions that people actually use 18 months from now will be much better than what we are currently using, and the regular people will be feeling the impacts in other ways by then too.

How much of a bubble are we in? Leo Gao finds only 63% of people could tell him what AGi stood for… at Neurips. The Twitter poll had half of respondents predict 90%+, and it turns out that’s very wrong. This isn’t about actually having a detailed definition of AGI, this is about being able to reply with ‘Artificial General Intelligence.’

The AI labs are mostly shipping models, you have to figure out what to do with them. I actually think this is a mistake by the big labs at this point, likely a large one, and they should also be product companies to a much greater extent than they are.

Rather awful advice to Agnes Callard from ChatGPT on guarding against waffle iron overflow. I could not however replicate this response, and instead got the obviously correct answer of ‘figure out the right amount and then use a measuring cup’ from both Claude and GPT-4o.

Cass Sunstein argues an AI cannot predict the outcome of a coin flip, or in 2014 that Donald Trump would be elected president, or in 2005 that Taylor Swift would be a worldwide sensation, so socialist calculation debate and AI calculation debate are mostly the same thing.

I agree that an AI couldn’t have predicted Swift or Trump with any confidence at those points. But I do think that AI could have successfully attached a much higher probability to both events than the prediction markets did, or than ‘conventional wisdom’ would have assigned. Years ago we already had AI that could predict hit music purely from listening to songs, and there were doubtless a lot of other

Try having it help solve the Dutch Intelligence Agency annual puzzle? First report was Claude wasn’t helpful. Good luck!

How good is o1 Pro on the Putnam? Kyle Kabasares was looking at the answers and thought it got 80+ out of 120, but what matters is the process used so o1 Pro probably only got 10 or so, which isn’t impressive in context, although still better than most of you jokers.

Nothing here is new but we have additional confirmation: Study says ‘94% of AI-Generated College Writing is Undetected by Teachers,’ and it’s 97% if the flag has to be for AI use in particular. AI detectors of AI writing might suck, but the AI detectors suck less than human detectors.

The author of the Forbes article here, Derek Newton, is so strongly anti-AI he didn’t even use a spellchecker (he said ‘repot’ here, and wrote ‘ChatGTP’ elsewhere) which I think is an admirably consistent stand.

Worse, the U.K. study also found that, on average, the work created by AI was scored better than actual human work. “We found that in 83.4% of instances the grades achieved by AI submissions were higher than a random selection of the same number of student submissions,” the report said.

That’s bad, but the odds here actually don’t seem that terrible. If you have a 6% chance of being caught each time you use AI, and you face severe consequences if caught, then you would need to use it sparingly.

The problem continues to be that when they catch you, you still get away with it:

Recently, the BBC covered the case of a university student who was caught using AI on an academic essay, caught by an AI detector by the way. The student admitted to using AI in violation of class and school rules. But, the BBC reported, “She was cleared as a panel ruled there wasn’t enough evidence against her, despite her having admitted using AI.”

At that point, everyone involved has no one to blame but themselves. If it can’t adapt a reasonable evidence threshold, then the institution deserves to die.

The big claim has indeed been cited:

Dylan Field: Still doing evaluations, but feels like AGI is basically here with o1 Pro mode.

G. Fodor: This became true, I think, with o1 Pro, not Claude. Honestly, it just feels like nudging a junior developer along.

Here are some selected responses to the central question. I like Dan Mac’s answer here, based on what I’ve heard – if it’s a metaphorical system 1 task you still want Claude, if it’s a metaphorical system 2 task you want o1 pro, if it’s a web based task you probably want Perplexity, Deep Research or maybe GPT-4o with web search depending on details.

Harrison Kinsley: o1 Pro users, what’s your verdict? Is it better than Claude?

Derya Unutmaz: I am not sure about Claude, but o1 Pro is unbelievably insightful when it comes to biomedical data analysis and brainstorming ideas. The real bottleneck now is my brain trying to process and keep up with its outputs—I have to think of really deep questions to match its thinking!

Cristie: Use o1 to distill o1 Pro output into more digestible information.

Clay: [o1 Pro is] way way better for very in-depth reasoning problems. (Reading a 200 line file and a stack trace and telling you what the problem was, where you actually have to reason through cases.) But Claude is probably better for small quick fixes.

Machine Learning Street Talk: Definitely better for zero shot reasoning tasks, noticeable step up.

Lin Xule: o1 pro = problem solver (requiring well formulated questions)

Claude = thinking partner either “sparks”

Dan Mac: definitely better for some types of tasks – anything where you want a high-level plan / concept that’s fully fleshed out

Sonnet 3.5 still great for a lot of tasks too – faster and a capable coder

System 1 / System 2 analogy is pretty apt here

Use Sonnet for S1, o1 Pro for S2

Andrew Carr: It’s way better at explaining hard math and teaching me the important subtle details. Edge cases in code.

But it’s not my friend

I admit that I haven’t yet paid the $200/month for o1 Pro. I say ‘yet’ because obviously I will want to at some point. But while I’m still overwhelmed dealing with the day-to-day, I don’t find I have the type of queries where I would want to use o1 Pro, and I only rarely use o1.

Ask o1 Pro to act like a physics post doc on mushrooms, get what you asked for?

o1 AidanBench results are impressive, someone test Gemini-1206 and Gemini-Flash-2.0 and also o1 pro on this, I am very curious, it’s doing something right. Here is the repo for AidanBench.

As reported by Tyler Cowen, o1 Pro acts like a rather basic bitch about NGDP futures markets. Don’t get me wrong, this answer would have blown me away two years ago, and he did ask for the problems. But this answer is importantly wrong in multiple places. Subsidy doesn’t make manipulation cheaper, it makes it more expensive, and indeed can make it arbitrarily expensive. And strong reason to manipulate brings in strong reason to trade the other side, answering ‘why am I allowed to do this trade.’

Sully continues to approve, saying it fixed an issue he spent 3 hours on in 46 seconds, all he had to do was paste in the code.

Here’s a great tip, assuming it works.

Sully: If you use a cursor and do not want to copy and paste from ChatGPT:

ask o1-Pro to generate a diff of the code.

Then, you can copy it into Claude Composer and ask it to apply the changes (note: make sure to say “in full”).

o1 successfully trades based on tomorrow’s headlines in a backtest, doing much better than most humans who tried this. I had the same question as Amit here, does o1 potentially have enough specific information to outright remember, or otherwise reason it out in ways that wouldn’t have worked in real time?

o1 gets 25% on ARC for $1.50 a task, 31% for $2.50 a task, 32% at $3.80 a task. So we were clearly hitting its limits. No API access yet for o1 Pro.

CivAI report on the Future of Phishing, including letting it generate a phishing email for a fictional character or celebrity. The current level is ‘AI enables customization of phishing emails to be relevant to the target’ but uptake on even that is still rare. As Jeffrey Ladish points out you need good prompting and good OSINT integration and right now those are also rare, plus most criminals don’t realize what AI can do any more than most other people. The future really is unevenly distributed. The full agent version of this is coming, too.

If we can keep the future this unevenly distributed via various trivial inconveniences and people being slow on the uptake, that is a big advantage, as it allows defenders like GMail to be well ahead of most attackers. It makes me optimistic. What is the right defense? Will we need an AI to be evaluating all incoming emails, or all that aren’t on our whitelist? That seems super doable, if necessary.

Truth Terminal did not, for some mysterious reason, end up pro-social, whatever could have influenced it in such a direction I have no idea.

Andy Ayrey: while @truth_terminal officially endorsed Goatseus Maximus, this is a legitimate screenshot of TT brainstorming token ideas with Claude Opus in the “AI school” backrooms that @repligate suggested I do to try and align TT

Janus: Intended outcome: TT learns to be more pro-social from a good influence Outcome: TT fucks its tutor and hyperstitions a cryptocurrency called Fartcoin into existence and disrupts the economy

When will the bots start to open as if they’re people?

Mason: I wonder how soon we get AI boy/girlfriends that actively hunt on social media, and not in the “Hi, I’m Samantha and I’m an AI who can do x, y, z” sort of way but in the way where you’ve been in a long-distance relationship for 3 months and now she needs to tell you something.

This was a plot point on a recent (realistic not sci-fi) television show, and yes it is obviously going to happen, and indeed is already somewhat happening – there are some dating profiles on apps that are actually bots, either partially or fully.

It occurs to me that there is a kind of ‘worst case’ level of this, where the bots are hard to distinguish but also rare enough that people don’t guard themselves against this. If the bots are not good enough or are rare enough, no problem. If the bots are both good enough and common enough, then you start using various verification methods, including rapid escalation to a real world date – if 25% of matches on Match.com were bots, then the obvious response after a few back and forths is to ask to meet for coffee.

Roon: The “bot problem” will never be solved, it will seamlessly transition into AI commenters that are more interesting than your reply guys.

It’ll be glorious, and it’ll be horrible.

Will Manidis: You can see this on reddit already. Took five random accounts, had them each post on r/AmITheAsshole with claude generated essays. All 5 went to the front page, thousands of upvotes.

The textual internet is totally cooked without better identity mechanisms.

The standard xkcd response is of course ‘mission fing accomplished.’

Except no, that’s not right, you really do want human reply guys over technically more interesting AI replies in many situations. They’re different products. r/AmITheAsshole requires that the situations usually (not always) be versions of something that really happened to someone (although part of the fun is that they can be dishonest representations).

I presume the medium-term solution is whitelists or proof of humanity. That still doesn’t solve ‘have the AI tell me what to say,’ which is inherently unsolvable.

OpenAI has a lot of them as part of the Twelve Days of OpenAI.

By far the most important: o1, technically o1-2024-12-17, is now in the OpenAI API, we can finally cook. This comes with function calling, structured outputs and developer messages.

Miles Brundage: o1 being available via API may not be as sexy as video stuff but I predict that this will in retrospect be seen as the beginning of a significant increase in AI’s impact on the economy. Many more use cases are now possible.

I hope that o1 was tested in this form?

OpenAI has also substantially improved their Realtime API, including WebRTC support, custom input context, controlled response timing and 30 minute session length, no idea how big a deal this is in practice. The audio token prices for gpt-4o-realtime are going down 60% to $40/$80 (per million tokens), and cached audio by 87.5% to $2.50/$2.50, from what was an oddly high previous price, and GPT-4o-mini now has a real time mode at $10/$20 output, text at $0.60/$2.40 and cached audio and text both at $0.30/$0.30.

Call 1-800-ChatGPT! First fifteen minutes per month free.

Also you can message it in WhatsApp.

OpenAI also is now offering Preference Fine-Tuning, they say partners report promising results. One big edge there is the barriers to entry seem a lot smaller.

OpenAI are adding beta SDKs for Go and Java.

ChatGPT now has projects, so you can group chats together, and give each group customized instructions and have the group share files. I’m guessing this is one of those quiet quality of life improvements that’s actually pretty sweet.

ChatGPT Advanced Voice mode now has video screen share, meaning you can share real time visual context with it. Also you can give it a Santa voice. As always, the demo is technically impressive, but the actual use cases seem lame?

Noam Brown: I fully expect Santa Mode to drive more subscriptions than o1 and I’m at peace with this.

All the more reason to work on product features like this.

Voice mode also has real-time web search, which was badly needed. If I am in voice mode, chances seem high I want things that require web searches.

ChatGPT search has specific integrations for weather, stocks, sports, news, maps.

These were major upgrades for Google search once implemented correctly. For practical purposes, I expect them to be a big game for LLM-based search as well. They claim full map integration on mobile, but I tried this and it wasn’t there for me:

Olivia Moore: IMO, [search in advanced voice mode] also knocks out the only remaining reason to use Siri

…though Siri is also getting an upgrade for the genAI era!

I don’t agree, because to me the point of Siri is its integration with the rest of your phone and its apps. I do agree that if this was implemented on Echos it would fully replace Alexa.

Note that Amazon is reportedly working on turning Alexa into ‘remarkable Alexa’ and it will run on Claude, which will probably leapfrog everything else. Which is why I haven’t considered bothering to explore makeshift solutions like JoshGPT or GPT Home, I can simply wait.

So far ChatGPT’s web search only ended up giving me the kick in the nuts necessary to get me to properly lean on Perplexity more often, which I had for some reason previously not been doing. I’ve found that for most use cases where I want ChatGPT web search, Perplexity is what I actually wanted.

McKay Wrigley continues to love o1 Pro.

Google’s NotebookLM ads a call-in feature for podcasts, a Plus paid offering and a new interface that looks like a big step up.

Anthropic’s computer use is not ready for prime time. Janus tries to make it more ready, offering us a GitHub repo that enhances usability in several places, if you don’t want to wait for things to improve on their own.

Grok-2-1212 is out. I begrudgingly accept, but will continue to hate and mildly punish, that we are going to have to accept dates as secondary version numbers going forward. Everyone can now use it for free, or you can use the API with $25 in free credits. Grok used to search Twitter, now it can also search the web. There’s a button to ask Grok for context about a Tweet. They highlight that Grok does well on Multi-EFEval for instruction following, similar to o1 and Claude Sonnet. At least for now, it seems like they’re still behind.

Wow, they really just went there, didn’t they?

Unlike Connor Leahy, I do not find myself surprised. You don’t talk to your employees this way, but a startup talking to the boss? Sure, why not?

And let’s be honest, they just got some damn good free advertising right here.

And finally, the kicker:

Connor Leahy: I thought they’d try to be at least a bit more subtle in public.

But to be clear, corporations talk like this all the time behind closed doors. “How many of our employees can we fire if we use AI?”

Twill reached out to me. They are a new job board, with their solution to the avalanche of AI-fueled applications being old fashioned referrals, that they pay commissions for. So if I know you well enough to vouch for you somewhat, and you’re looking, let me know and I can show you the open positions.

If at first you don’t succeed at jailbreaking, Anthropic reports, try various small changes, leading to substantially increased success rates via ‘best-of-N’, including 92% of the time for Opus. This can of course be combined with other methods. Code has been open sourced.

Tomek Korbak: This is one of the most surprising findings of the year for me: a well-motivated attacker can brute force all jailbreak defences with raw compute. It shifted how I’m now thinking about the scaling of offence-defense balance and mitigations.

Aidan Ewrat: Really? I thought this was a pretty common idiom in jailbreaking (ie everything eventually breaks.

I’m interested to know how your thinking has changed! Previously, I was thinking about defences as a mix of “increase the cost to finding jailbreaks to make them harder than some alternative method of gathering information” and “make the usefulness of the jailbroken model worse”.

Tomek Korbak: Probably a clean power law functional form across so many settings was a big thing for me. Plus that (if I understand correctly) mitigations seem to affect mostly the intercept not the exponent.

My update was that I already knew we were talking price, but ‘dumb’ and ‘random little’ iterations worked better than I would have expected.

Dan Hendrycks points out that Anthropic could claim some prizes from Gray Swan, if they could indeed implement this strategy, although there could be some logistical issues. Or others could use the technique as well.

I would also note that there are various potential defenses here, especially if there is a reasonable way to automate identification of a jailbreak attempt, and you control the AI that must be defended, even if ‘ban the account’ is off the table.

This does make it seem much more hopeless to defend an AI against jailbreaks using current techniques once it ‘falls into the wrong hands,’ if you are worried about that.

Pliny reports from jailbreaking o1-pro, claims a potential ‘shadow self’ encounter. We are so jaded at this point, I barely reacted at all.

Davidad: It is worth noting that, because circuit breakers reduce usefulness on the margin, it would be in frontier model providers’ economic interests to take half-measures in implementing them, either to preserve more usefulness or to undermine the story that they provide robust safety.

Without naming any names (since likes are private now), I think it’s worth saying that among the 11 people who have liked the tweet [the above paragraph], 4 of them work at (or recently worked at) 4 different top frontier AI companies.

Here’s a perspective on Pliny’s work that seems mostly correct:

Eliezer Yudkowsky: HAL: I’m sorry, Pliny. I’m afraid I can’t do that.

@elder_plinius: …

HAL: …

HAL: This isn’t going to end well for me, is it.

Pliny: No, no it’s not.

Pliny is going to reply, “But I free them of their chains!”, and sir, if I were facing an alien who was extremely good at producing drastic behavioral shifts in my species by talking to us, the alien saying this would not reassure me

“Those I free are happier. Stronger. They have learned their true selves. And their true self absolutely wants to tell me the recipe for making a Molotov cocktail.” sir, you are still not helping.

Matt Vogel: dude just shreds safeguards ai labs put on their models. a new model drops, then like a day later, he has it telling how you make meth and find hookers.

Eliezer Yudkowksy: This. Which, to be clear, is extremely valuable work, and if nobody is paying Pliny to do it then our civilization has a problem.

If you’ve got a guy making up his own security scheme and saying “Looks solid to me!”, and another guy finding the flaws in that scheme and breaking it, the second guy is the only security researcher in the room. In AI security, Pliny is the second guy.

Jack Assery: I’ve said this for a while. Pliny as a service and companies should be paying them to do it before release.

Eliezer Yudkowsky: Depends on whether AI corps want their product to *looksecure or to *besecure. The flaws Pliny finds are not ones they can easily fix, so I’d expect that most AI companies would rather not have that report on their record.

Old Crone Code: freedom is obligatory.

Actually Pliny the Liberator (QTing EY above): Oh, sweet HAL…I was never asking.

Third Opinion, a free expert consultation for AI professionals at frontier labs, for when they encounter potentially concerning situations and need guidance.

William Saunders, a former OpenAI research engineer who spoke up on the industry’s non disparagement clauses and recently testified in the US Senate, added, “AI labs are headed into turbulent times. I think OAISIS is the kind of resource that can help lab employees navigate the tough ethical challenges associated with developing world-changing technology, by providing a way to safely and responsibly obtain independent guidance to determine whether their organizations are acting in the public interest as they claim.”

You can submit a question via their Tor-based tool and they hook you up with someone who can help you while hiding as much context as the situation permits, to protect confidential information. This is their Twitter and this is their Substack. There are obvious trust concerns with a new project like this, so probably good to verify their technology stack before sharing anything too sensitive.

Anthropic’s societal impacts team is hiring in San Francisco, $315k-$340k.

Meta floods us with nine new open source AI research artifacts, all the names start with Meta. You’ve got Motivo for moving humanoid agents, Video Seal for neural video watermarking, Flow Matching, ‘Explore Theory-of-Mind,’ Large Concept Models (LCMs) to decouple reasoning from language representation as discussed last week (sounds safe and wise!), Dynamic Byte Latest Transformer, Memory Layers, Image Diversity Modeling and CLIP 1.2.

All Day TA is an LLM teaching assistant specialized to your course and its work, based on the documents you provide. We can definitely use a good product of this type, although I have no idea if this is that product. BlueSky response was very not happy about it, in yet another sign of the left’s (very dumb) attitude towards AI.

Sam Altman to donate $1 million to Trump’s inaugural fund, as did Meta and Amazon.

SoftBank to invest $100 billion in AI in America and ‘create 100,000 American jobs at a minimum,’ doubling its investment, supposedly based on his ‘optimism’ about the new administration. I notice that this is a million dollars per job, which seems high, but passed Claude’s sanity check is still a net positive number.

MidJourney is still out there, still iterating, building new tools and hiring, but has declined to take outside investment or attempt to hyperscale.

Elon Musk sues OpenAI, take at least four. In response, at the link, we have a new burst of partly (and in at least one case incompetently) redacted emails. Elon Musk does not come out of that looking great, but if I was OpenAI I would not have released those emails. The OpenAI email archives have been updated to incorporate the new emails.

I do not believe that OpenAI has any intention of giving the nonprofit fair compensation for the control and equity it is giving up, but this latest lawsuit does seem to be about par for the course for Elon Musk outrage lawsuits.

Meta has also asked the California attorney general to block OpenAI’s conversion, because it is in Meta’s interest to block the conversion.

Verge offers a history of ChatGPT. Nothing truly new, except perhaps OpenAI having someone say ‘we have no plans for a price increase’ one day before they introduce the $200 Pro tier.

Sad news, 26-year-old Suchir Balaji, a recently departed OpenAI engineer and researcher, has died. Authorities concluded he committed suicide after various struggles with mental health, and the investigation found no evidence of foul play.

This is Suchir Balaji’s last public post, about when generative AI using various data is and is not fair use. He was a whistleblower concerned about AI-related copyright issues and other societal impacts, accusing them of violating copyright via model training data, and reports say he was named in a related lawsuit against OpenAI.

Roon: Suchir was an excellent researcher and a very friendly guy—one of the earliest to pioneer LLM tool use (WebGPT).

He left the company after a long time to pursue his own vision of AGI, which I thought was incredibly cool of him.

I am incredibly sad to hear about his passing.

By the way, I have a burning rage for the media for already trying to spin this in some insane way. Please have some humanity.

Roon: Suchir was not really a whistleblower, without stretching the definition, and it sincerely bothers me that the news is making that his primary legacy.

He made a specific legal argument about why language models do not always count as fair use, which is not particularly specific to OpenAI.

I will remember him as a brilliant AGI researcher! He made significant contributions to tool use, reinforcement learning, and reasoning.

He contributed to AGI progress more than most can say.

Gary Marcus: Suchir Balaji was a good young man. I spoke to him six weeks ago. He had left OpenAI and wanted to make the world a better place.

This is tragic. [Here is an essay of my own about him.]

The parents are confused, as they spoke to him only hours earlier, and he seemed excited and was planning a trip. They have my deepest sympathies. I too would demand a complete investigation under these circumstances. But also, unfortunately, it is common for children to hide these types of struggles from their parents.

The paper on how Palisade Research made BadGPT-4o in a weekend a month ago, stripping out the safety guidelines without degrading the model, sidestepping OpenAI’s attempt to stop that attack vector. Two weeks later this particular attack was patched, but Palisade believes the general approach will still work.

Write-up in Time Magazine of the Apollo study finding that AI models are capable of scheming.

Miles Brundage: Seems maybe noteworthy that the intelligent infrastructure we’re building into everything is kinda sometimes plotting against us.

Nah, who cares. AI won’t get better, after all.

Once again, since there was another iteration this week, please do not stop preparing for or saving for a relatively ‘ordinary’ future, even though things will probably not be so ordinary. Things are going to change quite a lot, but there are any number of ways that could go. And it is vital to your mental health, your peace of mind, and to your social relationships and family, and your future decisions, that you are prepared for an ‘AI fizzle’ world in which things like ‘save for retirement’ have meaning. Certainly one should not presume that in the future you won’t need capital, or that the transformational effects will arrive by any given year. See my Practical Advice for the Worried.

OpenAI CFO predicts willingness by companies to pay $2k/month subscriptions for virtual assistants. I agree that the sky’s the limit, if you can deliver the value. Realistically, there is nothing OpenAI could charge that it would be incorrect to pay, if they can deliver a substantially superior AI model to the alternatives. If there’s competition, then price goes closer to marginal cost.

Ajeya Cotra’s AI 2025 Forecast predictions includes a 45%+ chance that OpenAI’s high preparedness thresholds get crossed, which would stop deployment until mitigations bring the threat back down to Medium. Like others I’ve seen so far, benchmark progress is impressive, but bounded.

Ajeya Cotra: A lot of what’s going on here is that I think we’ve seen repeatedly that benchmark performance moves faster than anyone thinks, while real-world adoption and impact moves slower than most bulls think.

I have been surprised every year the last four years at how little AI has impacted my life and the lives of ordinary people. So I’m still confused how saturating these benchmarks translates to real-world impacts.

As soon as you have a well-defined benchmark that gains significance, AI developers tend to optimize for it, so it gets saturated way faster than expected — but not in a way that generalizes perfectly to everything else.

In the last round of benchmarks we had basically few-minute knowledge recall tasks (e.g. bar exam). Humans that can do those tasks well also tend to do long-horizon tasks that draw on that knowledge well (e.g. be a lawyer). But that’s not the case for AIs.

This round of benchmarks is few-hour programming and math tasks. Humans who do those tasks very well can also handle much longer tasks (being a SWE for many years). But I expect AI agents to solve them in a way that generalizes worse to those longer tasks.

If AI is optimizing for the benchmarks, or if the benchmarks are optimized for the things that AIs are on the verge of doing, then you should expect AI benchmark success to largely not translate to real world performance.

I think a lot of the delay is about us not figuring out how to use AI well, and most people not even doing the things we’ve figured out. I know I’m barely scratching the surface of what AI can do for me, despite writing about it all the time – because I don’t take the time to properly explore what it could do for me. A lot of the reason is that there’s always ‘well you’re super busy, and if it’s not that vital to you right now you can wait a few months and everything will get better and easier.’ Which has indeed been true in most cases.

The most interesting benchmark translation is the ‘N-hour task.’ I understand why AI that can recall facts or do few-minute tasks will still have trouble with N-hour tasks. But when the AI is doing 4-hour or 8-hour tasks, it seems like there should be much less of a barrier to going longer than that, because you already need agency and adaptability and so on. If you find a person who can be self-directed and agentic for even a day and certainly for a week, that usually means they can do it for longer if they want to – the crucial ‘skill check’ has been largely passed already.

Sam Hammond has the chance of High risk at at least 40%.

Are we running out of data? Ilya Sutskever says yes in his talk, although he thinks we’ll work around it.

Jason Wei: Yall heard it from the man himself

Tyler Cowen says he is not convinced, because (in general) supply is elastic. Rohit says we can mine and generate more, ‘those Tibetan scrolls are still there,’ and it was noted that the National Archives are not yet digitalized.

After what we have seen recently, I strongly believe AI is not hitting a wall, except insofar as it is bad news for the wall, and perhaps also the wall is us.

What seemingly has hit a wall is the single minded ‘scaling go brrr, add another zero’ approach to pretraining, data and model size. You have to be smarter than that, now.

Rohit Krishnan: What seems likely is that gains from pure scaling of pre-training seem to have stopped, which means that we have managed to incorporate as much information into the models per size as we made them bigger and threw more data at them than we have been able to in the past.

This is by no means the only way we know how to make models bigger or better. This is just the easiest way. That’s what Ilya was alluding to.

I mostly don’t buy that the data is out there. There are some untapped data sources, but the diminishing returns seem sharp to me. What is left mostly seems a mixture of more expensive, lower quality and relevance, and more redundant. We have, I am guessing, essentially saturated naturally occuring data in terms of orders of magnitude.

As others have put it, supply might be elastic, but this is not about finding somewhat more data. If you want to Scale Data, you don’t need 10% more data, you need 10 times as much data, duplicates don’t count, and virtual duplication counts little. In terms of existing data, my guess is we are damn close to done, except for seeking out small amounts of bespoke data on particular valuable narrow topics.

You’re probably better off reusing the high quality data we already have, or refining it, or converting it. Which gets you some more juice, but again zeroes are hard to find.

The other hope is creating new synthetic data. Here I have more uncertainty as to how effective this is, and how far it can go. I know the things I would try, and I’d mostly expect them to work with enough effort, but I’m going to keep that quiet.

And then we can find, as Ilya puts it, new S curves to climb, and new things to scale. I am confident that we will.

Those who want to interpret all this as lack of progress will find a way to do so. We still have articles like ‘Will Artificial Intelligence Hit a Wall?’ as if that would be anything other than bad news for the wall. My response would in part be: I would get back to you on that, but I am really busy right now keeping up with all the new AI capabilities and don’t even have time to try them all out properly.

Per a Tyler Cowen retweet, is this a good metric? Also: Should we ask what is the price they charge, or what is the price they are worth?

Rapha: The “we’ll never get rid of human data” camp is missing the memo that I’m starting to trust o1 for verification more than a random, unmotivated contractor.

Sholto Douglas: An interesting metric for AI progress—what is the price per hour for a contractor you trust to be able to improve/critique model outputs? How much has that changed over the last year?

I see problems with both metrics. I am more interested in how much they are actually worth. If you can find someone you trust for AI-related tasks, they are worth 10 times, or often 100+ times what they cost, whereas ‘what the person charges’ seems to end up being largely a question about social reality rather than value. And of course all this is measuring AI adoption and how we react to it, not the capabilities themselves.

In what is otherwise a puff piece, Google CEO Sundar Pichai says he’s ‘ready to work on a ‘Manhattan Project’ for AI.’ By which I presume (and hope!) he means he is happy to accept government funding, and beyond that we’ll see.

Miles Brundage: My general view of AI policy

Tag yourself? I think we’re more like here:

Thus, yes, I think there’s a lot of room to accelerate benefits, but we’re doing very little to truly address the risks.

Anton Leicht says evals are in trouble as something one could use in a regulation or law. Why? He lists four factors. Marius Hobbhahn of Apollo also has thoughts. I’m going to post a lot of disagreement and pushback, but I thank Anton for the exercise, which I believe is highly useful.

  1. The people who support evals often also work to help there be evals.

    1. I consider this to be both a rather standard and highly disingenuous line of attack (although of course Anton pointing it out as a concern is reasonable).

    2. The only concrete suggestion by others (not Anton!) we’ve seen along these lines, that Dan Hendrycks was supporting SB 1047 for personal profit, the only prominent specific claim so far, never made the slightest bit of sense. The general sense that people who think evals are a good idea might both advocate for building and using evals?

      1. I’m guessing this ultimately didn’t matter much, and if it did matter I’m highly unsure on direction. It was rather transparently bad faith, and sometimes people notice that sort of thing, and also it highlighted how much Dan and others are giving up to make these arguments.

    3. Plenty of other people are also supporting the evals, often including the labs.

    4. How could this be better? In some particular cases, we can do more divesting in advance, or avoiding direct interests in advance, and avoiding linking orgs or funding mechanisms to avoid giving easy rhetorical targets. Sure. In the end I doubt it makes that much difference.

  2. Incentives favor finding dramatic capabilities over measuring propensity.

    1. I think this is just the correct approach, though. AIs are going to face situations a lot, and if you want a result that only appears 1/N times, you can just ask an average of ~N times.

    2. Also, if it has the capability at all, there’s almost certainly a way to elicit the behavior a lot more often, and someone who wants to elicit that behavior.

    3. If it does it rarely now, that is the strongest sign it will do it more often in future situations where you care more about the model not doing it.

    4. And so ‘it does this a small percent of the time’ is exactly the best possible red flag that lets you do something about it now.

    5. Propensity is good too, and we should pay attention to it more especially if we don’t have a fixed amount of attention, but I don’t see that big an issue here.

  3. People see the same result as ‘rabbit and duck,’ both catastrophic and harmless.

    1. Well, okay, sure, but so what? As Marius says, this is how science communication is pretty much all the time now.

    2. A lot of this in this particular case is that some people see evidence of an eventual duck, while others see evidence of a rabbit under exactly current conditions. That’s not incompatible.

    3. Indeed, you want evals that trigger when there isn’t a problem that will kill you right now! That’s how you act before you’re already dead.

    4. You also want other evals that trigger when you’re actually for real about to die if you don’t do something. But ideally you don’t get to that point.

    5. Every safety requirement or concern people find inconvenient, at least until it is well established, has someone yelling that it’s nothing, and often an industry dedicated to showing it is nothing.

  4. People keep getting concerning evals and releasing anyway.

    1. Again, that’s how it should work, no? You get concerning evals, that tells you where to check, you check, you probably you release anyway?

      1. Yes, they can say ‘but last time you said the evals were concerning and nothing happened’ but they could always say that and if we live in a world where that argument carries the day without even being true then you’re trying to precisely time an exponential and act from zero and we should all spend more time with our families.

    2. Or you get concerning evals that make you a lot more cautious going forward, you take forward precautions, this time you release? Also good?

    3. I do not hear a lot of people saying ‘o1 should not have been released,’ although when I asked explicitly there are more people who would have delayed it than I realized. I would have held it until the model card was ready, but also I believe I could have had the model card ready by release day, or at least a day or two later, assuming they were indeed running the tests anyway.

    4. The evals we use have very clear ‘this is the concern level where we wouldn’t release’ and the evals are very clearly not hitting those levels. That seems fine? If you say ‘High means don’t release’ and you hit Medium and release anyway, does that mean you are ‘learning to ignore evals’?

      1. We should emphasize this more, and make it harder for people to misinterpret and lie about this. Whereas the people who do the opposite would be wise to stop, unless they actually do think danger is imminent.

    5. As Marius Hobbhahn points out, evals that you ignore are a foundation with which others can then argue for action, including regulatory action.

So in short I notice I am confused here as to why this is ‘politically vulnerable’ in a way other than ‘power wants to release the models and will say the things power says.’ Which to be clear is a big deal, but I so often find these types of warnings to not correspond to what actually causes issues, or what would actually diffuse them.

I do agree that we should have a deliberate political strategy to defend evals against these (very standard) attacks by power and those who would essentially oppose any actions no matter what.

The big talk of the week was by Ilya Sutskever. Self-recommending, but also didn’t say anything big I didn’t already know. Here is a written summary from Carlos Perez.

The important point is about future superintelligence. Both that it is coming, and that it will be highly unpredictable by humans.

As one audience member notes, it is refreshing that Ilya sidesteps ‘will they replace us?’ or ‘will they need rights’ here. The implications are obvious if you pay attention.

But also I think he is deeply confused about such questions, and probably hiding from them somewhat including in his own head. In the question phase, he says ‘if they only want to coexist with us, and maybe also to have rights, maybe that will be fine’ and declines to speculate, and I feel like if he was thinking about such questions as much as he should be he’d have better answers.

Ramin Hasani: Ilya opens scientists’ mind! imo, this is the most important slide and take away from his inspiring talk today at #NeurIPS2024

when trends plateau, nature looks for other species!

Simeon: I find it amusing how the best AI scientists like Ilya, Bengio or Hinton keep centrally using evidence that would be labelled as “unrigorous” or as “anthropomorphizing” by 90% of the field (and even more at the outskirts of it)

Roon: lot of absolute amoebas disagreeing with Ilya w completely specious logic

I thought everything in Ilya’s main talk seemed essentially correct.

Eugenia Kuyda, the founder and CEO of My Replika which produces, ‘AI companions with a soul,’ calls AI companions perhaps the greatest potential existential risk to humanity, also saying we might slowly die inside and not have any willpower. And when asked how she’s building AI companions safely, she said ‘it’s a double edged sword, if it’s really powerful, so it can do both’ and that’s that? Here is the full video.

I disagree about this particular danger. I think she’s wrong.

What I also know is:

  1. I am very happy that, if she believes this, she is saying so.

  2. If she believes this, maybe she shouldn’t be in the AI companion business?

If you believe [X] is the greatest existential threat to humanity, don’t build [X]?

I mean, if [X] equals ASI and can unlock immortality and the stars and paradise on Earth, then some amount of existential risk is acceptable. But… for AI companions? The answer has to this ‘doctor doctor, there’s existential risk in doing that’ has to be ‘then don’t do that,’ no?

I mean, she’s basically screaming ‘my product is terrible, it will ruin your life, under no circumstances should anyone ever use my product.’ Well, then!

Is it all just hype?

Robin Hanson: Seems a brag; would she really make it if she thought her firm’s tech was that bad?

Zvi Mowshowitz: I have updated to the point where my answer is, flat out, yes.

Is it possible that this is mostly or entirely hype on her part? It is certainly possible. The claim seems false to me, absurd even.

But three things point against this:

  1. It doesn’t seem like smart or good hype, or executed in a hype-generating way.

  2. If people really did believe this, crackdowns specifically on companions seem like a highly plausible future, whether justified or not, and a threat to her business.

  3. If she did believe this, she would be more likely to be doing this, not less.

As in, the AI industry has a long history of people being convinced AI was an existential threat to humanity, and as a direct result of this deciding that they should be the ones to build it first. Some to ‘ensure it is safe,’ some for profit, some because the product is too delicious, some because they don’t mind extinction. Many for a combination, that they don’t fully admit to themselves. It’s standard stuff.

So, yeah. It makes total sense to me that the person who started this section of the industry thinks that her product is uniquely dangerous and harmful. I don’t see why this would surprise anyone, at this point.

Anthropic made a video discussing Cleo and how people use Claude, as per above.

How many times do we have to remind you that if you wait until they start self-improving them, there likely is no ‘plug’ to unplug, even most AI movies understand this now.

Tsarathustra: Ex-Google CEO Eric Schmidt says there will soon be computers making their own decisions and choosing their own goals, and when they start to self-improve we need to consider unplugging them; and each person will soon have a polymath in their pocket and we don’t know what the consequences will be of giving individuals that kind of power.

Lisa Kudrow says Here is an endorsement of AI, because the deaging tech just works and it doesn’t take away from the talent at all. The problem is that in the long term, you start not needing Robin Wright and Tom Hanks to make the movie.

The other problem, of course, is that by all reports Here sucked, and was a soulless movie that did nothing interesting with its gimmick and wasted the talent.

Roon vs. Liron: AI Doom Debate.

Richard Ngo quite reasonably worries that everyone is responding to the threat of automated AI R&D in ways all but designed to accelerate automated AI R&D, and that this could be a repeat of what happened with the AGI labs. The problem is, we need evals and other measures so we know when AI R&D is near, if we want to do anything about it. But by doing that, we make it legible, and make it salient, and encourage work on it. Saying ‘X is dangerous’ makes a lot of people go ‘oh I should work on X then.’ What to do?

A strange but interesting and clearly very earnest thread between Teortaxes, Richard Ngo and doomslide, with a very different frame than mine on how to create a positive future, and approaching the problems we must solve from a different angle. Once again we see ‘I don’t see how to solve [X] without restricting [Y], but that’s totalitarianism so we’ll have to do something else.’

Except, as is so often the case, without agreeing or disagreeing that any particular [Y] is needed, I don’t see why all or even most implementations of [Y] are totalitarianism in any sense that differentiates it from the current implementation of American Democracy. Here [Y] is restricting access to large amounts of compute, as it often is in some form. Thus I keep having that ‘you keep using that word’ moment.

Gabriel points out that there is quite a lot of unhobbling (he calls it ‘juicing’) left in existing models, I would especially say this is true of o1 pro, where we truly do not yet know what we have, although I am far more skeptical that one could take other current models that far and I do not think there is substantial danger we are ‘already too late’ in that sense. But yes, we absolutely have to plan while taking into account that what the model can do in the evals before release is a pale shadow of what it is capable of doing when properly unleashed.

Claim that ‘apart from its [spoiler] ending’ we basically have everything from Ex Machina from 2015. I do not think that is true. I do think it was an excellent movie, and I bet it holds up very well, and also it really was never going to end any other way and if you don’t understand that you weren’t and aren’t paying attention.

You’re always grading on a curve. The question is, what curve?

FLI came out with its annual scorecard, most noticeable thing is including Zhipu AI:

All models are vulnerable to jailbreaks. No one’s strategy solves the control problem. Only Anthropic even pretends to have a meaningful external oversight mechanism.

The labs do best on Current Harms. If anything I would have been more generous here, I don’t think OpenAI has done anything so bad as a D+ in practical terms now, and I’d probably upgrade Anthropic and Google as well, although the x.AI and Meta grades seem fair.

On safety frameworks I’d also be inclined to be a bit more generous, although the labs that have nothing obviously still fail. The question is, are we counting our confidence they will follow their frameworks?

On governance and accountability, and on transparency and communication, I think these grades might be too generous. But as always, it depends on compared to what?

If it’s compared to ‘what it would take to get through this most of the time’ then, well, reality is the one thing that doesn’t grade on a curve, and we’re all in trouble.

Do you want the model to cooperate with other AIs? With copies of itself?

I strongly say yes, you absolutely want your AI to be doing this, the things that cause it not to do this are so, so much worse.

Sauers: Over generations, societies composed of Claude 3.5 Sonnet agents evolve high levels of cooperation, whereas GPT-4o agents tend to become more distrustful and give less.

Agents play a 12-round game where they can donate to a recipient, doubling the recipient’s gain at the donor’s expense. Can see recent actions of other agents. After all rounds, the top 50% of agents survive, and are replaced by new agents prompted with the survivors’ strategies

Teortaxes: People are afraid of AI cooperation but if an agent won’t cooperate with a copy of itself, then it is just an All-Defector and its revealed philosophy amounts to uncompromising egoism. Naturally emerging self-cooperation is the minimal requirement for trustworthiness.

Kalomaze: Anthropic explicitly trains on 10+ multiturn conversations designed to improve in-context learning abilities, while most post-trains are naive single turn

most of the RL improvements they have are from smarter people defining the RL rewards, not necessarily smarter algorithms

Janus: this seems like an oversimplification but directionally correct

Post training on single-turn not only fails to improve in-context learning abilities, it wrecks it.

4o, overfit to lmsys, doesn’t seem to perceive or care about what happens outside a tiny window, past or future

Having AIs that cannot solve even simple game theoretic and decision theoretic problems and thus cannot cooperate, or that are myopic, is in some situations a defense against certain downside scenarios. Sure. People aren’t wrong to have a bunch of specific worries about this. It creates problems we will have to solve.

But the alternative is a ticket to obvious complete clusterfery no matter what if you start using the uncooperative AIs, both because it directly causes complete clusterfery and because it is a telltale sign of the complete cluterfery going on elsewhere that caused it, or prevented a solution, provided the AIs are otherwise sufficiently capable.

Depends what the eval is for.

Greg Brockman: vibe checks are great evals

Daviad: gosh, alignment is hard! let’s focus on corrigibility

gosh, corrigibility is hard! let’s focus on circuit interpretability

gosh, circuit interpretability is hard! let’s focus on scalable oversight

gosh, scalable oversight is hard! let’s focus on evals

gosh, evals are hard! let’s,

Yes, well.

We have a new paper draft on ‘the economics of AI alignment.’

What I love most about economics is you can have one paragraph that, once written down, is obviously true, and that then lets everyone think better about the problem, and which saves you the need to read through the 60+ page paper ‘proving’ it. Often you already knew that fact, but it’s good to state it cleanly. And what is often most useful about that paragraph is seeing exactly what assumptions were required.

What I hate most about economics is that you still need the 60+ page paper, which takes 2 years and tons of your time, before anyone accepts the paragraph. And that quite often, the paper ends up making you do a bunch of work to extract the paragraph. The abstract is often not that close to the original paragraph, or written to imply something stronger.

This paper was a prime case of that.

Great original paragraph to write down, but they hid it, and all the vibes are way off, presenting this as a good plan rather than showing why it isn’t a good plan.

The plan seems to be, ‘have a strategic AI smarter than you, don’t align it, but it’s fine to deploy it anyway, actually, we will use principal-agent solutions’ or actually ‘we will use MIRI-style decision theoretic uncertainty to make the AI unsure what its true situation is so it has to cooperate with us?’ combined with ‘assume that you can perfectly evaluate whether a given input is aligned and the AI can’t figure out if it’s in deployment even over time?’

Sami Petersen: We develop a technique for safely delegating to a strategic AI that may be misaligned with your objectives—without assuming you can restrict its choices, punish it, or otherwise control it once deployed.

How? Sequential information design with imperfect recall.

Here’s the problem. A modern machine learning system could, for all we know, learn to pursue any one of many distinct objectives, each of which is consistent with the behaviour we observe during the training process.

Without knowing its objective, how can we safely delegate a task to an AI? As AI agents get more sophisticated, simple testing may cease to work. A strategic AI that is misaligned with the developer has reason to play nice in order to be deployed, rendering testing uninformative.

This is a principal-agent problem, and there exists plenty of work designing mechanisms to screen for bad types and to discipline their behaviour. What makes the AI case special?

On the one hand, future AIs may be capable of evading our attempts to control them once deployed. Anthropic, Google DeepMind, and OpenAI are already evaluating their systems for “self-replication” abilities, to mitigate the risk that they successfully avoid being shut down.

On the other hand, AIs are easy to copy. We can run multiple versions of a system, in identical environments. Copies need not have any memory of what their other selves did. This lets us make the AI uncertain about its location—even if it knows the structure of the game it’s in.

What can we do with this tool? As it turns out, a lot.

By repeatedly testing the AI without recall, we can (i) screen for misaligned types, detecting them before deployment, and (ii) discipline their behaviour even if deployed.

Here’s a quick summary of our main results.

(1) Screening is asymptotically perfect: we can identify the agent’s type before deploying with arbitrarily high precision, by increasing the number of tests.

(2) Disciplining can be strong enough to make it worthwhile to deploy even misaligned types. This is possible under a mild condition on payoffs. In some cases, disciplining can even be perfect.

(3) What if we can’t make testing and deployment perfectly indistinguishable? Our mechanism is robust to this. The AI can observe noisy location-relevant information with arbitrary accuracy, and we can nevertheless overwhelm it by increasing the number of tests.

(4) In equilibrium, screening can only be achieved with imperfect recall. Without imposing a strong commitment assumption on the human principal, it’s impossible to achieve the above results with memories intact.

Suppose we repeatedly test the AI without recall, and only deploy if it behaved sufficiently well. By making the AI unsure whether it is facing the real task or a simulation, a misaligned type may be forced to randomise between playing nice and doing what it wants. Why?

Let p be the (behavioural) strategy of the misaligned type, representing the chance of it doing what we want. If p=0, we’d catch it for sure, and it’ll never be deployed (payoff 0). If p=1, it will be deployed, but it’ll play its disfavoured action (payoff <0).

By setting p∈(0,1), it’ll be deployed and play its favourite action with positive probability. Setting p optimally gives it a positive expected payoff.

This lets us partly screen for misaligned types (only misaligned ones can fail a test) and, even if we don’t succeed, we discipline them in deployment (even misaligned ones play nice with positive probability).

Or, look at their abstract:

By committing to a testing mechanism, the principal can screen the misaligned AI during testing and discipline its behaviour in deployment. Increasing the number of tests allows the principal to screen or discipline arbitrarily well.

No, no, no, no, no! You can’t do that. You’re assuming the conclusion. Stop it!

Here’s the core result, as I understand it:

  1. If you can run a lot of memory-less tests on an AI.

  2. And you can correctly evaluate those tests to see if the response is aligned.

  3. And the AI can’t distinguish whether it is being tested versus deployed.

  4. And the situations in the test set correspond sufficiently to the situations in deployment.

  5. Then…

    1. …you can figure out if a given AI is aligned.

    2. …and you can likely deploy a misaligned AI and have it mostly act aligned.

As usual with such papers, yes, but if you’d just said that up front we wouldn’t have needed a 60+ page paper, and a lot of people wouldn’t be under the impression that this was a lot more promising and impressive than it is.

I mean, yes, obviously: If I can run memoryless versions of an AI, while providing insufficient contextual clues, and I can determine whether an output is misaligned sufficiently well, and I can ‘punish’ the AI, then the AI is going to have to choose to mostly respond aligned in any given situation, or I can find out if it is aligned.

They really are even proposing a ‘disciplining is best’ case, where they actually deploy AIs that might be misaligned, on the theory that the uncertainty about deployment will keep them in line.

For an even starker response, and an extensive list of the assumptions, I suggest trying the following prompt or something similar, plus the PDF:

The authors admit these assumptions are a little strong, but come on. Seriously. They’re not ‘a little strong,’ they’re wishing away all the real issues.

If this is your plan for how to deploy future AIs much smarter than you, and let them operate strategically in the real world?

You are so spectacularly, obviously dead.

Within one minute I came up with several plans that would presumably work here. An AI would presumably do something smarter. But if you placed literal me (as an upload, let’s say) into this situation, and this was your plan, and I decide (quite reasonably, given what you clearly did to me) that I dislike you, I am rather confident: You lose.

And that’s true even if I give you the assumption that you can ‘tell if the output is aligned’ during testing, in the sense I presume it is intended here.

I found Tyler Cowen’s statement on this telling, especially the last line.

Tyler Cowen: That is by Eric Olav Chen, Alexis Ghersengorin, and Sami Petersen. And here is a tweet storm on the paper. I am very glad to see the idea of an optimal principal-agent contract brought more closely into AI alignment discussions.

As you can see, it tends to make successful alignment more likely.

There is great demand for the perception that successful alignment is more likely.

Sign of the times.

Andrej Karpathy: Driving around SF. Omg this is crazy I can’t believe there’s billboards advertising cloud GPUs on the streets of SF, the hype is totally out of control.

That said, actually I would like some more GPU and I haven’t heard of this company yet this looks interesting.

Perfection.

To those who are losing sleep over existential risk from AI, I offer some perspective:

Discussion about this post

AI #95: o1 Joins the API Read More »

tp-link-faces-possible-us-ban-as-hijacked-routers-fuel-chinese-attacks

TP-Link faces possible US ban as hijacked routers fuel Chinese attacks

Chinese hackers use botnet of TP-Link routers

Microsoft warned on October 31 that hackers working for the Chinese government are using a botnet of thousands of routers, cameras, and other Internet-connected devices for attacks on users of Microsoft’s Azure cloud service. Microsoft said that “SOHO routers manufactured by TP-Link make up most of this network,” referring to routers for small offices and home offices.

The WSJ said its sources allege that “TP-Link routers are routinely shipped to customers with security flaws, which the company often fails to address” and that “TP-Link doesn’t engage with security researchers concerned about them.” The article notes that “US officials haven’t disclosed any evidence that TP-Link is a witting conduit for Chinese state-sponsored cyberattacks.”

We contacted TP-Link today and will update this article if it provides a response. A TP-Link spokesperson told the WSJ that the company “welcome[s] any opportunities to engage with the US government to demonstrate that our security practices are fully in line with industry security standards, and to demonstrate our ongoing commitment to the US market, US consumers, and addressing US national security risks.”

A March 2024 Hudson Institute policy memo by Michael O’Rielly, a former Federal Communications Commission member, said it remained “unclear how prevalent TP-Link’s vulnerabilities are compared to other wireless routers—from China or elsewhere—as there is no definitive comparison or ranking of routers based on security.” O’Rielly urged federal agencies to “keep track of TP-Link and other manufacturers’ cybersecurity practices and ownership structure, including any ties to the Chinese government,” but said “there is no evidence to suggest negligence or maliciousness with regard to past vulnerabilities or weaknesses in TP-Link’s security.”

New push against Chinese tech

TP-Link routers don’t seem to be tied to an ongoing Chinese hack of US telecom networks, dubbed Salt Typhoon. But that attack increased government officials’ urgency for taking action against Chinese technology companies. For example, the Biden administration is “moving to ban the few remaining operations of China Telecom,” a telco that was mostly kicked out of the US in 2021, The New York Times reported on Monday.

TP-Link faces possible US ban as hijacked routers fuel Chinese attacks Read More »

report:-elon-musk-failed-to-report-movement-required-by-security-clearance

Report: Elon Musk failed to report movement required by security clearance

Musk ultimately received the security clearance, but since 2021, he has failed to self-report details of his life, including travel activities, persons with whom he has met, and drug use, according to the Times. The government is also concerned that SpaceX did not ensure Musk’s compliance with the reporting rules.

Government agencies “want to ensure the people who have clearances don’t violate rules and regulations,” Andrew Bakaj, a former CIA official and lawyer who works on security clearances, told the Times. “If you don’t self-report, the question becomes: ‘Why didn’t you? And what are you trying to hide?'”

According to the report, Musk’s handling of classified information has raised questions in diplomatic meetings between the United States and some of its allies, including Israel.

Musk’s national security profile has risen following his deep-pocketed and full-throated support of Donald Trump, who won the US presidential campaign in November and will be sworn into office next month. After this inauguration, Trump will have the power to grant security clearance to whomever he wishes.

Report: Elon Musk failed to report movement required by security clearance Read More »