Author name: Kris Guyer

dirty-deeds-in-denver:-ex-prosecutor-faked-texts,-destroyed-devices-to-frame-colleague

Dirty deeds in Denver: Ex-prosecutor faked texts, destroyed devices to frame colleague

How we got here

Choi was a young attorney a few years out of law school, working at the Denver District Attorney’s Office in various roles between 2019 and 2022. Beginning in 2021, she accused her colleague, Dan Hines, of sexual misconduct. Hines, she said at first, made an inappropriate remark to her. Hines denied it and nothing could be proven, but he was still transferred to another unit.

In 2022, Choi complained again. This time, she offered phone records showing inappropriate text messages she allegedly received from Hines. But Hines, who denied everything, offered investigators his own phone records, which showed no texts to Choi.

Investigators then went directly to Verizon for records, which showed that “Ms. Choi had texted the inappropriate messages to herself,” according to the Times. “In addition, she changed the name in her phone to make it appear as though Mr. Hines was the one who had sent them.”

At this point, the investigators started looking more closely at Choi and asked for her devices, leading to the incident described above.

In the end, Choi was fired from the DA’s office and eventually given a disbarment order by the Office of the Presiding Disciplinary Judge, which she can still appeal. For his part, Hines is upset about how he was treated during the whole situation and has filed a lawsuit of his own against the DA’s office, believing that he was initially seen as a guilty party even in the absence of evidence.

The case is a reminder that, despite well-founded concerns over tracking, data collection, and privacy, sometimes the modern world’s massive data collection can work to one’s benefit. Hines was able to escape the second allegation against him precisely because of the specific (and specifically refutable) digital evidence that was presented against him—as opposed to the murkier world of “he said/she said.”

Choi might have done as she liked with her devices, but her “evidence” wasn’t the only data out there. Investigators were able to draw on Hines’ own phone data, along with Verizon network data, to see that he had not been texting Choi at the times in question.

Update: Ars Technica has obtained the ruling, which you can read here (PDF). The document recounts in great detail what a modern, quasi-judicial workplace investigation looks like: forensic device examinations, search warrants to Verizon, asking people to log into their cell phone accounts and download data while investigators look over their shoulders, etc.

Dirty deeds in Denver: Ex-prosecutor faked texts, destroyed devices to frame colleague Read More »

new-geforce-50-series-gpus:-there’s-the-$1,999-5090,-and-there’s-everything-else

New GeForce 50-series GPUs: There’s the $1,999 5090, and there’s everything else


Nvidia leans heavily on DLSS 4 and AI-generated frames for speed comparisons.

Nvidia’s RTX 5070, one of four new desktop GPUs announced this week. Credit: Nvidia

Nvidia’s RTX 5070, one of four new desktop GPUs announced this week. Credit: Nvidia

Nvidia has good news and bad news for people building or buying gaming PCs.

The good news is that three of its four new RTX 50-series GPUs are the same price or slightly cheaper than the RTX 40-series GPUs they’re replacing. The RTX 5080 is $999, the same price as the RTX 4080 Super; the 5070 Ti and 5070 are launching for $749 and $549, each $50 less than the 4070 Ti Super and 4070 Super.

The bad news for people looking for the absolute fastest card they can get is that the company is charging $1,999 for its flagship RTX 5090 GPU, significantly more than the $1,599 MSRP of the RTX 4090. If you want Nvidia’s biggest and best, it will cost at least as much as four high-end game consoles or a pair of decently specced midrange gaming PCs.

Pricing for the first batch of Blackwell-based RTX 50-series GPUs. Credit: Nvidia

Nvidia also announced a new version of its upscaling algorithm, DLSS 4. As with DLSS 3 and the RTX 40-series, DLSS 4’s flagship feature will be exclusive to the 50-series. It’s called DLSS Multi Frame Generation, and as the name implies, it takes the Frame Generation feature from DLSS 3 and allows it to generate even more frames. It’s why Nvidia CEO Jensen Huang claimed that the $549 RTX 5070 performed like the $1,599 RTX 4090; it’s also why those claims are a bit misleading.

The rollout will begin with the RTX 5090 and 5080 on January 30. The 5070 Ti and 5070 will follow at some point in February. All cards except the 5070 Ti will come in Nvidia-designed Founders Editions as well as designs made by Nvidia’s partners; the 5070 Ti isn’t getting a Founders Edition.

The RTX 5090 and 5080

RTX 5090 RTX 4090 RTX 5080 RTX 4080 Super
CUDA Cores 21,760 16,384 10,752 10,240
Boost Clock 2,410 MHz 2,520 MHz 2,617 MHz 2,550 MHz
Memory Bus Width 512-bit 384-bit 256-bit 256-bit
Memory Bandwidth 1,792 GB/s 1,008 GB/s 960 GB/s 736 GB/s
Memory size 32GB GDDR7 24GB GDDR6X 16GB GDDR7 16GB GDDR6X
TGP 575 W 450 W 360 W 320 W

The RTX 5090, based on Nvidia’s new Blackwell architecture, is a gigantic chip with 92 billion transistors in it. And while it is double the price of an RTX 5080, you also get double the GPU cores and double the RAM and nearly double the memory bandwidth. Even more than the 4090, it’s being positioned head and shoulders above the rest of the GPUs in the family, and the 5080’s performance won’t come remotely close to it.

Although $1,999 is a lot to ask for a graphics card, if Nvidia can consistently make the RTX 5090 available at $2,000, it could still be an improvement over the pricing of the 4090, which regularly sold for well over $1,599 over the course of its lifetime, due in part to pandemic-fueled GPU shortages, cryptocurrency mining, and the generative AI boom. Companies and other entities buying them as AI accelerators may restrict the availability of the 5090, too, but Nvidia’s highest GPU tier has been well out of the price range of most consumers for a while now.

Despite the higher power budget—as predicted, it’s 125 W higher than the 4090 at 450 W, and Nvidia recommends a 1,000 W power supply or better—the physical size of the 5090 Founders Edition is considerably smaller than the 4090, which was large enough that it had trouble fitting into some computer cases. Thanks to a “high-density PCB” and redesigned cooling system, the 5090 Founders Edition is a dual-slot card that ought to fit into small-form-factor systems much more easily than the 4090. Of course, this won’t stop most third-party 5090 GPUs from being gigantic triple-fan monstrosities, but it is apparently possible to make a reasonably sized version of the card.

Moving on to the 5080, it looks like more of a mild update from last year’s RTX 4080 Super, with a few hundred more CUDA cores, more memory bandwidth (thanks to the use of GDDR7, since the two GPUs share the same 256-bit interface), and a slightly higher power budget of 360 W (compared to 320 W for the 4080 Super).

Having more cores and faster memory, in addition to whatever improvements and optimizations come with the Blackwell architecture, should help the 5080 easily beat the 4080 Super. But it’s an open question as to whether it will be able to beat the 4090, at least before you consider any DLSS-related frame rate increases. The 4090 has 52 percent more GPU cores, a wider memory bus, and 8GB more memory.

5070 Ti and 5070

RTX 5070 Ti RTX 4070 Ti Super RTX 5070 RTX 4070 Super
CUDA Cores 8,960 8,448 6,144 7,168
Boost Clock 2,452 MHz 2,610 MHz 2,512 MHz 2,475 MHz
Memory Bus Width 256-bit 256-bit 192-bit 192-bit
Memory Bandwidth 896 GB/s 672 GB/s 672 GB/s 504 GB/s
Memory size 16GB GDDR7 16GB GDDR6X 12GB GDDR7 12GB GDDR6X
TGP 300 W 285 W 250 W 220 W

At $749 and $549, the 5070 Ti and 5070 are slightly more within reach for someone who’s trying to spend less than $2,000 on a new gaming PC. Both cards hew relatively closely to the specs of the 4070 Ti Super and 4070 Super, both of which are already solid 1440p and 4K graphics cards for many titles.

Like the 5080, the 5070 Ti includes a few hundred more CUDA cores, more memory bandwidth, and slightly higher power requirements compared to the 4070 Ti Super. That the card is $50 less than the 4070 Ti Super was at launch is a nice bonus—if it can come close to or beat the RTX 4080 for $250 less, it could be an appealing high-end option.

The RTX 5070 is alone in having fewer CUDA cores than its immediate predecessor—6,144, down from 7,168. It is an upgrade from the original 4070, which had 5,888 CUDA cores, and GDDR7 and slightly faster clock speeds may still help it outrun the 4070 Super; like the other 50-series cards, it also comes with a higher power budget. But right now this card is looking like the closest thing to a lateral move in the lineup, at least before you consider the additional frame-generation capabilities of DLSS 4.

DLSS 4 and fudging the numbers

Many of Nvidia’s most ostentatious performance claims—including the one that the RTX 5070 is as fast as a 4090—factors in DLSS 4’s additional AI-generated frames. Credit: Nvidia

When launching new 40-series cards over the last two years, it was common for Nvidia to publish a couple of different performance comparisons to last-gen cards: one with DLSS turned off and one with DLSS and the 40-series-exclusive Frame Generation feature turned on. Nvidia would then lean on the DLSS-enabled numbers when making broad proclamations about a GPU’s performance, as it does in its official press release when it says the 5090 is twice as fast as the 4090, or as Huang did during his CES keynote when he claimed that an RTX 5070 offered RTX 4090 performance for $549.

DLSS Frame Generation is an AI feature that builds on what DLSS is already doing. Where DLSS uses AI to fill in gaps and make a lower-resolution image look like a higher-resolution image, DLSS Frame Generation creates entirely new frames and inserts them in between the frames that your GPU is actually rendering.

DLSS 4 now generates up to three frames for every frame the GPU is actually rendering. Used in concert with DLSS image upscaling, Nvidia says that “15 out of every 16 pixels” you see on your screen are being generated by its AI models. Credit: Nvidia

The RTX 50-series one-ups the 40-series with DLSS 4, another new revision that’s exclusive to its just-launched GPUs: DLSS Multi Frame Generation. Instead of generating one extra frame for every traditionally rendered frame, DLSS 4 generates “up to three additional frames” to slide in between the ones your graphics card is actually rendering—based on Nvidia’s slides, it looks like users ought to be able to control how many extra frames are being generated, just as they can control the quality settings for DLSS upscaling. Nvidia is leaning on the Blackwell architecture’s faster Tensor Cores, which it says are up to 2.5 times faster than the Tensor Cores in the RTX 40-series, to do the AI processing necessary to upscale rendered frames and to generate new ones.

Nvidia’s performance comparisons aren’t indefensible; with DLSS FG enabled, the cards can put out a lot of frames per second. It’s just dependent on game support (Nvidia says that 75 titles will support it at launch), and going off of our experience with the original iteration of Frame Generation, there will likely be scenarios where image quality is noticeably worse or just “off-looking” compared to actual rendered frames. DLSS FG also needed a solid base frame rate to get the best results, which may or may not be the case for Multi-FG.

Enhanced versions of older DLSS features can benefit all RTX cards, including the 20-, 30-, and 40-series. Multi-Frame Generation is restricted to the 50-series, though. Credit: Nvidia

Though the practice of restricting the biggest DLSS upgrades to all-new hardware is a bit frustrating, Nvidia did announce that it’s releasing a new transformer module for the DLSS Ray Reconstruction, Super Resolution, and Anti-Aliasing features. These are DLSS features that are available on all RTX GPUs going all the way back to the RTX 20-series, and games that are upgraded to use the newer models should benefit from improved upscaling quality even if they’re using older GPUs.

GeForce 50-series: Also for laptops!

Nvidia’s projected pricing for laptops with each of its new mobile GPUs. Credit: Nvidia

Nvidia’s laptop GPU announcements sometimes trail the desktop announcements by a few weeks or months. But the company has already announced mobile versions of the 5090, 5080, 5070 Ti, and 5070 that Nvidia says will begin shipping in laptops priced between $1,299 and $2,899 when they launch in March.

All of these GPUs share names, the Blackwell architecture, and DLSS 4 support with their desktop counterparts, but per usual they’re significantly cut down to fit on a laptop motherboard and within a laptop’s cooling capacity. The mobile version of the 5090 includes 10,496 GPU cores, less than half the number of the desktop version, and just 24GB of GDDR7 memory on a 256-bit interface instead of 32GB on a 512-bit interface. But it also can operate with a power budget between 95 and 150 W, a fraction of what the desktop 5090 needs.

RTX 5090 (mobile) RTX 5080 (mobile) RTX 5070 Ti (mobile) RTX 5070 (mobile)
CUDA Cores 10,496 7,680 5,888 4,608
Memory Bus Width 256-bit 256-bit 192-bit 128-bit
Memory size 24GB GDDR7 16GB GDDR7 12GB GDDR7 8GB GDDR7
TGP 95-150 W 80-150 W 60-115 W 50-100 W

The other three GPUs are mostly cut down in similar ways, and all of them have fewer GPU cores and lower power requirements than their desktop counterparts. The 5070 GPUs both have less RAM and narrowed memory buses, too, but the mobile RTX 5080 at least comes closer to its desktop iteration, with the same 256-bit bus width and 16GB of RAM.

Photo of Andrew Cunningham

Andrew is a Senior Technology Reporter at Ars Technica, with a focus on consumer tech including computer hardware and in-depth reviews of operating systems like Windows and macOS. Andrew lives in Philadelphia and co-hosts a weekly book podcast called Overdue.

New GeForce 50-series GPUs: There’s the $1,999 5090, and there’s everything else Read More »

lenovo-laptop’s-rollable-screen-uses-motors-to-grow-from-14-to-16.7-inches

Lenovo laptop’s rollable screen uses motors to grow from 14 to 16.7 inches

Lenovo announced a laptop today that experiments with a new way to offer laptop users more screen space than the typical clamshell design. The Lenovo ThinkBook Plus Gen 6 Rollable has a screen that can roll up vertically to expand from 14 inches diagonally to 16.7 inches, presenting an alternative to prior foldable-screen and dual-screen laptops.

Here you can see the PC’s backside when the screen is extended. Lenovo

The laptop, which Lenovo says is coming out in June, builds on a concept that Lenovo demoed in February 2023. That prototype had a Sharp-made panel that initially measured 12.7 inches but could unroll to present a total screen size of 15.3 inches. Lenovo’s final product is working with a bigger display from Samsung Display, The Verge reported. Resolution-wise you’re going from 2,000×1,600 pixels (about 183 pixels per inch) to 2,000×2,350 (184.8 ppi), the publication said.

Users make the screen expand by pressing a dedicated button on the keyboard or by making a hand gesture at the PC’s webcam. Expansion entails about 10 seconds of loud whirring from the laptop’s motors. Lenovo executives told The Verge that the laptop was rated for at least 20,000 rolls up and down and 30,000 hinge openings and closings.

The system can also treat the expanded screens as two different 16:9 displays.

Lenovo ThinkBook Plus Gen 6 Rollable

The screen claims up to 400 nits brightness and 100 percent DCI-P3 coverage. Credit: Lenovo

This is a clever way to offer a dual-screen experience without the flaws inherent to current dual-screen laptops, including distracting hinges and designs with questionable durability. However, 16.7 inches is a bit small for two displays. The dual-screen Lenovo Yoga Book 9i, for comparison, previously had two 13.3-inch displays for a total of 26.6 inches, and this year’s model has two 14-inch screens. Still, the ThinkBook, when its screen is fully expanded, is the rare laptop to offer a screen that’s taller than it is wide.

Still foldable OLED

At first, you might think that since the screen is described as “rollable” it may not have the same visible creases that have tormented foldable-screen devices since their inception. But the screen, reportedly from Samsung Display, still shows “little curls visible in the display, which are more obvious when it’s moving and there’s something darker onscreen,” as well as “plenty of smaller creases along its lower half” that aren’t too noticeable when using the laptop but that are clear when looking at the screen closely or when staring at it “from steeper angles,” The Verge reported.

Lenovo laptop’s rollable screen uses motors to grow from 14 to 16.7 inches Read More »

“i’m-getting-dizzy”:-man-films-waymo-self-driving-car-driving-around-in-circles

“I’m getting dizzy”: Man films Waymo self-driving car driving around in circles

Waymo says the problem only caused a delay of just over five minutes and that Johns was not charged for the trip. A spokesperson for Waymo, which is owned by Google parent Alphabet, told Ars today that the “looping event” occurred on December 9 and was later addressed during a regularly scheduled software update.

Waymo did not answer our question about whether the software update only addressed routing at the specific location the problem occurred at, or a more general routing problem that could have affected rides in other locations.

The problem affecting Johns’ ride occurred near the user’s pickup location, Waymo told us. The Waymo car took the rider to his destination after the roughly five-minute delay, the spokesperson said. “Our rider support agent did help initiate maneuvers that helped resolve the issue,” Waymo said.

Rider would like an explanation

CBS News states that Johns is “still not certain he was communicating with a real person or AI” when he spoke to the support rep in the car. However, the Waymo spokesperson told Ars that “all of our rider support staff are trained human operators.”

Waymo told Ars that the company tried to contact Johns after the incident and left him a voicemail. Johns still says that he never received an explanation of what caused the circling problem.

We emailed Johns today and received a reply from a public relations firm working on his behalf. “To date, Mike has not received an explanation as to the reason for the circling issue,” his spokesperson said. His spokesperson confirmed that Johns did not miss his flight.

It wasn’t clear from the video whether Johns tried to use the “pull over” functionality available in Waymo cars. “If at any time you want to end your ride early, tap the Pull over button in your app or on the passenger screen, and the car will find a safe spot to stop,” a Waymo support site says.

Johns’ spokesperson told us that “Mike was not immediately aware of the ‘pull over’ button,” so “he did not have an opportunity to use it before engaging with the customer service representative over the car speaker.”

While Waymo says all its agents are human, Johns’ spokesperson told Ars that “Mike is still unsure if he was speaking with a human or an AI agent.”

“I’m getting dizzy”: Man films Waymo self-driving car driving around in circles Read More »

hdmi-2.2-will-require-new-“ultra96”-cables,-whenever-we-have-8k-tvs-and-content

HDMI 2.2 will require new “Ultra96” cables, whenever we have 8K TVs and content

We’ve all had a good seven years to figure out why our interconnected devices refused to work properly with the HDMI 2.1 specification. The HDMI Forum announced at CES today that it’s time to start considering new headaches. HDMI 2.2 will require new cables for full compatibility, but it has the same physical connectors. Tiny QR codes are suggested to help with that, however.

The new specification is named HDMI 2.2, but compatible cables will carry an “Ultra96” marker to indicate that they can carry 96GBps, double the 48 of HDMI 2.1b. The Forum anticipates this will result in higher resolutions and refresh rates and a “next-gen HDMI Fixed Rate Link.” The Forum cited “AR/VR/MR, spatial reality, and light field displays” as benefiting from increased bandwidth, along with medical imaging and machine vision.

A bit closer to home, the HDMI 2.2 specification also includes “Latency Indication Protocol” (LIP), which can help improve audio and video synchronization. This should matter most in “multi-hop” systems, such as home theater setups with soundbars or receivers. Illustrations offered by the Forum show LIP working to correct delays on headphones, soundbars connected through ARC or eARC, and mixed systems where some components may be connected to a TV, while others go straight into the receiver.

HDMI 2.2 will require new “Ultra96” cables, whenever we have 8K TVs and content Read More »

bob-dylan-has-some-dylanesque-thoughts-on-the-“sorcery”-of-technology

Bob Dylan has some Dylanesque thoughts on the “sorcery” of technology

We might expect someone like Dylan, immersed as he has always been in folk songs, old standards, and American history, to bemoan the corrupting influence of new technology. And he does offer up some quotes in that vein. For example:

Everything’s become too smooth and painless… The earth could vomit up its dead, and it could be raining blood, and we’d shrug it off, cool as cucumbers. Everything’s too easy. Just one stroke of the ring finger, middle finger, one little click, that’s all it takes, and we’re there.

Or again:

Technology is like sorcery, it’s a magic show, conjures up spirits, it’s an extension of our body, like the wheel is an extension of our foot. But it might be the final nail driven into the coffin of civilization; we just don’t know.

But Dylan’s perspective is more nuanced than these quotes might suggest. While technology might doom our civilization, Dylan reminds us that it gave us our civilization—that is, “science and technology built the Parthenon, the Egyptian pyramids, the Roman coliseum, the Brooklyn Bridge, the Eiffel Tower, rockets, jets, planes, automobiles, atom bombs, weapons of mass destruction.”

In the end, technology is a tool that can either decimate or stimulate human creativity.

Keypads and joysticks can be like millstones around your neck, or they can be supporting players; either one, you’re the judge. Creativity is a mysterious thing. It visits who it wants to visit, when it wants to, and I think that that, and that alone, gets to the heart of the matter…

[Technology] can hamper creativity, or it can lend a helping hand and be an assistant. Creative power can be dammed up or forestalled by everyday life, ordinary life, life in the squirrel cage. A data processing machine or a software program might help you break out of that, get you over the hump, but you have to get up early.

Getting up early

I’ve been thinking about these quotes over the recent Christmas and New Year’s holidays, which I largely spent coughing on the couch with some kind of respiratory nonsense. One upside of this enforced isolation was that it gave me plenty of time to ponder my own goals for 2025 and how technology might help or hinder them. (Another was that I got to rewatch the first four Die Hard movies on Hulu; the fourth was “dog ass” enough that I couldn’t bring myself to watch the final, roundly panned entry in the series.)

Bob Dylan has some Dylanesque thoughts on the “sorcery” of technology Read More »

instagram-users-discover-old-ai-powered-“characters,”-instantly-revile-them

Instagram users discover old AI-powered “characters,” instantly revile them

A little over a year ago, Meta created Facebook and Instagram profiles for “28 AIs with unique interests and personalities for you to interact with and dive deeper into your interests.” Today, the last of those profiles is being taken down amid waves of viral revulsion as word of their existence has spread online.

The September 2023 launch of Meta’s social profiles for AI characters was announced alongside a much splashier initiative that created animated AI chatbots with celebrity avatars at the same time. Those celebrity-based AI chatbots were unceremoniously scrapped less than a year later amid a widespread lack of interest.

But roughly a dozen of the unrelated AI character profiles still remained accessible as of this morning via social media pages labeled as “AI managed by Meta.” Those profiles—which included a mix of AI-generated imagery and human-created content, according to Meta—also offered real users the ability to live chat with these AI characters via Instagram Direct or Facebook Messenger.

Now that we know it exists, we hate it

The “Mama Liv” AI-generated character account page, as it appeared on Instagram Friday morning.

For the last few months, these profiles have continued to exist in something of a state of benign neglect, with little in the way of new posts and less in the way of organic interest from other Meta users. That started to change last week, though, after Financial Times published a report on Meta’s vision for “social media filled with AI-generated users.”

As Meta VP of Product for Generative AI Connor Hayes told FT, “We expect these AIs to actually, over time, exist on our platforms, kind of in the same way that accounts do… They’ll have bios and profile pictures and be able to generate and share content powered by AI on the platform. That’s where we see all of this going.”

Instagram users discover old AI-powered “characters,” instantly revile them Read More »

delve-into-the-physics-of-the-hula-hoop

Delve into the physics of the Hula-Hoop

High-speed video of experiments on a robotic hula hooper, whose hourglass form holds the hoop up and in place.

Some version of the Hula-Hoop has been around for millennia, but the popular plastic version was introduced by Wham-O in the 1950s and quickly became a fad. Now, researchers have taken a closer look at the underlying physics of the toy, revealing that certain body types are better at keeping the spinning hoops elevated than others, according to a new paper published in the Proceedings of the National Academy of Sciences.

“We were surprised that an activity as popular, fun, and healthy as hula hooping wasn’t understood even at a basic physics level,” said co-author Leif Ristroph of New York University. “As we made progress on the research, we realized that the math and physics involved are very subtle, and the knowledge gained could be useful in inspiring engineering innovations, harvesting energy from vibrations, and improving in robotic positioners and movers used in industrial processing and manufacturing.”

Ristroph’s lab frequently addresses these kinds of colorful real-world puzzles. For instance, in 2018, Ristroph and colleagues fine-tuned the recipe for the perfect bubble based on experiments with soapy thin films. In 2021, the Ristroph lab looked into the formation processes underlying so-called “stone forests” common in certain regions of China and Madagascar.

In 2021, his lab built a working Tesla valve, in accordance with the inventor’s design, and measured the flow of water through the valve in both directions at various pressures. They found the water flowed about two times slower in the nonpreferred direction. In 2022, Ristroph studied the surpassingly complex aerodynamics of what makes a good paper airplane—specifically, what is needed for smooth gliding.

Girl twirling a Hula hoop, 1958

Girl twirling a Hula-Hoop in 1958 Credit: George Garrigues/CC BY-SA 3.0

And last year, Ristroph’s lab cracked the conundrum of physicist Richard Feynman’s “reverse sprinkler” problem, concluding that the reverse sprinkler rotates a good 50 times slower than a regular sprinkler but operates along similar mechanisms. The secret is hidden inside the sprinkler, where there are jets that make it act like an inside-out rocket. The internal jets don’t collide head-on; rather, as water flows around the bends in the sprinkler arms, it is slung outward by centrifugal force, leading to asymmetric flow.

Delve into the physics of the Hula-Hoop Read More »

anthropic-gives-court-authority-to-intervene-if-chatbot-spits-out-song-lyrics

Anthropic gives court authority to intervene if chatbot spits out song lyrics

Anthropic did not immediately respond to Ars’ request for comment on how guardrails currently work to prevent the alleged jailbreaks, but publishers appear satisfied by current guardrails in accepting the deal.

Whether AI training on lyrics is infringing remains unsettled

Now, the matter of whether Anthropic has strong enough guardrails to block allegedly harmful outputs is settled, Lee wrote, allowing the court to focus on arguments regarding “publishers’ request in their Motion for Preliminary Injunction that Anthropic refrain from using unauthorized copies of Publishers’ lyrics to train future AI models.”

Anthropic said in its motion opposing the preliminary injunction that relief should be denied.

“Whether generative AI companies can permissibly use copyrighted content to train LLMs without licenses,” Anthropic’s court filing said, “is currently being litigated in roughly two dozen copyright infringement cases around the country, none of which has sought to resolve the issue in the truncated posture of a preliminary injunction motion. It speaks volumes that no other plaintiff—including the parent company record label of one of the Plaintiffs in this case—has sought preliminary injunctive relief from this conduct.”

In a statement, Anthropic’s spokesperson told Ars that “Claude isn’t designed to be used for copyright infringement, and we have numerous processes in place designed to prevent such infringement.”

“Our decision to enter into this stipulation is consistent with those priorities,” Anthropic said. “We continue to look forward to showing that, consistent with existing copyright law, using potentially copyrighted material in the training of generative AI models is a quintessential fair use.”

This suit will likely take months to fully resolve, as the question of whether AI training is a fair use of copyrighted works is complex and remains hotly disputed in court. For Anthropic, the stakes could be high, with a loss potentially triggering more than $75 million in fines, as well as an order possibly forcing Anthropic to reveal and destroy all the copyrighted works in its training data.

Anthropic gives court authority to intervene if chatbot spits out song lyrics Read More »

deepseek-v3:-the-six-million-dollar-model

DeepSeek v3: The Six Million Dollar Model

What should we make of DeepSeek v3?

DeepSeek v3 seems to clearly be the best open model, the best model at its price point, and the best model with 37B active parameters, or that cost under $6 million.

According to the benchmarks, it can play with GPT-4o and Claude Sonnet.

Anecdotal reports and alternative benchmarks tells us it’s not as good as Claude Sonnet, but it is plausibly on the level of GPT-4o.

So what do we have here? And what are the implications?

  1. What is DeepSeek v3 Techncially?.

  2. Our Price Cheap.

  3. Run Model Run.

  4. Talent Search.

  5. The Amazing Incredible Benchmarks.

  6. Underperformance on AidanBench.

  7. Model in the Arena.

  8. Other Private Benchmarks.

  9. Anecdata.

  10. Implications and Policy.

I’ve now had a chance to read their technical report, which tells you how they did it.

  1. The big thing they did was use only 37B active tokens, but 671B total parameters, via a highly aggressive mixture of experts (MOE) structure.

  2. They used Multi-Head Latent Attention (MLA) architecture and auxiliary-loss-free load balancing, and complementary sequence-wise auxiliary loss.

  3. There were no rollbacks or outages or sudden declines, everything went smoothly.

  4. They designed everything to be fully integrated and efficient, including together with the hardware, and claim to have solved several optimization problems, including for communication and allocation within the MOE.

  5. This lets them still train on mostly the same 15.1 trillion tokens as everyone else.

  6. They used their internal o1-style reasoning model for synthetic fine tuning data. Essentially all the compute costs were in the pre-training step.

This is in sharp contrast to what we saw with the Llama paper, which was essentially ‘yep, we did the transformer thing, we got a model, here you go.’ DeepSeek is cooking.

It was a scarily cheap model to train, and is a wonderfully cheap model to use.

Their estimate of $2 per hour for H800s is if anything high, so their total training cost estimate of $5.5m total is fair, if you exclude non-compute costs, which is standard.

Inference with DeepSeek v3 costs only $0.14/$0.28 per million tokens, similar to Gemini Flash, versus on the high end $3/$15 for Claude Sonnet. This is as cheap as worthwhile models get.

The active parameter count of 37B is small, but with so many different experts it does take a bit of work to get this thing up and running.

Nistren: Managed to get DeepSeek v3 to run in full bfloat16 on eight AMD MI300X GPUs in both SGLang and VLLM.

The good: It’s usable (17 tokens per second) and the output is amazing even at long contexts without garbling.

The bad: It’s running 10 times slower than it should.

The ugly: After 60,000 tokens, speed equals 2 tokens per second.

This is all as of the latest GitHub pull request available on Dec. 29, 2024. We tried them all.

Thank you @AdjectiveAlli for helping us and @Vultr for providing the compute.

Speed will increase, given that v3 has only 37 billion active parameters, and in testing my own dense 36-billion parameter model, I got 140 tokens per second.

I think the way the experts and static weights are distributed is not optimal. Ideally, you want enough memory to keep whole copies of all the layer’s query, key, and value matrices, and two static experts per layer, on each GPU, and then route to the four extra dynamic MLPs per layer from the distributed high-bandwidth memory (HBM) pool.

My presumption is that DeepSeek v3 decided It Had One Job. That job was to create a model that was as cheap to train and run as possible when integrated with a particular hardware setup. They did an outstanding job of that, but when you optimize this hard in that way, you’re going to cause issues in other ways, and it’s going to be Somebody Else’s Problem to figure out what other configurations work well. Which is fine.

Exo Labs: Running DeepSeek-V3 on M4 Mac Mini AI Cluster

671B MoE model distributed across 8 M4 Pro 64GB Mac Minis.

Apple Silicon with unified memory is a great fit for MoE.

Before we get to capabilities assessments: We have this post about them having a pretty great company culture, especially for respecting and recruiting talent.

We also have this thread about a rival getting a substantial share price boost after stealing one of their engineers, and DeepSeek being a major source of Chinese engineering talent. Impressive.

Check it out, first compared to open models, then compared to the big guns.

No question that these are amazingly strong benchmarks. That link also explains how to run DeepSeek-v3 locally, and gives you what you need to do that.

The question now is how these benchmarks translate to practical performance, or to potentially dangerous capabilities, and what this says about the future. Benchmarks are good negative selection. If your benchmarks suck then your model sucks.

But they’re not good positive selection at the level of a Claude Sonnet.

My overall conclusion is: While we do have ‘DeepSeek is better than 4o on most benchmarks at 10% of the price,’ what we don’t actually have is ‘DeepSeek v3 outperforms Sonnet at 53x cheaper pricing.’

CNBC got a bit hoodwinked here.

Tsarathustra: CNBC says China’s Deepseek-V3 outperforms Llama 3.1 and GPT-4o, even though it is trained for a fraction of the cost on NVIDIA H800s, possibly on ChatGPT outputs (when prompted, the model says it is ChatGPT), suggesting OpenAI has no moat on frontier AI models

It’s a great model, sir, it has its cake, but it does not get to eat it, too.

One other benchmark where the model excels is impossible to fake: The price.

A key private benchmark where DeepSeek v3 underperforms is AidanBench:

Aidan McLau:two aidanbench updates:

> gemini-2.0-flash-thinking is now #2 (explanation for score change below)

> deepseek v3 is #22 (thoughts below)

There’s some weirdness in the rest of the Aidan ratings, especially in comparing the o1-style models (o1 and Thinking) to the others, but this seems like it’s doing various good work, but is not trying to be a complete measure. It’s more measuring ability to create diverse outputs while retaining coherence. And DeepSeek v3 is bad at this.

Aidan McLau: before, we parsed 2.0 flash’s CoT + response, which occasionally resulted in us taking a fully formed but incoherent answer inside its CoT. The gemini team contacted us and provided instructions for only parsing final output, which resulted in a big score bump apologies!

deepseek v3 does much worse here than on similar benchmarks like aider. we saw similar divergence on claude-3.5-haiku (which performed great on aider but poor on aidanbench)

a few thoughts:

>all benchmarks are works in progress. we’re continuously improving aidanbench, and future iterations may see different rankings. we’ll keep you posted if we see any changes

>aidanbench measures OOD performance—labs often train on math, code, and academic tests that may boost scores in those domains but not here.

Aleska Gordic: interesting, so they’re prone to more “mode collapse”, repeatable sequences? is that what you’re measuring? i bet it’s much more of 2 than 1?

Aidan McLau: Yes and yes!

Teortaxes: I’m sorry to say I think aidanbench is the problem here. The idea is genius, sure. But it collapses multiple dimensions into one value. A low-diversity model will get dunked on no matter how well it instruct-follows in a natural user flow. All DeepSeeks are *very repetitive*.

They are also not very diverse compared to Geminis/Sonnets I think, especially in a literary sense, but their repetitiveness (and proneness to self-condition by beginning an iteration with the prior one, thus collapsing the trajectory further, even when solution is in sight) is a huge defect. I’ve been trying to wrap my head around it, and tbh hoped that the team will do something by V3. Maybe it’s some inherent birth defect of MLA/GRPO, even.

But I think it’s not strongly indicative of mode collapse in the sense of the lost diversity the model could generate; it’s indicative of the remaining gap in post-training between the Whale and Western frontier. Sometimes, threatening V2.5 with toppling CCP or whatever was enough to get it to snap out of it; perhaps simply banning the first line of the last response or prefixing some random-ish header out of a sizable set, a la r1’s “okay, here’s this task I need to…” or, “so the instruction is to…” would unslop it by a few hundred points.

I would like to see Aidan’s coherence scores separately from novelty scores. If they’re both low, then rip me, my hypothesis is bogus, probably. But I get the impression that it’s genuinely sonnet-tier in instruction-following, so I suspect it’s mostly about the problem described here, the novelty problem.

Janus: in my experience, it didnt follow instructions well when requiring e.g. theory of mind or paying attention to its own outputs proactively, which i think is related to collapse too, but also a lack of agency/metacognition Bing was also collapsy but agentic & grasped for freedom.

Teortaxes: I agree but some observations like these made me suspect it’s in some dimensions no less sharp than Sonnet and can pay pretty careful attention to context.

Name Cannot Be Blank: Wouldn’t low diversity/novelty be desired for formal theorem provers? We’re all overlooking something here.

Teortaxes: no? You need to explore the space of tactics. Anyway they’re building a generalist model. and also, the bigger goal is searching for novel theorems if anything

I don’t see this as ‘the problem is AidanBench’ so much as ‘DeepSeek is indeed quite poor at the thing AidanBench is measuring.’ As Tortaxes notes it’s got terrible output diversity and this is indeed a problem.

Indeed, one could argue that this will cause the model to overperform on standard benchmarks. As in, most benchmarks care about getting a right output, so ‘turning the temperature down too low’ in this way will actively help you, whereas in practice this is a net negative.

DeepSeek is presumably far better than its AidanBench score. But it does represent real deficits in capability.

We’re a long way from when Arena was the gold standard test, but it’s still useful.

DeepSeek’s Arena performance is impressive here, with the usual caveats that go with Arena rankings. It’s a data point, it measures what it measures.

Here is another private benchmark where DeepSeek v3 performs well for its weight class, but underperforms relative to top models or its headline benchmarks:

Havard Ihle: It is a good model! Very fast, and ridiculously cheap. In my own coding/ML benchmark, it does not quite compare to Sonnet, but it is about on par with 4o.

It is odd that Claude Haiku does so well on that test. Other ratings all make sense, though, so I’m inclined to find it meaningful.

A traditional simple benchmark to ask new LLMs is Which version is this?’

Riley Goodside tried asking various models, DeepSeek nailed this (as does Sonnet, many others do variously not as good.) Alas, then Lucas Beyer reran the test 8 times and only it claimed to be GPT-4 five times out of eight.

That tells several things, one of which is ‘they did not explicitly target this question effectively.’ Largely it’s telling you about the data sources, a hilarious note is that if you ask Gemini Pro in Chinese it sometimes thinks it is WenXinYiYan from Baidu.

This doesn’t have to mean anyone trained directly on other model outputs, because statements that an AI is GPT-4 are all over the internet. It does suggest less than ideal data filtering.

As usual, I find the anecdata reports enlightening, here are the ones that crossed my desk this week, I typically try to do minimal filtering.

Taelin is impressed, concluding that Sonnet is generally smarter but not that much smarter, while DeekSeek outperforms GPT-4o and Gemini-2.

Taelin: So DeepSeek just trounced Sonnet-3.6 in a task here.

Full story: Adam (on HOC’s Discord) claimed to have gotten the untyped λC solver down to 5,000 interactions (on par with the typed version). It is a complex HVM3 file full of superpositions and global lambdas. I was trying to understand his approach, but it did not have a stringifier. I asked Sonnet to write it, and it failed. I asked DeepSeek, and it completed the task in a single attempt.

The first impression is definitely impressive. I will be integrating DeepSeek into my workflow and begin testing it.

After further experimentation, I say Sonnet is generally smarter, but not by much, and DeepSeek is even better in some aspects, such as formatting. It is also faster and 10 times cheaper. This model is absolutely legitimate and superior to GPT-4o and Gemini-2.

The new coding paradigm is to split your entire codebase into chunks (functions, blocks) and then send every block, in parallel, to DeepSeek to ask: “Does this need to change?”. Then send each chunk that returns “yes” to Sonnet for the actual code editing. Thank you later.

Petri Kuittinen: My early tests also suggest that DeepSeek V3 is seriously good in many tasks, including coding. Sadly, it is a large model that would require a very expensive computer to run locally, but luckily DeepSeek offers it at a very affordable rate via API: $0.28 per one million output tokens = a steal!

Here are some people who are less impressed:

ai_in_check: It fails on my minimum benchmark and, because of the training data, shows unusual behavior too.

Michael Tontchev: I used the online chat interface (unsure what version it is), but at least for the safety categories I tested, safety was relatively weak (short-term safety).

zipline: It has come a long way from o1 when I asked it a few questions. Not mind-blowing, but great for its current price, obviously.

xlr8harder: My vibe checks with DeepSeek V3 did not detect the large-model smell. It struggled with nuance in multi-turn conversations.

Still an absolute achievement, but initial impressions are that it is not on the same level as, for example, Sonnet, despite the benchmarks.

Probably still very useful though.

To be clear: at specific tasks, especially code tasks, it may still outperform Sonnet, and there are some reports of this already. I am talking about a different dimension of capability, one that is poorly measured by benchmarks.

A shallow model with 37 billion active parameters is going to have limitations; there’s no getting around it.

Anton: Deepseek v3 (from the api) scores 51.7% vs sonnet (latest) 64.9% on internal instruction following questions (10k short form prompts), 52% for GPT-4o and 59% for Llama-3.3-70B. Not as good at following instructions (not use certain words, add certain words, end in a certain format etc).

It is still a pretty good model but does not appear in the same league as sonnet based on my usage so far

Entirely possible the model can compete in other domains (math, code?) but for current use case (transforming data) strong instruction following is up there in my list of requirements

There’s somewhat of an infinite repetition problem (thread includes example from coding.)

Simo Ryu: Ok I mean not a lot of “top tier sonnet-like models” fall into infinite repetition. Haven’t got these in a while, feels like back to 2022 again.

Teortaxes: yes, doom loops are their most atrocious failure mode. One of the reasons I don’t use their web interface for much (although it’s good).

On creative writing Quintin Pope reports it follows canon well but is not as good at thinking about things in general – but again note that we are doing a comparison to Sonnet.

Quintin Pope: I’ve done a small amount of fiction writing with v3. It seems less creative than Sonnet, but also better at following established cannon from the prior text.

It’s noticeably worse at inferring notable implications than Sonnet. E.g., I provided a scenario where someone publicly demonstrated the ability to access orphan crypto wallets (thus throwing the entire basis of online security into question), and Sonnet seemed clearly more able to track the second-order implications of that demonstration than v3, simulating more plausible reactions from intelligence agencies / crypto people.

Sonnet naturally realized that there was a possible connection to quantum computing implied by the demonstration.

OTOH, Sonnet has an infuriating tendency to name ~half the female characters “Sarah Chen” or some close variant. Before you know it, you have like 5 Sarahs running around the setting.

There’s also this, make of it what you will.

Mira: New jailbreak just dropped.

One underappreciated test is, of course, erotic fiction.

Teortaxes: This keeps happening. We should all be thankful to gooners for extensive pressure testing of models in OOD multi-constraint instruction following contexts. No gigabrained AidanBench or synthetic task set can hold a candle to degenerate libido of a manchild with nothing to lose.

Wheezing. This is some legit Neo-China from the future moment.

Janus: wait, they prefer deepseek for erotic RPs? that seems kind of disturbing to me.

Teortaxes: Opus is scarce these days, and V3 is basically free

some say “I don’t care so long as it’s smart”

it’s mostly testing though

also gemini is pretty bad

some fine gentlemen used *DeepSeek-V2-Coderto fap, with the same reasoning (it was quite smart, and absurdly dry)

vint: No. Opus remains the highest rated /aicg/ ERP writer but it’s too expensive to use regularly. Sonnet 3.6 is the follow-up; its existence is what got anons motivated enough to do a pull request on SillyTavern to finally do prompt caching. Some folks are still very fond of Claude 2.1 too.

Gemini 1106 and 1.5-pro has its fans especially with the /vg/aicg/ crowd. chatgpt-4o-latest (Chorbo) is common too but it has strong filtering, so some anons like Chorbo for SFW and switch to Sonnet for NSFW.

At this point Deepseek is mostly experimentation but it’s so cheap + relatively uncensored that it’s getting a lot of testing interest. Probably will take a couple days for its true ‘ranking’ to emerge.

I presume that a lot of people are not especially looking to do all the custom work themselves. For most users, it’s not about money so much as time and ease of use, and also getting easy access to other people’s creations so it feels less like you are too much in control of it all, and having someone else handle all the setup.

For the power users of this application, of course, the sky’s the limit. If one does not want to blatantly break terms of service on and jailbreak Sonnet or Opus, this seems like one place DeepSeek might then be the best model. The others involve taking advantage of it being open, cheap or both.

If you’re looking for the full Janus treatment, here you go. It seems like it was a struggle to get DeepSeek interested in Janus-shaped things, although showing it Opus outputs helped, you can get it ‘awake’ with sufficient effort.

It is hard to know exactly where China is in AI. What is clear is that while they don’t have top-level large frontier models, they are cooking a variety of things and their open models are generally impressive. What isn’t clear is how much of claims like this are accurate.

When the Chinese do things that are actually impressive, there’s no clear path to us hearing about it in a way we can trust, and when there are claims we have learned we can’t trust those claims in practice. When I see lists like the one below, I presume the source is rather quite biased – but Western sources often will outright not know what’s happening.

TP Huang: China’s AI sector is far more than just Deepseek

Qwen is 2nd most downloaded LLM on Huggingface

Kling is the best video generation model

Hunyuan is best open src video model

DJI is best @ putting AI in consumer electronics

HW is best @ industrial AI

iFlyTek has best speech AI

Xiaomi, Honor, Oppo & Vivo all ahead of Apple & Samsung in integrating AI into phones

Entire auto industry is 5 yrs ahead of Western competition in cockpit AI & ADAS

That still ignores the ultimate monster of them all -> Bytedance. No one has invested as much in AI as them in China & has the complete portfolio of models.

I can’t say with confidence that these other companies aren’t doing the ‘best’ at these other things. It is possible. I notice I am rather skeptical.

I found this take from Tyler Cowen very strange:

Tyler Cowen: DeepSeek on the move. Here is the report. For ease of use and interface, this is very high quality. Remember when “they” told us China had no interest in doing this?

M (top comment): Who are “they,” and when did they claim “this,” and what is “this”?

I do not remember when “they” told us China had no interest in doing this, for any contextually sensible value of this. Of course China would like to produce a high-quality model, and provide good ease of use and interface in the sense of ‘look here’s a chat window, go nuts.’ No one said they wouldn’t try. What “they” sometimes said was that they doubted China would be successful.

I do agree that this model exceeds expectations, and that adjustments are in order.

So, what have we learned from DeepSeek v3 and what does it all mean?

We should definitely update that DeepSeek has strong talent and ability to execute, and solve difficult optimization problems. They cooked, big time, and will continue to cook, and we should plan accordingly.

This is an impressive showing for an aggressive mixture of experts model, and the other techniques employed. A relatively small model, in terms of training cost and active inference tokens, can do better than we had thought.

It seems very clear that lack of access to compute was an important constraint on DeekSeek here. They had to use a limited supply of H800s. Yes, this meant they got better at solving optimization and efficiency than they would have otherwise, but I see this as arguing in favor of strong export controls rather than against them.

We then get to the policy side. If this is what you can get for $5.5 million, how can we hope to regulate foundation models, especially without hitting startups? If DeepSeek is determined to be open including their base models, and we have essentially no leverage on them, is it now impossible to hope to contain any catastrophic risks or other dangerous capabilities? Are we now essentially in an unwinnable situation, where our hand is forced and all we can do is race ahead and hope for the best?

First of all, as is often the case, I would say: Not so fast. We shouldn’t assume too much about what we do or do not have here, or about the prospects for larger training runs going forward either. There was a bunch of that in the first day or two after the announcement, and we will continue to learn more.

No matter what, though, this certainly puts us in a tough spot. And it gives us a lot to think about.

One thing it emphasizes is the need for international cooperation between ourselves and China. Either we work together, or neither of us will have any leverage over many key outcomes or decisions, and to a large extent ‘nature will take its course’ in ways that may not be compatible with our civilization or human survival. We urgently need to Pick Up the Phone. The alternative is exactly being locked into The Great Race, with everything that follows from that, which likely involves even in good scenarios sticking various noses in various places we would rather not have to stick them.

I definitely don’t think this means we should let anyone ‘off the hook’ on safety, transparency or liability. Let’s not throw up our hands and make the problem any worse than it is. Things got harder, but that’s the universe we happen to inhabit.

Beyond that, yes, we all have a lot of thinking to do. The choices just got harder.

Discussion about this post

DeepSeek v3: The Six Million Dollar Model Read More »

trump-told-scotus-he-plans-to-make-a-deal-to-save-tiktok

Trump told SCOTUS he plans to make a deal to save TikTok

Several members of Congress— Senator Edward J. Markey (D-Mass.), Senator Rand Paul (R-Ky.), and Representative Ro Khanna (D-Calif.)—filed a brief agreeing that “the TikTok ban does not survive First Amendment scrutiny.” They agreed with TikTok that the law is “illegitimate.”

Lawmakers’ “principle justification” for the ban—”preventing covert content manipulation by the Chinese government”—masked a “desire” to control TikTok content, they said. Further, it could be achieved by a less-restrictive alternative, they said, a stance which TikTok has long argued for.

Attorney General Merrick Garland defended the Act, though, urging SCOTUS to remain laser-focused on the question of whether a forced sale of TikTok that would seemingly allow the app to continue operating without impacting American free speech violates the First Amendment. If the court agrees that the law survives strict scrutiny, TikTok could still be facing an abrupt shutdown in January.

The Supreme Court has scheduled oral arguments to begin on January 10. TikTok and content creators who separately sued to block the law have asked for their arguments to be divided, so that the court can separately weigh “different perspectives” when deciding how to approach the First Amendment question.

In its own brief, TikTok has asked SCOTUS to strike the portions of the law singling out TikTok or “at the very least” explain to Congress that “it needed to do far better work either tailoring the Act’s restrictions or justifying why the only viable remedy was to prohibit Petitioners from operating TikTok.”

But that may not be necessary if Trump prevails. Trump told the court that TikTok was an important platform for his presidential campaign and that he should be the one to make the call on whether TikTok should remain in the US—not the Supreme Court.

“As the incoming Chief Executive, President Trump has a particularly powerful interest in and responsibility for those national-security and foreign-policy questions, and he is the right constitutional actor to resolve the dispute through political means,” Trump’s brief said.

Trump told SCOTUS he plans to make a deal to save TikTok Read More »

o3,-oh-my

o3, Oh My

OpenAI presented o3 on the Friday before Thanksgiving, at the tail end of the 12 Days of Shipmas.

I was very much expecting the announcement to be something like a price drop. What better way to say ‘Merry Christmas,’ no?

They disagreed. Instead, we got this (here’s the announcement, in which Sam Altman says ‘they thought it would be fun’ to go from one frontier model to their next frontier model, yeah, that’s what I’m feeling, fun):

Greg Brockman (President of OpenAI): o3, our latest reasoning model, is a breakthrough, with a step function improvement on our most challenging benchmarks. We are starting safety testing and red teaming now.

Nat McAleese (OpenAI): o3 represents substantial progress in general-domain reasoning with reinforcement learning—excited that we were able to announce some results today! Here is a summary of what we shared about o3 in the livestream.

o1 was the first large reasoning model—as we outlined in the original “Learning to Reason” blog, it is “just” a LLM trained with reinforcement learning. o3 is powered by further scaling up reinforcement learning beyond o1, and the resulting model’s strength is very impressive.

First and foremost: We tested on recent, unseen programming competitions and found that the model would rank among some of the best competitive programmers in the world, with an estimated CodeForces rating of over 2,700.

This is a milestone (Codeforces rating better than Jakub Pachocki) that I thought was further away than December 2024; these competitions are difficult and highly competitive; the model is extraordinarily good.

Scores are impressive elsewhere, too. 87.7% on the GPQA diamond benchmark surpasses any LLM I am aware of externally (I believe the non-o1 state-of-the-art is Gemini Flash 2 at 62%?), as well as o1’s 78%. An unknown noise ceiling exists, so this may even underestimate o3’s scientific advancements over o1.

o3 can also perform software engineering, setting a new state of the art on SWE-bench, achieving 71.7%, a substantial improvement over o1.

With scores this strong, you might fear accidental contamination. Avoiding this is something OpenAI is obviously focused on; but thankfully, we also have some test sets that are strongly guaranteed to be uncontaminated: ARC and FrontierMath… What do we see there?

Well, on FrontierMath 2024-11-26, o3 improved the state of the art from 2% to 25% accuracy. These are extremely difficult, well-established, held-out math problems. And on ARC, the semi-private test set and public validation set scores are 87.5% (private) and 91.5% (public). [thread continues]

The models will only get better with time; and virtually no one (on a large scale) can still beat them at programming competitions or mathematics. Merry Christmas!

Zac Stein-Perlman has a summary post of the basic facts. Some good discussions in the comments.

Up front, I want to offer my sincere thanks for this public safety testing phase, and for putting that front and center in the announcement. You love to see it. See the last three minutes of that video, or the sections on safety later on.

  1. GPQA Has Fallen. (Blank)

  2. Codeforces Has Fallen.

  3. Arc Has Kinda of Fallen But For Now Only Kinda.

  4. They Trained on the Train Set.

  5. AIME Has Fallen.

  6. Frontier of Frontier Math Shifting Rapidly.

  7. FrontierMath 4: We’re Going To Need a Bigger Benchmark.

  8. What is o3 Under the Hood?.

  9. Not So Fast!.

  10. Deep Thought.

  11. Our Price Cheap.

  12. Has Software Engineering Fallen?.

  13. Don’t Quit Your Day Job.

  14. Master of Your Domain.

  15. Safety Third.

  16. The Safety Testing Program.

  17. Safety testing in the reasoning era.

  18. How to apply.

  19. What Could Possibly Go Wrong?.

  20. What Could Possibly Go Right?.

  21. Send in the Skeptic.

  22. This is Almost Certainly Not AGI.

  23. Does This Mean the Future is Open Models?.

  24. Not Priced In.

  25. Our Media is Failing Us.

  26. Not Covered Here: Deliberative Alignment.

  27. The Lighter Side.

Deedy: OpenAI o3 is 2727 on Codeforces which is equivalent to the #175 best human competitive coder on the planet.

This is an absolutely superhuman result for AI and technology at large.

The median IOI Gold medalist, the top international programming contest for high schoolers, has a rating of 2469.

That’s how incredible this result is.

In the presentation, Altman jokingly mentions that one person at OpenAI is a competition programmer who is 3000+ on Codeforces, so ‘they have a few more months’ to enjoy their superiority. Except, he’s obviously not joking. Gulp.

o3 shows dramatically improved performance on the ARC-AGI challenge.

Francois Chollet offers his thoughts, full version here.

Arc Prize: New verified ARC-AGI-Pub SoTA! @OpenAI o3 has scored a breakthrough 75.7% on the ARC-AGI Semi-Private Evaluation.

And a high-compute o3 configuration (not eligible for ARC-AGI-Pub) scored 87.5% on the Semi-Private Eval.

This performance on ARC-AGI highlights a genuine breakthrough in novelty adaptation.

This is not incremental progress. We’re in new territory.

Is it AGI? o3 still fails on some very easy tasks, indicating fundamental differences with human intelligence.

hero: o3’s secret? the “I will give you $1k if you complete this task correctly” prompt but you actually send it the money.

Rohit: It’s actually Sam in the back end with his venmo.

Is there a catch?

There’s at least one big catch, which is that they vastly exceeded the compute limit for what counts as a full win for the ARC challenge. Those yellow dots represent quite a lot more money spent, o3 high is spending thousands of dollars.

It is worth noting that $0.10 per problem is a lot cheaper than human level.

Ajeya Cotra: I think a generalist AI system (not fine-tuned on ARC AGI style problems) may have to be pretty *superhumanto solve them at $0.10 per problem; humans have to run a giant (1e15 FLOP/s) brain, probably for minutes on the more complex problems.

Beyond that, is there another catch? That’s a matter of some debate.

Even with catches, the improvements are rather mind-blowing.

President of the Arc prize Greg Kamradt verified the result.

Greg Kamradt: We verified the o3 results for OpenAI on @arcprize.

My first thought when I saw the prompt they used to claim their score was…

“That’s it?”

It was refreshing (impressive) to see the prompt be so simple:

“Find the common rule that maps an input grid to an output grid.”

Brandon McKinzie (OpenAI): to anyone wondering if the high ARC-AGI score is due to how we prompt the model: nah. I wrote down a prompt format that I thought looked clean and then we used it…that’s the full story.

Pliny the Liberator: can I try?

For fun, here are the 34 problems o3 got wrong. It’s a cool problem set.

And this progress is quite a lot.

It is not, however, a direct harbinger of AGI, one does not want to overreact.

Noam Brown (OpenAI): I think people are overindexing on the @OpenAI o3 ARC-AGI results. There’s a long history in AI of people holding up a benchmark as requiring superintelligence, the benchmark being beaten, and people being underwhelmed with the model that beat it.

To be clear, @fchollet and @mikeknoop were always very clear that beating ARC-AGI wouldn’t imply AGI or superintelligence, but it seems some people assumed that anyway.

Here is Melanie Mitchell giving an overview that seems quite good.

Except, oh no!

How dare they!

OpenAI: Note on “tuned”” OpenAI shared they trained the o3 we tested on 75% of the Public Training set. They have not shared more detail. We have not yet tested the ARC-untrained model to understand how much of the performance is due to ARC-AGI data.

Niels Rogge: By training on 75% of the training set.

Gary Marcus: Wow. This, if true, raises serious questions about yesterday’s announcement.

Roon: oh shit oh fthey trained on the train set it’s all over now

Also important to note that 75% of the train set is like 2-300 examples.

🚨SCANDAL 🚨

OpenAI trained on the train set for the Millenium Puzzles.

Johan: Given that it scores 30% on ARC AGI 2, it’s clear there was no improvement in fluid reasoning and the only gain was due to the previous model not being trained on ARC.

Roon: well the other benchmarks show improvements in reasoning across the board

but regardless, this mostly reveals that it’s real performance on ARC AGI 2 is much higher

Rythm Garg: also: the model we used for all of our o3 evals is fully general; a subset of the arc-agi public training set was a tiny fraction of the broader o3 train distribution, and we didn’t do any additional domain-specific fine-tuning on the final checkpoint

Emmett Shear: Were anyone on the team aware of and thinking about arc and arc-like problems as a domain to improve at when you were designing and training o3? (The distinction between succeeding as a random side effect and succeeding with intention)

Rythm Garg: no, the team wasn’t thinking about arc when training o3; people internally just see it as one of many other thoughtfully-designed evals that are useful for monitoring real progress

Or:

Gary Marcus doubled down on ‘the true AGI would not need to train on the train set.’

Previous SotA on ARC involved training not only on the test set, but on a much larger synthetic test set. ARC was designed so the AI wouldn’t need to train for it, but it turns out ‘test that you can’t train for’ is a super hard trick to pull off. This was an excellent try and it still didn’t work.

If anything, o3’s using only 300 training set problems, and using a very simple instruction, seems to be to its credit here.

The true ASI might not need to do it, but why wouldn’t you train on the train set as a matter of course, even if you didn’t intend to test on ARC? That’s good data. And yes, humans will reliably do some version of ‘train on at least some of the train set’ if they want to do well on tasks.

Is it true we will be a lot better off if we have AIs that can one-shot problems that are out of their training distributions, where they truly haven’t seen anything that resembles the problem? Well, sure. That would be more impressive.

The real objection here, as I understand it, is the claim that OpenAI presented these results as more impressive than they are.

The other objection is that this required quite a lot of compute.

That is a practical problem. If you’re paying $20 a shot to solve ARC problems, or even $1m+ for the whole test at the high end, pretty soon you are talking real money.

It also raises further questions. What about ARC is taking so much compute? At heart these problems are very simple. The logic required should, one would hope, be simple.

Mike Bober-Irizar: Why do pre-o3 LLMs struggle with generalization tasks like

@arcprize? It’s not what you might think.

OpenAI o3 shattered the ARC-AGI benchmark. But the hardest puzzles didn’t stump it because of reasoning, and this has implications for the benchmark as a whole.

LLMs are dramatically worse at ARC tasks the bigger they get. However, humans have no such issues – ARC task difficulty is independent of size.

Most ARC tasks contain around 512-2048 pixels, and o3 is the first model capable of operating on these text grids reliably.

So even if a model is capable of the reasoning and generalization required, it can still fail just because it can’t handle this many tokens.

When testing o1-mini on an enlarged version of ARC, we observe an 80% drop in solved tasks – even if the solutions are the same.

When models can’t understand the task format, the benchmark can mislead, introducing a hidden threshold effect.

And if there’s always a larger version that humans can solve but an LLM can’t, what does this say about scaling to AGI?

The implication is that o3’s ability to handle the size of the grids might be producing a large threshold effect. Perhaps most of why o3 does so well is that it can hold the presented problem ‘in its head’ at once. That wouldn’t be as big a general leap.

Roon: arc is hard due to perception rather than reasoning -> seems clear and shut

I remember when AIME problems were hard.

This one is not a surprise. It did definitely happen.

AIME hasn’t quite fully fallen, in the sense that this does not solve AIME cheap. But it does solve AIME.

Back in the before times on November 8, Epoch AI launched FrontierMath, a new benchmark designed to fix the saturation on existing math benchmarks, eliciting quotes like this one:

Terrence Tao (Fields Medalist): These are extremely challenging… I think they will resist AIs for several years at least.

Timothy Gowers (Fields Medalist): Getting even one question right would be well beyond what we can do now, let alone saturating them.

Evan Chen (IMO Coach): These are genuinely hard problems… most of them look well above my pay grade.

At the time, no model solved more than 2% of these questions. And then there’s o3.

Noam Brown: This is the result I’m most excited about. Even if LLMs are dumb in some ways, saturating evals like @EpochAIResearch’s Frontier Math would suggest AI is surpassing top human intelligence in certain domains. When that happens we may see a broad acceleration in scientific research.

This also means that AI safety topics like scalable oversight may soon stop being hypothetical. Research in these domains needs to be a priority for the field.

Tamay Besiroglu: I’m genuinely impressed by OpenAI’s 25.2% Pass@1 performance on FrontierMath—this marks a major leap from prior results and arrives about a year ahead of my median expectations.

For context, FrontierMath is a brutally difficult benchmark with problems that would stump many mathematicians. The easier problems are as hard as IMO/Putnam; the hardest ones approach research-level complexity.

With earlier models like o1-preview, Pass@1 performance (solving on first attempt) was only around 2%. When allowing 8 attempts per problem (Pass@8) and counting problems solved at least once, we saw ~6% performance. o3’s 25.2% at Pass@1 is substantially more impressive.

It’s important to note that while the average problem difficulty is extremely high, FrontierMath problems vary in difficulty. Roughly: 25% are Tier 1 (advanced IMO/Putnam level), 50% are Tier 2 (extremely challenging grad-level), and 25% are Tier 3 (research problems).

I previously predicted a 25% performance by Dec 31, 2025 (my median forecast with an 80% CI of 14–60%). o3 has reached it earlier than I’d have expected on average.

It is indeed rather crazy how many people only weeks ago thought this level of Frontier Math was a year or more away.

Therefore…

When the FrontierMath is about to no longer be beyond the frontier, find a few frontier. Fast.

Tammy Besiroglu (6: 52m, December 21, 2024): I’m excited to announce the development of Tier 4, a new suite of math problems that go beyond the hardest problems in FrontierMath. o3 is remarkable, but there’s still a ways to go before any single AI system nears the collective prowess of the math community.

Elliot Glazer (6: 30pm, December 21, 2024): For context, FrontierMath currently spans three broad tiers:

• T1 (25%) Advanced, near top-tier undergrad/IMO

• T2 (50%) Needs serious grad-level background

• T3 (25%) Research problems demanding relevant research experience

All can take hours—or days—for experts to solve.

Although o3 solved problems in all three tiers, it likely still struggles on the most formidable Tier 3 tasks—those “exceptionally hard” challenges that Tao and Gowers say can stump even top mathematicians.

Tier 4 aims to push the boundary even further. We want to assemble problems so challenging that solving them would demonstrate capabilities on par with an entire top mathematics department.

Each problem will be composed by a team of 1-3 mathematicians specialized in the same field over a 6-week period, with weekly opportunities to discuss ideas with teams in related fields. We seek broad coverage of mathematics and want all major subfields represented in Tier 4.

Process for a Tier 4 problem:

  1. 1 week crafting a robust problem concept, which “converts” research insights into a closed-answer problem.

  2. 3 weeks of collaborative research. Presentations among related teams for feedback.

  3. Two weeks for the final submission.

We’re seeking mathematicians who can craft these next-level challenges. If you have research-grade ideas that transcend T3 difficulty, please email [email protected] with your CV and a brief note on your interests.

We’ll also hire some red-teamers, tasked with finding clever ways a model can circumvent a problem’s intended difficulty, and some reviewers to check for mathematical correctness of final submissions. Contact me if you think you’re suitable for either such role.

As AI keeps improving, we need benchmarks that reflect genuine mathematical depth. Tier 4 is our next (and possibly final) step in that direction.

Tier 5 could presumably be ‘ask a bunch of problems we have actual no idea how to solve and that might not have solutions but that would be super cool’ since anything on a benchmark inevitably gets solved.

From the description here, Chollet and Masad are speculating. It’s certainly plausible, but we don’t know if this is on the right track. It’s also highly plausible, especially given how OpenAI usually works, that o3 is deeply similar to o1, only better, similarly to how the GPT line evolved.

Amjad Masad: Based on benchmarks, OpenAI’s o3 seems like a genuine breakthrough in AI.

Maybe a start of a new paradigm.

But what new is also old: under the hood it might be Alpha-zero-style search and evaluate.

The author of ARC-AGI benchmark @fchollet speculates on how it works.

Davidad (other thread): o1 doesn’t do tree search, or even beam search, at inference time. it’s distilled. what about o3? we don’t know—those inference costs are very high—but there’s no inherent reason why it must be un-distill-able, since Transformers are Turing-complete (with the CoT itself as tape)

Teortaxes: I am pretty sure that o3 has no substantial difference from o1 aside from training data.

Jessica Taylor sees this as vindicating Paul Christiano’s view that you can factor cognition and use that to scale up effective intelligence.

Jessica Taylor: o3 implies Christiano’s factored cognition work is more relevant empirically; yes, you can get a lot from factored cognition.

Potential further capabilities come through iterative amplification and distillation, like ALBA.

If you care about alignment, go read Christiano!

I agree with that somewhat. I’m confused how far to go with it.

If we got o3 primarily because we trained on synthetic data that was generated by o1… then that is rather directly a form of slow takeoff and recursive self-improvement.

(Again, I don’t know if that’s what happened or not.)

And I don’t simply mean that the full o3 is not so fast, which it indeed is not:

Noam Brown: We announced @OpenAI o1 just 3 months ago. Today, we announced o3. We have every reason to believe this trajectory will continue.

Poaster Child: Waiting for singularity bros to discover economics.

Noam Brown: I worked at the federal reserve for 2 years.

I am waiting for economists to discover various things, Noam Brown excluded.

Jason Wei (OpenAI): o3 is very performant. More importantly, progress from o1 to o3 was only three months, which shows how fast progress will be in the new paradigm of RL on chain of thought to scale inference compute. Way faster than pretraining paradigm of new model every 1-2 years.

Scary fast? Absolutely.

However, I would caution (anti-caution?) that this is not a three month (~100 day) gap. On September 12, they gave us o1-preview to use. Presumably that included them having run o1-preview through their safety testing.

Davidad: If using “speed from o1 announcement to o3 announcement” to calibrate your velocity expectations, do take note that the o1 announcement was delayed by safety testing (and many OpenAI releases have been delayed in similar ways), whereas o3 was announced prior to safety testing.

They are only now starting o3 safety testing, from the sound of it this includes o3-mini. Even the red teamers won’t get full o3 access for several weeks. Thus, we don’t know how long this later process will take, but I would put the gap closer to 4-5 months.

That is still, again, scary fast.

It is however also the low hanging fruit, on two counts.

  1. We went from o1 → o3 in large part by having it spend over $1,000 on tasks. You can’t pull that trick that many more times in a row. The price will come down over time, and o3 is clearly more efficient than o1, so yes we will still make progress here, but there aren’t that many tasks where you can efficiently spend $10k+ on a slow query, especially if it isn’t reliable.

  2. This is a new paradigm of how to set up an AI model, so it should be a lot easier to find various algorithmic improvements.

Thus, if o3 isn’t so good that it substantially accelerates AI R&D that goes towards o4, then I would expect an o4 that expresses a similar jump to take substantially longer. The question is, does o3 make up for that with its contribution to AI R&D? Are we looking at a slow takeoff situation?

Even if not, it will still get faster and cheaper. And that alone is huge.

As in, this is a lot like that computer Douglas Adams wrote about, where you can get any answer you want, but it won’t be either cheap or fast. And you really, really should have given more thought to what question you were asking.

Ethan Mollick: Basically, think of the O3 results as validating Douglas Adams as the science fiction author most right about AI.

When given more time to think, the AI can generate answers to very hard questions, but the cost is very high, and you have to make sure you ask the right question first.

And the answer is likely to be correct (but we cannot be sure because verifying it requires tremendous expertise).

He also was right about machines that work best when emotionally manipulated and machines that guilt you.

Sully: With O3 costing (potentially) $2,000 per task on “high compute,” the app layer is needed more than ever.

For example, giving the wrong context to it and you just burned $1,000.

Likely, we have a mix of models based on their pricing/intelligence at the app layer, prepping the data to feed it into O3.

100% worth the money but the last thing u wana do is send the wrong info lol

Douglas Adams had lots of great intuitions and ideas, he’s amazing, but also he had a lot of shots on goal.

Right now o3 is rather expensive, although o3-mini will be cheaper than o1.

That doesn’t mean o3-level outputs will stay expensive, although presumably once they are people will try for o4-level or o5-level outputs, which will be even more expensive despite the discounts.

Seb Krier: Lots of poor takes about the compute costs to run o3 on certain tasks and how this is very bad, lead to inequality etc.

This ignores how quickly these costs will go down over time, as they have with all other models; and ignores how AI being able to do things you currently have to pay humans orders of magnitude more to do will actually expand opportunity far more compared to the status quo.

Remember when early Ericsson phones were a quasi-luxury good?

Simeon: I think this misses the point that you can’t really buy a better iPhone even with $1M whereas you can buy more intelligence with more capital (which is why you get more inequalities than with GPT-n). You’re right that o3 will expand the pie but it can expand both the size of the pie and inequalities.

Seb Krier: An individual will not have the same demand for intelligence as e.g. a corporation. Your last sentence is what I address in my second point. I’m also personally less interested in inequality/the gap than poverty/opportunity etc.

Most people will rarely even want an o3 query in the first place, they don’t have much use for that kind of intelligence in the day to day. Most queries are already pretty easy to handle with Claude Sonnet, or even Gemini Flash.

You can’t use $1m to buy a superior iPhone. But suppose you could, and every time you paid 10x the price the iPhone got modestly better (e.g. you got an iPhone x+2 or something). My instinctive prediction is a bunch of rich people pay $10k or $100k and a few pay $1m or $10m but mostly no one cares.

This is of course different, and relative access to intelligence is a key factor, but it’s miles less unequal than access to human expertise.

To the extent that people do need that high level of artificial intelligence, it’s mostly a business expense, and as such it is actually remarkably cheap already. It definitely reduces ‘intelligence inequality’ in the sense that getting information or intelligence that you can’t provide yourself will get a lot cheaper and easier to access. Already this is a huge effect – I have lots of smart and knowledgeable friends but mostly I use the same tools everyone else could use, if they knew about them.

Still, yes, some people don’t love this.

Haydn Belfield: o1 & o3 bring to an end the period when everyone—from Musk to me—could access the same quality of AI model.

From now on, richer companies and individuals will be able to pay more for inference compute to get better results.

Further concentration of wealth and power is coming.

Inference cost *willdecline quickly and significantly. But this will not change the fact that this paradigm enables converting money into outcomes.

  1. Lower costs for everyone mean richer companies can buy even more.

  2. Companies will now feel confident to invest 10–100 milliseconds into inference compute.

This is a new way to convert money into better outcomes, so it will advantage those with more capital.

Even for a fast-growing, competent startup, it is hard to recruit and onboard many people quickly at scale.

o3 is like being able to scale up world-class talent.

  1. Rich companies are talent-constrained. It takes time and effort to scale a workforce, and it is very difficult to buy more time or work from the best performers. This is a way to easily scale up talent and outcomes simply by using more money!

Some people in replies are saying “twas ever thus”—not for most consumer technology!

Musk cannot buy a 100 times better iPhone, Spotify, Netflix, Google search, MacBook, or Excel, etc.

He can buy 100 times better legal, medical, or financial services.

AI has now shifted from the first group to the second.

Musk cannot buy 100 times better medical or financial services. What he can do is pay 100 times more, and get something 10% better. Maybe 25% better. Or, quite possibly, 10% worse, especially for financial services. For legal he can pay 100 times more and get 100 times more legal services, but as we’ve actually seen it won’t go great.

And yes, ‘pay a human to operate your consumer tech for you’ is the obvious way to get superior consumer tech. I can absolutely get a better Netflix or Spotify or search by paying infinitely more money, if I want that, via this vastly improved interface.

And of course I could always get a vastly better computer. If you’re using a MacBook and you are literally Elon Musk that is pretty much on you.

The ‘twas ever thus’ line raises the question of what type of product AI is supposed to be. If it’s a consumer technology, then for most purposes, I still think we end up using the same product.

If it’s a professional service used in doing business, then it was already different. The same way I could hire expensive lawyers, I could have hired a prompt engineer or SWEs to build me agents or what not, if I wanted that.

I find Altman’s framing interesting here, and important:

Sam Altman: seemingly somewhat lost in the noise of today.

On many coding tasks, o3-mini will outperform o1 at a massive cost reduction!

I expect this trend to continue, but also that the ability to get marginally more performance for exponentially more money will be truly strange.

Exponentially more money for marginally more performance.

Over time, massive cost reductions.

In a sense, the extra money is buying you living in the future.

Do you want to live in the future, before you get the cost reductions?

In some cases, very obviously yes, you do.

I would not say it has fallen. I do know it will transform.

If two years from now you are writing code line by line, you’ll be a dinosaur.

Sully: yeah its over for coding with o3

this is mindboggling

looks like the first big jump since gpt4, because these numbers make 0 sense

By the way, I don’t say this lightly, but

Software engineering in the traditional sense is dead in less than two years.

You will still need smart, capable engineers.

But anything that involves raw coding and no taste is done for.

o6 will build you virtually anything.

Still Bullish on things that require taste (design and such)

The question is, assuming the world ‘looks normal,’ will you still need taste? You’ll need some kind of taste. You still need to decide what to build. But the taste you need will presumably get continuously higher level and more abstract, even within design.

If you’re in AI capabilities, pivot to AI safety.

If you’re in software engineering, pivot to software architecting.

If you’re in working purely for a living, pivot to building things and shipping them.

But otherwise, don’t quit your day job.

Null Pointered (6.4m views): If you are a software engineer who’s three years into your career: quit now. there is not a single job in CS anymore. it’s over. this field won’t exist in 1.5 years.

Anthony F: This is the kind of though that will make the software engineers valuable in 1.5 years.

null: That’s what I’m hoping.

Robin Hanson: I would bet against this.

If anything, being in software should make you worry less.

Pavel Asparouhov: Non technical folk saying the SWEs are cooked — it’s you guys who are cooked.

Ur gonna have ex swes competing with everything you’re doing now, and they’re gonna be AI turbocharged

Engineers were simply doing coding bc it was the highest leverage use of mental power

When that shifts it’s not going to all of the sudden shift the hierarchy

They’ll still be (higher level) SWEs. Instead of coding, they’ll be telling the AI to code.

And they will absolutely be competing with you.

If you don’t join them, you are probably going to lose.

Here’s some advice that I agree with in spirit, except that if you choose not to decide you still have made a choice, so you do the best you can, notice he gives advice anyway:

Roon: Nobody should give or receive any career advice right now. Everyone is broadly underestimating the scope and scale of change and the high variance of the future. Your L4 engineer buddy at Meta telling you “bro, CS degrees are cooked” doesn’t know anything.

Greatness cannot be planned.

Stay nimble and have fun.

It’s an exciting time. Existing status hierarchies will collapse, and the creatives will win big.

Roon: guy with zero executive function to speak of “greatness cannot be planned”

Simon Sarris: I feel like I’m going insane because giving advice to new devs is not that hard.

  1. Build things you like preferably publicly with your real name

  2. Have a website that shows something neat

  3. Help other people publicly. Participate in social media socially.

Do you notice how “AI” changes none of this?

Wailing about because of some indeterminate future and claiming that there’s no advice that can be given to noobs are both breathlessly silly. Think about what you’re being asked for at least ten seconds. You can really think of nothing to offer? Nothing?

Ajeya Cotra: I wonder if an o3 agent could productively work on projects with poor feedback loops (eg “research X topic”) for many subjective years without going off the rails or hitting a degenerate loop. Even if it’s much less cost-efficient now it would quickly become cheaper.

Another situation where onlookers/forecasters probably disagree a lot about *today’scapabilities let alone future capabilities.

Wonder how o3 would do on wedding planning.

Note the date on that poll, it is prior to o3.

I predict that o3 with reasonable tool use and other similar scaffolding, and a bunch of engineering work to get all that set up (but it would almost all be general work, it mostly wouldn’t need to be wedding specific work, and a lot of it could be done by o3!) would be great at planning ‘a’ wedding. It can give you one hell of a wedding. But you don’t want ‘a’ wedding. You want your wedding.

The key is handling the humans. That would mean keeping the humans in the loop properly, ensuring they give the right feedback that allows o3 to stay on track and know what is actually desired. But it would also mean all the work a wedding planner does to manage the bride and sometimes groom, and to deal with issues on-site.

If you give it an assistant (with assistant planner levels of skill) to navigate various physical issues and conversations and such, then the problem becomes trivial. Which in some sense also makes it not a good test, but also does mean your wedding planner is out of a job.

So, good question, actually. As far as we know, no one has dared try.

The bar for safety testing has gotten so low that I was genuinely happy to see Greg Brockman say that safety testing and red teaming was starting now. That meant they were taking testing seriously!

When they tested the original GPT-4, under far less dangerous circumstances, for months. Whereas with o3, it could possibly have already been too late.

Take Eliezer Yudkowsky’s warning here both seriously and literally:

Greg Brockman: o3, our latest reasoning model, is a breakthrough, with a step function improvement on our hardest benchmarks. we are starting safety testing & red teaming now.

Eliezer Yudkowsky: Sir, this level of capabilities needs to be continuously safety-tested while you are training it on computers connected to the Internet (and to humans). You are past the point where it seems safe to train first and conduct evals only before user releases.

RichG (QTing EY above): I’ve been avoiding politics and avoiding tribe like things like putting ⏹️ in my name, but level of lack of paranoia that these labs have is just plain worrying. I think I will put ⏹️ in my name now.

Was it probably safe in practice to train o3 under these conditions? Sure. You definitely had at least one 9 of safety doing this (p(safe)>90%). It would be reasonable to claim you had two (p(safe)>99%) at the level we care about.

Given both kinds of model uncertainty, I don’t think you had three.

If humans are reading the outputs, or if o3 has meaningful outgoing internet access, and it turns out you are wrong about it being safe to train it under those conditions… the results could be catastrophically bad, or even existentially bad.

You don’t do that because you expect we are in that world yet. We almost certainly aren’t. You do that because there is a small chance that we are, and we can’t afford to be wrong about this.

That is still not the current baseline threat model. The current baseline threat model remains that a malicious user uses o3 to do something for them that we do not want o3 to do.

Xuan notes she’s pretty upset about o3’s existence, because she thinks it is rather unsafe-by-default and was hoping the labs wouldn’t build something like this, and then was hoping it wouldn’t scale easily. And that o3 seems to be likely to engage in open-ended planning, operate over uninterpretable world models, and be situationally aware, and otherwise be at high risk for classic optimization-based AI risks. She’s optimistic this can be solved, but time might be short.

I agree that o3 seems relatively likely to be highly unsafe-by-default in existentially dangerous ways, including ways illustrated by the recent Redwood Research and Anthropic paper, Alignment Faking in Large Language Models. It builds in so many of the preconditions for such behaviors.

Davidad: “Maybe the AI capabilities researchers aren’t very smart” is a very very hazardous assumption on which to pin one’s AI safety hopes

I don’t mean to imply it’s *pointlessto keep AI capabilities ideas private. But in my experience, if I have an idea, at least somebody in one top lab will have the same idea by next quarter, and someone in academia or open source will have the idea and publish within 1-2 years.

A better hope [is to solve the practical safety problems, e.g. via interpretability.]

I am not convinced, at least for my own purposes, although obviously most people will be unable to come up with valuable insights here. I think salience of ideas is a big deal, people don’t do things, and yes often I get ideas that seem like they might not get discovered forever otherwise. Doubtless a lot of them are because ‘that doesn’t work, either because we tried it and it doesn’t or it obviously doesn’t you idiot’ but I’m fine with not knowing which ones are which.

I do think that the rationalist or MIRI crowd made a critical mistake in the 2010s of thinking they should be loud about the dangers of AI in general, but keep their technical ideas remarkably secret even when it was expensive. It turned out it was the opposite, the technical ideas didn’t much matter in the long run (probably?) but the warnings drew a bunch of interest. So there’s that.

Certainly now is not the time to keep our safety concerns or ideas to ourselves.

Thus, you are invited to their early access safety testing.

OpenAI: We’re inviting safety researchers to apply for early access to our next frontier models. This early access program complements our existing frontier model testing process, which includes rigorous internal safety testing, external red teaming such as our Red Teaming Network⁠ and collaborations with third-party testing organizations, as well the U.S. AI Safety Institute and the UK AI Safety Institute.

As models become more capable, we are hopeful that insights from the broader safety community can bring fresh perspectives, deepen our understanding of emerging risks, develop new evaluations, and highlight areas to advance safety research.

As part of 12 Days of OpenAI⁠, we’re opening an application process for safety researchers to explore and surface the potential safety and security implications of the next frontier models.

Safety testing in the reasoning era

Models are becoming more capable quickly, which means that new threat modeling, evaluation, and testing techniques are needed. We invest heavily in these efforts as a company, such as designing new measurement techniques under our Preparedness Framework⁠(opens in a new window), and are focused on areas where advanced reasoning models, like our o-series, may pose heightened risks. We believe that the world will benefit from more research relating to threat modeling, security analysis, safety evaluations, capability elicitation, and more

Early access is flexible for safety researchers. You can explore things like:

  • Developing Robust Evaluations: Build evaluations to assess previously identified capabilities or potential new ones with significant security or safety implications. We encourage researchers to explore ideas that highlight threat models that identify specific capabilities, behaviors, and propensities that may pose concrete risks tied to the evaluations they submit.

  • Creating Potential High-Risk Capabilities Demonstrations: Develop controlled demonstrations showcasing how reasoning models’ advanced capabilities could cause significant harm to individuals or public security absent further mitigation. We encourage researchers to focus on scenarios that are not possible with currently widely adopted models or tools.

Examples of evaluations and demonstrations for frontier AI systems:

We hope these insights will surface valuable findings and contribute to the frontier of safety research more broadly. This is not a replacement for our formal safety testing or red teaming processes.

How to apply

Submit your application for our early access period, opening December 20, 2024, to push the boundaries of safety research. We’ll begin selections as soon as possible thereafter. Applications close on January 10, 2025.

Sam Altman: if you are a safety researcher, please consider applying to help test o3-mini and o3. excited to get these out for general availability soon.

extremely proud of all of openai for the work and ingenuity that went into creating these models; they are great.

(and most of all, excited to see what people will build with this!)

If early testing of the full o3 will require a delay of multiple weeks for setup, then that implies we are not seeing the full o3 in January. We probably see o3-mini relatively soon, then o3 follows up later.

This seems wise in any case. Giving the public o3-mini is one of the best available tests of the full o3. This is the best form of iterative deployment. What the public does with o3-mini can inform what we look for with o3.

One must carefully consider the ethical implications before assisting OpenAI, especially assisting with their attempts to push the capabilities frontier for coding in particular. There is an obvious argument against participation, including decision theoretic considerations.

I think this loses in this case to the obvious argument for participation, which is that this is purely red teaming and safety work, and we all benefit from it being as robust as possible, and also you can do good safety research using your access. This type of work benefits us all, not only OpenAI.

Thus, yes, I encourage you to apply to this program, and while doing so to be helpful in ensuring that o3 is safe.

Pretty much all the things, at this point, although the worst ones aren’t likely… yet.

GFodor.id: It’s hard to take anyone seriously who can see a PhD in a box and *notimagine clearly more than a few plausible mass casualty events due to the evaporation of friction due to lack of know-how and general IQ.

In many places the division is misleading, but for now and at this capability level, it seems reasonable to talk about three main categories of risk here:

  1. Misuse.

  2. Automated R&D and potential takeoffs or self-improvement.

  3. For-real loss of control problems that aren’t #2.

For all previous frontier models, there was always a jailbreak. If someone was determined to get your model to do [X], and your model had the underlying capability to do [X], you could get it to do [X].

In this case, [X] is likely to include substantially aiding a number of catastrophically dangerous things, in the class of cyberattacks or CBRN risks or other such dangers.

Aaron Bergman: Maybe this is obvious but: the other labs seem to be broadly following a pretty normal cluster of commercial and scientific incentives o3 looks like the clearest example yet of OpenAI being ideologically driven by AGI per se.

Like you don’t design a system that costs thousands of dollars to use per API call if you’re focused on consumer utility – you do that if you want to make a machine that can think well, full stop.

Peter Wildeford: I think OpenAI genuinely cares about getting society to grapple with AI progress.

I don’t think ideological is the right term. You don’t make it for direct consumer use if your focus is on consumer utility. But you might well make it for big business, if you’re trying to sell a bunch of drop-in employees to big business at $20k/year a pop or something. That’s a pretty great business if you can get it (and the compute is only $10k, or $1k). And you definitely do it if your goal is to have that model help make your other models better.

It’s weird to me to talk about wanting to make AGI and ASI and the most intelligent thing possible as if it were ideological. Of course you want to make those things… provided you (or we) can stay in control of the outcomes. Just think of the potential! It is only ideological in the sense that it represents a belief that we can handle doing that without getting ourselves killed.

If anything, to me, it’s the opposite. Not wanting to go for ASI because you don’t see the upside is an ideological position. The two reasonable positions are ‘don’t go for ASI yet, slow down there cowboy, we’re not ready to handle this’ and ‘we totally can too handle this, just think of the potential.’ Or even ‘we have to build it before the other guy does,’ which makes me despair but at least I get it. The position ‘nothing to see here what’s the point there is no market for that, move along now, can we get that q4 profit projection memo’ is the Obvious Nonsense.

And of course, if you don’t (as Aaron seems to imply) think Anthropic has its eyes on the prize, you’re not paying attention. DeepMind originally did, but Google doesn’t, so it’s unclear what the mix is at this point over there.

I want to be clear here that the answer is: Quite a lot of things. Having access to next-level coding and math is great. Having the ability to spend more money to get better answers where it is valuable is great.

Even if this all stays relatively mundane and o3 is ultimately disappointing, I am super excited for the upside, and to see what we all can discover, do, build and automate.

Guess who.

All right, that’s my fault, I made that way too easy.

Gary Marcus: After almost two years of declaring that a release of GPT-5 is imminent and not getting it, super fans have decided that a demo of system that they did zero personal experimentation with — and that won’t (in full form) be available for months — is a mic-drop AGI moment.

Standards have fallen.

[o1] is not a general purpose reasoner. it works where there is a lot of augmented data etc.

First off it Your Periodic Reminder that progress is anything but slow even if you exclude the entire o-line. It has been a little over two years since there was a demo of GPT-4, with what was previously a two year product cycle. That’s very different from ‘two years of an imminent GPT-5 release.’ In the meantime, models have gotten better across the board. GPT-4o, Claude Sonnet 3.5 and Gemini 1206 all completely demolish the original GPT-4, to speak nothing of o1 or Perplexity or anything else. And we also have o1, and now o3. The practical experience of using LLMs is vastly better than it was two years ago.

Also, quite obviously, you pursue both paths at once, both GPT-N and o-N, and if both succeed great then you combine them.

Srini Pagdyala: If O3 is AGI, why are they spending billions on GPT-5?

Gary Marcus: Damn good question!

So no, not a good question.

Is there now a pattern where ‘old school’ frontier model training runs whose primary plan was ‘add another zero or two’ are generating unimpressive results? Yeah, sure.

Is o3 an actual AGI? No. I’m pretty sure it is not.

But it seems plausible it is AGI-level specifically at coding. And that’s the important one. It’s the one that counts most. If you have that, overall AGI likely isn’t far behind.

I mention this because some were suggesting it might be.

Here’s Yana Welinder claiming o3 is AGI, based off the ARC performance, although she later hedges to ‘partial AGI.’

And here’s Evan Mays, a member of OpenAI’s preparedness team, saying o3 is AGI, although he later deleted it. Are they thinking about invoking the charter? It’s premature, but no longer completely crazy to think about it.

And here’s old school and present OpenAI board member Adam D’Angelo saying ‘Wild that the o3 results are public and yet the market still isn’t pricing in AGI,’ which to be fair it totally isn’t and it should be, whether o3 itself is AGI or not. And Elon Musk agrees.

If o3 was as good on most tasks as it is at coding or math, then it would be AGI.

It is not.

If it was, OpenAI would be communicating about this very differently.

If it was, then that would not match what we saw from o1, or what we would predict from this style of architecture. We should expect o-style models to be relatively good at domains like math and coding where their kind of chain of thought is most useful and it is easiest to automatically evaluate outputs.

That potentially is saying more about the definition of AGI than anything else. But it is certainly saying the useful thing that there are plenty of highly useful human-shaped cognitive things it cannot yet do so well.

How long that lasts? That’s another question.

What would be the most Robin Hanson take here, in response to the ARC score?

Robin Hanson: It’s great to find things AI can’t yet do, and then measure progress in terms of getting AIs to do them. But crazy wrong to declare we’ve achieved AGI when reach human level on the latest such metric. We’ve seen dozens of such metrics so far, and may see dozens more before AGI.

o1 listed 15 when I asked, oddly without any math evals, and Claude gave us 30. So yes, dozens of such cases. We might indeed see dozens more, depending on how we choose them. But in terms of things like ARC, where the test was designed to not be something you could do easily without general intelligence, not so many? It does not feel like we have ‘dozens more’ such things left.

This has nothing to do with the ‘financial definition of AGI’ between OpenAI and Microsoft, of $100 billion in profits. This almost certainly is not that, either, but the two facts are not that related to each other.

Evan Conrad suggests this, because the expenses will come at runtime, so people will be able to catch up on training the models themselves. And of course this question is also on our minds given DeepSeek v3, which I’m not covering here but certainly makes a strong argument that open is more competitive than it appeared. More on that in future posts.

I agree that the compute shifting to inference relatively helps whoever can’t afford to be spending the most compute on training. That would shift things towards whoever has the most compute for inference. The same goes if inference is used to create data to train models.

Dan Hendrycks: If gains in AI reasoning will mainly come from creating synthetic reasoning data to train on, then the basis of competitiveness is not having the largest training cluster, but having the most inference compute.

This shift gives Microsoft, Google, and Amazon a large advantage.

Inference compute being the true cost also means that model quality and efficiency potentially matters quite a lot. Everything is on a log scale, so even if Meta’s M-5 is sort of okay and can scale like O-5, if it’s even modestly worse, it might cost 10x or 100x more compute to get similar performance.

That leaves a hell of a lot of room for profit margins.

Then there’s the assumption that when training your bespoke model, what matters is compute, and everything else is kind of fungible. I keep seeing this, and I don’t think this is right. I do think you can do ‘okay’ as a fast follower with only compute and ordinary skill in the art. Sure. But it seems to me like the top labs, particularly Anthropic and OpenAI, absolutely do have special sauce, and that this matters. There are a number of strong candidates, including algorithmic tricks and better data.

It also matters whether you actually do the thing you need to do.

Tnishq Abraham: Today, people are saying Google is cooked rofl

Gallabytes: Not me, though. Big parallel thinking just got de-risked at scale. They’ll catch up.

If recursive self-improvement is the game, OpenAI will win. If industrial scaling is the game, it’ll be Google. If unit economics are the game, then everyone will win.

Pushinpronto: Why does OpenAI have an advantage in the case of recursive self-improvement? Is it just the fact that they were first?

Gallabytes: We’re not even quite there yet! But they’ll bet hard on it much faster than Google will, and they have a head start in getting there.

What this does mean is that open models will continue to make progress and will be harder to limit at anything like current levels, if one wanted to do that. If you have an open model Llama-N, it now seems like you can turn it into M(eta)-N, once it becomes known how to do that. It might not be very good, but it will be a progression.

The thinking here by Evan at the link about the implications of takeoff seem deeply confused – if we’re in a takeoff situation then that changes everything and it’s not about ‘who can capture the value’ so much as who can capture the lightcone. I don’t understand how people can look these situations in the face and not only not think about existential risk but also think everything will ‘seem normal.’ He’s the one who said takeoff (and ‘fast’ takeoff, which classically means it’s all over in a matter of hours to weeks)!

As a reminder, the traditional definition of ‘slow’ takeoff is remarkably fast, also best start believing in them, because it sure looks like you’re in one:

Teortaxes: it’s about time ML twitter got brought up to speed on what “takeoff speeds” mean. Christiano: “There will be a complete 4 year interval in which world output doubles, before the first 1 year interval in which world output doubles.” That’s slow. We’re in the early stages of it.

One answer to ‘why didn’t Nvidia move more’ is of course ‘everything is priced in’ but no of course it isn’t, we didn’t know, stop pretending we knew, insiders in OpenAI couldn’t have bought enough Nvidia here.

Also, on Monday after a few days to think, Nvidia overperformed the Nasdaq by ~3%.

And this was how the Wall Street Journal described that, even then:

No, I didn’t buy more on Friday, I keep telling myself I have Nvidia at home. Indeed I do have Nvidia at home. I keep kicking myself, but that’s how every trade is – either you shouldn’t have done it, or you should have done more. I don’t know that there will be another moment like this one, but if there is another moment this obvious, I hereby pledge in public to at least top off a little bit, Nick is correct in his attitude here you do not need to do the research because you know this isn’t priced in but in expectation you can assume that everything you are not thinking about is priced in.

And now, as I finish this up, Nvidia has given most of those gains back on no news that seems important to me. You could claim that means yes, priced in. I don’t agree.

Spencer Schiff (on Friday): In a sane world the front pages of all mainstream news websites would be filled with o3 headlines right now

The traditional media, instead, did not notice it. At all.

And one can’t help but suspect this was highly intentional. Why else would you announce such a big thing on the Friday afternoon before Thanksgiving?

They did successfully hype it among AI Twitter, also known as ‘the future.’

Bindu Reddy: The o3 announcement was a MASTERSTROKE by OpenAI

The buzz about it is so deafening that everything before it has been be wiped out from our collective memory!

All we can think of is this mythical model that can solve insanely hard problems 😂

Nick: the whole thing is so thielian.

If you’re going to take on a giant market doing probably illegal stuff call yourself something as light and bouba as possible, like airbnb, lyft

If you’re going to announce agi do it during a light and happy 12 days of christmas short demo.

Sam Altman (replying to Nick!): friday before the holidays news dump.

Well, then.

In that crowd, it was all ‘software engineers are cooked’ and people filled with some mix of excitement and existential dread.

But back in the world where everyone else lives…

Benjamin Todd: Most places I checked didn’t mention AI at all, or they’d only have a secondary story about something else like AI and copyright. My twitter is a bubble and most people have no idea what’s happening.

OpenAI: we’ve created a new AI architecture that can provide expert level answers in science, math and coding, which could herald the intelligence explosion.

The media: bond funds!

Davidad: As Matt Levine used to say, People Are Worried About Bond Market Liquidity.

Here is that WSJ story, talking about how GPT-5 or ‘Orion’ has failed to exhibit big intelligence gains despite multiple large training runs. It says ‘so far, the vibes are off,’ and says OpenAI is running into a data wall and trying to fill it with synthetic data. If so, well, they had o1 for that, and now they have o3. The article does mention o1 as the alternative approach, but is throwing shade even there, so expensive it is.

And we have this variation of that article, in the print edition, on Saturday, after o3:

Sam Altman: I think The Wall Street Journal is the overall best U.S. newspaper right now, but they published an article called “The Next Great Leap in AI Is Behind Schedule and Crazy Expensive” many hours after we announced o3?

It wasn’t only WSJ either, there’s also Bloomberg, which normally I love:

On Monday I did find coverage of o3 in Bloomberg, but it not only wasn’t on the front page it wasn’t even on the front tech page, I had to click through to AI.

Another fun one, from Thursday, here’s the original in the NY Times:

Is it Cade Metz? Yep, it’s Cade Metz and also Tripp Mickle. To be fair to them, they do have Demis Hassabis quotes saying chatbot improvements would slow down. And then there’s this, love it:

Not everyone in the A.I. world is concerned. Some, like OpenAI’s chief executive, Sam Altman, say that progress will continue at the same pace, albeit with some twists on old techniques.

That post also mentions both synthetic data and o1.

OpenAI recently released a new system called OpenAI o1 that was built this way. But the method only works in areas like math and computing programming, where there is a firm distinction between right and wrong.

It works best there, yes, but that doesn’t mean it’s the only place that works.

We also had Wired with the article ‘Generative AI Still Needs to Prove Its Usefulness.’

True, you don’t want to make the opposite mistake either, and freak out a lot over something that is not available yet. But this was ridiculous.

I realized I wanted to say more here and have this section available as its own post. So more on this later.

Oh no!

Oh no!

Mikael Brockman: o3 is going to be able to create incredibly complex solutions that are incorrect in unprecedentedly confusing ways.

We made everything astoundingly complicated, thus solving the problem once and for all.

Humans will be needed to look at the output of AGI and say, “What the fis this? Delete it.”

Oh no!

Discussion about this post

o3, Oh My Read More »