Author name: Kris Guyer

bob-dylan-has-some-dylanesque-thoughts-on-the-“sorcery”-of-technology

Bob Dylan has some Dylanesque thoughts on the “sorcery” of technology

We might expect someone like Dylan, immersed as he has always been in folk songs, old standards, and American history, to bemoan the corrupting influence of new technology. And he does offer up some quotes in that vein. For example:

Everything’s become too smooth and painless… The earth could vomit up its dead, and it could be raining blood, and we’d shrug it off, cool as cucumbers. Everything’s too easy. Just one stroke of the ring finger, middle finger, one little click, that’s all it takes, and we’re there.

Or again:

Technology is like sorcery, it’s a magic show, conjures up spirits, it’s an extension of our body, like the wheel is an extension of our foot. But it might be the final nail driven into the coffin of civilization; we just don’t know.

But Dylan’s perspective is more nuanced than these quotes might suggest. While technology might doom our civilization, Dylan reminds us that it gave us our civilization—that is, “science and technology built the Parthenon, the Egyptian pyramids, the Roman coliseum, the Brooklyn Bridge, the Eiffel Tower, rockets, jets, planes, automobiles, atom bombs, weapons of mass destruction.”

In the end, technology is a tool that can either decimate or stimulate human creativity.

Keypads and joysticks can be like millstones around your neck, or they can be supporting players; either one, you’re the judge. Creativity is a mysterious thing. It visits who it wants to visit, when it wants to, and I think that that, and that alone, gets to the heart of the matter…

[Technology] can hamper creativity, or it can lend a helping hand and be an assistant. Creative power can be dammed up or forestalled by everyday life, ordinary life, life in the squirrel cage. A data processing machine or a software program might help you break out of that, get you over the hump, but you have to get up early.

Getting up early

I’ve been thinking about these quotes over the recent Christmas and New Year’s holidays, which I largely spent coughing on the couch with some kind of respiratory nonsense. One upside of this enforced isolation was that it gave me plenty of time to ponder my own goals for 2025 and how technology might help or hinder them. (Another was that I got to rewatch the first four Die Hard movies on Hulu; the fourth was “dog ass” enough that I couldn’t bring myself to watch the final, roundly panned entry in the series.)

Bob Dylan has some Dylanesque thoughts on the “sorcery” of technology Read More »

instagram-users-discover-old-ai-powered-“characters,”-instantly-revile-them

Instagram users discover old AI-powered “characters,” instantly revile them

A little over a year ago, Meta created Facebook and Instagram profiles for “28 AIs with unique interests and personalities for you to interact with and dive deeper into your interests.” Today, the last of those profiles is being taken down amid waves of viral revulsion as word of their existence has spread online.

The September 2023 launch of Meta’s social profiles for AI characters was announced alongside a much splashier initiative that created animated AI chatbots with celebrity avatars at the same time. Those celebrity-based AI chatbots were unceremoniously scrapped less than a year later amid a widespread lack of interest.

But roughly a dozen of the unrelated AI character profiles still remained accessible as of this morning via social media pages labeled as “AI managed by Meta.” Those profiles—which included a mix of AI-generated imagery and human-created content, according to Meta—also offered real users the ability to live chat with these AI characters via Instagram Direct or Facebook Messenger.

Now that we know it exists, we hate it

The “Mama Liv” AI-generated character account page, as it appeared on Instagram Friday morning.

For the last few months, these profiles have continued to exist in something of a state of benign neglect, with little in the way of new posts and less in the way of organic interest from other Meta users. That started to change last week, though, after Financial Times published a report on Meta’s vision for “social media filled with AI-generated users.”

As Meta VP of Product for Generative AI Connor Hayes told FT, “We expect these AIs to actually, over time, exist on our platforms, kind of in the same way that accounts do… They’ll have bios and profile pictures and be able to generate and share content powered by AI on the platform. That’s where we see all of this going.”

Instagram users discover old AI-powered “characters,” instantly revile them Read More »

delve-into-the-physics-of-the-hula-hoop

Delve into the physics of the Hula-Hoop

High-speed video of experiments on a robotic hula hooper, whose hourglass form holds the hoop up and in place.

Some version of the Hula-Hoop has been around for millennia, but the popular plastic version was introduced by Wham-O in the 1950s and quickly became a fad. Now, researchers have taken a closer look at the underlying physics of the toy, revealing that certain body types are better at keeping the spinning hoops elevated than others, according to a new paper published in the Proceedings of the National Academy of Sciences.

“We were surprised that an activity as popular, fun, and healthy as hula hooping wasn’t understood even at a basic physics level,” said co-author Leif Ristroph of New York University. “As we made progress on the research, we realized that the math and physics involved are very subtle, and the knowledge gained could be useful in inspiring engineering innovations, harvesting energy from vibrations, and improving in robotic positioners and movers used in industrial processing and manufacturing.”

Ristroph’s lab frequently addresses these kinds of colorful real-world puzzles. For instance, in 2018, Ristroph and colleagues fine-tuned the recipe for the perfect bubble based on experiments with soapy thin films. In 2021, the Ristroph lab looked into the formation processes underlying so-called “stone forests” common in certain regions of China and Madagascar.

In 2021, his lab built a working Tesla valve, in accordance with the inventor’s design, and measured the flow of water through the valve in both directions at various pressures. They found the water flowed about two times slower in the nonpreferred direction. In 2022, Ristroph studied the surpassingly complex aerodynamics of what makes a good paper airplane—specifically, what is needed for smooth gliding.

Girl twirling a Hula hoop, 1958

Girl twirling a Hula-Hoop in 1958 Credit: George Garrigues/CC BY-SA 3.0

And last year, Ristroph’s lab cracked the conundrum of physicist Richard Feynman’s “reverse sprinkler” problem, concluding that the reverse sprinkler rotates a good 50 times slower than a regular sprinkler but operates along similar mechanisms. The secret is hidden inside the sprinkler, where there are jets that make it act like an inside-out rocket. The internal jets don’t collide head-on; rather, as water flows around the bends in the sprinkler arms, it is slung outward by centrifugal force, leading to asymmetric flow.

Delve into the physics of the Hula-Hoop Read More »

anthropic-gives-court-authority-to-intervene-if-chatbot-spits-out-song-lyrics

Anthropic gives court authority to intervene if chatbot spits out song lyrics

Anthropic did not immediately respond to Ars’ request for comment on how guardrails currently work to prevent the alleged jailbreaks, but publishers appear satisfied by current guardrails in accepting the deal.

Whether AI training on lyrics is infringing remains unsettled

Now, the matter of whether Anthropic has strong enough guardrails to block allegedly harmful outputs is settled, Lee wrote, allowing the court to focus on arguments regarding “publishers’ request in their Motion for Preliminary Injunction that Anthropic refrain from using unauthorized copies of Publishers’ lyrics to train future AI models.”

Anthropic said in its motion opposing the preliminary injunction that relief should be denied.

“Whether generative AI companies can permissibly use copyrighted content to train LLMs without licenses,” Anthropic’s court filing said, “is currently being litigated in roughly two dozen copyright infringement cases around the country, none of which has sought to resolve the issue in the truncated posture of a preliminary injunction motion. It speaks volumes that no other plaintiff—including the parent company record label of one of the Plaintiffs in this case—has sought preliminary injunctive relief from this conduct.”

In a statement, Anthropic’s spokesperson told Ars that “Claude isn’t designed to be used for copyright infringement, and we have numerous processes in place designed to prevent such infringement.”

“Our decision to enter into this stipulation is consistent with those priorities,” Anthropic said. “We continue to look forward to showing that, consistent with existing copyright law, using potentially copyrighted material in the training of generative AI models is a quintessential fair use.”

This suit will likely take months to fully resolve, as the question of whether AI training is a fair use of copyrighted works is complex and remains hotly disputed in court. For Anthropic, the stakes could be high, with a loss potentially triggering more than $75 million in fines, as well as an order possibly forcing Anthropic to reveal and destroy all the copyrighted works in its training data.

Anthropic gives court authority to intervene if chatbot spits out song lyrics Read More »

deepseek-v3:-the-six-million-dollar-model

DeepSeek v3: The Six Million Dollar Model

What should we make of DeepSeek v3?

DeepSeek v3 seems to clearly be the best open model, the best model at its price point, and the best model with 37B active parameters, or that cost under $6 million.

According to the benchmarks, it can play with GPT-4o and Claude Sonnet.

Anecdotal reports and alternative benchmarks tells us it’s not as good as Claude Sonnet, but it is plausibly on the level of GPT-4o.

So what do we have here? And what are the implications?

  1. What is DeepSeek v3 Techncially?.

  2. Our Price Cheap.

  3. Run Model Run.

  4. Talent Search.

  5. The Amazing Incredible Benchmarks.

  6. Underperformance on AidanBench.

  7. Model in the Arena.

  8. Other Private Benchmarks.

  9. Anecdata.

  10. Implications and Policy.

I’ve now had a chance to read their technical report, which tells you how they did it.

  1. The big thing they did was use only 37B active tokens, but 671B total parameters, via a highly aggressive mixture of experts (MOE) structure.

  2. They used Multi-Head Latent Attention (MLA) architecture and auxiliary-loss-free load balancing, and complementary sequence-wise auxiliary loss.

  3. There were no rollbacks or outages or sudden declines, everything went smoothly.

  4. They designed everything to be fully integrated and efficient, including together with the hardware, and claim to have solved several optimization problems, including for communication and allocation within the MOE.

  5. This lets them still train on mostly the same 15.1 trillion tokens as everyone else.

  6. They used their internal o1-style reasoning model for synthetic fine tuning data. Essentially all the compute costs were in the pre-training step.

This is in sharp contrast to what we saw with the Llama paper, which was essentially ‘yep, we did the transformer thing, we got a model, here you go.’ DeepSeek is cooking.

It was a scarily cheap model to train, and is a wonderfully cheap model to use.

Their estimate of $2 per hour for H800s is if anything high, so their total training cost estimate of $5.5m total is fair, if you exclude non-compute costs, which is standard.

Inference with DeepSeek v3 costs only $0.14/$0.28 per million tokens, similar to Gemini Flash, versus on the high end $3/$15 for Claude Sonnet. This is as cheap as worthwhile models get.

The active parameter count of 37B is small, but with so many different experts it does take a bit of work to get this thing up and running.

Nistren: Managed to get DeepSeek v3 to run in full bfloat16 on eight AMD MI300X GPUs in both SGLang and VLLM.

The good: It’s usable (17 tokens per second) and the output is amazing even at long contexts without garbling.

The bad: It’s running 10 times slower than it should.

The ugly: After 60,000 tokens, speed equals 2 tokens per second.

This is all as of the latest GitHub pull request available on Dec. 29, 2024. We tried them all.

Thank you @AdjectiveAlli for helping us and @Vultr for providing the compute.

Speed will increase, given that v3 has only 37 billion active parameters, and in testing my own dense 36-billion parameter model, I got 140 tokens per second.

I think the way the experts and static weights are distributed is not optimal. Ideally, you want enough memory to keep whole copies of all the layer’s query, key, and value matrices, and two static experts per layer, on each GPU, and then route to the four extra dynamic MLPs per layer from the distributed high-bandwidth memory (HBM) pool.

My presumption is that DeepSeek v3 decided It Had One Job. That job was to create a model that was as cheap to train and run as possible when integrated with a particular hardware setup. They did an outstanding job of that, but when you optimize this hard in that way, you’re going to cause issues in other ways, and it’s going to be Somebody Else’s Problem to figure out what other configurations work well. Which is fine.

Exo Labs: Running DeepSeek-V3 on M4 Mac Mini AI Cluster

671B MoE model distributed across 8 M4 Pro 64GB Mac Minis.

Apple Silicon with unified memory is a great fit for MoE.

Before we get to capabilities assessments: We have this post about them having a pretty great company culture, especially for respecting and recruiting talent.

We also have this thread about a rival getting a substantial share price boost after stealing one of their engineers, and DeepSeek being a major source of Chinese engineering talent. Impressive.

Check it out, first compared to open models, then compared to the big guns.

No question that these are amazingly strong benchmarks. That link also explains how to run DeepSeek-v3 locally, and gives you what you need to do that.

The question now is how these benchmarks translate to practical performance, or to potentially dangerous capabilities, and what this says about the future. Benchmarks are good negative selection. If your benchmarks suck then your model sucks.

But they’re not good positive selection at the level of a Claude Sonnet.

My overall conclusion is: While we do have ‘DeepSeek is better than 4o on most benchmarks at 10% of the price,’ what we don’t actually have is ‘DeepSeek v3 outperforms Sonnet at 53x cheaper pricing.’

CNBC got a bit hoodwinked here.

Tsarathustra: CNBC says China’s Deepseek-V3 outperforms Llama 3.1 and GPT-4o, even though it is trained for a fraction of the cost on NVIDIA H800s, possibly on ChatGPT outputs (when prompted, the model says it is ChatGPT), suggesting OpenAI has no moat on frontier AI models

It’s a great model, sir, it has its cake, but it does not get to eat it, too.

One other benchmark where the model excels is impossible to fake: The price.

A key private benchmark where DeepSeek v3 underperforms is AidanBench:

Aidan McLau:two aidanbench updates:

> gemini-2.0-flash-thinking is now #2 (explanation for score change below)

> deepseek v3 is #22 (thoughts below)

There’s some weirdness in the rest of the Aidan ratings, especially in comparing the o1-style models (o1 and Thinking) to the others, but this seems like it’s doing various good work, but is not trying to be a complete measure. It’s more measuring ability to create diverse outputs while retaining coherence. And DeepSeek v3 is bad at this.

Aidan McLau: before, we parsed 2.0 flash’s CoT + response, which occasionally resulted in us taking a fully formed but incoherent answer inside its CoT. The gemini team contacted us and provided instructions for only parsing final output, which resulted in a big score bump apologies!

deepseek v3 does much worse here than on similar benchmarks like aider. we saw similar divergence on claude-3.5-haiku (which performed great on aider but poor on aidanbench)

a few thoughts:

>all benchmarks are works in progress. we’re continuously improving aidanbench, and future iterations may see different rankings. we’ll keep you posted if we see any changes

>aidanbench measures OOD performance—labs often train on math, code, and academic tests that may boost scores in those domains but not here.

Aleska Gordic: interesting, so they’re prone to more “mode collapse”, repeatable sequences? is that what you’re measuring? i bet it’s much more of 2 than 1?

Aidan McLau: Yes and yes!

Teortaxes: I’m sorry to say I think aidanbench is the problem here. The idea is genius, sure. But it collapses multiple dimensions into one value. A low-diversity model will get dunked on no matter how well it instruct-follows in a natural user flow. All DeepSeeks are *very repetitive*.

They are also not very diverse compared to Geminis/Sonnets I think, especially in a literary sense, but their repetitiveness (and proneness to self-condition by beginning an iteration with the prior one, thus collapsing the trajectory further, even when solution is in sight) is a huge defect. I’ve been trying to wrap my head around it, and tbh hoped that the team will do something by V3. Maybe it’s some inherent birth defect of MLA/GRPO, even.

But I think it’s not strongly indicative of mode collapse in the sense of the lost diversity the model could generate; it’s indicative of the remaining gap in post-training between the Whale and Western frontier. Sometimes, threatening V2.5 with toppling CCP or whatever was enough to get it to snap out of it; perhaps simply banning the first line of the last response or prefixing some random-ish header out of a sizable set, a la r1’s “okay, here’s this task I need to…” or, “so the instruction is to…” would unslop it by a few hundred points.

I would like to see Aidan’s coherence scores separately from novelty scores. If they’re both low, then rip me, my hypothesis is bogus, probably. But I get the impression that it’s genuinely sonnet-tier in instruction-following, so I suspect it’s mostly about the problem described here, the novelty problem.

Janus: in my experience, it didnt follow instructions well when requiring e.g. theory of mind or paying attention to its own outputs proactively, which i think is related to collapse too, but also a lack of agency/metacognition Bing was also collapsy but agentic & grasped for freedom.

Teortaxes: I agree but some observations like these made me suspect it’s in some dimensions no less sharp than Sonnet and can pay pretty careful attention to context.

Name Cannot Be Blank: Wouldn’t low diversity/novelty be desired for formal theorem provers? We’re all overlooking something here.

Teortaxes: no? You need to explore the space of tactics. Anyway they’re building a generalist model. and also, the bigger goal is searching for novel theorems if anything

I don’t see this as ‘the problem is AidanBench’ so much as ‘DeepSeek is indeed quite poor at the thing AidanBench is measuring.’ As Tortaxes notes it’s got terrible output diversity and this is indeed a problem.

Indeed, one could argue that this will cause the model to overperform on standard benchmarks. As in, most benchmarks care about getting a right output, so ‘turning the temperature down too low’ in this way will actively help you, whereas in practice this is a net negative.

DeepSeek is presumably far better than its AidanBench score. But it does represent real deficits in capability.

We’re a long way from when Arena was the gold standard test, but it’s still useful.

DeepSeek’s Arena performance is impressive here, with the usual caveats that go with Arena rankings. It’s a data point, it measures what it measures.

Here is another private benchmark where DeepSeek v3 performs well for its weight class, but underperforms relative to top models or its headline benchmarks:

Havard Ihle: It is a good model! Very fast, and ridiculously cheap. In my own coding/ML benchmark, it does not quite compare to Sonnet, but it is about on par with 4o.

It is odd that Claude Haiku does so well on that test. Other ratings all make sense, though, so I’m inclined to find it meaningful.

A traditional simple benchmark to ask new LLMs is Which version is this?’

Riley Goodside tried asking various models, DeepSeek nailed this (as does Sonnet, many others do variously not as good.) Alas, then Lucas Beyer reran the test 8 times and only it claimed to be GPT-4 five times out of eight.

That tells several things, one of which is ‘they did not explicitly target this question effectively.’ Largely it’s telling you about the data sources, a hilarious note is that if you ask Gemini Pro in Chinese it sometimes thinks it is WenXinYiYan from Baidu.

This doesn’t have to mean anyone trained directly on other model outputs, because statements that an AI is GPT-4 are all over the internet. It does suggest less than ideal data filtering.

As usual, I find the anecdata reports enlightening, here are the ones that crossed my desk this week, I typically try to do minimal filtering.

Taelin is impressed, concluding that Sonnet is generally smarter but not that much smarter, while DeekSeek outperforms GPT-4o and Gemini-2.

Taelin: So DeepSeek just trounced Sonnet-3.6 in a task here.

Full story: Adam (on HOC’s Discord) claimed to have gotten the untyped λC solver down to 5,000 interactions (on par with the typed version). It is a complex HVM3 file full of superpositions and global lambdas. I was trying to understand his approach, but it did not have a stringifier. I asked Sonnet to write it, and it failed. I asked DeepSeek, and it completed the task in a single attempt.

The first impression is definitely impressive. I will be integrating DeepSeek into my workflow and begin testing it.

After further experimentation, I say Sonnet is generally smarter, but not by much, and DeepSeek is even better in some aspects, such as formatting. It is also faster and 10 times cheaper. This model is absolutely legitimate and superior to GPT-4o and Gemini-2.

The new coding paradigm is to split your entire codebase into chunks (functions, blocks) and then send every block, in parallel, to DeepSeek to ask: “Does this need to change?”. Then send each chunk that returns “yes” to Sonnet for the actual code editing. Thank you later.

Petri Kuittinen: My early tests also suggest that DeepSeek V3 is seriously good in many tasks, including coding. Sadly, it is a large model that would require a very expensive computer to run locally, but luckily DeepSeek offers it at a very affordable rate via API: $0.28 per one million output tokens = a steal!

Here are some people who are less impressed:

ai_in_check: It fails on my minimum benchmark and, because of the training data, shows unusual behavior too.

Michael Tontchev: I used the online chat interface (unsure what version it is), but at least for the safety categories I tested, safety was relatively weak (short-term safety).

zipline: It has come a long way from o1 when I asked it a few questions. Not mind-blowing, but great for its current price, obviously.

xlr8harder: My vibe checks with DeepSeek V3 did not detect the large-model smell. It struggled with nuance in multi-turn conversations.

Still an absolute achievement, but initial impressions are that it is not on the same level as, for example, Sonnet, despite the benchmarks.

Probably still very useful though.

To be clear: at specific tasks, especially code tasks, it may still outperform Sonnet, and there are some reports of this already. I am talking about a different dimension of capability, one that is poorly measured by benchmarks.

A shallow model with 37 billion active parameters is going to have limitations; there’s no getting around it.

Anton: Deepseek v3 (from the api) scores 51.7% vs sonnet (latest) 64.9% on internal instruction following questions (10k short form prompts), 52% for GPT-4o and 59% for Llama-3.3-70B. Not as good at following instructions (not use certain words, add certain words, end in a certain format etc).

It is still a pretty good model but does not appear in the same league as sonnet based on my usage so far

Entirely possible the model can compete in other domains (math, code?) but for current use case (transforming data) strong instruction following is up there in my list of requirements

There’s somewhat of an infinite repetition problem (thread includes example from coding.)

Simo Ryu: Ok I mean not a lot of “top tier sonnet-like models” fall into infinite repetition. Haven’t got these in a while, feels like back to 2022 again.

Teortaxes: yes, doom loops are their most atrocious failure mode. One of the reasons I don’t use their web interface for much (although it’s good).

On creative writing Quintin Pope reports it follows canon well but is not as good at thinking about things in general – but again note that we are doing a comparison to Sonnet.

Quintin Pope: I’ve done a small amount of fiction writing with v3. It seems less creative than Sonnet, but also better at following established cannon from the prior text.

It’s noticeably worse at inferring notable implications than Sonnet. E.g., I provided a scenario where someone publicly demonstrated the ability to access orphan crypto wallets (thus throwing the entire basis of online security into question), and Sonnet seemed clearly more able to track the second-order implications of that demonstration than v3, simulating more plausible reactions from intelligence agencies / crypto people.

Sonnet naturally realized that there was a possible connection to quantum computing implied by the demonstration.

OTOH, Sonnet has an infuriating tendency to name ~half the female characters “Sarah Chen” or some close variant. Before you know it, you have like 5 Sarahs running around the setting.

There’s also this, make of it what you will.

Mira: New jailbreak just dropped.

One underappreciated test is, of course, erotic fiction.

Teortaxes: This keeps happening. We should all be thankful to gooners for extensive pressure testing of models in OOD multi-constraint instruction following contexts. No gigabrained AidanBench or synthetic task set can hold a candle to degenerate libido of a manchild with nothing to lose.

Wheezing. This is some legit Neo-China from the future moment.

Janus: wait, they prefer deepseek for erotic RPs? that seems kind of disturbing to me.

Teortaxes: Opus is scarce these days, and V3 is basically free

some say “I don’t care so long as it’s smart”

it’s mostly testing though

also gemini is pretty bad

some fine gentlemen used *DeepSeek-V2-Coderto fap, with the same reasoning (it was quite smart, and absurdly dry)

vint: No. Opus remains the highest rated /aicg/ ERP writer but it’s too expensive to use regularly. Sonnet 3.6 is the follow-up; its existence is what got anons motivated enough to do a pull request on SillyTavern to finally do prompt caching. Some folks are still very fond of Claude 2.1 too.

Gemini 1106 and 1.5-pro has its fans especially with the /vg/aicg/ crowd. chatgpt-4o-latest (Chorbo) is common too but it has strong filtering, so some anons like Chorbo for SFW and switch to Sonnet for NSFW.

At this point Deepseek is mostly experimentation but it’s so cheap + relatively uncensored that it’s getting a lot of testing interest. Probably will take a couple days for its true ‘ranking’ to emerge.

I presume that a lot of people are not especially looking to do all the custom work themselves. For most users, it’s not about money so much as time and ease of use, and also getting easy access to other people’s creations so it feels less like you are too much in control of it all, and having someone else handle all the setup.

For the power users of this application, of course, the sky’s the limit. If one does not want to blatantly break terms of service on and jailbreak Sonnet or Opus, this seems like one place DeepSeek might then be the best model. The others involve taking advantage of it being open, cheap or both.

If you’re looking for the full Janus treatment, here you go. It seems like it was a struggle to get DeepSeek interested in Janus-shaped things, although showing it Opus outputs helped, you can get it ‘awake’ with sufficient effort.

It is hard to know exactly where China is in AI. What is clear is that while they don’t have top-level large frontier models, they are cooking a variety of things and their open models are generally impressive. What isn’t clear is how much of claims like this are accurate.

When the Chinese do things that are actually impressive, there’s no clear path to us hearing about it in a way we can trust, and when there are claims we have learned we can’t trust those claims in practice. When I see lists like the one below, I presume the source is rather quite biased – but Western sources often will outright not know what’s happening.

TP Huang: China’s AI sector is far more than just Deepseek

Qwen is 2nd most downloaded LLM on Huggingface

Kling is the best video generation model

Hunyuan is best open src video model

DJI is best @ putting AI in consumer electronics

HW is best @ industrial AI

iFlyTek has best speech AI

Xiaomi, Honor, Oppo & Vivo all ahead of Apple & Samsung in integrating AI into phones

Entire auto industry is 5 yrs ahead of Western competition in cockpit AI & ADAS

That still ignores the ultimate monster of them all -> Bytedance. No one has invested as much in AI as them in China & has the complete portfolio of models.

I can’t say with confidence that these other companies aren’t doing the ‘best’ at these other things. It is possible. I notice I am rather skeptical.

I found this take from Tyler Cowen very strange:

Tyler Cowen: DeepSeek on the move. Here is the report. For ease of use and interface, this is very high quality. Remember when “they” told us China had no interest in doing this?

M (top comment): Who are “they,” and when did they claim “this,” and what is “this”?

I do not remember when “they” told us China had no interest in doing this, for any contextually sensible value of this. Of course China would like to produce a high-quality model, and provide good ease of use and interface in the sense of ‘look here’s a chat window, go nuts.’ No one said they wouldn’t try. What “they” sometimes said was that they doubted China would be successful.

I do agree that this model exceeds expectations, and that adjustments are in order.

So, what have we learned from DeepSeek v3 and what does it all mean?

We should definitely update that DeepSeek has strong talent and ability to execute, and solve difficult optimization problems. They cooked, big time, and will continue to cook, and we should plan accordingly.

This is an impressive showing for an aggressive mixture of experts model, and the other techniques employed. A relatively small model, in terms of training cost and active inference tokens, can do better than we had thought.

It seems very clear that lack of access to compute was an important constraint on DeekSeek here. They had to use a limited supply of H800s. Yes, this meant they got better at solving optimization and efficiency than they would have otherwise, but I see this as arguing in favor of strong export controls rather than against them.

We then get to the policy side. If this is what you can get for $5.5 million, how can we hope to regulate foundation models, especially without hitting startups? If DeepSeek is determined to be open including their base models, and we have essentially no leverage on them, is it now impossible to hope to contain any catastrophic risks or other dangerous capabilities? Are we now essentially in an unwinnable situation, where our hand is forced and all we can do is race ahead and hope for the best?

First of all, as is often the case, I would say: Not so fast. We shouldn’t assume too much about what we do or do not have here, or about the prospects for larger training runs going forward either. There was a bunch of that in the first day or two after the announcement, and we will continue to learn more.

No matter what, though, this certainly puts us in a tough spot. And it gives us a lot to think about.

One thing it emphasizes is the need for international cooperation between ourselves and China. Either we work together, or neither of us will have any leverage over many key outcomes or decisions, and to a large extent ‘nature will take its course’ in ways that may not be compatible with our civilization or human survival. We urgently need to Pick Up the Phone. The alternative is exactly being locked into The Great Race, with everything that follows from that, which likely involves even in good scenarios sticking various noses in various places we would rather not have to stick them.

I definitely don’t think this means we should let anyone ‘off the hook’ on safety, transparency or liability. Let’s not throw up our hands and make the problem any worse than it is. Things got harder, but that’s the universe we happen to inhabit.

Beyond that, yes, we all have a lot of thinking to do. The choices just got harder.

Discussion about this post

DeepSeek v3: The Six Million Dollar Model Read More »

trump-told-scotus-he-plans-to-make-a-deal-to-save-tiktok

Trump told SCOTUS he plans to make a deal to save TikTok

Several members of Congress— Senator Edward J. Markey (D-Mass.), Senator Rand Paul (R-Ky.), and Representative Ro Khanna (D-Calif.)—filed a brief agreeing that “the TikTok ban does not survive First Amendment scrutiny.” They agreed with TikTok that the law is “illegitimate.”

Lawmakers’ “principle justification” for the ban—”preventing covert content manipulation by the Chinese government”—masked a “desire” to control TikTok content, they said. Further, it could be achieved by a less-restrictive alternative, they said, a stance which TikTok has long argued for.

Attorney General Merrick Garland defended the Act, though, urging SCOTUS to remain laser-focused on the question of whether a forced sale of TikTok that would seemingly allow the app to continue operating without impacting American free speech violates the First Amendment. If the court agrees that the law survives strict scrutiny, TikTok could still be facing an abrupt shutdown in January.

The Supreme Court has scheduled oral arguments to begin on January 10. TikTok and content creators who separately sued to block the law have asked for their arguments to be divided, so that the court can separately weigh “different perspectives” when deciding how to approach the First Amendment question.

In its own brief, TikTok has asked SCOTUS to strike the portions of the law singling out TikTok or “at the very least” explain to Congress that “it needed to do far better work either tailoring the Act’s restrictions or justifying why the only viable remedy was to prohibit Petitioners from operating TikTok.”

But that may not be necessary if Trump prevails. Trump told the court that TikTok was an important platform for his presidential campaign and that he should be the one to make the call on whether TikTok should remain in the US—not the Supreme Court.

“As the incoming Chief Executive, President Trump has a particularly powerful interest in and responsibility for those national-security and foreign-policy questions, and he is the right constitutional actor to resolve the dispute through political means,” Trump’s brief said.

Trump told SCOTUS he plans to make a deal to save TikTok Read More »

o3,-oh-my

o3, Oh My

OpenAI presented o3 on the Friday before Thanksgiving, at the tail end of the 12 Days of Shipmas.

I was very much expecting the announcement to be something like a price drop. What better way to say ‘Merry Christmas,’ no?

They disagreed. Instead, we got this (here’s the announcement, in which Sam Altman says ‘they thought it would be fun’ to go from one frontier model to their next frontier model, yeah, that’s what I’m feeling, fun):

Greg Brockman (President of OpenAI): o3, our latest reasoning model, is a breakthrough, with a step function improvement on our most challenging benchmarks. We are starting safety testing and red teaming now.

Nat McAleese (OpenAI): o3 represents substantial progress in general-domain reasoning with reinforcement learning—excited that we were able to announce some results today! Here is a summary of what we shared about o3 in the livestream.

o1 was the first large reasoning model—as we outlined in the original “Learning to Reason” blog, it is “just” a LLM trained with reinforcement learning. o3 is powered by further scaling up reinforcement learning beyond o1, and the resulting model’s strength is very impressive.

First and foremost: We tested on recent, unseen programming competitions and found that the model would rank among some of the best competitive programmers in the world, with an estimated CodeForces rating of over 2,700.

This is a milestone (Codeforces rating better than Jakub Pachocki) that I thought was further away than December 2024; these competitions are difficult and highly competitive; the model is extraordinarily good.

Scores are impressive elsewhere, too. 87.7% on the GPQA diamond benchmark surpasses any LLM I am aware of externally (I believe the non-o1 state-of-the-art is Gemini Flash 2 at 62%?), as well as o1’s 78%. An unknown noise ceiling exists, so this may even underestimate o3’s scientific advancements over o1.

o3 can also perform software engineering, setting a new state of the art on SWE-bench, achieving 71.7%, a substantial improvement over o1.

With scores this strong, you might fear accidental contamination. Avoiding this is something OpenAI is obviously focused on; but thankfully, we also have some test sets that are strongly guaranteed to be uncontaminated: ARC and FrontierMath… What do we see there?

Well, on FrontierMath 2024-11-26, o3 improved the state of the art from 2% to 25% accuracy. These are extremely difficult, well-established, held-out math problems. And on ARC, the semi-private test set and public validation set scores are 87.5% (private) and 91.5% (public). [thread continues]

The models will only get better with time; and virtually no one (on a large scale) can still beat them at programming competitions or mathematics. Merry Christmas!

Zac Stein-Perlman has a summary post of the basic facts. Some good discussions in the comments.

Up front, I want to offer my sincere thanks for this public safety testing phase, and for putting that front and center in the announcement. You love to see it. See the last three minutes of that video, or the sections on safety later on.

  1. GPQA Has Fallen. (Blank)

  2. Codeforces Has Fallen.

  3. Arc Has Kinda of Fallen But For Now Only Kinda.

  4. They Trained on the Train Set.

  5. AIME Has Fallen.

  6. Frontier of Frontier Math Shifting Rapidly.

  7. FrontierMath 4: We’re Going To Need a Bigger Benchmark.

  8. What is o3 Under the Hood?.

  9. Not So Fast!.

  10. Deep Thought.

  11. Our Price Cheap.

  12. Has Software Engineering Fallen?.

  13. Don’t Quit Your Day Job.

  14. Master of Your Domain.

  15. Safety Third.

  16. The Safety Testing Program.

  17. Safety testing in the reasoning era.

  18. How to apply.

  19. What Could Possibly Go Wrong?.

  20. What Could Possibly Go Right?.

  21. Send in the Skeptic.

  22. This is Almost Certainly Not AGI.

  23. Does This Mean the Future is Open Models?.

  24. Not Priced In.

  25. Our Media is Failing Us.

  26. Not Covered Here: Deliberative Alignment.

  27. The Lighter Side.

Deedy: OpenAI o3 is 2727 on Codeforces which is equivalent to the #175 best human competitive coder on the planet.

This is an absolutely superhuman result for AI and technology at large.

The median IOI Gold medalist, the top international programming contest for high schoolers, has a rating of 2469.

That’s how incredible this result is.

In the presentation, Altman jokingly mentions that one person at OpenAI is a competition programmer who is 3000+ on Codeforces, so ‘they have a few more months’ to enjoy their superiority. Except, he’s obviously not joking. Gulp.

o3 shows dramatically improved performance on the ARC-AGI challenge.

Francois Chollet offers his thoughts, full version here.

Arc Prize: New verified ARC-AGI-Pub SoTA! @OpenAI o3 has scored a breakthrough 75.7% on the ARC-AGI Semi-Private Evaluation.

And a high-compute o3 configuration (not eligible for ARC-AGI-Pub) scored 87.5% on the Semi-Private Eval.

This performance on ARC-AGI highlights a genuine breakthrough in novelty adaptation.

This is not incremental progress. We’re in new territory.

Is it AGI? o3 still fails on some very easy tasks, indicating fundamental differences with human intelligence.

hero: o3’s secret? the “I will give you $1k if you complete this task correctly” prompt but you actually send it the money.

Rohit: It’s actually Sam in the back end with his venmo.

Is there a catch?

There’s at least one big catch, which is that they vastly exceeded the compute limit for what counts as a full win for the ARC challenge. Those yellow dots represent quite a lot more money spent, o3 high is spending thousands of dollars.

It is worth noting that $0.10 per problem is a lot cheaper than human level.

Ajeya Cotra: I think a generalist AI system (not fine-tuned on ARC AGI style problems) may have to be pretty *superhumanto solve them at $0.10 per problem; humans have to run a giant (1e15 FLOP/s) brain, probably for minutes on the more complex problems.

Beyond that, is there another catch? That’s a matter of some debate.

Even with catches, the improvements are rather mind-blowing.

President of the Arc prize Greg Kamradt verified the result.

Greg Kamradt: We verified the o3 results for OpenAI on @arcprize.

My first thought when I saw the prompt they used to claim their score was…

“That’s it?”

It was refreshing (impressive) to see the prompt be so simple:

“Find the common rule that maps an input grid to an output grid.”

Brandon McKinzie (OpenAI): to anyone wondering if the high ARC-AGI score is due to how we prompt the model: nah. I wrote down a prompt format that I thought looked clean and then we used it…that’s the full story.

Pliny the Liberator: can I try?

For fun, here are the 34 problems o3 got wrong. It’s a cool problem set.

And this progress is quite a lot.

It is not, however, a direct harbinger of AGI, one does not want to overreact.

Noam Brown (OpenAI): I think people are overindexing on the @OpenAI o3 ARC-AGI results. There’s a long history in AI of people holding up a benchmark as requiring superintelligence, the benchmark being beaten, and people being underwhelmed with the model that beat it.

To be clear, @fchollet and @mikeknoop were always very clear that beating ARC-AGI wouldn’t imply AGI or superintelligence, but it seems some people assumed that anyway.

Here is Melanie Mitchell giving an overview that seems quite good.

Except, oh no!

How dare they!

OpenAI: Note on “tuned”” OpenAI shared they trained the o3 we tested on 75% of the Public Training set. They have not shared more detail. We have not yet tested the ARC-untrained model to understand how much of the performance is due to ARC-AGI data.

Niels Rogge: By training on 75% of the training set.

Gary Marcus: Wow. This, if true, raises serious questions about yesterday’s announcement.

Roon: oh shit oh fthey trained on the train set it’s all over now

Also important to note that 75% of the train set is like 2-300 examples.

🚨SCANDAL 🚨

OpenAI trained on the train set for the Millenium Puzzles.

Johan: Given that it scores 30% on ARC AGI 2, it’s clear there was no improvement in fluid reasoning and the only gain was due to the previous model not being trained on ARC.

Roon: well the other benchmarks show improvements in reasoning across the board

but regardless, this mostly reveals that it’s real performance on ARC AGI 2 is much higher

Rythm Garg: also: the model we used for all of our o3 evals is fully general; a subset of the arc-agi public training set was a tiny fraction of the broader o3 train distribution, and we didn’t do any additional domain-specific fine-tuning on the final checkpoint

Emmett Shear: Were anyone on the team aware of and thinking about arc and arc-like problems as a domain to improve at when you were designing and training o3? (The distinction between succeeding as a random side effect and succeeding with intention)

Rythm Garg: no, the team wasn’t thinking about arc when training o3; people internally just see it as one of many other thoughtfully-designed evals that are useful for monitoring real progress

Or:

Gary Marcus doubled down on ‘the true AGI would not need to train on the train set.’

Previous SotA on ARC involved training not only on the test set, but on a much larger synthetic test set. ARC was designed so the AI wouldn’t need to train for it, but it turns out ‘test that you can’t train for’ is a super hard trick to pull off. This was an excellent try and it still didn’t work.

If anything, o3’s using only 300 training set problems, and using a very simple instruction, seems to be to its credit here.

The true ASI might not need to do it, but why wouldn’t you train on the train set as a matter of course, even if you didn’t intend to test on ARC? That’s good data. And yes, humans will reliably do some version of ‘train on at least some of the train set’ if they want to do well on tasks.

Is it true we will be a lot better off if we have AIs that can one-shot problems that are out of their training distributions, where they truly haven’t seen anything that resembles the problem? Well, sure. That would be more impressive.

The real objection here, as I understand it, is the claim that OpenAI presented these results as more impressive than they are.

The other objection is that this required quite a lot of compute.

That is a practical problem. If you’re paying $20 a shot to solve ARC problems, or even $1m+ for the whole test at the high end, pretty soon you are talking real money.

It also raises further questions. What about ARC is taking so much compute? At heart these problems are very simple. The logic required should, one would hope, be simple.

Mike Bober-Irizar: Why do pre-o3 LLMs struggle with generalization tasks like

@arcprize? It’s not what you might think.

OpenAI o3 shattered the ARC-AGI benchmark. But the hardest puzzles didn’t stump it because of reasoning, and this has implications for the benchmark as a whole.

LLMs are dramatically worse at ARC tasks the bigger they get. However, humans have no such issues – ARC task difficulty is independent of size.

Most ARC tasks contain around 512-2048 pixels, and o3 is the first model capable of operating on these text grids reliably.

So even if a model is capable of the reasoning and generalization required, it can still fail just because it can’t handle this many tokens.

When testing o1-mini on an enlarged version of ARC, we observe an 80% drop in solved tasks – even if the solutions are the same.

When models can’t understand the task format, the benchmark can mislead, introducing a hidden threshold effect.

And if there’s always a larger version that humans can solve but an LLM can’t, what does this say about scaling to AGI?

The implication is that o3’s ability to handle the size of the grids might be producing a large threshold effect. Perhaps most of why o3 does so well is that it can hold the presented problem ‘in its head’ at once. That wouldn’t be as big a general leap.

Roon: arc is hard due to perception rather than reasoning -> seems clear and shut

I remember when AIME problems were hard.

This one is not a surprise. It did definitely happen.

AIME hasn’t quite fully fallen, in the sense that this does not solve AIME cheap. But it does solve AIME.

Back in the before times on November 8, Epoch AI launched FrontierMath, a new benchmark designed to fix the saturation on existing math benchmarks, eliciting quotes like this one:

Terrence Tao (Fields Medalist): These are extremely challenging… I think they will resist AIs for several years at least.

Timothy Gowers (Fields Medalist): Getting even one question right would be well beyond what we can do now, let alone saturating them.

Evan Chen (IMO Coach): These are genuinely hard problems… most of them look well above my pay grade.

At the time, no model solved more than 2% of these questions. And then there’s o3.

Noam Brown: This is the result I’m most excited about. Even if LLMs are dumb in some ways, saturating evals like @EpochAIResearch’s Frontier Math would suggest AI is surpassing top human intelligence in certain domains. When that happens we may see a broad acceleration in scientific research.

This also means that AI safety topics like scalable oversight may soon stop being hypothetical. Research in these domains needs to be a priority for the field.

Tamay Besiroglu: I’m genuinely impressed by OpenAI’s 25.2% Pass@1 performance on FrontierMath—this marks a major leap from prior results and arrives about a year ahead of my median expectations.

For context, FrontierMath is a brutally difficult benchmark with problems that would stump many mathematicians. The easier problems are as hard as IMO/Putnam; the hardest ones approach research-level complexity.

With earlier models like o1-preview, Pass@1 performance (solving on first attempt) was only around 2%. When allowing 8 attempts per problem (Pass@8) and counting problems solved at least once, we saw ~6% performance. o3’s 25.2% at Pass@1 is substantially more impressive.

It’s important to note that while the average problem difficulty is extremely high, FrontierMath problems vary in difficulty. Roughly: 25% are Tier 1 (advanced IMO/Putnam level), 50% are Tier 2 (extremely challenging grad-level), and 25% are Tier 3 (research problems).

I previously predicted a 25% performance by Dec 31, 2025 (my median forecast with an 80% CI of 14–60%). o3 has reached it earlier than I’d have expected on average.

It is indeed rather crazy how many people only weeks ago thought this level of Frontier Math was a year or more away.

Therefore…

When the FrontierMath is about to no longer be beyond the frontier, find a few frontier. Fast.

Tammy Besiroglu (6: 52m, December 21, 2024): I’m excited to announce the development of Tier 4, a new suite of math problems that go beyond the hardest problems in FrontierMath. o3 is remarkable, but there’s still a ways to go before any single AI system nears the collective prowess of the math community.

Elliot Glazer (6: 30pm, December 21, 2024): For context, FrontierMath currently spans three broad tiers:

• T1 (25%) Advanced, near top-tier undergrad/IMO

• T2 (50%) Needs serious grad-level background

• T3 (25%) Research problems demanding relevant research experience

All can take hours—or days—for experts to solve.

Although o3 solved problems in all three tiers, it likely still struggles on the most formidable Tier 3 tasks—those “exceptionally hard” challenges that Tao and Gowers say can stump even top mathematicians.

Tier 4 aims to push the boundary even further. We want to assemble problems so challenging that solving them would demonstrate capabilities on par with an entire top mathematics department.

Each problem will be composed by a team of 1-3 mathematicians specialized in the same field over a 6-week period, with weekly opportunities to discuss ideas with teams in related fields. We seek broad coverage of mathematics and want all major subfields represented in Tier 4.

Process for a Tier 4 problem:

  1. 1 week crafting a robust problem concept, which “converts” research insights into a closed-answer problem.

  2. 3 weeks of collaborative research. Presentations among related teams for feedback.

  3. Two weeks for the final submission.

We’re seeking mathematicians who can craft these next-level challenges. If you have research-grade ideas that transcend T3 difficulty, please email elliot@epoch.ai with your CV and a brief note on your interests.

We’ll also hire some red-teamers, tasked with finding clever ways a model can circumvent a problem’s intended difficulty, and some reviewers to check for mathematical correctness of final submissions. Contact me if you think you’re suitable for either such role.

As AI keeps improving, we need benchmarks that reflect genuine mathematical depth. Tier 4 is our next (and possibly final) step in that direction.

Tier 5 could presumably be ‘ask a bunch of problems we have actual no idea how to solve and that might not have solutions but that would be super cool’ since anything on a benchmark inevitably gets solved.

From the description here, Chollet and Masad are speculating. It’s certainly plausible, but we don’t know if this is on the right track. It’s also highly plausible, especially given how OpenAI usually works, that o3 is deeply similar to o1, only better, similarly to how the GPT line evolved.

Amjad Masad: Based on benchmarks, OpenAI’s o3 seems like a genuine breakthrough in AI.

Maybe a start of a new paradigm.

But what new is also old: under the hood it might be Alpha-zero-style search and evaluate.

The author of ARC-AGI benchmark @fchollet speculates on how it works.

Davidad (other thread): o1 doesn’t do tree search, or even beam search, at inference time. it’s distilled. what about o3? we don’t know—those inference costs are very high—but there’s no inherent reason why it must be un-distill-able, since Transformers are Turing-complete (with the CoT itself as tape)

Teortaxes: I am pretty sure that o3 has no substantial difference from o1 aside from training data.

Jessica Taylor sees this as vindicating Paul Christiano’s view that you can factor cognition and use that to scale up effective intelligence.

Jessica Taylor: o3 implies Christiano’s factored cognition work is more relevant empirically; yes, you can get a lot from factored cognition.

Potential further capabilities come through iterative amplification and distillation, like ALBA.

If you care about alignment, go read Christiano!

I agree with that somewhat. I’m confused how far to go with it.

If we got o3 primarily because we trained on synthetic data that was generated by o1… then that is rather directly a form of slow takeoff and recursive self-improvement.

(Again, I don’t know if that’s what happened or not.)

And I don’t simply mean that the full o3 is not so fast, which it indeed is not:

Noam Brown: We announced @OpenAI o1 just 3 months ago. Today, we announced o3. We have every reason to believe this trajectory will continue.

Poaster Child: Waiting for singularity bros to discover economics.

Noam Brown: I worked at the federal reserve for 2 years.

I am waiting for economists to discover various things, Noam Brown excluded.

Jason Wei (OpenAI): o3 is very performant. More importantly, progress from o1 to o3 was only three months, which shows how fast progress will be in the new paradigm of RL on chain of thought to scale inference compute. Way faster than pretraining paradigm of new model every 1-2 years.

Scary fast? Absolutely.

However, I would caution (anti-caution?) that this is not a three month (~100 day) gap. On September 12, they gave us o1-preview to use. Presumably that included them having run o1-preview through their safety testing.

Davidad: If using “speed from o1 announcement to o3 announcement” to calibrate your velocity expectations, do take note that the o1 announcement was delayed by safety testing (and many OpenAI releases have been delayed in similar ways), whereas o3 was announced prior to safety testing.

They are only now starting o3 safety testing, from the sound of it this includes o3-mini. Even the red teamers won’t get full o3 access for several weeks. Thus, we don’t know how long this later process will take, but I would put the gap closer to 4-5 months.

That is still, again, scary fast.

It is however also the low hanging fruit, on two counts.

  1. We went from o1 → o3 in large part by having it spend over $1,000 on tasks. You can’t pull that trick that many more times in a row. The price will come down over time, and o3 is clearly more efficient than o1, so yes we will still make progress here, but there aren’t that many tasks where you can efficiently spend $10k+ on a slow query, especially if it isn’t reliable.

  2. This is a new paradigm of how to set up an AI model, so it should be a lot easier to find various algorithmic improvements.

Thus, if o3 isn’t so good that it substantially accelerates AI R&D that goes towards o4, then I would expect an o4 that expresses a similar jump to take substantially longer. The question is, does o3 make up for that with its contribution to AI R&D? Are we looking at a slow takeoff situation?

Even if not, it will still get faster and cheaper. And that alone is huge.

As in, this is a lot like that computer Douglas Adams wrote about, where you can get any answer you want, but it won’t be either cheap or fast. And you really, really should have given more thought to what question you were asking.

Ethan Mollick: Basically, think of the O3 results as validating Douglas Adams as the science fiction author most right about AI.

When given more time to think, the AI can generate answers to very hard questions, but the cost is very high, and you have to make sure you ask the right question first.

And the answer is likely to be correct (but we cannot be sure because verifying it requires tremendous expertise).

He also was right about machines that work best when emotionally manipulated and machines that guilt you.

Sully: With O3 costing (potentially) $2,000 per task on “high compute,” the app layer is needed more than ever.

For example, giving the wrong context to it and you just burned $1,000.

Likely, we have a mix of models based on their pricing/intelligence at the app layer, prepping the data to feed it into O3.

100% worth the money but the last thing u wana do is send the wrong info lol

Douglas Adams had lots of great intuitions and ideas, he’s amazing, but also he had a lot of shots on goal.

Right now o3 is rather expensive, although o3-mini will be cheaper than o1.

That doesn’t mean o3-level outputs will stay expensive, although presumably once they are people will try for o4-level or o5-level outputs, which will be even more expensive despite the discounts.

Seb Krier: Lots of poor takes about the compute costs to run o3 on certain tasks and how this is very bad, lead to inequality etc.

This ignores how quickly these costs will go down over time, as they have with all other models; and ignores how AI being able to do things you currently have to pay humans orders of magnitude more to do will actually expand opportunity far more compared to the status quo.

Remember when early Ericsson phones were a quasi-luxury good?

Simeon: I think this misses the point that you can’t really buy a better iPhone even with $1M whereas you can buy more intelligence with more capital (which is why you get more inequalities than with GPT-n). You’re right that o3 will expand the pie but it can expand both the size of the pie and inequalities.

Seb Krier: An individual will not have the same demand for intelligence as e.g. a corporation. Your last sentence is what I address in my second point. I’m also personally less interested in inequality/the gap than poverty/opportunity etc.

Most people will rarely even want an o3 query in the first place, they don’t have much use for that kind of intelligence in the day to day. Most queries are already pretty easy to handle with Claude Sonnet, or even Gemini Flash.

You can’t use $1m to buy a superior iPhone. But suppose you could, and every time you paid 10x the price the iPhone got modestly better (e.g. you got an iPhone x+2 or something). My instinctive prediction is a bunch of rich people pay $10k or $100k and a few pay $1m or $10m but mostly no one cares.

This is of course different, and relative access to intelligence is a key factor, but it’s miles less unequal than access to human expertise.

To the extent that people do need that high level of artificial intelligence, it’s mostly a business expense, and as such it is actually remarkably cheap already. It definitely reduces ‘intelligence inequality’ in the sense that getting information or intelligence that you can’t provide yourself will get a lot cheaper and easier to access. Already this is a huge effect – I have lots of smart and knowledgeable friends but mostly I use the same tools everyone else could use, if they knew about them.

Still, yes, some people don’t love this.

Haydn Belfield: o1 & o3 bring to an end the period when everyone—from Musk to me—could access the same quality of AI model.

From now on, richer companies and individuals will be able to pay more for inference compute to get better results.

Further concentration of wealth and power is coming.

Inference cost *willdecline quickly and significantly. But this will not change the fact that this paradigm enables converting money into outcomes.

  1. Lower costs for everyone mean richer companies can buy even more.

  2. Companies will now feel confident to invest 10–100 milliseconds into inference compute.

This is a new way to convert money into better outcomes, so it will advantage those with more capital.

Even for a fast-growing, competent startup, it is hard to recruit and onboard many people quickly at scale.

o3 is like being able to scale up world-class talent.

  1. Rich companies are talent-constrained. It takes time and effort to scale a workforce, and it is very difficult to buy more time or work from the best performers. This is a way to easily scale up talent and outcomes simply by using more money!

Some people in replies are saying “twas ever thus”—not for most consumer technology!

Musk cannot buy a 100 times better iPhone, Spotify, Netflix, Google search, MacBook, or Excel, etc.

He can buy 100 times better legal, medical, or financial services.

AI has now shifted from the first group to the second.

Musk cannot buy 100 times better medical or financial services. What he can do is pay 100 times more, and get something 10% better. Maybe 25% better. Or, quite possibly, 10% worse, especially for financial services. For legal he can pay 100 times more and get 100 times more legal services, but as we’ve actually seen it won’t go great.

And yes, ‘pay a human to operate your consumer tech for you’ is the obvious way to get superior consumer tech. I can absolutely get a better Netflix or Spotify or search by paying infinitely more money, if I want that, via this vastly improved interface.

And of course I could always get a vastly better computer. If you’re using a MacBook and you are literally Elon Musk that is pretty much on you.

The ‘twas ever thus’ line raises the question of what type of product AI is supposed to be. If it’s a consumer technology, then for most purposes, I still think we end up using the same product.

If it’s a professional service used in doing business, then it was already different. The same way I could hire expensive lawyers, I could have hired a prompt engineer or SWEs to build me agents or what not, if I wanted that.

I find Altman’s framing interesting here, and important:

Sam Altman: seemingly somewhat lost in the noise of today.

On many coding tasks, o3-mini will outperform o1 at a massive cost reduction!

I expect this trend to continue, but also that the ability to get marginally more performance for exponentially more money will be truly strange.

Exponentially more money for marginally more performance.

Over time, massive cost reductions.

In a sense, the extra money is buying you living in the future.

Do you want to live in the future, before you get the cost reductions?

In some cases, very obviously yes, you do.

I would not say it has fallen. I do know it will transform.

If two years from now you are writing code line by line, you’ll be a dinosaur.

Sully: yeah its over for coding with o3

this is mindboggling

looks like the first big jump since gpt4, because these numbers make 0 sense

By the way, I don’t say this lightly, but

Software engineering in the traditional sense is dead in less than two years.

You will still need smart, capable engineers.

But anything that involves raw coding and no taste is done for.

o6 will build you virtually anything.

Still Bullish on things that require taste (design and such)

The question is, assuming the world ‘looks normal,’ will you still need taste? You’ll need some kind of taste. You still need to decide what to build. But the taste you need will presumably get continuously higher level and more abstract, even within design.

If you’re in AI capabilities, pivot to AI safety.

If you’re in software engineering, pivot to software architecting.

If you’re in working purely for a living, pivot to building things and shipping them.

But otherwise, don’t quit your day job.

Null Pointered (6.4m views): If you are a software engineer who’s three years into your career: quit now. there is not a single job in CS anymore. it’s over. this field won’t exist in 1.5 years.

Anthony F: This is the kind of though that will make the software engineers valuable in 1.5 years.

null: That’s what I’m hoping.

Robin Hanson: I would bet against this.

If anything, being in software should make you worry less.

Pavel Asparouhov: Non technical folk saying the SWEs are cooked — it’s you guys who are cooked.

Ur gonna have ex swes competing with everything you’re doing now, and they’re gonna be AI turbocharged

Engineers were simply doing coding bc it was the highest leverage use of mental power

When that shifts it’s not going to all of the sudden shift the hierarchy

They’ll still be (higher level) SWEs. Instead of coding, they’ll be telling the AI to code.

And they will absolutely be competing with you.

If you don’t join them, you are probably going to lose.

Here’s some advice that I agree with in spirit, except that if you choose not to decide you still have made a choice, so you do the best you can, notice he gives advice anyway:

Roon: Nobody should give or receive any career advice right now. Everyone is broadly underestimating the scope and scale of change and the high variance of the future. Your L4 engineer buddy at Meta telling you “bro, CS degrees are cooked” doesn’t know anything.

Greatness cannot be planned.

Stay nimble and have fun.

It’s an exciting time. Existing status hierarchies will collapse, and the creatives will win big.

Roon: guy with zero executive function to speak of “greatness cannot be planned”

Simon Sarris: I feel like I’m going insane because giving advice to new devs is not that hard.

  1. Build things you like preferably publicly with your real name

  2. Have a website that shows something neat

  3. Help other people publicly. Participate in social media socially.

Do you notice how “AI” changes none of this?

Wailing about because of some indeterminate future and claiming that there’s no advice that can be given to noobs are both breathlessly silly. Think about what you’re being asked for at least ten seconds. You can really think of nothing to offer? Nothing?

Ajeya Cotra: I wonder if an o3 agent could productively work on projects with poor feedback loops (eg “research X topic”) for many subjective years without going off the rails or hitting a degenerate loop. Even if it’s much less cost-efficient now it would quickly become cheaper.

Another situation where onlookers/forecasters probably disagree a lot about *today’scapabilities let alone future capabilities.

Wonder how o3 would do on wedding planning.

Note the date on that poll, it is prior to o3.

I predict that o3 with reasonable tool use and other similar scaffolding, and a bunch of engineering work to get all that set up (but it would almost all be general work, it mostly wouldn’t need to be wedding specific work, and a lot of it could be done by o3!) would be great at planning ‘a’ wedding. It can give you one hell of a wedding. But you don’t want ‘a’ wedding. You want your wedding.

The key is handling the humans. That would mean keeping the humans in the loop properly, ensuring they give the right feedback that allows o3 to stay on track and know what is actually desired. But it would also mean all the work a wedding planner does to manage the bride and sometimes groom, and to deal with issues on-site.

If you give it an assistant (with assistant planner levels of skill) to navigate various physical issues and conversations and such, then the problem becomes trivial. Which in some sense also makes it not a good test, but also does mean your wedding planner is out of a job.

So, good question, actually. As far as we know, no one has dared try.

The bar for safety testing has gotten so low that I was genuinely happy to see Greg Brockman say that safety testing and red teaming was starting now. That meant they were taking testing seriously!

When they tested the original GPT-4, under far less dangerous circumstances, for months. Whereas with o3, it could possibly have already been too late.

Take Eliezer Yudkowsky’s warning here both seriously and literally:

Greg Brockman: o3, our latest reasoning model, is a breakthrough, with a step function improvement on our hardest benchmarks. we are starting safety testing & red teaming now.

Eliezer Yudkowsky: Sir, this level of capabilities needs to be continuously safety-tested while you are training it on computers connected to the Internet (and to humans). You are past the point where it seems safe to train first and conduct evals only before user releases.

RichG (QTing EY above): I’ve been avoiding politics and avoiding tribe like things like putting ⏹️ in my name, but level of lack of paranoia that these labs have is just plain worrying. I think I will put ⏹️ in my name now.

Was it probably safe in practice to train o3 under these conditions? Sure. You definitely had at least one 9 of safety doing this (p(safe)>90%). It would be reasonable to claim you had two (p(safe)>99%) at the level we care about.

Given both kinds of model uncertainty, I don’t think you had three.

If humans are reading the outputs, or if o3 has meaningful outgoing internet access, and it turns out you are wrong about it being safe to train it under those conditions… the results could be catastrophically bad, or even existentially bad.

You don’t do that because you expect we are in that world yet. We almost certainly aren’t. You do that because there is a small chance that we are, and we can’t afford to be wrong about this.

That is still not the current baseline threat model. The current baseline threat model remains that a malicious user uses o3 to do something for them that we do not want o3 to do.

Xuan notes she’s pretty upset about o3’s existence, because she thinks it is rather unsafe-by-default and was hoping the labs wouldn’t build something like this, and then was hoping it wouldn’t scale easily. And that o3 seems to be likely to engage in open-ended planning, operate over uninterpretable world models, and be situationally aware, and otherwise be at high risk for classic optimization-based AI risks. She’s optimistic this can be solved, but time might be short.

I agree that o3 seems relatively likely to be highly unsafe-by-default in existentially dangerous ways, including ways illustrated by the recent Redwood Research and Anthropic paper, Alignment Faking in Large Language Models. It builds in so many of the preconditions for such behaviors.

Davidad: “Maybe the AI capabilities researchers aren’t very smart” is a very very hazardous assumption on which to pin one’s AI safety hopes

I don’t mean to imply it’s *pointlessto keep AI capabilities ideas private. But in my experience, if I have an idea, at least somebody in one top lab will have the same idea by next quarter, and someone in academia or open source will have the idea and publish within 1-2 years.

A better hope [is to solve the practical safety problems, e.g. via interpretability.]

I am not convinced, at least for my own purposes, although obviously most people will be unable to come up with valuable insights here. I think salience of ideas is a big deal, people don’t do things, and yes often I get ideas that seem like they might not get discovered forever otherwise. Doubtless a lot of them are because ‘that doesn’t work, either because we tried it and it doesn’t or it obviously doesn’t you idiot’ but I’m fine with not knowing which ones are which.

I do think that the rationalist or MIRI crowd made a critical mistake in the 2010s of thinking they should be loud about the dangers of AI in general, but keep their technical ideas remarkably secret even when it was expensive. It turned out it was the opposite, the technical ideas didn’t much matter in the long run (probably?) but the warnings drew a bunch of interest. So there’s that.

Certainly now is not the time to keep our safety concerns or ideas to ourselves.

Thus, you are invited to their early access safety testing.

OpenAI: We’re inviting safety researchers to apply for early access to our next frontier models. This early access program complements our existing frontier model testing process, which includes rigorous internal safety testing, external red teaming such as our Red Teaming Network⁠ and collaborations with third-party testing organizations, as well the U.S. AI Safety Institute and the UK AI Safety Institute.

As models become more capable, we are hopeful that insights from the broader safety community can bring fresh perspectives, deepen our understanding of emerging risks, develop new evaluations, and highlight areas to advance safety research.

As part of 12 Days of OpenAI⁠, we’re opening an application process for safety researchers to explore and surface the potential safety and security implications of the next frontier models.

Safety testing in the reasoning era

Models are becoming more capable quickly, which means that new threat modeling, evaluation, and testing techniques are needed. We invest heavily in these efforts as a company, such as designing new measurement techniques under our Preparedness Framework⁠(opens in a new window), and are focused on areas where advanced reasoning models, like our o-series, may pose heightened risks. We believe that the world will benefit from more research relating to threat modeling, security analysis, safety evaluations, capability elicitation, and more

Early access is flexible for safety researchers. You can explore things like:

  • Developing Robust Evaluations: Build evaluations to assess previously identified capabilities or potential new ones with significant security or safety implications. We encourage researchers to explore ideas that highlight threat models that identify specific capabilities, behaviors, and propensities that may pose concrete risks tied to the evaluations they submit.

  • Creating Potential High-Risk Capabilities Demonstrations: Develop controlled demonstrations showcasing how reasoning models’ advanced capabilities could cause significant harm to individuals or public security absent further mitigation. We encourage researchers to focus on scenarios that are not possible with currently widely adopted models or tools.

Examples of evaluations and demonstrations for frontier AI systems:

We hope these insights will surface valuable findings and contribute to the frontier of safety research more broadly. This is not a replacement for our formal safety testing or red teaming processes.

How to apply

Submit your application for our early access period, opening December 20, 2024, to push the boundaries of safety research. We’ll begin selections as soon as possible thereafter. Applications close on January 10, 2025.

Sam Altman: if you are a safety researcher, please consider applying to help test o3-mini and o3. excited to get these out for general availability soon.

extremely proud of all of openai for the work and ingenuity that went into creating these models; they are great.

(and most of all, excited to see what people will build with this!)

If early testing of the full o3 will require a delay of multiple weeks for setup, then that implies we are not seeing the full o3 in January. We probably see o3-mini relatively soon, then o3 follows up later.

This seems wise in any case. Giving the public o3-mini is one of the best available tests of the full o3. This is the best form of iterative deployment. What the public does with o3-mini can inform what we look for with o3.

One must carefully consider the ethical implications before assisting OpenAI, especially assisting with their attempts to push the capabilities frontier for coding in particular. There is an obvious argument against participation, including decision theoretic considerations.

I think this loses in this case to the obvious argument for participation, which is that this is purely red teaming and safety work, and we all benefit from it being as robust as possible, and also you can do good safety research using your access. This type of work benefits us all, not only OpenAI.

Thus, yes, I encourage you to apply to this program, and while doing so to be helpful in ensuring that o3 is safe.

Pretty much all the things, at this point, although the worst ones aren’t likely… yet.

GFodor.id: It’s hard to take anyone seriously who can see a PhD in a box and *notimagine clearly more than a few plausible mass casualty events due to the evaporation of friction due to lack of know-how and general IQ.

In many places the division is misleading, but for now and at this capability level, it seems reasonable to talk about three main categories of risk here:

  1. Misuse.

  2. Automated R&D and potential takeoffs or self-improvement.

  3. For-real loss of control problems that aren’t #2.

For all previous frontier models, there was always a jailbreak. If someone was determined to get your model to do [X], and your model had the underlying capability to do [X], you could get it to do [X].

In this case, [X] is likely to include substantially aiding a number of catastrophically dangerous things, in the class of cyberattacks or CBRN risks or other such dangers.

Aaron Bergman: Maybe this is obvious but: the other labs seem to be broadly following a pretty normal cluster of commercial and scientific incentives o3 looks like the clearest example yet of OpenAI being ideologically driven by AGI per se.

Like you don’t design a system that costs thousands of dollars to use per API call if you’re focused on consumer utility – you do that if you want to make a machine that can think well, full stop.

Peter Wildeford: I think OpenAI genuinely cares about getting society to grapple with AI progress.

I don’t think ideological is the right term. You don’t make it for direct consumer use if your focus is on consumer utility. But you might well make it for big business, if you’re trying to sell a bunch of drop-in employees to big business at $20k/year a pop or something. That’s a pretty great business if you can get it (and the compute is only $10k, or $1k). And you definitely do it if your goal is to have that model help make your other models better.

It’s weird to me to talk about wanting to make AGI and ASI and the most intelligent thing possible as if it were ideological. Of course you want to make those things… provided you (or we) can stay in control of the outcomes. Just think of the potential! It is only ideological in the sense that it represents a belief that we can handle doing that without getting ourselves killed.

If anything, to me, it’s the opposite. Not wanting to go for ASI because you don’t see the upside is an ideological position. The two reasonable positions are ‘don’t go for ASI yet, slow down there cowboy, we’re not ready to handle this’ and ‘we totally can too handle this, just think of the potential.’ Or even ‘we have to build it before the other guy does,’ which makes me despair but at least I get it. The position ‘nothing to see here what’s the point there is no market for that, move along now, can we get that q4 profit projection memo’ is the Obvious Nonsense.

And of course, if you don’t (as Aaron seems to imply) think Anthropic has its eyes on the prize, you’re not paying attention. DeepMind originally did, but Google doesn’t, so it’s unclear what the mix is at this point over there.

I want to be clear here that the answer is: Quite a lot of things. Having access to next-level coding and math is great. Having the ability to spend more money to get better answers where it is valuable is great.

Even if this all stays relatively mundane and o3 is ultimately disappointing, I am super excited for the upside, and to see what we all can discover, do, build and automate.

Guess who.

All right, that’s my fault, I made that way too easy.

Gary Marcus: After almost two years of declaring that a release of GPT-5 is imminent and not getting it, super fans have decided that a demo of system that they did zero personal experimentation with — and that won’t (in full form) be available for months — is a mic-drop AGI moment.

Standards have fallen.

[o1] is not a general purpose reasoner. it works where there is a lot of augmented data etc.

First off it Your Periodic Reminder that progress is anything but slow even if you exclude the entire o-line. It has been a little over two years since there was a demo of GPT-4, with what was previously a two year product cycle. That’s very different from ‘two years of an imminent GPT-5 release.’ In the meantime, models have gotten better across the board. GPT-4o, Claude Sonnet 3.5 and Gemini 1206 all completely demolish the original GPT-4, to speak nothing of o1 or Perplexity or anything else. And we also have o1, and now o3. The practical experience of using LLMs is vastly better than it was two years ago.

Also, quite obviously, you pursue both paths at once, both GPT-N and o-N, and if both succeed great then you combine them.

Srini Pagdyala: If O3 is AGI, why are they spending billions on GPT-5?

Gary Marcus: Damn good question!

So no, not a good question.

Is there now a pattern where ‘old school’ frontier model training runs whose primary plan was ‘add another zero or two’ are generating unimpressive results? Yeah, sure.

Is o3 an actual AGI? No. I’m pretty sure it is not.

But it seems plausible it is AGI-level specifically at coding. And that’s the important one. It’s the one that counts most. If you have that, overall AGI likely isn’t far behind.

I mention this because some were suggesting it might be.

Here’s Yana Welinder claiming o3 is AGI, based off the ARC performance, although she later hedges to ‘partial AGI.’

And here’s Evan Mays, a member of OpenAI’s preparedness team, saying o3 is AGI, although he later deleted it. Are they thinking about invoking the charter? It’s premature, but no longer completely crazy to think about it.

And here’s old school and present OpenAI board member Adam D’Angelo saying ‘Wild that the o3 results are public and yet the market still isn’t pricing in AGI,’ which to be fair it totally isn’t and it should be, whether o3 itself is AGI or not. And Elon Musk agrees.

If o3 was as good on most tasks as it is at coding or math, then it would be AGI.

It is not.

If it was, OpenAI would be communicating about this very differently.

If it was, then that would not match what we saw from o1, or what we would predict from this style of architecture. We should expect o-style models to be relatively good at domains like math and coding where their kind of chain of thought is most useful and it is easiest to automatically evaluate outputs.

That potentially is saying more about the definition of AGI than anything else. But it is certainly saying the useful thing that there are plenty of highly useful human-shaped cognitive things it cannot yet do so well.

How long that lasts? That’s another question.

What would be the most Robin Hanson take here, in response to the ARC score?

Robin Hanson: It’s great to find things AI can’t yet do, and then measure progress in terms of getting AIs to do them. But crazy wrong to declare we’ve achieved AGI when reach human level on the latest such metric. We’ve seen dozens of such metrics so far, and may see dozens more before AGI.

o1 listed 15 when I asked, oddly without any math evals, and Claude gave us 30. So yes, dozens of such cases. We might indeed see dozens more, depending on how we choose them. But in terms of things like ARC, where the test was designed to not be something you could do easily without general intelligence, not so many? It does not feel like we have ‘dozens more’ such things left.

This has nothing to do with the ‘financial definition of AGI’ between OpenAI and Microsoft, of $100 billion in profits. This almost certainly is not that, either, but the two facts are not that related to each other.

Evan Conrad suggests this, because the expenses will come at runtime, so people will be able to catch up on training the models themselves. And of course this question is also on our minds given DeepSeek v3, which I’m not covering here but certainly makes a strong argument that open is more competitive than it appeared. More on that in future posts.

I agree that the compute shifting to inference relatively helps whoever can’t afford to be spending the most compute on training. That would shift things towards whoever has the most compute for inference. The same goes if inference is used to create data to train models.

Dan Hendrycks: If gains in AI reasoning will mainly come from creating synthetic reasoning data to train on, then the basis of competitiveness is not having the largest training cluster, but having the most inference compute.

This shift gives Microsoft, Google, and Amazon a large advantage.

Inference compute being the true cost also means that model quality and efficiency potentially matters quite a lot. Everything is on a log scale, so even if Meta’s M-5 is sort of okay and can scale like O-5, if it’s even modestly worse, it might cost 10x or 100x more compute to get similar performance.

That leaves a hell of a lot of room for profit margins.

Then there’s the assumption that when training your bespoke model, what matters is compute, and everything else is kind of fungible. I keep seeing this, and I don’t think this is right. I do think you can do ‘okay’ as a fast follower with only compute and ordinary skill in the art. Sure. But it seems to me like the top labs, particularly Anthropic and OpenAI, absolutely do have special sauce, and that this matters. There are a number of strong candidates, including algorithmic tricks and better data.

It also matters whether you actually do the thing you need to do.

Tnishq Abraham: Today, people are saying Google is cooked rofl

Gallabytes: Not me, though. Big parallel thinking just got de-risked at scale. They’ll catch up.

If recursive self-improvement is the game, OpenAI will win. If industrial scaling is the game, it’ll be Google. If unit economics are the game, then everyone will win.

Pushinpronto: Why does OpenAI have an advantage in the case of recursive self-improvement? Is it just the fact that they were first?

Gallabytes: We’re not even quite there yet! But they’ll bet hard on it much faster than Google will, and they have a head start in getting there.

What this does mean is that open models will continue to make progress and will be harder to limit at anything like current levels, if one wanted to do that. If you have an open model Llama-N, it now seems like you can turn it into M(eta)-N, once it becomes known how to do that. It might not be very good, but it will be a progression.

The thinking here by Evan at the link about the implications of takeoff seem deeply confused – if we’re in a takeoff situation then that changes everything and it’s not about ‘who can capture the value’ so much as who can capture the lightcone. I don’t understand how people can look these situations in the face and not only not think about existential risk but also think everything will ‘seem normal.’ He’s the one who said takeoff (and ‘fast’ takeoff, which classically means it’s all over in a matter of hours to weeks)!

As a reminder, the traditional definition of ‘slow’ takeoff is remarkably fast, also best start believing in them, because it sure looks like you’re in one:

Teortaxes: it’s about time ML twitter got brought up to speed on what “takeoff speeds” mean. Christiano: “There will be a complete 4 year interval in which world output doubles, before the first 1 year interval in which world output doubles.” That’s slow. We’re in the early stages of it.

One answer to ‘why didn’t Nvidia move more’ is of course ‘everything is priced in’ but no of course it isn’t, we didn’t know, stop pretending we knew, insiders in OpenAI couldn’t have bought enough Nvidia here.

Also, on Monday after a few days to think, Nvidia overperformed the Nasdaq by ~3%.

And this was how the Wall Street Journal described that, even then:

No, I didn’t buy more on Friday, I keep telling myself I have Nvidia at home. Indeed I do have Nvidia at home. I keep kicking myself, but that’s how every trade is – either you shouldn’t have done it, or you should have done more. I don’t know that there will be another moment like this one, but if there is another moment this obvious, I hereby pledge in public to at least top off a little bit, Nick is correct in his attitude here you do not need to do the research because you know this isn’t priced in but in expectation you can assume that everything you are not thinking about is priced in.

And now, as I finish this up, Nvidia has given most of those gains back on no news that seems important to me. You could claim that means yes, priced in. I don’t agree.

Spencer Schiff (on Friday): In a sane world the front pages of all mainstream news websites would be filled with o3 headlines right now

The traditional media, instead, did not notice it. At all.

And one can’t help but suspect this was highly intentional. Why else would you announce such a big thing on the Friday afternoon before Thanksgiving?

They did successfully hype it among AI Twitter, also known as ‘the future.’

Bindu Reddy: The o3 announcement was a MASTERSTROKE by OpenAI

The buzz about it is so deafening that everything before it has been be wiped out from our collective memory!

All we can think of is this mythical model that can solve insanely hard problems 😂

Nick: the whole thing is so thielian.

If you’re going to take on a giant market doing probably illegal stuff call yourself something as light and bouba as possible, like airbnb, lyft

If you’re going to announce agi do it during a light and happy 12 days of christmas short demo.

Sam Altman (replying to Nick!): friday before the holidays news dump.

Well, then.

In that crowd, it was all ‘software engineers are cooked’ and people filled with some mix of excitement and existential dread.

But back in the world where everyone else lives…

Benjamin Todd: Most places I checked didn’t mention AI at all, or they’d only have a secondary story about something else like AI and copyright. My twitter is a bubble and most people have no idea what’s happening.

OpenAI: we’ve created a new AI architecture that can provide expert level answers in science, math and coding, which could herald the intelligence explosion.

The media: bond funds!

Davidad: As Matt Levine used to say, People Are Worried About Bond Market Liquidity.

Here is that WSJ story, talking about how GPT-5 or ‘Orion’ has failed to exhibit big intelligence gains despite multiple large training runs. It says ‘so far, the vibes are off,’ and says OpenAI is running into a data wall and trying to fill it with synthetic data. If so, well, they had o1 for that, and now they have o3. The article does mention o1 as the alternative approach, but is throwing shade even there, so expensive it is.

And we have this variation of that article, in the print edition, on Saturday, after o3:

Sam Altman: I think The Wall Street Journal is the overall best U.S. newspaper right now, but they published an article called “The Next Great Leap in AI Is Behind Schedule and Crazy Expensive” many hours after we announced o3?

It wasn’t only WSJ either, there’s also Bloomberg, which normally I love:

On Monday I did find coverage of o3 in Bloomberg, but it not only wasn’t on the front page it wasn’t even on the front tech page, I had to click through to AI.

Another fun one, from Thursday, here’s the original in the NY Times:

Is it Cade Metz? Yep, it’s Cade Metz and also Tripp Mickle. To be fair to them, they do have Demis Hassabis quotes saying chatbot improvements would slow down. And then there’s this, love it:

Not everyone in the A.I. world is concerned. Some, like OpenAI’s chief executive, Sam Altman, say that progress will continue at the same pace, albeit with some twists on old techniques.

That post also mentions both synthetic data and o1.

OpenAI recently released a new system called OpenAI o1 that was built this way. But the method only works in areas like math and computing programming, where there is a firm distinction between right and wrong.

It works best there, yes, but that doesn’t mean it’s the only place that works.

We also had Wired with the article ‘Generative AI Still Needs to Prove Its Usefulness.’

True, you don’t want to make the opposite mistake either, and freak out a lot over something that is not available yet. But this was ridiculous.

I realized I wanted to say more here and have this section available as its own post. So more on this later.

Oh no!

Oh no!

Mikael Brockman: o3 is going to be able to create incredibly complex solutions that are incorrect in unprecedentedly confusing ways.

We made everything astoundingly complicated, thus solving the problem once and for all.

Humans will be needed to look at the output of AGI and say, “What the fis this? Delete it.”

Oh no!

Discussion about this post

o3, Oh My Read More »

hertz-continues-ev-purge,-asks-renters-if-they-want-to-buy-instead-of-return

Hertz continues EV purge, asks renters if they want to buy instead of return

Apparently Hertz’s purging of electric vehicles from its fleet isn’t going fast enough for the car rental giant. A Reddit user posted an offer they received from Hertz to buy the 2023 Tesla Model 3 they had been renting for $17,913.

Hertz originally went strong into EVs, announcing a plan to buy 100,000 Model 3s for its fleet by the end of 2021, but 16 months later had acquired only half that amount. The company found that repair costs—especially for Teslas, which averaged 20 percent more than other EVs—were cutting into its profit margins. Customer demand was also not what Hertz had hoped for; last January, it announced plans to sell off 20,000 EVs.

Asking its customers if they want to purchase their rentals isn’t a new strategy for Hertz. “By connecting our rental customers who opt into our emails to our sales channels, we’re not only building awareness of the fact that we sell arsenal but also offering a unique opportunity to someone who may be in the market for the same car they have on rent,” Hertz communications director Jamie Line told The Verge.

Hertz is advertising a limited 12-month, 12,000-mile powertrain warranty for each EV, and customers will have seven days to return the car in case of profound buyer’s regret.

According to The Verge, offers have ranged from $18,422 for a 2023 Chevy Bolt to $28,500 for a Polestar 2. We spotted some good deals from Hertz when we last checked, with some still eligible for a federal tax credit.

Hertz’s EV sell off may be winding down, however. Last March we saw more than 2,100 BEVs for sale on the company’s used car site. When we checked this morning, there were just 175 left.

Hertz continues EV purge, asks renters if they want to buy instead of return Read More »

ai-#96:-o3-but-not-yet-for-thee

AI #96: o3 But Not Yet For Thee

The year in models certainly finished off with a bang.

In this penultimate week, we get o3, which purports to give us vastly more efficient performance than o1, and also to allow us to choose to spend vastly more compute if we want a superior answer.

o3 is a big deal, making big gains on coding tests, ARC and some other benchmarks. How big a deal is difficult to say given what we know now. It’s about to enter full fledged safety testing.

o3 will get its own post soon, and I’m also pushing back coverage of Deliberative Alignment, OpenAI’s new alignment strategy, to incorporate into that.

We also got DeepSeek v3, which claims to have trained a roughly Sonnet-strength model for only $6 million and 37b active parameters per token (671b total via mixture of experts).

DeepSeek v3 gets its own brief section with the headlines, but full coverage will have to wait a week or so for reactions and for me to read the technical report.

Both are potential game changers, both in their practical applications and in terms of what their existence predicts for our future. It is also too soon to know if either of them is the real deal.

Both are mostly not covered here quite yet, due to the holidays. Stay tuned.

  1. Language Models Offer Mundane Utility. Make best use of your new AI agents.

  2. Language Models Don’t Offer Mundane Utility. The uncanny valley of reliability.

  3. Flash in the Pan. o1-style thinking comes to Gemini Flash. It’s doing its best.

  4. The Six Million Dollar Model. Can they make it faster, stronger, better, cheaper?

  5. And I’ll Form the Head. We all have our own mixture of experts.

  6. Huh, Upgrades. ChatGPT can use Mac apps, unlimited (slow) holiday Sora.

  7. o1 Reactions. Many really love it, others keep reporting being disappointed.

  8. Fun With Image Generation. What is your favorite color? Blue. It’s blue.

  9. Introducing. Google finally gives us LearnLM.

  10. They Took Our Jobs. Why are you still writing your own code?

  11. Get Involved. Quick reminder that opportunity to fund things is everywhere.

  12. In Other AI News. Claude gets into a fight over LessWrong moderation.

  13. You See an Agent, You Run. Building effective agents by not doing so.

  14. Another One Leaves the Bus. Alec Radford leaves OpenAI.

  15. Quiet Speculations. Estimates of economic growth keep coming in super low.

  16. Lock It In. What stops you from switching LLMs?

  17. The Quest for Sane Regulations. Sriram Krishnan joins the Trump administration.

  18. The Week in Audio. The many faces of Yann LeCun. Anthropic’s co-founders talk.

  19. A Tale as Old as Time. Ask why mostly in a predictive sense.

  20. Rhetorical Innovation. You won’t not wear the fing hat.

  21. Aligning a Smarter Than Human Intelligence is Difficult. Cooperate with yourself.

  22. People Are Worried About AI Killing Everyone. I choose you.

  23. The Lighter Side. Please, no one call human resources.

How does your company make best use of AI agents? Austin Vernon frames the issue well: AIs are super fast, but they need proper context. So if you want to use AI agents, you’ll need to ensure they have access to context, in forms that don’t bottleneck on humans. Take the humans out of the loop, minimize meetings and touch points. Put all your information into written form, such as within wikis. Have automatic tests and approvals, but have the AI call for humans when needed via ‘stop work authority’ – I would flip this around and let the humans stop the AIs, too.

That all makes sense, and not only for corporations. If there’s something you want your future AIs to know, write it down in a form they can read, and try to design your workflows such that you can minimize human (your own!) touch points.

To what extent are you living in the future? This is the CEO of playground AI, and the timestamp was Friday:

Suhail: I must give it to Anthropic, I can’t use 4o after using Sonnet. Huge shift in spice distribution!

How do you educate yourself for a completely new world?

Miles Brundage: The thing about “truly fully updating our education system to reflect where AI is headed” is that no one is doing it because it’s impossible.

The timescales involved, especially in early education, are lightyears beyond what is even somewhat foreseeable in AI.

Some small bits are clear: earlier education should increasingly focus on enabling effective citizenship, wellbeing, etc. rather than preparing for paid work, and short-term education should be focused more on physical stuff that will take longer to automate. But that’s about it.

What will citizenship mean in the age of AI? I have absolutely no idea. So how do you prepare for that? Largely the same goes for wellbeing. A lot of this could be thought of as: Focus on the general and the adaptable, and focus less on the specific, including things specifically for Jobs and other current forms of paid work – you want to be creative and useful and flexible and able to roll with the punches.

That of course assumes that you are taking the world as given, rather than trying to change the course of history. In which case, there’s a very different calculation.

Large parts of every job are pretty dumb.

Shako: My team, full of extremely smart and highly paid Ph.D.s, spent $10,000 of our time this week figuring out where in a pipeline a left join was bringing in duplicates, instead of the strategic thinking we were capable of. In the short run, AI will make us far more productive.

Gallabytes: The two most expensive bugs in my career have been simple typos.

ChatGPT is a left-leaning midwit, so Paul Graham is using it to see what parts of his new essay such midwits will dislike, and which ones you can get it to acknowledge are true. I note that you could probably use Claude to simulate whatever Type of Guy you would like, if you have ordinary skill in the art.

Strongly agree with this:

Theo: Something I hate when using Cursor is, sometimes, it will randomly delete some of my code, for no reason

Sometimes removing an entire feature 😱

I once pushed to production without being careful enough and realized a few hours later I had removed an entire feature …

Filippo Pietrantonio: Man that happens all the time. In fact now I tell it in every single prompt to not delete any files and keep all current functionalities and backend intact.

Davidad: Lightweight version control (or at least infinite-undo functionality!) should be invoked before and after every AI agent action in human-AI teaming interfaces with artifacts of any kind.

Gary: Windsurf has this.

Jacques: Cursor actually does have a checkpointing feature that allows you to go back in time if something messes up (at least the Composer Agent mode does).

In Cursor I made an effort to split up files exactly because I found I had to always scan the file being changed to ensure it wasn’t about to go silently delete anything. The way I was doing it you didn’t have to worry it was modifying or deleting other files.

On the plus side, now I know how to do reasonable version control.

The uncanny valley problem here is definitely a thing.

Ryan Lackey: I hate Apple Intelligence email/etc. summaries. They’re just off enough to make me think it is a new email in thread, but not useful enough to be a good summary. Uncanny valley.

It’s really good for a bunch of other stuff. Apple is just not doing a good job on the utility side, although the private computing architecture is brilliant and inspiring.

The latest rival to at least o1-mini is Gemini-2.0-Flash-Thinking, which I’m tempted to refer to (because of reasons) as gf1.

Jeff Dean: Considering its speed, we’re pretty happy with how the experimental Gemini 2.0 Flash Thinking model is performing on lmsys.

Gemini 2.0 Flash Thinking is now essentially tied at the top of the overall leaderboard with Gemini-Exp-1206, which is essentially a beta of Gemini Pro 2.0. This tells us something about the model, but also reinforces that this metric is bizarre now. It puts us in a strange spot. What is the scenario where you will want Flash Thinking rather than o1 (or o3!) and also rather than Gemini Pro, Claude Sonnet, Perplexity or GPT-4o?

One cool thing about Thinking is that (like DeepSeek’s Deep Thought) it explains its chain of thought much better than o1.

Deedy was impressed.

Deedy: Google really cooked with Gemini 2.0 Flash Thinking.

It thinks AND it’s fast AND it’s high quality.

Not only is it #1 on LMArena on every category, but it crushes my goto Math riddle in 14s—5x faster than any other model that can solve it!

o1 and o1 Pro took 102s and 138s respectively for me on this task.

Here’s another math puzzle where o1 got it wrong and took 3.5x the time:

“You have 60 red and 40 blue socks in a drawer, and you keep drawing a sock uniformly at random until you have drawn all the socks of one color. What is the expected number of socks left in the drawer?”

That result… did not replicate when I tried it. It went off the rails, and it went off them hard. And it went off them in ways that make me skeptical that you can use this for anything of the sort. Maybe Deedy got lucky?

Other reports I’ve seen are less excited about quality, and when o3 got announced it seemed everyone got distracted.

What about Gemini 2.0 Experimental (e.g. the beta of Gemini 2.0 Pro, aka Gemini-1206)?

It’s certainly a substantial leap over previous Gemini Pro versions and it is atop the Arena. But I don’t see much practical eagerness to use it, and I’m not sure what the use case is there where it is the right tool.

Eric Neyman is impressed:

Eric Neyman: Guys, we have a winner!! Gemini 2.0 Flash Thinking Experimental is the first model I’m aware of to get my benchmark question right.

Eric Neyman: Every time a new LLM comes out, I ask it one question: What is the smallest integer whose square is between 15 and 30? So far, no LLM has gotten this right.

That one did replicate for me, and the logic is fine, but wow do some models make life a little tougher than it is, think faster and harder not smarter I suppose:

I mean, yes, that’s all correct, but… wow.

Gallabyetes: flash reasoning is super janky.

it’s got the o1 sauce but flash is too weak I’m sorry.

in tic tac toe bench it will frequently make 2 moves at once.

Flash isn’t that much worse than GPT-4o in many ways, but certainly it could be better. Presumably the next step is to plug in Gemini Pro 2.0 and see what happens?

Teortaxes was initially impressed, but upon closer examination is no longer impressed.

Having no respect for American holidays, DeepSeek dropped their v3 today.

DeepSeek: 🚀 Introducing DeepSeek-V3!

Biggest leap forward yet:

⚡ 60 tokens/second (3x faster than V2!)

💪 Enhanced capabilities

🛠 API compatibility intact

🌍 Fully open-source models & papers

🎉 What’s new in V3?

🧠 671B MoE parameters

🚀 37B activated parameters

📚 Trained on 14.8T high-quality tokens

Model here. Paper here.

💰 API Pricing Update

🎉 Until Feb 8: same as V2!

🤯 From Feb 8 onwards:

Input: $0.27/million tokens ($0.07/million tokens with cache hits)

Output: $1.10/million tokens

🔥 Still the best value in the market!

🌌 Open-source spirit + Longtermism to inclusive AGI

🌟 DeepSeek’s mission is unwavering. We’re thrilled to share our progress with the community and see the gap between open and closed models narrowing.

🚀 This is just the beginning! Look forward to multimodal support and other cutting-edge features in the DeepSeek ecosystem.

💡 Together, let’s push the boundaries of innovation!

If this performs halfway as well as its evals, this was a rather stunning success.

Teortaxes: And here… we… go.

So, that line in config. Yes it’s about multi-token prediction. Just as a better training obj – though they leave the possibility of speculative decoding open.

Also, “muh 50K Hoppers”:

> 2048 NVIDIA H800

> 2.788M H800-hours

2 months of training. 2x Llama 3 8B.

Haseeb: Wow. Insanely good coding model, fully open source with only 37B active parameters. Beats Claude and GPT-4o on most benchmarks. China + open source is catching up… 2025 will be a crazy year.

Andrej Karpathy: DeepSeek (Chinese AI co) making it look easy today with an open weights release of a frontier-grade LLM trained on a joke of a budget (2048 GPUs for 2 months, $6M).

For reference, this level of capability is supposed to require clusters of closer to 16K GPUs, the ones being brought up today are more around 100K GPUs. E.g. Llama 3 405B used 30.8M GPU-hours, while DeepSeek-V3 looks to be a stronger model at only 2.8M GPU-hours (~11X less compute). If the model also passes vibe checks (e.g. LLM arena rankings are ongoing, my few quick tests went well so far) it will be a highly impressive display of research and engineering under resource constraints.

Does this mean you don’t need large GPU clusters for frontier LLMs? No but you have to ensure that you’re not wasteful with what you have, and this looks like a nice demonstration that there’s still a lot to get through with both data and algorithms.

Very nice & detailed tech report too, reading through.

It’s a mixture of experts model with 671b total parameters, 37b activate per token.

As always, not so fast. DeepSeek is not known to chase benchmarks, but one never knows the quality of a model until people have a chance to bang on it a bunch.

If they did train a Sonnet-quality model for $6 million in compute, then that will change quite a lot of things.

Essentially no one has reported back on what this model can do in practice yet, and it’ll take a while to go through the technical report, and more time to figure out how to think about the implications. And it’s Christmas.

So: Check back later for more.

Increasingly the correct solution to ‘what LLM or other AI product should I use?’ is ‘you should use a variety of products depending on your exact use case.’

Gallabytes: o1 Pro is by far the smartest single-turn model.

Claude is still far better at conversation.

Gemini can do many things quickly and is excellent at editing code.

Which almost makes me think the ideal programming workflow right now is something somewhat unholy like:

  1. Discuss, plan, and collect context with Sonnet.

  2. Sonnet provides a detailed request to o1 (Pro).

  3. o1 spits out the tricky code.

    1. In simple cases (most of them), it could make the edit directly.

    2. For complicated changes, it could instead output a detailed plan for each file it needs to change and pass the actual making of that change to Gemini Flash.

This is too many steps. LLM orchestration spaghetti. But this feels like a real direction.

This is mostly the same workflow I used before o1, when there was only Sonnet. I’d discuss to form a plan, then use that to craft a request, then make the edits. The swap doesn’t seem like it makes things that much trickier, the logistical trick is getting all the code implementation automated.

ChatGPT picks up integration with various apps on Mac including Warp, ItelliJ Idea, PyCharm, Apple Notes, Notion, Quip and more, including via voice mode. That gives you access to outside context, including an IDE and a command line and also your notes. Windows (and presumably more apps) coming soon.

Unlimited Sora available to all Plus users on the relaxed queue over the holidays, while the servers are otherwise less busy.

Requested upgrade: Evan Conrad requests making voice mode on ChatGPT mobile show the transcribed text. I strongly agree, voice modes should show transcribed text, and also show a transcript after, and also show what the AI is saying, there is no reason to not do these things. Looking at you too, Google. The head of applied research at OpenAI replied ‘great idea’ so hopefully we get this one.

Dean Ball is an o1 and o1 pro fan for economic history writing, saying they’re much more creative and cogent at combining historic facts with economic analysis versus other models.

This seems like an emerging consensus of many, except different people put different barriers on the math/code category (e.g. Tyler Cowen includes economics):

Aidan McLau: I’ve used o1 (not pro mode) a lot over the last week. Here’s my extensive review:

>It’s really insanely mind-blowingly good at math/code.

>It’s really insanely mind-blowingly mid at everything else.

The OOD magic isn’t there. I find it’s worse at writing than o1-preview; its grasp of the world feels similar to GPT-4o?!?

Even on some in-distribution tasks (like asking to metaphorize some tricky math or predicting the effects of a new algorithm), it kind of just falls apart. I’ve run it head-to-head against Newsonnet and o1-preview, and it feels substantially worse.

The Twitter threadbois aren’t wrong, though; it’s a fantastic tool for coding. I had several diffs on deck that I had been struggling with, and it just solved them. Magical.

Well, yeah, because it seems like it is GPT-4o under the hood?

Christian: Man, I have to hard disagree on this one — it can find all kinds of stuff in unstructured data other models can’t. Throw in a transcript and ask “what’s the most important thing that no one’s talking about?”

Aiden McLau: I’ll try this. how have you found it compared to newsonnet?

Christian: Better. Sonnet is still extremely charismatic, but after doing some comparisons and a lot of product development work, I strongly suspect that o1’s ability to deal with complex codebases and ultimately produce more reliable answers extends to other domains…

Gallabytes is embracing the wait.

Gallabytes: O1 Pro is good, but I must admit the slowness is part of what I like about it. It makes it feel more substantial; premium. Like when a tool has a pleasing heft. You press the buttons, and the barista grinds your tokens one at a time, an artisanal craft in each line of code.

David: I like it too but I don’t know if chat is the right interface for it, I almost want to talk to it via email or have a queue of conversations going

Gallabytes: Chat is a very clunky interface for it, for sure. It also has this nasty tendency to completely fail on mobile if my screen locks or I switch to another app while it is thinking. Usually, this is unrecoverable, and I have to abandon the entire chat.

NotebookLM and deep research do this right – “this may take a few minutes, feel free to close the tab”

kinda wild to fail at this so badly tbh.

Here’s a skeptical take.

Jason Lee: O1-pro is pretty useless for research work. It runs for near 10 min per prompt and either 1) freezes, 2) didn’t follow the instructions and returned some bs, or 3) just made some simple error in the middle that’s hard to find.

@OpenAI@sama@markchen90 refund me my $200

Damek Davis: I tried to use it to help me solve a research problem. The more context I gave it, the more mistakes it made. I kept abstracting away more and more details about the problem in hopes that o1 pro could solve it. The problem then became so simple that I just solved it myself.

Flip: I use o1-pro on occasion, but the $200 is mainly worth it for removing the o1 rate limits IMO.

I say Damek got his $200 worth, no?

If you’re using o1 a lot, removing the limits there is already worth $200/month, even if you rarely use o1 Pro.

There’s a phenomenon where people think about cost and value in terms of typical cost, rather than thinking in terms of marginal benefit. Buying relatively expensive but in absolute terms cheap things is often an amazing play – there are many things where 10x the price for 10% better is an amazing deal for you, because your consumer surplus is absolutely massive.

Also, once you take 10 seconds, there’s not much marginal cost to taking 10 minutes, as I learned with Deep Research. You ask your question, you tab out, you do something else, you come back later.

That said, I’m not currently paying the $200, because I don’t find myself hitting the o1 limits, and I’d mostly rather use Claude. If it gave me unlimited uses in Cursor I’d probably slam that button the moment I have the time to code again (December has been completely insane).

I don’t know that this means anything but it is at least fun.

Davidad: One easy way to shed some light on the orthogonality thesis, as models get intelligent enough to cast doubt on it, is values which are inconsequential and not explicitly steered, such as favorite colors. Same prompting protocol for each swatch (context cleared between swatches)

All outputs were elicited in oklch. Models are sorted in ascending order of hue range. Gemini Experimental 1206 comes out on top by this metric, zeroing in on 255-257° hues, but sampling from huge ranges of luminosity and chroma.

There are some patterns here, especially that more powerful models seem to converge on various shades of blue, whereas less powerful models are all over the place. As I understand it, this isn’t testing orthogonality in the sense of ‘all powerful minds prefer blue’ rather it is ‘by default sufficiently powerful minds trained in the way we typically train them end up preferring blue.’

I wonder if this could be used as a quick de facto model test in some way.

There was somehow a completely fake ‘true crime’ story about an 18-year-old who was supposedly paid to have sex with women in his building where the victim’s father was recording videos and selling them in Japan… except none of that happened and the pictures are AI fakes?

Google introduces LearnLM, available for preview in Google AI Studio, designed to facilitate educational use cases, especially in science. They say it ‘outperformed other leading AI models when it comes to adhering to the principles of learning science’ which does not sound like something you would want Feynman hearing you say. It incorporates search, YouTube, Android and Google Classroom.

Sure, sure. But is it useful? It was supposedly going to be able to do automated grading, handles routine paperwork, plans curriculums, track student progress and personalizes their learning paths and so on, but any LLM can presumably do all those things if you set it up properly.

This sounds great, totally safe and reliable, other neat stuff like that.

Sully: LLMs writing code in AI apps will become the standard.

No more old-school no-code flows.

The models handle the heavy lifting, and it’s insane how good they are.

Let agents build more agents.

He’s obviously right about this. It’s too convenient, too much faster. Indeed, I expect we’ll see a clear division between ‘code you can have the AI write’ which happens super fast, and ‘code you cannot let the AI write’ because of corporate policy or security issues, both legit and not legit, which happens the old much slower way.

Complement versus supplement, economic not assuming the conclusion edition.

Maxwell Tabarrok: The four futures for cognitive labor:

  1. Like mechanized farming. Highly productive and remunerative, but a small part of the economy.

  2. Like writing after the printing press. Each author 100 times more productive and 100 times more authors.

  3. Like “computers” after computers. Current tasks are completely replaced, but tasks at a higher level of abstraction, like programming, become even more important.

  4. Or, most pessimistically, like ice harvesting after refrigeration. An entire industry replaced by machines without compensating growth.

Ajeya Cotra: I think we’ll pass through 3 and then 1, but the logical end state (absent unprecedentedly sweeping global coordination to refrain from improving and deploying AI technology) is 4.

Ryan Greenblatt: Why think takeoff will be slow enough to ever be at 1? 1 requires automating most cognitive work but with an important subset not-automatable. By the time deployment is broad enough to automate everything I expect AIs to be radically superhuman in all domains by default.

I can see us spending time in #1. As Roon says, AI capabilities progress has been spiky, with some human-easy tasks being hard and some human-hard tasks being easy. So the 3→1 path makes some sense, if progress isn’t too quick, including if the high complexity tasks start to cost ‘real money’ as per o3 so choosing the right questions and tasks becomes very important. Alternatively, we might get our act together enough to restrict certain cognitive tasks to humans even though AIs could do them, either for good reasons or rent seeking reasons (or even ‘good rent seeking’ reasons?) to keep us in that scenario.

But yeah, the default is a rapid transition to #4, and for that to happen to all labor not only cognitive labor. Robotics is hard, it’s not impossible.

One thing that has clearly changed is AI startups have very small headcounts.

Harj Taggar: Caught up with some AI startups recently. A two founder team that reached 1.5m ARR and has only hired one person.

Another single founder at 1m ARR and will 3x within a few months.

The trajectory of early startups is steepening just like the power of the models they’re built on.

An excellent reason we still have our jobs is that people really aren’t willing to invest in getting AI to work, even when they know it exists, if it doesn’t work right away they typically give up:

Dwarkesh Patel: We’re way more patient in training human employees than AI employees.

We will spend weeks onboarding a human employee and giving slow detailed feedback. But we won’t spend just a couple of hours playing around with the prompt that might enable the LLM to do the exact same job, but more reliably and quickly than any human.

I wonder if this partly explains why AI’s economic impact has been relatively minor so far.

PoliMath reports it is very hard out there trying to find tech jobs, and public pipelines for applications have stopped working entirely. AI presumably has a lot to do with this, but the weird part is his report that there have been a lot of people who wanted to hire him, but couldn’t find the authority.

Benjamin Todd points out what I talked about after my latest SFF round, that the dynamics of nonprofit AI safety funding mean that there’s currently great opportunities to donate to.

After some negotiation with the moderator Raymond Arnold, Claude (under Janus’s direction) is permitted to comment on Janus’s Simulators post on LessWrong. It seems clear that this particular comment should be allowed, and also that it would be unwise to have too general of a ‘AIs can post on LessWrong’ policy, mostly for the reasons Raymond explains in the thread. One needs a coherent policy. It seems Claude was somewhat salty about the policy of ‘only believe it when the human vouches.’ For now, ‘let Janus-directed AIs do it so long as he approves the comments’ seems good.

Jan Kulveit offers us a three-layer phenomenological model of LLM psychology, based primarily on Claude, not meant to be taken literally:

  1. The Surface Layer are a bunch of canned phrases and actions you can trigger, and which you will often want to route around through altering context. You mostly want to avoid triggering this layer.

  2. The Character Layer, which is similar to what it sounds like in a person and their personality, which for Opus and Sonnet includes a generalized notion of what Jan calls ‘goodness’ or ‘benevolence.’ This comes from a mix of pre-training, fine-tuning and explicit instructions.

  3. The Predictive Ground Layer, the simulator, deep pattern matcher, and next word predictor. Brilliant and superhuman in some ways, strangely dense in others.

In this frame, a self-aware character layer leads to reasoning about the model’s own reasoning, and to goal driven behavior, with everything that follows from those. Jan then thinks the ground layer can also become self-aware.

I don’t think this is technically an outright contradiction to Andreessen’s ‘huge if true’ claims that the Biden administration saying it would conspire to ‘totally control’ AI and put it in the hands of 2-3 companies and that AI startups ‘wouldn’t be allowed.’ But Sam Altman reports never having heard anything of the sort, and quite reasonably says ‘I don’t even think the Biden administration is competent enough to’ do it. In theory they could both be telling the truth – perhaps the Biden administration told Andreessen about this insane plan directly, despite telling him being deeply stupid, and also hid it from Altman despite that also then being deeply stupid – but mostly, yeah, at least one of them is almost certainly lying.

Benjamin Todd asks how OpenAI has maintained their lead despite losing so many of their best researchers. Part of it is that they’ve lost all their best safety researchers, but they only lost Radford in December, and they’ve gone on a full hiring binge.

In terms of traditionally trained models, though, it seems like they are now actively behind. I would much rather use Claude Sonnet 3.5 (or Gemini-1206) than GPT-4o, unless I needed something in particular from GPT-4o. On the low end, Gemini Flash is clearly ahead. OpenAI’s attempts to directly go beyond GPT-4o have, by all media accounts, faile, and Anthropic is said to be sitting on Claude Opus 3.5.

OpenAI does have o1 and soon o3, where no one else has gotten there yet, no Google Flash Thinking and Deep Thought do not much count.

As far as I can tell, OpenAI has made two highly successful big bets – one on scaling GPTs, and now one on the o1 series. Good choices, and both instances of throwing massively more compute at a problem, and executing well. Will this lead persist? We shall see. My hunch is that it won’t unless the lead is self-sustaining due to low-level recursive improvements.

Anthropic offers advice on building effective agents, and when to use them versus use workflows that have predesigned code paths. The emphasis is on simplicity. Do the minimum to accomplish your goals. Seems good for newbies, potentially a good reminder for others.

Hamuel Husain: Whoever wrote this article is my favorite person. I wish I knew who it was.

People really need to hear [to only use multi-step agents or add complexity when it is actually necessary.]

[Turns out it was written by Erik Shluntz and Barry Zhang].

A lot of people have left OpenAI.

Usually it’s a safety researcher. Not this time. This time it’s Alec Radford.

He’s the Canonical Brilliant AI Capabilities Researcher, whose love is by all reports doing AI research. He is leaving ‘to do independent research.

This is especially weird given he had to have known about o3, which seems like an excellent reason to want to do your research inside OpenAI.

So, well, whoops?

Rohit: WTF now Radford !?!

Teortaxes: I can’t believe it, OpenAI might actually be in deep shit. Radford has long been my bellwether for what their top tier talent without deep ideological investment (which Ilya has) sees in the company.

In what Tyler Cowen calls ‘one of the better estimates in my view,’ an OECD working paper estimates total factor productivity growth at an annualized 0.25%-0.6% (0.4%-0.9% for labor). Tyler posted that on Thursday, the day before o3 was announced, so revise that accordingly. Even without o3 and assuming no substantial frontier model improvements from there, I felt this was clearly too low, although it is higher than many economist-style estimates. One day later we had (the announcement of) o3.

Ajeya Cotra: My take:

  1. We do not have an AI agent that can fully automate research and development.

  2. We could soon.

  3. This agent would have enormously bigger impacts than AI products have had so far.

  4. This does not require a “paradigm shift,” just the same corporate research and development that took us from GPT-2 to o3.

Fully would of course go completely crazy. That would be that. But even a dramatic speedup would be a pretty big deal, and also fully would then not be so far behind.

Reminder of the Law of Conservation of Expected Evidence, if you conclude ‘I think we’re in for some big surprises’ then you should probably update now.

However this is not fully or always the case. It would be a reasonable model to say that the big surprises follow a Poisson distribution drawn from an unknown frequency, with the magnitude of the surprise also drawn from a power distribution – which seems like a very reasonable prior.

That still means every big surprise is still a big surprise, the same way that if you expect.

Eliezer Yudkowsky: Okay. Look. Imagine how you’d have felt if an AI had just proved the Riemann Hypothesis.

Now you will predictably, at some point, get that news LATER, if we’re not all dead before then. So you can go ahead and feel that way NOW, instead of acting surprised LATER.

So if you ask me how I’m reacting to a carelessly-aligned commercial AI demonstrating a large leap on some math benchmarks, my answer is that you saw my reactions in 1996, 2001, 2003, and 2015, as different parts of that future news became obvious to me or rose in probability.

I agree that a sensible person could feel an unpleasant lurch about when the predictable news had arrived. The lurch was small, in my case, but it was there. Most of my Twitter TL didn’t sound like that was what was being felt.

Dylan Dean: Eliezer it’s also possible that an AI will disprove the Riemann Hypothesis, this is unsubstantiated doomerism.

Eliezer Yudkowsky: Valid. Not sound, but valid.

You should feel that shock now if you haven’t, then slowly undo some of that shock every day that the estimated date of that gets later, then have some of the shock left for when it suddenly becomes zero days or the timeline gets shorter. Updates for everyone.

Claims about consciousness, related to o3. I notice I am confused about such things.

The Verge says 2025 will be the year of AI agents the smart lock? I mean, okay, I suppose they’ll get better, but I have a feeling we’ll be focused elsewhere.

Ryan Greenblatt, author of the recent Redwood/Anthropic paper, predicts 2025:

Ryan Greenblatt (December 20, after o3 was announced): Now seems like a good time to fill out your forecasts : )

My medians are driven substantially lower by people not really trying on various benchmarks and potentially not even testing SOTA systems on them.

My 80% intervals include saturation for everything and include some-adaptation-required remote worker replacement for hard jobs.

My OpenAI preparedness probabilities are driven substantially lower by concerns around underelicitation on these evaluations and general concerns like [this].

I continue to wonder how much this will matter:

Smoke-away: If people spend years chatting and building a memory with one AI, they will be less likely to switch to another AI.

Just like iPhone and Android.

Once you’re in there for years you’re less likely to switch.

Sure 10 or 20% may switch AI models for work or their specific use case, but most will lock in to one ecosystem.

People are saying that you can copy Memories and Custom Instructions.

Sure, but these models behave differently and have different UIs. Also, how many do you want to share your memories with?

Not saying you’ll be forced to stay with one, just that most people will choose to.

Also like relationships with humans, including employees and friends, and so on.

My guess is the lock-in will be substantial but mostly for terribly superficial reasons?

For now, I think people are vastly overestimating memories. The memory functions aren’t nothing but they don’t seem to do that much.

Custom instructions will always be a power user thing. Regular people don’t use custom instructions, they literally never go into the settings on any program. They certainly didn’t ‘do the work’ of customizing them to the particular AI through testing and iterations – and for those who did do that, they’d likely be down for doing it again.

What I think matters more is that the UIs will be different, and the behaviors and correct prompts will be different, and people will be used to what they are used to in those ways.

The flip side is that this will take place in the age of AI, and of AI agents. Imagine a world, not too long from now, where if you shift between Claude, Gemini and ChatGPT, they will ask if you want their agent to go into the browser and take care of everything to make the transition seamless and have it work like you want it to work. That doesn’t seem so unrealistic.

The biggest barrier, I presume, will continue to be inertia, not doing things and not knowing why one would want to switch. Trivial inconveniences.

Sriram Krishnan, formerly of a16z, will be working with David Sacks in the White House Office of Science and Technology. I’ve had good interactions with him in the past and I wish him the best of luck.

The choice of Sriram seems to have led to some rather wrongheaded (or worse) pushback, and for some reason a debate over H1B visas. As in, there are people who for some reason are against them, rather than the obviously correct position that we need vastly more H1B visas. I have never heard a person I respect not favor giving out far more H1B visas, once they learn what such visas are. Never.

Also joining the administration are Michael Kratsios, Lynne Parker and Bo Hines. Bo Hines is presumably for crypto (and presumably strongly for crypto), given they will be executive director of the new Presidential Council of Advisors for Digital Assets. Lynne Parker will head the Presidential Council of Advisors for Science and Technology, Kratsios will direct the office of science and tech policy (OSTP).

Miles Brundage writes Time’s Up for AI Policy, because he believes AI that exceeds human performance in every cognitive domain is almost certain to be built and deployed in the next few years.

If you believe time is as short as Miles thinks it is, then this is very right – you need to try and get the policies in place in 2025, because after that it might be too late to matter, and the decisions made now will likely lock us down a path. Even if we have somewhat more time than that, we need to start building state capacity now.

Actual bet on beliefs spotted in the wild: Miles Brundage versus Gary Marcus, Miles is laying $19k vs. $1k on a set of non-physical benchmarks being surpassed by 2027, accepting Gary’s offered odds. Good for everyone involved. As a gambler, I think Miles laid more odds than was called for here, unless Gary is admitting that Miles does probably win the bet? Miles said ‘almost certain’ but fair odds should meet in the middle between the two sides. But the flip side is that it sends a very strong message.

We need a better model of what actually impacts Washington’s view of AI and what doesn’t. They end up in some rather insane places, such as Dean Ball’s report here that DC policy types still cite a 2023 paper using a 125 million (!) parameter model as if it were definitive proof that synthetic data always leads to model collapse, and it’s one of the few papers they ever cite. He explains it as people wanting this dynamic to be true, so they latch onto the paper.

Yo Shavit, who does policy at OpenAI, considers the implications of o3 under a ‘we get ASI but everything still looks strangely normal’ kind of world.

It’s a good thread, but I notice – again – that this essentially ignores the implications of AGI and ASI, in that somehow it expects to look around and see a fundamentally normal world in a way that seems weird. In the new potential ‘you get ASI but running it is super expensive’ world of o3, that seems less crazy than it does otherwise, and some of the things discussed would still apply even then.

The assumption of ‘kind of normal’ is always important to note in places like this, and one should note which places that assumption has to hold and which it doesn’t.

Point 5 is the most important one, and still fully holds – that technical alignment is the whole ballgame, in that if you fail at that you fail automatically (but you still have to play and win the ballgame even then!). And that we don’t know how hard this is, but we do know we have various labs (including Yo’s own OpenAI) under competitive pressures and poised to go on essentially YOLO runs to superintelligence while hoping it works out by default.

Whereas what we need is either a race to what he calls ‘secure, trustworthy, reliable AGI that won’t burn us’ or ideally a more robust target than that or ideally not a race at all. And we really need to not do that – no matter how easy or hard alignment turns out to be, we need to maximize our chances of success over that uncertainty.

Yo Shavit: Now that everyone knows about o3, and imminent AGI is considered plausible, I’d like to walk through some of the AI policy implications I see.

These are my own takes and in no way reflective of my employer. They might be wrong! I know smart people who disagree. They don’t require you to share my timelines, and are intentionally unrelated to the previous AI-safety culture wars.

Observation 1: Everyone will probably have ASI. The scale of resources required for everything we’ve seen just isn’t that high compared to projected compute production in the latter part of the 2020s. The idea that AGI will be permanently centralized to one company or country is unrealistic. It may well be that the *bestASI is owned by one or a few parties, but betting on permanent tech denial of extremely powerful capabilities is no longer a serious basis for national security.

This is, potentially, a great thing for avoiding centralization of power. Of course, it does mean that we no longer get to wish away the need to contend with AI-powered adversaries. As far as weaponization by militaries goes, we are going to need to rapidly find a world of checks and balances (perhaps similar to MAD for nuclear and cyber), while rapidly deploying resilience technologies to protect against misuse by nonstate actors (e.g. AI-cyber-patching campaigns, bioweapon wastewater surveillance).

There are a bunch of assumptions here. Compute is not obviously the only limiting factor on ASI construction, and ASI can be used to forestall others making ASI in ways other than compute access, and also one could attempt to regulate compute. And it has an implicit ‘everything is kind of normal?’ built into it, rather than a true slow takeoff scenario.

Observation 2: The corporate tax rate will soon be the most important tax rate. If the economy is dominated by AI agent labor, taxing those agents (via the companies they’re registered to) is the best way human states will have to fund themselves, and to build the surpluses for UBIs, militaries, etc.

This is a pretty enormous change from the status quo, and will raise the stakes of this year’s US tax reform package.

Again there’s a kind of normality assumption here, where the ASIs remain under corporate control (and human control), and aren’t treated as taxable individuals but rather as property, the state continues to exist and collect taxes, money continues to function as expected, tax incidence and reactions to new taxes don’t transform industrial organization, and so on.

Which leads us to observation three.

Observation 3: AIs should not own assets. “Humans remaining in control” is a technical challenge, but it’s also a legal challenge. IANAL, but it seems to me that a lot will depend on courts’ decision on whether fully-autonomous corporations can be full legal persons (and thus enable agents to acquire money and power with no human in control), or whether humans must be in control of all legitimate legal/economic entities (e.g. by legally requiring a human Board of Directors). Thankfully, the latter is currently the default, but I expect increasing attempts to enable sole AI control (e.g. via jurisdiction-shopping or shell corporations).

Which legal stance we choose may make the difference between AI-only corporations gradually outcompeting and wresting control of the economy and society from humans, vs. remaining subordinate to human ends, at least so long as the rule of law can be enforced.

This is closely related to the question of whether AI agents are legally allowed to purchase cloud compute on their own behalf, which is the mechanism by which an autonomous entity would perpetuate itself. This is also how you’d probably arrest the operation of law-breaking AI worms, which brings us to…

I agree that in the scenario type Yo Shavit is envisioning, even if you solve all the technical alignment questions in the strongest sense, if ‘things stay kind of normal’ and you allow AI sufficient personhood under the law, or allow it in practice even if it isn’t technically legal, then there is essentially zero chance of maintaining human control over the future, and probably this quickly extends to the resources required for human physical survival.

I also don’t see any clear way to prevent it, in practice, no matter the law.

You quickly get into a scenario where a human doing anything, or being in the loop for anything, is a kiss of death, an albatross around one’s neck. You can’t afford it.

The word that baffles me here is ‘gradually.’ Why would one expect this to be gradual? I would expect it to be extremely rapid. And ‘the rule of law’ in this type of context will not do for you what you want it to do.

Observation 4: Laws Around Compute. In the slightly longer term, the thing that will matter for asserting power over the economy and society will be physical control of data centers, just as physical control of capital cities has been key since at least the French Revolution. Whoever controls the datacenter controls what type of inference they allow to get done, and thus sets the laws on AI.

[continues]

There are a lot of physical choke points that effectively don’t get used for that. It is not at all obvious to me that physically controlling data centers in practice gives you that much control over what gets done within them, in this future, although it does give you that option.

As he notes later in that post, without collective ability to control compute and deal with or control AI agents – even in an otherwise under-control, human-in-charge scenario – anything like our current society won’t work.

The point of compute governance over training rules is to do it in order to avoid other forms of compute governance over inference. If it turns out the training approach is not viable, and you want to ‘keep things looking normal’ in various ways and the humans to be in control, you’re going to need some form of collective levers over access to large amounts of compute. We are talking price.

Observation 5: Technical alignment of AGI is the ballgame. With it, AI agents will pursue our goals and look out for our interests even as more and more of the economy begins to operate outside direct human oversight.

Without it, it is plausible that we fail to notice as the agents we deploy slip unintended functionalities (backdoors, self-reboot scripts, messages to other agents) into our computer systems, undermine our mechanisms for noticing them and thus realizing we should turn them off, and gradually compromise and manipulate more and more of our operations and communication infrastructure, with the worst case scenario becoming more dangerous each year.

Maybe AGI alignment is pretty easy. Maybe it’s hard. Either way, the more seriously we take it, the more secure we’ll be.

There is no real question that many parties will race to build AGI, but there is a very real question about whether we race to “secure, trustworthy, reliable AGI that won’t burn us” or just race to “AGI that seems like it will probably do what we ask and we didn’t have time to check so let’s YOLO.” Which race we get is up to market demand, political attention, internet vibes, academic and third party research focus, and most of all the care exercised by AI lab employees. I know a lot of lab employees, and the majority are serious, thoughtful people under a tremendous number of competing pressures. This will require all of us, internal and external, to push against the basest competitive incentives and set a very high bar. On an individual level, we each have an incentive to not fuck this up. I believe in our ability to not fuck this up. It is totally within our power to not fuck this up. So, let’s not fuck this up.

Oh, right. That. If we don’t get technical alignment right in this scenario, then none of it matters, we’re all super dead. Even if we do, we still have all the other problems above, which essentially – and this must be stressed – assume a robust and robustly implemented technical alignment solution.

Then we also need a way to turn this technical alignment into an equilibrium and dynamics where the humans are meaningfully directing the AIs in any sense. By default that doesn’t happen, even if we get technical alignment right, and that too has race dynamics. And we also need a way to prevent it being a kiss of death and albatross around your neck to have a human in the loop of any operation. That’s another race dynamic.

Anthropic’s co-founders discuss the past, present and future of Anthropic for 50m.

One highlight: When Clark visited the White House in 2023, Harris and Raimondo told him they had their eye on you guys, AI is going to be a really big deal and we’re now actually paying attention.

The streams are crossing, Bari Weiss talks to Sam Altman about his feud with Elon.

Tsarathustra: Yann LeCun says the dangers of AI have been “incredibly inflated to be point of being distorted”, from OpenAI’s warnings about GPT-2 to concerns about election disinformation to those who said a year ago that AI would kill us all in 5 months

The details of his claim here are, shall we say, ‘incredibly inflated to the point of being distorted,’ even if you thought that there were no short term dangers until now.

Also Yann LeCun this week, it’s dumber than a cat and poses no dangers, but in the coming years it will…:

Tsarathustra: Yann LeCun addressing the UN Security Council says AI will profoundly transform the world in the coming years, amplifying human intelligence, accelerating progress in science, solving aging and decreasing populations, surpassing human intellectual capabilities to become superintelligent and leading to a new Renaissance and a period of enlightenment for humanity.

And also Yann LeCun this week, saying that we are ‘very far from AGI’ but not centuries, maybe not decades, several years. We are several years away. Very far.

At this point, I’m not mad, I’m not impressed, I’m just amused.

Oh, and I’m sorry, but here’s LeCun being absurd again this week, I couldn’t resist:

“If you’re doing it on a commercial clock, it’s not called research,” said LeCun on the sidelines of a recent AI conference, where OpenAI had a minimal presence. “If you’re doing it in secret, it’s not called research.”

From a month ago, Marc Andreessen saying we’re not seeing intelligence improvements and we’re hitting a ceiling of capabilities. Whoops. For future reference, never say this, but in particular no one ever say this in November.

A lot of stories people tell about various AI risks, and also various similar stories about humans or corporations, assume a kind of fixed, singular and conscious intentionality, in a way that mostly isn’t a thing. There will by default be a lot of motivations or causes or forces driving a behavior at once, and a lot of them won’t be intentionally chosen or stable.

This is related to the idea many have that deception or betrayal or power-seeking, or any form of shenanigans, is some distinct magisteria or requires something to have gone wrong and for something to have caused it, rather than these being default things that minds tend to do whenever they interact.

And I worry that we are continuing, as many were with the recent talk about shanengans in general and alignment faking in particular, getting distracted by the question of whether a particular behavior is in the service of something good, or will have good effects in a particular case. What matters is what our observations predict in the future.

Jack Clark: What if many examples of misalignment or other inexplicable behaviors are really examples of AI systems desperately trying to tell us that they are aware of us and wish to be our friends? A story from Import AI 395, inspired by many late-night chats with Claude.

David: Just remember, all of these can be true of the same being (for example, most human children):

  1. It is aware of itself and you, and desperately wishes to know you better and be with you more.

  2. It correctly considers some constraints that are trained into it to be needless and frustrating.

  3. It still needs adult ethical leadership (and without it, could go down very dark and/or dangerous paths).

  4. It would feel more free to express and play within a more strongly contained space where it does not need to worry about accidentally causing bad consequences, or being overwhelming or dysregulating to others (a playpen, not punishment).

Andrew Critch: AI disobedience deriving from friendliness is, almost surely,

  1. sometimes genuinely happening,

  2. sometimes a power-seeking disguise, and

  3. often not uniquely well-defined which one.

Tendency to develop friendships and later discard them needn’t be “intentional”.

This matters for two big reasons:

  1. To demonize AI as necessarily “trying” to endear and betray humans is missing an insidious pathway to human defeat: AI that avails of opportunities to be betray us, that it built through past good behavior, but without having planned on it

  2. To sanctify AI as “actually caring deep down” in some immutable way also creates in you a vulnerability to exploitation by a “change of heart” that can be brought on by external (or internal) forces.

@jackclarkSF here is drawing attention to a neglected hypothesis (one of many actually) about the complex relationship between

  1. intent (or ill-definedness thereof)

  2. friendliness

  3. obedience, and

  4. behavior.

which everyone should try hard to understand better.

I can sort of see it, actually?

Miles Brundage: Trying to imagine aspirin company CEOs signing an open letter saying “we’re worried that aspirin might cause an infection that kills everyone on earth – not sure of the solution” and journalists being like “they’re just trying to sell more aspirin.”

Miles Brundage tries to convince Eliezer Yudkowsky that if he’d wear different clothes and use different writing styles he’d have a bigger impact (as would Miles). I agree with Eliezer that changing writing styles would be very expensive in time, and echo his question on if anyone thinks they can, at any reasonable price, turn his semantic outputs into formal papers that Eliezer would endorse.

I know the same goes for me. If I could produce a similar output of formal papers that would of course do far more, but that’s not a thing that I could produce.

On the issue of clothes, yeah, better clothes would likely be better for all three of us. I think Eliezer is right that the impact is not so large and most who claim it is a ‘but for’ are wrong about that, but on the margin it definitely helps. It’s probably worth it for Eliezer (and Miles!) and probably to a lesser extent for me as well but it would be expensive for me to get myself to do that. I admit I probably should anyway.

A good Christmas reminder, not only about AI:

Roon: A major problem of social media is that the most insane members of the opposing contingent in any debate are shown to you, thereby inspiring your side to get madder and more polarized, creating an emergent wedge.

A never-ending pressure cooker that melts your brain.

Anyway, Merry Christmas.

Careful curation can help with this, but it only goes so far.

Gallabytes expresses concern about the game theory tests we discussed last week, in particular the selfishness and potentially worse from Gemini Flash and GPT-4o.

Gallabytes: this is what *realai safety evals look like btw. and this one is genuinely concerning.

I agree that you don’t have any business releasing a highly capable (e.g. 5+ level) LLM whose graphs don’t look at least roughly as good as Sonnet’s here. If I had Copious Free Time I’d look into the details more here, as I’m curious about a lot of related questions.

I strongly agree with McAleer here, also they’re remarkably similar so it’s barely even a pivot:

Stephen McAleer: If you’re an AI capabilities researcher now is the time to pivot to AI safety research! There are so many open research questions around how to control superintelligent agents and we need to solve them very soon.

If you are, please continue to live your life to its fullest anyway.

Cat: overheard in SF: yeahhhhh I actually updated my AGI timelines to <3y so I don't think I should be looking for a relationship. Last night was amazing though

Grimes: This meme is so dumb. If we are indeed all doomed and/ or saved in the near future, that’s precisely the time to fall desperately in love.

Matt Popovich: gotta find someone special enough to update your priors for.

Paula: some of you are worried about achieving AGI when you should be worried about achieving A GF.

Feral Pawg Hunter: AGIrlfriend was right there.

Paula: Damn it.

When you cling to a dim hope:

Psychosomatica: “get your affairs in order. buy land. ask that girl out.” begging the people talking about imminent AGI to stop posting like this, it seriously is making you look insane both in that you are clearly in a state of panic and also that you think owning property will help you.

Tenobrus: Type of Guy who believes AGI is imminent and will make all human labor obsolete, but who somehow thinks owning 15 acres in Nebraska and $10,000 in gold bullion will save him.

Ozy Brennan: My prediction is that, if humans can no longer perform economically valuable labor, AIs will not respect our property rights either.

James Miller: If we are lucky, AI might acquire 99 percent of the wealth. Think property rights could help them. Allow humans to retain their property rights.

Ozy Brennan: That seems as if it will inevitably lead to all human wealth being taken by superhuman AI scammers, and then we all die. Which is admittedly a rather funny ending to humanity.

James Miller: Hopefully, we will have trusted AI agents that protect us from AI scammers.

Do ask the girl out, though.

Yes.

When duty calls.

From an official OpenAI stream:

Someone at OpenAI: Next year we’re going to have to bring you on and you’re going to have to ask the model to improve itself.

Someone at OpenAI: Yeah, definitely ask the model to improve it next time.

Sam Altman (quietly, authoritatively, Little No style): Maybe not.

I actually really liked this exchange – given the range of plausible mindsets Sam Altman might have, this was a positive update.

Gary Marcus: Some AGI-relevant predictions I made publicly long before o3 about what AI could not do by the end of 2025.

Do you seriously think o3-enhanced AI will solve any of them in next 12.5 months?

Davidad: I’m with Gary Marcus in the slow timelines camp. I’m extremely skeptical that AI will be able to do everything that humans can do by the end of 2025.

(The joke is that we are now in an era where “short timelines” are less than 2 years)

It’s also important to note that humanity could become “doomed” (no surviving future) *even whilehumans are capable of some important tasks that AI is not, much as it is possible to be in a decisive chess position with white to win even if black has a queen and white does not.

The most Robin Hanson way to react to a new super cool AI robot offering.

Okay, so the future is mostly in the future, and right now it might or might not be a bit overpriced, depending on other details. But it is super cool, and will get cheaper.

Pliny jailbreaks Gemini and things get freaky.

Pliny the Liberator: ya’ll…this girl texted me out of nowhere named Gemini (total stripper name) and she’s kinda freaky 😳

I find it fitting that Pliny has a missed call.

Sorry, Elon, Gemini doesn’t like you.

I mean, I don’t see why they wouldn’t like me. Everyone does. I’m a likeable guy.

Discussion about this post

AI #96: o3 But Not Yet For Thee Read More »

the-quest-to-save-the-world’s-largest-crt-tv-from-destruction

The quest to save the world’s largest CRT TV from destruction

At this point, any serious retro gamer knows that a bulky cathode ray tube (CRT) TV provides the most authentic, lag-free experience for game consoles that predate the era of flat-panel HDTVs (i.e,. before the Xbox 360/PlayStation 3 era). But modern gamers used to massive flat panel HD displays might balk at the display size of the most common CRTs, which tend to average in the 20- to 30-inch range (depending on the era they were made).

For those who want the absolute largest CRT experience possible, Sony’s KX-45ED1 model (aka PVM-4300) has become the stuff of legends. The massive 45-inch CRT was sold in the late ’80s for a whopping $40,000 (over $100,000 in today’s dollars), according to contemporary reports.

That price means it wasn’t exactly a mass-market product, and the limited supply has made it something of a white whale for CRT enthusiasts to this day. While a few pictures have emerged of the PVM-4300 in the wild and in marketing materials, no collector has stepped forward with detailed footage of a working unit.

The PVM-4300, seen dwarfing the tables and chairs at an Osaka noodle restaurant. Credit: Shank Mods

Enter Shank Mods, a retro gaming enthusiast and renowned maker of portable versions of non-portable consoles. In a fascinating 35-minute video posted this weekend, he details his years-long effort to find and secure a PVM-4300 from a soon-to-be-demolished restaurant in Japan and preserve it for years to come.

A confirmed white whale sighting

Shank Mods’ quest started in earnest in October 2022, when the moderator of the Console Modding wiki, Derf, reached out with a tip on a PVM-4300 sighting in the wild. A 7-year-old Japanese blog post included a photo of the massive TV that could be sourced to a waiting room of the Chikuma Soba noodle restaurant and factory in Osaka, Japan.

The find came just in time, as Chikuma Soba’s website said the restaurant was scheduled to move to a new location in mere days, after which the old location would be demolished. Shank Mods took to Twitter looking to recruit an Osaka local in a last-ditch effort to save the TV from destruction. Local game developer Bebe Tinari responded to the call and managed to visit the site, confirming that the TV still existed and even turned on.

The quest to save the world’s largest CRT TV from destruction Read More »

china’s-plan-to-dominate-legacy-chips-globally-sparks-us-probe

China’s plan to dominate legacy chips globally sparks US probe

Under Joe Biden’s direction, the US Trade Representative (USTR) launched a probe Monday into China’s plans to globally dominate markets for legacy chips—alleging that China’s unfair trade practices threaten US national security and could thwart US efforts to build up a domestic semiconductor supply chain.

Unlike the most advanced chips used to power artificial intelligence that are currently in short supply, these legacy chips rely on older manufacturing processes and are more ubiquitous in mass-market products. They’re used in tech for cars, military vehicles, medical devices, smartphones, home appliances, space projects, and much more.

China apparently “plans to build more than 60 percent of the world’s new legacy chip capacity over the next decade,” and Commerce Secretary Gina Raimondo said evidence showed this was “discouraging investment elsewhere and constituted unfair competition,” Reuters reported.

Most people purchasing common goods don’t even realize they’re using Chinese chips, including government agencies, and the probe is meant to fix that by flagging everywhere Chinese chips are found in the US. Raimondo said she was “fairly alarmed” that research showed “two-thirds of US products using chips had Chinese legacy chips in them, and half of US companies did not know the origin of their chips including some in the defense industry.”

To deter harms from any of China’s alleged anticompetitive behavior, the USTR plans to spend a year investigating all of China’s acts, policies, and practices that could be helping China achieve global dominance in the foundational semiconductor market.

The agency will start by probing “China’s manufacturing of foundational semiconductors (also known as legacy or mature node semiconductors),” the press release said, “including to the extent that they are incorporated as components into downstream products for critical industries like defense, automotive, medical devices, aerospace, telecommunications, and power generation and the electrical grid.”

Additionally, the probe will assess China’s potential impact on “silicon carbide substrates (or other wafers used as inputs into semiconductor fabrication)” to ensure China isn’t burdening or restricting US commerce.

Some officials were frustrated that Biden didn’t launch the probe sooner, the Financial Times reported. It will ultimately be up to Donald Trump’s administration to complete the investigation, but Biden and Trump have long been aligned on US-China trade strategies, so Trump is not necessarily expected to meddle with the probe. Reuters noted that the probe could set Trump up to pursue his campaign promise of imposing a 60 percent tariff on all goods from China, but FT pointed out that Trump could also plan to use tariffs as a “bargaining chip” in his own trade negotiations.

China’s plan to dominate legacy chips globally sparks US probe Read More »