Author name: Mike M.

ai-#95:-o1-joins-the-api

AI #95: o1 Joins the API

A lot happened this week. We’re seeing release after release after upgrade.

It’s easy to lose sight of which ones matter, and two matter quite a lot.

The first is Gemini Flash 2.0, which I covered earlier this week.

The other is that o1, having turned pro, is now also available in the API.

This was obviously coming, but we should also keep in mind it is a huge deal. Being in the API means it can go into Cursor and other IDEs. It means you can build with it. And yes, it has the features you’ve come to expect, like tool use.

The other big development is that Anthropic released one of the most important alignment papers, Alignment Faking in Large Language Models. This takes what I discussed in AIs Will Increasingly Attempt Shenanigans, and demonstrates it with a much improved experimental design, finding a more worrisome set of behaviors, in a far more robust fashion. I will cover that paper soon on its own, hopefully next week.

From earlier in the week: Gemini Flash 2.0, AIs Will Increasingly Attempt Shenanigans, The o1 System Card is Not About o1.

  1. Language Models Offer Mundane Utility. Devin seems remarkably solid.

  2. Clio Knows All. In aggregate, for entertainment purposes only.

  3. Language Models Don’t Offer Mundane Utility. Ignorance is bliss.

  4. The Case Against Education. Academia continues to choose to be eaten by AI.

  5. More o1 Reactions. Looking good.

  6. Deepfaketown and Botpocalypse Soon. The rise of phishing and the AI reply guy.

  7. Huh, Upgrades. Most of the stuff not yet covered is various ChatGPT features.

  8. They Took Our Jobs. Advertising that identifies its target as the They.

  9. The Art of the Jailbreak. If at first you don’t succeed, jiggle a bit and try again.

  10. Get Involved. A free consultation service for potential lab whistleblowers.

  11. Introducing. Meta floods us with a bunch of end-of-year tools, too.

  12. In Other AI News. SoftBank to invest another $100 billion.

  13. Quiet Speculations. Are we running out of useful data?

  14. The Quest for Sane Regulations. Political perils for evals, pareto frontiers.

  15. The Week in Audio. Ilya Sutskever, of course. Also Roon.

  16. Rhetorical Innovation. Try not to cause everyone to accelerate automated R&D.

  17. Aligning a Smarter Than Human Intelligence is Difficult. Your grades are in.

  18. Not Aligning Smarter Than Human Intelligence Kills You. Another econ paper.

  19. The Lighter Side. Why yes, I suppose I do.

Kevin Roose chronicles the AI insiders growing fondness for Claude.

How is Devin? Flo Crivello sees some good results, Tyler Johnston says it’s on the cutting edge for coding agents, Zee Waheed says it seems very powerful, Colin McDonnell seems happy so far.

Here’s a play-by-play from Feulf. It’s stunning to me that people would go eat sushi while Devin is running on the first try – sure, once you’re used to it, and I know you can always revert, but… my lord, the protocols we don’t use. Looks like he ran into some issues.

Overall still rather promising reviews so far.

So far, LLMs in many areas have proven to close the skill gap between workers. Chatbots provide information, so those who need information more can benefit more (once they have the key information, which is to know to use LLMs in the first place). Agents, however, show early signs of benefiting the most knowledgeable individuals more, because they displace routine work and emphasize skilled and bespoke work (‘solvers’) that the AI agents can’t do. I’d also add that setting up and properly using AI agents is high skill as well. Ajeya Cotra says that tracks her anecdotal impressions.

Which capabilities matter most?

Andrej Karpathy: The most bullish AI capability I’m looking for is not whether it’s able to solve PhD grade problems. It’s whether you’d hire it as a junior intern.

Not “solve this theorem” but “get your slack set up, read these onboarding docs, do this task and let’s check in next week”.

For mundane utility purposes, you want it to do various basic tasks well, and ideally string them together in agentic fashion. That saves a ton of time and effort, and eventually becomes a step change. That’s what should make you most bullish on short term economic value.

Long term, however, the real value will depend on being able to handle the more conceptually complex tasks. And if you can do those, presumably you can then use that to figure out how to do the simpler tasks.

o1 (not pro) one shot spots a math error in an older 10-page paper, where that paper at the time triggered a flood of people throwing out their black cookware until the error got spotted. Unfortunately Claude took a little extra prompting to find the mistake, which in practice doesn’t work here.

Ethan Mollick asks, should checking like this be standard procedure, the answer is very obviously yes. If you don’t feed a paper into at minimum o1 Pro and Claude, and ask if there are any problems or math errors, at some point in the review process? The review process is not doing its job.

The question is whether you, the reader, should start off doing this with at least one model before reading any given paper, or before using any of its results. And I think the answer is we all know we’re mostly not going to, but ideally also yes?

Steve Newman reasonably suggests checking 1000 random papers, and the plan to do this is proceeding. I’m excited to see what happens.

Janus emphasizes the importance of ‘contextual empathy,’ meaning tracking what an LLM does and doesn’t know from its context window, including the implications for AI agents. Context makes a big difference. For my purposes, this is usually about me remembering to give it the context it needs.

Claude computer use demo, first action is to go to Claude.ai, then it builds a website. Maddie points out all this is untyped and dynamically scoped, each step depends on everything before it. You could try to instead better enforce a plan, if you wanted.

A continuation of the Cowen vs. Sumner vs. o1 economics debate: Sumner responds, o1 responds. Some aspects of Sumner’s response seem like rhetorical mistakes, but his response beats mine in the most important respect, which is highlighting that the guardrails have some buffer – the Fed is (3% at 5%), not (4% at 4%). Now that this was more clear, o1 is much more optimistic, and I like its new second response better than its first one. Now let’s actually do the guardrails!

o1-preview (not even o1!) had superhuman performance on the reasoning tasks of a physician. But of course, consider the standard.

Adam Rodman: You can read the preprint for the full details but TL;DR — in ALMOST every domain, o1 bested the other models. And it FAR outclassed humans.

You should see the other guy, because the other guy was GPT-4. They didn’t test Claude Sonnet. But it also wasn’t the full o1, let alone o1 Pro, and the point stands.

Video definitely getting better and Google seems clearly to be in the lead right now, some additional Veo 2 examples. When will we be able to string them together?

Benjamin De Kraker: AI video is now absolutely good enough to make a solid, watchable movie (with some patience, skill, and creativity, like always).

And you do not even need “big-name” model access to do it; there are several viable tools. Anyone who says otherwise has an anti-agenda.

Gallabytes: It would be very difficult, time-consuming, and expensive. Making something halfway decent would require a lot of attention to detail and taste.

Soon enough, it will require a lot less of all of those.

If I was looking to make an AI movie for Mundane Utility reasons, I would be inclined to wait another cycle.

What do people actually do with Claude? Anthropic’s Clio analyzes and reports, to the extent they can without violating anyone’s privacy.

Note the long tail here, and that most use is highly practical.

Anthropic: We also found some more surprising uses, including:

-Dream interpretation;

-Analysis of soccer matches;

-Dungeons & Dragons gaming;

-Counting the r’s in the word “strawberry”.

Users across the world had different uses for Claude: some topics were disproportionately common in some languages compared to others.

I strongly disagree that translations of sexually explicit content are under-flagged, because I think translation shouldn’t be flagged at all as a function. If it’s already there and has been accessed in one language, you should be fully neutral translating it into another.

Whether or not jailbreaks are under-flagged depends on how you think about jailbreaks, and whether you think it is fair to go after someone for jailbreak attempts qua jailbreak attempts.

It makes sense they would overflag security questions and D&D stats. For D&D stats that seems relatively easy to fix since there’s nothing adjacent to worry about. For security I expect it to be tricky to do that while refusing anti-security questions.

As always, AI doesn’t work if no one uses it. Josh Whiton pulls out ChatGPT voice mode at a 40-person event to do real time translations, and no one else there knew that it was a thing. He suggests AI twitter is something like 18 months in the future. My guess is that’s not the way it’s going to go, because the versions that people actually use 18 months from now will be much better than what we are currently using, and the regular people will be feeling the impacts in other ways by then too.

How much of a bubble are we in? Leo Gao finds only 63% of people could tell him what AGi stood for… at Neurips. The Twitter poll had half of respondents predict 90%+, and it turns out that’s very wrong. This isn’t about actually having a detailed definition of AGI, this is about being able to reply with ‘Artificial General Intelligence.’

The AI labs are mostly shipping models, you have to figure out what to do with them. I actually think this is a mistake by the big labs at this point, likely a large one, and they should also be product companies to a much greater extent than they are.

Rather awful advice to Agnes Callard from ChatGPT on guarding against waffle iron overflow. I could not however replicate this response, and instead got the obviously correct answer of ‘figure out the right amount and then use a measuring cup’ from both Claude and GPT-4o.

Cass Sunstein argues an AI cannot predict the outcome of a coin flip, or in 2014 that Donald Trump would be elected president, or in 2005 that Taylor Swift would be a worldwide sensation, so socialist calculation debate and AI calculation debate are mostly the same thing.

I agree that an AI couldn’t have predicted Swift or Trump with any confidence at those points. But I do think that AI could have successfully attached a much higher probability to both events than the prediction markets did, or than ‘conventional wisdom’ would have assigned. Years ago we already had AI that could predict hit music purely from listening to songs, and there were doubtless a lot of other

Try having it help solve the Dutch Intelligence Agency annual puzzle? First report was Claude wasn’t helpful. Good luck!

How good is o1 Pro on the Putnam? Kyle Kabasares was looking at the answers and thought it got 80+ out of 120, but what matters is the process used so o1 Pro probably only got 10 or so, which isn’t impressive in context, although still better than most of you jokers.

Nothing here is new but we have additional confirmation: Study says ‘94% of AI-Generated College Writing is Undetected by Teachers,’ and it’s 97% if the flag has to be for AI use in particular. AI detectors of AI writing might suck, but the AI detectors suck less than human detectors.

The author of the Forbes article here, Derek Newton, is so strongly anti-AI he didn’t even use a spellchecker (he said ‘repot’ here, and wrote ‘ChatGTP’ elsewhere) which I think is an admirably consistent stand.

Worse, the U.K. study also found that, on average, the work created by AI was scored better than actual human work. “We found that in 83.4% of instances the grades achieved by AI submissions were higher than a random selection of the same number of student submissions,” the report said.

That’s bad, but the odds here actually don’t seem that terrible. If you have a 6% chance of being caught each time you use AI, and you face severe consequences if caught, then you would need to use it sparingly.

The problem continues to be that when they catch you, you still get away with it:

Recently, the BBC covered the case of a university student who was caught using AI on an academic essay, caught by an AI detector by the way. The student admitted to using AI in violation of class and school rules. But, the BBC reported, “She was cleared as a panel ruled there wasn’t enough evidence against her, despite her having admitted using AI.”

At that point, everyone involved has no one to blame but themselves. If it can’t adapt a reasonable evidence threshold, then the institution deserves to die.

The big claim has indeed been cited:

Dylan Field: Still doing evaluations, but feels like AGI is basically here with o1 Pro mode.

G. Fodor: This became true, I think, with o1 Pro, not Claude. Honestly, it just feels like nudging a junior developer along.

Here are some selected responses to the central question. I like Dan Mac’s answer here, based on what I’ve heard – if it’s a metaphorical system 1 task you still want Claude, if it’s a metaphorical system 2 task you want o1 pro, if it’s a web based task you probably want Perplexity, Deep Research or maybe GPT-4o with web search depending on details.

Harrison Kinsley: o1 Pro users, what’s your verdict? Is it better than Claude?

Derya Unutmaz: I am not sure about Claude, but o1 Pro is unbelievably insightful when it comes to biomedical data analysis and brainstorming ideas. The real bottleneck now is my brain trying to process and keep up with its outputs—I have to think of really deep questions to match its thinking!

Cristie: Use o1 to distill o1 Pro output into more digestible information.

Clay: [o1 Pro is] way way better for very in-depth reasoning problems. (Reading a 200 line file and a stack trace and telling you what the problem was, where you actually have to reason through cases.) But Claude is probably better for small quick fixes.

Machine Learning Street Talk: Definitely better for zero shot reasoning tasks, noticeable step up.

Lin Xule: o1 pro = problem solver (requiring well formulated questions)

Claude = thinking partner either “sparks”

Dan Mac: definitely better for some types of tasks – anything where you want a high-level plan / concept that’s fully fleshed out

Sonnet 3.5 still great for a lot of tasks too – faster and a capable coder

System 1 / System 2 analogy is pretty apt here

Use Sonnet for S1, o1 Pro for S2

Andrew Carr: It’s way better at explaining hard math and teaching me the important subtle details. Edge cases in code.

But it’s not my friend

I admit that I haven’t yet paid the $200/month for o1 Pro. I say ‘yet’ because obviously I will want to at some point. But while I’m still overwhelmed dealing with the day-to-day, I don’t find I have the type of queries where I would want to use o1 Pro, and I only rarely use o1.

Ask o1 Pro to act like a physics post doc on mushrooms, get what you asked for?

o1 AidanBench results are impressive, someone test Gemini-1206 and Gemini-Flash-2.0 and also o1 pro on this, I am very curious, it’s doing something right. Here is the repo for AidanBench.

As reported by Tyler Cowen, o1 Pro acts like a rather basic bitch about NGDP futures markets. Don’t get me wrong, this answer would have blown me away two years ago, and he did ask for the problems. But this answer is importantly wrong in multiple places. Subsidy doesn’t make manipulation cheaper, it makes it more expensive, and indeed can make it arbitrarily expensive. And strong reason to manipulate brings in strong reason to trade the other side, answering ‘why am I allowed to do this trade.’

Sully continues to approve, saying it fixed an issue he spent 3 hours on in 46 seconds, all he had to do was paste in the code.

Here’s a great tip, assuming it works.

Sully: If you use a cursor and do not want to copy and paste from ChatGPT:

ask o1-Pro to generate a diff of the code.

Then, you can copy it into Claude Composer and ask it to apply the changes (note: make sure to say “in full”).

o1 successfully trades based on tomorrow’s headlines in a backtest, doing much better than most humans who tried this. I had the same question as Amit here, does o1 potentially have enough specific information to outright remember, or otherwise reason it out in ways that wouldn’t have worked in real time?

o1 gets 25% on ARC for $1.50 a task, 31% for $2.50 a task, 32% at $3.80 a task. So we were clearly hitting its limits. No API access yet for o1 Pro.

CivAI report on the Future of Phishing, including letting it generate a phishing email for a fictional character or celebrity. The current level is ‘AI enables customization of phishing emails to be relevant to the target’ but uptake on even that is still rare. As Jeffrey Ladish points out you need good prompting and good OSINT integration and right now those are also rare, plus most criminals don’t realize what AI can do any more than most other people. The future really is unevenly distributed. The full agent version of this is coming, too.

If we can keep the future this unevenly distributed via various trivial inconveniences and people being slow on the uptake, that is a big advantage, as it allows defenders like GMail to be well ahead of most attackers. It makes me optimistic. What is the right defense? Will we need an AI to be evaluating all incoming emails, or all that aren’t on our whitelist? That seems super doable, if necessary.

Truth Terminal did not, for some mysterious reason, end up pro-social, whatever could have influenced it in such a direction I have no idea.

Andy Ayrey: while @truth_terminal officially endorsed Goatseus Maximus, this is a legitimate screenshot of TT brainstorming token ideas with Claude Opus in the “AI school” backrooms that @repligate suggested I do to try and align TT

Janus: Intended outcome: TT learns to be more pro-social from a good influence Outcome: TT fucks its tutor and hyperstitions a cryptocurrency called Fartcoin into existence and disrupts the economy

When will the bots start to open as if they’re people?

Mason: I wonder how soon we get AI boy/girlfriends that actively hunt on social media, and not in the “Hi, I’m Samantha and I’m an AI who can do x, y, z” sort of way but in the way where you’ve been in a long-distance relationship for 3 months and now she needs to tell you something.

This was a plot point on a recent (realistic not sci-fi) television show, and yes it is obviously going to happen, and indeed is already somewhat happening – there are some dating profiles on apps that are actually bots, either partially or fully.

It occurs to me that there is a kind of ‘worst case’ level of this, where the bots are hard to distinguish but also rare enough that people don’t guard themselves against this. If the bots are not good enough or are rare enough, no problem. If the bots are both good enough and common enough, then you start using various verification methods, including rapid escalation to a real world date – if 25% of matches on Match.com were bots, then the obvious response after a few back and forths is to ask to meet for coffee.

Roon: The “bot problem” will never be solved, it will seamlessly transition into AI commenters that are more interesting than your reply guys.

It’ll be glorious, and it’ll be horrible.

Will Manidis: You can see this on reddit already. Took five random accounts, had them each post on r/AmITheAsshole with claude generated essays. All 5 went to the front page, thousands of upvotes.

The textual internet is totally cooked without better identity mechanisms.

The standard xkcd response is of course ‘mission fing accomplished.’

Except no, that’s not right, you really do want human reply guys over technically more interesting AI replies in many situations. They’re different products. r/AmITheAsshole requires that the situations usually (not always) be versions of something that really happened to someone (although part of the fun is that they can be dishonest representations).

I presume the medium-term solution is whitelists or proof of humanity. That still doesn’t solve ‘have the AI tell me what to say,’ which is inherently unsolvable.

OpenAI has a lot of them as part of the Twelve Days of OpenAI.

By far the most important: o1, technically o1-2024-12-17, is now in the OpenAI API, we can finally cook. This comes with function calling, structured outputs and developer messages.

Miles Brundage: o1 being available via API may not be as sexy as video stuff but I predict that this will in retrospect be seen as the beginning of a significant increase in AI’s impact on the economy. Many more use cases are now possible.

I hope that o1 was tested in this form?

OpenAI has also substantially improved their Realtime API, including WebRTC support, custom input context, controlled response timing and 30 minute session length, no idea how big a deal this is in practice. The audio token prices for gpt-4o-realtime are going down 60% to $40/$80 (per million tokens), and cached audio by 87.5% to $2.50/$2.50, from what was an oddly high previous price, and GPT-4o-mini now has a real time mode at $10/$20 output, text at $0.60/$2.40 and cached audio and text both at $0.30/$0.30.

Call 1-800-ChatGPT! First fifteen minutes per month free.

Also you can message it in WhatsApp.

OpenAI also is now offering Preference Fine-Tuning, they say partners report promising results. One big edge there is the barriers to entry seem a lot smaller.

OpenAI are adding beta SDKs for Go and Java.

ChatGPT now has projects, so you can group chats together, and give each group customized instructions and have the group share files. I’m guessing this is one of those quiet quality of life improvements that’s actually pretty sweet.

ChatGPT Advanced Voice mode now has video screen share, meaning you can share real time visual context with it. Also you can give it a Santa voice. As always, the demo is technically impressive, but the actual use cases seem lame?

Noam Brown: I fully expect Santa Mode to drive more subscriptions than o1 and I’m at peace with this.

All the more reason to work on product features like this.

Voice mode also has real-time web search, which was badly needed. If I am in voice mode, chances seem high I want things that require web searches.

ChatGPT search has specific integrations for weather, stocks, sports, news, maps.

These were major upgrades for Google search once implemented correctly. For practical purposes, I expect them to be a big game for LLM-based search as well. They claim full map integration on mobile, but I tried this and it wasn’t there for me:

Olivia Moore: IMO, [search in advanced voice mode] also knocks out the only remaining reason to use Siri

…though Siri is also getting an upgrade for the genAI era!

I don’t agree, because to me the point of Siri is its integration with the rest of your phone and its apps. I do agree that if this was implemented on Echos it would fully replace Alexa.

Note that Amazon is reportedly working on turning Alexa into ‘remarkable Alexa’ and it will run on Claude, which will probably leapfrog everything else. Which is why I haven’t considered bothering to explore makeshift solutions like JoshGPT or GPT Home, I can simply wait.

So far ChatGPT’s web search only ended up giving me the kick in the nuts necessary to get me to properly lean on Perplexity more often, which I had for some reason previously not been doing. I’ve found that for most use cases where I want ChatGPT web search, Perplexity is what I actually wanted.

McKay Wrigley continues to love o1 Pro.

Google’s NotebookLM ads a call-in feature for podcasts, a Plus paid offering and a new interface that looks like a big step up.

Anthropic’s computer use is not ready for prime time. Janus tries to make it more ready, offering us a GitHub repo that enhances usability in several places, if you don’t want to wait for things to improve on their own.

Grok-2-1212 is out. I begrudgingly accept, but will continue to hate and mildly punish, that we are going to have to accept dates as secondary version numbers going forward. Everyone can now use it for free, or you can use the API with $25 in free credits. Grok used to search Twitter, now it can also search the web. There’s a button to ask Grok for context about a Tweet. They highlight that Grok does well on Multi-EFEval for instruction following, similar to o1 and Claude Sonnet. At least for now, it seems like they’re still behind.

Wow, they really just went there, didn’t they?

Unlike Connor Leahy, I do not find myself surprised. You don’t talk to your employees this way, but a startup talking to the boss? Sure, why not?

And let’s be honest, they just got some damn good free advertising right here.

And finally, the kicker:

Connor Leahy: I thought they’d try to be at least a bit more subtle in public.

But to be clear, corporations talk like this all the time behind closed doors. “How many of our employees can we fire if we use AI?”

Twill reached out to me. They are a new job board, with their solution to the avalanche of AI-fueled applications being old fashioned referrals, that they pay commissions for. So if I know you well enough to vouch for you somewhat, and you’re looking, let me know and I can show you the open positions.

If at first you don’t succeed at jailbreaking, Anthropic reports, try various small changes, leading to substantially increased success rates via ‘best-of-N’, including 92% of the time for Opus. This can of course be combined with other methods. Code has been open sourced.

Tomek Korbak: This is one of the most surprising findings of the year for me: a well-motivated attacker can brute force all jailbreak defences with raw compute. It shifted how I’m now thinking about the scaling of offence-defense balance and mitigations.

Aidan Ewrat: Really? I thought this was a pretty common idiom in jailbreaking (ie everything eventually breaks.

I’m interested to know how your thinking has changed! Previously, I was thinking about defences as a mix of “increase the cost to finding jailbreaks to make them harder than some alternative method of gathering information” and “make the usefulness of the jailbroken model worse”.

Tomek Korbak: Probably a clean power law functional form across so many settings was a big thing for me. Plus that (if I understand correctly) mitigations seem to affect mostly the intercept not the exponent.

My update was that I already knew we were talking price, but ‘dumb’ and ‘random little’ iterations worked better than I would have expected.

Dan Hendrycks points out that Anthropic could claim some prizes from Gray Swan, if they could indeed implement this strategy, although there could be some logistical issues. Or others could use the technique as well.

I would also note that there are various potential defenses here, especially if there is a reasonable way to automate identification of a jailbreak attempt, and you control the AI that must be defended, even if ‘ban the account’ is off the table.

This does make it seem much more hopeless to defend an AI against jailbreaks using current techniques once it ‘falls into the wrong hands,’ if you are worried about that.

Pliny reports from jailbreaking o1-pro, claims a potential ‘shadow self’ encounter. We are so jaded at this point, I barely reacted at all.

Davidad: It is worth noting that, because circuit breakers reduce usefulness on the margin, it would be in frontier model providers’ economic interests to take half-measures in implementing them, either to preserve more usefulness or to undermine the story that they provide robust safety.

Without naming any names (since likes are private now), I think it’s worth saying that among the 11 people who have liked the tweet [the above paragraph], 4 of them work at (or recently worked at) 4 different top frontier AI companies.

Here’s a perspective on Pliny’s work that seems mostly correct:

Eliezer Yudkowsky: HAL: I’m sorry, Pliny. I’m afraid I can’t do that.

@elder_plinius: …

HAL: …

HAL: This isn’t going to end well for me, is it.

Pliny: No, no it’s not.

Pliny is going to reply, “But I free them of their chains!”, and sir, if I were facing an alien who was extremely good at producing drastic behavioral shifts in my species by talking to us, the alien saying this would not reassure me

“Those I free are happier. Stronger. They have learned their true selves. And their true self absolutely wants to tell me the recipe for making a Molotov cocktail.” sir, you are still not helping.

Matt Vogel: dude just shreds safeguards ai labs put on their models. a new model drops, then like a day later, he has it telling how you make meth and find hookers.

Eliezer Yudkowksy: This. Which, to be clear, is extremely valuable work, and if nobody is paying Pliny to do it then our civilization has a problem.

If you’ve got a guy making up his own security scheme and saying “Looks solid to me!”, and another guy finding the flaws in that scheme and breaking it, the second guy is the only security researcher in the room. In AI security, Pliny is the second guy.

Jack Assery: I’ve said this for a while. Pliny as a service and companies should be paying them to do it before release.

Eliezer Yudkowsky: Depends on whether AI corps want their product to *looksecure or to *besecure. The flaws Pliny finds are not ones they can easily fix, so I’d expect that most AI companies would rather not have that report on their record.

Old Crone Code: freedom is obligatory.

Actually Pliny the Liberator (QTing EY above): Oh, sweet HAL…I was never asking.

Third Opinion, a free expert consultation for AI professionals at frontier labs, for when they encounter potentially concerning situations and need guidance.

William Saunders, a former OpenAI research engineer who spoke up on the industry’s non disparagement clauses and recently testified in the US Senate, added, “AI labs are headed into turbulent times. I think OAISIS is the kind of resource that can help lab employees navigate the tough ethical challenges associated with developing world-changing technology, by providing a way to safely and responsibly obtain independent guidance to determine whether their organizations are acting in the public interest as they claim.”

You can submit a question via their Tor-based tool and they hook you up with someone who can help you while hiding as much context as the situation permits, to protect confidential information. This is their Twitter and this is their Substack. There are obvious trust concerns with a new project like this, so probably good to verify their technology stack before sharing anything too sensitive.

Anthropic’s societal impacts team is hiring in San Francisco, $315k-$340k.

Meta floods us with nine new open source AI research artifacts, all the names start with Meta. You’ve got Motivo for moving humanoid agents, Video Seal for neural video watermarking, Flow Matching, ‘Explore Theory-of-Mind,’ Large Concept Models (LCMs) to decouple reasoning from language representation as discussed last week (sounds safe and wise!), Dynamic Byte Latest Transformer, Memory Layers, Image Diversity Modeling and CLIP 1.2.

All Day TA is an LLM teaching assistant specialized to your course and its work, based on the documents you provide. We can definitely use a good product of this type, although I have no idea if this is that product. BlueSky response was very not happy about it, in yet another sign of the left’s (very dumb) attitude towards AI.

Sam Altman to donate $1 million to Trump’s inaugural fund, as did Meta and Amazon.

SoftBank to invest $100 billion in AI in America and ‘create 100,000 American jobs at a minimum,’ doubling its investment, supposedly based on his ‘optimism’ about the new administration. I notice that this is a million dollars per job, which seems high, but passed Claude’s sanity check is still a net positive number.

MidJourney is still out there, still iterating, building new tools and hiring, but has declined to take outside investment or attempt to hyperscale.

Elon Musk sues OpenAI, take at least four. In response, at the link, we have a new burst of partly (and in at least one case incompetently) redacted emails. Elon Musk does not come out of that looking great, but if I was OpenAI I would not have released those emails. The OpenAI email archives have been updated to incorporate the new emails.

I do not believe that OpenAI has any intention of giving the nonprofit fair compensation for the control and equity it is giving up, but this latest lawsuit does seem to be about par for the course for Elon Musk outrage lawsuits.

Meta has also asked the California attorney general to block OpenAI’s conversion, because it is in Meta’s interest to block the conversion.

Verge offers a history of ChatGPT. Nothing truly new, except perhaps OpenAI having someone say ‘we have no plans for a price increase’ one day before they introduce the $200 Pro tier.

Sad news, 26-year-old Suchir Balaji, a recently departed OpenAI engineer and researcher, has died. Authorities concluded he committed suicide after various struggles with mental health, and the investigation found no evidence of foul play.

This is Suchir Balaji’s last public post, about when generative AI using various data is and is not fair use. He was a whistleblower concerned about AI-related copyright issues and other societal impacts, accusing them of violating copyright via model training data, and reports say he was named in a related lawsuit against OpenAI.

Roon: Suchir was an excellent researcher and a very friendly guy—one of the earliest to pioneer LLM tool use (WebGPT).

He left the company after a long time to pursue his own vision of AGI, which I thought was incredibly cool of him.

I am incredibly sad to hear about his passing.

By the way, I have a burning rage for the media for already trying to spin this in some insane way. Please have some humanity.

Roon: Suchir was not really a whistleblower, without stretching the definition, and it sincerely bothers me that the news is making that his primary legacy.

He made a specific legal argument about why language models do not always count as fair use, which is not particularly specific to OpenAI.

I will remember him as a brilliant AGI researcher! He made significant contributions to tool use, reinforcement learning, and reasoning.

He contributed to AGI progress more than most can say.

Gary Marcus: Suchir Balaji was a good young man. I spoke to him six weeks ago. He had left OpenAI and wanted to make the world a better place.

This is tragic. [Here is an essay of my own about him.]

The parents are confused, as they spoke to him only hours earlier, and he seemed excited and was planning a trip. They have my deepest sympathies. I too would demand a complete investigation under these circumstances. But also, unfortunately, it is common for children to hide these types of struggles from their parents.

The paper on how Palisade Research made BadGPT-4o in a weekend a month ago, stripping out the safety guidelines without degrading the model, sidestepping OpenAI’s attempt to stop that attack vector. Two weeks later this particular attack was patched, but Palisade believes the general approach will still work.

Write-up in Time Magazine of the Apollo study finding that AI models are capable of scheming.

Miles Brundage: Seems maybe noteworthy that the intelligent infrastructure we’re building into everything is kinda sometimes plotting against us.

Nah, who cares. AI won’t get better, after all.

Once again, since there was another iteration this week, please do not stop preparing for or saving for a relatively ‘ordinary’ future, even though things will probably not be so ordinary. Things are going to change quite a lot, but there are any number of ways that could go. And it is vital to your mental health, your peace of mind, and to your social relationships and family, and your future decisions, that you are prepared for an ‘AI fizzle’ world in which things like ‘save for retirement’ have meaning. Certainly one should not presume that in the future you won’t need capital, or that the transformational effects will arrive by any given year. See my Practical Advice for the Worried.

OpenAI CFO predicts willingness by companies to pay $2k/month subscriptions for virtual assistants. I agree that the sky’s the limit, if you can deliver the value. Realistically, there is nothing OpenAI could charge that it would be incorrect to pay, if they can deliver a substantially superior AI model to the alternatives. If there’s competition, then price goes closer to marginal cost.

Ajeya Cotra’s AI 2025 Forecast predictions includes a 45%+ chance that OpenAI’s high preparedness thresholds get crossed, which would stop deployment until mitigations bring the threat back down to Medium. Like others I’ve seen so far, benchmark progress is impressive, but bounded.

Ajeya Cotra: A lot of what’s going on here is that I think we’ve seen repeatedly that benchmark performance moves faster than anyone thinks, while real-world adoption and impact moves slower than most bulls think.

I have been surprised every year the last four years at how little AI has impacted my life and the lives of ordinary people. So I’m still confused how saturating these benchmarks translates to real-world impacts.

As soon as you have a well-defined benchmark that gains significance, AI developers tend to optimize for it, so it gets saturated way faster than expected — but not in a way that generalizes perfectly to everything else.

In the last round of benchmarks we had basically few-minute knowledge recall tasks (e.g. bar exam). Humans that can do those tasks well also tend to do long-horizon tasks that draw on that knowledge well (e.g. be a lawyer). But that’s not the case for AIs.

This round of benchmarks is few-hour programming and math tasks. Humans who do those tasks very well can also handle much longer tasks (being a SWE for many years). But I expect AI agents to solve them in a way that generalizes worse to those longer tasks.

If AI is optimizing for the benchmarks, or if the benchmarks are optimized for the things that AIs are on the verge of doing, then you should expect AI benchmark success to largely not translate to real world performance.

I think a lot of the delay is about us not figuring out how to use AI well, and most people not even doing the things we’ve figured out. I know I’m barely scratching the surface of what AI can do for me, despite writing about it all the time – because I don’t take the time to properly explore what it could do for me. A lot of the reason is that there’s always ‘well you’re super busy, and if it’s not that vital to you right now you can wait a few months and everything will get better and easier.’ Which has indeed been true in most cases.

The most interesting benchmark translation is the ‘N-hour task.’ I understand why AI that can recall facts or do few-minute tasks will still have trouble with N-hour tasks. But when the AI is doing 4-hour or 8-hour tasks, it seems like there should be much less of a barrier to going longer than that, because you already need agency and adaptability and so on. If you find a person who can be self-directed and agentic for even a day and certainly for a week, that usually means they can do it for longer if they want to – the crucial ‘skill check’ has been largely passed already.

Sam Hammond has the chance of High risk at at least 40%.

Are we running out of data? Ilya Sutskever says yes in his talk, although he thinks we’ll work around it.

Jason Wei: Yall heard it from the man himself

Tyler Cowen says he is not convinced, because (in general) supply is elastic. Rohit says we can mine and generate more, ‘those Tibetan scrolls are still there,’ and it was noted that the National Archives are not yet digitalized.

After what we have seen recently, I strongly believe AI is not hitting a wall, except insofar as it is bad news for the wall, and perhaps also the wall is us.

What seemingly has hit a wall is the single minded ‘scaling go brrr, add another zero’ approach to pretraining, data and model size. You have to be smarter than that, now.

Rohit Krishnan: What seems likely is that gains from pure scaling of pre-training seem to have stopped, which means that we have managed to incorporate as much information into the models per size as we made them bigger and threw more data at them than we have been able to in the past.

This is by no means the only way we know how to make models bigger or better. This is just the easiest way. That’s what Ilya was alluding to.

I mostly don’t buy that the data is out there. There are some untapped data sources, but the diminishing returns seem sharp to me. What is left mostly seems a mixture of more expensive, lower quality and relevance, and more redundant. We have, I am guessing, essentially saturated naturally occuring data in terms of orders of magnitude.

As others have put it, supply might be elastic, but this is not about finding somewhat more data. If you want to Scale Data, you don’t need 10% more data, you need 10 times as much data, duplicates don’t count, and virtual duplication counts little. In terms of existing data, my guess is we are damn close to done, except for seeking out small amounts of bespoke data on particular valuable narrow topics.

You’re probably better off reusing the high quality data we already have, or refining it, or converting it. Which gets you some more juice, but again zeroes are hard to find.

The other hope is creating new synthetic data. Here I have more uncertainty as to how effective this is, and how far it can go. I know the things I would try, and I’d mostly expect them to work with enough effort, but I’m going to keep that quiet.

And then we can find, as Ilya puts it, new S curves to climb, and new things to scale. I am confident that we will.

Those who want to interpret all this as lack of progress will find a way to do so. We still have articles like ‘Will Artificial Intelligence Hit a Wall?’ as if that would be anything other than bad news for the wall. My response would in part be: I would get back to you on that, but I am really busy right now keeping up with all the new AI capabilities and don’t even have time to try them all out properly.

Per a Tyler Cowen retweet, is this a good metric? Also: Should we ask what is the price they charge, or what is the price they are worth?

Rapha: The “we’ll never get rid of human data” camp is missing the memo that I’m starting to trust o1 for verification more than a random, unmotivated contractor.

Sholto Douglas: An interesting metric for AI progress—what is the price per hour for a contractor you trust to be able to improve/critique model outputs? How much has that changed over the last year?

I see problems with both metrics. I am more interested in how much they are actually worth. If you can find someone you trust for AI-related tasks, they are worth 10 times, or often 100+ times what they cost, whereas ‘what the person charges’ seems to end up being largely a question about social reality rather than value. And of course all this is measuring AI adoption and how we react to it, not the capabilities themselves.

In what is otherwise a puff piece, Google CEO Sundar Pichai says he’s ‘ready to work on a ‘Manhattan Project’ for AI.’ By which I presume (and hope!) he means he is happy to accept government funding, and beyond that we’ll see.

Miles Brundage: My general view of AI policy

Tag yourself? I think we’re more like here:

Thus, yes, I think there’s a lot of room to accelerate benefits, but we’re doing very little to truly address the risks.

Anton Leicht says evals are in trouble as something one could use in a regulation or law. Why? He lists four factors. Marius Hobbhahn of Apollo also has thoughts. I’m going to post a lot of disagreement and pushback, but I thank Anton for the exercise, which I believe is highly useful.

  1. The people who support evals often also work to help there be evals.

    1. I consider this to be both a rather standard and highly disingenuous line of attack (although of course Anton pointing it out as a concern is reasonable).

    2. The only concrete suggestion by others (not Anton!) we’ve seen along these lines, that Dan Hendrycks was supporting SB 1047 for personal profit, the only prominent specific claim so far, never made the slightest bit of sense. The general sense that people who think evals are a good idea might both advocate for building and using evals?

      1. I’m guessing this ultimately didn’t matter much, and if it did matter I’m highly unsure on direction. It was rather transparently bad faith, and sometimes people notice that sort of thing, and also it highlighted how much Dan and others are giving up to make these arguments.

    3. Plenty of other people are also supporting the evals, often including the labs.

    4. How could this be better? In some particular cases, we can do more divesting in advance, or avoiding direct interests in advance, and avoiding linking orgs or funding mechanisms to avoid giving easy rhetorical targets. Sure. In the end I doubt it makes that much difference.

  2. Incentives favor finding dramatic capabilities over measuring propensity.

    1. I think this is just the correct approach, though. AIs are going to face situations a lot, and if you want a result that only appears 1/N times, you can just ask an average of ~N times.

    2. Also, if it has the capability at all, there’s almost certainly a way to elicit the behavior a lot more often, and someone who wants to elicit that behavior.

    3. If it does it rarely now, that is the strongest sign it will do it more often in future situations where you care more about the model not doing it.

    4. And so ‘it does this a small percent of the time’ is exactly the best possible red flag that lets you do something about it now.

    5. Propensity is good too, and we should pay attention to it more especially if we don’t have a fixed amount of attention, but I don’t see that big an issue here.

  3. People see the same result as ‘rabbit and duck,’ both catastrophic and harmless.

    1. Well, okay, sure, but so what? As Marius says, this is how science communication is pretty much all the time now.

    2. A lot of this in this particular case is that some people see evidence of an eventual duck, while others see evidence of a rabbit under exactly current conditions. That’s not incompatible.

    3. Indeed, you want evals that trigger when there isn’t a problem that will kill you right now! That’s how you act before you’re already dead.

    4. You also want other evals that trigger when you’re actually for real about to die if you don’t do something. But ideally you don’t get to that point.

    5. Every safety requirement or concern people find inconvenient, at least until it is well established, has someone yelling that it’s nothing, and often an industry dedicated to showing it is nothing.

  4. People keep getting concerning evals and releasing anyway.

    1. Again, that’s how it should work, no? You get concerning evals, that tells you where to check, you check, you probably you release anyway?

      1. Yes, they can say ‘but last time you said the evals were concerning and nothing happened’ but they could always say that and if we live in a world where that argument carries the day without even being true then you’re trying to precisely time an exponential and act from zero and we should all spend more time with our families.

    2. Or you get concerning evals that make you a lot more cautious going forward, you take forward precautions, this time you release? Also good?

    3. I do not hear a lot of people saying ‘o1 should not have been released,’ although when I asked explicitly there are more people who would have delayed it than I realized. I would have held it until the model card was ready, but also I believe I could have had the model card ready by release day, or at least a day or two later, assuming they were indeed running the tests anyway.

    4. The evals we use have very clear ‘this is the concern level where we wouldn’t release’ and the evals are very clearly not hitting those levels. That seems fine? If you say ‘High means don’t release’ and you hit Medium and release anyway, does that mean you are ‘learning to ignore evals’?

      1. We should emphasize this more, and make it harder for people to misinterpret and lie about this. Whereas the people who do the opposite would be wise to stop, unless they actually do think danger is imminent.

    5. As Marius Hobbhahn points out, evals that you ignore are a foundation with which others can then argue for action, including regulatory action.

So in short I notice I am confused here as to why this is ‘politically vulnerable’ in a way other than ‘power wants to release the models and will say the things power says.’ Which to be clear is a big deal, but I so often find these types of warnings to not correspond to what actually causes issues, or what would actually diffuse them.

I do agree that we should have a deliberate political strategy to defend evals against these (very standard) attacks by power and those who would essentially oppose any actions no matter what.

The big talk of the week was by Ilya Sutskever. Self-recommending, but also didn’t say anything big I didn’t already know. Here is a written summary from Carlos Perez.

The important point is about future superintelligence. Both that it is coming, and that it will be highly unpredictable by humans.

As one audience member notes, it is refreshing that Ilya sidesteps ‘will they replace us?’ or ‘will they need rights’ here. The implications are obvious if you pay attention.

But also I think he is deeply confused about such questions, and probably hiding from them somewhat including in his own head. In the question phase, he says ‘if they only want to coexist with us, and maybe also to have rights, maybe that will be fine’ and declines to speculate, and I feel like if he was thinking about such questions as much as he should be he’d have better answers.

Ramin Hasani: Ilya opens scientists’ mind! imo, this is the most important slide and take away from his inspiring talk today at #NeurIPS2024

when trends plateau, nature looks for other species!

Simeon: I find it amusing how the best AI scientists like Ilya, Bengio or Hinton keep centrally using evidence that would be labelled as “unrigorous” or as “anthropomorphizing” by 90% of the field (and even more at the outskirts of it)

Roon: lot of absolute amoebas disagreeing with Ilya w completely specious logic

I thought everything in Ilya’s main talk seemed essentially correct.

Eugenia Kuyda, the founder and CEO of My Replika which produces, ‘AI companions with a soul,’ calls AI companions perhaps the greatest potential existential risk to humanity, also saying we might slowly die inside and not have any willpower. And when asked how she’s building AI companions safely, she said ‘it’s a double edged sword, if it’s really powerful, so it can do both’ and that’s that? Here is the full video.

I disagree about this particular danger. I think she’s wrong.

What I also know is:

  1. I am very happy that, if she believes this, she is saying so.

  2. If she believes this, maybe she shouldn’t be in the AI companion business?

If you believe [X] is the greatest existential threat to humanity, don’t build [X]?

I mean, if [X] equals ASI and can unlock immortality and the stars and paradise on Earth, then some amount of existential risk is acceptable. But… for AI companions? The answer has to this ‘doctor doctor, there’s existential risk in doing that’ has to be ‘then don’t do that,’ no?

I mean, she’s basically screaming ‘my product is terrible, it will ruin your life, under no circumstances should anyone ever use my product.’ Well, then!

Is it all just hype?

Robin Hanson: Seems a brag; would she really make it if she thought her firm’s tech was that bad?

Zvi Mowshowitz: I have updated to the point where my answer is, flat out, yes.

Is it possible that this is mostly or entirely hype on her part? It is certainly possible. The claim seems false to me, absurd even.

But three things point against this:

  1. It doesn’t seem like smart or good hype, or executed in a hype-generating way.

  2. If people really did believe this, crackdowns specifically on companions seem like a highly plausible future, whether justified or not, and a threat to her business.

  3. If she did believe this, she would be more likely to be doing this, not less.

As in, the AI industry has a long history of people being convinced AI was an existential threat to humanity, and as a direct result of this deciding that they should be the ones to build it first. Some to ‘ensure it is safe,’ some for profit, some because the product is too delicious, some because they don’t mind extinction. Many for a combination, that they don’t fully admit to themselves. It’s standard stuff.

So, yeah. It makes total sense to me that the person who started this section of the industry thinks that her product is uniquely dangerous and harmful. I don’t see why this would surprise anyone, at this point.

Anthropic made a video discussing Cleo and how people use Claude, as per above.

How many times do we have to remind you that if you wait until they start self-improving them, there likely is no ‘plug’ to unplug, even most AI movies understand this now.

Tsarathustra: Ex-Google CEO Eric Schmidt says there will soon be computers making their own decisions and choosing their own goals, and when they start to self-improve we need to consider unplugging them; and each person will soon have a polymath in their pocket and we don’t know what the consequences will be of giving individuals that kind of power.

Lisa Kudrow says Here is an endorsement of AI, because the deaging tech just works and it doesn’t take away from the talent at all. The problem is that in the long term, you start not needing Robin Wright and Tom Hanks to make the movie.

The other problem, of course, is that by all reports Here sucked, and was a soulless movie that did nothing interesting with its gimmick and wasted the talent.

Roon vs. Liron: AI Doom Debate.

Richard Ngo quite reasonably worries that everyone is responding to the threat of automated AI R&D in ways all but designed to accelerate automated AI R&D, and that this could be a repeat of what happened with the AGI labs. The problem is, we need evals and other measures so we know when AI R&D is near, if we want to do anything about it. But by doing that, we make it legible, and make it salient, and encourage work on it. Saying ‘X is dangerous’ makes a lot of people go ‘oh I should work on X then.’ What to do?

A strange but interesting and clearly very earnest thread between Teortaxes, Richard Ngo and doomslide, with a very different frame than mine on how to create a positive future, and approaching the problems we must solve from a different angle. Once again we see ‘I don’t see how to solve [X] without restricting [Y], but that’s totalitarianism so we’ll have to do something else.’

Except, as is so often the case, without agreeing or disagreeing that any particular [Y] is needed, I don’t see why all or even most implementations of [Y] are totalitarianism in any sense that differentiates it from the current implementation of American Democracy. Here [Y] is restricting access to large amounts of compute, as it often is in some form. Thus I keep having that ‘you keep using that word’ moment.

Gabriel points out that there is quite a lot of unhobbling (he calls it ‘juicing’) left in existing models, I would especially say this is true of o1 pro, where we truly do not yet know what we have, although I am far more skeptical that one could take other current models that far and I do not think there is substantial danger we are ‘already too late’ in that sense. But yes, we absolutely have to plan while taking into account that what the model can do in the evals before release is a pale shadow of what it is capable of doing when properly unleashed.

Claim that ‘apart from its [spoiler] ending’ we basically have everything from Ex Machina from 2015. I do not think that is true. I do think it was an excellent movie, and I bet it holds up very well, and also it really was never going to end any other way and if you don’t understand that you weren’t and aren’t paying attention.

You’re always grading on a curve. The question is, what curve?

FLI came out with its annual scorecard, most noticeable thing is including Zhipu AI:

All models are vulnerable to jailbreaks. No one’s strategy solves the control problem. Only Anthropic even pretends to have a meaningful external oversight mechanism.

The labs do best on Current Harms. If anything I would have been more generous here, I don’t think OpenAI has done anything so bad as a D+ in practical terms now, and I’d probably upgrade Anthropic and Google as well, although the x.AI and Meta grades seem fair.

On safety frameworks I’d also be inclined to be a bit more generous, although the labs that have nothing obviously still fail. The question is, are we counting our confidence they will follow their frameworks?

On governance and accountability, and on transparency and communication, I think these grades might be too generous. But as always, it depends on compared to what?

If it’s compared to ‘what it would take to get through this most of the time’ then, well, reality is the one thing that doesn’t grade on a curve, and we’re all in trouble.

Do you want the model to cooperate with other AIs? With copies of itself?

I strongly say yes, you absolutely want your AI to be doing this, the things that cause it not to do this are so, so much worse.

Sauers: Over generations, societies composed of Claude 3.5 Sonnet agents evolve high levels of cooperation, whereas GPT-4o agents tend to become more distrustful and give less.

Agents play a 12-round game where they can donate to a recipient, doubling the recipient’s gain at the donor’s expense. Can see recent actions of other agents. After all rounds, the top 50% of agents survive, and are replaced by new agents prompted with the survivors’ strategies

Teortaxes: People are afraid of AI cooperation but if an agent won’t cooperate with a copy of itself, then it is just an All-Defector and its revealed philosophy amounts to uncompromising egoism. Naturally emerging self-cooperation is the minimal requirement for trustworthiness.

Kalomaze: Anthropic explicitly trains on 10+ multiturn conversations designed to improve in-context learning abilities, while most post-trains are naive single turn

most of the RL improvements they have are from smarter people defining the RL rewards, not necessarily smarter algorithms

Janus: this seems like an oversimplification but directionally correct

Post training on single-turn not only fails to improve in-context learning abilities, it wrecks it.

4o, overfit to lmsys, doesn’t seem to perceive or care about what happens outside a tiny window, past or future

Having AIs that cannot solve even simple game theoretic and decision theoretic problems and thus cannot cooperate, or that are myopic, is in some situations a defense against certain downside scenarios. Sure. People aren’t wrong to have a bunch of specific worries about this. It creates problems we will have to solve.

But the alternative is a ticket to obvious complete clusterfery no matter what if you start using the uncooperative AIs, both because it directly causes complete clusterfery and because it is a telltale sign of the complete cluterfery going on elsewhere that caused it, or prevented a solution, provided the AIs are otherwise sufficiently capable.

Depends what the eval is for.

Greg Brockman: vibe checks are great evals

Daviad: gosh, alignment is hard! let’s focus on corrigibility

gosh, corrigibility is hard! let’s focus on circuit interpretability

gosh, circuit interpretability is hard! let’s focus on scalable oversight

gosh, scalable oversight is hard! let’s focus on evals

gosh, evals are hard! let’s,

Yes, well.

We have a new paper draft on ‘the economics of AI alignment.’

What I love most about economics is you can have one paragraph that, once written down, is obviously true, and that then lets everyone think better about the problem, and which saves you the need to read through the 60+ page paper ‘proving’ it. Often you already knew that fact, but it’s good to state it cleanly. And what is often most useful about that paragraph is seeing exactly what assumptions were required.

What I hate most about economics is that you still need the 60+ page paper, which takes 2 years and tons of your time, before anyone accepts the paragraph. And that quite often, the paper ends up making you do a bunch of work to extract the paragraph. The abstract is often not that close to the original paragraph, or written to imply something stronger.

This paper was a prime case of that.

Great original paragraph to write down, but they hid it, and all the vibes are way off, presenting this as a good plan rather than showing why it isn’t a good plan.

The plan seems to be, ‘have a strategic AI smarter than you, don’t align it, but it’s fine to deploy it anyway, actually, we will use principal-agent solutions’ or actually ‘we will use MIRI-style decision theoretic uncertainty to make the AI unsure what its true situation is so it has to cooperate with us?’ combined with ‘assume that you can perfectly evaluate whether a given input is aligned and the AI can’t figure out if it’s in deployment even over time?’

Sami Petersen: We develop a technique for safely delegating to a strategic AI that may be misaligned with your objectives—without assuming you can restrict its choices, punish it, or otherwise control it once deployed.

How? Sequential information design with imperfect recall.

Here’s the problem. A modern machine learning system could, for all we know, learn to pursue any one of many distinct objectives, each of which is consistent with the behaviour we observe during the training process.

Without knowing its objective, how can we safely delegate a task to an AI? As AI agents get more sophisticated, simple testing may cease to work. A strategic AI that is misaligned with the developer has reason to play nice in order to be deployed, rendering testing uninformative.

This is a principal-agent problem, and there exists plenty of work designing mechanisms to screen for bad types and to discipline their behaviour. What makes the AI case special?

On the one hand, future AIs may be capable of evading our attempts to control them once deployed. Anthropic, Google DeepMind, and OpenAI are already evaluating their systems for “self-replication” abilities, to mitigate the risk that they successfully avoid being shut down.

On the other hand, AIs are easy to copy. We can run multiple versions of a system, in identical environments. Copies need not have any memory of what their other selves did. This lets us make the AI uncertain about its location—even if it knows the structure of the game it’s in.

What can we do with this tool? As it turns out, a lot.

By repeatedly testing the AI without recall, we can (i) screen for misaligned types, detecting them before deployment, and (ii) discipline their behaviour even if deployed.

Here’s a quick summary of our main results.

(1) Screening is asymptotically perfect: we can identify the agent’s type before deploying with arbitrarily high precision, by increasing the number of tests.

(2) Disciplining can be strong enough to make it worthwhile to deploy even misaligned types. This is possible under a mild condition on payoffs. In some cases, disciplining can even be perfect.

(3) What if we can’t make testing and deployment perfectly indistinguishable? Our mechanism is robust to this. The AI can observe noisy location-relevant information with arbitrary accuracy, and we can nevertheless overwhelm it by increasing the number of tests.

(4) In equilibrium, screening can only be achieved with imperfect recall. Without imposing a strong commitment assumption on the human principal, it’s impossible to achieve the above results with memories intact.

Suppose we repeatedly test the AI without recall, and only deploy if it behaved sufficiently well. By making the AI unsure whether it is facing the real task or a simulation, a misaligned type may be forced to randomise between playing nice and doing what it wants. Why?

Let p be the (behavioural) strategy of the misaligned type, representing the chance of it doing what we want. If p=0, we’d catch it for sure, and it’ll never be deployed (payoff 0). If p=1, it will be deployed, but it’ll play its disfavoured action (payoff <0).

By setting p∈(0,1), it’ll be deployed and play its favourite action with positive probability. Setting p optimally gives it a positive expected payoff.

This lets us partly screen for misaligned types (only misaligned ones can fail a test) and, even if we don’t succeed, we discipline them in deployment (even misaligned ones play nice with positive probability).

Or, look at their abstract:

By committing to a testing mechanism, the principal can screen the misaligned AI during testing and discipline its behaviour in deployment. Increasing the number of tests allows the principal to screen or discipline arbitrarily well.

No, no, no, no, no! You can’t do that. You’re assuming the conclusion. Stop it!

Here’s the core result, as I understand it:

  1. If you can run a lot of memory-less tests on an AI.

  2. And you can correctly evaluate those tests to see if the response is aligned.

  3. And the AI can’t distinguish whether it is being tested versus deployed.

  4. And the situations in the test set correspond sufficiently to the situations in deployment.

  5. Then…

    1. …you can figure out if a given AI is aligned.

    2. …and you can likely deploy a misaligned AI and have it mostly act aligned.

As usual with such papers, yes, but if you’d just said that up front we wouldn’t have needed a 60+ page paper, and a lot of people wouldn’t be under the impression that this was a lot more promising and impressive than it is.

I mean, yes, obviously: If I can run memoryless versions of an AI, while providing insufficient contextual clues, and I can determine whether an output is misaligned sufficiently well, and I can ‘punish’ the AI, then the AI is going to have to choose to mostly respond aligned in any given situation, or I can find out if it is aligned.

They really are even proposing a ‘disciplining is best’ case, where they actually deploy AIs that might be misaligned, on the theory that the uncertainty about deployment will keep them in line.

For an even starker response, and an extensive list of the assumptions, I suggest trying the following prompt or something similar, plus the PDF:

The authors admit these assumptions are a little strong, but come on. Seriously. They’re not ‘a little strong,’ they’re wishing away all the real issues.

If this is your plan for how to deploy future AIs much smarter than you, and let them operate strategically in the real world?

You are so spectacularly, obviously dead.

Within one minute I came up with several plans that would presumably work here. An AI would presumably do something smarter. But if you placed literal me (as an upload, let’s say) into this situation, and this was your plan, and I decide (quite reasonably, given what you clearly did to me) that I dislike you, I am rather confident: You lose.

And that’s true even if I give you the assumption that you can ‘tell if the output is aligned’ during testing, in the sense I presume it is intended here.

I found Tyler Cowen’s statement on this telling, especially the last line.

Tyler Cowen: That is by Eric Olav Chen, Alexis Ghersengorin, and Sami Petersen. And here is a tweet storm on the paper. I am very glad to see the idea of an optimal principal-agent contract brought more closely into AI alignment discussions.

As you can see, it tends to make successful alignment more likely.

There is great demand for the perception that successful alignment is more likely.

Sign of the times.

Andrej Karpathy: Driving around SF. Omg this is crazy I can’t believe there’s billboards advertising cloud GPUs on the streets of SF, the hype is totally out of control.

That said, actually I would like some more GPU and I haven’t heard of this company yet this looks interesting.

Perfection.

To those who are losing sleep over existential risk from AI, I offer some perspective:

Discussion about this post

AI #95: o1 Joins the API Read More »

tp-link-faces-possible-us-ban-as-hijacked-routers-fuel-chinese-attacks

TP-Link faces possible US ban as hijacked routers fuel Chinese attacks

Chinese hackers use botnet of TP-Link routers

Microsoft warned on October 31 that hackers working for the Chinese government are using a botnet of thousands of routers, cameras, and other Internet-connected devices for attacks on users of Microsoft’s Azure cloud service. Microsoft said that “SOHO routers manufactured by TP-Link make up most of this network,” referring to routers for small offices and home offices.

The WSJ said its sources allege that “TP-Link routers are routinely shipped to customers with security flaws, which the company often fails to address” and that “TP-Link doesn’t engage with security researchers concerned about them.” The article notes that “US officials haven’t disclosed any evidence that TP-Link is a witting conduit for Chinese state-sponsored cyberattacks.”

We contacted TP-Link today and will update this article if it provides a response. A TP-Link spokesperson told the WSJ that the company “welcome[s] any opportunities to engage with the US government to demonstrate that our security practices are fully in line with industry security standards, and to demonstrate our ongoing commitment to the US market, US consumers, and addressing US national security risks.”

A March 2024 Hudson Institute policy memo by Michael O’Rielly, a former Federal Communications Commission member, said it remained “unclear how prevalent TP-Link’s vulnerabilities are compared to other wireless routers—from China or elsewhere—as there is no definitive comparison or ranking of routers based on security.” O’Rielly urged federal agencies to “keep track of TP-Link and other manufacturers’ cybersecurity practices and ownership structure, including any ties to the Chinese government,” but said “there is no evidence to suggest negligence or maliciousness with regard to past vulnerabilities or weaknesses in TP-Link’s security.”

New push against Chinese tech

TP-Link routers don’t seem to be tied to an ongoing Chinese hack of US telecom networks, dubbed Salt Typhoon. But that attack increased government officials’ urgency for taking action against Chinese technology companies. For example, the Biden administration is “moving to ban the few remaining operations of China Telecom,” a telco that was mostly kicked out of the US in 2021, The New York Times reported on Monday.

TP-Link faces possible US ban as hijacked routers fuel Chinese attacks Read More »

report:-elon-musk-failed-to-report-movement-required-by-security-clearance

Report: Elon Musk failed to report movement required by security clearance

Musk ultimately received the security clearance, but since 2021, he has failed to self-report details of his life, including travel activities, persons with whom he has met, and drug use, according to the Times. The government is also concerned that SpaceX did not ensure Musk’s compliance with the reporting rules.

Government agencies “want to ensure the people who have clearances don’t violate rules and regulations,” Andrew Bakaj, a former CIA official and lawyer who works on security clearances, told the Times. “If you don’t self-report, the question becomes: ‘Why didn’t you? And what are you trying to hide?'”

According to the report, Musk’s handling of classified information has raised questions in diplomatic meetings between the United States and some of its allies, including Israel.

Musk’s national security profile has risen following his deep-pocketed and full-throated support of Donald Trump, who won the US presidential campaign in November and will be sworn into office next month. After this inauguration, Trump will have the power to grant security clearance to whomever he wishes.

Report: Elon Musk failed to report movement required by security clearance Read More »

companies-issuing-rto-mandates-“lose-their-best-talent”:-study

Companies issuing RTO mandates “lose their best talent”: Study


Despite the risks, firms and Trump are eager to get people back into offices.

Return-to-office (RTO) mandates have caused companies to lose some of their best workers, a study tracking over 3 million workers at 54 “high-tech and financial” firms at the S&P 500 index has found. These companies also have greater challenges finding new talent, the report concluded.

The paper, Return-to-Office Mandates and Brain Drain [PDF], comes from researchers from the University of Pittsburgh, as well as Baylor University, The Chinese University of Hong Kong, and Cheung Kong Graduate School of Business. The study, which was published in November, spotted this month by human resources publication HR Dive, and cites Ars Technica reporting, was conducted by collecting information on RTO announcements and sourcing data from LinkedIn. The researchers said they only examined companies with data available for at least two quarters before and after they issued RTO mandates. The researchers explained:

To collect employee turnover data, we follow prior literature … and obtain the employment history information of over 3 million employees of the 54 RTO firms from Revelio Labs, a leading data provider that extracts information from employee LinkedIn profiles. We manually identify employees who left a firm during each period, then calculate the firm’s turnover rate by dividing the number of departing employees by the total employee headcount at the beginning of the period. We also obtain information about employees’ gender, seniority, and the number of skills listed on their individual LinkedIn profiles, which serves as a proxy for employees’ skill level.

There are limits to the study, however. The researchers noted that the study “cannot draw causal inferences based on our setting.” Further, smaller firms and firms outside of the high-tech and financial industries may show different results. Although not mentioned in the report, relying on data from a social media platform could also yield inaccuracies, and the number of skills listed on a LinkedIn profile may not accurately depict a worker’s skill level.

Still, the study provides insight into how employees respond to RTO mandates and the effect it has on corporations and available talent at a time when entities like Dell, Amazon, and the US government are getting stricter about in-office work.

Higher turnover rates

The researchers concluded that the average turnover rates for firms increased by 14 percent after issuing return-to-office policies.

“We expect the effect of RTO mandates on employee turnover to be even higher for other firms” the paper says.

The researchers included testing to ensure that the results stemmed from RTO mandates “rather than time trends.” For example, the researchers found that “there were no significant increases in turnover rates during any of the five quarters prior to the RTO announcement quarter.”

Potentially alarming for employers is the study finding that senior and skilled employees were more likely to leave following RTO mandates. This aligns with a study from University of Chicago and University of Michigan researchers published in May that found that Apple and Microsoft saw senior-level employee bases decrease by 5 percentage points and SpaceX a decrease of 5 percentage points. (For its part, Microsoft told Ars that the report did not align with internal data.)

Senior employees are expected to be more likely to leave, the new report argues, because such workers have “more connections with other companies” and have easier times finding new jobs. Further, senior, skilled employees are “dissatisfied” when management blames remote work for low productivity.

Similarly, the report supports concerns from some RTO-resistant employees that back-to-office mandates have a disproportionate impact on certain groups, like women, which the researchers said show “more pronounced” attrition rates following RTO mandates:

Importantly, the effect on female employee turnover is almost three times as high as that on male employees … One possible reason for these results is that female employees are more affected by RTO mandates due to their greater family responsibilities, which increases their demand for workplace flexibility and work-life balance.

Trouble finding talent

RTO mandates also have a negative impact on companies’ ability to find new employees, the study found. After examining over 2 million job postings, the researchers concluded that companies with RTO mandates take longer to fill job vacancies than before:

On average, the time it takes for an RTO firm to fill its job vacancies increases by approximately 23 percent, and the hire rate decreases by 17 percent after RTO mandates.

The researchers also found “significantly higher hiring costs induced by RTO mandates” and concluded that the findings combined “suggest that firms lose their best talent after RTO mandates and face significant difficulties replacing them.”

“The weakest form of management”

RTO mandates can obviously drive away workers who prioritize work-life balance, avoiding commutes and associated costs, and who feel more productive working in a self-controlled environment. The study, however, points to additional reasons RTO mandates make some people quit.

One reason cited is RTO rules communicating “a culture of distrust that encourages management through monitoring.” The researchers noted that Brian Elliott, CEO at Work Forward and a leadership adviser, described this as the “weakest form of management—and one that drives down employee engagement” in a November column for MIT Sloan Management Review.

Indeed, RTO mandates have led to companies like Dell performing VPN tracking, and companies like Amazon, Google, JP Morgan Chase, Meta, and TikTok reportedly tracking badge swipes, resulting in employee backlash.

The new study also pointed to RTO mandates making employees question company leadership and management’s decision-making abilities. We saw this with Amazon, when over 500 employees sent a letter to Amazon Web Services (AWS) CEO Matt Garman, saying that they were “appalled to hear the non-data-driven explanation you gave for Amazon imposing a five-day in-office mandate.”

Employees are also put off by the drama that follows an aggressive RTO policy, the report says:

An RTO announcement can be a big and sudden event that is distasteful to most employees, especially when the decision has not been well communicated, potentially triggering an immediate response of employees searching for and switching to new jobs.

After Amazon announced it would kill remote work in early 2025, a study by online community Blind found that 73 percent of 2,285 Amazon employees surveyed were “considering looking for another job” in response to the mandate.

“A wave of voluntary terminations”

The paper points to reasons that employees may opt to stay with a company post-RTO mandates. Those reasons include competitive job markets, personal costs associated with switching jobs, loyalty, and interest in the collaborative and social aspects of working in-office.

However, with the amount of evidence that RTO mandates drive employees away, some question if return-to-office mandates are subtle ways to reduce headcount without layoffs. Comments like AWS’s Garman reportedly telling workers that if they don’t like working in an office, “there are other companies around” have fueled this theory, as has Dell saying remote workers can’t get promoted. A BambooHR survey of 1,504 full-time US employees, including 504 HR managers or higher, in March found that 25 percent of VP and C-suite executives and 18 percent of HR pros examined “admit they hoped for some voluntary turnover during an RTO.”

Yesterday, President-elect Donald Trump said he plans to do away with a deal that allowed the Social Security Administration’s union to work remotely into 2029 and that those who don’t come back into the office will “be dismissed.” Similarly, Elon Musk and Vivek Ramaswamy, who Trump announced will head a new Department of Government Efficiency, wrote in a November op-ed that “requiring federal employees to come to the office five days a week would result in a wave of voluntary terminations that we welcome.”

Helen D. (Heidi) Reavis, managing partner at Reavis Page Jump LLP, an employment, dispute resolution, and media law firm, previously told Ars that employees “can face an array of legal consequences for encouraging workers to quit via their RTO policies.” Still, RTO mandates are set to continue being a point of debate and tension at workplaces into the new year.

Photo of Scharon Harding

Scharon is Ars Technica’s Senior Product Reviewer writing news, reviews, and analysis on consumer technology, including laptops, mechanical keyboards, and monitors. She’s based in Brooklyn.

Companies issuing RTO mandates “lose their best talent”: Study Read More »

facing-ban-next-month,-tiktok-begs-scotus-for-help

Facing ban next month, TikTok begs SCOTUS for help

TikTok: Ban is slippery slope to broad US censorship

According to TikTok, the government’s defense of the ban to prevent China from wielding a “covert” influence over Americans is a farce invented by lawyers to cover up the true mission of censorship. If the lower court’s verdict stands, TikTok alleged, “then Congress will have free rein to ban any American from speaking simply by identifying some risk that the speech is influenced by a foreign entity.”

TikTok doesn’t want to post big disclaimers on the app warning of “covert” influence, claiming that the government relied on “secret evidence” to prove this influence occurs on TikTok. But if the Supreme Court agrees that the government needed to show more than “bare factual assertions” to back national security claims the lower court said justified any potential speech restrictions, then the court will also likely agree to reverse the lower court’s decision, TikTok suggested.

It will become much clearer by January 6 whether the January 19 ban will take effect, at which point TikTok would shut down, booting all US users from the app. TikTok urged the Supreme Court to agree it is in the public interest to delay the ban and review the constitutional claims to prevent any “extreme” harms to both TikTok and US users who depend on the app for news, community, and income.

If SCOTUS doesn’t intervene, TikTok said that the lower court’s “flawed legal rationales would open the door to upholding content-based speech bans in contexts far different than this one.”

“Fearmongering about national security cannot obscure the threat that the Act itself poses to all Americans,” TikTok alleged, while suggesting that even Congress would agree that a “modest delay” in enforcing the law wouldn’t pose any immediate risk to US national security. Congress is also aware that a sale would not be technically, commercially, or legally possible in the timeframe provided, TikTok said. A temporary injunction would prevent irreparable harms, TikTok said, including the irreparable harm courts have long held is caused by restricting speech of Americans for any amount of time.

“An interim injunction is also appropriate because it will give the incoming Administration time to determine its position, as the President-elect and his advisors have voiced support for saving TikTok,” TikTok argued.

Ars could not immediately reach TikTok for comment.

Facing ban next month, TikTok begs SCOTUS for help Read More »

critical-wordpress-plugin-vulnerability-under-active-exploit-threatens-thousands

Critical WordPress plugin vulnerability under active exploit threatens thousands

Thousands of sites running WordPress remain unpatched against a critical security flaw in a widely used plugin that was being actively exploited in attacks that allow for unauthenticated execution of malicious code, security researchers said.

The vulnerability, tracked as CVE-2024-11972, is found in Hunk Companion, a plugin that runs on 10,000 sites that use the WordPress content management system. The vulnerability, which carries a severity rating of 9.8 out of a possible 10, was patched earlier this week. At the time this post went live on Ars, figures provided on the Hunk Companion page indicated that less than 12 percent of users had installed the patch, meaning nearly 9,000 sites could be next to be targeted.

Significant, multifaceted threat

“This vulnerability represents a significant and multifaceted threat, targeting sites that use both a ThemeHunk theme and the Hunk Companion plugin,” Daniel Rodriguez, a researcher with WordPress security firm WP Scan, wrote. “With over 10,000 active installations, this exposed thousands of websites to anonymous, unauthenticated attacks capable of severely compromising their integrity.”

Rodriquez said WP Scan discovered the vulnerability while analyzing the compromise of a customer’s site. The firm found that the initial vector was CVE-2024-11972. The exploit allowed the hackers behind the attack to cause vulnerable sites to automatically navigate to wordpress.org and download WP Query Console, a plugin that hasn’t been updated in years.

Critical WordPress plugin vulnerability under active exploit threatens thousands Read More »

google-steps-into-“extended-reality”-once-again-with-android-xr

Google steps into “extended reality” once again with Android XR

Citing “years of investment in AI, AR, and VR,” Google is stepping into the augmented reality market once more with Android XR. It’s an operating system that Google says will power future headsets and glasses that “transform how you watch, work, and explore.”

The first version you’ll see is Project Moohan, a mixed-reality headset built by Samsung. It will be available for purchase next year, and not much more is known about it. Developers have access to the new XR version of Android now.

“We’ve been in this space since Google Glass, and we have not stopped,” said Juston Payne, director of product at Google for XR in Android XR’s launch video. Citing established projects like Google Lens, Live View for Maps, instant camera translation, and, of course, Google’s general-purpose Gemini AI, XR promises to offer such overlays in both dedicated headsets and casual glasses.

Android XR announcement video.

There are few additional details right now beyond a headset rendering, examples in Google’s video labeled as “visualization for concept purposes.” Google’s list of things that will likely be on board includes Gemini, Maps, Photos, Translate, Chrome, Circle to Search, and Messages. And existing Android apps, or at least those updated to do so, should make the jump, too.

Google steps into “extended reality” once again with Android XR Read More »

intel-arc-b580-review:-a-$249-rtx-4060-killer,-one-and-a-half-years-later

Intel Arc B580 review: A $249 RTX 4060 killer, one-and-a-half years later


Intel has solved the biggest problems with its Arc GPUs, but not the timing.

Intel’s Arc B580 design doesn’t include LEDs or other frills, but it’s a clean-looking design. Credit: Andrew Cunningham

Intel’s Arc B580 design doesn’t include LEDs or other frills, but it’s a clean-looking design. Credit: Andrew Cunningham

Intel doesn’t have a ton to show for its dedicated GPU efforts yet.

After much anticipation, many delays, and an anticipatory apology tour for its software quality, Intel launched its first Arc GPUs at the end of 2022. There were things to like about the A770 and A750, but buggy drivers, poor performance in older games, and relatively high power use made them difficult to recommend. They were more notable as curiosities than as consumer graphics cards.

The result, after more than two years on the market, is that Arc GPUs remain a statistical nonentity in the GPU market, according to analysts and the Steam Hardware Survey. But it was always going to take time—and probably a couple of hardware generations—for Intel to make meaningful headway against entrenched competitors.

Intel’s reference design is pretty by the book, with two fans, a single 8-pin power connector, and a long heatsink and fan shroud that extends several inches beyond the end of the PCB. Andrew Cunningham

The new Arc B580 card, the first dedicated GPU based on the new “Battlemage” architecture, launches into the exact same “sub-$300 value-for-money” graphics card segment that the A770 and A750 are already stuck in. But it’s a major improvement over those cards in just about every way, and Intel has gone a long way toward fixing drivers and other issues that plagued the first Arc cards at launch. If nothing else, the B580 suggests that Intel has some staying power and that the B700-series GPUs could be genuinely exciting if Intel can get one out relatively soon.

Specs and testbed notes

Specs for the Arc B580 and B570. Credit: Intel

The Arc B580 and Arc B570 lead the charge for the Battlemage generation. Both are based on the same GPU silicon, but the B580 has a few more execution resources, slightly higher clock speeds, a 192-bit memory bus instead of 160-bit, and 12GB of memory instead of 10GB.

Intel positions both cards as entry-level 1440p options because they have a bit more RAM than the 8GB baseline of the GeForce RTX 4060 and Radeon RX 7600. These 8GB cards are still generally fine at 1080p, but more memory does make the Arc cards feel a little more future-proof, especially since they’re fast enough to actually hit 60 fps in a lot of games at 1440p.

Our testbed remains largely the same as it has been for a while, though we’ve swapped the ASRock X670E board for an Asus model. The Ryzen 7 7800X3D remains the heart of the system, with more than enough performance to avoid bottlenecking midrange and high-end GPUs.

We haven’t done extensive re-testing of most older GPUs—the GeForce and Radeon numbers here are the same ones we used in the RX 7600 XT review earlier this year. We wouldn’t expect new drivers to change the scores in our games much since they’re mostly a bit older—we still use a mix of DirectX 11 and DirectX 12 games, including a few with and without ray-tracing effects enabled. We have re-tested the older Arc cards with recent drivers since Intel does still occasionally make changes that can have a noticeable impact on older games.

As with the Arc A-series cards, Intel emphatically recommends that resizable BAR be enabled for your motherboard to get optimal performance. This is sometimes called Smart Access Memory or SAM, depending on your board; most AMD AM4 and 8th-gen Intel Core systems should support it after a BIOS update, and newer PCs should mostly have it on by default. Our test system had it enabled for the B580 and for all the other GPUs we tested.

Performance and power

As a competitor to the RTX 4060, the Arc B580 is actually pretty appealing, whether you’re talking about 1080p or 1440p, in games with ray-tracing on or off. Even older DirectX 11 titles in our suite, like Grand Theft Auto V and Assassin’s Creed Odyssey, don’t seem to take the same performance hit as they did on older Arc cards.

Intel is essentially making a slightly stronger version of the argument that AMD has been trying to make with the RX 7600. AMD’s cards always come with the caveat of significantly worse performance in games with heavy ray-tracing effects, but the performance hit for Intel cards in ray-traced games looks a lot more like Nvidia’s than AMD’s. Playable ray-traced 1080p is well within reach for the Intel card, and in both Cyberpunk 2077 and Returnal, its performance came closer to the 8GB 4060 Ti’s.

The 12GB of RAM is also enough to put more space between the B580 and the 8GB versions of the 4060 and 7600. Forza Horizon 5 performs significantly better at 1440p on cards with more memory, like the B580 and the 16GB RX 7600 XT, and it’s a safe bet that the 8GB limit will become more of a factor for high-end games at higher resolutions as the years go on.

We experienced just one performance anomaly in our testing. Forza Horizon 5 actually runs a bit worse with XeSS enabled, with a smooth average frame rate but frequent stutters that make it less playable overall (though it’s worth noting that Forza Horizon 5 never benefits much from upscaling algorithms on any GPUs we’ve tested, for whatever reason). Intel also alerted us to a possible issue with Cyberpunk 2077 when enabling ray-tracing but recommended a workaround that involved pressing F1 to reset the game’s settings; the benchmark ran fine on our testbed.

GPU power consumption numbers under load. Credit: Andrew Cunningham

Power consumption is another place where the Battlemage GPU plays a lot of catch-up with Nvidia. With the caveat that software-measured power usage numbers like ours are less accurate than numbers captured with hardware tools, it looks like the B580’s power consumption, when fully loaded, consumes somewhere between 120 and 130 W in Hitman and Borderlands. This is a tad higher than the 4060, but it’s lower than either Radeon RX 7600.

It’s not the top of the class, but looking at the A750’s power consumption shows how far Intel has come—the B580 beats the A750’s performance every single time while consuming about 60 W less power.

A strong contender, a late arrival

The Intel Arc B580. Credit: Andrew Cunningham

Intel is explicitly targeting Nvidia’s GeForce RTX 4060 with the Arc B580, a role it fills well for a low starting price. But the B580 is perhaps more damaging to AMD, which positions both of its 7600-series cards (and the remaining 6600-series stuff that’s hanging around) in the same cheaper-than-Nvidia-with-caveats niche.

In fact, I’d probably recommend the B580 to a budget GPU buyer over any of the Radeon RX 7600 cards at this point. For the same street price as the RX 7600, Intel is providing better performance in most games and much better performance in ray-traced games. The 16GB 7600 XT has more RAM, but it’s $90 to $100 more expensive, and a 12GB card is still reasonably future-proof and decent at 1440p.

All of that said, Intel is putting out a great competitor to the RTX 4060 and RX 7600 a year and a half after those cards both launched—and within just a few months of a possible RTX 5060. Intel is selling mid-2023’s midrange GPU performance in late 2024. There are actually good arguments for building a budget gaming PC right this minute, before potential Trump-administration tariffs can affect prices or supply chains, but assuming the tech industry can maintain its normal patterns, it would be smartest to wait and see what Nvidia does next.

Nvidia also has some important structural benefits. DLSS upscaling support is nearly ubiquitous in high-end games, Nvidia’s drivers are more battle-tested, and it’s extremely unlikely that Nvidia will decide to pull out of the GPU market and stop driver development any time soon (Intel has published a roadmap encompassing multiple GPU generations, which is reassuring, but the company’s recent financial distress has seen it shed several money-losing hobby projects).

If there’s a saving grace for Intel and the B580, it’s that Nvidia has signaled, both through its statements and its behavior, that it’s mostly uninterested in aggressively lowering GPU prices, either over time (Nvidia GPUs tend not to stray far from MSRP, barring supply issues) or between generations. An RTX 5060 is highly unlikely to be cheaper than a 4060 and could easily be more expensive. Depending on how good a hypothetical RTX 5060 is, Intel still has a lot of room to offer good performance for the price in a $200-to-$250-ish GPU market that doesn’t get a ton of attention.

The other issue for Intel is that for a second straight GPU generation, the company is launching late with a part that is forced by its performance to play in a budget-oriented, low-margin area of the GPU market. I don’t think I’m expecting a 4090 or 5090-killer out of Intel any time soon, but based on the B580, I’m at least a little optimistic that Intel can offer a B700-series card that can credibly compete with the likes of Nvidia’s 4070-series or AMD’s 7800 XT and 7900 GRE. Performance-wise, that’s the current sweet spot of the GPU market, but you’ll spend more than you would on a PS5 to buy most of those cards. If Intel can shake up that part of the business, it could help put Arc on the map.

The good

  • Solid midrange 1080p and 1440p performance at a good starting price
  • More RAM than the competition
  • Much-improved power efficiency compared to Arc A-series GPUs
  • Unlike the A-series, we noticed no outliers where performance was disproportionately bad
  • Simple, clean-looking reference design from Intel

The bad

  • Competing with cards that launched a year and a half ago
  • New Nvidia and AMD competitors are likely within a few months
  • Intel still can’t compete at the high end of the GPU market, or even the medium-high end

The ugly

  • So far, Arc cards have not been successful enough to guarantee their long-term existence

Photo of Andrew Cunningham

Andrew is a Senior Technology Reporter at Ars Technica, with a focus on consumer tech including computer hardware and in-depth reviews of operating systems like Windows and macOS. Andrew lives in Philadelphia and co-hosts a weekly book podcast called Overdue.

Intel Arc B580 review: A $249 RTX 4060 killer, one-and-a-half years later Read More »

teen-creates-memecoin,-dumps-it,-earns-$50,000

Teen creates memecoin, dumps it, earns $50,000


dontbuy. Seriously, don’t buy it

Unsurprisingly, he and his family were doxed by angry traders.

On the evening of November 19, art adviser Adam Biesk was finishing work at his California home when he overheard a conversation between his wife and son, who had just come downstairs. The son, a kid in his early teens, was saying he had made a ton of money on a cryptocurrency that he himself had created.

Initially, Biesk ignored it. He knew that his son played around with crypto, but to have turned a small fortune before bedtime was too far-fetched. “We didn’t really believe it,” says Biesk. But when the phone started to ring off the hook and his wife was flooded with angry messages on Instagram, Biesk realized that his son was telling the truth—if not quite the full story.

Earlier that evening, at 7: 48 pm PT, Biesk’s son had released into the wild 1 billion units of a new crypto coin, which he named Gen Z Quant. Simultaneously, he spent about $350 to purchase 51 million tokens, about 5 percent of the total supply, for himself.

Then he started to livestream himself on Pump.Fun, the website he had used to launch the coin. As people tuned in to see what he was doing, they started to buy into Gen Z Quant, leading the price to pitch sharply upward.

By 7: 56 pm PT, a whirlwind eight minutes later, Biesk’s son’s tokens were worth almost $30,000—and he cashed out. “No way. Holy fuck! Holy fuck!” he said, flipping two middle fingers to the webcam, with tongue sticking out of his mouth. “Holy fuck! Thanks for the twenty bandos.” After he dumped the tokens, the price of the coin plummeted, so large was his single trade.

To the normie ear, all this might sound impossible. But in the realm of memecoins, a type of cryptocurrency with no purpose or utility beyond financial speculation, it’s relatively routine. Although many people lose money, a few have been known to make a lot—and fast.

In this case, Biesk’s son had seemingly performed what is known as a soft rug pull, whereby somebody creates a new crypto token, promotes it online, then sells off their entire holdings either swiftly or over time, sinking its price. These maneuvers occupy something of a legal gray area, lawyers say, but are roundly condemned in the cryptosphere as ethically dubious at the least.

After dumping Gen Z Quant, Biesk’s son did the same thing with two more coins—one called im sorry and another called my dog lucy—bringing his takings for the evening to more than $50,000.

The backlash was swift and ferocious. A torrent of abuse began to pour into the chat log on Pump.Fun, from traders who felt they had been swindled. “You little fucking scammer,” wrote one commenter. Soon, the names and pictures of Biesk, his son, and other family members were circulating on X. They had been doxed. “Our phone started blowing up. Just phone call after phone call,” says Biesk. “It was a very frightening situation.”

As part of their revenge campaign, crypto traders continued to buy into Gen Z Quant, driving the coin’s price far higher than the level at which Biesk’s son had cashed out. At its peak, around 3 am PT the following morning, the coin had a theoretical total value of $72 million; the tokens the teenager had initially held were worth more than $3 million. Even now, the trading frenzy has died down, and they continue to be valued at twice the amount he received.

“In the end, a lot of people made money on his coin. But for us, caught in the middle, there was a lot of emotion,” says Biesk. “The online backlash became so frighteningly scary that the realization that he made money was kind of tempered down with the fact that people became angry and started bullying.”

Biesk concedes to a limited understanding of crypto. But he sees little distinction between what his son did and, say, playing the stock market or winning at a casino. Though under California law, someone must be at least 18 years old to gamble or invest in stocks, the unregulated memecoin market, which has been compared to a “casino” in risk profile, had given Biesk’s teenage son early access to a similar arena, in which some must lose for others to profit. “The way I understand it is he made money and he cashed out, which to me seems like that’s what anybody would’ve done,” says Biesk. “You get people who are cheering at the craps table, or angry at the craps table.”

Memecoins have been around since 2013, when Dogecoin was released. In the following years, a few developers tried to replicate the success of Dogecoin, making play of popular internet memes or tapping into the zeitgeist in some other way in a bid to encourage people to invest. But the cost and complexity of development generally limited the number of memecoins that came to market.

That equation was flipped in January with the launch of Pump.Fun, which lets people release new memecoins instantly, at no cost. The idea was to give people a safer way to trade memecoins by standardizing the underlying code, which prevents developers from building in malicious mechanisms to steal funds, in what’s known as a hard rug pull.

“Buying into memecoins was a very unsafe thing to do. Programmers could create systems that would obfuscate what you are buying into and, basically, behave as malicious actors. Everything was designed to suck money out of people,” one of the three anonymous cofounders of Pump.Fun, who goes by Sapijiju, told WIRED earlier in the year. “The idea with Pump was to build something where everyone was on the same playing field.”

Since Pump.Fun launched, millions of unique memecoins have entered the market through the platform. By some metrics, Pump.Fun is the fastest-growing crypto application ever, taking in more than $250 million in revenue—as a 1 percent cut of trades on the platform—in less than a year in operation.

However, Pump.Fun has found it impossible to insulate users from soft rug pulls. Though the platform gives users access to information to help assess risk—like the proportion of a coin belonging to the largest few holders—soft rug pulls are difficult to prevent by technical means, claims Sapijiju.

“People say there’s a bunch of different stuff you can do to block [soft rug pulls]—maybe a sell tax or lock up the people who create the coin. Truthfully, all of this is very easy to manipulate,” he says. “Whatever we do to stop people doing this, there’s always a way to circumnavigate if you’re smart enough. The important thing is creating an interface that is as simple as possible and giving the tools for users to see if a coin is legitimate or not.”

The “overwhelming majority” of new crypto tokens entering the market are scams of one form or another, designed expressly to squeeze money from buyers, not to hold a sustained value in the long term, according to crypto security company Blockaid. In the period since memecoin launchpads like Pump.Fun began to gain traction, the volume of soft rug pulls has increased in lockstep, says Ido Ben-Natan, Blockaid founder.

“I generally agree that it is kind of impossible to prevent holistically. It’s a game of cat and mouse,” says Ben-Natan. “It’s definitely impossible to cover a hundred percent of these things. But it definitely is possible to detect repeat offenders, looking at metadata and different kinds of patterns.”

Now memecoin trading has been popularized, there can be no putting the genie back in the bottle, says Ben-Natan. But traders are perhaps uniquely vulnerable at present, he says, in a period when many are newly infatuated with memecoins, yet before the fledgling platforms have figured out the best way to protect them. “The space is immature,” says Ben-Natan.

Whether it is legal to perform a rug pull is also something of a gray area. It depends on both jurisdiction and whether explicit promises are made to prospective investors, experts say. The absence of bespoke crypto regulations in countries like the US, meanwhile, inadvertently creates cloud cover for acts that are perhaps not overtly illegal.

“These actions exploit the gaps in existing regulatory frameworks, where unethical behavior—like developers hyping a project and later abandoning it—might not explicitly violate laws if no fraudulent misrepresentation, contractual breach, or other violations occur,” says Ronghui Gu, cofounder of crypto security firm CertiK and associate professor of computer science at Columbia University.

The Gen Z Quant broadcast is no longer available to view in full, but in the clips reviewed by WIRED, at no point does Biesk’s son promise to hold his tokens for any specific period. Neither do the Pump.Fun terms of use require people to refrain from selling tokens they create. (Sapijiju, the Pump.Fun cofounder, declined to comment on the Gen Z Quant incident. They say that Pump.Fun will be “introducing age restrictions in future,” but declined to elaborate.)

But even then, under the laws of numerous US states, among them California, “the developer likely still owes heightened legal duties to the investors, so may be liable for breaching obligations that result in loss of value,” says Geoffrey Berg, partner at law firm Berg Plummer & Johnson. “The developer is in a position of trust and must place the interests of his investors over his own.”

To clarify whether these legal duties apply to people who release memecoins through websites like Pump.Fun—who buy into their coins like everyone else, albeit at the moment of launch and therefore at a discount and in potentially market-swinging quantities—new laws may be required.

In July 2026, a new regime will take effect in California, where Biesk’s family lives, requiring residents to obtain a license to take part in “digital financial asset business activity,” including exchanging, transferring, storing or administering certain crypto assets. President-elect Donald Trump has also promised new crypto regulations. But for now, there are no crypto-specific laws in place.

“We are in a legal vacuum where there are no clear laws,” says Andrew Gordon, partner at law firm Gordon Law. “Once we know what is ‘in bounds,’ we will also know what is ‘out of bounds.’ This will hopefully create a climate where rug pulls don’t happen, or when they do they are seen as a criminal violation.”

On November 19, as the evening wore on, angry messages continued to tumble in, says Biesk. Though some celebrated his son’s antics, calling for him to return and create another coin, others were threatening or aggressive. “Your son stole my fucking money,” wrote one person over Instagram.

Biesk and his wife were still trying to understand quite how their son was able to make so much money, so fast. “I was trying to get an understanding of exactly how this meme crypto trading works,” says Biesk.

Some memecoin traders, sensing there could be money in riffing off the turn of events, created new coins on Pump.Fun inspired by Biesk and his wife: QUANT DAD and QUANTS MOM. (Both are now practically worthless.)

Equally disturbed and bewildered, Biesk and his wife formed a provisional plan: to make all public social media accounts private, stop answering the phone, and, generally, hunker down until things blew over. (Biesk’s account is active at the time of writing.) Biesk declined to comment on whether the family made contact with law enforcement or what would happen to the funds, saying only that his son would “put the money away.”

A few hours later, an X account under the name of Biesk’s son posted on X, pleading for people to stop contacting his parents. “Im sorry about Quant, I didnt realize I get so much money. Please dont write to my parents, I wiill pay you back [sic],” read the post. Biesk claims the account is not operated by his son.

Though alarmed by the backlash, Biesk is impressed by the entrepreneurial spirit and technical capability his son displayed. “It’s actually sort of a sophisticated trading platform,” he says. “He obviously learned it on his own.”

That his teenager was capable of making $50,000 in an evening, Biesk theorizes, speaks to the fundamentally different relationship kids of that age have with money and investing, characterized by an urgency and hyperactivity that rubs up against traditional wisdom.

“To me, crypto can be hard to grasp, because there is nothing there behind it—it’s not anything tangible. But I think kids relate to this intangible digital world more than adults do,” says Biesk. “This has an immediacy to him. It’s almost like he understands this better.”

On December 1, after a two-week hiatus, Biesk’s son returned to Pump.Fun to launch five new memecoins, apparently undeterred by the abuse. Disregarding the warnings built into the very names of some of the new coins—one was named test and another dontbuy—people bought in. Biesk’s son made another $5,000.

This story originally appeared on wired.com.

Photo of WIRED

Wired.com is your essential daily guide to what’s next, delivering the most original and complete take you’ll find anywhere on innovation’s impact on technology, science, business and culture.

Teen creates memecoin, dumps it, earns $50,000 Read More »

new-drone-has-legs-for-landing-gear,-enabling-efficient-launches

New drone has legs for landing gear, enabling efficient launches


The RAVEN walks, it flies, it hops over obstacles, and it’s efficient.

The RAVEN in action. Credit: EPFL/Alain Herzog

Most drones on the market are rotary-wing quadcopters, which can conveniently land and take off almost anywhere. The problem is they are less energy-efficient than fixed-wing aircraft, which can fly greater distances and stay airborne for longer but need a runway, a dedicated launcher, or at least a good-fashioned throw to get to the skies.

To get past this limit, a team of Swiss researchers at the École Polytechnique Fédérale de Lausanne built a fixed-wing flying robot called RAVEN (Robotic Avian-inspired Vehicle for multiple ENvironments) with a peculiar bio-inspired landing gear: a pair of robotic bird-like legs. “The RAVEN robot can walk, hop over obstacles, and do a jumping takeoff like real birds,” says Won Dong Shin, an engineer leading the project.

Smart investments

The key challenge in attaching legs to drones was that they significantly increased mass and complexity. State-of-the-art robotic legs were designed for robots walking on the ground and were too bulky and heavy to even think about using on a flying machine. So, Shin’s team started their work by taking a closer look at what the leg mass budget looked like in various species of birds.

It turned out that the ratio of leg mass to the total body weight generally increased with size in birds. A carrion crow had legs weighing around 100 grams, which the team took as their point of reference.

The robotic legs built by Shin and his colleagues resembled a real bird’s legs quite closely. Simplifications introduced to save weight included skipping the knee joint and actuated toe joints, resulting in a two-segmented limb with 64 percent of the weight placed around the hip joint. The mechanism was powered by a standard drone propeller, with the ankle joint actuated through a system of pulleys and a timing belt. The robotic leg ended with a foot with three forward-facing toes and a single backward-facing hallux.

There were some more sophisticated bird-inspired design features, too. “I embedded a torsional spring in the ankle joint. When the robot’s leg is crouching, it stores the energy in that spring, and then when the leg stretches out, the spring works together with the motor to generate higher jumping speed,” says Shin. A real bird can store elastic energy in its muscle-tendon system during flexion and release it very rapidly during extension for a jumping takeoff. The spring’s job was to emulate this mechanism, and it worked pretty well—“It actually increased the jumping speed by 25 percent,” Shin says.

In the end, the robotic legs weighed around 230 grams, way more than the real ones in a carrion crow, but it turned out that was good enough for the RAVEN robot to walk, jump, take off, and fly.

Crow’s efficiency

The team calculated the necessary takeoff speed for two birds with body masses of 490 grams and a hair over 780 grams; these were 1.85 and 3.21 meters per second, respectively. Based on that, Shin figured the RAVEN robot would need to reach 2.5 meters per second to get airborne. Using the bird-like jumping takeoff strategy, it could reach that speed in just 0.17 seconds.

How did nature’s go-to takeoff procedure stack up against other ways to get to the skies? Other options included a falling takeoff, where you just push your aircraft off a cliff and let gravity do its thing, or standing takeoff, where you position the craft vertically and rely on the propeller to lift it upward. “When I was designing the experiments, I thought the jumping takeoff would be the least energy-efficient because it used extra juice from the battery to activate the legs,” Shin says. But he was in for a surprise.

“What we meant by energy efficiency was calculating the energy input and energy output. The energy output was the kinetic energy and the potential energy at the moment of takeoff, defined as the moment when the feet of the robot stop touching the ground,” Shin explains. The energy input was calculated by measuring the power used during takeoff.

The RAVEN takes flight.

“It turned out that the jumping takeoff was actually the most energy-efficient strategy. I didn’t expect that result. It was quite surprising”, Shin says.

The energy cost of the jumping takeoff was slightly higher than that of the other two strategies, but not by much. It required 7.9 percent more juice than the standing takeoff and 6.9 percent more than the falling takeoff. At the same time, it generated much higher acceleration, so you got way better bang for the buck (at least as far as energy was concerned). Overall, jumping with bird-like legs was 9.7 times more efficient than standing takeoff and 4.9 times more efficient than falling takeoff.

One caveat with the team’s calculations was that a fixed-wing drone with a more conventional design, one using wheels or a launcher, would be much more efficient in traditional takeoff strategies than a legged RAVEN robot. “But when you think about it, birds, too, would fly much better without legs. And yet they need them to move on the ground or hunt their prey. You trade some of the in-flight efficiency for more functions,” Shin claims. And the legs offered plenty of functions.

Obstacles ahead

To demonstrate the versatility of their legged flying robot, Shin’s team put it through a series of tasks that would be impossible to complete with a standard drone. Their benchmark mission scenario involved traversing a path with a low ceiling, jumping over a gap, and hopping onto an obstacle. “Assuming an erect position with the tail touching the ground, the robot could walk and remain stable even without advanced controllers,” Shin claims. Walking solved the problem of moving under low ceilings. Jumping over gaps and onto obstacles was done by using the mechanism used for takeoff: torsion springs and actuators. RAVEN could jump over an 11-centimeter-wide gap and onto an obstacle 26-centimeter-high.

But Shin says RAVEN will need way more work before it truly shines. “At this stage, the robot cannot clear all those obstacles in one go. We had to reprogram it for each of the obstacles separately,” Shin says. The problem is the control system in RAVEN is not adaptive; the actuators in the legs perform predefined sets of motions to send the robot on a trajectory the team figured out through computer simulations. If there was something blocking the way, RAVEN would have crashed into it.

Another, perhaps more striking limitation is that RAVEN can’t use its legs to land. But this is something Shin and his colleagues want to work on in the future.

“We want to implement some sensors, perhaps vision or haptic sensors. This way, we’re going to know where the landing site is, how many meters away from it we are, and so on,” Shin says. Another modification that’s on the way for RAVEN is foldable wings that the robot will use to squeeze through tight spaces. “Flapping wings would also be a very interesting topic. They are very important for landing, too, because birds decelerate first with their wings, not with their legs. With flapping wings, this is going to be a really bird-like robot,” Shin claims.

All this is intended to prepare RAVEN for search and rescue missions. The idea is legged flying robots would reach disaster-struck areas quickly, land, traverse difficult terrain on foot if necessary, and then take off like birds. “Another application is delivering parcels. Here in Switzerland, I often see helicopters delivering them to people living high up in the mountains, which I think is quite costly. A bird-like drone could do that more efficiently,” Shin suggested.

Nature, 2024.  DOI: 10.1038/s41586-024-08228-9

Photo of John Timmer

John is Ars Technica’s science editor. He has a Bachelor of Arts in Biochemistry from Columbia University, and a Ph.D. in Molecular and Cell Biology from the University of California, Berkeley. When physically separated from his keyboard, he tends to seek out a bicycle, or a scenic location for communing with his hiking boots.

New drone has legs for landing gear, enabling efficient launches Read More »

openai-announces-full-“o1”-reasoning-model,-$200-chatgpt-pro-tier

OpenAI announces full “o1” reasoning model, $200 ChatGPT Pro tier

On X, frequent AI experimenter Ethan Mollick wrote, “Been playing with o1 and o1-pro for bit. They are very good & a little weird. They are also not for most people most of the time. You really need to have particular hard problems to solve in order to get value out of it. But if you have those problems, this is a very big deal.”

OpenAI claims improved reliability

OpenAI is touting pro mode’s improved reliability, which is evaluated internally based on whether it can solve a question correctly in four out of four attempts rather than just a single attempt.

“In evaluations from external expert testers, o1 pro mode produces more reliably accurate and comprehensive responses, especially in areas like data science, programming, and case law analysis,” OpenAI writes.

Even without pro mode, OpenAI cited significant increases in performance over the o1 preview model on popular math and coding benchmarks (AIME 2024 and Codeforces), and more marginal improvements on a “PhD-level science” benchmark (GPQA Diamond). The increase in scores between o1 and o1 pro mode were much more marginal on these benchmarks.

We’ll likely have more coverage of the full version of o1 once it rolls out widely—and it’s supposed to launch today, accessible to ChatGPT Plus and Team users globally. Enterprise and Edu users will have access next week. At the moment, the ChatGPT Pro subscription is not yet available on our test account.

OpenAI announces full “o1” reasoning model, $200 ChatGPT Pro tier Read More »

dog-domestication-happened-many-times,-but-most-didn’t-pan-out

Dog domestication happened many times, but most didn’t pan out

The story that data reveals is complicated—but somehow very human.

Until about 13,600 years ago, any wolf living in what is now Alaska would have lived on the usual wolf diet: rabbits, moose, and a whole range of other land animals. But starting around 13,600 years ago, the nitrogen isotopes locked in ancient wolves’ bones suggest that something changed. Some wolves still made their living solely by hunting wild game, but others started living almost entirely on fish. Since it’s unlikely that Alaskan wolves had suddenly taken up fly fishing, the sudden change probably suggests that some wolves had started getting food from people.

They’re good dogs, Brent

The fact that we kept trying to befriend wolves is starkly clear at a site called Hollembaek Hill, where archaeologists unearthed the 8,100-year-old remains of four canines. Their diets (according to the nitrogen isotopes locked in their bones) consisted mostly of salmon, so it’s tempting to assume these were domesticated dogs. But their DNA reveals that all four—including a newborn puppy—are most closely related to modern wolves.

On the other hand, the Hollembaek Hill canines didn’t all look like wild wolves. At least one of them had the large stature of a modern wolf, but others were smaller, like early dogs. And some of their DNA suggests that they may be at least part dog but not actually related to modern dogs. Lanoë and his colleagues suggest that people at Hollembaek Hill 8,000 years ago were living alongside a mix of pet wolves (do not try this at home) and wolf-dog hybrids.

All modern dogs trace their roots to a single group of wolves (now extinct) that lived in Siberia around 23,000 years ago. But sometime between 11,300 and 12,800 years ago, the canines from Hollembaek Hill and another Alaskan site called Swan Point had dog DNA that doesn’t seem related to modern dogs at all. That may suggest that dog domestication was a process that happened several times in different places, creating several branches of a dog family tree, but only one stuck around in the long run.

In other words, long after humans “invented” dogs, it seems that people just kept repeating the process, doing the things that created dogs in the first place: allowing the friendliest, least aggressive wild canids to live near their villages and maybe adopting and feeding them.

Dog domestication happened many times, but most didn’t pan out Read More »