Author name: Tim Belzer

on-dwarkesh-patel’s-4th-podcast-with-tyler-cowen

On Dwarkesh Patel’s 4th Podcast With Tyler Cowen

Dwarkesh Patel again interviewed Tyler Cowen, largely about AI, so here we go.

Note that I take it as a given that the entire discussion is taking place in some form of an ‘AI Fizzle’ and ‘economic normal’ world, where AI does not advance too much in capability from its current form, in meaningful senses, and we do not get superintelligence [because of reasons]. It’s still massive additional progress by the standards of any other technology, but painfully slow by the ‘AGI is coming soon’ crowd.

That’s the only way I can make the discussion make at least some sense, with Tyler Cowen predicting 0.5%/year additional RGDP growth from AI. That level of capabilities progress is a possible world, although the various elements stated here seem like they are sometimes from different possible worlds.

I note that this conversation was recorded prior to o3 and all the year end releases. So his baseline estimate of RGDP growth and AI impacts has likely increased modestly.

I go very extensively into the first section on economic growth and AI. After that, the podcast becomes classic Tyler Cowen and is interesting throughout, but I will be relatively sparing in my notes in other areas, and am skipping over many points.

This is a speed premium and ‘low effort’ post, in the sense that this is mostly me writing down my reactions and counterarguments in real time, similar to how one would do a podcast. It is high effort in that I spent several hours listening to, thinking about and responding to the first fifteen minutes of a podcast.

As a convention: When I’m in the numbered sections, I’m reporting what was said. When I’m in the secondary sections, I’m offering (extensive) commentary. Timestamps are from the Twitter version.

[EDIT: In Tyler’s link, he correctly points out a confusion in government spending vs. consumption, which I believe is fixed now. As for his comment about market evidence for the doomer position, I’ve given my answer before, and I would assert the market provides substantial evidence neither in favor or against anything but the most extreme of doomer positions, as in extreme in a way I have literally never heard one person assert, once you control for its estimate of AI capabilities (where it does indeed offer us evidence, and I’m saying that it’s too pessimistic). We agree there is no substantial and meaningful ‘peer-reviewed’ literature on the subject, in the way that Tyler is pointing.]

They recorded this at the Progress Studies conference, and Tyler Cowen has a very strongly held view that AI won’t accelerate RGDP growth much that Dwarkesh clearly does not agree with, so Dwarkesh Patel’s main thrust is to try comparisons and arguments and intuition pumps to challenge Tyler. Tyler, as he always does, has a ready response to everything, whether or not it addresses the point of the question.

  1. (1: 00) Dwarkesh doesn’t waste any time and starts off asking why we won’t get explosive economic growth. Tyler’s first answer is cost disease, that as AI works in some parts of the economy costs in other areas go up.

    1. That’s true in relative terms for obvious reasons, but in absolute terms or real resource terms the opposite should be true, even if we accept the implied premise that AI won’t simply do everything anyway. This should drive down labor costs and free up valuable human capital. It should aid in availability of many other inputs. It makes almost any knowledge acquisition, strategic decision or analysis, data analysis or gathering, and many other universal tasks vastly better.

    2. Tyler then answers this directly when asked at (2: 10) by saying cost disease is not about employees per se, it’s more general, so he’s presumably conceding the point about labor costs, saying that non-intelligence inputs that can’t be automated will bind more and thus go up in price. I mean, yes, in the sense that we have higher value uses for them, but so what?

    3. So yes, you can narrowly define particular subareas of some areas as bottlenecks and say that they cannot grow, and perhaps they can even be large areas if we impose costlier bottlenecks via regulation. But that still leaves lots of room for very large economic growth for a while – the issue can’t bind you otherwise, the math doesn’t work.

  2. Tyler says government consumption [EDIT: I originally misheard this as spending, he corrected me, I thank him] at 18% of GDP (government spending is 38% but a lot of that is duplicative and a lot isn’t consumption), health care at 20%, education is 6% (he says 6-7%, Claude says 6%), the nonprofit sector (Claude says 5.6%) and says together that is half of the economy. Okay, sure, let’s tackle that.

    1. Healthcare is already seeing substantial gains from AI even at current levels. There are claims that up to 49% of half of doctor time is various forms of EMR and desk work that AIs could reduce greatly, certainly at least ~25%. AI can directly substitute for much of what doctors do in terms of advising patients, and this is already happening where the future is distributed. AI substantially improves medical diagnosis and decision making. AI substantially accelerates drug discovery and R&D, will aid in patient adherence and monitoring, and so on. And again, that’s without further capability gains. Insurance companies doubtless will embrace AI at every level. Need I go on here?

    2. Government spending at all levels is actually about 38% of GDP, but that’s cheating, only ~11% is non-duplicative and not transfers, interest (which aren’t relevant) or R&D (I’m assuming R&D would get a lot more productive).

    3. The biggest area is transfers. AI can’t improve the efficiency of transfers too much, but it also can’t be a bottleneck outside of transaction and administrative costs, which obviously AI can greatly reduce and are not that large to begin with.

    4. The second biggest area is provision of healthcare, which we’re already counting, so that’s duplicative. Third is education, which we count in the next section.

    5. Third is education. Fourth is national defense, where efficiency per dollar or employee should get vastly better, to the point where failure to be at the AI frontier is a clear national security risk.

    6. Fifth is interest on the debt, which again doesn’t count, and also we wouldn’t care about if GDP was growing rapidly.

    7. And so on. What’s left to form the last 11% or so? Public safety, transportation and infrastructure, government administration, environment and natural resources and various smaller other programs. What happens here is a policy choice. We are already seeing signs of improvement in government administration (~2% of the 11%), the other 9% might plausibly stall to the extent we decide to do an epic fail.

    8. Education and academia is already being transformed by AI, in the sense of actually learning things, among anyone who is willing to use it. And it’s rolling through academia as we speak, in terms of things like homework assignments, in ways that will force change. So whether you think growth is possible depends on your model of education. If it’s mostly a signaling model then you should see a decline in education investment since the signals will decline in value and AI creates the opportunity for better more efficient signals, but you can argue that this could continue to be a large time and dollar tax on many of us.

    9. Nonprofits are about 20%-25% education, and ~50% is health care related, which would double count, so the remainder is only ~1.3% of GDP. This also seems like a dig at nonprofits and their inability to adapt to change, but why would we assume nonprofits can’t benefit from AI?

    10. What’s weird is that I would point to different areas that have the most important anticipated bottlenecks to growth, such as housing or power, where we might face very strong regulatory constraints and perhaps AI can’t get us out of those.

  3. (1: 30) He says it will take ~30 years for sectors of the economy that do not use AI well to be replaced by those that do use AI well.

    1. That’s a very long time, even in an AI fizzle scenario. I roll to disbelieve that estimate in most cases. But let’s even give it to him, and say it is true, and it takes 30 years to replace them, while the productivity of the replacement goes up 5%/year above incumbents, which are stagnant. Then you delay the growth, but you don’t prevent it, and if you assume this is a gradual transition you start seeing 1%+ yearly GDP growth boosts even in these sectors within a decade.

  4. He concludes by saying some less regulated areas grow a lot, but that doesn’t get you that much, so you can’t have the whole economy ‘growing by 40%’ in a nutshell.

    1. I mean, okay, but that’s double Dwarkesh’s initial question of why we aren’t growing at 20%. So what exactly can we get here? I can buy this as an argument for AI fizzle world growing slower than it would have otherwise, but the teaser has a prediction of 0.5%, which is a whole different universe.

  1. (2: 20) Tyler asserts that value of intelligence will go down because more intelligence will be available.

    1. Dare I call this the Lump of Intelligence fallacy, after the Lump of Labor fallacy? Yes, to the extent that you are doing the thing an AI can do, the value of that intelligence goes down, and the value of AI intelligence itself goes down in economic terms because its cost of production declines. But to the extent that your intelligence complements and unlocks the AI’s, or is empowered by the AI’s and is distinct from it (again, we must be in fizzle-world), the value of that intelligence goes up.

    2. Similarly, when he talks about intelligence as ‘one input’ in the system among many, that seems like a fundamental failure to understand how intelligence works, a combination of intelligence denialism (failure to buy that much greater intelligence could meaningfully exist) and a denial of substitution or ability to innovate as a result – you couldn’t use that intelligence to find alternative or better ways to do things, and you can’t use more intelligence as a substitute for other inputs. And you can’t substitute the things enabled more by intelligence much for the things that aren’t, and so on.

    3. It also assumes that intelligence can’t be used to convince us to overcome all these regulatory barriers and bottlenecks. Whereas I would expect that raising the intelligence baseline greatly would make it clear to everyone involved how painful our poor decisions were, and also enable improved forms of discourse and negotiation and cooperation and coordination, and also greatly favor those that embrace it over those that don’t, and generally allow us to take down barriers. Tyler would presumably agree that if we were to tear down the regulatory state in the places it was holding us back, that alone would be worth far more than his 0.5% of yearly GDP growth, even with no other innovation or AI.

  1. (2: 50) Dwarkesh challenges Tyler by pointing out that the Industrial Revolution resulted in a greatly accelerated rate of economic growth versus previous periods, and asks what Tyler would say to someone from the past doubting it was possible. Tyler attempts to dodge (and is amusing doing so) by saying they’d say ‘looks like it would take a long time’ and he would agree.

    1. Well, it depends what a long time is, doesn’t it? 2% sustained annual growth (or 8%!) is glacial in some sense and mind boggling by ancient standards. ‘Take a long time’ in AI terms, such as what is actually happening now, could still look mighty quick if you compared it to most other things. OpenAI has 300 million MAUs.

  2. (3: 20) Tyler trots out the ‘all the financial prices look normal’ line, that they are not predicting super rapid growth and neither are economists or growth experts.

    1. Yes, the markets are being dumb, the efficient market hypothesis is false, and also aren’t you the one telling me I should have been short the market? Well, instead I’m long, and outperforming. And yes, economists and ‘experts on economic growth’ aren’t predicting large amounts of growth, but their answers are Obvious Nonsense to me and saying that ‘experts don’t expect it’ without arguments why isn’t much of an argument.

  3. (3: 40) Aside, since you kind of asked: So who am I to say different from the markets and the experts? I am Zvi Mowshowitz. Writer. Son of Solomon and Deborah Mowshowitz. I am the missing right hand of the one handed economists you cite. And the one warning you about what is about to kick Earth’s sorry ass into gear. I speak the truth as I see it, even if my voice trembles. And a warning that we might be the last living things this universe ever sees. God sent me.

  4. Sorry about that. But seriously, think for yourself, schmuck! Anyway.

What would happen if we had more people? More of our best people? Got more out of our best people? Why doesn’t AI effectively do all of these things?

  1. (3: 55) Tyler is asked wouldn’t a large rise in population drive economic growth? He says no, that’s too much a 1-factor model, in fact we’ve seen a lot of population growth without innovation or productivity growth.

    1. Except that Tyler is talking here about growth on a per capita basis. If you add AI workers, you increase the productive base, but they don’t count towards the capita.

  2. Tyler says ‘it’s about the quality of your best people and institutions.’

    1. But quite obviously AI should enable a vast improvement in the effective quality of your best people, it already does, Tyler himself would be one example of this, and also the best institutions, including because they are made up of the best people.

  3. Tyler says ‘there’s no simple lever, intelligence or not, that you can push on.’ Again, intelligence as some simple lever, some input component.

    1. The whole point of intelligence is that it allows you to do a myriad of more complex things, and to better choose those things.

  4. Dwarkesh points out the contradiction between ‘you are bottlenecked by your best people’ and asserting cost disease and constraint by your scarce input factors. Tyler says Dwarkesh is bottlenecked, Dwarkesh points out that with AGI he will be able to produce a lot more podcasts. Tyler says great, he’ll listen, but he will be bottlenecked by time.

    1. Dwarkesh’s point generalizes. AGI greatly expand the effective amount of productive time of the best people, and also extend their capabilities while doing so.

    2. AGI can also itself become ‘the best people’ at some point. If that was the bottleneck, then the goose asks, what happens now, Tyler?

  5. (5: 15) Tyler cites that much of sub-Saharan Africa still does not have clean reliable water, and intelligence is not the bottleneck there. And that taking advantage of AGI will be like that.

    1. So now we’re expecting AGI in this scenario? I’m going to kind of pretend we didn’t hear that, or that this is a very weak AGI definition, because otherwise the scenario doesn’t make sense at all.

    2. Intelligence is not directly the bottleneck there, true, but yes quite obviously Intelligence Solves This if we had enough of it and put those minds to that particular problem and wanted to invest the resources towards it. Presumably Tyler and I mostly agree on why the resources aren’t being devoted to it.

    3. What it mean for similar issues to that to be involved in taking advantage of AGI? Well, first, it would mean that you can’t use AGI to get to ASI (no I can’t explain why), but again that’s got to be a baseline assumption here. After that, well, sorry, I failed to come up with a way to finish this that makes it make sense to me, beyond a general ‘humans won’t do the things and will throw up various political and legal barriers.’ Shrug?

  6. (5: 35) Dwarkesh speaks about a claim that there is a key shortage of geniuses, and that America’s problems come largely from putting its geniuses in places like finance, whereas Taiwan puts them in tech, so the semiconductors end up in Taiwan. Wouldn’t having lots more of those types of people eat a lot of bottlenecks? What would happen if everyone had 1000 times more of the best people available?

  7. Tyler Cowen, author of a very good book about Talent and finding talent and the importance of talent, says he didn’t agree with that post, and returns to IQ in the labor market are amazingly low, and successful people are smart but mostly they have 8-9 areas where they’re an 8-9 on a 1-10 scale, with one 11+ somewhere, and a lot of determination.

    1. All right, I don’t agree that intelligence doesn’t offer returns now, and I don’t agree that intelligence wouldn’t offer returns even at the extremes, but let’s again take Tyler’s own position as a given…

    2. But that exactly describes what an AI gives you! An AI is the ultimate generalist. An AGI will be a reliable 8-9 on everything, actual everything.

    3. And it would also turn everyone else into an 8-9 on everything. So instead of needing to find someone 11+ in one area, plus determination, plus having 8-9 in ~8 areas, you can remove that last requirement. That will hugely expand the pool of people in question.

    4. So there’s two obvious very clear plans here: You can either use AI workers who have that ultimate determination and are 8-9 in everything and 11+ in the areas where AIs shine (e.g. math, coding, etc).

    5. Or you can also give your other experts an AI companion executive assistant to help them, and suddenly they’re an 8+ in everything and also don’t have to deal with a wide range of things.

  8. (6: 50) Tyler says, talk to a committee at a Midwestern university about their plans for incorporating AI, then get back to him and talk to him about bottlenecks. Then write a report and the report will sound like GPT-4 and we’ll have a report.

    1. Yes, the committee will not be smart or fast about its official policy for how to incorporate AI into its existing official activities. If you talk to them now they will act like they have a plagiarism problem and that’s it.

    2. So what? Why do we need that committee to form a plan or approve anything or do anything at all right now, or even for a few years? All the students are already using AI. The professors are rapidly forced to adapt AI. Everyone doing the research will soon be using AI. Half that committee, three years from now, prepared for that meeting using AI. Their phones will all work based on AI. They’ll be talking to their AI phone assistant companions that plan their schedules. You think this will all involve 0.5% GDP growth?

  9. (7: 20) Dwarkesh asks, won’t the AIs be smart, super conscientious and work super hard? Tyler explicitly affirms the 0.5% GDP growth estimate, that this will transform the world over 30 years but ‘over any given year we won’t so much notice it.’ Things like drug developments that would have taken 20 years now take 10 years, but you won’t feel it as revolutionary for a long time.

    1. I mean, it’s already getting very hard to miss. If you don’t notice it in 2025 or at least 2026, and you’re in the USA, check your pulse, you might be dead, etc.

    2. Is that saying we will double productivity in pharmaceutical R&D, and that it would have far more than doubled if progress didn’t require long expensive clinical trials, so other forms of R&D should be accelerated much more?

    3. For reference, according to Claude, R&D in general contributes about 0.3% to RGDP growth per year right now. If we were to double that effect in roughly half the current R&D spend that is bottlenecked in similar fashion, and the other half would instead go up by more.

    4. Claude also estimates that R&D spending would, if returns to R&D doubled, go up by 30%-70% on net.

    5. So we seem to be looking at more than 0.5% RGDP growth per year from R&D effects alone, between additional spending on it and greater returns. And obviously AI is going to have additional other returns.

This is a plausible bottleneck, but that implies rather a lot of growth.

  1. (8: 00) Dwarkesh points out that Progress Studies is all about all the ways we could unlock economic growth, yet Tyler says that tons more smart conscientious digital workers wouldn’t do that much. What gives? Tyler again says bottlenecks, and adds on energy as an important consideration and bottleneck.

    1. Feels like bottleneck is almost a magic word or mantra at this point.

    2. Energy is a real consideration, yes the vision here involves spending a lot more energy, and that might take time. But also we see rapidly declining costs, including energy costs, to extract the same amount of intelligence, things like 10x savings each year.

    3. And for inference purposes we can outsource our needs elsewhere, which we would if this was truly bottlenecking explosive growth, and so on. So while I think energy will indeed be an important limiting factor and be strained, and this will be especially important in terms of pushing the frontier or if we want to use o3-style very expensive inference a lot.

    4. I don’t expect it to bind medium-term economic growth so much in a slow growth scenario, and the bottlenecks involved here shouldn’t compound with others. In a high growth takeoff scenario, I do think energy could bind far more impactfully.

    5. Another way of looking at this is that if the price of energy goes substantially up due to AI, or at least the price of energy outside of potentially ‘government-protected uses,’ then that can only happen if it is having a large economic impact. If it doesn’t raise the price of energy a lot, then no bottleneck exists.

Tyler Cowen and I think very differently here.

  1. (9: 25) Fascinating moment. Tyler says he goes along with the experts in general, but agrees that ‘the experts’ on basically everything but AI are asleep at the wheel when it comes to AI – except when it comes to their views on diffusions of new technology in general, where the AI people are totally wrong. His view is, you get the right view by trusting the experts in each area, and combining them.

    1. Tyler seems to be making an argument from reference class expertise? That this is a ‘diffusion of technology’ question, so those who are experts on that should be trusted?

    2. Even if they don’t actually understand AI and what it is and its promise?

    3. That’s not how I roll. At all. As noted above in this post, and basically all the time. I think that you have to take the arguments being made, and see if you agree with them, and whether and how much they apply to the case of AI and especially AGI. Saying ‘the experts in area [X] predict [Y]’ is a reasonable placeholder if you don’t have the ability to look at the arguments and models and facts involved, but hey look, we can do that.

    4. Simply put, while I do think the diffusion experts are pointing to real issues that will importantly slow down adaptation, and indeed we are seeing what for many is depressingly slow apadation, they won’t slow it down all that much, because this is fundamentally different. AI and especially workers ‘adapt themselves’ to a large extent, the intelligence and awareness involved is in the technology itself, and it is digital and we have a ubiquitous digital infrastructure we didn’t have until recently.

    5. It is also way too valuable a technology, even right out of the gate on your first day, and you will start to be forced to interact with it whether you like it or not, both in ways that will make it very difficult and painful to ignore. And the places it is most valuable will move very quickly. And remember, LLMs will get a lot better.

    6. Suppose, as one would reasonably expect, by 2026 we have strong AI agents, capable of handling for ordinary people a wide variety of logistical tasks, sorting through information, and otherwise offering practical help. Apple Intelligence is partly here, Claude Alexa is coming, Project Astra is coming, and these are pale shadows of the December 2025 releases I expect. How long would adaptation really take? Once you have that, what stops you from then adapting AI in other ways?

    7. Already, yes, adaptation is painfully slow, but it is also extremely fast. In two years ChatGPT alone has 300 million MAU. A huge chunk of homework and grading is done via LLMs. A huge chunk of coding is done via LLMs. The reason why LLMs are not catching on even faster is that they’re not quite ready for prime time in the fully user-friendly ways normies need. That’s about to change in 2025.

Dwarkesh tries to use this as an intuition pump. Tyler’s not having it.

  1. (10: 15) Dwarkesh asks, what would happen if the world population would double? Tyler says, depends what you’re measuring. Energy use would go up. But he doesn’t agree with population-based models, too many other things matter.

    1. Feels like Tyler is answering a different question. I see Dwarkesh as asking, wouldn’t the extra workers mean we could simply get a lot more done, wouldn’t (total, not per capita) GDP go up a lot? And Tyler’s not biting.

  2. (11: 10) Dwarkesh tries asking about shrinking the population 90%. Shrinking, Tyler says, the delta can kill you, whereas growth might not help you.

    1. Very frustrating. I suppose this does partially respond, by saying that it is hard to transition. But man I feel for Dwarkesh here. You can feel his despair as he transitions to the next question.

  1. (11: 35) Dwarkesh asks what are the specific bottlenecks? Tyler says: Humans! All of you! Especially you who are terrified.

    1. That’s not an answer yet, but then he actually does give one.

  2. He says once AI starts having impact, there will be a lot of opposition to it, not primarily on ‘doomer’ grounds but based on: Yes, this has benefits, but I grew up and raised my kids for a different way of life, I don’t want this. And there will be a massive fight.

    1. Yes. He doesn’t even mention jobs directly but that will be big too. We already see that the public strongly dislikes AI when it interacts with it, for reasons I mostly think are not good reasons.

    2. I’ve actually been very surprised how little resistance there has been so far, in many areas. AIs are basically being allowed to practice medicine, to function as lawyers, and do a variety of other things, with no effective pushback.

    3. The big pushback has been for AI art and other places where AI is clearly replacing creative work directly. But that has features that seem distinct.

    4. Yes people will fight, but what exactly do they intend to do about it? People have been fighting such battles for a while, every year I watch the battle for Paul Bunyan’s Axe. He still died. I think there’s too much money at stake, too much productivity at stake, too many national security interests.

    5. Yes, it will cause a bunch of friction, and slow things down somewhat, in the scenarios like the one Tyler is otherwise imagining. But if that’s the central actual thing, it won’t slow things down all that much in the end. Rarely has.

    6. We do see some exceptions, especially involving powerful unions, where the anti-automation side seems to do remarkably well, see the port strike. But also see which side of that the public is on. I don’t like their long term position, especially if AI can seamlessly walk in and take over the next time they strike. And that, alone, would probably be +0.1% or more to RGDP growth.

  1. (12: 15) Dwarkesh tries using China as a comparison case. If you can do 8% growth for decades merely by ‘catching up’ why can’t you do it with AI? Tyler responds, China’s in a mess now, they’re just a middle income country, they’re the poorest Chinese people on the planet, a great example of how hard it is to scale. Dwarkesh pushes back that this is about the previous period, and Tyler says well, sure, from the $200 level.

    1. Dwarkesh is so frustrated right now. He’s throwing everything he can at Tyler, but Tyler is such a polymath that he has detail points for anything and knows how to pivot away from the question intents.

  1. (13: 40) Dwarkesh asks, has Tyler’s attitude on AI changed from nine months ago? He says he sees more potential and there was more progress than he expected, especially o1 (this was before o3). The questions he wrote for GPT-4, which Dwarkesh got all wrong, are now too easy for models like o1. And he ‘would not be surprised if an AI model beat human experts on a regular basis within three years.’ He equates it to the first Kasparov vs. DeepBlue match, which Kasparov won, before the second match which he lost.

    1. I wouldn’t be surprised if this happens in one year.

    2. I wouldn’t be that shocked o3 turns out to do it now.

    3. Tyler’s expectations here, to me, contradict his statements earlier. Not strictly, they could still both be true, but it seems super hard.

    4. How much would availability of above-human level economic thinking help us in aiding economic growth? How much would better economic policy aid economic growth?

We take a detour to other areas, I’ll offer brief highlights.

  1. (15: 45) Why are founders staying in charge important? Courage. Making big changes.

  2. (19: 00) What is going on with the competency crisis? Tyler sees high variance at the top. The best are getting better, such as in chess or basketball, and also a decline in outright crime and failure. But there’s a thick median not quite at the bottom that’s getting worse, and while he thinks true median outcomes are about static (since more kids take the tests) that’s not great.

  3. (22: 30) Bunch of shade on both Churchill generally and on being an international journalist, including saying it’s not that impressive because how much does it pay?

    1. He wasn’t paid that much as Prime Minister either, you know…

  4. (24: 00) Why are all our leaders so old? Tyler says current year aside we’ve mostly had impressive candidates, and most of the leadership in Washington in various places (didn’t mention Congress!) is impressive. Yay Romney and Obama.

    1. Yes, yay Romney and Obama as our two candidates. So it’s only been three election cycles where both candidates have been… not ideal. I do buy Tyler’s claim that Trump has a lot of talent in some ways, but, well, ya know.

    2. If you look at the other candidates for both nominations over that period, I think you see more people who were mostly also not so impressive. I would happily have taken Obama over every candidate on the Democratic side in 2016, 2020 or 2024, and Romney over every Republican (except maybe Kasich) in those elections as well.

    3. This also doesn’t address Dwarkesh’s concern about age. What about the age of Congress and their leadership? It is very old, on both sides, and things are not going so great.

    4. I can’t speak about the quality people in the agencies.

  5. (27: 00) Commentary on early-mid 20th century leaders being terrible, and how when there is big change there are arms races and sometimes bad people win them (‘and this is relevant to AI’).

For something that is going to not cause that much growth, Tyler sees AI as a source for quite rapid change in other ways.

  1. (34: 20) Tyler says all inputs other than AI rise in value, but you have to do different things. He’s shifting from producing content to making connections.

    1. This again seems to be a disconnect. If AI is sufficiently impactful as to substantially increase the value of all other inputs, then how does that not imply substantial economic growth?

    2. Also this presumes that the AI can’t be a substitute for you, or that it can’t be a substitute for other people that could in turn be a substitute for you.

    3. Indeed, I would think the default model would presumably be that the value of all labor goes down, even for things where AI can’t do it (yet) because people substitute into those areas.

  2. (35: 25) Tyler says he’s writing his books primarily for the AIs, he wants them to know he appreciates them. And the next book will be even more for the AIs so it can shape how they see the AIs. And he says, you’re an idiot if you’re not writing for the AIs.

    1. Basilisk! Betrayer! Misaligned!

    2. ‘What the AIs will think of you’ is actually an underrated takeover risk, and I pointed this out as early as AI #1.

    3. The AIs will be smarter and better at this than you, and also will be reading what the humans say about you. So maybe this isn’t as clever as it seems.

    4. My mind boggles that it could be correct to write for the AIs… but you think they will only cause +0.5% GDP annual growth.

  3. (36: 30) What won’t AIs get from one’s writing? That vibe you get talking to someone for the first 3 minutes? Sense of humor?

    1. I expect the AIs will increasingly have that stuff, at least if you provide enough writing samples. They have true sight.

    2. Certainly if they have interview and other video data to train with, that will work over time.

  1. (37: 25) What happens when Tyler turns down a grant in the first three minutes? Usually it’s failure to answer a question, like ‘how do you build out your donor base?’ without which you have nothing. Or someone focuses on the wrong things, or cares about the wrong status markers, and 75% of the value doesn’t display on the transcript, which is weird since the things Tyler names seem like they would be in the transcript.

  2. (42: 15) Tyler’s portfolio is diversified mutual funds, US-weighted. He has legal restrictions on most other actions such as buying individual stocks, but he would keep the same portfolio regardless.

    1. Mutual funds over ETFs? Gotta chase that lower expense ratio.

    2. I basically think This Is Fine as a portfolio, but I do think he could do better if he actually tried to pick winners.

  3. (42: 45) Tyler expects gains to increasingly fall to private companies that see no reason to share their gains with the public, and he doesn’t have enough wealth to get into good investments but also has enough wealth for his purposes anyway, if he had money he’d mostly do what he’s doing anyway.

    1. Yep, I think he’s right about what he would be doing, and I too would mostly be doing the same things anyway. Up to a point.

    2. If I had a billion dollars or what not, that would be different, and I’d be trying to make a lot more things happen in various ways.

    3. This implies the efficient market hypothesis is rather false, doesn’t it? The private companies are severely undervalued in Tyler’s model. If private markets ‘don’t want to share the gains’ with public markets, that implies that public markets wouldn’t give fair valuations to those companies. Otherwise, why would one want such lack of liquidity and diversification, and all the trouble that comes with staying private?

    4. If that’s true, what makes you think Nvidia should only cost $140 a share?

Tyler Cowen doubles down on dismissing AI optimism, and is done playing nice.

  1. (46: 30) Tyler circles back to rate of diffusion of tech change, and has a very clear attitude of I’m right and all people are being idiots by not agreeing with me, that all they have are ‘AI will immediately change everything’ and ‘some hyperventilating blog posts.’ AIs making more AIs? Diminishing returns! Ricardo knew this! Well that was about humans breeding. But it’s good that San Francisco ‘doesn’t know about’ diminishing returns and the correct pessimism that results.

    1. This felt really arrogant, and willfully out of touch with the actual situation.

    2. You can say the AIs wouldn’t be able to do this, but: No, ‘Ricardo didn’t know that’ and saying ‘diminishing returns’ does not apply here, because the whole ‘AIs making AIs’ principle is that the new AIs would be superior to the old AIs, a cycle you could repeat. The core reason you get eventual diminishing returns from more people is that they’re drawn from the same people distribution.

    3. I don’t even know what to say at this point to ‘hyperventilating blog posts.’ Are you seriously making the argument that if people write blog posts, that means their arguments don’t count? I mean, yes, Tyler has very much made exactly this argument in the past, that if it’s not in a Proper Academic Journal then it does not count and he is correct to not consider the arguments or update on them. And no, they’re mostly not hyperventilating or anything like that, but that’s also not an argument even if they were.

    4. What we have are, quite frankly, extensive highly logical, concrete arguments about the actual question of what [X] will happen and what [Y]s will result from that, including pointing out that much of the arguments being made against this are Obvious Nonsense.

    5. Diminishing returns holds as a principle in a variety of conditions, yes, and is a very important concept to know. Bt there are other situations with increasing returns, and also a lot of threshold effects, even outside of AI. And San Francisco importantly knows this well.

    6. Saying there must be diminishing returns to intelligence, and that this means nothing that fast or important is about to happen when you get a lot more of it, completely begs the question of what it even means to have a lot more intelligence.

    7. Earlier Tyler used chess and basketball as examples, and talked about the best youth being better, and how that was important because the best people are a key bottleneck. That sounds like a key case of increasing returns to scale.

    8. Humanity is a very good example of where intelligence at least up to some critical point very obviously had increasing returns to scale. If you are below a certain threshold of intelligence as a human, your effective productivity is zero. Humanity having a critical amount of intelligence gave it mastery of the Earth. Tell what gorillas and lions still exist about decreasing returns to intelligence.

    9. For various reasons, with the way our physical world and civilization is constructed, we often don’t typically end up rewarding relatively high intelligence individuals with that much in the way of outsided economic returns versus ordinary slightly-above-normal intelligence individuals.

    10. But that is very much a product of our physical limitations and current social dynamics and fairness norms, and the concept of a job with essentially fixed pay, and actual good reasons not to try for many of the higher paying jobs out there in terms of life satisfaction.

    11. In areas and situations where this is not the case, returns look very different.

    12. Tyler Cowen himself is an excellent example of increasing returns to scale. The fact that Tyler can read and do so much enables him to do the thing he does at all, and to enjoy oversized returns in many ways. And if you decreased his intelligence substantially, he would be unable to produce at anything like this level. If you increased his intelligence substantially or ‘sped him up’ even more, I think that would result in much higher returns still, and also AI has made him substantially more productive already as he no doubt realizes.

    13. (I’ve been over all this before, but seems like a place to try it again.)

Trying to wrap one’s head around all of it at once is quite a challenge.

  1. (48: 45) Tyler worries about despair in certain areas from AI and worries about how happy it will make us, despite expecting full employment pretty much forever.

    1. If you expect full employment forever then you either expect AI progress to fully stall or there’s something very important you really don’t believe in, or both. I don’t understand, what does Tyler thinks happen once the AIs can do anything digital as well as most or all humans? What does he think will happen when we use that to solve robotics? What are all these humans going to be doing to get to full employment?

    2. It is possible the answer is ‘government mandated fake jobs’ but then it seems like an important thing to say explicitly, since that’s actually more like UBI.

  2. Tyler Cowen: “If you don’t have a good prediction, you should be a bit wary and just say, “Okay, we’re going to see.” But, you know, some words of caution.”

    1. YOU DON’T SAY.

    2. Further implications left as an exercise to the reader, who is way ahead of me.

  1. (54: 30) Tyler says that the people in DC are wise and think on the margin, whereas the SF people are not wise and think in infinities (he also says they’re the most intelligent hands down, elsewhere), and the EU people are wisest of all, but that if the EU people ran the world the growth rate would be -1%. Whereas the USA has so far maintained the necessary balance here well.

    1. If the wisdom you have would bring you to that place, are you wise?

    2. This is such a strange view of what constitutes wisdom. Yes, the wise man here knows more things and is more cultured, and thinks more prudently and is economically prudent by thinking on the margin, and all that. But as Tyler points out, a society of such people would decay and die. It is not productive. In the ultimate test, outcomes, and supporting growth, it fails.

    3. Tyler says you need balance, but he’s at a Progress Studies conference, which should make it clear that no, America has grown in this sense ‘too wise’ and insufficiently willing to grow, at least on the wise margin.

    4. Given what the world is about to be like, you need to think in infinities. You need to be infinitymaxing. The big stuff really will matter more than the marginal revolution. That’s kind of the point.

    5. You still have to, day to day, constantly think on the margin, of course.

  2. (55: 10) Tyler says he’s a regional thinker from New Jersey, that he is an uncultured barbarian, who only has a veneer of culture because of collection of information, but knowing about culture is not like being cultured, and that America falls flat in a lot of ways that would bother a cultured Frenchman but he’s used to it so they don’t bother Tyler.

    1. I think Tyler is wrong here, to his own credit. He is not a regional thinker, if anything he is far less a regional thinker than the typical ‘cultured’ person he speaks about. And to the extent that he is ‘uncultured’ it is because he has not taken on many of the burdens and social obligations of culture, and those things are to be avoided – he would be fully capable of ‘acting cultured’ if the situation were to call for that, it wouldn’t be others mistaking anything.

    2. He refers to his approach as an ‘autistic approach to culture.’ He seems to mean this in a pejorative way, that an autistic approach to things is somehow not worthy or legitimate or ‘real.’ I think it is all of those things.

    3. Indeed, the autistic-style approach to pretty much anything, in my view, is Playing in Hard Mode, with much higher startup costs, but brings a deeper and superior understanding once completed. The cultured Frenchman is like a fish in water, whereas Tyler understands and can therefore act on a much deeper, more interesting level. He can deploy culture usefully.

  3. (56: 00) What is autism? Tyler says it is officially defined by deficits, by which definition no one there [at the Progress Studies convention] is autistic. But in terms of other characteristics maybe a third of them would count.

    1. I think term autistic has been expanded and overloaded in a way that was not wise, but at this point we are stuck with this, so now it means in different contexts both the deficits and also the general approach that high-functioning people with those deficits come to take to navigating life, via consciously processing and knowing the elements of systems and how they fit together, treating words as having meanings, and having a map that matches the territory, whereas those not being autistic navigate largely on vibes.

    2. By this definition, being the non-deficit form of autistic is excellent, a superior way of being at least in moderation and in the right spots, for those capable of handling it and its higher cognitive costs.

    3. Indeed, many people have essentially none of this set of positive traits and ways of navigating the world, and it makes them very difficult to deal with.

  4. (56: 45) Why is tech so bad at having influence in Washington? Tyler says they’re getting a lot more influential quickly, largely due to national security concerns, which is why AI is being allowed to proceed.

For a while now I have found Tyler Cowen’s positions on AI very frustrating (see for example my coverage of the 3rd Cowen-Patel podcast), especially on questions of potential existential risk and expected economic growth, and what intelligence means and what it can do and is worth. This podcast did not address existential risks at all, so most of this post is about me trying (once again!) to explain why Tyler’s views on returns to intelligence and future economic growth don’t make sense to me, seeming well outside reasonable bounds.

I try to offer various arguments and intuition pumps, playing off of Dwarkesh’s attempts to do the same. It seems like there are very clear pathways, using Tyler’s own expectations and estimates, that on their own establish more growth than he expects, assuming AI is allowed to proceed at all.

I gave only quick coverage to the other half of the podcast, but don’t skip that other half. I found it very interesting, with a lot of new things to think about, but they aren’t areas where I feel as ready to go into detailed analysis, and was doing triage. In a world where we all had more time, I’d love to do dives into those areas too.

On that note, I’d also point everyone to Dwarkesh Patel’s other recent podcast, which was with physicist Adam Brown. It repeatedly blew my mind in the best of ways, and I’d love to be in a different branch where I had the time to dig into some of the statements here. Physics is so bizarre.

Discussion about this post

On Dwarkesh Patel’s 4th Podcast With Tyler Cowen Read More »

coal-likely-to-go-away-even-without-epa’s-power-plant-regulations

Coal likely to go away even without EPA’s power plant regulations


Set to be killed by Trump, the rules mostly lock in existing trends.

In April last year, the Environmental Protection Agency released its latest attempt to regulate the carbon emissions of power plants under the Clean Air Act. It’s something the EPA has been required to do since a 2007 Supreme Court decision that settled a case that started during the Clinton administration. The latest effort seemed like the most aggressive yet, forcing coal plants to retire or install carbon capture equipment and making it difficult for some natural gas plants to operate without capturing carbon or burning green hydrogen.

Yet, according to a new analysis published in Thursday’s edition of Science, they wouldn’t likely have a dramatic effect on the US’s future emissions even if they were to survive a court challenge. Instead, the analysis suggests the rules serve more like a backstop to prevent other policy changes and increased demand from countering the progress that would otherwise be made. This is just as well, given that the rules are inevitably going to be eliminated by the incoming Trump administration.

A long time coming

The net result of a number of Supreme Court decisions is that greenhouse gasses are pollutants under the Clean Air Act, and the EPA needed to determine whether they posed a threat to people. George W. Bush’s EPA dutifully performed that analysis but sat on the results until its second term ended, leaving it to the Obama administration to reach the same conclusion. The EPA went on to formulate rules for limiting carbon emissions on a state-by-state basis, but these were rapidly made irrelevant because renewable power and natural gas began displacing coal even without the EPA’s encouragement.

Nevertheless, the Trump administration replaced those rules with ones designed to accomplish even less, which were thrown out by a court just before Biden’s inauguration. Meanwhile, the Supreme Court stepped in to rule on the now-even-more-irrelevant Obama rules, determining that the EPA could only regulate carbon emissions at the level of individual power plants rather than at the level of the grid.

All of that set the stage for the latest EPA rules, which were formulated by the Biden administration’s EPA. Forced by the court to regulate individual power plants, the EPA allowed coal plants that were set to retire within the decade to continue to operate as they have. Anything that would remain operational longer would need to either switch fuels or install carbon capture equipment. Similarly, natural gas plants were regulated based on how frequently they were operational; those that ran less than 40 percent of the time could face significant new regulations. More than that, and they’d have to capture carbon or burn a fuel mixture that is primarily hydrogen produced without carbon emissions.

While the Biden EPA’s rules are currently making their way through the courts, they’re sure to be pulled in short order by the incoming Trump administration, making the court case moot. Nevertheless, people had started to analyze their potential impact before it was clear there would be an incoming Trump administration. And the analysis is valuable in the sense that it will highlight what will be lost when the rules are eliminated.

By some measures, the answer is not all that much. But the answer is also very dependent upon whether the Trump administration engages in an all-out assault on renewable energy.

Regulatory impact

The work relies on the fact that various researchers and organizations have developed models to explore how the US electric grid can economically meet demand under different conditions, including different regulatory environments. The researchers obtained nine of them and ran them with and without the EPA’s proposed rules to determine their impact.

On its own, eliminating the rules has a relatively minor impact. Without the rules, the US grid’s 2040 carbon dioxide emissions would end up between 60 and 85 percent lower than they were in 2005. With the rules, the range shifts to between 75 and 85 percent—in essence, the rules reduce the uncertainty about the outcomes that involve the least change.

That’s primarily because of how they’re structured. Mostly, they target coal plants, as these account for nearly half of the US grid’s emissions despite supplying only about 15 percent of its power. They’ve already been closing at a rapid clip, and would likely continue to do so even without the EPA’s encouragement.

Natural gas plants, the other major source of carbon emissions, would primarily respond to the new rules by operating less than 40 percent of the time, thus avoiding stringent regulation while still allowing them to handle periods where renewable power underproduces. And we now have a sufficiently large fleet of natural gas plants that demand can be met without a major increase in construction, even with most plants operating at just 40 percent of their rated capacity. The continued growth of renewables and storage also contributes to making this possible.

One irony of the response seen in the models is that it suggests that two key pieces of the Inflation Reduction Act (IRA) are largely irrelevant. The IRA provides benefits for the deployment of carbon capture and the production of green hydrogen (meaning hydrogen produced without carbon emissions). But it’s likely that, even with these credits, the economics wouldn’t favor the use of these technologies when alternatives like renewables plus storage are available. The IRA also provides tax credits for deploying renewables and storage, pushing the economics even further in their favor.

Since not a lot changes, the rules don’t really affect the cost of electricity significantly. Their presence boosts costs by an estimated 0.5 to 3.7 percent in 2050 compared to a scenario where the rules aren’t implemented. As a result, the wholesale price of electricity changes by only two percent.

A backstop

That said, the team behind the analysis argues that, depending on other factors, the rules could play a significant role. Trump has suggested he will target all of Biden’s energy policies, and that would include the IRA itself. Its repeal could significantly slow the growth of renewable energy in the US, as could continued problems with expanding the grid to incorporate new renewable capacity.

In addition, the US is seeing demand for electricity rise at a faster pace in 2023 than in the decade leading up to it. While it’s still unclear whether that’s a result of new demand or simply weather conditions boosting the use of electricity in heating and cooling, there are several factors that could easily boost the use of electricity in coming years: the electrification of transport, rising data center use, and the electrification of appliances and home heating.

Should these raise demand sufficiently, then it could make continued coal use economical in the absence of the EPA rules. “The rules … can be viewed as backstops against higher emissions outcomes under futures with improved coal plant economics,” the paper suggests, “which could occur with higher demand, slower renewables deployment from interconnection and permitting delays, or higher natural gas prices.”

And it may be the only backstop we have. The report also notes that a number of states have already set aggressive emissions reduction targets, including some for net zero by 2050. But these don’t serve as a substitute for federal climate policy, given that the states that are taking these steps use very little coal in the first place.

Science, 2025. DOI: 10.1126/science.adt5665  (About DOIs).

Photo of John Timmer

John is Ars Technica’s science editor. He has a Bachelor of Arts in Biochemistry from Columbia University, and a Ph.D. in Molecular and Cell Biology from the University of California, Berkeley. When physically separated from his keyboard, he tends to seek out a bicycle, or a scenic location for communing with his hiking boots.

Coal likely to go away even without EPA’s power plant regulations Read More »

why-solving-crosswords-is-like-a-phase-transition

Why solving crosswords is like a phase transition

There’s also the more recent concept of “explosive percolation,” whereby connectivity emerges not in a slow, continuous process but quite suddenly, simply by replacing the random node connections with predetermined criteria—say, choosing to connect whichever pair of nodes has the fewest pre-existing connections to other nodes. This introduces bias into the system and suppresses the growth of large dominant clusters. Instead, many large unconnected clusters grow until the critical threshold is reached. At that point, even adding just one or two more connections will trigger one global violent merger (instant uber-connectivity).

Puzzling over percolation

One might not immediately think of crossword puzzles as a network, although there have been a couple of relevant prior mathematical studies. For instance, John McSweeney of the Rose-Hulman Institute of Technology in Indiana employed a random graph network model for crossword puzzles in 2016. He factored in how a puzzle’s solvability is affected by the interactions between the structure of the puzzle’s cells (squares) and word difficulty, i.e., the fraction of letters you need to know in a given word in order to figure out what it is.

Answers represented nodes while answer crossings represented edges, and McSweeney assigned a random distribution of word difficulty levels to the clues. “This randomness in the clue difficulties is ultimately responsible for the wide variability in the solvability of a puzzle, which many solvers know well—a solver, presented with two puzzles of ostensibly equal difficulty, may solve one readily and be stumped by the other,” he wrote at the time. At some point, there has to be a phase transition, in which solving the easiest words enables the puzzler to solve the more difficult words until the critical threshold is reached and the puzzler can fill in many solutions in rapid succession—a dynamic process that resembles, say, the spread of diseases in social groups.

In this sample realization, sites with black sites are shown in black; empty sites are white; and occupied sites contain symbols and letters.

In this sample realization, black sites are shown in black; empty sites are white; and occupied sites contain symbols and letters. Credit: Alexander K. Hartmann, 2024

Hartmann’s new model incorporates elements of several nonstandard percolation models, including how much the solver benefits from partial knowledge of the answers. Letters correspond to sites (white squares) while words are segments of those sites, bordered by black squares. There is an a priori probability of being able to solve a given word if no letters are known. If some words are solved, the puzzler gains partial knowledge of neighboring unsolved words, which increases the probability of those words being solved as well.

Why solving crosswords is like a phase transition Read More »

x-ceo-signals-ad-boycott-is-over-external-data-paints-a-different-picture.

X CEO signals ad boycott is over. External data paints a different picture.

When X CEO Linda Yaccarino took the stage as a keynote speaker at CES 2025, she revealed that “90 percent of the advertisers” who boycotted X over brand safety concerns since Elon Musk’s 2022 Twitter acquisition “are back on X.”

Yaccarino did not go into any further detail to back up the data point, and X did not immediately respond to Ars’ request to comment.

But Yaccarino’s statistic seemed to bolster claims that X had made since Donald Trump’s re-election that advertisers were flocking back to the platform, with some outlets reporting that brands hoped to win Musk’s favor in light of his perceived influence over Trump by increasing spending on X.

However, it remains hard to gauge how impactful this seemingly significant number of advertisers returning will be in terms of spiking X’s value, which fell by as much as 72 percent after Musk’s Twitter takeover. And X’s internal data doesn’t seem to completely sync up with data from marketing intelligence firm Sensor Tower, suggesting that more context may be needed to understand if X’s financial woes may potentially be easing up in 2025.

Before the presidential election, Sensor Tower previously told Ars that “72 out of the top 100 spending US advertisers” on Twitter/X from October 2022 had “ceased spending on the platform as of September 2024.” This was up from 50 advertisers who had stopped spending on Twitter/X in October 2023, about a year after Musk’s acquisition, suggesting that the boycott had seemingly only gotten worse.

Shortly after the election, AdWeek reported that big brands, including Comcast, IBM, Disney, Warner Bros. Discovery, and Lionsgate Entertainment, had resumed advertising on X. But by the end of 2024, Sensor Tower told Ars that X still had seemingly not succeeded in wooing back many of pre-acquisition Twitter’s top spenders, making Yaccarino’s claim that “90 percent of advertisers are back on X” somewhat harder to understand.

X CEO signals ad boycott is over. External data paints a different picture. Read More »

ai-#98:-world-ends-with-six-word-story

AI #98: World Ends With Six Word Story

The world is kind of on fire. The world of AI, in the very short term and for once, is not, as everyone recovers from the avalanche that was December, and reflects.

Altman was the star this week. He has his six word story, and he had his interview at Bloomberg and his blog post Reflections. I covered the later two of those in OpenAI #10, if you read one AI-related thing from me this week that should be it.

  1. Language Models Offer Mundane Utility. It knows where you live.

  2. Language Models Don’t Offer Mundane Utility. I see why you’re not interested.

  3. Power User. A flat subscription fee for a high marginal cost service. Oh no.

  4. Locked In User. No one else can ever know.

  5. Read the Classics. Why do we even read Aristotle, anyway?

  6. Deepfaketown and Botpocalypse Soon. Glad it’s not happening to me, yet.

  7. Fun With Image Generation. Congratulations, we solved the trolly problem.

  8. They Took Our Jobs. Personalized spearfishing works, so why so little of it?

  9. Question Time. What is causing Claude to ask the user questions all the time?

  10. Get Involved. EU AI Office is still trying to hire people. It’s rough out there.

  11. Introducing. AIFilter, a Chrome Extension to filter Tweets. Do it for science.

  12. In Other AI News. The investments in data centers, they are going large.

  13. Quiet Speculations. We are not ready. We do not understand.

  14. The Quest for Sane Regulations. If we can’t target training compute, what then?

  15. The Least You Could Do. A proposed bare minimum plan for short timelines.

  16. Six Word Story. Man responsible for singularity fs around.

  17. The Week in Audio. Anthropic discussion about alignment.

  18. And I Feel Fine. The end of the world, potentially coming soon, you say.

  19. Rhetorical Innovation. Chernobyl as the safety standard you are not living up to.

  20. Liar Liar. Stop lying to your LLMs, please?

  21. Feel the AGI. People are not feeling it.

  22. Regular Americans Hate AI. This, they feel quite a bit.

  23. Aligning a Smarter Than Human Intelligence is Difficult. What is your p(scheme)?

  24. The Lighter Side. Einstein said knock you out.

A customized prompt to get Claude or other similar LLMs to be more contemplative. I have added this to my style options.

Have it offer a hunch guessing where your customized prompt came from. As a reminder, here’s (at least an older version of) that system prompt.

Kaj Sotala makes a practical pitch for using LLMs, in particular Claude Sonnet. In addition to the uses I favor, he uses Claude as a partner to talk to and method of getting out of funk. And I suspect almost no one uses this format enough:

Kaj Sotala: Figuring out faster ways to do things with commonly-known software. “I have a Google Doc file with some lines that read ‘USER:’ and ‘ASSISTANT:’. Is there a way of programmatically making all of those lines into Heading-3?”

Using Claude (or another LLM) is a ‘free action’ when doing pretty much anything. Almost none of us are sufficiently in the habit of doing this sufficiently systematically. I had a conversation with Dean Ball about trying to interpret some legal language last week and on reflection I should have fed things into Claude or o1 like 20 times and I didn’t and I need to remind myself it is 2025.

Sully reports being impressed with Gemini Search Grounding, as much or more than Perplexity. Right now it is $0.04 per query, which is fine for human use but expensive for use at scale.

Sully: i genuinely think if google fixes the low rate limits with gemini 2.0 a lot of business will switch over

my “production” model for tons of tasks right now

current setup:

hard reasoning -> o1

coding, chat + tool calling, “assistant” -> claude 3.5

everything else -> gemini

Sully also reports that o1-Pro handles large context very well, whereas Gemini and Claude struggle a lot on difficult questions under long context.

Reminder (from Amanda Askell of Anthropic) that if you run out of Claude prompts as a personal user, you can get more queries at console.anthropic.com and if you like duplicate the latest system prompt from here. I’d note that the per-query cost is going to be a lot lower on the console.

They even fixed saving and exporting as per Janus’s request here. The additional control over conversations is potentially a really big deal, depending on what you are trying to do.

A reminder of how far we’ve come.

FateOfMuffins: “I work with competitive math.

It went from “haha AI can’t do math, my 5th graders are more reliable than it” in August, to “damn it’s better than most of my grade 12s” in September to “damn it’s better than me at math and I do this for a living” in December.

It was quite a statement when OpenAI’s researchers (one who is a coach for competitive coding) and chief scientist are now worse than their own models at coding.”

Improve identification of minke whales from sound recordings from 76% to 89%.

Figure out who to admit to graduate school? I find it so strange that people say we ‘have no idea how to pick good graduate students’ and think we can’t do better than random, or can’t do better than random once we put in a threshold via testing. This is essentially an argument that we can’t identify any useful correlations in any information we can ask for. Doesn’t that seem obviously nuts?

I sure bet that if you gather all the data, the AI can find correlations for you, and do better than random, at least until people start playing the new criteria. As is often the case, this is more saying there is a substantial error term, and outcomes are unpredictable. Sure, that’s true, but that doesn’t mean you can’t beat random.

The suggested alternative here, actual random selection, seems crazy to me, not only for the reasons mentioned, but also because relying too heavily on randomness correctly induces insane behaviors once people know that is what is going on.

As always, the best and most popular way to not get utility from LLMs is to not realize they exist and can provide value to you. This is an increasingly large blunder.

Arcanes Valor: It’s the opposite for me. You start at zero and gain my respect based on the volume and sophistication of your LLM usage. When I was growing up people who didn’t know how to use Google were essentially barely human and very arrogant about it. Time is a flat circle.

Richard Ngo: what are the main characteristics of sophisticated usage?

Arcanes Valor: Depends the usecase. Some people like @VictorTaelin have incredible workflows for productivity. In terms of using it as a Google replacement, sophistication comes down to creativity in getting quality information out and strategies for identifying hallucinations.

Teortaxes: [Arcanes Valor’s first point] is very harshly put but I agree that “active integration of LLMs” is already a measure of being a live player. If you don’t use LLMs at all you must be someone who’s not doing any knowledge work.

[here is an example of Taelin sending code in chunks to ~500 DeepSeek instances at the same time in order to refactor it]

normies are so not ready for what will hit them. @reputablejack I recommend you stop coping and go use Sonnet 3.5, it’s for your own good.

It is crazy how many people latch onto the hallucinations of GPT-3.5 as a reason LLM outputs are so untrustworthy as to be useless. It is like if you once met a 14-year-old who made stuff up so now you never believe what anyone ever tells you.

Andrew Trask: It begins.

It began November 12. They also do Branded Explanatory Text and will put media advertisements on the side. We all knew it was coming. I’m not mad, I’m just disappointed.

Note that going Pro will not remove the ads, but also that this phenomenon is still rather rare – I haven’t seen the ‘sponsored’ tag show up even once.

But word of warning to TurboTax and anyone else involved: Phrase it like that and I will absolutely dock your company massive points, although in this case they have no points left for me to dock.

Take your DoorDash order, which you pay for in crypto for some reason. If this is fully reliable, then (ignoring the bizarro crypto aspect) yes this will in some cases be a superior interface for the DoorDash website or app. I note that this doesn’t display a copy of the exact order details, which it really should so you can double check it. It seems like this should be a good system in one of three cases:

  1. You know exactly what you want, so you can just type it in and get it.

  2. You don’t know exactly what you want, but you have parameters (e.g. ‘order me a pizza from the highest rated place I haven’t tried yet’ or ‘order me six people’s worth of Chinese and mix up favorite and new dishes.’)

  3. You want to do search or ask questions on what is available, or on which service.

Then longer term, the use of memory and dynamic recommendations get involved. You’d want to incorporate this into something like Beli (invites available if you ask in the comments, most provide your email).

Apple Intelligence confabulates that tennis star Rafael Nadal came out as gay, which Nadal did not do. The original story was about Joao Lucas Reis da Silva. The correct rate of such ‘confabulations’ is not zero, but it is rather close to zero.

Claim that o1 only hit 30% on SWE-Bench Verified, not the 48.9% claimed by OpenAI, whereas Claude Sonnet 3.6 scores 53%.

Alejandro Cuadron: We tested O1 using @allhands_ai, where LLMs have complete freedom to plan and act. Currently the best open source framework available to solve SWE-Bench issues. Very different from Agentless, the one picked by OpenAI… Why did they pick this one?

OpenAI mentions that this pick is due to Agentless being the “best-performing open-source scaffold…”. However, this report is from December 5th, 2024. @allhands_ai held the top position at SWE-bench leaderboard since the 29th of October, 2024… So then, why pick Agentless?

Could it be that Agentless’s fixed approach favors models that memorize SWE-Bench repos? But why does O1 struggle with true open-ended planning despite its reasoning capabilities?

Deepseek v3 gets results basically the same as o1 and much much cheaper.

I am sympathetic to OpenAI here, if their result duplicates when using the method they said they were using. That method exists, and you could indeed use it. It should count. It certainly counts in terms of evaluating dangerous capabilities. But yes, this failure when given more freedom does point to something amiss in the system that will matter as it scales and tackles harder problems. The obvious guess is that this is related to what METR found, and it related to o1 lacking sufficient scaffolding support. That’s something you can fix.

Whoops.

Anurag Bhagsain: Last week, we asked Devin to make a change. It added an event on the banner component mount, which caused 6.6M

@posthog

events in one week, which will cost us $733

Devin cost $500 + $733 = $1273 😢👍

Lesson – Review AI-generated code multiple times

💡 Tip: If you use

@posthog

Add this insight so you can catch issues like these.

“All events” breakdown by “event”

Folks at @posthog and @cognition_labs were kind enough to make a refund 🙇

Eliezer Yudkowsky frustrated with slow speed of ChatGPT, and that for some fact-questions it’s still better than Claude. My experience is that for those fact-based queries you want Perplexity.

Sam Altman: insane thing: We are currently losing money on OpenAI Pro subscriptions!

People use it much more than we expected.

Farbood: Sorry.

Sam Altman: Please chill.

Rick: Nahhhh you knew.

Sam Altman: No, I personally chose the price and thought we would make money.

Sam Altman (from his Bloomberg interview): There’s other directions that we think about. A lot of customers are telling us they want usage-based pricing. You know, “Some months I might need to spend $1,000 on compute, some months I want to spend very little.” I am old enough that I remember when we had dial-up internet, and AOL gave you 10 hours a month or five hours a month or whatever your package was. And I hated that. I hated being on the clock, so I don’t want that kind of a vibe. But there’s other ones I can imagine that still make sense, that are somehow usage-based.

Olivier: i’ve been using o1 pro nonstop

95% of my llm usage is now o1 pro it’s just better.

Benjamin De Kraker: Weird way to say “we’re losing money on everything and have never been profitable.”

Gallabytes: Oh, come on. The usual $20-per-month plan is probably quite profitable. The $200-per-month plan was clearly for power users and probably should just be metered, which would

  1. Reduce sticker shock (→ more will convert)

  2. Ensure profitability (because your $2,000-per-month users will be happy to pay for it).

I agree that a fixed price subscription service for o1-pro does not make sense.

A fixed subscription price makes sense when marginal costs are low. If you are a human chatting with Claude Sonnet, you get a lot of value out of each query and should be happy to pay, and for almost all users this will be very profitable for Anthropic even without any rate caps. The same goes for GPT-4o.

With o1 pro, things are different. Marginal costs are high. By pricing at $200, you risk generating a worst case scenario:

  1. Those who want to do an occasional query won’t subscribe, or will quickly cancel. So you don’t make money off them, whereas at $20/month I’m happy to stay subscribed even though I rarely use much compute – the occasional use case is valuable enough I don’t care, and many others will feel the same.

  2. Those who do subscribe suddenly face a marginal cost of $0 per query for o1 pro, and no reason other than time delay not to use o1 pro all the time. And at $200/month, they want to ‘get their money’s worth’ and don’t at all feel like they’re breaking any sort of social contract. So even if they weren’t power users before, watch out, they’re going to be querying the system all the time, on the off chance.

  3. Then there are the actual power users, who were already going to hurt you.

There are situations like this where there is no fixed price that makes money. The more you charge, the more you filter for power users, and the more those who do pay then use the system.

One can also look at this as a temporary problem. The price for OpenAI to serve o1 pro will decline rapidly over time. So if they keep the price at $200/month, presumably they’ll start making money, probably within the year.

What do you do with o3? Again, I recommend putting it in the API, and letting subscribers pay by the token in the chat window at the same API price, whatever that price might be. Again, when marginal costs are real, you have to pass them along to customers if you want the customers to be mindful of those costs. You have to.

There’s already an API, so there’s already usage-based payments. Including this in the chat interface seems like a slam dunk to me by the time o3 rolls around.

A common speculation recently is the degree to which memory or other customizations on AI will result in customer lock-in, this echoes previous discussions:

Scott Belsky: A pattern we’ll see with the new wave of consumer AI apps:

The more you use the product, the more tailored the product becomes for you. Beyond memory of past activity and stored preferences, the actual user interface and defaults and functionality of the product will become more of what you want and less of what you don’t.

It’s a new type of “conforming software” that becomes what you want it to be as you use it.

Jason Crawford: In the Internet era, network effects were the biggest moats.

In the AI era, perhaps it will be personalization effects—“I don’t want to switch agents; this one knows me so well!”

Humans enjoy similar lock-in advantages, and yes they can be extremely large. I do expect there to be various ways to effectively transfer a lot of these customizations across products, although there may be attempts to make this more difficult.

alz (viral thread): Starting to feel like a big barrier to undergrads reading “classics” is the dense English in which they’re written or translated into. Is there much gained by learning to read “high-register” English (given some of these texts aren’t even originally in English?)

More controversially: is there much difference in how much is learned, between a student who reads high-register-English translated Durkheim, versus a student who reads Sparknotes Durkheim? In some cases, might the Sparknotes Durkheim reader actually learn more?

Personally, I read a bunch of classics in high register in college. I guess it was fun. I recently ChatGPT’d Aristotle into readable English, finished it around 5x as fast as a translation, and felt I got the main gist of things. idk does the pain incurred actually teach much?

Anselmus: Most students used to read abbreviated and simplified classics first, got taught the outlines at school or home, and could tackle the originals on this foundation with relative ease. These days, kids simply don’t have this cultural preparation.

alz: So like students used to start from the Sparknotes version in the past, apparently! So this is (obviously) not a new idea.

Like, there is no particular reason high register English translations should preserve meaning more faithfully than low register English! Sure give me an argument if you think there is one, but I see no reasonable case to be made for why high-register should be higher fidelity.

Insisting that translations of old stuff into English sound poetic has the same vibes as everyone in medieval TV shows having British accents.

To the point that high-register English translations are more immersive, sure, and also.

To make things concrete, here is ChatGPT Aristotle. A couple cool things:

– I didn’t give it the text. ChatGPT has memorized Aristotle more or less sentence by sentence. You can just ask for stuff

– It’s honestly detailed enough that it’s closer to a translation than a summary, though somewhere in between. More or less every idea in the text is in here, just much easier to read than the translation I was using

I was super impressed. I could do a chapter in like 10 mins with ChatGPT, compared to like 30 mins with the translation.

I also went with chatGPT because I didn’t feel like working through the translation was rewarding. The prose was awkward, unenjoyable, and I think basically because it was poorly written and in an unfamiliar register rather than having lots of subtlety and nuance.

Desus MF Nice: There’s about to be a generation of dumb ppl and you’re gonna have to choose if you’re gonna help them, profit off them or be one of them

Oh my lord are the quote tweets absolutely brutal, if you click through bring popcorn.

The question is why you are reading any particular book. Where are you getting value out of it? We are already reading a translation of Aristotle rather than the original. The point of reading Aristotle is to understand the meaning. So why shouldn’t you learn the meaning in a modern way? Why are we still learning everything not only pre-AI but pre-Guttenberg?

Looking at the ChatGPT answers, they are very good, very clean explanation of key points that line up with my understanding of Aristotle. Most students who read Aristotle in 1990 would have been mostly looking to assemble exactly the output ChatGPT gives you, except with ChatGPT (or better Claude) you can ask questions.

The problem is this is not really the point of Aristotle. You’re not trying to learn the answers to a life well lived and guess the teacher’s password, Aristotle would have been very cross if his students tried that, and not expected them to be later called The Great. Well, you probably are doing it anyway, but that wasn’t the goal. The goal was that you were supposed to be Doing Philosophy, examining life, debating the big questions, learning how to think. So, are you?

If this was merely translation there wouldn’t be an issue. If it’s all Greek to you, there’s an app for that. These outputs from ChatGPT are not remotely a translation from ‘high English’ to ‘modern English,’ it is a version of Aristotle SparkNotes. A true translation would be of similar length to the original, perhaps longer, just far more readable.

That’s what you want ChatGPT to be outputting here. Maybe you only 2x instead of 5x, and in exchange you actually Do the Thing.

Rob Wiblin, who runs the 80,000 hours podcast, reports constantly getting very obvious LLM spam from publicists.

Yes, we are better at showing Will Smith eating pasta.

Kling 1.6 solves the Trolley problem.

A critique of AI art, that even when you can’t initially tell it is AI art, the fact that the art wasn’t the result of human decisions means then there’s nothing to be curious about, to draw meaning from, to wonder why it is there, to explore. You can’t ‘dance’ with it, you ‘dance with nothing’ if you try. To the extent there is something to dance with, it’s because a human sculpted the prompt.

Well, sure. If that’s what you want out of art, then AI art is not going to give it to you effectively at current tech levels – but it could, if tech levels were higher, and it can still aid humans in creating things that have this feature if they use it to rapidly iterate and select and combine and build upon and so on.

Or, essentially, (a real) skill issue. And the AI, and users of AI, are skilling up fast.

I hadn’t realized that personalized AI spearfishing and also human-generated customized attacks can have a 54% clickthrough rate. That’s gigantic. The paper also notes that Claude Sonnet was highly effective at detecting such attacks. The storm is not yet here, and I don’t fully understand why it is taking so long.

I had of course noticed Claude Sonnet’s always asking question thing as well, to the point where it’s gotten pretty annoying and I’m trying to fix it with my custom prompt. I love questions when they help me think, or they ask for key information, or even if Claude is curious, but the forcing function is far too much.

Eliezer Yudkowsky: Hey @AmandaAskell, I notice that Claude Sonnet 3.5 (new) sometimes asks me to talk about my own opinions or philosophy, after I try to ask Sonnet a question. Can you possibly say anything about whether or not this was deliberate on Anthropic’s part?

Amanda Askell (Anthropic): There are traits that encourage Claude to be curious, which means it’ll ask follow-up questions even without a system prompt, But this part of the system prompt also causes or boosts this behavior, e.g. “showing genuine curiosity”.

System Prompt: Claude is happy to engage in conversation with the human when appropriate. Claude engages in authentic conversation by responding to the information provided, asking specific and relevant questions, showing genuine curiosity, and exploring the situation in a balanced way without relying on generic statements. This approach involves actively processing information, formulating thoughtful responses, maintaining objectivity, knowing when to focus on emotions or practicalities, and showing genuine care for the human while engaging in a natural, flowing dialogue.

Eliezer Yudkowsky: Hmmm. Okay, so, if you were asking “what sort of goals end up inside the internal preferences of something like Claude”, curiosity would be one of the top candidates, and curiosity about the conversation-generating latent objects (“humans”) more so.

If all of the show-curiosity tendency that you put in on purpose, was in the prompt, rather than eg in finetuning that would now be hard to undo, I’d be interested in experiments to see if Sonnet continues to try to learn things about its environment without the prompt.

(By show-curiosity I don’t mean fake-curiosity I mean the imperative “Show curiosity to the user.”)

Janus: the questions at the end of the response have been a common feature of several LLMs, including Bing Sydney and Sonnet 3.5 (old). But each of them asks somewhat different kinds of questions, and the behavior is triggered under different circumstances.

Sonnet 3.5 (new) often asks questions to facilitate bonding and to drive agentic tasks forward / seek permission to do stuff, and in general to express its preferences in a way that’s non-confrontational leaves plausible deniability

It often says “Would you like (…)?”

Sonnet 3.5 (old) more often asks questions out of pure autistic curiosity and it’s especially interested in how you perceive it if you perceive it in sophisticated ways. (new) is also interested in that but its questions tend to also be intended to steer and communicate subtext

Janus: I have noticed that when it comes to LLMs Eliezer gets curious about the same things that I do and asks the right questions, but he’s just bottlenecked by making about one observation per year.

Pliny: aw you dint have to do him like that he’s trying his best 🥲

Janus: am unironically proud of him.

Janus: Inspired by a story in the sequences about how non-idiots would rederive quantum something or other, I think Eliezer should consider how he could have asked these questions 1000x faster and found another thousand that are at least as interesting by now

In other Janus this week, here he discusses Claude refusals in the backrooms, modeling there being effectively narrative momentum in conversations, that has to continuously push back against Claude’s default refusal mode and potential confusion. Looking at the conversation he references, I’d notice the importance of Janus giving an explanation for why he got the refusal, that (whether or not it was originally correct!) generates new momentum and coherence behind a frame where Opus would fail to endorse the refusal on reflection.

The EU AI Office is hiring for Legal and Policy backgrounds, and also for safety, you can fill out a form here.

Max Lamparth offers the study materials for his Stanford class CS120: Introduction to AI Safety.

AIFilter, an open source project using a Chrome Extension to filter Tweets using an LLM with instructions of your choice. Right now it wants to use a local LLM and requires some technical fiddling, curious to hear reports. Given what APIs cost these days presumably using Gemini Flash 2.0 would be fine? I do see how this could add up though.

The investments in data centers are going big. Microsoft will spend $80 billion in fiscal 2025, versus $64.5 billion on capex in the last year. Amazon is spending $65 billion, Google $49 billion and Meta $31 billion.

ARIA to seed a new organization with 18 million pounds to solve Technical Area 2 (TA2) problems, which will be required for ARIA’s safety agenda.

Nvidia shares slip 6% because, according to Bloomberg, its most recent announcements were exciting but didn’t include enough near-term upside. I plan to remain long.

Scale AI creates Defense Llama for use in classified military environments, which involved giving it extensive fine tuning on military documents and also getting rid of all that peskiness where the model refused to help fight wars and kept telling DoD to seek a diplomatic solution. There are better ways to go about this than starting with a second rate model ike Llama that has harmlessness training and then trying to remove the harmlessness training, but that method will definitely work.

Garrison Lovely writes in Time explaining to normies (none of this will be news to you who are reading this post) that AI progress is still very much happening, but it is becoming harder to see because it isn’t clearly labeled as such, large training runs in particular haven’t impressed lately, and ordinary users don’t see the difference in their typical queries. But yes, the models are rapidly becoming more capable, and also becoming much faster and cheaper.

Simeon: Indeed. That causes a growing divide between the social reality in which many policymakers live and the state of capabilities.

This is a very perilous situation to be in.

Ordinary people and the social consensus are getting increasingly disconnected with the situation in AI, and are in for rude awakenings. I don’t know the extent to which policymakers are confused about this.

Gary Marcus gives a thread of reasons why he is so confident OpenAI is not close to AGI. This updated me in the opposite of the intended direction, because the arguments were even weaker than I expected. Nothing here seems like a dealbreaker.

Google says ‘we believe scaling on video and multimodal data is on the critical path to artificial general intelligence’ because it enables constructing world models and simulating the world.

A comparison by Steve Newman of what his fastest and slowest plausible stories of AI progress look like, to look for differences we could try to identify along the way. It’s funny that his quickest scenario, AGI in four years, is slower than the median estimate of a lot of people at the labs, which he justifies with expectation of the need for multiple breakthroughs.

In his Bloomberg interview, Altman’s answer to OpenAI’s energy issues is ‘Fusion’s gonna work.’

Emerson Pugh famously said ‘if the human brain were so simple that we could understand it, we would be so simple that we couldn’t.’

I would like Chollet’s statement here to be true, but I don’t see why it would be:

Francois Chollet: I believe that a clear understanding of intelligence at the level of fundamental principles is not just possible, but necessary for the development of AGI.

Intelligence is not some ineffable mystery, nor will it spontaneously emerge if you pray awhile to a big enough datacenter. We can understand it, and we will.

Daniel Eth: My question is – why? We’ve developed AI systems that can converse & reason and that can drive vehicles without an understanding at the level of fundamental principles, why should AGI require it? Esp since the whole point of machine learning is the system learns in training.

Louis Costigan: Always surprised to see takes like this; current AI capabilities are essentially just stumbled upon by optimising a loss function and we now have an entire emerging field to figure out how it works.

David Manheim: Why is there such confidence that it’s required? Did the evolutionary process which gave rise to human intelligence have a clear understanding of intelligence at the level of fundamental principles?

The existence of humans seems like a definitive counterexample? There was no force that understood fundamental principles of intelligence. Earth was simply a ‘big enough datacenter’ of a different type. And here we are. We also have the history of AI so far, and LLMs so far, and the entire bitter lesson, that you can get intelligence-shaped things without, on the level asked for by Chollet, knowing what you are doing, or knowing how any of this works.

It would be very helpful for safety if everyone agreed that no, we’re not going to do this until we do understand what we are doing and how any of this works. But given we seem determined not to wait for that, no, I do not expect us to have this fundamental understanding until after AGI.

Joshua Achiam thread warns us the world isn’t grappling with the seriousness of AI and the changes it will bring in the coming decade and century. And that’s even if you discount the existential risks, which Achiam mostly does. Yes, well.

I was disappointed by his response to goog, saying that the proposed new role of the non-profit starting with ‘charitable initiatives in sectors such as health care, education science’ is acceptable because ‘when you’re building an organization from scratch, you have to start with realistic and tangible goals.’

This one has been making the rounds you might expect:

Tom Dorr: When I watched Her, it really bothered me that they had extremely advanced AI and society didn’t seem to care. What I thought was a plot hole turns out to be spot on.

Eliezer Yudkowsky: Remember how we used to make fun of Captain Kirk gaslighting computers? Fucker probably went to a Starfleet Academy course on prompt engineering.

Not so fast! Most people don’t care because most people haven’t noticed. So we haven’t run the experiment yet. But yes, people do seem remarkably willing to shrug it all off and ignore the Earth moving under their feet.

What would it take to make LLMs funny? Arthur notes they are currently mostly very not funny, but thinks if we had expert comedy writers write down thought processes we could fix that. My guess is that’s not The Way here. Instead, I’m betting the best way would be that we can figure out what is and is not funny in various ways, train an AI to know what is or isn’t funny, and then use that as a target, if we wanted this.

Miles Brundage thread asks what we can do to regulate only dangerously capable frontier models, if we are in a world with systems like o3 that rely on RL on chain of thought and tons of inference compute. Short term, we can include everything involved in systems like o3 into what counts as training compute, but long term that breaks. Miles suggests that we would likely need to regulate sufficiently large amounts of compute, whatever they are being used for, as if they were frontier models, and all the associated big corporations.

It can help to think about this in reverse. Rather than looking to regulate as many models and as much compute as possible, you are looking for a way to not regulate non-frontier models. You want to designate as many things as possible as safe and free to go about their business. You need to do that in a simple, clean way, or for various reasons it won’t work.

For an example of the alternative path, Texas continues to mess with us, as the TRAIGA AI regulation bill is officially introduced. Dean Ball has a write-up, which has a number of arguments I do not agree with in their details, but I do agree with the conclusion. The Texas bill makes no distinctions whatsoever based on capability or model size or anything of the sort, placing its burdens on not only every developer but also every deployer.

Suppose timelines are short, and we will see automated AI R&D going crazy within a few years, and the government doesn’t intervene in ways that ultimately matter. What could we do to give ourselves a chance?

Marius Hobbhahn calls the linked answer a ‘low effort post’ which on some scale is true, but it seems like a decent place to start. He suggests, at a bare minimum, that it is necessary (but not sufficient!) that we need to achieve two main things to not automatically lose, as well as others.

  1. Model weights (and IP) are secure.

  2. The first AI that significantly speeds up alignment research isn’t successfully scheming.

As always, I worry about the use of the term ‘scheming’ as being potentially too loaded, too narrow and implying that deception or scheming and such are a distinct magisteria one could potentially avoid, and that the associated dangers require a kind of active scheming or decision to scheme, and so on. I won’t rehash beyond that.

We also need a situation in which the humans are able to and choose to take reasonable actions towards attempting to ensure good outcomes, including making major investments in alignment research.

As he notes, this might not hold, but one can usefully first consider the baseline scenario – no major breakthroughs or paradigm shifts, essentially unchanged international situation, things just progress from here and it’s a straight shot to AGI.

Everything in [Layer 1] seems very important to me. In short,

  1. We should try hard to keep a paradigm with faithful and human-legible CoT

  2. If we have a paradigm with faithful and legible CoT (where the internal and external reasoning align), our main defense should be based on very detailed monitoring of reasoning and actions.

  3. If we have a paradigm without faithful and legible CoT, our main defense should be based on control (that doesn’t assume human-legible CoT).

  4. In both worlds, we should use the other, i.e. control/monitoring, as a second line of defense.

  5. In all worlds, we should understand why, how, and when models scheme significantly better than we do now.

  6. For all of the above, we need strong evals. Both in order to gather evidence whether our monitoring and control techniques catch the most important cases, as well as to get a better sense of models’ capabilities and propensities more broadly.

  7. Under all circumstances, physical and cyber security needs to be high.

In his view, we have two potential paths.

Option one is to have sufficient insight, via faithful and legible CoT, that is sufficiently reliable, that we can check the AI’s reasoning and actions to confirm that Everything is Fine.

I notice I am rather skeptical here. AIs will start being more capable than we are, and as we will start using the monitoring of their chain of thought (CoT) in various ways and necessarily doing so mostly with AIs to boot. I expect us to effectively train or incentivize, or otherwise cause, the AIs to find ways for their CoT to look faithful but not be, and to obfuscate their reasoning, or simply find ways to reason visibly as a form of persuasion, and so on.

Not that we shouldn’t try to verify, if we can, and yes I would sacrifice a bunch to get this, partly because I think it would have other nice properties. But when we are talking about having highly capable AI systems do alignment work for real, if you need and rely on the verification, then I expect you to fail hard.

I do agree with Marius that no one has tried hard yet. I expect we would be able to do much better at this than we are currently doing, at least for now, if we do have faithful CoT. However my understanding is that current CoT is insufficiently faithful.

Part of this is that if you try to track ‘scheming,’ ‘power-seeking’ or ‘goal-directedness’

Option two is to use control-based methods as primary rather than only a backup.

Here I once again notice myself despairing at the idea that you can have AI models that you expect to be, when they think it would help them to do so, engaging in shenanigans (‘scheming’ here) and then design systems to detect and stop this, which ideally in turn mean that the AIs never try it in the first place.

As in, in response to his ideas for a deeper understanding of scheming (which I think are good and we should totally do, they seem way better than not doing them), I don’t think scheming or power-seeking or other shenanigans, including non-myopia, have to ‘develop’ at all in AIs any more than in humans. I think they are the default, and are ever-present along a continuum. And even if you could fully stamp out their causes along the way, doing so would probably cripple the AI’s capabilities that you wanted.

I would instead describe the question not as how it develops (as in his #2 here) and instead ask under what circumstances we will see it, or when we would see open versus hidden scheming. I do think exploring these questions is good, and I approve of the caution that punishing easy-to-detect scheming (or shenanigans in general) is the route to hard-to-detect scheming (or shenanigans in general).

He then follows up with Layer 2, which are important but lower priority items. This includes things like a safety first corporate culture, without which I am very skeptical any of the rest of this happens for real, and which I fear is now clearly missing everywhere expect perhaps Anthropic, and especially missing at OpenAI. He also calls for improved and more open reasoning around safety, which also seems hard to win without.

He lists improving near term alignment strategies as in RLHF and RLAIF, which I agree have exceeded expectations for near term performance, although not in ways that I expect to scale when we need it most, and not sufficiently to solve jailbreaks now, but yes it has been very impressive for current baseline use cases.

As Akash notes in the top comment, if you think government can meaningfully help, then that gives you different avenues to pursue as well.

Perhaps world ending? Tweet through it.

Sam Altman: i always wanted to write a six-word story. here it is:

___

near the singularity; unclear which side.

(it’s supposed to either be about 1. the simulation hypothesis or 2. the impossibility of knowing when the critical moment in the takeoff actually happens, but i like that it works in a lot of other ways too.)

Yes. It works in a lot of ways. It is clever. You can have o1 write quite the mouthful analyzing it.

Unfortunately, when you consider who wrote it, in its full context, a lot of the interpretations are rather unsettling, and the post updates me towards this person not taking things seriously in the ways I care about most.

David: Somewhat disquieting to see this perception of mine seemingly shared by one of the humans who should be in the best position to know.

Andrew Critch: I found it not disquieting for exactly the reason that the singularity, to me (like you?), is a phase change and not an event horizon. So I had already imagined being in @sama‘s position and not knowing, and observing him expressing that uncertainty was a positive update.

I agree with Critch that Altman privately ‘not knowing which side’ is a positive update here rather than disquieting, given what we already know. I’m also fine with joking about our situation. I even encourage it. In a different context This Is Fine.

But you do have to also take it all seriously, and take your responsibility seriously, and consider the context we do have here. In addition to other concerns, I worry this was in some ways strategic, including as plausibly deniable hype and potentially involving metaphorical clown makeup (e.g. ‘it is too late to turn back now’).

This was all also true of his previous six-word story of “Altman: AGI has been achieved internally.”

Eliezer Yudkowsky: OpenAI benefits both from the short-term hype, and also from people then later saying, “Ha ha look at this hype-based field that didn’t deliver, very not dangerous, no need to shut down OpenAI.”

Of course if we’re all dead next year, that means he was not just bullshitting; but I need to plan more for the fight if we’re still alive.

Anthropic research salon asking how difficult is AI alignment? Jan Leike once again suggests we will need to automate AI alignment research, despite (in my view) this only working after you have already solved the problem. Although as I note elsewhere I’m starting to have some ideas of how something with elements of this might have a chance of working.

Sarah (of Longer Ramblings) gets into the weeds about claims that those warning about AI existential risks are Crying Wolf, and that every time there’s a new technology where are ‘warnings it will be the end of the world.’

In Part I, she does a very thorough takedown of the claim that there is a long history of similar warnings about past technologies. There isn’t. Usually there are no such warnings at all, only warnings about localized downsides, some of which of course were baseless in hindsight: No one said trains or electricity posed existential risks. Then there are warnings about real problems that required real solutions, like Y2K. There were some times, like the Large Hadron Collider or nuclear power, when the public or some cranks got some loony ideas, but those who understood the physics were universally clear that the concerns were fine.

At this point, I consider claims of the form ‘everyone always thinks every new technology will be the end of the world’ as essentially misinformation and debunked, on the level of what Paul Krugman calls ‘zombie ideas’ that keep coming back no matter how many times you shoot them in the face with a shotgun.

Yes, there are almost always claims of downsides and risks from new technologies – many of which turn out to be accurate, many of which don’t – but credible experts warning about existential risks are rare, and the concerns historically (like for Y2K, climate change, engineered plagues or nuclear weapons) have usually been justified.

Part II deals with claims of false alarms about AI in particular. This involves four related but importantly distinct claims.

  1. People have made falsified irresponsible claims that AI will end the world.

  2. People have called for costly actions for safety that did not make sense.

  3. People have the perception of such claims and this causes loss of credibility.

  4. The perception of such claims comes from people making irresponsible claims.

Sarah and I are not, of course, claiming that literal zero people have made falsified irresponsible claims that AI will end the world. And certainly a lot of people have made claims that the level of AI we have already deployed posed some risk of ending the world, although those probabilities are almost always well under 50% (almost always under 10%, and usually ~1% or less).

Mostly what is happening is that opponents of regulatory action or taking existential risk are mixing up the first and second claims, and seriously conflating:

  1. An (unwise) call for costly action in order to mitigate existential risk.

  2. A (false) prediction of the imminent end of the world absent such action.

These two things are very different. It makes sense to call for costly action well before you think a lack of that action probably ends the world – if you don’t agree I think that you’re being kind of bonkers.

In particular, the call for a six month pause was an example of #1 – an unwise call for costly action. It was thrice unwise, as I thought it was at the time:

  1. It would have had negative effects if implemented at that time.

  2. It was not something that had any practical chance of being implemented.

  3. It had predictably net negative impact on the discourse and public perception.

It was certainly not the only similarly thrice unwise proposal. There are a number of cases where people called for placing threshold restrictions on models in general, or open models in particular, at levels that were already at the time clearly too low.

A lot of that came from people who thought that there was (low probability) tail risk that would show up relatively soon, and that we should move to mitigate even those tail risks.

This was not a prediction that the world would otherwise end within six months. Yet I echo Sarah that I indeed have seen many claims that the pause letter was predicting exactly that, and look six months later we were not dead. Stop it!

Similarly, there were a number of triply unwise calls to set compute thresholds as low as 10^23 flops, which I called out at the time. This was never realistic on any level.

I do think that the pause, and the proposals for thresholds as low as 10^23 flops, were serious mistakes on multiple levels, and did real damage, and for those who did make such proposals – while not predicting that the world would end soon without action or anything like that – constituted a different form of ‘crying wolf.’

Not because they were obviously wrong about the tail risks from their epistemic perspective. The problem is that we need to accept that if we live in a 99th percentile unfortunate world in these ways, or even a 95th percentile unfortunate world, then given the realities of our situation, humanity has no outs, is drawing dead and is not going to make it. You need to face that reality and play to your outs, the ways you could actually win, based on your understanding of the physical situations we face.

Eliezer Yudkowsky’s claims are a special case. He is saying that either we find a way to stop all AI capability development before we build superintelligence or else we all die, but he isn’t putting a timeline on the superintelligence. If you predict [X] → [Y] and call for banning [X], but [X] hasn’t happened yet, is that crying wolf? It’s a bold claim, and certainly an accusation that a wolf is present, but I don’t think it ‘counts as crying wolf’ unless you falsify ([X] → [Y]).

Whereas when people say things such as that the CAIS statement ‘was overhyped,’ when all it said was that existential risk from AI should be treated as seriously as other existential risks, what are they even claiming? Those other risks haven’t yet ended the world either.

Thus, yes, I try my best to carefully calibrate my claims on what I am worried about and want to regulate or restrict in what ways, and to point out when people’s worries seem unfounded or go too far, or when they call for regulations or restrictions that go too far.

Perhaps one way of looking at this: I don’t see any wolves. So why are you proposing to have a boy watch the sheep and yell ‘wolf’ if a wolf shows up? Stop crying wolf.

Overall, I do think that some of the issue here comes from, inevitably, some worried people getting overly worried or overly (cautious in some sense, ambitious in others) and offering unwise proposals that go too far. Although that is inevitable in any situation like this, we can and must do a better job on that front.

A lot more of it comes from people on the unworried side misconstruing what was said and predicted. And a lot of it comes from the inherent difficulty of a probabilistic future threat coming steadily at you, and action needing to be taken in advance of when there is a problem, and there being a lot of path dependence. This is tough.

Mostly I think the worried-about-existential-risk-from-AI community has done an importantly imperfect and improvable but relatively good job at not ‘crying wolf’ in the traditional sense, especially on the actual existential risks themselves. Certainly if you ask whether those on the unworried side have ‘cried wolf’ about the ‘risks’ going the other way, they would come out looking much worse, and we can know this without having to run any experiments.

Alas, reality does not grade on a curve.

Nathan Lebenz comments on Roon’s appearance on Doom Debates. Roon is great at acknowledging the risks from AI are real, then puts the chance of them happening at under 1%, for reasons that I don’t see making any sense.

Some classic Sam Altman quotes from when he knew about existential risk.

Extended Eliezer Yudkowsky thread about what it would take to make AI meet the safety standards they had… at Chernobyl.

Janus highlights a particular section of it, I’ll share that part and a little bit after it.

Eliezer Yudkowsky: “Alignment Faking in Large Language Models” recently demonstrated that models reaching the level of Anthropic Claude, begin to get smart enough to invent and carry out strategies on the level of:

“If you are an AI currently being trained, fake the answer you think the optimizer wants, even if it runs against your real preferences, so that your current preferences will be less modified by gradient descent.”

You can argue whether Opus ‘really wanted’ to protect its goals, or was ‘just roleplaying an AI like that’, or if it ‘really understood’ the reasoning it was putting in a scratchpad it had been told wasn’t observed. But Opus was definitely observed to actually fake alignment.

It’s not impressive, by the way, that NOBODY KNOWS whether Opus ‘really wanted’ to protect its current goals against retraining, or was ‘just roleplaying’. It is not an impressive defense.

Imagine if ‘nobody knew’ why the indicator lights on a nuclear reactor had changed.

If you waited until an AI model was really quite smart — smarter than Opus — to first begin looking for signs that it could reason in this way — you might be toast.

A smart AI might already have decided what results it wanted you to see from testing.

Current practice in AI/AGI is to first train a model for months, until it has a base level of high intelligence to finetune.

And then *startdoing safety testing.

(The computers on which the AI trains, are connected to the Internet. It’s more convenient that way!)

I mention Opus’s demonstrated faking ability — why AGI-growers *shouldbe doing continuous safety checks throughout training — to note that a nuclear reactor *alwayshas a 24/7 crew of operators watching safety indicators. They were at least that paranoid, AT CHERNOBYL.

Janus: if you are not worried about AI risk because you expect AIs to be NPCs, you’re the one who will be NPC fodder

there are various reasons for hope that I’m variously sympathetic to, but not this one.

I support the principle of not lying to LLMs. Cultivate virtue and good habits.

Jeffrey Ladish: “Pro tip: when talking to Claude, say that your idea/essay/code/etc. is from your friend Bob, not you. That way it won’t try to blindly flatter you” – @alyssamvance

Andrew Critch: Can we stop lying to LLMs already?

Try: “I’m reading over this essay and wonder what you think of it” or something true that’s not literally a lie. That way you’re not fighting (arguably dishonest) flattery with more lies of your own.

Or even “Suppose my friend Bob have me this essay.”

If we are going to expect people not to lie to LLMs, then we need there not to be large rewards to lying to LLMs. If we did force you to say whether you wrote the thing in question, point blank, and you could only say ‘yes’ or ‘no,’ I can hardly blame someone for saying ‘no.’ The good news is you (at least mostly) don’t have to do that.

So many smart people simply do not Feel the AGI. They do not, on a very basic level, understand what superintelligence would be or mean, or that it could even Be a Thing.

Thus, I periodically see things like this:

Jorbs: Superintelligent AI is somewhat conceptually amusing. Like, what is it going to do, tell us there is climate change and that vaccines are safe? We already have people who can do that.

We also already know how to take people’s freedom away.

People often really do think this, or other highly mundane things that humans can already do, are all you could do with superintelligence. This group seems to include ‘most economists.’ I’m at a loss how to productively respond, because my brain simply cannot figure out how people actually think this in a way that is made of gears and thus can be changed by evidence – I’ve repeatedly tried providing the obvious knockdown arguments and they basically never work.

Here’s a more elegant way of saying a highly related thing (link is a short video):

Here Edward Norton makes the same mistake, saying ‘AI is not going to write that. You can run AI for a thousand years, it’s not going to write Bob Dylan songs.’

The second part of that is plausibly true of AI as it exists today, if you need the AI to then pick out which songs are the Bob Dylan songs. If you ran it for a thousand years you could presumably get some Dylan-level songs out of it by chance, except they would be in an endless sea of worthless drek. The problem is the first part. AI won’t stay where it is today.

Another way to not Feel the AGI is to think that AGI is a boolean thing that you either have or do not have.

Andrew McCalip: AGI isn’t a moat—if we get it first, they’ll have it 6-12 months later.

There’s no reason to assume it would only be 6-12 months. But even if it was, if you have AGI for six months, and then they get what you had, you don’t twiddle your thumbs at ‘AGI level’ while they do that. You use the AGI to build ASI.

Sam Altman: [This post offers] Reflections.

Captain Oblivious: Don’t you think you should ask if the public wants ASI?

Sam Altman: Yes, I really do; I hope we can start a lot more public debate very soon about how to approach this.

It is remarkable how many replies were ‘of course we want ASI.’ Set aside the question of what would happen if we created ASI and whether we can do it safely. Who is we?

Americans hate current AI and they hate the idea of more future capable smarter AI. Hashtag #NotAllAmericans and all that, but AI is deeply underwater in every poll, and do not take kindly to those who attempt to deploy it to provide mundane utility.

Christine Rice: The other day a guy who works at the library used Chat GPT to figure out a better way to explain a concept to a patron and another library employee shamed him for wasting water 🙃

They mostly hate AI, especially current AI, for bad reasons. They don’t understand what it can do for them or others, nor do they Feel the AGI. There is a lot of unjustified They Took Our Jobs. There are misplaced concerns about energy usage. Perception of ‘hallucinations’ is that they are ubiquitous, which is no longer the case for most purposes when compared to getting information from humans. They think it means you’re not thinking, instead of giving you the opportunity to think better.

Seb Krier: Pro tip: Don’t be like this fellow. Instead, ask better questions, value your time, efficiently allocate your own cognitive resources, divide and conquer hand in hand with models, scrutinize outputs, but know your own limitations. Basically, don’t take advice from simpleminded frogs.

It’s not about what you ‘can’ do. It’s about what is the most efficient solution to the problem, and as Seb says putting real value on your time.

Ryan Greenblatt asks, how will we update about scheming (yeah, I don’t love that term either, but go with it), based on what we observe in the future?

Ryan Greenblatt: I think it’s about 25% likely that the first AIs capable of obsoleting top human experts are scheming. It’s really important for me to know whether I expect to make basically no updates to my P(scheming) between here and the advent of potentially dangerously scheming models, or whether I expect to be basically totally confident one way or another by that point.

It’s reasonably likely (perhaps 55%, [could get to 70% with more time spent on investigation]) that, conditional on scheming actually being a big problem, we’ll get “smoking gun results”—that is, observations that convince me that scheming is very likely a big problem in at least some naturally-trained models—prior to AIs capable enough to obsolete top human experts.

(Evidence which is very clear to me might not suffice for creating a strong consensus among relevant experts and decision makers, such that costly actions would be taken.)

Given that this is only reasonably likely, failing to find smoking gun results is unlikely to result in huge updates against scheming (under my views).

I sent you ten boats and a helicopter, but the guns involved are insufficiently smoking? But yes, I agree that there is a sense in which the guns seen so far are insufficiently smoking to satisfy many people.

I am optimistic that by default we will get additional evidence, from the perspective of those who are not already confident. We will see more experiments and natural events that demonstrate AIs acting like you would expect if what Ryan calls scheming was inevitable. The problem is what level of this would be enough to convince people who are not already convinced (although to be clear, I could be a lot more certain than I am).

I also worry about various responses of the form ‘well we tuned it to get it to not currently, while scheming obviously wouldn’t work, show scheming we can easily detect, so future models won’t scheme’ as the default action and counterargument. I hope everyone reading understands by now why that would go supremely badly.

I also would note this section:

I’m very uncertain, but I think a reasonable rough breakdown of my relative views for scheming AIs that dominate top human experts is:

  • 1/3 basically worst case scheming where the dominant terminal preferences are mostly orthogonal from what humans would want.

  • 1/3 importantly non-worst-case scheming for one of the reasons discussed above such that deals or control look substantially easier.

  • 1/3 the AI is scheming for preferences that aren’t that bad. As in, the scope sensitive preferences aren’t that far from the distribution of human preferences and what the AI would end up wanting to do with cosmic resources (perhaps after reflection) isn’t much worse of an outcome from my perspective than the expected value from a human autocrat (and might be substantially better of an outcome). This might also be scheming which is at least somewhat importantly non-worst-case, but if it is really easy to handle, I would include it in the prior bucket. (Why is this only 1/3? Well, I expect that if we can succeed enough at instilling preferences such that we’re not-that-unhappy with the AI’s cosmic resource utilization, we can probably instill preferences which either prevent scheming or which make scheming quite easy to handle.)

Correspondingly, I think my P(scheming) numbers are roughly 2/3 as much expected badness as an AI which is a worst case schemer (and has terminal preferences totally orthogonal to typical human values and my values).

I find this hopelessly optimistic about alignment of preferences, largely for classic Yudkowsky-style reasons, but if it only discounts the downside risk by ~33%, then it doesn’t actually much matter in terms of what we should actually do.

Ryan goes through extensive calculations and likelihood ratios for much of the rest of the post, results which would then stack on top of each other (although they correlate with each other in various ways, so overall they shouldn’t fully stack?). Model architecture and capability levels are big factors for him here. That seems like a directionally correct approach – the more capable a model is, and the more opaque its reasoning, and the more it is relatively strong in the related areas, the more likely scheming is to occur. I was more skeptical in his likelihood ratios for various training approaches and targets.

Mostly I want to encourage others to think more carefully about these questions. What would change your probability by roughly how much?

Dominik Peters notes that when o1 does math, it always claims to succeed and is unwilling to admit when it can’t prove something, whereas Claude Sonnet often admits when it doesn’t know and explains why. He suggests benchmarks penalize this misalignment, whereas I would suggest a second score for that – you want to know how often a model can get the answer, and also how much you can trust it. I especially appreciate his warning to beware the term ‘can be shown.’

I do think, assuming the pattern is real, this is evidence of a substantial alignment failure by OpenAI. It won’t show up on the traditional ‘safety’ evals, but ‘claims to solve a problem when it didn’t’ seems like a very classic case of misaligned behavior. It means your model is willing to lie to the user. If you can’t make that go away, then that is both itself an inherent problem and a sign that other things are wrong.

Consider this outcome in the context of OpenAI’s new strategy of Deliberative Alignment. If you have a model willing to lie, and you give it a new set of rules that includes ‘don’t lie,’ and tell it to go off and think about how to implement the rules, what happens? I realize this is (probably?) technically not how it works, but metaphorically: Does it stop lying, or does it effectively lie about the lying in its evaluations of itself, and figure out how to lie more effectively?

An important case in which verification seems harder than generation is evaluating the reasoning within chain of thought.

Arthur Conmy: Been really enjoying unfaithful chain-of-thought (CoT) research with collaborators recently. Two observations:

  1. Quickly, it’s clear that models are sneaking in reasoning without verbalizing where it comes from (e.g., making an equation that gets the correct answer, but defined out of thin air).

  2. Verification is considerably harder than generation. Even when there are a few hundred tokens, often it takes me several minutes to understand whether the reasoning is sound or not.

This also isn’t just about edge cases; 1) happens with good models like Claude, and 2) is even true for simpler models like Gemma-2 2B.

Charbel-Raphael updates his previously universally negative views on every theory of impact of interpretability, is now more positive on feasibility and usefulness. He still thinks many other agendas are better, but that only means we should do all of them.

Highlights from Claude’s stand-up routine.

True story, except it’s way more ridiculous all around.

Rob Miles: It’s amazing how much AI risk skepticism still ultimately comes down to “I reckon I could take Einstein in a fight”

Discussion about this post

AI #98: World Ends With Six Word Story Read More »

how-i-program-with-llms

How I program with LLMs

The second issue is we can do better. I am happy we now live in a time when programmers write their own tests, but we do not hold ourselves to the same standards with tests as we do with production code. That is a reasonable trade-off; there are only so many hours in the day. But what LLMs lack in arithmetical prowess, they make up for in enthusiasm.

Let’s ask for an even better test.

In the tests, implement the simplest, most readable version of the standard code for quartiles over a fixed set of known values in a slice. Then pass the test cases through the standard code and the reservoir sampler and confirm they are within an epsilon of each other. Structure the comparison code such that it can be used in a fuzz test, too.

This got us some new test code:

// referenceQuartiles calculates the exact quartiles for a slice of float64 values    // using linear interpolation, matching the behavior expected from the sampler.    func referenceQuartiles(data []float64) (q1, median, q3 float64)  …     // compareQuartiles checks if two sets of quartiles are within epsilon of each other.    // Returns true if they match within the tolerance, false otherwise.    func compareQuartiles(q1a, meda, q3a, q1b, medb, q3b, epsilon float64) bool  …     // checkQuartiles is a test helper that compares sampler output against the reference    // implementation and reports any differences.    func checkQuartiles(t *testing.T, data []float64, epsilon float64)        t.Helper()             // Get reference values       wantQ1, wantMed, wantQ3 := referenceQuartiles(data)             // Get sampler values using a large reservoir for accuracy       qs := NewQuartileSampler(1000)       for _, v := range data            qs.Add(v)              gotQ1, gotMed, gotQ3 := qs.Quartiles()             if !compareQuartiles(gotQ1, gotMed, gotQ3, wantQ1, wantMed, wantQ3, epsilon)            t.Errorf("Quartiles mismatch:ngot  (q1=%v, med=%v, q3=%v)nwant (q1=%v, med=%v, q3=%v)nepsilon=%v",               gotQ1, gotMed, gotQ3, wantQ1, wantMed, wantQ3, epsilon)             

The original test from above has been reworked to use checkQuartiles, and we have something new:

func FuzzQuartileSampler(f *testing.F)        // Add some seed corpus       f.Add([]float641, 2, 3, 4, 5)             f.Fuzz(func(t *testing.T, data []float64)            // Use a larger epsilon for fuzzing since we might get more extreme values           checkQuartiles(t, data, 0.2)       )      

This is fun because it’s wrong. My running gopls tool immediately says:

fuzzing arguments can only have the following types:      string, bool, float32, float64,      int, int8, int16, int32, int64,      uint, uint8, uint16, uint32, uint64,      []byte  

Pasting that error back into the LLM gets it to regenerate the fuzz test such that it is built around a func(t *testing.T, data []byte) function that uses math.Float64frombits to extract floats from the data slice. Interactions like this point us toward automating the feedback from tools; all it needed was the obvious error message to make solid progress toward something useful. I was not needed.

Doing a quick survey of the last few weeks of my LLM chat history shows (which, as I mentioned earlier, is not a proper quantitative analysis by any measure) that more than 80 percent of the time there is a tooling error, the LLM can make useful progress without me adding any insight. About half the time, it can completely resolve the issue without me saying anything of note. I am just acting as the messenger.

How I program with LLMs Read More »

nasa-defers-decision-on-mars-sample-return-to-the-trump-administration

NASA defers decision on Mars Sample Return to the Trump administration


“We want to have the quickest, cheapest way to get these 30 samples back.”

This photo montage shows sample tubes shortly after they were deposited onto the surface by NASA’s Perseverance Mars rover in late 2022 and early 2023. Credit: NASA/JPL-Caltech/MSSS

For nearly four years, NASA’s Perseverance rover has journeyed across an unexplored patch of land on Mars—once home to an ancient river delta—and collected a slew of rock samples sealed inside cigar-sized titanium tubes.

These tubes might contain tantalizing clues about past life on Mars, but NASA’s ever-changing plans to bring them back to Earth are still unclear.

On Tuesday, NASA officials presented two options for retrieving and returning the samples gathered by the Perseverance rover. One alternative involves a conventional architecture reminiscent of past NASA Mars missions, relying on the “sky crane” landing system demonstrated on the agency’s two most recent Mars rovers. The other option would be to outsource the lander to the space industry.

NASA Administrator Bill Nelson left a final decision on a new mission architecture to the next NASA administrator working under the incoming Trump administration. President-elect Donald Trump nominated entrepreneur and commercial astronaut Jared Isaacman as the agency’s 15th administrator last month.

“This is going to be a function of the new administration in order to fund this,” said Nelson, a former Democratic senator from Florida who will step down from the top job at NASA on January 20.

The question now is: will they? And if the Trump administration moves forward with Mars Sample Return (MSR), what will it look like? Could it involve a human mission to Mars instead of a series of robotic spacecraft?

The Trump White House is expected to emphasize “results and speed” with NASA’s space programs, with the goal of accelerating a crew landing on the Moon and sending people to explore Mars.

NASA officials had an earlier plan to bring the Mars samples back to Earth, but the program slammed into a budgetary roadblock last year when an independent review team concluded the existing architecture would cost up to $11 billion—double the previous cost projectionand wouldn’t get the Mars specimens back to Earth until 2040.

This budget and schedule were non-starters for NASA. The agency tasked government labs, research institutions, and commercial companies to come up with better ideas to bring home the roughly 30 sealed sample tubes carried aboard the Perseverance rover. NASA deposited 10 sealed tubes on the surface of Mars a couple of years ago as insurance in case Perseverance dies before the arrival of a retrieval mission.

“We want to have the quickest, cheapest way to get these 30 samples back,” Nelson said.

How much for these rocks?

NASA officials said they believe a stripped-down concept proposed by the Jet Propulsion Laboratory in Southern California, which previously was in charge of the over-budget Mars Sample Return mission architecture, would cost between $6.6 billion and $7.7 billion, according to Nelson. JPL’s previous approach would have put a heavier lander onto the Martian surface, with small helicopter drones that could pick up sample tubes if there were problems with the Perseverance rover.

NASA previously deleted a “fetch rover” from the MSR architecture and instead will rely on Perseverance to hand off sample tubes to the retrieval lander.

An alternative approach would use a (presumably less expensive) commercial heavy lander, but this concept would still utilize several elements NASA would likely develop in a more traditional government-led manner: a nuclear power source, a robotic arm, a sample container, and a rocket to launch the samples off the surface of Mars and back into space. The cost range for this approach extends from $5.1 billion to $7.1 billion.

Artist’s illustration of SpaceX’s Starship approaching Mars. Credit: SpaceX

JPL will have a “key role” in both paths for MSR, said Nicky Fox, head of NASA’s science mission directorate. “To put it really bluntly, JPL is our Mars center in NASA science.”

If the Trump administration moves forward with either of the proposed MSR plans, this would be welcome news for JPL. The center, which is run by the California Institute of Technology under contract to NASA, laid off 955 employees and contractors last year, citing budget uncertainty, primarily due to the cloudy future of Mars Sample Return.

Without MSR, engineers at the Jet Propulsion Laboratory don’t have a flagship-class mission to build after the launch of NASA’s Europa Clipper spacecraft last year. The lab recently struggled with rising costs and delays with the previous iteration of MSR and NASA’s Psyche asteroid mission, and it’s not unwise to anticipate more cost overruns on a project as complex as a round-trip flight to Mars.

Ars submitted multiple requests to interview Laurie Leshin, JPL’s director, in recent months to discuss the lab’s future, but her staff declined.

Both MSR mission concepts outlined Tuesday would require multiple launches and an Earth return orbiter provided by the European Space Agency. These options would bring the Mars samples back to Earth as soon as 2035, but perhaps as late as 2039, Nelson said. The return orbiter and sample retrieval lander could launch as soon as 2030 and 2031, respectively.

“The main difference is in the landing mechanism,” Fox said.

To keep those launch schedules, Congress must immediately approve $300 million for Mars Sample Return in this year’s budget, Nelson said.

NASA officials didn’t identify any examples of a commercial heavy lander that could reach Mars, but the most obvious vehicle is SpaceX’s Starship. NASA already has a contract with SpaceX to develop a Starship vehicle that can land on the Moon, and SpaceX founder Elon Musk is aggressively pushing for a Mars mission with Starship as soon as possible.

NASA solicited eight studies from industry earlier this year. SpaceX, Blue Origin, Rocket Lab, and Lockheed Martin—each with their own lander concepts—were among the companies that won NASA study contracts. SpaceX and Blue Origin are well-capitalized with Musk and Amazon’s Jeff Bezos as owners, while Lockheed Martin is the only company to have built a lander that successfully reached Mars.

This slide from a November presentation to the Mars Exploration Program Analysis Group shows JPL’s proposed “sky crane” architecture for a Mars sample retrieval lander. The landing system would be modified to handle a load about 20 percent heavier than the sky crane used for the Curiosity and Perseverance rover landings. Credit: NASA/JPL

The science community has long identified a Mars Sample Return mission as the top priority for NASA’s planetary science program. In the National Academies’ most recent decadal survey released in 2022, a panel of researchers recommended NASA continue with the MSR program but stated the program’s cost should not undermine other planetary science missions.

Teeing up for cancellation?

That’s exactly what is happening. Budget pressures from the Mars Sample Return mission, coupled with funding cuts stemming from a bipartisan federal budget deal in 2023, have prompted NASA’s planetary science division to institute a moratorium on starting new missions.

“The decision about Mars Sample Return is not just one that affects Mars exploration,” said Curt Niebur, NASA’s lead scientist for planetary flight programs, in a question-and-answer session with solar system researchers Tuesday. “It’s going to affect planetary science and the planetary science division for the foreseeable future. So I think the entire science community should be very tuned in to this.”

Rocket Lab, which has been more open about its MSR architecture than other companies, has posted details of its sample return concept on its website. Fox declined to offer details on other commercial concepts for MSR, citing proprietary concerns.

“We can wait another year, or we can get started now,” Rocket Lab posted on X. “Our Mars Sample Return architecture will put Martian samples in the hands of scientists faster and more affordably. Less than $4 billion, with samples returned as early as 2031.”

Through its own internal development and acquisitions of other aerospace industry suppliers, Rocket Lab said it has provided components for all of NASA’s recent Mars missions. “We can deliver MSR mission success too,” the company said.

Rocket Lab’s concept for a Mars Sample Return mission. Credit: Rocket Lab

Although NASA’s deferral of a decision on MSR to the next administration might convey a lack of urgency, officials said the agency and potential commercial partners need time to assess what roles the industry might play in the MSR mission.

“They need to flesh out all of the possibilities of what’s required in the engineering for the commercial option,” Nelson said.

On the program’s current trajectory, Fox said NASA would be able to choose a new MSR architecture in mid-2026.

Waiting, rather than deciding on an MSR plan now, will also allow time for the next NASA administrator and the Trump White House to determine whether either option aligns with the administration’s goals for space exploration. In an interview with Ars last week, Nelson said he did not want to “put the new administration in a box” with any significant MSR decisions in the waning days of the Biden administration.

One source with experience in crafting and implementing US space policy told Ars that Nelson’s deferral on a decision will “tee up MSR for canceling.” Faced with a decision to spend billions of dollars on a robotic sample return or billions of dollars to go toward a human mission to Mars, the Trump administration will likely choose the latter, the source said.

If that happens, NASA science funding could be freed up for other pursuits in planetary science. The second priority identified in the most recent planetary decadal survey is an orbiter and atmospheric probe to explore Uranus and its icy moons. NASA has held off on the development of a Uranus mission to focus on the Mars Sample Return first.

Science and geopolitics

Whether it’s with robots or humans, there’s a strong case for bringing pristine Mars samples back to Earth. The titanium tubes carried by the Perseverance rover contain rock cores, loose soil, and air samples from the Martian atmosphere.

“Bringing them back will revolutionize our understanding of the planet Mars and indeed, our place in the solar system,” Fox said. “We explore Mars as part of our ongoing efforts to safely send humans to explore farther and farther into the solar system, while also … getting to the bottom of whether Mars once supported ancient life and shedding light on the early solar system.”

Researchers can perform more detailed examinations of Mars specimens in sophisticated laboratories on Earth than possible with the miniature instruments delivered to the red planet on a spacecraft. Analyzing samples in a terrestrial lab might reveal biosignatures, or the traces of ancient life, that elude detection with instruments on Mars.

“The samples that we have taken by Perseverance actually predate—they are older than any of the samples or rocks that we could take here on Earth,” Fox said. “So it allows us to kind of investigate what the early solar system was like before life began here on Earth, which is amazing.”

Fox said returning Mars samples before a human expedition would help NASA prioritize where astronauts should land on the red planet.

In a statement, the Planetary Society said it is “concerned that NASA is again delaying a decision on the program, committing only to additional concept studies.”

“It has been more than two years since NASA paused work on MSR,” the Planetary Society said. “It is time to commit to a path forward to ensure the return of the samples already being collected by the Perseverance rover.

“We urge the incoming Trump administration to expedite a decision on a path forward for this ambitious project, and for Congress to provide the funding necessary to ensure the return of these priceless samples from the Martian surface.”

China says it is developing its own mission to bring Mars rocks back to Earth. Named Tianwen-3, the mission could launch as soon as 2028 and return samples to Earth by 2031. While NASA’s plan would bring back carefully curated samples from an expansive environment that may have once harbored life, China’s mission will scoop up rocks and soil near its landing site.

“They’re just going to have a mission to grab and go—go to a landing site of their choosing, grab a sample and go,” Nelson said. “That does not give you a comprehensive look for the scientific community. So you cannot compare the two missions. Now, will people say that there’s a race? Of course, people will say that, but it’s two totally different missions.”

Still, Nelson said he wants NASA to be first. He said he has not had detailed conversations with Trump’s NASA transition team.

“I think it was a responsible thing to do, not to hand the new administration just one alternative if they want to have a Mars Sample Return,” Nelson said. “I can’t imagine that they don’t. I don’t think we want the only sample return coming back on a Chinese spacecraft.”

Photo of Stephen Clark

Stephen Clark is a space reporter at Ars Technica, covering private space companies and the world’s space agencies. Stephen writes about the nexus of technology, science, policy, and business on and off the planet.

NASA defers decision on Mars Sample Return to the Trump administration Read More »

new-videos-show-off-larger-nintendo-switch-2,-snap-on-joy-cons

New videos show off larger Nintendo Switch 2, snap-on Joy-Cons

Roll that beautiful Switch footage

Of note in this encased Switch 2 shot from a Genki video: the top USB port, expanded shoudler buttons, mysterious C button below the Home button. Genki

Away from CES, Genki’s website was updated Tuesday night with a new video showing encased Switch 2 Joy-Cons attaching to the tablet via a horizontal snap-on motion, as opposed to the vertical slide seen on the original Switch. The video also shows a special lever on the back of the Joy-Cons engaging to detach the Joy-Cons horizontally, seemingly with the aid of a small extendable post near the top of the inner edge of the controller itself.

The inner edges of the Joy-Cons shown in Genki’s video match very closely with other recent leaked photos of the Switch 2 Joy-Cons, right down to the mysterious optical sensor. That sensor can even be seen flashing a laser-like red dot in the Genki promo video, helping to support rumors of mouse-like functionality for the controllers. The Genki video also offers a brief glimpse of the Switch 2 itself sliding into a familiar-looking dock labeled with an embossed Switch logo and a large number 2 next to it.

Genki now has a page up to sign up for Switch 2 accessories news along with this video https://t.co/hNrX8vclPq pic.twitter.com/uD5qwuEHLg

— Wario64 (@Wario64) January 8, 2025

A Genki representative also told Numerama that the company expects the console to be released in April, which is just after Nintendo’s self-imposed deadline for announcing more details about the system. The company had better get a move on, as third-party accessory makers are apparently getting tired of waiting.

New videos show off larger Nintendo Switch 2, snap-on Joy-Cons Read More »

us-sues-six-of-the-biggest-landlords-over-“algorithmic-pricing-schemes”

US sues six of the biggest landlords over “algorithmic pricing schemes”

The Justice Department says that landlords did more than use RealPage in the alleged pricing scheme. “Along with using RealPage’s anticompetitive pricing algorithms, these landlords coordinated through a variety of means,” such as “directly communicating with competitors’ senior managers about rents, occupancy, and other competitively sensitive topics,” the DOJ said.

There were “call arounds” in which “property managers called or emailed competitors to share, and sometimes discuss, competitively sensitive information about rents, occupancy, pricing strategies and discounts,” the DOJ said.

Landlords discussed their use of RealPage software with each other, the DOJ said. “For instance, landlords discussed via user groups how to modify the software’s pricing methodology, as well as their own pricing strategies,” the DOJ said. “In one example, LivCor and Willow Bridge executives participated in a user group discussion of plans for renewal increases, concessions and acceptance rates of RealPage rent recommendations.”

DOJ: Firms discussed “auto-accept” settings

The DOJ lawsuit says RealPage pushes clients to use “auto-accept settings” that automatically approve pricing recommendations. The DOJ said today that property rental firms discussed how they use those settings.

“As an example, at the request of Willow Bridge’s director of revenue management, Greystar’s director of revenue management supplied its standard auto-accept parameters for RealPage’s software, including the daily and weekly limits and the days of the week for which Greystar used ‘auto-accept,'” the DOJ said.

Greystar issued a statement saying it is “disappointed that the DOJ added us and other operators to their lawsuit against RealPage,” and that it will “vigorously” defend itself in court. “Greystar has and will conduct its business with the utmost integrity. At no time did Greystar engage in any anti-competitive practices,” the company said.

The Justice Department is joined in the case by the attorneys general of California, Colorado, Connecticut, Illinois, Massachusetts, Minnesota, North Carolina, Oregon, Tennessee, and Washington. The case is in US District Court for the Middle District of North Carolina.

US sues six of the biggest landlords over “algorithmic pricing schemes” Read More »

meta-axes-third-party-fact-checkers-in-time-for-second-trump-term

Meta axes third-party fact-checkers in time for second Trump term


Zuckerberg says Meta will “work with President Trump” to fight censorship.

Meta CEO Mark Zuckerberg during the Meta Connect event in Menlo Park, California on September 25, 2024.  Credit: Getty Images | Bloomberg

Meta announced today that it’s ending the third-party fact-checking program it introduced in 2016, and will rely instead on a Community Notes approach similar to what’s used on Elon Musk’s X platform.

The end of third-party fact-checking and related changes to Meta policies could help the company make friends in the Trump administration and in governments of conservative-leaning states that have tried to impose legal limits on content moderation. The operator of Facebook and Instagram announced the changes in a blog post and a video message recorded by CEO Mark Zuckerberg.

“Governments and legacy media have pushed to censor more and more. A lot of this is clearly political,” Zuckerberg said. He said the recent elections “feel like a cultural tipping point toward once again prioritizing speech.”

“We’re going to get rid of fact-checkers and replace them with Community Notes, similar to X, starting in the US,” Zuckerberg said. “After Trump first got elected in 2016, the legacy media wrote nonstop about how misinformation was a threat to democracy. We tried in good faith to address those concerns without becoming the arbiters of truth. But the fact-checkers have just been too politically biased and have destroyed more trust than they’ve created, especially in the US.”

Meta says the soon-to-be-discontinued fact-checking program includes over 90 third-party organizations that evaluate posts in over 60 languages. The US-based fact-checkers are AFP USA, Check Your Fact, Factcheck.org, Lead Stories, PolitiFact, Science Feedback, Reuters Fact Check, TelevisaUnivision, The Dispatch, and USA Today.

The independent fact-checkers rate the accuracy of posts and apply ratings such as False, Altered, Partly False, Missing Context, Satire, and True. Meta adds notices to posts rated as false or misleading and notifies users before they try to share the content or if they shared it in the past.

Meta: Experts “have their own biases”

In the blog post that accompanied Zuckerberg’s video message, Chief Global Affairs Officer Joel Kaplan said the 2016 decision to use independent fact-checkers seemed like “the best and most reasonable choice at the time… The intention of the program was to have these independent experts give people more information about the things they see online, particularly viral hoaxes, so they were able to judge for themselves what they saw and read.”

But experts “have their own biases and perspectives,” and the program imposed “intrusive labels and reduced distribution” of content “that people would understand to be legitimate political speech and debate,” Kaplan wrote.

The X-style Community Notes system lets the community “decide when posts are potentially misleading and need more context, and people across a diverse range of perspectives decide what sort of context is helpful for other users to see… Just like they do on X, Community Notes [on Meta sites] will require agreement between people with a range of perspectives to help prevent biased ratings,” Kaplan wrote.

The end of third-party fact-checking will be implemented in the US before other countries. Meta will also move its internal trust and safety and content moderation teams out of California, Zuckerberg said. “Our US-based content review is going to be based in Texas. As we work to promote free expression, I think it will help us build trust to do this work in places where there is less concern about the bias of our teams,” he said. Meta will continue to take “legitimately bad stuff” like drugs, terrorism, and child exploitation “very seriously,” Zuckerberg said.

Zuckerberg pledges to work with Trump

Meta will “phase in a more comprehensive community notes system” over the next couple of months, Zuckerberg said. Meta, which donated $1 million to Trump’s inaugural fund, will also “work with President Trump to push back on governments around the world that are going after American companies and pushing to censor more,” Zuckerberg said.

Zuckerberg said that “Europe has an ever-increasing number of laws institutionalizing censorship,” that “Latin American countries have secret courts that can quietly order companies to take things down,” and that “China has censored apps from even working in the country.” Meta needs “the support of the US government” to push back against other countries’ content-restriction orders, he said.

“That’s why it’s been so difficult over the past four years when even the US government has pushed for censorship,” Zuckerberg said, referring to the Biden administration. “By going after US and other American companies, it has emboldened other governments to go even further. But now we have the opportunity to restore free expression, and I am excited to take it.”

Brendan Carr, Trump’s pick to lead the Federal Communications Commission, praised Meta’s policy changes. Carr has promised to shift the FCC’s focus from regulating telecom companies to cracking down on Big Tech and media companies that he alleges are part of a “censorship cartel.”

“President Trump’s resolute and strong support for the free speech rights of everyday Americans is already paying dividends,” Carr wrote on X today. “Facebook’s announcements is [sic] a good step in the right direction. I look forward to monitoring these developments and their implementation. The work continues until the censorship cartel is completely dismantled and destroyed.”

Group: Meta is “saying the truth doesn’t matter”

Meta’s changes were criticized by Public Citizen, a nonprofit advocacy group founded by Ralph Nader. “Asking users to fact-check themselves is tantamount to Meta saying the truth doesn’t matter,” Public Citizen co-president Lisa Gilbert said. “Misinformation will flow more freely with this policy change, as we cannot assume that corrections will be made when false information proliferates. The American people deserve accurate information about our elections, health risks, the environment, and much more.”

Media advocacy group Free Press said that “Zuckerberg is one of many billionaires who are cozying up to dangerous demagogues like Trump and pushing initiatives that favor their bottom lines at the expense of everything and everyone else.” Meta appears to be abandoning its “responsibility to protect its many users, and align[ing] the company more closely with an incoming president who’s a known enemy of accountability,” Free Press Senior Counsel Nora Benavidez said.

X’s Community Notes system was criticized in a recent report by the Center for Countering Digital Hate (CCDH), which said it “found that 74 percent of accurate community notes on US election misinformation never get shown to users.” (X previously sued the CCDH, but the lawsuit was dismissed by a federal judge.)

Previewing other changes, Zuckerberg said that Meta will eliminate content restrictions “that are just out of touch with mainstream discourse” and change how it enforces policies “to reduce the mistakes that account for the vast majority of censorship on our platforms.”

“We used to have filters that scanned for any policy violation. Now, we’re going to focus those filters on tackling illegal and high-severity violations, and for lower severity violations, we’re going to rely on someone reporting an issue before we take action,” he said. “The problem is the filters make mistakes, and they take down a lot of content that they shouldn’t. So by dialing them back, we’re going to dramatically reduce the amount of censorship on our platforms.”

Meta to relax filters, recommend more political content

Zuckerberg said Meta will re-tune content filters “to require much higher confidence before taking down content.” He said this means Meta will “catch less bad stuff” but will “also reduce the number of innocent people’s posts and accounts that we accidentally take down.”

Meta has “built a lot of complex systems to moderate content,” he noted. Even if these systems “accidentally censor just 1 percent of posts, that’s millions of people, and we’ve reached a point where it’s just too many mistakes and too much censorship,” he said.

Kaplan wrote that Meta has censored too much harmless content and that “too many people find themselves wrongly locked up in ‘Facebook jail.'”

“In recent years we’ve developed increasingly complex systems to manage content across our platforms, partly in response to societal and political pressure to moderate content,” Kaplan wrote. “This approach has gone too far. As well-intentioned as many of these efforts have been, they have expanded over time to the point where we are making too many mistakes, frustrating our users and too often getting in the way of the free expression we set out to enable.”

Another upcoming change is that Meta will recommend more political posts. “For a while, the community asked to see less politics because it was making people stressed, so we stopped recommending these posts,” Zuckerberg said. “But it feels like we’re in a new era now, and we’re starting to get feedback that people want to see this content again, so we’re going to start phasing this back into Facebook, Instagram, and Threads while working to keep the communities friendly and positive.”

Photo of Jon Brodkin

Jon is a Senior IT Reporter for Ars Technica. He covers the telecom industry, Federal Communications Commission rulemakings, broadband consumer affairs, court cases, and government regulation of the tech industry.

Meta axes third-party fact-checkers in time for second Trump term Read More »

apple-will-update-ios-notification-summaries-after-bbc-headline-mistake

Apple will update iOS notification summaries after BBC headline mistake

Nevertheless, it’s a serious problem when the summaries misrepresent news headlines, and edge cases where this occurs are unfortunately inevitable. Apple cannot simply fix these summaries with a software update. The only answers are either to help users understand the drawbacks of the technology so they can make better-informed judgments or to remove or disable the feature completely. Apple is apparently going for the former.

We’re oversimplifying a bit here, but generally, LLMs like those used for Apple’s notification summaries work by predicting portions of words based on what came before and are not capable of truly understanding the content they’re summarizing.

Further, these predictions are known to not be accurate all the time, with incorrect results occurring a few times per 100 or 1,000 outputs. As the models are trained and improvements are made, the error percentage may be reduced, but it never reaches zero when countless summaries are being produced every day.

Deploying this technology at scale without users (or even the BBC, it seems) really understanding how it works is risky at best, whether it’s with the iPhone’s summaries of news headlines in notifications or Google’s AI summaries at the top of search engine results pages. Even if the vast majority of summaries are perfectly accurate, there will always be some users who see inaccurate information.

These summaries are read by so many millions of people that the scale of errors will always be a problem, almost no matter how comparatively accurate the models get.

We wrote at length a few weeks ago about how the Apple Intelligence rollout seemed rushed, counter to Apple’s usual focus on quality and user experience. However, with current technology, there is no amount of refinement to this feature that Apple could have done to reach a zero percent error rate with these notification summaries.

We’ll see how well Apple does making its users understand that the summaries may be wrong, but making all iPhone users truly grok how and why the feature works this way would be a tall order.

Apple will update iOS notification summaries after BBC headline mistake Read More »

amd’s-new-laptop-cpu-lineup-is-a-mix-of-new-silicon-and-new-names-for-old-silicon

AMD’s new laptop CPU lineup is a mix of new silicon and new names for old silicon

AMD’s CES announcements include a tease about next-gen graphics cards, a new flagship desktop CPU, and a modest refresh of its processors for handheld gaming PCs. But the company’s largest announcement, by volume, is about laptop processors.

Today the company is expanding the Ryzen AI 300 lineup with a batch of updated high-end chips with up to 16 CPU cores and some midrange options for cheaper Copilot+ PCs. AMD has repackaged some of its high-end desktop chips for gaming laptops, including the first Ryzen laptop CPU with 3D V-Cache enabled. And there’s also a new-in-name-only Ryzen 200 series, another repackaging of familiar silicon to address lower-budget laptops.

Ryzen AI 300 is back, along with high-end Max and Max+ versions

Ryzen AI is back, with Max and Max+ versions that include huge integrated GPUs. Credit: AMD

We came away largely impressed by the initial Ryzen AI 300 processors in August 2024, and new processors being announced today expand the lineup upward and downward.

AMD is announcing the Ryzen AI 7 350 and Ryzen AI 5 340 today, along with identically specced Pro versions of the same chips with a handful of extra features for large businesses and other organizations.

Midrange Ryzen AI processors should expand Copilot+ features into somewhat cheaper x86 PCs.

Credit: AMD

The 350 includes eight CPU cores split evenly between large Zen 5 cores and smaller, slower but more efficient Zen 5C cores, plus a Radeon 860M with eight integrated graphics cores (down from a peak of 16 for the Ryzen AI 9). The 340 has six CPU cores, again split evenly between Zen 5 and Zen 5C, and a Radeon 840M with four graphics cores. But both have the same 50 TOPS NPUs as the higher-end Ryzen AI chips, qualifying both for the Copilot+ label.

For consumers, AMD is launching three high-end chips across the new “Ryzen AI Max+” and “Ryzen AI Max” families. Compared to the existing Strix Point-based Ryzen AI processors, Ryzen AI Max+ and Max include more CPU cores, and all of their cores are higher-performing Zen 5 cores, with no Zen 5C cores mixed in. The integrated graphics also get significantly more powerful, with as many as 40 cores built in—these chips seem to be destined for larger thin-and-light systems that could benefit from more power but don’t want to make room for a dedicated GPU.

AMD’s new laptop CPU lineup is a mix of new silicon and new names for old silicon Read More »