Author name: Paul Patrick

we’re-in-deep-research

We’re in Deep Research

The latest addition to OpenAI’s Pro offerings is their version of Deep Research.

Have you longed for 10k word reports on anything your heart desires, 100 times a month, at a level similar to a graduate student intern? We have the product for you.

  1. The Pitch.

  2. It’s Coming.

  3. Is It Safe?

  4. How Does Deep Research Work?

  5. Killer Shopping App.

  6. Rave Reviews.

  7. Research Reports.

  8. Perfecting the Prompt.

  9. Not So Fast!

  10. What’s Next?

  11. Paying the Five.

  12. The Lighter Side.

OpenAI: Today we’re launching deep research in ChatGPT, a new agentic capability that conducts multi-step research on the internet for complex tasks. It accomplishes in tens of minutes what would take a human many hours.

Sam Altman: Today we launch Deep Research, our next agent.

This is like a superpower; experts on demand!

It can use the Internet, do complex research and reasoning, and give you a report.

It is quite good and can complete tasks that would take hours or days and cost hundreds of dollars.

People will post many excellent examples, but here is a fun one:

I am in Japan right now and looking for an old NSX. I spent hours searching unsuccessfully for the perfect one. I was about to give up, and Deep Research just… found it.

It is very compute-intensive and slow, but it is the first AI system that can do such a wide variety of complex, valuable tasks.

Going live in our Pro tier now, with 100 queries per month.

Plus, Team and Enterprise tiers will come soon, and then a free tier.

This version will have something like 10 queries per month in our Plus tier and a very small number in our free tier, but we are working on a more efficient version.

(This version is built on o3.)

Give it a try on your hardest work task that can be solved just by using the Internet and see what happens.

Or:

Sam Altman: 50 cents of compute for 500 dollars of value

Sarah (YuanYuanSunSara): Deepseek do it for 5 cents, 500 dollar value.

Perhaps DeepSeek will quickly follow suit, perhaps they will choose not to. The important thing about Sarah’s comment is that there is essentially no difference here.

If the report really is worth $500, then the primary costs are:

  1. Figuring out what you want.

  2. Figuring out the prompt to get it.

  3. Reading the giant report.

  4. NOT the 45 cents you might save!

If the marginal compute cost to me really is 50 cents, then the actual 50 cents is chump change. Even a tiny increase in quality matters so much more.

This isn’t true if you are using the research reports at scale somehow, generating them continuously on tons of subjects and then feeding them into o1-pro for refinement and creating some sort of AI CEO or what not. But the way that all of us are using DR right now, in practice? All that matters is the report is good.

Here was the livestream announcement, if you want it. I find these unwatchable.

Dan Hendrycks: It looks like the latest OpenAI model is very doing well across many topics. My guess is that Deep Research particularly helps with subjects including medicine, classics, and law.

Kevin Roose: When I wrote about Humanity’s Last Exam, the leading AI model got an 8.3%. 5 models now surpass that, and the best model gets a 26.6%.

That was 10 DAYS AGO.

Buck Shlegeris: Note that the questions were explicitly chosen to be adversarial to the frontier models available at the time, which means that models released after HLE look better than they deserve.

Having browsing and python tools and CoT all are not an especially fair fight, and also this is o3 rather than o3-mini under the hood, but yeah the jump to 26.6% is quite big, and confirms why Sam Altman said that soon we will need another exam. That doesn’t mean that we will pass it.

OAI-DR is also the new state of the art on GAIA, which evaluates real-world questions.

They shared a few other internal scores, but none of the standard benchmarks other than humanity’s last exam, and they did not share any safety testing information, despite this being built on o3.

Currently Pro users get 100 DR queries per month, Plus and Free get 0.

I mean, probably. But how would you know?

It is a huge jump on Humanity’s Last Exam. There’s no system card. There’s no discussion of red teaming. There’s not even a public explanation of why there’s no public explanation.

This was released literally two days after they released the o3-mini model card, to show that o3-mini is a safe thing to release, in which they seem not to have used Deep Research as part of their evaluation process. Going forward, I think it is necessary to use Deep Research as part of all aspects of the preparedness framework testing for any new models, and that this should have also been done with o3-mini.

Then two days later, without a system card, they released Deep Research, which is confirmed to be based upon the full o3.

I see this as strongly against the spirit of the White House and Seoul commitments to release safety announcements for ‘all new significant model public releases.’

Miles Brundage: Excited to try this out though with a doubling on Humanity’s Last Exam and o3-mini already getting into potentially dangerous territory on some capabilities, I’m sad there wasn’t a system card or even any brief discussion of red teaming.

OpenAI: In the coming weeks and months, we’ll be working on the technical infrastructure, closely monitoring the current release, and conducting even more rigorous testing. This aligns with our principle of iterative deployment. If all safety checks continue to meet your release standards, we anticipate releasing deep research to plus users in about a month.

Miles Brundage: from the post – does this mean they did automated evals but no RTing yet, or that once they start doing it they’ll stop deployment if it triggers a High score? Very vague. Agree re: value of iterative deployment but that doesn’t mean “anything goes as long as it’s 200/month”.

So where is the model card and safety information for o3?

Well, their basic answer is ‘this is a limited release and doesn’t really count.’ With the obvious (to be clear completely unstated and not at all confirmed) impression that this was rushed out due to r1 to ensure that the conversation and vibe shifted.

I reached out officially, and they gave me this formal statement (which also appears here):

OpenAI: We conducted rigorous safety testing, preparedness evaluations and governance reviews on the early version of o3 that powers deep research, identifying it as Medium risk.

We also ran additional safety testing to better understand incremental risks associated with deep research’s ability to browse the web, and we have added new mitigations.

We will continue to thoroughly test and closely monitor the current limited release.

We will share our safety insights and safeguards for deep research in a system card when we widen access to Plus users.

We do know that the version of o3 (again, full o3) in use tested out as Medium on their preparedness framework and went through the relevant internal committees, which would allow them to release it. But that’s all we know.

They stealth released o3, albeit in a limited form, well before it was ready.

I also have confirmation that the system card will be released before they make Deep Research more widely available (and presumably before o3 is made widely available), and that this is OpenAI’s understanding of its obligations going forward.

They draw a clear distinction between Plus and Free releases or API access, which invokes their disclosure obligations, and limited release only to Pro users, which does not. They do their safety testing under the preparedness framework before even a limited release. However, they consider their obligations to share safety information only to be invoked when a new model is made available to Plus or Free users, or to API users.

I don’t see that as the right distinction to draw here, although I see an important one in API access vs. chatbot interface access. Anyone can now pay the $200 (perhaps with a VPN) and use it, if they want to do that, and in practice multi-account for additional queries if necessary. This is not that limited a release in terms of the biggest worries.

The silver lining is that this allows us to have the discussion now.

I am nonzero sympathetic to the urgency of the situation, and to the intuition that this modality combined with the limited bandwidth and speed renders the whole thing Mostly Harmless.

But if this is how you act under this amount of pressure, how are you going to act in the future, with higher stakes, under much more pressure?

Presumably not so well.

Bob McGrew (OpenAI): The important breakthrough in OpenAI’s Deep Research is that the model is trained to take actions as part of its CoT. The problem with agents has always been that they can’t take coherent action over long timespans. They get distracted and stop making progress.

That’s now fixed.

I do notice this is seemingly distinct from Gemini’s Deep Research. With Gemini’s version, first it searches for sources up front, which it shows to you. Then it compiles the report. OpenAI’s version will search for sources and otherwise take actions as it needs to look for them. That’s a huge upgrade.

Under the hood, we know it’s centrally o3 plus reinforcement learning with the ability to take actions during the chain of thought. What you get from there depends on what you choose as the target.

This is clearly not the thing everyone had in mind, and it’s not the highest value use of a graduate research assistant, but I totally buy that it is awesome at this:

Greg Brockman: Deep Research is an extremely simple agent — an o3 model which can browse the web and execute python code — and is already quite useful.

It’s been eye-opening how many people at OpenAI have been using it as a much better e-commerce search in particular.

E-commerce search is a perfect application. You don’t care about missing some details or a few hallucinations all that much if you’re going to check its work afterwards. You usually don’t need the result right now. But you absolutely want to know what options are available, where, at what price, and what features matter and what general reviews look like and so on.

In the past I’ve found this to be the best use case for Gemini Deep Research – it can compare various offerings, track down where to buy them, list their features and so on. This is presumably the next level up for that.

If I could buy unlimited queries at $0.50 a pop, I would totally do this. The question then becomes, right now, that you get 100 queries a month for $200 (along with operator and o1-pro), but you can’t then add more queries on the margin. So the marginal query might be worth a lot more to you than $0.50.

Not every review is a rave, but here are some of the rave reviews.

Drerya Unutmaz: I asked Deep Research to assist me on two cancer cases earlier today. One was within my area of expertise & the other was slightly outside it. Both reports were simply impeccable, like something only a specialist MD could write! There’s a reason I said this is a game-changer! 🤯

I can finally reveal that I’ve had early access to @OpenAI’s Deep Research since Friday & I’ve been using it nonstop! It’s an absolute game-changer for scientific research, publishing, legal documents, medicine, education-from my tests but likely many others. I’m just blown away!

Yes I did [use Google’s DR] and it’s very good but this is much better! Google needs will need to come up with their next model.

Danielle Fong: openai deep research is incredible

Siqi Chen: i’m only a day in so far but @openai’s deep research and o3 is exceeding the value of the $150K i am paying a private research team to research craniopharyngioma treatments for my daughter.

$200/mo is an insane ROI.grateful to @sama and the @OpenAI team.

feature request for @sama and @OpenAI:

A lot of academic articles are pay walled, and I have subscriptions to just about every major medical journal now.

It would be game changing if i could connect all my credentials to deep research so it can access the raw papers.

As I mentioned above, ‘hook DR up to your credentials’ would be huge.

Tyler Cowen says ‘so far [DR] is amazing’ but doesn’t yet offer more details, as the post is mostly about o1-pro.

Dean Ball is very impressed, saying DR is doing work that would have taken a day or more of research, here it is researching various state regulations. He thinks this is big. I continue to see Dean Ball as a great example of where this type of work is exactly a fit for what he needs to do his job, but still, wowsers.

Olivia Moore is impressed for retrieval tasks, finding it better than Operator, finding it very thorough. I worry it’s too thorough, forcing you to wade through too much other stuff, but that’s what other LLMs are for – turning more text into less text.

Seth Lazar is impressed as he shops for a camera, notices a weakness that it doesn’t properly discount older websites in this context.

Aaron Levie: Can confirm OpenAI Deep Research is quite strong. In a few minutes it did what used to take a dozen hours. The implications to knowledge work is going to be quite profound when you just ask an AI Agent to perform full tasks for you and come back with a finished result.

The ultimate rave review is high willingness to pay.

xjdr: in limited testing, Deep research can completely replace me for researching things i know nothing about to start (its honestly probably much better and certainly much faster). Even for long reports on things i am fairly knowledgeable about, it competes pretty well on quality (i had it reproduce some recent research i did with a few back and forth prompts and compared notes). i am honestly pretty shocked how polished the experience is and how well it works.

I have my gripes but i will save them for later. For now, i will just say that i am incredibly impressed with this release.

To put a finer point on it, i will happily keep paying $200 / month for this feature alone. if they start rate limiting me, i would happily pay more to keep using it.

You Jiacheng: is 100/month enough?

xjdr: I’m gunna hit that today most likely.

My question is, how does xjdr even have time to read all the reports? Or is this a case of feeding them into another LLM?

Paul Calcraft: Very good imo, though haven’t tested it on my own areas of expertise yet.

Oh shit. Deep Research + o1-pro just ~solved this computer graphics problem

Things that didn’t work:

  1. R1/C3.5/o1-pro

  2. Me 🙁

  3. OpenAI Deep Research answer + C3.5 cursor

Needed Deep Research *andan extra o1-pro step to figure out correct changes to my code given the research

Kevin Bryan: The new OpenAI model announced today is quite wild. It is essentially Google’s Deep Research idea with multistep reasoning, web search, *andthe o3 model underneath (as far as I know). It sometimes takes a half hour to answer.

So related to the tariff nuttiness, what if you wanted to read about the economics of the 1890 McKinley tariff, drawing on modern trade theory? I asked Deep Research to spin up a draft, with a couple paragraphs of guidance, in Latex, with citations.

How good can it do literally one shot? I mean, not bad. Honestly, I’ve gotten papers to referee that are worse than this. The path from here to steps where you can massively speed up pace of research is really clear. You can read yourself here.

I tried to spin up a theory paper as well. On my guidance on the problem, it pulled a few dozen papers, scanned them, then tried to write a model in line with what I said.

It wasn’t exactly what I wanted, and is quite far away from even one novel result (basically, it gave me a slight extension of Scotchmer-Green). But again, the path is very clear, and I’ve definitely used frontier models to help prove theorems already.

I think the research uses are obvious here. I would also say, for academia, the amount of AI slop you are about to get is *insane*. In 2022, I pointed out that undergrads could AI their way to a B. I am *surefor B-level journals, you can publish papers you “wrote” in a day.

Now we get to the regular reviews. The general theme is, it will give you a lot of text most of it accurate, but not all, and it will have some insights but pile the slop and unimportant stuff high on top of it without noticing which is which. It’s the same as Gemini’s Deep Research, only more so, and generally stronger but slower. That is highly useful, if you know how to use it.

Abu El Banat: Tested on several questions. Where I am a professional expert, it gave above average grad student summer intern research project answers — covered all the basics, lacked some domain-specific knowledge to integrate the new info helpfully, but had 2-3 nuggets of real insight.

It found several sources of info in my specialty online that I did not know were publicly accessible.

On questions where I’m an amateur, it was extremely helpful. It seemed to shine in tasks like “take all my particulars into account, research all options, and recommend.”

One thing that frustrates me about Gemini Deep Research, and seems to be present in OpenAI’s version as well, is that it will give you an avalanche of slop whether you like it or not. If you ask it for a specific piece of information, like one number that is ‘the average age when kickers retire,’ you won’t get it, at least not by default. This is very frustrating. To me, what I actually want – very often – is to answer a specific question, for a particular reason.

Bayram Annakov concludes it is ‘deeper but slower than Gemini.’

Here’s a personal self-analysis example, On the Couch with Dr. Deep Research.

Ethan Mollick gets a 30-page 10k word report ‘Evolution of Tabletop RPGs: From Dungeon Crawls to Narrative Epics.’

Ethan Mollick: Prompt: I need a report on evolution in TTRPGs, especially the major families of rules that have evolved over the past few years, and the emblematic games of each. make it an interesting read with examples of gameplay mechanics. start with the 1970s but focus most on post 20102. all genres, all types, no need for a chart unless it helps, but good narratives with sections contrasting examples of how the game might actually play. maybe the same sort of gameplay challenge under different mechanics?

To me there are mostly two speeds, ‘don’t care how I want it now,’ and ‘we have all the time in the world.’ Once you’re coming back later, 5 minutes and an hour are remarkably similar lengths. If it takes days then that’s a third level.

Colin Fraser asks, who would each NBA team most likely have guard LeBron James? It worked very hard on this, and came back with answer that often included players no longer on the team, just like o1 often does. Colin describes this as a lack of agency problem, that o3 isn’t ensuring they have an up to date set of rosters as a human would. I’m not sure that’s the right way to look at it? But it’s not unreasonable.

Kevin Roose: Asked ChatGPT Deep Research to plan this week’s Hard Fork episode and it suggested a segment we did last week and two guests I can’t stand, -10 points on the podcast vibes eval.

Shakeel: Name names.

Ted tries it in a complex subfield he knows well, finds 90% coherent high level summary of prior work and 10% total nonsense that a non-expert wouldn’t be able to differentiate, and he’s ‘not convinced it “understands” what is going on.’ That’s a potentially both highly useful and highly dangerous place to be, depending on the field and the user.

Here’s one brave user:

Simeon: Excellent for quick literature reviews in literature you barely know (able to give a few example papers) but don’t know much about.

And one that is less brave on this one:

Siebe: Necessary reminder to test features like this in areas you’re familiar with. “It did a good job of summarizing an area I wasn’t familiar with.”

No, you don’t know that. You don’t have the expertise to judge that.

Where is the 10% coming from? Steve Sokolowski has a theory.

Steve Sokolowski: ‘m somewhat disappointed by @OpenAI’s Deep Research. @sama promised it was a dramatic advance, so I entered the complaint for our o1 pro-guided lawsuit against @DCGco and others into it and told it to take the role of Barry Silbert and move to dismiss the case.

Unfortunately, while the model appears to be insanely intelligent, it output obviously weak arguments because it ended up taking poor-quality source data from poor-quality websites. It relied upon sources like reddit and those summarized articles that attorneys write to drive traffic to their websites and obtain new cases.

The arguments for dismissal were accurate in the context of the websites it relied upon, but upon review I found that those websites often oversimplified the law and missed key points of the actual laws’ texts.

When the model based its arguments upon actual case text, it did put out arguments that seemed like they would hold up to a judge. One of the arguments was exceptional and is a risk that we are aware of.

But except for that one flash of brilliance, I got the impression that the context window of this model is small. It “forgot” key parts of the complaint, so its “good” arguments would not work as a defense.

The first problem – the low quality websites – should be able to be solved with a system prompt explaining what types of websites to avoid. If they already have a system prompt explaining that, it isn’t good enough.

Deep Research is a model that could change the world dramatically with a few minor advances, and we’re probably only months from that.

This is a problem for every internet user, knowing what sources to trust. It makes sense that it would be a problem for DR’s first iteration. I strongly agree that this should improve rapidly over time.

Dan Hendrycks is not impressed when he asks for feedback on a paper draft, finding it repeatedly claiming Dan was saying things he didn’t say, but as he notes this is mainly a complaint about the underlying o3 model. So given how humans typically read AI papers, it’s doing a good job predicting the next token? I wonder how well o3’s misreads correlate with human ones.

With time, you can get a good sense of what parts can be trusted versus what has to be checked, including noticing which parts are too load bearing to risk being wrong.

Gallabytes is unimpressed so far but suspects it’s because of the domain he’s trying.

Gallabytes: so far deep research feels kinda underwhelming. I’m sure this is to some degree a skill issue on my part and to some degree a matter of asking it about domains where there isn’t good literature coverage. was hoping it could spend more time doing math when it can’t find sources.

ok let’s turn this around. what should I be using deep research for? what are some domains where you’ve seen great output? so far ML research ain’t it too sparse (and maybe too much in pdfs? not at all obvious to me that it’s reading beyond the abstracts on arxiv so far).

Carlos: I was procrastinating buying a new wool overcoat, and I hate shopping. So I had it look for one for me and make a page I could reference (the html canvas had to be a follow-up message, for some reason Research’s responses aren’t using even code backticks properly atm) I just got back from the store with my new coat.

Peter Wildeford is not impressed but that’s on a rather impressive curve.

Peter Wildeford: Today’s mood: Using OpenAI Deep Research to automate some of my job to save time to investigate how well OpenAI Deep Research can automate my job.

…Only took me four hours to get to this point, looks like you get 20 deep research reports per day

Tyler John: Keen to learn from your use of the model re: what it’s most helpful for.

Peter Wildeford: I’ll have more detailed takes on my Substack but right now it seems most useful for “rapidly get a basic familiarity with a field/question/problem”

It won’t replace even an RA or fellow at IAPS, but it is great at grinding through 1-2 hours of initial desk research in ~10min.

Tyler John: “it won’t replace even an RA” where did the time go

Peter Wildeford: LOL yeah but the hype is honestly that level here on Twitter right now

It’s good for if you don’t have all the PDFs in advance

The ability to ask follow up questions actually seems sadly lacking right now AFAICT

If you do have the PDFs in advance and have o1-pro and can steer the o1-pro model to do a more in-depth report, then I think Deep Research doesn’t add much more on top of that

It’s all about the data set.

Ethan Mollick: Having access to a good search engine and access to paywalled content is going to be a big deal in making AI research agents useful.

Kevin Bryan: Playing with Operator and both Google’s Deep Research and OpenAI’s, I agree with Ethan: access to gated documents, and a much better inline pdf OCR, will be huge. The Google Books lawsuit which killed it looks like a massive harm to humanity and science in retrospect.

And of course it will need all your internal and local stuff as well.

Note that this could actually be a huge windfall for gated content.

Suppose this integrated the user’s subscriptions, so you got paywalled content if and only if you were paying for it. Credentials for all those academic journals now look a lot more enticing, don’t they? Want the New York Times or Washington Post in your Deep Research? Pay up. Maybe it’s part of the normal subscription. Maybe it’s a less. Maybe it’s more.

And suddenly you can get value out of a lot more subscriptions, especially if the corporation is fitting the bill.

Arthur B is happy with his first query, disappointed with the one on Tezos where he knows best, is hoping it’s data quality issues rather than Gel-Men Amnesia.

Deric Cheong finds it better than Gemini DR on economic policies for a post-AGI society. I checked out the report, which takes place in the strange ‘economic normal under AGI’ idyllic Christmasland that economists insist on as our baseline future, where our worries are purely mundane things like concentration of wealth and power in specific humans and the need to ensure competition.

So you get proposals such as ‘we need to ensure that we have AGIs and AGI offerings competing against each other to maximize profits, that’ll ensure humans come out okay and totally not result by default in at best gradual disempowerment.’ And under ‘drawbacks’ you get ‘it requires global coordination to ensure competition.’ What?

We get all the classics. Universal basic income, robot taxes, windfall taxes, capital gains taxes, ‘workforce retraining and education’ (workforce? Into ‘growth fields’? What are these ‘growth fields’?), shorten the work week, mandatory paid leave (um…), a government infrastructure program, giving workers bargaining power, ‘cooperative and worker ownership’ of what it doesn’t call ‘the means of production,’ data dividends and rights, and many more.

All of which largely comes down to rearranging deck chairs on the Titanic, while the Titanic isn’t sinking and actually looks really sharp but also no one can afford the fare. It’s stuff that matters on the margin but we won’t be operating on the margin, we will be as they say ‘out of distribution.’

Alternatively, it’s a lot of ways of saying ‘redistribution’ over and over with varying levels of inefficiency and social fiction. If humans can retain political power and the ability to redistribute real resources, also known as ‘retain control over the future,’ then there will be more than enough real resources that everyone can be economically fine, whatever status or meaning or other problems they might have. The problem is that the report doesn’t raise that need as a consideration, and if anything the interventions here make that problem harder not easier.

But hey, you ask a silly question, you get a silly answer. None of that is really DR’s fault, except that it accepted the premise. So, high marks!

Nabeel Qureshi: OpenAI’s Deep Research is another instance of “prompts matter more now, not less.” It’s so powerful that small tweaks to the prompt end up having large impacts on the output. And it’s slow, so mistakes cost you more.

I expect we’ll see better ways to “steer” agents as they’re working, e.g. iterative ‘check-ins’ or CoT inspection. Right now it’s very easy for them to go off piste.

It reminds me of the old Yudkowsky point: telling the AI *exactlywhat you want is quite hard, especially as the request gets more complex and as the AI gets more powerful.

Someone should get on this, and craft at least a GPT or instruction you can give to o3-mini-high or o1-pro (or Claude Sonnet 3.6?), that will take your prompt and other explanations, ask you follow-ups if needed, and give you back a better prompt, and give you back a prediction of what to expect so you can refine and such.

I strongly disagree with this take:

Noam Brown: @OpenAI Deep Research might be the beginning of the end for Wikipedia and I think that’s fine. We talk a lot about the AI alignment problem, but aligning people is hard too. Wikipedia is a great example of this.

There are problems with Wikipedia, but these two things are very much not substitutes. Here are some facts about Wikipedia that don’t apply to DR and aren’t about to any time soon.

  1. Wikipedia is highly reliable, at least for most purposes.

  2. Wikipedia can be cited as reliable source to others.

  3. Wikipedia is the same for everyone, not sensitive to input details.

  4. Wikipedia is carefully workshopped to be maximally helpful and efficient.

  5. Wikipedia curates the information that is notable, gets rid of the slop.

  6. Wikipedia is there at your fingertips for quick reference.

  7. Wikipedia is the original source, a key part of training data. Careful, Icarus.

And so on. These are very different modalities.

Noam Brown: I’m not saying people will query a Deep Research model every time they want to read about Abraham Lincoln. I think models like Deep Research will eventually be used to pre-generate a bunch of articles that can stored and read just like Wikipedia pages, but will be higher quality.

I don’t think that is a good idea either. Deep Research is not a substitute for Wikipedia. Deep Research is for when you can’t use Wikipedia, because what you want isn’t notable and is particular, or you need to know things with a different threshold of reliability than Wikipedia’s exacting source standards, and so on. You’re not going to ‘do better’ than Wikipedia at its own job this way.

Eventually, of course, AI will be better at every cognitive task than even the collective of humans, so yes it would be able to write a superior Wikipedia article at that point, or something that serves the same purpose. But at that point, which is fully AGI-complete, we have a lot of much bigger issues to consider, and OAI-DR-1.0 won’t be much of a ‘beginning of the end.’

Another way of putting this is, you’d love a graduate research assistant, but you’d never tell them to write a Wikipedia article for you to read.

Here’s another bold claim.

Sam Altman: congrats to the team, especially @isafulf and @EdwardSun0909, for building an incredible product.

my very approximate vibe is that it can do a single-digit percentage of all economically valuable tasks in the world, which is a wild milestone.

Can Deep Research do 1% of all economically valuable tasks in the world?

With proper usage, I think the answer is yes. But I also would have said the same thing about o1-pro, or Claude Sonnet 3.5, once you give them a little scaffolding.

Poll respondents disagreed, saying it could do between 0.1% and 1% of tasks.

We have Operator. We have two versions of Deep Research. What’s next?

Stephen McAleer (OpenAI): Deep research is emblematic of a new wave of products that will be created by doing high-compute RL on real-world tasks.

If you can Do Reinforcement Learning To It, and it’s valuable, they’ll build it. The question is, what products might be coming soon here?

o1’s suggestions were legal analysis, high-frequency trading, medical diagnosis, supply chain coordination, warehouse robotics, personalized tutoring, customer support, traffic management, code generation and drug discovery.

That’s a solid list. The dream is general robotics but that’s rather a bit trickier. Code generation is the other dream, and that’s definitely going to step up its game quickly.

The main barrier seems to be asking what people actually want.

I’d like to see a much more precise version of DR next. Instead of giving me a giant report, give me something focused. But probably I should be thinking bigger.

Should you pay the $200?

For that price, you now get:

  1. o1-pro.

  2. Unlimited o3-mini-high and o1.

  3. Operator.

  4. 100 queries per month on Deep Research.

When it was only o1-pro, I thought those using it for coding or other specialized tasks where it excels should clearly pay, but it wasn’t clear others should pay.

Now that the package has expanded, I agree with Sam Altman that the value proposition is much improved, and o3 and o3-pro will enhance it further soon.

I notice I haven’t pulled the trigger yet. I know it’s a mistake that I haven’t found ways to want to do this more as part of my process. Just one more day, maybe two, to clear the backlog, I just need to clear the backlog. They can’t keep releasing products like this.

Right?

Depends what counts? As long as it doesn’t need to be cutting edge we should be fine.

Andrew Critch: I hope universal basic income turns out to be enough to pay for a Deep Research subscription.

A story I find myself in often:

Miles Brundage: Man goes to Deep Research, asks for help with the literature on trustworthy AI development.

Deep Research says, “You are in luck. There is relevant paper by Brundage et al.”

Man: “But Deep Research…”

Get your value!

Discussion about this post

We’re in Deep Research Read More »

popular-linux-orgs-freedesktop-and-alpine-linux-are-scrambling-for-new-web-hosting

Popular Linux orgs Freedesktop and Alpine Linux are scrambling for new web hosting

Having worked “around the clock” to move from Google Cloud Platform after its open source credits there ran out, and now rushing to move off Equinix, Tissoires suggests a new plan: “[H]ave [freedesktop.org] pay for its own servers, and then have sponsors chip in.”

“Popular without most users knowing it”

Alpine Linux, a small, security-minded distribution used in many containers and embedded devices, also needs a new home quickly. As detailed in its blog, Alpine Linux uses about 800TB of bandwidth each month and also needs continuous integration runners (or separate job agents), as well as a development box. Alpine states it is seeking co-location space and bare-metal servers near the Netherlands, though it will consider virtual machines if bare metal is not feasible.

Like X.org/Freedesktop, Alpine is using this moment as a wake-up call. Responding to Ars, Carlo Landmeter, who serves on Alpine’s council, noted that Alpine Linux is a kind of open source project “that became popular without most users knowing it.” Users are starting to donate, and companies are reaching out to help, but it’s still “early days,” Landmeter wrote.

Every so often, those working at the foundations of open source software experience something that highlights the mismatch between a project’s importance and its support and funding. Perhaps some people or some organizations will do the harder work of finding a sustaining future for these projects.

Ars has reached out to Equinix and X/Freedesktop and will update this post with responses.

Popular Linux orgs Freedesktop and Alpine Linux are scrambling for new web hosting Read More »

it-seems-the-faa-office-overseeing-spacex’s-starship-probe-still-has-some-bite

It seems the FAA office overseeing SpaceX’s Starship probe still has some bite


The political winds have shifted in Washington, but the FAA hasn’t yet changed its tune on Starship.

Liftoff of SpaceX’s seventh full-scale test flight of the Super Heavy/Starship launch vehicle on January 16. Credit: SpaceX

The seventh test flight of SpaceX’s gigantic Starship rocket came to a disappointing end a little more than two weeks ago. The in-flight failure of the rocket’s upper stage, or ship, about eight minutes after launch on January 16 rained debris over the Turks and Caicos Islands and the Atlantic Ocean.

Amateur videos recorded from land, sea, and air showed fiery debris trails streaming overhead at twilight, appearing like a fireworks display gone wrong. Within hours, posts on social media showed small pieces of debris recovered by residents and tourists in the Turks and Caicos. Most of these items were modest in size, and many appeared to be chunks of tiles from Starship’s heat shield.

Unsurprisingly, the Federal Aviation Administration grounded Starship and ordered an investigation into the accident on the day after the launch. This decision came three days before the inauguration of President Donald Trump. Elon Musk’s close relationship with Trump, coupled with the new administration’s appetite for cutting regulations and reducing the size of government, led some industry watchers to question whether Musk’s influence might change the FAA’s stance on SpaceX.

So far, the FAA hasn’t budged on its requirement for an investigation, an agency spokesperson told Ars on Friday. After a preliminary assessment of flight data, SpaceX officials said a fire appeared to develop in the aft section of the ship before it broke apart and fell to Earth.

“The FAA has directed SpaceX to lead an investigation of the Starship Super Heavy Flight 7 mishap with FAA oversight,” the spokesperson said. “Based on the investigation findings for root cause and corrective actions, the FAA may require a company to modify its license.”

This is much the same language the FAA used two weeks ago, when it first ordered the investigation.

Damage report

The FAA’s Office of Commercial Space Transportation is charged with ensuring commercial space launches and reentries don’t endanger the public, and requires launch operators obtain liability insurance or demonstrate financial ability to cover any third-party property damages.

For each Starship launch, the FAA requires SpaceX maintain liability insurance policies worth at least $500 million for such claims. It’s rare for debris from US rockets to fall over land during a launch. This would typically only happen if a launch failed at certain parts of the flight. And there’s no public record of any claims of third-party property damage in the era of commercial spaceflight. Under federal law, the US government would pay for damages to a much higher amount if any claims exceeded a launch company’s insurance policies.

Here’s a piece of Starship 33 @SpaceX @elonmusk found in Turks and Caicos! 🚀🏝️ pic.twitter.com/HPZDCqA9MV

— @maximzavet (@MaximZavet) January 17, 2025

The good news is there were no injuries or reports of significant damage from the wreckage that fell over the Turks and Caicos. “The FAA confirmed one report of minor damage to a vehicle located in South Caicos,” an FAA spokesperson told Ars on Friday. “To date, there are no other reports of damage.”

It’s not clear if the vehicle owner in South Caicos will file a claim against SpaceX for the damage. It would the first time someone makes such a claim related to an accident with a commercial rocket overseen by the FAA. Last year, a Florida homeowner submitted a claim to NASA for damage to his house from a piece of debris that fell from the International Space Station.

Nevertheless, the Turks and Caicos government said local officials met with representatives from SpaceX and the UK Air Accident Investigations Branch on January 25 to develop a recovery plan for debris that fell on the islands, which are a British Overseas Territory.

A prickly relationship

Musk often bristled at the FAA last year, especially after regulators proposed fines of more than $600,000 alleging that SpaceX violated terms of its launch licenses during two Falcon 9 missions. The alleged violations involved the relocation of a propellant farm at one of SpaceX’s launch pads in Florida, and the use of a new launch control center without FAA approval.

In a post on X, Musk said the FAA was conducting “lawfare” against his company. “SpaceX will be filing suit against the FAA for regulatory overreach,” Musk wrote.

There was no such lawsuit, and the issue may now be moot. Sean Duffy, Trump’s new secretary of transportation, vowed to review the FAA fines during his confirmation hearing in the Senate. It is rare for the FAA to fine launch companies, and the fines last year made up the largest civil penalty ever imposed by the FAA’s commercial spaceflight division.

SpaceX also criticized delays in licensing Starship test flights last year. The FAA cited environmental issues and concerns about the extent of the sonic boom from Starship’s 23-story-tall Super Heavy booster returning to its launch pad in South Texas. SpaceX successfully caught the returning first stage booster at the launch pad for the first time in October, and repeated the feat after the January 16 test flight.

What separates the FAA’s ongoing oversight of Starship’s recent launch failure from these previous regulatory squabbles is that debris fell over populated areas. This would appear to be directly in line with the FAA’s responsibility for public safety.

During last month’s test flight, Starship did not deviate from its planned ground track, which took the rocket over the Gulf of Mexico, the waters between Florida and Cuba, and then the Atlantic Ocean. But the debris field extended beyond the standard airspace closure for the launch. After the accident, FAA air traffic controllers cleared additional airspace over the debris zone for more than an hour, rerouting, diverting, and delaying dozens of commercial aircraft.

These actions followed pre-established protocols. However, it highlighted the small but non-zero risk of rocket debris falling to Earth after a launch failure. “The potential for a bad day downrange just got real,” Lori Garver, a former NASA deputy administrator, posted on X.

Public safety is not sole mandate of the FAA’s commercial space office. It is also chartered to “encourage, facilitate, and promote commercial space launches and reentries by the private sector,” according to an FAA website. There’s a balance to strike.

Lawmakers last year urged the FAA to speed up its launch approvals, primarily because Starship is central to strategic national objectives. NASA has contracts with SpaceX to develop a variant of Starship to land astronauts on the Moon, and Starship’s unmatched ability to deliver more than 100 tons of cargo to low-Earth orbit is attractive to the Pentagon.

While Musk criticized the FAA in 2024, SpaceX officials in 2023 took a different tone, calling for Congress to increase the budget for the FAA’s Office of Commercial Spaceflight and for the regulator to double the space division’s workforce. This change, SpaceX officials argued, would allow the FAA to more rapidly assess and approve a fast-growing number of commercial launch and reentry applications.

In September, SpaceX released a statement accusing the former administrator of the FAA, Michael Whitaker, of making inaccurate statements about SpaceX to a congressional subcommittee. In a different post on X, Musk directly called for Whitaker’s resignation.

He needs to resign https://t.co/pG8htfTYHb

— Elon Musk (@elonmusk) September 25, 2024

That’s exactly what happened. Whitaker, who took over the FAA’s top job in 2023 under the Biden administration, announced in December he would resign on Inauguration Day. Since the agency’s establishment in 1958, three FAA administrators have similarly resigned when a new administration takes power, but the office has been largely immune from presidential politics in recent decades. Since 1993, FAA administrators have stayed in their post during all presidential transitions.

There’s no evidence Whitaker’s resignation had any role in the mid-air collision of an American Eagle passenger jet and a US Army helicopter Wednesday night near Ronald Reagan Washington National Airport. But his departure from the FAA less than two years into a five-year term on January 20 left the agency without a leader. Trump named Chris Rocheleau as the FAA’s acting administrator Thursday.

Next flight, next month?

SpaceX has not released an official schedule for the next Starship test flight or outlined its precise objectives. However, it will likely repeat many of the goals planned for the previous flight, which ended before SpaceX could accomplish some of its test goals. These missed objectives included the release of satellite mockups in space for the first demonstration of Starship’s payload deployment mechanism, and a reentry over the Indian Ocean to test new, more durable heat shield materials.

The January 16 test flight was the first launch up an upgraded, slightly taller Starship, known as Version 2 or Block 2. The next flight will use the same upgraded version.

A SpaceX filing with the Federal Communications Commission suggests the next Starship flight could launch as soon as February 24. Sources told Ars that SpaceX teams believe a launch before the end of February is realistic.

But SpaceX has more to do before Flight 8. These tasks include completing the FAA-mandated investigation and the installation of all 39 Raptor engines on the rocket. Then, SpaceX will likely test-fire the booster and ship before stacking the two elements together to complete assembly of the 404-foot-tall (123.1-meter) rocket.

SpaceX is also awaiting a new FAA launch license, pending its completion of the investigation into what happened on Flight 7.

Photo of Stephen Clark

Stephen Clark is a space reporter at Ars Technica, covering private space companies and the world’s space agencies. Stephen writes about the nexus of technology, science, policy, and business on and off the planet.

It seems the FAA office overseeing SpaceX’s Starship probe still has some bite Read More »

to-help-ais-understand-the-world,-researchers-put-them-in-a-robot

To help AIs understand the world, researchers put them in a robot


There’s a difference between knowing a word and knowing a concept.

Large language models like ChatGPT display conversational skills, but the problem is they don’t really understand the words they use. They are primarily systems that interact with data obtained from the real world but not the real world itself. Humans, on the other hand, associate language with experiences. We know what the word “hot” means because we’ve been burned at some point in our lives.

Is it possible to get an AI to achieve a human-like understanding of language? A team of researchers at the Okinawa Institute of Science and Technology built a brain-inspired AI model comprising multiple neural networks. The AI was very limited—it could learn a total of just five nouns and eight verbs. But their AI seems to have learned more than just those words; it learned the concepts behind them.

Babysitting robotic arms

“The inspiration for our model came from developmental psychology. We tried to emulate how infants learn and develop language,” says Prasanna Vijayaraghavan, a researcher at the Okinawa Institute of Science and Technology and the lead author of the study.

While the idea of teaching AIs the same way we teach little babies is not new—we applied it to standard neural nets that associated words with visuals. Researchers also tried teaching an AI using a video feed from a GoPro strapped to a human baby. The problem is babies do way more than just associate items with words when they learn. They touch everything—grasp things, manipulate them, throw stuff around, and this way, they learn to think and plan their actions in language. An abstract AI model couldn’t do any of that, so Vijayaraghavan’s team gave one an embodied experience—their AI was trained in an actual robot that could interact with the world.

Vijayaraghavan’s robot was a fairly simple system with an arm and a gripper that could pick objects up and move them around. Vision was provided by a simple RGB camera feeding videos in a somewhat crude 64×64 pixels resolution.

 The robot and the camera were placed in a workspace, put in front of a white table with blocks painted green, yellow, red, purple, and blue. The robot’s task was to manipulate those blocks in response to simple prompts like “move red left,” “move blue right,” or “put red on blue.” All that didn’t seem particularly challenging. What was challenging, though, was building an AI that could process all those words and movements in a manner similar to humans. “I don’t want to say we tried to make the system biologically plausible,” Vijayaraghavan told Ars. “Let’s say we tried to draw inspiration from the human brain.”

Chasing free energy

The starting point for Vijayaraghavan’s team was the free energy principle, a hypothesis that the brain constantly makes predictions about the world based on internal models, then updates these predictions based on sensory input. The idea is that we first think of an action plan to achieve a desired goal, and then this plan is updated in real time based on what we experience during execution. This goal-directed planning scheme, if the hypothesis is correct, governs everything we do, from picking up a cup of coffee to landing a dream job.

All that is closely intertwined with language. Neuroscientists at the University of Parma found that motor areas in the brain got activated when the participants in their study listened to action-related sentences. To emulate that in a robot, Vijayaraghavan used four neural networks working in a closely interconnected system. The first was responsible for processing visual data coming from the camera. It was tightly integrated with a second neural net that handled proprioception: all the processes that ensured the robot was aware of its position and the movement of its body. This second neural net also built internal models of actions necessary to manipulate blocks on the table. Those two neural nets were additionally hooked up to visual memory and attention modules that enabled them to reliably focus on the chosen object and separate it from the image’s background.

The third neural net was relatively simple and processed language using vectorized representations of those “move red right” sentences. Finally, the fourth neural net worked as an associative layer and predicted the output of the previous three at every time step. “When we do an action, we don’t always have to verbalize it, but we have this verbalization in our minds at some point,” Vijayaraghavan says. The AI he and his team built was meant to do just that: seamlessly connect language, proprioception, action planning, and vision.

When the robotic brain was up and running, they started teaching it some of the possible combinations of commands and sequences of movements. But they didn’t teach it all of them.

The birth of compositionality

In 2016, Brenden Lake, a professor of psychology and data science, published a paper in which his team named a set of competencies machines need to master to truly learn and think like humans. One of them was compositionality: the ability to compose or decompose a whole into parts that can be reused. This reuse lets them generalize acquired knowledge to new tasks and situations. “The compositionality phase is when children learn to combine words to explain things. They [initially] learn the names of objects, the names of actions, but those are just single words. When they learn this compositionality concept, their ability to communicate kind of explodes,” Vijayaraghavan explains.

The AI his team built was made for this exact purpose: to see if it would develop compositionality. And it did.

Once the robot learned how certain commands and actions were connected, it also learned to generalize that knowledge to execute commands it never heard before. recognizing the names of actions it had not performed and then performing them on combinations of blocks it had never seen. Vijayaraghavan’s AI figured out the concept of moving something to the right or the left or putting an item on top of something. It could also combine words to name previously unseen actions, like putting a blue block on a red one.

While teaching robots to extract concepts from language has been done before, those efforts were focused on making them understand how words were used to describe visuals. Vijayaragha built on that to include proprioception and action planning, basically adding a layer that integrated sense and movement to the way his robot made sense of the world.

But some issues are yet to overcome. The AI had very limited workspace. The were only a few objects and all had a single, cubical shape. The vocabulary included only names of colors and actions, so no modifiers, adjectives, or adverbs. Finally, the robot had to learn around 80 percent of all possible combinations of nouns and verbs before it could generalize well to the remaining 20 percent. Its performance was worse when those ratios dropped to 60/40 and 40/60.

But it’s possible that just a bit more computing power could fix this. “What we had for this study was a single RTX 3090 GPU, so with the latest generation GPU, we could solve a lot of those issues,” Vijayaraghavan argued. That’s because the team hopes that adding more words and more actions won’t result in a dramatic need for computing power. “We want to scale the system up. We have a humanoid robot with cameras in its head and two hands that can do way more than a single robotic arm. So that’s the next step: using it in the real world with real world robots,” Vijayaraghavan said.

Science Robotics, 2025. DOI: 10.1126/scirobotics.adp0751

Photo of Jacek Krywko

Jacek Krywko is a freelance science and technology writer who covers space exploration, artificial intelligence research, computer science, and all sorts of engineering wizardry.

To help AIs understand the world, researchers put them in a robot Read More »

fda-approves-first-non-opioid-pain-medicine-in-more-than-20-years

FDA approves first non-opioid pain medicine in more than 20 years

The approval “is an important public health milestone in acute pain management,” Jacqueline Corrigan-Curay, J.D., M.D., acting director of the FDA’s Center for Drug Evaluation and Research, said in a statement. “A new non-opioid analgesic therapeutic class for acute pain offers an opportunity to mitigate certain risks associated with using an opioid for pain and provides patients with another treatment option.”

The company behind the drug, Vertex, said a 50 mg pill that works for 12 hours will have a wholesale cost of $15.50, making the daily cost $31 and the weekly cost $217. The cost is higher than cheap, generic opioids. But, a report from The Institute for Clinical and Economic Review in December estimated that suzetrigine would be “slightly cost-saving” relative to opioids if the price was set at $420 per week, given the drug’s ability to avert opioid addiction cases.

In a statement, Reshma Kewalramani, the CEO and President of Vertex, trumpeted the approval as a “historic milestone for the 80 million people in America who are prescribed a medicine for moderate-to-severe acute pain each year … [W]e have the opportunity to change the paradigm of acute pain management and establish a new standard of care.”

FDA approves first non-opioid pain medicine in more than 20 years Read More »

fcc-demands-cbs-provide-unedited-transcript-of-kamala-harris-interview

FCC demands CBS provide unedited transcript of Kamala Harris interview

The Federal Communications Commission demanded that CBS provide the unedited transcript of a 60 Minutes interview with Kamala Harris that is the subject of a complaint to the FCC and a lawsuit filed by President Donald Trump.

CBS News on Wednesday received a letter of inquiry in which the FCC requested “the full, unedited transcript and camera feeds” of the Harris interview, The New York Times reported today. “We are working to comply with that inquiry as we are legally compelled to do,” a CBS News spokesperson told media outlets.

FCC Chairman Brendan Carr repeatedly echoed Trump’s complaints about alleged media bias before the election and has taken steps to punish news broadcasters since Trump promoted him to the chairmanship. Complaints against CBS, ABC, and NBC stations were dismissed under former Chairwoman Jessica Rosenworcel, but Carr reversed those dismissals in his first week as chair. Carr also ordered investigations into NPR and CBS.

FCC Commissioner Anna Gomez, a Democrat, criticized what she called Carr’s “latest action to weaponize our broadcast licensing authority.”

“This is a retaliatory move by the government against broadcasters whose content or coverage is perceived to be unfavorable,” Gomez said today. “It is designed to instill fear in broadcast stations and influence a network’s editorial decisions. The Communications Act clearly prohibits the Commission from censoring broadcasters and the First Amendment protects journalistic decisions against government intimidation. We must respect the rule of law, uphold the Constitution, and safeguard public trust in our oversight of broadcasters.”

CBS considers settling Trump lawsuit

Trump sued CBS over the Harris interview, and executives at CBS owner Paramount Global have held settlement talks with Trump representatives. “A settlement would be an extraordinary concession by a major U.S. media company to a sitting president, especially in a case in which there is no evidence that the network got facts wrong or damaged the plaintiff’s reputation,” The New York Times wrote.

FCC demands CBS provide unedited transcript of Kamala Harris interview Read More »

dell-risks-employee-retention-by-forcing-all-teams-back-into-offices-full-time

Dell risks employee retention by forcing all teams back into offices full-time

In a statement to Ars, Dell’s PR team said:

“We continually evolve our business so we’re set up to deliver the best innovation, value, and service to our customers and partners. That includes more in-person connections to drive market leadership.”

The road to full RTO

After Dell allowed employees to work from home two days per week, Dell’s sales team in March became the first department to order employees back into offices full-time. At the time, Dell said it had data showing that salespeople are more productive on site. Dell corporate strategy SVP Vivek Mohindra said last month that sales’ RTO brought “huge benefits” in “learning from each other, training, and mentorship.”

The company’s “manufacturing teams, engineers in the labs, onsite team members, and leaders” had also previously been called into offices full-time, Business Insider reported today.

Since February, Dell has been among the organizations pushing for more in-person work since pandemic restrictions lifted, with reported efforts including VPN and badge tracking.

Risking personnel

Like other organizations, Dell risks losing employees by implementing a divisive mandate. For Dell specifically, internal tracking data reportedly found that nearly half of workers already opted for remote work over being eligible for promotions or new roles, according to a September Business Insider report.

Research has suggested that companies that issue RTO mandates subsequently lose some of their best talent. A November research paper (PDF) from the University of Pittsburgh, Baylor University, The Chinese University of Hong Kong, and Cheung Kong Graduate School of Business researchers that cited LinkedIn data found this particularly true for “high-tech” and financial firms. The researchers concluded that average turnover rates increased by 14 percent on average after companies issued RTO policies. This research, in addition to other studies, has also found that companies with in-office work mandates are at risk of losing senior-level employees especially.

Some analysts don’t believe Dell is in danger of a mass exodus, though. Bob O’Donnell, president and chief analyst at Technalysis Research, told Business Insider in December, “It’s not like I think Dell’s going to lose a whole bunch of people to HP or Lenovo.”

Patrick Moorhead, CEO and chief analyst at Moor Insights & Strategy, said he believes RTO would be particularly beneficial to Dell’s product development.

Still, some workers have accused Dell of using RTO policies to try to reduce headcount. There’s no proof of this, but broader research, including commentary from various company executives outside of Dell, has shown that some companies have used RTO policies to try to get people to quit.

Dell declined to comment about potential employee blowback to Ars Technica.

Dell risks employee retention by forcing all teams back into offices full-time Read More »

“just-give-me-the-f***ing-links!”—cursing-disables-google’s-ai-overviews

“Just give me the f***ing links!”—Cursing disables Google’s AI overviews

If you search Google for a way to turn off the company’s AI-powered search results, you may well get an AI Overview telling you that AI Overviews can’t be directly disabled in Google Search. But if you instead ask Google how to turn off “fucking Google AI results,” you’ll get a standard set of useful web suggestions without any AI Overview at the top.

The existence of this “curse to disable Google AI” trick has been making the rounds on social media in recent days, and it holds up in Ars’ own testing. For instance, when searching for “how do you turn off [adjective] Google AI results,” a variety of curse word adjectives reliably disabled the AI Overviews, while adjectives like “dumb” or “lousy” did not. Inserting curse words randomly at any point in the search query seems to have a similar effect.

There’s long been evidence that Google’s Gemini AI system tries to avoid swearing if at all possible, which might help explain why AI Overviews balk at queries that contain curses. Users should also keep in mind, though, that the actual web link results to a query can change significantly when curse words are inserted, especially if SafeSearch is turned off.

“Just give me the f***ing links!”—Cursing disables Google’s AI overviews Read More »

here’s-why-the-tech-industry-gets-excited-about-sports-car-racing

Here’s why the tech industry gets excited about sports car racing


It would take IMSA 700 years to drive to Mars

Racing has always been used to improve the breed, but now mostly with software.

NASA worm logo with race imagery over a backdrop of Mars

Credit: Aurich Lawson | Getty Images | NASA

Credit: Aurich Lawson | Getty Images | NASA

DAYTONA BEACH—Last week, ahead of the annual Rolex 24 at Daytona and the start of the North American road racing season, IMSA (the sport’s organizers) held a tech symposium across the road from the vast speedway at Embry-Riddle University. Last year, panelists, including Crowdstrike’s CSO, explained the draw of racing to their employers; this time, organizations represented included NASA, Michelin, AMD, and Microsoft. And while they were all there to talk about racing, it seems everyone was also there to talk about simulation and AI.

I’ve long maintained that endurance racing, where grids of prototypes and road car-based racers compete over long durations—24 hours, for example—is the most relevant form of motorsport, the one that makes road cars better. Formula 1 has budgets and an audience to dwarf all others, and there’s no doubt about the level of talent and commitment required to triumph in that arena. The Indy 500 might have more history. And rallying looks like the hardest challenge for both humans and machines.

But your car owes its disc brakes to endurance racing, plus its dual-clutch transmission, if it’s one of the increasing number of cars fitted with such. But let’s not overblow it. Over the years, budgets have had to be reined in for the health of the sport. That—plus a desire for parity among the teams so that no one clever idea runs away with the series—means there are plenty of spec or controlled components on a current endurance racer. Direct technology transfer, then, happens less and less often—at least in terms of new mechanical bits or bobs you might find inside your next car.

Software has become a new competitive advantage for the teams that race hybrid sports prototypes from Acura, BMW, Cadillac, Porsche, and Lamborghini, just as it is between teams in Formula E.

But this year’s symposium shone a light on a different area of tech transfer, where Microsoft or NASA can use the vast streams of data that pour out of a 60-car, 24-hour race to build more accurate simulations and AI tools—maybe even ones that will babysit a crewed mission to Mars.

Sorry, did you say Mars?

“Critically, it takes light 20 minutes to make that trip, which has some really unfortunate operational impacts,” said Ian Maddox of NASA’s Marshall Space Flight Center’s Habitation office. A 40-minute delay between asking a question and getting an answer wouldn’t work for a team trying to win the Rolex 24, and “it certainly isn’t going to work for us,” he said.

“And so we’re placed in—I’ll be frank—the really uncomfortable position of having to figure out how to build AI tools to help the crew on board a Mars ship diagnose and respond to their own problems. So to be their own crew, to be their own engineering teams, at least for the subset of problems that can get really bad in the course of 45 minutes to an hour,” Maddox said.

Building those kinds of tools will require a “giant bucket of really good data,” Maddox said, “and that’s why we’ve come to IMSA.”

Individually, the hybrid prototypes and GT cars in an IMSA race are obviously far less complicated than a Mars-bound spacecraft. But when you get that data from all the cars in the race together, the size starts to become comparable.

“And fundamentally, you guys have things that roll and we have things that rotate, and you have things that get hot and cold, and so do we,” Maddox said. “When you get down to the actual measurement level, there are a lot of similarities between the stuff that you guys use to understand vehicle performance and the stuff we use to understand vehicle performance.”

Not just Mars

Other speakers pointed to areas of technology development—like tire development—that you may have read about recently here on Ars Technica. “[A tire is] a composite material made with more than 200 components with very non-linear behavior. It’s pressure-sensitive, it’s temperature-sensitive. It changes with wear… and actually, the ground interaction is also one of the worst mechanisms to try to anticipate and to understand,” said Phillippe Tramond, head of research of motorsport at Michelin.

For the past four years, Michelin has been crunching data gathered from cars racing on its rubber (and the other 199 components). “And eventually, we are able to build and develop a thermomechanical tire model able to mimic and simulate tire behavior, tire performance, whatever the specification is,” Tramond said.

That tool has been quite valuable to the teams racing in the GTP class of hybrid prototypes, as it means that their driver-in-the-loop simulators are now even more faithful to real life. But Michelin has also started using the tire model when developing road tires for specific cars with individual OEMs.

For Sid Siddhartha, a principal researcher at Microsoft Research, the data is again the draw. Siddhartha has been using AI to study human behavior, including in the game Rocket League. “We were able to actually show that we can really understand and home in on individual human behavior in a very granular way, to the point where if I just observe you for two or three seconds, or if I look at some of your games, I can tell you who played it,” Siddhartha said.

That led to a new approach by the Alpine F1 team, which wanted to use Siddhartha’s AI to improve its simulation tools. F1 teams will run entirely virtual simulations on upgraded cars long before they fire those changes up in the big simulator and let their human drivers have a go (as described above). In Alpine’s case, they wanted something more realistic than a lap time simulator that just assumed perfect behavior.

The dreaded BoP

“Eventually, we are connected to IMSA, and IMSA is interested in a whole host of questions that are very interesting to us at Microsoft Research,” Siddhartha said. “They’re interested in what are the limits of driver and car? How do you balance that performance across different classes? How do you anticipate what might happen when people make different strategic decisions during the race? And how do you communicate all of this to a fan base, which has really blown me away, as John was saying, who are interested in following the sport and understanding what’s going on.”

“Sports car racing is inherently complex,” said Matt Kurdock, IMSA’s managing director of engineering. “We’ve got four different classes. We have, in each car, four different drivers. And IMSA’s challenge is to extract from this race data that’s being collected and figure out how to get an appropriate balance so that manufacturers stay engaged in the sport,” Kurdock said.

IMSA has the cars put through wind tunnels and runs CFD simulations on them as well. “We then plug all this information into one of Michelin’s tools, which is their canopy vehicle dynamic simulation, which runs in the cloud, and from this, we start generating a picture of where we believe the optimized performance of each platform is,” Kurdock said.

That’s something to think about the next time your favorite team gets the short end of the stick in the latest balance of performance—better known as BoP—update.

Photo of Jonathan M. Gitlin

Jonathan is the Automotive Editor at Ars Technica. He has a BSc and PhD in Pharmacology. In 2014 he decided to indulge his lifelong passion for the car by leaving the National Human Genome Research Institute and launching Ars Technica’s automotive coverage. He lives in Washington, DC.

Here’s why the tech industry gets excited about sports car racing Read More »

in-apple’s-first-quarter-earnings,-the-mac-leads-the-way-in-sales-growth

In Apple’s first-quarter earnings, the Mac leads the way in sales growth

Apple fell slightly short of investor expectations when it reported its first-quarter earnings today. While sales were up 4 percent overall, the iPhone showed signs of weakness, and sales in the Chinese market slipped by just over 11 percent.

CEO Tim Cook told CNBC that the iPhone performed better in countries where Apple Intelligence was available, like the US—seemingly suggesting that the slip was partially because Chinese consumers do not see enough reason to buy new phones without Apple Intelligence. (He also said, “Half of the decline is due to a change in channel inventory.”) iPhone sales also slipped in China during this same quarter last year; this was the first full quarter during which the iPhone 16 was available.

In any case, Cook said the company plans to roll out Apple Intelligence in additional languages, including Mandarin, this spring.

Apple’s wearables category also declined slightly, but only by 2 percent.

Despite the trends that worried investors, Apple reported $36.33 billion in net revenue for the first quarter. That’s 7.1 percent more than last year’s Q1. This was driven by the Mac, the iPad, and Services (which includes everything from Apple Music to iCloud)—all of which saw slight upticks in sales. Services was up 14 percent, continuing a strong streak for that business, while the Mac and the iPad both jumped up 15 percent.

The uptick in Mac and iPad sales was likely helped by several new Mac models and a new iPad mini starting shipments last October.

Cook shared some other interesting numbers in the earnings call with investors and the press: The company has an active base of 2.35 billion devices, and it has more than 1 billion active subscriptions.

In Apple’s first-quarter earnings, the Mac leads the way in sales growth Read More »

how-one-youtuber-is-trying-to-poison-the-ai-bots-stealing-her-content

How one YouTuber is trying to poison the AI bots stealing her content

If you’ve been paying careful attention to YouTube recently, you may have noticed the rising trend of so-called “faceless YouTube channels” that never feature a visible human talking in the video frame. While some of these channels are simply authored by camera-shy humans, many more are fully automated through AI-powered tools to craft everything from the scripts and voiceovers to the imagery and music. Unsurprisingly, this is often sold as a way to make a quick buck off the YouTube algorithm with minimal human effort.

It’s not hard to find YouTubers complaining about a flood of these faceless channels stealing their embedded transcript files and running them through AI summarizers to generate their own instant knock-offs. But one YouTuber is trying to fight back, seeding her transcripts with junk data that is invisible to humans but poisonous to any AI that dares to try to work from a poached transcript file.

The power of the .ass

YouTuber F4mi, who creates some excellent deep dives on obscure technology, recently detailed her efforts “to poison any AI summarizers that were trying to steal my content to make slop.” The key to F4mi’s method is the .ass subtitle format, created decades ago as part of fansubbing software Advanced SubStation Alpha. Unlike simpler and more popular subtitle formats, .ass supports fancy features like fonts, colors, positioning, bold, italic, underline, and more.

It’s these fancy features that let F4mi hide AI-confounding garbage in her YouTube transcripts without impacting the subtitle experience for her human viewers. For each chunk of actual text in her subtitle file, she also inserted “two chunks of text out of bounds using the positioning feature of the .ass format, with their size and transparency set to zero so they are completely invisible.”

In those “invisible” subtitle boxes, F4mi added text from public domain works (with certain words replaced with synonyms to avoid detection) or her own LLM-generated scripts full of completely made-up facts. When those transcript files were fed into popular AI summarizer sites, that junk text ended up overwhelming the actual content, creating a totally unrelated script that would be useless to any faceless channel trying to exploit it.

How one YouTuber is trying to poison the AI bots stealing her content Read More »

copyright-office-suggests-ai-copyright-debate-was-settled-in-1965

Copyright Office suggests AI copyright debate was settled in 1965


Most people think purely AI-generated works shouldn’t be copyrighted, report says.

Ars used Copilot to generate this AI image using the precise prompt the Copyright Office used to determine that prompting alone isn’t authorship. Credit: AI image generated by Copilot

The US Copyright Office issued AI guidance this week that declared no laws need to be clarified when it comes to protecting authorship rights of humans producing AI-assisted works.

“Questions of copyrightability and AI can be resolved pursuant to existing law, without the need for legislative change,” the Copyright Office said.

More than 10,000 commenters weighed in on the guidance, with some hoping to convince the Copyright Office to guarantee more protections for artists as AI technologies advance and the line between human- and AI-created works seems to increasingly blur.

But the Copyright Office insisted that the AI copyright debate was settled in 1965 after commercial computer technology started advancing quickly and “difficult questions of authorship” were first raised. That was the first time officials had to ponder how much involvement human creators had in works created using computers.

Back then, the Register of Copyrights, Abraham Kaminstein—who was also instrumental in codifying fair use—suggested that “there is no one-size-fits-all answer” to copyright questions about computer-assisted human authorship. And the Copyright Office agrees that’s still the case today.

“Very few bright-line rules are possible,” the Copyright Office said, with one obvious exception. Because of “insufficient human control over the expressive elements” of resulting works, “if content is entirely generated by AI, it cannot be protected by copyright.”

The office further clarified that doesn’t mean that works assisted by AI can never be copyrighted.

“Where AI merely assists an author in the creative process, its use does not change the copyrightability of the output,” the Copyright Office said.

Following Kaminstein’s advice, officials plan to continue reviewing AI disclosures and weighing, on a case-by-case basis, what parts of each work are AI-authored and which parts are human-authored. Any human-authored expressive element can be copyrighted, the office said, but any aspect of the work deemed to have been generated purely by AI cannot.

Prompting alone isn’t authorship, Copyright Office says

After doing some testing on whether the same exact prompt can generate widely varied outputs, even from the same AI tool, the Copyright Office further concluded that “prompts do not alone provide sufficient control” over outputs to allow creators to copyright purely AI-generated works based on highly intelligent or creative prompting.

That decision could change, the Copyright Office said, if AI technologies provide more human control over outputs through prompting.

New guidance noted, for example, that some AI tools allow prompts or other inputs “to be substantially retained as part of the output.” Consider an artist uploading an original drawing, the Copyright Office suggested, and prompting AI to modify colors, or an author uploading an original piece and using AI to translate it. And “other generative AI systems also offer tools that similarly allow users to exert control over the selection, arrangement, and content of the final output.”

The Copyright Office drafted this prompt to test artists’ control over expressive inputs that are retained in AI outputs. Credit: Copyright Office

“Where a human inputs their own copyrightable work and that work is perceptible in the output, they will be the author of at least that portion of the output,” the guidelines said.

But if officials conclude that even the most iterative prompting doesn’t perfectly control the resulting outputs—even slowly, repeatedly prompting AI to produce the exact vision in an artist’s head—some artists are sure to be disappointed. One artist behind a controversial prize-winning AI-generated artwork has staunchly defended his rigorous AI prompting as authorship.

However, if “even expert researchers are limited in their ability to understand or predict the behavior of specific models,” the Copyright Office said it struggled to see how artists could. To further prove their point, officials drafted a lengthy, quirky prompt about a cat reading a Sunday newspaper to compare different outputs from the same AI image generator.

Copyright Office drafted a quirky, lengthy prompt to test creative control over AI outputs. Credit: Copyright Office

Officials apparently agreed with Adobe, which submitted a comment advising the Copyright Office that any output is “based solely on the AI’s interpretation of that prompt.” Academics further warned that copyrighting outputs based only on prompting could lead copyright law to “effectively vest” authorship adopters with “rights in ideas.”

“The Office concludes that, given current generally available technology, prompts alone do not provide sufficient human control to make users of an AI system the authors of the output. Prompts essentially function as instructions that convey unprotectable ideas,” the guidance said. “While highly detailed prompts could contain the user’s desired expressive elements, at present they do not control how the AI system processes them in generating the output.”

Hundreds of AI artworks are copyrighted, officials say

The Copyright Office repeatedly emphasized that most commenters agreed with the majority of their conclusions. Officials also stressed that hundreds of AI artworks submitted for registration, under existing law, have been approved to copyright the human-authored elements of their works. Rejections are apparently expected to be less common.

“In most cases,” the Copyright Office said, “humans will be involved in the creation process, and the work will be copyrightable to the extent that their contributions qualify as authorship.”

For stakeholders who have been awaiting this guidance for months, the Copyright Office report may not change the law, but it offers some clarity.

For some artists who hoped to push the Copyright Office to adapt laws, the guidelines may disappoint, leaving many questions about a world of possible creative AI uses unanswered. But while a case-by-case approach may leave some artists unsure about which parts of their works are copyrightable, seemingly common cases are being resolved more readily. According to the Copyright Office, after each decision, it gets easier to register AI works that meet similar standards for copyrightability. Perhaps over time, artists will grow more secure in how they use AI and whether it will impact their exclusive rights to distribute works.

That’s likely cold comfort for the artist advocating for prompting alone to constitute authorship. One AI artist told Ars in October that being denied a copyright has meant suffering being mocked and watching his award-winning work freely used anywhere online without his permission and without payment. But in the end, the Copyright Office was apparently more sympathetic to other commenters who warned that humanity’s progress in the arts could be hampered if a flood of easily generated, copyrightable AI works drowned too many humans out of the market.

“We share the concerns expressed about the impact of AI-generated material on human authors and the value that their creative expression provides to society. If a flood of easily and rapidly AI-generated content drowns out human-authored works in the marketplace, additional legal protection would undermine rather than advance the goals of the copyright system. The availability of vastly more works to choose from could actually make it harder to find inspiring or enlightening content.”

New guidance likely a big yawn for AI companies

For AI companies, the copyright guidance may mean very little. According to AI company Hugging Face’s comments to the Copyright Office, no changes in the law were needed to ensure the US continued leading in AI innovation, because “very little to no innovation in generative AI is driven by the hope of obtaining copyright protection for model outputs.”

Hugging Face’s Head of ML & Society, Yacine Jernite, told Ars that the Copyright Office seemed to “take a constructive approach” to answering some of artists’ biggest questions about AI.

“We believe AI should support, not replace, artists,” Jernite told Ars. “For that to happen, the value of creative work must remain in its human contribution, regardless of the tools used.”

Although the Copyright Office suggested that this week’s report might be the most highly anticipated, Jernite said that Hugging Face is eager to see the next report, which officials said would focus on “the legal implications of training AI models on copyrighted works, including licensing considerations and the allocation of any potential liability.”

“As a platform that supports broader participation in AI, we see more value in distributing its benefits than in concentrating all control with a few large model providers,” Jernite said. “We’re looking forward to the next part of the Copyright Office’s Report, particularly on training data, licensing, and liability, key questions especially for some types of output, like code.”

Photo of Ashley Belanger

Ashley is a senior policy reporter for Ars Technica, dedicated to tracking social impacts of emerging policies and new technologies. She is a Chicago-based journalist with 20 years of experience.

Copyright Office suggests AI copyright debate was settled in 1965 Read More »