Author name: Paul Patrick

citing-ev-“rollercoaster”-in-us,-bmw-invests-in-internal-combustion

Citing EV “rollercoaster” in US, BMW invests in internal combustion

“We anticipated that people wouldn’t want to be discriminated against because of the power train,” Goller said. “We’ve gone the path which others are now following.”

Analysts say BMW is better positioned than rivals to meet the EU’s tougher emissions targets without selling EVs at deep discounts. It is also less exposed to Trump’s tariff war since 65 percent of its cars sold in the US are built locally, and it is also a net exporter from the US.

“From an operational standpoint, I think BMW, outside China, is very well placed,” said UBS analyst Patrick Hummel. “They’re pretty much where they need to be in terms of the EV share in the mix.”

Jefferies analyst Philippe Houchois has described BMW, which has in the past drawn criticism from investors for hedging its bets on power train technology, as “the most thoughtful [original equipment manufacturer] over the years.”

This year, the group will launch its Neue Klasse platform for its next generation of EVs, with longer range, faster charging, and upgraded software capabilities, which Houchois said would “consolidate a lead in software-defined vehicles, multi-energy power train, and battery sourcing.”

But China has proved challenging to the Munich-based carmaker. BMW and Mini sales in the world’s largest automotive market fell more than 13 percent last year to 714,530 cars, a more severe slump than rivals such as Mercedes-Benz and Audi.

Analysts at Citigroup have warned that BMW remains vulnerable to China, where intensifying price pressure in an overcrowded market has been forcing carmakers to discount prices. Sliding sales in the country, where BMW still delivers just under a third of its cars, “remains our key concern,” the Citi analysts said.

Goller acknowledged China was unlikely to return to the explosive economic growth that first attracted foreign carmakers to flood into the country.

“But we still see a growing market… and therefore, our ambition is clearly that we want to participate in a growing market,” he said.

Goller added that it shouldn’t come as “a shock” that Chinese brands were rapidly taking domestic marketshare from foreign carmakers.

“The cars are really good from a technology perspective,” he said. “But we are not afraid.”

© 2025 The Financial Times Ltd. All rights reserved. Not to be redistributed, copied, or modified in any way.

Citing EV “rollercoaster” in US, BMW invests in internal combustion Read More »

levels-of-friction

Levels of Friction

Scott Alexander famously warned us to Beware Trivial Inconveniences.

When you make a thing easy to do, people often do vastly more of it.

When you put up barriers, even highly solvable ones, people often do vastly less.

Let us take this seriously, and carefully choose what inconveniences to put where.

Let us also take seriously that when AI or other things reduce frictions, or change the relative severity of frictions, various things might break or require adjustment.

This applies to all system design, and especially to legal and regulatory questions.

  1. Levels of Friction (and Legality).

  2. Important Friction Principles.

  3. Principle #1: By Default Friction is Bad.

  4. Principle #3: Friction Can Be Load Bearing.

  5. Insufficient Friction On Antisocial Behaviors Eventually Snowballs.

  6. Principle #4: The Best Frictions Are Non-Destructive.

  7. Principle #8: The Abundance Agenda and Deregulation as Category 1-ification.

  8. Principle #10: Ensure Antisocial Activities Have Higher Friction.

  9. Sports Gambling as Motivating Example of Necessary 2-ness.

  10. On Principle #13: Law Abiding Citizen.

  11. Mundane AI as 2-breaker and Friction Reducer.

  12. What To Do About All This.

There is a vast difference along the continuum, both in legal status and in terms of other practical barriers, as you move between:

  1. Automatic, a default, facilitated, required or heavily subsidized.

  1. Legal, ubiquitous and advertised, with minimal frictions.

  2. Available, mostly safe to get, but we make it annoying.

  3. Actively illegal or tricky, perhaps risking actual legal trouble or big loss of status.

  4. Actively illegal and we will try to stop you or ruin your life (e.g. rape, murder).

  5. We will move the world to stop you (e.g. terrorism, nuclear weapons).

  6. Physically impossible (e.g. perpetual motion, time travel, reading all my blog posts)

The most direct way to introduce or remove frictions is to change the law. This can take the form of prohibitions, regulations and requirements, or of taxes.

One can also alter social norms, deploy new technologies or business models or procedures, or change opportunity costs that facilitate or inhibit such activities.

Or one can directly change things like the defaults on popular software.

Often these interact in non-obvious ways.

It is ultimately a practical question. How easy is it to do? What happens if you try?

If the conditions move beyond annoying and become prohibitive, then you can move things that are nominally legal, such as building houses or letting your kids play outside or even having children at all, into category 3 or even 4.

Here are 14 points that constitute important principles regarding friction:

  1. By default more friction is bad and less friction is good.

  2. Of course there are obvious exceptions (e.g. rape and murder, but not only that).

  3. Activities imposing a cost on others or acting as a signal often rely on friction.

    1. Moving such activities from (#2 or #1) to #0, or sometimes from #2 to #1, can break the incentives that maintain a system or equilibrium.

    2. That does not have to be bad, but adjustments will likely be required.

    3. The solution often involves intentionally introducing alternative frictions.

    4. Insufficient friction on antisocial activities eventually snowballs.

  4. Where friction is necessary, focus on ensuring it is minimally net destructive.

  5. Lower friction choices have a big advantage in being selected.

    1. Pay attention to relative friction, not only absolute friction.

  6. Be very sparing when putting private consensual activities in #3 or especially #4.

    1. This tends to work out extremely poorly and make things worse.

    2. Large net negative externalities to non-participants changes this, of course.

  7. Be intentional about what is in #0 versus #1 versus #2. Beware what norms and patterns this distinction might encourage.

  8. Keep pro-social, useful and productive things in #0 or #1.

  9. Do not let things that are orderly and legible thereby be dragged into #2 or worse, while rival things that are disorderly and illegible become relatively easier.

  10. Keep anti-social, destructive and counterproductive things in at least #2, and at a higher level than pro-social, constructive and productive alternatives.

  11. The ideal form of annoying, in the sense of #2, is often (but not always) a tax, as in increasing the cost, ideally in a way that the lost value is transfered, not lost.

  12. Do not move anti-social things to #1 to be consistent or make a quick buck.

  13. Changing the level of friction can change the activity in kind, not only degree.

  14. When it comes to friction, consistency is frequently the hobgoblin of small minds.

It is a game of incentives. You can and should jury-rig it as needed to win.

By default, you want most actions to have lower friction. You want to eliminate the paperwork and phone calls that waste time and fill us with dread, and cause things we ‘should’ do to go undone.

If AI can handle all the various stupid things for me, I would love that.

The problems come when frictions are load bearing. Here are five central causes.

  1. An activity or the lack of an activity is anti-social and destructive. We would prefer it happen less, or not at all, or not expose people to it unless they seek it out first. We want quite a lot of friction standing in the way of things like rape, murder, theft, fraud, pollution, excessive noise, nuclear weapons and so on.

  2. An activity that could be exploited, especially if done ruthlessly at scale. You might for example want to offer a promotional deal or a generous return policy. You might let anyone in the world send you an email or slide into your DMs.

  3. An activity that sends a costly signal. A handwritten thank you note is valuable because it means you were thoughtful and spent the time. Spending four years in college proves you are the type of person who can spend those years.

  4. An activity that imposes costs or allocates a scarce resource. The frictions act as a price, ensuring an efficient or at least reasonable allocation, and guards against people’s time and money being wasted. Literal prices are best, but charging one can be impractical or socially unacceptable, such as when applying for a job.

  5. Removing the frictions from one alternative, when you continue to impose frictions on alternatives, is putting your finger on the scale. Neutrality does not always mean imposing minimal frictions. Sometimes you would want to reduce frictions on [X] only if you also could do so (or had done so) on [Y].

Imposing friction to maintain good incentives or equilibria, either legally or otherwise, is often expensive. Once the crime or other violation already happened, imposing punishment costs time and money, and harms someone. Stopping people from doing things they want to do, and enforcing norms and laws, is often annoying and expensive and painful. In many cases it feels unfair, and there have been a lot of pushes to do this less.

You can often ‘get away with’ this kind of permissiveness for a longer time than I would have expected. People can be very slow to adjust and solve for the equilibrium.

But eventually, they do solve for it, norms and expectations and defaults adjust. Often this happens slowly, then quickly. Afterwards you are left with a new set of norms and expectations and defaults, often that becomes equally sticky.

There are a lot of laws and norms we really do not want people to break, or actions you don’t want people to take except under the right conditions. When you reduce the frictions involved in breaking them or doing them at the wrong times, there won’t be that big an instant adjustment, but you are spending down the associated social capital and mortgaging the future.

We are seeing a lot of the consequences of that now, in many places. And we are poised to see quite a lot more of it.

Time lost is lost forever. Unpleasant phone calls do not make someone else’s life more pleasant. Whereas additional money spent then goes to someone else.

Generalize this. Whenever friction is necessary, either introduce it in the service of some necessary function, or use as non-destructive a transfer or cost as possible.

It’s time to build. It’s always time to build.

The problem is, you need permission to build.

The abundance agenda is largely about taking the pro-social legible actions that make us richer, and moving them back from Category 2 into Category 1 or sometimes 0.

It is not enough to make it possible. It needs to be easy. As easy as possible.

Building housing where people want to live needs to be at most Category 1.

Building green energy, and transmission lines, need to be at most Category 1.

Pharmaceutical drug development needs to be at most Category 1.

Having children needs to be at least Category 1, ideally Category 0.

Deployment of and extraction of utility from AI needs to remain Category 1, where it does not impose catastrophic or existential risks. Developing frontier models that might kill everyone needs to be at Category 2 with an option to move it to Category 3 or Category 4 on a dime if necessary, including gathering the data necessary to make that choice.

What matters is mostly moving into Category 1. Actively subsidizing into Category 0 is a nice-to-have, but in most cases unnecessary. We need only to remove the barriers to such activities, to make such activities free of unnecessary frictions and costs and delays. That’s it.

When you put things in category 1, magic happens. If that would be good magic, do it.

A lot of technological advances and innovations, including the ones that are currently blocked, are about taking something that was previously Category 2, and turning it into a Category 1. Making the possible easier is extremely valuable.

We often need to beware and keep in Category 2 or higher actions that disrupt important norms and encourage disorder, that are primarily acts of predation, or that have important other negative externalities.

When the wrong thing is a little more annoying to do than the right thing, a lot more people will choose the right path, and vice versa. When you make the anti-social action easier than the pro-social action, when you reward those who bring disorder or wreck the commons and punish those who adhere to order and help the group, you go down a dark path.

This is also especially true when considering whether something will be a default, or otherwise impossible to ignore.

There is a huge difference between ‘you can get [X] if you seek it out’ versus ‘constantly seeing advertising for [X]’ or facing active media or peer pressure to participate in [X].

Recently, America moved Sports Gambling from Category 2 to Category 1.

Suddenly, sports gambling was everywhere, on our billboards and in our sports media, including the game broadcasts and stadium experiences. Participation exploded.

We now have very strong evidence that this was a mistake.

That does not mean sports gambling should be seriously illegal. It only means that people can’t handle low-friction sports gambling apps being available on phones that get pushed in the media.

I very much don’t want it in Category 3, only to move it back to Category 2. Let people gamble at physical locations. Let those who want to use VPNs or actively subvert the rules have their fun too. It’s fine, but don’t make it too easy, or in people’s faces.

The same goes for a variety of other things, mostly either vices or things that impose negative externalities on others, that are fine in moderation with frictions attached.

The classic other vice examples count: Cigarettes, drugs and alcohol, prostitution, TikTok. Prohibition on such things always backfires, but you want to see less of them, in both the figurative and literal sense, than you would if you fully unleashed them. So we need to talk price, and exactly what level of friction is correct, keeping in mind that ‘technically legal versus illegal’ is not the critical distinction in practice.

There are those who will not, on principle, lie or break the law, or not break other norms. Every hero has a code. It would be good if we could return to a norm where this was how most people acted, rather than us all treating many laws as almost not being there and certain statements as not truth tracking – that being ‘nominally illegal with no enforcement’ or ‘requires telling a lie’ was already Category 2.

Unfortunately, we don’t live in that world, at least not anymore. Indeed, people are effectively forced to tell various lies to navigate for example the medical system, and technically break various laws. This is terrible, and we should work to reverse this, but mostly we need to be realistic.

Similarly, it would be good if we lived by the principle that you consider the costs you impose on others when deciding what to do, only imposing them when justified or with compensation, and we socially punished those who act otherwise. But increasingly we do not live in that world, either.

As AI and other technology removes many frictions, especially for those willing to have the AI lie on their behalf to exploit those systems at scale, this becomes a problem.

Current AI largely takes many tasks that were Category 2, and turns them into Category 1, or effectively makes them so easy as to be Category 0.

Academia and school break first because the friction ‘was the point’ most explicitly, and AI is especially good at related tasks. Note that breaking these equilibria and systems could be very good for actual education, but we must adapt.

Henry Shevlin: I generally position myself an AI optimist, but it’s also increasingly clear to me that LLMs just break lots of our current institutions, and capabilities are increasing fast enough that it’ll be very hard for them to adapt in the near-term.

Education (secondary and higher) is the big one, but also large aspects of academic publishing. More broadly, a lot of the knowledge-work economy seems basically unsustainable in an era of intelligence too cheap to meter.

Lawfare too cheap to meter.

Dick Bruere: I am optimistic that AI will break everything.

Then we get into places like lawsuits.

Filing or defending against a lawsuit is currently a Category 2 action in most situations. The whole process is expensive and annoying, and it’s far more expensive to do it with competent representation. The whole system is effectively designed with this in mind. If lawsuits fell down to Category 1 because AI facilitated all the filings, suddenly a lot more legal actions become viable.

The courts themselves plausibly break from the strain. A lot of dynamics throughout society shift, as threats to file become credible, and legal considerations that exist on paper but not in practice – and often make very little sense in practice – suddenly exist in practice. New strategies for lawfare, for engineering the ability to sue, come into play.

Yes, the defense also moves towards Category 1 via AI, and this will help mitigate, but for many reasons this is a highly incomplete solution. The system will have to change.

Job applications are another example. It used to be annoying to apply to jobs, to the extent that most people applied to vastly fewer jobs than was wise. As a result, one could reasonably advertise or list a job and consider the applications that came in.

In software, this is essentially no longer true – AI-assisted applications flood the zone. If you apply via a public portal, you will get nowhere. You can only meaningfully apply via methods that find new ways to apply friction. That problem will gradually (or rapidly) spread to other industries and jobs.

There are lots of formal systems that offer transfers of wealth, in exchange for humans undergoing friction and directing attention. This can be (an incomplete list):

  1. Price discrimination. You offer discounts to those willing to figure out how to get them, charge more to those who pay no attention and don’t care.

  2. Advertising for yourself. Offer free samples, get people to try new products.

  3. Advertising for others. As in, a way to sell you on watching advertising.

  4. Relationship building. Initial offers of 0% interest get you to sign up for a credit card. You give your email to get into a rewards program with special offers.

  5. Customer service. If you are coming in to ask for an exchange or refund, that is annoying enough to do that it is mostly safe to assume your request is legit.

  6. Costly signaling. Only those who truly need or would benefit would endure what you made them do to qualify. School and job applications fall into this.

  7. Habit formation. Daily login rewards and other forms of gamification are ubiquitous in mobile apps and other places.

  8. Security through obscurity. There is a loophole in the system, but not many people know about it, and figuring it out takes skill.

  9. Enemy action. It is far too expensive to fully defend yourself against a sufficiently determined fraudster or thief, or someone determined to destroy your reputation, or worse an assassin or other physical attacker. Better to impose enough friction they don’t bother.

  10. Blackmail. It is relatively easy to impose large costs on someone else, or credibly threaten to do so, to try and extract resources from them. This applies on essentially all levels. Or of course someone might actually want to inflict massive damage (including catastrophic harms, cyberattacks, CBRN risks, etc).

Breaking all these systems, and the ways we ensure that they don’t get exploited at scale, upends quite a lot of things that no longer make sense.

In some cases, that is good. In others, not so good. Most will require adjustment.

Future more capable AI may then threaten to bring things in categories #3, #4 and #5 into the realm of super doable, or even start doing them on its own. Maybe even some things we think are in #6. In some cases this will be good because the frictions were due to physical limitations or worries that no longer apply. In other cases, this would represent a crisis.

To the extent you have control over levels of friction of various activities, for yourself or others, choose intentionally, especially in relative terms. All of this applies on a variety of scales.

Focus on reducing frictions you benefit from reducing, and assume this matters more than you think because it will change the composition of your decisions quite a lot.

Often this means it is well worth it to spend [X] in advance to prevent [Y] amount of friction over time, even if X>Y, or even X>>Y.

Where lower friction would make you worse off, perhaps because you would then make worse choices, consider introducing new frictions, up to and including commitment devices and actively taking away optionality that is not to your benefit.

Beware those who try to turn the scale into a boolean. It is totally valid to be fine with letting people do something if and only if it is sufficiently annoying for them to do it – you’re not a hypocrite to draw that distinction.

You’re also allowed to say, essentially ‘if we can’t put this into [1] without it being in [0] then it needs to be in [2] or even ‘if there’s no way to put this into [2] without putting it into [1] then we need to put it in [3].’

You are especially allowed to point out ‘putting [X] in [1 or 0] has severe negative consequences, and doing [Y] makes puts [X] there, so until you figure out a solution you cannot do [Y].’

Most importantly, pay attention to all this especially as yourself and other people will actually respond, take it seriously, and consider the incentives, equilibria, dynamics and consequences that result, and then respond deliberatively.

Finally, when you notice that friction levels are changing, watch for necessary adjustments, and to see what if anything will break, what habits must be avoided. And also, of course, what new opportunities this opens up.

Discussion about this post

Levels of Friction Read More »

feds-putting-the-kibosh-on-national-ev-charging-program

Feds putting the kibosh on national EV charging program

“There is no legal basis for funds that have been apportioned to states to build projects being ‘decertified’ based on policy,” says Andrew Rogers, a former deputy administrator and chief counsel of the Federal Highway Administration.

The US DOT did not immediately respond to a request for comment.

It’s unclear how the DOT’s order will affect charging stations that are under construction. In the letter, FHWA officials write that “no new obligations may occur,” suggesting states may not sign new contracts with businesses even if those states have been allocated federal funding. The letter also says “reimbursement of existing obligations will be allowed” as the program goes through a review process, suggesting states may be allowed to pay back businesses that have already provided services.

Billions in federal funding have already been disbursed under the program. Money has gone to both red and blue states. Top funding recipients last year included Florida, New York, Texas, Georgia, and Ohio.

Tesla CEO Elon Musk has spent the last few weeks at the head of the federal so-called Department of Government Efficiency directing “audits” and cuts to federal spending. But his electric automobile company has been a recipient of $31 million in awards from the NEVI program, according to a database maintained by transportation officials, accounting for 6 percent of the money awarded so far.

The Trump administration has said that it plans to target electric vehicles and EV-related programs. An executive order signed by Trump on his first day in office purported to eliminate “the EV mandate,” though such a federal policy never existed.

NEVI projects have taken longer to get off the ground than other charging station construction because the federal government was deliberate in allocating funding to companies with track records, that could prove they could build or operate charging stations, says Ryan McKinnon, a spokesperson for Charge Ahead Partnership, a group of businesses and organizations that work in electric vehicle charging. If NEVI funding isn’t disbursed, “the businesses that have spent time or money investing in this program will be hurt,” he says.

This story originally appeared on wired.com.

Feds putting the kibosh on national EV charging program Read More »

deepseek-ios-app-sends-data-unencrypted-to-bytedance-controlled-servers

DeepSeek iOS app sends data unencrypted to ByteDance-controlled servers


Apple’s defenses that protect data from being sent in the clear are globally disabled.

A little over two weeks ago, a largely unknown China-based company named DeepSeek stunned the AI world with the release of an open source AI chatbot that had simulated reasoning capabilities that were largely on par with those from market leader OpenAI. Within days, the DeepSeek AI assistant app climbed to the top of the iPhone App Store’s “Free Apps” category, overtaking ChatGPT.

On Thursday, mobile security company NowSecure reported that the app sends sensitive data over unencrypted channels, making the data readable to anyone who can monitor the traffic. More sophisticated attackers could also tamper with the data while it’s in transit. Apple strongly encourages iPhone and iPad developers to enforce encryption of data sent over the wire using ATS (App Transport Security). For unknown reasons, that protection is globally disabled in the app, NowSecure said.

Basic security protections MIA

What’s more, the data is sent to servers that are controlled by ByteDance, the Chinese company that owns TikTok. While some of that data is properly encrypted using transport layer security, once it’s decrypted on the ByteDance-controlled servers, it can be cross-referenced with user data collected elsewhere to identify specific users and potentially track queries and other usage.

More technically, the DeepSeek AI chatbot uses an open weights simulated reasoning model. Its performance is largely comparable with OpenAI’s o1 simulated reasoning (SR) model on several math and coding benchmarks. The feat, which largely took AI industry watchers by surprise, was all the more stunning because DeepSeek reported spending only a small fraction on it compared with the amount OpenAI spent.

A NowSecure audit of the app has found other behaviors that researchers found potentially concerning. For instance, the app uses a symmetric encryption scheme known as 3DES or triple DES. The scheme was deprecated by NIST following research in 2016 that showed it could be broken in practical attacks to decrypt web and VPN traffic. Another concern is that the symmetric keys, which are identical for every iOS user, are hardcoded into the app and stored on the device.

The app is “not equipped or willing to provide basic security protections of your data and identity,” NowSecure co-founder Andrew Hoog told Ars. “There are fundamental security practices that are not being observed, either intentionally or unintentionally. In the end, it puts your and your company’s data and identity at risk.”

Hoog said the audit is not yet complete, so there are many questions and details left unanswered or unclear. He said the findings were concerning enough that NowSecure wanted to disclose what is currently known without delay.

In a report, he wrote:

NowSecure recommends that organizations remove the DeepSeek iOS mobile app from their environment (managed and BYOD deployments) due to privacy and security risks, such as:

  1. Privacy issues due to insecure data transmission
  2. Vulnerability issues due to hardcoded keys
  3. Data sharing with third parties such as ByteDance
  4. Data analysis and storage in China

Hoog added that the DeepSeek app for Android is even less secure than its iOS counterpart and should also be removed.

Representatives for both DeepSeek and Apple didn’t respond to an email seeking comment.

Data sent entirely in the clear occurs during the initial registration of the app, including:

  • organization id
  • the version of the software development kit used to create the app
  • user OS version
  • language selected in the configuration

Apple strongly encourages developers to implement ATS to ensure the apps they submit don’t transmit any data insecurely over HTTP channels. For reasons that Apple hasn’t explained publicly, Hoog said, this protection isn’t mandatory. DeepSeek has yet to explain why ATS is globally disabled in the app or why it uses no encryption when sending this information over the wire.

This data, along with a mix of other encrypted information, is sent to DeepSeek over infrastructure provided by Volcengine a cloud platform developed by ByteDance. While the IP address the app connects to geo-locates to the US and is owned by US-based telecom Level 3 Communications, the DeepSeek privacy policy makes clear that the company “store[s] the data we collect in secure servers located in the People’s Republic of China.” The policy further states that DeepSeek:

may access, preserve, and share the information described in “What Information We Collect” with law enforcement agencies, public authorities, copyright holders, or other third parties if we have good faith belief that it is necessary to:

• comply with applicable law, legal process or government requests, as consistent with internationally recognised standards.

NowSecure still doesn’t know precisely the purpose of the app’s use of 3DES encryption functions. The fact that the key is hardcoded into the app, however, is a major security failure that’s been recognized for more than a decade when building encryption into software.

No good reason

NowSecure’s Thursday report adds to growing list of safety and privacy concerns that have already been reported by others.

One was the terms spelled out in the above-mentioned privacy policy. Another came last week in a report from researchers at Cisco and the University of Pennsylvania. It found that the DeepSeek R1, the simulated reasoning model, exhibited a 100 percent attack failure rate against 50 malicious prompts designed to generate toxic content.

A third concern is research from security firm Wiz that uncovered a publicly accessible, fully controllable database belonging to DeepSeek. It contained more than 1 million instances of “chat history, backend data, and sensitive information, including log streams, API secrets, and operational details,” Wiz reported. An open web interface also allowed for full database control and privilege escalation, with internal API endpoints and keys available through the interface and common URL parameters.

Thomas Reed, staff product manager for Mac endpoint detection and response at security firm Huntress, and an expert in iOS security, said he found NowSecure’s findings concerning.

“ATS being disabled is generally a bad idea,” he wrote in an online interview. “That essentially allows the app to communicate via insecure protocols, like HTTP. Apple does allow it, and I’m sure other apps probably do it, but they shouldn’t. There’s no good reason for this in this day and age.”

He added: “Even if they were to secure the communications, I’d still be extremely unwilling to send any remotely sensitive data that will end up on a server that the government of China could get access to.”

HD Moore, founder and CEO of runZero, said he was less concerned about ByteDance or other Chinese companies having access to data.

“The unencrypted HTTP endpoints are inexcusable,” he wrote. “You would expect the mobile app and their framework partners (ByteDance, Volcengine, etc) to hoover device data, just like anything else—but the HTTP endpoints expose data to anyone in the network path, not just the vendor and their partners.”

On Thursday, US lawmakers began pushing to immediately ban DeepSeek from all government devices, citing national security concerns that the Chinese Communist Party may have built a backdoor into the service to access Americans’ sensitive private data. If passed, DeepSeek could be banned within 60 days.

This story was updated to add further examples of security concerns regarding DeepSeek.

Photo of Dan Goodin

Dan Goodin is Senior Security Editor at Ars Technica, where he oversees coverage of malware, computer espionage, botnets, hardware hacking, encryption, and passwords. In his spare time, he enjoys gardening, cooking, and following the independent music scene. Dan is based in San Francisco. Follow him at here on Mastodon and here on Bluesky. Contact him on Signal at DanArs.82.

DeepSeek iOS app sends data unencrypted to ByteDance-controlled servers Read More »

parrots-struggle-when-told-to-do-something-other-than-mimic-their-peers

Parrots struggle when told to do something other than mimic their peers

There have been many studies on the capability of non-human animals to mimic transitive actions—actions that have a purpose. Hardly any studies have shown that animals are also capable of intransitive actions. Even though intransitive actions have no particular purpose, imitating these non-conscious movements is still thought to help with socialization and strengthen bonds for both animals and humans.

Zoologist Esha Haldar and colleagues from the Comparative Cognition Research group worked with blue-throated macaws, which are critically endangered, at the Loro Parque Fundación in Tenerife. They trained the macaws to perform two intransitive actions, then set up a conflict: Two neighboring macaws were asked to do different actions.

What Haldar and her team found was that individual birds were more likely to perform the same intransitive action as a bird next to them, no matter what they’d been asked to do. This could mean that macaws possess mirror neurons, the same neurons that, in humans, fire when we are watching intransitive movements and cause us to imitate them (at least if these neurons function the way some think they do).

But it wasn’t on purpose

Parrots are already known for their mimicry of transitive actions, such as grabbing an object. Because they are highly social creatures with brains that are large relative to the size of their bodies, they made excellent subjects for a study that gauged how susceptible they were to copying intransitive actions.

Mirroring of intransitive actions, also called automatic imitation, can be measured with what’s called a stimulus-response-compatibility (SRC) test. These tests measure the response time between seeing an intransitive movement (the visual stimulus) and mimicking it (the action). A faster response time indicates a stronger reaction to the stimulus. They also measure the accuracy with which they reproduce the stimulus.

Until now, there have only been three studies that showed non-human animals are capable of copying intransitive actions, but the intransitive actions in these studies were all by-products of transitive actions. Only one of these focused on a parrot species. Haldar and her team would be the first to test directly for animal mimicry of intransitive actions.

Parrots struggle when told to do something other than mimic their peers Read More »

not-gouda-nough:-google-removes-ai-generated-cheese-error-from-super-bowl-ad

Not Gouda-nough: Google removes AI-generated cheese error from Super Bowl ad

Blame cheese.com

While it’s easy to accuse Google Gemini of just making up plausible-sounding cheese facts from whole cloth, this seems more like a case of garbage-in, garbage-out. Google President of Cloud Applications Jerry Dischler posted on social media to note that the incorrect Gouda fact was “not a hallucination,” because all of Gemini’s data is “grounded in the Web… in this case, multiple sites across the web include the 50-60% stat.”

The specific Gouda numbers Gemini used can be most easily traced to cheese.com, a heavily SEO-focused subsidiary of news aggregator WorldNews Inc. Cheese.com doesn’t cite a source for the percentages featured prominently on its Smoked Gouda page, but that page also confidently asserts that the cheese is pronounced “How-da,” a fact that only seems true in the Netherlands itself.

The offending cheese.com passage that is not cited when using Google’s AI writing assistant.

The offending cheese.com passage that is not cited when using Google’s AI writing assistant. Credit: cheese.com

Regardless, Google can at least point to cheese.com as a plausibly reliable source that misled its AI in a way that might also stymie web searchers. And Dischler added on social media that users “can always check the results and references” that Gemini provides.

The only problem with that defense is that the Google writing assistant shown off in the ad doesn’t seem to provide any such sources for a user to check. Unlike Google search’s AI Overviews—which does refer to a cheese.com link when responding about gouda consumption—the writing assistant doesn’t provide any backup for its numbers here.

The Gemini writing assistant does note in small print that its results are “a creative writing aid, and not intended to be factual.” If you click for more information about that warning, Google warns that “the suggestions from Help me write can be inaccurate or offensive since it’s still in an experimental status.”

This “experimental” status hasn’t stopped Google from heavily selling its AI writing assistant as a godsend for business owners in its planned Super Bowl ads, though. Nor is this major caveat included in the ads themselves. Yet it’s the kind of thing users should have at the front of their minds when using AI assistants for anything with even a hint of factual info.

Now if you’ll excuse me, I’m going to go update my personal webpage with information about my selection as World’s Most Intelligent Astronaut/Underwear Model, in hopes that Google’s AI will repeat the “fact” to anyone who asks.

Not Gouda-nough: Google removes AI-generated cheese error from Super Bowl ad Read More »

internet-archive-played-crucial-role-in-tracking-shady-cdc-data-removals

Internet Archive played crucial role in tracking shady CDC data removals


Internet Archive makes it easier to track changes in CDC data online.

When thousands of pages started disappearing from the Centers for Disease Control and Prevention (CDC) website late last week, public health researchers quickly moved to archive deleted public health data.

Soon, researchers discovered that the Internet Archive (IA) offers one of the most effective ways to both preserve online data and track changes on government websites. For decades, IA crawlers have collected snapshots of the public Internet, making it easier to compare current versions of websites to historic versions. And IA also allows users to upload digital materials to further expand the web archive. Both aspects of the archive immediately proved useful to researchers assessing how much data the public risked losing during a rapid purge following a pair of President Trump’s executive orders.

Part of a small group of researchers who managed to download the entire CDC website within days, virologist Angela Rasmussen helped create a public resource that combines CDC website information with deleted CDC datasets. Those datasets, many of which were previously in the public domain for years, were uploaded to IA by an anonymous user, “SheWhoExists,” on January 31. Moving forward, Rasmussen told Ars that IA will likely remain a go-to tool for researchers attempting to closely monitor for any unexpected changes in access to public data.

IA “continually updates their archives,” Rasmussen said, which makes IA “a good mechanism for tracking modifications to these websites that haven’t been made yet.”

The CDC website is being overhauled to comply with two executive orders from January 20, the CDC told Ars. The Defending Women from Gender Ideology Extremism and Restoring Biological Truth to the Federal Government requires government agencies to remove LGBTQ+ language that Trump claimed denies “the biological reality of sex” and is likely driving most of the CDC changes to public health resources. The other executive order the CDC cited, the Ending Radical And Wasteful Government DEI Programs And Preferencing, would seemingly largely only impact CDC employment practices.

Additionally, “the Office of Personnel Management has provided initial guidance on both Executive Orders and HHS and divisions are acting accordingly to execute,” the CDC told Ars.

Rasmussen told Ars that the deletion of CDC datasets is “extremely alarming” and “not normal.” While some deleted pages have since been restored in altered versions, removing gender ideology from CDC guidance could put Americans at heightened risk. That’s another emerging problem that IA’s snapshots could help researchers and health professionals resolve.

“I think the average person probably doesn’t think that much about the CDC’s website, but it’s not just a matter of like, ‘Oh, we’re going to change some wording’ or ‘we’re going to remove these data,” Rasmussen said. “We are actually going to retool all the information that’s there to remove critical information about public health that could actually put people in danger.”

For example, altered Mpox transmission data removed “all references to men who have sex with men,” Rasmussen said. “And in the US those are the people who are not the only people at risk, but they’re the people who are most at risk of being exposed to Mpox. So, by removing that DEI language, you’re actually depriving people who are at risk of information they could use to protect themselves, and that eventually will get people hurt or even killed.”

Likely the biggest frustration for researchers scrambling to preserve data is dealing with broken links. On social media, Rasmussen has repeatedly called for help flagging broken links to ensure her team’s archive is as useful as possible.

Rasmussen’s group isn’t the only effort to preserve the CDC data. Some are creating niche archives focused on particular topics, like journalist Jessica Valenti, who created an archive of CDC guidelines on reproductive rights issues, sexual health, intimate partner violence, and other data the CDC removed online.

Niche archives could make it easier for some researchers to quickly survey missing data in their field, but Rasmussen’s group is hoping to take next steps to make all the missing CDC data more easily discoverable in their archive.

“I think the next step,” Rasmussen said, “would be to try to fix anything in there that’s broken, but also look into ways that we could maybe make it more browsable and user-friendly for people who may not know what they’re looking for or may not be able to find what they’re looking for.”

CDC advisers demand answers

The CDC has been largely quiet about the deleted data, only pointing to Trump’s executive orders to justify removals. That could change by February 7. That’s the deadline when a congressionally mandated advisory committee to the CDC’s acting director, Susan Monarez, asked for answers in an open letter to a list of questions about the data removals.

“It has been reported through anonymous sources that the website changes are related to new executive orders that ban the use of specific words and phrases,” their letter said. “But as far as we are aware, these unprecedented actions have yet to be explained by CDC; news stories indicate that the agency is declining to comment.”

At the top of the committee’s list of questions is likely the one frustrating researchers most: “What was the rationale for making these datasets and websites inaccessible to the public?” But the committee also importantly asked what analysis was done “of the consequences of removing access to these datasets and website” prior to the removals. They also asked how deleted data would be safeguarded and when data would be restored.

It’s unclear if the CDC will be motivated to respond by the deadline. Ars reached out to one of the committee members, Joshua Sharfstein—a physician and vice dean for Public Health Practice and Community Engagement at Johns Hopkins University—who confirmed that as of this writing, the CDC has not yet responded. And the CDC did not respond to Ars’ request to comment on the letter.

Rasmussen told Ars that even temporary removals of CDC guidance can disrupt important processes keeping Americans healthy. Among the potentially most consequential pages briefly removed were recommendations from the congressionally mandated Advisory Committee on Immunization Practices (ACIP).

Those recommendations are used by insurance companies to decide who gets reimbursed for vaccines and by physicians to deduce vaccine eligibility, and Rasmussen said they “are incredibly important for the entire population to have access to any kind of vaccination.” And while, for example, the Mpox vaccine recommendations were eventually restored unaltered, Rasmussen told Ars that she suspects that “one of the reasons” preventing interference currently with ACIP is that it’s mandated by Congress.

Seemingly ACIP could be weakened by the new administration, Rasmussen suggested. She warned that Trump’s pick for CDC director, Dave Weldon, “is an anti-vaxxer” (with a long history of falsely linking vaccines to autism) who may decide to replace ACIP committee members with anti-vaccine advocates or move to dissolve ACIP. And any changes in recommendations could mean “insurance companies aren’t going to cover vaccinations [and that] physicians will not recommend vaccination.” And that could mean “vaccination will go down and we’ll start having outbreaks of some of these vaccine-preventable diseases.”

“If there’s a big polio outbreak, that is going to result in permanently disabled children, dead children—it’s really, really serious,” Rasmussen said. “So I think that people need to understand that this isn’t just like, ‘Oh, maybe wear a mask when you’re at the movie theater’ kind of CDC guidance. This is guidance that’s really fundamental to our most basic public health practices, and it’s going to cause widespread suffering and death if this is allowed to continue.”

Seeding deleted data and doing science to fight back

On Bluesky, Rasmussen led one of many charges to compile archived links and download CDC data so that researchers can reference every available government study when advancing public health knowledge.

“These data are public and they are ours,” Rasmussen posted. “Deletion disobedience is one way to fight back.”

As Rasmussen sees it, deleting CDC data is “theft” from the public domain and archiving CDC data is simply taking “back what is ours.” But at the same time, her team is also taking steps to be sure the data they collected can be lawfully preserved. Because the CDC website has not been copied and hosted on a server, they expect their archive should be deemed lawful and remain online.

“I don’t put it past this administration to try to shut this stuff down by any means possible,” Rasmussen told Ars. “And we wanted to make sure there weren’t any sort of legal loopholes that would jeopardize anybody in the group, but also that would potentially jeopardize the data.”

It’s not clear if some data has already been lost. Seemingly the same user who uploaded the deleted datasets to IA posted on Reddit, clarifying that while the “full” archive “should contain all public datasets that were available” before “anything was scrubbed,” it likely only includes “most” of the “metadata and attachments.” So, researchers who download the data may still struggle to fill in some blanks.

To help researchers quickly access the missing data, anyone can help the IA seed the datasets, the Reddit user said in another post providing seeding and mirroring instructions. Currently dozens are seeding it for a couple hundred peers.

“Thank you to everyone who requested this important data, and particularly to those who have offered to mirror it,” the Reddit user wrote.

As Rasmussen works with her group to make their archive more user-friendly, her plan is to help as many researchers as possible fight back against data deletion by continuing to reference deleted data in their research. She suggested that effort—doing science that ignores Trump’s executive orders—is perhaps a more powerful way to resist and defend public health data than joining in loud protests, which many researchers based in the US (and perhaps relying on federal funding) may not be able to afford to do.

“Just by doing things and standing up for science with your actions, rather than your words, you can really make, I think, a big difference,” Rasmussen said.

Photo of Ashley Belanger

Ashley is a senior policy reporter for Ars Technica, dedicated to tracking social impacts of emerging policies and new technologies. She is a Chicago-based journalist with 20 years of experience.

Internet Archive played crucial role in tracking shady CDC data removals Read More »

sick-right-now?-flu-is-resurging-to-yet-a-higher-peak-this-season.

Sick right now? Flu is resurging to yet a higher peak this season.

Currently, flu activity is categorized as “very high” in 29 states, and “high” in 15. States in the South are ablaze with flu. Louisiana, Tennessee, and South Carolina are at the highest “very high” level. But parts of the Northeast corridor are also seeing extremely high activity, including Massachusetts, New Hampshire, New Jersey, and New York City.

Credit: CDC

As often is the case in flu seasons, the age group hardest hit this year are children ages 0 to 4. The CDC recorded 16 pediatric deaths linked to flu in week 4 of the season, bringing the season’s total pediatric deaths to 47.

Overall hospitalizations are up. The Centers for Disease Control and Prevention estimates that there have been at least 20 million illnesses, 250,000 hospitalizations, and 11,000 deaths from flu so far this season. About 44 percent of US adults have gotten their flu shot, far below the public health goal of 70 percent.

Laboratory surveillance of influenza cases in week 4 indicates that nearly all of the cases are from influenza A viruses, about an even split between H1N1 and H3N2, which has been the case over the course of the season. Around 2 percent of cases were the influenza B Victoria lineage.

Sick right now? Flu is resurging to yet a higher peak this season. Read More »

we’re-in-deep-research

We’re in Deep Research

The latest addition to OpenAI’s Pro offerings is their version of Deep Research.

Have you longed for 10k word reports on anything your heart desires, 100 times a month, at a level similar to a graduate student intern? We have the product for you.

  1. The Pitch.

  2. It’s Coming.

  3. Is It Safe?

  4. How Does Deep Research Work?

  5. Killer Shopping App.

  6. Rave Reviews.

  7. Research Reports.

  8. Perfecting the Prompt.

  9. Not So Fast!

  10. What’s Next?

  11. Paying the Five.

  12. The Lighter Side.

OpenAI: Today we’re launching deep research in ChatGPT, a new agentic capability that conducts multi-step research on the internet for complex tasks. It accomplishes in tens of minutes what would take a human many hours.

Sam Altman: Today we launch Deep Research, our next agent.

This is like a superpower; experts on demand!

It can use the Internet, do complex research and reasoning, and give you a report.

It is quite good and can complete tasks that would take hours or days and cost hundreds of dollars.

People will post many excellent examples, but here is a fun one:

I am in Japan right now and looking for an old NSX. I spent hours searching unsuccessfully for the perfect one. I was about to give up, and Deep Research just… found it.

It is very compute-intensive and slow, but it is the first AI system that can do such a wide variety of complex, valuable tasks.

Going live in our Pro tier now, with 100 queries per month.

Plus, Team and Enterprise tiers will come soon, and then a free tier.

This version will have something like 10 queries per month in our Plus tier and a very small number in our free tier, but we are working on a more efficient version.

(This version is built on o3.)

Give it a try on your hardest work task that can be solved just by using the Internet and see what happens.

Or:

Sam Altman: 50 cents of compute for 500 dollars of value

Sarah (YuanYuanSunSara): Deepseek do it for 5 cents, 500 dollar value.

Perhaps DeepSeek will quickly follow suit, perhaps they will choose not to. The important thing about Sarah’s comment is that there is essentially no difference here.

If the report really is worth $500, then the primary costs are:

  1. Figuring out what you want.

  2. Figuring out the prompt to get it.

  3. Reading the giant report.

  4. NOT the 45 cents you might save!

If the marginal compute cost to me really is 50 cents, then the actual 50 cents is chump change. Even a tiny increase in quality matters so much more.

This isn’t true if you are using the research reports at scale somehow, generating them continuously on tons of subjects and then feeding them into o1-pro for refinement and creating some sort of AI CEO or what not. But the way that all of us are using DR right now, in practice? All that matters is the report is good.

Here was the livestream announcement, if you want it. I find these unwatchable.

Dan Hendrycks: It looks like the latest OpenAI model is very doing well across many topics. My guess is that Deep Research particularly helps with subjects including medicine, classics, and law.

Kevin Roose: When I wrote about Humanity’s Last Exam, the leading AI model got an 8.3%. 5 models now surpass that, and the best model gets a 26.6%.

That was 10 DAYS AGO.

Buck Shlegeris: Note that the questions were explicitly chosen to be adversarial to the frontier models available at the time, which means that models released after HLE look better than they deserve.

Having browsing and python tools and CoT all are not an especially fair fight, and also this is o3 rather than o3-mini under the hood, but yeah the jump to 26.6% is quite big, and confirms why Sam Altman said that soon we will need another exam. That doesn’t mean that we will pass it.

OAI-DR is also the new state of the art on GAIA, which evaluates real-world questions.

They shared a few other internal scores, but none of the standard benchmarks other than humanity’s last exam, and they did not share any safety testing information, despite this being built on o3.

Currently Pro users get 100 DR queries per month, Plus and Free get 0.

I mean, probably. But how would you know?

It is a huge jump on Humanity’s Last Exam. There’s no system card. There’s no discussion of red teaming. There’s not even a public explanation of why there’s no public explanation.

This was released literally two days after they released the o3-mini model card, to show that o3-mini is a safe thing to release, in which they seem not to have used Deep Research as part of their evaluation process. Going forward, I think it is necessary to use Deep Research as part of all aspects of the preparedness framework testing for any new models, and that this should have also been done with o3-mini.

Then two days later, without a system card, they released Deep Research, which is confirmed to be based upon the full o3.

I see this as strongly against the spirit of the White House and Seoul commitments to release safety announcements for ‘all new significant model public releases.’

Miles Brundage: Excited to try this out though with a doubling on Humanity’s Last Exam and o3-mini already getting into potentially dangerous territory on some capabilities, I’m sad there wasn’t a system card or even any brief discussion of red teaming.

OpenAI: In the coming weeks and months, we’ll be working on the technical infrastructure, closely monitoring the current release, and conducting even more rigorous testing. This aligns with our principle of iterative deployment. If all safety checks continue to meet your release standards, we anticipate releasing deep research to plus users in about a month.

Miles Brundage: from the post – does this mean they did automated evals but no RTing yet, or that once they start doing it they’ll stop deployment if it triggers a High score? Very vague. Agree re: value of iterative deployment but that doesn’t mean “anything goes as long as it’s 200/month”.

So where is the model card and safety information for o3?

Well, their basic answer is ‘this is a limited release and doesn’t really count.’ With the obvious (to be clear completely unstated and not at all confirmed) impression that this was rushed out due to r1 to ensure that the conversation and vibe shifted.

I reached out officially, and they gave me this formal statement (which also appears here):

OpenAI: We conducted rigorous safety testing, preparedness evaluations and governance reviews on the early version of o3 that powers deep research, identifying it as Medium risk.

We also ran additional safety testing to better understand incremental risks associated with deep research’s ability to browse the web, and we have added new mitigations.

We will continue to thoroughly test and closely monitor the current limited release.

We will share our safety insights and safeguards for deep research in a system card when we widen access to Plus users.

We do know that the version of o3 (again, full o3) in use tested out as Medium on their preparedness framework and went through the relevant internal committees, which would allow them to release it. But that’s all we know.

They stealth released o3, albeit in a limited form, well before it was ready.

I also have confirmation that the system card will be released before they make Deep Research more widely available (and presumably before o3 is made widely available), and that this is OpenAI’s understanding of its obligations going forward.

They draw a clear distinction between Plus and Free releases or API access, which invokes their disclosure obligations, and limited release only to Pro users, which does not. They do their safety testing under the preparedness framework before even a limited release. However, they consider their obligations to share safety information only to be invoked when a new model is made available to Plus or Free users, or to API users.

I don’t see that as the right distinction to draw here, although I see an important one in API access vs. chatbot interface access. Anyone can now pay the $200 (perhaps with a VPN) and use it, if they want to do that, and in practice multi-account for additional queries if necessary. This is not that limited a release in terms of the biggest worries.

The silver lining is that this allows us to have the discussion now.

I am nonzero sympathetic to the urgency of the situation, and to the intuition that this modality combined with the limited bandwidth and speed renders the whole thing Mostly Harmless.

But if this is how you act under this amount of pressure, how are you going to act in the future, with higher stakes, under much more pressure?

Presumably not so well.

Bob McGrew (OpenAI): The important breakthrough in OpenAI’s Deep Research is that the model is trained to take actions as part of its CoT. The problem with agents has always been that they can’t take coherent action over long timespans. They get distracted and stop making progress.

That’s now fixed.

I do notice this is seemingly distinct from Gemini’s Deep Research. With Gemini’s version, first it searches for sources up front, which it shows to you. Then it compiles the report. OpenAI’s version will search for sources and otherwise take actions as it needs to look for them. That’s a huge upgrade.

Under the hood, we know it’s centrally o3 plus reinforcement learning with the ability to take actions during the chain of thought. What you get from there depends on what you choose as the target.

This is clearly not the thing everyone had in mind, and it’s not the highest value use of a graduate research assistant, but I totally buy that it is awesome at this:

Greg Brockman: Deep Research is an extremely simple agent — an o3 model which can browse the web and execute python code — and is already quite useful.

It’s been eye-opening how many people at OpenAI have been using it as a much better e-commerce search in particular.

E-commerce search is a perfect application. You don’t care about missing some details or a few hallucinations all that much if you’re going to check its work afterwards. You usually don’t need the result right now. But you absolutely want to know what options are available, where, at what price, and what features matter and what general reviews look like and so on.

In the past I’ve found this to be the best use case for Gemini Deep Research – it can compare various offerings, track down where to buy them, list their features and so on. This is presumably the next level up for that.

If I could buy unlimited queries at $0.50 a pop, I would totally do this. The question then becomes, right now, that you get 100 queries a month for $200 (along with operator and o1-pro), but you can’t then add more queries on the margin. So the marginal query might be worth a lot more to you than $0.50.

Not every review is a rave, but here are some of the rave reviews.

Drerya Unutmaz: I asked Deep Research to assist me on two cancer cases earlier today. One was within my area of expertise & the other was slightly outside it. Both reports were simply impeccable, like something only a specialist MD could write! There’s a reason I said this is a game-changer! 🤯

I can finally reveal that I’ve had early access to @OpenAI’s Deep Research since Friday & I’ve been using it nonstop! It’s an absolute game-changer for scientific research, publishing, legal documents, medicine, education-from my tests but likely many others. I’m just blown away!

Yes I did [use Google’s DR] and it’s very good but this is much better! Google needs will need to come up with their next model.

Danielle Fong: openai deep research is incredible

Siqi Chen: i’m only a day in so far but @openai’s deep research and o3 is exceeding the value of the $150K i am paying a private research team to research craniopharyngioma treatments for my daughter.

$200/mo is an insane ROI.grateful to @sama and the @OpenAI team.

feature request for @sama and @OpenAI:

A lot of academic articles are pay walled, and I have subscriptions to just about every major medical journal now.

It would be game changing if i could connect all my credentials to deep research so it can access the raw papers.

As I mentioned above, ‘hook DR up to your credentials’ would be huge.

Tyler Cowen says ‘so far [DR] is amazing’ but doesn’t yet offer more details, as the post is mostly about o1-pro.

Dean Ball is very impressed, saying DR is doing work that would have taken a day or more of research, here it is researching various state regulations. He thinks this is big. I continue to see Dean Ball as a great example of where this type of work is exactly a fit for what he needs to do his job, but still, wowsers.

Olivia Moore is impressed for retrieval tasks, finding it better than Operator, finding it very thorough. I worry it’s too thorough, forcing you to wade through too much other stuff, but that’s what other LLMs are for – turning more text into less text.

Seth Lazar is impressed as he shops for a camera, notices a weakness that it doesn’t properly discount older websites in this context.

Aaron Levie: Can confirm OpenAI Deep Research is quite strong. In a few minutes it did what used to take a dozen hours. The implications to knowledge work is going to be quite profound when you just ask an AI Agent to perform full tasks for you and come back with a finished result.

The ultimate rave review is high willingness to pay.

xjdr: in limited testing, Deep research can completely replace me for researching things i know nothing about to start (its honestly probably much better and certainly much faster). Even for long reports on things i am fairly knowledgeable about, it competes pretty well on quality (i had it reproduce some recent research i did with a few back and forth prompts and compared notes). i am honestly pretty shocked how polished the experience is and how well it works.

I have my gripes but i will save them for later. For now, i will just say that i am incredibly impressed with this release.

To put a finer point on it, i will happily keep paying $200 / month for this feature alone. if they start rate limiting me, i would happily pay more to keep using it.

You Jiacheng: is 100/month enough?

xjdr: I’m gunna hit that today most likely.

My question is, how does xjdr even have time to read all the reports? Or is this a case of feeding them into another LLM?

Paul Calcraft: Very good imo, though haven’t tested it on my own areas of expertise yet.

Oh shit. Deep Research + o1-pro just ~solved this computer graphics problem

Things that didn’t work:

  1. R1/C3.5/o1-pro

  2. Me 🙁

  3. OpenAI Deep Research answer + C3.5 cursor

Needed Deep Research *andan extra o1-pro step to figure out correct changes to my code given the research

Kevin Bryan: The new OpenAI model announced today is quite wild. It is essentially Google’s Deep Research idea with multistep reasoning, web search, *andthe o3 model underneath (as far as I know). It sometimes takes a half hour to answer.

So related to the tariff nuttiness, what if you wanted to read about the economics of the 1890 McKinley tariff, drawing on modern trade theory? I asked Deep Research to spin up a draft, with a couple paragraphs of guidance, in Latex, with citations.

How good can it do literally one shot? I mean, not bad. Honestly, I’ve gotten papers to referee that are worse than this. The path from here to steps where you can massively speed up pace of research is really clear. You can read yourself here.

I tried to spin up a theory paper as well. On my guidance on the problem, it pulled a few dozen papers, scanned them, then tried to write a model in line with what I said.

It wasn’t exactly what I wanted, and is quite far away from even one novel result (basically, it gave me a slight extension of Scotchmer-Green). But again, the path is very clear, and I’ve definitely used frontier models to help prove theorems already.

I think the research uses are obvious here. I would also say, for academia, the amount of AI slop you are about to get is *insane*. In 2022, I pointed out that undergrads could AI their way to a B. I am *surefor B-level journals, you can publish papers you “wrote” in a day.

Now we get to the regular reviews. The general theme is, it will give you a lot of text most of it accurate, but not all, and it will have some insights but pile the slop and unimportant stuff high on top of it without noticing which is which. It’s the same as Gemini’s Deep Research, only more so, and generally stronger but slower. That is highly useful, if you know how to use it.

Abu El Banat: Tested on several questions. Where I am a professional expert, it gave above average grad student summer intern research project answers — covered all the basics, lacked some domain-specific knowledge to integrate the new info helpfully, but had 2-3 nuggets of real insight.

It found several sources of info in my specialty online that I did not know were publicly accessible.

On questions where I’m an amateur, it was extremely helpful. It seemed to shine in tasks like “take all my particulars into account, research all options, and recommend.”

One thing that frustrates me about Gemini Deep Research, and seems to be present in OpenAI’s version as well, is that it will give you an avalanche of slop whether you like it or not. If you ask it for a specific piece of information, like one number that is ‘the average age when kickers retire,’ you won’t get it, at least not by default. This is very frustrating. To me, what I actually want – very often – is to answer a specific question, for a particular reason.

Bayram Annakov concludes it is ‘deeper but slower than Gemini.’

Here’s a personal self-analysis example, On the Couch with Dr. Deep Research.

Ethan Mollick gets a 30-page 10k word report ‘Evolution of Tabletop RPGs: From Dungeon Crawls to Narrative Epics.’

Ethan Mollick: Prompt: I need a report on evolution in TTRPGs, especially the major families of rules that have evolved over the past few years, and the emblematic games of each. make it an interesting read with examples of gameplay mechanics. start with the 1970s but focus most on post 20102. all genres, all types, no need for a chart unless it helps, but good narratives with sections contrasting examples of how the game might actually play. maybe the same sort of gameplay challenge under different mechanics?

To me there are mostly two speeds, ‘don’t care how I want it now,’ and ‘we have all the time in the world.’ Once you’re coming back later, 5 minutes and an hour are remarkably similar lengths. If it takes days then that’s a third level.

Colin Fraser asks, who would each NBA team most likely have guard LeBron James? It worked very hard on this, and came back with answer that often included players no longer on the team, just like o1 often does. Colin describes this as a lack of agency problem, that o3 isn’t ensuring they have an up to date set of rosters as a human would. I’m not sure that’s the right way to look at it? But it’s not unreasonable.

Kevin Roose: Asked ChatGPT Deep Research to plan this week’s Hard Fork episode and it suggested a segment we did last week and two guests I can’t stand, -10 points on the podcast vibes eval.

Shakeel: Name names.

Ted tries it in a complex subfield he knows well, finds 90% coherent high level summary of prior work and 10% total nonsense that a non-expert wouldn’t be able to differentiate, and he’s ‘not convinced it “understands” what is going on.’ That’s a potentially both highly useful and highly dangerous place to be, depending on the field and the user.

Here’s one brave user:

Simeon: Excellent for quick literature reviews in literature you barely know (able to give a few example papers) but don’t know much about.

And one that is less brave on this one:

Siebe: Necessary reminder to test features like this in areas you’re familiar with. “It did a good job of summarizing an area I wasn’t familiar with.”

No, you don’t know that. You don’t have the expertise to judge that.

Where is the 10% coming from? Steve Sokolowski has a theory.

Steve Sokolowski: ‘m somewhat disappointed by @OpenAI’s Deep Research. @sama promised it was a dramatic advance, so I entered the complaint for our o1 pro-guided lawsuit against @DCGco and others into it and told it to take the role of Barry Silbert and move to dismiss the case.

Unfortunately, while the model appears to be insanely intelligent, it output obviously weak arguments because it ended up taking poor-quality source data from poor-quality websites. It relied upon sources like reddit and those summarized articles that attorneys write to drive traffic to their websites and obtain new cases.

The arguments for dismissal were accurate in the context of the websites it relied upon, but upon review I found that those websites often oversimplified the law and missed key points of the actual laws’ texts.

When the model based its arguments upon actual case text, it did put out arguments that seemed like they would hold up to a judge. One of the arguments was exceptional and is a risk that we are aware of.

But except for that one flash of brilliance, I got the impression that the context window of this model is small. It “forgot” key parts of the complaint, so its “good” arguments would not work as a defense.

The first problem – the low quality websites – should be able to be solved with a system prompt explaining what types of websites to avoid. If they already have a system prompt explaining that, it isn’t good enough.

Deep Research is a model that could change the world dramatically with a few minor advances, and we’re probably only months from that.

This is a problem for every internet user, knowing what sources to trust. It makes sense that it would be a problem for DR’s first iteration. I strongly agree that this should improve rapidly over time.

Dan Hendrycks is not impressed when he asks for feedback on a paper draft, finding it repeatedly claiming Dan was saying things he didn’t say, but as he notes this is mainly a complaint about the underlying o3 model. So given how humans typically read AI papers, it’s doing a good job predicting the next token? I wonder how well o3’s misreads correlate with human ones.

With time, you can get a good sense of what parts can be trusted versus what has to be checked, including noticing which parts are too load bearing to risk being wrong.

Gallabytes is unimpressed so far but suspects it’s because of the domain he’s trying.

Gallabytes: so far deep research feels kinda underwhelming. I’m sure this is to some degree a skill issue on my part and to some degree a matter of asking it about domains where there isn’t good literature coverage. was hoping it could spend more time doing math when it can’t find sources.

ok let’s turn this around. what should I be using deep research for? what are some domains where you’ve seen great output? so far ML research ain’t it too sparse (and maybe too much in pdfs? not at all obvious to me that it’s reading beyond the abstracts on arxiv so far).

Carlos: I was procrastinating buying a new wool overcoat, and I hate shopping. So I had it look for one for me and make a page I could reference (the html canvas had to be a follow-up message, for some reason Research’s responses aren’t using even code backticks properly atm) I just got back from the store with my new coat.

Peter Wildeford is not impressed but that’s on a rather impressive curve.

Peter Wildeford: Today’s mood: Using OpenAI Deep Research to automate some of my job to save time to investigate how well OpenAI Deep Research can automate my job.

…Only took me four hours to get to this point, looks like you get 20 deep research reports per day

Tyler John: Keen to learn from your use of the model re: what it’s most helpful for.

Peter Wildeford: I’ll have more detailed takes on my Substack but right now it seems most useful for “rapidly get a basic familiarity with a field/question/problem”

It won’t replace even an RA or fellow at IAPS, but it is great at grinding through 1-2 hours of initial desk research in ~10min.

Tyler John: “it won’t replace even an RA” where did the time go

Peter Wildeford: LOL yeah but the hype is honestly that level here on Twitter right now

It’s good for if you don’t have all the PDFs in advance

The ability to ask follow up questions actually seems sadly lacking right now AFAICT

If you do have the PDFs in advance and have o1-pro and can steer the o1-pro model to do a more in-depth report, then I think Deep Research doesn’t add much more on top of that

It’s all about the data set.

Ethan Mollick: Having access to a good search engine and access to paywalled content is going to be a big deal in making AI research agents useful.

Kevin Bryan: Playing with Operator and both Google’s Deep Research and OpenAI’s, I agree with Ethan: access to gated documents, and a much better inline pdf OCR, will be huge. The Google Books lawsuit which killed it looks like a massive harm to humanity and science in retrospect.

And of course it will need all your internal and local stuff as well.

Note that this could actually be a huge windfall for gated content.

Suppose this integrated the user’s subscriptions, so you got paywalled content if and only if you were paying for it. Credentials for all those academic journals now look a lot more enticing, don’t they? Want the New York Times or Washington Post in your Deep Research? Pay up. Maybe it’s part of the normal subscription. Maybe it’s a less. Maybe it’s more.

And suddenly you can get value out of a lot more subscriptions, especially if the corporation is fitting the bill.

Arthur B is happy with his first query, disappointed with the one on Tezos where he knows best, is hoping it’s data quality issues rather than Gel-Men Amnesia.

Deric Cheong finds it better than Gemini DR on economic policies for a post-AGI society. I checked out the report, which takes place in the strange ‘economic normal under AGI’ idyllic Christmasland that economists insist on as our baseline future, where our worries are purely mundane things like concentration of wealth and power in specific humans and the need to ensure competition.

So you get proposals such as ‘we need to ensure that we have AGIs and AGI offerings competing against each other to maximize profits, that’ll ensure humans come out okay and totally not result by default in at best gradual disempowerment.’ And under ‘drawbacks’ you get ‘it requires global coordination to ensure competition.’ What?

We get all the classics. Universal basic income, robot taxes, windfall taxes, capital gains taxes, ‘workforce retraining and education’ (workforce? Into ‘growth fields’? What are these ‘growth fields’?), shorten the work week, mandatory paid leave (um…), a government infrastructure program, giving workers bargaining power, ‘cooperative and worker ownership’ of what it doesn’t call ‘the means of production,’ data dividends and rights, and many more.

All of which largely comes down to rearranging deck chairs on the Titanic, while the Titanic isn’t sinking and actually looks really sharp but also no one can afford the fare. It’s stuff that matters on the margin but we won’t be operating on the margin, we will be as they say ‘out of distribution.’

Alternatively, it’s a lot of ways of saying ‘redistribution’ over and over with varying levels of inefficiency and social fiction. If humans can retain political power and the ability to redistribute real resources, also known as ‘retain control over the future,’ then there will be more than enough real resources that everyone can be economically fine, whatever status or meaning or other problems they might have. The problem is that the report doesn’t raise that need as a consideration, and if anything the interventions here make that problem harder not easier.

But hey, you ask a silly question, you get a silly answer. None of that is really DR’s fault, except that it accepted the premise. So, high marks!

Nabeel Qureshi: OpenAI’s Deep Research is another instance of “prompts matter more now, not less.” It’s so powerful that small tweaks to the prompt end up having large impacts on the output. And it’s slow, so mistakes cost you more.

I expect we’ll see better ways to “steer” agents as they’re working, e.g. iterative ‘check-ins’ or CoT inspection. Right now it’s very easy for them to go off piste.

It reminds me of the old Yudkowsky point: telling the AI *exactlywhat you want is quite hard, especially as the request gets more complex and as the AI gets more powerful.

Someone should get on this, and craft at least a GPT or instruction you can give to o3-mini-high or o1-pro (or Claude Sonnet 3.6?), that will take your prompt and other explanations, ask you follow-ups if needed, and give you back a better prompt, and give you back a prediction of what to expect so you can refine and such.

I strongly disagree with this take:

Noam Brown: @OpenAI Deep Research might be the beginning of the end for Wikipedia and I think that’s fine. We talk a lot about the AI alignment problem, but aligning people is hard too. Wikipedia is a great example of this.

There are problems with Wikipedia, but these two things are very much not substitutes. Here are some facts about Wikipedia that don’t apply to DR and aren’t about to any time soon.

  1. Wikipedia is highly reliable, at least for most purposes.

  2. Wikipedia can be cited as reliable source to others.

  3. Wikipedia is the same for everyone, not sensitive to input details.

  4. Wikipedia is carefully workshopped to be maximally helpful and efficient.

  5. Wikipedia curates the information that is notable, gets rid of the slop.

  6. Wikipedia is there at your fingertips for quick reference.

  7. Wikipedia is the original source, a key part of training data. Careful, Icarus.

And so on. These are very different modalities.

Noam Brown: I’m not saying people will query a Deep Research model every time they want to read about Abraham Lincoln. I think models like Deep Research will eventually be used to pre-generate a bunch of articles that can stored and read just like Wikipedia pages, but will be higher quality.

I don’t think that is a good idea either. Deep Research is not a substitute for Wikipedia. Deep Research is for when you can’t use Wikipedia, because what you want isn’t notable and is particular, or you need to know things with a different threshold of reliability than Wikipedia’s exacting source standards, and so on. You’re not going to ‘do better’ than Wikipedia at its own job this way.

Eventually, of course, AI will be better at every cognitive task than even the collective of humans, so yes it would be able to write a superior Wikipedia article at that point, or something that serves the same purpose. But at that point, which is fully AGI-complete, we have a lot of much bigger issues to consider, and OAI-DR-1.0 won’t be much of a ‘beginning of the end.’

Another way of putting this is, you’d love a graduate research assistant, but you’d never tell them to write a Wikipedia article for you to read.

Here’s another bold claim.

Sam Altman: congrats to the team, especially @isafulf and @EdwardSun0909, for building an incredible product.

my very approximate vibe is that it can do a single-digit percentage of all economically valuable tasks in the world, which is a wild milestone.

Can Deep Research do 1% of all economically valuable tasks in the world?

With proper usage, I think the answer is yes. But I also would have said the same thing about o1-pro, or Claude Sonnet 3.5, once you give them a little scaffolding.

Poll respondents disagreed, saying it could do between 0.1% and 1% of tasks.

We have Operator. We have two versions of Deep Research. What’s next?

Stephen McAleer (OpenAI): Deep research is emblematic of a new wave of products that will be created by doing high-compute RL on real-world tasks.

If you can Do Reinforcement Learning To It, and it’s valuable, they’ll build it. The question is, what products might be coming soon here?

o1’s suggestions were legal analysis, high-frequency trading, medical diagnosis, supply chain coordination, warehouse robotics, personalized tutoring, customer support, traffic management, code generation and drug discovery.

That’s a solid list. The dream is general robotics but that’s rather a bit trickier. Code generation is the other dream, and that’s definitely going to step up its game quickly.

The main barrier seems to be asking what people actually want.

I’d like to see a much more precise version of DR next. Instead of giving me a giant report, give me something focused. But probably I should be thinking bigger.

Should you pay the $200?

For that price, you now get:

  1. o1-pro.

  2. Unlimited o3-mini-high and o1.

  3. Operator.

  4. 100 queries per month on Deep Research.

When it was only o1-pro, I thought those using it for coding or other specialized tasks where it excels should clearly pay, but it wasn’t clear others should pay.

Now that the package has expanded, I agree with Sam Altman that the value proposition is much improved, and o3 and o3-pro will enhance it further soon.

I notice I haven’t pulled the trigger yet. I know it’s a mistake that I haven’t found ways to want to do this more as part of my process. Just one more day, maybe two, to clear the backlog, I just need to clear the backlog. They can’t keep releasing products like this.

Right?

Depends what counts? As long as it doesn’t need to be cutting edge we should be fine.

Andrew Critch: I hope universal basic income turns out to be enough to pay for a Deep Research subscription.

A story I find myself in often:

Miles Brundage: Man goes to Deep Research, asks for help with the literature on trustworthy AI development.

Deep Research says, “You are in luck. There is relevant paper by Brundage et al.”

Man: “But Deep Research…”

Get your value!

Discussion about this post

We’re in Deep Research Read More »

popular-linux-orgs-freedesktop-and-alpine-linux-are-scrambling-for-new-web-hosting

Popular Linux orgs Freedesktop and Alpine Linux are scrambling for new web hosting

Having worked “around the clock” to move from Google Cloud Platform after its open source credits there ran out, and now rushing to move off Equinix, Tissoires suggests a new plan: “[H]ave [freedesktop.org] pay for its own servers, and then have sponsors chip in.”

“Popular without most users knowing it”

Alpine Linux, a small, security-minded distribution used in many containers and embedded devices, also needs a new home quickly. As detailed in its blog, Alpine Linux uses about 800TB of bandwidth each month and also needs continuous integration runners (or separate job agents), as well as a development box. Alpine states it is seeking co-location space and bare-metal servers near the Netherlands, though it will consider virtual machines if bare metal is not feasible.

Like X.org/Freedesktop, Alpine is using this moment as a wake-up call. Responding to Ars, Carlo Landmeter, who serves on Alpine’s council, noted that Alpine Linux is a kind of open source project “that became popular without most users knowing it.” Users are starting to donate, and companies are reaching out to help, but it’s still “early days,” Landmeter wrote.

Every so often, those working at the foundations of open source software experience something that highlights the mismatch between a project’s importance and its support and funding. Perhaps some people or some organizations will do the harder work of finding a sustaining future for these projects.

Ars has reached out to Equinix and X/Freedesktop and will update this post with responses.

Popular Linux orgs Freedesktop and Alpine Linux are scrambling for new web hosting Read More »

it-seems-the-faa-office-overseeing-spacex’s-starship-probe-still-has-some-bite

It seems the FAA office overseeing SpaceX’s Starship probe still has some bite


The political winds have shifted in Washington, but the FAA hasn’t yet changed its tune on Starship.

Liftoff of SpaceX’s seventh full-scale test flight of the Super Heavy/Starship launch vehicle on January 16. Credit: SpaceX

The seventh test flight of SpaceX’s gigantic Starship rocket came to a disappointing end a little more than two weeks ago. The in-flight failure of the rocket’s upper stage, or ship, about eight minutes after launch on January 16 rained debris over the Turks and Caicos Islands and the Atlantic Ocean.

Amateur videos recorded from land, sea, and air showed fiery debris trails streaming overhead at twilight, appearing like a fireworks display gone wrong. Within hours, posts on social media showed small pieces of debris recovered by residents and tourists in the Turks and Caicos. Most of these items were modest in size, and many appeared to be chunks of tiles from Starship’s heat shield.

Unsurprisingly, the Federal Aviation Administration grounded Starship and ordered an investigation into the accident on the day after the launch. This decision came three days before the inauguration of President Donald Trump. Elon Musk’s close relationship with Trump, coupled with the new administration’s appetite for cutting regulations and reducing the size of government, led some industry watchers to question whether Musk’s influence might change the FAA’s stance on SpaceX.

So far, the FAA hasn’t budged on its requirement for an investigation, an agency spokesperson told Ars on Friday. After a preliminary assessment of flight data, SpaceX officials said a fire appeared to develop in the aft section of the ship before it broke apart and fell to Earth.

“The FAA has directed SpaceX to lead an investigation of the Starship Super Heavy Flight 7 mishap with FAA oversight,” the spokesperson said. “Based on the investigation findings for root cause and corrective actions, the FAA may require a company to modify its license.”

This is much the same language the FAA used two weeks ago, when it first ordered the investigation.

Damage report

The FAA’s Office of Commercial Space Transportation is charged with ensuring commercial space launches and reentries don’t endanger the public, and requires launch operators obtain liability insurance or demonstrate financial ability to cover any third-party property damages.

For each Starship launch, the FAA requires SpaceX maintain liability insurance policies worth at least $500 million for such claims. It’s rare for debris from US rockets to fall over land during a launch. This would typically only happen if a launch failed at certain parts of the flight. And there’s no public record of any claims of third-party property damage in the era of commercial spaceflight. Under federal law, the US government would pay for damages to a much higher amount if any claims exceeded a launch company’s insurance policies.

Here’s a piece of Starship 33 @SpaceX @elonmusk found in Turks and Caicos! 🚀🏝️ pic.twitter.com/HPZDCqA9MV

— @maximzavet (@MaximZavet) January 17, 2025

The good news is there were no injuries or reports of significant damage from the wreckage that fell over the Turks and Caicos. “The FAA confirmed one report of minor damage to a vehicle located in South Caicos,” an FAA spokesperson told Ars on Friday. “To date, there are no other reports of damage.”

It’s not clear if the vehicle owner in South Caicos will file a claim against SpaceX for the damage. It would the first time someone makes such a claim related to an accident with a commercial rocket overseen by the FAA. Last year, a Florida homeowner submitted a claim to NASA for damage to his house from a piece of debris that fell from the International Space Station.

Nevertheless, the Turks and Caicos government said local officials met with representatives from SpaceX and the UK Air Accident Investigations Branch on January 25 to develop a recovery plan for debris that fell on the islands, which are a British Overseas Territory.

A prickly relationship

Musk often bristled at the FAA last year, especially after regulators proposed fines of more than $600,000 alleging that SpaceX violated terms of its launch licenses during two Falcon 9 missions. The alleged violations involved the relocation of a propellant farm at one of SpaceX’s launch pads in Florida, and the use of a new launch control center without FAA approval.

In a post on X, Musk said the FAA was conducting “lawfare” against his company. “SpaceX will be filing suit against the FAA for regulatory overreach,” Musk wrote.

There was no such lawsuit, and the issue may now be moot. Sean Duffy, Trump’s new secretary of transportation, vowed to review the FAA fines during his confirmation hearing in the Senate. It is rare for the FAA to fine launch companies, and the fines last year made up the largest civil penalty ever imposed by the FAA’s commercial spaceflight division.

SpaceX also criticized delays in licensing Starship test flights last year. The FAA cited environmental issues and concerns about the extent of the sonic boom from Starship’s 23-story-tall Super Heavy booster returning to its launch pad in South Texas. SpaceX successfully caught the returning first stage booster at the launch pad for the first time in October, and repeated the feat after the January 16 test flight.

What separates the FAA’s ongoing oversight of Starship’s recent launch failure from these previous regulatory squabbles is that debris fell over populated areas. This would appear to be directly in line with the FAA’s responsibility for public safety.

During last month’s test flight, Starship did not deviate from its planned ground track, which took the rocket over the Gulf of Mexico, the waters between Florida and Cuba, and then the Atlantic Ocean. But the debris field extended beyond the standard airspace closure for the launch. After the accident, FAA air traffic controllers cleared additional airspace over the debris zone for more than an hour, rerouting, diverting, and delaying dozens of commercial aircraft.

These actions followed pre-established protocols. However, it highlighted the small but non-zero risk of rocket debris falling to Earth after a launch failure. “The potential for a bad day downrange just got real,” Lori Garver, a former NASA deputy administrator, posted on X.

Public safety is not sole mandate of the FAA’s commercial space office. It is also chartered to “encourage, facilitate, and promote commercial space launches and reentries by the private sector,” according to an FAA website. There’s a balance to strike.

Lawmakers last year urged the FAA to speed up its launch approvals, primarily because Starship is central to strategic national objectives. NASA has contracts with SpaceX to develop a variant of Starship to land astronauts on the Moon, and Starship’s unmatched ability to deliver more than 100 tons of cargo to low-Earth orbit is attractive to the Pentagon.

While Musk criticized the FAA in 2024, SpaceX officials in 2023 took a different tone, calling for Congress to increase the budget for the FAA’s Office of Commercial Spaceflight and for the regulator to double the space division’s workforce. This change, SpaceX officials argued, would allow the FAA to more rapidly assess and approve a fast-growing number of commercial launch and reentry applications.

In September, SpaceX released a statement accusing the former administrator of the FAA, Michael Whitaker, of making inaccurate statements about SpaceX to a congressional subcommittee. In a different post on X, Musk directly called for Whitaker’s resignation.

He needs to resign https://t.co/pG8htfTYHb

— Elon Musk (@elonmusk) September 25, 2024

That’s exactly what happened. Whitaker, who took over the FAA’s top job in 2023 under the Biden administration, announced in December he would resign on Inauguration Day. Since the agency’s establishment in 1958, three FAA administrators have similarly resigned when a new administration takes power, but the office has been largely immune from presidential politics in recent decades. Since 1993, FAA administrators have stayed in their post during all presidential transitions.

There’s no evidence Whitaker’s resignation had any role in the mid-air collision of an American Eagle passenger jet and a US Army helicopter Wednesday night near Ronald Reagan Washington National Airport. But his departure from the FAA less than two years into a five-year term on January 20 left the agency without a leader. Trump named Chris Rocheleau as the FAA’s acting administrator Thursday.

Next flight, next month?

SpaceX has not released an official schedule for the next Starship test flight or outlined its precise objectives. However, it will likely repeat many of the goals planned for the previous flight, which ended before SpaceX could accomplish some of its test goals. These missed objectives included the release of satellite mockups in space for the first demonstration of Starship’s payload deployment mechanism, and a reentry over the Indian Ocean to test new, more durable heat shield materials.

The January 16 test flight was the first launch up an upgraded, slightly taller Starship, known as Version 2 or Block 2. The next flight will use the same upgraded version.

A SpaceX filing with the Federal Communications Commission suggests the next Starship flight could launch as soon as February 24. Sources told Ars that SpaceX teams believe a launch before the end of February is realistic.

But SpaceX has more to do before Flight 8. These tasks include completing the FAA-mandated investigation and the installation of all 39 Raptor engines on the rocket. Then, SpaceX will likely test-fire the booster and ship before stacking the two elements together to complete assembly of the 404-foot-tall (123.1-meter) rocket.

SpaceX is also awaiting a new FAA launch license, pending its completion of the investigation into what happened on Flight 7.

Photo of Stephen Clark

Stephen Clark is a space reporter at Ars Technica, covering private space companies and the world’s space agencies. Stephen writes about the nexus of technology, science, policy, and business on and off the planet.

It seems the FAA office overseeing SpaceX’s Starship probe still has some bite Read More »

to-help-ais-understand-the-world,-researchers-put-them-in-a-robot

To help AIs understand the world, researchers put them in a robot


There’s a difference between knowing a word and knowing a concept.

Large language models like ChatGPT display conversational skills, but the problem is they don’t really understand the words they use. They are primarily systems that interact with data obtained from the real world but not the real world itself. Humans, on the other hand, associate language with experiences. We know what the word “hot” means because we’ve been burned at some point in our lives.

Is it possible to get an AI to achieve a human-like understanding of language? A team of researchers at the Okinawa Institute of Science and Technology built a brain-inspired AI model comprising multiple neural networks. The AI was very limited—it could learn a total of just five nouns and eight verbs. But their AI seems to have learned more than just those words; it learned the concepts behind them.

Babysitting robotic arms

“The inspiration for our model came from developmental psychology. We tried to emulate how infants learn and develop language,” says Prasanna Vijayaraghavan, a researcher at the Okinawa Institute of Science and Technology and the lead author of the study.

While the idea of teaching AIs the same way we teach little babies is not new—we applied it to standard neural nets that associated words with visuals. Researchers also tried teaching an AI using a video feed from a GoPro strapped to a human baby. The problem is babies do way more than just associate items with words when they learn. They touch everything—grasp things, manipulate them, throw stuff around, and this way, they learn to think and plan their actions in language. An abstract AI model couldn’t do any of that, so Vijayaraghavan’s team gave one an embodied experience—their AI was trained in an actual robot that could interact with the world.

Vijayaraghavan’s robot was a fairly simple system with an arm and a gripper that could pick objects up and move them around. Vision was provided by a simple RGB camera feeding videos in a somewhat crude 64×64 pixels resolution.

 The robot and the camera were placed in a workspace, put in front of a white table with blocks painted green, yellow, red, purple, and blue. The robot’s task was to manipulate those blocks in response to simple prompts like “move red left,” “move blue right,” or “put red on blue.” All that didn’t seem particularly challenging. What was challenging, though, was building an AI that could process all those words and movements in a manner similar to humans. “I don’t want to say we tried to make the system biologically plausible,” Vijayaraghavan told Ars. “Let’s say we tried to draw inspiration from the human brain.”

Chasing free energy

The starting point for Vijayaraghavan’s team was the free energy principle, a hypothesis that the brain constantly makes predictions about the world based on internal models, then updates these predictions based on sensory input. The idea is that we first think of an action plan to achieve a desired goal, and then this plan is updated in real time based on what we experience during execution. This goal-directed planning scheme, if the hypothesis is correct, governs everything we do, from picking up a cup of coffee to landing a dream job.

All that is closely intertwined with language. Neuroscientists at the University of Parma found that motor areas in the brain got activated when the participants in their study listened to action-related sentences. To emulate that in a robot, Vijayaraghavan used four neural networks working in a closely interconnected system. The first was responsible for processing visual data coming from the camera. It was tightly integrated with a second neural net that handled proprioception: all the processes that ensured the robot was aware of its position and the movement of its body. This second neural net also built internal models of actions necessary to manipulate blocks on the table. Those two neural nets were additionally hooked up to visual memory and attention modules that enabled them to reliably focus on the chosen object and separate it from the image’s background.

The third neural net was relatively simple and processed language using vectorized representations of those “move red right” sentences. Finally, the fourth neural net worked as an associative layer and predicted the output of the previous three at every time step. “When we do an action, we don’t always have to verbalize it, but we have this verbalization in our minds at some point,” Vijayaraghavan says. The AI he and his team built was meant to do just that: seamlessly connect language, proprioception, action planning, and vision.

When the robotic brain was up and running, they started teaching it some of the possible combinations of commands and sequences of movements. But they didn’t teach it all of them.

The birth of compositionality

In 2016, Brenden Lake, a professor of psychology and data science, published a paper in which his team named a set of competencies machines need to master to truly learn and think like humans. One of them was compositionality: the ability to compose or decompose a whole into parts that can be reused. This reuse lets them generalize acquired knowledge to new tasks and situations. “The compositionality phase is when children learn to combine words to explain things. They [initially] learn the names of objects, the names of actions, but those are just single words. When they learn this compositionality concept, their ability to communicate kind of explodes,” Vijayaraghavan explains.

The AI his team built was made for this exact purpose: to see if it would develop compositionality. And it did.

Once the robot learned how certain commands and actions were connected, it also learned to generalize that knowledge to execute commands it never heard before. recognizing the names of actions it had not performed and then performing them on combinations of blocks it had never seen. Vijayaraghavan’s AI figured out the concept of moving something to the right or the left or putting an item on top of something. It could also combine words to name previously unseen actions, like putting a blue block on a red one.

While teaching robots to extract concepts from language has been done before, those efforts were focused on making them understand how words were used to describe visuals. Vijayaragha built on that to include proprioception and action planning, basically adding a layer that integrated sense and movement to the way his robot made sense of the world.

But some issues are yet to overcome. The AI had very limited workspace. The were only a few objects and all had a single, cubical shape. The vocabulary included only names of colors and actions, so no modifiers, adjectives, or adverbs. Finally, the robot had to learn around 80 percent of all possible combinations of nouns and verbs before it could generalize well to the remaining 20 percent. Its performance was worse when those ratios dropped to 60/40 and 40/60.

But it’s possible that just a bit more computing power could fix this. “What we had for this study was a single RTX 3090 GPU, so with the latest generation GPU, we could solve a lot of those issues,” Vijayaraghavan argued. That’s because the team hopes that adding more words and more actions won’t result in a dramatic need for computing power. “We want to scale the system up. We have a humanoid robot with cameras in its head and two hands that can do way more than a single robotic arm. So that’s the next step: using it in the real world with real world robots,” Vijayaraghavan said.

Science Robotics, 2025. DOI: 10.1126/scirobotics.adp0751

Photo of Jacek Krywko

Jacek Krywko is a freelance science and technology writer who covers space exploration, artificial intelligence research, computer science, and all sorts of engineering wizardry.

To help AIs understand the world, researchers put them in a robot Read More »