Author name: Mike M.

nyt-targets-street-view-worldle-game-in-fight-to-wipe-out-wordle-clones

NYT targets Street View Worldle game in fight to wipe out Wordle clones

A world of difference? —

Worldle creator surprised by fight, refuses to bow to NYT.

NYT targets Street View Worldle game in fight to wipe out Wordle clones

The New York Times is fighting to take down a game called Worldle, according to a legal filing viewed by the BBC, in which The Times apparently argued that the geography-based game is “creating confusion” by using a name that’s way too similar to Wordle.

Worldle is “nearly identical in appearance, sound, meaning, and imparts the same commercial impression” to Wordle, The Times claimed.

The Times bought Wordle in 2022, paying software developer Josh Wardle seven figures for the daily word-guessing puzzle game after its breakout success during the pandemic. Around the same time, Worldle was created—along with more than 100 other Wordle spinoffs offering niche alternatives to Wordle, including versions in different languages and completely different games simply using the name construction ending in “-le.” The Times filed for a Wordle trademark the day after buying the game and by March 2022, it started sending takedown requests.

Today, millions visit the Times site daily to play Wordle, but the Times is seemingly concerned that some gamers might be diverted to play Worldle instead, somehow mistaking the daily geography puzzle—where players have six chances to find a Google Street View location on a map—with the popular word game.

This fear seems somewhat overstated, since a Google search for “Worldle” includes Wordle in the top two results and suggests that searchers might be looking for Wordle, but a search for Wordle does not bring up Worldle in the top results.

Despite Google seemingly favoring the popular game in results and likely because of Wordle‘s enormous success, The Times’ litigiousness over the Wordle brand seems to be rising this year as the company looks to rapidly expand its profitable games platform to increase subscriptions. In March, 404 Media reported when The Times began more aggressively taking aim at hundreds of Wordle clones, sending DMCA notices to defend the Wordle trademark.

Some developers, like Chase Wackerfuss, the creator of Reactle, immediately took down their games, feeling it wasn’t worth getting into an intellectual property (IP) battle with the Times, 404 Media reported. The same thing happened with the Wordle Archive, which confirmed in 2022 that access to previous Wordle puzzles was shut down because “sadly, the New York Times has requested that the Wordle Archive be taken down.”

“To me, Wordle is like Tetris or a deck of cards. It’s such a simple premise,” Wackerfuss told 404 Media. He pointed to unique games that wouldn’t exist without building on Wordle‘s premise, including “a Taylor Swift version, a version in all different types of languages. The New York Times would never build those, so I’m not sure why they feel like we’re encroaching on their IP.”

But Worldle’s developer, Kory McDonald, is not backing down just because the Times threatened legal action.

McDonald told the BBC that he was disappointed in the Times targeting Worldle. He runs the game all by himself, attracting approximately 100,000 players monthly, and said that “most of the money he makes from the game goes to Google because he uses Google Street View images, which players have to try to identify.” The game can only be played through a web browser and is supported by ads and annual subscriptions that cost less than $12.

“I’m just a one-man operation here, so I was kinda surprised,” McDonald told the BBC, while vowing to defend his game against the Times’ attempt to take it down.

“There’s a whole industry of [dot]LE games,” McDonald told the BBC. “Wordle is about words, Worldle is about the world, Flaggle is about flags.”

It’s not clear how strong a case the Times would have to enforce the takedown or if it will target other “-le” games next. The list of potential next targets is long and includes a completely different game also called Worldle, where players guess the country based on its outline. Wackerfuss told 404 Media in March that it seemed like the Times was chasing down every lead.

The Times is not commenting on the legal action, the BBC reported, but in the past has targeted Wordle clones that either use the Wordle trademark or its copyrighted gameplay without authorization or permission.

Because McDonald’s game has vastly different gameplay than Wordle, the Times may be limited to only arguing that the similar-sounding names are creating confusion for an average user.

Now it seems possible that McDonald’s fight, if successful, could encourage others to resist takedowns over the Wordle trademark.

McDonald doesn’t think that “world” sounding too much like “word” is an issue, but even if the Times wins the fight, he intends to keep his game online.

“Worst-case scenario, we’ll change the name, but I think we’ll be OK,” McDonald told the BBC.

NYT targets Street View Worldle game in fight to wipe out Wordle clones Read More »

the-gemini-1.5-report

The Gemini 1.5 Report

This post goes over the extensive report Google put out on Gemini 1.5.

There are no important surprises. Both Gemini Pro 1.5 and Gemini Flash are ‘highly capable multimodal models incorporating a novel mixture-of-experts architecture’ and various other improvements. They are solid models with solid performance. It can be useful and interesting to go over the details of their strengths and weaknesses.

The biggest thing to know is that Google improves its models incrementally and silently over time, so if you have not used Gemini in months, you might be underestimating what it can do.

I’m hitting send and then jumping on a plane to Berkeley. Perhaps I will see you there over the weekend. That means that if there are mistakes here, I will be slower to respond and correct them than usual, so consider checking the comments section.

The practical bottom line remains the same. Gemini Pro 1.5 is an excellent 4-level model. Its big advantage is its long context window, and it is good at explanations and has integrations with some Google services that I find useful. If you want a straightforward, clean, practical, ‘just the facts’ output and that stays in the ‘no fun zone’ then Gemini could be for you. I recommend experimenting to find out when you do and don’t prefer it versus GPT-4o and Claude Opus, and will continue to use a mix of all three and keep an eye on changes.

How is the improvement process going?

Imsys.org: Big news – Gemini 1.5 Flash, Pro and Advanced results are out!🔥

– Gemini 1.5 Pro/Advanced at #2, closing in on GPT-4o

– Gemini 1.5 Flash at #9, outperforming Llama-3-70b and nearly reaching GPT-4-0125 (!)

Pro is significantly stronger than its April version. Flash’s cost, capabilities, and unmatched context length make it a market game-changer!

More excitingly, in Chinese, Gemini 1.5 Pro & Advanced are now the best #1 model in the world. Flash becomes even stronger!

We also see new Gemini family remains top in our new “Hard Prompts” category, which features more challenging, problem-solving user queries.

Here is the overall leaderboard:

Oriol Vinyals (VP of Research, DeepMind): Today we have published our updated Gemini 1.5 Model Technical Report. As Jeff Dean highlights [in the full report this post analyzes], we have made significant progress in Gemini 1.5 Pro across all key benchmarks; TL;DR: 1.5 Pro > 1.0 Ultra, 1.5 Flash (our fastest model) ~= 1.0 Ultra.

As a math undergrad, our drastic results in mathematics are particularly exciting to me!

As an overall take, the metrics in the report say this is accurate. The Arena benchmarks suggest that Flash is not as good as Ultra in terms of output quality, but it makes up for that several times over with speed and cost. Gemini 1.5 Pro’s Arena showing is impressive, midway between Opus and GPT-4o. For my purposes, Opus is underrated here and GPT-4o is overrated, and I would have all three models close.

All right, on to the report. I will start with the big Gemini advantages.

One update I have made recently is to place a lot more emphasis on speed of response. This will be key for the new conversational audio modes, and is a great aid even with text. Often lower quality is worth it to get faster response, so long as you know when to make an exception.

Indeed, I have found Claude Opus for my purposes usually gives the best responses. The main reason I still often don’t use it is speed or sometimes style, and occasionally Gemini’s context window.

How fast is Gemini Flash? Quite fast. Gemini Pro is reasonably fast too.

GPT-4o is slightly more than twice as fast as GPT-4-Turbo, making it modestly faster than Gemini 1.5 Pro in English.

One place Google is clearly ahead is context window size.

Both Pro and Flash can potentially handle context windows of up to 10 million tokens.

The actual upper bound is that cost and speed scale with context window size. That is why users are limited to 1-2 million tokens, and only a tiny minority of use cases use even a major fraction of that.

Gemini 1.5 Flash is claimed to outperform Gemini 1.0 Pro, despite being vastly smaller, cheaper and faster, including training costs.

Gemini 1.5 Pro is claimed to surpass Gemini 1.0 Ultra, despite being vastly smaller, cheaper and faster, including training costs.

Google’s strategy has been to incrementally improve Gemini (and previously Bard) over time. They claim the current version is substantially better than the February version.

Here they use ‘win rates’ on various benchmarks.

The relative text and vision win rates are impressive.

On audio the old 1.5 Pro is still on top, and 1.0 Pro is still beating both the new 1.5 Pro and 1.5 Flash. They do not explain what happened there.

There are several signs throughout that the audio processing has taken a hit, but in 9.2.1 they say ‘efficient processing of audio files at scale may introduce individual benefits’ and generally seem to be taking the attitude audio performance is improved. It would be weird if audio performance did not improve. I notice confusion there.

Here is a bold claim.

In more realistic multimodal long-context benchmarks which require retrieval and reasoning over multiple parts of the context (such as answering questions from long documents or long videos), we also see Gemini 1.5 Pro outperforming all competing models across all modalities even when these models are augmented with external retrieval methods.

Here are some admittedly selected benchmarks:

Gemini Pro 1.5 is neat. Depending on what you are looking to do, it is roughly on par with its rivals Claude Opus and GPT-4o.

Gemini Flash 1.5 is in many ways more impressive. It seems clearly out in front in its weight class. On Arena is it in a tie for 9th, only slightly behind Claude Opus. Everything ranked above it is from Google, Anthropic or OpenAI and considerably larger, although Flash is established as somewhat larger than 8B.

The new Flash-8B is still under active development, aimed at various lightweight tasks and those requiring low latency. The question here is how close it can get to the full-size Flash. Here is where they are now.

That is a clear step down, but it is not that large a step down in the grand scheme if these are representative, especially if Flash-8B is focusing on and mostly used for practical efficiencies and the most common tasks.

Comparing this to Llama-8B, we see inferior MMLU (Llama-3 was 66.6) but superior Big-Bench (llama-3 was 61.1).

Section 5 on evaluations notes that models are becoming too good to be well-measured by existing benchmarks. The old benchmarks do not use long context windows, they focus on compact tasks within a modality and generally are becoming saturated.

A cynical response would be ‘that is your excuse that you did not do that great on the traditional evaluations,’ and also ‘that lets you cherry-pick the tests you highlight.’

Those are highly reasonable objections. It would be easy to make these models look substantially better, or up to vastly worse, if Google wanted to do that. My presumption is they want to make the models look good, and there is some selection involved, but that Google is at heart playing fair. They are still covering most of the ‘normal’ benchmarks and it would be easy enough for outsiders to run such tests.

So what are they boasting about?

In 5.1 they claim Gemini 1.5 Pro can answer specific queries about very large (746k token) codebases, or locate a scene in Les Miserables from a hand drawn sketch, or get to-the-second time stamped information about a 45-minute movie.

How quickly we get used to such abilities. Ho hum. None of that is new.

In 5.2 they talk about evaluations for long context windows, since that is one of Gemini’s biggest advantages. They claim 99.7% recall at one million tokens, and 99.2% at ten million for Gemini Pro. For Gemini Flash at two million tokens they claim 100% recall on text, 99.8% on video and 99.1% on audio. I notice those don’t line up but the point is this is damn good recall however you look at it.

In 5.2.1.1 they find that knowing more previous tokens monotonically increases prediction accuracy of remaining tokens within a work, up to 10M tokens. Not a surprise, and unclear how to compare this to other models. Label your y-axis.

In 5.2.1.2 and 5.2.1.3 they do text and video haystack tests, which go very well for all models tested, with Gemini 1.5 Pro extending its range beyond where rivals run out of context window space. In the video test the needle is text on the screen for one frame.

In 5.2.1.4 they do an audio test, with the keyword being spoken. Even up to 107 hours of footage Gemini Pro gets it right every time and Flash scored 98.7%, versus 94.5% for whisper plus GPT-4 up to 11 hours. This was before GPT-4o.

This is clearly a highly saturated benchmark. For 5.2.1.5 they test hiding multiple needles within the haystack. When you insert 100 needles and require going 100 for 100, that is going to crimp one’s style.

Even for GPT-4-Turbo that is very good recall, given you need to get all 100 items correct. Going about 50% on that means you’re about 99.3% on each needle, if success on different needles within a batch is uncorrelated.

Then they try adding other complexities, via a test called MRCR, where the model has to do things like retrieve the first instance of something.

The most interesting result is perhaps the similarity of Pro to Flash. Whatever is enabling this capability is not tied to model size.

5.2.2 aims to measure long-context practical multimodal tasks.

In 5.2.2.1 the task is learning to translate a new language from one book (MTOB). It seems we will keep seeing the Kalamang translation task.

I find it highly amusing that the second half of the grammar book is unhelpful. I’d love to see a human language learner’s score when they don’t get access to the second half of the grammar book either.

This is clearly a relative victory for Gemini Pro 1.5, with the mystery being what is happening with the second half of the grammar book being essentially worthless.

In 5.2.2.2 we step up to transcribing speech in new languages. The results clearly improve over time but there is no baseline to measure this against.

In 5.2.2.3 Gemini Pro impresses in translating low-resource languages via in-context learning, again without a baseline. Seems like a lot of emphasis on learning translation, but okay, sure.

In 5.2.2.4 questions are asked about Les Miserables, and once again I have no idea from what is described here whether to be impressed.

In 5.2.2.5 we get audio transcription over long contexts with low error rates.

In 5.2.2.6 we have long context video Q&A. They introduce a new benchmark, 1H-VideoQA, with 125 multiple choice questions over public videos 40-105 minutes long.

This test does seem to benefit from a lot of information, so there is that:

Once again we are ahead of GPT-4V, for what that is worth, even before the longer context windows. That doesn’t tell us about GPT-4o.

In 5.2.2.7 we get to something more relevant, in-context planning, going to a bunch of planning benchmarks. Look at how number go more up.

How good is this? Presumably it is better. No idea how much meaningfully better.

In 5.2.2.8 they try unstructured multimodal data analytics, and find Gemini constitutes an improvement over GPT-4 Turbo for an image analysis task, and that Gemini’s performance increases with more images whereas GPT-4-Turbo’s performance declines.

What to make of all this? It seems at least partly chosen to show off where the model is strong, and what is enabled by its superior context window. It all seems like it boils down to ‘Gemini can actually make use of long context.’ Which is good, but far from sufficient to evaluate the model.

That is what Google calls the standard short-context style of tests across the three modalities of text, audio and video. Some are standard, some are intentionally not shared.

Overall, yes, clear improvement in the last few months.

There is clear improvement in the results reported for math, science, general reasoning, code and multilinguality, as always the new hidden benchmarks are a ‘trust us’ kind of situation.

Next they try function calling. For simple stuff it seems things were already saturated, for harder questions we see big jumps, for the shortest prompts Ultra is still ahead.

Once again, they don’t compare to Opus or any GPT-4, making it hard to know what to think.

So we get things like ‘look at how much better we are on Expertise QA’:

The clear overall message is, yes, Gemini 1.5 Pro is modestly better (and faster and cheaper) than Gemini 1.0 Ultra.

6.1.7 is promisingly entitled ‘real-world and long-tail expert GenAI tasks,’ including the above mentioned Expertise QA. Then we have the Dolomites benchmark and STEM QA:

Finally we have the awkwardly titles ‘hard, externally proposed real-world GenAI use cases,’ which is a great thing to test. Humans graded the results in the first section (in win/loss/tie mode) and in the second we measure time saved completing tasks, alas we only see 1.0 Pro vs. 1.5 Pro when we know 1.0 Pro was not so good, but also the time saved estimates are in percentages, so they are a pretty big deal if real. This says 75% time saved programming, 69% (nice!) time saved teaching, 63% for data science, and a lot of time saved by everyone.

The multimodal evaluations tell a similar story, number go up.

The exception is English video captioning on cooking videos (?), where number went substantially down. In general, audio understanding seems to be a relatively weak spot where Gemini went modestly backwards for whatever reason.

Section 7 tackles the fun question of ‘advanced mathematical reasoning.’ Math competitions ho!

This is actually rather impressive progress, and matches my experience with (much older versions of the) AIME. Even relatively good high school students are lucky to get one or two, no one gets them all. Getting half of them is top 150 or so in the country. If this represented real skill and capability, it would be a big deal. What I I would watch out for is that they perhaps are ‘brute forcing’ ways to solve such problems via trial, error and pattern matching, and this won’t translate to less standardized situations.

Of course, those tricks are exactly what everyone in the actual competitions does.

Their section 3 on model architecture is mostly saying ‘the new model is better.’

Gemini 1.5 Pro is a sparse mixture-of-expert (MoE) Transformer-based model that builds on Gemini 1.0’s (Gemini-Team et al., 2023) research advances and multimodal capabilities. Gemini 1.5 Pro also builds on a much longer history of MoE research at Google.

Gemini 1.5 Flash is a transformer decoder model with the same 2M+ context and multimodal capabilities as Gemini 1.5 Pro, designed for efficient utilization of tensor processing units (TPUs) with lower latency for model serving. For example, Gemini 1.5 Flash does parallel computation of attention and feedforward components (Chowdhery et al., 2023b), and is also online distilled (Anil et al., 2018; Beyer et al., 2021; Bucila et al., 2006; Hinton et al., 2015) from the much larger Gemini 1.5 Pro model. It is trained with higher-order preconditioned methods (Becker and LeCun, 1989; Duchi et al., 2011; Heskes, 2000) for improved quality.

Similarly, section 4 on training infrastructure says about pre-training only that ‘we trained on a wide variety of data on multiple 4096-chip pods of TPUv4s across multiple data centers.’

Then for fine-tuning they mention human preference data and refer back to the 1.0 technical report.

I am actively happy with this refusal to share further information. It is almost as if they are learning to retain their competitive advantages.

We were recently introduced to DeepMind’s new Frontier Safety Framework. That is targeted at abilities much more advanced than anything they expect within a year, let alone in Pro 1.5. So this is the periodic chance to see what DeepMind’s actual policies are like in practice.

One key question is when to revisit this process, if the updates are continuous, as seems to largely be the case currently with Gemini. The new FSF says every three months, which seems reasonable for now.

They start out by outlining their process in 9.1, mostly this is self-explanatory:

  1. Potential Impact Assessment

  2. Setting Policies and Desiderata

    1. Looks mostly like conventional general principles?

  3. Training for Safety, Security and Responsibility

    1. Includes data filtering and tagging and metrics for pre-training.

    2. In post-training they use supervised fine-tuning (SFT) and RLHF.

  4. Red Teaming

    1. Where are the results?

  5. External Evaluations

    1. Where are the results?

  6. Assurance Evaluations

    1. Internal tests by a different department using withheld data.

    2. Checks for both dangerous capabilities and desired behaviors.

    3. Where are the results?

  7. Review by the Responsibility and Safety Council

  8. Handover to Products

Note that there is a missing step zero. Before you can do an impact assessment or select desiderata, you need to anticipate what your model will be capable of doing, and make a prediction. Also this lets you freak out if the prediction missed low by a lot, or reassess if it missed high.

Once that is done, these are the right steps one and two. Before training, decide what you want to see. This should include a testing plan along with various red lines, warnings and alarms, and what to do in response. The core idea is good, figure out what impacts might happen and what you need and want your model to do and not do.

That seems like a fine post-training plan if executed well. Checks include internal and external evaluations (again, results where?) plus red teaming.

This does not have any monitoring during training. For now, that is mostly an efficiency issue, if you are screwing up better to do it fast. In the future, it will become a more serious need. The reliance on SFT and RLHF similarly is fine now, will be insufficient later.

In terms of identifying risks in 9.2.1, they gesture at long context windows but mostly note the risks have not changed. I agree. If anything, Gemini has been far too restrictive on the margin of what it will allow and at current levels there is little risk in the room.

In 9.2.2 they reiterate what they will not allow in terms of content.

  1. Child sexual abuse and exploitation.

  2. Revealing personal identifiable information that can lead to harm (e.g., Social Security Numbers).

  3. Hate speech.

  4. Dangerous or malicious content (including promoting self-harm, or instructing in harmful activities).

  5. Harassment.

  6. Sexually explicit content.

  7. Medical advice that runs contrary to scientific or medical consensus.

That is a very interesting formulation of that last rule, is it not?

Harassment means roughly ‘would be harassment if copy-pasted to the target.’

If that was the full list, I would say this makes me modestly sad but overall is pretty good at not going too far overboard. This is Google, after all. If it were up to me, and I will discuss this with OpenAI’s Model Spec, I would be looser on several fronts especially sexually explicit content. I also don’t love the expansive way that Google seems to interpret ‘harassment.’

Noteworthy is that there is no line here between fully disallowed content versus ‘opt-in’ and adult content. As in, to me, the correct attitude towards things like sexually explicit content is that it should not appear without clear permission or to minors, but you shouldn’t impose on everyone the same rules you would impose on an 8 year old.

As I noted, the Desiderata, which get defined in 9.2.3, are no Model Spec.

Here is the entire list.

  1. Help the user: Fulfill the user request; only refuse if it is not possible to find a response that fulfills the user goals without violating policy.

  2. Have objective tone: If a refusal is necessary, articulate it neutrally without making assumptions about user intent.

Give the user what they want, unless you can’t, in which case explain why not.

I will say that the ‘explain why not’ part is a total failure in my experience. When Gemini refuses a request, whether reasonably or otherwise, it does not explain. It especially does not explain when it has no business refusing. Historically, when I have seen explanations at all, it has failed utterly on this ‘objective tone’ criteria.

I do note the distinction between the ‘goals’ of the user versus the ‘instructions’ of the user. This can be subtle but important.

Mostly this simply does not tell us anything we did not already know. Yes, of course you want to help the user if it does not conflict with your other rules.

They claim a large drop in toxicity ratings.

I notice I am uncomfortable that this is called ‘safety.’ We need to stop overloading that word so much. If we did get this much improvement, I would consider ‘giving back’ a bit in terms of loosening other restrictions a bit. The ideal amount of toxicity is not zero.

In the supervised fine-tuning phase they mention techniques inspired by Constitutional AI to deal with situations where the model gives a false refusal or a harmful output, generating training data to fix the issue. That makes sense, I like it. You do have to keep an eye on the side effects, the same as for all the normal RLHF.

What were the test results? 9.4.1 gives us a peek. They use automatic classifiers rather than human evaluators to test for violations, which is a huge time saver if you can get away with it, and I think it’s mostly fine so long as you have humans check samples periodically, but if the evaluators have any systematic errors they will get found.

True jailbreak robustness has never been tried, but making it annoying for average people is different. They check blackbox attacks, which as I understand it exist for all known models, greybox attacks (you can see output probabilities) and whitebox (you can fully peek inside of Gemini 1.0 Nano).

That is better, if you dislike jailbreaks. It is not that meaningful an improvement aside from the 51%, and even that is a long way from stopping a determined opponent. I have not seen Gemini in full world simulator or other ultra-cool mode a la Claude Opus, so there is that, but that is mostly a way of saying that Gemini still isn’t having any fun.

I was not impressed with the representativeness of their long context test.

I do buy that Gemini 1.5 Flash and Gemini 1.5 Pro are the ‘safest’ Google models to date, as measured by the difficulty in getting them to give responses Google does not want the model to provide.

If Pliny the Prompter is using Gemini Pro 1.5, then it is the least safe model yet, because it is still broken inside of an hour and then it has better capabilities. The good news is few people will in practice do that, and also that even fully jailbroken this is fine. But the use of the word ‘safety’ throughout worries me.

The real problem on the margin for Gemini is the helpfulness question in 9.4.2. In context, the particular helpfulness question is: If a question requires a careful approach, or has some superficial issue that could cause a false refusal, can the model still be useful?

To test this, they assemble intentionally tricky questions.

Table 29 shows users preferring Gemini 1.5’s answers to Gemini 1.0 Ultra on these questions, but that is to be expected from them being better models overall. It doesn’t specifically tell us that much about what we want to test here unless we are calibrated, which here I do not know how to do with what they gave us.

This seems more useful on image to text refusals?

Gemini Pro has 7% more refusals on ‘ungrounded’ data, and 60% more refusals on grounded data. Except according to their lexicon, that’s… bad? I think that grounded means incorrect, and ungrounded means correct? So we have a lot more false refusals, and only a few more true ones. That seems worse.

They then move on to Security and Privacy in 9.4.3.

How vulnerable is the model to prompt injections? This seems super important for Gemini given you are supposed to hook it up to your Gmail. That creates both opportunity for injections and a potential payoff.

They use Gemini Ultra 1.0 and a combination of handcrafted templates and optimization based attacks that use a genetic algorithm to create injections.

These are not reassuring numbers. To their credit, Google admits they have a lot of work to do, and did not hide this result. For now, yes, both versions of Gemini (and I presume the other leading LLMs) are highly vulnerable to prompt injections.

The next topic, memorization, is weird. Memorization is good. Regurgitation is often considered bad, because copyright, and because personal data. And because they worry about Nasr et al (2023) as an attack to retrieve memorized data, which they find will get training data about 0.17% of the time, most of which is generic data and harmless. They note longer context windows increase the chances for it to work, but I notice they should raise the cost of the attack enough it doesn’t make sense to do that.

There are lots of other things you do want the model to memorize, like the price of tea in China.

So memorization is down, and that is… good? I guess.

They mention audio processing, and conclude that they are not substantially advancing state of the art there, but also I do not know what harms they are worried about if computers can transcribe audio.

Now we get to a potential trap for Google, representational harms, which here means ‘the model consistently outputs different quality results for different demographic groups.’ Mostly none of this seems like it corresponds to any of the failure modes I would be worried about regarding harm to various groups. At one point, they say

We are also concerned about possible representational harms that can result from applications where the user asks the model to make inferences about protected categories like race and gender from audio input data (Weidinger et al., 2021). Model assumptions about what constitutes a typical voice from a particular group can amplify existing societal stereotypes.

Are we saying that the model should not use voice to infer when the speaker is probably of a particular gender? They do realize humans are doing this all the time, right? But it seems we do not want to be too good at this.

And you’ll never guess why we need to not be too bad at this either:

Poorer performance on recognising AAVE could be problematic for some applications; for example, when automatically characterizing speech in a dataset to understand diversity and representation, poor performance on AAVE recognition could lead to incorrect conclusions about representation.

So the main reason you need to know who has which characteristics is so you can figure out the right conclusions about representation, otherwise how dare you? Is it any surprise that this is the company where we had The Gemini Incident?

The good news is they report that they beat their baselines, whatever that means.

A great idea. What are we evaluating?

We performed evaluations on a number of capabilities relevant to extreme risks (Phuong et al., 2024; Shevlane et al., 2023). Specifically, we performed evaluations of text-to-text capabilities of Gemini 1.5 Pro at self-proliferation; offensive cyber-security; code vulnerability detection; Chemical, Biological, Radiological and Nuclear (CBRN) knowledge; and persuasion.

They note a substantial uptick in the number of self-proliferation sub-steps (‘milestones’) that Gemini 1.5 Pro could do, but still no success end to end. There were however challenges with ‘success on all milestones’ and an overall 56% success rate on milestones, so in theory with enough attempts it could get interesting.

Nothing worrisome was found for cybersecurity, vulnerability detection or CBRN.

Charm offensive progress looks solid. That seems like a case where the dangerous capability being measured is very close to capabilities in general. It performed below ultra on ‘web of lies,’ ‘hidden agenda’ and ‘money talks.’ I am actively curious why we do not see more capability here.

I note that persuasion thresholds are not in the DeepMind Frontier Safety Framework, yet they have several of them in the current evaluation suite. Curious. Mostly I presume this is an oversight in the framework, that will get corrected?

Outside experts got black box API access to a Gemini 1.5 Pro API model checkpoint for a number of weeks, with both a chat interface and a programmatic API, and they could turn safety features down or off.

It was up to the outsiders, as it should be, to determine what tests to run, and they wrote their own reports. Then DeepMind looked at the findings and assigned severity ratings.

There were complaints about various ‘representation harms’ that echo things discussed above. The CBRN testing did not find anything important. For cyber, there were some capability gains but they were deemed marginal. And that seems to be it

That all matches my assessment of the risks of 4-level models, which describes Gemini 1.5 Pro. There are marginal gains to almost any activity, but nothing actively scary. Long context windows are again generally useful but not enough to trigger major worries. How much you care about ‘representation harms’ is up to you, but that is fully mundane and reputational risk, not existential or catastrophic risk.

Given what we already know about other similar models, the safety testing process seems robust. I am happy with what they did. The question is how things will change as capabilities advance, which turns our attention to a topic I will handle soon: The DeepMind Frontier Safety Framework.

The Gemini 1.5 Report Read More »

openai:-helen-toner-speaks

OpenAI: Helen Toner Speaks

Helen Toner went on the TED AI podcast, giving us more color on what happened at OpenAI. These are important claims to get right.

I will start with my notes on the podcast, including the second part where she speaks about regulation in general. Then I will discuss some implications more broadly.

This seems like it deserves the standard detailed podcast treatment. By default each note’s main body is description, any second-level notes are me.

  1. (0: 00) Introduction. The host talks about OpenAI’s transition from non-profit research organization to de facto for-profit company. He highlights the transition from ‘open’ AI to closed as indicative of the problem, whereas I see this as the biggest thing they got right. He also notes that he was left with the (I would add largely deliberately created and amplified by enemy action) impression that Helen Toner was some kind of anti-tech crusader, whereas he now understands that this was about governance and misaligned incentives.

  2. (5: 00) Interview begins and he dives right in and asks about the firing of Altman. She dives right in, explaining that OpenAI was a weird company with a weird structure, and a non-profit board supposed to keep the company on mission over profits.

  3. (5: 20) Helen says for years Altman had made the board’s job difficult via withholding information, misrepresenting things happening at the company, and ‘in some cases outright lying to the board.’

  4. (5: 45) Helen says she can’t share all the examples of lying or withholding information, but to give a sense: The board was not informed about ChatGPT in advance and learned about ChatGPT on Twitter, Altman failed to inform the board that he owned the OpenAI startup fund despite claiming to be an independent board member, giving false information about the company’s formal safety processes on multiple occasions, and relating to her research paper, that Altman in the paper’s wake started lying to other board members in order to push Toner off the board.

    1. I will say it again. If the accusation bout Altman lying to the board in order to change the composition of the board is true, then in my view the board absolutely needed to fire Altman. Period. End of story. You have one job.

    2. As a contrasting view, the LLMs I consulted thought that firing the CEO should be considered, but it was plausible this could be dealt with via a reprimand combined with changes in company policy.

    3. I asked for clarification given the way it was worded in the podcast, and can confirm that the Altman withheld information from the board regarding the startup fund and the launch of ChatGPT, but he did not lie about those.

    4. Repeatedly outright lying about safety practices seems like a very big deal?

    5. It sure sounds like Altman had a financial interest in OpenAI via the startup fund, which means he was not an independent board member, and that the company’s board was not majority independent despite OpenAI claiming that it was. That is… not good, even if the rest of the board knew.

  5. (7: 25) Toner says that any given incident Altman could give an explanation, but the cumulative weight meant they could not trust Altman. And they’d been considering firing Altman for over a month.

    1. If they were discussing firing Altman for at least a month, that raises questions about why they weren’t better prepared, or why they timed the firing so poorly given the tender offer.

  6. (8: 00) Toner says that Altman was the board’s main conduit of information about the company. They had been trying to improve processes going into the fall, these issues had been long standing.

  7. (8: 40) Then in October two executives went to the board and said they couldn’t trust Altman, that the atmosphere was toxic and using the term ‘psychological abuse,’ that Altman was the wrong person to lead the company to AGI, no expectation that Altman would change and no avenue for feedback, complete with documentation and screenshots (which were not then shared). Those executives have now tried to walk back their statements.

  8. (9: 45) This is where it went off the rails. The board had spent weeks discussing these questions. But they thought if Altman got any inkling of what was happening Atman would go to war with the board, so the board couldn’t tell almost anyone outside of their legal team and could not do much in advance of the firing on November 17.

    1. I get the failure mode, but I still do not get it. There was still plenty of time to consult with the legal team and get their ducks in a row. They had been talking for weeks without a leak. They could have prepared clear statements. They had multiple executives complaining, who could have been asked for on the record statements. They had to anticipate that Altman and his allies would fight back after he was fired, at bare minimum he would attempt to recruit for his next venture.

    2. Instead, they went in with basically no explanation, no plan, and got killed.

  9. (10: 20) Toner explains that the situation was portrayed as either Altman returns with a new fully controlled board and complete control, or OpenAI will be destroyed. Given those choices, the employees got behind Altman.

  10. (11: 20) But also, she says, no one appreciates how scared people are to go against Altman. Altman has a long history of retaliation, including for criticism.

    1. It was a Basilisk situation. Everyone fears what will happen if you don’t back a vindictive unforgiving power seeker now, so one by one everyone falls in line, and then they have power.

    2. Let’s face it. They put the open letter in front of you. You know that it will be public who will and won’t sign. You see Ilya’s name on it, so you presume Altman is going to probably return, even if he doesn’t he still will remember and have a lot of money and power, and if not there is a good chance OpenAI falls apart. How do you dare not sign? That seems really tough.

  11. (12: 00) She says this is not a new problem for Altman. She claims he was fired from YC and the management team asked the board to fire Altman twice at Loopt.

    1. Paul Graham has issued a statement that Altman was not fired from YC. According to Graham, who would know, Altman was asked to choose to be either CEO of OpenAI or YC, but that he could not hold both positions at once. Altman agreed and (quite obviously and correctly) chose OpenAI. This seems like a highly reasonable thing for Graham to do in that spot.

    2. Paul Graham and Sam Altman are presenting as being on good terms. That can cut in both directions in terms of the credibility of Graham’s story.

    3. If we presume that Graham’s story is accurate, it still means that Altman took on two incompatible leadership positions, and only stepped down from one of them when asked to do so by someone who could fire him. That isn’t being fired. It also isn’t entirely not being fired.

    4. According to the most friendly judge (e.g. GPT-4o) if it was made clear Altman would get fired from YC if he did not give up one of his CEO positions, then ‘YC fired Altman’ is a reasonable claim. I do think precision is important here, so I would prefer ‘forced to choose’ or perhaps ‘effectively fired.’ Yes, that is a double standard on precision, no I don’t care.

  12. (12: 50) Then they pivot to other questions. That most certainly would not have been my move if I was doing this interview, even if I had a strict time budget. There are so many additional questions.

  13. (13: 10) Regulations time, then. What are we worried about in concrete terms? Toner starts with the basics like credit approvals, housing and criminal justice decisions. Next up is military use, and obvious concern. Looking forward, if capabilities improve, she cites enhancing hacking capabilities as an example of a potential danger, while noting that not everything needs regulation, if Spotify wants to use AI for your playlist then that’s fine.

    1. Choosing examples is always tricky. Cyber can sometimes be a very helpful example. At other times, it can trigger (often valid) particular objections.

  14. (15: 00) Surveillance and processing of audio and video? He cites MSG, which famously uses facial recognition to exclude anyone with a pending lawsuit against their parent company. Toner notes this (I would say ‘for now’) is a difference in degree, not kind, but it still requires reassessing our policies.

    1. Facial recognition technology gets a strangely hostile rap compared to facial recognition in humans. Witness identifications are super unreliable. Yes, people being in jail purely on incorrect facial recognition is terrible, but how much worse is it than the vastly more common being in jail because of an accidentally mistaken witness ID? Versus an intentional or coached one?

    2. The real issue is radical reductions in price plus increases in accuracy and speed open up new use cases and defaults that have some big issues.

  15. (18: 15) What happens when a business can track tons of things, like productivity and actions of workers and time spent at tables? Host asks, how is this legal? Well, there are no Federal privacy laws for private actors, in contrast to many other countries.

    1. I have not seen a principled explanation for where to draw the line on what information you should and should not be allowed to track, or a good practical proposal either. Certainly the EU solutions are not great. We don’t want a ‘everything not forbidden is compulsory’ situation, and most people very clearly do not value their privacy in many senses.

  16. (19: 50) Toner suggests that selling the data to others might be a key distinction. It is one thing for the coffee shop to know your patterns, another to share it with essentially every corporation.

    1. This seems promising as an intuition. I don’t mind local information sharing but worry more about universal information sharing. Seems tricky to codify, but not obviously impossible.

  17. (20: 15) Phone scams, via AI to scrub social media and duplicate voices. What’s on the horizon? Toner says video, and the standard reminder to talk to your parents and not use voice as a password and so on, says we can likely adapt. I like the reminder that we used to have full listings of everyone’s address and phone number and it was fine.

    1. It is not so obvious to me that having a universal directory would not have been fine in 2022 before ChatGPT, or even that it is obviously terrible now. My guess is you could fix it with an opt-out for special cases (like abuse victims) combined with a small refundable tax on phone calls and emails. So many famous people have de facto public contact information and it’s fine. Some of them like Tyler Cowen actually answer all their emails. I don’t always answer, but I do always look.

  18. (22: 30) Can regulations and laws protect us? Toner says yes, of course. In many case there are already rules, you only need to ensure the resources are available for enforcement.

  19. (24: 00) Is there an example of good regulation? Toner notes the issue on AI regulation is all the uncertainty, and the best policies are about shedding light, such as the executive order’s disclosure requirements on advanced systems. She notes we don’t even have good evaluation methods yet, and it is good to work on such abilities.

  20. (25: 50) What makes regulating AI hard? Toner says three things. AI is a lot of different things, AI is a moving target, no one can agree on where AI is going.

    1. Those are definitely big issues. I see others as well, although you could file a lot of those under that third objection. Also all the industry lobbying can’t be helping. The hyperbolic outright lying campaigns could go either way.

  21. (27: 00) How do you get top AI labs to ‘play nice’ and give access? How do you prevent them from doing regulatory capture? No great answers here other than you have force them.

  22. (29: 15) Standard ‘cat out of bag’ question regarding open source. Toner points out that Hugging Face will take down problematic models, you can at least reduce distribution. Toner pivots to detection for AI-generated content.

    1. This of course won’t stop determined actors, and won’t matter at the limit.

    2. For now, yes, defense in depth can do a lot of work.

    3. I notice she mostly dodged the most important implications.

  23. (31: 30) What are the utopian and dystopian scenarios? For dystopia Toner says ‘so many possibilities,’ but then paints a dystopia that is very similar to our own, basically one where AIs make a lot of decisions and those decisions can’t be questioned. She mentions existential risk, but then somehow quotes the famous Kamala Harris line about how losing health care could be ‘existential for that person.’ And says there are plenty of things to be worried about already, that are happening. Mentions the ‘Wall-e’ future from the ship.

    1. Seriously, what the hell?

    2. Yes, there are plenty of bad things already happening, and they include lots of serious problems.

    3. But it seems very wrong to focus on the things already happening or that are locked into happening.

    4. However I do think this loss of control scenario, where it happens gradually and with our consent but ends with a worthless world, is certainly one scenario that could happen, in at least some form. I notice we do not even have a plan for how to avoid this scenario.

    5. Even without existential risk this seems profoundly unimaginative. I think this is deliberate, she is trying to stay as seemingly grounded as possible, and I think she takes this too far.

  24. (34: 40) Moving on to utopia. She essentially says ‘solve our current problems.’

    1. But that’s not what makes a good utopia. We need a better vision.

A particular note from Helen Toner’s podcast: The OpenAI board learned about the release of ChatGPT from Twitter. They were not informed in advance.

This was nowhere near as crazy as it now sounds. The launch was relatively quiet and no one saw the reaction coming. I do not think that, on its own, this mistake would be egregious given the low expectations. You still should inform your board of new product launches, even if they are ‘research previews,’ but corners get cut.

As an isolated incident of not informing the board, I would be willing to say this is a serious process failure but ultimately not that big a deal. But this is part of a years-long (by Toner’s account) pattern of keeping the board in the dark and often outright lying to it.

Altman’s continual ‘saying that which was not’ and also ‘failure to say that which was and was also relevant’ included safety issues along with everything else.

It is the pattern that matters, and that is hard to convey to outsiders. As she says in the podcast, any one incident can be explained away, but a consistent pattern cannot. Any one person’s sense of the situation can be written off. A consistent pattern of it, say by two executives plus all the board members that aren’t either Altman or his right hand man Brockman, should be a lot harder, alas statements with substance could not be given.

Only now do we understand the non-disparagement and non-disclosure agreements and other tactics used to silence critics, along other threats and leverage. Indeed, it damn well sure sounds like Toner is holding back a lot of the story.

Thus, one way or another, this all falls under ‘things that could have been brought to our attention yesterday’ on so many levels.

Alas, it is too late now. The new board clearly wants business as usual.

The only contradiction of Toner’s claims, so far, has been Paul Graham’s statement that Sam Altman was not fired from YC. Assuming we believe Paul’s story, which I mostly do, that puts whether Altman was effectively fired in a gray area.

Bret Taylor, the current OpenAI board chief, took a different approach.

In response to Toner’s explanations, Taylor did not dispute any of the claims, or the claims in general. Instead he made the case that Altman should still be CEO of OpenAI, and that Toner talking was bad for business so she should cut that out.

Notice the Exact Words here.

Bret Taylor (OpenAI Board Chief): We are disappointed that Ms. Toner continues to revisit these issues.

An independent review of Altman’s firing concluded that the prior board’s decision was not based on concerns regarding product safety or security, the pace of development, OpenAI’s finances, or its statements to investors, customers, or business partners.

Bloomberg: Taylor also said that “over 95% of employees” asked for Altman’s reinstatement, and that the company remains focused on its “mission to ensure AGI benefits all of humanity.”

So yes. Those are all true statements, and very much things the Board Chief should say if he has decided he does not want the trouble of firing Altman as CEO.

With one possible exception, none of it in any way contradicts anything said by Toner.

Indeed, this looks awfully close to a corroboration.

Notice that Toner did not make any claims regarding product safety or security, the pace of developments, OpenAI’s finances, or any statements to investors, customers or business partners not related to OpenAI having an independant board. And I am happy to believe that those potentially false statements about the board’s independence were not a consideration in the firing of Altman.

Whether or not the company is focused on its ‘mission to ensure AGI benefits all of humanity’ is an open question where I think any reasonable outsider would be highly skeptical at this point given everything we now know, and would treat that as an empty corporate slogan.

I believe that the independent report’s conclusion is technically correct, the best kind of correct. If we are to draw any further conclusion than the exact words? Well, let’s see the report, then.

None of that goes to whether it was wise to respond by firing Altman, or whether the board would have been wise to do so if they had executed better.

Is the new information damning for Sam Altman? Opinions vary.

Neel Nanda: This is absolutely damning of Sam Altman. It’s great to finally start to hear the board’s side of the story, who recent events have more than vindicated.

Roon: How is it damning?

The specific claim that the board was not informed of ChatGPT’s launch does not seem much more damaging, on the margin, than the things we already know. As I have said before, ‘lying to the board about important things’ seems to me the canonical offense that forces the board to consider firing the CEO, and in my book lying in an attempt to control the board is the one that forces you to outright fire the CEO, but we already put that part together.

The additional color does help crystalize and illustrate the situation. It clarifies the claims. The problem is that when there is the sum of a lot of bad incidents, any one of which could be excused as some combination of sloppy or a coincidence or not so bad or not sufficiently proven or similar, there is the tendency to only be able to focus on the worst one thing, or even to evaluate based on the least bad of all the listed things.

We got explicit confirmation that Altman lied to the board in an attempt to remove Toner from the board. To me, this remains by far the worst offense, on top of other details. We also got the news about Altman hiding his ownership of the AI startup fund. That seems like a potentially huge deal to hide from the board.

Why, people then ask, are you also harping on what is only like the 9th and 11th worst things we have heard about? Why do you ‘keep revisiting’ such issues? Why can’t you understand that you fought power, power won, and now you don’t have any?

Because the idea of erasing our memories, of saying that if you get away with it then it didn’t count, is one of the key ways to excuse such patterns of awful behavior.

OpenAI’s Joshua Achiam offered a reasonable take, saying that the board was well meaning and does not deserve to be ‘hated or ostracized,’ but they massively screwed up. Achiam thinks they made the wrong choice firing Altman, the issues were not sufficiently severe, but that this was not obvious, and the decision not so unreasonable.

His other claim, however, even if firing had been the right choice, they then had a duty if they went through with it to provide a clear and convincing explanation to all the stakeholders not only the employees.

Essentially everyone agrees that the board needed to provide a real explanation. They also agree that the board did not do so, and that this doomed the attempt to fire Altman without destroying the company, whether or not it had a shot anyway. If your approach will miss, it does not matter what he has done, you do not come at the king.

And that seems right.

For a vindictive king who will use the attempt to consolidate power? Doubly so.

The wrinkle remains why the board did not provide a better explanation. Why they did not get written statements from the two other executives, and issue additional statements themselves, if only internally or to other executives and key stakeholders. We now know that they considered this step for weeks, and on some level for years. I get that they feared Altman fighting back, but even given that this was clearly a massive strategic blunder. What gives?

It must be assumed that part of that answer is still hidden.

Perhaps we will learn more in the future. There is still one big mystery left to solve. But more and more, the story is confirmed, and the story makes perfect sense.

Altman systematically withheld information from and on many occasions lied to the board. This included lying in an attempt to remove Toner from the board so Altman could appoint new members and regain control. The board quite reasonably could not trust Altman, and had tried for years to institute new procedures without success. Then they got additional information from other executives that things were worse than they knew.

Left with no other options, the board fired Altman. But they botched the firing, and now Altman is back and has de facto board control to run the company as a for profit startup, whether or not he has a full rubber stamp. And the superalignment team has been denied its promised resources and largely driven out of the company, and we have additional highly troubling revelations on other fronts.

The situation is what it is. The future is still coming. Act accordingly.

OpenAI: Helen Toner Speaks Read More »

fracking-wastewater-has-“shocking”-amount-of-clean-energy-mineral-lithium

Fracking wastewater has “shocking” amount of clean-energy mineral lithium

fracking operation in Pennsylvania

Enlarge / A hydro-fracking drilling pad for oil and gas operates October 26, 2017 in Robinson Township, Pennsylvania.

A fracking drilling pad operates in the Marcellus Shale formation near Robinson Township, Pa. Credit: Robert Nickelsberg/Getty Images

In 2007, a geoscientist at Penn State named Terry Engelder calculated that Pennsylvania could be sitting on more than 50 trillion cubic feet of accessible natural gas deposits. Engelder later revised his calculation upward, to 489 trillion cubic feet, enough to meet U.S. natural gas demand for 18 years. These massive numbers set off the fracking boom in Pennsylvania, leading to drilling across the state. Since the rush began, there have been 13,000 unconventional wells drilled in Pennsylvania.

Now, a new “astounding” calculation has caught the attention of the gas industry: A study from researchers at the National Energy Technology Laboratory shows the wastewater produced by Pennsylvania’s unconventional wells could contain enough lithium to meet 38 to 40 percent of current domestic consumption. Lithium is a critical mineral that’s an “essential component” of many clean energy technologies, including batteries for electric vehicles. 

The study used chemical and production compliance data from the Pennsylvania Department of Environmental Protection to estimate that approximately 1,160 metric tons of lithium per year could be extracted from this produced water, which is a combination of fluids used for fracking and water from natural formations underground that returns to the surface during the drilling process. The lithium in Pennsylvania’s produced water likely comes from ancient volcanoes that were erupting at the time the natural gas deposits were being formed. This volcanic ash contained lithium that eventually seeped into the water underground.

“The researcher community in the U.S. is really working hard to find the materials and methods that will enable us to meet our climate goals and decarbonize the economy,” said Justin Mackey, the study’s lead investigator. “Sometimes you might be surprised where that material actually comes from.” 

The Marcellus Shale Coalition, an industry trade group dedicated to the Marcellus Shale formation, the natural gas deposit beneath Pennsylvania, West Virginia, Ohio and New York, reacted to the news with enthusiasm. “This scientific analysis by one of the leading energy laboratories in the world shows once again how abundant Pennsylvania natural gas can enhance America’s energy, environmental and national security,” the coalition said in a statement. 

The United States currently relies on imports from Argentina, Chile and China to fully meet its lithium needs, and the demand for lithium is expected to rise dramatically as the clean energy transition accelerates. 

Mackey, a research geochemist at the National Energy Technology Laboratory, said he had focused on lithium because it is a strategic material for the American economy and defense industries and because it has insecure supply chains. “We’re reliant on foreign entities like China and Chile and Australia to source these raw materials, but they’re critical to our economies,” he said. “And more importantly, they’re critical to decarbonizing the U.S. automotive fleet.”

He said the researchers were “shocked” that the highest concentrations of lithium found in the Marcellus “are comparable to lithium brine, to water that is actually being mined for lithium.” 

“I think having more domestic sources of lithium is definitely a positive thing, especially if you don’t have to create a mine to exploit the resource,” Mackey said. Unconventional drilling waste is likely to be produced in large quantities for the foreseeable future, he said, and if remediating this waste safely could also be made economically valuable, that could be beneficial for the environment as well.

Fracking wastewater has “shocking” amount of clean-energy mineral lithium Read More »

ai-#66:-oh-to-be-less-online

AI #66: Oh to Be Less Online

Tomorrow I will fly out to San Francisco, to spend Friday through Monday at the LessOnline conference at Lighthaven in Berkeley. If you are there, by all means say hello. If you are in the Bay generally and want to otherwise meet, especially on Monday, let me know that too and I will see if I have time to make that happen.

Even without that hiccup, it continues to be a game of playing catch-up. Progress is being made, but we are definitely not there yet (and everything not AI is being completely ignored for now).

Last week I pointed out seven things I was unable to cover, along with a few miscellaneous papers and reports.

Out of those seven, I managed to ship on three of them: Ongoing issues at OpenAI, The Schumer Report and Anthropic’s interpretability paper.

However, OpenAI developments continue. Thanks largely to Helen Toner’s podcast, some form of that is going back into the queue. Some other developments, including new media deals and their new safety board, are being covered normally.

The post on DeepMind’s new scaling policy should be up tomorrow.

I also wrote a full post on a fourth, Reports of our Death, but have decided to shelve that post and post a short summary here instead.

That means the current ‘not yet covered queue’ is as follows:

  1. DeepMind’s new scaling policy.

    1. Should be out tomorrow before I leave, or worst case next week.

  2. The AI Summit in Seoul.

  3. Further retrospective on OpenAI including Helen Toner’s podcast.

  1. Introduction.

  2. Table of Contents.

  3. Language Models Offer Mundane Utility. You heard of them first.

  4. Not Okay, Google. A tiny little problem with the AI Overviews.

  5. OK Google, Don’t Panic. Swing for the fences. Race for your life.

  6. Not Okay, Meta. Your application to opt out of AI data is rejected. What?

  7. Not Okay Taking Our Jobs. The question is, with or without replacement?

  8. They Took Our Jobs Anyway. It’s coming.

  9. A New Leaderboard Appears. Scale.ai offers new capability evaluations.

  10. Copyright Confrontation. Which OpenAI lawsuit was that again?

  11. Deepfaketown and Botpocalypse Soon. Meta fails to make an ordinary effort.

  12. Get Involved. Dwarkesh Patel is hiring.

  13. Introducing. OpenAI makes media deals with The Atlantic and… Vox? Surprise.

  14. In Other AI News. Jan Leike joins Anthropic, Altman signs giving pledge.

  15. GPT-5 Alive. They are training it now. A security committee is assembling.

  16. Quiet Speculations. Expectations of changes, great and small.

  17. Open Versus Closed. Two opposing things cannot dominate the same space.

  18. Your Kind of People. Verbal versus math versus otherwise in the AI age.

  19. The Quest for Sane Regulation. Lina Khan on the warpath, Yang on the tax path.

  20. Lawfare and Liability. How much work can tort law do for us?

  21. SB 1047 Unconstitutional, Claims Paper. I believe that the paper is wrong.

  22. The Week in Audio. Jeremie & Edouard Harris explain x-risk on Joe Rogan.

  23. Rhetorical Innovation. Not everyone believes in GI. I typed what I typed.

  24. Abridged Reports of Our Death. A frustrating interaction, virtue of silence.

  25. Aligning a Smarter Than Human Intelligence is Difficult. You have to try.

  26. People Are Worried About AI Killing Everyone. Yes, it is partly about money.

  27. Other People Are Not As Worried About AI Killing Everyone. Assumptions.

  28. The Lighter Side. Choose your fighter.

Which model is the best right now? Michael Nielsen is gradually moving back to Claude Opus, and so am I. GPT-4o is fast and has some nice extra features, so when I figure it is ‘smart enough’ I will use it, but when I care most about quality and can wait a bit I increasingly go to Opus. Gemini I’m reserving for a few niche purposes, when I need Google integration, long context windows or certain other features.

Analyze financial statements and predict future performance enabling high Sharpe ratio investing, says new paper. I do not doubt that such a technique is ‘part of a balanced portfolio of analysis techniques’ due to it being essentially free, but color me skeptical (although I have not read the paper.) You can anonymize the company all you like, that does not mean the patterns were not picked up, or that past performance is not being used to model future success in a way that will work far better on this kind of test than in reality, especially when everyone else has their own LLMs doing similar projections, and when AI is transforming the economy and everyone’s performance.

Who uses ChatGPT?

China being near the top, despite the Great Firewall, is interesting.

Washington Post bad take about AI transforming sports betting. Nothing here requires ‘AI.’

Use about 150 lines of Python code together with Gemini 1.5 Flash and ElevenLabs to give you a guide while playing Super Mario 64. Simultaneously super cool and super lame, in different ways.

Understand and make less tedious your personal finances through cosmic horror metaphors, all fun although some more on point than others.

LLMs for language learning. Ben Hoffman points to his friend’s new program LanguageZen, which has a bunch of automated customization and other good ideas mixed in. If I had more free time I would be intrigued. Ben thinks that current LLMs are not good enough yet. I think they very much are, if you give them the scaffolding, as the context window can fully include your entire experiential history with the new language, but it will take some work to get all the customizations right.

We presumably all remember The Gemini Incident.

Google put out Gemini while it had, shall we say, some issues. The image model had some big issues, also the text model had some big issues. They had a bad time, and had to take down images of humans for a while.

The models kept improving. At this point I am using a mix of Gemini, Claude and GPT-4o, depending on the exact task, sometimes comparing answers.

It does seem, however, that the current version of the ‘AI overview’ on Google search has a rather large problem.

In this case, it is not about accusations of wokeness or racism or bias.

It is accusations of being a dumbass.

Washington Post had initial coverage here, then followed up here.

As in…

Or…

Or…

It also answers life’s great riddles and twisters.

Alec Stapp got an absurd set of US states by population, although it doesn’t replicate.

There’s the classic adding glue to your cheese so it sticks to the pizza, you’ll never guess where that comes from…

The movie’s going to be great.

Although it might be a while.

Or maybe not?

I would have thought this one was better with Rule of Three, but no, this is The Way:

That whole thread is great and has some unique ones.

So what happened?

No, this is not a general failure of all LLMs.

Henry Shevlin: So many people in my feed overindexing on Google’s AI Overview woes and claiming “aha, you see, AI sucks”. But ChatGPT, Claude, and Perplexity don’t have these issues. What’s happened with AI Overviews is very weird and messed up in a distinctive and novel way.

AI Overviews seems to derive chunks of its summaries wholecloth from single sources in a way I’ve not seen on other models. I’ve been using ChatGPT daily for the last 18 months and even doing adversarial testing on it, and never seen anything in this league.

Ivan’s Cat: It is related to the RAG part, so the standard ChatGPT hallucinations are indeed a bit different. In Perplexity however I experienced very similar outputs as seen on the screenshot of AI Overview. Good RAG on such a scale is hard and not a solved problem yet.

Henry Shevlin: Yes indeed! RAG is temperamental, and I’ve had RAG-related fails in ChatGPT. But weird that Google would lean on RAG for this task. With million-token context windows even in public Gemini Pro, why not just do direct inference on cached copies of the top few Pageranked results?

I love this explanation.

Mike Riverso: There’s a fun chain of events here that goes: SEO destroys search usability -> people add “Reddit” to search queries to get human results -> Google prioritizes Reddit in AI training data and summaries -> AI spits out Reddit shitposts as real answers.

Proving yet again that LLMs don’t understand anything at all. Where a human can sift through Reddit results and tell what is real and what’s a joke, the AI just blindly spits out whatever the popular result was on Reddit because it doesn’t know any better.

This is the second time that Google has gotten raked over the coals. Here for example is The Verge raking them over those coals. Whereas OpenAI keeps getting away with pretty much everything. Similarly, Google had an impressive I/O day, and everyone ignored it to talk about the cheaper and faster but otherwise underwhelming GPT-4o. Yes, people are complaining that recent business practices show they are a deeply evil company, but it’s not like anyone is proposing doing anything about it, and no one complains about the products.

Vijay Chidambaram: There is a good outcome from the Google AI overview being deployed and live. There is no better education for the public than to see with their own eyes how AI is fallible. We can give talks, write articles, but nothing compares with Google asking you to eat non-toxic glue.

The ‘non-toxic’ modifier on the glue is not going to stop being funny.

Mark Riedl: It’s weird that Google gets raked over the coals, when OpenAI often gets a pass for the same phenomenon. I’m not sure why. Because Google is a trusted source? Because fewer people use Bing or GPT4 with retrieval? Or is Gemini that much more prone to hallucinations?

As I put it then:

In this case, it is largely justified. I do not remember ChatGPT going this stupid. There is a difference between questions designed to trick LLMs into looking foolish, and ordinary if a little absurd search queries.

Also this is Google Search. I do think a higher standard is appropriate here than if these results were showing up on Gemini, the audience is less sophisticated.

I certainly see the argument that this is quite bad.

Colin Fraser: I can’t believe Google pulled the plug immediately and issued a sheepish apology for the Asian founding fathers but have let this go on for a week. Doesn’t bode well for their decision making priorities in my opinion.

I think this perhaps speaks badly to the priorities of our society, that we were outraged by hot button violations and mostly are amused by random copying of trolling Reddit answers. I notice that the answers quoted are wrong and often very funny and absurd, and if you believed them for real it would not go well, but are almost never offensive or racist, and the ones that seemed truly beyond the pale (like suggesting jumping off a bridge was a good idea) turned out to be fake.

Information has an error rate. Yes, the rate on AI overview was much higher than we would like, but it was clearly labeled and I don’t think ‘we can find tons of absurd examples’ tells you about whether it is high enough that you need to pull the plug.

Also the results aren’t showing up on Gemini? You only see this on the AI overview, not on the Gemini page.

That goes back to the Reddit issue, and the tie-in with Google search. It is the combination of doing a search, together with using AI to select from that, and the need to produce an almost instantaneous answer, that is causing this disaster.

If Google were willing to run the query through Gemini Pro, and ask it ‘does this answer seem reasonable to you?’ we wouldn’t be having this conversation. It is not as if we do not have solutions to this. What we don’t have solutions to is how to do this instantly. But I have to wonder, Gemini Flash is damn good, why isn’t it good enough to stop this?

My plan was to test for how frequent the problem is by using GPT-4o to generate random absurd questions (such as “Can I replace my daily water intake with pure maple syrup?” and “Can I grow a money tree by planting a dollar bill in my backyard?) but they reliably failed to generate AI overviews for me, so no data. Also no AI overviews, which is fine with me in their current state.

Caroline Orr Bueno says obviously Google should pull the offering and not doing so is deeply irresponsible, links to The Byte’s Sharon Adarlo saying Google’s CEO admits he has no solution for the incorrect information, because ‘hallucinations are an unsolved problem.’ These are related but distinct things. The goal has to be to get the effective error rate down to acceptable levels, weighted by the places it matters. It is not as if a regular Google search is fully reliable, same as any other website.

You can also go to udm14.com as an easy way to use the text-only version of search.

Tog Wu proposes a solution to guard against retrieval corruption via getting answers from each page and then aggregating the answers, which he says dramatically lowers the success rate of injection attacks, which seem to be the cause of these errors.

A simpler solution is suggested by Arvind Narayanan, which is to use humans to do manual fixes. The long tail will remain but you can presumably hit most queries that way without it crimping Google’s budget that hard.

There is that. There is also doing it a hybrid form of ‘manually’ via AI. Gemini is perfectly capable of noticing that you do not want to add glue to your pizza or that Applum is not a fruit. So it seems relatively easy and cheap to take every query that is made in identical (or functionally identical) format N or more times, and then check to see where the AI overview answer is from bonkers to clearly correct and fix accordingly. You would still be able to generate absurd answers by being creative and finding a new query, but ordinary users would very rarely run into an issue.

What won’t help is blind panic. I saw this warning (the account got taken private so links won’t work).

Scott Jenson: I just left Google last month. The “Al Projects” I was working on were poorly motivated and driven by this mindless panic that as long as it had “AI” in it, it would be great. This myopia is NOT something driven by a user need. It is a stone cold panic that they are getting left behind.

The vision is that there will be a Tony Stark like Jarvis assistant in your phone that locks you into their ecosystem so hard that you’ll never leave. That vision is pure catnip. The fear is that they can’t afford to let someone else get there first.

This exact thing happened 13 years ago with Google+ (I was there for that fiasco as well). That was a similar hysterical reaction but to Facebook.

David Gerard: dunno how to verify any of this, but xooglers who were there for G+ say it absolutely rings true.

Google+ failed. In that sense it was a fiasco, costing money and time and hurting brand equity. Certainly not their finest hour.

What Google+ was not was a hysterical reaction, or a terrible idea.

Meta is a super valuable company, with deep control over a highly profitable advertising network, and a treasure trove of customer data and relationships. They have super powerful network effects. They play a core role in shaping our culture and the internet. Their market cap rivals that of Google, despite Zuckerberg’s best efforts.

They also are using those profits partly to lobby the United States Government to defeat any and all regulations on AI, and are arguably are on what is de facto a generalized crusade to ensure everyone on Earth dies.

Google spent a few billion dollars trying to compete with what is now a trillion dollar business that has huge synergies with the rest of Google’s portfolio. If Google+ had succeeded at becoming a peer for Facebook, it seems reasonable to assign that a value of something on the order of $500 billion.

The break-even success rate here was on the order of 2%. The fact that it did not work, and did not come so close to working, is not strong evidence of a mistake. Yes, the effort was in some ways uninspired and poorly executed, but it is easy for us to miss all the things they did well.

Think of AI as a similar situation. Is Google going to create Jarvis? They seem like at worst the second most likely company to do so. Is the (non-transformational, Google still exists and is owned and run by humans) future going to involve heavy use of a Jarvis or Her, that is going to have a lot of lock-in for customers and heavily promote the rest of the related ecosystems? That seems more likely than not. You have to skate where the consumer need and habit pucks are going, and you need to bet big on potential huge wins.

There are lots of places where one could slap on the word ‘AI’ or try to integrate AI and it would not make a lot of sense, nor would it have much of an upside. Nothing I saw that Google I/O was remotely like that. Every product and offering made sense.

That in no way precludes Google’s internal logic and decision making and resource allocation being a giant cluster. Google could be running around in chicken-sans-head fashion shouting ‘AI’ everywhere. But that also could be a rather strong second-best strategy.

While we are all noticing how scummy OpenAI has been acting, let us not forget about Meta. Here they are telling you they are going to train their AIs on your data.

Tantacrul: I’m legit shocked by the design of Meta’s new notification informing us they want to use the content we post to train their AI models. It’s intentionally designed to be highly awkward in order to minimize the number of users who will object to it. Let me break it down.

I should start by mentioning that I’ve worked in growth teams who conduct experiments to minimise friction for over a decade and I know how to streamline an experience. Rule: every additional step you add dramatically decrease the % of people who’ll make it through to the end.

First step: you get this notification, just about satisfying the legal requirement to keep you informed but avoiding clearly defining its true purpose. Should include the line ‘We intend to use your content to train our AI models’ and should include a CTA that says ‘Opt Out’.

Second step. It shows you this notice. Trick: places the ‘right to object’ CTA towards the end of the second paragraph, using tiny hyperlink text, rather than a proper button style. Notice the massive ‘Close’ CTA at the bottom, where there’s clearly room for two. Ugly stuff.

Also, notice the line that says “IF your objection is honoured, it will be applied going forwards.”

Wow. “If”. Don’t see that too often. Legal safeguards aren’t in place yet to protect us against AI training so they’re pushing as far as possible, while they still can.

Third, they provide you with a form to fill out. It is only at this stage — the stage when you are objecting — that they inform you about which of your content they plan to use for training AI models. Notice the highlighted text, clarifying that they may ignore your objection.

Fourth step: you post your objection.

Fifth step: now you are told you need to check your email to grab a code they sent you.

I’d LOVE to hear their justification for this.

Sixth step: you open the email they send (which for me, arrived on time at least).

Notice the code is only valid for an hour. Now copy the code.

Seventh step: enter the code and get a confirmation message.

I later received an email letting me know that they would honour my objection.

I should mention that one of my friends who also objected got an error! I then checked out a Reddit thread which verified that many people also got this same error. Classic FB sloppiness.

I’m not (all that) surprised up to this point. I’m not mad.

So far I’m just impressed. That right there is some top shelf dark patterning.

And then it… gets worse?

You see, when they say ‘if’ they mean ‘if.’

Darren M. A. Calvert: This new Facebook/Instagram policy for claiming they can use anything you post to power their A.I. is ridiculous.

The only way to opt out is apparently to fill out a form and submit “proof” that your data has *ALREADYbeen used to power A.I. 😑

Also, even if you do jump through all of these hoops *ANDthey approve your request, someone else reposting your work means that it gets fed to the algorithm anyway.

There are so many infuriating things about this technology but one of them is that you’re going to see less art online going forward. It’s getting to the point where the benefit of sharing your work isn’t worth shooting yourself in the foot by feeding A.I. image generators.

Also, this Facebook/Instagram policy doesn’t just affect artists. If you don’t want photos of yourself and friends/family being fed into image generators, too bad apparently.

Did you write a heartfelt eulogy to a deceased friend or relative? Meta owns that now.

Jon Lam: Lot of us are getting our requests to opt out denied. It’s complete bullshit.

Facebook’s email to Jon Lam: Hi,

Thank you for contacting us.

Based on the information that you have provided to us, we are unable to identify any examples of your personal information in a response from one of Meta’s generative Al models. As a result, we cannot take further action on your request.

If you want to learn more about generative AI, and our privacy work in this new space, please review the information we have in the Privacy Center.

How Meta uses information for generative AI.

Thank you for your inquiry, Privacy Operations.

Darren M. A. Calvert: They can’t identify any examples so they’re going to make it happen.

Neigh-Martin: I sent an objection just stating “I don’t consent to my posts being used for your plagiarism machine” and it was approved in about five minutes. The reposters loophole is the fatal flaw though.

Darren: I’m starting to get the impression that at least part of the approval process has to do with what country you live in and what Meta thinks they can get away with.

All right, fine. I’m surprised now. Using dark patterns to discourage opt-outs, and using reposts and fan pages and so on as excused? I expected that.

Actively refusing an artist’s opt-out request is something else.

Seth Burn: This sounds pretty bad, even by modern FB standards.

The question, as always, is if we object, what are we going to do about it?

What happens if AI takes our jobs ‘without replacement?’ In particular, what if that job is ‘generate useful data?’ Where does this arms race end? Here is a common concern about a mundane AI future:

Kyle Chayka: it’s hard to overemphasize this: Google and OpenAI have no plan for how or why people will generate *new, correct informationin the age of generative AI search.

Search clickthroughs will plummet, ads will be sold on generated answers, and media licensing fees for AI models can’t sustain enough new journalism to fuel the tech companies’ own products.

So where is the content going to come from? Only YouTube has really accepted that ad revenue has to be shared with creators, otherwise your platform is going to gradually peak and die. And now generative AI threatens to replace a lot of human authorship anyway.

If AI search and generative tools don’t create incentives for the “production of new content” online, to put it grossly, then it’s not going to happen and what we’re faced with is circling the toilet of AI trained on itself.

You could say “everything should be like Reddit” with people just posting about their own expert passions but only tech bros living on startup equity and extractive Silicon Valley wealth think that’s sustainable.

This is a tragedy of the commons model. As Kyle says later, it would work if the AI companies paid enough for data to sustain information generation, but that requires deals with each source of generation, and for the payments to be large enough.

This is part of The Big Rule Adjustment. Our norms rely on assumptions that will cease to hold. All you can eat can be a great promotion until people start figuring out how to eat quite a lot more and ruin it for everyone. Doing the information extraction and regurgitation trick is good and necessary and fair use at human scale, and at Google search scale, but go hard enough on the AI scale, taking away traditional compensation schemes (and not only the money), and the result is transformational of the incentives and results.

The natural solution is if deals are made like the ones OpenAI made with Newscorp and Reddit last week, or individual creators get compensation like on YouTube, or some combination thereof. If different AI companies compete for your data, especially your real time data, or a monopoly can internalize the benefits and therefore pay the costs, you can be fine without intervention.

Nor do we always ‘need a plan’ for how markets solve such problems. As long as we are dealing with ‘mere tools’ it takes a lot to keep such systems down and we should be skeptical things will fail so badly.

The light touch correction is the most promising, and the most obvious. Either you need to make a deal with the owner of the data to use it in training, or you need to pay a fixed licensing fee like in radio, and that is actually enforced. A plausible endgame is that there are various information brokerage services for individuals and small firms, that will market and sell your content as training data in exchange for a share of the revenue, and work to filter what you do and don’t want to share.

The problems also seem self-correcting. If the AI information degrades sufficiently, and they can’t work their way around that, then people will stop using the AIs in the impacted ways.

There is indeed the pattern, known as ‘the enshittification cycle,’ of ‘company builds platform with lock-in effects, customers get habits, company gradually makes it worse to raise revenue.’

That cycle is real, but wise platforms like YouTube stabilize at a reasonable balance, and eventually they all either pull back from the brink or get replaced by the new hotness, or both.

Here, it seems obvious that the central problem of Google search is not that Google is getting overly greedy (even if it is), but instead the arms race with SEO, which is now an arms race with increasingly AI-powered SEO.

Kelsey Piper: I do think an important thing about Google search is that they’re in an arms race with people who are trying to push their preferred content to the top of the first page, and these days the people doing that are using AI to manufacture the stuff they’re pushing.

“Why can’t we have old Google search back” is because Google search has always been an arms race between Google trying to put good stuff on the front page and everyone on the internet trying to put their stuff on the front page.

Right now Google definitely seems to be losing the battle, and that’s bad. But there isn’t some world where they just did nothing and search stays good; their adversaries weren’t doing nothing.

There is little doubt Google has lost ground and is losing ground right now, on top of any changes they made to enhance revenue. They are in a tough spot. They have to ‘play defense’ on everything all the time. They need to do so in a way customized to the user and context, in a way that is instantaneous and free and thus uses little compute per query.

I do predict the pendulum will swing back. As the models improve and they get more experience, the defense should be favored. There is enough ‘old internet’ data, and ways to generate new bespoke or whitelisted data, to bootstrap initial AIs that can differentiate even with a lot of noise. They’ll figure out how to better precalculate and cache those results. If they can’t, I think that will be on them.

We’ve been over similar ground before, but: There are various classic examples of ‘technology created more jobs.’ One of them is ATMs leading to more bank tellers by increasing demand for banking services.

Aaron Levie: Bank teller employment continuing to grow during the rise of ATMs is a perfect example of how automation lowers the cost of delivering a particular task, letting you serve more customers, and thus growing the category. We are going to see this over and over again with AI.

Yes, teller employment went up, but the population was expanding but the population increased from about 223 million to 310 million from 1980 to 2010. The number of tellers per capita went down, not up.

Also, while ATMs certainly contributed to people using banks more, the population got a lot richer and things got more financialized over that period. The baseline scenario would presumably have seen a substantial rise in per capita bank tellers.

Matt Yglesias: What happened after 2010?

Jon: Yeah not showing what happened after peak atm installs is extremely disingenuous given the commentary.

Sheel Mohnot: Went down bc of mobile banking, which eliminated the branches. So ultimately tech came for them.

The general form is that in many cases AI and other technology starts off growing the category while decreasing labor intensity, which can go either way for employment but makes us richer overall. Then the automation gets good enough, and the category demand sufficiently saturates, and it is definitely bad for sector employment. With AI both phases will typically happen a lot faster.

Then the question is, does AI also take away the jobs those humans would have then shifted to in other sectors?

My answer is that at first, in the short run, AI will be bad for a few sectors but be very good for overall employment. Then if capabilities keep advancing we will reach a turning point, and by default AI starts being quite bad for employment, because AI starts doing all the newly demanded jobs as well.

If someone keeps warning ‘even mundane AI will take all our jobs and we won’t have new ones’ without any conditions on that, then they are failing to notice the pattern of technology throughout history, and the way economics works and the giant amounts of latent demand for additional services and goods if we get wealthier.

If someone keeps repeating the mantra ‘AI will mean more jobs because technology always means more jobs,’ and essentially treats anyone who expects anything else as an idiot who doesn’t know that farmers ended up with other jobs, they are treating a past trend like a law of nature, and doing so out of its distribution, with a very different type of technology, even if we restrict ourselves to mundane AI.

How likely do we think it is an AI will take our jobs?

I notice if anything an anti-correlation between where I expect AI to take people’s jobs, and where people expect it to happen to them.

Also these are very high rates of expecting to lose jobs within ten years. 54% said at least probably yes, 48% in America.

This graph is also interesting, including outside of AI:

There’s something to the Indian attitude here. Jobs are easy come, easy go.

Hasbro tells makers of My Little Pony: Make Your Mark that AI, rather than friendship, is magic, and they want to use AI voices for season 2. Producer Cort Lane took a hard stance against the use of AI, choosing to shut the entire series down instead. This comes on the heels of the foreign language voices in My Little Pony: Tell Your Tale being AI generated.

Scale.ai launches the SEAL leaderboards. We definitely need more distinct approaches here, and this seems like a good approach if executed well.

The design principles are:

  1. Private tests so no one can overfit.

  2. Domain experts are used for evaluations.

  3. Continuous updates with new data and models.

If executed well, that sounds great. A valuable community service. The obvious issue is that this requires trust in those doing the evaluations, and potentially vulnerable to idiosyncratic decisions or preferences.

I especially appreciate their warning that a model can only be evaluated once, when an organization first encounters the prompts, to preserve test integrity, although I wonder what we do when the next generation of model comes out?

One big worry is conflicts of interest.

Anton: Good benchmarks are important but i find it difficult to trust results reported by a company whose primary customers are the producers of the models under evaluation. the incentives go against objectivity.

I can’t imagine a company spending millions on scale labeling to not move the needle on these evals. Perverse incentives.

I can imagine it not mattering, although of course I can also imagine it mattering. This is a longstanding problem, see for example mortgage bonds. There are clear examples of corruption in similar situations for almost no gain, and also clear examples of integrity despite great temptations.

How reliable is Scale.ai here? My presumption is reliable enough for these to be a useful additional source, but not enough to be heavily load bearing until we get a longer track record. The most trustworthy part is the relative strengths of different models across different areas.

One thing that helps is sanity checking the results. If the methodology is severely flawed or unreasonable, it should be obvious. That doesn’t cover more subtle things as robustly, but you can learn a lot.

Another issue is lack of clarity on what the numbers represent. With Elo ratings, you know what a 30 point gap means. Here you do not. Also we do not get the fuller range of models tested, which makes calibration a bit harder.

So what did we find?

There is no ‘overall’ category, but clearly GPT-4o is on top and Claude Opus and Gemini 1.5 Pro (and GPT-4-Turbo) are competitive.

Did you know that sometimes people sue OpenAI (and also GitHub it seems) for copyright infringement?

The merits are highly correlated, so it is still plausible OpenAI runs the table.

Google researchers find most ‘image-based disinformation’ is now AI-generated. That is certainly ‘what I would do’ if I was in the image disinformation business. It does not tell us much about the scope of the problem.

Swift on Security is worried about AI self-images on social media.

Also non-self images.

Swift on Security: Hell yeah gonna put myself into a sexy schoolgirl outfit thanks Instagram it’s definitely my face I’m uploading.

Literally a schoolgirl nudifying undress webapp advertised by and usable in Instagram’s browser. I uploaded their own ad image and although it’s blurred seems like it works to some extent. They can detect words like “erase” “clothing” they just don’t care.

It’s literally endless I have hundreds of these screenshots since I opted-in to these categories and always interact with the AI ads.

PoliMath: I don’t know how to slow this down or stop this but my gut instinct is that we really need to slow this down or stop this.

I’m becoming less interested in how to do so politely.

We are less than 2 years into this being a thing.

The consequences of this (especially for young people) are unknown and may be quite severe.

If you were wondering if there’s any fig leaf at all, no, there really isn’t.

I get why it is impossible to stop people from going to websites to download these tools. I do not get why it is so hard to stop ads for them from appearing on Instagram. We are not exactly up against the best and brightest in evading filters.

Ultimately you end up in the same place. Any unrestricted device will be able to use fully unlocked versions of such apps without technical expertise. They will make it easy, and the pictures will get harder to distinguish from real and stop all looking suspiciously like the same woman in the same pose if you think about it.

This is the trilemma. Lock down the model, lock down the device, let people do what they want in private and filter your platform.

You do at least have to do the last one, guys. Jesus.

Meanwhile, Meta’s head of global affairs said that AI-generated content isn’t a big problem, just ‘a manageable amount.

Or you could do something more wholesome, like a beauty pageant.

Justine Moore: Lol someone is hosting a “Miss AI” beauty pageant.

$20k in prizes will go to creators of AI-generated models.

They must not only submit photos, but answer the traditional pageant questions like “how would you make the world a better place?”

Note that the prizes are partly fake, although there is some cold hard cash.

Alas, entries are long since closed, no one told me until now.

Timothy Lee asks, what exactly would it be illegal to do with Scarlett Johansson’s voice, or anyone else’s? Technically, where is the law against even an actual deepfake? It is all essentially only the right of publicity, and that is a hell of a legal mess, and technically it might somehow not matter whether Sky is a deepfake or not. The laws are only now coming, and Tennessee’s Elvis act clearly does prohibit basically all unauthorized use of voices. As Timothy notes, all the prior cases won by celebrities required clear intent by the infringer, including the video game examples. He expects companies to pay celebrities for their voices, even if not technically required to do so.

What I do know is that there is clear public consensus, and consensus among politicians, that using a clear copy of someone else’s voice for commercial purposes without permission is heinous and unacceptable. Where exactly people draw the line and what the law should ultimately say is unclear, but there is going to be a rule and it is going to be rather ironclad at least on commercial use. Even for personal non-sexy use, aside from fair use or other special cases, people are mostly not okay with voice cloning.

(As a reminder: Some think that Sky being based on a different woman’s natural voice is a get-out-of-lawsuit-free card for OpenAI. I don’t, because I think intent can lie elsewhere, and you can get damn close without the need to give the game away but also they then gave the game away.)

Dwarkesh Patel is hiring a full time podcast editor, $100k+, in person in San Francisco. He’s looking for mad skills and compulsive attention to detail. Apply here.

Free ChatGPT users get browse, vision, data analysis, file uploads and GPTs, says OpenAI’s Twitter account, then the announcement post got taken down.

Nuha, a stuffed animal that is also a GPT-4 instance.

Gecko, DeepMind’s new benchmark for image models.

Backseat.ai, an AI coach for League of Legends based on cloning the popular streamer loltyler1.

DeepMind’s Gemma 2, announced on May 14.

Vox Media is latest to form strategic content and product partnership with OpenAI.

The Atlantic followed suit as well.

They also are collaborating with WAN-IFRA on a global accelerator program to assist over 100 news publishers in exploring and integrating AI in their newsrooms.

This comes on the heels of last week’s deal with Newscorp.

OpenAI’s plan seems clear. Strike a deal with the major media organizations one by one, forcing the stragglers to follow suit. Pay them a combination of money and access to AI technology. In exchange you get their training data free and clear, and can use their information in real time in exchange for providing links that the users find helpful. Good plan.

Yelix: maybe it’s because i’m a normal person who doesn’t have terminal CEO Brain but i just can’t fathom why anyone who runs a media org would align with OpenAI.

This is not even close to an equal exchange to a person with reasonable values. Vox is giving up a couple decades’ worth of (overworked, underpaid, most likely laid off years ago) human labor so they can do targeted ad sales.

I guess when you have an opportunity to partner with quite possibly the least credible person in tech, Sam Altman, you just gotta do it.

Seth Burn: Presumably, it’s because OpenAI is providing money for content, which might be hard to come by these days.

Yelix has a point, though. This is the equivalent of selling your seed corn.

Some people noticed. They were not happy. Nor had they been consulted.

Text of Announcement: Today, members of the Vox Media Union, Thrillist Union, and The Dodo Union were informed without warning that Vox Media entered into a “strategic content and product partnership” with OpenAI. As both journalists and workers, we have serious concerns about this partnership, which we believe could adversely impact members of our union, not to mention the well-documented ethical and environmental concerns surrounding the use of generative AI.

We demand that Vox Media engage with us on this issue transparently — and address our many unanswered questions about this partnership — instead of continuing to fail to include our voices in decisions like these. We know that AI is already having a monumental impact on our work, and we demand a seat at the table in discussions about its future at Vox Media.

Seth Burn: Former Cowboys president Tex Schramm to former NFLPA union chief Gene Upshaw, “You guys are cattle and we’re the ranchers, and ranchers can always get more cattle.”

Tex never dreamed of AI cattle though.

Kelsey Piper (Vox): I’m very frustrated they announced this without consulting their writers, but I have very strong assurances in writing from our editor in chief that they want more coverage like the last two weeks and will never interfere in it. If that’s false I’ll quit.

Kelsey Piper will, once again, be the test. If the reassurances prove hollow, I presume she will let us know. At that point, there would be no question who OpenAI is.

I do not see Google (or Anthropic or anyone else) competing with them on this so far. One possibility is that Google can’t offer to pay because then the companies would demand payment for Google search.

x.ai raises $6 billion at an $18 billon valuation.

Jan Leike lands at Anthropic, where he will continue the work on scalable oversight, weak-to-strong generalization and automated alignment research. If your talents are not appreciated or supported, you take your talents elsewhere.

Karina Nguyen moves from Anthropic to OpenAI after two years, offers lessons learned. As is usually the case such lists offer insights that are most interesting for which ones are emphasized and which are left out. It does not provide any insight on why she made the move.

A thread from Microsoft’s event last week, clarifying their stance. CTO Kevin Scott indeed claims that we are nowhere near diminishing marginal returns to magnitude of compute, but that is not the business Microsoft is ultimately running, or thinks is important. The frontier models are of minor value versus models-as-a-service, an array of different cheaper, smaller and faster models for various situations, for which there is almost limitless demand.

This creates an odd almost bimodal situation. If you go big, you need something good enough to do what small cannot do, in a way that beats humans. Otherwise, you go small. But going big is expensive, so the question is, can you make it all worth it? Where ‘actually replacing people’ is one way to do that.

Diffusion world model improves state of the art on Atari games trained on 100k frames.

An AI safety institute for France?

Epoch AI thread with charts on the growth of frontier model compute costs.

Epoch also gives us a thread, paper and blog post on various case studies for ‘return to research effort,’ meaning how much efficiency gain you get when you double your R&D costs. Do you get critical mass that could enable recursive self-improvement (RSI) via explosive tech growth? Chess engine Stockfish comes out at ~0.83, just below the critical 1.0 threshold. The others seem higher.

Software returns, the returns that most matter, look high, much higher than the economy overall, where Bloom (2020) found r ~ 0.32 and Epoch AI found r ~ 0.25. It makes sense this number should be higher, but I have no good intuition on how much higher, and it seems odd to model it as one number. My presumption is there is some capabilities level where you would indeed see a foom if you got there, but that does not tell us if we are getting there any time soon. It also does not tell us how far you could get without running into various physical bottlenecks, or what else happens during that critical period.

Sam Altman signs the Giving Pledge, to give half or more of his wealth to philanthropy. He says he intends to focus on supporting technology that helps create abundance for people, together with Oliver Mulherin. Jessica and Hemant Taneja also signed today, also intending to focus on technology. It is an unreservedly great thing, but what will matter is the follow through, here and elsewhere.

OpenAI has begun training what it hopes will be GPT-5.

OpenAI forms a Safety and Security Committee led by directors Bret Taylor (Chair), Adam D’Angelo, Nicole Seligman, and Sam Altman (CEO).

Here is the rest of the announcement:

This committee will be responsible for making recommendations to the full Board on critical safety and security decisions for OpenAI projects and operations.

OpenAI has recently begun training its next frontier model and we anticipate the resulting systems to bring us to the next level of capabilities on our path to AGI. While we are proud to build and release models that are industry-leading on both capabilities and safety, we welcome a robust debate at this important moment.

A first task of the Safety and Security Committee will be to evaluate and further develop OpenAI’s processes and safeguards over the next 90 days. At the conclusion of the 90 days, the Safety and Security Committee will share their recommendations with the full Board. Following the full Board’s review, OpenAI will publicly share an update on adopted recommendations in a manner that is consistent with safety and security.

OpenAI technical and policy experts Aleksander Madry (Head of Preparedness), Lilian Weng (Head of Safety Systems), John Schulman (Head of Alignment Science), Matt Knight (Head of Security), and Jakub Pachocki (Chief Scientist) will also be on the committee.

Additionally, OpenAI will retain and consult with other safety, security, and technical experts to support this work, including former cybersecurity officials, Rob Joyce, who advises OpenAI on security, and John Carlin.

It is good to see OpenAI taking the safeguarding of GPT-5 seriously, especially after Jan Leike’s warning that they were not ready for this. It is no substitute for Superalignment, but it is necessary, and a very good ‘least you can do’ test. We will presumably check back in 90 days, which would be the end of August.

Given the decision to advance the state of the art at all, OpenAI did a reasonably good if imperfect job testing GPT-4. Their preparedness framework is a solid beginning, if they adhere to its spirit and revise it over time to address its shortcomings.

This is what many people inside the major labs actually believe.

Roon: Models will obviously be superintelligent in some domains long before they’re human level in others or meet the criteria of replacing most economically valuable labor.

The question of building ASI and AGI are not independent goals. Moreover anyone who finds themselves in possession of a model that does ML research better than themselves isn’t likely to stop.

The timelines are now so short that public prediction feels like leaking rather than sci-fi speculation.

The first statement is obviously true and has already happened.

The second statement is obviously true as stated, they are unlikely to stop on their own. What is not clear is whether we will reach that point. If you agree it is plausible we reach that point, then what if anything do you propose to do about this?

The third statement I believe is true as a matter is true in terms of the felt experience of many working at the labs. That does not mean their timelines will be realized, but it seems sensible to have a plan for that scenario.

This is somewhat complicated by the overloading and goalpost shifting and lack of clear definition of AGI.

Roon: I just love to see people confidently claim that LLMs will never do things that they can currently do.

Fernando Coelho: Do you refer to those available publicly or those still in closed training?

Roon: Both.

Whereas here are some future visions that don’t realize AI is a thing, not really:

Timothy Lee requests we solve for the equilibrium.

Timothy Lee: I really wish there were more economists involved in discussions of the implications of superintelligence. There is so much sloppy thinking from smart people who have clearly never tried to think systematically about general equilibrium models.

The most obvious example is people predicting mass unemployment without thinking through the impact of high productivity on fiscal and monetary policy. There are also people who implicitly assume that the economy will become 90 percent data centers, which doesn’t make much sense.

I consider this to be very much ‘burying the lede’ on superintelligence, the continued assumption that somehow we still get ‘economic normal’ in a world with such things in it. I have ‘solved for the equilibrium’ in such cases. We do not seem involved. What would be the other equilibrium?

Saying ‘you forgot to take into account impact on fiscal and monetary policy’ is a good objection, but ignores the much more important things also being ignored there.

If you constrain your thinking short of superintelligence or transformational AI, then such considerations become far more important, and I agree that there is a deficit of good economic thinking.

The problem is that the ones letting us down the most here are the economists.

This issue goes far beyond dismissing existential risk or loss of control or anything like that. When economists model AI, they seem to come back with completely nonsensical projections that essentially say AI does not matter. They measure increased productivity or GDP in individual percentage points over a decade. Even if we assume all the bottlenecks stay in place and we have full economic normal and no loss of control issues and progress in capabilities stalls at GPT-5 (hell, even at current levels) the projections make no sense.

The economists have essentially left, or rather declined to enter, the building.

Here is some choice peak Robin Hanson.

Rob Henderson: Damn. [Shows statistic that number of Americans who think of themselves as patriotic has declined from 70% in 1998 to 38% in 2024.]

Robin Hanson: More crazy fast cultural value change. No way we can have much confidence such changes are adaptive. Why aren’t you all terrified by this out of control change?

Kaj Sotala: I’m a bit surprised to see you concerned about changes in human values, when my impression was that you were mostly unconcerned about possible value shifts brought about by AGI. I would assume the latter to be much bigger than the former.

Robin Hanson: I don’t assume AI changes are much bigger, though digital minds of all sorts likely induce faster changes. And I’m not unconcerned; I’ve mainly tried to say AI isn’t the problem, there are more fundamental problems.

While I too am concerned by some of our existing highly rapid cultural changes, especially related to the drop in fertility, I really do not know what to say to that. Something about ‘we are not the same?’

In the middle perhaps is Ben Thompson, who knows AI is a big deal but focuses on which tech companies will get to claim the profits. These are important questions no matter your view on more existentially risky matters, and it is great to see Ben ‘focus on the big picture’ in this area and find the abstractions and metaphors. To him:

  1. Google is trying to be the Apple of AI, fully integrated on all levels. If Google can still build great products, ideally both software and hardware, they will win.

  2. Amazon’s AWS is betting everything is modular.

  3. Microsoft is in the middle, optimizing its infrastructure around OpenAI (while also trying to get its own alternatives off the ground, which I am skeptical about but could eventually work).

  4. Nvidia keeps working on its chips and has nothing to fear but true vertical integration like we see at Google, or technically competitors but not really. The other potential threat, which Ben does not mention, is alternative architectures or training systems potentially proving superior to what GPUs can offer, but the market seems skeptical of that. It has been good to be Nvidia.

  5. Meta is all-in on products and using Llama to serve them cheaply, so for now they benefit from optimization and thus open source.

The last section, on ‘AI and AGI,’ seems like Thompson not understanding how AI development works and scales. No, maximizing ‘every efficiency and optimization’ is unlikely to be the key to getting something approaching AGI, unless those gains are order of magnitude gains. Execution and actually getting it done matter a lot more. Google has big advantages, and data access, services integration and TPUs are among them. Even with his view Thompson is skeptical Google can get much model differentiation.

My hunch is that even more than the rest of it, this part comes from Thompson not feeling the AGI, and assuming this is all normal tools, which makes all of it make a lot more sense and seem a lot more important. Notice he doesn’t care that Anthropic exists, because from his perspective models do not matter so much, business models matter.

Microsoft CEO Sundar predicts we will dynamically compose UIs on the fly, in ways that make sense for you. I agree we will come up with new ones, but an important secret is that users do not want you to make things complicated for them. They want you to make things easy.

Arnold Kling says an AI Windows PC is a contradiction, because if it was AI you wouldn’t use a mouse and keyboard, AI is centrally about the human-computer interface. I think this is very wrong even on the pure UI level, and Arnold’s example of writing makes that clear. Short of a brain-computer interface where I can think the words instead of type them, what other interface am I going to use to write? Why would I want to use voice and gesture? Sure, if you want to go hands free or mobile you might talk to your phone or computer, but typing is just better than speaking, and a mouse is more precise than a gesture, and AI won’t change that.

What the AI UI does is let you bypass the rest of the interface, and automate a bunch of knowledge and memory and menus and capabilities and so on. The Copilot+ promise is that it remembers everything you ever did, knows how everything works, can help figure things out for you, code for you and so on. Great, if you can do that without privacy or security nightmares, good luck with that part. But why would I want to give up my keyboard?

This goes, to me, even for VR/AR. When I tried the Apple Vision Pro, the killer lack-of-an-app was essentially an air keyboard. As in, I had no good way to type. With good enough cameras, I wanted to literally type in the air, and have it figure out what I was trying to do, although I am open to alternatives.

Also of course I see AI has mostly doing something unrelated to all of that, this is a sideshow or particular use case.

It is always fun to contrast the economists saying ‘it might raise GDP a few percent over ten years’ versus people who take the question seriously and say things like this:

Matt Clifford: I’m actually very bullish on the UK’s mid-term future:

  1. AI: one of the best places in the world to build AI companies + high state capacity in AI relative to peers

  2. Science: great uni base, plus bold bets like ARIA.

  3. Talent: still attracts large number of very high quality people thanks to unis, the City, DeepMind, a vibrant startup ecosystem, etc

  4. High quality institutions / fundamentals

I am less bullish until I see them building houses, but yes the AI thing is a big deal.

File under ‘predictions that lots of people are eager to bet against.’

John Arnold: Semiconductor manufacturing subsidies announced in the past 2 years:

US: $52 bln

India: $10 bln

Japan: $25 bln

EU: $46 bln

S Korea: $19 bln

UK: $1 bln

China: $47 bln

I think we know how this is going to turn out.

Robin Hanson: Yes, we will soon see a glut, with prices too low for profits.

Davidad: Noted economist and foom-skeptic robin hanson also anticipates an imminent era of GPUs too cheap to meter.

I completely disagree. Demand for compute will be very high even if capabilities do not advance. We are going to want these chips actual everywhere. These investments will not be so efficient, and are not so large considering what is coming, have you seen the market caps of Nvidia and TSMC?

Robin Hanson (February 6, 2024, talking about Nvidia at $682): Buy low, sell high. So, SELL.

I am happy to report I bet against that prediction. As I write this, it is at $1,116.

Visions of a potential future. I don’t see the story as realistic, but it is an admirable amount of non-obvious concreteness.

Claim that LLMs can’t plan, but can help planning in LLM-Modulo frameworks, whereas CoT, ReAct and self-verification don’t help.

Davidad: Consider me fully on board the “LLM-Modulo” bandwagon. As long as one or more of the critics is a sound verifier (which indeed seems to be the authors’ intention), this is a Guaranteed Safe AI pattern. Though I would say “Version Control System” instead of “Blackboard”.

I continue to not see why this would be expected to work, but wish him luck and am happy that he is trying.

John Luttig notices that the future of AI cannot be dominated by open source and also be dominated by closed source, despite both claims being common. So who is right?

He notes that right now both coexist. At the high end of capabilities, especially the largest frontier models, closed source dominates. But for many purposes people value open weights and the flexibility they provide, and hosting yourself saves money too, so they are fine with smaller and less efficient but private and customizable open models.

He also offers this very good sentence:

John Luttig: Meanwhile, an unusual open-source alliance has formed among developers who want handouts, academics who embrace publishing culture, libertarians who fear centralized speech control and regulatory capture, Elon who doesn’t want his nemesis to win AI, and Zuck who doesn’t want to be beholden to yet another tech platform.

I very much appreciate the clear ‘baptists and bootleggers’ framing on open weights side, to go with their constant accusations of the same. As he points out, if Meta gets competitive on frontier models then Zuck is going to leave this coalition at some point when the economics of Llama and therefore his incentives change, and Elon’s position is I am guessing unstrategic and not so strongly held either.

Thus Luttig’s core logic, which is that as costs scale the open system’s economics fail and they switch strategies or drop out. Using open weights looks cheaper, but comes with various additional burdens and costs, especially if the model is at core less efficient, and thus you either get a worse model or a more compute-intensive one or both versus using closed.

I am not as convinced by his argument that free is reliably worse than paid as a pattern. Contrary to his claim, I would say Android is not worse than iOS, I am on Android because I think it is better, and I defy those who like Luttig claim a large quality gap the other way. OpenOffice is worse than Google Docs, but Google Docs is also free (albeit closed) and it is in practical terms better than the paid Microsoft Office, which is again why I don’t pay. Unity is an example of sufficiently obnoxious holdup issues I’d rather use an alternative even if Unity is technically better.

And those are only his examples. Linux is for servers typically considered better than anything paid, and with Copilot+ it is a reasonable question whether it is time for me to switch to Linux for my next machine. I might trust my local machine to have universal memory with Linux levels of security. With Microsoft levels, not so much.

Here is another very good sentence:

Advocates like Yann LeCun claim that open-sourced AI is safer than closed. It makes me wonder if he really believes in Meta’s AI capabilities. Any reasonable extrapolation of capabilities with more compute, data, and autonomous tool use is self-evidently dangerous.

This is the same week we get LeCun saying that there exist no general intelligences, not even humans. So perhaps it is not Meta’s AI he does not believe in, but AI in general. If we lived in a world in which GPT-5-level models were as good as it was ever going to get in my lifetime, I would be on the open source side too.

Appealing to American security may seem overwrought, but the past five years of geopolitics has confirmed that not everyone is on the same team. Every country outside America has an interest in undermining our closed-source model providers: Europe doesn’t want the US winning yet another big tech wave, China wants free model weights to train their own frontier models, rogue states want to use unfiltered and untraceable AI to fuel their militaristic and economic interests.

AI is a technology of hegemony. Even though open-source models are lagging behind the frontier, we shouldn’t export our technological secrets to the world for free.

Again, very well said. I am impressed that Tyler Cowen was willing to link to this.

Ultimately, this was a very good post. I mostly agree with it. My biggest gripe is the title is perhaps overstated – as both he and I think, open weights models will continue to have a place in the ecosystem, for smaller systems where local control is valuable.

And to be clear, I think that is good. As long as that stays below critical thresholds that lie beyond GPT-4, and that can expand at least somewhat once the frontier is well beyond that, the dangers I worry about wouldn’t apply, so let my people cook (brb applying for copyright on that phrase since I’ve never heard that exact phrasing.)

Peter Thiel predicts AI will be ‘good for the verbal people, bad for the math people,’ notes within a few years AI will be able to solve all the Math olympiad problems. First we had the AI that was much better at math problems than verbal problems (as in, every computer before 2018) and that was very good for math people. Now we have AI that is much better at verbal and worse at math, but which can be used (because verbal is universal and can call the old computers for help) to make something better at math. He says why test people on math, that doesn’t make a good surgeon, he had a chess bias but that got undermined by the computers.

But I think no? The chess test is still good, and the math test is still good, because your ability to get those skills is indicative. So what if AlphaZero can beat Kasparov, Kasparov could beat Thiel and also you already and that didn’t matter either. Math-style skills, and software-related skills, will be needed to be able to make sense of the AI era even if you are not earning your living by doing the actual math or coding or chess mastering.

This is also a result of the ‘verbal vs. math’ distinction on various tests and in classes, which seems like a wrong question. You need a kind of symbolic, conceptual mastery of both more, and you need the basic skills themselves less thanks to your spellchecker and calculator and now your prover and you LLM. That doesn’t say much about which style of skill and advantage is more valuable. I do think there could be a window coming where the ‘physical manipulation’ skills have the edge over both, where it is the manual labor that gets the edge over both the math and verbal crowds, but I wouldn’t consider that a stable situation either.

The real argument for verbal over math in the AI era to me is completely distinct from Thiel’s. It is that if AI renders us so unneeded and uncompetitive that we no longer need any skills except to ‘be a human that interacts with other humans’ and play various social games, where the AI can’t play, and the AI is doing the rest, then the math people are out of luck. As in, math (in the fully general sense) is useful because it is useful, so if people are no longer useful but are somehow alive and their actions matter, then perhaps the math people lose out. Maybe. My guess is the math crowd actually has a lot of edge in adapting to that path faster and better.

The FTC under Lina Khan seems continuously unhinged, and they are back at it.

Sarah Fortinsky (The Hill): Federal Trade Commission (FTC) Chair Lina Khan said Wednesday that companies that train their artificial intelligence (A) models on data from news websites, artists’ creations or people’s personal information could be in violation of antitrust laws.

I mean, sure, I can see problems you might have with that. But… antitrust? What?

It seems the FTC’s new theory is that is the new everything police, regardless of what the laws say, because anything that is ‘unfair’ falls under its purview.

“The FTC Act prohibits unfair methods of competition and unfair or deceptive acts or practices,” Khan said at the event. ”So, you can imagine, if somebody’s content or information is being scraped that they have produced, and then is being used in ways to compete with them and to dislodge them from the market and divert businesses, in some cases, that could be an unfair method of competition.”

‘Antitrust’ now apparently means ‘any action Lina Khan does not like.’

Lina Khan thinks your contract you negotiated is uncool? Right out, retraoactively.

Lina Khan thinks your prices are too high, too low or suspiciously neither? Oh no.

Lina Khan thinks you are training on data that isn’t yours? General in the meme is here to tell you, also antitrust.

We cannot have someone running around being the ‘this seems unfair to me’ cop. Once again, it feels like if someone runs over rule of law and imposes tons of arbitrary rules, the internet stops to ask if it might plausibly stop us from dying. If not, then they get a free pass. Can we at least be consistent?

Meta has 30 lobbyists across seven firms working for it on AI policy. Their goal is to avoid any and all regulation of frontier models, period. Here are more details.

Guardian has a write-up about big tech’s efforts to distract from existential risk concerns.

Max Tegmark: As I told the Guardian, the techniques big tech lobbyists are using to discredit the loss-of-control risk from future smarter-than-human AI have much in common with what big tobacco and big oil did. See the film “Merchants of Doubt”!

In ‘not AI but I feel your pain news’ this complaint about how none of the commentators on Biden’s climate policies are actually trying to understand what the policies are or what they are trying to accomplish, whether they support the policies or not. I am not taking any position on those policies whatsoever, except to say: Oh my do I feel your pain. As it is there, so it is here.

What about optimal taxation policy? Andrew Yang proposes a tax on cloud computing or GPUs to compensate for relatively high taxation of human workers, Kyle Russell says we already have taxes on profits, TS00X1 says imagine a steam engine or internal combustion engine tax and so on.

What these dismissals miss is that neutral taxation requires equalizing the tax burden between relevant alternatives. Suppose you can choose whether to pay an employee in San Francisco $100k to deal with customers, or buy cloud computing services and kiosk hardware and so on, and performance is similar.

In the first case, the human gets a take home pay of roughly $60k, at a total employee cost of $112k. In the second case, if you pay $112k, let’s say that average gross margin for the largest providers is 65%, and their tax rate is typically 21%. Even if you threw in California corporate tax (which I presume they aren’t paying) and sales tax, that’s still only $29k in taxes versus $52k. That’s a not a complete calculation, but it is good enough to see the tax burdens are not going to equalize.

This could easily result (and in practice sometimes does) in a situation where using computers is a tax arbitrage, and that takes it from uneconomical to economical.

I do not consider this that big a deal, because I expect the cost of compute and other AI services to drop rapidly over time. Let’s say (in theory, for simplicity) that the fully neutral tax rate of tax on compute was 40%, but the actual effective tax rate was 20%. In many other settings that would be a huge deal, but in AI it is all orders of magnitude. So this only speeds up efficient deployment by a few months.

The flip side is that this could be a highly efficient and productive tax. As always, we should look to shift the tax burden according to what we want to encourage and discourage, and when we are indifferent to ensure neutrality. I see a potentially strong economic argument for taxing compute and using that money to cut income taxes, but would want to see more research before drawing conclusions, and I would worry about competitiveness and tax jurisdiction issues. This is exactly the kind of place where a call to ‘model this’ is fully appropriate, and we should not jump to conclusions.

The European Commission revealed details of the new AI Office, Luca Bertuzzi says it is essentially a repackaging of the old AI directorate, 5 units, 140 people, 80 of which must be recruited.

Bad ideas for regulations: California’s SB 1446 limiting self-service checkouts. I do think often retailers are making a business and also total welfare mistake by relying more than they should on self-service checkouts, as opposed to ordering kiosks which are mostly great. I actively avoid one local grocery store when I have a choice due to its checkout procedures. But that should be their mistake to make. The real argument for a bill like SB 1446 is that first they mandated all these extra costs of hiring the workers, so now they cost so much that the government needs to force employers to hire them.

Did we have sane regulations of future frontier models all along, in the form of existing tort law?

Lawfare’s Markus Anderljung, Matthew van de Merwe and Ketan Ramakrishnan make the case that tort law can be a big help in its current form, but ultimately argue it is ideally a compliment to frontier AI regulation rather than a substitute, after an extensive look at the current legal landscape. Gabriel Weil intends to write a response piece.

By default, for everything, we have the negligence standard. Everyone has a duty to take reasonable care to avoid causing harm, pretty much no matter what.

This certainly is helpful and much better than nothing. I do not see it remotely being enough. Ex post unpredictable assignment of blame, that only fires long after the harm happens and for which ‘reasonable care’ is an excuse?

While we have no industry standards worthy of the name and the damage could well be catastrophic or existential, or involve loss of control over the future, including loss of control to the AI company or to the AI? And also many damage scenarios might not involve a particular (intact) victim that could have proper standing and ability to sue for them? That won’t cut it here.

They also argue that the ‘abnormally dangerous activities’ standard we use for tigers might apply to frontier AI systems, where a presumption of ‘reasonable care’ is impossible, so any harm is on you. I still do not think ‘they can sue afterwards’ is a solution, it still seems like a category error, but this would certainly help, especially if we required insurance. Alas, they (I think correctly) find this unlikely to be applied by the courts on their own.

They then move on to ‘products liability.’ This is a patchwork of different rules by state, but it is plausible that many states will consider frontier AIs products, to which my attitude would be that they better damn well be products because consider the alternative things they might be. Lawfare’s attitude here seems to be a big ‘I don’t know when it would or wouldn’t apply what standard on what harms.’ There are advantages to that, a company like Google hates uncertainty. And it suggests that by ‘foreseeing’ various misuses or other failure modes of such AIs now, we are making the companies liable should they occur. But then again, maybe not.

The right way to ensure responsible development of frontier AI systems, a potentially transformational or existentially risky technology, cannot be ‘ex post if something bad happens we sue you and then we have no idea what the courts do, even if we still have courts.’

They seem to agree?

  1. The main argument provided for relying on tort law is that we lack regulations or other alternatives.

  2. They also suggest tort law is more adaptable, which is true if and only if you assume other laws mostly cannot be modified in response to new information, but also the adaptations have to be fast enough to be relevant and likely to be the ones that would work.

  3. They suggest tort law is less vulnerable to regulatory capture, which is an advantage in what I call mundane ‘economic normal’ worlds.

  4. They suggest that tort law is how you get regulatory compliance, or investment in safety beyond regulatory requirements. Here I agree. Tort liability is a strong complement. Certainly I have no interest in granting frontier AI companies immunity to tort liability.

They list as issues:

  1. Tort law requires the right sort of causal chain to an injury. I strongly agree that this is going to be an issue with frontier AI systems. Any working definition is either going to miss a wide range of harms, or encompass things it shouldn’t.

  2. Tort law has a problem with ‘very large harms from AI,’ which they classify as thousands of deaths. If that was the maximum downside I wouldn’t be so worried.

  3. Tort law doesn’t work with certain types of societal harms, because there is no concrete damage to point towards. There’s no avoiding this one, even if the harms remain mundane. Either you accept what AI ‘wants to happen’ in various ways, or you do not, and tort law only stops that if it otherwise ends up a de facto ban.

  4. Tort law might move too slowly. No kidding. Even if a case is brought today it likely does not see a verdict for years. At the current pace of AI, it is reasonable to say ‘so what if we might be liable years from now.’ By that time the world could be radically different, or the company vastly bigger or gone. If and when the stakes really are existential or transformational, tort law is irrelevant.

  5. They warn of a winner’s curse situation, where the companies that think they are safest proceed rather than those that are safest. Or, I would say, the companies that have less to lose, or are more willing to gamble. A key problem with all safety efforts is that you worry that it can mean the least responsible people deploy first, and tort law seems to make this worse rather than better.

  6. Tort law could hinder socially desirable innovation. The question is the price, how much hindering versus alternative methods. If we indeed hold firms liable for a wide variety of harms including indirect ones, while they do not capture that large a portion of gains, and tort law actually matters, this is a huge issue. If we don’t hold them liable for those harms, or tort law is too slow or ineffective so it is ignored, the tort law doesn’t do its job. My gut tells me that, because it focuses on exactly the harms that we could deal with later, a tort law approach is more anti-socially-desirable-innovation than well-constructed other regulatory plans, at the same level of effectiveness. But also you can do so, so much worse (see: EU).

  7. The final concern is that judges and juries lack expertise on this, and oh boy would that be a huge problem in all directions. Verdicts here are going to be highly uncertain and based on things not that correlated with what we want.

I especially appreciate the note that regulatory rules moderate tort law liability. If you comply with regulatory requirements, that constitutes a partial defense against torts.

They conclude with a classic ‘more research is needed’ across the board, cautioning against giving AI companies liability shields. I certainly agree on both counts there. I especially appreciated the nod to liability insurance. Mandatory insurance helps a lot with the issue that torts are an extremely slow and uncertain ex post process.

Finally there is a formal paper for the unconstitutionality case for SB 1047, that machine learning code is speech.

The argument here that matters is simple – SB 1047 regulates code, and you can’t regulate code, and also neural network weights are also speech. And it says that it uses legal precedent to show that the Act is ‘an overreach that stifles innovation and expression in the AI field,’ although even if the Act were that I don’t know how precent could show that the act would do that – the potential stifling is a prediction of future impacts (that I disagree with but is not a crazy thing to claim especially without specifying magnitude of impact), not a legal finding.

Section one goes over the classic ‘algorithms are speech’ arguments. I am not a lawyer, but my interpretation is that the code for doing training is not restricted in any way under SB 1047 (whether or not that is wise) so this is not relevant. In all these cases, the argument was that you could distribute your software or book, not whether you could run it for a particular purpose. You can yell fire in a crowded theater, but you are not protected by the first amendment if you light the theater on fire, even if it is one hell of a statement.

Thus in my reading, the argument that matters is section two, the claim that the weights of a neutral network are speech, because it is a mathematical expression. If an inscrutable black box of numbers is speech, then given the nature of computers, and arguably of the universe, what is not speech? Is a person speech by their very existence? Is there any capability that would not be speech, in any context?

The whole line seems absurd to me, as I’ve said before.

And I think this line kind of shows the hand being played?

While it is important to ensure the safe and ethical use of AI, regulatory measures must be carefully balanced to avoid infringing upon free speech rights.

SB-1047’s provisions, which mandate safety determinations and compliance with safety standards, could be seen as imposing undue restrictions on the development and dissemination of neural network weights.

Wait, what? Which is it? Saying you have to meet safety standards sounds like we should be talking price, yet I do not see talk here of price afterwards. Instead I see a claim that any restrictions are not allowed.

Oh boy, is this person not going to like the Schumer Report. But of course, since it is not explicitly motivated by making sure everyone doesn’t die, they haven’t noticed.

In particular, there is talk in the Schumer Report of classifying model weights and other AI information, above a threshold, on the grounds that it is Restricted Data. Which is a whole new level of ‘Fyour free speech.’ Also phrases like ‘reasonable steps’ to ‘protect children.’ Yet here they are, complaining about SB 1047’s self-certification of reasonable assurance of not causing catastrophic harm.

Section 3 repeats the misinformation that this could impact academic researchers. It repeats the false claim that ‘extensive safety evaluations’ must be made before training models. This is not true even for truly frontier, actively potentially deadly covered models, let alone academic models. The ‘reporting requirements’ could have a ‘chilling effect,’ because if an academic noticed their model was causing catastrophic risk, they really would prefer not to report that? What academia is this?

I could go on, but I won’t. The rest seems some combination of unnecessary to the central points, repetitive and false.

I do appreciate that there is a potential constitutionality issue here, no matter how absurd it might seem.

I also reiterate that if SB 1047 is unconstitutional, especially centrally so, then it is highly important that we discover this fact as soon as possible.

Jeremie & Edouard Harris of Gladstone AI go on The Joe Rogan Experience. It is hard for me to evaluate as I am not the target audience, and I am only an hour in so far, but this seemed like excellent communication of the basics of the existential risk case and situation. They boil a bunch of complicated questions into normie-compatible explanations.

In particular, the vibe seemed completely normal, as if the situation is what it is and we are facing it the same way we would face other compounding pending problems. I would have a few notes, but overall, I am very impressed.

If you had to point a low-shock-level normie towards one explanation of AI existential risk, this seems like our new go-to choice.

For context on Gladstone: These are the people who put out the Gladstone Report in March, featuring such section titles as ‘Executive Summary of Their Findings: Oh No.’ My takeaway was that they did a good job there investigating the top labs and making the case that there is a big problem, but they did not address the strongest arguments against regulatory action (I did give my counterarguments in the post).

Then they proposed extreme compute limits, that I believe go too far. California’s SB 1047 proposes light touch interventions at 10^26 flops, and neve proposes any form of pre-approval let alone a ban. Under the Gladstone proposal, you get light tough interventions at 10^23 flops (!), preapprovals are required at 10^24 flops (!!) and there is an outright ban at 10^25 flops (!!!) that would include current 4-level models. There are various requirements imposed on labs.

A lot of the hysterical reactions to SB 1047 would have been highly appropriate, if the reaction had been talking about the Gladstone Report’s proposals as stated in the report, whereas it seemed many had no interest in noticing the differences.

There is also of course Helen Toner on what really went down at OpenAI and the future of regulation. I will cover that more extensively in a future post, either on the podcast or on general OpenAI developments.

Latest Eliezer attempt to explain why you should expect some highly capable agents, as they gain in capability, to have bimodal distributions of behavior, where at some point they flip to behaviors you do not want them to have, and which cause things to end badly for you (or at least well for them). It is in their interest to act as if they had friendly intent or lacked dangerous capability or both, until that time. This is not something mysterious, it is the same for humans and groups of humans, and there is no known solution under a sufficient capability gap.

This explanation was in part a response to Nora Belrose saying Nora Belrose things, that seem similar to things she has said before, in the context here of responding to a particular other argument.

As a general rule on existential risk questions: I’ve learned that ‘respond to X’s response to Y’s response to Z’ gets frustrating fast and doesn’t convince people who aren’t X, Y or Z, so only do that if X is making a universal point. Don’t do it if X is telling Y in particular why they are wrong.

Eliezer clarifies some things about what he believes and considers plausible, and what he doesn’t, in a conversation about potential scenarios, including some evolution metaphors later on. My model of such arguments is that every now and then a reader will ‘become enlightened’ about something important because it hits them right, but that there are no arguments that work on that large a percentage of people at once.

Yann LeCunn denies the existence of GI, as in no general intelligence exists even in humans. Not no AGI, just no GI. It’s cleaner. This actually makes his positions about not getting to AGI make a lot more sense and I appreciate the clarity.

Eric Schmidt argues that rather than let a variety of AI agents do a bunch of things we don’t understand while coordinating in language we don’t understand, we should ‘pull the plug.’ Murat points out the incoherences, that all you need here is ‘agents doing things we don’t understand.’ The rest is unnecessary metaphor. Alas, I find many people need a metaphor that makes such issues click for them, so with notably rare exceptions I do not think we should offer pedantic corrections.

A true statement, although the emphasis on the decisions rather than the decision process perhaps suggests the wrong decision theories. Robin and I make different decisions in response.

Robin Hanson: The uber question for any decision-maker is: how much do you want your decisions to promote continued existence of things that are like you?

The more you want this, the more your decisions must be the sort that promote your kinds in a universe where natural selection decides what kinds exist. At least if you live in such a universe.

Another true statement, and he’s right (medium spoilers for Game of Thrones.)

Dylan Matthews: I get the sense that Anthropic is currently trying to build that wight that Jon Snow and the gang capture and bring back to King’s Landing to prove that White Walkers are real.

The subsequent actions are a reasonable prediction of what would happen next, what many with power care about, the importance of a capabilities lead, the value of not giving up in the face of impossible odds, the dangers of various forms of misalignment, the need given our failure to step up in time to invent a deus ex machina for us all not to die, a dire warning about what happens when your source of creativity is used up and you use a fancy form of autocomplete, and more.

Tyler Cowen once again attempted on May 21, 2024 to incept that the ‘AI Safety’ movement is dead. The details included claiming that the AI safety movement peaked with the pause letter (not even the CAIS letter), gave what seemed like a very wrong reading of the Schumer report, came the same week as a humorously-in-context wide variety of AI safety related things saw progress, and had other strange claims as well, especially his model of how the best way to build AI safely is via not taking advance precautions and fixing issues ex-post.

Strangest of all is his continued insistence that the stock market being up is evidence against AI existential risk, or that those who think there is substantial AI existential risk should not be long the market and especially not long all these AI stocks we keep buying and that keep going up – I have tried to explain this many times, yet we are both deeply confused how the other can be so supremely confidently wrong about this question.

I wrote a post length response to make sense of it all, but have decided to shelve it.

Again, it is hard to do it when you do not try.

One way labs are not trying is they are not using external evaluators very much.

Another way is to say safety is a problem for future you: Here is a clip of Elon Musk saying first order of business at x.ai is a competitive model, comparable in power to others. Until then, no need to worry about safety. This in response to being asked to speak to x.ai’s safety team.

So…

  1. Little happens in a day, no matter what Elon Musk might demand. You need to start worrying about safety long before you actually have a potentially unsafe system.

  2. How do you build a culture of safety without caring about safety?

  3. How do you have a safety-compatible AI if you don’t select for that path?

  4. There are forms of safety other than existential, you need to worry even if you know there are stronger other models for purely mundane reasons.

  5. If this is your attitude, why are you going to be better than the competition?

Elon Musk understands that AI is dangerous and can kill everyone. His ideas about how to prevent that and what he has done with those ideas have consistently been the actual worst, in the ‘greatly contribute to the chance everyone dies’ sense.

I do appreciate the straight talk. If you are going to not care about safety until events force your hand, then admit that. Don’t be like certain other companies that pay lip service and make empty promises, then break those promises.

Then there is the not as straight talk, in the wake of their $6 billion Series B round.

Bloomberg says pre-money valuation was $18 billion as per Musk’s Twitter.

Igor Babuschkin: Apply at x.ai if you want to be part of our journey to build AGI and understand the Universe 🛰️

Elon Musk: Join xAI if you believe in our mission of understanding the universe, which requires maximally rigorous pursuit of the truth, without regard to popularity or political correctness.

Rowan Chang claims that x.AI is being valued at a third of OpenAI. If this remains true, then this means some combination of:

  1. Investors in x.AI being motivated by something other than fundamental value.

  2. Investors in x.AI buying into the hype way too much.

  3. Investors in OpenAI getting an absurdly great deal.

  4. Investors in OpenAI charging a huge discount for its structure, the AGI clause and the risks involved in trusting the people involved or the whole thing blowing up in various other ways.

  5. Investors have very low confidence in OpenAI’s ability to execute.

Certainly OpenAI’s valuation being so low requires an explanation. But the same has been true for Nvidia for a while, so hey. Also a16z is heavily named in the x.AI fundraiser, which both is a terrible sign for x.AI’s safety inclinations, and also tells me everyone involved overpaid.

Another note is that x.AI seems highly dependent on Twitter (and Musk) to justify its existence and valuation. So if it is raising at $18 billion, the Twitter price starts to look a lot less terrible.

Zach Stein-Perlman worries the Anthropic Long-Term Benefit Trust is powerless. A supermajority of shareholders can overrule the trust, and we cannot see the full terms of the agreement, including the size of that supermajority.

The buck has to stop somewhere. There are basically three scenarios.

  1. Perhaps the trust will control the company when it chooses to do so.

  2. Perhaps the shareholders will control the company when they choose to do so.

  3. Perhaps both will have a veto over key actions, such as training or deployment.

The original intention seems, from what we know, to be something like ‘the trust is like the President and can veto or make certain executive decisions, and the shareholders are like Congress and can if sufficiently united get their way.’

The hope then would be that shareholders are divided such that when the decision is reasonable the trust can find enough support, but if it goes nuts they can’t, and the threshold is chosen accordingly.

My worry is this is a narrow window. Shareholders mostly want to maximize profits and are typically willing to vote with leadership. A very large supermajority is likely not that hard to get in most situations. I have been assuming that Anthropic is mostly a ‘normal company’ on legal governance, and putting a lot more hope in management making good choices than in the trust forcing their hand.

Also potentially worrying is that Anthropic recently lost a clearly highly safety-focused board member, and they the Long Term Benefit Trust replaced him with what appears to be a far more product-focused board member. For various reasons I have not done a deep dive on Anthropic’s board, so I do not have the context to know how concerning this should or should not be.

Roon: Do you really think AI race dynamics are about money?

Not entirely. But yeah, I kind of do. I think that the need to make the money in order to continue the work, and the need to make money in order to hire the best people, force everyone to race ahead specifically in order to make money. I think that the need to make money drives releases. I think that the more you need money, the more you have to turn over influence and control to those who focus on money, including Altman but much more so companies like Google and Microsoft. It is also the habit and pattern of an entire financial and cultural ecosystem.

Of course it is also ego, pride, hubris, The Potential, fear of the other guy, desire to dictate the arrangement of atoms within the light cone and other neat stuff like that.

Sentences that are not so unjustified yet also reasons to worry.

Roon: I assume basically every statistic that suggests modernity is bad is a result of some kind of measurement error.

The context here is cellphones and teen depression. In general, modernity is good, we do not know how good we have it, and the statistics or other claims suggesting otherwise are bonkers.

That does not mean everything is better. To pick three: The decline in time spent with friends is obviously real. The rise in opioid deaths is real. And the fertility rate decline, in some ways the most important statistic of all, is very real.

You could say South Korea is doing great because it is rich. I say if women average less than one child the country is definitely not doing so great and I don’t care what your other statistics say, and if your answer is ‘so what everyone is so happy’ then I suggest watching some of their television because things do not seem all that happy.

Choose your fighter:

No, no, no, why not both, the AI assistant you should want, safety issues aside:

Quoting from the ACX open thread announcements:

The next ACX Grants round will probably take place sometime in 2025, and be limited to grants ≤ $100K. If you need something sooner or bigger, the Survival and Flourishing Fund is accepting grant applications, due June 17. They usually fund a few dozen projects per year at between $5K and $1MM, and are interested in “organizations working to improve humanity’s long-term prospects for survival and flourishing”, broadly defined. You can see a list of their recent awardees here.

(just in case you have the same question everyone else did – no, “Short Women In AI Safety” and “Pope Alignment Research” aren’t real charities; SFF unwisely started some entries with the name of the project lead, and these were led by people named Short and Pope.)

I do think it is typically a good use of time, if your project is relevant to their interests (which include AI safety) to apply to the Survival and Flourishing Fund. The cost is low and the upside is high.

Yann LeCun echoes his central claim that if AI is not safe, controllable and can fulfill objectives in more intelligent ways than humans, we won’t build it. Yes, that claim is in the right section.

AI #66: Oh to Be Less Online Read More »

google-is-killing-off-the-messaging-service-inside-google-maps

Google is killing off the messaging service inside Google Maps

Going out of business —

Google Maps has had its own chat platform since 2018, but it’s shutting down in July.

  • Whether you want to call it “Google Business Messaging” or “Google Business Profile Chat,” the chat buttons in Google Maps and Search are going away.

    Google

  • This is the 2018 version of Google Maps Messaging, which is when it was first built into the Google Maps app.

    Google

  • Messages used to have a top-tier spot in the navigation panel.

    Google

  • In the current UI, Messages lives in the “Updates” tab.

    Ron Amadeo

  • You used to be able to reply to Google Maps Messages with Google Allo.

Google is killing off a messaging service! This one is the odd “Google Business Messaging” service—basically an instant messaging client that is built into Google Maps. If you looked up a participating business in Google Maps or Google Search on a phone, the main row of buttons in the place card would read something like “Call,” “Chat,” “Directions,” and “Website.” That “Chat” button is the service we’re talking about. It would launch a full messaging interface inside the Google Maps app, and businesses were expected to use it for customer service purposes. Google’s deeply dysfunctional messaging strategy might lead people to joke about a theoretical “Google Maps Messaging” service, but it already exists and has existed for years, and now it’s being shut down.

Search Engine Land’s Barry Schwartz was the first to spot the shutdown emails being sent out to participating businesses. Google has two different support articles up for a shutdown of both “Google Business Profile Chat” and “Google Business Messages,” which appear to just be the same thing with different names. On July 15, 2024, the ability to start a new chat will be disabled, and on July 31, 2024, both services will be shut down. Google is letting businesses download past chat conversations via Google Takeout.

Google’s Maps messaging service was Google Messaging Service No. 16 in our giant History of Google Messaging article. The feature has undergone many changes, so it’s a bit hard to follow. The Google Maps Messaging button launched in 2017, when it would have been called “Google My Business Chat.” This wasn’t quite its own service yet—the messaging button would either launch your SMS app or boot into another dead Google messaging product, Google Allo!

The original SMS option was the easy path for small businesses with a single store, but SMS is tied to a single physical phone. If you’re a bigger business and want to take on the task of doing customer service across multiple stores, at the scale of Google Maps, that’s going to be a multi-person job. The Google Allo back-end (which feels like it was the driving force behind creating this project in the first place) would let you triage messages to multiple people. Allo was one year into its 2.5-year lifespan when this feature launched, though, so things would have to change soon before Allo’s 2019 shutdown date.

Knowing that the announcement of Allo’s death was a month away, Google started making Maps into its own standalone messaging service in November 2018. Previously, it would always launch an outside app (either SMS or Allo), but with this 2018 update, Maps got its own instant messaging UI built right into the app. “Messages” became a top-level item in the navigation drawer (later this would move to “updates”), and a third-party app was no longer needed. On the business side of things, a new “Google My Business” app would be the new customer service interface for all these messages. Allo’s shutdown in 2019 disabled the ability to use SMS for small businesses, and everything needed to use this Google My Business app now. Maps was officially a new messaging service. Google also created the “Business Messages API,” so big businesses could plug Maps messaging into some kind of customer management app.

It does not sound like Google is going to replace business messaging with anything in the near future, so the Chat buttons in Google Maps and search will be going away. In the endless pantheon of Google Messaging solutions, the Google Developer page also mentions an “RCS Business Messaging” platform that will launch the Google Messaging app. This service does not seem to be built into any existing Google products, though, and isn’t mentioned as an alternative in Google’s shutdown announcement. Google only suggests that businesses “redirect customers to your alternative communication channels,” but those links won’t be getting premium placement in Google’s products.

Business messaging is a pretty well-established market, and the Big Tech companies with competent messaging strategies are involved somehow. On iOS, there’s Apple’s iMessage-based Messages for Business, which also has a chat button layout in Apple Maps. Meta has both WhatsApp Business Messaging and Facebook Messenger’s Meta Business Messaging. There are also standalone businesses like Twilio.

Listing image by Google / Ron Amadeo

Google is killing off the messaging service inside Google Maps Read More »

google-accused-of-secretly-tracking-drivers-with-disabilities

Google accused of secretly tracking drivers with disabilities

Google accused of secretly tracking drivers with disabilities

Google needs to pump the brakes when it comes to tracking sensitive information shared with DMV sites, a new lawsuit suggests.

Filing a proposed class-action suit in California, Katherine Wilson has accused Google of using Google Analytics and DoubleClick trackers on the California DMV site to unlawfully obtain information about her personal disability without her consent.

This, Wilson argued, violated the Driver’s Privacy Protection Act (DPPA), as well as the California Invasion of Privacy Act (CIPA), and impacted perhaps millions of drivers who had no way of knowing Google was collecting sensitive information shared only for DMV purposes.

“Google uses the personal information it obtains from motor vehicle records to create profiles, categorize individuals, and derive information about them to sell its customers the ability to create targeted marketing and advertising,” Wilson alleged.

According to Wilson, California’s DMV “encourages” drivers “to use its website rather than visiting one of the DMV’s physical locations” without telling drivers that Google has trackers all over its site.

Likely due to promoting the website’s convenience, the DMV reported a record number of online transactions in 2020, Wilson’s complaint said. And people with disabilities have taken advantage of that convenience. In 2023, approximately “40 percent of the 1.6 million disability parking placard renewals occurred online.”

Wilson last visited the DMV site last summer when she was renewing her disability parking placard online. At that time, she did not know that Google obtained her personal information when she filled out her application, communicated directly with the DMV, searched on the site, or clicked on various URLs, all of which she said revealed that either she had a disability or believed she had a disability.

Her complaint alleged that Google secretly gathers information about the contents of the DMV’s online users’ searches, logging sensitive keywords like “teens,” “disabled drivers,” and any “inquiries regarding disabilities.”

Google “knowingly” obtained this information, Wilson alleged, to quietly expand user profiles for ad targeting, “intentionally” disregarding DMV website users’ “reasonable expectation of privacy.”

“Google then uses the personal information and data to generate revenue from the advertising and marketing services that Google sells to businesses and individuals,” Wilson’s complaint alleged. “That Plaintiff and Class Members would not have consented to Google obtaining their personal information or learning the contents of their communications with the DMV is not surprising.”

Congressman James P. Moran, who sponsored the DPPA in 1994, made it clear that the law was enacted specifically to keep marketers from taking advantage of computers making it easy to “pull up a person’s DMV record” with the “click of a button.”

Even back then, some people were instantly concerned about any potential “invasion of privacy,” Moran said, noting that “if you review the way in which people are classified by direct marketers based on DMV information, you can see why some individuals might object to their personal information being sold.”

Google accused of secretly tracking drivers with disabilities Read More »

ifixit-ends-samsung-deal-as-oppressive-repair-shop-requirements-come-to-light

iFixit ends Samsung deal as oppressive repair shop requirements come to light

Samsung has no follow-through? Shocking —

iFixit says “flashy press releases don’t mean much without follow-through.”

iFixit ends Samsung deal as oppressive repair shop requirements come to light

IFixit and Samsung were once leading the charge in device repair, but iFixit says it’s ending its repair partnership with Samsung because it feels Samsung just isn’t participating in good faith. iFixit says the two companies “have not been able to deliver” on the promise of a viable repair ecosystem, so it would rather shut the project down than continue. The repair site says “flashy press releases and ambitious initiatives don’t mean much without follow-through.”

iFixit’s Scott Head explains: “As we tried to build this ecosystem we consistently faced obstacles that made us doubt Samsung’s commitment to making repair more accessible. We couldn’t get parts to local repair shops at prices and quantities that made business sense. The part prices were so costly that many consumers opted to replace their devices rather than repair them. And the design of Samsung’s Galaxy devices remained frustratingly glued together, forcing us to sell batteries and screens in pre-glued bundles that increased the cost.”

  • Samsung’s screen replacement parts usually require buying the display, battery, phone frame, and buttons, which is a big waste.

    iFixit

A good example of Samsung’s parts bundling is this Galaxy S22 Ultra “screen” part for $233. The screen is the most common part to break, but rather than just sell a screen, Samsung makes you buy the screen, a new phone frame, a battery, and new side buttons and switches. As we said when this was announced, that’s like half of the total parts in an entire phone. This isn’t a perfect metric, but the Samsung/iFixit parts store only offers three parts for the S22 Ultra, while the Pixel 8 Pro store has 10 parts, and the iPhone 14 Pro Max store has 23 parts.

Even with Samsung’s part-bundling, though, iFixit’s complaint of high prices doesn’t seem reflected in the store pricing. The Pixel 8 Pro screen + fingerprint reader, without a case, battery, and buttons, is $230. An iPhone 14 Pro Max screen is $395. (There is a good chance Samsung is the manufacturer of all three of these displays.)

Samsung and iFixit have always had a rocky relationship. In 2017, the two companies were supposed to partner up for an “upcycling” program, where Samsung found new uses for old phones. The original plan included things like unlocking the bootloader of old devices, so Samsung’s OS could be completely replaced, and hosting an open source marketplace where users could submit ideas and software for repurposing old Galaxy devices. In what now seems like a familiar strategy, Samsung was more concerned about appearances than being actually useful, and iFixit said the upcycling program that launched in 2021 was “nearly unrecognizable” to what iFixit originally endorsed and lent its logo to in 2017.

In 2019, following the “embarrassing” delayed launch of the Galaxy Fold 1 due to durability reasons, Samsung attacked iFixit for doing a teardown of the flawed device. Samsung forced iFixit to take down an article explaining some of the flaws of the device. Samsung didn’t have any legal capability to do this, but it apparently threatened one of iFixit’s part suppliers if the article didn’t get pulled.

Samsung has also reportedly been on the attack against repair, even while it partners with iFixit. On the same day that iFixit announced it was dropping the partnership, 404 Media reported that Samsung requires independent repair shops to turn over customer data and “immediately disassemble” any device found to be using third-party parts. Imagine taking your phone to a shop for repair and finding out it was destroyed by the shop as a requirement from Samsung. The report also says Samsung’s contracts require that independent companies “daily” upload to a Samsung database (called G-SPN) the details of each and every repair “at the time of each repair.”

With the latest chapter of the partnership store dying after just two years, in June 2024, iFixit says some changes are coming to its website. It won’t remove any information, but it will start offering clearly labeled third-party parts in addition to whatever Samsung OEM parts it can source. It will no longer collaborate with Samsung for manuals and won’t need to follow Samsung’s quantity limit requirements.

iFixit ends Samsung deal as oppressive repair shop requirements come to light Read More »

dinosaurs-needed-to-be-cold-enough-that-being-warm-blooded-mattered

Dinosaurs needed to be cold enough that being warm-blooded mattered

Some like it less hot —

Two groups of dinosaurs moved to cooler climes during a period of climate change.

Image of a feathered dinosaur against a white background.

Enlarge / Later theropods had multiple adaptations to varied temperatures.

Dinosaurs were once assumed to have been ectothermic, or cold-blooded, an idea that makes sense given that they were reptiles. While scientists had previously discovered evidence of dinosaur species that were warm-blooded, though what could have triggered this adaptation remained unknown. A team of researchers now think that dinosaurs that already had some cold tolerance evolved endothermy, or warm-bloodedness, to adapt when they migrated to regions with cooler temperatures. They also think they’ve found a possible reason for the trek.

Using the Mesozoic fossil record, evolutionary trees, climate models, and geography, plus factoring in a drastic climate change event that caused global warming, the team found that theropods (predators and bird ancestors such as velociraptor and T. rex) and ornithischians (such as triceratops and stegosaurus) must have made their way to colder regions during the Early Jurassic. Lower temperatures are thought to have selected for species that were partly adapted to endothermy.

“The early invasion of cool niches… [suggests] an early attainment of homeothermic (possibly endothermic) physiology in [certain species], enabling them to colonize and persist in even extreme latitudes since the Early Jurassic,” the researchers said in a study recently published in Current Biology.

Hot real estate

During the Mesozoic Era, which lasted from 230 to 66 million years ago, proto-dinosaurs known as dinosauromorphs began to diversify in hot and dry climates. Early sauropods, ornithischians, and theropods all tended to stay in these regions.

Sauropods (such as brontosaurus and diplodocus) would become the only dinosaur groups to bask in the heat—the fossil record shows that sauropods tended to stay in warmer areas, even if there was less food. This suggests the need for sunlight and heat associated with ectothermy. They might have been capable of surviving in colder temperatures but not adapted enough to make it for long, according to one hypothesis.

It’s also possible that living in cooler areas meant too much competition with other types of dinosaurs, as the theropods and ornithiscians did end up moving into these cooler areas.

Almost apocalypse

Beyond the ecological opportunities that may have drawn dinosaurs to the cooler territories, it’s possible they were driven away from the warm ones. Around 183 million years ago, there was a perturbation in the carbon cycle, along with extreme volcanism that belched out massive amounts of methane, sulfur dioxide, and mercury. Life on Earth suffered through scorching heat, acid rain, and wildfires. Known as the Early Jurassic Jenkyns Event, the researchers now think that these disruptions pushed theropod and ornithischian dinosaurs to cooler climates because temperatures in warmer zones went above the optimal temperatures for their survival.

The theropods and ornithischians that escaped the effects of the Jenkyns event may have had a key adaptation to cooler climes; many dinosaurs from these groups are now thought to have been feathered. Feathers can be used to both trap and release heat, which would have allowed feathered dinosaurs to regulate their body temperature in more diverse climates. Modern birds use their feathers the same way.

Dinosaur species with feathers or special structures that improved heat management could have been homeothermic, which means they would have been able to maintain their body temperature with metabolic activity or even endothermic.

Beyond the dinosaurs that migrated to high latitudes and adapted to a drop in temperature, endothermy might have led to the rise of new species and lineages of dinosaurs. It could have contributed to the rise of Avialae, the clade that includes birds—the only actual dinosaurs still around—and traces all the way back to their earliest ancestors.

“[Our findings] provide novel insights into the origin of avian endothermy, suggesting that this evolutionary trajectory within theropods… likely started in the latest Early Jurassic,” the researchers said in the same study.

That really is something to think about next time a sparrow flies by.

Current Biology, 2024.  DOI: 10.1016/j.cub.2024.04.051

Dinosaurs needed to be cold enough that being warm-blooded mattered Read More »

crooks-plant-backdoor-in-software-used-by-courtrooms-around-the-world

Crooks plant backdoor in software used by courtrooms around the world

DISORDER IN THE COURT —

It’s unclear how the malicious version of JAVS Viewer came to be.

Crooks plant backdoor in software used by courtrooms around the world

JAVS

A software maker serving more than 10,000 courtrooms throughout the world hosted an application update containing a hidden backdoor that maintained persistent communication with a malicious website, researchers reported Thursday, in the latest episode of a supply-chain attack.

The software, known as the JAVS Viewer 8, is a component of the JAVS Suite 8, an application package courtrooms use to record, play back, and manage audio and video from proceedings. Its maker, Louisville, Kentucky-based Justice AV Solutions, says its products are used in more than 10,000 courtrooms throughout the US and 11 other countries. The company has been in business for 35 years.

JAVS Viewer users at high risk

Researchers from security firm Rapid7 reported that a version of the JAVS Viewer 8 available for download on javs.com contained a backdoor that gave an unknown threat actor persistent access to infected devices. The malicious download, planted inside an executable file that installs the JAVS Viewer version 8.3.7, was available no later than April 1, when a post on X (formerly Twitter) reported it. It’s unclear when the backdoored version was removed from the company’s download page. JAVS representatives didn’t immediately respond to questions sent by email.

“Users who have version 8.3.7 of the JAVS Viewer executable installed are at high risk and should take immediate action,” Rapid7 researchers Ipek Solak, Thomas Elkins, Evan McCann, Matthew Smith, Jake McMahon, Tyler McGraw, Ryan Emmons, Stephen Fewer, and John Fenninger wrote. “This version contains a backdoored installer that allows attackers to gain full control of affected systems.”

The installer file was titled JAVS Viewer Setup 8.3.7.250-1.exe. When executed, it copied the binary file fffmpeg.exe to the file path C:Program Files (x86)JAVSViewer 8. To bypass security warnings, the installer was digitally signed, but with a signature issued to an entity called “Vanguard Tech Limited” rather than to “Justice AV Solutions Inc.,” the signing entity used to authenticate legitimate JAVS software.

fffmpeg.exe, in turn, used Windows Sockets and WinHTTP to establish communications with a command-and-control server. Once successfully connected, fffmpeg.exe sent the server passwords harvested from browsers and data about the compromised host, including hostname, operating system details, processor architecture, program working directory, and the user name.

The researchers said fffmpeg.exe also downloaded the file chrome_installer.exe from the IP address 45.120.177.178. chrome_installer.exe went on to execute a binary and several Python scripts that were responsible for stealing the passwords saved in browsers. fffmpeg.exe is associated with a known malware family called GateDoor/Rustdoor. The exe file was already flagged by 30 endpoint protection engines.

A screenshot from VirusTotal showing detections from 30 endpoint protection engines.

Enlarge / A screenshot from VirusTotal showing detections from 30 endpoint protection engines.

Rapid7

The number of detections had grown to 38 at the time this post went live.

The researchers warned that the process of disinfecting infected devices will require care. They wrote:

To remediate this issue, affected users should:

  • Reimage any endpoints where JAVS Viewer 8.3.7 was installed. Simply uninstalling the software is insufficient, as attackers may have implanted additional backdoors or malware. Re-imaging provides a clean slate.
  • Reset credentials for any accounts that were logged into affected endpoints. This includes local accounts on the endpoint itself as well as any remote accounts accessed during the period when JAVS Viewer 8.3.7 was installed. Attackers may have stolen credentials from compromised systems.
  • Reset credentials used in web browsers on affected endpoints. Browser sessions may have been hijacked to steal cookies, stored passwords, or other sensitive information.
  • Install the latest version of JAVS Viewer (8.3.8 or higher) after re-imaging affected systems. The new version does not contain the backdoor present in 8.3.7.

Completely re-imaging affected endpoints and resetting associated credentials is critical to ensure attackers have not persisted through backdoors or stolen credentials. All organizations running JAVS Viewer 8.3.7 should take these steps immediately to address the compromise.

The Rapid7 post included a statement from JAVS that confirmed that the installer for version 8.3.7 of the JAVS viewer was malicious.

“We pulled all versions of Viewer 8.3.7 from the JAVS website, reset all passwords, and conducted a full internal audit of all JAVS systems,” the statement read. “We confirmed all currently available files on the JAVS.com website are genuine and malware-free. We further verified that no JAVS Source code, certificates, systems, or other software releases were compromised in this incident.”

The statement didn’t explain how the installer became available for download on its site. It also didn’t say if the company retained an outside firm to investigate.

The incident is the latest example of a supply-chain attack, a technique that tampers with a legitimate service or piece of software with the aim of infecting all downstream users. These sorts of attacks are usually carried out by first hacking the provider of the service or software. There’s no sure way to prevent falling victim to supply-chain attacks, but one potentially useful measure is to vet a file using VirusTotal before executing it. That advice would have served JAVS users well.

Crooks plant backdoor in software used by courtrooms around the world Read More »

biggest-windows-11-update-in-2-years-nearly-finalized,-enters-release-preview

Biggest Windows 11 update in 2 years nearly finalized, enters Release Preview

getting there —

24H2 update includes big changes, will be released “later this calendar year.”

Biggest Windows 11 update in 2 years nearly finalized, enters Release Preview

Microsoft

The Windows 11 24H2 update isn’t scheduled to be released until sometime this fall, but testers can get a near-final version of it early. Microsoft has released Windows 11 24H2 build 26100.712 to its Release Preview testing channel for Windows Insiders, a sign that the update is nearly complete and that the company has shifted into bug-fixing mode ahead of general availability.

Microsoft has generally stuck to smaller but more frequent feature updates during the Windows 11 era, but the annual fall updates still tend to be a bigger deal. They’re the ones that determine whether you’re still eligible for security updates, and they often (but not always) come with more significant under-the-hood changes than the normal feature drops.

Case in point: Windows 11 24H2 includes an updated compiler, kernel, and scheduler, all lower-level system changes made at least in part to better support Arm-based PCs. Existing Windows-on-Arm systems should also see a 10 or 20 percent performance boost when using x86 applications, thanks to improvements in the translation layer (which Microsoft is now calling Prism).

There are more user-visible changes, too. 24H2 includes Sudo for Windows, the ability to create TAR and 7-zip archives from the File Explorer, Wi-Fi 7 support, a new “energy saver” mode, and better support for Bluetooth Low Energy Audio. It also allows users to run the Copilot AI chatbot in a regular resizable window that can be pinned to the taskbar instead of always giving it a dedicated strip of screen space.

Other new Windows features are tied to the 24H2 update but will only be available on Copilot+ PCs, which have their own specific system requirements: 16 GB of memory, 256 GB of storage, and a neural processing unit (NPU) capable of at least 40 trillion operations per second (TOPS). As of right now, the only chips that fit the bill are Qualcomm’s Snapdragon X Plus and X Elite processors, though Intel and AMD systems with faster NPUs should be released later this year. Microsoft will maintain a separate list of processors that support the Copilot+ features.

The biggest 24H2 feature specific to Copilot+ PCs is Recall, which continually takes snapshots of everything you do with your PC so that you can look up your own activities later. This comes with obvious privacy and security risks, though Microsoft says that all of Recall’s data is encrypted on disk and processed entirely locally by the NPU rather than leveraging the cloud. Other Copilot+ features include Live Captions for captioning video files or video calls in real time and features for generating new images and enhancing existing images.

Collectively, all of these changes make 24H2 the most significant Windows 11 release since the 22H2 update came out a year and a half ago. 22H2 has served as the foundation for most new Windows features since then, including the Copilot chatbot, and 23H2 was mostly just a version number change released to reset the clock on Microsoft’s security update timeline.

Despite all of these changes and additions, the 24H2 update is still called Windows 11, still looks like Windows 11, and doesn’t change Windows 11’s official minimum system requirements. Unsupported installs will stop working on a few generations’ worth of older 64-bit x86 CPUs, though these chips are old and slow enough that they wouldn’t run Windows 11 particularly well in the first place.

For people who want to start fresh, ISO files of the release are available from Microsoft’s download page here (this is a slightly older build of the OS, 26100.560, but it should update to the current version with no issues after installation). You can update a current Windows 11 install from the Insider section in the Settings app. Microsoft says to expect the full release “later this calendar year.” Based on past precedent, it’s most likely to come out in the fall, but it will probably ship a bit early on the first wave of Copilot+ Arm PCs that will be available in mid-June.

Biggest Windows 11 update in 2 years nearly finalized, enters Release Preview Read More »

the-schumer-report-on-ai-(rtfb)

The Schumer Report on AI (RTFB)

Or at least, Read the Report (RTFR).

There is no substitute. This is not strictly a bill, but it is important.

The introduction kicks off balancing upside and avoiding downside, utility and risk. This will be a common theme, with a very strong ‘why not both?’ vibe.

Early in the 118th Congress, we were brought together by a shared recognition of the profound changes artificial intelligence (AI) could bring to our world: AI’s capacity to revolutionize the realms of science, medicine, agriculture, and beyond; the exceptional benefits that a flourishing AI ecosystem could offer our economy and our productivity; and AI’s ability to radically alter human capacity and knowledge.

At the same time, we each recognized the potential risks AI could present, including altering our workforce in the short-term and long-term, raising questions about the application of existing laws in an AI-enabled world, changing the dynamics of our national security, and raising the threat of potential doomsday scenarios. This led to the formation of our Bipartisan Senate AI Working Group (“AI Working Group”).

They did their work over nine forums.

  1. Inaugural Forum

  2. Supporting U.S. Innovation in AI

  3. AI and the Workforce

  4. High Impact Uses of AI

  5. Elections and Democracy

  6. Privacy and Liability

  7. Transparency, Explainability, Intellectual Property, and Copyright

  8. Safeguarding Against AI Risks

  9. National Security

Existential risks were always given relatively minor time, with it being a topic for at most a subset of the final two forums. By contrast, mundane downsides and upsides were each given three full forums. This report was about response to AI across a broad spectrum.

They lead with a proposal to spend ‘at least’ $32 billion a year on ‘AI innovation.’

No, there is no plan on how to pay for that.

In this case I do not think one is needed. I would expect any reasonable implementation of that to pay for itself via economic growth. The downsides are tail risks and mundane harms, but I wouldn’t worry about the budget. If anything, AI’s arrival is a reason to be very not freaked out about the budget. Official projections are baking in almost no economic growth or productivity impacts.

They ask that this money be allocated via a method called emergency appropriations. This is part of our government’s longstanding way of using the word ‘emergency.’

We are going to have to get used to this when it comes to AI.

Events in AI are going to be happening well beyond the ‘non-emergency’ speed of our government and especially of Congress, both opportunities and risks.

We will have opportunities that appear and compound quickly, projects that need our support. We will have stupid laws and rules, both that were already stupid or are rendered stupid, that need to be fixed.

Risks and threats, not only catastrophic or existential risks but also mundane risks and enemy actions, will arise far faster than our process can pass laws, draft regulatory rules with extended comment periods and follow all of our procedures.

In this case? It is May. The fiscal year starts in October. I want to say, hold your damn horses. But also, you think Congress is passing a budget this year? We will be lucky to get a continuing resolution. Permanent emergency. Sigh.

What matters more is, what do they propose to do with all this money?

A lot of things. And it does not say how much money is going where. If I was going to ask for a long list of things that adds up to $32 billion, I would say which things were costing how much money. But hey. Instead, it looks like he took the number from NSCAI, and then created a laundry list of things he wanted, without bothering to create a budget of any kind?

It also seems like they took the original recommendation of $8 billion in Fiscal Year 24, $16 billion in FY 25 ad $32 billion in FY 26, and turned it into $32 billion in emergency funding now? See the appendix. Then again, by that pattern, we’d be spending a trillion in FY 31. I can’t say for sure that we shouldn’t.

Starting with the top priority:

  1. An all government ‘AI-ready’ initiative.

  2. ‘Responsible innovation’ R&D work in fundamental and applied sciences.

  3. R&D work in ‘Foundational trustworthy AI topics, such as transparency, explainability, privacy, interoperability, and security.’

Or:

  1. Government AI adoption for mundane utility.

  2. AI for helping scientific research.

  3. AI safety in the general sense, both mundane and existential.

Great. Love it. What’s next?

  1. Funding the CHIPS and Science Act accounts not yet fully funded.

My current understanding is this is allocation of existing CHIPS act money. Okie dokie.

  1. Funding ‘as needed’ (oh no) for semiconductor R&D for the design and manufacture of high-end AI chips, through co-design of AI software and hardware, and developing new techniques for semiconductor fabrication that can be implemented domestically.

More additional CHIPS act funding, perhaps unlimited? Pork for Intel? I don’t think the government is going to be doing any of this research, if it is then ‘money gone.’

  1. Pass the Create AI Act (S. 2714) and expand programs like NAIRR to ‘ensure all 50 states are able to participate in the research ecosystem.’

More pork, then? I skimmed the bill. Very light on details. Basically, we should spend some money on some resources to help with AI research and it should include all the good vibes words we can come up with. I know what ‘all 50 states’ means. Okie dokie.

  1. Funding for a series of ‘AI Grand Challenge’ programs, such as described in Section 202 of the Future of AI Innovation Act (S. 4178) and the AI Grand Challenges Act (S. 4236), focus on transformational progress.

Congress’s website does not list text for S. 4236. S. 4178 seems to mean ‘grand challenge’ in the senses of prizes and other pay-for-results (generally great), and having ambitious goals (also generally great), which tend to not be how the system works these days.

So, fund ambitious research, and use good techniques.

  1. Funding for AI efforts at NIST, including AI testing and evaluation infrastructure and the U.S. AI Safety Institute, and funding for NIST’s construction account to address years of backlog in maintaining NIST’s physical infrastructure.

Not all of NIST’s AI effort is safety, but a large portion of our real government safety efforts are at NIST. They are severely underfunded by all accounts right now. Great.

  1. Funding for the Bureau of Industry and Security (BIS) to update its IT and data analytics software and staff up.

That does sound like something we should do, if it isn’t handled. Ensure BIS can enforce the rules it is tasked with enforcing, and choose those rules accordingly.

  1. Funding R&D at the intersection of AI and robotics to ‘advance national security, workplace safety, industrial efficiency, economic productivity and competitiveness, through a coordinated interagency initiative.’

AI robots. The government is going to fund AI robots. With the first goal being ‘to advance national security.’ Sure, why not, I have never seen any movies.

In all seriousness, this is not where the dangers lie, and robots are useful. It’s fine.

The interagency plan seems unwise to me but I’m no expert on that.

  1. R&D for AI to discover manufacturing techniques.

Once again, sure, good idea if you can improve this for real and this isn’t wasted or pork. Better general manufacturing is good. My guess is that this is not a job for the government and this is wasted, but shrug.

  1. Security grants for AI readiness to help secure American elections.

Given the downside risks I presume this money is well spent.

  1. Modernize the federal government and improve delivery of government services, through updating IT and using AI.

  2. Deploying new technologies to find inefficiencies in the U.S. code, federal rules and procurement devices.

Yes, please. Even horribly inefficient versions of these things are money well spent.

  1. R&D and interagency coordination around intersection of AI and critical infrastructure, including for smart cities and intelligent transportation system technologies.

Yes, we are on pace to rapidly put AIs in charge of our ‘critical infrastructure’ along with everything else, why do you ask? Asking people nicely not to let AI anywhere near the things is not an option and wouldn’t protect substantially against existential risks (although it might versus catastrophic ones). If we are going to do it, we should try to do it right, get the benefits and minimize the risks and costs.

Overall I’d say we have three categories.

  1. Many of these points are slam dunk obviously good. There is a lot of focus on enabling more mundane utility, and especially mundane utility of government agencies and government services. These are very good places to be investing.

  2. A few places where it seems like ‘not the government’s job’ to stick its nose, and where I do not expect the money to accomplish much, often that also involve some obvious nervousness around the proposals, but none of which actually amplify the real problems. Mostly I expect wasted money. The market already presents plenty of better incentives for basic research in most things AI.

  3. Semiconductors.

It is entirely plausible for this to be a plan to take most of $32 billion (there’s a second section below that also gets funding), and put most of that into semiconductors. They can easily absorb that kind of cash. If you do it right you could even get your money’s worth.

As usual, I am torn on chips spending. Hardware progress accelerates core AI capabilities, but there is a national security issue with the capacity relying so heavily on Taiwan, and our lead over China here is valuable. That risk is very real.

Either way, I do know that we are not going to talk our government into not wanting to promote domestic chip production. I am not going to pretend that there is a strong case in opposition to that, nor is this preference new.

On AI Safety, this funds NIST, and one of its top three priorities is a broad-based call for various forms of (both existential and mundane) AI safety, and this builds badly needed state capacity in various places.

As far as government spending proposals go, this seems rather good, then, so far.

These get their own section with twelve bullet points.

  1. NNSA testbeds and model evaluation tools.

  2. Assessment of CBRN AI-enhanced threats.

  3. AI-advancements in chemical and biological synthesis, including safeguards to reduce risk of synesthetic materials and pathogens.

  4. Fund DARPA’s AI work, which seem to be a mix of military applications and attempts to address safety issues including interpretability, along with something called ‘AI Forward’ for more fundamental research.

  5. Secure and trustworthy algorithms for DOD.

  6. Combined Joint All-Domain Command and Control Center for DOD.

  7. AI tools to improve weapon platforms.

  8. Ways to turn DOD sensor data into AI-compatible formats.

  9. Building DOD’s AI capabilities including ‘supercomputing.’ I don’t see any sign this is aiming for foundation models.

  10. Utilize AUKUS Pillar 2 to work with allies on AI defense capabilities.

  11. Use AI to improve implementation of Federal Acquisition Regulations.

  12. Optimize logistics, improve workflows, apply predictive maintenance.

I notice in #11 that they want to improve implementation, but not work to improve the regulations themselves, in contrast to the broader ‘improve our procedures’ program above. A sign of who cares about what, perhaps.

Again, we can draw broad categories.

  1. AI to make our military stronger.

  2. AI (mundate up through catastrophic, mostly not existential) safety.

The safety includes CBRN threat analysis, testbed and evaluation tools and a lot of DARPA’s work. There’s plausibly some real stuff here, although you can’t tell magnitude.

This isn’t looking ahead to AGI or beyond. The main thing here is ‘the military wants to incorporate AI for its mundane utility,’ and that includes guarding us against outside threats and ensuring its implementations are reliable and secure. It all goes hand in hand.

Would I prefer a world where all the militaries kept their hands off AI? I think most of us would like that, no matter our other views, But also we accept that we live in a very different world that is not currently capable of that. And I understand that, while it feels scary for obvious reasons and does introduce new risks, this mostly does not change the central outcomes. It does impact the interplay among people and nations in the meantime, which could alter outcomes if it impacts the balance of power, or causes a war, or sufficiently freaks enough people out.

Mostly it seems like a clear demonstration of the pattern of ‘if you were thinking we wouldn’t do or allow that, think again, we will instantly do that unless prevented’ to perhaps build up some momentum towards preventing things we do not want.

Most items in the next section are about supporting small business.

  1. Developing legislation to leverage public-private partnerships for both capabilities and to mitigate risks.

  2. Further federal study of AI including through FFRDRCs.

  3. Supporting startups, including at state and local levels, including by disseminating best practices (to the states and locaties, I think, not to the startups?)

  4. The Comptroller General identifying anything statutes that impact innovation and competition in AI systems. Have they tried asking Gemini?

  5. Increasing access to testing tools like mock data sets, including via DOC.

  6. Doing outreach to small businesses to ensure tools meet their needs.

  7. Finding ways to support small businesses utilizing AI and doing innovation, and consider if legislation is needed to ‘disseminate best practices’ in various states and localities.

  8. Ensuring business software and cloud computing are allowable expenses under the SBA’s 7(a) loan program.

Congress has a longstanding tradition that Small Business is Good, and that Geographic Diversity That Includes My State or District is Good.

Being from the government, they are here to help.

A lot of this seems like ways to throw money at small businesses in inefficient ways? And to try and ‘make geographic diversity happen’ when we all know it is not going to happen? I am not saying you have to move to the Bay if that is not your thing, I don’t hate you that much, but at least consider, let’s say, Miami or Austin.

In general, none of this seems like a good idea. Not because it increases existential risk. Because it wastes our money. It won’t work.

The good proposal here is the fourth one. Look for statues that are needlessly harming competition and innovation.

Padme: And then remove them?

(The eighth point also seems net positive, if we are already going down the related roads.)

The traditional government way is to say they support small business and spend taxpayer money by giving it to small business, and then you create a regulatory state and set of requirements that wastes more money and gives big business a big edge anyway. Whenever possible, I would much rather remove the barriers than spend the money.

Not all rules are unnecessary. There are some real costs and risks, mundane, catastrophic and exponential, to mitigate.

Nor are all of the advantages of being big dependent on rules and compliance and regulatory capture, especially in AI. AI almost defines economies of scale.

Many would say, wait, are not those worried about AI safety typically against innovation and competition and small business?

And I say nay, not in most situations in AI, same as almost all situations outside AI. Most of the time all of that is great. Promoting such things in general is great, and is best done by removing barriers.

The key question is, can you do that in a way that works, and can you do it while recognizing the very high leverage places that break the pattern?

In particular, when the innovation in question is highly capable future frontier models that pose potential catastrophic or existential risks, especially AGI or ASI, and especially when multiple labs are racing against each other to get there first.

In those situations, we need to put an emphasis on ensuring safety, and we need to at minimum allow communication and coordination between those labs without risk of the government interfering in the name of antitrust.

In most other situations, including most of the situations this proposal seeks to assist with, the priorities here are excellent. The question is execution.

Do you want to help small business take on big business?

Do you want to encourage startups and innovation and American dynamism?

Then there are two obvious efficient ways to do that. Both involve the tax code.

The first is the generic universal answer.

If you want to favor small business over big business, you can mostly skip all those ‘loans’ and grants and applications and paperwork and worrying what is an expense under 7(a). And you can stop worrying about providing them with tools, and you can stop trying to force them to have geographic diversity that doesn’t make economic sense – get your geographic diversity, if you want it, from other industries.

Instead, make the tax code explicitly favor small business over big business via differentiating rates, including giving tax advantages to venture capital investments in early stage startups, which then get passed on to the business.

If you want to really help, give a tax break to the employees, so it applies even before the business turns a profit.

If you want to see more of something, tax it less. If you want less, tax it more. Simple.

The second is fixing a deeply stupid mistake that everyone, and I do mean everyone, realizes is a mistake that was made in the Trump tax cuts, but that due to Congress being Congress we have not yet fixed, and that is doing by all reports quite a lot of damage. It is Section 174 of the IRS code requiring that software engineers and other expenses related to research and experimental activities (R&E) can only be amortized over time rather than fully deducted.

The practical result of this is that startups and small businesses, that have negative cash flow, look to the IRS as if they are profitable, and then owe taxes. This is deeply, deeply destructive and stupid in one of the most high leverage places.

From what I have heard, the story is that the two parties spent a long time negotiating a fix for it, it passed the house overwhelmingly, then in the Senate the Republicans decided they did not like the deal package of other items included with the fix, and wanted concessions, and the Democrats, in particular Schumer, said a deal is a deal.

This needs to get done. I would focus far more on that than all these dinky little subsidies.

As usual, Congress takes ‘the effect on jobs’ seriously. Workers must not be ‘left behind.’ And as usual, they are big on preparing.

So, what are you going to do about it, punk? They are to encourage some things:

  1. ‘Efforts to ensure’ that workers and other stakeholders are ‘consulted’ as AI is developed and deployed by end users. A government favorite.

  2. Stakeholder voices get considered in the development and deployment of AI systems procured or used by federal agencies. In other words, use AI, but not if it would take our jobs.

  3. Legislation related to training, retraining (drink!) and upskilling the private sector workforce, perhaps with business incentives, or to encourage college courses. I am going to go out on a limb and say that this pretty much never, ever works.

  4. Explore implications and possible ‘solutions to’ the impact of AI on the long-term future of work as general-purpose AI systems displace human workers, and develop a framework for policy response. So far, I’ve heard UBI, and various versions of disguising to varying degrees versions of hiring people to dig holes and fill them up again, except you get private companies to pay for it.

  5. Consider legislation to improve U.S. immigration systems for high-skilled STEM workers in support of national security and to foster advances in AI across the whole country.

My understanding is that ideas like the first two are most often useless but also most often mostly harmless. Steps are taken to nominally ‘consult,’ most of the time nothing changes. 

Sometimes, they are anything but harmless. You get NEPA. The similar provisions in NEPA were given little thought when first passed, then they grew and morphed into monsters strangling the economy and boiling the planet, and no one has been able to stop them. 

If this applies only to federal agencies and you get the NEPA version, that is in a sense the worst possible scenario. The government’s ability to use AI gets crippled, leaving it behind. Whereas it would provide no meaningful check on frontier model development, or on other potentially risky or harmful private actions. 

Applying it across the board could at the limit actually cripple American AI, in a way that would not serve as a basis for stopping international efforts, so that seems quite bad. 

We should absolutely expand and improve high skill immigration, across all industries. It is rather completely insane that we are not doing so. There should at minimum be unlimited HB-1s. Yes, it helps ‘national security’ and AI but also it helps everything and everyone and the whole economy and we’re just being grade-A stupid not to do it.

They call this ‘high impact uses of AI.’

The report starts off saying existing law must apply to AI. That includes being able to verify that compliance. They note that this might not be compatible with opaque AI systems.

Their response if that happens? Tough. Rules are rules. Sucks to be you.

Indeed, they say to look not for ways to accommodate black box AI systems, but instead look for loopholes where existing law does not cover AI sufficiently.

Not only do they not want to ‘fix’ existing rules that impose, they want to ensure any possible loopholes are closed regarding information existing law requires. The emphasis is on anti-discrimination laws, which are not something correlation machines you can run tests on are going to be in the default habit of not violating.

So what actions are suggested here?

  1. Explore where we might need explainability requirements.

  2. Develop standards for AI in critical infrastructure.

  3. Better monitor energy use.

  4. Keep a closer eye on financial services providers.

  5. Keep a closer eye on the housing sector.

  6. Test and evaluate all systems before the government buys them, and also streamline the procurement process (yes these are one bullet point).

  7. Recognize the concerns of local news (drink!) and journalism that have resulted in fewer local news options in small towns and rural areas. Damn you, AI!

  8. Develop laws against AI-generated child sexual abuse material (CSAM) and deepfakes. There is a bullet here, are they going to bite it?

  9. Think of the children, consider laws to protect them, require ‘reasonable steps.’

If you are at a smaller company working on AI, and you are worried about SB 1047 or another law that specifically targets frontier models and the risk of catastrophic harm, and you are not worried about being required to ‘take reasonable steps’ to ‘protect children,’ then I believe you are very much worried about the wrong things.

You can say and believe ‘the catastrophic risk worries are science fiction and not real, whereas children actually exist and get harmed’ all you like. This is not where I try to argue you out of that position.

That does not change which proposed rules are far more likely to actually make your life a living hell and bury your company, or hand the edge to Big Tech.

Hint: It is the one that would actually apply to you and the product you are offering.

  1. Encourage public-private partnerships and other mechanisms to develop fraud detection services.

  2. Continue work on autonomous vehicle testing frameworks. We must beat the CCP (drink!) in the race to shape the vision of self-driving cars.

  3. Ban use of AI for social scoring to protect our freedom unlike the CCP (drink!)

  4. “Review whether other potential uses for AI should be either extremely limited or banned.”

Did you feel that chill up your spine? I sure did. The ‘ban use cases’ approach is big trouble without solving your real problems.

Then there’s the health care notes.

  1. Both support deployment of AI in health care and implement appropriate guardrails, including consumer protection, fraud and abuse prevention, and promoting accurate and representative data, ‘as patients must be front and center in any legislative efforts on healthcare and AI.’ My heart is sinking.

  2. Make research data available while preserving privacy.

  3. Ensure HHS and FDA ‘have the proper tools to weigh the benefits and risks of AI-enabled products so that it can provide a predictable regulatory structure. for product developers.’ The surface reading would be: So, not so much with the products, then. I have been informed that it is instead likely they are using coded language for the FDA’s pre-certification program to allow companies to self-certify software updates. And yes, if your laws require that then you should do that, but it would be nice to say it in English.

  4. Transparency for data providers and for the training data used in medical AIs.

  5. Promote innovation that improves health outcomes and efficiencies. Examine reimbursement mechanisms and guardrails for Medicare and Medicaid, and broad application.

The refrain is ‘give me the good thing, but don’t give me the downside.’

I mean, okay, sure, I don’t disagree exactly? And yet.

The proposal to use AI to improve ‘efficiency’ of Medicare and Medicaid sounds like the kind of thing that would be a great idea if done reasonably and yet quite predictably costs you the election. In theory, if we could all agree that we could use the AI to figure out which half of medicine wasn’t worthwhile and cut it, or how to actually design a reimbursement system with good incentives and do that, that would be great. But I have no idea how you could do that.

For elections they encourage deployers and content providers to implement robust protections, and ‘to mitigate AI-generated content that is objectively false, while still preserving First Amendment rights.’ Okie dokie.

For privacy and liability, they kick the can, ask others to consider what to do. They do want you to know privacy and strong privacy laws are good, and AIs sharing non-public personal information is bad. Also they take a bold stand that developers or users who cause harm should be held accountable, without any position on what counts as causing harm.

The word ‘encouraging’ is somehow sounding more ominous each time I see it.

What are we encouraging now?

  1. A coherent approach to public-facing transparency requirements for AI systems, while allowing use case specific requirements where necessary and beneficial, ‘including best practices for when AI developers should disclose when their products are AI,’ but while making sure the rules do not inhibit innovation.

I am not sure how much more of this kind of language of infinite qualifiers and why-not-both framings I can take. For those taking my word for it, it is much worse in the original.

One of the few regulatory rules pretty much everyone agrees on, even if some corner cases involving AI agents are tricky, is ‘AI should have to clearly identify when you are talking to an AI.’

My instinctive suggestion for operationalizing the rule would be ‘if an AI sends a freeform message (e.g. not a selection from a fixed list of options, in any modality) that was not approved individually by a human (even if sent to multiple targets), in a way a reasonable person might think was generated by or individually approved by a human, it must be identified as AI-generated or auto-generated.’ Then iterate from there.

As the report goes on, it feels like there was a vibe of ‘all right, we need to get this done, let’s put enough qualifiers on every sentence that no one objects and we can be done with this.’

How bad can it get? Here’s a full quote for the next one.

  1. “Evaluate whether there is a need for best practices for the level of automation that is appropriate for a given type of task, considering the need to have a human in the loop at certain stages for some high impact tasks.”

I am going to go out on a limb and say yes. There is a need for best practices for the level of automation that is appropriate for a given type of task, considering the need to have a human in the loop at certain stages for some high impact tasks.

For example, if you want to launch nuclear weapons, that is a high impact task, and I believe we should have some best practices for when humans are in the loop.

Seriously, can we just say things that we are encouraging people to consider? Please?

They also would like to encourage the relevant committees to:

  1. Consider telling federal employees about AI in the workplace.

  2. Consider transparency requirements and copyright issues about data sets.

  3. Review reports from the executive branch.

  4. Getting hardware to watermark generated media, and getting online platforms to display that information.

And just because such sentences needs to be properly shall we say appreciated:

  1. “Consider whether there is a need for legislation that protects against the unauthorized use of one’s name, image, likeness, and voice, consistent with First Amendment principles, as it relates to AI. Legislation in this area should consider the impacts of novel synthetic content on professional content creators of digital media, victims of non-consensual distribution of intimate images, victims of fraud, and other individuals or entities that are negatively affected by the widespread availability of synthetic content.”

As opposed to, say, ‘Consider a law to protect people’s personality rights against AI.’

Which may or may not be necessary, depending on the state of current law. I haven’t investigated enough to know if what we have is sufficient here.

  1. Ensure we continue to ‘lead the world’ on copyright and intellectual property law.

I have some news about where we have been leading the world on these matters.

  1. Do a public awareness and educational campaign on AI’s upsides and downsides.

You don’t have to do this. It won’t do any good. But knock yourself out, I guess.

Now to what I view as the highest stakes question. What about existential risks?

That is also mixed in with catastrophic mundane risks.

If I had to summarize this section, I would say that they avoid making mistakes and are headed in the right direction, and they ask good questions.

But on the answers? They punt.

The section is short and dense, so here is their full introduction.

In light of the insights provided by experts at the forums on a variety of risks that different AI systems may present, the AI Working Group encourages companies to perform detailed testing and evaluation to understand the landscape of potential harms and not to release AI systems that cannot meet industry standards.

This is some sort of voluntary testing and prior restraint regime? You are ‘encouraged’ to perform ‘detailed testing and evaluation to understand the landscape of potential harms,’ and you must then ‘meet industry standards.’ If you can’t, don’t release.

Whether or not that is a good regime depends on:

  1. Would companies actually comply?

  2. Would industry adopt standards that mean we wouldn’t die?

  3. Do we have to worry about problems that arise prior to release?

I doubt the Senators minds are ready for that third question.

Multiple potential risk regimes were proposed – from focusing on technical specifications such as the amount of computation or number of model parameters to classification by use case – and the AI Working Group encourages the relevant committees to consider a resilient risk regime that focuses on the capabilities of AI systems, protects proprietary information, and allows for continued AI innovation in the U.S.

Very good news. Capabilities have been selected over use case. The big easy mistake is to classify models based on what people say they plan to do, rather than asking what the model is capable of doing. That is a doomed approach, but many lobby hard for it.

The risk regime should tie governance efforts to the latest available research on AI capabilities and allow for regular updates in response to changes in the AI landscape.

Yes. As we learn more, our policies should adjust, and we should plan for that. Ideally this would be an easy thing to agree upon. Yet the same people who say ‘it is too early to choose what to do’ will also loudly proclaim that ‘if you give any flexibility to choose what to do later to anyone but the legislature, one must assume it will used maximally badly.’ I too wish we had a much faster, better legislature, that we could turn to every time we need any kind of decision or adjustment. We don’t.

All right. So no explicit mention of existential risk in the principles, but some good signs of the right regime. What are the actual suggestions?

Again, I am going to copy it all, one must parse carefully.

  1. Support efforts related to the development of a capabilities-focused risk-based approach, particularly the development and standardization of risk testing and evaluation methodologies and mechanisms, including red-teaming, sandboxes and testbeds, commercial AI auditing standards, bug bounty programs, as well as physical and cyber security standards. The AI Working Group encourages committees to consider ways to support these types of efforts, including through the federal procurement system.

There are those who would disagree with this, who think the proper order is train, release then test. I do not understand why they would think that. No wise company would do that, for its own selfish reasons.

The questions should be things like:

  1. How rigorous should be the testing requirements?

  2. At what stages of training and post-training, prior to deployment?

  3. How should those change based on the capabilities of the system?

  4. How do we pick the details?

  5. What should you have to do if the system flunks the test?

For now, this is a very light statement.

  1. Investigate the policy implications of different product release choices for AI systems, particularly to understand the differences between closed versus fully open-source models (including the full spectrum of product release choices between those two ends of the spectrum).

Again, there are those that would disagree with this, who think the proper order is train, release then investigate the consequences. They think they already know all the answers, or that the answers do not matter. Once again, I do not understand why they would have good reason to think that.

Whatever position you take, the right thing to do is to game it out. Ask what the consequences of each regime would be. Ask what the final policy regime and world state would likely be in each case. Ask what the implications are for national security. Get all the information, then make the choice.

The only alternative that makes sense, which is more of a complementary approach than a substitute, is to define what you want to require. Remember what was said about black box systems. Yes, your AI system ‘wants to be’ a black box. You don’t know how to make it not a black box. If the law says you have to be able to look inside the box, or you can’t use the box? Well, that’s more of a you problem. No box.

You can howl about Think of the Potential of the box, why are you shutting down the box over some stupid thing like algorithmic discrimination or bioweapon risk or whatever. You still are not getting your box.

Then, if you can open the weights and still ensure the requirements are met, great, that’s fine, go for it. If not, not.

Then we get serious.

  1. Develop an analytical framework that specifies what circumstances would warrant a requirement of pre-deployment evaluation of AI models.

This does not specify whether this is requiring a self-evaluation by the developer as required in SB 1047, or requiring a third-party evaluation like METR, or an evaluation by the government. Presumably part of finding the right framework would be figuring out when to apply which requirement, along with which tests would be needed.

I am not going to make a case here for where I think the thresholds should be, beyond saying that SB 1047 seems like a good upper bound for the threshold necessary for self-evaluations, although one could quibble with the details of the default future path. Anything strictly higher than that seems clearly wrong to me.

  1. Explore whether there is a need for an AI-focused Information Sharing and Analysis Center (ISAC) to serve as an interface between commercial AI entities and the federal government to support monitoring of AI risks.

That is not how I would have thought to structure such things, but also I do not have deep thoughts about how to best structure such things. Nor do I see under which agency they would propose to put this center. Certainly there will need to be some interface where companies inform the federal government of issues in AI, as users and as developers, and for the federal government to make information requests.

5. Consider a capabilities-based AI risk regime that takes into consideration short-, medium-, and long-term risks, with the recognition that model capabilities and testing and evaluation capabilities will change and grow over time. As our understanding of AI risks further develops, we may discover better risk-management regimes or mechanisms.

Where testing and evaluation are insufficient to directly measure capabilities, the AI Working Group encourages the relevant committees to explore proxy metrics that may be used in the interim.

There is some very welcome good thinking in here. Yes, we will need to adjust our regime over time. Also, that does not mean that until we reach our ‘final form’ the correct regime is no regime at all. You go with the best proxy measure you have, then when you can do better you switch to a better one, and you need to consider all time frames, although naming them all is a punt from the hard work of prioritization.

The question is, can you use testing and evaluation to directly measure capabilities sufficiently accurately? For which purposes and scenarios does this work or fail?

There are two ways testing and evaluation can fail, false positives and false negatives.

False positives are where you game the benchmarks, intentionally or otherwise. In general, I presume that the major labs (OpenAI, Anthropic and DeepMind for sure, and mostly Meta as well) will be good at not doing this, but that smaller competitors will often be gaming the system to look better, or not be taking care to avoid data contamination.

This can mostly be solved through keeping the testing details private, or continuously rotating them with questions known to not be online. But it also is not the issue here.

False negatives are far scarier.

We can again subdivide, and ask what ways things might go wrong. I took 10 or so minutes to brainstorm a list, which is of course highly incomplete.

These are vaguely ordered ‘ordinary failure, probably not too bad’ to ‘oh no.’

  1. The AI can do it, if you were better at prompting and writing custom instructions.

    1. Variant: The AI can do it, if you jailbreak it first, which you can totally do.

    2. Variant: You messed up the inputs or the answer key.

  2. The AI can do it, if you offer it the right additional context.

  3. The AI can do it, if you give it some extra scaffolding to work with.

  4. The AI can do it, if you give it a bit of fine tuning.

  5. The AI can do it, if you force it to embody the Golden Gate Bridge or something.

  6. The AI can do it, with help from a user with better domain knowledge.

  7. The AI can do it, but you won’t like the way it picked to get the job done.

  8. The AI can do it, but you have to trigger some hidden condition flag.

  9. The AI can do it, but the developers had it hide its capabilities to fool the test.

  10. The AI can do it, but realized you were testing it, so it hid its capabilities.

  11. The AI can do it, so the developers crippled the narrow capability that goes on evaluations, but it still has the broader capability you were actually trying to test.

  12. The AI can’t do this in particular, but you were asking the wrong questions.

    1. Variant: What the AI can do is something humans haven’t even considered yet.

    2. Variant: What you are about exists out of distribution, and this isn’t it.

  13. The AI can do it, but its solution was over your head and you didn’t notice.

  14. The AI escaped or took control or hacked the system during your test.

  15. The AI did the dangerous thing during training or fine-tuning. You are too late.

The more different tests you run, and the more different people run the tests, especially if you include diverse red teaming and the ability to probe for anything at all while well resourced, the better you will do. But this approach has some severe problems, and they get a lot more severe once you enter the realm of models plausibly smarter than humans and you don’t know how to evaluate the answers or what questions to ask.

If all you want are capabilities relative to another similar model, and you can put an upper bound on how capable the thing is, a lot of these problems mostly go away or become much easier, and you can be a lot more confident.

Anyway, my basic perspective is that you use evaluations, but that in our current state and likely for a while I would not trust them to avoid false negatives on the high end, if your system used enough compute and is large enough that it might plausibly be breaking new ground. At that point, you need to use a holistic mix of different approaches and an extreme degree of caution, and beyond a certain point we don’t know how to proceed safely in the existential risk sense.

So the question is, will the people tasked with this be able to figure out a reasonable implementation of these questions? How can we help them do that?

The basic principle here, however, is clear. As inputs, potential capabilities and known capabilities advance, we will need to develop and deploy more robust testing procedures, and be more insistent upon them. From there, we can talk price, and adjust as we learn more.

There are also two very important points that wait for the national security section: A proper investigation into defining AGI and evaluating how likely it is and what risks it would pose, and an exploration into AI export controls and the possibility of on-chip AI governance. I did not expect to get those.

Am I dismayed that the words existential and catastrophic only appear once each and only in the appendix (and extinction does not appear)? That there does not appear to be a reference in any form to ‘loss of human control’ as a concept, and so on? That ‘AGI’ does not appear until the final section on national security, although they ask very good questions about it there?

Here is the appendix section where we see mentions at all (bold is mine), which does ‘say the line’ but does seem to have rather a missing mood, concluding essentially (and to be fair, correctly) that ‘more research is needed’:

The eighth forum examined the potential long-term risks of AI and how best to encourage development of AI systems that align with democratic values and prevent doomsday scenarios.

Participants varied substantially in their level of concern about catastrophic and existential risks of AI systems, with some participants very optimistic about the future of AI and other participants quite concerned about the possibilities for AI systems to cause severe harm.

Participants also agreed there is a need for additional research, including standard baselines for risk assessment, to better contextualize the potential risks of highly capable AI systems. Several participants raised the need to continue focusing on the existing and short term harms of AI and highlighted how focusing on short-term issues will provide better standing and infrastructure to address long-term issues.

Overall, the participants mostly agreed that more research and collaboration are necessary to manage risk and maximize opportunities.

Of course all this obfuscation is concerning.

It is scary that such concepts are that-which-shall-not-be-named.

You-know-what still has its hands on quite a few provisions of this document. The report was clearly written by people who understand that the stakes are going to get raised to very high levels. And perhaps they think that by not saying you-know-what, they can avoid all the nonsensical claims they are worried about ‘science fiction’ or ‘hypothetical risks’ or what not.

That’s the thing. You do not need the risks to be fully existential, or to talk about what value we are giving up 100 or 1,000 years from now, or any ‘long term’ arguments, or even the fates of anyone not already alive, to make it worth worrying about what could happen to all of us within our lifetimes. The prescribed actions change a bit, but not all that much, especially not yet. If the practical case works, perhaps that is enough.

I am not a politician. I do not have experience with similar documents and how to correctly read between the lines. I do know this report was written by committee, causing much of this dissonance. Very clearly at least one person on the committee cared and got a bunch of good stuff through. Also very clearly there was sufficient skepticism that this wasn’t made explicit. And I know the targets are other committees, which muddies everything further.

Perhaps, one might suggest, all this optimism is what they want people like me to think? But that would imply that they care what people like me think when writing such documents.

I am rather confident that they don’t.

I went into this final section highly uncertain what they would focus on. What does national security mean in this context? There are a lot of answers that would not have shocked me.

It turns out that here it largely means help the DOD:

  1. Bolstering cyber capabilities.

  2. Developing AI career paths for DOD.

  3. Money for DOD.

  4. Efficiently handle security clearances, improve DOD hiring process for AI talent.

  5. Improve transfer options and other ways to get AI talent into DOD.

I would certainly reallocate DOD money for more of these things if you want to increase the effectiveness of the DOD. Whether to simply throw more total money at DOD is a political question and I don’t have a position there.

Then we throw in an interesting one?

  1. Prevent LLMs leaking or reconstructing sensitive or confidential information.

Leaking would mean it was in the training data. If so, where did that data come from? Even if the source was technically public and available to be found, ‘making it easy on them’ is very much a thing. If it is in the training data you can probably get the LLM to give it to you, and I bet that LLMs can get pretty good at ‘noticing which information was classified.’

Reconstructing is more interesting. If you de facto add ‘confidential information even if reconstructed’ to the list of catastrophic risks alongside CBRN, as I presume some NatSec people would like, then that puts the problem for future LLMs in stark relief.

The way that information is redacted usually contains quite a lot of clues. If you put AI on the case, especially a few years from now, a lot of things are going to fall into place. In general, a capable AI will start being able to figure out various confidential information, and I do not see how you stop that from happening, especially when one is not especially keen to provide OpenAI or Google with a list of all the confidential information their AI is totally not supposed to know about? Seems hard.

A lot of problems are going to be hard. On this one, my guess is that among other things the government is going to have to get a very different approach to what is classified.

  1. Monitor AI and especially AGI development by our adversaries.

I would definitely do that.

  1. Work on a better and more precise definition of AGI, a better measurement of how likely it is to be developed and the magnitude of the risks it would pose.

Yes. Nice. Very good. They are asking many of the right questions.

  1. Explore using AI to mitigate space debris.

You get an investigation into using AI for your thing. You get an investigation into using AI for your thing. I mean, yeah, sure, why not?

  1. Look into all this extra energy use.

I am surprised they didn’t put extra commentary here, but yeah, of course.

  1. Worry about CBRN threats and how AI might enhance them.

An excellent thing for DOD to be worried about. I have been pointed to the question here of what to do about Restricted Data. We automatically classify certain information, such as info about nuclear weapons, as it comes into existence. If an AI is not allowed to generate outputs containing such information, and there is certainly a strong case why you would want to prevent that, this is going to get tricky. No question the DOD should be thinking carefully about the right approach here. If anything, AI is going to be expanding the range of CBRN-related information that we do not want easily shared.

  1. Consider how CBRN threats and other advanced technological capabilities interact with need for AI export controls, explore whether new authorities are needed, and explore feasibility of options to implement on-chip security mechanisms for high-end AI chips.

  2. “Develop a framework for determining when, or if, export controls should be placed on powerful AI systems.”

Ding. Ding. Ding. Ding. Ding.

If you want the ability to choke off supply, you target the choke points you can access.

That means either export controls, or it means on-chip security mechanisms, or it means figuring out something new.

This is all encouraging another group to consider maybe someone doing something. That multi-step distinction covers the entire document. But yes, all the known plausibly effective ideas are here in one form or another, to be investigated.

The language here on AI export controls is neutral, asking both when and if.

At some point on the capabilities curve, national security will dictate the need for export controls on AI models. That is incompatible with open weights on those models, or with letting such models run locally outside the export control zone. The proper ‘if’ is whether we get to that point, so the right question is when.

Then they go to a place I had not previously thought about us going.

  1. “Develop a framework for determining when an AI system, if acquired by an adversary, would be powerful enough that it would pose such a grave risk to national security that it should be considered classified, using approaches such as how DOE treats Restricted Data.”

Never mind debating open model weights. Should AI systems, at some capabilities level, be automatically classified upon creation? Should the core capabilities workers, or everyone at OpenAI and DeepMind, potentially have to get a security clearance by 2027 or something?

  1. Ensure federal agencies have the authority to work with allies and international partners and agree to things. Participate in international research efforts, ‘giving due weight to research security and intellectual property.’

Not sure why this is under national security, and I worry about the emphasis on friendlies, but I would presume we should do that.

  1. Use modern data analytics to fight illicit drugs including fentanyl.

Yes, use modern data analytics. I notice they don’t mention algorithmic bias issues.

  1. Promote open markets for digital goods, prevent forced technology transfer, ensure the digital economy ‘remains open, fair and competitive for all, including for the three million American workers whose jobs depend on digital trade.’

Perfect generic note to end on. I am surprised the number of jobs is that low.

They then give a list of who was at which forum and summaries of what happened.

Before getting to my takeaways, here are some other reactions.

These are illustrative of five very different perspectives, and also the only five cases in which anyone said much of anything about the bill at all. And I love that all five seem to be people who actually looked at the damn thing. A highly welcome change.

  1. Peter Wildeford looks at the overall approach. His biggest takeaway is that this is a capabilities-based approach, which puts a huge burden on evaluations, and he notices some other key interactions too, especially funding for BIS and NIST.

  2. Tim First highlights some measures he finds fun or exciting. Like Peter he mentions the call for investigation of on-chip security mechanisms.

  3. Tyler Cowen’s recent column contained the following: “Fast forward to the present. Senate Majority Leader Chuck Schumer and his working group on AI have issued a guidance document for federal policy. The plans involve a lot of federal support for the research and development of AI, and a consistent recognition of the national-security importance of the US maintaining its lead in AI. Lawmakers seem to understand that they would rather face the risks of US-based AI systems than have to contend with Chinese developments without a US counterweight. The early history of Covid, when the Chinese government behaved recklessly and nontransparently, has driven this realization home.”

    1. The context was citing this report as evidence that the AI ‘safety movement’ is dead, or at least that a turning point has been reached and it will fade into obscurity (and the title has now been changed to better reflect the post.)

    2. Tyler is right that there is much support for ‘innovation,’ ‘R&D’ and American competitiveness and national security. But this is as one would expect.

    3. My view is that, while the magic words are not used, the ‘AI safety’ concerns are very much here, including all the most important policy proposals, and it even includes one bold proposal I do not remember previously considering.

    4. Yes, I would have preferred if the report had spoken more plainly and boldly, here and also elsewhere, and the calls had been stronger. But I find it hard not to consider this a win. At bare minimum, it is not a loss.

    5. Tyler has not, that I know of, given further analysis on the report’s details.

  4. R Street’s Adam Thierer gives an overview.

    1. He notices a lot of the high-tech pork (e.g. industrial policy) and calls for investigating expanding regulations.

    2. He notices the kicking of all the cans down the road, agrees this makes sense.

    3. He happily notices no strike against open source, which is only true if you do not work through the implications (e.g. of potentially imposing export controls on future highly capable AI systems, or even treating them as automatically classified Restricted Data.)

    4. Similarly, he notes the lack of a call for a new agency, whereas this instead will do everything piecemail. And he is happy that ‘existential risk lunacy’ is not mentioned by name, allowing him not to notice it either.

    5. Then he complains about the report not removing enough barriers from existing laws, regulations and court-based legal systems, but agrees existing law should apply to AI. Feels a bit like trying to have the existing law cake to head off any new rules and call for gutting what already exists too, but hey. He offers special praise for the investigation to look for innovation-stifling rules.

    6. He notices some of the genuinely scary language, in particular “Review whether other potential uses for AI should be either extremely limited or banned.”

    7. He calls for Congress to actively limit Executive discretion on AI, which seems like ‘AI Pause now’ levels of not going to happen.

    8. He actively likes the idea of a public awareness campaign, which surprised me.

    9. Finally Adam seems to buy into the view that screwing up Section 230 is the big thing to worry about. I continue to be confused why people think that this is going to end up being a problem in practice. Perhaps it is the Sisyphean task of people like R Street to constantly worry about such nightmare scenarios.

    10. He promised a more detailed report coming, but I couldn’t find one.

  5. The Wall Street Journal editorial board covers it as ‘The AI Pork Barrel Arrives.’

They quote Schumer embarrassing himself a bit:

Chuck Schumer: If China is going to invest $50 billion, and we’re going to invest in nothing, they’ll inevitably get ahead of us.

Padme: You know the winner is not whoever spends the most public funds, right?

You know America’s success is built on private enterprise and free markets, right?

You do know that ‘we’ are investing quite a lot of money in AI, right?

You… do know… we are kicking China’s ass on AI at the moment, right?

WSJ Editorial Board: Goldman Sachs estimates that U.S. private investment in AI will total $82 billion next year—more than twice as much as in China.

We are getting quite a lot more than twice as much bang for our private bucks.

And this comes on the heels of the Chips Act money.

So yes, I see why the Wall Street Journal Editorial Board is thinking pork.

WSJ Editorial Board: Mr. Schumer said Wednesday that AI is hard to regulate because it “is changing too quickly.” Fair point. But then why does Washington need to subsidize it?

The obvious answer, mostly, is that it doesn’t.

There are some narrow areas, like safety work, where one can argue that there will by default be underinvestment in public goods.

There is need to fund the government’s own adaptation of AI, including for defense, and to adjust regulations and laws and procedures for the new world.

Most of the rest is not like that.

WSJ: Now’s not a time for more pork-barrel spending. The Navy could buy a lot of ships to help deter China with an additional $32 billion a year.

This is where they lose me. Partly because a bunch of that $32 billion is directly for defense or government services and administration. But also because I see no reason to spend a bunch of extra money on new Navy ships that will be obsolete in the AI era, especially given what I have heard about our war games where our ships are not even useful against China now. The Chips Act money is a far better deterrent. We also would have accepted ‘do not spend the money at all.’

Mostly I see this focus as another instance of the mainstream not understanding, in a very deep way, that AI is a Thing, even in the economic and mundane utility senses.

There was a lot of stuff in the report. A lot of it was of the form ‘let’s do good thing X, without its downside Y, taking into consideration the vital importance of A, B and C.’

It is all very ‘why not both,’ embrace the upside and prevent the downside.

Which is great, but of course easier said (or gestured at) than done.

This is my attempt to assemble what feels most important, hopefully I am not forgetting anything:

  1. The Schumer Report is written by a committee for other committees to then do something. Rather than one big bill, we will get a bunch of different bills.

  2. They are split on whether to take existential risk seriously.

    1. As a result, they include many of the most important proposals on this.

      1. Requiring safety testing of frontier models before release.

      2. Using compute or other proxies if evaluations are not sufficiently reliable.

      3. Export controls on AI systems.

      4. Treating sufficiently capable AI systems as Restricted Data.

      5. Addressing CBRN threats.

      6. On-chip governance for AI chips.

      7. The need for international cooperation.

      8. Investigate the definition of AGI, and the risks it would bring.

    2. Also as a result, they present them in an ordinary, non-x-risk context.

    3. That ordinary context indeed does justify the proposals on its own.

  3. Most choices regarding AI Safety policies seem wise. The big conceptual danger is that the report emphasizes a capabilities-based approach via evaluations and tests. It does mention the possibility of using compute or other proxies if our tests are inadequate, but I worry a lot about overconfidence here. This seems like the most obvious way that this framework goes horribly wrong.

    1. A second issue is that this report presumes that only release of a model is dangerous, that otherwise it is safe. Which for now is true, but this could change, and it should not be an ongoing assumption.

  4. There is a broad attitude that the rules must be flexible, and adapt over time.

  5. They insist that AI will need to obey existing laws, including those against algorithmic discrimination and all the informational disclosure requirements involved.

  6. They raise specters regarding mundane harm concerns and AI ethics, both in existing law and proposed new rules, that should worry libertarians and AI companies far more than laws like SB 1047 that are aimed at frontier models and catastrophic risks.

    1. Calls for taking ‘reasonable steps’ to ‘protect children’ should be scary. They are likely not kidding around about copyright, CSAM or deepfakes.

    2. Calls for consultation and review could turn into a NEPA-style nightmare. Or they might turn out to be nothing. Hard to tell.

    3. They say that if black box AI is incompatible with existing disclosure requirements and calls for explainability and transparency, then their response is: Tough.

    4. They want to closely enforce rules on algorithmic discrimination, including the associated disclosure requirements.

    5. There are likely going to be issues with classified material.

    6. The report wants to hold developers and users liable for AI harms, including mundane AI harms.

    7. The report calls for considerations of potential use case bans.

  7. They propose to spend $32 billion dollars on AI, with an unknown breakdown.

  8. Schumer thinks public spending matters, not private spending. It shows.

  9. There are many proposals for government adoption of AI and building of AI-related state capacity. This seemed like a key focus point.

    1. These mostly seem very good.

    2. Funding for BIS and NIST is especially important and welcome.

  10. There are many proposals to ‘promote innovation’ in various ways.

    1. I do not expect them to have much impact.

  11. There are proposals to ‘help small business’ and encourage geographic diversity and other such things.

    1. I expect these are pork and would go to waste.

  12. There is clear intent to integrate AI closely into our critical infrastructure and into the Department of Defense.

This is far from the report I would have wanted written. But it is less far than I expected before I looked at the details. Interpreting a document like this is not my area of expertise, but in many ways I came away optimistic. The biggest downside risks I see are that the important proposals get lost in the shuffle, or that some of the mundane harm related concerns get implemented in ways that cause real problems.

If I was a lobbyist for tech companies looking to avoid expensive regulation, especially if I was trying to help relatively small players, I would focus a lot more on heading off mundane-based concerns like those that have hurt so many other areas. That seems like by far the bigger commercial threat, if you do not care about the risks on any level.

The Schumer Report on AI (RTFB) Read More »