Notes

Notes on Dwarkesh Patel’s Podcast with Sholto Douglas and Trenton Bricken

Notes / Paul Patrick / April 1, 2024

Dwarkesh Patel continues to be on fire, and the podcast notes format seems like a success, so we are back once again.

This time the topic is how LLMs are trained, work and will work in the future. Timestamps are for YouTube. Where I inject my own opinions or takes, I do my best to make that explicit and clear.

This was highly technical compared to the average podcast I listen to, or that Dwarkesh does. This podcast definitely threated to technically go over my head at times, and some details definitely did go over my head outright. I still learned a ton, and expect you will too if you pay attention.

This is an attempt to distill what I found valuable, and what questions I found most interesting. I did my best to make it intuitive to follow even if you are not technical, but in this case one can only go so far. Enjoy.

(1: 30) Capabilities only podcast, Trenton has ‘solved alignment.’ April fools!

(2: 15) Huge context tokens is underhyped, a huge deal. It occurs to me that the issue is about the trivial inconvenience of providing the context. Right now I mostly do not bother providing context on my queries. If that happened automatically, it would be a whole different ballgame.

(2: 50) Could the models be sample efficient if you can fit it all in the context window? Speculation is it might work out of the box.

(3: 45) Does this mean models are already in some sense superhuman, with this much context and memory? Well, yeah, of course. Computers have been superhuman at math and chess and so on for a while. Now LLMs have quickly gone from having worse short term working memory than humans to vastly superior short term working memory. Which will make a big difference. The pattern will continue.

(4: 30) In-context learning is similar to gradient descent. It gets problematic for adversarial attacks, but of course you can ignore that because as Tenton reiterates alignment is solved, and certainly it is solved for such mundane practical concerns. But it does seem like he’s saying if you do this then ‘you’re fine-tuning but in a way where you cannot control what is going on’?

(6: 00) Models need to learn how to learn from examples in order to take advantage of long context. So does that mean the task of intelligence requires long context? That this is what causes the intelligence, in some sense, they ask? I don’t think you can reverse it that way, but it is possible that this will orient work in directions that are more effective?

(7: 00) Dwarkesh asks about how long contexts link to agent reliability. Douglas says this is more about lack of nines of reliability, and GPT-4-level models won’t cut it there. And if you need to get multiple things right, the reliability numbers have to multiply together, which does not go well in bulk. If that is indeed the issue then it is not obvious to me the extent to which scaffolding and tricks (e.g. Devin, probably) render this fixable.

(8: 45) Performance on complex tasks follows log scores. It gets it right one time in a thousand, then one in a hundred, then one in ten. So there is a clear window where the thing is in practice useless, but you know it soon won’t be. And we are in that window on many tasks. This goes double if you have complex multi-step tasks. If you have a three-step task and are getting each step right one time in a thousand, the full task is one in a billion, but you are not so far being able to in practice do the task.

(9: 15) The model being presented here is predicting scary capabilities jumps in the future. LLMs can actually (unreliably) do all the subtasks, including identifying what the subtasks are, for a wide variety of complex tasks, but they fall over on subtasks too often and we do not know how to get the models to correct for that. But that is not so far from the whole thing coming together, and that would include finding scaffolding that lets the model identify failed steps and redo them until they work, if which tasks fail is sufficiently non-deterministic from the core difficulties.

(11: 30) Attention costs for context window size are quadratic, so how is Google getting the window so big? Suggestion is the cost is still actually dwarfed by the MLP block, and while generating tokens the cost is no longer n-squared, your marginal cost becomes linear.

(13: 30) Are we shifting where the models learn, with more and more in the forward pass? Douglas says essentially no, the context length allows useful working memory, but is not ‘the key thing towards actual reasoning.’

(15: 10) Which scaling up counts? Tokens, compute, model size? Can you loop through the model or brain or language? Yes, but in practice notice humans only in practice do 5-7 steps in complex sentences because of working memory limits.

(17: 15) Where is the model reasoning? No crisp answer. The residual stream that the model carries forward packs in a lot of different vectors that encode all the info. Attention is about what to pick up and put into what is effectively RAM.

(20: 40) Does the brain work via this residual stream? Yes. Humans implement a bunch of efficient algorithms and really scale up our cerebral cortex investment. A key thing we do is very similar to the attention algorithm.

(24: 00) How does the brain reason? Trenton thinks mostly intelligence is pattern matching. ‘Association is all you need.’

(25: 45) Paper from Demis in 2008 noted that memory is reconstructive, so it is linked to creativity and also is horribly unreliable.

(26: 45) What makes Sherlock Homes so good? Under this theory: A really long context length and working memory, and better high-level association. Also a good algorithm for his queries and how to build representations. Also proposed: A Sherlock Homes evaluation. Give a mystery novel or story, ask for probability distribution over ‘The suspect is X.’

(28: 30) A vector in the residual stream is the composite of all the tokens to which I have previously paid attention, even by layer two.

(30: 30) Could we do an unsupervised benchmark? It has been explored, such as with constitutional AI. Again, alignment-free podcast here.

(31: 45) If intelligence is all associations, should we be less worried about superintelligence, because there’s not this sense in which it is Sherlock++ and it can’t solve physics from a world frame? The response is, they would need to learn the associations, but also the tech makes that quick to do, and silicon can be about as generally intelligent as humans and can recursively improve anyway.

My response here would strongly be that if this is true, we should be more worried rather than less worried, because it means there is no secret or trick, and scale really would be all you would need, if you scale enough distinct aspects, and we should expect that we would do that.

(32: 45) Dwarkesh asks if this means disagreeing with the premise of them not being that much more powerful. To which I would strongly say yes. If it turns out that the power comes from associations, then that still leads to unbounded power, so what if it does not sound impressive? What matters is if it works.

(33: 30) If we got thousands of you do we get an intelligence explosion? We do dramatically speed up research but compute is a binding constraint. Trenton thinks we would need longer contexts, more reliability and lower cost to get an intelligence explosion, but getting there within a few years seems plausible.

(37: 30) Trenton expects this to speed up a lot of the engineering soon, accelerating research and compounding, but not (yet) a true intelligence explosion.

(39: 00) What about the costs of training orders-of-magnitude bigger models? Does this break recursive intelligence explosion? It’s a breaking mechanism. We should be trying hard to estimate how much of this is automatable. I agree that the retraining costs and required time are a breaking mechanism, but also efficiency gains could quickly reduce those costs, and one could choose to work around the need to do that via other methods. One should not be confident here.

(41: 00) Understanding what goes wrong is key to making AI progress. There are lots of ideas but figuring out which ideas are worth exploring is vital. This includes anticipating which trend lines will hold when scaled up and which won’t. There’s an invisible graveyard of trend lines that looked promising and then failed to hold.

(44: 20) A lot of good research works backwards from solving actual problems. Trying to understand what is going on, figuring out how to run experiments. Performance is lots of low-level hard engineering work. Ruthless prioritization is key to doing high quality research, the most effective people attack the problem, do really fast experiments and do not get attached to solutions. Everything is empirical.

(48: 00) “Even though we wouldn’t want to admit it, the whole community is kind of doing greedy evolutionary optimization over the landscape of possible AI architectures and everything else. It’s no better than evolution. And that’s not even a slight against evolution.” Does not fill one with confidence on safety.

(49: 30) Compute and taste on what to do are the current limiting factors for capabilities. Scaling to properly use more humans is hard. For interpretability they need more good engineers.

(51: 00) “I think the Gemini program would probably be maybe five times faster with 10 times more compute or something like that. I think more compute would just directly convert into progress.”

(51: 30) If compute is such a bottleneck is it being insufficiently allocated to such research and smaller training tasks? You also need the big training runs to avoid getting off track.

(53: 00) What does it look like for AI to speed up AI research? Could be algorithmic progress from AI. That takes more compute, but seems quite reasonable this could act as a force multiplier for humans. Also could be synthetic data.

(55: 30) Reasoning traces are missing from data sets, and seem important.

(56: 15) Is progress going to be about making really amazing AI maps of the training data? Douglas says clearly a very important part. Doing next token on a sufficiently good data set requires so many other things.

(58: 30) Language as synthetic data by humans for humans? With verifier via real world.

(59: 30) Yeah, whole development process is largely evolutionary, more people means more recombination, more shots on target. That does to me seem in conflict with the best people being the ones who can discriminate over potential tasks and ideas. But also they point out serendipity is a big deal and it scales. They expect AGI to be the sum of a bunch of marginal things.

(1: 01: 30) If we don’t get AGI by GPT-7-levels-of-OOMs are we stuck? Sholto basically buys this, that orders of magnitude have at core diminishing returns, although they unlock reliability, reasoning progress is sublinear in OOMs. Dwarkesh notes this is highly bearish, which seems right.

(1: 03: 15) Sholto points out that even with smaller progress, another 3.5→4 jump in GPT-levels is still pretty huge. We should expect smart plus a lot of reliability. This is not to undersell what is coming, rather the jumps so far are huge, and even smaller jumps from here unlock lots of value. I agree.

(1: 07: 30) Bigger models allow you to minimize superposition (overloading more features onto less parameters), making results less noisy, whereas smaller ones are under parameterized given their goal of representing the entire internet. Speculation that superposition is why interpretability is so hard. I wonder if that means it could get easier with more parameters? Could we use ‘too many’ parameters on purpose in order to help with this?

(1: 11: 00) What’s happening with distilled models? Dwarkesh suggests GPT-4-Turbo is distilled, Sholto suggests it could instead be new architecture.

(1: 12: 30) Distillation is powerful because the full probability distribution gives you much richer data to work with.

(1: 13: 30) Adaptive compute means spend more cycles on harder questions. How do you do that via chain of thought? You get to pass a KV-value during forward passes, not only passing only the token, which helps, so the KV-cache is (headcanon-level, not definitively) pushing forward the CoT without having to link to the output tokens. This is ‘secret communication’ (from the user’s perspective) of the model to its forward inferences, and we don’t know how much of that is happening. Not always the thing going on, but there is high weirdness.

(1: 19: 15) Anthropic sleeper agents paper, notice the CoT reasoning does seem to impact results and the reasoning it does is pretty creepy. But in another paper, the model will figure out the multiple choice answer is always ‘A’ but the reasoning in its CoT will be something else that sounds plausible. Dwarkesh notes humans also come up with crazy explanations for what they are doing, such as when they have split brains. “It’s just that some people will hail chain-of-thought reasoning as a great way to solve AI safety, but actually we don’t know whether we can trust it.”

(1: 23: 30) Agents, how will they work once they work well enough? Short term expectation from Sholto is agent talking together. Sufficiently long context windows could make fine-tuning unnecessary or irrelevant.

(1: 26: 00) With sufficient context could you train everything on a global goal like ‘did the firm make money?’ In the limit, yes, that is ‘the dream of reinforcement learning.’ Can you feel the instrumental convergence? At first, though, they say, in practice, no, it won’t work.

(1: 27: 45) Suggestion that languages evolve to be good at encoding things to teach children important things, such as ‘don’t die.’

(1: 29: 30) In other modalities figuring out exactly what you are predicting is key to success. For language you predict the next token, it is easy mode in that sense.

(1: 31: 30) “there are interesting interpretability pieces where if we fine-tune on math problems, the model just gets better at entity recognition.” It makes the model better at attending to positions of things and such.

(1: 32: 30) Getting better at code makes the model a better thinking. Code is reasoning, you can see how it would transfer. I certainly see this happening in humans.

(1: 35: 00) Section on their careers. Sholto’s story is a lot of standard things you hear from high-agency, high-energy high-achieving people. They went ahead and did things, and also pivot and go in different directions and follow curiosity, read all the papers. Strong ideas, loosely held, carefully selected, vigorously pursued. Dwarkesh notes one of the most important things is to go do the things, and managers are desperate for people who will make sure things get done. If you get bottlenecked because you need lawyers, well, why didn’t you go get the lawyers? Lots of impact is convincing people to work with you to do a thing.

(1: 43: 30) Sholto is working on AI largely because he thinks it can lead to a wonderful future, and was sucked into scaling by Gwern’s scaling hypothesis post. That is indeed the right reason, if you are also taking into account the downside risks including existential risks, and still think this is a good idea. It almost certainly is not a neutral idea, it is either a very good idea or extremely ill-advised.

(1: 43: 35) Sholto says McKinsey taught him how to actually do work, and the value of not taking no for an answer, whereas often things don’t happen because no individual cares enough to make it happen. The consultant can be that person, and you can be that person otherwise without being a consultant. He got hired largely by being seen on the internet asking questions about how things work, causing Google to reach out. It turns out at Google you can ask the algorithm and systems experts and they will gladly teach you everything they know.

(1: 51: 30) Being in the office all the time, collaborating with others including pair programming with Sergey Brin sometimes, knowing the people who make decisions, matters a lot.

(1: 54: 00) Trenton’s story begins, his was more standard and direct.

(1: 55: 30) Dwarkesh notes that these stories are framed as highly contingent, that people tend to think their own stories are contingent and those of others are not. Sholto mentions the idea of shots on goal, putting yourself in position to get lucky. I buy this. There are a bunch of times I got lucky and something important happened. If you take those times away, or add different ones, my life could look very different. Also a lot of what was happening was, effectively, engineering the situation to allow those events to happen, without having a particular detailed event in mind. Same with these two.

(1: 57: 00) Google is continuing the experiment to find high-agency people and bootstrap them. Seems highly promising. Also Chris Olah was hired off a cold email. You need to send and look out for unusual signals. I agree with Dwarkesh that is very good for the world that a lot of this hiring is not done legibly, and instead is people looking out for agency and contributions generally. If you write a great paper or otherwise show you have the goods, the AI labs will find you.

(2: 01: 45) You still need to do the interview process, make sure people can code or what not and you are properly debiased, but that process should be designed not to get in the way otherwise.

(2: 03: 00) Emphasis on need to care a ton, and go full blast towards what you want, doing everything that would help.

(2: 04: 30) When you get your job then is that the time to relax or to put petal to the metal? There’s pros and cons. Not everyone can go all out, many people want to focus on their families or otherwise relax. Others need to be out there working every hour of the week, and the returns are highly superlinear. And yes, this seems very right to me, returns to going fully in on something have been much higher than returns to ordinary efforts. Jane Street would have been great for me if I could have gone fully in, but I was not in a position to do that.

(2: 06: 00) Dwarkesh: “I just try to come up with really smart questions to send to them. In that entire process I’ve always thought, if I just cold email them, it’s like a 2% chance they say yes. If I include this list, there’s a 10% chance. Because otherwise, you go through their inbox and every 34 seconds, there’s an interview for some podcast or interview. Every single time I’ve done this they’ve said yes.” And yep, story checks out.

(2: 09: 30) A discussion of what is a feature. It is whatever you call a feature, or it is anything you can turn on and off, it any of the things. Is that a useful definition? Not if the features were not predictive, or if the features did not do anything. The point is to compose the features into something higher level.

(2: 17: 00) Trenton thinks you can detect features that correspond to deceptive behavior, or malicious behavior, when evaluating a request. I’ve discussed my concerns on this before. It is only a feature if you can turn it on and off, perhaps?

(2: 20: 00) There are a bunch of circuits that have various jobs they try to do, sometimes as simple as ‘copy the last token,’ and then there are other heads that suppress that behavior. Reasons to do X, versus reasons not to do X.

(2: 20: 45) Deception circuit gets labeled as whatever fires in examples where you find deception, or similar? Well, sure, basically.

(2: 22: 00) RLHF induces theory of mind.

(2: 22: 05) What do we do if the model is superhuman, will our interpretability strategies still work, would we understand what was going on? Trenton says that the models are deterministic (except when finally sampling) so we have a lot to work with, and we can do automated interpretability. And if it is all associations, then in theory that means what in my words would be ‘no secret’ so you can break down whatever it is doing into parts that we can understand and thus evaluate. A claim that evaluation in this sense is easier than generation, basically.

(2: 24: 00) Can we find things without knowing in advance what they are? It should be possible to identify a feature and how it relates to other features even if you do not know what the feature is in some sense. Or you can train in the new thing and see what activates, or use other strategies.

(2: 26: 00) Is red teaming Gemma helping jailbreak Gemini? How universal are features across models? To some extent.

(2: 27: 00) Curriculum learning, which is trying to teach the model things in an intentional order to facilitate learning, is interesting and mentioned in the Gemini paper.

(2: 29: 45) Very high confidence that this general model of what is going on with superposition is right, based on success of recent work.

(2: 31: 00) A fascinating question: Should humans learn a real representation of the world, or would a distorted one be more useful in some cases? Should venomous animals flash neon pink, a kind of heads-up display baked into your eyes? The answer is that you have too many different use cases, distortions do more harm than good, you want to use other ways to notice key things, and so that is what we do. So Trenton is optimistic the LLMs are doing this too.

(2: 32: 00) “Another dinner party question. Should we be less worried about misalignment? Maybe that’s not even the right term for what I’m referring to, but alienness and Shoggoth-ness? Given feature universality there are certain ways of thinking and ways of understanding the world that are instrumentally useful to different kinds of intelligences. So should we just be less worried about bizarro paperclip maximizers as a result?” I quote this question because I do not understand it. If we have feature universality, how is that not saying that the features are compatible with any set of preferences, over next tokens or otherwise? So why is this optimistic? The response is that components of LLMs are often very Shoggoth-like.

(2: 34: 00) You can talk to any of the current models in Base64 and it works great.

(2: 34: 10) Dwarkesh asks, doesn’t the fact that you needed a Base64 expert to happen to be there to recognize what the Base64 feature was mean that interpretability on smarter models is going to be really hard, if no human can grok it? Anomaly detection is suggested, you look for something different. Any new feature is a red flag. Also you can ask the model for help sometimes, or automate the process. All of this strikes me as exactly how you train a model how not to be interpretable.

(2: 36: 45) Feature splitting is where if you only have so much space in the model for birds it will learn ‘birds’ and call it a day, whereas if it has more room it will learn features for different specific birds.

(2: 38: 30) We have this mess of neurons and connections. The dream is bootstrapping to making sense of all that. Not claiming we have made any progress here.

(2: 39: 45) What parts of the process for GPT-7 will be expensive? Training the sparce encoder and doing projection into a wider space of features, or labeling those features? Trenton says it depends on how much data goes in and how dimensional is your space, which I think means how overloaded and full of superpositions you are or are measuring.

(2: 42: 00) Dwarkesh asks: Why should the features be things we can understand? In Mixtral of Experts they noticed their experts were not distinctive in ways they could understand. They are excited to study this question more but so far don’t know much. It is empirical, and they will know when they look and find out. They claim there is usually clear breakdown of expert types, but that you can also get distinctions that break up what you would naively expect.

(2: 45: 00) Try to disentangle all these neurons, audience. Sholto’s challenge to you.

(2: 48: 00) Bruno Olshausen theorizes that all the brain regions you do not here about are doing a ton of computation in superposition. And sure, why not? The human brain sure seems under-parameterized.

(2: 49: 25) Superposition is a combinatorial code, not an artifact of one neuron.

(2: 51: 20) GPT-7 has been trained. Your interpretability research succeeded. What will you do next? Try to get it to do the work, of course. But no, before that, what do you need to do to be convinced it is safe to deploy? ‘I mean we have our RSP.’ I mean, no you don’t, not yet, not for GPT-7-level models, it says ‘fill this in later’ over there. So Trenton rightfully says we would need a lot more interpretability progress. Right now he would not give the green light, he’d be crying and hoping the tears interfered with GPUs.

(2: 53: 00) He says ‘Ideally we can find some compelling deception circuit which lights up when the model knows that it’s not telling the full truth to you.’ Dwarkesh asks about linear probes, Trenton says that does not look good.

I would ask, what makes you think that you have found the only such circuit? If the model had indeed found a way around your interpretability research, would you not expect it to give you a deception circuit to find, in addition to the one you are not supposed to find, because you are optimizing for exactly that which will fool you? Wouldn’t you expect the unsupervised learning to give you what you want to find either way? Fundamentally, this seems like saying ‘oh sure he lies all the time, but when he lies he never looks the person in the eye, so there is nothing to worry about, there is no way he would ever lie while looking you in the eye.’ And you do this with a thing much smarter than you, that knows you will notice this, and expect it to go well. For you, that is.

Also I would reiterate all my ‘not everything you should be worried about requires the model to be deceptive in way that is distinct from its normal behavior, even in the worlds where this distinction is maximally real,’ and also ‘deception is not a distinct thing from what is imbued into almost every communication.’ And that’s without things smarter than us. None of this seems to me to have any hope, on a very fundamental level.

(2: 56: 15) Yet Trenton continues to be optimistic such techniques will understand GPT-7. A third of team is scaling up dictionary learning, a second group is identifying circuits, a third is working to identify attention heads.

(3: 01: 00) A good test would be, we found feature X, we ablated it, and now we can’t elicit X to happen. That does sound a little better?

(3: 02: 00) What are the unknown unknowns for superhuman models? The answer is ‘we’ll see,’ our hope is automated interpretability. And I mean, yes, ‘we’ll see’ is in some sense the right way to discuss unknown unknowns, there are far worse answers, but my despair is palpable.

(3: 03: 00) Should we worry if alignment succeeds ‘too hard’ and people get fine-grained control over AIs? “That is the whole Valley lock-in argument in my mind. It’s definitely one of the strongest contributing factors for why I am working on capabilities at the moment. I think the current player set is actually extremely well-intentioned.”

(3: 07: 00) “If it works well, it’s probably not being published.” Finally.

Notes on Dwarkesh Patel’s Podcast with Sholto Douglas and Trenton Bricken Read More »

Notes on Dwarkesh Patel’s Podcast with Demis Hassabis

Notes / Mike M. / March 2, 2024

Demis Hassabis was interviewed twice this past week.

First, he was interviewed on Hard Fork. Then he had a much more interesting interview with Dwarkesh Patel.

This post covers my notes from both interviews, mostly the one with Dwarkesh.

Hard Fork was less fruitful, because they mostly asked what for me are the wrong questions and mostly get answers I presume Demis has given many times. So I only noticed two things, neither of which is ultimately surprising.

They do ask about The Gemini Incident, although only about the particular issue with image generation. Demis gives the generic ‘it should do what the user wants and this was dumb’ answer, which I buy he likely personally believes.
When asked about p(doom) he expresses dismay about the state of discourse and says around 42: 00 that ‘well Geoffrey Hinton and Yann LeCun disagree so that indicates we don’t know, this technology is so transformative that it is unknown. It is nonsense to put a probability on it. What I do know is it is non-zero, that risk, and it is worth debating and researching carefully… we don’t want to wait until the eve of AGI happening.’ He says we want to be prepared even if the risk is relatively small, without saying what would count as small. He also says he hopes in five years to give us a better answer, which is evidence against him having super short timelines.

I do not think this is the right way to handle probabilities in your own head. I do think it is plausibly a smart way to handle public relations around probabilities, given how people react when you give a particular p(doom).

I am of course deeply disappointed that Demis does not think he can differentiate between the arguments of Geoffrey Hinton versus Yann LeCun, and the implied importance on the accomplishments and thus implied credibility of the people. He did not get that way, or win Diplomacy championships, thinking like that. I also don’t think he was being fully genuine here.

Otherwise, this seemed like an inessential interview. Demis did well but was not given new challenges to handle.

Demis Hassabis also talked to Dwarkesh Patel, which is of course self-recommending. Here you want to pay attention, and I paused to think things over and take detailed notes. Five minutes in I had already learned more interesting things than I did from the entire Hard Fork interview.

Here is the transcript, which is also helpful.

(1: 00) Dwarkesh first asks Demis about the nature of intelligence, whether it is one broad thing or the sum of many small things. Demis says there must be some common themes and underlying mechanisms, although there are also specialized parts. I strongly agree with Demis. I do not think you can understand intelligence, of any form, without some form the concept of G.
(1: 45) Dwarkesh follows up by asking then why doesn’t lots of data in one domain generalize to other domains? Demis says often it does, such as coding improving reasoning (which also happens in humans), and he expects more chain transfer.
(4: 00) Dwarkesh asks what insights neuroscience brings to AI. Demis points to many early AI concepts. Going forward, questions include how brains form world models or memory.
(6: 00) Demis thinks scaffolding via tree search or AlphaZero-style approaches for LLMs is super promising. He notes they’re working hard on search efficiency in many of their approaches so they can search further.
(9: 00) Dwarkesh notes that Go and Chess have clear win conditions, real life does not, asks what to do about this. Demis agrees this is a challenge, but that usually ‘in scientific problems’ there are ways to specify goals. Suspicious dodge?
(10: 00) Dwarkesh notes humans are super sample efficient, Demis says it is because we are not built for Monty Carlo tree search, so we use our intuition to narrow the search.
(12: 00) Demis is optimistic about LLM self-play and synthetic data, but we need to do more work on what makes a good data set – what fills in holes, what fixes potential bias and makes it representative of the distribution you want to learn. Definitely seems underexplored.
(14: 00) Dwarkesh asks what techniques are underrated now. Demis says things go in and out of fashion, that we should bring back old ideas like reinforcement and Q learning and combine them with the new ones. Demis really believes games are The Way, it seems.
(15: 00) Demis thinks AGI could in theory come from full AlphaZero-style approaches and some people are working on that, with no priors, which you can then combine with known data, and he doesn’t see why you wouldn’t combine planning search with outside knowledge.
(16: 45) Demis notes everyone has been surprised how well scaling hypothesis has held up and systems have gotten grounding and learned concepts, and that language and human feedback can contain so much grounding. From Demis: “I think we’ve got to push scaling as hard as we can, and that’s what we’re doing here. And it’s an empirical question whether that will hit an asymptote or a brick wall, and there are different people argue about that. But actually, I think we should just test it. I think no one knows. But in the meantime, we should also double down on innovation and invention.” He’s roughly splitting his efforts in half, scaling versus new ideas. He’s taking the ‘hit a wall’ hypothesis seriously.
(20: 00) Demis says systems need to be grounded (in the physical world and its causes and effects) to achieve their goals and various advances are forms of this grounding, systems will understand physics better, references need for robotics.
(21: 30) Dwarkesh asks about the other half, grounding in human preferences, what it takes to align a system smarter than humans. Demis says that has been at forefront of Shane and his minds since before founding DeepMind, they had to plan for success and ensure systems are understandable and controllable. The part that addresses details:

Demis Hassabis: And I think there are sort of several, this will be a whole sort of discussion in itself, but there are many, many ideas that people have from much more stringent eval systems. I think we don’t have good enough evaluations and benchmarks for things like, can the system deceive you? Can it exfiltrate its own code, sort of undesirable behaviors?

And then there are ideas of actually using AI, maybe narrow AIs, so not general learning ones, but systems that are specialized for a domain to help us as the human scientists analyze and summarize what the more general system is doing. Right. So kind of narrow AI tools.

I think that there’s a lot of promise in creating hardened sandboxes or simulations that are hardened with cybersecurity arrangements around the simulation, both to keep the AI in, but also as cybersecurity to keep hackers out. And then you could experiment a lot more freely within that sandbox domain.

And I think a lot of these ideas are, and there’s many, many others, including the analysis stuff we talked about earlier, where can we analyze and understand what the concepts are that this system is building, what the representations are, so maybe they’re not so alien to us and we can actually keep track of the kind of knowledge that it’s building.

It has been over fourteen years of thinking hard about these questions, and this is the best Demis has been able to come up with. They’re not bad ideas. Incrementally they seem helpful. They don’t constitute an answer or full path to victory or central form of a solution. They are more like a grab bag of things one could try incrementally. We are going to need to do better than that.

(24: 00) Dwarkesh asks timelines, notes Shane said median of 2028. Demis sort of dodges and tries to not get pinned down but implies AGI-like systems are on track for 2030 and says he wouldn’t be surprised to get them ‘in the next decade.’
(25: 00) Demis agrees AGI accelerating AI (RSI) is possible, says it depends on what we use the first AGI systems for, warning of the safety implications. The obvious follow-up question is: How would society make a choice to not use the first AGI systems for exactly this? He needs far more understanding to know even what we would need to know to know if this feedback loop was imminent.
(26: 30) Demis notes deception is a root node that you very much do not want, ideally you want the AGI to give you post-hoc explanations. I increasingly think people are considering ‘deception’ as distinct from non-deception in a way that does not reflect reality, and it is an expensive and important confusion.
(27: 40): Dwarkesh asks, what observations would it take to make Demis halt training of Gemini 2 because it was too dangerous? Demis answers reasonably but generically, saying we should test in sandboxes for this reason and that such issues might come up in a few years but aren’t of concern now, that the system lying about defying our instructions might be one trigger. And that then you would, ideally, ‘pause and get to the bottom of why it was doing those things’ before continuing. More conditional alarm, more detail, and especially more hard commitment, seems needed here.
(28: 50) Logistical barriers are the main reason Gemini didn’t scale bigger, also you need to adjust all your parameters and go incrementally, not go more than one order of magnitude at a time. You can predict ‘training loss’ farther out but that does not tell you about actual capabilities you care about. A surprising thing about Gemini was the relationship between scoring on target metrics versus ultimate practical capabilities.
(31: 30) Says Gemini 1.0 used about as much compute as ‘has been rumored for’ GPT-4. Google will have the most compute, they hope to make good use of that, and the things that scale best are what matter most.
(35: 30): What should governance for these systems look like? Demis says we all need to be involved in those decisions and reach consensus on what would be good for all, and this is why he emphases things that benefit everyone like AI for science. Easy to say, but needs specifics and actual plans.
(37: 30): Dwarkesh asks the good question, why haven’t LLMs automated things more than they have? Demis says for general use cases the capabilities are not there yet for things such as planning, search and long term memory for prior conversations. He mentions future recommendation systems, a pet cause of mine. I think he is underestimating that the future simply is not evenly distributed yet.
(40: 42) Demis says they are working on having a safety framework like those of OpenAI and Anthropic. Right now he says they have them implicitly on safety councils and so on that people like Shane chair, but they are going to be publicly talking about it this year. Excellent.
(41: 30): Dwarkesh asks about model weights security, Demis connects to open model weights right away. Demis says Google has very strong world-class protections already and DeepMind doubles down on that, says all frontier labs should take such precautions. Access is a tricky issue. For open weights, he’s all for it for things like AlphaFold or AlphaGo that can’t be misused (and those are indeed open sourced now) but his question is, for frontier models, how do we stop bad actors at all scales from misusing them if we share the weights? He doesn’t know the answer and hasn’t heard a clear one anywhere.
(46: 00) Asked what safety research will be DeepMind’s specialty, Demis first mentions them pioneering RLHF, which I would say has not been going well recently and definitely won’t scale. He then mentions self-play especially for boundary testing, we need automated testing, goes back to games. Not nothing, but seems like he should be able to do better.
(47: 00) Demis is excited by multimodal use cases for LLMs like Gemini, and also excited on the progress in robotics, they like that it is a data-poor regime because it forces them to do good research. Multimodality starts out harder, then makes things easier once things get going. He expects places where self-play works to see better progress than other domains, as you would expect.
(52: 00) Why build science AIs rather than wait for AGI? We can bring benefits to the world before AGI, and we don’t know how long AGI will take to arrive. Also real-world problems keep you honest, give you real world feedback.
(54: 30) Standard ‘things are going great’ for the merger with Google Brain, calls Gemini the first fruit of the collaboration, strongly implies the ‘twins’ that inspired the name Gemini are Google Brain and DeepMind.
(57: 20) Demis affirms ‘responsible scaling policies are something that is a very good empirical way to precommit to these kinds of things.’
(58: 00) Demis says if a model helped enable a bioweapon or something similar, they’d need to ‘fix that loophole,’ the important thing is to detect it in advance. I always worry about such talk, because of its emphasis on addressing specific failure modes that you foresee, rather than thinking about failures in general.

While interesting throughout, nothing here was inconsistent with what we know about Demis Hassabis or DeepMind. Demis, Shane and DeepMind are clearly very aware of the problems that lie ahead of them, are motivated to solve them, and unfortunately are still unable to express detailed plans that seem hopeful for actually doing that. Demis seemed much more aware of this confusion than Shane did, which is hopeful. Games are still central to what Demis thinks about and plans for AI.

The best concrete news is that DeepMind will be issuing its own safety framework in the coming months.

Notes on Dwarkesh Patel’s Podcast with Demis Hassabis Read More »