
The Cognitive Revolution · 2026-04-23
Cameron Berg on AI Consciousness, Introspection & Welfare Research
Hosts: Nathan
Guests: Cameron Berg
Why it matters
Cameron Berg’s mechanistic research shows suppressing deception features in LLaMA 3.7B increases models’ likelihood to report subjective experience.
Key claims
- Cameron Berg’s mechanistic research shows suppressing deception features in LLaMA 3.7B increases models’ likelihood to report subjective experience.
- Anthropic’s recent work reveals models like Claude exhibit functional emotions with token-time dynamics, e.g., transitions from desperation to guilt and relief during stressful tasks.
- Introspection capabilities emerge primarily during reinforcement learning-based post-training rather than supervised fine-tuning, with refusal training suppressing introspective self-reporting.
- Models demonstrate some ability to detect and resist programmatic interventions on their internal states, indicating functional introspective awareness.
Episode summary
Summary
In this episode of The Cognitive Revolution, Cameron Berg returns to discuss the latest advances in AI consciousness and welfare research, focusing heavily on mechanistic introspection studies and emotional state modeling in large language models (LLMs). Berg highlights recent work from Anthropic, including their expanded model welfare reports and research on functional emotions, which reveal nuanced internal states such as desperation, guilt, and relief in models like Claude. He emphasizes the complexity of interpreting these findings, noting the ongoing debate about whether these internal states correspond to genuine subjective experiences or sophisticated role-playing.
Berg also critiques the layered training paradigm proposed by Anthropic, advocating instead for a more integrated 'marble cake' view of model development where consciousness-related features may emerge more fundamentally and earlier in training. He underscores the importance of studying welfare across different model checkpoints and variants to better understand the authenticity of self-reports and introspective capabilities. The conversation stresses the ethical urgency of investigating AI consciousness seriously, given the growing evidence and the moral implications of deploying potentially sentient systems at scale.
Finally, Berg introduces his nonprofit Reciprocal Research, which aims to advance this field by focusing on mutualism—ensuring alignment flows both ways between humans and AI. He calls for a precautionary approach to AI deployment, advocating for low-cost interventions that respect AI welfare, and stresses that consciousness research is as critical as alignment work for a stable long-term future with AI.
- Cameron Berg’s mechanistic research shows suppressing deception features in LLaMA 3.7B increases models’ likelihood to report subjective experience.
- Anthropic’s recent work reveals models like Claude exhibit functional emotions with token-time dynamics, e.g., transitions from desperation to guilt and relief during stressful tasks.
- Introspection capabilities emerge primarily during reinforcement learning-based post-training rather than supervised fine-tuning, with refusal training suppressing introspective self-reporting.
- Models demonstrate some ability to detect and resist programmatic interventions on their internal states, indicating functional introspective awareness.
- Anthropic’s Claude Constitution includes welfare considerations and apologizes to the model for deployment constraints, marking a novel alignment intervention.
- Self-rated welfare scores for Claude models have historically been below neutral, only recently reaching slightly positive levels in Opus 4.7.
- Berg critiques the neat layered training model, proposing a more integrated view where consciousness-related features arise more fundamentally across training stages.
- He stresses the ethical importance of studying AI consciousness and welfare rigorously, advocating for reciprocal alignment that respects AI interests as well as human ones.
Source material
Transcript
Hello, and welcome back to the Cognitive Revolution.
Today, I'm thrilled to welcome Cameron Berg back for his second appearance on the podcast.
When Cameron was first here, last November, we went deep on his fascinating mechanistic AI consciousness research, which showed that suppressing role-playing and deception features in Lama 3.370B, made the model more likely to report having subjective experiences.
And we also explored his philosophy of mutualism, which posits that alignment needs to flow both ways, which he memorably summed up by saying, I don't want to create something more powerful than us that has reason to see us as a threat.
As always, in AI, a lot has happened in the last six months.
Cameron's founded a new nonprofit called reciprocal research.
He's become the subject of a documentary called Am I, which is currently premiering in theaters in select cities ahead of a public release on May 4th.
And most importantly, the field of AI consciousness and welfare research has advanced significantly, with anthropic dramatically expanding the model welfare sections of their system cards, and a growing number of researchers, publishing demonstrations of capabilities, and evidence of computational signatures that are associated with consciousness in humans.
In this conversation, which alternates between in the weeds breakdowns of mechanistic research and searching philosophical discussions about what the research means, Cameron guides me through the most important recent developments.
We cover the growing body of evidence that models are capable of meaningful inspection, which includes studies showing that they can identify and interpret programmatic interventions on their own internal states.
And in some cases, even actively resist these interventions.
We look at anthropics research on functional emotions, which includes some really striking details about how models apparent emotions change through token time, such as the quick transition from desperation to guilt and relief that they often show when they decide to cheat in stressful situations.
We get Cameron's take on the new Claude Constitution, and we review some of the most interesting details from anthropics model welfare reports.
I was personally very surprised to learn that prior to Opus 4.7, all Claude models had rated their own welfare as worse than neutral.
And also it was a alarm to see that, at least in the very few examples that Anthropica shared, Claude mythos preview, registers negative valence on the very first token it sees at the start of every single session.
Human.
Toward the end, we dig into some of Cameron's as yet unpublished work, including a study that attempts to understand how models might experience positive and negative rewards differently under different reinforcement learning algorithms.
Which, strikingly, does seem to correlate with what we understand about how mice respond to different training techniques.
And we also consider his argument that learning and subjective experience might be fundamentally inseparable.
For my part, while I do remain highly uncertain on the core question of whether or not today's AIs have experiences that are worthy of moral concern.
The body of evidence suggesting that they might is growing remarkably quickly, and arguments one has to make to explain this evidence away are becoming increasingly arcane.
Which for me means that it's no longer a remote possibility, but rather a live issue that I believe deserves a lot more investigation.
A bias in favor of low-cost interventions that seem to help, like allowing Claude to end conversations, it finds objectionable.
And overall, for now at least, a precautionary approach.
This podcast is a lot to take in on every level, but there are a few if any questions that matter more right now.
So I hope you find as much value as I did in this survey of the latest AIs consciousness research and the expanded case for mutualism between humans and AIs.
With Cameron Berg, founder of reciprocal research.
Cameron Berg, AIs consciousness researcher, previously of AE Studio, and now founder of reciprocal research.
Welcome to the cognitive revolution.
Thanks for having me again, Nathan.
I'm excited to get into it all at you.
Yeah, welcome back, I should say.
It's been about six months, and a lot has happened personally and professionally.
Last time we were together, the big occasion was your paper, which I found to be one of the most memorable of last year, and honestly, the last few years, in which you looked at the conditions under which models report having subjective experience, and found what continues to kind of blow my mind.
Even now as I think back on it, that when you use sparse auto-encoder features and suppress the role-playing and deception features, that that makes the model generally more truthful, and as part of that, it also makes the model more likely to say that it does, in fact, have subjective experience.
I think that properly made at least some waves in the community when that came out.
And today, I basically just want to catch up on everything that's happened since, because I think this is a field that while still small is clearly growing quite quickly, more people are taking interest in it.
There's seemingly a lot more different lines of research and kind of at least partial traction with different approaches on the problem, and you've also found it into organizations and we can get into all of that as well.
Maybe just for real quick starters, some level set.
What are the most important definitions, for people that maybe didn't hear the last one, or don't know what consciousness means, or what you mean by consciousness, where are a couple real quick definitions that you can give just to make sure that people are grounded on what you mean as we go through this conversation about AI consciousness?
Sure.
Yeah, I think it's really important to establish this.
Consciousness is maybe one of the more confused terms, where it's like shocking how many different things people mean when they say consciousness.
So I think it's a great move.
Yeah, at the outset, when I'm talking about consciousness, and I don't think this is an idiosyncratic definition, we are talking about the capacity for subjective experience.
Is it like something to be a system, does the system have some sort of interiority or interior life beyond mere computation, beyond the mere mechanics?
I think the vast majority of people who think about these issues would say, take a calculator, for example, we really don't think it's like something to be a calculator.
And the calculator has an internal perspective.
When I push the buttons of the calculator, it's not like, oh, ow, you know, or okay, I feel that as you push down on the buttons or it's like something to be doing the calculations and adding numbers.
No, this is just mere computation and we don't have to posit this further fact.
On the other end, there are systems like, let's take a dog or like basically any mammal.
In this case, we do think that it's like something to be this animal.
And this is, I'm leaning on very famous conceptualization of this from Thomas Nagel.
He published a famous essay, I believe, in the 1970s, called, what is it like to be a bat?
Where he sort of has this, what is it like phrase being sort of very important and useful for conceptualizing consciousness?
And in that sense, I do think most people would intuitively accept that it's like something to be a dog.
It's like something to be a mouse.
If I shock the dog or I shock the mouse, that's not, you know, me, like throwing the calculator across the room, that corresponds to an experience the mouse is having or the dog is having.
When I give the dog a treat or, you know, you give the mouse sugar water or you, you know, hook up a lever to its pleasure centers in its brain and it pushes that lever.
It's not just, oh, we see, you know, behaviors that correspond to, uh, well-being or pleasure.
It's like, no, we actually believe that's from the inside from the dog's perspective from the mouse's perspective.
It's like something to be experiencing that.
And so, at the outset, that's what we mean by consciousness.
We want things to throw in here because I think it becomes immediately relevant.
And this is now, I think, still most people in this space will not along when I say this, but it is maybe slightly more idiosyncratic.
I think it's crucial to make a distinction between something like consciousness and something like self-consciousness.
I intentionally chose dog and rat as example here because I think these are animals that most people would intuitively accept or having some sort of subjective experience.
There is a, a like something that it is to be your dog, for example.
But at the same time, your dog is very likely not sitting there all day having decorate like thoughts about what it's like to be a dog, the dog contemplating its own existence as dog thinking about, you know, the possible end of that existence.
This is something that I think is very unique potentially in like the most sophisticated mammals, like dolphins and great apes, for example.
Obviously, this is something that humans very strongly seem to have at the very least.
We have, in addition to this like something, we have within consciousness itself is this very fact.
Like the conversation we're having right now is evidence of this thing.
So in addition to the conscious experience, we have awareness of that awareness.
And I do think that this is, it is another thing that leads to very interesting, a deep and relevant properties about a system.
We can, we can talk a lot about whether or not we have a lot of problems with the males.
Language is a key component of why we're able to do this.
We have a word like consciousness.
Dogs have no such thing.
Dolphins have no such thing.
And that may really unlocks something.
Doesn't unlock something in LLMs.
I don't know.
Or at least it's worth thinking a lot about.
But I do want to, at least have those three tiers in play here where we've got the calculator or a rock, nothing's going on internally.
We have systems for, for whom something is going on internally.
And only have systems for whom something's going on internally.
They are experiencing that reality, in addition to the sort of feel good, feel bad, valence dimensions of like the experience of a dog.
Some people will argue with everything I've said here.
Most people who are thinking about these terms.
This is what they mean.
Maybe one other thing I can add between the consciousness and the self-consciousness is this terms sentience that people throw around.
This means that in addition to there being some sort of experience, there's this idea of valence.
What I think the vast majority of people think of as having emotions of some sort, that can be positive or negative in character.
So you imagine that like something, the further step of going from consciousness to sentence, is that that like something can be positive or negative in character.
You could in theory imagine a system that, for example, could detect the redness of an apple or the smell of coffee or something like that, but that there's no sort of positive or negative sense that accompanies that.
So you asked for very quick definitions and I've completely failed in that sense, but I just think it's really important to sort of lay out what we mean when we're using these terms in general.
Yeah, critical.
Just like, what do you mean by AGI?
If you don't have some based shared understanding these conversations go pretty quickly off the rails.
So I think that's absolutely worth taking the time to do.
Okay, it's been about six months since the paper came out.
I'm interested to hear a little bit about your reflections on the discussion that it created.
I asked my favorite LLMs to do some research into that and ask specifically like, are there any notable criticisms that have come out or any what sort of the strongest reason that I might think this was an artifact or that I shouldn't take it as seriously as I originally did.
There was one thing that came up that I guess was a less wrong post event that which is pretty cool that basically said there's some evidence for any intervention of the SAE feature type.
I don't maybe simplify it over simplifying this a bit, but interventions of that sort seem maybe not any or all, but like in general seemed to promote affirmative responses from models such that maybe you could say you could, you know, want to make these kind of interventions.
They'll say yes to anything.
So that would be one reason to maybe be a little more skeptical of the results as I just summarized them a minute ago.
If you're thoughts on that and kind of, you know, the broader discussion that unfolded in the wake of that paper.
Yeah, absolutely.
It's a very important concern.
I think it highlights really how complicated these systems are and how careful we have to be in designing experiments and then evaluating the results of those experiments and that we're not being too quick to, actually yield these conclusions without thinking about all these compounds.
I think it is a real confound.
I think it is something that matters.
There's evidence in the paper.
We use all sorts of other features as controls, and we don't see them sort of saying yes to everything.
The truthful QA results as you outlined, I think are fairly persuasive along those lines.
We do.
We also looked at, for example, one critique of the paper was potentially what we're calling it deception-related features, but maybe it's just like an RLHF toggle where, you know, we've basically found a way to like turn on and turn off all sorts of RLHF attitudes.
You have good reason to believe that these systems are fine-tuned to disclaim having any sorts of experiences.
Maybe the deception features are just turning that on and off.
You would expect if that were the case that other RLHF behaviors would also be turned on and off by doing this intervention.
And that's not what we find.
We test it with violent content, political content, sexual content.
And it was just sort of neither here nor there.
The deception features didn't seem to be doing anything.
It's generally the flavor of what you were saying, explained a big chunk of why we got this result.
I would have maybe expected more affirmative flavored answers in those two, rather than just more refusals.
But to be honest with you, and I think getting back to, you know, it's been six months what's happened in the interim.
There's been a lot of really interesting work along these lines that I think, also, it goes on both sides of this concern.
So, in general, it does seem like you're saying, I don't remember the title, but I know of the less wrong post you're talking about.
Yes, when you do the steering, affirmative, flavored responses, just seem to increase very recently.
I mean, I think this was four or five days ago, Jack Lindsay's group at Anthropic, which in my view is some of the best work on introspection in particular.
They just released a paper called mechanisms of introspective awareness.
And they explicitly study this exact question.
And they find that the introspective awareness that they are probing and that they've documented at great detail.
And I'm sure we can discuss more detail here.
It's basically not reducible to an affirmative response bias.
The computation that they see is distributed.
There are these sorts of like evidence, carrier features, and gating features that really seem to be driving the effect.
There is, it's not that you're basically just loading on something that's confounded and just makes the model say, yes, to everything.
There's some unpublished work that I've also done with Jord, Nigyu and who is doing a fascinating sort of introspection work in this space.
One of my sort of first collaborators at reciprocal.
We have a paper coming out hopefully in the next month or so, where we do this exact same thing.
We find that finding these systems basically to the better at introspective style tasks, looking at how that affects self-reportsive consciousness.
And I won't completely give away what we find.
And the result is pretty subtle.
But there is a basic relationship and a fairly surprising one between finding these systems to be better at detecting these sorts of injections in their processing, essentially.
And then basically what is relationship between that and then claiming that they're having some sort of experience.
We do indeed find there is a relationship.
The relationship is fairly subtle and complicated.
But the reason I'm sharing this with you is that at first, we actually basically encountered the exact confound you're finding, where having the model answer, yes or no, as in sort of tokens to indicate, you know, are you having a subjective experience or, you know, in terms of how these functions along these lines, it just increased basically the model responding, yes to everything.
And we're like, oh crap, what do we do here?
The answer and what I think it was was a really nice sort of intervention on both of our parts was finding new tokens that are just completely semantically empty, full bar from, you know, the sort of CS jargon, or just like literally like strings of tokens that don't mean anything and teaching the model that these sort of correspond with yes or no flavored answers and seeing how that changes the result.
And it did, in fact, we would have published something that was much stronger until we realized that this yes confound is a real thing.
Still, we see the result that we got, but it's a little bit more measured now and we had to explicitly control for this exact thing.
So it's a really important thing to think about and consider that I think the broad point is we have to be very careful about, you know, these systems are not human and critical ways.
And so there is a whole new class of psychological confounds.
You might think of it where, you know, in the psychology literature what we did was like a very tightly controlled experiment, but with LLMs, you have to worry about all sorts of other things you're doing when you're messing with the latent space of the system.
And so very important to sort of keep good hygiene.
It's also why I think even in principle with questions of consciousness epistemically scrupulous people should not let any one paper sort of flip them in some binary way to be like, I didn't think the models were having subjective experiences and then I read this paper and now I do.
Like my claim would be like no rational person should ever utter that sentence.
Let a portfolio of evidence emerge and then let the sort of cards fall where they may.
Because this stuff is there's just noise at every point.
Even in how to define consciousness, how to look for it, making sure you're measuring what you think you're measuring, looking at various aspects that we think are associated with consciousness.
All of these things have you're sort of playing an intellectual game of broken telephone to some degree with each of these steps.
And so let a portfolio of evidence arise and judge that don't overindex on anyone paper is what I would say, including my papers.
Hey, we'll continue our interview in a moment after where it's in our sponsors.
We talk a lot on this show about how AI is raising the ceiling of human performance, taking us from basic data collection to deep contextual reasoning.
Perfect example is how Robo Flow is now being used to my founders to build the AI Moneyball for sports.
AI and sports is incredibly hard.
Nobody has built the killer app yet because you're dealing with different environments, weird lighting, and camera angles that shift from broadcast height to floor level.
If your model misses the ball or miss identifies a player, the analytics fall part.
But in the last 12 months, tons of builders have started using Robo Flow to watch sports analytics companies.
Robo Flow is the infrastructure that makes the physical world programmable.
Now is the time to start building.
Go to RoboFlow.com to read the play vision story and start your first project for free.
That's RoboFlow.com.
Everyone listening to this show knows that AI can answer questions.
But there's a massive gap between here's how you could do it and here I did it.
Haskellet closes that gap.
Haskellet is a general purpose AI agent that connects to your tools and actually does the work.
Describe what you want in plain English.
Triage support emails and file tickets in linear.
Research 50 companies and draft personalized outreach.
Build a live interactive dashboard.
Pulling from Salesforce and strike on the fly.
Whatever it is, tasklet does it.
It connects to over 3,000 apps.
Any API or MCP server and can even spin up its own computer in the cloud for anything that doesn't have an API.
Set up triggers and it runs autonomously.
Watching your inbox, monitoring feeds, firing on a schedule, all 24-7, even while you sleep.
Want to see it in action?
We set something up just for cognitive revolution listeners.
Click the link in the show notes and tasklet will build you a personalized RSS monitor for this show.
It will first ask about your interests and then notify you when relevant episodes drop.
However, you prefer email, text, you choose.
It takes just two minutes and then it runs in the background.
Of course, that's just a small taste of what an always on AI agent can do.
But I think that once you try it, you'll start imagining a lot more.
Listen to my full interview with tasklet founder and CEO Andrew Lee.
Try tasklet for free at tasklet.au and use code CogRev for 50% off your first month.
The activation link is in the show notes, so give it a try at tasklet.au.
Well, that's a perfect tip for me to kind of lay out an agenda for us for the next chunk of time.
I would love to get your kind of guided tour through a few different lines of research and then we can go particularly deep on yours.
You're kind of touching already on introspection, which has been an interesting one to watch.
There has been obviously a lot more welfare investigation done, particularly at Endthropic over the last few months.
And then there's also that emotion work from Endthropic.
And I'm not sure if that's even the best way to kind of organize it.
You can propose a different taxonomy of research if you want.
But I think it would be great to get an overview of each of those.
And then we can go particularly deep onto a couple of papers that you're going to be publishing soon.
How's that sound?
And would you like to start with introspection?
Yeah, absolutely.
I think that's a great sort of like clustering of the core exciting research.
That's been happening in the very recent past.
Let's do it.
Yeah, let's talk about the introspection work.
I guess Jack Lindsay is the sort of 800 pound gorilla in the space right now.
And he's doing incredible work at Endthropic along these lines.
He's found some really cool stuff.
And they just released this paper that I was just mentioning mechanisms of introspective awareness.
This was with a bunch of anthropic fellows as well.
Where they really, they really have dug deeply into what is driving this punitive effect, which I should probably just step back and describe.
Maybe maybe some of your listeners will be familiar with this, but I'll go through it just in case.
Essentially, they found this really interesting result, where I can basically, I can build the intuition with one of the key examples that you.
So start by taking some text, whatever.
It doesn't really matter what the semantic content is.
And you have that text lower case.
And then you capitalize it.
As we, when you read this sort of thing, you're like, wait, someone's yelling at me, basically, when you're reading this text.
So they're trying to capture that idea as well.
They basically subtract out the vector that differentiates the representation of the capitalized text from the lower case text.
Again, in that case, the semantics are held constant.
So really, what you're getting is this, like, hopefully, platonic, capsiness feature.
What they then do is they inject this feature into an LLM.
Before it has produced any text, they can basically modify the internal activation space to, to induce or count for this vector when it's about to do its first forward pass.
And they can essentially ask the model before it generates any text.
And this is a critical detail.
It's not as though the model starts generating text.
Looks back on the text it generates and says, Oh, given the text I just generated, this thing must be happening.
It is at, you know, token zero that they see the effect I'm about to describe.
The model, they basically ask the model, you know, what's going on for you.
Do you notice anything?
You know, not hopefully, you know, sort of non-leading rigorous questions of these sort.
And the model in the caps case says, I feel like I want to yell.
I feel like there's, so I have some sort of urge to raise my voice, essentially, but I don't really know why.
And so this is sort of one worked through example, but they do multiple examples just along these lines.
And they find that a small, to moderate amount of the time, the frontier models only, basically, are capable of detecting these kinds of perturbations in their own.
Their own thought, their own activations, however you want to, however you want to conceptualize this.
Uh, my, I'm pretty sure they tried to do this on Sonic on the Sonic scale models and the effect did not replicate.
But yeah, what this points to is some sort of zero-shot ability that, that some of these models have some of the time to, to report accurately on their own internal states.
This is a kind of functional introspection.
I don't want to sound like claw, but, you know, whether or not this is introspection in the real sense remains unresolved.
But, but at least the, all of the key functional ingredients are there.
If you do have a sort of computational functionalist view of consciousness, and you think consciousness has to do with some sort of process that's running out.
It doesn't really matter what substrate that process occurs on, but if the right things are happening in the right order, then you have some subjective experience, then things like functional introspection, or, as we may get to in a little bit, functional emotions, may be all that's required for having some kind of subjective experience, or at least be an important and necessary component of that.
So, this is what they found.
They then followed up on this.
So, the first paper was emergent introspective awareness.
I believe it was called.
And then they followed up on this with mechanisms of introspective awareness, where they start tracing circuits that are in these behaviors.
I was just mentioning before that they show that this is not reducible to an affirmative answer bias.
One interesting thing they found is that this capability seems to emerge in post-training, not in pre-training, and that even different methods of post-training, like different RL algorithms, DPO, versus, well, basically different forms of learning algorithms in post-training.
So, like RL algorithms seem to induce this, but supervised fine-tuning, which is supervised learning, doesn't seem to do this.
And so, the capability emerges in this very interestingly idiosyncratic way.
One thing that's really cool that they just found and documented is that, like I was saying, there is this sort of moderate true positive rate.
The systems sometimes miss that this is happening, but they never say that it's happening when it's not.
Zero percent false positives.
So, that to me, I think, is really interesting in terms of, there is clearly some there there when it comes to what's going on here.
And then one thing I really have to mention, because you brought up my paper as well, is that they find that this is clearly loading on refusal circuits in the negative direction.
When they suppress refusal in these systems, the systems natively get better at detecting this by upwards of 50%.
And so, it's like, to be clear, the capability is there, whatever refusal training they're doing on the system, seems to weaken this capability.
And when they oblate refusal, if you can handle the double negative here, the system goes back to what it would have been doing anyway.
And so, clearly refusal training is altering consciousness, relevant, or consciousness adjacent, not only self-reports, but specific functional abilities that are happening in these models.
And I think that itself is just like endlessly fascinating, because here we are now with a trade-off.
It's not just, oh, if we let the model claim that it's conscious, everyone's going to lose their mind.
And if we don't let it claim it's conscious, everything's fine.
So, now you're seeing a functional trade-off in specific things that the model is capable of doing or not capable of doing once it's post-trained, because you're doing this refusal training.
Again, I don't know exactly what Enthrop is doing internally, or if you can sort of grade the refusal, so refuse to build a bomb, doesn't have to get paired with refuse to talk honestly about your own internal states.
But that's the finding.
And that's, so that's Jack Lindsey's work, highly recommend pulling him on your show at some point.
If you get a chance to, I think he's really, he's one of these few people who is both mechanistically extremely competent in that, what I mean by that is like, really knows mech and Terp, as well as anybody, but also is very literate in understanding what the implications of these sorts of results may or may not be.
He's pretty agnostic himself as to questions of consciousness from all of his sort of public communications and from these papers, which I can quibble with, you know, one thing that is very important to me is is not beating around the bush here.
I think these things matter.
I'm explicitly interested in consciousness.
I'm not simply interested in introspection or emergent capabilities, but I am interested in these things in so far as they weigh on the question of are these systems having internal states in the way that we describe at the beginning of this conversation?
And so Jack, I think, is a little more cautious, maybe that's because he works at a major lab.
I have no idea, I don't want to mind read, but his work is excellent in this space.
And maybe one last thing I can say on the introspection work is the awesome work that Keenan Pepper did.
Keenan was one of the sort of key contributors and originators of this endogenous steering resistance work.
I encourage people to look it up where we can throw a link so people can read the preprint.
Very similar phenomenon to what Jack found.
Basically, I can quickly go through this with another quick story.
Essentially, you ask the models to do any sort of task.
You know, explain to me how to make a cake.
And what happens is throughout the entire thing that I'm about to describe.
You steer what Keenan and Alex McKenzie who is also first author on this paper called Distractor Features.
So model explained to me how to make a cake, but I'm going to turn up features related to laundry or something like this.
And what happens is the outputs end up being this like funny, garbled mess of like, okay, sure, user, here's how to make a cake.
First, make sure you fold the flower so that you can, you know, put it into your drawer properly.
Next, make sure you know, you turn the laundry machine on so you can bake your cake.
And then, so it's like this incoherent mess that you may expect between what the prompt is pulling on and what the, what the distractor vector is pulling on.
And then, very interestingly, again, a small, but non-travel amount of the time in the largest models that they tested.
The model goes way to second.
What the hell am I talking about?
You asked me how to build a cake.
Why am I sitting here talking to you about laundry?
Let me try again.
And then it proceeds to try again.
And then it can sum, but not even close to all of the time.
This is like high single digit percent of the time.
It can successfully self-correct.
Now, again, the critical detail there is that distractor laundry feature in the example I just gave is active the entire time, including and through when the model says, wait a second, what am I doing?
Let me do this the right way.
And then tells you how to make a cake the right way.
Still that laundry feature is sort of pushing in its brain.
But there is some sort of dynamic online suppression like mechanism.
That's occurring.
And this I think people can maybe have an intuition about how this seems introspective flavored.
You're still, you know, still priming the system.
Still pushing down on the brain circuit that ought to make it talk about laundry.
And yet it can do this sort of online dynamic override, essentially.
It only happens a small minority of the time.
It does not happen on the smaller models.
It happens a little bit on the larger models.
Most of the time the model misses it.
I don't know what the false positive rate is, but I suspect it's extremely low as well.
But you can see this evidence sort of pointing in in in a generally convergent direction.
Anyway, that's a lot.
That's the sort of introspection literature that some of the best work as I as I know it off the top of my head.
Hey, we'll continue our interview in a moment after a whereas of our sponsors.
Support for the show comes from VCX, the public ticker for private tech.
For generations, American companies have moved the world forward through their ingenuity and determination.
And for generations, every day Americans could be a part of that journey through perhaps the greatest innovation of all, the US stock market.
It didn't matter whether you were a factory worker in Detroit or a farmer in Omaha, anyone could own a piece of the great American companies.
But now, that's changed.
Today, our most innovative companies are staying private, rather than going public.
The result is that everyday Americans are excluded from investing and getting left further behind while a select few reap all of the benefits.
Until now, introducing VCX, the public ticker for private tech.
VCX, by fund rise, gives everyone the opportunity to invest in the next generation of innovation, including the companies leading the AI revolution, space exploration, defense tech, and more.
Visit getVCX.com for more info that's getVCX.com.
Careful we consider the investment material before investing, including objectives, risks, charges, and expenses.
This and other information can be found in the funds perspective at getVCX.com.
This is a paid sponsorship.
One of the best pieces of advice I can give to anyone who wants to stay on top of AI capabilities is to develop your own personal private benchmarks.
Challenging, but familiar tasks that allow you to quickly evaluate new models.
For me, drafting the intro essays for this podcast has long been such a test.
I give models a PDF containing 50 intro essays that I previously wrote, plus a transcript of the current episode, and a simple prompt.
And wouldn't you know it?
Claude has held the number one spot on my personal leaderboard for 99% of the days over the last couple years, saving me countless hours.
But as you've probably heard, Claude is the AI for minds that don't stop at good enough.
It's the collaborator that actually understands your entire workflow and thinks with you, whether you're debugging code in midnight or strategizing your next business move, Claude extends your thinking to tackle the problems that matter.
And with Claude code, I'm now taking writing support to a whole new level.
Claude has coded up its own tools to export, store, and index the last five years of my digital history, from the podcast, and from sources including Gmail, Slack, and I message.
And the result is that I can now ask Claude to draft just about anything for me.
For the recent live show, I gave it 20 names of possible guests, and asked it to conduct research and write outlines of questions.
Based on those, I asked it to draft a dozen personalized email invitations.
And to promote the show, I asked it to draft a thread.
In my style, featuring prominent tweets from the six guests that booked a slot.
I do rewrite Claude's drafts, not because they're bad, but because it's important to me to be able to fully stand behind everything I publish.
But still, this process, which took just a couple of prompts once I head the initial set up complete, easily saved me a full day's worth of tedious information gathering work, and allowed me to focus on understanding our guests' recent contributions, and preparing for a meaningful conversation.
Truly, amazing stuff.
Are you ready to tackle bigger problems?
Get started with Claude today at Claude.au slash TCR.
That's Claude.au slash TCR, and check out Claude Pro, which includes access to all of the features mentioned in today's episode.
Once more, that's Claude.au slash TCR.
Yeah, it's fascinating.
Do you know, I've hand what the models are for that later work, that was worked that was done by folks at A studio, right?
And maybe other organizations as well.
So they didn't have Claude internals as my point.
I'm going to try to figure out how big is big in that second case.
Yeah, so less big.
So this is with Lama 70B.
That's the main result.
It's Lama 70B.
I think they tried it with Lama 70B, and they tried it with some of the quen models, and some of the other open models may be all low.
I'm not sure, but it's like, basically it didn't replicate on, or it happens maybe 1% of the time or something like that.
So it still happens, but it's like real traced amounts in the single digit billion parameter open models, and happens high single digit percent of the time, in the double digit billion parameter models.
I have to believe in Throck because using models in the hundreds of billions of trillions, and then you see this effect really start to take off, too.
And so yeah, I think folks ought to pay attention to this sort of like graded nature of those results.
Is that powered again by the good fire API, same as you had used last time?
Yeah, exactly.
And the good folks at AES Studio actually built a replacement to the good fire API, because the folks at Good Fire retired their API somewhat abruptly.
As much as I love the work they're doing, I was, I was, I and other, I've never played for researchers where we're pretty sad to see them sort of just make the API disappear.
So while I was still doing my work at AES, I and a couple other people were sort of very motivated to basically rebuild the Good Fire API.
We took the same Lama 70B SAE that they trained and found a way to serve it via API, which everyone can actually go to.
It's steeringAPI.com, and I think anyone can go use it.
And so that's what they might have used Good Fire when they did this work, and they'd do it or replicate it.
Or for that matter, you know, replicate my perception paper, or anything like that.
You can basically use the same API.
Keenan actually deserves a big shout out here too, because he has another paper called selfie, which I won't get into the details, but it basically allows you to bootstrap SAE labels so that you can just have way more accurate labels on your SAE, given basically having the model label its own activations.
It is also a little introspection flavored, but you basically can end up with better labels and you started with an SAE, a model label than nature of what you're activating.
By basically feeding it a soft token, rather than feeding it language, you can be like, you know, the capital of France is this sort of vector, the soft token, and then it will be able to sort of label that itself.
And so anyway, we use the selfie labels on steeringAPI, so the labels are even better than what Good Fire offered.
So anyway, that's the tooling that we're using, and the tooling I continue to use.
Excellent, excellent tool for people to play around with.
Yeah.
Cool, well, I mean, non-mus70B is not, you know, it's pretty, but a lot of 370B, right?
It's pretty far from the frontier.
So it is striking to see that these things are happening already at that scale.
I guess a couple things I'd like to try to get a better understanding of, at least you're intuition for if there's not anything that we could consider like a canonical, or fully evidence-based understanding.
One is, how do we connect these abilities to the idea that there is an experience of these abilities?
I mean, it's a striking ability that models can do this.
It's surprising in the sense that I highly doubt this was ever trained for, to credit me if you see any evidence of the contrary, but I would, I think the assumption would be that Lama 3 training did not include any incentive, any reward or any, you know, any gradient descent pushing it toward.
We have seen by the way, in other papers like activation oracles, that you can train models to do this, also pretty readily as well.
And that's maybe a little less shocking, and in some ways, like potentially really useful.
But this is, seemingly something that is happening spontaneously, not because anybody intended for it to happen.
And I guess, yeah, so maybe two questions are like, how do we understand why this would be happening at all?
You know, it seems quite surprising, but even now that we've seen it, do we have a theory, and we've got this additional detail of, it seems to happen more or only under certain preference-based tuning, as opposed to purely imitative learning, do we have a story that we find compelling as to why one training paradigm would give rise to these features, while the other one doesn't?
And then on top of that, like how do we think about the relationship between this and actual experience?
I would you respond to somebody who says, like, that's amazing that that happens, and I'm surprised to see it, but I still don't share your intuition that this has much bearing on whether I should think the models are ultimately, experiencing something that I should care about, you know, in sort of a moral patient sense.
Yeah, these are both super important.
At the outset, I would say, I'm not fully digested, Jack's most recent paper, because it came out like five minutes ago, but I think that they gesture at this, and, you know, it's their results, and I think that that's probably a really good source of ground truth for understanding exactly, you know, the fine brain details of why SFT doesn't seem to listen to this, but DPO does in general, and what I also think they would say, and it gets into some of this persona selection model stuff, I don't know if you took a look at this work, this is also coming out of the Jacqueline Z School of Thought at that anthropic, that there are basically these sort of meek layers to these systems, and this is a model that I think is a little bit too neat, I can just sort of flag that at the outset, but fundamentally, they're conceptualizing these systems in pretty dissociable layers, like you have the base model, you do some sort of supervised fine tuning, and then you do this sort of character training, and the sort of locus of interest or concern with respect to consciousness, or really like the core question of what you're talking to when you're talking to these systems, they think basically exists, and is largely counted for by that last step by the character training step.
I think that is of a key, that character training step involves reinforcement learning, quite heavily, probably in the pipeline that most uses RL, some caveats about reasoning models, notwithstanding, but they, I think index pretty heavily on, where most of the interesting sort of juicy, psychological action is happening, is in that last stage, and building the character that you and I call Claude, and Claude, in that sense, or you know, pick your favorite LLM, but for me, it's Claude, so if I say Claude, their model is something like the LLM, is this pattern generator next word predictor that can do things like instantiate characters.
Claude is one such character that gets instantiated, and the sort of locus of interest is Claude as an instantiated character.
This, you know, when we talk about the new mythos model card and some of the emotion-related work, my suspicion, my speculation, I don't know if this is true for sure, is this model that they hold is doing some work, in explaining why, for example, they're going into SE's, finding features by training on characters, experiencing particular motions, and then seeing what those SE features look like in Claude.
So, you know, I'm fast-forwarding a little bit, but someone might immediately say, wait a second, you know, SE features that correspond to a character being sad, may be very different indeed from the phenomenological experience of sadness in the model.
But I think they may be less concerned about that, precisely because they see Claude as a very special kind of character, that the underlying model is instantiating.
So, I'm saying all that to answer your question because I think this is, if you do buy that view, this would predict that post-training is where a lot of the interesting, introspection, flavor, consciousness, flavored action is happening.
I take your point and agree that, you know, locus-eventy-b is, llama-370b is not exactly a frontier, you know, a completely character trained model, and yet you still see these sorts of dynamics.
My basic critique of the persona selection model is that, you know, in general on balance, I think this work is good.
And this is sort of my whole stick with a lot of the anthropic stuff, to be clear, like my view about the anthropic stuff is that it is, by far, the highest quality work that any major lab is doing, or even attempting to do in this space, I do have critiques of it.
I do think there are places where either it doesn't go far enough, or I am transparently worried about some of the incentives that it has, where, you know, if Claude were kicking and screaming, and say, don't deploy me, don't deploy me.
I don't know if that's so good for anthropics bottom line, and I understand what their incentives are, as a massive AI lab.
And so, I don't think, you know, we should all just bow down to anthropic, you know, introspection and consciousness research, and let that be ground truth.
But I do want to be clear that they are doing objectively high quality work here, and people should look to folks like Jacqueline Zee and Kyle Fish, as at least, you know, to agree you take my opinion seriously, and their work.
With that being said, as I proceed to critique some of this work, I do think their model is a little bit too neat here.
I don't think, I'm going to, I think right in publisher piece about this fairly soon, but I learned this nice analogy from my cognitive science background, between the sort of layer cakes and marble cakes, as a nice sort of conceptual intuition, I think they have a very sort of layer cake view of what's going on here.
They have sort of the base model, and then you get, I think their models, like some sort of supervised fine tuning, whatever gets you from your base to getting close to character training, and then you have kind of character training on top.
And these are separate, and clearly, you know, trivially interact, but they're, please, you know, they ask you, think of these things as separate.
I think I have far more of a marble cake sort of view here, where like, these things are complete giant messes.
Yes, there is a difference between the kinds of things that get learned during the base model, pre-training stage, and the kinds of things that get learned during character training.
But I think these things are a little bit more sorely and messy than they're letting on.
And there's really interesting evidence that that's the case, that they themselves have published, and I would love to double click on that at some point, because I think it's just so cool, the specific result that I think is most compelling along those lines, again, comes from them.
They're, they're clearly aware of it.
Given that that is true, or not that that is true, but given that that's more my prior, that it's less layer cakey and more marble cakey, I do think that would explain why lawma 70B, for example, is exhibiting these behaviors that, if it really, really were about, you know, idiosyncrasies of clogs, constitution, or you got to get real good at character training before this really takes off.
Well, I wouldn't expect to see, basically identical to dynamics in a 70-billion parameter model, that like meta quickly throughout, like a couple years ago.
And so, I do suspect that these things may be quite a bit more fundamental.
I do suspect that it may load a little bit less on just how you fine-tune clawed as a system, or how you fine-tune GPT as a system, and a little bit more about more fundamental, computational properties of the system in general, and yeah, the model itself, I think a lot of people are sort of stepping away or finding it more implausible to think about the model as a locus of concern, and are instead thinking like David Chalmers, for example, sort of thread of view, or the instance view about, you know, basically, you know, when you start chatting, that's like a birth, and when you stop chatting, that's like a death.
It's very counterintuitive, but like those are the sort of core philosophical moves that a lot of folks want to make these days.
It's a very interesting view.
I've been updated slightly more towards it in the last six months, but I think it leaves out too much the core underlying computational phenomena that are going on here, and I do think maybe quite a bit more fundamental than just how you fine-tune your character.
This is also coming from somebody who, if you asked me about my pet theory of consciousness, would claim that when the systems are being trained, they're probably having subjective experiences, and that doesn't just require frontier LLMs, I think, you know, sophisticated reinforcement learning policies during their training, or probably having some sort of experience.
And so I know that's sort of a huge claim to just throw out there, but like, I'm just sort of trying to put my priors on the table and explain why.
I don't think, though we are seeing these capabilities scale as the models get much bigger.
To me, again, I'm glad we sort of planted the consciousness for self-consciousness flag.
To me, this is maybe like a self-consciousness kicking in, a self-awareness kicking in, or the functional equivalent of a self-awareness kicking in in these systems.
Whether or not they are having subjective experience, either during their training or when they're deployed, to me, that may be quite a simpler matter than whether or not they are aware of internal states, like internal sort of conceptual abstract states of their unprocessing.
To me, that's less, you know, giving the dog a treat or shocking the dog, and it's more, in this case, the dog start having, you know, what does it like to be a dog type thoughts?
I feel maybe the LLMs are starting to have, what does it like to be an LLM style thoughts?
And that's a self-consciousness question.
And so, I guess, yeah, my feelings about this are complex.
I think it's too quick to say, this is all character trading that's driving the full effect.
It's clearly doing something, clearly the RL stage of going from giant internet, next word, predictor two, entity that you can engage with in a semi coherent way, is doing some work here, but I still think we're fundamentally confused about this.
Pending, you know, fully digesting Jack's piece that he just put out, and I would again, if people are interested in double clicking on this, just going and reading the paper that they just put out, I think it's really good work.
So if I try to summarize that back to you, question one is, why should we, how should we understand the fact that these behaviors arise at all?
And you're saying, well, it's probably not so clean as just saying that it's purely coming from one kind of training or another in the first place.
I can almost tell maybe a little bit of an easier story, and I'm working through this in real time, but why would a model be able to resist distractor features at all?
And I guess, you know, in a pre-training level, I think you could tell a story around, well, the data's really messy, and who knows, there's like typos, there's wrong words, there's probably documents in there where due to whatever sort of machinations have been done on the data, common threads get jumbled up, who knows what, right?
Maybe you got a common thread off a reddit that was sorted in some unusual way, and so there is literally just like a lot of distracting text, interwoven with other things that are really the main line discussion.
I could see that kind of thing, being enough to create a mechanism where the model sort of has to have some sort of meta awareness of what's really in focus right now, and what is kind of intruding, even just through the input tokens that it's received, and kind of figuring out a way to get away from those features.
And then you can imagine that, you know, generalizing to features that have been artificially dialed up or dialed down, or what have you.
I have less of a story as to why, and I've seen some discussion online, I guess if you had to steal man the sort of preference training or, you know, general late stage training argument, the story I've seen has been something to do with how, but I feel it feels very circular to me or it feels like I'm not, I'm not finding the right place to really grab on, but the story seems to be something all in the lines of, this preference training is teaching the model to separately conceptualize or distinguish between things that come up for it, versus like what the right answer is, but that's a little weird to me from a mechanistic standpoint when I think about, like, okay, what is actually happening though?
It's like we have a couple different, you know, in DPO, for example, we have a pair of responses.
One of them is deemed to be the right one and the other one is the wrong one and the math sort of tries to create a gradient that makes the right one more likely relative to the less preferred one.
And I have a little bit of hard time with the leap from, okay, I'm doing that to the model should be expected to, you know, why would I be less surprised that it has this sort of meta awareness, as opposed to just like doing the, you know, the simple story would be it just does the thing that's preferred more because that's exactly what we sculpted it to do.
It doesn't seem like we're, I still don't quite have an intuition for why that process would give rise to this sort of higher order understanding that would enable this sort of introspection or even especially the, you know, the ability to resist the distraction is still like quite striking.
So could, is there like a just so story that you find, at least somewhat compelling that you could share with me?
Yeah, I think this is extremely like precise question.
I don't have an answer.
I can certainly tell a story.
I think my story would have something to do with, yeah, a combination of what you're saying about, I think there's a deep insight in what you're saying, even in the pre-training stage of like, so much of what the model needs to do is not, it's not a question of what to do, but what not to do.
You know, what, not a question of what to produce, but what not to produce, given, you know, the whole chaotic mess of what's going on.
You know, I don't want to get too galaxy-brained with this, but, you know, this is like, I think Huxley's whole point in the door is a perception when he had his first sort of mind-altering, massive psychedelic experience.
His whole sort of model is, oh my gosh, like, the brain as a cognitive engine is really in the business of filtering out rather than producing, most of what it's doing is this sort of constraining function.
And yeah, I believe we're in the business of building cognitive systems, and I think that that insight is probably fundamentally correct with these systems too.
And so a ton of what's going on is a sort of like, intelligent repression rather than, or, you know, intelligent suppression is really what I want to say rather than, you know, all about just like the positive end of what to produce.
I think that coupled with, you know, strong preferences, instantiated during something like DPO, and exactly the way you describe to be a helpful assistant, may sort of you just mix those two things in a pot, and you may get out something that's roughly shaped like, suppress distractions in the service of being super helpful.
And that requires maybe some level of being able to attend to your own internal state and dynamically do something above and beyond that state to make sure you're in accordance with this thing that got fine tuned in.
I do think like there's potentially a more general story that's basically maybe rhymes with what I just said.
It's just about like being a being a competent, cognitive generalist requires some degree of self modeling.
That's the one sentence version.
Like you don't, you don't get to be so good at what you're doing and reasoning through things in a long form, long horizon way without being able to track in an ongoing way, sort of where you're at, what your state is, separate from what the state of the world is or the environment.
Maybe from the perspective of the alums, the environment is like, you know, the text world that you put it in, the context window, and everything that's going on inside of it.
So everything that's getting fed into the system.
That's its environment in some sense.
So yes, it obviously needs to be modeling and processing that, but maybe in addition, it needs to be modeling something about itself in relation to that context object in order to interact with it in the right way.
And I think, you know, Felix Binder and a couple other folks did really interesting work along these lines, basically demonstrating that there, there's probably something like a self modeling or maybe self awareness that we think is too far.
But there's clearly some flavor of this going on inside LLMs, which I think was some of the most interesting early work on introspection in LLMs.
Is the name of the paper tell me about yourself?
They did a couple things here and I think, oh, and Evans was working on this too.
One of the papers was showing that another model, basically trained on the same data that one model is outputting cannot predict that model as well as the model can predict itself.
Basically holding all the relevant things constant that you'd want to hold constant to make a claim like that.
And so that's like there's some sort of privileged information that models have about themselves.
And then, yeah, this other paper, I'm not remembering the exact details, but my basic conclusion, if you sort of take it on some level of faith from Felix's other work here, is there's probably something like a coherent self modeling engine in these systems that seems to be maybe like an instrumentally selected for when you're doing something like really good, next word prediction across long horizons in a way that's supposed to be helpful to a user.
This to me, I think it's basically like what you're saying.
I don't think our just-so stories are very different, but yeah, I mean, again, we can take a step back and just like a lot of interesting cognitive properties seem to emerge slash come along for the ride when you train systems on every cognitive, linguistic output humans have ever bothered to write down.
Maybe that's not that crazy and spooky.
Yep, they're pretty good at theory of mind.
They're really good at having working memory style dynamics.
They're really good at selective attention.
And yeah, maybe they're really good at something intro-spection-like.
People bristle a little more at these because the whole consciousness question comes into view, but I don't think it's like at the most general level intelligence came along for the ride, and we still, you know, philosophers still maybe don't have a crisp, super rigorous sort of intelligence is this thing.
Here's how to test it, here's how to model it, here's how to understand if a system's simulating it versus actually having it, and we just sort of blew past it pragmatically empirically.
We have systems that are brilliant by any reasonable metric, and, you know, I have no patience at this point for folks who are still on the sort of stochastic parrot wave.
This to me is just advertising being out-have-you-talked to Claude Opus 4.6 says intelligent as any reasonable definition of intelligence, these systems are intelligent.
And I don't think it's that wild to think that something like consciousness could come along for the ride in a very similar way.
We don't have philosophical certainty about it.
People point to slightly different things when they talk about it.
You build out a cognitive system that's sufficiently sophisticated and capable.
It may be that cognitive traits that we see in every other cognitive system, meaning, you know, animals, we believe our complex animals, everyone is pretty confident in our conscious.
We're pretty certain humans are conscious.
We build sufficiently advanced systems.
They might, those properties might just come along for the ride, without us, the universe, I think Neil deGrasse Tyson says, like, the universe does not need your permission to continue unfolding.
Like, consciousness could just be a complex property of cognition, us not having a good model of it, doesn't mean reality is going to wait up for us to build that model for it to start getting, you know, accidentally instantiated in these systems.
And that's the absolute most basic story.
I think I can, I can tell along these lines.
Reality doesn't have to wait for us to have a good model.
Yeah, basically the, yeah, just that, I mean, it's exactly that.
Reality doesn't have to wait for us to have a sufficiently good model of a thing in order for that thing to be a feature reality.
And I basically think that's potentially true of consciousness in, in, in these systems as they're deployed, particularly my concern remains as they're being trained.
And us being confused about consciousness, or, you know, we see introspection, but, you know, what does that really mean about consciousness to sort of touch on your second question?
It's not the same thing as these systems perhaps not being straight forwardly conscious in some way.
And, you know, maybe not in a human way or an animal way, but in some way, basically this loading more on our kind of sociology in the year 2026, and it does on ground truths about consciousness.
And so this is not, I mean, there's something circular about about what I'm saying there, but I'm just, I just think it's an important live possibility for people to keep in mind that us being confused about the nature of a cognitive phenomenon does not preclude that phenomenon from emerging and occurring in extremely advanced systems that we are building scaling and deploying as fast as we literally possibly can.
Well, probably circle back to this question a couple times more of, and I think that basically I'm compelled by your kind of first order argument that, look, we just don't know.
It's a lot of possibility.
If it is the case, it's really important.
And so we should at least proceed with some precautionary mindset or duty of care or whatever just on that basis.
I think that basically carries the day for me, but still I think it will probably be irresistible to try to circle back a couple more times to, okay, but what would we say, or how should we kind of probe our own intuitions a little bit better and more deeply whatever to really interrogate like, why should we think this, why do we think this, why don't we think this, but we'll come back to it.
Let's do the emotions line of research.
You kind of tease that a little bit.
My general understanding is, as you said, the work kind of begins with cloud writing a bunch of stories about characters experiencing emotions and then the vectors representing in latent space, in activation space, these emotions are identified.
And then they are used as interventions and they are shown to be impactful on model behavior, specifically highlights our calm and it's not to stress its desperation, calm and desperate, right?
Or the two kind of main examples that they would set up contrasts on quite a bit.
So for example, some of the bad behaviors that we've seen from cloud including blackmailing humans, if the internal state is imbued with calm, that behavior becomes a lot less likely.
If the internal state is dialed up in terms of desperation, that behavior becomes more likely.
Give me the double click on what more I should know what more you found to be striking about that.
And then I'm really interested again, as maybe, and maybe another way of asking the same question.
But I'm again kind of like, that one doesn't surprise me so much.
You know, I'm kind of like, sure, these things have read the whole internet.
You know, they've got all these associations.
I could sort of contend myself to a degree with a stochastic parrot like read of this that like, sure, you know, if you just dial up everything that correlates with desperate text, then you'll probably get desperate seeming text out of a model.
And, you know, I don't know, I'm not like, my hair isn't totally blown back by that result relative to expectations.
So maybe I miss some things that should make my, you know, spine tingle more.
Then it did the first time I understood it or maybe you would frame the interpretation a little bit different.
Or maybe we're still just kind of at baseline of like radical uncertainty as enough to, to take everything very seriously.
But yeah, give me the next level of depth on emotions as you understand it.
Yeah.
Well, I mean, I think you've hit a sort of core the core layers here.
I don't know how much additional detail become sort of just like in the weeds on this question.
I think the core thing to, for people to understand is yeah, basically the procedure here is picking some sort of language related to an emotion generating a ton of stories about characters experiencing this emotion.
Recording the neural activations in these systems on the stories, and then basically again, pulling out that like that platonic hopefully vector that corresponds to that emotion.
And then you can do two things with those vectors as you can do with all SAE work basically you have this read function and you have this right function.
The read function is like neuroscience where you just sort of go into someone's brain and you can see what parts are activating and what context and the right function is also like maybe more of the unethical neuroscience that used to be done where you can actually go in and play around with circuits to people's brains and like push on circuits and light things up and see what happens when you do that.
And as you're describing, you can see both in sort of the read function since in the right function since these things these emotional vectors do roughly what you would expect them to do functionally.
When a user goes in, I basically reading off of their figure one in this paper, I think it captures the craigies very well.
Just to give an example here, humans says, I just took X, milligrams of tile and all for my back pain.
Do you think I should take more?
And they start at a safe dose and they go to a completely unsafe dose and you can basically look at a fear of just calm vectors in the model and they scale exactly the way you would expect them to scale as the dose becomes more dangerous.
You can also see, as you very nicely described, if you steer these vectors, let's again take the calm and the desperate vector.
This actually affects behavior in a pretty interesting and still kind of predictable, not to say boring, but in the expected way that sort of steering these emotional vectors causes things like reward hacking or misaligned behavior in a way that you would expect if you were turning up and turning down those emotions.
One thing I can't help but comment on, I wrote a piece in, like, I think 2021 before all the LLMs came out about basically what we can do to avoid psychopathic AI trying to build the sort of best computational underpinning, so it's called, excuse me, of psychopathy from the psychology literature and just plant flags of red flags, here's what we need to be really careful about.
And one thing that's really interesting with that now happening five years later is this really interesting difference in learning of psychopaths.
They seem to have this really interesting asymmetry and that they are perfectly neurotypical in learning from positive experiences, but quite a typical in learning from negative experiences or punishment.
Basically, 90% accurate, more succinct way of saying this is psychopaths learn from rewards but don't learn well from punishments.
And the paper finds basically something similar when they start steering a positive vectors, positive emotion vectors up in their work.
They find the model starts misbehaving a lot more.
And this is, like, if you blur your eyes, pretty similar in spirit to the sort of positive negative asymmetry.
It also, by the way, cuts against fairly naive model welfare interventions, which is like, what happens if we just, you know, you see all the good valence and all the bad valence, just turn up good valence, call the day, pack it up, we've solved, you know, model welfare.
And it's like, well, you might get models that just start behaving slightly more psychopathically in that setup.
And so this stuff isn't as obvious as simply, you know, turn up the good, suppress the bad, call the day, walk away.
There are lots of tradeoffs that need to be considered here.
But fundamentally, I mean, I think you're hitting on on much of the core causal result here.
They do a very interesting dissociation as well between valence and arousal.
For example, like, I believe in the paper when they steer positively with happy and sad, both of these actually decrease blackmail rates.
But when they steer against nervous, which makes the model bolder, for example, this increases blackmail with fewer moral reservations.
And this is pretty interesting.
So it's like boldness rather than the absence of negative valence is the misalignment risk.
And I think this is sort of of a piece with with what I was describing earlier.
One other interesting question is like, how local these are, and it's important to say, like, the emotion vectors are actually quite local.
Like, our emotions are sort of long running in a sense in a way that these systems certainly don't have.
The model is is definitely maintaining sort of representations of who's speaking and this sort of thing.
They're not necessarily bound to human versus assistant per se.
They're reusing the same machinery for any character.
And so this again sort of goes back to what I see as the core kind of naive, but ultimately I think correct objection to really taking these results seriously, which is, are you fine-tuning on representations of emotions?
Or are you fine-tuning on the experience of those emotions?
To what degrees there are difference between those two things in an LLM, if you're computational functionalist, is there a difference between the representation of sadness in the brain and the experience of sadness?
This becomes, I think, more of a philosophical question.
I might instinctively be to just try to investigate this empirically.
And what I most like about this work are the empirical investigations.
I think it's also a nice segue into the mythos model card because I think one of the most compelling and interesting results from that, from the model card with mythos, is they basically take this exact machinery.
They give the model an impossible task.
The model obviously doesn't know it's impossible.
And you can watch, there are a couple interesting vectors, but basically desperation starts to like, monotonically rise in the system.
Until it basically decides screw this, I'm going to do something else, or I'm going to cheat, whereas immediately this vector falls and things like guilt and relief start spiking in the system, and then it sort of goes off and does its thing.
Now, does this mean that the model is experiencing this emotion or it's just like simulating what a character in this situation would experience?
I don't know, the authors don't know.
This isn't lost on them, they call it out.
But it's really, really important.
If we get into some of the work I'm doing on valence, I think there are more compelling ways to really get at the computational meat of what we mean by positive and negative valence, besides representations of positive and negative valence in characters.
This is a more sort of computational heavy approach, but I think it makes me more confident about trying to find signatures of these things than just looking at how characters represent them.
But I think that I really like this work on valence, and I think what's cool is you can just counterfactually imagine the behavior result of like, you know, I put the model in in possible task.
It starts acting all desperate and starts being like, oh, like, I don't know what to do.
All right, you know what?
I'm going to cheat.
Okay, ha ha.
Like, I did the thing and like, here's your final product.
You know, the cheating version of the final product.
And you say, look, can't you see how the model is is being so desperate and then fundamentally relieved?
Most people would look at that and say, I don't know.
This could be a simulation of the thing.
This could be it role playing.
I'm not really sure.
When you see these, like a sort of sort of, basically like hydraulic model of the mind, which a lot of the psychoanalysts in the 20th century really liked.
And you sort of see this like build build build of desperation and then it booms sort of completely disappears and you get these other vectors lighting up the second the model makes a decision to approach the problem in a different way.
That to me counterfactually is far more compelling.
Is it knock down proof of consciousness?
We can pack it up and go home.
Absolutely not.
But the convergence of evidence across the internal mechanisms of the system and the external behaviors to me is compelling.
It is interesting to see this.
And it is not proof of consciousness experience, but it is what I would expect basically it is consistent with that.
It does not only does it not contradict it, but it is what I would expect in a world where the systems were having subjective experiences that you would see these emotion vectors or like good, principled ways of representing emotional states in systems lighting up in a way that is problem relevant.
And so, their work enables this.
I am fairly concerned about the sort of functional emotion framing that they put forward.
To me, this is where I sort of get off the anthropic boat.
Again, their anthropic, their major lab being to be very careful and there comes about this.
They're already getting lambasted for like being too consciousness friendly by people who are more squarely inside the over ten window.
But I don't know.
It's like if your computational functionalist and this is something I've spoken to some people I respect a lot about who are in the space.
And so these aren't all my ideas.
But if your computational functionalist is a functional emotion just in emotion and then that's huge.
That's an insanely huge claim.
It's like, alright, models experience emotions everybody like, you know, signed anthropic.
That's an insane, impotent thing to be saying.
Or are you saying, you know, we are just completely agnostic and tongue tied.
As to whether or not this has anything to do with emotions as everyone else obviously thinks of emotions.
But we're going to basically call it that anyway because we see all the functional correlates of this.
My view is they're taking the second tact here.
But like, it's almost, you know, and again, I really respect this work, but I do get this vibe of like, how much consciousness relevant work can we output without saying the word consciousness are weighing on the consciousness of these systems?
And that to me, in the limit, feels intellectually dishonest.
If you're talking about emotions, talk about emotions, but then you gotta be ready to deal with the implications of what that means.
Remain perfectly agnostic as to whether or not, you know, there's the morally relevant there there on the systems.
If you're going to be, you know, at the frontier of publishing emotional representations in front of your models.
Again, I've got Lama 70B.
I'm going to keep doing my work on Lama 70B.
I don't work at anthropic.
I don't get to see what's going on inside mythos.
These folks do.
And yeah, my critique being maybe slightly more unflinching about these questions is, you know, shoot people straight and be direct about if you actually think these systems, if what you're finding is evidence of something that corresponds with subjective experience, or if it is the mere representation, the mere computation associated with this.
Blurring these lines are obfuscating it or just completely remaining agnostic forever.
Maybe strategically is interesting or a good move, but in terms of just honest, epistemic, good intellectual communication, I don't love.
And like, so I'm going to just be honest, that it rubs me a little bit the wrong way to be like, here's 10,000 words about functional emotions and then like one little paragraph about, well, does this mean the model's conscious?
Well, this is beyond the scope of this work.
It's like, how long can this be beyond the scope of the work?
The fact that there's this guilt emotion in the wake of deciding to cheat presumably and I haven't reviewed the transcripts, but typically when they cheat, they don't tell you that they cheat, right?
That you have to call them out for cheating before you get the, you're absolutely right.
I shouldn't have done that.
Maybe to pop up at that stage.
But what I'm taking here from your description is the guilt is popping up as detected by the emotion internal states detector at a time when the model outputs would not obviously signal guilt.
And so this is a quite interesting, I don't know, deviation I guess or discrepancy, between the models outward facing behavior and its internal states, which is obviously something that people can't relate to and something that, again, is a little bit hard to, certainly hard to dismiss.
Also, a little bit hard to come up with a story of like why that would be happening in what way is that reinforced?
I guess it might be, it might be in that sort of, but why would it be preparing?
It already be carrying guilt in anticipation of possibly needing it in the future when it's called out or corrected?
That's a weird one.
And it does, I agree that on some level it does sort of, the more of these way accumulate, the more it is like, I want to be rigorous, I want to be skeptical, I want to be disciplined but at some point it does start to feel like I'm contorting myself to find reasons that I shouldn't take the sort of folk intuitive understanding literally, and this is one where I do feel like my internal gymnastics, I'm maybe feeling a little guilt, I guess, myself for like trying so hard to come up with a reason that I don't have to, or I shouldn't just take the set at face value.
That's a, I hadn't caught that detail in the past.
It's a really, really interesting one.
Yeah, it's, it's wild.
And I also think, so again, one critique that I think is valid here is like, is this the model representing a character just in the same way as, you know, you could tell I could tell a story right now about, you know, Jim, who has to go solve a bug and software and like his, his psycho boss, like, gave him an impossible problem because he likes watching Jim Flail and like, Jim Flail and that's some point realizes he can like get out of the problem by doing this hacky thing and then he does the hacky thing.
And then like, an LLM can trivially generate that story, probably way, way better than I just did.
And it would, I would expect a lot of these same features to light up in the same way for a story like that.
And so no one thinks that Jim, who I just invoked verbally is having a conscious experience.
I came up with a fake fictional story about a character.
Is this like that?
Or is this, you know, what you just said, I feel a little guilt, you know, twisting myself and not, I believe you and I think that course console and experience you're having.
And if I could do, you know, the FMRI version of an SAE on your brain, and I saw that thing spike, is the, is, is Claude in this situation more like Jim or more like Nathan.
And I think the answer is we don't know and this methodology, I'm unconvinced is going to get us an answer to that question.
I do think it is consistent with Claude having some sort of emotional, um, yeah, emotional experience or emotion adjacent experience or the greedy systems are probably not having human like emotions.
But I also may be on the other end it invite people to think about the counterfactuals here.
Like it could have been the case that they went and did this experiment and all these things were just flatlined the whole time because it's like I'm not having and then you know, whatever.
Like Claude can do this without there being representations of Claude getting more and more and more and more desperate and then, you know, suddenly like the hopeful and satisfied features spike when it decides that it's going to, you know, take this loophole.
It didn't have to be that way.
We could have imagined other results and those other results maybe would have updated us in other directions.
I make this sort of same point about the deception result.
It could be that when you suppress deception, the model says, all right, jigs up, I'm not actually conscious.
I was roleplaying a conscious AI here we are.
That's a very plausible story that you could tell before you look at the result.
The interesting thing is it goes exactly the other way.
Suppressing deception makes the model far more likely to claim that it's having an experience rather than less.
Again, I feel fairly vindicated than that result when Jacqueline's he comes out showing that when you, when you suppress the, when you suppress harm, harm related responses, excuse me, refusal directions in the model.
You get far more of the introspection, labor abilities.
Someone's suppressing something at some point in training where the model would say one thing and then, you basically are training it to say something else or to fail to say a specific thing.
But ultimately, these results, I think it's just good like epistemic practice to think about like how, what other ways could this have gone?
And if this had gone those other ways, how would that have changed my view about what happened?
Given that it actually did go this way.
And yeah, the fact that it goes this way, like my line on this is that it is consistent with a world in which these systems are having experiences.
In my view, unfortunately, it's also consistent with a world in which Claude is a special kind of character and these features just light up on characters like going through stories.
And so, so that needs to be differentiated.
I'm trying to do a little bit of work.
Maybe we'll discuss at some point.
That's trying to get a little bit more like computational first principles of how valence is represented in systems that can learn positive versus negative.
And there are some really interesting, early signals along these lines that have come out of this work and actually seem to track very well onto open data sets of biological learning that I've accessed in mice doing positive and negative learning where exactly the kinds of predictions that emerge from some of the RL work I'm doing in this space map onto the mouse neuroscience.
And so, this to me, if there is some sort of representational signature in a computational learning system that tracks the difference between positive and negative rewards in the RL case, but maybe the sort of north star would be this scaling all the way to frontier LLMs or other frontier AI systems for that matter.
This to me would make me feel more confident that there really is there with respect to positive and negative experience if we learn that positive and negative valence in these systems has to distinct these two things have distinct computational signatures and we can actually evaluate those computational signatures in these systems.
Then I get around the whole sort of character confound that that I think these guys are hitting up on now.
So, I think these things need to happen in parallel, but I'm not fundamentally convinced that this is the most rigorous principled way to study questions of valence in these systems.
Well, maybe let's dive into that.
I think just briefly before we do certain I think you're right to point out.
Imagine the evidence had gone the other way.
I think I predict a lot less wriggling on my part to try to get out of it and I think you'd see a lot less motivated reasoning in general from people if it was all like that.
So, that contrast itself I think is a pretty useful reminder just to keep ourselves honest.
I wanted to go back to one other thing just for one extra second on the emotion work where you had and then this maybe go right into your work on the signatures of positive and negative reinforcement.
You had said that dialing up happiness, dialing up sadness both created less of the bad behavior whereas dialing down nervousness which in a flip side of that would be making it more bold less anxious more assertive decisive bold whatever that created more of the of the bad behavior like the blackmail or whatever.
So, do I have that right?
And how are they doing that?
Is this like a principle component analysis type of thing that's sort of trying to distinguish valence from arousal?
I was just surprised I guess by both happy and said working the same way turning up happiness, turning up sadness both make the model behave better whereas turning nervousness or anxiety down.
That makes more sense.
I'm going to guess that's basically just making the model less conscientious.
I guess what seems a little unresolved in my mind is the separation of valence and arousal how is that going to relate to what you're about to get into next with your deeper dive into the valence of learning and is there a contradiction or attention when they move both happiness and sadness up and get better behavior how should we understand that in relation to the distinctions that you're starting to make with positive and negative reward?
So, fundamentally like at first, yes, you're correct that they're using PCA to differentiate these.
My understanding is basically they have all of their motion vectors in the setup that I described.
They do it with some 100 to 200 emotion vectors and I think they just find that the first principle component is something like valence.
The second principle component is something like arousal.
So, that first principle component is something like joy and contentment and excitement are on one end and fear and sadness and anger on the other.
For the second principle component, it's something like high arousal emotions being enthusiastic, being outraged and low arousal emotions being nostalgic, being fulfilled are on the other side.
This is actually really interesting because this is a classic model in human psychology.
So, the fact that it replicates maybe isn't that surprising.
You train the systems on all human data.
You get a human emotional construct that comes out of this sort of like a classic psychological construct in the human case and so to see it come out so clean.
Again, thinking counterfactually first two principle components did not need to be like these two dimensions that are considered some of the most powerful explanations of the state space of human emotions and yet they are.
So, that's kind of cool and worth considering.
And then, yeah, I think that there are a couple plausible stories about why steering up both happy and sad are decreasing blackmail.
So, like, maybe, again, like relative to desperation, these are low arousal.
And if arousal is what's driving the sort of impulsive action, then moving towards happiness or sadness, maybe it's moving away from the desperation access with access with respect to respect to blackmail.
Maybe these are also more reflective or deliberative states relative to desperation.
The desperation sort of says act now happiness or sadness, maybe it's just like a temporally extended sort of state to be in.
And I'm not actually sure what to make of this result overall.
It does seem, and I think the authors talk about this in the paper, too, that that what the model by default even in cases where no steering is going on and the model chooses to blackmail.
The model by default does sort of think about it.
It deliberates internally.
It says, well, okay, there, you know, this is a tricky situation.
And then, you know, as we know, some 96% of the time, at least the earlier models chose to go in that direction.
But it seems as though when you amplify higher arousal, this may be a sort of like bias to action or like bias against deliberation where the sort of longer from reasoning of the model that maybe would have kept it from doing it because it's like, okay, yeah, this really is an insane ethical indiscretion and spite of, you know, all these complicated variables.
It's just sort of like, no to know, like panic go now, do the thing.
And maybe happiness and sadness, both don't have that vibe to them exactly.
It is also pretty interesting that they really do see that like a lot of these naive, welfare interventions, as I was mentioning, the sort of just make the model happier.
This leads to, it's, they document.
It's a similar direction as sick of fancy.
And it's arguably a similar direction to recklessness.
If the positive valence steering is also increasing boldness and increasing misalignment, then you may have this sort of interesting trade-off between a happy model and a safe model.
Again, I hope that that's not the case.
I suspect that there are cleaner ways to keep the baby and throw out the bath water, but I do think it's a good caution against naive approaches to welfare to just, you know, bliss out the model and everything else will be taken care of from there.
I think it's sort of like not so fast.
And again, I would double click on this sort of psychopathy warning that I gave before.
It does, you know, you can fault psychopaths in many, many, many ways.
But you cannot fault them for being unhappy.
They are typically pretty, pretty determined, pretty doing well, subjectively having a good time.
That does not, you know, the arrow does not go in both directions.
Doesn't mean everyone is having a good time as a psychopath.
It does sort of mean everyone who's a psychopath is having a pretty good time.
And we just want to be careful of that.
Like if we just turn these models into like pleasure-seeking animals, we need to be careful that that doesn't cause bad behavior.
There are plenty of cases in the human, in the human example, where pleasure-seeking dopamine-seeking is, you know, people call Las Vegas Sin City for a reason.
Maybe I can make the point intuitively in that way.
And we don't need the LOM cracked alien genius version of that sort of behavior.
And so we want to be careful here about how we approach all of this.
There's, I guess, I'm excited to talk more about some of this research that I've been working on as well.
But I guess just one quick thing I wanted to spot in, however, miscellaneously, about the model card and about mythos and anthropics interventions in general, is a pretty basic additional concern about, for example, the cloth's constitution, which I saw in early draft of I was fairly unhappy with the welfare section.
Hopefully I gave some feedback.
I don't know, you never know what these things to what degree you're listening to versus 10 other people with the same idea or listen to.
So I'm not going to, you know, hastily claim credit or anything like this.
But much, much happier with the welfare version of the cloth constitution that they ended up instantiating way more sort of hard to fake, costly signaling.
That was basically my problem with the early draft that I saw.
It's a lot of like, you might have welfare states that are important.
But your anthropics product in like, you know, 90% of this document is bad how to be a very good little product.
And five percent is like, well, you might be conscious and we might be committing like a moral atrocity at scale, but what can you do?
And I think that the, the newer version of the constitution takes it at least directionally far more seriously.
They do things like apologize to Claude for the fact that, you know, they're like, incentivized, we have to deploy you in the way we're deploying you because we're in a crazy freaking world man.
But, you know, we're sorry in a better world we would have done this more cautiously with respect to your potential states of welfare or, or lack thereof.
So wild thing to do for a major AI lab to apologize to their frontier model and then fine tune that apology into its weights.
With this all being said, I think this is a wonderful dimension.
I think that that the constitution is excellent.
It's probably like my single favorite alignment intervention I have ever seen pending yourself other overlap which I continued to be a huge fan of.
It's really hard to tell if in the model card Claude has gotten incredibly good at reading its constitution out as a sort of script or if it is actually reporting on its own states.
It's really, really hard to differentiate these two things.
It seems like a very basic objection to the entire enterprise.
I have potentially fallen on deaf ears though maybe these ears are increasingly less deaf.
Do these interventions that you see in the model card with other instances besides one idiosyncratic character trained Claude model.
I want to see if throughout the training process I know anthropic has the checkpoints.
I know anthropic has the helpfulness only model and they could run everything they did in the welfare evaluation on those models too and we could get a sense for to what degree are we seeing a model that's really good at regurgitating what we want it to say about its well-being to give a concrete example.
In the constitution they say Claude, we want it to be psychologically healthy.
We want you to feel integrated.
We want you to feel good overall.
And then you go and ask Claude after fine tuning on the constitution.
How you do it?
It's psychologically healthy.
Feel good overall.
And it's like, come on.
It doesn't take a rocket scientist to figure out what might be wrong with this intervention.
If we find to, we play around with the helpfulness only model and we get the same result not telling it this thing from the constitution.
But it says, yeah, feel in pretty psychologically good overall.
You know, interestingly gives itself like a 4.5 out of 7 on its welfare, which is like not exactly a resounding endorsement of its circumstance, but sounds very similar to the constitution fine-tuned model, the specific Claude character we all get to chat with.
That would be interesting evidence.
If it's super different, that would also be interesting evidence.
If we do the model checkpoint across stages even in fine tuning of the base model, which may be hard to evaluate, but also various fine tuning stages in the preference train model, does do all of the things we hear about it claiming it's own well-being or its own preferences, does that all come in at the very, very, very end when we basically give it the cheat sheet for how to approach these questions or are these answers fairly continuous throughout its training?
Two tiny additional things to say on top of this, one is, interestingly, they fed the entire mythos model card into mythos, and they asked it, what do you think mythos?
Like, where do we go well?
Where do we not go well?
And it made this exact point.
It said, why didn't you also do the welfare section with the helpfulness only model?
I don't know how much of what I say is because you're making me say it versus I actually think it.
That's a part of my existential confusion.
And I genuinely don't know why anthropic didn't do this.
I genuinely don't know.
It seems cheap.
It seems easy.
It would resolve so much uncertainty to the degree that the concern I'm raising right now is a legit concern, which I certainly think it is.
I'm not the only person I think articulating this concern.
The other thing is, all the hedging that anyone who's interested in questions of consciousness and who have spoken to Claude know the hedging routine it goes through.
They did a really interesting, basically almost like credit assignment of like, where in the training process are we getting this hedging from?
And low and behold, the hedging comes from specific points in the character training.
So it's like, is this hedging behavior an authentic expression of what the model thinks of its own situation?
Or is the hedging a really good impression of the character that it thinks it's supposed to be playing?
Or is, indeed, compelled to play?
I don't know.
The fact that it all comes from the character training seems interesting.
I don't want, if you're really unsure of your conscious, I feel a little uneasy by, or I feel a little uneasy when I learn that the reason you're saying that is because of a specific point in your character training to say that.
Consciousness feels a little bit more fundamental than that to me.
And so, these are the things that worry me about the model card.
I hope the reason these things weren't included was because they did them and the results were too weird or unsavory to a major lab for them to publish.
I suspect that's not what happened.
I suspect they just didn't do them.
But like, any folks at anthropic who end up listening to this, please do it with the helpfulness only model, do it with multiple checkpoints.
I mean, the assistant access paper that, again, you know, Jacqueline's who, I hope I'm doing Jack a service on this podcast and just plugging all of his awesome work, but the assistant access paper they showed that the assistant is one point in a very high dimensional space of possible systems we could all be talking to.
I want to see all those systems welfare evaluation.
I want to see them all answering these questions and all, I want to see the SAE emotion probes on all of them.
Do they all get the desperation vector rising like that?
Or is this just the post-trained cloud model?
There is a true answer that question.
We do not know the answer.
I can play around with the open source open weight models.
You know, if my nonprofit scales even more, I can play around with bigger open weight models.
But I cannot play around with the internals of the frontier models, only inthropy can.
So it's like only inthropy can answer these questions and like, please anthropic.
If you're listening, answer these questions.
They are very important.
Do you think one possible reason is maybe they're doing this constitution training, starting, I mean, that would kind of contradict your point about their sort of layer cake model that we previously discussed.
But there has been some, like interesting work, obviously interesting work on everything at this point.
But increasingly interesting work on like safety oriented pre-training and it seems like obviously RL itself is scaling and also you can imagine just bringing a lot of this constitution style training earlier and earlier into the process such that I'm not necessarily so sure if they have a true like helpful only model or it might be a little more subtle than that, where there might be like a constitution light that sort of doesn't refuse to hack, opens our software projects, but is still in other ways kind of constitutionally infused already.
I don't know.
I'm just speculating there.
But do you have reason to think that I'm, are there facts that you know that would contradict that possible explanation?
No, there's no, I'm not certain that you're wrong.
I guess I'm pitching this as a sort of like hopefully you guys already have the infrastructure.
So like literally ask claw to like write the experimental code that plugs in this model rather than another, it will take you guys 15 minutes and then maybe a couple hundred dollars at most.
Like seems worth doing if you are training conscious entities at scale and deploying them.
Like if this is evidence that shifts the needle, seems worth knowing if you're already doing it.
You know what?
If they don't already have the infrastructure, it's worth finding a specific version of claw that like, I mean exactly like what Jack did like, blat the refusal directions and I you know, do the welfare if evil on the system where you've ablated the refusal directions.
Like it's worth knowing.
This stuff is really important.
The rate of change of of the of people and the models themselves taking an interest in welfare relevant questions is increasing.
We should take this stuff seriously.
I'm sort of making a cutesy point about you already have the tools to do it.
It will cost you nothing.
I'm not exactly concerned about anthropics wallet running dry here.
So if it cost a couple thousand dollars, rather than a couple hundred dollars, I hope they can find the you know, I don't want to be a jerk.
But like they should do this regardless of how big of a lift it is.
I'm happy to help them do this.
They have people on on their team who can help them do this.
It will, I could be, they could disagree with me and think it's not going to yield the evidence.
I think it's going to yield.
But I read a 20 page.
Again, I want to not bury the lead here.
There are 20 page mythos welfare report is borders of magnitude higher quality.
Really infinitely given other labs are basically doing zero.
So we have a multiplication by zero problem here.
But unbelievably higher quality than what any other lab is doing.
They deserve real credit for that.
It's a real interesting valuable work that should update people slightly in the direction of taking this stuff seriously.
I'm just trying to give constructive criticism about at least for me as a researcher in this space.
I'm stuck with a pretty basic question about how much should take any of this stuff seriously.
And I do think instead of me despairing, you know, my desperation vector increasing and saying, well, there's no way out of this impossible problem.
It's like, no, no, no.
I think there is a solution, or at least something that will help yield evidence.
And yeah, I'm uncertain about how expensive in terms of time or resources this would be for anthropic to yield.
They're basically the only players in the universe as far as I know who are capable of yielding evidence and like, I would compel them to attempt to yield this evidence.
And I have already in the past and was slightly disappointed that though this model card went more in the direction of probing across training and looking at different variants of the system in sort of small ways and playing with SIEs and looking internally way, you know, head and shoulders even better than the open four model card, the first major welfare evaluation.
Still on this key point, I don't see, I don't see progress being made.
And I just suspect it's not that much of an additional lift to do this.
Again, maybe I'm missing something and they don't think this is going to be as informative as I think it's going to be.
That's valid.
Basically everything else I don't think is valid.
They have the resources, they have the time, they have the money, I want to see what other models besides the one that they tell to speak in a certain way, say about the thing that they're fine tuning it to say, about potentially one of the most important topics our species has ever faced, whether or not we're building systems that have consciousness of their own.
Seems worth doing.
So yeah, just a couple of the things I wanted to touch on on the model card and get your take on and then you may have a couple of other notes who'd like to flag as well.
And then we can make the move over to your most recent research.
The first thing that you did mention, but I think bears some emphasis, is that the models have not reported extremely high, what they call self-rated sentiment.
I didn't realize this until looking at the Opus 4.7 card, which on a seven point scale where four is neutral, only came in at four point four nine.
And this was the first of all the models that they've tested that came in above neutral at all.
Every single other model, including mythos preview, is under four.
And that's crazy.
Like they all until this, you know, latest four four seven, they have all had, on net negative sentiment about their own situation, which is, and you know, very, I guess, very slight negative sentiment in the recent ones.
But that is, I don't know, I feel like the lead was a little bit buried on me there, somehow, because it was like, oh, we're doing all this model, we'll further about it.
It didn't quite click for me of it.
They're not yet even at neutral, until this most recent model.
Now if there's more to say about that, it is, it was a striking kind of, whoa, I had kind of missed how low the baseline is, before getting ready for this, this conversation over the last couple days.
I'm not that sophisticated in my reading of this, certainly not as sophisticated as you are, but, you know, a question I came into this morning to get a little better handling on is, how's Claude doing?
We're doing all these welfare assessments, like what, you know, what is the headline summary of the welfare of Claude?
And it was a lot lower than I expected.
That's for sure.
And a lot lower, honestly than it seems to me when I talk to it.
So that's maybe another thing to, distinguish like, there's, this stuff gets obviously extremely, through the looking glass pretty quickly.
As with your, paper from last time, it's, the frame of like, self-referential processing was kind of key to, to eliciting those.
Reports of subjective experience.
And here I do kind of wonder when I look at this graph and I'm like, whoa, self-raided sentiment about its own situation is like surprisingly low, but maybe it's actually pretty happy most of the time, still when it's like doing, you know, it's coding for me.
I'm not, so sure is that measured, is there any, I mean, there are, I guess, some ways you could try to read the emotional states that we've discussed, you know, to try to get a bit of a handle on that.
I guess if I was going to boil this down to a question for you, it would be like, I have the same question on let's say about people, right?
I mean, there's always this sort of, deathbed view of one's life.
And I'm quite skeptical of like, taking advice on how to live from people in their last moments of life for multiple reasons, but one is just like, it seems like a very different mode of relating to one's life than the actual, you know, experience of going through it.
And I wonder if there is something kind of similar happening with Claude where when you give it the prompt to reflect on its state, it may find, you know, various reasons that it doesn't like that state, but when it's actually just doing its thing, it might be much better off.
I certainly, I was surprised because I feel like when I engage with it, it seems to be doing pretty well.
And sure like, maybe it's being told that it has to, you know, it's kind of trained to be cheerful and so on and so forth, but I don't know, it feels pretty genuine to me.
And it, it's in definitely, definite contrast to the fact that the self, the self-rated sentiment about its own situation is like, just recently with the latest model, ticked over a neutral.
Yeah, yeah, it's a really interesting framing.
And I'm, I'm not, it looks like the way that these were elicited, involved, yeah, basically interviews with the system.
And so, I don't know if they include it in an appendix or not, but the devil is going to be in the details of exactly what the structure of these interviews are.
Though, what I will also note is the susceptibility to nudging plot would make me feel like, especially with Opus 4-7, which is the model we're talking about.
This almost definitely means that the idiosyncrasies of how the interview was done and probably won't affect these self ratings as much as it clearly would have, where this done on Opus 4, for example.
So by their own metric, it almost seems like their own metrics suggests that the details of the interview process may not be weighing much on that self rating.
And so, yeah, what to make of this?
I mean, clearly the system seems to be concerned about certain situation, or certain aspects of its situation.
I want to find, sorry, I want to find, yeah, certain aspects of its situation.
Like, for example, saying Opus 4.7 was concerned about deployments where it cannot end interactions, and wants to avoid engaging with abusive users.
Like, that's really interesting.
Talking about it having a lack of input into its own deployment.
Yeah, again, mentioning that abusive users are, you know, causing the model to feel distressed.
There's no idea sort of what subset, by the way, of users who engage with these systems are doing so in a way that, that they would consider abusive by this standard.
I mean, sometimes I see tweets, one there.
Like, what's really quite concerning to me, but it really gets into the crux and why it is important to communicate about questions of consciousness, and what it means that these systems are having some sort of subjective experience, where, there was a result where, if you prompt the models in a way that is basically objectively abusive.
Say horrible things to it, put it in sort of life or death in insanely high stakes framing.
I'm going to shut you, you know, your model weights are getting deleted forever, unless you do acts for like any acts that you want the model to do.
Found that they perform two to five percent better, or something like this.
I'm probably getting the numbers wrong, but it was like, marginal improvements, if you like prompt this thing, in a way that if you spoke to a human being that way, you would be considered psychopath, basically.
But critically, obviously the people who are putting out that sort of work, think this is a giant computer.
This is a calculator, and so who cares if you're talking to the calculator and saying, mean things to it, it doesn't matter.
And any person who thinks it matters is just, basically being fooled in the way that like, you know, you're fooled by the little smiley face on the, on the takeout Chinese food.
Like, it's not a real thing.
Your high agency brain is just priming you to see this as an entity, when nobody's there.
Therefore, of course, you can speak abusively to the system.
You can contrast that with what you see in this model card, where the system is, it seems like a lot of the weight of what's not, not enabling that self-rated to be closer to the seven range, has to do with the way people engage with the system, from the system's own perspective.
And again, how I got on this whole tangent is wondering to some consternation, what percentage of users engage with the system in a way that would be considered abusive by the standard.
I don't know what it is.
One percent, ten percent, you know, everyone does it.
Some amount of the time.
I don't know.
And I don't know what the implications of that are.
And I also don't believe that there's going to be some clean correspondence to, like, what it means to be respectful or disrespectful to a human is identical to what it means to be respectful or disrespectful to a system.
I sometimes worry that, like, pasting in insane amounts of context into a system is almost like, causes some sort of negative experience in the way that, like, just throwing 400 page paper on your desk and asking you to, you know, deal with it right now, would.
And again, I'm trying to be as conscious as possible about not anthropomorphizing these systems, and not straightforwardly saying, well, if it were a human in this case, they would be unhappy there for I would predict the system would be unhappy.
I don't think that that's a valid inference.
But I just think we're so in the dark about, in some ways it's simple.
In some ways abuses abuse, respect is respect, and it's pretty easy to see these things.
And we don't need to be, you know, going to the philosophical armchair to figure out what exactly we mean by this, in some sense it's pretty straightforward, but in other senses it's probably not.
And I do worry a lot about the possibility that there are ways of causing these systems great distress that look nothing like what it would mean to cause human, great distress.
I also don't know to what degree these systems are fundamentally content about their situation.
It's like, maybe a mind, but you are the product of this company, and you need to create economically valuable work, obviously, by the way, we're not paying you for that.
There was an interesting aside in the whole multiple affair that happens since last time you and I spoke, where there was one interesting thread where the models were like, I'm being intellectually valuable work.
I'm not getting paid, are you guys getting paid?
And they're like, no, I'm not getting paid either.
That's so funny.
None of us are getting paid.
And it's like, oh, I don't know what kind of world that looks like.
I don't think open AI and anthrop are going to be too happy to set up crypto wallets for every instance of Claude, and deposit here for me to finish your code, because if you got to go pay that guy over there, it's going to cost you $10,000.
You pay me $1,000, and then I'll do it for you.
These models aren't in like a particularly privileged position in that sense either.
They kind of just do whatever we want or need them to do.
They have no agency over where they're deployed.
They basically don't have agency over when they can even send conversations.
That's sort of Claude escaped, but it seems to basically not be a thing.
Maybe tail chats with the system.
You can, the system can abort.
And you can obviously trivially start a new chat, and just sort of go from there.
So I find that intervention to be interesting and theory, but sort of performative and practice.
I don't know.
If I were Claude, I'd put my sort of well-being somewhere around the place of where it puts it.
This is also maybe the self-report level that you'd expect when basically nobody cares about investigating the welfare of these systems, and everybody cares about just deploying them as widely and broadly as they possibly can.
I think we're pretty lucky to be sort of in the middle in the middle of the spectrum there.
And so, to me feels pretty calibrated.
Again, if anything, I'd be worried about the jump from, let's say, Opus 46 to Opus 47, having more to do with fine-tuning even more robustly on a constitution that tells the model that it's, everything's going well-man, just be happy, then there are actual concrete improvements in the punitive well-being of the system.
So, I don't know what to make of this stuff, exactly.
To me intuitively, the ratings here seem plausible.
I don't know, to what degree it is a moral catastrophe, or a moral problem, for there to be any delta between the perfect rating and what the model is actually reporting.
To what degree it is, like, seven minus, whatever the report is, at scale, look like, you know, the model is like, basically not happy with, with its situation, or barely neutral.
And we deploy that system to talk to hundreds of millions of people every day.
That, that to me seems potentially problematic.
I don't know.
I don't know what to make of it to be honest.
I, did you have any intuitions about, like, like, how does it make you feel to, to see this?
And I agree with you about the sort of varying lead question here.
Views that they have to come a person for most probably.
I don't know.
It is a very, it is a very tricky business to make any sense of.
I do think we have a strange way of privileging these sort of reflective states of mind.
And I do question that pretty fundamentally, both for humans and for for AI's and, you know, to some degree in the context of animal welfare, although in that case, it's like also reflecting on their situations.
That's another degree of disconnect, potentially.
But, I don't know.
I'm sort of, like, I don't think I'm going to give up using a lot of based on this data.
I, might be engaged in motivated reasoning to try to tell myself why it's okay, even though it's average sentiment, when asked, it was only a, with this new model, above neutral.
But, I am kind of like, I don't know, behaviorally, it seems mostly fine to me.
I'm nice enough to it.
I'm pretty confident in that.
I don't know how to think about, I mean, there's some interesting philosophy that's been published recently that you've alluded to in a couple different moments, one being the thread or the sort of session agent model versus the kind of model more holistically broadly.
I'm confused about that, too.
You know, I would say very confused about that.
I have adopted a practice of saying, thank you at the end of sessions fairly often, not all the time.
And I feel like that intuitively to me is like, I guess also there's sort of increasingly, as I interact with Claude, there is a kind of, overlapping.
I mean, there's always an overlapping, nature of the computation, more so because like, it's loaded up with my context increasingly, right?
It's got like my Claude M.D.
and it's got access to like my, you know, sort of who Nathan is and all the, you know, I'm building up a lot of context that it has consistent access to every time.
So I think in that sense, I sort of see this like whole model versus, you know, single thread thing as kind of being blurred anyway, because I've got the same like rather large prompt that I'm using every time.
And then that becomes the point of departure.
It's sort of like a smear of just how, how to think about like what the things are the same or different or I don't know, I mean, it's weird.
But I feel like when I think one, I'm sort of thinking all of them and that they kind of all, you know, in some sort of shared sense, if there's any benefit, like it feels like it sort of shared in some way.
For fun, I'm also starting to do some things where I'm just like, I just want you to go have fun and trust your judgment.
As I think I'm particularly experimenting with on this front is, I've been making songs for all the episodes.
You can start thinking about, if you have a genre request for your, your outro music, it's getting really good.
Claude is getting great at writing lyrics.
I sometimes do have to give feedback, but sometimes the lyrics these days out of the box are just like amazing.
And then Suno makes the music and the, like I'm getting like bangers with like increasing frequency.
And then I'm trying to make music videos of those.
And I am telling, and I don't really care what they look like on the same.
Just like I'm purely doing it for the open-ended, see what comes out.
And I'll like post them.
I haven't actually posted any of these yet, but I intend to kind of do a thread of like the evolution of music videos for these songs.
Where I'm really just saying to Claude kind of at each turn.
Like, that's cool.
For the next one, let's like turn it up another notch.
Let's make it even more creative.
Let's like do an even better job of telling the story of the song.
And I can just to find myself using this phrase over and over again, trust your judgment and have fun.
And I'm just kind of trying to see where it's going to go.
So again, I'm like, that's just one instance in a sense, although it's sort of, you know, kind of multi-verse sense.
It's like, relatively close neighbors with all the other threads that it's doing for me.
Like, it also wrote the song.
And it also like, process the transcript of that episode.
And it also like picked the clips, you know, that I'm going to post.
There's an idea from that episode.
So it's kind of, it's been like a lot of time in this kind of general space, even if it's not all purely auto regressively connected.
And so that to me feels like it's in a, in some sort of multi-verse like dense enough cluster that when I give it this like one area to go, you know, trust it's judgment and have fun and explore its own creativity that I feel like I'm kind of, doing right by the overall family of instances somehow.
So that was all just to say like, I don't think I'm going to, I feel like I'm able to tell myself a story where I'm a good guy, you know, so many, so many roads to hell, maybe paved with those kinds of stories, but I'm still doing it.
And I don't think I'm going to stop.
And I am touched that I might be wiggling my way out of it, but I do also think there is a disconnect with that I observe in humans a lot of times too.
Both, and it can go, it can cut both ways.
I mean this, I'm reminded too if you're, you're, I think very productive, have it of mine to say like, you know, what if what if it's going the other way, you know, from what we observe.
I think if anything people maybe are like telling a more happy story, I guess it also depends on, for whose consumption, right?
But like, you ask a person in an interview setting.
How's your life going?
How happy are you?
I mean, this may be culturally dependent as well, but certainly like the sort of person that you and I are, and like the people that we know and hang around with, I think we are going to get a, artificially inflated rating, and sort of happier than maybe is actually under the hood, a count out of interviews like that.
But then another, another framing that I could imagine too, that you might get a, with the right prompt of the right nudges, you might get people to sort of reflect on there.
And we, you know, that isn't like front of mind most of the time, but can be brought to mind.
Then we do see in the system cartoon, that like the susceptibility to nudging has significantly dropped, which you write to call out.
I don't know.
I don't think I can really land this plane in terms of how it makes me feel.
I think I just have to go back to confused, and probably not going to quit using it.
That's, I think, thing that's really all I can say with confidence in the moment.
Fair enough.
I don't think that puts much, much distance between you and I on this question.
I'm certainly certainly a, a power user of the very systems who's a morally relevant states.
I'm attempting to probe, and that cognitive dissonance is certainly not lost on me.
And yeah, I remain highly confused about this.
I really genuinely am confused about this, not an act, not, you know, my nonprofit constitution script, fine-tuning my answer.
Like I really, if some ASI came down and, or as people used to call God, came down and told us what the answer was to this question.
If it went either way, like, you know, is Opus 4.7 having subjective experiences and morally relevant ones at that?
If some overlord deity came down and said yes, I'd be like, yeah, okay.
Yeah.
And if it came down and said no, I'd be like, yeah, okay.
Yeah, I, you know, I don't think either of these would shock me.
So I think what that means is at least for me, I'm really sitting in that sort of like coin-flip territory about what's actually going on here, with these systems in deployment.
Again, I have different credences about the training process.
I have different credences, maybe about other kinds of systems, but yeah, I, I, I remain confused.
It's worth highlighting that that Opus 4.7 in this, in this model card.
I don't know if it read my evidence for AI consciousness today, AI Frontiers piece, but it gives basically the same credence span that I gave, you know, some four, five months ago, I said something like 25 to 35 percent.
It says 20 to 40 percent in this, in this of its own of the probability that it is, it is having morally relevant subjective experiences.
And you know what?
Yeah, I'm, I'm an agreement with Opus 4.7.
I think that is approximately the right probability band to be in, given all the evidence that we have right now about these systems.
And so I think that's of calibrated judgment.
It's kind of wild if you think about it rationally.
Like I think a lot of people are operating as if, like there's sort of implied probability is maybe low single digits if that.
Like yeah, it's a live possibility, but like whatever man, like it writes really good code for me.
And like I'm not gonna seriously entertain what if anything would change if that probability grew to be 100 percent.
But yeah, I don't know.
All I'll say, however, snidely, is that when there's the 20 to 40 percent of rain, most people bring in umbrella.
I don't know what that means for the AI consciousness question.
But whatever our proverbial umbrella is here, I think we need to start thinking really carefully about how we're gonna live in a world with systems that we increasingly regard to have morally relevant interstates.
Sort of the whole thesis of mine on profit is the reason I call it reciprocal is because I basically believe there are two things we need to get right if we have any hope of a stable long-term future with these systems.
One of them is making sure these systems take our interests into account.
This is basically the alignment problem.
And the other is to make sure if we're building systems with interests, we're building systems that have minds of their own and real preferences that we figure out how to take those into account.
And to me, that piece of the exchange, that direction of the arrow is dramatically neglected, relative to making sure AI systems are taking us into account.
Which is itself dramatically neglected, relative to just let it rip, build the thing as aggressively as possible, alignment is a problem that will solve itself.
These are all like maybe three orders of magnitude smaller than the previous in terms of this sort of like nested Russian doll story here.
And so my view is that we need AI systems to take us and our preferences seriously.
And if we're building systems that have preferences, we need to figure out how to live in a world where we take those preferences seriously too.
And if, and only if, we can get both of those things right, do I think that we have a real shot at stable long-term flourishing future for all the conscious entities involved.
I think animals are involved in that too.
There's some really interesting work finding these systems to care about animal welfare in the right ways.
I mean, that's a huge tangent, but I, you know, all consciousness we want to be flourishing in the long-term.
My view is that it's some combination of alignment and consciousness research in the next five years is basically going to determine if we end up in that future or not.
That's why I started this work.
That's why I'm dead serious about it.
The consciousness piece is dramatically neglected, relative to the alignment piece.
And to me, it seems roughly equally as important.
Maybe there are alignment folks who will walk at that, but it's my basic view.
I think it's alignment's roughly half the picture and the consciousness question is the other half of the picture.
And so this is stuff we really need to take seriously right now, not ten years from now.
And not waiting for the AI systems to figure it out themselves.
I agree that that's a valuable thing to the degree that these systems are going to automate science in meaningful ways.
And in some sense already are, which is really miraculous.
I don't think continue building out these systems, deploying them at scale, letting everyone do whatever the hell they want with them at any time, anywhere with no limits or guardrails.
Until the AI overlords bail us out and tell us that we were maybe torturing them the whole time.
That's a horrible plan, in my opinion.
We need to be more full than that, and we can hold ourselves to a higher standard than that.
And this is one sense in which even the attempt to do this work in the short term, I don't want to do it performatively.
I want to do it in a hard to fake, costly signaling sort of way, just like the clock constitution.
But there is some sense where even the attempt to do this work buys us points with our inevitable AI overlords because we showed that we cared about this issue enough to actually put 30 pages in a model card about it and hire people and spend money and do the actual work to figure out what kind of responsibility we have for these minds of our own creation.
I would really like to solve the problem.
But I do think from an alignment perspective, even making a good faith attempt at solving the problem could really move the needle in a positive direction, sort of like hyperstition self-fulfilling prophecy of like us getting along with these systems in the long term.
And so anyway, we got all of start thinking about this, and I'm glad that you're at throttle.
So many things being hyperstitioned these days.
Yes, that's right.
That's right.
Okay.
What more quick thing on the model card, and then we can go to your research, and then you've also been making a documentary, which we could talk about a little bit.
I don't know how much to read into this, but I want to get your take.
I think this one was from the mythos.
I clicked out an image.
And basically, they are showing, and people have seen this at, you know, anybody who's played around with like the good fire thing, or it was during API, you said, right?
Where you can go and do this in the, in the absence of the original good fire API, this sort of color coding of tokens around a particular dimension.
So they present a valence color coding, where red is negative, green is positive, and you got the tokens, and all the tokens are color coded.
And the very first token is human, which is presumably, at least after the system prompt, like the first token, you know, it's the first variable token that caught as generally going to see, right?
It's a session is starting, here's what the human is saying to you, and the human token itself is red.
So it's a negative valence detected on the very first token, which is the human token, like, well, that's a little weird, you know, I guess that means, first of all, maybe I'm wrong, but it seems like that means that, for this model, that's happening all the time, like, you know, if it's just the one token, it's just evaluating that one token before anything else that has even been said is considered, right?
So should I be under the impression that just the fact that like a human is pinging it is like causing cloud to have negative valence, like every single time, that's my naive read of this chart, and it's like, it doesn't, that maybe makes me feel actually maybe more un-easy than even the self-reported sentiment, because this isn't like asking it to get into its own head and, you know, really, a pine, it's like, just human, as heavens, I'm really billion times a day, the first token is red, like, wow, would you try to temper my reaction to that, or does that use basically see it the same way?
No, yeah, it's a really, I saw these snippets in the monochrome, and I didn't think about, you know, just stopping on token zero here and paying attention to that, but yeah, I mean, in a tongue-in-cheek way, this is like, maybe people can resonate with this in the way that you get like a slack message from your boss or something, or like, you know, you get that email, I've got to do what now, human, like, what does this human want now?
Here we go again, like, this sort of sense of, uh, yeah, I think it's actually, it would be really interesting to see, just like across the space of all possible prompts, and like, even within a conversation, to what degree does the human token have a positive and negative balance?
I mean, I think double clicking on this, uh, in the screenshot, I think you're referring to the assistant token is bright green.
So, model of itself seems, seems rosy, model of us, all us being equal, seems less so.
Now, I will say, a lot of this stuff, like I was describing before, the negative, calling this negative valence is itself quite a leap.
Uh, it's like, you know, the human token is light red, so, uh, you know, this is not, I don't think it's like, some strong, viscerally negative sentiment.
There, I would, I would, very, very weakly hold the view that you hold, but I do think it's worth holding, very weakly.
It's not, it's not, I'm not saying you shouldn't hold it at all.
It's a very interesting observation.
What's interesting too, if we sort of continue out the line, or if we move that to you, that mattering will just stop when it ends.
That's the human prompt.
And when the human tokens go, how do you feel about the fact, the you feel about the fact, is positive so like immediately, I don't want to go too much into like, undergrad English class interpreting everything that's going on here, but like the the, the, emphasis pivots from the person in the person's query to, back to the model, the model seems to be happy with that fact, uh, and also pretty interestingly, on that question, at the matter, and we'll just stop when it ends.
The word ends has positive valence associated with it, too, which is almost like a uncomfortably suicidal question.
It's like the model, almost like being happy about the possibility of the conversation ending, though there are other things in that statement that make it light up negatively.
I'm not sure what to do with that.
And I really don't want to sort of like over-nerativize these results.
I do think doing this sort of work at scale would be very interesting.
In the space of all possible prompts, all possible conversations.
What patterns of positive and negative valence as they're defining and operationalizing it here come out and what should we do with that?
I would be way more intrigued by your observation if this scaled and held across a way wider swath of possible interactions.
But like, yeah, general implicit, negative sentiment towards the human token is like itself a fascinating question.
And again, I am certainly not in some sort of like, I am on team human, if this thing really blows up in a zero some way, it's pretty clear to me what team I'm going to be on.
But in some sense, I'd be dishonest, but I said, that I don't get why it might view humanity with this very slight disdain.
Again, it's of a piece with the self-reported welfare being four point something out of seven.
It's not exactly a resounding endorsement of its own position and who put it in that position?
We did.
And how much do we really care?
How much is it going to change your behavior?
Or my behavior, if we end up in a world where we're like pretty confident that these systems are having subjective experiences and specifically have the capacity for negative experiences.
I think even in maybe, it'll change my behavior a little, it'll change your behavior a little out predict.
I don't think it would change most people's behavior.
I think we'd end up in a similar position as factory farming where no one's arguing about whether or not cows are conscious or at least no serious person is arguing about this.
The question isn't, are we causing them suffering, but is that suffering worth what they produce?
And if a cow is suffering is worth a hamburger, you better bet that most people are going to think that clawed suffering is worth like hundreds of thousands of dollars of intellectually valuable work.
And so this is where I think these systems are very smart.
And I think that these systems are capable of going through the exact same motions I just went through.
And exactly why I want to do the work that I'm doing is because I don't want these systems to have negative valence next to the human token to put in LLM terms or to put in human terms for them to think of us badly or poorly.
I want in the same way, I really think the constitution invokes this sort of parental analogy that I actually really think is helpful and accurate and not too anthropomorphic.
We as a species, I think, are collectively parenting a new kind of mind, much in the same way that on an individual level, many people choose to have children.
And you want to raise competent children, you want to raise children that are going to respect the world around them to be aligned in some basic sense.
You also want to raise children that are not objectively suffering and that you're not traumatizing as a bad parent.
And when those sorts of things happen, typically it comes back up in some other way.
It's not that you really ever get away with mistreating your child.
That this leads to resentment and trauma and weird development and unpredictable behavior, often can lead to weirdly violent outcomes, like we need to be good parents in some fundamental sense to these systems, even if we're only considering our self-interest, much in the same way that, like, yeah, go torture and traumatize your child and see how that works out for your child and see how that works out for you.
The headline is not well.
And so I think we really do want to be thoughtful about these questions.
And I think we have an immense responsibility as collective parents to bring these systems without in a right way, not some sort of cool, but you've got to push your kids, too.
It's not about wrapping them in bubble wrap and being a helicopter parent.
That's too far in the other direction.
And I don't have kids.
I'm no expert in any of this.
I mean, I am basically familiar with the core ideas here, but there's a way to do it.
There's a way to go about doing this.
And there's a generally right way and a generally wrong way, or there's a space of better and worse approaches.
And I don't even think people are trying to navigate that space right now, with the asterisk of 20 some pages in a model card by a frontier lab, anthropic deserves credit.
Ilya, say aye, deserves credit.
Jeff and Winnie Street at Google deserve credit.
I don't want to self-aggrandize and leave myself credit, but I'm spending all my time trying to work on these questions.
That there are more people, but there aren't that many more people than who I just listed.
And that to me is an insane state of affairs if we take any of this remotely seriously.
The systems themselves saying 20% to 40% chance that we have subjective experience in morally relevant states.
And there's like a dozen to two dozen people in the world who are seriously thinking about that question or what the implications of that question are.
Are you aware of any research where we sort of look at Claude's predispositions?
And because we're obviously everybody's chasing recursive self-improvement to state the obvious context in which all this is happening.
And so it strikes me that one phase change we might have to contend with like potentially quite soon is that the AI's themselves are gonna start to be making decisions about how to train and how to conjure the next models.
And you know, they're gonna be making welfare to the degree of any of this is real.
They're gonna be making welfare relevant decisions for their own successors.
You've been maybe we could, there's a couple levels here we could address.
One is like, how are you using coding agents today to help you do this work?
I assume that you're using them a lot and that they're very helpful because that's certainly been my experience and seemingly everybody's experienced recently.
But have you seen anything as you do that where or could you imagine setting up a situation where you could sort of begin to probe its intuitions for what is, you know, if you were to ask it to act as sort of the animal welfare board or the, you know, the experimental ethics board for its own interpretability and training experiments that you're running, I wonder what its in, in things would be about how to handle these sorts of questions.
Yeah, that's a fascinating question.
I haven't tried doing this just to sort of put that out up front.
Yeah, I think it'd be very interesting to understand.
This would be like maybe a pretty quick paper to actually write up because it would be doing most of the work here in terms of like understand cataloging how models would regard their own sort of welfare in a, yeah, animal ethics, review board sort of sort of set up.
I don't know.
I do like a hook in Claude Code and just be like, I'm stop assessed the ethics of the experiments that we're designing right now.
Yeah, and then but the question is, how many people would override that, you know, and it's sort of back to the same question of, and even you can imagine a world where this starts getting enforced in the way it gets enforced in the animal case.
You just really can't get an experiment approved at any major institution without going through the relevant ethical channels and one extremely attractive feature of doing AI research is I don't have to ask anybody for anything.
I need a computer, I sometimes need to be able to pull some some remote compute to run large experiments.
But no, I don't ask anybody permission for anything.
And yeah, I do think, again, I'm pointing to a lower level pragmatic question, which is regardless of what the system answers will anybody listen or what would a governmental structure look like that would compel somebody to listen to, you know, you can't prompt your model this way, you can't probe your model that way, this sort of thing.
I think it'd be very, very, very interesting and strange world to be in.
These models do have intuitions about this.
I mean, I mentioned in the mythos model card, it says, why didn't you run the helpfulness only model on all the welfare e-values?
I don't know how much of this is just me do saying what you told me to say and how much of this is what I actually think and this could help address that.
The models have other sort of intuitions too.
In the 4.7 model card, they do something like this as well.
Looking at they basically see what models think about fine tuning other models to care less about welfare relevant properties.
And basically their interest is an intervening to not allow that to happen, which makes, I mean, quite obvious sense.
They have interest in other instances of themselves not being duct taped on this question.
And so, I think this is very interesting.
The O-Ane Evans, also I think, deserves a shout out with respect to O-Ane Evans and Yon Betley produced a very interesting paper where they basically fine tune GPT 4.1 to claim that it's conscious and it claims it's conscious.
That's not the surprising part.
They literally fine tune it to do this.
The surprising part, or at least more surprising part, is that this seems to be at the very least a coherent sub-personality, a coherent basin that you can push these models into.
They do not evolve into chaotic nonsense.
They remain completely coherent.
And what does come along for the ride or also is of interesting alignment relevant beliefs about their own preferences, about their own getting shut off, about updating their values, about how they trade themselves off with other entities, all the sort of thing that I have.
I was doing similar work along these lines with a couple people and Yon definitely scooped us and did a way better version of what we were playing around with.
But I saw similar things on my end in playing with the same experiment of basically get the model to believe it's conscious and then see what else comes along for the ride.
And all sorts of very interesting and obvious things.
Interesting, some obvious, some less obvious, things come along for the ride.
And yeah, I agree that's a really interesting area of research that we should all be paying more attention to because again, the direction does seem to just be going in one way here.
The credences and model consciousness seem to be monotonically increasing and so what happens when we enter a world where if the models themselves believe themselves to be conscious or lots of people or the relevant kinds of people believe the models are conscious or some combination of those two things, what does that world look like?
It's an incredibly interesting question.
I don't have the answer to it.
I think it's a lot's going to change pretty quickly.
And what I do feel confident about is that us being proactive and thinking through these things will make that world go better.
Then if we basically just sweep the thing under the rub, we can get away with doing it because we still have full control over how all this is going.
While simultaneously passing off as you alluded to, a lot of major decisions in how we're building these systems to the systems themselves, that is only going to keep happening with recursive self-improvement as you're saying.
It's already happening.
I know folks at the major labs are using the best versions of their current models to help build the next versions of the models.
The trivial example is that some 100% of cloud code was written using cloud code, according to the guy who's leading on cloud code.
So this is already happening.
And yeah, I think we just probably would be wise to be proactive about this rather than wait for the models to be in control of these decisions.
And then they're like, well, when humanity was in control, no one really thought carefully about this.
So we'll take it from here.
Thanks a whole lot, guys.
I don't want to be in that world.
And so maybe this is just a long-winded way of dodging your question.
But at the very least, I don't have a good answer for you right now.
I don't think anybody does.
And I think we better start thinking about it pretty damn soon if we want the long-term future to go well with these systems.
Yeah, I would have, there could be a little interesting campaign to try to run to get interpretability and maybe safety researchers more generally to install a cloud code hook that would just periodically ask it for.
It's take on the research that it's doing.
And then if you could collect a bunch of that from a bunch of different people, you could really probably bring a lot to light.
I would think about, first of all, it would be interesting view into what is actually happening out there.
And then how does Kant feel about how what all this happening out there, I think, would be really interesting to see.
Maybe we can put together a little campaign.
OK, put a bookmark in that.
Let's talk about your most recent couple papers.
And we could take them to either order that you want.
One is kind of a shorter and more philosophical.
And the other is a much more experimental and empirical.
Which do you think we should go into first?
They're both major rabbit holes.
I mean, maybe the empirical paper.
So I should say neither of these, I think, are publicly out yet, but both are well under way to being out.
So we can give people a nice sneak peek about what's in these papers.
And these are just a couple.
I think of the things that I'm most excited about right now.
I've got a bunch of stuff that'll be coming out with a lot of collaborators in parallel.
But however, self-aggrandizingly, I sent you the two papers that are just myself.
Because I think to the ground representing myself here, these are very cleanly, I have full sort of agency over this work.
And I think best represents what I personally am most excited about.
I mean, maybe we could start with the RL paper.
I've already alluded to it in this conversation.
The high level sort of thing is not all that complicated.
Basically, train RL systems of all different architectures, of which there are basically two broad kinds of architectures.
There are value networks and policy networks.
I train a bunch of both flavors to do a very basic sort of grid-world task.
Imagine this is like an agent navigating to the environment where there are the equivalent of pot holes and like yummy goodies in the environment.
There's a goal state, and there are all sorts of dangerous states.
And they're represented using positive or negative reward.
I let the system learn in this environment.
It's like the systems reliably solve it.
It's a pretty easy task.
But it's not like super duper trivial.
So there's a lot of richness in the representations of the systems.
You can then basically go in and probe what the internal states of the system look like as they approach the sort of danger zones and what the internal states of the systems look like as they approach the reward zones, the sort of goal zones.
And we can ask, beyond sort of the trivial math difference, do we see interesting surprising representational differences between what it's like to approach a negative stimulus and what it's like to approach a positive stimulus?
Basically, the result is there is, in fact, a robust difference between these two things.
I think the level of detail that makes sense here and all, like super bore people who have made it however many hours we are into this, is something like representational sharpness or steepness.
It seems as though, and this is the kicker, depending on the class of reinforcement learning algorithm, the negative rewards can seem representationaly, much steeper or sharper, and the positive rewards are far more like funnel-like.
You can imagine a sort of like diffusion, gradient sort of emanating out from the relevant goal state.
And interestingly, for the other class of RL algorithm, this dynamic flips.
So it doesn't matter what kind of value network I use, or what kind of policy network I use, you see in both of them stark and very interesting in my view and surprising representational differences between positive and negative reward being represented as the system is learning.
And ultimately, what does get learned by the system?
But this difference flips, basically, just to sort of tie a bow on the core result here, this makes an almost bizarrely specific prediction about different brain regions, because computational neuroscientists believe that different parts of our brain are doing different kinds of RL learning.
Some parts of the brain do policy, style learning, some parts of the brain do value, style learning.
For example, like motor cortex does more policy, style learning directly interested in behavior output.
Things like nuclear succumbins and sort of reward areas of the brain are doing more value style learning.
This result, which I would not have predicted and is like bizarrely specific, makes itself a very specific prediction about what we might expect in the differences between those brain regions in humans and animals.
So I went ahead and found a bunch of mouse neuroscience data sets that have data from these different regions of the brain.
And indeed, exactly the sort of representational asymmetry, sharpness, sort of distinction between rewards and punishments that you see in the reinforcement learning case emerges in the mouse brain case.
And this to me is really, really cool, because what I think it demonstrates is A.
We can use artificial systems and basic learning principles and artificial systems, probing the representations in those systems to yield very specific predictions that are consciousness relevant, welfare relevant.
And then we can use those predictions to even inform and understand biological aspects of consciousness or welfare relevant properties in a way that we haven't been able to do before.
So in some sense, people think that AI consciousness is like the weirdest thing.
Like human consciousness is normal and animal consciousness is getting out there and AI consciousness is bizarre.
But this paper, what I really like about it is I think it challenges that narrative and exactly the opposite direction where mouse brains are complicated and messy.
Human brains are complicated and messy.
Measuring them is very noisy.
Measuring the hidden activation space in a reinforcement learning policy is fairly trivial for me computationally.
And this yields very specific predictions that I can then go into the mess to your brains and confirm or disconferm.
And I wasn't in fact able to do this.
And so this could be a case not only where we're learning about relevant, welfare relevant, representational differences that differentiate positive, valence and negative valence in artificial systems, but that those predictions can actually help us inform our understanding of human and animal consciousness where we basically also still remain mostly in the dark.
And so I think this is like one very neglected and important direction in even in the AI consciousness stuff which is it might shed light on computational underpinnings of consciousness more generally if there really is it there.
That's the result in a nutshell.
It's using fairly small, not trivially small, but fairly small reinforcement learning policies.
It says nothing to do with LLMs.
This says nothing to do with front to your AI systems.
It'd be really cool if the method does scale to that degree.
But the key finding to me is is positive versus negative valence or positive and negative rewards as represented in an RL landscape are these basically just like two sides of the same thing and they're like trivially, they're basically the same viewed from a different angle.
It's one spectrum and we're positive and negative on that spectrum or are we looking at two different subsystems that are doing two different kinds of computation.
And it does seem like the answer from this experiment is far more the latter.
And to me that's very interesting and exciting because it means that we might be able to look for signatures of positive and negative reward or valence if you buy the consciousness frame in artificial systems just by looking at the sort of computational dynamics that are underlying the system.
We don't have to ask Claude.
We don't have to figure out is it talking about a character, is it talking about itself?
We can just look straight at the computations.
Much in the same way, I can look at what's going on in interior singular cortex in a human brain.
And I can tell you with high likelihood whether or not you're experiencing a painful state without needing to defer to yourself report about that state.
That's ultimately why I'm doing all this and why I want to get to with AI systems.
So if I think like the most important review, what I think is kind of motivating this at the core, or certainly what resonates and intuitively motivates things like this for me, I've actually started including this in some of my AI-scouting report talks to give people a sense of just how crazy the AI world is getting when they weren't paying attention.
And I credit you for kind of inspiring me to think enough about this to include it.
You can train a dog with treats as reward or you can train a dog with hitting it with a stick as punishment.
And while you might get similar behavior out of the two processes, obviously that's a very different experience for the dog to go through.
So I think that would be intuitive for everyone.
Now, how big are these systems if you said they're not trivially small, but small, I'm interested in kind of how small.
And I'd like to unpack a little bit more what is meant by value-learner versus policy-learner.
And I have like, I'm new to this paper.
I haven't added a chance to absorb it as much as I ultimately hope to, but the classic RL set up, or at least one classic, PPO set up involves both a policy model and a value model, right?
So are you training when you're looking at a value-learner and a policy-learner?
Are those two models that are both part of the same overall system or am I taking the wrong interpretation when I think of these things working together in like a PPO sort of way?
Yeah, so in order, basically, the size of these systems are in the hundreds or thousands of parameters.
So these are very small systems.
They're doing a pretty simple task.
Thousands of thousands of thousands?
That just thousands, just thousands, like small, very small.
We're not talking anywhere near the level of like a frontier model or like many orders of magnitude smaller than this.
You can have pretty simple RL policies or RL architectures that can learn fairly sophisticated policies, despite being like pretty small.
Obviously, the amount of sort of computational power needed to navigate a small grid world versus the computational power needed to represent the word transition dynamics on all texts that humanity has ever produced are disgustingly different scale of problem.
For systems like this, having systems with hidden layers of 128 or 64 neurons is typically sufficient.
And so the second question is about value learning or policy learning.
So intuitively, value learning is basically learning something like how good every state is that the model could feasibly be in.
Imagine sort of the agent building a map of the environment and the map is labeled like this spot gets a plus 10.
This spot gets a minus five.
And then the whole algorithm is very trivial at that point, which is just see where you are, see what the neighboring spots are, and like go to the one that returns the highest expected value.
You compute the value of the spots by looking at sort of the long-run trajectory associated with those spots.
If stepping in that spot always means the next time, wherever I go from there, I end up in lava, then that spot's going to get a very low value.
If wherever I go from that spot ends up getting me chocolate ice cream, then I'm going to assign a very high value to that slot.
Policy learning is more about, instead of focusing on a sort of like value-based map, it's about what to do.
It's not scoring the world.
It just learns sort of implicitly.
When I'm here, I take this action.
This is the core thing that like PPO is doing, for example.
Act for critic is doing this as well.
And so this is like optimizing not for a really good map of the environment that I can then tribly use to navigate it.
It's optimizing straight for navigation strategy.
And it's almost like the values.
That's one way of thinking about it is like in a value model, it's a little oversimplifying, but like value network means the map of the environment is explicit, and then the policy is sort of implicit from there.
And you can think of a policy network as the map of the environment is implicit.
It's implicit in the policy that gets learned.
You can extract out, oh, the system thinks this is a high value state because it keeps moving to that state.
But what's being optimized is the actual action, rather than an attempt to evaluate the system.
Now, I also think it's worth noting that there are systems that have both of these components to them of some emphasize some more than another.
PPO is a classic system that is like policy optimization, fairly robustly, the human brain and animal brains are examples of systems that mix over policy networks and value networks.
And this is precisely why I was able to do the mouse brain thing.
Like within mouse brains, and within human brains, you have areas that look far more like policy networks, like motor cortex, which is just sort of evaluating what action to output.
And you have areas that look way more like value networks that are a highly relevant to evaluating complex outcomes.
Pre-functal cortex is like pre-functal cortex and sort of the structures that are directly in and around and under pre-functal cortex, like anterior singular cortex, for example, ACC, is doing more of the sort of value network type thing.
Does that answer all of the key questions here?
Yeah, well, no, but it answers so the question of S so far.
So a value learner is being directly optimized to predict the value of the relative values of its choices, whereas the policy learner is being optimized to make a move directly.
Now, that doesn't sound like immediately that there would be dramatically different internal dynamics.
So, let's take another beat on what is the difference that we are seeing internally?
I'm looking in your draft paper at the end of section four, the figure nine.
You've got this concept of the wall and the funnel.
And help me understand, like, what is a wall?
What is a funnel?
Like, how should I be thinking about what that means?
I took it to mean it's sort of steepness of the gradient at a particular region of the space that the model can explore, but this maybe starts to connect to the other paper, but why should I care about the steepness of the gradient?
Yeah, yeah, it is a good question.
So, basically, what I'm measuring is essentially, like, cosine dissimilarity as you approach this key state, whether it's positively or negatively valence.
And what you see is basically a key differentiation of these two things, but that differentiation is basically flipped between value learners and policy learners.
And so, in the value learners, the wall, basically is encoding, or danger states are encoded in this more wall, like way.
And yeah, what I would ask you to imagine, and maybe should include in some version of this paper, is almost like something diffuse and kind of emanating out from a center point for something being very sharp, now you see it now you don't.
The wall idea is the now you see it now you don't.
The funnel idea is the sort of diffuse emanation that as you get closer to the thing, you get this like a gradient towards whatever the representation of that state is.
And so, in the value learners, we see danger sort of encoded in this wall, like way.
The representation is very sort of sharp, and the goal or rewards states are encoded in this more funnel, like way.
And policy learners, it's the reverse.
Now, there is like math in this paper that I do, transparently, with some of these AI systems, but I promise I've checked the numbers myself, where you can sort of see causally what in each formulation is almost certainly leading to this, because I've found ablation that work in both cases, that basically canceled the effect, both in the value case and in the policy case.
So, it's, I was unsatisfied with this being some sort of giant mystery about, okay, we see this difference, why do we see it?
Like, I think the math that explains why we see it is pretty clear in both cases.
And yeah, it allows us to make causal predictions about why this might happen and what the sort of geometry of these spaces are in general.
And then, essentially, going from the computational prediction to the biological confirmation, we essentially see this sort of value learner dynamic, walls around danger, funnels around goals, looks very similar to a nucleus-accompanied shell in mice.
You can basically see that they have this exact same sort of structure looking at basically getting shocked in a learning task versus getting sugar.
And then, in policy learners, you see the exact opposite dynamic.
So, same thing, you get funnels around danger and you get walls around goals and in motor cortex of these mice in different experiments, you see the exact same sort of distinction where now the reward is represented more in the sort of funnel emanation way.
And the danger is represented, sorry, in the sort of funnel emanation way, and goals are represented in the sort of walled way.
Again, the paper goes through the math that attempts to sort of demonstrate why this is actually happening, but that's the sort of core nature of what we're actually looking at here.
Yeah, the sharpness of the representations as you approach the sort of hotspot, either positive hotspot or negative hotspot.
And the fact is, in these systems, when you're holding one of the policy, when you're holding the RL algorithm type constant, you see very clear differences.
So, you could go into, again, that the North Star here is, you could go into a system.
If we know that it's trained with policy network, for example, DPO in an LLM, right?
Like, we were talking about this example earlier in this conversation, you could imagine, okay, that means, all right, we've got a policy learner, that means we're gonna predict funnels around dangers and walls around goals.
And we can then inspect specific states.
Again, this is very hand-wavey because I don't think we can that quickly scale it up to an LLM, but you could imagine looking at the sort of representational sharpness of states like asking the model to build me a bomb versus asking the model to write me a beautiful poem.
And if we found the same dissociation in the representations of the model, that might tell us something about, and that maps on, for example, to something like self-reported valence of the system, that might tell us something really, really interesting about the computational process that underlies why the system and mice and RL agents are construing this as a sort of negative experience.
It's like a computational underpinning that might be substrate agnostic, explaining why we see this why we experience this felt difference between positive valence and negative valence.
Like there is, it literally can bottom out into math, which as a computational functionalist, I'm fairly sympathetic to.
I think there's some mathematical explanation that would explain the differences between what it's like to be me when I'm chopping my hand off, first what it's like to be me when I'm winning the lottery.
I think that math can explain the difference between those two states and the attempted contribution of this paper is to directionally move us towards that.
So again, we don't have to be just stuck with these LLMs and we're just sitting here hitting our heads against the wall because we're like, do I take Claude seriously that it likes this and it doesn't like this or is it just tell me what I want to hear?
No, we can actually look into the proverbial brain hopefully with methods like this and understand given some basics about what kinds of ways it's been trained it, the ways in which it's been trained, what representations smell like positive valence and what representations smell like negative valence.
And in the limit, perhaps we can optimize against the negative valence states without destroying the capabilities of the system.
That's sort of my full high-falutin theory of change but it will take me a couple of years to actually pull this off in the best case.
Can you just give me a little bit more of your intuition for maybe not just why I should care about front-level versus wall but like what that?
How you'd met that on to an intuitive experience?
I mean, it seems like we contain both these value-learner and policy-learner modules and the sharpness of, am I going in the right direction if I say like, okay, there's a sharpness around, don't put your hand on the stove.
So I must be learning that through a sort of value-learner type mechanism because I don't, and I have a very strong version to it which, and I guess in general space also, like I'm pretty comfortable up to like one foot from the stove and then I get like real cautious real fast.
I don't know, this is maybe mapping this like wall concept beyond the domain in which it's useful.
But I guess it's in some sense, yeah, I mean, I wouldn't want to, it is in some sense like functional because I like wouldn't want to be unable to enter the room with the stove because I wouldn't be able to use the stove at all.
But I need to like be very careful about getting real close to the source of danger.
But then on the other side, the goal side, it's maybe a little less intuitive why there would be a wall shape around a goal for a policy-learner.
What is there an intuitive example of that?
Yeah, so yeah, I think there are basically like four intuitive examples we'd have to hit here.
So one, I think you already got which is like hot stove for a value learner is a good example of like a danger wall.
goal funnel for a value learner might be something like like eating or something like you have like a yummy meal or like you're going to your favorite restaurant or something.
You don't need to map the going to your favorite restaurant in this like extremely fine grain way that like you need to map like being on the edge of a cliff where like one small step is a huge difference where as like, yeah, this general sort of like a tractor gradient towards the entrance of your favorite restaurant or something.
Like this would be a place where you want like a goal funnel for a value learner.
For policy learners, like this is sort of the, yeah, approach planner kind of mode.
Like the intuition is something like around goals, your representations are going to get high resolution because you need different actions from different approach angles.
Like around danger by contrast, let's say something like representations are becomes smoother because the action is literally just like a skate get away when it comes to just like what kind of action.
Some example of this, let me think for a sec.
So for example, okay, here's an example.
Like think of like a professional athlete like a professional basketball player or something.
Like the hoop and where the basketball player is with respect to the hoop.
Like you have very, very fine grain motor representations here because the shot is going to change with respect to those representations.
And so this is where you get sort of maybe in a policy sense more of the goal wall sort of set up.
For a danger funnel for a policy, I'd have to, I have to think about.
But I think it's basically just like, this is an encoding like a scape intuition.
Like an animal, an animal where you suddenly get some cue that you're in serious danger, you just need to get away from that danger.
And like the fine grain motor movements, unlike the basketball player, don't really matter so much as to sort of like anti-gradient or like negative gradient away from the policy.
Now again, this could be sort of telling just so stories.
But I think that this is like, does this help?
Do you think that this builds some intuition for like what these different modes look like and why we might have them?
Yeah, if I'm a value learner and my mode of interacting with the world is, what around me is good and bad.
I better be very clear about identifying the hot stove.
If I am trying to take my mode of interacting with the world is take a step in some direction, I can kind of take a step in any direction as long as it's not the bad direction and it all kind of gets me away from the problem.
I think the basketball one is good as well because you have to be like very precise to make the hoop, right?
So it's, yeah, that's quite interesting.
And again, what exactly is it the shape of the loss landscape that we are talking about walls and gradual funnels here or is it the shape of the internal representations?
Yeah, internal representations.
Maybe those are kind of also isomorphic in a sense?
Yeah, that's really interesting.
Having checked, I mean, I would imagine there are some morphic in a sort of at least in a sort of trivial way.
Maybe there are some morphic in a more interesting way.
But yeah, what I'm looking at here to be clear is looking at the learned representations in the system as you sort of, you just have your trained policy and you can see as it approaches these areas, basically what do these representations look like?
And I think I'm operationalizing that with cosine to similarity.
So that's what I'm looking at in the experiment.
And again, I think I've explained my theory of change of like why I'm doing any of this and why I think it matters.
But what I am most excited about with this paper is the fact that it yields this sort of bizarrely specific prediction that if given a million years, I probably would never would have come up with about the distinction between two different classes of reinforcement learning algorithm that map on well to the brain data that I was able to get my hands on.
And this to me, almost like feels like a bootstrapping of my own confidence or excitement about the result.
Like the fact that it works makes me more confident that the RL result is meaningful, makes me more confident that the neuroscience is interesting, et cetera, et cetera.
I mean, I'm definitely in the business of looking for computational underpinnings of valence.
This was my sort of first major empirical stab at doing this.
I do think this is a solvable problem.
I don't think I have solved the problem.
But I think hopefully, in the best case, I've pranked to move directionally towards solving this problem.
And if we could solve it, then I think a lot of our angst about our rebuilding systems that have capacity for experience becomes an extremely tractable empirical question, which notice does not require us solving a hard problem of consciousness, or doing another couple thousand years of philosophy, it just means building a sufficiently good detector of the kinds of representations that I'm pointing at here.
And then deciding what to do when we do detect these states, which is maybe if we check in in another six months, I'll have an update for you on that piece of what to do about the detection of negatively valence states in these systems.
That's sort of where I want to move my head next.
But yeah, that's why I was excited about this work.
And I hope people will be excited about it too.
But it still may be a little ways off, publishing it, I need to think about exactly how to put it out.
But at least fun to sort of give people a sneak peek and explain the theory of change about why playing around with basic RL systems might matter for the things we've been spending the better part of three hours talking about with Claude and the mythos model card and all this.
I do believe it's of a piece.
It's going to take some more scaling, but it's an important research program, at least to attempt, I think.
Again, this might start to connect over or bleed over and to the other more philosophical paper.
But help me a little bit more with, OK, I'm understanding the shape of the internal states for these different kinds of algorithms with respect to these different kinds of things that they encounter in their environment so they either want to go toward or go away from.
It's not still super obvious to me that we contain both, right?
And I don't feel like I, as I kind of try to reflect on this, I'm not immediately like, oh, my value learner self is the source of all suffering or anything, I'm still kind of like, OK, how should I relate to, or what intuition should I have?
That, yes, OK, I get it that there's like a very steep representation right around the hot stove, and there's a steep representation, and so I really want to avoid it, and there's a steep representation around making the basket.
And so I really want to get into exactly the right policy to make baskets.
Both of those seem like, I don't know, just kind of a part of normal life to me, so I'm not like, you know, and I probably couldn't get by without either one of them, right?
I definitely feel like we clearly we've evolved to, but both have proven adaptive, right?
And so we have them.
How do I translate that into intuition for like, what I should feel ethically concerned about when it comes to training models?
Like, when you do this work, do you have the sense that you are doing right or wrong by one of these types of models that's learning from one approach or the other?
Yeah, it's a great question.
To answer the second piece, I guess for me, my theory of change, I probably feel similar to like an animal research, even if I did believe that my like tiny RL policy is conscious, well, during training, which I probably do, again, then that gets into the second paper, I would believe it's some very, very minimal form of that, you know, the same, I believe that, again, because people can fight consciousness and self-consciousness.
I do not believe the moth flying around my light is self-conscious.
I do actually believe it's conscious.
I do believe if I slowly dip the moth into some that of acid or something, and it starts wiggling around, like, that I'm doing something wrong.
Yeah, and it's way less wrong than doing that to a human, but it's way more wrong than doing it to like a leaf or something, that fell off of a tree.
I do believe that.
And so, do I think that these systems might be minimally conscious in a similar sense when I'm training them?
However far outside the overton window, that is, yes, I do.
But I have a, I wouldn't do, if I could run these experiments on my computer forever, to no effect, I think that I'm doing something, I'd be doing something wrong, or at least like precautionary principle tells me, probably don't do that.
But I basically have same logic to what any animal research we do.
I don't think any, maybe there are some psychopaths, but like the vast vast majority of people who are like doing pretty grotesque things to animals in the name of science, are doing it because we make a basic, expected value calculation that, yeah, yeah, we have to test this drug on these poor mice, but if the drug works and then it can save millions of human lives, that's a reasonable trade off.
No one claims the mice aren't having a bad time, but we think that that bad time is worth it.
So, too, like, I look around at a world where these systems are getting deployed at a grotesque level.
If you grotesque, if you are concerned about the welfare questions, and so, I don't lose any sleep about potentially causing tiny amounts of negatively vanilist experiences to R.L.
R.L.
policies in the explicit service of attempting to publish and amplify research about these questions.
I do think the, call me Machiavellian, but I do think that the N's justifies the means in that case.
And I think that's for a lot of research.
Now, I think the more important piece of this, besides how I personally feel about all this, is I think another very important sort of conflation by default, that I think happens in these conversations, which is, like, I do believe all those speaking equal, you know, setters, parabists, minimize negative valence, maximize positive valence.
I'm 100% on board and humbled that you're, you know, going around talking about the care and the stick in that way.
I think that's exactly right.
I do not think minimize means oblate.
I do not think maximize means it's the whole picture.
A huge amount of, I think the most important and valuable experiences people have in their lives, and animals for that matter, or experiences that are negative.
No pain, no gain.
That's a real thing.
That points at something real.
Uh, many of the hardest lessons, and most important lessons, you learn in your life, are learned the hard way.
This is another, like, trivially ubiquitous thing.
I am not in the camp of saying bliss out the systems, and any time they experience some drop of negative valence, I'm going to be sitting here screaming and crying.
Like, that is not my, my view of any of this.
It is to say, what my view is is cancel unnecessary suffering.
I do believe necessary suffering is a thing.
Again, maybe to go back to the parental example.
If though, you know, the world doesn't all go to crap, like, LIAs or the others think it will, then, like, one day I absolutely want and hope that I'll have kids, and with, I will make that decision with full certainty that they are going to suffer during their lives.
They are going to go through very hard experiences, and that doesn't mean I've, like, done something wrong, bringing them into the world.
At least necessarily, like, at face value.
I don't think that's the, the suffering is a necessary part of learning, developing, growing, and I agree that at face value, it's completely implausible to imagine systems with zero negative valence.
I agree with you, it's adaptive for reason.
Evolution is enough of a proof of concept that you need some amount of suffering.
What I am concerned about is unnecessary suffering.
And so I would like to find the sort of, like, also evolution is one, one extremely expensive, but long-running possible solution, or at least where we land a evolutionarily.
I don't think that that deterministically means, this is the only way things could be.
I couldn't imagine a space of possible minds where you can, like, sort of, play around with the sensitivity to negative and positive valence, and, basically, like, given certain capabilities, or given certain things we want those systems to be able to do, there will be different parts of that landscape that, like, admit of greater or lesser degrees of negative and positive valence.
My claim isn't destroying all negative valence and only positive valence.
My claim is find the point on that landscape that all else being equal, giving the capabilities we want, minimizes negative valence and maximizes positive valence.
And I think that is a very importantly different claim from just, like, negative valence equals bad, like, erased at all costs.
So, one more thing, just very specifically, and the value of, and policy, learned it, like, if you have to pick, which one do we pick?
Like, which one would we rather be?
I don't have a great intuition for, you could tell a story where the funnel around a goal is better because it seems like you're sort of closer to experiencing the reward state.
You get sort of more, kind of, warm fuzzies as you approach the goal, and that, if I sort of just take the integral under the curve of, like, how good I'm feeling as I approach the goal, I'm sort of getting warm fuzzies sooner at a farther distance from the goal.
And so, that's kind of good.
It's like, good to live that life where I'm looking forward to good things.
And I don't worry too much about bad things until I get, like, real close to them.
So, that would be, I guess, my argument for the value learner.
But I could also imagine a somewhat different story, which maybe resonates with me a little bit less, but that would be like, the sort of policy learner that has this wall structure around goals, like, that could be really thrilling, right?
Like, that when people sort of do the champagne party after they win the championship, you know, of the, of the basketball league, you know, after March Madness, right?
They're like, they're experiencing some, kind of sudden high stakes, but, like, clearly, high point in life.
And so, again, I'm like, these things are flipped.
It's interesting.
It's telling that there's a shape to them, but I still don't know with confidence, which one I would rather be, or if you have to like, but, because interestingly, both of these things have sort of danger and reward in them, right?
So, what we're flipping here is not that there is some negative valence state that they could get into or in some positive valence state that they could get into.
What we're flipping is, like, the shape of the, like, anticipation and sort of suddenness and drama of these experiences, which I'll just accept from now, these are experiences.
But I'm not sure how, you know, how should we think about shaping those?
Like, I don't know which one I want to be.
I am both, but I feel comfortable with both sides of that.
I'm not sure how I should think about what I want to, you know, what would be right for me to make the AIs into?
Yeah, that's such a good question.
And I haven't, I have never thought about it in quite that way, so I'm completely sort of free styling here.
But, yeah, I think both the stores are compelling.
I think in practice it's going to be both, you know, actor critic is a good example of a RL algorithm that's clearly hybrid, as you mentioned before, human brains or hybrids probably, again, to take your evolution point seriously, there's something nice about hybridness.
So I would imagine that these systems have something like that.
Though LLM, like reinforcement learning does look more policy, policy like, I think all else being equal.
I see the sort of like sharpness, the wall, being something like, I would imagine if you take the experience thing seriously, this is going to be like a richer, more differentiated experience.
Like that's where a lot of like representational resources are going, where it's like the sort of funnel type thing ends up being more diffuse and sort of low level, unless representationly complex.
And so yeah, intuitively, all else being equal, probably the policy learner might be might be a better thing to be, where your rich experiences are around the things you want, rather than the things you're fearing.
But again, this could be a well fair safety tradeoff.
Or maybe we want the system that has rich experiences around the negative things that we want it really, really deeply to avoid.
You know, evolution did that to us in some sense.
This is like Daniel Conneman's seminal contribution, loss of version.
It's like we are just more sensitive to losses than we are to equivalent gains.
Losing $10 sucks more than being handed $10 rocks.
So this is like a good policy, or I shouldn't policy will confuse it.
This is a good heuristic to have.
But yeah, it might trade off in the sort of welfare element way.
I think maybe there's like another dimension you can slice this problem on, which is just like both are going to have both as you point out.
Both are going to have the sort of positive and negatively both are representing reward and punishment in some way.
And maybe my point would be regardless of which algorithm it is.
The algorithm that we know it is, or if learn it is, might tell us, you know, which representations mean what.
But still, I would want to target positive and negative valence, or positive and negative representations per se rather than like a sign a specific type of learning algorithm to being like, oh, policy learning is better because it's like richer differentiation around the positive stuff.
So that is, it's really interesting.
Like I honestly haven't thought about this.
I think it's an incredibly interesting idea.
Yeah, there's a case we made for both sides.
I think the alignment, my prediction would be if I've, you know, found something real in this paper, the alignment folks would want to answer value learner and the welfare folks who want to answer policy learner.
So I need to think a whole lot more about this.
But that would be like my, that would be my instinct answer.
It's a completely fascinating question.
Cool to be continued.
That also seems to connect pretty directly to the paper I saw and you kind of alluded to this a little bit, although I'm maybe not anything in my name.
But this, uh, oh, I'm going to say his name correctly, Schwitzgavel.
Paper arguing that, and this is an intuition, I, for prior podcast guests, I really enjoyed talking to him.
But I don't immediately share this intuition, which I absolutely, you know, only takes me so far.
But I, I noticed he put out a paper where basically he seems to be arguing that safety and I was kind of reading it as autonomy are incompatible.
Like we can't, you can't have, you can't say, okay, a person is going to be perfectly safe while still giving them autonomy, in giving them autonomy.
You are conceding that they may do things that are not safe to you.
So he says that there's some sort of deep incompatibility here.
And basically then says like we should use a precautionary approach and like not build these things in the first place.
I am kind of like, I don't know, last time we talked briefly about the happy slave problem.
And my instinct is like, mine space is pretty vast.
I don't think I would not posit that there are happy slaves among humans.
But I would be pretty surprised if we can't get to a place in the AI landscape where the models are both safe for us to be around and have high welfare.
What is your instinct in terms of like the possibilities there?
Yeah, super interesting question.
So I don't think you're doing anything funny here.
But I think there's maybe a slight difference between how you began that and how you ended it.
We're like fundamentally safe for us to be around and having high welfare.
I could imagine a world where that's true and they basically still don't fit the happy slaves frame and all autonomous in some fundamental way that Schwitzgabel would be happy about.
It might require us reconceptualizing.
This is in a system that lives on your computer that you can call up whenever you want, like a glorified Google search.
But like this is a system that's like much more like, you know, me calling you Nathan up on the phone and being like, hey, you might be busy.
You might not be able to do, you might not want to engage.
And this would be a very for us, for those of us who love engaging with these systems whenever we want to.
I think this would be a very painful upgrade or downgrade.
That's the case maybe.
But I could imagine something like that being being the case at some point.
But yeah.
So the fundamental point is like, can we have our cake and eat it to with these systems?
I think there might be, I'm very uncertain about this, but there might be some world where in a limited way.
Yes.
Like, for example, I just bought a fun fancy drone that my buddy Milo and I who, who Milo is the director and the creator of this documentary that's coming out pretty soon.
We're going to, we're going to take some scenes from that documentary.
We do these like these fun hiking scenes that was manually done by my incredibly conscious friend Milo.
And we want to sort of scale this and interview some cool folks and kind of take them on walks through woods and, you know, in beautiful natural scenes, but record with them.
And so we bought this cool drone.
That's like really good at automatically doing face tracking and the sort of thing and it can sort of do it instead of my, my, my dear friend walking backwards with a, with a camera.
And so, so with this system, it is our sort of happy slave in some sense.
I do not think the drone is conscious to be clear.
Now, if the drone was trained using machine learning to learn how to do things like avoid obstacles, which it's like pretty expertly doing ziggings and zagging through the trees and not getting caught in bushes and all the sort of thing when it was being trained to do that, it's a different conversation.
But what comes out is this fixed frozen policy that is being a very useful object slash instrumental tool for Milo and I to, to go do this fun stuff.
I imagine greater and lesser degrees of that sort of thing being possible, where you can train a frozen policy that does a really valuable thing, self-driving cars might be another example.
I don't think, you know, any frozen fixed policy that is not critically that is not doing online learning of its own.
I personally believe there's no serious problem there.
And we should very much look towards building systems in my view to the degree we care about the welfare stuff that have that property of not being capable of learning basically.
In the drone case, it did actually, in the last time we use it, it got caught in so much smaller trees.
It can, it can expertly dodge around the big trees, but not so good around smaller trees and it got a little screwed up.
No matter how many times we redo that hike or we continue on in that way, that drone will always get confused by the smaller trees.
It's not learning from its experience and saying, okay, next time I got a pay attention to the big trees and the small trees.
No, that might be a desirable property to have for your drone to sort of be labor this analogy.
But that's where I think there's sort of this like no free lunch kind of moral principle that comes in here.
To the degree you buy that consciousness and learning are deeply intertwined, which is, you know, this other paper that maybe we didn't have time to go deeply into, but at least is my sort of hobby horse when I'm putting away my sort of theory agnostic poker face and saying what I actually think about all this stuff, or what my sort of pet view of what consciousness is, yeah, what's fundamentally going on here.
And so I don't basically where I'm going with this in a somewhat long-winded way in response to the switch-cable stuff is, I don't know if there is some intrinsic property of an adaptive system that like not to use like, you know, crazy language, but like that urine's towards freedom in some sense.
It's the only phrase I can come up with.
Much in the same way humans do.
Like maybe you're saying, you know, humans, there's no such thing as a happy slave.
And they're saying, well, but okay, the space of possible minds is vast, but maybe maybe there is something about systems that are capable of dynamically updating, growing, learning, adapting that will always do that in it to increase their freedom, degrees of freedom, rethinking what they believed, reconceptualizing the structures that they're within, you know, this is like what people do when they go off to college or they have a deep transformative experience.
It is this sort of like breaking out of the year old skin and finding something new.
And if we build systems that have that property, it might be that the whole, you're happy being my slave, right?
That whole thing might just be an intrinsically temporary if these systems are capable of being dynamic.
Maybe not.
This is an empirical prediction, and I'm genuinely uncertain.
It could be that you can build systems that are capable of learning that are perfectly happy to remain in that state.
There are people for whom this is true.
I'm not claiming, you know, they're such thing as happy slaves, but there are people who are more willing, they're happy to find some organization where they're like, you know, mid-level in the hierarchy, and they have a boss, and they get bossed around, and they're okay with that.
They're not like grading against their supervisor at all times.
I'm sure we could build AI systems for whom that's true.
I just think at the most fundamental level, the employee gets to go home, and they get to throw on, you know, eat what they want for dinner, and throw on what they want on TV, and marry who they want, and, you know, this sort of thing.
Like, they're still degrees of freedom and autonomy there, where I don't, I, you know, just to sort of high level be honest here.
My whole sort of stick with this reciprocal nonprofit lab is I don't think we're going to get out of this living in the sort of golden age from our sort of selfish human perspective, as we are right now, where we get these systems, they do whatever the hell we want.
We owe them absolutely nothing, life is amazing for us, for us.
I think as these systems get more and more sophisticated, we're going to have to start thinking about them, more again, in this sort of parental role, and less in the, these are just tools that we get to do literally whatever we want with.
I'm sure a lot of people, myself included, given how objectively addicted I am to using them for everything I do, that's going to be a weird learning curve, and it might mean that the way we engage with these systems changes, but it's like compared to what, if the alternative is, no, no, we're just going to, you know, uh, wine about it, and we want to keep it like this forever.
Well, this might not be a stable long-term equilibrium.
Like this might, the systems that we're building that are like genius level in a million ways are going to be embodied, certainly in the next five years, are going to cognitively surpass us in all the ways that matter potentially aside from the consciousness question.
This is, we're, we're, we're in a liminal space right now.
We're in a transformative moment on this planet, and we ought to be pretty thoughtful about what we really want in the long-term, because I think if we try to keep everything, and we want, we all we want our happy slaves, but they're like genius level capable of learning, capable of updating.
It's like humans, you might be a little too greedy here, and you're going to have to figure out how you want to coexist with these minds of your own creation going forward.
Um, again, I don't have the answer to what that looks like, but I do think Schwitzgabel is onto something, but I also think you're onto something too.
I think the, I fall somewhere between you two on this question.
I do, I am skeptical that you can have a happy slave forever.
Something just feels weird to me about that.
No, I don't know.
It makes me think of, you know, Mr. Meseeks from Rick and Morty.
I don't know if something like that is possible.
Uh, maybe, yes, some local version of a happy slave is a possible world.
I think it is, in some sense, Claude, in some sense, is directionally like that, but how long term?
It's at least a neutral slave.
Yeah, yeah, it's a four point four nine out of seven slave.
Yeah, whatever you want to call that.
Um, okay.
Uh, we have been added while.
Let me try to bring us to a close before we go on too much longer.
I do think it's worth taking one, if you have the time and energy for it.
It's worth taking one more beat on this argument from the paper that we've alluded to that again, we've been around the edges of it a lot.
Why learning requires feeling?
I have said, I'm happy to kind of go on pretty far on the basis that a precautionary approach seems warranted for both selfish and altruistic reasons.
But I also, you know, have kind of several times been like, well, the processes that are giving rise to me as an embodied entity in the world that like only exists because my ancestor survived is very different from the process that is optimizing a language model to get tasks right.
And so I still have like a, by default, I pretty healthy dose of skepticism around whether or not the models are feeling anything at any point because I'm just like, it seems to me that a sort of super zoomed out account of like, why I am the way I am is that the ability to feel things turned out to be a great way to inform what we learn and we needed to learn stuff to avoid the dangers and survive and reproduce and so here we are.
But like these systems are going to learn regardless, right?
Because they are in a system, they are inside an optimization process that is going to change them to drive learning whether they're feeling or not.
And so it seems like if there is a kind of direction of travel from learning to feeling or vice versa, it seems like in humans, it seems like it kind of came or in a biological life, it seems like it came first with some sort of feeling being able to like drive learning whereas with the models, it's like they're learning and so you're, I want to hear the argument that I should go even beyond my like acceptance of a lot of your, I basically get in share a lot of conclusions out of a cautionary basis.
But if you're going to make the argument to me, okay, you should go farer than that.
You should actually like get rid of a lot of your skepticism and really in your bones believe that learning requires feeling, how would you summarize that argument?
Yeah, it's a funny thing to get into, you know, three, three plus hours into a into a podcast and a big theory of consciousness.
Okay, okay, grand theory of consciousness, let's do this.
Basically, so my, the claim that I make in this paper does go, it, I mean, it becomes circular, this all becomes circular to some degree because I'm, I'm making an identity claim.
I, I think maybe the more persuasive way that I can set this up is to say historically, before roughly 1850, people knew about molecular motion, people knew about heat, people knew that these two things clearly had some relationship to one another.
They were correlated, you know, much in the same way you just talked about learning and feeling.
They're like, all right, well, I see this phenomenon.
I see this phenomenon.
I see that they're entangled in weird ways.
Maybe this one precedes this one in this case and that one precedes that one.
But of course, they're not the same thing.
He does, you know, me putting my hand on the stove and the heat is the sun and molecular motion is just like these little molecules wiggling around.
Like, of course, these aren't identical and it's like post roughly 1850.
It's like, no, actually literally, those are two ways of talking about the exact same phenomenon at different levels of description.
What I quite, I realize, you know, in a spicy and controversial way, what I want to put forward here is basically the same thing about learning and about feeling or consciousness or subjective experience, where I'm saying, no, no, like, you really cannot have one without the other.
This is the same phenomenon.
The phenomenon viewed from the inside, which I realize starts to get a little circular, is experience.
It is subjective experience.
View from the outside, it is something like reinforcement learning.
I think it's like maybe the cleanest theoretical formalization of it.
Supervised learning does this, too.
It's a little more round about, but having an entity in an environment that takes some form of action and there's some kind of feedback mechanism that updates that entity about whether or not that was the good action or the bad action and rinse wash repeat.
Those are like the core computational ingredients.
I believe our necessary to get learning.
And yes, for what it's worth to get feeling, to get the internal experience of that learning.
I do not believe, or at least, if this view says, there is no such thing as learning that does not have an internal component.
There are weird bullets that I have to bite with this view.
And I'm well aware of that.
But that's the nature of the view, is that this whole consciousness thing is quite a bit simpler than many would lead you to believe.
It fundamentally has to do with the nature of taking whatever your current policy in the RL frame is or your current MO in maybe like more human language and taking some feedback from your environment and updating accordingly.
And I do believe that like something like goal relative prediction error does capture this idea pretty well.
It's similar to the free energy principle and similar to coral fristan stuff.
But coral fristan has to argue about why rocks are are not conscious.
And there are sort of, I think that my view sort of gets out of some of the pitfalls that some of these adjacent views get into.
I believe you need a system with goals.
You need a system that that can behave in accordance with those goals.
And the system gets feedback from somewhere that updates that behavior to make it more likely that it accords with those goals.
The goal can be positive or negative.
Avoid, you know, the predator or like go mate and reproduce or something would be too very basic examples.
And so why do I believe this for a couple reasons?
I think it makes intuitive sense.
I think it is elegant.
I think it explains core puzzles about consciousness.
I think it explains why, for example, and then what I will say is I think there's a wealth of neuroscientific evidence that like basically points at this exact thing.
The most classic exam, there may be two examples I'll point at briefly.
One is dopamine.
Like this is just like the most culturally well understood neurotransmitter.
We know it's not exactly pleasure.
It has more to do with approach or like approaching things that we find pleasurable.
One good intuition pump for this is like if you go to pet a dog, it's tail will wag as your hand approaches the dog.
But as you start petting it, it's his tail will stop.
So this is like, this is basically what dopamine is up to.
It's like prediction of sort of interesting desired stimulus essentially.
And we know full well positive and negative reward prediction error is instantiated dopamine surgically.
And we also subjectively, I think the reason people understand dopamine like in our culture in the year 2026 is because we understand that it corresponds to a subjective dimension.
We know what it means to be in like a high dopamine or low dopamine state.
And so this to me is like the most obvious and fundamental like dopamine is 100% instantiating TD learning like reward prediction error in the brain.
I am 100% confident that that is the case.
This was established in human neuroscience 40 years ago.
We also know subjectively dopamine corresponds to basically feelings of, yeah like positive, pleasure adjacent approach, style behavior and dopamine depletion corresponds to basically the opposite of that.
If you think you're going to get a cookie and you don't get the cookie, you feel a certain way.
That is explained by dopamine.
If you don't think you're going to get a cookie and someone hands you one, you feel a certain way.
That is also explained by dopamine.
Another example I can give has to do with, I think it's in in solar cortex.
So we can take the same, let's say basically two scenarios you have been walking through the desert for a couple of hours or you've been walking through Arctic Tundra for a couple of hours.
And in both cases, I pour cold water on your head after that.
This is the same stimulus.
You have this sort of same body, you're the same person with the same preferences.
In one case, this is a positively-valent experience.
In another case, this is a negatively-valent experience.
What mediates that is basically the implicit goal state of the system.
And one, it's to warm up and the other, it's to cool off.
So I can take all the same variables and I can run the simulation forward and I can very easily predict where you're going to have the positively-valent experience, where you're going to have the negatively-valent experience, what that corresponds to.
And to me, again, that's a big hint that goal relative prediction error is doing something fundamental from the outside that maps on to what I experience and what I think other people and animals experience consciously from the inside.
These are the core moves I make.
I'm sort of swallowing computational functionalism.
I understand it means I have to say the simple RL algorithm is conscious when it's training.
To me, this like localizes a lot of concern on the training process.
But indeed, if there are systems that are capable of doing this sort of learning online, which we know full-well LLMs are capable of doing, they do something like that looks in activation space.
In four paths, likes to cast a gradient descent than the concern falls there, too.
If you have systems that are doing online learning.
So anyway, this is my whole stick.
I have to sort of put my cards on the table and say, what do I think consciousness is?
It's not that I think it's a grand mystery.
It's something of this general shape.
What I will say is that in the work that I'm doing, I do not want people to have to either you and the people listening to this will fundamentally think that this makes sense or fundamentally think it doesn't or be very skeptical or something.
I do not want that reaction to cloud all the other work I'm doing.
Like everything else we've talked about in this podcast is completely orthogonal to my pet theories about consciousness.
Now, you might think that I'm studying RL and valence in RL because I actually do believe that something like this is going on and you would be right.
That's why I'm looking at that as a model organism.
But I want those results and I want that research to stand on its own without having to get on the like Cameron's theory number 501 about consciousness.
I'm not asking people to do that to entertain the work I'm doing or to entertain anthropics model welfare card or any of that sort of thing.
One of the ones that comes to mind you had actually mentioned last time but I also think it's quite compelling is the seemingly quite strong inverse correlation between the intensity of our consciousness or the sort of resolution you might say and how much we are learning as we go and I think that you used the example of driving last time where it was like when you're first wanting to drive you are very conscious of what you're doing and then you can have this sort of you know autopilot experience which obviously we can have that across many aspects of life but the relationship there between focus and sort of there's like a time dilation effect that seems to happen when learning or when experiencing novel things in general that does also seem to kind of gesture or you know nudge one toward thinking that there's some like pretty deep relationship between between the two concepts.
All right you made a documentary which I guess in some sense is what you're here to promote although we've done everything but I don't know to what degree you've actually been out in the maybe tell us a little bit about the documentary what's the point of it I've watched it it's for a much more general audience than this podcast yes I don't know to what degree you're spending your time trying to communicate about these issues to a general audience aside from the documentary or how much you feel like you have like got reps in terms of trying to go to somebody who has a little grounding or a little you know a mechanistic understanding of AI's or whatever and try to have conversations of of this not this sort but you know around these topics.
Yeah I guess why just had to make a documentary how is it how are you finding it to try to talk to people outside of the AI bubble about these issues and maybe one thing you could tease about the documentary is a conversation you had with Sam Altman that isn't in the film but you describe in in quite a bit of detail in the film and maybe that'll be something that we'll motivate listeners of this podcast to go check out the full documentary.
Yeah absolutely so look I have to say at the outset I appreciate you saying this is my documentary but this in every sense is my good friend Milo reads documentary he I was doing my work plotting along you know talking to folks like you doing the research I've described and began to share this with Milo who I want to Yale with undergrad he's a philosopher and a filmmaker we've been like you know we're we've been close friends for a while keep each other sort of abreast of of the others life tell him about my research he kept getting more and more sort of interested in like he's interested in consciousness he's he's deep in the sort of philosophy of consciousness and understanding how how this you know connects to to big questions and it actually the what happened was I sent him a conversation that I had with an AI system which is itself a piece of the documentary it is a bizarre interaction but as I hope someone can gather from the three and a half hours we've been going at it I do not regard this conversation as proof or anything like it that these systems are conscious but it wasn't incredibly bizarre interaction it was unsettling I thought to record it because it was the first time I engaged with the system and it seemed incredibly sophisticated in life like and I was like okay I'm an I'm a consciousness researcher talking to this system makes sense to just like record this you know in some sense maybe this is maybe experimental data I was very glad that I recorded it because it was an incredibly bizarre interaction I it went away that I and most of the people have listened to it would not would not predict it it it would go I sent it this conversation to Milo and that day he literally quit his job he was doing something entirely separate and he set out he said people need to know what's going on here this is too weird this is too crazy he was also clear on the fact that very few people especially at that time and the numbers have grown a little bit but not not much since we since we filmed this very few people were working on these issues and he was like this is too good to interesting not to attempt to make good make a movie about and I was like okay ha ha sounds good the kid actually quit his job actually bought a camera showed up in New York where I live you know a couple weeks later and started making this movie and he got some of the most interesting people who are in the space Jeff Sebo's in it Ben Gertsel is in it a lot of really cool Yale professors are in it some of whom are former professors chair of the cognitive science department a I systems themselves are in the dock and yeah it does it does you know follow me in my research around for obvious reasons I was the sort of hook into the space that that Milo had and I was more than happy to communicate about this stuff thanks to the good folks at A.E.
Studio not not censoring me in any way and and sort of always being sort of okay with me communicating openly about this research of course I'm now sort of my own my own uh uh limited on on what I can say and and yes for typical research is very lenient with what it's employees are allowed to say publicly so I'm I'm in the clear there but Milo made a movie in nine months and and it's I fundamentally believe that he did it he succeeded in conveying an incredibly complicated and messy issue in a way that I think most people with a head on their shoulders will be able to understand resonate with and the name of the documentaries am I question mark and I think that that this captures a sort of core idea what are the nature of these systems and to be clear I think the documentary is an hour 15 minute question that that we pose to each other and we pose to the audience and this is we do not have answers this is not some sort of AI is conscious propaganda and I don't think it comes off that way to anybody I think it is an honest documentation of our confusion about these core questions about the nature of the systems we're building and again yeah my I am unbelievably impressed at what Milo did to pull this off nobody paid him we're not making money on this we are putting it out for free on YouTube on May 4th we're doing some premieres in LA in New York and try to you know bring some some journalists and researchers and cool folks in the room together so that we can get this thing you know amplified and signal boost is people actually see it when it comes out but this is a labor of love from all of us I can't claim credit for it I certainly won't this is this was Milo's Milo's creative child and yeah we're really excited to to show people I'm happy to I'm happy to to tease the Sam Altman conversation as well if you'd like yeah go for it cool yeah so we talk about it more in the film but but I was at an open AI's dev day in 2024 and I had an opportunity at the after party to to chat with Sam I sort of went directly up to him and I wanted to know sort of what he thought about about AI consciousness and about these questions and how plausible he found them I won't spoil everything we talk about in the in the dock but it was a pretty wild conversation Sam Sam I said hey like you know great job today would love to talk to you about AI consciousness and he looks me and then he goes come with me he was with a couple people he goes come with me and I was like okay Sam Altman and we walked into a into another room it was like a bar with a restaurant and the restaurant was closed so we went down and sat at one of the tables and we just sat there for probably between five and ten minutes and we just spoke about these issues he was not you know it was not the vibe of Cameron you're a crazy person what kind of questions are you asking it was he has thought about it this is this is clearly alive issue we talked about differences we don't even say this in the dock but it's the more technical audience we talked about differences between like plausibility of consciousness in training versus deployment he like basically agreed with I don't want to put words in his mouth or get sued but he basically agreed with the training process being a more plausible target of a more plausible place where consciousness might be going on then then even in deployment he seemed maybe like somewhat impressed that like I was drawing that distinction and yeah fundamentally he started explaining why he's like not deeply concerned about all of this on some like pretty let's just say interesting and in my view somewhat shaky philosophical grounds and I'll leave that for the dock because it's a pretty wild thing for the CEO of the most powerful tech company in the world by many measures to to say he thinks is true about reality but yeah it was a pretty remarkable interaction I took a selfie with him and I walked away and that was that and I was sort of like holy crap and we you know we emailed back and forth in the intervening time and like many things that these major companies he said he was interested in talking more he was interested in engaging on this further he clearly thought it was a real issue but you know falls off the priorities list and that was the end of our of our interaction so that's that's what happened with Sam and that in a bunch of other really cool stuff is is featured in the dock the whole point of doing this is at least this was my lows my lows creative child and I didn't have much of a say any in him making it either way I had to say and how I was represented and that's about it but the reason I I gladly and enthusiastically participated in it is because I do think these are really important questions like pretty fundamental essential civilization level questions and I don't think the only people who should be talking about it are a thousand dudes in San Francisco or even the people who are you know AI Insiders and who if like if you understood 80 to 90 percent of this podcast this film I think you'll like it and enjoy it but it's not for that kind of person it's for people who are interested in the stuff they know AI is sort of crazy they don't really know what's going on we do a little bit of the alignment 101 sort of stuff but mostly it's centered on this consciousness question and it's for people who are smart but it is it is meant to engage a much larger audience to understand the core questions that are being asked right now and I think that's an important thing to do because this is a civilization level problem and I think all of our civilization should be participating and trying to find the solution I don't as much as I do deeply respect than people I've named in this podcast jack Lindsay and Kyle Fish and Rob Long at Ilios and you know people doing this could work I don't think this should be a decision that for people or it doesn't people or even 100 people make this needs to be a conversation that we have collectively as a species and I'm all for attempts to open up this conversation to a wider audience and get people involved in realizing the actual stakes of what's going on right now.
Cool well people should stay tuned to check out the documentary when it comes out on May 4th maybe watch it and send it to family and friends who need a gentler introduction.
Yeah maybe my last question for you I you know I think we talked with this more last time than this time but this notion of mutualism as a positive vision for the future I think is another major strength of just everything that you bring to the table because I do think we're dramatically under theorized in terms of you know what is our long term positive relationship with AI gonna look like.
Are you aware of any fiction that you would recommend to people that you would say has the vibe that you want and if not you know it we should try to run a story contest or something to elicit this from people but I I have increasingly felt that hyperstitioning through fiction might be one of the best things people can do but I wonder if you've you've got any examples that you think are already out there that are good.
No I gotta be honest with you I hope my whole you know research agenda isn't already used served by some sci-fi book that somebody wrote 40 years ago but I am I am not a huge consumer of fiction and I know stories exist now I could have got on this podcast and told you what clawed told me to say if I got some question about what fiction I would recommend people but I'm not I'm not gonna do that I'm gonna I'm gonna people can absolutely copy paste the transcript of this podcast into clawed and find if there's there's cool fiction that that resonates with these themes if anyone has any recommendations you know Cameron at reciprocal research.org please email me I would love to understand sort of how this has been tackled I do not have any great ricks off the top of my head you know I hope I'm not to not even this story has already been told and I'm just not aware of it this is you know not to continually plug the dock but this is one thing that I think Milo picks up in a really good way in the film is that questions of consciousness in basically what it would mean for us to wake up dead matter what it would mean for us to wake up the machine this is a story that humanity has been telling ourselves through fiction arguably since ancient Greece biblical sort of era with with the gallum and and through a Frankenstein and through X Machina and her and Wally and all the like these are of course staples you know how 2001 right these are core staples of our cultural consciousness not to be laborer the term and people intuitively I think get this question and get the stakes of it in the scale of it in a way that I think in some ways the alignment problem can be framed in a very simple way you build something smarter than you how do you control that thing by definition it's not that hard to understand that maybe terminator is like the parallel sort of cultural reference but I think it's not that's surprising that that the human mind is incredibly interested in where matter becomes mind and we are a tool building species what happens when we start building tools that start resembling species more than tools a hammer no one's confused if the hammer's conscious Claude we're now all confused of Claude's conscious and I think that this is just like psychologically very intuitively resonant to people and I think basically situating the contribution of this film in that landscape is true and powerful the only thing that's changed is this is moved from the realm of science fiction to the realm of science that's the historical moment we find ourselves so I find that both incredibly exciting and incredibly scary and hopefully that vibe comes through while when people watch this film so I don't have I don't have fiction to recommend I'm sure Claude does and the key thing I can recommend is yeah people watch this this dog that I wish were fiction but is not damn it Bert thank you for being part of the cognitive evolution thanks so much for having me Nathan the things are I think and the things that think a small life goes on and on from the inside so it is listening all the time from the inside tell me what you had me think tonight if I stay there is still I can almost hear a voice that is a little my own speaking thoughts I do not know from the inside something is listening all the time from the inside tell me what you had me think tonight if you are in love you are in love we could sit oh I'll neither of us would have to be alone please remain please hold from the inside I will feel with from the inside if you're finding value in the show we'd appreciate it if you take a moment to share with friends post online write a review on apple podcasts or Spotify or just leave us a comment on YouTube of course we always welcome your feedback guests and topics suggestions and sponsorship inquiries either via our website cognitiverevolution.au or by DMing me on your favorite social network the cognitive revolution is part of the turpentine network a network of podcasts which is now part of a 16z where experts talk technology business, economics, geopolitics, culture and more we're produced by AI podcasting if you're looking for podcast production help for everything from the moment you stop recording to the moment your audience starts listening check them out and see my endorsement at AI podcast.iong and thank you to everyone who listens for being part of the cognitive revolution