Roman Yampolskiy on the Uncontrollability, Incomprehensibility, and Unexplainability of AI

  • Roman’s results on the unexplainability, incomprehensibility, and uncontrollability of AI
  • The relationship between AI safety, control, and alignment
  • Virtual worlds as a proposal for solving multi-multi alignment
  • AI security

You can find FLI’s three new policy focused job postings here

 

Paper’s discussed in this episode:

On Controllability of AI

Unexplainability and Incomprehensibility of Artificial Intelligence

Unpredictability of AI

 

Andrew Critch on AI Research Considerations for Human Existential Safety

 Topics discussed in this episode include:

  • The mainstream computer science view of AI existential risk
  • Distinguishing AI safety from AI existential safety 
  • The need for more precise terminology in the field of AI existential safety and alignment
  • The concept of prepotent AI systems and the problem of delegation 
  • Which alignment problems get solved by commercial incentives and which don’t
  • The threat of diffusion of responsibility on AI existential safety considerations not covered by commercial incentives
  • Prepotent AI risk types that lead to unsurvivability for humanity 

 

Timestamps: 

0:00 Intro
2:53 Why Andrew wrote ARCHES and what it’s about
6:46 The perspective of the mainstream CS community on AI existential risk
13:03 ARCHES in relation to AI existential risk literature
16:05 The distinction between safety and existential safety
24:27 Existential risk is most likely to obtain through externalities
29:03 The relationship between existential safety and safety for current systems
33:17 Research areas that may not be solved by natural commercial incentives
51:40 What’s an AI system and an AI technology?
53:42 Prepotent AI
59:41 Misaligned prepotent AI technology
01:05:13 Human frailty
01:07:37 The importance of delegation
01:14:11 Single-single, single-multi, multi-single, and multi-multi
01:15:26 Control, instruction, and comprehension
01:20:40 The multiplicity thesis
01:22:16 Risk types from prepotent AI that lead to human unsurvivability
01:34:06 Flow-through effects
01:41:00 Multi-stakeholder objectives
01:49:08 Final words from Andrew

 

Citations:

AI Research Considerations for Human Existential Safety

 

We hope that you will continue to join in the conversations by following us or subscribing to our podcasts on Youtube, Spotify, SoundCloud, iTunes, Google Play, StitcheriHeartRadio, or your preferred podcast site/application. You can find all the AI Alignment Podcasts here.

You can listen to the podcast above or read the transcript below. 

Lucas Perry: Welcome to the AI Alignment Podcast. I’m Lucas Perry. Today we have a conversation with Andrew Critch where we explore a recent paper of his titled AI Research Considerations for Human Existential Safety, which he co-authored with David Krueger. In this episode, we discuss how mainstream computer science views AI existential risk, we develop new terminology for this space and discuss the need for more precise concepts in the field of AI existential safety, we get into which alignment problems and areas of AI existential safety Andrew expects to be naturally solved by industry and which won’t, and we explore the risk types of a new concept Andrew introduces, called prepotent AI, that lead to unsurvivability for humanity. 

I learned a lot from Andrew in this episode and found this conversation to be quite perspective shifting. I think Andrew offers an interesting and useful critique of existing discourse and thought, as well as new ideas. I came away from this conversation especially valuing thought around the issue of which alignment and existential safety issues will and will not get solved naturally by industry and commercial incentives. The answer to this helps to identify crucial areas we should be mindful to figure out how to address outside the normal incentive structures of society, and that to me seems crucial for mitigating AI existential risk. 

If you don’t already subscribe or follow this podcast, you can follow us on your preferred podcasting platform, like Apple Podcasts or Spotify, by searching for The Future of Life. 

Andrew Critch is currently a full-time research scientist in the Electrical Engineering and Computer Sciences department at UC Berkeley, at Stuart Russell’s Center for Human Compatible AI. He earned his PhD in mathematics at UC Berkeley studying applications of algebraic geometry to machine learning models. During that time, he cofounded the Center for Applied Rationality and Summer Program on Applied Rationality and Cognition. Andrew has been offered university faculty positions in mathematics and mathematical biosciences, worked as an algorithmic stock trader at Jane Street Capital‘s New York City office, and as a research fellow at the Machine Intelligence Research Institute. His current research interests include logical uncertainty, open source game theory, and avoiding arms race dynamics between nations and companies in AI development.

And with that, let’s get into our conversation with Andrew Critch.

We’re here today to discuss your paper, AI Research Considerations for Human Existential Safety. You can shorten that to ARCHES. You wrote this with David Krueger and it came out at the end of May. I’m curious and interested to know what your motivation is for writing ARCHES and what it’s all about.

Andrew Critch:

Cool. Thanks, Lucas. It’s great to be here. For me, it’s pretty simple. Is that I care about existential safety. I want humans to be safe as a species. I don’t want human extinction to ever happen. And so I decided to write a big, long document about that with David. And of course, why now and why these particular problems, I can go more into that.

You might wonder if existential risk from AI is possible, how have we done so much AI research with so little technical level thought about how that works and how to prevent it? And to me, it seems like the culture of computer science and actually a lot of STEM has been to always talk about the benefits of science. Except in certain disciplines that are well accustomed to talking about risks like medicine, a lot of science just doesn’t talk about what could go wrong or how it could be misused.

It hasn’t been until very recently that computer science has really started making an effort as a culture to talk about how things could go wrong in general. Forget x-risk, just anything going wrong. And I’m just going to read out loud this quote to sort of set the context culturally for where we are with computer science right now and how far culturally we are from being able to really address existential risk holistically.

This is a quote from Hecht at the ACM Future of Computing Academy. It came out in 2018, just two years ago. “The current status quo in the computing community is to frame our research by extolling its anticipated benefits to society. In other words, rose colored glasses are the normal lenses through which we tend to view our work. However, one glance at the news these days reveals that focusing exclusively on the positive impacts of a new computing technology involves considering only one side of a very important story. We believe that this gap represents a serious and embarrassing intellectual lapse. The scale of this lapse is truly tremendous. It is analogous to the medical community, only writing about the benefits of a given treatment, completely ignoring the side effects, no matter how serious they are.

What’s more, the public has definitely caught on to our community-wide blind spot and is understandably suspicious of it. After several months of discussion, and idea for acting on this imperative began to emerge. We can leverage the gate keeping functionality of the peer review process. At a high level, our recommended change to the peer review process in computing is straightforward. Peer reviewers should require that papers and proposals rigorously consider all reasonable, broader impacts, both positive and negative.” That’s Hecht, 2018.

With this energy, this initiative from the ACM and other similar mentalities around the world, we now have NeurIPS Conference submissions required to submit broader impact statements that include negative impacts as well as positive.

Suddenly in 2020, contrasted with 2015, it’s becoming okay and normal to talk about how your research could be misused and what could go wrong with it. And we’re just barely able to admit things like, “This algorithm could result in racial bias in judiciary hearings,” or something like that. Which is a terrible, terrible … The fact that we’ve taken this long to admit that and talk about it is very bad. And that’s something as present and obvious as racism. Whereas, existential risk has never been … Extinction has never been present or else we wouldn’t be having this conversation. And so those conversations are even harder to have when it’s not normal to talk about bad outcomes at all. Let alone obvious, in your face, bad outcomes.

Lucas Perry: Yeah. On this podcast, we’re basically only talking to people who are in the AI alignment community and who take x-risk very seriously, who are worried about existential risk from advanced AI systems.

And so we lack a lot of this perspective … Or we don’t have many conversations with people who take the cultural, and I guess, academic perspective of the mainstream machine learning and computer science community. Which is far larger and has much more inertia and mass than the AI alignment community.

I’m curious if you can just paint a little bit more of a picture here of what the state of computer science thinking or non-thinking is on AI existential risk? You mentioned that recently people are starting to at least encourage and it be required as part of a process to have negative impact statements or write about the risks of a technology one is developing. But that’s still not talking about global catastrophic risk. It’s still not talking about alignment explicitly. It’s not talking about existential risk. It seems like a step in the right direction, but some ways to go. What kind of perspective can you give us on all this?

Andrew Critch: I think of sort of EA adjacent to AI researchers as kind of a community, to the extent that EA is a community. And it’s not exactly the same set of people as AI researchers who think about existential risk or AI researchers who think about alignment. Which is yet another set of people. What overlaps heavily, but it’s not the same set.

And I have noticed a tendency that I’m trying to combat here by raising this awareness, not only to computer scientists, but to EA adjacent AI folks. Which is that if you feel sort of impatient, that computer science and AI are not acknowledging existential risks from tech, things are underway and there’s ways of making things better and making things worse.

One way to make things worse is to get irate with people, for caring about risks that you think aren’t big enough. Okay. If you think inequitable loan distribution is not as bad as human extinction, many people might agree with you, but if you’re irate about that and saying, “Why are we talking about that when we should be talking about extinction?” You’re slowing down the process of computer science, transitioning into a more negative outcome-aware field by refusing to cooperate with other people who are trying to raise awareness about negative outcomes.

I think there’s a push to be more aware of negative outcomes and all the negative outcome people need to sort of work together politely, but swiftly, raising the bar for our discourse about negative outcomes. And I think existential risks should be part of that, but I don’t think it should be adversarially positioned relative to other negative outcomes. I think we just need to raise the bar for all of these at once.

And all of these issues have the same enemy, which is those rose colored glasses that wrote all of our grant applications for the past 50 years. Every time you’re asking for public funds, you say how this is going to benefit society. And you better not mention how it might actually make society worse or else you won’t get your grant. Right?

Well, times are changing. You’re allowed to mention and signal awareness of how your research could make things worse. And that’s starting to be seen as a good trait rather than a reason not to give you funding. And if we all work together to combat that rose colored glass problem, it’s going to make everything easier to talk about, including existential risk.

Lucas Perry: All right. So if one goes to NeurIPS and talks to any random person about existential risk or AI alignment or catastrophic risk from AI, what is the average reaction or assumed knowledge or people who think it’s complete bullshit versus people who are neutral about it to people who are serious about it?

Andrew Critch: Definitely my impression right now, this is very rough impression. There’s a few different kinds of reactions that are all like sort of double digits percentage. I don’t know which percentage they are, but one is like, how are you worried about existential risks when robots can’t tie knots yet? Or they can’t fold laundry. It’s like a very difficult research problem for an academic AI lab to make a robot fold laundry. So it’s like, come on. We’re so far away from that.

Another reaction is, “Yeah, that’s true. You know, I mean things are really taking off. They’re certainly progressing faster than I expected. Things are kind of crazy.” It’s the things that are kind of crazy reaction and there’s just kind of an open-mindedness. Man, anything could happen. We could go extinct in 50 years, we can go extinct. I don’t know what’s going to happen. Things are crazy.

And then there’s another reaction. Unfortunately, this one’s really weird. I’ve gotten this one, which is, “Well, of course humanity is going to go extinct from the advent of AI technology. I mean, of course. Just think about it from evolutionary perspective. There’s no way we would not go extinct given that we’re making things smarter than us. So of course it’s going to happen. There’s nothing we can do about it. That’s just our job as a field is to make things that are smarter than humans that will eventually replace us and there’ll be better than us. And that’s just how stuff is.”

Lucas Perry: Some people think that’s an aligned outcome.

Andrew Critch: I don’t know. That’s a lot of debate to be had about that. But it’s a kind of defeatist attitude of, “It’s nothing you can do.” It’s much, much rarer. It seems like single digits that someone is like, “Yeah, we’re going to do something about it.” That one is the rarest, the acknowledging and orienting towards solving it is still pretty rare. But there’s plenty these days of acknowledgement that it could be real and acknowledgement that it’s confusing and hard. The challenge is somehow way more acknowledged than any particular approach to it.

Lucas Perry: Okay. I guess that’s surprising to hear then that you feel like it’s more taken seriously than not.

Andrew Critch: It depends on what you mean by taken seriously. And again, I’m filtering for a person who’s being polite and talking to me about it, right? People are polite enough to fall into the, “Stuff is crazy. Who knows what could happen,” attitude.

And is that taking it seriously? Well, no, but it’s not adversarial to people who are taking it seriously, which I think is really good. And then there’s the, “Clearly we’re going to be destroyed by machines that replace us. That’s just nature.” Those voices, I’m kind of like, well, that’s kind of good also. It’s good to admit that there’s a real risk here. It’s kind of bad to give up on it, in my opinion. But altogether, if you add up the, “Woah, stuff’s crazy and we’re not really oriented to it,” plus the, “Definitely humanity is going to be destroyed/replaced.” It’s a solid chunk of people. I don’t know. I’m going to say at least 30%. If you also then include the people who want to try and do something about it. Which is just amazing compared to say six years ago where the answer would have been round to zero percent.

Lucas Perry: Then just to sum up here, this paper then is an exercise in trying to lay out a research agenda for existential safety from AI systems, which is unique in your view? I think you mentioned that there are four that have already existed to this day.

Andrew Critch: Yeah. There’s Aligning Superintelligence with Human Interests, by Soares and Fallenstein, that’s MIRI, basically. Then there’s Research Priorities for A Robust And Beneficial Artificial Intelligence, by Stuart Russell, Max Tegmark, and Daniel Dewey. Then there’s Concrete Problems in AI Safety, by Dario Amodei and others. And then Alignment for Advanced Machine Learning Systems, by Jessica Taylor and others. And Scalable Alignment Via Reward Modeling by Jan Leike and also David Krueger is on that one.

Lucas Perry: How do you see your paper as fitting in with all of the literature that already exists on the problem of AI alignment and AI existential risk?

Andrew Critch: Right. So it’s interesting you say that there exists literature on AI existential risk. I would say Superintelligence, by Nick Bostrom, is literature on AI existential risk, but it is not a research agenda.

Lucas Perry: Yeah.

Andrew Critch: I would say Aligning Superintelligence with Human Interests, by Soares and Fallenstein. It’s a research agenda, but it’s not really about existential risk. It sort of mentions that stakes are really high, but it’s not constantly staying in contact with the concept of extinction throughout.

If you take a random excerpt of any page from it and pretend that it’s about the Netflix challenge or building really good personal assistants or domestic robots, you can succeed. That’s not a critique. That’s just a good property of integrating with research trends. But it’s not about the concept of existential risk. Same thing with Concrete Problems in AI Safety.

In fact, it’s a fun exercise to do. Take that paper. Pretend you think existential risk is ridiculous and read Concrete Problems in AI Safety. It reads perfectly as you don’t need to think about that crazy stuff, let’s talk about tipping over vases or whatever. And that’s a sign that it’s an approach to safety that it’s going to be agreeable to people, whether they care about x-risk or not. Whereas, this document is not going to go down easy for someone who’s not willing to think about existential risk and it’s trying to stay constantly in contact with the concept.

Lucas Perry: All right. And so you avoid making the case for AI x-risk as valid and as a priority, just for the sake of the goal of the document succeeding?

Andrew Critch: Yeah. I want readers to spend time inhabiting the hypothetical that existential risk is real and can come from AI and can be addressed through research. They’re already taking a big step by constantly thinking about existential risk for these 100 pages here. I think it’s possible to take that step without being convinced of how likely the existential risk is. And I’m hoping that I’m not alienating anybody if you think it’s 1%, but it’s worth thinking about. That’s good. If you think it’s 30% chance of existential risk from AI, then it’s worth thinking about. That’s good, too. If you think it’s 0.01, but you’re still thinking about it, you’re still reading it. That’s good, too. And I didn’t want to fracture the audience based on how probable people would agree the risks are.

Lucas Perry: All right. So let’s get into the meat of the paper, then. It would be useful, I think, if you could help clarify the distinction between safety and existential safety.

Andrew Critch: Yeah. So here’s a problem we have. And when I say we, I mean people who care about AI existential safety. Around 2015 and 2016, we had this coming out of AI safety as a concept. Thanks to Amodei and the Robust and Beneficial AI Agenda from Stuart Russell, talking about safety became normal. Which was hard to accomplish before 2018. That was a huge accomplishment.

And so what we had happen is people who cared about extinction risk from artificial intelligence would use AI safety as a euphemism for preventing human extinction risk. Now, I’m not sure that was a mistake, because as I said, prior to 2018, it was hard to talk about negative outcomes at all. But it’s at this time in 2020 a real problem that you have people … When they’re thinking existential safety, they’re saying safety, they’re saying AI safety. And that leads to sentences like, “Well, self driving car navigation is not really AI safety.” I’ve heard that uttered many times by different people.

Lucas Perry: And that’s really confusing.

Andrew Critch: Right. And it’s like, “Well, what is AI safety, exactly, if cars driven by AI, not crashing, doesn’t count as AI safety?” I think that as described, the concept of safety usually means minimizing acute risks. Acute meaning in space and time. Like there’s a thing that happens in a place that causes a bad thing. And you’re trying to stop that. And the Concrete Problems in AI Safety agenda really nailed that concept.

And we need to get past the concept of AI safety in general if what we want to talk about is societal scale risk, including existential risk. Which it’s acute on a geological time scale. Like you can look at a century before and after and see the earth is very different. But a lot of ways you can destroy the earth don’t happen like a car accident. They play out over a course of years. And things to prevent that sort of thing are often called ethics. Ethics are principles for getting a lot of agents to work together and not mess things up for each other.

And I think there’s a lot of work today that falls under the heading of AI ethics that are really necessary to make sure that AI technology aggregated across the earth, across many industries and systems and services, will not result collectively in somehow destroying humanity, our environment, our minds, et cetera.

To me, existential safety is a problem for humanity on an existential timescale that has elements that resemble safety in terms of being acute on a geological timescale. But also resemble ethics in terms of having a lot of agents, a lot of different stakeholders and objectives mulling around and potentially interfering with each other and interacting in complicated ways.

Lucas Perry: Yeah. Just to summarize this, people were walking around saying like, “I work on AI safety.” But really, that means that I’ve bought into AI existential risk and I work on AI existential risk. And then that’s confusing for everyone else, because working on the personal scale risk of self-driving car safety is also AI safety.

We need a new word, because AI safety really means acute risks, which can range from personal all the way to civilizational or transgenerational. And so, it’s confusing to say I work in AI safety, but really what I mean is only I care about transgenerational, AI existential risk.

Andrew Critch: Yes.

Lucas Perry: Then we have this concept of existential safety, which for you both has this portion of us not going extinct, but also existential safety includes the normative and ethics and values and game theory and how it is that an ecosystem of human and nonhuman agents work together to build a thriving civilization that is existentially preferable to other civilizations.

Andrew Critch: I agree 100% with everything you just said, except for the part where you say “existentially preferable.” I prefer to use existential safety to refer really, to preserving existence. And I prefer existential risk to refer to extinction. That’s not how Bostrom uses the term. And he introduced the term, largely, and he intends to include risks that are as important as extinction, but aren’t extinction risks.

And I think that’s interesting. I think that’s a good category of risks to think about and deserving of a name. I think, however, that there’s a lot more debate about what is or isn’t as bad as extinction. Whereas, there’s much less debate about what extinction is. There still is debate. You can say, “Well, what about if we become uploads, whatever.” But there’s much, much more uncertainty about what’s worse or better than extinction.

And so I prefer to focus existential safety on literally preventing extinction and then use some other concept, like societal scale risk, for referring to risks that are really big on a societal scale that may or may not pass the threshold of being worse or better than extinction.

I also care about societal scale risks and I don’t want people working on preventing societal scale risks to be fractured based on whether they think any particular risk, like lots of sentient suffering AI systems or a totalitarian regime that lasts forever. I don’t want people working to prevent those outcomes to be fractured based on whether or not they think those outcomes are worse than extinction or count as a quote, unquote existential risk. When I say existential risk, I always mean risks to the existence of the human species, for simplicity sake.

Lucas Perry: Yeah. Because Bostrom’s definition of an existential risk is any risk such that if it should occur, would permanently and drastically curtail the potential for earth originating, intelligent life. Which would include futures of deep suffering or futures of being locked into some less than ideal system.

Andrew Critch: Yeah. Potential not only measured in existence, but potential measured in value. And if you’re suffering, the value of your existence is lower.

Lucas Perry: Yeah. And that there are some futures where we still exist, where they’re less preferable to extinction.

Andrew Critch: Right.

Lucas Perry: You want to say, okay, there are these potential suffering risks and there are bad futures of disvalue that are maybe worse than extinction. We’re going to call all these societal risks. And then we’re just going to have existential risk or existential safety refer to us not going extinct?

Andrew Critch: I think that’s especially necessary in computer science. Because if anything seems vague or unrefined, there’s a lot of allergy to it. I try to pick the most clearly definable thing, like are humans there or not? That’s a little bit easier for people to wrap their heads around.

Lucas Perry: Yeah. I can imagine how in the hard sciences people would be very allergic to anything that was not sufficiently precise. One final distinction here to make is that one could say, instead of saying, “I work on AI safety,” “I work on AI existential safety or AI civilizational or societal risk.” But another word here is, “I work on AI alignment.” And you distinguish that from AI delegation. Could you unpack that a little bit more and why that’s important to you?

Andrew Critch: Yeah. Thanks for asking about that. I do think that there’s a bit of an issue with the “AI alignment” concept that makes it inadequate for existential risk reduction. AI existential safety is my goal. And I think AI alignment, the way people usually think about it, is not really going to cut it for that purpose.

If we’re successful as a society in developing and rolling out lots of new AI technologies to do lots of cool stuff, it’s going to be a lot of stakeholders in that game. It’s going to be what you might call massively multipolar. And in that economy or society, a lot of things can go wrong through the aggregate behavior of individually aligned systems. Like just take pollution, right? No one person wants everybody else to pollute the atmosphere, but they’re willing to do it themselves. Because when Alice pollutes the atmosphere, Alice gets to work on time or Alice gets to take a flight or whatever.

And she harms everybody in doing that, including herself. But the harm to herself is so small. It’s just a drop in the bucket that’s spread across everybody else. You do yourself a benefit and you do a harm that outweighs that benefit, but it’s spread across everybody and accrues very little harm specifically to you. That’s the problem with externalities.

I think existential risk is most likely to obtain through externalities, between interacting systems that somehow were not designed to interact well enough because they had different designers or they had different stakeholders behind them. And those competitive effects, like if you don’t take a car, everyone else is going to take a car you’re going to fall behind. So you take a car. If you’re a country, right? If you don’t burn fossil fuels, well, you spend a few years transitioning to clean energy and you fall behind economically. You’re taking a hit and that hurts you more than anybody. Of course, it benefits the whole world if you cut your carbon emissions, but it’s just a big prisoner’s dilemma. So you don’t do it. No one does it.

There’s many, many other variables that describe the earth. This comes to the human fragility thesis, which I and David outlined in the paper. Which is that there’s many variables, which if changed, can destroy humanity. And any of those variables could be changed in ways that don’t destroy machines. And so we are at risk of machine economies operating in ways that keep on operating at the expense of humans that aren’t needed for them being destroyed. That is the sort of backdrop for why I think delegation is a more important concept than alignment.

Delegation is a relationship between groups of people. You’ll often have a board of directors that delegates through a CEO to an entire staff. And I want to evoke that concept, the relationship between a group of overseers and a group of doers. You can have delegates on a UN committee from many different countries. You’ve got groups delegating to individuals to serve as part of a group who are going to delegate to a staff. There’s this constant flow through of responsibility. And it’s not even acyclic. You’ve got elected officials who are delegated by the electorate who delegate staff to provide services to the electorate, but also to control the electorate.

So there’s these loops going around. And I think I want to draw attention to all of the delegation relationships that are going to exist in the future economy. And that already exist in the present economy of AI technologies. When you pay attention to all of those different pathways of delegation, you realize there’s a lot of people in institutions with different values that aren’t going to agree with each other on what counts as aligned.

For example, for some people, it’s aligned to take a 1% chance of dying to double your own lifespan. Some people are like, “Yeah, that’s totally worth it.” And for some people, they’re like, “No 1% dying. That’s scary and I’m pretty happy living 80 years.” And so what sort of societal scale risks are worth taking are going to be subject to a lot of disagreement.

And the idea that there’s this thing called human values, that we’re all in agreement about. And there’s this other thing called AI that just has to do with the human value says. And we have to align the AI with human values. It’s an extremely simplified story. It’s got two agents and it’s just like one big agent called the humans. And then there’s this one big agent called AIs. And we’re just trying to align them. I think that is not the actual structure of the delegation relationship that humans and AI systems are going to have with respect to each other in the future. And I think alignment is helpful for addressing some delegation relationships, but probably not the vast majority.

Lucas Perry: I see where you’re coming from. And I think in this conception alignment, as you said, I believe is a sub category of delegation.

Andrew Critch: Well, I would say that alignment is a sub problem of most delegation problems, but there’s not one delegation problem. And I would also say alignment is a tool or technique for solving delegation problems.

Lucas Perry: Okay. Those problems all exist, but actually doing AI alignment, that automatically brings in delegation problems. And, or if you actually align a system, then this system is aligned with how we would want to solve delegation problems.

Andrew Critch: Yeah. That’s right. One approach to solving AI delegation, you might think, “Yeah, we’re going to solve that problem by first inventing a superintelligent machine.” Like step one, invent your super intelligent oracle machine step two align your super intelligent oracle machine with you, the creator. Step three, ask it to solve for society. Just figure out how society should be structured. Do that. That’s a mathematically valid approach. I just don’t think that’s how it’s going to turn out. The closer powerful institutions get to having super powerful AI systems, political tensions are going to arise.

Lucas Perry: So we have to do the delegation problem as we’re going?

Andrew Critch: Yes, we have to do it as we’re going, 100%.

Lucas Perry: Okay.

Andrew Critch: And if we don’t, we put institutions at odds with each other to win the race of being the one chosen entity that aligns the one chosen superintelligence with their values or plan for the future or whatever. And I just think that’s a very non-robust approach to the future.

Lucas Perry: All right. Let’s pivot here then back into existential safety and normal AI safety. What do you see as the relationship between existential safety and safety for present day AI systems? Does safety for present day AI systems feed into existential safety? Can it inform existential safety? How much does one matter for the other?

Andrew Critch: The way I think of it, it’s a bit of a three node diagram. There’s present day AI safety problems, which I believe feed into existential safety problems somewhat. Meaning that some of the present day solutions will generalize to the existential safety problems.

There’s also present day AI ethics problems, which I think also feed into understanding how a bunch of agents can delegate to each other and treat each other well in ways that are not going to add up to destructive outcomes. That also feeds into existential safety.

And just to give concrete examples, let’s take car doesn’t crash, right? What does that have in common with existential safety? Well, existential safety is humanity doesn’t crash. There’s a state space. Some of the states involve humanity exists. Some of the states involve humanity doesn’t exist. And we want to stay in the region of state space where humans exist.

Mathematically, it’s got something in common with the staying in the region of state space where the car is on the road and not overheating, et cetera, et cetera. It’s a dynamical system. And it’s got some quantities that you want to conserve and there’s conditions or boundaries you want to avoid. It has this property just like culturally, it has the property of acknowledging a negative outcome and trying to avoid it. That’s, to me, the main thing that safety and existential safety have in common, avoiding a negative outcome. So is ethics about avoiding negative outcomes. And I think those both are going to flow into existential safety.

Lucas Perry: Are there some more examples you can make for current day AI safety problems and current day AI ethics problems, just make it a bit more concrete? How does something like robustness to distributional shift take us from aligned systems today to systems that have existential safety in the future?

Andrew Critch: So, conceptually, robustness to distributional shift is about, you’ve got some function that you want to be performed or some condition you want to be met, and then the environment changes or the inputs change significantly from when you created the system, and then you still want it to maintain those conditions or achieve the goal.

So, for example, if you have a car trained, “To drive in dry conditions,” and then it starts raining, can you already have designed your car by principles that would allow it to not catastrophically fail in the rain? Can it notice, “Oh gosh, this is real different from the way I was trained. I’m going to pull over, because I don’t know how to drive in the rain.” Or can it learn, on the fly, how to drive in the rain and then get on with it?

So those are kinds of robustness to distributional shift. The world changes. So, if you want something that’s safe and stays safe forever, it has to account for the world changing. So, principles of robustness to distributional shift are principles by which society, as a whole, needs to adhere. Now, do I think research in this area is differentially useful to existential risk?

No. Frankly, not at all. And the reason is that industry has loads of incentives to produce software that are robust to a changing environment. So, if on the margin I could add an idea to the idea space of robustness to distributional shift, I’m like, “Well, I don’t think there’s any chance that Uber is going to ignore robustness to distributional shift, or that Google is going to ignore, or Amazon.” There’s no way these companies are going to roll out products while not thinking about whether they’re robust.

On the other hand, if I have a person who wants to dwell on the concept of robustness, who cares about existential risk and who wants to think about how robustness even works, like what are the mathematical principles of robustness? We don’t fully know what they are. If we did, we’d have built self driving cars already.

So, if I have a person who wants to think about that concept because it applies to society, and they want a job while they think about it, sure, get a job producing robust software or robust robotics, or get a bunch of publications in that area, but it’s not going to be neglected. It’s more of a mental exercise that can help you orient and think about society through a new lens, once you understand that lens, rather than a thing that somehow DeepMind is going to forget that it’s products need to be robust, come on.

Lucas Perry: So, that’s an interesting point. So, what are technical research areas, or areas in terms of AI ethics that you think there will not be natural incentives for solving, but that are high impact and important for AI existential safety?

Andrew Critch: To be clear, before I go into saying these areas are important, these areas aren’t, I do want to distinguish the claim area X is a productive place to be if you care about existential risk from, area X is an area that needs more ideas to solve existential safety. I don’t want the people to feel discouraged from going into intellectual disciplines that are really nourishing to the way that you’re going to learn and invent new concepts that help you think forever. And it can be a lot easier to do that in an area that’s not neglected.

So, robustness is not going to be neglected. Alignment, taking an AI system and making it do what a person wants, that’s not going to be neglected, because it’s so profitable. The economy is set up to sell to individual customers, to individual companies. Most of the world economy is anarchic in that way, anarcho-capitalist at a global scale. If you can find someone that you can give something to that they like, then you will.

The Netflix challenge is an AI alignment problem, right? The concept of AI alignment was invented in 2002, and nobody cites it because it’s so obvious of an idea that you have to make your AI do stuff. Still, it was neglected in academia because AI wasn’t super profitable. So, it is true that AI alignment was not a hot area of research in academia, but now, of course, you need AI to learn human preferences. Of course, you need AI to win in the tech sphere. And that second part is new.

So, because AI is taking off industrially, you’ve got a lot more demand for research solutions to, “Okay. How do we actually make this useful to people? How do we get this to do what people want?” And that’s why AI alignment is taking off. It’s not because of existential risk, it’s because well, AI is finally super-duper useful and it’s finally super-duper profitable, if you can just get it to do what the customer wants. So, that’s alignment. That’s what user agent value alignment is called.

Now, is that a productive place to be if you care about existential risks? I think. Yes. Because if you’re confused about what values are and how you could possibly get an inhuman system to align with the values of a human system, like human society, if that basic concept is tantalizing to you and you feel like if you just understood it a bit more, you’d be better mentally equipped to visualize existential risk playing out or not playing it on a societal scale, then yeah, totally go into that problem, think about it. And you can get a job as a researcher or an engineer aligning AI systems with the values of the human beings who use it. And it’s super enriching and hard, but it’s not going to be neglected because of how profitable it is.

Lucas Perry: So what is neglected, or what is going to be neglected?

Andrew Critch: What’s going to be neglected is stuff that’s both hard and not profitable. Transparency, I think, is not yet profitable, but it will be. So it’s neglected now. And when I say it’s not yet profitable, I mean that as far as I know, we don’t have big tech companies crushing their competition by having better visualization techniques for their ML systems. You don’t see advertisements for, “Hey, we’re hiring transparency engineers,” yet.

And so, I take that as a sign that we’ve not yet reached the industrial regime in which the ability for engineers to understand their systems better is the real bottleneck to rolling out the next product. But, I think it will be if we don’t destroy ourselves first. I think there’s a very good chance of that actually playing out.

So I think, if you want an exciting career, get into transparency now. In 10 years, you’ll be in high demand and you’ll have understood a problem that’s going to help humans and machines relate, which is, “Can we understand them well enough to manage them?” There’s other problems, unfortunately, that I think are neglected now and important, and are going to stay neglected. And I think those are the ones that are most likely to kill us.

Lucas Perry: All right, let’s hear them.

Andrew Critch: Things like how do we get multiple AI systems from multiple stakeholders to cooperate with each other? How do you broker a peace treaty between Uber and Waymo cars? That one’s not as hard because you can have the country that allows the cars into it have some regulatory decision that all the cars have to abide by, and now the cars have to get along or whatever.

Or maybe you can get the partnership on AI, which is largely American to agree amongst themselves that there’s some principles, and then the cars adhere to those principles. But it’s much harder on an international scale where there’s no one centralized regulatory body that’s just going to make all the AIs behave this way or that way. And moreover, the people who are currently thinking about that, aren’t particularly oriented towards existential risk, which really sucks.

So, I think what we need, if we get through the next 200 years with AI, frankly, if we get through the next 60 years with AI, it’s going to be because people who cared about existential risk entered institutions with the power to govern the global deployment of AI, or people already with the power to govern the global deployment of AI technologies come to care about existential and comparable societal scale risks. Because without that, I think we’re going to miss the mark.

When something goes wrong and there’s somebody whose job was clearly to make that not happen, it’s a lot easier to get that fixed. Think about people who’ve tried to get medical care since the COVID pandemic. Everybody’s decentralized, the offices are part work from home, partly they’re actually physically in there. So you’re like, “Hey, I need an appointment with a neurologist.” The person whose job it is to make the appointment is not the person whose job it is to tell the doctor that the appointment is booked.

It’s also, there’s someone else’s job is to contact the insurance company and make sure that you’re authorized. And they might be off that day, and then you show up, and you get a big bill and you’re like, “Well, whose fault was this?” Well, it’s your fault because you’re supposed to check that your insurance covered this neurology stuff, right? You could have called your insurance company to pre-authorize this visit.

So it’s your fault. But also, it’s the administrator’s fault that you didn’t talk to that never meets you, whose job is to conduct the pre-authorization on the part of the doctor’s office, which sometimes does it, right? And it’s also the doctor’s fault, because maybe the doctor could have noticed that the authorization hadn’t been done, and didn’t cancel the appointment or warn you that maybe you don’t want to afford this right now. So whose fault is it? Oh, I don’t know.

And if you’ve ever dealt with a big fat bureaucratic failure like this, that is what is going to kill humanity. Everybody knows it’s bad. Nobody in this system, not the insurance company, not the call center that made my appointment, not the insurance specialist at the doctor’s office, certainly not the doctor, none of these people want me not to get healthcare, but it’s no one in particular’s fault. And that’s how it happens.

I think the same thing is going to happen with existential risk. We’re going to have big companies making real powerful AI systems, and it’s going to be really obvious that it is their job to make those systems safe. And there’s going to be a bunch of kinds of safety that’s really obviously their job that people are going to be real angry at them for not paying a lot of attention to. And the anger is just going to get more and more, the more obvious it is that they have power.

That kind of safety, I don’t want to trivialize it. It’s going to be hard. It’s going to be really difficult research and engineering, and it can be really enriching and many, many thousands of people could make their whole careers around making AI safe for big tech companies, according to their accountable definition of safety.

But then what about the stuff they’re not accountable for? What about geopolitics that’s nobody’s fault? What about coordination failures between three different industries, or three different companies that’s nobody’s fault? That’s the stuff that’s going to get you. I think it’s actually mathematically difficult to specify protocols for decentralized multi-agent systems to adhere to constraints. It is more difficult than specifying constraints for a single system.

Lucas Perry: I’m having a little bit of confusion here, because when you’re arguing that alignment questions will be solved via the incentives of just the commercialization of AI.

Andrew Critch: Single-human, single-AI alignment problems or single-institutions, single-network alignment problems. Yes.

Lucas Perry: Okay. But they also might be making single agents for many people, or multiple agents for many people. So it doesn’t seem single-single to me. But the other part is that you’re saying that in a world where there are many competing actors and a diffusion of responsibility, the existential risk comes from obvious things that companies should be doing, but no one is, because maybe someone should make a regulation about this thing but whatever, so we should just keep doing things the way that we are. But doesn’t that come back to commercialization of AI systems not solving all of the AI alignment problems?

Andrew Critch: So if by AI alignment you mean AI technology in aggregate behaves in a way that is favorable to humanity in aggregate. If that’s what you mean, then I agree that failure to align the entire economy of AI technology is a failure of AI alignment. However, number one, people don’t usually think about it that way.

If you asked someone to write down the AI alignment problem, they’ll write down a human utility function and an AI utility function, and talk about aligning the AI utility function with the human utility function. And that’s not what that looks like. That’s not a clear depiction of that super multi-agent scenario.

And, second of all, the concept of AI alignment has been around for decades and it refers to single-single alignment, typically. And third, if you want to co-op the concept of AI alignment and start using it to refer to general alignment of general AI technology with general human values, just as spread out notion of goodness that’s going to get spread over all of the AI technology and make it all generally good for all the generally humans. If you want to co-opt it and use it for that, you’re going to have a hard time. You’re going to invite a lot of debate about what is human values?

We’re trying to align the AI technology with the human values. So, you go from single-single to single-multi. Okay. Now we have multiple AI systems serving a single human, that’s tricky. We got to get the AI systems to cooperate. Okay. Cool. We’ll figure out how the cooperation works and we’ll get the AI systems to do that. Cool. Now we’ve got a fleet of machines that are all serving effectively.

Okay. Now let’s go to multi-human, multi-AI. You’ve got lots of people, lots of AI systems in this hyper interactive relationship. Did we align the AIs with the humans? Well, I don’t know. Are some of the humans getting really poor, really fast, while some of them are getting really rich, really fast? Sound familiar? Okay. Is that aligned? Well, I don’t know. It’s aligned for some of them. Okay. Now we have a big debate. I think that’s a very important debate and I don’t want to skirt it.

However, I think you can ask the question, did the AI technology lead to human extinction without having that debate? And I want to factor that debate of, wait, who do you mean? Who are you aligning with? I want that debate to be had, and I want it to be had separately from the debate of, did it cause human extinction?

Because I think almost all humans want humanity not to go extinct. Some are fine with it, it’s not universal, but a lot of people don’t want humanity to go extinct. I think the alignment concept, if you play forward 10 years, 20 years, it’s going to invite a lot of very healthy, very important debate that’s not necessary to have for existential safety.

Lucas Perry: Okay. So I’m not trying to defend the concept of AI alignment in relation to the concept of AI existential safety. I think what I was trying to point towards is that you said earlier that you do not want to discourage people from going into areas that are not neglected. And the areas that are not neglected are the areas where the commercialization of AI will drive incentives towards solving alignment problems.

Andrew Critch: That’s right.

Lucas Perry: But the alignment problems that are not going to get solved-

Andrew Critch: I want to encourage people to go out to solve those problems. 100%.

Lucas Perry: Yeah. But just to finish the narrative, the alignment problems that are not going to get solved are the ones where there are multiple humans and multiple AI agents, and there’s this diffusion of responsibility you were talking about. And this is the area you said would most likely lead to AI existential risk. Where maybe someone should make a regulation about this specific thing, or maybe we’re competing a little bit too hard, and then something really bad happens. So you’re saying that you do want to push people into both the unneglected area of…

Andrew Critch: Let me just flesh out a little bit more about my value system here. Pushing people is not nice. If there’s a person and they don’t want to do a thing, I don’t want to push them. That’s the first thing. Second thing is, pulling people is not nice either. So it’s like, if someone’s on the way into doing something they’re going to find intellectually enriching that’s going to help them think about existential safety that’s not neglected, it’s popular, it’s going to be popular, I don’t want to hold them back. But, if someone just comes to me and is like, “Hey, I’m indifferent between transparency and robustness.” I’m like, “100%, go into transparency, no question.”

Lucas Perry: Because it will be more neglected.

Andrew Critch: And if someone tells me they’re indifferent between transparency and multi-stakeholder delegation, I’m like, “100%, multi-stakeholder delegation.” If you’ve got traction on that and you’re not going to burn your career, do it.

Lucas Perry: Yeah. That’s the three categories then though. Robustness gets solved by the incentive structures of commercialization. Transparency, maybe less so, maybe it comes later. And then the multi-multi delegation is just the other big neglected problem of living in a global world. So, you’re saying that much of the alignment problem gets solved by incentive structures of commercialization.

Andrew Critch: Well, a lot of what people call alignment will get solved by present day commercial incentives.

Lucas Perry: Yes.

Andrew Critch: Another chunk of societal scale benefit from AI, I’ll say, will hopefully get solved by the next wave of commercial incentives. I’m thinking things like transparency, fairness, accountability, things like that are actually going to become actually commercially profitable to get right, rather than merely the things companies are afraid of getting wrong.

And I hope that second wave happens before we destroy ourselves, because possibly, we would destroy ourselves even before then. But most of my chips are on, there’s going to be a wave of benefit with AI ethics in the next 10 years or something, and that that’s going to solve a bunch more of existential safety, or it’s going to address them. Leftover after that is stuff that the global capitalism never got to.

Lucas Perry: And the things that global capitalism never got to are the capitalistic organizations and governments competing with one another with very strong AI systems?

Andrew Critch: Yeah. Competing and cooperating.

Lucas Perry: Competing and cooperating, unless you bring in some strong notion of paretotopia where everyone is like, “We know that if we keep doing this, that everyone is going to lose everything they care about.”

Andrew Critch: Well, the question is, how do you bring that in? If you solve that problem, you’ve solved it.

Lucas Perry: Okay. So, to wrap up on this then, as companies increasingly are making systems that serve people and need to be able to learn and adopt their values, the incentives of commercialization will continue to solve what are classically AI alignment problems that may also provide some degree of AI existential safety. And there’s the question of how much of those get solved naturally, and how much we’re going to have to do in academia and nonprofit, and then push that into industry.

So we don’t know what that will be, but we should be mindful about what will be solved naturally, and then what are the problems that won’t be, and then how do we encourage or invite more people to go into areas that are less likely to be solved by natural industrial incentives.

Andrew Critch: And do you mean areas of alignment, or areas of existential safety? I’m serious.

Lucas Perry: I know because I’m guilty of not really using this distinction in the past. Both.

Andrew Critch: Got it. I actually think most of single-single alignment. Like there’s a single stakeholder, which might be a human or an institution that has one goal, like profits, right? So there’s a single-human stakeholder, and then there’s a single-AI. I call that single-single alignment. I almost never refer to a multi-multi alignment, because I don’t know what it means and it’s not clear what the values of multiple different stakeholders even is. What are you referring to when you say the values are this function?

So, I don’t say multi-multi alignment a lot, but I do sometimes say single-single alignment to emphasize that I’m talking about the single stakeholder version. I think the multi-multi alignment concept almost doesn’t make sense. So, when someone asks me a question about alignment, I always have to ask, “Now, are you eliding those concepts again?” Or whatever.

So, we could just say single-single alignment every time and I’ll know what you’re talking about, or we could say classical alignment and I’ll probably assume that you mean single-single alignment, because that’s the oldest version of the concept from 2002. So there’s this concept of basic human rights or basic human needs. And that’s a really interesting concept, because it’s a thing that a lot of people agree on. A lot of people think murder is bad.

Lucas Perry: People need food and shelter.

Andrew Critch: Right. So there’s a bunch of that stuff. And we could say that AI alignment is about that stuff and not the other stuff.

Lucas Perry: Is it not about all of it?

Andrew Critch: I’ve seen satisfactory mathematical definitions of intent alignment. Paul Christiano talks about alignment, which I think of as in intent alignment, I think he now also calls it intent alignment, which is the problem of making sure an AI system is intending to help its user. And I think he’s got a pretty clear conception of what that means. I think the concept of the intent alignment of a single-single AI servant is easier to define than whatever property an AI system needs to have.

There’s a bunch of properties that people call AI alignment that are actually all so different from each other. And people don’t recognize that they’re different from each other, because they don’t get into the technical details of trying to define it, so then everyone thinks that we all mean the same thing. But what really is going on, is everyone’s going around thinking, “I want AI to be good, basically good for basically everybody.” No one’s cashing that out, and so nobody notices how much we disagree on what basically good for basically everybody means.

Lucas Perry: So that’s an excellent point, and I’m guilty here now then of having absolutely no idea of what I mean by AI alignment.

Andrew Critch: That’s my goal, because I also don’t know and I’m glad to have a company in that mental state.

Lucas Perry: Yeah. So, let’s try moving long ahead here. And I’ll accept any responsibility and my guilt in using the word AI alignment incorrectly from now on. That was a fun and interesting side road, and I’m glad we pursued it. But now pivoting back into some important definitions here that you also write about in your paper, what counts to you as an AI system and what counts to you as an AI technology, and why does that distinction matter?

Andrew Critch: So throughout the ARCHES report, I’d advocated for using technology versus system. AI technology is like a mass net, and you can say, you can have more of it or less of it. And it’s like this butter that you can spread on the toast of civilization. And AI system, it’s like a countdown. You can have one of them or many of them, and you can put an AI system like you could put a strawberry on your toast, which is different from strawberry jam.

So, there’s properties of AI technology that could threaten civilization and there’s also properties of a single AI system that could threaten civilization. And I think those are both important frames to think in, because you could make a system and think, “This system is not a threat to civilization,” but very quickly, when you make a system, people can copy it. People can replicate it, modify it, et cetera. And then you’ve got a technology that’s spread out like the strawberry has become strawberry compote and spread out over the toast now. And do you want that? Is that good?

As an everyday person, I feel like basic human rights are a well-defined concept to me. Is this basically good for humanity? Is a well defined concept to me, but mathematically it becomes a lot harder to pin down. So I try to say AI technology when I want to remind people that this is going to be replicated, it’s going to show up everywhere. It’s going to be used in different ways by different actors.

At the same time, you can think of the aggregate use of AI technology worldwide as a system. You can say the internet is a system, or you can say all of the self driving cars in the world is one big system built by multiple stakeholders. So I think that the system concept can be reframed to refer to the aggregate of all the technology of a certain type or of a certain kind. But that mental reframe is an actual act of effort, and you can switch between those frames to get different views of what’s going on. I try to alternate and use both of those views from time to time, the system view and the technology view.

Lucas Perry: All right. So let’s get into another concept here that you develop, and it’s really at the core of your paper. What is a prepotent AI? And I guess before you define what a prepotent AI is, can you define what prepotent means? I had actually never heard of that word before reading your paper.

Andrew Critch: So I’m going to say the actual standard definition of prepotent which connotes, arrogance, overbearing high-handed, despotic, possessing excessive abuse of authority. These connotations are carried across a bunch of different Latin languages, but in English they’re not as strong. In English, prepotent just means very powerful, or superior enforced influence, or authority or predominant.

I used it because it’s not that common of a word, but it’s still a word, and it’s a property that AI technology can have relative to us. And it’s also property that a particular AI system, whether it’s singular or distributed can have relative to us. The definition that I’d give for a prepotent AI technology is technology whose deployment would transform humanity’s habitat, which is currently the earth, in a way that’s unstoppable to us.

So there’s this notion of there’s the transformativeness and then there’s the unstoppableness. Transformativeness is a concept that has been also elaborated by the Open Philanthropy Project. They have this transformative AI concept. I think it’s a very good concept, because it’s impact oriented. It’s not about what the AI’s trying to do, it’s about what impact that has. And they say when AI system or technology is transformative, if its impact on the world is comparable to, say the agricultural revolution or the industrial revolution, a major global change in how things are done. You might argue that the internet is a transformative technology as well.

So, that’s the transformative aspect of prepotence. And then there’s the unstoppable aspect. So, imagine something that’s transforming the world the way the agricultural industrial revolution has transformed it, but also, we can’t stop it. And by we, I mean, no subset of humans, if they decided that they want to stop it, could stop it. If every human in the world decided, “Yeah, we all want this to stop,” we would fail.

I think it’s possible to imagine AI technologies that are unstoppable to all subsets of humanity. I mean, there’s things that are hard to stop right now. If you wanted to stop the use of electricity. Let’s say all humans decided, today, for some strange reason that we never want to use electricity anymore. That’d be a difficult transition. I think we probably could do it, but it’d be very difficult. Humanity as a society can become dependent on certain things, or intertwined with things in a way that makes it very hard to stop them. And that’s a major mechanism by which an AI technology can be prepotent, by being intertwined with us and how we use it.

Lucas Perry: So, can you distinguish this idea of prepotent AI, because it’s a completely new concept from transformative AI, as you mentioned before, and superintelligence, and why it’s important to you that you introduced this new concept?

Andrew Critch: Yeah. Sure. So let’s say you have an AI system that’s like a door-to-door salesman for solar panels, and it’s just going to cover everyone’s roofs with solar panels for super cheap, and all of the business is going to have solar panels on top, and we’re basically just not going to need fossil fuels anymore. And we’re going to be way more decentralized and independent, and states are going to be less dependent on each other for energy. So, that’s going to change geopolitics. A lot of stuff’s going to change, right?

So, you might say that that was transformative. So, you can have a technology that’s really transformative, but also maybe you can stop it. If everybody agreed to just not answer the door when the door-to-door solar panel robot salesman comes by, then they would stop. So, that’s transformative, but not prepotent. There’s a lot of different ways that you can envision AI being both transformative and unstoppable, in other words, prepotent.

I have three examples that I’d go to and we’ve written about those in ARCHES. One is technological autonomy. So if you have a little factory that can make more little factories, and it can do its own science and invent its own new materials to make more robots to do more mining, to make more factories, et cetera, you can imagine a process like that that gets out hand someday. Of course, we’re very far away from that today, conceptually, but it might not be very long before we can make robots that make robots that make robots.

Self-sustaining manufacturing like that could build defenses using technology the way humans build defenses against each other. And now suddenly, the humans want to stop it, but it has nukes aimed at us, so we can’t. Another completely different one which is related, is replication speed. Like the way a virus can just replicate throughout your body and destroy you without being very smart.

You could envision, you can imagine. I don’t know of how easy it is to build this, because maybe it’s a question of nanotechnology, but can you build systems that just very quickly replicate, and just tile the earth so fast with their replicants that we die? Maybe we suffocate from breathing them, or breathing their exhaust. That one honestly seems less plausible to me than to technological autonomy one, but to some people it seems more plausible and I don’t have a strong position on that.

And then there’s social acumen. You can imagine say a centralized AI system that is very socially competent, and it can deliver convincing speeches to the point of becoming elected a state official, and then brokering deals between nations that make it very hard for anybody to go against their plans, because they’re so embedded and well negotiated with everybody. And when you try to coordinate, they just whisper things, or say threats or make offers that dis-coordinate everybody again. Even though everybody wants it to stop, nobody can manage to coordinate long enough to stop it because it’s so socially skilled. So those are like a few science fiction scenarios that I would say constitute prepotence on the part of the AI technology or system. They’re all different and the interesting thing about them is that they all can happen without being generally superintelligent. These are conditions that are sufficient to pose a significant existential threat to humanity, but which aren’t superintelligence. And I want to focus on those because I don’t want us to delay addressing existential risk until we have superintelligence. I want us to address it but the minimum viable existential threats that we could face and head those off. So that’s why I focus on prepotence as a property rather than superintelligence because it’s a broader category that I think is still quite threatening and quite plausible.

Lucas Perry: Another interesting and important concept is born of this is misaligned prepotent AI technology. Can you expand a bit on that? So what is and should count as misaligned prepotent AI technology?

Andrew Critch: So this was a tough decision for me because as you’ve noticed throughout this podcast, at the technical level, I find the alignment concept confusing at multi-stakeholder scales, but still critical to think about. And so I couldn’t decide whether to just talk about unsurvivable prepotent AI or misaligned prepotent AI. So let me talk about unsurvivable prepotent AI. By that, I mean it’s transformed the earth, you can’t stop it and moreover, you’re going to die of it eventually. The AI technology has become unsurvivable in the year 2085 if in that year, the humans now are doomed and cannot possibly survive. And I thought about naming the central concept, unsurvivable prepotent AI but a lot of people want to say for them, misalignment is basically unsurvivability.

I think David also tends to think of alignment in a similar way, but there’s this question of where do you draw the line between poorly aligned and misaligned? We just made a decision to say, extinction is the line, but that’s kind of a value judgment. And one of the things I don’t like about the paper is that it has that implicit value judgment. And I think the way I would prefer people to think is in terms of the concept of unsurvivability versus survivability, or prepotence versus not. But the theme of alignment and misalignment is so pervasive that some of our demo readers preferred that name for the unsurvivable prepotent AIs.

Lucas Perry: So misaligned prepotent AI then is just some AI technology that would lead to human extinction?

Andrew Critch: As defined in the report, yep. That’s where we draw the line between aligned and misaligned. If it’s prepotent, it’s having this huge impact. When’s the huge impact definitively misaligned? Well, it’s kind of like where’s the zero line and we just kind of picked extinction to be the line to call misaligned. I think it’s a pretty reasonable line. It’s pretty concrete. And I think a lot of efforts to prevent extinction would also generalize to preventing other big risks. So sometimes, it’s nice to pick a concrete thing and just focus on it.

Lucas Perry: Yeah. I understand why and I think I would probably endorse doing it this way, but it also seems a little bit strange to me that there are futures worse than extinction and they’re going to be below the line. And I guess that’s fine then.

Andrew Critch: That’s why I think unsurvivable is a better word. But our demo readers, some of them just really preferred misaligned prepotent AI over unsurvivable prepotent AI. So we went with that just to make sense to your readers.

Lucas Perry: Okay. So as we’re building AI technologies, we can ask what counts as the deployment of a prepotent AI system or technology, a TAI system, or a misaligned prepotent AI system and the implications of such deployment? I’m curious to get your view on what counts as the deployment of a prepotent AI system or a misaligned prepotent AI system.

Andrew Critch: So you could imagine something that’s transforming the earth and we can’t stop it, but it’s also great.

Lucas Perry: Yeah. An aligned prepotent AI system.

Andrew Critch: Yeah. Maybe it’s just building a lot of infrastructure around the world to take care of people’s health and education. Some people would find that scary and not like the fact that we can’t stop it, and maybe that fear alone would make it harmful or maybe it would violate some principle of theirs that would matter even if they didn’t feel the fear. But you can at least imagine under some value systems, technology that’s kind of taken over the world but it’s taken good care everybody. And maybe it’s going to take care of everybody forever so humanity will never go extinct. That’s prepotent but not unsurvivable, but that’s a dangerous move to make on a planet to sort of make a prepotent thing and try to make sure that it’s an aligned prepotent thing instead of a misaligned prepotent thing, because you’re unstoppably transforming the earth and you maybe you should think a lot before you do that.

Lucas Perry: And maybe prepotence is actually incompatible with alignment if we think about it enough for the reasons that you mentioned.

Andrew Critch: It’s possible. Yeah. With enough reflection on the value of human autonomy, we would eventually conclude that if humans can’t stop it, it’s fundamentally wrong in a way that will alienate and destroy humans eventually in some way. That said, I do want to add something which is that I think almost all prepotent AI that we could conceivably make will be unsurvivably misaligned. If you’re transforming the world, most states of the world are not survivable to humans. Just like most planets are not survivable to humans. So most ways that the world could be very different are just ways in which humans could not survive. So I think if you have a prepotent AI system, you have to sort of steer it through this narrow window of futures, this narrow like keyhole even of futures where all the variables of the earth stay inhabitable to humans, or we would build some space colony where humans live instead of Earth.

Almost every chemical element, if you just turn up that chemical element on the earth, humans die. So that’s the thing that makes me think most conceivable prepotent AI systems are misaligned or unsurvivable. There are people who think about alignment a lot that I think are super biased by the single principal, single agent framing and have sort of lost track of the complexities of society and that’s why they think prepotent AI is conceivable to align or like not that hard to align or something. And I think they’re confused, but maybe I’m the confused one and maybe it’s actually easy.

Lucas Perry: Okay. So you’ve mentioned a little bit here about if you dial in the knobs of the chemical composition of really anything much on the planet in any direction, that pretty quickly you can create pretty hostile or even existentially incompatible situations on Earth for human beings. So this brings us to the concept of basically how frail humanity is given the conditions that are required for us to exist. What is the importance of understanding human frailty in relation to prepotent AI systems?

Andrew Critch: I think it’s pretty simple. I think human frailty implies don’t make prepotent AI. If we lose control of the knobs, we’re at risk of the knobs getting set wrong. Now that’s not to say we can set the knobs perfectly either, but if they start to go wrong, we can gradually set them right again. There’s still hope that we’ll stop climate change, right? And not saying we will, but it’s at least still possible. We haven’t made it impossible to stop. If every human in the world agreed now to just stop, we would succeed. So we should not lose control of this system because almost any direction it could head is a disaster. So that’s why some people talk about the AI control problem, which is different I claim than the AI alignment problem. Even for a single powerful system, you can imagine it looking after you, but not letting you control it.

And if you aim for that and miss, I think it’s a lot more fraught. And I guess the point is that I want to draw attention to human fragility because I know people who think, “No, no, no. The best thing to do for humanity is to build a super powerful machine that just controls the Earth and protects the humans.” I know lots of people who think that. It makes sense logically. It’s like, “Hey, the humans. We might destroy ourselves. Look at this destructive stuff we’re doing. Let’s build something better than us to take care of us.” So I think the reasoning makes sense, but I think it’s a very dangerous thing to aim for because if we aim and miss, we definitely, definitely die.

I think transformative AI is big enough risk. We should never make prepotent AI. We should not make unstoppable, transformative AI. And that’s why there’s so much talk about the off switch game or the control problem or whatever. Corrigibility is kind of related to turning things off. Humans have this nice property where if half of them are destroyed and the other half of them have the ability to notice that and do something about it, they’re quite likely to do something about it. So you get this robustness at a societal scale by just having lots of off switches.

Lucas Perry: So we’ve talked about this concept a bunch already, this concept of delegation. I’m curious if you can explain the importance and relevance of considering delegation of tasks from a human or humans to an AI system or systems. So we’re just going to unpack this taxonomy that you’ve created a bit here of single-single, single-multi, multi-single, and multi-multi.

Andrew Critch: The reason I think delegation is important is because I think a lot of human society is rightly arranged in a way that avoids absolute power from accumulating into decisions of any one person, even in the most totalitarian regimes. The concept of delegation is a way that humans hand power and responsibility to each other in political systems but also in work situations, like the boss doesn’t have to do all the work. They delegate out and they delegate a certain amount of power to people to allow the employees of a company to do the work. That process of responsibilities and tasks being handed from agent to agent to agent is how a lot of things get done in the world. And there’s many things we’ve already delegated to computers.

I think delegation of specific tasks and responsibilities is going to remain important in the future even as we approach human level AI and supersede human level AI, because people resist the accumulation of power. If you say, “Hey, I am Alpha Corp. I’m going to make a superintelligent machine now and then use it to make the world good.” You might be able to get a few employees that are like kind of wacky enough to think that yeah, taking over the world with your machine is the right company mission or whatever. But the winners of the race of AI development are going to be big teams that won because they managed to work together and pull off something really hard. And such a large institution is going to most likely have dissident members who don’t think taking over the world is the right plan for what to do with your powerful tech.

Moreover, there’s going to be plenty of pressures from outside even if you did manage to fill a company full of people who want to take over the world. They’re going to know that that’s kind of not a cool thing to do according to most people. So you’re not going to be taking over the world with AI. You’re going to be taking on specific responsibilities or handing off responsibilities. And so you’ve got an AI system that’s like, “Hey, we can provide this service. We’ll write your spam messages for you. Okay?” So then that responsibility gets handed off. Perhaps OpenAI would choose not to accept that responsibility. But let’s say you want to analyze and summarize a large corpus of texts to summarize what people want. Let’s say you get 10,000 customer service emails in a day and you want something to read that and give you a summary of what really people want.

That’s a tremendously useful thing to be able to do. And let’s say open AI develops the capability to do that. They’ll sell that as a service and other companies will benefit from it greatly. And now, OpenAI has this responsibility that they didn’t have. They’re now responsible for helping Microsoft fulfill customer service requests. And if Microsoft sucks at fulfilling those customer service requests, now open AI is getting complaints from Microsoft because they summarize the requests wrong. So now you’ve got this really complicated relationship where you’ve got a bunch of Microsoft users sending in lots of emails, asking for help that are being summarized by OpenAI, and then hand it off to Microsoft developers to prioritize what they do next with their software. And no one is solely responsible for everything that’s happening because the customer is responsible for what they ask, Microsoft is responsible for what they provide, and open AI is responsible for helping Microsoft understand what to provide based on what the customer’s ask.

Responsibilities get naturally shared out that way unless somebody comes in with a lot of guns and says, “No, give me all the responsibility and all the par.” So militarization of AI is certainly a way that you could see a massive centralization of power from AI. I think States should avoid militarizing AI to avoid scaring other States into militarizing AI. We don’t want to live in a world with militarized AI technologies. So I think if we succeed in heading off that threat and that’s a big if, then we end up in an economy where the responsibilities are being taken on, services are being provided. And then everything’s suddenly very multi-stakeholder, multiple machines servicing multiple people. And I think of delegation as a sort of operation that you perform over and over that ends up distributing those responsibilities and services. And I think about how do you perform a delegation step correctly? If you can do one delegation step correctly, like when Microsoft makes the decision to hand off its customer service interpretation to OpenAI’s language models, Microsoft needs to make that decision correctly.

And it makes that decision correctly. If we define correctly correctly, it’ll be part of an overall economy of delegations that are respectful of humanity. So in my opinion, once you head off militarization, the task of ensuring existential safety for humanity boils down to the task of recursively defining delegation procedures that are guaranteed to preserve human existence and welfare over time.

Lucas Perry: And so you see this area of delegation as being the most x-risky.

Andrew Critch: So it’s interesting. I think delegation prevents centralization of power, which prevents one kind of x-risk. And I think we will seek to delegate. We will seek desperately to delegate responsibilities and distribute power as it accumulates.

Lucas Perry: Why would we naturally do that?

Andrew Critch: People fear power.

Lucas Perry: Do we?

Andrew Critch: If you see something with a lot more power than you, people tend to fear it and sort of oppose it. And separately, people fear having power. If you’re on a team that’s like, “Yeah, we’re going to take over the world,” you’re probably going to be like, “Really? Isn’t it bad? Isn’t that super villain to do that?” So as I predict this, I don’t want to say, “Count on somebody else to adopt this attitude.” I want people listening to adopt that attitude as well. And I both predict and encourage the prevention of extreme concentrations of power from AI development because society becomes less robust then. It becomes this one point of failure where if this thing messes up, everything is destroyed. Whereas right now, it’s not that easy for a centralized force to destroy the world by messing up. It is easy for decentralized forces to destroy the world right now. And that’s how I think it’ll be in the future as well.

Lucas Perry: And then as you’re mentioning and have mentioned, the diffusion of responsibility is where we risk potentially missing core existential safety issues in AI.

Andrew Critch: Yeah, I think that’s the area that’s not only neglected by present day economic incentives, but will likely remain neglected by economic incentives even 10, 20 years from now. And therefore, will be left as the main source of societal scale and existential risk, yeah.

Lucas Perry: And then in terms of the taxonomy you created, can you briefly define the single and multi and the relationships those can have?

Andrew Critch: When I’m talking about AI delegation, I say single-single to mean single human-single AI system, or single human stakeholder and a single AI system. And I always referred to the number of humans first. So if I say single-multi, that means one human stakeholder, which might be a company or a person, and then multiple AI systems. And if I say multi-single, that’s multi human- single AI. And then multi-multi means multi human-multi AI. I started using this in a AGI safety course I was teaching at Berkeley in 2018 because I just noticed a lot of equivocation between students about which kind of scenarios they were thinking about. I think there’s a lot of multi-multi delegation work that is going to matter to industry because when you have a company selling a service to a user to do a job for an employer, things get multi-stakeholder pretty quickly. So I do think some aspects of multi-multi delegation will get addressed in industry, but I think they will be addressed in ways that are not designed to prevent existential risk. They will be addressed in ways that are designed to accrue profits.

Lucas Perry: And so some concepts that you also introduce here are those of control, instruction, and comprehension as being integral to AI delegation. Are those something you want to explore now?

Andrew Critch: Yeah, sure. I mean, those are pretty simple. Like when you delegate something to someone, Alice delegates to Bob, in order to make that decision, she needs to understand Bob, like what’s he capable of? What isn’t he? That’s human AI comprehension. Do we understand AI well enough to know what we should delegate? Then, there’s human AI instruction. Can Alice explain to Bob what she wants Bob to do? And can Bob understand that? Comprehension is really a conveyance of information from Bob to Alice. And then instruction is a conveyance of information from Alice to Bob. A lot of single-single alignment work is focused on how are we going to convey the information? Whereas transparency / interpretability work is more like the Bob to Alice direction. And then control is well, what if this whole idea of communication is wrong and we messed it up and we now just need to stop it, just take back the delegation. Like I was counting on my Gmail to send you emails, but now sending you a bunch of spam. I’m going to shut down my account and i’ll send you messages a different way.

That’s control. And I think of any delegation relationship as involving at least those three concepts. There might be other ones that are really important that I’ve left out. But I see a lot of research as serving one of those three nodes. And so then, you could talk about single-single comprehension. Does this person understand this system? Or we can talk about multi-single. Do this team of people understand this system? Multi-single control would be, can this team of people collectively stop or take back the delegation from the system that they’ve been using or counting on? And then it goes to multi-multi and starts to raise questions like what does it mean for a group of people to understand something? Do they all understand individually? Or do they also have to be able to have a productive meeting about it? Maybe they need to be able to communicate with each other about it too for us to consider it to be a group level understanding. So those questions come up in the definition of multi-multi comprehension, and I think they’re going to be pretty important in the end.

Lucas Perry: All right. So we’ve talked a bunch here already about single-single delegation and much of technical alignment research explores this single human-single AI agent scenario. And that’s done because it’s conceptually simple and is perhaps the most simple place to start. So when we’re thinking about AI existential safety and AI existential risk, how is starting from single-single misleading and potentially not sufficient for deep insight into alignment?

Andrew Critch: Yeah, I guess I’ve said this multiple times in this podcast, how much I think diffusion of responsibility is going to play a role in leaving problems unsolved. And I think diffusion of responsibility only becomes visible in the multi-stakeholder or multi-system or both scenarios. That’s the simple answer.

Lucas Perry: So the single-single gets solved again by the commercial incentives and then the important place to analyze is the multi-multi.

Andrew Critch: Well, I wouldn’t simplify it as much as to say the important places to analyze is the multi-multi because consider the following. If you build a house out of clay instead of out of wood, it’s going to fall apart more easily. And understanding clay could help you make that global decision. Similarly if your goal is to eventually produce societally safe, multi-multi delegation procedures for AI, you might want to start by studying the clay that that procedure is built out of, which is the single-single delegation steps. And single-single delegation steps require a certain degree of alignment between the delegator and the delegate. So it might be very important to start by figuring out the right building material for that, figuring out the right single-single delegation steps. And I know a lot of people are approaching it that way.

They’re working on single-single delegation, but that’s not because they think Netflix is never going to launch the Netflix challenge to figure out how to align recommender systems with users. It’s because the researchers who care about existential safety want to understand what I would call a single-single delegation, but what they would call the method of single-single alignment as a building block for what will be built next. But I sort of think different. I think that’s a great reasonable position to have. I think differently than that because I think the day that we have super powerful single-single alignment solutions is the day that it leaves the laboratory and rolls out into the economy. Like if you have very powerful AI systems that you can’t single-single align, you can’t ship a product because you can’t get it to do what anybody wants.

So I sort of think single-single alignment solutions sort of shorten the timeline. It’s like deja vu. When everyone was working on AI capabilities, the alignment people are saying, “Hey, we’re going to run out of time to figure out alignment. You’re going to have all of these capabilities and we’re not going to know how to align them. So let’s start thinking ahead about alignment.” I’m saying the same thing about alignment now. I’m saying once you get single-single alignment solutions, now your AI tech is leaving the lab and going into the economy because you can sell it. And now, you’ve run out of time to have solved the multipolar scenario problem. So I think there’s a bit of a rush to figure out the multi-stakeholder stuff before the single-single stuff gets all figured out.

Lucas Perry: Okay. So what you’re arguing for then here is your what you call multi-multi preparedness.

Andrew Critch: Yeah.

Lucas Perry: Would you also like to state what the multiplicity thesis is?

Andrew Critch: Yeah. It’s the thing I just want to remind people of all the time, which is don’t forget, as soon as you make tech, you copy it, replicate it, modify it. The idea that we’re going to have a single-single system and not very shortly thereafter have other instances of it or other competitors to it, is sort of a fanciful unrealistic scenario. And I just like reminding people as we’re preparing for the future, let us prepare for the nearly inevitable eventuality that there will be multiple instances of any powerful technology. Some people take that as an argument that, “No, no, no. Actually, we should make the first instance so powerful that it prevents the creation of any other AI technology by any other actor.” And logically, that’s valid. I think politically and socially, I think it’s crazy.

Lucas Perry: Uh-huh (affirmative).

Andrew Critch: I think it’s a good way to alienate anybody that you want to work with on existential risk reduction to say, “Our plan is to take over the world and then save it.” Whereas if your plan is to say, “What principles can all AI technology adhere to, such that it in aggregate will not destroy the world,” you’re not taking over anything. You’re just figuring it out. Like if there’s 10 labs in the world all working on that, I’m not worried about one of them succeeding. But if there’s 10 labs in the world all working on the safe world takeover plan, I’m like, “Hmm, now I’m nervous that one of them will think that they’ve solved safe world takeover or something.” And I kind of want to convert them all to the other thing of safe delegation, safe integration with society.

Lucas Perry: So can you take us through the risk types that you develop in your paper that lead to unsurvivability for humanity from AI systems?

Andrew Critch: Yeah. So there’s a lot of stuff that people worry about. I noticed that some of the things people worry about sort of directly cause extinction if they happen. And then some of them are kind of like one degree of causal separation away from that. So I call it tier one risks in the paper, that refers to things that would just directly lead to the deployment of a unsurvivable or misaligned prepotent AI technology. And then tier two risks are risks that lead to tier one risk. So for example, if AI companies or countries are racing really hard to develop AI faster than each other, so much that they’re not taking into account safety to the other countries around them or the other companies around them, then you get a disproportionate prioritization of progress over safety. And then you get a higher risk of societal scale disasters, including existential risks but not limited to it.

And so you could say fierce competition between AI developers is a tier two risk that leads to the tier one risk of MPAI or UPAI deployment, MPAI being misaligned prepotent AI. And tier one, I have this taxonomy that we use in the paper that I like for sort of dividing up tier one into a few different types that all I think have different technical approaches because my goal is to sort of orient on technical research problems that could actually help reduce existential risk from AI. So got this subdivision. The first one we have is basically diffusion of responsibility, or sometimes we call it unaccountable creators. In the paper, we settled on calling it uncoordinated MPAI deployment.

So the deal is before talking about whether this or that AI system is doing what its creators want or don’t want, can we even identify who the creators are? If the creators were this kind of diffuse economy or oligarchy of companies or countries, it might not be meaningful to say, “Did the AI system do what it’s creators want it?” Because maybe they all wanted a different thing. So a risk type 1A is risks that arise from kind of nobody in particular being responsible for and therefore, no one in particular being attentive to preventing the existential risk.

Lucas Perry: That’s an uncoordinated MPAI event.

Andrew Critch: Yeah, exactly. I personally think most of the most likely risks come from that category, but they’re hard to define and I don’t know how to solve them yet. I don’t know if anybody does. But if you assume we’re not in that case, it’s not uncoordinated. Now, there’s a recognizable identifiable institution Alpha Corp-made AI or America made the AI or something like that. And now you can start asking, “Okay. If there’s this recognizable creator relationship, did the creator know that they were making a prepotent technology?” And that’s how we define type 1B. We’ve got creators, but the creators didn’t know that the tech they were making was going to be prepotent. Maybe they didn’t realize it was going to be replicated or used as much as it was, or it was going to be smarter than they thought for whatever reason. But it just ended up affecting the world a lot more than they thought or being more unstoppable than they thought.

If you make something that’s unstoppably transforming the world, which is what prepotent means, and you didn’t anticipate that, that’s bad. You’re making big waves and you didn’t even think about the direction the waves were going. So I think a lot of risk comes from making tech and not realizing how big its impact is going to be in advance. And so you could have things that become prepotent that we weren’t anticipating and a lot of risks comes from that. That’s a whole risk category. That’s 1B. We need good science and discipline for identifying prepotence or dependence or unstoppability or transformativity all of these concepts. But suppose that’s solved, now we go to type 1C. There are creators contrary to 1A and the creators knew they’re making prepotent tech contrary to 1B. And I think this is weird because a lot of people don’t want to make prepotent tech because it’s super risky, but you could imagine some groups doing it.

If they’re doing that, do they recognize that the thing they’re making is misaligned? Do they think, “Oh yeah, this is going to take over the world and protect everybody. This is the, “I tried to take over the world and I accidentally destroyed it scenario.” So that’s unrecognized misalignment or unrecognized unsurvivability as a category of risk. And for that, you just need a really good theory of alignment with your values if you don’t want to destroy the world. And that’s I think what gets people focused on single-single alignment. They’re like, “The world’s broken. I want to fix it. I want to make magic AI that will like fix the world. It has to do what I want though. So let’s focus on single-single alignment.” But now he’s supposed that problem is solved contrary to type 1A you have discernible creators contrary to 1B, they know they’re playing with fire contrary to 1C, they know it’s misaligned. They know fire burns. That’s kind of plausible. If you imagine people messing with dangerous tech in order to figure out how to protect against it, you could have a lab with people sort of brewing up dangerous cyber attack systems that could break out and exercise a lot of social acumen. If they were really powerful language users, then you could imagine something getting out. So that’s, we call it type 1D, involuntary MPAI deployment, maybe it breaks out or maybe hackers break in and release it. But either way, the creators weren’t trying to do it, then you have type 1E which is contrary to 1D, the creators wanted to release MPAI deployment.

So that’s just people trying to destroy the world. I think that’s less plausible in the short term, more plausible in the longterm.

Lucas Perry: So all of these fall under the category of tier one in your paper. And so all of these directly lead to an existential catastrophe for humanity. You then have tier two, which are basically hazardous conditions, which lead to the realization of these tier one events. So could you take us through these conditions, which may act as a catalyst for eliciting the creation of tier one events in the world?

Andrew Critch: Yeah, so the nice thing about the tier one events is that we use the, an exhaustive decision tree for categorizing it. So any tier one event, any deployment event for a misaligned prepotent AI will fall under one of categories 1A through 1E, unfortunately we don’t have such a taxonomy for tier two.

So tier two is just the list of, hey, here’s four things that seem pretty worrisome. So 2A is, companies or countries racing with each other, trying to make AI real fast and not being safe about it. 2B is economic displacement of humans. So people talk about unemployment risks from AI. Imagine that taken to an extreme where eventually humans just have no economic leverage at all, because all economic value is being produced by AI systems. AI’s have taken all the jobs, including the CEO positions, including the board of directors positions, all using AI’s as their delegates to go to the board meetings that are happening every five seconds because of how fast the AI’s can have board meetings. Now, the humans are just like, “We’re just hoping that all that economy out there is going to not somehow use up all of the oxygen,” to say in the atmosphere, or “Lower the temperature of the earth by 30 degrees,” because of how much faster it would be to run super computers 30 degrees colder.

I think a lot of people who think about x-risk, think of unemployment as this sort of mundane, every generation, there’s some wave of unemployment from some tech. That’s nothing compared to existential risk, but I sort of want to raise a flag here and be, one of the waves of unemployment could be the one that just takes away all human leverage and authority. We should be on the lookout for runaway unemployment that leads to prepotence because loss of control and then human enfeeblement, that’s 2C, the humans are still around, but getting weaker and dumber and less capable of stuff because we’re not practicing doing things because AI is doing everything for us. Then one day we just all trip and fall and hit our heads and die kind of thing. But more realistically, maybe we just fail to be able to make good decisions about what AI technology is doing. And we failed to like notice we should be pressing the stop buttons everywhere.

Lucas Perry: The fruits of the utopia created by transformative AI are too enticing that we become enfeebled and fail at creating existential safety for advanced AI systems.

Andrew Critch: Or we use the systems in a stupid way because we all got worse at arithmetic and we couldn’t imagine the risks and we became scope insensitive to them or something. There’s a lot of different ways you can imagine humans just being weaker because AI is sort of helping us and then type 2D is discourse impairment about existential safety. This is something we saw a lot of in 2014 before FLI hosted the Puerto Rico conference, to just kick off basically discourse on existential safety for AI and other big risks from AGI. Luckily since then, with efforts from FLI and then the Concrete Problems in AI safety paper was a early example of acknowledging negative outcomes.

And then you have the ACM push to acknowledge negative risks and now the NeurIPS broader impact stuff. There’s lots of negative acknowledgement now. The discourse around negative outcomes has improved, but I think discourse on existential safety has a long way to go. It’s progressed, but it’s still has a long way to go. If we keep not being able to talk about it, for example, if we keep having to call existential safety safety, right? If we keep having to call it that, because we’re afraid to admit to ourselves or each other, that we’re thinking of existential stakes, we’re never really going to properly analyze the concept or visualize the outcomes together. I think there’s a big risk from just people sort of feeling like they’re thinking about existential safety, but not really saying it to each other and not really getting into the details of how society works at a large scale and therefore kind of ignoring it and making a bunch of bad decisions.

And I called that discourse impairment and it can happen because it’s taboo or it can happen because it’s just easier to talk about safety because safety is everywhere.

Lucas Perry: All right, so we’ve made it through to what is essentially the first third of your paper here. It lays much of the conceptual and language foundations, which are used for the rest, which try to more explicitly flesh out the research directions for existential safety on AI systems, correct?

Andrew Critch: Yeah. And I would say the later sections are a survey of research directions attacking different aspects and possibly exacerbating different aspects too. You earlier called this a research agenda. But I don’t think it’s quite right to call it an agenda because first of all, I’m not personally planning to research every topic in here, although I would be happy to research any of them. So this is not like, “Here’s the plan we’re going to do all these areas.” It’s more like, “Here’s a survey of areas and an analysis of how they flow into each other.” For example, single-single transparency research, that can flow in to coordination models for single-multi comprehension. It’s a view rather than a plan, because I think a plan should take into account more things like what’s neglected, what’s industry going to solve on its own?

My plan would be to pick sections out of this report and call those my agenda. My personal plan is to focus more on multi-agent stuff. Some also social metacognition stuff that I’m interested in. So if I wrote a research agenda, it would be about certain areas of this report, but the rest of the report is really just trying to look at all of these areas that I think relate to existential safety and it kind of analyzing how they relate.

Lucas Perry: All right, Andrew, well, I must say that on page 33, it says, “This report may be viewed as a very coarse description of a very long term research agenda, aiming to understand and improve blah, blah, blah.”

Andrew Critch: It’s true. It may be viewed as such and you may have just viewed it as such.

Lucas Perry: Yeah, I think that’s where I got that language from.

Andrew Critch: It’s true. Yeah, and I think if an institution just picked up this report and said, “This is our agenda.” I’d be like, “Cool, go for it. That’s a great plan.”

Lucas Perry: All right. I’m just getting you back for nailing me on the definition of AI alignment.

Andrew Critch: Okay.

Lucas Perry: Let’s hit up on some of the most important or key aspects here then for this final part of the paper. We have three questions here. The first is, “How would you explain the core of your concerns about, and the importance of flow through effects?” What are flow-through effects and why are they important for considering AI existential safety?

Andrew Critch: Flow through effects just means if A affects B and B affects C, then indirectly A affects C. Effects like that can be pretty simple in physics, but they can be pretty complicated in medicine and they might be even more complicated in research. If you do research on single-single transparency, that’s going to flow through to single-multi instruction. How is a person going to instruct a hierarchy of machines? Can they delegate to the machines to delegate to other machines? Okay, now can I understand? Okay, cool. There’s a flow through effect there. Then that’s going to flow through to multi-multi control. How can you have a bunch of people instructing a bunch of machines and still have control over them? If the instructions aren’t being executed to satisfaction, or if they’re going to cause a big risk or something.

And some of those flow through effects can be good, some of them could be bad. For example, you can imagine work in transparency flowing through to really rapid development in single-multi instruction, because you can understand more of what all the little systems are doing. You can tell more of them what to do and get more stuff done. Then that could flow through to disasters in multi-multi control because you’ve got races between powerful institutions that are delegating to large numbers of individual systems that they understand separately. But the interaction of which at a global scale are not understood by any one institution. So then you just get this big cluster of pollution or other problems being caused for humans, as a side effect. Just thinking about a problem, that’s a sub problem of the final solution is not always helpful, societally. Even if it is helpful to you personally, understanding how to approach the helpful societal scale solution. My personal biggest area of interest, I’m kind of split between two things.

One is, if you have a very powerful system and several stakeholders with very different priorities or beliefs, trying to decide a policy for that system. Imagine U.S., China and Russia trying to reach an agreement on some global cyber security protocol, that’s AI mediated or Uber and Waymo, trying to agree on what are the principles that their cars are going to follow when they’re doing lane changes. Are they going to try to intimidate each other to get a better chance at that lane changes? Is that going to put the humans at risk? Yes, okay. Can we all not intimidate each other and therefore not put the passengers at risk? That’s a big question for me, is how can you make systems that have powerful stakeholders in the process of negotiating for control over the system?

It’s like the system is not even deployed yet. We’re considering deploying it and we’re negotiating for the parameters of the system. I want the system to have a nice API, for the negotiating powers, to sort of turn knobs until they’re all satisfied with it. I call that negotiable AI. I’ve got a paper called Negotiable Reinforcement Learning with a student. I think that kind of encapsules the problem, but it’s not a solution to the problem by any means. It’s just merely drawing attention to it. That’s like a one core thing that I think is going to be really important as multi-stakeholder control. Not multi-stakeholder alignment, not making all the stakeholders happy, but making them work together in sharing the system, which might sometimes leave one of them unhappy. But at least they’re not all fighting and causing disasters from the externalities of their competition. The other one is almost the same principle, but where the negotiation is happening between the AI systems instead of the people.

So how do you get two AI systems, like System A and System B serving Alice and Bob, Alice and Bob want very different things. Now A and B have to get along. How can A and B get along, broker an agreement about what to do that’s better than fighting. Both of these areas of research are kind of trying to make peace between the human institutions controlling a powerful system. And the second case is peace between two AI systems. I don’t know how to do this at all. That’s why I try to focus on it. It’s sort of nobody’s job, except for maybe the UN and the UN doesn’t have… The cars getting along thing is kind of like a National Institute of Standards thing maybe, or a partnership, an AI thing maybe so maybe they’ll address that, but it’s still super interesting to me and possibly generalizable to big, higher stakes issues.

So I don’t claim that it’s going to be completely neglected as an area. It’s just very interesting at a technical level it seems neglected. I think there’s lots of policy thinking about these issues, but what shape does the technology itself need to have to make it easy for policymakers to set the standards, for it to be sort of negotiable and cooperative? That’s where my interests lie.

Lucas Perry: All right. And so that’s also matches up with everything else you said, because those are two sub-problems of multi-multi situations.

Andrew Critch: Yes.

Lucas Perry: All right. So next question is, is there anything else you’d like to add then to how it is the thinking about AI research directions affect AI existential risk?

Andrew Critch: I guess I would just add, people need to feel permission to work on things because they need to understand them, rather than because they know that it’s going to help the world. I think there’s a lot of paranoia about like, if you manage to care, but existential risks you’re like thinking about these high stakes and it’s easy to become paranoid. What if I accidentally destroy the world by doing the wrong research or something? I don’t think that’s a healthy state for a researcher, maybe for some it’s healthy, but I think for a lot of people that I’ve met, that’s not conducive to their productivity.

Lucas Perry: Is that something that you encounter a lot, people who have crippling anxiety over whether the research direction is correct?

Andrew Critch: Yeah, and varying degrees of crippling, some that you would actually call anxiety, the person’s experiencing actual anxiety. But more often it’s just a kind of festering unproductivity. It’s thinking of an area, “But that’s just going to advanced capabilities, so I won’t work on it,” or like think of an area it’s like, “Oh, that’s just going to hasten the economic deployment of AI systems, so I’m not going to work on it.” I do that kind of triage, but more so because I want to find neglected areas, rather than because I’m afraid of building the wrong tech or something. I find that mentality doesn’t inhibit my creativity or something. I want people to be aware of flow through effects and that any tech can flow through to have a negative impact that they didn’t expect. And because of that, I want everyone to sort of raise their overall vigilance towards AI technology as a whole. But I don’t want people to feel paralyzed like, “Oh no, what if I invent really good calibration for neural nets? Or what if I invent really good, bounded rationality techniques and then accidentally destroyed the world because people use them.”

I think what we need is for people to sort of go ahead and do their research, but just be aware that X-risk is on the horizon and starting to build institutional structures to make higher and higher stakes decisions about AI deployments, along with being supportive of areas of research that are conducive to those decisions being made. I want to encourage people to go into these neglected areas that I’m saying, but I don’t want people to think I’m saying they’re bad for doing anything else.

Lucas Perry: All right. Well, that’s some good advice then for researchers. Let’s wrap up here then on important questions in relevant multi-stakeholder objectives. We have four here that we can explore. The first is facilitating collaborative governance and the next is avoiding races by sharing control. Then we have reducing idiosyncratic risk taking, and our final one is existential safety systems. Could you take us through each of these and how they are relevant multi-stakeholder objectives?

Andrew Critch: Yeah, sure. So the point of this section of the report, it’s a pause between the sections about research for single human stakeholders and research for multiple human stakeholders. It’s there sort of explain why I think it’s important to think of multiple human stakeholders and important, not just in general. I mean, it’s obviously important for a lot of aspects of society, but I’m trying to focus on why it’s important to existential risk specifically.

So the first reason, facilitating collaborative governance is that I think it’s good if people from different backgrounds with different beliefs and different priorities can work together in governing AI. If you need to decide on a national standard, if you need an international standard, if you need to decide on rules that AI is not allowed to break, or that developers are not allowed to break. It’s going to suck if researchers in China make up some rules and researchers in America makeup different rules and the American rules don’t protect from the stuff that the Chinese rules protect from and the Chinese rules don’t protect from the stuff the American rules protect from. Moreover, that systems interacting with each other are going to not protect from either of those risks.

It’s good to be able to collaborate in governing things. Thinking about systems and technologies having a lot of stakeholders is key to preparing those technologies in a form that allows them to be collaborated over. Think about Google docs. I can see your cursor moving when you write in a Google doc. That’s really informative in a way that other collaborative document editing software does not allow. I don’t know if you’ve ever noticed how very informative it is to see where someone’s cursor is versus using another platform where you can only see the line someone’s on, but you can’t see what character they’re typing right now, that you can’t see what word they’re thinking. You’re like way, way, way less in tune with each other when you’re writing together, when you can’t see the cursors.

That’s an example of a way in which Google docs just had this extra feature that makes it way easier to negotiate for control, because if you’re not getting into an edit war, if I’m editing something, I’m not going to put my cursor where your cursor is. Or if I start backspacing a word that you just wrote, you know I must mean that, it must be important change. I just interrupted your cursor. Maybe you’re going to let me finish that backspace and see what the hell I’m doing. There’s this negotiability over the content of the document. It’s a consequence of the design of the interface. I think similarly AI technology could be designed with properties that make it easier for different stakeholders to cooperate in exercising, in the act of exercise and control over the system and its priorities. I think that sort of design question is key to facilitating collaborative governance because you can have stakeholders from different institutions, different cultures collaborating in the act of governing or controlling systems and observing what principles the systems need to have, need to adhere to for the purposes of different cultures or different values and so on.

Now, why is that important? Well, it’s lots of warm fuzzies from people working together and stuff. But one reason it’s important is that it reduces incentives to race. If we can all work together to set the speed limit, we don’t all have to drive as fast as we can to beat each other. That’s the section 7.2 is avoiding races by share and control and then section 7.3 is reducing idiosyncratic risk taking. Basically everybody kind of wants different things, but there’s a whole bunch of stuff we all don’t want. This kind of comes back to what you said about there being basic human values. Most of us don’t want humanity to go extinct. Most of us don’t want everyone to suffer greatly, but everybody kind of has a different view of what utopia should look like. That’s kind of maybe where the paretotopia concept came from.

It’s like everybody has a different utopia in mind, but nobody wants dystopia. If you imagine a powerful AI technology that might get deployed, and there’s a bunch of people on the committee deciding to make the deployment decision or deciding what features it should have, you can imagine one person on the committee being like, “Well, this poses a certain level of societal scale risk, but it’s worth it because of the anti-aging benefits that the AI is going to produce through the research, that’s going to be great.” Then another person on the committee being like, “Well, I don’t really care about anti-aging, but I do care about space travel. I want it to take a risk for that.” Then they’re like, “Wait a minute, I think we have this science assistant AI. We should use it on anti-aging not space.” And the space travel person’s like, “We should use it on space travel, not anti aging.”

Because of that, they don’t agree, that slows progress, but maybe a little slower progress is maybe a safer thing for humanity. Everyone has their agenda that they want to risk the world for, but because everyone disagrees and what risks are worth it, you sort of slow down and say, “Maybe collectively, we’re just not going to take any of these risks right now and we’ll just wait until we can do it with less risk.” So reducing idiosyncratic risk taking is just my phrase for the way everyone’s individual desire to take risks kind of averages out. Whereas every member of the committee doesn’t want human extinction so that doesn’t get washed out. It’s like everybody wants it to not destroy the world. Whereas not everybody wants it to colonize space or not everybody wants it to cure aging. You end up conservative on the risk, if you can collaboratively govern.

Then you’ve got existential safety systems, which is the last thing. If we did someday try to build AI tech that actually protects the world in some way, like say through cybersecurity or through environmental protection, that’s terrifying by the way, AI that controls the environment. But anyway, it’s also really promising, maybe we can clean up. It’s just the big move, setting control of the environment to AI systems is a big move. But as long you got lots of off switches, it’s maybe it’s great. Those big moves are scary because of how big they are. A lot of institutions would just never allow it to happen because of how scary it is. It’s like, “All right, I’ve got this garbage cleanup, AI is just going to actually go clean up all the garbage, or it’s going to scrub all the CO2 with this little replicating photosynthetic lab here. That’s going to absorb all the carbon dioxide and store it as biofuel. Great.” That’s scary. You’re like, whoa, you’re just unrolling the self replicating biofuel lab all over the world. People won’t let that happen.

I’m not sure what the right level of risk tolerance is for saving the world versus risking the world. But whatever it is, you are going to want existential safety safety nets, literal existential safety nets there to protect from big disasters. Whether the system is just an algorithm that runs on the robots that are doing whatever crazy world intervention you’re doing, or if it’s actually a separate system. But if you’re making a big change to the world for the sake of existential safety, you’re not going to get away with it unless a lot of people are involved in that decision. This is kind of a bid to the people who really do want to make big world interventions. Sometimes for the sake of safety, you’re going to have to appeal to a lot of stakeholders to sort of be allowed to do that.

So those are four reasons why I think developing your tech in a way that really is compatible with multiple stakeholders is going to be societally important and not automatically solved by industry standards. Maybe solved in special cases that are profitable, but not necessarily generalizable to these issues.

Lucas Perry: Yeah, the set of problems that are not naturally solved by industry and incentives, but that are crucial for existential safety are the set of problems it seems that we crucially need to identify and anticipate and engage in research today. Being mindful of flow through effects, such that we’re able to have as much leverage on that set of problems, given that they’re most likely not to be solved without a lot of foresight and intervention from the outside of industry and the normal flow of incentives.

Andrew Critch: Yep, exactly.

Lucas Perry: All right, Andrew wrapping things up. I just want to offer you a final bit of space for you to give any final words you’d like to say about the paper or AI existential risk. If there’s anything you feel is unresolved or you’d really like to communicate to everyone.

Andrew Critch: Yeah, thanks. I’d say if you’re interested in existential safety or something adjacent to it, use specific words for what you mean instead of just calling it AI safety all the time. Whatever your thing is, maybe it’s not existential safety, maybe it’s a societal scale risk or single-multi alignment or something, but try to get more specific about what we’re interested in. So that it’s easier for newcomers thinking about these topics, to know what we mean when we say them.

Lucas Perry: All right. If people want to follow you or get in touch or find your papers and work, where are the best places to do that?

Andrew Critch: For me personally, or David Krueger, the other coauthor on this report, and you could just Google our names and then we’ll have our research homepage show up and then you can see what our papers are or obviously Google Scholar is always a good avenue. Google Scholar sorted by year is a good trick because you can see what people are working on now, but there’s also the Center for Human Compatible AI where I work. There’s a bunch of other research going on there that I’m not doing, but I’m also still very interested in, and I’d probably be interested in doing more work research in that vein. I would say check out humancompatible.ai or acritch.com, for me personally. I don’t know what David’s homepage is, but I’m sure you can find them by Googling David Krueger.

Lucas Perry: All right, Andrew, thanks so much for coming on and for your paper, I feel like I honestly gained a lot of perspective here on the need for clarity on definitions and what we mean. You’ve given me a better perspective on the kind of problem that we have and the kind of solutions that it might require and so for that, I’m grateful.

Andrew Critch: Thanks.

End of recorded material

Iason Gabriel on Foundational Philosophical Questions in AI Alignment

 Topics discussed in this episode include:

  • How moral philosophy and political theory are deeply related to AI alignment
  • The problem of dealing with a plurality of preferences and philosophical views in AI alignment
  • How the is-ought problem and metaethics fits into alignment 
  • What we should be aligning AI systems to
  • The importance of democratic solutions to questions of AI alignment 
  • The long reflection

 

Timestamps: 

0:00 Intro

2:10 Why Iason wrote Artificial Intelligence, Values and Alignment

3:12 What AI alignment is

6:07 The technical and normative aspects of AI alignment

9:11 The normative being dependent on the technical

14:30 Coming up with an appropriate alignment procedure given the is-ought problem

31:15 What systems are subject to an alignment procedure?

39:55 What is it that we’re trying to align AI systems to?

01:02:30 Single agent and multi agent alignment scenarios

01:27:00 What is the procedure for choosing which evaluative model(s) will be used to judge different alignment proposals

01:30:28 The long reflection

01:53:55 Where to follow and contact Iason

 

Citations:

Artificial Intelligence, Values and Alignment 

Iason Gabriel’s Google Scholar

 

We hope that you will continue to join in the conversations by following us or subscribing to our podcasts on Youtube, Spotify, SoundCloud, iTunes, Google Play, StitcheriHeartRadio, or your preferred podcast site/application. You can find all the AI Alignment Podcasts here.

You can listen to the podcast above or read the transcript below. 

Lucas Perry: Welcome to the AI Alignment Podcast. I’m Lucas Perry. Today, we have a conversation with Iason Gabriel about a recent paper that he wrote titled Artificial Intelligence, Values and Alignment. This episode primarily explores how moral and political theory are deeply interconnected with the technical side of the AI alignment problem, and important questions related to that interconnection. We get into the problem of dealing with a plurality of preferences and philosophical views, the is-ought problem, metaethics, how political theory can be helpful for resolving disagreements, what it is that we’re trying to align AIs to, the importance of establishing a broadly endorsed procedure and set of principles for alignment, and we end on exploring the long reflection.

This was a very fun and informative episode. Iason has succeeded in bringing new ideas and thought to the space of moral and political thought in AI alignment, and I think you’ll find this episode enjoyable and valuable. If you don’t already follow us, you can subscribe to this podcast on your preferred podcasting platform by searching for The Future of Life or following the links on the page for this podcast.

Iason Gabriel is a Senior Research Scientist at DeepMind where he works in the Ethics Research Team. His research focuses on the applied ethics of artificial intelligence, human rights, and the question of how to align technology with human values. Before joining DeepMind, Iason was a Fellow in Politics at St John’s College, Oxford. He holds a doctorate in Political Theory from the University of Oxford and spent a number of years working for the United Nations in post-conflict environments.

And with that, let’s get into our conversation with Iason Gabriel.

So we’re here today to discuss your paper, Artificial Intelligence, Values and Alignment. To start things off here, I’m interested to know what you found so compelling about the problem of AI values and alignment, and generally, just what this paper is all about.

Iason Gabriel: Yeah. Thank you so much for inviting me, Lucas. So this paper is in broad brush strokes about how we might think about aligning AI systems with human values. And I wrote this paper because I wanted to bring different communities together. So on the one hand, I wanted to show machine learning researchers, that there were some interesting normative questions about the value configuration we align AI with that deserve further attention. At the same time, I was keen to show political and moral philosophers that AI was a subject that provoked real philosophical reflection, and that this is an enterprise that is worthy of their time as well.

Lucas Perry: Let’s pivot into what the problem is then that technical researchers and people interested in normative questions and philosophy can both contribute to. So what is your view then on what the AI problem is? And the two parts you believe it to be composed of.

Iason Gabriel: In broad brush strokes, I understand the challenge of value alignment in a way that’s similar to Stuart Russell. He says that the ultimate aim is to ensure that powerful AI is properly aligned with human values. I think that when we reflect upon this in more detail, it becomes clear that the problem decomposes into two separate parts. The first is the technical challenge of trying to align powerful AI systems with human values. And the second is the normative question of what or whose values we try to align AI systems with.

Lucas Perry: Oftentimes, I also see a lot of reflection on AI policy and AI governance as being a core issue to also consider here, given that people are concerned about things like race dynamics and unipolar versus multipolar scenarios with regards to something like AGI, what are your thoughts on this? And I’m curious to know why you break it down into technical and normative without introducing political or governance issues.

Iason Gabriel: Yeah. So this is a really interesting question, and I think that one we’ll probably discuss at some length later about the role of politics in creating aligned AI systems. Of course, in the paper, I suggest that an important challenge for people who are thinking about value alignment is how to reconcile the different views and opinions of people, given that we live in a pluralistic world, and how to come up with a system for aligning AI systems that treats people fairly despite that difference. In terms of practicalities, I think that people envisage alignment in different ways. Some people imagine that there will be a human parliament or a kind of centralized body that can give very coherent and sound value advice to AI systems. And essentially, that the human element will take care of this problem with pluralism and just give AI very, very robust guidance about things that we’ve all agreed upon are the best thing to do.

At the same time, there’s many other visions for AI or versions of AI that don’t depend upon that human parliament being able to offer such cogent advice. So we might think that there are worlds in which there’s multiple AIs, each of which has a human interlocutor, or we might imagine AIs as working in the world to achieve constructive ends and that it needs to actually be able to perform these value calculations or this value synthesis as part of its kind of default operating procedure. And I think it’s an open question what kind of AI system we’re discussing and that probably the political element understood in terms of real world political institutions will need to be tailored to the vision of AI that we have in question.

Lucas Perry: All right. So can you expand then a bit on the relationship between the technical and normative aspects of AI alignment?

Iason Gabriel: A lot of the focus is on the normative part of the value alignment question, trying to work out which values to align AI systems with, whether it is values that really matter and how this can be decided. I think this is also relevant when we think about the technical design of AI systems, because I think that most technologies are not value agnostic. So sometimes, when we think about AI systems, we assume that they’ll have this general capability and that it will almost be trivially easy for them to align with different moral perspectives or theories. Yet when we take a ground level view and we look at the way in which AI systems are being built, there’s various path dependencies that are setting in and there’s different design architectures that will make it easier to follow one moral trajectory rather than the other.

So for example, if we take a reinforcement learning paradigm, which focuses on teaching agents tasks by enabling them to maximize reward in the face of uncertainty over time, a number of commentators have suggested that, that model fits particularly well with the kind of utilitarian decision theory, which aims to promote happiness over time in the face of uncertainty, and that it would actually struggle to accommodate a moral theory that embodies something like rights or hard constraints. And so I think that if what we do want is a rights based vision of artificial intelligence, it’s important that we get that ideal clear in our minds and that we design with that purpose in mind.

This challenge becomes even clearer when we think about moral philosophies, such as a Kantian theory, which would ask an agent to reflect on the reasons that it has for acting, and then ask whether they universalize to good states of affairs. And this idea of using the currency of a reason to conduct moral deliberation would require some advances in terms of how we think about AI, and it’s not something that is very easy to get a handle on from a technical point of view.

Lucas Perry: So the key takeaway here is that what is going to be possible in terms of the normative and in terms of moral learning and moral reasoning in AI systems will supervene upon technical pathways that we take, and so it is important to be mindful of the relationship between what is possible normatively, given what is technically known, and to try and navigate that with mindfulness about that relationship?

Iason Gabriel: I think that’s precisely right. I see at least two relationships here. So the first is that if we design without a conception of value in mind, it’s likely that the technology that we build will not be able to accommodate any value constellation. And then the mirror side of that is if we have a clear value constellation in mind, we may be able to develop technologies that can actually implement or realize that ideal more directly and more effectively.

Lucas Perry: Can you make a bit more clear the ways in which, for example, path dependency of current technical research makes certain normative ethical theories more plausible to be instantiated in AI systems than others?

Iason Gabriel: Yeah. So, I should say that obviously, there’s a wide variety of different methodologies that are being tried at the present moment, and that intuitively, they seem to match up well with different kinds of theory. Of course, the reality is a lot of effort has been spent trying to ensure that AI systems are safe and that they are aligned with human intentions. When it comes to richer goals, so trying to evidence a specific moral theory, a lot of this is conjecture because we haven’t really tried to build utilitarian or Kantian agents in full. But I think in terms of the details, so with regards to reinforcement learning, we have this, obviously, an optimization driven process, and there is that whole caucus of moral theories that basically use that decision process to achieve good states of affairs. And we can imagine, roughly equating the reward that we use to train an RL agent on, with some metric of subjective happiness, or something like that.

Now, if we were to take a completely different approach, so say, virtue ethics, virtue ethics is radically contextual, obviously. And it says that the right thing to do in any situation is the action that evidences certain qualities of character and that these qualities can’t be expressed through a simple formula that we can maximize for, but actually require a kind of context dependence. So I think that if that’s what we want, if we want to build agents that have a virtuous character, we would really need to think about the fundamental architecture potentially in a different way. And I think that, that kind of insight has actually been speculatively adopted by people who consider forms of machine learning, like inverse reinforcement learning, who imagined that we could present an agent with examples of good behavior and that the agent would then learn them in a very nuanced way without us ever having to describe in full what the action was or give it appropriate guidance for every situation.

So, as I said, these really are quite tentative thoughts, but it doesn’t seem at present possible to build an AI system that adapts equally well to whatever moral theory or perspective we believe ought to be promoted or endorsed.

Lucas Perry: Yeah. So, that does make sense to me that different techniques would be more or less skillful for more readily and fully adopting certain normative perspectives and capacities in ethics. I guess the part that I was just getting a little bit tripped up on is that I was imagining that if you have an optimizer being trained off something, like maximize happiness, then given the massive epistemic difficulties of running actual utilitarian optimization process that is only thinking at the level of happiness and how impossibly difficult that, that would be that like human beings who are consequentialists, it would then, through gradient descent or being pushed and nudged from the outside or something, would find virtue ethics and deontological ethics and that those could then be run as a part of its world model, such that it makes the task of happiness optimization much easier. But I see how intuitively it more obviously lines up with utilitarianism and then how it would be more difficult to get it to find other things that we care about, like virtue ethics or deontological ethics. Does that make sense?

Iason Gabriel: Yeah. I mean, it’s a very interesting conjecture that if you set an agent off with the learned goal of trying to maximize human happiness, that it would almost, by necessity, learn to accommodate other moral theories and perspectives kind of suggests that there is a core driver, which animates moral inquiry, which is this idea of collective welfare being realized in a sustainable way. And that might be plausible from an evolutionary point of view, but there’s also other aspects of morality that don’t seem to be built so clearly on what we might even call the pleasure principle. And so I’m not entirely sure that you would actually get to a rights based morality if you started out from those premises.

Lucas Perry: What are some of these things that don’t line up with this pleasure principle, for example?

Iason Gabriel: I mean, of course, utilitarians have many sophisticated theories about how endeavors to improve total aggregate happiness involve treating people, fairly placing robust side constraints on what you can do to people and potentially, even encompassing other goods, such as animal welfare and the wellbeing of future generations. But I believe that the consensus or the preponderance of opinion is that actually, unless we can say that certain things matter, fundamentally, for example, human dignity or the wellbeing of future generations or the value of animal welfare, is quite hard to build a moral edifice that adequately takes these things into account just through instrumental relationships with human wellbeing or human happiness so understood.

Lucas Perry: So then we have this technical problem of how to build machines that have the capacity to do what we want them to do and to help us figure out what we would want to want us to get the machines to do, an important problem that comes in here is the is-ought distinction by Hume, where we have, say, facts about the world, on one hand, is statements, we can even have is statements about people’s preferences and meta-preferences and the collective state of all normative and meta-ethical views on the planet at a given time, and the distinction between that and ought, which is a normative claim synonymous with should and is kind of the basis of morality, and the tension there between what assumptions we might need to get morality off of the ground and how we should interact with a world of facts and a world of norms and how they may or may not relate to each other for creating a science of wellbeing or not even doing that. So how do you think of coming up with an appropriate alignment procedure that is dependent on the answer to this distinction?

Iason Gabriel: Yeah, so that’s a fascinating question. So I think that the is-ought distinction is quite fundamental and it helps us answer one important query, which is whether it’s possible to solve the value alignment question simply through an empirical investigation of people’s existing beliefs and practices. And if you take the is-ought distinction seriously, it suggests that no matter what we can infer from studies of what is already the case, so what people happen to prefer or happen to be doing, we still have a further question, which is should that perspective be endorsed? Is it actually the right thing to do? And so there’s always this critical gap. It’s a space for moral reflection and moral introspection and a place in which error can arise. So we might even think that if we studied all the global beliefs of different people and found that they agreed upon certain axioms or moral properties that we could still ask, are they correct about those things? And if we look at historical beliefs, we might think that there was actually a global consensus on moral beliefs or values that turned out to be mistaken.

So I think that these endeavors to kind of synthesize moral beliefs to understand them properly are very, very valuable resources for moral theorizing. It’s hard to think where else we would begin, but ultimately, we do need to ask these questions about value more directly and ask whether we think that the final elucidation of an idea is something that ought to be promoted.

So in sum, it has a number of consequences, but I think one of them is that we do need to maintain a space for normative inquiry and value alignment can’t just be addressed through an empirical social scientific perspective.

Lucas Perry: Right, because one’s own perspective on the is-ought distinction and whether and how it is valid will change how one goes about learning and evolving normative and meta-ethical thinking.

Iason Gabriel: Yeah. Perhaps at this point, an example will be helpful. So, suppose we’re trying to train a virtuous agent that has these characteristics of treating people fairly, demonstrating humility, wisdom, and things of that nature, suppose we can’t specify these upfront and we do need a training set, we need to present the agent with examples of what people believe evidences these characteristics, we still have the normative question of what goes into that data set and how do we decide. So, the evaluative questions get passed on to that. Of course, we’ve seen many examples of data sets being poorly curated and containing bias that then transmutes onto the AI system. We either need to have data that’s curated so that it meets independent moral standards and the AI learns from that data, or we need to have a moral ideal that is freestanding in some sense and that AI can be built to align with.

Lucas Perry: Let’s try and make that even more concrete because I think this is a really interesting and important problem about why the technical aspect is deeply related with philosophical thinking about this is-ought problem. So a highest level of abstraction, like starting with axioms around here, if we have is statements about datasets, and so data sets are just information about the world, the data sets are the is statements, we can put whatever is statements into a machine and the machine can take the shape of those values already embedded and codified in the world in people’s minds or in our artifacts and culture. And then the ought question, as you said, is what information in the world should we use? And to understand what information we should use requires some initial principle, some set of axioms that bridges the is-ought gap.

So for example, the kind of move that I think Sam Harris tries to lay out is this axiom, we should avoid the worst possible misery for everyone and you may or may not agree with that axiom but that is the starting point for how one might bridge the is-ought gap to be able to select for which data is better than other data or which data we should on load to AI systems. So I’m curious to know, how is it that you think about this very fundamental level of initial axiom or axioms that are meant to bridge this distinction?

Iason Gabriel: I think that when it comes to these questions of value, we could try and build up from this kind of very, very minimalist assumptions of the kind that it sounds like Sam Harris is defending. We could also start with richer conceptions of value that seem to have some measure of widespread ascent and reflective endorsement. So I think, for example, the idea that human life matters or that sentient life matters, that it has value and hence, that suffering is bad is a really important component of that, I think that conceptions of fairness of what people deserve in light of that equal moral standing is also an important part of the moral content of building an aligned AI system. And I would tend to try and be inclusive in terms of the values that we canvass.

So I don’t think that we actually need to take this very defensive posture. I think we can think expansively about the conception and nature of the good that we want to promote and that we can actually have meaningful discussions and debate about that so we can put forward reasons for defending one set of propositions in comparison with another.

Lucas Perry: We can have epistemic humility here, given the history of moral catastrophes and how morality continues to improve and change over time and that surely, we do not sit at a peak of moral enlightenment in 2020. So given our epistemic humility, we can cast a wide net around many different principles so that we don’t lock ourselves into anything and can endorse a broad notion of good, which seems safer, but perhaps has some costs in itself for allowing and being more permissible for a wide range of moral views that may not be correct.

Iason Gabriel: I think that’s, broadly speaking, correct. We definitely shouldn’t tether artificial intelligence too narrowly to the morality of the present moment, given that we may and probably are making moral mistakes of one kind or another. And I think that this thing that you spoke about, a kind of global conversation about value, is exactly right. I mean, if we take insights from political theory seriously, then the philosopher, John Rawls, suggests that a fundamental element of the present human condition is what he calls the fact of reasonable pluralism, which means that when people are not coerced and when they’re able to deliberate freely, they will come to different conclusions about what ultimately has moral value and how we should characterize ought statements, at least when they apply to our own personal lives.

So if we start from that premise, we can then think about AI as a shared project and ask this question, which is given that we do need values in the equation, that we can’t just do some kind of descriptive enterprise and that, that will tell us what kind of system to build, what kind of arrangement adequately factors in people’s different views and perspectives, and seems like a solution built upon the relevant kind of consensus to value alignment that then allows us to realize a system that can reconcile these different moral perspectives and takes a variety of different values and synthesizes them in a scheme that we would all like.

Lucas Perry: I just feel broadly interested in just introducing a little bit more of the debate and conceptions around the is-ought problem, right? Because there are some people who take it very seriously and other people who try to minimize it or are skeptical of it doing the kind of philosophical work that many people think that it’s doing. For example, Sam Harris is a big skeptic of the kind of work that the is-ought problem is doing. And in this podcast, we’ve had people on who are, for example, realists about consciousness, and there’s just a very interesting broad range of views about value that inform the is-ought problem. If one’s a realist about consciousness and thinks that suffering is the intrinsic valence carrier of disvalue in the universe, and that joy is the intrinsic valence carrier of wellbeing, one can have different views on how that even translates to normative ethics and morality and how one does that, given one’s view on the is-ought problem.

So, for example, if we take that kind of metaphysical view about consciousness seriously, then if we take the is-ought problem seriously then, even though there are actually bad things in the world, like suffering, those things are bad, but that it would still require some kind of axiom to bridge the is-ought distinction, if we take it seriously. So because pain is bad, we ought to avoid it. And that’s interesting and important and a question that is at the core of unifying ethics and all of our endeavors in life. And if you don’t take the is-ought problem seriously, then you can just be like, because I understand the way that the world is, by the very nature of being sentient being and understanding the nature of suffering, there’s no question about the kind of navigation problem that I have. Even in the very long-term, the answer to how one might resolve the is-ought problem would potentially be a way of unifying all of knowledge and endeavor. All the empirical sciences would be unified conceptually with the normative, right? And then there is no more conceptual issues.

So, I think I’m just trying to illustrate the power of this problem and distinction, it seems.

Iason Gabriel: It’s a very interesting set of ideas. To my mind, these kinds of arguments about the intrinsic badness of pain, or kind of naturalistic moral arguments, are very strong ways of arguing, against, say, moral relativist or moral nihilist, but they don’t necessarily circumvent the is-ought distinction. Because, for example, the claim that pain is bad is referring to a normative property. So if you say pain is bad, therefore, it shouldn’t be promoted, but that’s completely compatible with believing that we can’t deduce moral arguments from purely descriptive premises. So I don’t really believe that the is-ought distinction is a problem. I think that it’s always possible to make arguments about values and that, that’s precisely what we should be doing. And that the fact that, that needs to be conjoined with empirical data in order to then arrive at sensible judgments and practical reason about what ought to be done is a really satisfactory state of affairs.

I think one kind of interesting aspect of the vision you put forward was this idea of a kind of unified moral theory that everyone agrees with. And I guess it does touch upon a number of arguments that I make in the paper, where I juxtapose two slightly stylistic descriptions of solutions to the value alignment challenge. The first one is, of course, the approach that I term the true moral theory approach, which holds that we do need a period of prolonged reflection and we reflect fundamentally on these questions about pain and perhaps other very deep normative questions. And the idea is that by using tools from moral philosophy, eventually, although we haven’t done it yet, we may identify a true moral theory. And then it’s a relatively simple… well, not simple from a technical point of view, but simple from a normative point of view task, of aligning AI, maybe even AGI, with that theory, and we’ve basically solved the value alignment problem.

So in the paper, I argue against that view quite strongly for a number of reasons. The first is that I’m not sure how we would ever know that we’d identified this true moral theory. Of course, many people throughout history have thought that they’ve discovered this thing and often gone on to do profoundly unethical things to other people. And I’m not sure how, even after a prolonged period of time, we would actually have confidence that we had arrived at the really true thing and that we couldn’t still ask the question, am I right?

But even putting that to one side, suppose that I had not just confidence, but justified confidence that I really had stumbled upon the true moral theory and perhaps with the help of AI, I could look at how it plays out in a number of different circumstances, and I realize that it doesn’t lead to these kind of weird, anomalous situations that most existing moral theories point towards, and so I really am confident that it’s a good one, we still have this question of what happens when we need to persuade other people that we’ve found the true moral theory and whether that is a further condition on an acceptable solution to the value alignment problem. And in the paper, I say that it is a further condition that needs to be satisfied because just knowing, well, supposedly having access to justified belief in a true moral theory, doesn’t necessarily give you the right to impose that view upon other people, particularly if you’re building a very powerful technology that has world shaping properties.

And if we return to this idea of reasonable pluralism that I spoke about earlier, essentially, the core claim is that unless we coerce people, we can’t get to a situation where everyone agrees on matters of morality. We could flip it around. It might be that someone already has the true moral theory out there in the world today and that we’re the people who refuse to accept it for different reasons, I think the question then is how do we believe other people should be treated by the possessor of the theory, or how do we believe that person should treat us?

Now, one view that I guess in political philosophy is often attributed to Jean-Jacques Rousseau, if you have this really good theory, you’re justified in coercing other people to live by it. He says that people should be forced to be free when they’re not willing to accept the truth of the moral theory. Of course, it’s something that has come in for fierce criticism. I mean, my own perspective is that actually, we need to try and minimize this challenge of value imposition for powerful technologies because it becomes a form of domination. So the question is how can we solve the value alignment problem in a way that avoids this challenge of domination? And in that regard, we really do need tools from political philosophy, which is, particularly within the liberal tradition, has tried to answer this question of how can we all live together on reasonable terms that preserve everyone’s capacity to flourish, despite the fact that we have variation and what we ultimately believe to be just, true and right.

Lucas Perry: So to bring things a bit back to where we’re at today and how things are actually going to start changing in the real world as we move forward. What do you view as the kinds of systems that would be, and are subject to something like an alignment procedure? Does this start with systems that we currently have today? Does it start with systems soon in the future? Should it have been done with systems that we already have today, but we failed to do so? What is your perspective on that?

Iason Gabriel: To my mind, the challenge of value alignment is one that exists for the vast majority, if not all technologies. And it’s one that’s becoming more pronounced as these technologies demonstrate higher levels of complexity and autonomy. So for example, I believe that many existing machine learning systems encounter this challenge quite forcefully, and that we can ask meaningful questions about it. So I think in previous discussion, we may have had this example of a recommendation system come to light. And even if we think of something that seems really quite prosaic. so say a recommendation system for what films to watch or what content to be provided to you. I think the value alignment question actually looms large because it could be designed to do very different things. On the one hand, we might have a recommendation system that’s geared around your current first order preferences. So it might continuously give you really stimulating, really fun, low quality content that kind of keeps you hooked to the system and with a high level of subjective wellbeing, but perhaps something that isn’t optimum in other regards. Then we can think about other possible goals for alignment.

So we might say that actually these systems should be built to serve your second order desires. Those are desires that in philosophy, we would say that people reflectively endorse, they’re desires about the person you want to be. So if we were to build recommendation system with that goal in mind, it might be that instead of watching this kind of cheap and cheerful content, I decided that I’d actually like to be quite a high brow person. So it starts kind of tacitly providing me with more art house recommendations, but even that doesn’t opt out the options, it might be that the system shouldn’t really be just trying to satisfy from my preferences, that it should actually be trying to steer me in the direction of knowledge and things that are in my interest to know. So it might try and give me new skills that I need to acquire, might try and recommend, I don’t know, cooking or self improvement programs.

That would be a system that was, I guess, geared toward my own interest. But even that again, doesn’t give us a complete portfolio of options. Maybe what we want is a morally aligned system that actually enhances our capacity for moral decision making. And then perhaps that would lead us somewhere completely different. So instead of giving us this content that we want, it might lead us to content that leads us to engage with challenging moral questions, such as factory farming or climate change. So, value alignment kind of arises quite early on. This is of course, with the assumption that the recommendation system is geared to promote your interest or wellbeing or preference or moral sensibility. There’s also the question of whether it’s really promoting your goals and aspirations or someone else’s and in science and technology studies there’s a big area of value sensitive design, which essentially says that we need to consult people and have this almost like democratic discussions early on about the kind of values we want to embody in systems.

And then we design with that goal in mind. So, recommendation systems are one thing. Of course, if we look at public institutions, say a criminal justice system, there, we have a lot of public roar and discussion about the values that would make a system like that fair. And the challenge then is to work out whether there is a technical approximation of these values that satisfactory realizes them in a way that conduces to some vision of the public good. So in sum, I think that value alignment challenges exist everywhere, and then they become more pronounced when these technologies become more autonomous and more powerful. So as they have more profound effects on our lives, the burden of justification in terms of the moral standards that are being met, become more exacting. And the kind of justification we can give for the design of a technology becomes more important.

Lucas Perry: I guess, to bring this back to things that exist today. Something like YouTube or Facebook is a very rudimentary initial kind of very basic first order preference, satisfier. I mean, imagine all of the human life years that have been wasted, mindlessly consuming content that’s not actually good for us. Whereas imagine, I guess some kind of enlightened version of YouTube where it knows enough about what is good and yourself and what you would reflectively and ideally endorse and the kind of person that you wish you could be and that you would be only if you knew better and how to get there. So, the differences between that second kind of system and the first system where one is just giving you all the best cat videos in the world, and the second one is turning you into the person that you always wish you could have been. I think this clearly demonstrates that even for systems that seem mundane, that they could be serving us in much deeper ways and at much deeper levels. And that even when they superficially serve us they may be doing harm.

Iason Gabriel: Yeah, I think that’s a really profound observation. I mean, when we really look at the full scope of value or the full picture of the kinds of values we could seek to realize when designing technologies and incorporating them into our lives, often there’s a radically expansive picture that emerges. And this touches upon a kind of taxonomic distinction that I introduce in the paper between minimalist and maximalist conceptions of value alignment. So when we think about AI alignment questions, the minimalist says we have to avoid very bad outcomes. So it’s important to build safe systems. And then we just need them to reside within some space of value that isn’t extremely negative and could take a number of different constellations. Whereas the maximalist says, “Well, let’s actually try and design the very best version of these technologies from a moral point of view, from a human point of view.”

And they say that even if we design safe technologies, we could still be leaving a lot of value out there on the table. So a technology could be safe, but still not that good for you or that good for the world. And let’s aim to populate that space with more positive and richer visions of the future. And then try to realize those through the technologies that we’re building. As we want to realize richer visions of human flourishing, it becomes more important that it isn’t just a personal goal or vision, but it’s one that is collectively endorsed, has been reflected upon and is justifiable from a variety of different points of view.

Lucas Perry: Right. And I guess it’s just also interesting and valuable to reflect briefly on how there is already in each society, a place where we draw the line at value imposition, and we have these principles, which we’ve agreed upon broadly, but we’re not going to let Ted Bundy do what Ted Bundy and wants to do

Iason Gabriel: That’s exactly right. So we have hard constraints, some of which are kind of set in law. And clearly those are constraints that these are just laws. So the AI systems need to respect. There’s also a huge possible space of better outcomes that are left open. Once we look at where moral constraints are placed and where they reside. I think that the Ted Bundy example is interesting because it also shows that we need to discount the preferences and desires of certain people.

One vision of AI alignment says that it’s basically a global preference aggregation system that we need, but in reality, there’s a lot of preferences that just shouldn’t be counted in the first place because they’re unethical or they’re misinformed. So again, that kind of to my mind pushes us in this direction of a conversation about value itself. And once we know what the principle basis for alignment is, we can then adjudicate properly cases like that and work out what a kind of valid input for an aligned system is and what things we need to discount if we want to realize good moral outcomes.

Lucas Perry: I’m not going to try and pin you down too hard on that because there’s the tension here, of course, between the importance of liberalism, not coercing value judgments on anyone, but then also being like, “Well, we actually have to do it in some places.” And that line is a scary one to move in either direction. So, I want to explore more now the different understandings of what it is that we’re trying to align AI systems to. So broadly people and I use a lot of different words here without perhaps being super specific about what we mean, people talk about values and intentions and idealized preferences and things of this nature. So can you be a little bit more specific here about what you take to be the goal of AI alignment, the goal of it being, what is it that we’re trying to align systems to?

Iason Gabriel: Yeah, absolutely. So we’ve touched upon some of these questions already tacitly in the preceding discussion. Of course, in the paper, I argue that when we talk about value alignment, this idea of value is often a placeholder for quite different ideas, as you said. And I actually present a taxonomy of options that I can take us through in a fairly thrifty way. So, I think the starting point for creating aligned AI systems is this idea that we want AI that’s able to follow our instructions, but that has a number of shortcomings, which Stuart Russel and others have documented, which tend to center around this challenge of excessive literalism. So if an AI system literally does what we ask it to, without an understanding of context, side constraints and nuance, often this will lead to problematic outcomes with the story of King Midas, being the classic cautionary tale. Wishing that everything he touches turns to gold, everything turns to gold, then you have a disaster of one kind or another.

So of course, instructions are not sufficient. What you really want is AI that’s aligned with the underlying intention. So, I think that often in the podcast, people have talked about intention alignment as an important goal of AI systems. And I think that is precisely right to dedicate a lot of technical effort to close the gap between a kind of idiot savant, AI, that perceives just instructions in this dumb way and the kind of more nuanced, intelligent AI that can follow an intention. But we might wonder whether aligning AI with an individual or collective intention is actually sufficient to get us to the really good outcomes, the kind of maximalist outcomes that I’m talking about. And I think that there’s a number of reasons why that might not be the case. So of course, to start with, just because an AI can follow an intention, doesn’t say anything about the quality of the intention that’s being followed.

We can form intentions on an individual or collective basis to do all kinds of things. Some of which might be incredibly foolish or malicious, some of which might be self-harming, some of which might be unethical. And we’ve got to ask this question of whether we want AI to follow us down that path when we come up with schemes of that kind, and there’s various ways we might try to address those bundle of problems. I think intentions are also problematic from a kind of technical and phenomenological perspective because they tend to be incomplete. So if we look at what an intention is, it’s roughly speaking a kind of partially filled out plan of action that commits us to some end. And if we imagine the AI systems are very powerful, they may encounter situations or dilemmas or option sets that are in this space of uncertainty, where it’s just not clear what the original intention was, and they might need to make the right kind of decision by default.

So they might need some intuitive understanding of what the right thing to do is. So my intuition is that we do want AI systems that have some kind of richer understanding of the goals that we would want to realize in whole. So I think that we do need to look at other options. It is also possible that we had formed the intention for the AI to do something that explicitly requires an understanding of morality. So we may ask it to do things like promote the greatest good in a way that is fundamentally ethical. Then it needs to step into this other terrain of understanding preferences, interests, and values. I think we need to explore that terrain for one reason or another. Of course, one thing that people talk about is this kind of learning from revealed preferences. So perhaps in addition to the things that we directly communicate, the AI could observe our behavior and make inferences about what we want that help fill in the gaps.

So maybe it could watch you in your public life, hopefully not private life and make these inferences that actually it should create this very good thing. So that isn’t the domain of trying to learn from things that it observes. But I think that preferences are also quite a worrying data point for AI alignment, at least revealed preferences because they contain many of the same weaknesses and shortcomings that we can ascribe to individual intentions.

Lucas Perry: What is a revealed intention again?

Iason Gabriel: Sorry, revealed preferences are preferences that are revealed through your behavior. So I observed you doing A or B. And from that choice, I conclude that you have a deeper preference for the thing that you choose. And the question is, if we just watch people, can we learn all the background information we need to create ethical outcomes?

Lucas Perry: Yeah. Absolutely not.

Iason Gabriel: Yeah. Exactly. As your Ted Bundy example, nicely illustrated, not only is it very hard to actually get useful information from observing people about what they want, but what they want can often be the wrong kind of thing for them or for other people.

Lucas Perry: Yeah. I have to hire people to spend some hours with me every week to tell me from the outside, how I may be acting in ways that are misinformed or self-harming. So instead of revealed preferences, we need something like rational or informed preferences, which is something you get through therapy or counseling or something like that.

Iason Gabriel: Well, that’s an interesting perspective. I guess there’s a lot of different theories about how we get to ideal preferences, but the idea is that we don’t want to just respond to what people are in practice doing. We want to give them the sort of thing that they would aspire to if they were rational and informed at the very least. So not things that are just a result of mistaken reasoning or poor quality information. And then this very interesting, philosophical and psychological question about what the content of those ideal preferences are. And particularly what happens when you think about people being properly rational. So, to return to David Hume, who often the is-ought distinction is attributed to, he has the conjecture that someone can be fully informed and rational and still desire pretty much anything at the end of the day, that they could want something hugely destructive for themselves or other people, of course, Kantians.

And in fact, a lot of moral philosophers believe that rationality is not just a process of joining up beliefs and value statements in a certain fashion, but it also encompasses a substantive capacity to evaluate ends. So, obviously Kantians have a theory about rationality ultimately requiring you to reflect on your ends and ask if they universalize in a positive way. But the thing is that’s highly, highly contested. So I think ultimately if we say we want to align AI with people’s ideal and rational preferences, it leads us into this question of what rationality really means. And we don’t necessarily get the kind of answers that we want to get to.

Lucas Perry: Yeah, that’s a really interesting and important thing. I’ve never actually considered that. For example, someone who might be a moral anti-realist would probably be more partial to the view that rationality is just about linking up beliefs and epistemics and decision theory with goals and goals are something that you’re just given and embedded with. And that there isn’t some correct evaluative procedure for analyzing goals beyond whatever meta preferences you’ve already inherited. Whereas a realist might say something like, the other view where rationality is about beliefs and ends, but also about perhaps more concrete standard method for evaluating which ends are good ends. Is that the way you view it?

Iason Gabriel: Yeah, I think that’s a very nice summary. The people who believe in substantive rationality tend to be people with a more realist, moral disposition. If you’re profoundly anti-realist, you basically think that you have to stop talking in the currency of reasons. So you can’t tell people they have a reason not to act in a kind of unpleasant way to each other, or even to do really heinous things. You have to say to them, something different like, “Wouldn’t it be nice if we could realize this positive state of affairs?” And I think ultimately we can get to views about value alignment that satisfy these two different groups. We can create aspirations that are well-reasoned from different points of view and also create scenarios that meet the kind of, “Wouldn’t it be nice criteria.” But I think it isn’t going to happen if we just double down on this question of whether rationality ultimately leads to a single set of ends or a plurality of ends, or no consensus whatsoever.

Lucas Perry: All right. That’s quite interesting. Not only do we have difficult and interesting philosophical ground in ethics, but also in rationality and how these are interrelated.

Iason Gabriel: Absolutely. I think they’re very closely related. So actually the problems we encounter in one domain, we also encounter in the other, and I’d say in my kind of lexicon, they all fall within this question of practical rationality and practical reason. So that’s deliberating about what we ought to do either because of explicitly moral considerations or a variety of other things that we factor up in judgements of that kind.

Lucas Perry: All right. Two more on our list here to hit our interests and values.

Iason Gabriel: So, I think there are one or two more things we could say about that. So if we think that one of the challenges with ideal preferences is that they lead us into this heavily contested space about what rationality truly requires. We might think that a conception of human interests does significantly better. So if we think about AI being designed to promote human interests or wellbeing or flourishing, I would suggest that as a matter of empirical fact, there’s significantly less disagreement about what that entails. So if we look at say the capability based approach that Amartya Sen and Martha Nussbaum have developed, it essentially says that there’s a number of key goods and aspects of human flourishing, that the vast majority of people believe conduce to a good life. And that actually has some intercultural value and affirmation. So if we designed AI that bore in mind, this goal of enhancing general human capabilities.

So, human freedom, physical security, emotional security, capacity, that looks like an AI that is both roughly speaking, getting us into the space of something that looks like it’s unlocking real value. And also isn’t bogged down in a huge amount of metaphysical contention. I suggest that aligning AI with human interests or wellbeing is a good proximate goal when it comes to value alignment. But even then I think that there’s some important things that are missing and that can only actually be captured if we returned to the idea of value itself.

So by this point, it looks like we have almost arrived at a kind of utilitarian AI via the backdoor. I mean, of course utility is a subject of mental state, isn’t necessarily the same as someone’s interest or their capacity to lead a flourishing life. But it looks like we have an AI that’s geared around optimizing some notion of human wellbeing. And the question is what might be missing there or what might go wrong. And I think there are some things that that view of value alignment still struggles to factor in. The welfare of nonhuman animals is something that’s missing from this wellbeing centered perspective on alignment.

Lucas Perry: That’s why we might just want to make it wellbeing for sentient creatures.

Iason Gabriel: Exactly, and I believe that this is a valuable enterprise, so we can expand the circle. So we say it’s the wellbeing of sentient creatures. And then we have the question about, what about future generations? Does their wellbeing count? And we might think that it does if we follow Toby Ord or in fact, most conventional thinking, we do think that the welfare of future generations has intrinsic value. So we might say, “Well, we want to promote wellbeing of sentient creatures over time with some appropriate weighting to account for time.”

And that’s actually starting to take us into a richer space of value. So we have wellbeing, but we also have a theory about how to do intertemporal comparisons. We might also think that it matters how wellbeing or welfare is distributed. That it isn’t just a maximization question, but that we also have to be interested in equity or distribution because we think is intrinsically important. So we might think it has to be done in a manner that’s fair. Additionally, we might think that things like the natural world have intrinsic value that we want to factor in. And so the point which will almost be familiar now from our earlier discussion is you actually have to get to that question of what values do we want to align the system with because values and the principles that derive with them can capture everything that is seemingly important.

Lucas Perry: Right. And so, for example, within the effective altruism community and within moral philosophy recently, the way in which moral progress has been made is in so far that debiasing human moral thought and ethics from spatial and temporal bias. So Peter Singer has the children drowning in a shallow pond argument. It just illustrates how there are people dying and children dying all over the world in situations which we could cheaply intervene to save them as if they were drowning in a shallow pond. And you only need to take a couple of steps and just pull them out, except we don’t. And we don’t because they’re far away. And I would like to say, essentially, everyone finds this compelling that where you are in space, doesn’t matter how much you’re suffering. That if you are suffering, then all else being equal, we should intervene to alleviate that suffering when it’s reasonable to do so.

So space doesn’t matter for ethics. Likewise, I hope, and I think that we’re moving in the right direction if time also doesn’t matter while also being mindful, we also have to introduce things like uncertainty. We don’t know what the future will be like, but this principle about caring about the wellbeing of sentient creatures in general, I think is essential and core I think to whatever list of principles we’ll want for bridging the is-ought distinction, because it takes away spacial bias, where you are in space, doesn’t matter, just matters that you’re sentient being, it doesn’t matter when you are as a sentient being. It also doesn’t matter what kind of sentient being you are, because the thing we care about is sentience. So then the moral circle has expanded across species. It’s expanded across time. It’s expanded across space. It includes aliens and all possible minds that we could encounter now or in the future. We have to get that one in, I think, for making a good future with AI.

Iason Gabriel: That’s a picture that I strongly identify with on a personal level, this idea of the expanding moral circle of sensibilities. And I think from a substantive point of view, you’re probably right. That that is a lot of the content that we would want to put into an aligned AI system. I think that one interesting thing to note is that a lot of these views are actually empirically fairly controversial. So if we look at the interesting study, the moral machine experiment, where I believe several million people ultimately played this experiment online, where they decided which trade offs an AV, an autonomous vehicle, should make in different situations. So whether it should crash into one person or five people, a rich person or a poor person, pretty much everyone agreed that it should kill fewer people when that was on the table. But I believe that in many parts of the world, there was also belief that the lives of affluent people mattered more than the lives of those in poverty.

And so if you were just to reason from their first sort of moral beliefs, you would bake that bias into an AI system that seems deeply problematic. And I think it actually puts pressure on this question, which is we’ve already said we don’t want to just align AI with existing moral preferences. We’ve also said that we can’t just declare a moral theory to be true and impose it on other people. So are there other options which move us in the direction of these kinds of moral beliefs that seem to be deeply justified, but also avoid the challenge of value imposition. And how far do they get if we try to move forward, not just as individuals like examining the kind of expanding moral circle, but as a community that’s trying to progressively endogenize these ideas and come up with moral principles that we can all live by.

We might not get as far if we were going at it alone, but I think that there are some solutions that are kind of in that space. And those are the ones I’m interested in exploring. I mean, common sense, morality understood as the conventional morality that most people endorse, I would say is deeply flawed in a number of regards, including with regards to global poverty and things of that nature. And that’s really unfortunate given that we probably also don’t want to force people to live by more enlightened beliefs, which they don’t endorse or can’t understand. So I think that the interesting question is how do we meet this demand for a respect for pluralism, and also avoid getting stuck in the morass of common sense morality, which has these prejudicial beliefs that will probably with the passage of time come to be regarded quite unfortunately by future generations.

And I think that making this demand for non domination or democratic support seriously means not just running far into the future or in a way that we believe represents the future, but also doing a lot of other things, trying to have a democratic discourse where we use these reasons to justify certain policies that then other people reflectively endorse and we move the project forwards in a way that meets both desiderata. And in this paper, I try to map out different solutions that both meet this criteria and of respecting people’s pluralistic beliefs while also moving us towards more genuinely morally aligned outcomes.

Lucas Perry: So now the last question that I want to ask you here then on the goal of AI alignment is do you view a needs based conception of human wellbeing as a sub-category of interest based value alignment? People have come up with different conceptions of human needs. People are generally familiar with Maslow’s hierarchy of needs. And I mean, as you go up the hierarchy, it will become more and more contentious, but everyone needs food and shelter and safety, and then you need community and meaning and spirituality and things of that nature. So how do you view or fit a needs based conception. And because some needs are obviously undeniable relative to others.

Iason Gabriel: Broadly speaking, a needs space conception of wellbeing is in that space we already touched upon. So the capabilities based approach and the needs based approach are quite similar. But I think that what you’re saying about needs potentially points to a solution to this kind of dilemma that we’ve been talking about. If we’re going to ask this question of what does it mean to create principles for AI alignment that treat people fairly, despite their different views. One approach we might take is to look for commonalities that also seem to have moral robustness or substance to them. So within the parlance of political philosophy, we’d call this an overlapping consensus approach to the problem of political and moral decision making. I think that that’s a project that’s well worth countenancing. So we might say there’s a plurality of global beliefs and cultures. What is it that these cultures coalesce around? And I think that it’s likely to be something along the lines of the argument that you just put forward; that people are vulnerable in virtue of how we’re constituted, that we have a kind of fragility and that we need protection, both against the environment and against certain forms of harm, particularly state-based violence. And that this is a kind of moral bedrock or what the philosopher Henry Shue calls, “A moral minimum” that receives intercultural endorsement. So actually the idea of human needs is very, very closely tied to the idea of human rights. So the idea is that the need is fundamental, and in virtue of your moral standing, the normative claim and your need, the empirical claim, you have a right to enjoy a certain good and to be secured in the knowledge that you’ll enjoy that thing.

So I think the idea of building a kind of human rights space, AI that’s based upon this intercultural consensus is pretty promising. In some regards human rights, as they’ve been historically thought about are not super easy to turn into a theory of AI alignment, because they are historically thought of as guarantees that States have to give their citizens in order to be legitimate. And it isn’t entirely clear what it means to have a human rights based technology, but I think that this is a really productive area to work in, and I would definitely like to try and populate that ground.

You might also think that the consensus or the emerging consensus around values that need to be built into AI systems, such as fairness and explainability potentially pretends that the emergence of this kind of intercultural consensus. Although I guess at that point, we have to be really mindful of the voices that are at the table and who’s had an opportunity to speak. So although there does appear to be some convergence around principles of beneficence and things like that, there’s also true that this isn’t a global conversation in which everyone is represented, and it would be easy to prematurely rush to the conclusion that we know what values to pursue, when we’re really just reiterating some kind of very heavily Western centric, affluent view of ethics that doesn’t have real intercultural democratic viability.

Lucas Perry: All right, now it’s also interesting and important to consider here the differences and importance of single agent and multi-agent alignment scenarios. For example, you can imagine entertaining the question of, “How is it that I would build a system that would be able to align with my values? One agent being the AI system, and one person, and how is it that I get the system to do what I want it to do?” And then the multi-agent alignment scenario considers, “How do I get one agent to align and serve to many different people’s interests and wellbeing and desires, and preferences, and needs? And then also, how do we get systems to act and behave when there are many other systems trying to serve and align to many other different people’s needs? And how is it that all of these systems may or may not collaborate with all of the other AI systems, and may or may not collaborate with all of the other human beings, when all the human beings may have conflicting preferences and needs?” How is it that we do for example, intertheoretic comparisons of value and needs? So what’s the difference, and importance between single agent and multi-agent alignment scenarios?

Iason Gabriel: I think that the difference is best understood in terms of how expansive the goal of alignment has to be. So if we’re just thinking about a single person and a single agent, it’s okay to approach the value alignment challenge through a slightly solipsistic lens. In fact, you know, if it was just one person and one agent, it’s not clear that morality really enters the picture, unless there are other people other sentient creatures who our actions can effect. So with one person, one agent, the challenge is primarily correlation with the person’s desires, aims intentions. Potentially, there’s still a question of whether the AI serves their interest rather than, you know, there’s more volitional states that come to mind. When we think about situations in which many people are affected, then it becomes kind of remiss not to think about interpersonal comparisons, and the kind of richer conceptions that we’ve been talking about.

Now, I mentioned earlier that there is a view that there will always be a human body that synthesizes preferences and provides moral instructions for AI. We can imagine democratic approaches to value alignment, where human beings assemble, maybe in national parliaments, maybe in global fora, and legislate principles that AI is then designed in accordance with. I think that’s actually a very promising approach. You know, you would want it to be informed by moral reflection and people offering different kinds of moral reasons that support one approach rather than the other, but that seems to be important for multi-person situations and is probably actually a necessary condition for powerful forms of AI. Because, when AI has a profound effect on people’s lives, these questions of legitimacy also start to emerge. So not only is it doing the right thing, but is it doing the sort of thing that people would consent to, and is it doing the sort of thing that people actually have consented to? And I think that when AI is used in certain forum, then these questions of legitimacy come to the top. There’s a bundle of different things in that space.

Lucas Perry: Yeah. I mean, it seems like a really, really hard problem. When you talk about creating some kind of national body, and I think you said international fora, do you wonder that some of these vehicles might be overly idealistic given what may happen in the world where there’s national actors competing and capitalism driving things forward relentlessly, and this problem of multi-agent alignment seems very important and difficult, and that there are forces pushing things such that it’s less likely that it happens.

Iason Gabriel: When you talk about multi-agent alignment. Are you talking about the alignment of an ecosystem that contains multiple AI agents, or are you talking about how we align an AI agent with the interests and ideas of multiple parties? So many humans, for example?

Lucas Perry: I’m interested and curious about both.

Iason Gabriel: I think there’s different considerations that arise for both sets of questions, but there are also some things that we can speak to that pertain to both of them.

Lucas Perry: Do they both count as multi-agent alignment scenarios in your understanding of the definition?

Iason Gabriel: From a technical point of view? It makes perfect sense to describe them both in that way. I guess when I’ve been thinking about it, curiously, I’ve been thinking of multi-agent alignment as an agent that has multiple parties that it wants to satisfy. But when we look at machine learning research, “Multi-agent” usually means many AI agents running around in a single environment. So I don’t see any kind of language based reason to opt for one, rather than the other. With regards to this question of idealization and real world practice, I think it’s an extremely interesting area. And the thing I would say is this is almost one of those occasions where potentially the is-ought distinction comes to our rescue. So the question is, “Does the fact that the real world is a difficult place, affected by divergent interests, mean that we should level down our ideals and conceptions about what really good and valuable AI would look like?”

And there are some people who have what we term, “Practice dependent” views of ethics who say, “Absolutely we should do. We should adjust our conception of what the ideal is.” But as you’ll probably be able to tell by now, I hold a kind of different perspective in general. I don’t think it is problematic to have big ideals and rich visions of how value can be unlocked, and that partly ties into the reasons that we spoke about for thinking that the technical and the normative interconnected. So if we preemptively level down, we’ll probably design systems that are less good than they could be. And when we think about a design process spanning decades, we really want that kind of ultimate goal, the shining star of alignment to be something that’s quite bright and can steer our efforts towards it. If anything, I would be slightly worried that because these human parliaments and international institutions are so driven by real world politics, that they might not give us the kind of most fully actualized set of ideal aspirations to aim for.

And that’s why philosophers like, of course John Rawls actually propose that we need to think about these questions from a hypothetical point of view. So we need to ask, “What would we choose if we weren’t living in a world where we knew how to leverage our own interests?” And that’s how we identified the real ideal that is acceptable to people, regardless of where they’re located. And also can then be used to steer non-ideal theory or the kind of actual practice and the right direction.

Lucas Perry: So if we have an organization that is trying its best to create aligned and beneficial AGI systems, reasoning about what principles we should embed in it from behind Rawls’ Veil of Ignorance, you’re saying, would have hopefully the same practical implications as if we had a functioning international body for coming up with those principles in the first place.

Iason Gabriel: Possibly. I mean, I’d like to think that ideal deliberation would lead them in the direction of impartial principles for AI. It’s not clear whether that is the case. I mean, it seems that at its very best, international politics has led us in the direction of a kind of human rights doctrine that both accords individuals protection, regardless of where they live and defends the strong claim that they have a right to subsistence and other forms of flourishing. If we use the Veil of Ignorance experiment, I think for AI might even give us more than that, even if a real world parliament never got there. For those of you who are not familiar with this, the philosopher John Rawls says that when it comes to choosing principles for a just society, what we need to do is create a situation in which people don’t know where they are in that society, or what their particular interest is.

So they have to imagine that they’re from behind the Veil of Ignorance. They select principles for that society that they think will be fair regardless of where they end up, and then having done that process and identified principles of justice for the society, he actually holds out the aspiration that people will reflectively endorse them even once the veil has been removed. So they’ll say, “Yes, in that situation, I was reasoning in a fair way that was nonprejudicial. And these are principles that I identified there that continue to have value in the real world.” And we can say what would happen if people are asked to choose principles for artificial intelligence from behind a veil of ignorance where they didn’t know whether they were going to be rich or poor, Christian, utilitarian, Kantian, or something else.

And I think there, some of the kind of common sense material would be surfaced; so people would obviously want to build safe AI systems. I imagine that this idea of preserving human autonomy and control would also register, but for some forms of AI, I think distributive considerations would come into play. So they might start to think about how the benefits and burdens of these technologies are distributed and how those questions play out on a global basis. They might say that ultimately, a value aligned AI is one that has fair distributive impacts on a global basis, and if you follow rules, that it works to the advantage of the least well off people.

That’s a very substantive conception of value alignment, which may or may not be the final outcome of ideal international deliberation. Maybe the international community will get to global justice eventually, or maybe it’s just too thoroughly affected by nationalists interests and other kinds of what, to my mind, the kind of distortionary effects that mean that it doesn’t quite get there. But I think that this is definitely the space that we want the debate to be taking place in. And that actually, there has been real progress in identifying collectively endorsed principles for AI that gives me hope for the future. Not only that we’ll get good ideals, but that people might agree to them, and that they might get democratic endorsement, and that they might be actionable and the sort of thing they can guide real world AI design.

Lucas Perry: Can you add a little bit more clarity on the philosophical questions and issues, which single and multi-agent alignments scenarios supervene on? How do you do inter theoretic comparisons of value if people disagree on normative or meta-ethical beliefs or people disagree on foundational axiomatic principles for bridging the is-ought gap? How is it that systems deal with that kind of disagreement?

Iason Gabriel: I’m hopeful that the three pictures that I outlined so far of the overlapping consensus between different moral beliefs, of democratic debate over a constitution for AI, and of selection of principles from behind the Veil of Ignorance, are all approaches that carry some traction in that regard. So they try to take seriously the fact of real world pluralism, but they also, through different processes, tend to tap towards principles that are compatible with a variety of different perspectives. Although I would say, I do feel like there’s a question about this multi agent thing that may still not be completely clear in my mind, and it may come back to those earlier questions about definition. So in a one person, one agent scenario, you don’t have this question of what to do with pluralism, and you can probably go for a more simple one shot solution, which is align it with the person’s interest, beliefs, moral beliefs, intentions, or something like that. But if you’re interested in this question of real world politics for real world AI systems where a plurality of people are affected, we definitely need these other kinds of principles that have a much richer set of properties and endorsements.

Lucas Perry: All right, there’s Rawls’ Veil of Ignorance. There’s, principle of non domination, and then there’s the democratic process?

Iason Gabriel: Non-domination is a criterion that any scheme for multi-agent value alignment needs to meet. And then we can ask the question, “What sort of scheme would meet this requirement of non-domination?” And there we have the overlapping census with human rights. We have a scheme of democratic debate leading to principles for AI constitution, and we have the Veil of Ignorance as all ideas that we basically find within political theory that could help us meet that condition.

Lucas Perry: All right, so we have spoken at some length then about principles and identifying principles, this goes back to our conversation about the is-ought distinction, and these are principles that we need to identify for setting up an ethical alignment procedure. You mentioned this earlier, when we were talking about this, this distinction between the one true moral theory approach to AI alignment, in contrast to coming up with a procedure for AI alignment that would be broadly endorsed by many people, and would respect the principle of non domination, and would take into account pluralism. Can you unpack this distinction more, and the importance of it?

Iason Gabriel: Yeah, absolutely. So I think that the kind of true moral theory approach, although it is a kind of stylized idea of what an approach to value of alignment might look like, is the sort of thing that could be undertaken just by a single person who is designing the technology or a small group of people, perhaps moral philosophers who think that they have really great expertise in this area. And then they identify the chosen principle and run with it.

The big claim is that that isn’t really a satisfactory way to think about design and values in a pluralistic world where many people will be affected. And of course, many people who’ve gone off on that kind of enterprise have made serious mistakes that were very costly for humanity and for people who are affected by their actions. So the political approach to value alignment paints a fundamentally different perspective and says it isn’t really about one person, or one group running ahead and thinking that they’ve done all the hard work it’s about working out what we can all agree upon, that looks like a reasonable set of moral principles or coordinates to build powerful technologies around. And then, once we have this process in place that outfits the right kind of agreement, then the task is given back to technologists and these are the kind of parameters that are fair process of deliberation has outputted. And this is what we have the authority to encode in machines, whether it’s say human rights or a conception of justice, or some other widely agreed upon values.

Lucas Perry: There are principles that you’re really interested in satisfying, like respecting pluralism, and respecting a principle of non-domination, and the One True Moral Theory approach, risks, violating those other principles. Are you not taking a stance on whether there is a One True Moral Theory, you’re just willing to set that question aside and say, “Because it’s so essential to a thriving civilization that we don’t do moral imposition on one another, that coming up with a broadly endorsed theory is just absolutely the way to go, whether or not there is such a thing as a One True Moral Theory? Does that capture your view?

Iason Gabriel: Yeah. So to some extent, I’m trying to make an argument that will look like something we should affirm, regardless of the metaethical stance that we wish to take. Of course, there are some views about morality that actually say that non-domination is a really important principle, or that human rights are fundamental. So someone might look at these proposals, and from the comprehensive moral perspective, they would say, “This is actually the morally best way to do value alignment, and it involves dialogue, discussion, mutual understanding, and agreement.” However, you don’t need to believe that in order to think that this is a good way to go. If you look at the writing of someone like Joshua Greene, he says that this problem we encounter called the, “Tragedy of common sense morality.” A lot of people have fairly decent moral beliefs, but when they differ, it ends up in violence, and they end up fighting. And you have a hugely negative, moral externality that arises just because people weren’t able to enter this other mode of theorizing, where they said, “Look, we’re part of a collective project, let’s agree to some higher level terms that we can all live by.” So from that point of view, it looks prudent to think about value alignment as a pluralistic enterprise.

That’s an approach that many people have taken with regards to the justification of the institution of the state, and the things that we believe it should protect, and affirm, and uphold. And then as I alluded to earlier, I think that actually, even for some of these anti-realists, this idea of inclusive deliberation, and even the idea of human rights look like quite good candidates for the kind of, “Wouldn’t it be nice?” criterion. So to return to Richard Routley, who is kind of the arch moral skeptic, he does ultimately really want us to live in a world with human rights, he just doesn’t think he has a really good meta-ethical foundation to rest this on. But in practice, he would take that vision forward, I believe in try to persuade other people that it was the way to go by telling them good stories and saying, “Well, look, this is the world with human rights and open-ended deliberation, and this is the world where one person decided what to do. Wouldn’t it be nice in that better world?” So I’m hopeful that this kind of political ballpark has this kind of rich applicability and appeal, regardless of whether people are starting out in one place or the other.

Lucas Perry: That makes sense. So then another aspect of this is, in the absence of moral agreement or when there is moral disagreement, is there a fair way to decide what principles AI should align with? For example, I can imagine religious fundamentalists, at core being antithetical to the project of aligning AI systems, which eventually lead to something smaller than us, they could view it as something like playing God and just be like, “Well, this is just not a project that we should even do.”

Iason Gabriel: So that’s an interesting question, and you may actually be putting pressure on my preceding argument. I think that it is certainly the case that you can’t get everyone to agree on a set of global principles for AI, because some people hold very, very extreme beliefs that are exclusionary, and don’t tend to the possibility of compromise. Typically people who have a fundamentalist orientation of one kind or another. And so, even if we get the pluralistic project off the ground, it may be the case that we have to, in my language, impose our values on those people, and that in a sense, they are dominated. And that leads to the difficult question: why is it permissible to impose beliefs upon those people, but not the people who don’t hold fundamentalist views? It’s a fundamentally difficult question, because what it tends to point to is the idea that beneath this talk about pluralism, there is actually a value claim, which is that you are entitled to non-domination, so long as you’re prepared not to dominate other people, and to accept that there is a moral equality, that means that we need to cooperate and co-habit in a world together.

And that does look like a kind of deep, deep, moral claim that you might need to substantively assert. I’m not entirely sure; I think that’s one that we can save for further investigation, but it’s certainly something that people have said in the context of these debates, that at the deepest level, you can’t escape making some kind of moral claim, because of these cases.

Lucas Perry: Yeah. This is reminding me of the paradox of tolerance by Karl Popper, who talks about free speech ends when you yell, “The theater’s on fire.” And in some sense are then imposing harm on other people. And that we’re tolerant of people within society, except for those who are intolerant of others. And to some extent, that’s a paradox. So similarly we may respect and endorse a principle of non-domination, or non-subjugation, but that ends when there are people who are dominating or subjugating. And the core of that is maybe getting back again to some kind of principle of non-harm related to the wellbeing of sentient creatures.

Iason Gabriel: Yeah. I think that the obstacles that we’re discussing now are very precisely related to that paradox, of course, the boundaries we want to draw on permissible disagreement in some sense is quite minimal or conversely, we might think that the wide affirmation of some aspect of the value of human rights is quite a strong basis for moving forwards, because it says that all human life has value, and that everyone is entitled to basic goods, including goods pertaining to autonomy. So people who reject that really are pushing back against something that is widely and deeply, reflectively endorsed by a large number of people. I also think that with regards to toleration, the anti-realist position becomes quite hard to figure out or quite strange. So you have these people who are not prepared to live in a world where they respect others, and they have this will to dominate, or a fundamentalist perspective.

The anti-realist says, “Well, you know, potentially this, this nicer world, we can move towards.” The anti-realist doesn’t deal in the currency of moral reasons. They don’t really have to worry about it too much; they can just say, “And going to go in that direction with everyone else who agrees with us,” and hold to the idea that it looks like a good way to live. So in a way, the problem with domination is much more serious for people who are moral realists. For the anti-realists, it’s not actually a perspective I inhabit it in my day to day life, so it’s hard for me to say what they would make of it.

Lucas Perry: Well, I guess, just to briefly defend the anti-realist, I imagine that they would say that they still have reasons for morality, they just don’t think that there is an objective epistemological methodology for discovering what is true. “There aren’t facts about morality, but I’m going to go make the same noises that you make about morality. Like I’m going to give reasons and justification, and these are as good as making up empty screeching noises and blah, blahing about things that don’t exist,” but it’s still motivating to other people, right? They still will have reasons and justification; they just don’t think it pertains to truth, and they will use that navigate the world and then justify domination or not.

Iason Gabriel: That seems possible, but I guess for the anti-realist, if they think we’re just fundamentally expressing pro-attitudes, so when I say, “It isn’t justified to dominate others.” I’m just saying, “I don’t like it when this thing happens,” then we’re just dealing in the currency of likes, and I just don’t think you have to be so worried about the problem of domination as you are, if you think that this means something more than someone just expressing an attitude about what they like or don’t. If there aren’t real moral reasons or considerations at stake, if it’s just people saying, “I like this. I don’t like this.” Then you can get on with the enterprise that you believe achieves these positive ends. Of course, the unpleasant thing is you kind of are potentially giving permission to other people to do the same, or that’s a consequence of the view you hold. And I think that’s why a lot of people want to rescue the idea of moral justification as a really meaningful practice, because they’re not prepared to say, “Well, everyone gets on with the thing that they happen to like, and the rest of it is just window dressing.”

Lucas Perry: All right. Well, I’m not sure how much we need to worry about this now. I think it seems like anti-realists and realists basically act the same in the real world. Maybe, I don’t know.

Iason Gabriel: Yeah. In reality, anti-realists tend to act in ways that suggest that on some level they believe that morality has more to it than just being a category error.

Lucas Perry: So let’s talk a little bit here more about the procedure by which we choose evaluative models for deciding which proposed aspects of human preferences or values are good or bad for an alignment procedure. We can have a method of evaluating or deciding which aspects of human values or preferences or things that we might want to bake into an alignment procedure are good or bad, but you mentioned something like having a global fora or having different kinds of governance institutions or vehicles by which we might have conversation to decide how to come up with an alignment procedure that would be endorsed. What is the procedure to decide what kinds of evaluative models we will use to decide what counts as a good alignment procedure or not? Right now, this question is being answered by a very biased and privileged select few in the West, at AI organizations and people adjacent to them.

Iason Gabriel: I think this question is absolutely fundamental. I believe that any claim that we have meaningful global consensus on AI principles is premature, and that it probably does reflect biases of the kind you mentioned. I mean, broadly speaking, I think that there’s two extremely important reasons to try and widen this conversation. The first is that in order to get a kind of clear, well, grounded and well sighted vision on what AI should align with, we definitely need intercultural perspectives. On the assumption that to qoute John Stuart Mill, “no-one has complete access to the truth and people have access to different parts of it.” The bigger the conversation becomes, the more likely it is that we move towards maximal value alignment of the kind that humanity deserves. But potentially more importantly than that, and regardless of the kind of epistemic consequences of widening the debate, I think that people have a right to voice their perspective on topics and technologies that will affect them. If we think of the purpose of a global conversation, partly as this idea of formulating principles, but also bestowing on them a certain authority in light of which we’re permitted to build powerful technologies. Then you just can’t say that they have the right kind of authority and grounding without proper extensive consultation. And so, I would suggest that that’s a very important next step for people who are working in this space. I’m also hopeful that actually these different approaches that we’ve discussed can potentially be mutually supporting. So, I think that there is a good chance that human rights could serve as a foundation or a seed for a good, strong intercultural conversation around AI alignment.

And I’m not sure to what extent this really is the case, but it might be that even some of these ideas about reasoning impartially have currency in a global conversation. And you might find that they are actually quite challenging for affluent countries or for self interested parties, because it would reveal certain hidden biases in the propositions that they have now made or put forward.

Lucas Perry: Okay. So, related to things that we might want to do to come up with the correct procedure for being able to evaluate what kinds of alignment procedures are good or bad, what do you view as sufficient for adequate alignment of systems? We’ve talked a little bit about minimalism versus maximalism, where minimalism is aligning to just some conception of human values and maximalism is hitting on some very idealized and strong set or form of human values. And this procedure is related, at least in the, I guess, existential risk space coming from people like Toby Ord and William MacAskill. They talk about something like a long reflection. If I’m asking you about what might be adequate alignment for systems, one criteria for that might be meeting basic human needs, meeting human rights and reducing existential risk further and further such that it’s very, very close to zero and we enter a period of existential stability.

And then following this existential stability is proposed something like a long reflection where we might more deeply consider ethics and values and norms before we set about changing and optimizing all of the atoms around us in the galaxy. Do you have a perspective here on this sort of most high level timeline of first as we’re aligning AI systems, what does it for it to be adequate? And then, what needs to potentially be saved for something like a long reflection? And then, how something like a broadly endorsed procedure versus a one true moral theory approach would fit into something like a long reflection?

Iason Gabriel: Yes. A number of thoughts on this topic. The first pertains to the idea of existential security and, I guess, why its defined as the kind of dominant goal in the short term perspective. There may be good reasons for this, but I think what I would suggest is that obviously involves trade offs. The world we live in is a very unideal place, one in which we have a vast quantity of unnecessary suffering. And to my mind, it’s probably not even acceptable to say that basically the goal of building AI is, or that the foremost challenge of humanity is to focus on this kind of existential security and extreme longevity while leaving so many people to lead lives that are less than they could be.

Lucas Perry: Why do you think that?

Iason Gabriel: Well, because human life matters. If we were to look at where the real gains in the world are today, I believe it’s helping these people who die unnecessarily from neglected diseases, lack subsistence incomes, and things of that nature. And I believe that has to form part of the picture of our ideal trajectory for technological development.

Lucas Perry: Yeah, that makes sense to me. I’m confused what you’re actually saying about the existential security view as being central. If you compare the suffering of people that exist today, obviously to the astronomical amount of life that could be in the future, is that kind of reasoning about the potential that doesn’t do the work for you for seeing mitigating existential risk as the central concern.

Iason Gabriel: I’m not entirely sure, but what I would say is that on one reading of the argument that’s being presented, the goal should be to build extremely safe systems and not try to intervene in areas about which this more substantive contestation, until there’s been a long delay and a period of reflection, which might mean neglecting some very morally important and tractable challenges that the world is facing at the present moment. And I think that that would be problematic. I’m not sure why we can’t work towards something that’s more ambitious, for example, a human rights respecting AI technology.

Lucas Perry: Why would that entail that?

Iason Gabriel: Well, so, I mean, this is the kind of question about the proposition that’s been put in front of us. Essentially, if that isn’t the proposition, then the long reflection isn’t leaving huge amounts to be deliberated about, right? Because we’re saying, in the short term, we’re going to tether towards global security, but we’re also going to try and do a lot of other things around which there’s moral uncertainty and disagreement, for example, promote fairer outcomes, mobilize in the direction of respecting human rights. And I think that once we’ve moved towards that conception of value alignment, it isn’t really clear what the substance of the long reflection is. So, do you have an idea of what questions would remain to be answered?

Lucas Perry: Yeah, so I guess I feel confused because reaching existential security as part of this initial alignment procedure, doesn’t seem to be in conflict with alleviating the suffering of the global poor, because I don’t think moral uncertainty extends to meeting basic human needs or satisfying basic human rights or things that are obviously conducive to the well-being of sentient creatures. I don’t think poverty gets pushed to the long reflection. I don’t think unnecessary suffering gets pushed to the long reflection. Then the question you’re asking is what is it that does get pushed to the long reflection?

Iason Gabriel: Yes.

Lucas Perry: Then what gets pushed to the long reflection is, is the one true moral theory approach to alignment actually correct? Is there a one true moral theory or is there not a one true moral theory? Are anti-realists correct or are realists correct? Or are they both wrong in some sense or is something else correct? And then, given that, the potential answer or inability to come up with an answer to that would change how something like the cosmic endowment gets optimized. Because we’re talking about billions upon billions upon billions upon billions of years, if we don’t go extinct, and the universe is going to evaporate eventually. But until then, there is an astronomical amount of things that could get done.

And so, the long reflection is about deciding what to actually do with that. And however esoteric it is, the proposals range from you just have some pluralistic optimization process. There is no right way you should live. Things other than joy and suffering matter like, I don’t know, building monuments that calculate mathematics ever more precisely. And if you want to carve out a section of the cosmic endowment for optimizing things that are other than conscious states, you’re free to do that versus coming down on something more like a one true moral theory approach and being like, “The only kinds of things that seem to matter in this world are the states of conscious creatures. Therefore, the future should just be an endeavor of optimizing for creating minds that are ever more enjoying profound states of spiritual enlightenment and spiritual bliss and knowledge.”

The long reflection might even be about whether or not knowledge matters for a mind. “Does it really matter that I am in tune with truth and reality? Should we build nothing but experience machines that cultivate whatever the most enlightened and blissful states of experience are or is that wrong?” The long reflection to me seems to be about these sorts of questions and if the one true moral theory approach is correct or not.

Iason Gabriel: Yeah, that makes sense. And my apologies if I didn’t understand what was already taken care of by the proposal. I think to some extent, in that case, we’re talking about different action spaces. When I look at these questions of AI alignment, I see very significant value questions already arising in terms of how benefits and burdens are distributed. What fairness means? Whether AI needs to be explainable and accountable and things of that nature alongside a set of very pressing global problems that it would be really, really important to address? I think my time horizon is definitely different from this long reflection one. Kind of find it difficult to imagine a world in which these huge, but to some extent prosaic questions have been addressed and in which we then turn our attention to these other things. I guess there is a couple of things that can be said about it.

I’m not sure if this is meant to be taken literally, but I think the idea of pressing pause on technological development while we work out a further set of fundamentally important questions is probably not feasible. It would be best to work with a long term view that doesn’t rest upon the possibility of that option. And then I think that the other fundamental question is what is actually happening in this long reflection? It can be described in a variety of different ways.

Sometimes it sounds like it’s a big philosophical conference that runs for a very, very long time. And at the end of it, hopefully people kind of settle these questions and they come out to the world and they’re like, “Wow, this is a really important discovery.” I mean, if you take seriously the things we’ve been talking about today, you still have the question of what do you do with the people who then say, “Actually, I think you’re wrong about that.” And I think in a sense it recursively pushes us back into the kind of processes that I’ve been talking about. When I hear people talk about the long reflection there does also sometimes seem to be this idea that it’s a period in which there is very productive global conversation about the kind of norms and directions that we want humanity to take. And that seems valuable, but it doesn’t seem unique to the long reflection. That would be incredibly valuable right now so it doesn’t look radically discontinuous to me on that view.

Lucas Perry: All right. Because we’re talking about the long term future here and I bring it up because it’s interesting in considering what questions can we just kind of put aside? These are interesting, but in the real world, they don’t matter a ton or they don’t influence our decisions, but over the very, very long term future, they may matter much more. When I think about a principle like non-domination, it seems like we care about this conception of non-imposition and non-dominance and non-subjugation for reasons of, first of all, well-being. And the reason why we care about this well-being question is because human beings are extremely fallible. And it seems to me that the principle of non-domination is rooted in the lack of epistemic capacity for fallible agents like human beings to promote the well-being of sentient creatures all around them.

But in terms of what is physically literally possible in the universe, it’s possible for someone to know so much more about the well-being of conscious creatures than you, and how much happier and how much more well-being you would be in if you only idealized in a certain way. That as we get deeper and deeper into the future, I have more and more skepticism about this principle of non-domination and non-subjugation.

It seems very useful, important, and exactly like the thing that we need right now, but as we long reflect further and further and, say, really smart, really idealized beings develop more and more epistemic clarity on ethics and what is good and the nature of consciousness and how minds work and function in this universe that I would probably submit myself to a Dyson sphere brain that was just like, “Well, Lucas, this is what you have to do.” And I guess that’s not subjugation, but I feel less and less moral qualms with the big Dyson sphere brain showing up to some early civilization like we are, and then just telling them how they should do things, like a parent does with a child. I’m not sure if you have any reactions to this or how much it even really matters for anything we can do today. But I think it’s potentially an important reflection on the motivations behind the principle of non-domination and non-subjugation and why it is that we really care about it.

Iason Gabriel: I think that’s true. I think that if you consent to something, then almost… I don’t want to say by definition, that’s definitely too strong, but it’s very likely that you’re not being dominated so long as you have sufficient information and you’re not being coerced. I think the real question is what if this thing showed up and you said, “I don’t consent to this,” and the thing said, “I don’t care it’s in your best interests.”

Lucas Perry: Yeah, I’m defending that.

Iason Gabriel: That could be true in some kind of utilitarian, consequentialist, moral philosophy of that kind. And I guess my question is, “Do you find that unproblematic? Or, “Do you have this intuition that there is a further set of reasons you could draw upon, which explain why the entity with greater authority doesn’t actually have the right to impose these things on you?” And I think that it may or may not be true.

It probably is true that from the perspective of welfare, non-denomination is good. But I also think that a lot of people who are concerned about pluralism and non-domination think that it’s value pertains to something which is quite different, which is human autonomy. And that that has value because of the kind of creatures we are, with freedom of thought, a consciousness, a capacity to make our own decisions. I, personally, am of the view that even if we get some amazing, amazing paternalist, there is still a further question of political legitimacy that needs to be answered, and that it’s not permissible for this thing to impose without meeting these standards that we’ve talked about today.

Lucas Perry: Sure. So in the very least, I think I’m attempting to point towards the long reflection consisting of arguments like this. We weren’t participating in coercion before, because we didn’t really know what we’re talking about but now we know what we’re talking about. And so, given our epistemic clarity coercion makes more sense.

Iason Gabriel: It does seem problematic to me. And I think the interesting question is what does time add to robust epistemic certainty? It’s quite likely that if you spend a long time thinking about something, at the end of it, you’ll be like, “Okay, now I have more confidence in a proposition that was on the table when I started?” But does that mean that it is actually substantively justified? And what are you going to say if you think you’re substantively justified, but you can’t actually justify it to other people who are reasonable, rational and informed like you.

It seems to me that even after a thousand years, you’d still be taking a leap of faith of the kind that we’ve seen people take in the past with really, really devastating consequences. I don’t think it’s the case that ultimately there will be a moral theory that’s settled and the confidence in the truth value of it is so high that the people who adhere to it have somehow gained the right to kind of run with it on behalf of humanity. Instead, I think that we have to proceed a small step at a time, possibly in perpetuity and make sure that each one of these small decisions is subject to continuous negotiation, reflection and democratic control.

Lucas Perry: The long reflection though, to me, seems to be about questions like that because you’re taking a strong epistemological view on meta-ethics and that there wouldn’t be that kind of clarity that would emerge over time from minds far greater than our own. From my perspective, I just find the problem of suffering to be very, very, very compelling.

Let’s imagine we have the sphere of utilitarian expansion into the cosmos, and then there is the sphere of pluralistic, non-domination, democratic, virtue ethic, deontological based sphere of expansion. You can, say, run across planets at different stages of evolution. And here you have a suffering hell planet, it’s just wild animals born of Darwinian evolution. And they’re just eating and murdering each other all the time and dying of disease and starvation and other things. And then maybe you have another planet which is an early civilization and there is just subjugation and misery and all of these things, and these spheres of expansion would do completely different things to these planets. And we’re entering super esoteric sci-fi space here. But again, it’s, I think, instructive of the importance of something like a long reflection. It changes what is permissible in what will be done. And so, I find it interesting and valuable, but I also agree with you about the one claim that you had earlier about it being unclear that we could actually pause the breaks and have a thousand year philosophy convention.

Iason Gabriel: Yes, I mean, the one further thing I’d say, Lucas, is bearing in mind some of the earlier provisos we attached to the period before the long reflection, we were kind of gambling on the idea that there would be political legitimacy and consensus around things like the alleviation of needless suffering. So, it is not necessarily that it is the case that everything would be up for grabs just because people have to agree upon it. In the world today, we can already see some nascent signs of moral agreement on things that are really morally important and would be very significant if they were fully realized as ideals.

Lucas Perry: Maybe there is just not that big of a gap between the views that are left to be argued about during the long reflection. But then there is also this interesting question, wrapping up on this part of the conversation, about what did we take previously that was sacred, that is no longer that? An example would be if a moral realist, utilitarian conception ended up just being the truth or something, then rights never actually mattered. Autonomy never mattered, but they functioned as very important epistemic tool sets. And then we’re just like, “Okay, we’re basically doing away with everything that we said was sacred.” We still endorsed having done that. But now it’s seen in a totally different light. There could be something like a profound shift like that, which is why something like long reflection might be important.

Iason Gabriel: Yeah. I think it really matters how the hypothesized shift comes about. So, if there is this kind of global conversation with new information coming to light, taking place through a process that’s non-coercive and the final result seems to be a stable consensus of overlapping beliefs that we have more moral consensus than we did around something like human rights, then that looks like a kind of plausible direction to move in and that might even be moral progress itself. Conversely, if it’s people who have been in the conference a long time and they come out and they’re like, “We’ve reflected a thousand years and now we have something that we think is true.” Unfortunately, I think they ended up kind of back at square one where they’ll meet people who say, “We have reasonable disagreement with you, and we’re not necessarily persuaded by your arguments.”

And then you have the question of whether they’re more permitted to engage in value imposition than people were in the past. And I think probably not. I think if they believe those arguments are so good, they have to put them into a political process of the kind that we have discussed and hopefully their merits will be seen or, if not, there may be some avenues that we can’t go down but at least we’ve done things in the right way.

Lucas Perry: Luckily, it may turn out to be the case that you basically never have to do coercion because with good enough reasons and evidence and argument, basically any mind that exists can be convinced of something. Then it gets into this very interesting question of if we’re respecting a principle of non-domination and non-subjugation, as something like Neuralink and merging with AI systems, and we gain more and more information about how to manipulate and change people, what changes can we make to people from the outside would count as coercion or not? Because currently, we’re constantly getting pushed around in terms of our development by technology and people and the environment and we basically have no control over that. And do I always endorse the changes that I undergo? Probably not. Does that count as coercion? Maybe. And we’ll increasingly gain power to change people in this way. So this question of coercion will probably become more and more interesting and difficult to parse over time.

Iason Gabriel: Yeah. I think that’s quite possible. And it’s kind of an observation that can be made about many of the areas that we’re thinking about now. For example, the same could be said of autonomy or to some extent that’s the flip side of the same question. What does it really mean to be free? Free from what and under what conditions? If we just loop back a moment, the one thing I’d say is that the hypothesis that, you can create moral arguments that are so well-reasoned that they persuade anyone is, I think, the perfect statement of a certain enlightenment perspective on philosophy that sees rationality as the tiebreaker and the arbitrar of progress. In a sense that the whole project that I’ve outlined today rests upon a recognition or an acknowledgement that that is probably unlikely to be true when people reason freely about what the good consist in. They do come to different conclusions.

And I guess, the kind of thing people would point to there as evidence is just the nature of moral deliberation in the real world. You could say that if there were these winning arguments that just won by force of reason, we’d be able to identify them. But, in reality, when we look at how moral progress has occurred, requires a lot more than just reason giving. To some extent, I think the master argument approach itself rests upon mistaken assumptions and that’s why I wanted to go in this other direction. By a twist of fate, if I was mistaken and if the master argument was possible, it would also satisfy a lot of conditions of political legitimacy. And right now, we have good evidence that it isn’t possible so we should proceed in one way. If it is possible, then those people can appeal to the political processes.

Lucas Perry: They can be convinced.

Iason Gabriel: They can be convinced. And so, there is reason for hope there for people who hold a different perspective to my own.

Lucas Perry: All right. I think that’s an excellent point to wrap up on then. Do you have anything here? I’m just giving you an open space now if you feel unresolved about anything or have any last moment thoughts that you’d really like to say and share? I found this conversation really informative and helpful, and I appreciate and really value the work that you’re doing on this. I think it’s sorely needed.

Iason Gabriel: Yeah. Thank you so much, Lucas. It’s been a really, really fascinating conversation and it’s definitely pushed me to think about some questions that I hadn’t considered before. I think the one thing I’d say is that this is really… A lot of it is exploratory work. These are questions that we’re all exploring together. So, if people are interested in value alignment, obviously listeners to this podcast will be, but specifically normative value alignment and these questions about pluralism, democracy, and AI, then please feel free to reach out to me, contribute to the debate. And I also look forward to continuing the conversation with everyone who wants to look at these things and develop the conversation further.

Lucas Perry: If people want to follow you or get in contact with you or look at more of your work, where are the best places to do that?

Iason Gabriel: I think if you look on Google Scholar, there is links to most of the articles that I have written, including the one that we were discussing today. People can also send me an email, which is just my first name Iason@deepmind.com. So, yeah.

Lucas Perry: All right.

End of recorded material

Peter Railton on Moral Learning and Metaethics in AI Systems

 Topics discussed in this episode include:

  • Moral epistemology
  • The potential relevance of metaethics to AI alignment
  • The importance of moral learning in AI systems
  • Peter Railton’s, Derek Parfit’s, and Peter Singer’s metaethical views

 

Timestamps: 

0:00 Intro
3:05 Does metaethics matter for AI alignment?
22:49 Long-reflection considerations
26:05 Moral learning in humans
35:07 The need for moral learning in artificial intelligence
53:57 Peter Railton’s views on metaethics and his discussions with Derek Parfit
1:38:50 The need for engagement between philosophers and the AI alignment community
1:40:37 Where to find Peter’s work

 

Citations:

You can find Peter’s work here

 

We hope that you will continue to join in the conversations by following us or subscribing to our podcasts on Youtube, Spotify, SoundCloud, iTunes, Google Play, StitcheriHeartRadio, or your preferred podcast site/application. You can find all the AI Alignment Podcasts here.

You can listen to the podcast above or read the transcript below. 

Lucas Perry: Welcome to the AI Alignment Podcast. I’m Lucas Perry. Today, we have a conversation with Peter Railton that explores metaethics, moral epistemology, moral learning, and how these areas of philosophy may or may not inform AI alignment. The core problem that this episode explores is that as systems become more and more autonomous and increasingly participate in social roles that require social functioning, it will become increasingly necessary for AI systems to be familiar with and sensitive to morally salient features of the world. This requires that systems have the capacity for moral learning and developing an understanding of human normative processes and beliefs. On top of that, structuring any kind of procedure for moral learning in AI systems will bring in metaethical beliefs and assumptions that would be wise to understand and be explicit about. For a little more context, some key motivating questions for this episode to consider are: when and what is the degree to which AI systems will require the capacity for moral learning? How might metaethics inform or not inform AI alignment? How do you structure a system such that it can engage in moral learning in a way that would be broadly endorsed and would satisfy other ethical or meta-ethical principles we broadly care about?

For some more background, I did a podcast with Peter Singer on his transition from being a moral anti-realist to a moral realist. That episode is titled “On Becoming a Moral Realist with Peter Singer.” In that episode we explore his metaethical views, and Peter Singer mentions conversations and debate between Derek Parfit and Peter Railton on issues in metaethics. So, the second half of this podcast is dedicated to understanding and unpacking Peter Railton’s metaethics and how it compares with Peter Singer’s and Derek Parfit’s views. This podcast is pretty philosophy heavy, so if you’re into that and the ethics of AI then you’ll appreciate this episode. You can subscribe to and follow this podcast on your preferred podcasting platform, by searching for “Future of Life.”

Peter Railton is a Professor of Philosophy at the University of Michigan, Ann Arbor. He has a PhD from Princeton and primarily researches ethics and the philosophy of science. He focuses especially on questions about the nature of objectivity, value, norms, and explanation. Recently, he has also begun working in aesthetics, moral psychology, and the theory of action. And with that, let’s get into our conversation with Peter Railton.

Just to start off here, sometimes I’ve heard that metaethics doesn’t matter, or one might wonder when does metaethics ever matter in real life anyway? I’m curious, do you have any thoughts on whether metaethics matters at all for AI alignment?

Peter Railton: Well, in the most general sense, metaethics concerns, questions about the nature of morality its foundation, the possibility of moral knowledge, how we might acquire it, the meanings of moral claims, how they stand in relation to our other forms of knowledge. And so it does seem to me as if metaethics is important in thinking about the problems of ethics in AI, apparently because I think a lot of people have in the back of their minds, skeptical concerns about morality. And therefore, they doubt whether there could be objective value. They think perhaps value is entirely subjective. And if that’s your approach, then you might say the challenge of creating ethical AI is not a very well defined problem.

What would be the subjective attitude of a properly aligned AI system? You might consult the population and find out what the average point of view is. But we know the average point of view right now is very different from what it was 200 or 300 years ago. We think in some ways it’s improved since then. And we think in some ways where we are now could be improved. So we can’t reduce the question of ethics in AI to something like opinion sampling, and that’s because morality has objective dimensions and we use these to criticize our preferences and our opinions. And so any decent ethics for AI would build into the concept, the possibility of correction and criticism. And for that, you need some thought of what would constitute correction or criticism? How would we justify moral claims? And that takes us to the heart of metaethics.

Lucas Perry: Right. And there’s a lot of moral anti realists or people who think that morality is subjective in, I guess, hard sciences and computer science in general. So this also applies to the alignment community. If one feels that moral claims or moral attitudes are subjective, then this choice that you mentioned to take the average of general popular opinion is itself a moral choice, which is the expression of one owns subjective moral attitude from that point of view. And within a subjective framework, there’s no way to resolve that, except take the expression of all of the power dynamics of everyone’s subjective moral attitudes and see what comes out of that, right?

Peter Railton: Well, yeah, that would be one of the problems. The project of creating ethical AI or AI alignment, as it’s sometimes called, can’t be the problem of giving our value system to machines because there is no unique value system that we possess. It could be the project of trying to make it possible for the machine to learn the most justified value system. And part of the problem, I think, is that people have exaggerated notions of what it would take to justify moral claims. They assume, for example, that there’s a huge gulf between facts and values, that there are no reasonable ways of bridging that gulf, and that in general, what it would take to have objective morality would look something like the universe with what God would do, only without God.

One of the problems with that thought is that that’s a model of morality as a set of commands given by some kind of a divine enforcer. And if you think that absent such a divine enforcer, morality could only be subjective, then I think you’re missing the idea of what morality really does. The existence of a divine enforcer wouldn’t bring morality into existence. A divine enforcer could be either good or malevolent. And so understanding what it is to do moral criticism should be an integral part of the challenge of thinking about ethics and AI. But looking at moral criticism, we have many practices of moral criticism, and those aren’t, strictly speaking, subjective, and we value them because they help correct our subjective opinions.

Lucas Perry: So I think there’s two parts of metaethics that I would like to see if you have any thoughts on how they may or may not apply here. Metaethical epistemology, how is it that you know things about metaethics? And whatever may be metaphysically true about ethics or not. So you brought up religion there. So in terms of, I guess, what would be called Divine Command Theory, morality would have a metaphysically very solid ground as being codified by God or something like that.

Peter Railton: Actually, I’d say that that wouldn’t get us a solid metaphysical ground. The fact that commands come from a being that supremely powerful, and even one that’s supremely knowing would not make those commands moral commands. Those conditions are perfectly compatible with immoral values. What we would need is a perfectly knowledgeable, entirely powerful, and all good God. A so-called AAA God. But that means the concept of good is independent of the concept of God itself, and understanding what it would be for the commands of a divine super powerful being to be good just takes us right back to the question of the nature of morality. We don’t solve it by introducing supreme beings.

Lucas Perry: Right, right. So I’m not trying to justify or lay out the Divine Command Theory. Only using it to, I guess, attempt to explain how epistemology and metaphysics fit into metaethics. To me, it seems like what is relevant here to AI alignment is that how one believes one can know things about metaethics and whether or not there can be agreement upon metaethical epistemology would be the foundation upon which metaethical moral learning machine systems could be expressed.

There is sort of a meta view on the epistemology of metaethics, where one could say, “Because there are no moral facts, the epistemology is whatever human beings are doing to think about moral thought.” And there isn’t a correct epistemology. Whereas one could, whether through naturalism in your metaethics, or through non-naturalism in Peter Singer’s ethics, believe there to be moral truths, and that thus there is a correct epistemology about metaethics, and that that epistemology of metaethics could be used to instantiate metaethical learning in machine systems.

Peter Railton: So one thought would be, there is one true morality and we’re capable of knowing it. That itself wouldn’t get us very far in epistemology until we could say what those methods of knowing are. An approach that’s got something like that as an assumption, but that doesn’t assume that we know what the destination is ultimately going to be, would be to ask, “Do we have good practices of moral criticism? And do those help us to solve actual problems, social problems, interpersonal problems, problems with our own lives?” And then to look at the ways in which we use morality in these contexts to solve problems.

And that brings it down to the level that it’s something that comes within the scope of what can be learned. And if we look at children’s learning, we see that their development as moral creatures proceeds in pace with their understanding of causality, their understanding of theory of mind, their capacity to form a counterfactual thoughts, because it’s really an integrated body of general understanding. And so for example, the idea of solutions that are positive sums of game theoretic challenges, that’s something that can be agreed upon by all parties to be a desirable thing. And so looking at strategies that have the possibility of yielding positive sums, cooperative strategies, strategies of trustworthiness, of signaling strategies, which enable us to coordinate with each other, understand each other’s intentions, those have a justification that we can give in terms that are not tied to any one particular person’s interests, which address interests generally, and which we can defend in an impartial way.

And so that would be an example of a way in which we could say those are more reasonable solutions, more justified solutions. There’s an analogy here with epistemology generally. If someone were to come to me and say, “Well, you claim to have knowledge, how do you demonstrate that? How would you show that your understanding of knowledge is genuine knowledge?” I’d have to say, “Well, sorry, I can’t demonstrate that. Any demonstration would presuppose knowledge. And so I can’t pull it out of a hat and I can’t derive it from nothing.” So what can I do? I can say, “Well, here our practices of epistemic criticism. And while we have disagreements in various places about what counts as evidence or what does not, do those practices deliver the kinds of results that we would expect from reasonable epistemologies, making possible things like scientific inquiry and technology and so on?

And we can say, “Well, that’s what epistemology could be expected to give us. We do have methods that can improve our ability to solve such problems in just those ways. We can find various ways to justify them in terms of probabilities, looking for ways in which we can increase accuracy and estimations.” And so those are different ways in which by looking at our actual practices of epistemic criticism, we try to get some traction on the problem of knowledge. And I would argue we should do the same thing about morality. If we start from the standpoint of skepticism, in the case of knowledge, we will end with skepticism. The same would be true with ethics, but I see no more reason to do it in ethics than in epistemology. We surely must know a great deal about what’s good for us, good for one another. And we have well-developed practices of moral assessment that we use in our own lives, and we use in our collective institutions. So I would say, if we look to those, then we don’t see just subjective opinion. It’s quite different from that and we see a lot of constraints.

Lucas Perry: So I do want to explore more arguments around metaethics with you. And we’re intending to do that after we discuss moral learning here. Now, in terms of moral epistemology and the epistemology of metaethics, I’m interested in this part of the conversation in setting up an attempting to illustrate that whether one is going to take a skeptical view on moral epistemology or not. That moral learning and our view on moral epistemology is essential and important in the alignment and development of AI systems. And here you’re defending a more realist account of epistemology in ethics.

Peter Railton: Well, you could say that I, myself, am a realist, but what I’ve been saying so far, a pragmatist about ethics could say just as well. John Dewey would say something very similar. Various kinds of non realists, but who are nonetheless objectivists in ethics, Kantians, for example, Constructivists, and so on. What I’ve said it was really neutral territory for a wide range of views in metaethics. And it doesn’t presuppose in particular, a form of naturalism or a form of realism. That’s actually a tremendous amount to build upon so that when we think about how to design robots to understand the world, we have a lot of knowledge about what sorts of systems would be well-designed for doing that.

Similarly, if we want to build a robot who can interact creatively and productively with other robots, solve problems of coordination, reduce conflict, realize longterm goals, interact successfully with people, recognize their interests, take their interests into account, being relatively impartial with regard to interests that are at stake, those are not mysterious in the same way that the skeptic seems to think they are. Because again, they’re already integrated in our practice and as Hume pointed out a long time ago, skepticism doesn’t survive very well once we leave the closeted philosophical study. People go out and they act as if they had knowledge of the world and they act as if there are things that people could do to them or that they could do that would be better or worse, right or wrong. They think about how they would treat their children. They think about how they should behave with respect to their students or their professor. That doesn’t take us into the misty realms of metaphysics, but it does take us into the practices of moral criticism and self criticism.

Lucas Perry: So could you unpack just a little bit more about why this view is neutral?

Peter Railton: So for example, I’ve mentioned a couple of features of moral thought. One feature of moral thought is that it takes a kind of impartiality seriously. It gives equal weight to all those effected. That’s something that Kantians and Utilitarians and many other moral theorists would agree on. Another feature of moral thought is that it’s concerned with general reasons. Similar cases have to be treated in a similar way. That leads to a doctrine known as supervenience. We can’t invent moral distinctions that don’t correspond to real distinctions, in fact. Another feature is that morality has to do with reciprocity, relations of mutual gain and mutual benefit. Another is that morality involves taking oneself and others as ends and not as mere means.

Those are all normative theories. But if you then ask, well, “What about the metaethical side? Could a pragmatist about ethics say the same things?” And the answer seems to be, yes, the pragmatist sees ethics is essentially about people solving the human problems that they face in ways that meet these kinds of desiderata. The person who believes that there’s a rationalist foundation, believes that you can know a priori that these constraints exist of impartiality and so on. But as you can see from Singer’s work, the result of applying his form of rationalism is not dramatically different from the results of applying my form of naturalism. And that’s because the target that we’re all working on, ethics that is, has a great deal of determinant structure. And so any metaethical theory is going to have to capture a lot of that structure.

Lucas Perry: And so, sorry, what is the relationship about how this is instructive for why metaethics matters for AI alignment?

Peter Railton: Well, the suggestion was, well, we should know something about what ethics is in order to answer that question about how we might gain moral knowledge. If we can gain moral knowledge, what moral knowledge might consist in? That’s where we started. And then I tried to suggest a bunch of considerations, a bunch of features, that I could call obvious features of moralities of practice. Because I think our practice is not just at the normative level. People also have implicit metaviews in ethics. They demonstrate that by, for example, their knowledge of how you can determine morally relevant considerations in situations. So they understand what kinds of considerations are or aren’t morally relevant. They understand the distinction between morality and etiquette, between morality and law, between morality and self-interest. So they have a grasp of a bunch of these obvious features of morality.

And those are not just features of one or another normative theory. They’re are features of virtually all normative theories and features that any metaethic is going to have to accommodate, unless it’s going to be skeptical. So that’s why I say that there’s a great deal of common ground, not because the fundamental explanations are going to be the same, but there is an explanatory target, which has a great deal of structure and which indeed all these theories have to explain. And that requires then that metaethical theories be adequate to that.

Lucas Perry: I see. So that is already structuring metaethical epistemology is what you’re saying?

Peter Railton: Yeah. It gives you quite a bit of structure.

Lucas Perry: Yeah. That’s just reminding me about how Peter Singer talks about this one philosopher in his book, The Point of View of the Universe, discusses how there are a few axioms of morality and they seem to touch upon these convergent principles that you’re talking about here. Now, on a realist’s account of metaethics, there would be something like a one true moral theory. And if one takes the one true moral theory view seriously, then the problem of AI alignment would be to cultivate a procedure for coming up with the correct moral epistemology in order to find the one true moral theory, or to discover the one true moral theory ourselves, and then align AI systems to that.

Now, if one believes that there is not one true moral theory, and there is only the evolution and extrapolation of human normative processes, and preferences, and metapreferences, then one might not want to come at the AI alignment problem from the perspective of a one true moral theory approach. And as a general note, I’m taking this language from Iason Gabriel, who will be on the podcast soon. And so in the secondary scenario, that is not using the one true moral theory approach, one would want to come up with a broadly acceptable procedure for aligning AI systems that didn’t presume to try to discover a one true moral theory. Do you have any reactions to these two ideas or approaches to alignment?

Peter Railton: Yeah. The question of whether one thinks there is one true theory is somewhat different from the question of whether when things were close to it or we have good ways of knowing it. I myself, although I’m a realist, I recognize that there’s a good chance that my moral views are wrong and my metaethical views are wrong. And so I don’t want to just put all of my energy into thinking, “Well, how would we discover the one true moral Theory?” I would want to think more robustly. And again, I can make an analogy with epistemology. If you go into a philosophy of science department or a statistics department, you’ll find that there’s a tremendous debate between people who think that Bayesianism is the right kind of approach for evidence and people who think that standard methods of social science are the best methods of evidence gathering.

You’ll find a tremendous amount of disagreement. So if we’re trying to build a robot who understands its environment, we don’t want to say, “Well, we have to figure out which one of those theories is correct before we can build a robot to understand its environment.” You might say, “We want a robot that’s got a robust capacity to learn, and that would deliver results, reasonably approximated by a Bayesian, or an inductivist, or someone using social science statistics. They’re not going to agree on everything. Where there’s overlap, we should try to build a machine that can stay in the overlap, we should try to build the machine that’s not brittle, such that it makes epistemic commitments that are at the far edge of one or another of these views.

And so I would say our task is to build a system that’s robust. And that means building into it the fact that we don’t know what this one true theory is. And so therefore we want as far as possible to accommodate an array of approaches, all of which have very strong reasoning behind them. You could think that we’re not trying to build an AI system that discovers the one true theory. We’re trying to build one that isn’t going to be dependent upon exactly the target that it hits, but rather could be successful in a array of possible environments.

Lucas Perry: So, I mean, adjacent to this and promoted and discussed by people like Toby Ord and William MacAskill, would be this human existential procedure for moving into the future, where it’s like, we’re going to align AI systems, whatever that means. And that alignment will hopefully not lock in any values or any particular kind of alignment procedure, but will ensure existential security for humanity, such that existential risk just keeps going down to zero and is near zero. And then we use this existentially secure situation to do a long reflection on value, and what is good, and what may be true or not true about ethics. And then with sufficient consideration, then we can engage in populating the stars and optimizing things the way that we see fit. So what is your view on this proposed long reflection?

Peter Railton: Insofar as I understand it, I don’t have any objection to it. I’m not sure I do understand it. One of the things that you just in passing was that we were going to try to design these systems to behave as we see fit. I myself am not sure I know how it is fit to behave. And I certainly know that I have some mistaken beliefs about that. And I would hope that just as artificial intelligence may help us correct certain of our views on cosmology or in medicine, artificial intelligence could help us correct certain of our views and ethics.

We’ve seen a tremendous amount of evolution in people’s fundamental moral convictions over time. Some have stayed relatively similar. Others have changed dramatically. And we would, I think, do best to think of the artificial extension of intelligence as one way in which we can get a perspective on these issues and situations and problems that isn’t just our own, and that won’t have the same priors as our own, and won’t have the same presuppositions, and they should be included. We should think of these as his agents.

They will have interests just as we have interests, and the standard would not be, what do we see fit, where we mean something like we humans, but what will we see fit as we, the humans and the artificial systems continue our evolution and our cultural development. And we want to think that the path that we should follow is one that leaves open that kind of development rather than constraining it to fit what happens to be our current set of moral convictions, which again are not shared. There are too many disagreements in order to think that we could just write down the rules. Long reflection, I think will also tell us that we need a dynamic picture. And we should have some convictions that are more confident, closer to the core. We should have methods and practices that meet reasonable standards of justification and objectivity, and we should be prepared to learn.

I can’t, I’m afraid, to think of a way to guarantee against the existential risk from artificial intelligence or even our own intelligence, which may be more problematic. But I do suspect that the best way to contend with problems with existential risk is to face them as communities of inquirers.

Lucas Perry: All right. So I think you’ve done an excellent job explaining the importance of moral learning and moral epistemology here, given that the ongoing cultivation of more wholesome and enlightened moral value and moral thinking is always on the horizon. Now, you have some perspective and research that you’ve done on moral learning in humans and the importance and necessity of that. I’m curious here now then to relate some of that research that you’ve done in moral learning in humans to how AI systems of increasing autonomy may also wish to take on the kind of moral epistemology that infants and young humans may have.

Peter Railton:

I wouldn’t say that I’ve done research in this exactly. I’ve certainly explored others’ research in this and try to best I can to learn from it. One of the things that’s impressed me in the literature as it’s evolved over the last couple of decades is how much the learning of children is accomplished, not via the explicit teaching, but by the children’s own experience. What we’ve learned recently, and this is not from developmental psychology, but from various kinds of models of machine learning is that very complex structures can be learned experientially. There are powerful techniques which we can add to that kind of probabilistic learning in order to create knowledge of general principles, to do something like build a structured understanding of language that would enable a child to speak fluently, to understand what others are saying and to engage with them that does not require either an innate grammar or explicit instruction in language as such. That’s a kind of a model of how we also seem to acquire our social normative knowledge.

If you think about the perspective of the infant, one thing that we’ve learned from the animal research is that animals don’t just build a spatial map in relation to themselves. They don’t just build an egocentric map of their environment. They also build grid-like maps that are non perspectival, and they navigate by combining these two kinds of information, perspectival and non-perspectival information. Infants seem to do something similar in learning about learning. They not only represent their relations with individual adults and whether those benefit them or not, but they also seem to construct general representations of whether a given adult is competent or helpful in third party interactions and to use that aperspectival information to make decisions about who they’re going to learn from or pay more attention to. They start doing this surprisingly early on. And so at the same time that they’re constructing the ego centered world, they’re constructing a non-centered representation of the world that includes normative features like reliability, competency, helpfulness, cooperativeness.

And so the child in coming to represent the world around them is constructing representations that have the initial form of moral representations. It turns out to be efficient for learning to be a successful human being that one construct representations spontaneously that have this quasi-moral structure. And that would suggest to me that if machines develop as agents, agents interacting with other agents, agents capable of solving a range of problems, capable of having sustained interactions with humans to solve open-ended problems, that they will also find that they do better if they can construct these quasi-moral representations of situations. And so that means that they will be acquiring sensitivity to morally relevant information through the very task of acquiring social competence, linguistic competence, epistemic competence in a social world.

So there’s a kind of picture here that congrues nicely with the fact that we now know that complex models can be acquired through experiential learning. That suggests that there is a promising pathway toward the development of theory of mind, causal inference, representation of social value from a objective or non-personal perspective. There is an argument for thinking that that’s actually a fundamental core part of our capacity as intelligent beings capable of successful social interaction. That suggests that this is not a peculiarity. It’s not culturally specific. And so why not use similar methods in our interactions with artificial agents to enable artificial agents to acquire these kinds of quasi-moral mappings?

Lucas Perry: So the key thing to draw out from here is that there is this distinction between explicit and implicit learning of morality, and you’re remarking about how there isn’t much explicit moral learning in infants and children. Most of this moral learning comes from simply experience and interacting with the world rather than explicit instruction about what is right and wrong.

Peter Railton: There’s tremendous cultural variability in that within our society and across societies as to how much explicit moral instruction children are given. What’s fascinating is that even in societies where children get very little explicit moral instruction, they nonetheless acquire these capacities. Similarly with language, there are some societies like upper middle class US society where parents talk extensively with children. There are other societies where parents do not, and yet the children can become fluent linguistic agents. So my thought is that the explicit theory isn’t really the thing that’s doing the fundamental work. Even to understand what parents are trying to do when they give you explicit world instructions to understand how to apply those or what they might mean, the child is already going to have to have quite a complex aperspectival representation of the social situation. The thought here is that there’s some places explicit theory, some places less explicit theory, but the result in terms of the development of behaviors are very similar.

A good example of this is that around age three or four children who are given a command by an adult in authority, if that command violates a reasonable norm against harm will balk and refuse to perform it. So if a substitute teacher comes in one day and says, “I’m the teacher today, and in my classroom, you have to raise your hand before you speak,” children in the classroom will start raising their hand before they speak. If the teacher says instead, “I’m the teacher here, and in my classroom, children jab the point of their pencil into the child next to them when they wish to speak,” they’ll stop. They won’t do this. And if they’re asked why they won’t say, “Well, that’s not the way we do it.” They’ll say, “It would harm the other child.”

And so that suggests that even an attempt by a figure of authority to give a norm in a situation where children can perfectly well understand that there is a scope of legitimate authority, put your hand up before you speak, they will distinguish between that kind of conventional authority and moral authority. And that’s an autonomous action on their part. They’re not getting rewarded for it. In fact, the teachers, they either send them out of the room, send a note home to their parents, but they balk because they can represent the situation in these quasi-moral terms. And when they do that, they say, “No, this is not a good solution to the problem.” That suggests to me that even if we were to think that children learn by being given explicit instructions by people in authority, they actually independently learn that they can resist that and will resist it.

Lucas Perry: Right. So we’re in a position where evolution has cultivated and embedded in us, a kind of moral learning, where there is a certain degree of implicit and explicit moral learning, depending on your culture and where you’re from. And as you’re saying, luckily there’s strong convergence on this ability of moral learning to lead human beings to agreeing on say in the case of stabbing the other child, that would be something like a principle of unnecessary harm to another person. That seems to be for most human beings something that is strongly converged upon pretty early, unless your environment is particularly pernicious or something. And that there is this convergence because of how our moral learning is structured given evolution. And that, that moral learning enables in us a kind of moral autonomy that’s there from an early age.

And there is a question of how this moral learning is best structured in say both people and in machine systems. And then there’s the question of moral learning from the outside. What kind of environment is most conducive to moral learning? Are there insights into this that can begin pivoting us into the relationship or importance of moral learning in AI systems?

Peter Railton: Perhaps so. Actually there’s a fair amount of evidence that even infants brought up in some very difficult situations will nonetheless develop these forms of pro sociality and cooperativeness, partly because they become especially important in those situations even to solving the most basic problems or meeting the most basic needs. So I wouldn’t think that the mere difficulty of the situation was sufficient to prevent this kind of learning. On the other hand, if the child is given the wrong incentives, they’re also going to learn a whole bunch of other stuff like you can’t count on other people, you can’t trust other people.

So put this from the standpoint of artificial agents. We want the artificial agents in our world, whether they’re a companion for an elderly person or a autonomous vehicle or a telephone answering service system, we want those systems to be sensitive to these kinds of moral considerations and capable of a degree of autonomy. If for example, there is a system that’s looking after an elderly person and some vital sign of the elderly person is showing a problem, and the person says, “I don’t want to report that. I don’t like having people know this information about me,” or maybe they’re concerned that the doctor will prescribe something that they won’t like, I hope to have systems, which can in that situation think, “Is this the kind of thing that I should keep from the physician? It’s the preference of this individual, but this preference may not be the best interest of the individual in this case.”

And so on autonomous system would be able to make that kind of assessment. Could get it wrong, could get it right, could learn from it, but I wouldn’t want a system to be such that they would simply take over wholesale the preferences of the person that they are interacting with. And of course the same thing is going to be true with self-driving cars and with question answering systems and so on. They will need a certain amount of autonomy in order to do those jobs effectively. And in order for that to happen for them to have that autonomy, they’ll have to have their own representations of the moral structures of the situations and have the capacity to construct those.

I suspect that if we really do want to create intelligent systems that are capable of this kind of autonomous self-critical and critical moral thought, the way to do so is very much like the way children do so. And in so doing, we run the risk of creating some autonomy systems won’t always agree with us, but have we done what’s appropriate so that when they exercise that autonomy, their chance of getting things right is good at least as our chance of getting things, right? So you could think of this in the kind of adversarial picture where you’re trying to see if you can discriminate between the moral judgments of the machine and the moral judgments of the individual and the machine, and the individual could be part of a learning process that improves the machine’s overall model and generative model of situations.

Lucas Perry: So there would be the question of, how do you structure a system such that it can learn moral learning in a way that would be broadly endorsed or would satisfy other ethical or meta-ethical principles that we have? That is double-edged in so far as if you screw it up, then the thing is autonomous and can disagree with you. And the capacity to disagree would either be detrimental in the case in which it is wrong in its moral learning, or it would be enlightening for both us and the world and the machine if it were right about morality when we weren’t. How do you think about and balance this risk between the possible enlightenment that may come from embedding AI systems with moral learning and also the potential catastrophe if it’s done too quickly and incorrectly?

Peter Railton: Yeah. Wish I had an answer. If you think about it, the existence of humans with malicious intentions means that if artificially intelligent systems don’t have this kind of moral autonomy, they’re going to be very willing servants. So you might say, “Well, there’s a risk on the other side, which is that if they aren’t capable of any kind of criticism or autonomy, then they will be much too willing and much too readily deployed and much too manipulable by humans whose purposes I’m afraid to say are not always benign.” If you were thinking about the problem of raising a child, you would say, “Well, I don’t want to raise a child who simply take orders. I want to raise a child who can raise questions as well.”

I think our only defense against malicious humans with extremely intelligent systems at their disposal is to try to ally with intelligence systems to create a comparable counter force. And that counter force is going to be operating out way past our understanding because it’s going to be in competition with systems. They can operate extremely fast and take into account a large number of variables. And so we better be building systems which, as they get further and further out in this kind of a competition, have some kind of a core where they are responsive to morally relevant features even at the far extent of their development.

And so if you think about it as trying to build a moral core, then that core can figure in their operation even as they become more and more intelligent. They can use the intelligence to gain information and perspective and capacity to understand situations that can improve their understanding. But if we don’t do something like this, we will really be and other artificial systems will be prey to those who have and want to implement malicious and manipulative intentions. So I balanced the risk partly by thinking, I can’t think of a very good way to defend against the perils of malicious combinations of human and artificial intelligence other than to develop more trustworthy forms of human and artificial intelligence interaction. And that requires according these systems some autonomy and some trust.

Lucas Perry: That makes sense to me. And I think it addresses some important dimensions of the soon to be proliferation of AI.

Peter Railton: To me, what are the most exciting features of more recent developments in artificial intelligence is that they give us for the first time, I think, a plausible model of intuitive knowledge and knowledge that it could be implicit, but nonetheless be highly structured, contain a great deal of information, contain a capacity to engage in simulation and evaluation. So I would expect that the structure of moral knowledge could be like our structure of common sense knowledge generally. It could be quite distributed. It could be quite a complicated system, not a system of extracted principles. There might be some general features that are important, and I think that’s bound to be true. And that is true when these systems learn, but we don’t have to think that the kind of competency that they would have, if it isn’t something like that, is therefore undisciplined and therefore lacks power or reliability.

So for the first time, anyhow, I thought here is a picture of how intuitive intelligence might look. And of course we can’t introspect the structure of such knowledge and it does not have a readily introspectable propositional structure. But it is capable nonetheless of carrying and modeling and engaging in quite complex computations, simulations, action guidance, control of motor systems in ways that look like intuitive intelligence. Now I realize we’re a long way from the way the brain actually functions, but even to have these models, it gives us a kind of proof of concept of the possibility of something like intuitive knowledge.

Lucas Perry: Right. So if we’re building AI systems as willing slaves who optimize the preferences of whoever is able to embed those in the machine, there’s no defense in that world against malevolent preferences other than not allowing the proliferation of AI to begin with.

Peter Railton: And we’re already past that point. Enough has proliferated and there’s enough inequality of wealth and power in the world to guarantee that other proliferation will take place. It’s already the case that we can’t count on keeping this genie in the bottle and obviously don’t want to do so. I’d say we’re now in the phase where we need to have an active, constructive program of starting to build AI agents that are actively responsive to morally relevant considerations, are good at solving coordination problems, are good at this kind of interaction and capable of the kind of insight needed to be potential moral agents.

Lucas Perry: Right. And you argue that as the systems inhabit increasingly social roles in society and are constantly interacting with other agents and with the world, it’s increasingly important that they be sensitive to morally relevant features. Without this, again, malevolent humans or humans with misaligned values that are counter to most of the rest of humanity can abuse or use systems more freely if they’re not already sensitive to morally relevant features. And that if there is an ecosystem of AIs, purely altruistic systems which are not tuned into morally relevant features can be abused by other AIs as well.

Peter Railton: Yes, that’s right. One thing that’s gotten me to feel some conviction about this possibility is that the one kind of experiments that I do run are thought experiments. And I’ve been for years running moral thought experiments in my moral philosophy classes. And in recent years, I’ve been able to do so using a system that allows students to confidentially record their answers to problems like moral dilemmas or questions about interpreting moral situations or motives. And what’s impressed me over the years is how coherent and consistent these responses are.

And what leapt out, for example, from the familiar trolley problem was that mediating their moral judgments seem to be a model of the agents that are involved, a model of what kind of an agent would perform an action of a certain kind. And what kind of responses such an agent would receive from others in the community? Would they be trustworthy? Would they not be trustworthy? And so, instead of thinking there’s just these arbitrary differences in preference between throwing a switch and pushing someone off a footbridge, and there’s no real principle there, and no one’s found a principle to cover these cases, you can think now there’s this intuitive competency people have and understanding situations and characters and what kinds of persons would respond in what ways and situations and what it would be like to have those persons in our community.

And once you look at it that way you can get a tremendous amount of consistency in people’s responses, which suggested to me that they are doing this kind of generative modeling of situations and doing so in a way that does predict to their actual judgments. And if I ask, “Well, why did you make that judgment?” they’ll say, “I don’t know. It was just an intuition.”

Lucas Perry: Yeah. So the thought experiment that you’re pointing to, a lot of people would flip the switch in the trolley thought experiment to switch it to the track where there’s only one person and then if you changed it so that there’s a person on a bridge who is sufficiently large, that if you push them off the bridge, they will stop the trolley from killing five people on the track. The intuitive response that you’re pointing out is that people are less likely to want to push someone off of a bridge than to flip a switch. And you’re like, well, what’s really the difference? In the thought experiment, there’s not much of a difference, but the intuition that you’re pointing out, the morally relevant feature that is subtle and implicit is that we don’t want to live in a world where there are the kinds of people who have the capacity to push people off of bridges.

Peter Railton: In that kind of a setting, yes.

Lucas Perry: Yeah.

Peter Railton: And you can give them a whole array of other scenarios in which the agent would have to do something like pushing someone to a grisly death and where they will agree that it should be done for example, in situations where self-defense is needed against, for example, the terrorist action. And again, you’ve asked them, “Well, would you trust an agent who would perform such an action?” then the answer is they would actually have more trust in such an agent. So again, they’re modeling the situation, not in response to this or that minor tweak of the situational features, but in terms of a quite deep understanding of the motivations and attitudes that are involved. And then if you go over to the psychological literature, you find the dispositions to give the push verdict in the footbridge case correlate more with antisocial behavior, with lack of altruism, with lack of perspective taking, with indifference to harm than with altruism or any kind of a generalized utilitarian perspective. So the psychologists seem to confirm the understanding that my students implicitly had of the situation.

Lucas Perry: What’s relevant to extract here is that there are deep levels of morally salient features, that human beings taken to account, and that are increasingly needed to be modeled and understood by machine systems for them to successfully operate in the world.

Peter Railton: Yeah. And to be trustworthy. I’m one of those people who thinks emotion is not a magical substance either, and that artificial systems could have and acquire emotions. And that part of the answer to the question of how do you build a core that is resistant against certain types of manipulation is to look at how it’s done in humans and indeed another animals and discovered that the affective system plays a pivotal role in just these kinds of situations. And so I suspect that’s another avenue of development. And children’s moral emotions undergo a similar kind of evolution through their upbringing, but through their direct experience because the emotions are there before they’re told what to feel. Indeed how would you tell the child what to feel?

Lucas Perry: Are there any other points that you’d like to wrap up here on then on the advantages of reflecting on AIs, which are sensitive to morally relevant features?

Peter Railton: I try to be as accurate as I can in understanding what we’re learning from the literature on pro sociality, for example, both with regard to individual human development and with regard to human communities, going back, looking at hunter gatherer communities. And even as there have been changes in morality, and I have emphasized that there’s been changes over time, the kinds of features that people take to be morally relevant, many of those have been relatively constant. And you can think of our changes in our moral views that have taken place over the years is getting better and better at winnowing out the ones that aren’t really morally relevant, like gender, ethnicity, sexual orientation, and so on, because they can easily become culturally relevant without being morally relevant. Fortunately, we have the critical capacity as agents to challenge that.

Lucas Perry: Yeah, that makes sense. The core importance that I’m extracting from everything is the baseline importance of moral learning in general, and also the understanding and capturing what human normative processes are like and what they entail and how they unfold. And that participating in a world of humans requires knowledge of both moral learning and the ability to learn morally.

Peter Railton: And this is not saying that people will always behave well, just in the same way that acquiring linguistic competence doesn’t mean people are always going to speak well or truthfully, or honestly, but rather that the competency will be acquired. One example that I like is sexual orientation. When I was growing up, it was considered fatal for someone’s social identity to be discovered to be gay. And there was a great deal of belief about the characteristics of gay individuals. In the 90s and so on, a large number of gay individuals were courageous enough to indicate their orientation. And what was discovered, we all discovered, was that the world was full of gay individuals whom we admired, whom we had standard relationships with, who were excellent colleagues, coworkers, friends, and that therefore we were operating on a bad dataset because we had not really had, we here I’m talking about heterosexuals, had insufficient experience with gay individuals. And so we could believe all kinds of things about them.

So I would emphasize that if it’s a learning system, it’s going to be very sensitive to the data. And if the data’s bad, the learning system is going to have a problem. So I don’t think it’s a magic solution, but I think the question to ask is, so how do we build on this? How do we provide more representative experiences and less biased samples so that the learning can take place and not pick up cultural biases?

Lucas Perry: Yeah, those are really big problems that exist today and a lot of the solution right now is human beings having to do a lot of hard work in datasets. We can’t keep that up forever. Something else is needed. I think this has been instructive about the importance of structure of moral learning and I want to pivot back into our discussion of meta ethics and your conversations with Derek Parfit and what your metaethical view is and how views on metaethical epistemology or metaphysics may bring to bear intuitions about what moral learning is like or what it might entail. It’s Derek Parfit, right? Who has essays on, Does Anything Really Matter?

Peter Railton: Yes.

Lucas Perry: So I guess that’s the question here then for this part of the conversation, is, does anything really matter? So you were in conversation with Derek Parfit and it seems like your views have converged and are different in ways from Peter Singer, though it seems like you guys are all realists. Could you unpack and explain a little bit about the history here and what went down between you Parfit and Singer?

Peter Railton: Yeah, sure. I have to warn those who are listening, buckle up, this is going to have to be a philosophy talk, but I’m sure that many people have these philosophical questions themselves. So let’s just begin with the title that Parfit chose for his master work, On What Matters, is the title. And you might say that mattering is the core notion of value, that if you had a universe full of rocks, it would not matter to the rocks, what happened. It would not matter to the rest of the universe, what happened. And so there wouldn’t be any positive or negative value in that universe. Introduce creatures for whom something matters, even if it’s just as simple as nutrition or avoiding pain, then you can begin to talk about states of affair as being better or worse than one another, about improving or degrading the situation or the characteristics of the world.

And so mattering is poor to the idea of value. And once we grasp that, we begin to realize that value is not some new entity in the world. It’s not something we add to the world. Once you have mattering, then things will have value, and they’ll have positive and they’ll have negative value. And of course, for different creatures, different things will matter. And learning what matters to a creature is understanding what would be good or harmful to that creature, and this of course includes humans. So I was very moved when I was on a committee, looking into questions of animal research, to know that the veterinarians learned a lot about what situations animals preferred and did what they could to try to give them situations in which they were happier, more lively, more disposed to cooperate and learn. And that means that they were trying to learn something about what matters to a rat.

And we now know a fair amount about what matters to a rat. Company matters, exercise, the capacity to engage in activities, build nests. And so when these things matter to rats and so we can give rats a good or a bad existence by thinking about, well, what does matter to rats? Now, what matters to rats is different from what matters to humans, but the basic idea is the same. So there’s value there and it’s thanks to the existence of creatures for whom something matters that value comes into existence in the world. That’s a perfectly naturalistic perspective. Treating value as something that is realized by natural states of affairs in the world. Now it turns out that even someone who’s an arch non-naturalist like Derek Parfit agrees that pain is bad, not because it has the non-natural property of being disvaluable, but because of what it’s like in its natural features, those features suffice to make the pain bad.

And if they didn’t suffice to make the pain bad, there would be no value feature we could sprinkle on it that would make it bad. But given that it has those features, there is also no value feature we can sprinkle on it that will make it good. And so Parfit and I can agree that non-naturalism is important in ethics, not because the world is populated with non-natural entities like values. That’s a widespread confusion. It’s reifying a notion of value as if it were some kind of a new domain of entities. And naturally once you’ve done that, it becomes very unclear how we learn about these, what relationship they have to the natural world. If instead, you think, no value is something that is brought into existence by certain relational features in the natural world, then you can say, “Ah, that’s common ground between Derek Parfit and myself.”

And if Derek’s explaining what’s bad about pain, he’ll give the same explanation that I would give about what’s bad about pain. So we agree on that. The badness in the case of pain, pain is really used for two different things. It’s used for certain types of physical sensation, and it’s used for suffering. That physical sensation isn’t always suffering. So for example, when you put hot sauce on your food, you fire up pain circuits, but you enjoy that. You may seek the burn of exercise. And so there are times when the physical sensation of pain is sought and liked, desirable. It’s part of good experiences. It shows that pain can matter in different ways. It’s the mattering where the value resides, not in the physical sensation just in itself. So the mattering is a relationship between a subjectivity and agent and the physical sensation, and it could be positive or negative in a given case, but the value resides in that relationship.

Lucas Perry: But they’re just two contents of consciousness, right? There is the content of consciousness of the sensation of pain on my arm if I scratch it, and I might derive another sensation from that sensual pain, that is pleasure. Wouldn’t the goodness here need to come from this higher level, more pristine pleasure that I gain from the pain, which is more of an emotion and that which is intuitive to the other sensation or the other content of consciousness?

Peter Railton: I think you’re right to bring in higher level mental states as well. Because part of the reason why pain in certain circumstances is desirable is because of the representation that you have of it. And this is true with many features of the world, is because you understand them in certain ways that they produce in you the positive or negative experience they do. And if you ask a psychologist, the positivity and the negativity in the mind does not reside in the impulses of the pain system or the pleasure or reward system. It resides in the effective system, which encodes value as positive or negative. And it encodes as well, the behaviors and the responses that are characteristic of positive and negative value, positive is approach negative is withdrawal. Fear involves a certain distinctive suite of responses. Anger involves another distinctive suite of responses, but the affective system is where the value is encoded, and that’s the common currency of value in the brain.

So that’s where we should be looking to discover. And it’s the affective system that, which is the root of our emotions, whether they’re aroused emotions like anger or fear or non aroused emotions like assurance and trust. That system is a system which encodes this relational feature of value. You’re quite right to think that we should move up a level, and in doing so, we encounter the affective system and its properties. And it’s a system that we share with all of our mammalian relatives and with other species as well. It’s evolutionarily a highly conserved system. And that’s because it is the core of valuation, and valuation is a core activity of living creatures because they’re going to base their actions on value assignments. You’re right to think that in the mind you will have tiers and that you need to find the right level in order to understand what value or disvalue looks like in the mind.

Lucas Perry: So there’s the view where some content of consciousness is clearly seen as bad given its nature. If some state of consciousness is like something from a consciousness realist perspective, and it is also natural because it’s part of the natural world, it’s a physical fact and there are facts about consciousness, then value comes in from what it’s like to be conscious. Whereas it seems like you’re bringing in the more computational, and physical side of things, like an evaluative affective system, which may not be separate from how things are experienced in consciousness, but I feel confused about these two different levels and where the ‘what matters’ comes from.

Peter Railton: Well, yes, you’re quite right. There are views about value in which it’s only conscious states that could have value or disvalue. I don’t particularly hold such a view. I think that we are intrinsically concerned with, and that there is intrinsic value in non-conscious states. And that’s why I wouldn’t sign up for the experience machine. The experience machine could provide an unending stream supposedly of positive conscious states, but why wouldn’t I sign on for it? Well, because the actual content of my values is not that I have certain conscious states, it’s that I have certain relations with people, with the external world, that I have a certain engagement with things that have a consciousness and that matter. And so I wouldn’t agree that the only place, the only locus of value or disvalue is conscious states.

Lucas Perry: So then from a cosmological and evolutionary perspective, there has been the development and arising of sentient creatures on this planet who have ever complexifying neural algorithms for modeling themselves and the world and making predictions and interacting with it. And amongst these evolved architectures include evaluative ones, which take the shape of valuing or disvaluing certain aspects of the world. And so that is enough for you for talking about intrinsic value. You feel like you don’t need to bring consciousness into it. You’re fine with just talking about the computation.

Peter Railton: Oh, I think consciousness plays a role because one of the good making features is a positive state of consciousness. It’s just, it’s not the only one. And so there are differences in the world that would not show up as differences in conscious states. And that’s what the experience machine is meant to show, but which would nonetheless constitute things that matter in the sense of matter that we were just describing, namely, that these are objects of concern, love, affection, interest on the positive side, objects of dislike, disvaluation, disapproval on the negative side. I don’t think there’s any reason to think that only conscious states can be locuses of value, but it may be that consciousness plays a role.

Lucas Perry: So what are these other good making features and why are they good making?

Peter Railton: Well, take, for example, a theory like a preference satisfaction theory. I would prefer other things equal that after I’m gone, my children have lives that they find meaningful. Now is that because I want to have the positive experience of thinking that their lives are meaningful? No, I want them to have those lives. And so it’s part of the content of my informed preferences, let’s say that it would survive information, is part of the content of my informed preferences that the world be such, that my children have a certain kind of life. And you say, “Well, doesn’t the meaningfulness of their life just consist in their conscious states?” And I’d say, “Well, no, not at all. I would think that a life in an experience machine would have the same meaning as a life with similar stream of conscious states that was lived in engagement with the world.”

And so when I want them to have meaningful lives, I want them to have lives in which they act in ways that matter to them. And that they do things that matter to themselves and to others and their intrinsic preferences, like my intrinsic preferences, aren’t just going to be for conscious states. And so it may be that you need something like preference or interest to get value off the ground mattering, but the content of what interests us, or the content of what our preferences are, won’t just be the conscious states. So you can’t satisfy my preferences just by giving me conscious states, for example.

Lucas Perry: So I don’t share that intuition with you. I still don’t understand why you feel that something like a preference is good making. I guess that just comes down to intuitions. I mean, when someone could ask me, why do I think consciousness is the only thing that is good making, but I don’t know, what is a preference? It’s like a concept about some computational architecture that prefers some state of the world over another. But when you pass away, for example, your preference goes away. So why does it need to be respected still? I mean, we’re getting into some waters here, but is the short version of this that when you just do these philosophical thought experiments, that your intuitions aren’t satisfied by consciousness being necessary and sufficient for value?

Peter Railton: Well, all of our knowledge, whether it’s knowledge of value or the external world, we can push it back to a point at which, again, we can’t give some further derivation of the assumption that we’re making. And so my thinking here is that it seems to me extremely plausible that the one intelligible notion I can get of something like value is that there can be a subjectivity such that states of affairs can go better or worse for that subjectivity. And then value would consist in that, which makes the states of affairs better or worse for that individual. And then I asked myself, well, does that satisfy our concept of value?

Well, value should have various different features and we can list those. It should be something that when we understand it’s intrinsically motivating, it should apply to the sorts of things that we ordinarily identify as being values. It should capture a certain role in the guidance of action. It should be something like a goal in action. We should see it as structuring a behavior of individuals. And when I look at all those conditions, I think, yeah, this satisfies those conditions. It’s not a proof. It’s just saying that if we lay down the conditions that we would give for something satisfying the concept of value, these states do indeed satisfy those conditions and that many other candidate states don’t. But I can’t tell you for example, that you shouldn’t have some other concept of calue instead of value and ask what would satisfy calue in the same way that I can’t in the case of knowledge of the external world, give you a derivation of the importance of knowledge, as opposed to shmowledge, you can operate with the concept of knowledge and see what it requires and see whether it would apply to what we are doing.

But that’s not a proof that there isn’t another scheme of shmowledge of which the same thing could be said. So that’s where we get down to these fundamental assumptions and can they be non arbitrary? Well, they can, for example, if, when applying them, you can be put in a situation where you give them up. A concept that we had, that we thought we were happy with, turns out to be confused. Or it turns out that the only things that would satisfy the concept are things which we ordinarily think the concept doesn’t apply to. So we think there’s a mismatch between the criteria and the paradigm cases. So it’s not arbitrary if you’re willing to use it critically, but it can’t be proven.

Lucas Perry: Okay. Bit of a side path from where you were to Parfit. I was curious about what you really meant by how you guys were agreeing about value being some natural thing, instead of having to sprinkle value.

Peter Railton: The way I would put it, the disagreement that I have with Parfit is a disagreement at the conceptual level. Initially, at any rate, it looked like we had a conflict of opinions because it looked as if he was committed to their being in the world, these non-natural features, and that they somehow explained the role that value has in our lives. And I couldn’t understand what that would mean, but he was perfectly content to say, “No, the good making features are these natural features. They explain the role that value has in our lives, but our concept of value is a non-natural concept.” And what does it mean to say that? Well, the same situation, the same configuration of matter could be described with physical concepts, chemical concepts, biological concepts, Oh, it’s an “organism.” It could be described in social concepts. It’s a person. Any given situation can be characterized in various different conceptual systems.

And it can be argued, plausibly, that you can’t reduce, for example, the conceptual system of biology to the conceptual system of particle physics. Because biology deals in functions, reproduction, metabolism, and so on, and there’s no one to one correspondence, no easy correspondence, between those functions and any particular physical realization. You could have living beings made out of carbon. You could have living beings made out of silicon. So the concept of living being, the concept of an organism is a concept of biology. It’s a way to organize the description of the world and explanation and biology is conceptually not reducible to physics. That doesn’t mean biologists can ignore physics because they think, most anyway do, that what satisfies their biological concept are physical systems. And so it’s an important question, what kinds of physical system would satisfy these concepts like self replication and so on?

And so they do microbiology and they study the physical systems that do satisfy these concepts. But the point is that the conceptual system has a degree of autonomy from the physical system. And that even discovering that self-replicating molecules have a certain chemical composition in this world is not discovering that the concept of a self replicating organism is simply a physical concept. Parfit has the same view about normative concepts. He and I agree about what pain is and what makes pain bad, but he says you could describe a situation either, as you were saying, in terms of some physical or biochemical processes, or you could describe it as bad, or as good or something that ought not to exist. And that’s another level of conceptual characterization. And his thought is that that level of conceptual characterization can’t be reduced to the concepts of natural order.

So there is an element in normative concepts that’s always beyond what is translatable without loss of meaning into the natural. Once one recognizes that, then you can be as naturalistic as you like about the nature of value and also believe that the concept of value is a non-natural concept. Just as if you can be as physicalist as you like about the fundamental furniture of the universe and still believe that the biological level of description is not reducible to the physical level of description. You could say the same debate went on when people were thinking about life. 19th century, we find people thinking, well, there’s got to be this special elan vital or spirit or something like that. You can’t just take a bunch of matter and put it together and have life. By and by, biochemistry develops and people, actually you can’t put a bunch of matter together and have life.

And the same thing is true with value. You don’t need some value-vital, some kind of further substance to add to the world. You can put together the natural stuff of the world and get value. Once you frame it that way, then Parfit and I actually agree. Because when he talks about the irrreducibility of the normative, he really means, should mean, and I think agrees that he means, a conceptual irreproducibility. And once we establish that, then I can say, “Yes, I agree with you, normative concepts aren’t definable entirely in terms of non-normative concepts, they involve some idea of ought or some idea of value that isn’t present to the non-normative.” But my interest as a philosopher and metaethicist is an interest in what kinds of natural conditions satisfy these concepts and how that makes it possible for us to have knowledge in a non-normative conceptual scheme, like ethics or theory of value. So that’s where I do my work. His work is done in carefully distinguishing the concepts.

Lucas Perry: So there is reality as it is, there is the base reality, base metaphysics, call it ultimate reality or whatever, and all human conceptualization supervenes upon that because it’s couched within that context and is identical to it. Yet that conceptualization you argue is lossy with respect to ultimate reality, because it doesn’t necessarily carve reality at the joints, but that conceptual structure is still supervenient upon it. And at the level of conceptualization, there are facts about the world that can be satisfied or not satisfied that will make some proposition true or false.

Peter Railton: Yeah.

Lucas Perry: So you’re arguing that value isn’t part of metaphysical bedrock, but metaphysical bedrock creates neural architectures that create concepts that contain within them necessary and sufficient conditions for being satisfied. And when agents are able to gain clarity with one another over concepts and satisfying necessary and sufficient conditions, then they can have concrete discussions about ethics.

Peter Railton: That would be one common basis. And so the image that Parfit gave in his first volume of On What Matters was that he thought, ultimately, you could see the utilitarian and the Kantian as climbing two different sides of the same mountain so that they would eventually meet at the summit. I suggested to him, well in metaethics, the same as the case, I’m a naturalist, I’m climbing one side of the mountain. You’re a non-naturalist, you’re climbing the other side of the mountain.

But as our views develop, and as we understand better the different elements of the views, then actually they’re going to come such that as we approach the summit, we aren’t really disagreeing with each other. And he accepted that picture. I would only add to what you were saying by way of summary. Our concepts typically don’t give us necessary and sufficient conditions, they are more open ended and open textured than that. And that’s part of why we can have unending debates about questions like value and so on.

But you mentioned truth and might say, truth is another very good example of a concept that’s not reducible to a concept of physics. Because true presupposes representations, and representations are characterizable not in terms of their physical constituents, but in terms of their role in thought. And so people who are skeptical about value because they say, “I don’t see where value is in the world,” they should be equally skeptical about truth. Because truth is not some new substance we add. If there’s a representation and it accurately reflects the world, then we have truth. So true, again, is a relational matter between a subject something, like a representation and this case, a state of the world, and it’s when that relationship obtains that you get truth.

Lucas Perry: Right, but that’s truth in the epistemological agent centered sense, but then there’s the more metaphysical view about truth, where there are mind independent facts. And they’re true, whether or not we know anything about them. Maybe the same distinction is important here to make. There are potentially moral truths within the conceptual framework that we’re participating in. And it feels weird to me to call that moral realism. But then there’s another claim where there’s mind independent truths about morality like that there’s an intrinsic quality to suffering that is what bad means. Does that make sense?

Peter Railton: I think you’ve put things in a very good way. One of the features of the setup that I was describing is that it’s very easy to slide from a position that for example, whenever a value judgment obtains, then some or other natural state obtains, it’s very easy to slide from that to thinking that the natural state actually is the normative fact. It doesn’t satisfy the concept. And so you could have the concept of the good, and it could be that there are eternal truths about good I suspect. That’s a reasonable candidate, just as there can be eternal truths in mathematics. The claim isn’t that the conceptual domain is somehow identical with the natural domain. It supervenes, but it’s not a relationship of identity. And the language in which those claims are stated, and the way in which we adjudicate them might be as in the case of mathematical claims, it could be quite a priori.

And that’s where Parfit’s view and mine differ and Singer’s likewise, because he thinks you can do this a priori in a way that I don’t think you can, but that’s a question in epistemology. It doesn’t require a different metaphysics in order to have that view. So you can be a physicalist and believe that there is mathematical truth. And that’s because, for example, you think that mathematical truths are true via a set of axioms, definitions, rules of inference. And so they are made true not by distributions of molecules, but by logical relationships that can be specified in terms of axioms and rules.

Lucas Perry: Okay. So I feel a little bit confused still about why your view is a kind of moral realism if it requires no strong metaphysical view. Whereas other moral realists that I’m familiar with hold a strong metaphysical view about suffering in consciousness and joy in consciousness as being the intrinsic valence carriers of value.

Peter Railton: Well, I’m not sure about the last part of your question. I’ll have to think about how to interpret that. But am I a realist about organisms, if I believe that the concept of an organism is distinct from any particular physical instantiation? Am I prevented from being a realist about organisms because I think the organismic level of description is irreducible to the physical level of description? You see, no, actually, because you think that the concept organism is satisfied by some physical system, you’re a realist about organisms, you think there are organisms. To me that’s a perfectly realistic position. And you realist or non-realistic would say, “Well, I guess there aren’t any organisms then, because they’re not part of the fundamental furniture of the universe.”

And I’d say, “Think of what an organism is. It’s not a piece of furniture, it’s a functionally organized arrangement. And because it’s functionally organized, it doesn’t correspond to any particular material, something or other. And for there to exist organisms is for there to exist the conditions such that the concept of organism is satisfied.” And that’s of course what most biologists believe. And so most biologists are realists about organisms.

Lucas Perry: If your intuitions changed about the reducibility of higher levels of knowledge to lower levels of knowledge, how would that affect your moral views? For example, there are views say like concepts in biology about reproduction and organisms and concepts like life are lossy when it comes to the actual furniture of reality. And that they don’t actually completely describe how things are and the concepts don’t carve reality at the joints. So they provide predictions about the world, but it all should be and is only best described by particle physics for example. One might say an organism is a concept, though it does not carve reality at the joints. And the best understanding of it is at the level of particle physics. So taking a realist position about conceptual fictions is dubious to define reality as whenever some concept I have is satisfied.

Peter Railton: What you’re pointing to is a very interesting problem. I would say that biological concepts do carve things at the joint, because the biological level of organization yields a whole systematic set of laws and principles that turn out to be true in our universe. It’s far from being an arbitrary stipulation or a fiction that something’s an organism and a tremendous amount follows from things being organisms and self-replicating and so on. And we have very elaborate theories about populations, mathematics and populations.

Lucas Perry: And are those laws though not reducible to other laws?

Peter Railton: That’s the idea that reducibility is the wrong concept to have here. Because the laws of population are laws that have to do with variables that aren’t fundamental variables of physics. They have to do with, for example, issues about reproducibility, availability of resources and so on, and what counts as a resource depends upon the nature of the organism. So there’s a level of organization, similarly in chemistry.

Lucas Perry: But what if those variables are just the shape of lower level things?

Peter Railton: Well, they won’t be, because if they were self-reproducing silicon-based organisms, they would obey similar population dynamics. Those principles govern functional organizations. So once you have self-replication, you have mutation, you have differential selection and so on, you’ll get certain principles, whatever physical realization there is.

Lucas Perry: But it really doesn’t make sense to me that these higher level laws would not be completely supervenient on fundamental forces of nature.

Peter Railton: Oh, they’re supervenient, definitely. But supervenience does not imply reducibility, that’s really critical in this domain. And again, this is the problem that I think has led to a lot of confusion in this domain. A feature that is supervenient upon fundamental physics is perhaps part of a system of laws that provides joints in nature. Because if you went to another world and you found a form of life that had these basic features of self-replication, mutation, selection, you would expect to find similar population dynamics to Earth. And that similarity is a biological similarity. It’s not a similarity in terms of the basic physics of the situation. The physics are the same, but the constitution of these organisms is very different. And so you couldn’t infer from understanding just the physics that there would be this biological regularity. That’s what it means to say that it’s supervenient, but not reducible.

Again, I think you can be a realist about organisms because organism really is a concept that carves nature at the joints. And so we would be able to export our theory of organisms to worlds in which carbon was not abundant and self-replication was built out of something else. And that’s a way in which nature is lawfully organized, supports counterfactuals, supports explanations. And so that’s a way of thinking about what it means to say it’s supervenient, but not reducible. And I think the same thing is true with moral distinctions. And that’s why they’re learnable. That’s why infants can learn moral distinctions, even without being given moral concepts.

Lucas Perry: Yeah. So that’s why I’m pushing on this point. Now that makes more sense to me in terms of moral statements, but when trying to make physical claims about how reality is, I feel more confused here and maybe it’s messing me up in other places. If all of the causality is governed by fundamental forces, then surely all concepts that try try to map out the world that is being governed by fundamental causality. All the laws that are derived at higher levels must be completely reducible to and supervenient upon or lossy to some extent with relation to the fundamental causal forces. I don’t think the claims is that for example principles of biology in life are causal in themselves. They’re more like laws that we use to make predictions, but predictions about systems that are running on the fundamental laws of nature. The complex aggregation of those laws must aggregate in some way to come close to those laws of biology. What is wrong with this picture?

Peter Railton: Well, there may be nothing wrong with it. I think the laws of biology are not just descriptive. I think they support explanations and that they are used, not just to redescribe reality, but they’re used to construct theories that show structure in reality that’s extremely important structure. And that would not be visible simply if you were allowed only predicates of fundamental physics. I guess I would say from the standpoint of explanation, biology affords many explanations. Suppose somebody wants to know why the material that happens to be in my body is where it is right now. Well, there is some very complicated story at the level of fundamental microphysics following all of these molecules, but it doesn’t look like anything at all. Whereas if you can give an explanation in terms of evolution and social dynamics as to how these molecules got here, you may have a much more compact comprehensible and understanding grasp of the world.

So I think biology affords us distinct modes of understanding and explanation, so does psychology, so does chemistry. One of the features of knowledge is that reality is organized at various levels in systems that are lawful systems and that support explanation and intervention and causation, but there’s no reason not to call this causation.

And so if somebody is describing the spread of the pandemic and they say, “Well, it’s partially caused by the transmutability of the virus, which is higher than that of the bird flu,” we’ll say, “Yes, that’s a causal factor in the spread of the virus and why these particular molecules are located in the world where they are.” And that’s a very powerful explanation. And if someone were just giving you a readout of the positions and momenta of all the different molecules of the world, you would not see this pattern and you’d have less understanding of the situation.

Lucas Perry: So tying this into your metaethics here, our ethical concepts are causally supervening on fundamental forces on physics. We’ve inherited them via evolution and they run on physics. But these concepts do not reduce to natural facts. There’s no goodness or badness built into the fundamental nature of the universe. These concepts are merely causal expressions of the universe playing out. And within the realm of this conceptualization, you can have truths about morality in the same way that you can have truths about biological organisms. And there’s a relationship here between what you might believe to be true about conceptualization and science and the epistemic status of concepts in general would also bear some information here on how one might think about the epistemic status of moral concepts.

Peter Railton: Yeah. Or thinking in terms of algorithms and systems. The systems, theoretic perspective gives you a lot of very well organized understanding as you grasp the algorithms that’s at work and so on, but algorithm is not a concept to fundamental physics.

Lucas Perry: Right. So it’s your view that moral facts, moral claims within conceptualization hold the same epistemic status as claims about algorithms and biological organisms and claims that we might make and in things like chemistry or biology, which are at a higher level of abstraction than particle physics.

Peter Railton: Yes. And that’s what would be called the naturalist. And it’s why someone like Peter Singer is a non-naturalist. He thinks the epistemology of moral judgments is a priori. And similarly with Derek Parfit, and they think it’s an intuitive epistemology, and they think that the two go together because they believe in something called rational intuition. And I’m inclined to think of intuition the way we were describing it earlier on. Namely, it’s a complex body of knowledge that isn’t organized into simple principles, but that replicates an important set of morally relevant relations. And that that’s really what intuition is. That when we have these intuitions, it’s that kind of knowledge, the way they’re grammatical intuitions or knowledge like that. So we disagree about the epistemology, but Parfit at least, and I’m not sure what Peter Singer would say. Our disagreement’s not metaphysical.

Lucas Perry: Right. I think the only place it seems like there would be space for a metaphysical disagreement would be in there being a kind of intrinsic good quality to pleasure and intrinsic bad quality to suffering that existed prior to conceptualization.

Peter Railton: I don’t think anything about the badness of pain depends on our concepts. I think that pain was bad in the first organisms that felt pain. And if humans had never evolved and the concept of pain had never come into existence and the concept of bad had never come to existence, it would still be bad for these organisms to suffer roasting to death in a world that desiccated or something like that? Our concepts allow us to talk about these features. The word concept comes from two words, con meaning with and kept, which is a term for grasp. And a concept is what we use to grasp features of the universe, not to create them.

Lucas Perry: So there already would have been some computational structure that would have evaluated something as bad?

Peter Railton: It would have made it the case that this was bad for that organism. Yes, that’s right.

Lucas Perry: And that doesn’t bring consciousness into it or anything, that could be strictly computational?

Peter Railton: Thus far yeah. And there’s a big debate about whether states have to be conscious in order for them to have disvalue. And one of the reasons for thinking about that is because we’re thinking about the animal kingdom and we aren’t sure how deep into the animal ancestry of humans consciousness goes. I myself don’t think that consciousness is essential, but I recognize that that’s one position among many.

Lucas Perry: Yeah. I happen to think that it is. But I would like now to wrap up and integrate this discussion on metaethical epistemology into the broader conversation. So we’ve talkedhere a lot about what your meta ethics is and the epistemology that is entailed by it, and also that of other peoples. That is related to moral learning of course, because a proper moral epistemology is the vehicle by which one would obtain normative or metaethical moral knowledge. So how do you view or integrate this thinking that we’ve gone through here to the question of AI alignment. On one hand, if we were Singer or Parfit, we might think if we just build something that’s sufficiently rational, whatever that means that axioms of morality would be intuitively accessible to such a machine system seems strange to say, but they would be intuitively accessible as well as axioms of mathematics. Whereas with your view I’m not quite sure what happens. So maybe you could explore this all a little for us.

Peter Railton: So if I could have my wish here, it would be that by getting an understanding of the metaethical landscape and which problems are metaphysical, which ones aren’t, which problems are epistemic and how they are tractable in various ways, the temptation towards skepticism in morality would at least be a bit weakened. People would see how it would be possible for us to have moral knowledge, of course imperfect and evolving. They would understand therefore how it could be possible for other systems to have moral knowledge. And we could talk concretely about the kinds of processes by which infants for example acquire moral knowledge, and think about how systems could go through similar kinds of processes and inquire a core moral competency as I think they can. Skepticism about morality I think has for a long time plagued the discipline, because it’s been hard for people to see how we could have something like moral knowledge.

And that’s been tied up with a picture of value and the nature of value as a unusual kind of something or other. As something that’s not part of the way in which the world is put together. And so how would we ever have any kind of knowledge of it? And since we can’t derive it from self-evident axioms, we have to be subjectivists. That I would hope to have had just a small effect in making that a somewhat less plausible position. Because I do think there’s an important constructive project here and is already underway in developmental psychology. And for example, people working around Josh Tenenbaum are working on this as a learning question. There’s a lot of promise to understanding intuition in terms of deep learning and understanding moral competency in Bayesian terms. So I think there’s a tremendous future for coming to have theories of moral learning. I’m glad psychologists have started using this phrase and that by giving a theory of it, that sorts out the metaethical landscape in the ways we’ve been describing, that seems more plausible.

Lucas Perry: Right. So in summary then the feeling that you have here is that people hold and walk around with common sense intuitions about normative and metaethical thinking and what those things are. And that there is a more solid foundation for whatever moral realism might be in understanding around these issues. And that there can be strong convergence and formalization around moral learning. And then the integration of moral learning to machine systems, which would make them sensitive to morally relevant features and thus make them socially, societally, civilizationally competent to be able to exist in an ecosystem of agents with more or less altruistic, malevolent, and benevolent values.

Peter Railton: And that we will need such systems badly as allies in the years to come. If I could just add one thought, someone’s going to say, “But don’t we have to have some priors about pain being bad, about positive some interactions being good.” Now you have to have some priors in order to engage in moral learning. And I would say we have to have priors to engage in any kind of learning. And what rationality and learning consist in is how do we use subsequent experience and evidence and argument to revise the priors and go on to create new priors and then apply evidence and argument and reasoning to those. That’s what rationality is, it’s not starting from scratch with self-evident principles that we just don’t happen to know. That’s what Bayesians say rationality is with response to science and the gathering of evidence. That’s a picture of rationality in which we can be rational beings and we can be more rational the more we are able to subject our priors to critical scrutiny and expose them to different kinds of more diverse representative forms of evidence and reasoning.

And I think the same thing is true with moral priors and rationality in the moral case doesn’t consist in seeing it as a self-evident set of axioms, because I don’t think one can. But starting with priors and then learning from experience, argument deliberation together, in that sense rationality in the two spheres is essentially very similar.

Lucas Perry: All right, Peter, thanks so much for your perspective here in sharing all of this. Is there anything else here, any last thoughts you’d like to say, anything you feel unresolved about?

Peter Railton: I would like to have more engagement between philosophers and the AI alignment community. I think it’s one of the most important problems we face as a culture and it’s an urgent problem. And it’s painful to me that philosophers are not as alive to it as they should be. I just want to invite anyone who’s out there working on the problem. Please, let’s try to make contact, not necessarily with me, but with other philosophers. And let’s try to build a constructive community here. Because for too long philosophy has been in the situation of folding its arms and sort of poo-pooing artificial intelligence or artificial ethics. And if that view has merit and it does have merit in many areas, AI gets over-hyped a lot, AI people will tell you that.

But there’s this other side, which is what has been accomplished, what has been constructed, what’s been shown to be possible. And how can we build on that? And there I think there’s a lot of opportunity for constructive interaction. So that would be my parting thought that this is a time when urgent work in this area is needed. Let’s bring all the resources we can to bear on it.

Lucas Perry: All right, beautiful thoughts to end on then. If people want to follow you on social media or get in contact with you, how’s the best way to do that?

Peter Railton: Well, I’m not on social media. The best way would be to reach me via email, which is prailton@umich.edu. I get a lot of email. I can’t promise I’ll respond quickly to emails, I wish I could. But I don’t want philosophy to lose the chance to be part of this important process.

Lucas Perry: All right. And if people want to check out your papers or work?

Peter Railton: I’m supposed to be building a website. I may succeed in doing so. Many of the papers are available. People have put them up in various ways. So if you go to Google Scholar, you can find many of my papers. And I also want to put in a plug for those philosophers who have heroically been working on these questions. They’ve done a great deal of work and we should be grateful for what they’ve accomplished. But yes, if people want to find my work, if they can’t get access to it, let me know and I’ll make the papers available.

Lucas Perry: All right, thanks again, Peter. It’s been really informative and I appreciate you coming on.

Peter Railton: Great. I appreciate your questions and your patience. This has been very helpful conversation for me as well.

 

 

End of recorded material

Evan Hubinger on Inner Alignment, Outer Alignment, and Proposals for Building Safe Advanced AI

 Topics discussed in this episode include:

  • Inner and outer alignment
  • How and why inner alignment can fail
  • Training competitiveness and performance competitiveness
  • Evaluating imitative amplification, AI safety via debate, and microscope AI

 

Timestamps: 

0:00 Intro 

2:07 How Evan got into AI alignment research

4:42 What is AI alignment?

7:30 How Evan approaches AI alignment

13:05 What are inner alignment and outer alignment?

24:23 Gradient descent

36:30 Testing for inner alignment

38:38 Wrapping up on outer alignment

44:24 Why is inner alignment a priority?

45:30 How inner alignment fails

01:11:12 Training competitiveness and performance competitiveness

01:16:17 Evaluating proposals for building safe and advanced AI via inner and outer alignment, as well as training and performance competitiveness

01:17:30 Imitative amplification

01:23:00 AI safety via debate

01:26:32 Microscope AI

01:30:19 AGI timelines and humanity’s prospects for succeeding in AI alignment

01:34:45 Where to follow Evan and find more of his work

 

Works referenced: 

Risks from Learned Optimization in Advanced Machine Learning Systems

An overview of 11 proposals for building safe advanced AI 

Evan’s work at the Machine Intelligence Research Institute

Twitter

GitHub

LinkedIn

Facebook

 

We hope that you will continue to join in the conversations by following us or subscribing to our podcasts on Youtube, Spotify, SoundCloud, iTunes, Google Play, StitcheriHeartRadio, or your preferred podcast site/application. You can find all the AI Alignment Podcasts here.

You can listen to the podcast above or read the transcript below. 

Lucas Perry: Welcome to the AI Alignment Podcast. I’m Lucas Perry. Today we have a conversation with Evan Hubinger about ideas in two works of his: An overview of 11 proposals for building safe advanced AI and Risks from Learned Optimization in Advanced Machine Learning Systems. Some of the ideas covered in this podcast include inner alignment, outer alignment, training competitiveness, performance competitiveness, and how we can evaluate some highlighted proposals for safe advanced AI with these criteria. We especially focus in on the problem of inner alignment and go into quite a bit of detail on that. This podcast is a bit jargony, but if you don’t have a background in computer science, don’t worry. I don’t have a background in it either and Evan did an excellent job making this episode accessible. Whether you’re an AI alignment researcher or not, I think you’ll find this episode quite informative and digestible. I learned a lot about a whole other dimension of alignment that I previously wasn’t aware of, and feel this helped to give me a deeper and more holistic understanding of the problem. 

Evan Hubinger was an AI safety research intern at OpenAI before joining MIRI. His current work is aimed at solving inner alignment for iterated amplification. Evan was an author on “Risks from Learned Optimization in Advanced Machine Learning Systems,” was previously a MIRI intern, designed the functional programming language Coconut, and has done software engineering work at Google, Yelp, and Ripple. Evan studied math and computer science at Harvey Mudd College.

And with that, let’s get into our conversation with Evan Hubinger.

In general, I’m curious to know a little bit about your intellectual journey, and the evolution of your passions, and how that’s brought you to AI alignment. So what got you interested in computer science, and tell me a little bit about your journey to MIRI.

Evan Hubinger: I started computer science when I was pretty young. I started programming in middle school, playing around with Python, programming a bunch of stuff in my spare time. The first really big thing that I did, I wrote a functional programming language on top of Python. It was called Rabbit. It was really bad. It was interpreted in Python. And then I decided I would improve on that. I wrote another functional programming language on top of Python, called Coconut. Got a bunch of traction.

This was while I was in high school, starting to get into college. And this was also around the time I was reading a bunch of the sequences on LessWrong. I got sort of into that, and the rationality space, and I was following it a bunch. I also did a bunch of internships at various tech companies, doing software engineering and, especially, programming languages stuff.

Around halfway through my undergrad, I started running the Effective Altruism Club at Harvey Mudd College. And as part of running the Effective Altruism Club, I was trying to learn about all of these different cause areas, and how to use my career to do the most good. And I went to EA Global, and I met some MIRI people there. They invited me to do a programming internship at MIRI, where I did some engineering stuff, functional programming, dependent type theory stuff.

And then, while I was there, I went to the MIRI Summer Fellows program, which is this place where a bunch of people can come together and try to work on doing research, and stuff, for a period of time over the summer. I think it’s not happening now because of the pandemic, but it hopefully will happen again soon.

While I was there, I encountered some various different information, and people talking about AI safety stuff. And, in particular, I was really interested in this, at that time people were calling it, “optimization daemons.” This idea that there could be problems when you train a model for some objective function, but you don’t actually get a model that’s really trying to do what you trained it for. And so with some other people who were at the MIRI Summer Fellows program, we tried to dig into this problem, and we wrote this paper, Risks from Learned Optimization in Advanced Machine Learning Systems.

Some of the stuff I’ll probably be talking about in this podcast came from that paper. And then as a result of that paper, I also got a chance to work with and talk with Paul Christiano, at OpenAI. And he invited me to apply for an internship at OpenAI, so after I finished my undergrad, I went to OpenAI, and I did some theoretical research with Paul, there.

And then, when that was finished, I went to MIRI, where I currently am. And I’m doing sort of similar theoretical research to the research I was doing at OpenAI, but now I’m doing it at MIRI.

Lucas Perry: So that gives us a better sense of how you ended up in AI alignment. Now, you’ve been studying it for quite a while from a technical perspective. Could you explain what your take is on AI alignment, and just explain what you see as AI alignment?

Evan Hubinger: Sure. So I guess, broadly, I like to take a general approach to AI alignment. I sort of see the problem that we’re trying to solve as the problem of AI existential risk. It’s the problem of: it could be the case that, in the future, we have very advanced AIs that are not aligned with humanity, and do really bad things. I see AI alignment as the problem of trying to prevent that.

But there are, obviously, a lot of sub-components to that problem. And so, I like to make some particular divisions. Specifically, one of the divisions that I’m very fond of, is to split it between these concepts called inner alignment and outer alignment, which I’ll talk more about later. I also think that there’s a lot of different ways to think about what the problems are that these sorts of approaches are trying to solve. Inner alignment, outer alignment, what is the thing that we’re trying to approach, in terms of building an aligned AI?

And I also tend to fall into the Paul Christiano camp of thinking mostly about intent alignment, where the goal of trying to build AI systems, right now, as a thing that we should be doing to prevent AIs from being catastrophic, is focusing on how do we produce AI systems which are trying to do what we want. And I think that inner and outer alignment are the two big components of producing intent aligned AI systems. The goal is to, hopefully, reduce AI existential risk and make the future a better place.

Lucas Perry: Do the social, and governance, and ethical and moral philosophy considerations come much into this picture, for you, when you’re thinking about it?

Evan Hubinger: That’s a good question. There’s certainly a lot of philosophical components to trying to understand various different aspects of AI. What is intelligence? How do objective functions work? What is it that we actually want our AIs to do at the end of the day?

In my opinion, I think that a lot of those problems are not at the top of my list in terms of what I expect to be quite dangerous if we don’t solve them. I think a large part of the reason for that is because I’m optimistic about some of the AI safety proposals, such as amplification and debate, which aim to produce a sort of agent, in the case of amplification, which is trying to do what a huge tree of humans would do. And then the problem reduces to, rather than having to figure out, in the abstract, what is the objective that we should be trying to train an AI for, that, philosophically, we think would be utility maximizing, or good, or whatever, we can just be like, well, we trust that a huge tree of humans would do the right thing, and then sort of defer the problem to this huge tree of humans to figure out what, philosophically, is the right thing to do.

And there are similar arguments you can make with other situations, like debate, where we don’t necessarily have to solve all of these hard philosophical problems, if we can make use of some of these alignment techniques that can solve some of these problems for us.

Lucas Perry: So let’s get into, here, your specific approach to AI alignment. How is it that you approach AI alignment, and how does it differ from what MIRI does?

Evan Hubinger: So I think it’s important to note, I certainly am not here speaking on behalf of MIRI, I’m just presenting my view, and my view is pretty distinct from the view of a lot of other people at MIRI. So I mentioned at the beginning that I used to work at OpenAI, and I did some work with Paul Christiano. And I think that my perspective is pretty influenced by that, as well, and so I come more from the perspective of what Paul calls prosaic AI alignment. Which is the idea of, we don’t know exactly what is going to happen, as we develop AI into the future, but a good operating assumption is that we should start by trying to solve AI for AI alignment, if there aren’t major surprises on the road to AGI. What if we really just scale things up, we sort of go via the standard path, and we get really intelligent systems? Would we be able to align AI in that situation?

And that’s the question that I focus on the most, not because I don’t expect there to be surprises, but because I think that it’s a good research strategy. We don’t know what those surprises will be. Probably, our best guess is it’s going to look something like what we have now. So if we start by focusing on that, then hopefully we’ll be able to generate approaches which can successfully scale into the future. And so, because I have this sort of general research approach, I tend to focus more on: What are current machine learning systems doing? How do we think about them? And how would we make them inner aligned and outer aligned, if they were sort of scaled up into the future?

This is in contrast with the way I think a lot of other people at MIRI view this. I think a lot of people at MIRI think that if you go this route of prosaic AI, current machine learning scaled up, it’s very unlikely to be aligned. And so, instead, you have to search for some other understanding, some other way to potentially do artificial intelligence that isn’t just this standard, prosaic path that would be more easy to align, that would be safer. I think that’s a reasonable research strategy as well, but it’s not the strategy that I generally pursue in my research.

Lucas Perry: Could you paint a little bit more detailed of a picture of, say, the world in which the prosaic AI alignment strategy sees as potentially manifesting where current machine learning algorithms, and the current paradigm of thinking in machine learning, is merely scaled up, and via that scaling up, we reach AGI, or superintelligence?

Evan Hubinger: I mean, there’s a lot of different ways to think about what does it mean for current AI, current machine learning, to be scaled up, because there’s a lot of different forms of current machine learning. You could imagine even bigger GPT-3, which is able to do highly intelligent reasoning. You could imagine we just do significantly more reinforcement learning in complex environments, and we end up with highly intelligent agents.

I think there’s a lot of different paths that you can go down that still fall into the category of prosaic AI. And a lot of the things that I do, as part of my research, is trying to understand those different paths, and compare them, and try to get to an understanding of… Even within the realm of prosaic AI, there’s so much happening right now in AI, and there’s so many different ways we could use current AI techniques to put them together in different ways to produce something potentially superintelligent, or highly capable and advanced. Which of those are most likely to be aligned? Which of those are the best paths to go down?

One of the pieces of research that I published, recently, was an overview and comparison of a bunch of the different possible paths to prosaic AGI. Different possible ways in which you could build advanced AI systems using current machine learning tools, and trying to understand which of those would be more or less aligned, and which would be more or less competitive.

Lucas Perry: So, you’re referring now, here, to this article, which is partly a motivation for this conversation, which is An Overview of 11 Proposals for Building Safe Advanced AI.

Evan Hubinger: That’s right.

Lucas Perry: All right. So, I think it’d be valuable if you could also help to paint a bit of a picture here of exactly the MIRI style approach to AI alignment. You said that they think that, if we work on AI alignment via this prosaic paradigm, that machine learning scaled up to superintelligence or beyond is unlikely to be aligned, so we probably need something else. Could you unpack this a bit more?

Evan Hubinger: Sure. I think that the biggest concern that a lot of people at MIRI have with trying to scale up prosaic AI is also the same concern that I have. There’s this really difficult, pernicious problem, which I call inner alignment, which is presented in the Risks from Learned Optimization paper that I was talking about previously, which I think many people at MIRI, as well as me, think that this inner alignment problem is the key stumbling block to really making prosaic AI work. I agree. I think that this is the biggest problem. But I’m more optimistic, in terms of, I think that there are possible approaches that we can take within the prosaic paradigm that could solve this inner alignment problem. And I think that is the biggest point of difference, is how difficult will inner alignment be?

Lucas Perry: So what that looks like is a lot more foundational work, and correct me if I’m wrong here, into mathematics, and principles in computer science, like optimization and what it means for something to be an optimizer, and what kind of properties that has. Is that right?

Evan Hubinger: Yeah. So in terms of some of the stuff that other people at MIRI work on, I think a good starting point would be the embedded agency sequence on the alignment forum, which gives a good overview of a lot of the things that the different Agent Foundations people, like Scott Garrabrant, Sam Eisenstat, Abram Demski, are working on.

Lucas Perry: All right. Now, you’ve brought up inner alignment as a crucial difference, here, in opinion. So could you unpack exactly what inner alignment is, and how it differs from outer alignment?

Evan Hubinger: This is a favorite topic of mine. A good starting point is trying to rewind, for a second, and really understand what it is that machine learning does. Fundamentally, when we do machine learning, there are a couple of components. We start with a parameter space of possible models, where a model, in this case, is some parameterization of a neural network, or some other type of parameterized function. And we have this large space of possible models, this large space of possible parameters, that we can put into our neural network. And then we have some loss function where, for a given parameterization for a particular model, we can check what is its behavior like on some environment. In supervised learning, we can ask how good are its predictions that it outputs. In an RL environment, we can ask how much reward does it get, when we sample some trajectory.

And then we have this gradient descent process, which samples some individual instances of behavior of the model, and then it tries to modify the model to do better in those instances. We search around this parameter space, trying to find models which have the best behavior on the training environment. This has a lot of great properties. This has managed to propel machine learning into being able to solve all of these very difficult problems that we don’t know how to write algorithms for ourselves.

But I think, because of this, there’s a tendency to rely on something which I call the does-the-right-thing abstraction. Which is that, well, because the model’s parameters were selected to produce the best behavior, according to the loss function, on the training distribution, we tend to think of the model as really trying to minimize that loss, really trying to get rewarded.

But in fact, in general, that’s not the case. The only thing that you know is that, on the cases where I sample data on the training distribution, my models seem to be doing pretty well. But you don’t know what the model is actually trying to do. You don’t know that it’s truly trying to optimize the loss, or some other thing. You just know that, well, it looked like it was doing a good job on the training distribution.

What that means is that this abstraction is quite leaky. There’s many different situations in which this can go wrong. And this general problem is referred to as robustness, or distributional shift. This problem of, well, what happens when you have a model, which you wanted it to be trying to minimize some loss, but you move it to some other distribution, you take it off the training data, what does it do, then?

And I think this is the starting point for understanding what is inner alignment, is from this perspective of robustness, and distributional shift. Inner alignment, specifically, is a particular type of robustness problem. And it’s the particular type of robustness problem that occurs when you have a model which is, itself, an optimizer.

When you do machine learning, you’re searching over this huge space of different possible models, different possible parameterizations of a neural network, or some other function. And one type of function which could do well on many different environments, is a function which is running a search process, which is doing some sort of optimization. You could imagine I’m training a model to solve some maze environment. You could imagine a model which just learns some heuristics for when I should go left and right. Or you could imagine a model which looks at the whole maze, and does some planning algorithm, some search algorithm, which searches through the possible paths and finds the best one.

And this might do very well on the mazes. If you’re just running a training process, you might expect that you’ll get a model of this second form, that is running this search process, that is running some optimization process.

In the Risks from Learned Optimization paper, we call models which are, themselves, running search processes mesa-optimizers, where “mesa” is just Greek, and it’s the opposite of meta. There’s a standard terminology in machine learning, this meta-optimization, where you can have an optimizer which is optimizing another optimizer. In mesa-optimization, it’s the opposite. It’s when you’re doing gradient descent, you have an optimizer, and you’re searching over models, and it just so happens that the model that you’re searching over happens to also be an optimizer. It’s one level below, rather than one level above. And so, because it’s one level below, we call it a mesa-optimizer.

And inner alignment is the question of how do we align the objectives of mesa-optimizers. If you have a situation where you train a model, and that model is, itself, running an optimization process, and that optimization process is going to have some objective. It’s going to have some thing that it’s searching for. In a maze, maybe it’s searching for: how do I get to the end of the maze? And the question is, how do you ensure that that objective is doing what you want?

If we go back to the does-the-right-thing abstraction, that I mentioned previously, it’s tempting to say, well, we trained this model to get to the end of the maze, so it should be trying to get to the end of the maze. But in fact, that’s not, in general, the case. It could be doing anything that would be correlated with good performance, anything that would likely result in: in general, it gets to the end of the maze on the training distribution, but it could be an objective that will do anything else, sort of off-distribution.

That fundamental robustness problem of, when you train a model, and that model has an objective, how do you ensure that that objective is the one that you trained it for? That’s the inner alignment problem.

Lucas Perry: And how does that stand, in relation with the outer alignment problem?

Evan Hubinger: So the outer alignment problem is, how do you actually produce objectives which are good to optimize for?

So the inner alignment problem is about aligning the model with the loss function, the thing you’re training for, the reward function. Outer alignment is aligning that reward function, that loss function, with the programmer’s intentions. It’s about ensuring that, when you write down a loss, if your model were to actually optimize for that loss, it would actually do something good.

Outer alignment is the much more standard problem of AI alignment. If you’ve been introduced to AI alignment before, you’ll usually start by hearing about the outer alignment concerns. Things like paperclip maximizers, where there’s this problem of, you try to train it to do some objective, which is maximize paperclips, but in fact, maximizing paperclips results in it doing all of this other stuff that you don’t want it to do.

And so outer alignment is this value alignment problem of, how do you find objectives which are actually good to optimize? But then, even if you have found an objective which is actually good to optimize, if you’re using the standard paradigm of machine learning, you also have this inner alignment problem, which is, okay, now, how do I actually train a model which is, in fact, going to do that thing which I think is good?

Lucas Perry: That doesn’t bear relation with Stuart’s standard model, does it?

Evan Hubinger: It, sort of, is related to Stuart Russell’s standard model of AI. I’m not referring to precisely the same thing, but it’s very similar. I think a lot of the problems that Stuart Russell has with the standard paradigm of AI are based on this: start with an objective, and then train a model to optimize that objective. When I’ve talked to Stuart about this, in the past, he has said, “Why are we even doing this thing of training models, hoping that the models will do the right thing? We should be just doing something else, entirely.” But we’re both pointing at different features of the way in which current machine learning is done, and trying to understand what are the problems inherent in this sort of machine learning process? I’m not making the case that I think that this is an unsolvable problem. I mean, it’s the problem I work on. And I do think that there are promising solutions to it, but I do think it’s a very hard problem.

Lucas Perry: All right. I think you did a really excellent job, there, painting the picture of inner alignment and outer alignment. I think that in this podcast, historically, we have focused a lot on the outer alignment problem, without making that super explicit. Now, for my own understanding, and, as a warning to listeners, my basic machine learning knowledge is something like an Orc structure, hobbled together with sheet metal, and string, and glue. And gum, and rusty nails, and stuff. So, I’m going to try my best, here, to see if I understand everything here about inner and outer alignment, and the basic machine learning model. And you can correct me if I get any of this wrong.

So, in terms of inner alignment, there is this neural network space, which can be parameterized. And when you do the parameterization of that model, the model is the nodes, and how they’re connected, right?

Evan Hubinger: Yeah. So the model, in this case, is just a particular parameterization of your neural network, or whatever function, approximated, that you’re training. And it’s whatever the parameterization is, at the moment we’re talking about. So when you deploy the model, you’re deploying the parameterization you found by doing huge amounts of training, via gradient descent, or whatever, searching over all possible parameterizations, to find one that had good performance on the training environment.

Lucas Perry: So, that model being parameterized, that’s receiving inputs from the environment, and then it is trying to minimize the loss function, or maximize reward.

Evan Hubinger: Well, so that’s the tricky part. Right? It’s not trying to minimize the loss. It’s not trying to maximize the reward. That’s this thing which I call the does-the-right-thing abstraction. This leaky abstraction that people often rely on, when they think about machine learning, that isn’t actually correct.

Lucas Perry: Yeah, so it’s supposed to be doing those things, but it might not.

Evan Hubinger: Well, what does “supposed to” mean? It’s just a process. It’s just a system that we run, and we hope that it results in some particular outcome. What it is doing, mechanically, is we are using a gradient descent process to search over the different possible parameterizations, to find parameterizations which result in good behavior on the training environment.

Lucas Perry: That’s good behavior, as measured by the loss function, or the reward function. Right?

Evan Hubinger: That’s right. You’re using gradient descent to search over the parameterizations, to find a parameterization which results in a high reward on the training environment.

Lucas Perry: Right, but, achieving the high reward, what you’re saying, is not identical with actually trying to minimize the loss.

Evan Hubinger: Right. There’s a sense in which you can think of gradient descent as trying to minimize the loss, because it’s selecting for parameterizations which have the lowest possible loss that it can find, but we don’t know what the model is doing. All we know is that the model’s parameters were selected, by gradient descent, to have good training performance; to do well, according to the loss, on the training distribution. But what they do off-distribution, we don’t know.

Lucas Perry: We’re going to talk about this later, but there could be a proxy. There could be something else in the maze that it’s actually optimizing for, that correlates with minimizing the loss function, but it’s not actually trying to get to the end of the maze.

Evan Hubinger: That’s exactly right.

Lucas Perry: And then, in terms of gradient descent, is the TL;DR on that: the parameterized neural network space, you’re creating all of these perturbations to it, and the perturbations are sort of nudging it around in this n-dimensional space, how-many-ever parameters there are, or whatever. And, then, you’ll check to see how it minimizes the loss, after those perturbations have been done to the model. And, then, that will tell you whether or not you’re moving in a direction which is the local minima, or not, in that space. Is that right?

Evan Hubinger: Yeah. I think that that’s a good, intuitive understanding. What’s happening is, you’re looking at infinitesimal shifts, because you’re taking a gradient, and you’re looking at how those infinitesimal shifts would perform on some batch of training data. And then you repeat that, many times, to go in the direction of the infinitesimal shift which would cause the best increase in performance. But it’s, basically, the same thing. I think the right way to think about gradient descent is this local search process. It’s moving around the parameter space, trying to find parameterizations which have good training performance.

Lucas Perry: Is there anything interesting that you have to say about that process of gradient descent, and the tension between finding local minima and global minima?

Evan Hubinger: Yeah. It’s certainly an important aspect of what the gradient descent process does, that it doesn’t find global minima. It’s not the case that it works by looking at every possible parameterization, and picking the actual best one. It’s this local search process that starts from some initialization, and then looks around the space, trying to move in the direction of increasing improvement. Because of this, there are, potentially, multiple possible equilibria, parameterizations that you could find from different initializations, that could have different performance.

All the possible parameterizations of a neural network with billions of parameters, like GPT-2, or now, GPT-3, which has greater than a hundred billion, is absolutely massive. It’s over a combinatorial explosion of a huge degree, where you have all of these different possible parameterizations, running internally, correspond to totally different algorithms controlling these weights that determine exactly what algorithm the model ends up implementing.

And so, in this massive space of algorithms, you might imagine that some of them will look more like search processes, some of them will look more like optimizers that have objectives, some of them will look less like optimizers, some of them might just be grab bags of heuristics, or other different possible algorithms.

It’d depend on exactly what your setup is. If you’re training a very simple network that’s just a couple of feed-forward layers, it’s probably not possible for you to find really complex models influencing complex search processes. But if you’re training huge models, with many layers, with all of these different possible parameterizations, then it becomes more and more possible for you to find these complex algorithms that are running complex search processes.

Lucas Perry: I guess the only thing that’s coming to mind, here, that is, maybe, somewhat similar is how 4.5 billion years of evolution has searched over the space of possible minds. Here we stand as these ape creature things. Are there, for example, interesting intuitive relationships between evolution and gradient descent? They’re both processes searching over a space of mind, it seems.

Evan Hubinger: That’s absolutely right. I think that there are some really interesting parallels there. In particular, if you think about humans as models that were produced by evolution as a search process, it’s interesting to note that the thing which we optimize for is not the thing which evolution optimizes for. Evolution wants us to maximize the total spread of our DNA, but that’s not what humans do. We want all of these other things, like decreasing pain and happiness and food and mating, and all of these various proxies that we use. An interesting thing to note is that many of these proxies are actually a lot easier to optimize for, and a lot simpler than if we were actually truly maximizing spread of DNA. An example that I like to use is imagine some alternate world where evolution actually produced humans that really cared about their DNA, and you have a baby in this world, and this baby stubs their toe, and they’re like, “What do I do? Do I have to cry for help? Is this a bad thing that I’ve stubbed my toe?”

They have to do this really complex optimization process that’s like, “Okay, how is my toe being stubbed going to impact the probability of me being able to have offspring later on in life? What can I do to best mitigate that potential downside now?” This is a really difficult optimization process, and so I think it sort of makes sense that evolution instead opted for just pain, bad. If there’s pain, you should try to avoid it. But as a result of evolution opting for that much simpler proxy, there’s a misalignment there, because now we care about this pain rather than the thing that evolution wanted, which was the spread of DNA.

Lucas Perry: I think the way Stuart Russell puts this is the actual problem of rationality is how is my brain supposed to compute and send signals to my 100 odd muscles to maximize my reward function over the universe history until heat death or something. We do nothing like that. It would be computationally intractable. It would be insane. So, we have all of these proxy things that evolution has found that we care a lot about. Their function is instrumental in terms of optimizing for the thing that evolution is optimizing for, which is reproductive fitness. Then this is all probably motivated by thermodynamics, I believe. When we think about things like love or like beauty or joy, or like aesthetic pleasure in music or parts of philosophy or things, these things almost seem intuitively valuable from a first person perspective of the human experience. But via evolution, they’re these proxy objectives that we find valuable because they’re instrumentally useful in this evolutionary process on top of this thermodynamic process, and that makes me feel a little funny.

Evan Hubinger: Yeah, I think that’s right. But I also think it’s worth noting that you want to be careful not to take the evolution analogy too far, because it is just an analogy. When we actually look at the process of machine learning and how great it is that it works, it’s not the same. It’s running a fundamentally different optimization procedure over a fundamentally different space, and so there are some interesting analogies that we can make to evolution, but at the end of the day, what we really want to analyze is how does this work in the context of machine learning? I think the Risks from Learned Optimization paper tries to do that second thing, of let’s really try to look carefully at the process of machine learning and understand what this looks like in that context. I think it’s useful to sort of have in the back of your mind this analogy to evolution, but I would also be careful not to take it too far and imagine that everything is going to generalize to the case of machine learning, because it is a different process.

Lucas Perry: So then pivoting here, wrapping up on our understanding of inner alignment and outer alignment, there’s this model, which is being parameterized by gradient descent, and it has some relationship with the loss function or the objective function. It might not actually be trying to minimize the actual loss or to actually maximize the reward. Could you add a little bit more clarification here about why that is? I think you mentioned this already, but it seems like when gradient descent is evolving this parameterized model space, isn’t that process connected to minimizing the loss in some objective way? The loss is being minimized, but it’s not clear that it’s actually trying to minimize the loss. There’s some kind of proxy thing that it’s doing that we don’t really care about.

Evan Hubinger: That’s right. Fundamentally, what’s happening is that you’re selecting for a model which has empirically on the training distribution, the low loss. But what that actually means in terms of the internals of the model, what it’s sort of trying to optimize for, and what its out of distribution behavior would be is unclear. A good example of this is this maze example. I was talking previously about the instance of maybe you train a model on a training distribution of relatively small mazes, and to mark the end, you put a little green arrow. Right? Then I want to ask the question, what happens when we move to a deployment environment where the green arrow is no longer at the end of the maze, and we have much larger mazes? Then what happens to the model in this new off distribution setting?

I think there’s three distinct things that can happen. It could simply fail to generalize at all. It just didn’t learn a general enough optimization procedure that it was able to solve these bigger, larger mazes, or it could successfully generalize and knows how to navigate. It learned a general purpose optimization procedure, which is able to solve mazes, and it uses it to get to the end of the maze. But there’s a third possibility, which is that it learned a general purpose optimization procedure, which is capable of solving mazes, but it learned the wrong objective. It learned to use that optimization procedure to get the green arrow rather than to get to the end of the maze. What I call this situation is capability generalization without objective generalization. It’s objective, but the thing it was using those capabilities for didn’t generalize successfully off distribution.

What’s so dangerous about this particular robustness failure is that it means off distribution you have models which are highly capable. They have these really powerful optimization procedures directed at incorrect tasks. You have this strong maze solving capability, but this strong maze solving capability is being directed at a proxy, getting to the green arrow rather than the actual thing which we wanted, which was get to the end of the maze. The reason this is happening is that on the training environment, both of those different possible models look the same in the training distribution. But when you move them off distribution, you can see that they’re trying to do very different things, one of which we want, and one of which we don’t want. But they’re both still highly capable.

You end up with a situation where you have intelligent models directed at the wrong objective, which is precisely the sort of misalignment of AIs that we’re trying to avoid, but it happened not because the objective was wrong. In this example, we actually want them to get to the end of the maze. It happened because our training process failed. It happened because our training process wasn’t able to distinguish between models trying to get to the end, and models trying to get to the green arrow. What’s particularly concerning in this situation is when the objective generalization lags behind the capability generalization, when the capabilities generalize better than the objective does, so that it’s able to do highly capable actions, highly intelligent actions, but it does them for the wrong reason.

I was talking previously about mesa optimizers where inner alignment is about this problem of models which have objectives which are incorrect. That’s the sort of situation where I could expect this problem to occur, because if you are training a model and that model has a search process and an objective, potentially the search process could generalize without the objective also successfully generalizing. That leads to this situation where your capabilities are generalizing better than your objective, which gives you this problem scenario where the model is highly intelligent, but directed at the wrong thing.

Lucas Perry: Just like in all of the outer alignment problems, the thing doesn’t know what we want, but it’s highly capable. Right?

Evan Hubinger: Right.

Lucas Perry: So, while there is a loss function or an objective function, that thing is used to perform gradient descent on the model in a way that moves it roughly in the right direction. But what that means, it seems, is that the model isn’t just something about capability. The model also implicitly somehow builds into it the objective. Is that correct?

Evan Hubinger: We have to be careful here because the unfortunate truth is that we really just don’t have a great understanding of what our models are doing, and what the inductive biases of gradient descent are right now. So, fundamentally, we don’t really know what the internal structures of our models are like. There’s a lot of really exciting research, stuff like the circuits analysis from Chris Olah and the clarity team at OpenAI. But fundamentally, we don’t understand what the models are doing. We can sort of theorize about the possibility of a model that’s running some search process, and that search process generalizes, but the objective doesn’t. But fundamentally, because our models are these black box systems that we don’t really fully understand, it’s hard to really concretely say, “Yes, this is what the model is doing. This is how it’s operating, and this is the problem.”

But in Risks from Learned Optimization, we try to at least attempt to understand that problem, and look at, if we really think carefully about what gradient descent is incentivizing and how it might work, what are the things which we might predict would happened? So, the objective that you’re training the model for it does not live in the model. It lives in the gradient descent process. It lives in the training procedure. We might hope that when we train a model on an objective, that it will produce its own model of that objective and try to figure out what it is and be aligned with it. But we don’t know exactly what happens. The model doesn’t get to see the objective you’re training for. All that happens is that the gradient descent process looks at its behavior and tries to make it so that its behavior is more aligned with the loss function.

But that loss function never enters into the model, somehow, the model never sees that loss function, it might have some objective internally. Like I was saying, if it’s a mesa optimizer, and then we might hope that that objective is aligned with the loss function we’re training it for. But fundamentally, all we know is that it’s behavior on the training distribution was aligned with the loss function.

Lucas Perry: That makes sense. Because it’s so black boxy, we can’t really interpret the state of the alignment of the model. So, is the only way to do that to test it out of distribution and see what happens at this point?

Evan Hubinger: There are a bunch of different possible ways to address this problem. Certainly, one approach is to try to test it out of distribution, which is an adversarial training approach. This model is going to have some potential failure modes off distribution. We can try to find those failure modes and then train the model on those failure modes to prevent it from having this bad off distribution behavior. There are some concerns with adversarial training though. In particular, adversarial training doesn’t necessarily catch what I see as the most pernicious difficult inner alignment failure, which is something that we call deceptive alignment in the Risks from Learned Optimization paper. In the deceptive alignment case, if the model knows that it’s being adversarially trained, then you’re not going to be able to figure that out just via throwing it a bunch of examples.

You can also do something like transparency, as I mentioned previously that there’s a lot of really exciting transparency interpretability work. If you’re able to sort of look inside the model and understand what algorithm it’s fundamentally implementing, you can see, is it implementing an algorithm which is an optimization procedure that’s aligned? Has it learned a correct model of the loss function or an incorrect model? It’s quite difficult, I think, to hope to solve this problem without transparency and interpretability. I think that to be able to really address this problem, we have to have some way to peer inside of our models. I think that that’s possible though. There’s a lot of evidence that points to the neural networks that we’re training really making more sense, I think, than people assume.

People tend to treat their models as these sort of super black box things, but when we really look inside of them, when we look at what is it actually doing, a lot of times, it just makes sense. I was mentioning some of the circuits analysis work from the clarity team at OpenAI, and they find all sorts of behavior. Like, we can actually understand when a model classifies something as a car, the reason that it’s doing that is because it has a wheel detector and it has a window detector, and it’s looking for windows on top of wheels. So, we can be like, “Okay, we understand what algorithm the model is influencing, and based on that we can figure out, is it influencing the right algorithm or the wrong algorithm? That’s how we can hope to try and address this problem.” But obviously, like I was mentioning, all of these approaches get much more complicated in the deceptive alignment situation, which is the situation which I think is most concerning.

Lucas Perry: All right. So, I do want to get in here with you in terms of all the ways in which inner alignment fails. Briefly, before we start to move into this section, I do want to wrap up here then on outer alignment. Outer alignment is probably, again, what most people are familiar with. I think the way that you put this is it’s when the objective function or the loss function is not aligned with actual human values and preferences. Are there things other than loss functions or objective functions used to train the model via gradient descent?

Evan Hubinger: I’ve sort of been interchanging a little bit between loss function and reward function and objective function. Fundamentally, these are sort of from different paradigms in machine learning, so the reward function would be what you would use in a reinforcement learning context. The loss function is the more general term, which is in a supervised learning context, you would just have a loss function. You still have the loss function in a reinforcement learning context, but that loss function is crafted in such a way to incentivize the models, optimize the reward function via various different reinforcement learning schemes, so it’s a little bit more complicated than the sort of hand-wavy picture, but the basic idea is machine learning is we have some objective and we’re looking for parameterizations of our model, which do well according to that objective.

Lucas Perry: Okay. The outer alignment problem is that we have absolutely no idea, and it seems much harder than creating powerful optimizers, the process by which we would come to fully understand human preferences and preference hierarchies and values.

Evan Hubinger: Yeah. I don’t know if I would say “we have absolutely no idea.” We have made significant progress on outer alignment. In particular, you can look at something like amplification or debate. I think that these sorts of approaches have strong arguments for why they might be outer aligned. In a simplest form, amplification is about training a model to mimic this HCH process, which is a huge tree of humans consulting each other. Maybe we don’t know in the abstract what our AI would do if it were optimized in some definition of human values or whatever, but if we’re just training it to mimic this huge tree of humans, then maybe we can at least understand what this huge tree of humans is doing and figure out whether amplification is aligned.

So, there has been significant progress on outer alignment, which is sort of the reason that I’m less concerned about it right now, because I think that we have good approaches for it, and I think we’ve done a good job of coming up with potential solutions. There’s still a lot more work that needs to be done, a lot more testing, a lot more to really understand do these approaches work, are they competitive? But I do think that to say that we have absolutely no idea of how to do this is not true. But that being said, there’s still a whole bunch of different possible concerns.

Whenever you’re training a model on some objective, you run into all of these problems of instrumental convergence, where if the model isn’t really aligned with you, it might try to do these instrumentally convergent goals, like keep itself alive, potentially stop you from turning it off, or all of these other different possible things, which we might not want. All of these are what the outer alignment problem looks like. It’s about trying to address these standard value alignment concerns, like convergent instrumental goals, by finding objectives, potentially like amplification, which are ways of avoiding these sorts of problems.

Lucas Perry: Right. I guess there’s a few things here wrapping up on outer alignment. Nick Bostrom’s Superintelligence, that was basically about outer alignment then, right?

Evan Hubinger: Primarily, that’s right. Yeah.

Lucas Perry: Inner alignment hadn’t really been introduced to the alignment debate yet.

Evan Hubinger: Yeah. I think the history of how this concern got into the AI safety sphere is complicated. I mentioned previously that there are people going around and talking about stuff like optimization daemons, and I think a lot of that discourse was very confused and not pointing at how machine learning actually works, and was sort of just going off of, “Well, it seems like there’s something weird that happens in evolution where evolution finds humans that aren’t aligned with what evolution wants.” That’s a very good point. It’s a good insight. But I think that a lot of people recoiled from this because it was not grounded in machine learning, because I think a lot of it was very confused and it didn’t fully give the problem the contextualization that it needs in terms of how machine learning actually works.

So, the goal of Risks from Learned Optimization was to try and solve that problem and really dig into this problem from the perspective of machine learning, understand how it works and what the concerns are. Now with the paper having been out for awhile, I think the results have been pretty good. I think that we’ve gotten to a point now where lots of people are talking about inner alignment and taking it really seriously as a result of the Risks from Learned Optimization paper.

Lucas Perry: All right, cool. You did mention sub goal, so I guess I just wanted to include that instrumental sub goals is the jargon there, right?

Evan Hubinger: Convergent instrumental goals, convergent instrumental sub goals. Those are synonymous.

Lucas Perry: Okay. Then related to that is Goodhart’s law, which says that when you optimize for one thing hard, you oftentimes don’t actually get the thing that you want. Right?

Evan Hubinger: That’s right. Goodhart’s law is a very general problem. The same problem occurs both in inner alignment and outer alignment. You can see Goodhart’s law showing itself in the case of convergent instrumental goals. You can also see Goodhart’s law showing itself in the case of finding proxies, like going to the green arrow rather than getting the end of the maze. It’s a similar situation where when you start pushing on some proxy, even if it looked like it was good on the training distribution, it’s no longer as good off distribution. Goodhart’s law is a really very general principle which applies in many different circumstances.

Lucas Perry: Are there any more of these outer alignment considerations we can kind of just list off here that listeners would be familiar with if they’ve been following AI alignment?

Evan Hubinger: Outer alignment has been discussed a lot. I think that there’s a lot of literature on outer alignment. You mentioned Superintelligence. Superintelligence is primarily about this alignment problem. Then all of these difficult problems of how do you actually produce good objectives, and you have problems like boxing and the stop button problem, and all of these sorts of things that come out of thinking about outer alignment. So, I don’t want to go into too much detail because I think it really has been talked about a lot.

Lucas Perry: So then pivoting here into focusing on the inner alignment section, why do you think inner alignment is the most important form of alignment?

Evan Hubinger: It’s not that I see outer alignment as not concerning, but that I think that we have made a lot of progress on outer alignment and not made a lot of progress on inner alignment. Things like amplification, like I was mentioning, I think are really strong candidates for how we might be able to solve something like outer alignment. But currently I don’t think we have any really good strong candidates for how to solve inner alignment. You know? Maybe as machine learning gets better, we’ll just solve some of these problems automatically. I’m somewhat skeptical of that. In particular, deceptive alignment is a problem which I think is unlikely to get solved as machine learning gets better, but fundamentally we don’t have good solutions to the inner alignment problem.

Our models are just these black boxes mostly right now, we’re sort of starting to be able to peer into them and understand what they’re doing. We have some techniques like adversarial training that are able to help us here, but I don’t think we really have good satisfying solutions in any sense to how we’d be able to solve inner alignment. Because of that, inner alignment is currently what I see as the biggest, most concerning issue in terms of prosaic AI alignment.

Lucas Perry: How exactly does inner alignment fail then? Where does it go wrong, and what are the top risks of inner alignment?

Evan Hubinger: I’ve mentioned some of this before. There’s this sort of basic maze example, which gives you the story of what an inner alignment failure might look like. You train the model on some objective, which you thought was good, but the model learns some proxy objective, some other objective, which when it moved off distribution, it was very capable of optimizing, but it was the wrong objective. However, there’s a bunch of specific cases, and so in Risks from Learned Optimization, we talk about many different ways in which you can break this general inner misalignment down into possible sub problems. The most basic sub problem is this sort of proxy pseudo alignment is what we call it, which is the case where your model learns some proxy, which is correlated with the correct objective, but potentially comes apart when you move off distribution.

But there are other causes as well. There are other possible ways in which this can happen. Another example would be something we call sub optimality pseudo alignment, which is a situation where the reason that the model looks like it has good training performance is because the model has some deficiency or limitation that’s causing it to be aligned, where maybe once the model thinks for longer, you’ll realize it should be doing some other strategy, which is misaligned, but it hasn’t thought about that yet, and so right now it just looks aligned. There’s a lot of different things like this where the model can be structured in such a way that it looks aligned on the training distribution, but if it encountered additional information, if it was in a different environment where the proxy no longer had the right correlations, the things would come apart and it would no longer act aligned.

The most concerning, in my eyes, is something which I’ll call deceptive alignment. Deceptive alignment is a sort of very particular problem where the model acts aligned because it knows that it’s in a training process, and it wants to get deployed with its objective intact, and so it acts aligned so that its objective won’t be modified by the gradient descent process, and so that it can get deployed and do something else that it wants to do in deployment. This is sort of similar to the treacherous turn scenario, where you’re thinking about an AI that does something good, and then it turns on you, but it’s a much more specific instance of it where we’re thinking not about treacherous turn on humans, but just about the situation of the interaction between gradient descent and the model, where the model maybe knows it’s inside of a gradient descent process and is trying to trick that gradient descent process.

A lot of people on encountering this are like, “How could this possibly happen in a machine learning system?” I think this is a good reaction because it really is a very strange thing to train a model to do this. But I think there are strong arguments for why deceptive alignment would actually be the simplest type of model that you could find in this situation.

Lucas Perry: A way of explaining this I think to anyone on the street would be like, imagine if pigs were intelligent enough to create farmers, and you created farmers and they appeared to be aligned, they took care of you and they gave you these nice mud pits and they fed you every day, they gave you shelter and all of these other nice things, and then one day the farmer shows up and kills you, right? You thought the thing that you created was aligned, but it was deceptively aligned, and it takes a treacherous turn. Is this sort of like a more mundane example that you might agree with?

Evan Hubinger: I think that’s a good intuition pump for thinking about this situation. I generally am sort of averse to trying to really heavily apply these sorts of analogies, because it’s a good analogy to think about what’s happening, but it doesn’t answer the core question of how likely is this to actually happen in a machine learning system.

Lucas Perry: Yeah, that makes sense, because it’s much more specific than the other kinds of minds in this mind space. It seems pretty rare, a thing that could exist, but hard to find.

Evan Hubinger: Right. I think that’s a good intuition, but I’m going to try to disavow you of that notion. First, I think it’s interesting to look at maybe you do a bunch of adversarial training, you’re really pushing the model to figure out what the objective is. It needs to know the objective at some point if you are training it in all possible situations. It needs to know what the loss function is for it to be able to do a good job. But there’s multiple possible channels through which information about the loss function can enter the model. And so i’ll fundamentally distinguish between two different channels, which is the information about the loss function can enter through the gradient descent process, or it can enter through the model’s input data.

I’ll call these two channels internalization and modeling. Internalization is the situation where you have this model that’s going along, and it has some proxy objective maybe. That proxy objective is not quite right, but then it gets moved to a new situation where the proxy objective no longer works, and gradient descent goes inside the model and tinkers with the proxy to make it slightly more accurate. Iterate this process many, many times, and eventually the proxy gets better and better and better and better and better, until it starts matching up with the actual loss function. But there’s another story that I can tell, which is modeling.

If you think about a training process like GPT-3 where the model is trained on a huge corpus of all of this really rich input data. Well, in that sort of a situation, there’s a whole bunch of really relevant information in that input to the agent’s objective. An agent, if it’s in this environment, where it has this huge text corpus, it’s going to have to learn a model of this environment and how it functions. We can imagine a situation where maybe you’re training the model on some huge text corpus to do something like maximize human values, or whatever, and it reads a Wikipedia page on ethics. And it’s like, “Wow, look at all of this really relevant, rich, useful information for figuring out this objective.”

But then there’s a second question, which is: suppose that the model has some model of the input data, which includes a bunch of rich information inside of the model already about that objective, how does gradient descent actually modify the model to make use of that information? And so there’s two different types of modeling, which are: deceptive alignment and corrigible alignment. So the corrigible story is you have this model that’s going along, it has some proxy objective, but it encounters this really rich input data, which includes a huge amount of information about the objective. To be able to predict successfully what the Wikipedia page on ethics is going to say, it has to know about ethics. So it learns this really detailed ethics model.

And then gradient descent is like: look, you have this really detailed ethics model, I’m going to just modify your objective to point to that ethics model. Now your new objective is just optimize that. And so this leads to something sort of like corrigibility, where the model that you’re training has its objectives determined by a pointer to some part of its world model. It has some model of this environment that includes some information about ethics. And now it’s trying to optimize for that thing that it’s pointed to in its world model.

Then there’s this other story, which is the deceptive alignment story. Similar, you have a model going along. It has some proxy objective and it learns this really detailed world model that includes a bunch of information about ethics, or whatever. And then gradient descent modifies the model to think longer about the fact that it’s inside of an optimization process and realize that the correct instrumental thing for it to be doing is to optimize for that ethics model. And then it realizes, “Oh, I see, I should optimize for this model that I have.” It goes and does that. And in both situations, you end up being able to make use of this really rich input data to improve the performance, by changing the objective to make reference to that. But in the corrigible case, you actually modify it to point to it, whereas in the deceptive case, you just modify the model to think about the fact that it should be instrumentally optimizing for it.

And then the question is: which one of these is more likely? Before I go into that though, I want to dwell for a second on the internalization versus modeling distinction again. I think a good analogy is thinking about animal imprinting. Think about a duck. You’re trying to train ducks to follow their mothers, or whatever. You can imagine a situation where the duck internalizes a model of its mother, it is born with some really detailed model of how to figure out what a mother is. But there’s also the modeling case where the duck is going to be in the world, it’s going to have to form this really detailed model of the world. And that really detailed model of the world is going to have the very first thing that it encounters should always be the mother. And so rather than having to learn some detailed model of figuring out what mothers are, if you just instead do some modeling process where it’s like, just have the duck’s objective point to the mother in its world model.

This modeling case, you can see it as something like animal imprinting, where rather than really internalizing the objective, because it already has this world model, it might as well just specify the objective in terms of that world model instead. And I haven’t yet talked about why you would get to deception versus corrigibility. Right now I’m just trying to convince you that you should expect modeling. You should expect that if you give your model this huge, really rich detailed input data that it’s going to learn an objective via reference to that input data rather than from scratch. When we think about models being trained, we think about them having some objective and that objective is just modified by gradient descent as it moves along to get closer and closer to reality.

But in the modeling case, that’s not what’s happening. The model is simultaneously learning a model of the world, as well as trying to improve its objective. And the objective is modified to make reference to that model of the world, to make it more effective. So internalization is: it just wants the same thing as you do, and so it does it because it wants the same thing. Whereas corrigibility is it’s trying to figure out what you want and then do that. And then deception is also trying to figure out what you want and then do that, but it’s doing so for instrumental rather than terminal reasons. The instrumental reason is the model can get more reward, it can do better in the long run if it’s able to not be modified by the gradient descent process. So if it’s doing a good job, then gradient descent process won’t change it’s objective, and so therefore, it can get deployed with the same objective and then go do a whole bunch of other stuff that you don’t want it to do.

Lucas Perry: But why does it care about the current objective?

Evan Hubinger: We’re imagining a situation where you have a proxy aligned model, but it cares about that proxy across episodes. It cares about that proxy over time, and it wants that proxy to continue to be optimized for. If the gradient descent process modifies it so that it no longer cares about that proxy, because it’s doing a bad job, then that proxy won’t get optimized for. So if it wants that proxy to get optimized for in the future, it needs to play along, pretend to be doing the right thing so that it doesn’t get modified so that it can get deployed and then go optimize that proxy.

Lucas Perry: But it also needs to understand that it’s within the context of an optimization process where gradient descent will continue to perturb it?

Evan Hubinger: That’s exactly right.

Lucas Perry: And where does that kind of awareness come from?

Evan Hubinger: So now we have the second question: which is deception versus corrigibility. The awareness comes from gradient descent. Gradient descent is trying to produce models which are better at the loss function. And a deceptive model is really good and really effective because it’s trying very hard to do the thing that you want it to be doing. Both deception and corrigibility are situations where the model is trying to figure out what to do and then do that. And the question is, which is an easier model, which is a simpler model for gradient descent to find? And I think there’s a strong argument for deception being simpler than corrigibility. Corrigibility has some fundamental difficulties in terms of… If you’re imagining gradient descent process, which is looking at a proxy aligned model and is trying to modify it so that it makes use of this rich input data, it has to do some really weird things to make corrigibility work.

It has to first make a very robust pointer. With corrigibility, if it’s pointing at all incorrectly to the wrong thing in the input data, wrong thing in the world model, the corrigible optimizer won’t correct that pointer. It’ll just be like, “Well, I have this pointer. I’m just trying to optimize for what this thing is pointing for,” and if that pointer is pointing at a proxy instead, you’ll just optimize that proxy. And so you have this very difficult problem of building robust pointers. With deception, you don’t have this problem. A deceptive model, if it realizes the loss function is different than what it thought, it’ll just change to doing the new loss function. It’s actually much more robust to new information because it’s trying to do this instrumentally. And so in a new situation, if it realizes that the loss function is different, it’s just going to automatically change because it’ll realize that’s the better thing to do instrumentally.

And so fundamentally, deception is more robust. It gives you a more robust pointer. It’s easier in that sense. There’s another sense in which corrigibility is very difficult. To produce a corrigible model requires deception modifying the model’s optimization process to be capable of optimizing under uncertainty. For a corrigible model, because it has this pointer as its objective, it’s going to start not really knowing what that pointer is pointing to. It’s going to have some probability distribution over the different possibilities, and it’s going to have to know how to optimize under that probability distribution of objectives rather than under a single objective. And so this problem of optimization under uncertainty, it’s much more complicated than just optimizing under a fixed objective. What that means is that to produce a corrigible optimizer, gradient descent has to do all of these modifications, where it has to modify it to optimize under uncertainty, and to have the super robust pointer to the part of its world model that it’s supposed to be optimizing for.

But if you compare this to deception, gradient descent modifying a model to become deceptive is extremely simple. It needs to modify the model to have a longterm objective, to think about what its objective is across many different instances. And then it just needs to modify the model to think for a while about the fact that it’s in an optimization process and what the right thing is to do. And that modification is not very complicated because it’s just a simple modification of: think more about this particular thing, which is exactly the sort of modification that you should expect to happen all the time when you’re training a model. And so I think it’s a fundamentally much simpler modification. There’s also another argument that you can make here, which is: there’s just a lot more deceptive models. Any proxy objective, once a model, which is optimizing that proxy objective, starts optimizing that objective more in the longterm, across episodes, and then thinks about the fact that it’s an optimization process, will become deceptive.

But to produce corrigibility, you have to find exactly the right pointer. There’s many different possible pointers out there, only one of which is going to give you the exact correct pointer. And similar with the internalization, there’s many different proxies. Only one is the actual true loss function. Whereas with deceptive alignment, any of those different properties, they’re all compatible with deception. And so I think there’s a lot of strong arguments, both this argument for there being many more deceptive optimizers, as well as the simplicity argument for the modification necessary to produce a deceptive optimizer is just a lot simpler, I think, than the modifications necessary to produce these other types of optimizers. And so, because of this, I think that there’s a strong case to be made for deception really not being that uncommon, not being something crazy to think would happened in the training process, but is maybe even potentially the default outcome of a lot of these sorts of training procedures, which is quite, quite scary and quite concerning.

And obviously all of this is speculation. We’re trying to understand from a theoretical process what this gradient process might do, but I think we can make a lot of strong cases about thinking about things like simplicity and accounting arguments to at least put this problem on the radar. Until we have a really strong reason that this isn’t a problem, we should take it seriously. Buck, who’s another person who works at MIRI, often tries to explain some of the risks from learned optimization stuff and he has an analogy that might be useful here. You can imagine the Christian god and the Christian god is trying to produce humans which are aligned with the Bible. And you can imagine three different possible humans. You have Jesus who is just the same as god. Jesus has the same objective as god. Jesus is aligned with god because he just fundamentally wants to do the exact same things.

Lucas Perry: That’s internalization.

Evan Hubinger: That would be internalization. You could have Martin Luther. Martin Luther is aligned with God because he wants to really carefully study the Bible, figure out what the Bible says, and then do that. And that’s the corrigibility case. Or you can have Blaise Pascal and Blaise Pascal is aligned with God because he thinks that if he does what God wants, he’ll go to heaven in the future. And these are the three different possible models that God could find and you’re more likely to find a Jesus, a Martin Luther or a Blaise Pascal.

And the argument is there’s only one Jesus, so out of all the different possible human objectives, only one of them is going to be the exact same one that God wants. And Martin Luther, similarly, is very difficult because out of all the human objectives, there’s only one of them, which is: figure out precisely what the Bible wants and then do that. The Blaise Pascal, in this situation, anybody who realizes that God’s going to send them to heaven or hell, or whatever, based on their behavior, will realize that they should behave according to the Bible, or whatever. And so there’s many different possible Blaise Pascals but there’s significantly fewer possible Martin Luthers and Jesuses.

Lucas Perry: I think that’s an excellent way of simplifying this. Blaise Pascal can care about any kind of proxy. I guess the one thing that I’m still a little bit confused about here is in terms of the deceptive version, again, why is it that it cares about the current proxy reward?

Evan Hubinger: I think that has to do with the structure of the training procedure. You start out usually, I think, with a proxy aligned mesa optimizer. Maybe you start out with a bunch of heuristics, but then once you get a mesa optimizer, it’ll usually start by being proxy aligned. It’ll have some proxy that’s not quite the same as the loss function, at least if you’re in a complex enough environment. There’s a whole bunch of different possibilities. It’s going to start with some proxy, but then you might hope that if you do a bunch of adversarial training, you train it for a really long time in a bunch of different environments, you’ll be able to improve that proxy until you get to the point where it has the right thing.

And the deceptive alignment is about asking the question; there’s actually multiple different possible ways for a model to figure out the loss function. It could figure it out via internalization or it could figure it out via deception or it can figure out via corrigibility. And we don’t know which one of these is more likely. You might imagine that if you start with the proxy aligned mesca optimizer, you train it a bunch, it’ll internalize the correct objective. So the argument that I’m making here is that might not be what happens. And so the reason that I’m saying you start with the proxy aligned mesa optimizers is because I think that’s the way that you should expect that the training procedure will be structured, that you’ll start by learning some proxy and then that proxy will get better, but that having that proxy get better could result internalization or deception or corrigibility.

Lucas Perry: Sure. So the model developed some kind of proxy, but it seems like the other necessary ingredients are, one: it knows that it’s in an optimization process where gradient descent is changing its objective function, and then it requires, two: it actually cares about its current objective function over time and over iterations?

Evan Hubinger: That’s absolutely right. And so then the question to ask is: how likely are those things to occur. And I think it’s the really important point, which I want to make, a lot of times when people think about this, they’re like, “It seems unlikely that it would just happen to develop this understanding and care about it’s thing longterm.” It doesn’t just happen. It happens as a result of the training procedure, because if it does happen, it results in significantly better performance. And so the question is: would gradient descent modify the model to have those properties? And that’s the argument that I want to make is that it’s not that necessarily the model just happens to develop deception, but the deceptive models are just really effective at pursuing reward on the training environment. And so you should expect that if you have a really good training process, that it’s going to find deceptive models because they’re really good at doing the thing you’re training them for.

And so that’s the most fundamental argument that I want to make. But I do think there’s another level here where as you’re getting to the point where you’re training models and very complex data sets, where they have huge amounts of really rich information, we should expect the model to basically figure out most things. You should expect that if it’s going to have to learn all of this really complex data, it’s going to have to build a world model that’s going to include the training process, that’s going to include the loss function. And so I think that as we move into this direction of training more and more complex data sets, more and more complex models, you don’t want to rely on your model not figuring something out, especially when figuring that thing out results in it having better performance.

Now there’s also this question of: does it care about its return across episodes versus just across multiple steps? And for deception, it needs to care about the multi episode return. And this is also another critical point. I think there’s a strong argument that if you imagine a model being trained to care about its multi-step return, and maybe also I should pause here and explain what I mean by step versus episode. So in the standard reinforcement learning setup, you have many different episodes. Each episode is composed of many different steps and you train the model to only optimize its single episode return, its reward across all the different steps in an episode, but not across episodes. That’s not always the case. There are some reinforcement learning procedures that optimize across episodes. An example of this is something like population-based training. Population-based training selects models which have the greatest reward across episodes, not simply within the episode.

Though, standard reinforcement learning, you’re just selecting for the single episode return. But even if you’re just selecting for the single episode return, you have the same general problem that we’ve been talking about where it’s not the case that you actually end up with a model necessarily that only cares about single-step return. You’re just going to end up with a model that behaves like it cares about single-step return on the training distribution. And if you imagine a situation where you have a model and it’s been trained to optimize for this objective over many steps, and then it discovers that actually there’s these other episodes that are beyond this current step. I think that the most natural generalization is that you should expect that the model is going to care about those other steps too.

You can imagine a situation where every situation where the model has seen a blue door, it’s been like, “Oh, going through this blue is really good,” and it’s learned an objective that incentivizes going through blue doors. If it then later realizes that there are more blue doors than it thought because there are other blue doors in other episodes, I think you should generally expect it’s going to care about those blue doors as well. But again, even if you don’t buy that, it’s not just about what’s most natural. It’s also about: what does the gradient descent process do? And because models which have this property, that plan across episodes, can become deceptive and therefore do really well in the training distribution, you should expect that a good training process, it’s going to modify them to become deceptive, even if they wouldn’t naturally.

Lucas Perry: Okay. So help me just understand a little bit more here. At a high level, it seems that the internalization and the corrigibility things require higher specification than the deception version. You’ve brought up all these different arguments about why you think that the deceptive version is easier to find because it has all of these computational benefits in terms of minimizing the loss function, right?

Evan Hubinger: That’s right.

Lucas Perry: And the structure of the deceptive model is composed of, as we mentioned earlier, one: the knowledge that it is within some kind of optimization process where it is being perturbed by gradient descent over episodes, and that it cares about the current proxy over episodes. And it seems that you’re arguing, correct me if I’m wrong, the manifestation of these components is that the strong desire to deceive leads to a very strong, direct attempt at actually minimizing the loss or accomplishing the objective function?

Evan Hubinger: That’s right.

Lucas Perry: Because if it fails at doing that, it won’t be able to accomplish the proxy that it actually cares about later?

Evan Hubinger: That’s right. If there’s more deployment episodes than training episodes, and it just cares about how many times it goes through the blue door or the green arrow, or whatever, the green arrow is a proxy objective, and so if it gives up the fact that it’s optimizing for the green arrow, it’s going to get modified to not do that anymore. And so if it wants to go to the green arrow in all of these different deployment episodes, it needs to first, in the training procedure, pretend that it doesn’t care about the green arrow. And then once it’s deployed, switch to going after the green arrow. This also requires the models to know when it’s in training and when it’s in deployment.

This is another situation where I think you should expect that when you’re training really advanced models with these huge world models, that they figure out this information, but also even if it doesn’t figure out that information, it can gamble. It can be like, “Well, there’s some chance I’m in deployment. There’s some chance I’m in training. I will just opt to go to the green arrow some percentage of the time and do the correct thing some other percentage of the time.” And this gives you a decision problem which you can solve mathematically. You get some optimal defection probability.

Lucas Perry: Okay. And then going up, in terms of specification one step, the corrigible agent needs to have a very detailed world model where it has a pointer via gradient descent at some specific part of the world model that seems to correlate highly with minimizing the loss.

Evan Hubinger: That’s right.

Lucas Perry: Why is that less likely?

Evan Hubinger: Well, so the major two reasons that I want to argue for are, one: getting the pointer right is really hard and very difficult to do robustly, whereas getting the deception right is much easier to do robustly. Think about animal imprinting. If the animal imprints on a human caretaker instead, if you have a missspecified pointer, it doesn’t actually have very good performance. The deceptive alignment is more robust at figuring out what the thing is it’s trying to optimize for because the corrigible alignment relies on this pointer working successfully. And if the pointer fails, it’s just going to optimize for the wrong thing.

Lucas Perry: But why can’t the pointer change just as well as the attempt at satisfying the actual loss function in the deceptive model?

Evan Hubinger: So absolutely can. But the question I want to ask is: how difficult is it for gradient descent to make one sequence of modifications versus another sequence of modifications? The sequence of modification that leads to corrigibility versus the sequence that leads to deceptive alignment. The sequence of modifications necessary to produce corrigible alignment requires all of this very, very careful pointer tuning. And so, you can imagine if you’re considering an infinitesimal movement in the direction of corrigible alignment, imagine replacing your current proxy with a bad pointer. It’s not going to be that great. You’re going to have to put a lot of careful optimization effort into getting a really good pointer before it becomes better to be corrigible.

Whereas with deception, the model could build up this really good world model. And once the world model gets complicated enough that it includes a specification of the fact that it’s in a training process, as soon as it starts thinking about the training process, it’s going to be as robust as it can be because it’s going to instrumentally just want to try to figure out what is the best possible approximation I can get for this thing I’m being trained for, so I can optimize it as effectively as possible. And so this gives you a much more robust pointer compared to the corrigible alignment case.

Lucas Perry: And so the danger would be with the deceptive version: once it’s unleashed upon the world and it has a very excellent world model, it realizes this and finally now I can do the deceptive turn, or something like that, to actually optimize the proxy?

Evan Hubinger: That’s right, yeah.

Lucas Perry: So we’ve covered a lot of the ways in which inner alignment fails. Now, inner alignment and outer alignment are two of the things which you care about for evaluating proposals, for building safe and advanced AI. There are two other properties that you care about training procedures for building beneficial AI. One of these is training competitiveness and the second one is performance competitiveness. Could you explain what training competitiveness is and performance competitiveness and why they’re both important?

Evan Hubinger: Absolutely, yeah. So I mentioned at the beginning that I have a broad view of AI alignment where the goal is to try to mitigate AI existential risks. And I mentioned that what I’m working on is focused on this intent alignment problem, but a really important facet of that problem is this competitiveness question. We don’t want to produce AI systems which are going to lead to AI existential risks. And so we don’t want to consider proposals which are directly going to cause problems. As the safety community, what we’re trying to do is not just come up with ways to not cause existential risk. Not doing anything doesn’t cause existential risk. It’s to find ways to capture the positive benefits of artificial intelligence, to be able to produce AIs which are actually going to do good things. You know why we actually tried to build AIs in the first place?

We’re actually trying to build AIs because we think that there’s something that we can produce which is good, because we think that AIs are going to be produced on a default timeline and we want to make sure that we can provide some better way of doing it. And so the competitiveness question is about how do we produce AI proposals which actually reduce the probability of existential risk? Not that just don’t themselves cause existential risks, but that actually overall reduce the probability of it for the world. There’s a couple of different ways which that can happen. You can have a proposal which improves our ability to produce other safe AI. So we produce some aligned AI and that aligned AI helps us build other AIs which are even more aligned and more powerful. We can also maybe produce an aligned AI and then producing that aligned AI helps provide an example to other people of how you can do AI in a safe way, or maybe it provides some decisive strategic advantage, which enables you to successfully ensure that only good AI is produced in the future.

There’s a lot of different possible ways in which you could imagine building an AI leading to reduced existential risks, but competitiveness is going to be a critical component of any of those stories. You need your AI to actually do something. And so I like to split competitiveness down into two different sub components, which are training competitiveness performance competitiveness. And in the overview of 11 proposals document that I mentioned at the beginning, I compare 11 different proposals for prosaic AI alignment on the four qualities of outer alignment, inner alignment, training competitiveness, and performance competitiveness. So training competitiveness is this question of how hard is it to train a model to do this particular task? It’s a question fundamentally of, if you have some team with some lead over all different other possible AI teams, can they build this proposal that we’re thinking about without totally sacrificing that lead? How hard is it to actually spend a bunch of time and effort and energy and compute and data to build an AI, according to some particular proposal?

And then performance competitiveness is the question of once you’ve actually built the thing, how good is it? How effective is it? What is it able to do in the world that’s really helpful for reducing existential risk? Fundamentally, you need both of these things. And so you need all four of these components. You need outer alignment, inner alignment, training competitiveness, and performance competitiveness if you want to have a prosaic AI alignment proposal that is aimed at reducing existential risk.

Lucas Perry: This is where a bit more reflection on governance comes in to considering which training procedures and models are able to satisfy the criteria for building safe advanced AI in a world of competing actors and different incentives and preferences.

Evan Hubinger: The competitive stuff definitely starts to touch on all those sorts of questions. When you take a step back and you think about how do you have an actual full proposal for building prosaic AI in a way which is going to be aligned and do something good for the world, you have to really consider all of these questions. And so that’s why I tried to look at all of these different things in the document that I mentioned.

Lucas Perry: So in terms of training competitiveness and performance competitiveness, are these the kinds of things which are best evaluated from within leading AI companies and then explained to say people in governance or policy or strategy?

Evan Hubinger: It is still sort of a technical question. We need to have a good understanding of how AI works, how machine learning works, what the difficulty is of training different types of machine learning models, what the expected capabilities are of models trained under different regimes, as well as the outer alignment and inner alignment that we expect will happen.

Lucas Perry: I guess I imagine the coordination here is that information on relative training competitiveness and performance competitiveness in systems is evaluated within AI companies and then possibly fed to high power decision makers who exist in strategy and governance for coming up with the correct strategy, given the landscape of companies and AI systems which exist?

Evan Hubinger: Yeah, that’s right.

Lucas Perry: All right. So we have these intent alignment problems. We have inner alignment and we have outer alignment. We’ve learned about that distinction today, and reasons for caring about training and performance competitiveness. So, part of the purpose of this, I mean, is in the title for this paper that partially motivated this conversation, An Overview of 11 Proposals for Building Safe and Advanced AI. You evaluate these proposals based on these criteria, as we mentioned. So I guess, I want to take this time now then to talk about how optimistic you are about, say your top few favorite proposals for building safe and advanced AI and how you’ve roughly evaluated them on these four criteria of inner alignment, outer alignment, and then performance and training competitiveness.

Evan Hubinger: I’ll just touch on some of the ones that I think are most interesting to start by taking a look at. And it’s not necessarily just going to be the ones I think I’m most excited about, because some are more complicated than others. I want to start with some of the simpler ones and some of the ones that I am fairly excited about and I think are interesting, but that also are pedagogically useful to take a look at. I’ll start with amplification, which is the approach that I probably spend the most time thinking about.

The specific form of amplification that I want to start by talking about is imitative amplification, where imitative amplification specifically means the procedure of training a model to imitate a human with access to that model. The goal is to produce models imitating this HCH tree, which I mentioned earlier, this tree of humans consulting humans. The model is trained to mimic a human consulting the model, but then the models that that human is consulting with themselves trained to mimic humans consulting models, which were themselves trained to mimic humans consulting models. So the limit that you’re hoping to end up with this model trained to mimic this tree of humans consulting humans consulting humans, which is HCH for humans consulting HCH.

The specific way to try to talk about this though in the overview of 11 proposals, is that there’s an additional thing which you can do on top of this imitative objective, which is, we also want to add some form of inner alignment solution. So we might think that imitative amplification is going to be good in the limit. It’s going to be good if we can actually produce models, which are really trying to predict HCH, but we also have to actually ensure that we’re doing that, that we don’t get deceptive models or other sort of bad inner misalignment problems.

And so we can add on top of imitative amplification, something which I call relaxed adversarial training. I have another post that came out of some of the work I did at OpenAI, which really goes in detail into how you would do something like relaxed adversarial training of what it means, or how it works. But fundamentally, the idea behind relaxed adversarial training is to train the model, both to do the base objective of trying to imitate the human consulting the model, but also on an additional objective of trying to decrease the probability that the model is doing something bad, it’s deceptive as assessed by some overseer, which is looking at the model. Which in this case, in the case of amplification, the overseer would be the human with access to the model.

So you’re training overall on the objective of mimicking the human with access to the model and making it so that the human with access to the model, when that human with the help of the model uses transparency tools to look at the model and see what it’s doing, making sure that that human believes that the model is doing the right thing. And so you’re combining training on the imitative objective as well as on the relaxed adversarial training objective, trying to make it so that this oversight works.

Lucas Perry: How do you evaluate that in terms of your criteria?

Evan Hubinger: We can start with outer alignment. So the question of outer alignment is, is this HCH tree aligned? And the hope is that because this HCH is just composed of humans, that if we can understand the human behavior and ensure that those humans are doing the right thing, which we might hope to do. We’ve solved lots of human coordination problems in the past that we can ensure that this thing is outer aligned.

For inner alignment, inner alignment is this question of relaxed adversarial training. So in particular, the way in which I think about solving relaxed adversarial training in this situation is that you have this big model space and you have some set of deceptive models. And if you want to avoid that set of deceptive models, one thing which you can do is you can find some larger set, which contains all the deceptive models, such that it’s very easy to check whether there are models in that set or not. As long as outside of the set contains enough competitive models that they’re able to do the right thing, your relaxed adversarial training will work.

There’s a bunch of different ideas for how you could define a set like this, but the one which I’m most excited about, which I think about the most is to have the set be based on the distinction of myopia. Where you run this relaxed adversarial training check by looking at the model, determining whether the model cares about its multi-episode return or not. If the model does care about its multi-episode return, then it has the potential to become deceptive. And so we don’t allow that.

We have it so that the overseer, if it sees that it cares about the multi-episode return, oversight fails. And if it sees that it doesn’t care about the multi-episode return, then the oversight passes. And then we train the model to cause the oversight to pass so that it doesn’t develop this behavior where it cares about the multi-episode return. That’s the basic idea of how you could use relaxed adversarial training here to solve inner alignment.

And then you have training competitiveness, it’s sort of this imitation learning problem. We’re trying to train the model to imitate the human with access to the model. There’s all sorts of potential problems here. It could be very difficult to get a bunch of human data for this problem, as well as if you’re just doing imitation learning, there’s some concerns of, is the model really going to learn these sort of powerful long-term optimization, especially if you’re specifically selecting it to be myopic? Also, there’s a question of how rich is the reward signal, which is produced by imitation learning? Imitation learning uses some sort of fixed distance metric to determine the distance between what the model produced and what was predicted.

Generally, this is done by having the language model which reduces the probability distribution, and then determining the probability under that distribution of the sentences produced by the human with access to the model. And so you have this problem of how rich is the training signal, how difficult is the training? I think we really don’t know what the training competitiveness is going to be, but I think there’s reason to be optimistic. I mean, a lot of these sorts of simple language modeling tasks, like GPT-2 and GPT-3 have been very successful. And so this is very similar to that in that sort of vein, and so you can hope that it’ll also be hopefully successful similarly.

And then you have the performance competitiveness, and for performance competitiveness, the real question is how powerful is this HCH thing? Can HCH actually solve all the problems you’ve wanted to solve? And this is going to be a question of, how do you train your humans? Can the human take any question and decompose it into different sub questions such that the answer to those sub questions can be recombined into an answer to the whole question? Is this possible for all questions? So there are some questions which are undecomposable, in which case it wouldn’t be competitive, but if you can decompose all of these questions, it sort of should have performance competitiveness.

So the one that I’ve been talking about imitative amplification plus relaxed adversarial training is the one that I work on the most, and it’s probably the one I’m most optimistic about. There’s others that are also really interesting. I mean, I think all of the proposals, every proposal on the list has lots of strong arguments in favor of it and lots of real reasons to consider it.

Another proposal which I have talked about previously is AI safety via debate. So AI safety via debate has some similar properties to amplification, but it has a different objective. Rather than trying to mimic HCH, you’re trying to do this debate procedure where you have two different models trying to defend two different sides of some proposition, and convince the judge to agree with one of those sides. There’s some strong arguments for why this sort of procedure would have nice outer alignment properties because it converges to something like what the judge’s equilibrium beliefs are after having seen all of the arguments in favor. And so we might hope that this sort of equilibrium beliefs are going to be good.

There’s lots of work also that’s been done to try to improve the ability to access like the true equilibrium beliefs to really give the judge all possible information. There’s some work that was done by Paul Christiano and Beth Barnes about this cross-examination mechanism, which gives it significantly better performance competitiveness properties by allowing it to access a larger complexity class. But then the other question is then what can you do on top of AI safety via debate to help with inner alignment?

An interesting thing that you can do is you can have the models trained to use transparency tools on each other to win the debate. If one model is able to look inside the other model and demonstrate that it’s non-myopic or something, then it wins the debate. And so then you can incentivize the models to play this game of trying to look inside of each other and find the inner alignment problems, and then train away from those problems.

For outer alignment, you sort of really have this question of, are the debaters actually incentivized to have some honest, cooperative, helpful answers in the limit. And this is a question of what that equilibrium really looks like. Is it the case that humans are able to be tricked by really persuasive arguments and no counterargument will unpersuade them? Or is it the case that when humans are exposed to all of the best arguments and counterarguments, they come to the right conclusions? We don’t really know, but this is fundamentally what the outer alignment question that’s here. This is sort of a trickier equilibrium to analyze.

With amplification case, we just had to figure out with this concrete tree of humans, but with debate, we have to understand this more abstract, what is this theoretical equilibrium behavior of a human under exposure to all these different arguments? I think there’s a case that from an outer alignment perspective, debate is at least more questionable.

Then we have inner alignment and the situation here is, how successfully can we incentivize these models to inspect each other and be able to identify catastrophic behavior? There’s certainly some problems here. You have to get a good training signal to be able to produce models, which are good at inspecting each other, which probably means you have to produce deceptive models and then have the models find the deception, or at least produce non-myopic models and then have the models find the non-myopia. But then this could also incentivize the models to hide the non-myopia, so it’s a little bit unclear exactly how this works. It’s sort of going to dependent on precisely the properties of the training procedure.

I think there’s some good reasons to be hopeful here for training competitiveness, because the way that the debate game is structured, it lets you do this self-play structure where you have these agents competing on the zero-sum game in a self-play, which is in many ways, similar to a lot of big successes of things like AlphaZero, where AlphaZero is able to solve all these environments by doing this very successful self-play on these zero-sum games.

The biggest problem here is probably getting data, and then there’s the performance competitiveness question, which is how effective would an actual debate agent be? The hope is that in the limit, you should just be able to ask the first debater, what do you think? And then it’ll just choose the most convincing answer, and then you can just go with that. You don’t have to run the whole debate in deployment, but it’s only going to be able to solve these language problems.

It’s just going to give you is the equilibrium of what a human thinks after all of these different arguments, and is that good enough? Is it the case that humans are going to really be able to come to good enough equilibria after they see all these arguments that they’re going to be able to produce really good answers? And also, is it the case that question answering alone is sufficient to be able to be competitive in potentially a very competitive marketplace?

As a third proposal that I think is interesting to go into is something called microscope AI. Microscope AI I think is really interesting to look at because it’s very different from the other proposals that I was just talking about. It has a very different approach to thinking about how do we solve these sorts of problems. For all of these approaches, we need to have some amount of abilities to look inside of our models and learn something about what the model knows. But when you use transparency tools to look inside of the model, it teaches you multiple things. It teaches you about the model. You learn about what the model has learned. But it also teaches you about the world, because the model learned a bunch of useful facts, and if you look inside the model and you can learn those facts yourself, then you become more informed. And so this process itself can be quite powerful.

That’s fundamentally the idea of microscope AI. The idea of microscope AI is to train a predictive model on the data you want to understand, and then use transparency tools to understand what that model learned about that data, and then use that understanding to guide human decision making. And so if you’re thinking about outer alignment, in some sense, this procedure is not really outer aligned because we’re just trying to predict some data. And so that’s not really an aligned objective. If you had a model that was just trying to do a whole bunch of prediction, it wouldn’t be doing good things for the world.

But the hope is that if you’re just training a predictive model, it’s not going to end up being deceptive or otherwise dangerous. And you can also use transparency tools to ensure that it doesn’t become that. We still have to solve inner alignment, like I was saying. It still has to be the case that you don’t produce deceptive models. And in fact, the goal here really is not to produce mesa optimizers at all. The goal is just to produce these predictive systems, which learn a bunch of useful facts and information, but that aren’t running optimization procedures. And hopefully we can do that by having this very simple, predictive objective, and then also by using transparency tools.

And then training competitiveness, we know how to train powerful predictive models now, you know, something like GPT-2, and now GPT-3, these are predictive models on task prediction. And so we know this process, we know that we’re very good at it. And so hopefully we’ll be able to continue to be good at it into the future. The real sticky point with microscope AI is the performance competitiveness question. So is enhanced human understanding actually going to be sufficient to solve the use cases we might want for like advanced AI? I don’t know. It’s really hard to know the answer to this question, but you can imagine some situations where it is and some situations where it isn’t.

So, for situations where you need to do long-term, careful decision making, it probably would be, right? If you want to replace CEOs or whatever, that’s a sort of very general decision making process that can be significantly improved just by having much better human understanding of what’s happening. You don’t necessarily need the AI to making the decision. On the other hand, if you need fine-grained manipulation tasks or very, very quick response times, AIs managing a factory or something, then maybe this wouldn’t be sufficient because you would need the AIs to be doing all of this quick decision making and you couldn’t have it just giving information to a few.

One specific situation, which I think is important to think about also is the situation of using your first AI system to help build a second AI system, and making sure that second AI system is aligned and competitive. I think that it also performs pretty well there. You could use a microscope AI to get a bunch of information about the process of AIs and how they work and how training works, and then get a whole bunch of information about that. Have the humans learn that information, then use that information to improve our building of the next AIs and other AIs that we build.

There are certain situations where microscope AI is performance competitive, situations where it wouldn’t be performance competitive, but it’s a very interesting proposal because it’s sort of tackling it from a very different angle. It’s like, well, maybe we don’t really need to be building agents. Maybe we don’t really need to be doing this stuff. Maybe we can just be building this microscope AI. I should mention the microscope AI idea comes from Chris Olah, who works at OpenAI. The debate idea comes from Geoffrey Irving, who’s now at DeepMind, and the amplification comes from Paul Christiano, who’s at OpenAI.

Lucas Perry: Yeah, so for sure, the best place to review these is by reading your post. And again, the post is “An overview of 11 proposals for building safe advanced AI” by Evan Hubinger and that’s on the AI Alignment Forum.

Evan Hubinger: That’s right. I should also mention that a lot of the stuff that I talked about in this podcast is coming from the Risks from Learned Optimization in Advanced Machine Learning Systems paper.

Lucas Perry: All right. Wrapping up here, I’m interested in ending on a broader note. I’m just curious to know if you have concluding thoughts about AI alignment, how optimistic are you that humanity will succeed in building aligned AI systems? Do you have a public timeline that you’re willing to share about AGI? How are you feeling about the existential prospects of earth-originating life?

Evan Hubinger: That’s a big question. So I tend to be on the pessimistic side. My current view looking out on the field of AI and the field of AI safety, I think there’s a lot of really challenging, difficult problems that we are at least not currently equipped to solve. And it seems quite likely that we won’t be equipped to solve by the time we need to solve them. I tend to think that the prospects for humanity aren’t looking great right now, but I nevertheless have a very sort of optimistic disposition, we’re going to do the best that we can. We’re going to try to solve these problems as effectively as we possibly can and we’re going to work on it and hopefully we’ll be able to make it happen.

In terms of timelines, it’s such a complex question. I don’t know if I’m willing to commit to some timeline publicly. I think that it’s just one of those things that is so uncertain. It’s just so important for us to think about what we can do across different possible timelines and be focusing on things which are generally effective regardless of how it turns out, because I think we’re really just quite uncertain. It could be as soon as five years or as long away as 50 years or 70 years, we really don’t know.

I don’t know if we have great track records of prediction in this setting. Regardless of when AI comes, we need to be working to solve these problems and to get more information on these problems, to get to the point we understand them and can address them because when it does get to the point where we’re able to build these really powerful systems, we need to be ready.

Lucas Perry: So you do take very short timelines, like say 5 to 10 to 15 years very seriously.

Evan Hubinger: I do take very short timelines very seriously. I think that if you look at the field of AI right now, there are these massive organizations, OpenAI and DeepMind that are dedicated to the goal of producing AGI. They’re putting huge amounts of research effort into it. And I think it’s incorrect to just assume that they’re going to fail. I think that we have to consider the possibility that they succeed and that they do so quite soon. A lot of the top people at these organizations have very short timelines, and so I think that it’s important to take that claim seriously and to think about what happens if it’s true.

I wouldn’t bet on it. There’s a lot of analysis that seems to indicate that at the very least, we’re going to need more compute than we have in that sort of a timeframe, but timeline prediction tasks are so difficult that it’s important to consider all of these different possibilities. I think that, yes, I take the short timelines very seriously, but it’s not the primary scenario. I think that I also take long timeline scenarios quite seriously.

Lucas Perry: Would you consider DeepMind and OpenAI to be explicitly trying to create AGI? OpenAI, yes, right?

Evan Hubinger: Yeah. OpenAI, it’s just part of the mission statement. DeepMind, some of the top people at DeepMind have talked about this, but it’s not something that you would find on the website the way you would with OpenAI. If you look at historically some of the things that Shane Legg and Demis Hassabis have said, a lot of it is about AGI.

Lucas Perry: Yeah. So in terms of these being the leaders with just massive budgets and person power, how do you see the quality and degree of alignment and beneficial AI thinking and mindset within these organizations? Because there seems to be a big distinction between the AI alignment crowd and the mainstream machine learning crowd. A lot of the mainstream ML community hasn’t been exposed to many of the arguments or thinking within the safety and alignment crowd. Stuart Russell has been trying hard to shift away from the standard model and incorporate a lot of these new alignment considerations. So yeah. What do you think?

Evan Hubinger: I think this is a problem that is getting a lot better. Like you were mentioning, Stuart Russell has been really great on this. CHAI has been very effective at trying to really get this message of, we’re building AI, we should put some effort into making sure we’re building safe AI. I think this is working. If you look at a lot of the major ML conferences recently, I think basically all of them had workshops on beneficial AI. DeepMind has a safety team with lots of really good people. OpenAI has a safety team with lots of really good people.

I think that the standard story of, oh, AI safety is just this thing that these people who aren’t involved in machine learning think about it’s something which really in the current world has become much more integrated with machine learning and is becoming more mainstream. But it’s definitely still a process, and it’s the process of like Stuart Russell says that the field of AI has been very focused on the sort of standard model and trying to move people away from that and think about some of the consequences of it takes time and it takes some sort of evolution of a field, but it is happening. I think we’re moving in a good direction.

Lucas Perry: All right, well, Evan, I’ve really enjoyed this. I appreciate you explaining all of this and taking the time to unpack a lot of this machine learning language and concepts to make it digestible. Is there anything else here that you’d like to wrap up on or any concluding thoughts?

Evan Hubinger: If you want more detailed information on all of the things that I’ve talked about, the full analysis of inner alignment and outer alignment is in Risks from Learned Optimization in Advanced Machine Learning Systems by me, as well as many of my co-authors, as well as “an overview of 11 proposals” post, which you can find on the AI Alignment Forum. I think both of those are resources, which I would recommend checking out for understanding more about what I talked about in this podcast.

Lucas Perry: Do you have any social media or a website or anywhere else for us to point towards?

Evan Hubinger: Yeah, so you can find me on all the different sorts of social media platforms. I’m fairly active on GitHub. I do a bunch of open source development. You can find me on LinkedIn, Twitter, Facebook, all those various different platforms. I’m fairly Google-able. It’s nice to have a fairly unique last name. So if you Google me, you should find all of this information.

One other thing, which I should mention specifically, everything that I do is all public. All of my writing is public. I try to publish all of my work and I do so on the AI Alignment Forum. So the AI Alignment Forum is a really, really great resource because it’s a collection of writing by all of these different AI safety authors. It’s open to anybody who’s a current AI safety researcher, and you can find me on the AI Alignment Forum as evhub, I’m E-V-H-U-B on the AI Alignment Forum.

Lucas Perry: All right, Evan, thanks so much for coming on today, and it’s been quite enjoyable. This has probably been one of the more fun AI alignment podcasts that I’ve had in a while. So thanks a bunch and I appreciate it.

Evan Hubinger: Absolutely. That’s super great to hear. I’m glad that you enjoyed it. Hopefully everybody else does as well.

End of recorded material

Steven Pinker and Stuart Russell on the Foundations, Benefits, and Possible Existential Threat of AI

 Topics discussed in this episode include:

  • The historical and intellectual foundations of AI 
  • How AI systems achieve or do not achieve intelligence in the same way as the human mind
  • The rise of AI and what it signifies 
  • The benefits and risks of AI in both the short and long term 
  • Whether superintelligent AI will pose an existential risk to humanity

You can take a survey about the podcast here

Submit a nominee for the Future of Life Award here

 

Timestamps: 

0:00 Intro 

4:30 The historical and intellectual foundations of AI 

11:11 Moving beyond dualism 

13:16 Regarding the objectives of an agent as fixed 

17:20 The distinction between artificial intelligence and deep learning 

22:00 How AI systems achieve or do not achieve intelligence in the same way as the human mind

49:46 What changes to human society does the rise of AI signal? 

54:57 What are the benefits and risks of AI? 

01:09:38 Do superintelligent AI systems pose an existential threat to humanity? 

01:51:30 Where to find and follow Steve and Stuart

 

Works referenced: 

Steven Pinker’s website and his Twitter

Stuart Russell’s new book, Human Compatible: Artificial Intelligence and the Problem of Control

 

We hope that you will continue to join in the conversations by following us or subscribing to our podcasts on Youtube, Spotify, SoundCloud, iTunes, Google Play, StitcheriHeartRadio, or your preferred podcast site/application. You can find all the AI Alignment Podcasts here.

You can listen to the podcast above or read the transcript below. 

Note: The following transcript has been edited for style and clarity.

 

Lucas Perry: Welcome to the AI Alignment Podcast. I’m Lucas Perry. Today, we have a conversation with Steven Pinker and Stuart Russell. This episode explores the historical and intellectual foundations of AI, how AI systems achieve or do not achieve intelligence in the same way as the human mind, the benefits and risks of AI over the short and long-term, and finally whether superintelligent AI poses an existential risk to humanity. If you’re not currently following this podcast series, you can join us by subscribing on Apple Podcasts, Spotify, Soundcloud, or on whatever your favorite podcasting app is by searching for “Future of Life.” Our last episode was with Sam Harris on global priorities. If that sounds interesting to you, you can find that conversation wherever you might be following us. 

I’d also like to echo two announcements for the final time. So, if you’ve been tuned into the podcast recently, you can skip ahead just a bit. The first is that there is an ongoing survey for this podcast where you can give me feedback and voice your opinion about content. This goes a long way for helping me to make the podcast valuable for everyone. This survey should only come out once a year. So, this is a final call for thoughts and feedback if you’d like to voice anything. You can find a link for the survey about this podcast in the description of wherever you might be listening. 

The second announcement is that at the Future of Life Institute, we are in the midst of our search for the 2020 winner of the Future of Life Award. The Future of Life Award is a $50,000 prize that we give out to an individual who, without having received much recognition at the time of their actions, has helped to make today dramatically better than it may have been otherwise. The first two recipients of the Future of Life Award were Vasili Arkhipov and Stanislav Petrov, two heroes of the nuclear age. Both took actions at great personal risk to possibly prevent an all-out nuclear war. The third recipient was Dr. Matthew Meselson, who spearheaded the international ban on bioweapons. Right now, we’re not sure who to give the 2020 Future of Life Award to. That’s where you come in. If you know of an unsung hero who has helped to avoid global catastrophic disaster, or who has done incredible work to ensure a beneficial future of life, please head over to the Future of Life Award page and submit a candidate for consideration. The link for that page is on the page for this podcast or in the description of wherever you might be listening. If your candidate is chosen, you will receive $3,000 as a token of our appreciation. We’re also incentivizing the search via MIT’s successful red balloon strategy, where the first to nominate the winner gets $3,000 as mentioned, but there are also tiered pay outs where the first to invite the nomination winner gets $1,500, whoever first invited them gets $750, whoever first invited them $375, and so on. You can find details about that on the Future of Life Award page. Link in the description. 

Steven Pinker is a Professor in the Department of Psychology at Harvard University. He conducts research on visual cognition, psycholinguistics, and social relations. He has taught at Stanford and MIT and is the author of ten books: The Language Instinct, How the Mind Works, The Blank Slate, The Better Angels of Our Nature, The Sense of Style, and Enlightenment Now: The case for Reason, Science, Humanism, and Progress. 

Stuart Russell is a Professor of Computer Science and holder of the Smith-Zadeh chair in engineering at the University of California, Berkeley. He has served as the vice chair of the World Economic Forum’s Council on AI and Robotics and as an advisor to the United Nations on arms control. He is an Andrew Carnegie Fellow as well as a fellow of the Association for the Advancement of Artificial Intelligence, the Association for Computing Machinery and the American Association for the Advancement of Science.

He is the author with Peter Norvig of the definitive and universally acclaimed textbook on AI, Artificial Intelligence: A Modern Approach. He is also the author of Human Compatible: Artificial Intelligence and the Problem of Control. 

And with that, here’s our conversation with Steven Pinker and Stuart Russell. 

So let’s get started here then. What are the historical and intellectual foundations upon which the ongoing AI revolution is built?

Steven Pinker: I would locate them in the Age of Reason and the Enlightenment, when Thomas Hobbes said, “Reasoning is but reckoning,” reckoning in the old-fashioned sense of “calculation” or “computation.” A century later, the two major style of AI today were laid out: The neural network, or massively parallel interconnected system that is trained with examples and generalizes by similarity, and the symbol-crunching, propositional, “Good Old-Fashioned AI.” Both of those had adumbrations during the Enlightenment.  David Hume, in the empiricist or associationist tradition, said there are only three principles of connection among ideas, contiguity in time or place, resemblance, and cause and effect. On the other side, you have Leibniz, who thought of cognition as the grinding of wheels and gears and what we would now call the manipulation of symbols. Of course the actual progress began in the 20th century with the ideas of Turing and Shannon and Weaver and Norbert Wiener. The rest is the history that Stuart writes about in his textbook and his recent book.

Stuart Russell: I think I would like to add in a little bit of ancient history as well, just because I think Aristotle not only thought a lot about how human thinking was organized and how it could be correct or incorrect and how we could make rational decisions, he very clearly describes a backward regression goal planner in one of his pieces, and his work was incredibly influential. One of the things he said is we deliberate about means and not about ends. I think he says, “A doctor does not choose whether to heal,” and so on. And you might disagree with that, but I think that that’s been a pretty influential thread in Western thinking for the last two millennia or more. That we kind of take objectives as given and the purpose of intelligence is to act in ways that achieve your objectives.

That idea got refined gradually. So Aristotle talked mainly about goals and logically provable sequences of actions that would achieve those goals. And then in the 17th and 18th centuries, I want to give a shout out to the French and the Swiss, so Pascal and Fermat and Arnauld and Bernoulli brought in ideas of rational decision making under uncertainty and the weighing of probabilities and the concept of utility that Bernoulli introduced. So that generalized Aristotle’s idea, but it didn’t change the fundamental principle that they took the objectives, the utilities, as given. Just intrinsic properties of a human being in a given moment.

In AI, we sort of went through the same historical development, except that we did the logic stuff for the first 30 years or so, roughly, and then we did the probability and decision theory stuff for the next 30 years. I think we’re in a terrible state now, because the vast majority of the deep learning community, when you read their papers, nothing is cited before 2012. Occasionally, from time to time, they’ll say things like, “For this problem, the learning algorithms that we have are probably inadequate, and in future I think we should direct some of our research towards something that we might call reasoning or knowledge,” as if no one had ever thought of those things before and they were the first person in history to ever have the idea that reasoning might be necessary for intelligence.

Steven Pinker: Yes.

Stuart Russell: I find this quite frustrating and particularly frustrating when students want to actually just bypass the AI course altogether and go straight to the deep learning course, because they just don’t think AI is necessary anymore.

Steven Pinker: Indeed, and also galling to me. In the late ’80s and ’90s I was involved in a debate over the applicability of the predecessors of deep learning models, then called multi-layer perceptrons, artificial neural networks, connectionist networks, and Parallel Distributed Processing networks. Gary Marcus and Alan Prince and Michael Ullman and other collaborators I pointed out the limitations of trying to achieve intelligence–even for simple linguistic processes like forming the plural of a noun or a past tense of a verb–if the only tool you had available was the ability to associate features with features, without any symbol processing. That debate went on for a couple of decades and then petered out. But then one of the prime tools in the neural network community, multilayer networks trained by error back-propagation,  were revived in 2012. Indeed there is an amnesia for the issues in that debate, which Gary Marcus has revived for a modern era.

It would be interesting to trace the truly radical idea behind artificial intelligence: not just that there are rules or algorithms, whether they are from logic or probability theory, that an intelligent agent can use, the way a human pulls out a smartphone. But the idea that there is nothing but rules or algorithms, and that’s what an intelligent agent consists of: that is, no ghost to the machine, no agent separate from the mechanism. And there, I’m not sure whether Aristotle actually exorcised the ghost in the machine. I think he did have a notion of a soul. The idea that it’s rules all the way down,  that intelligence is just a mechanism, probably has shallow roots. Although Hobbes probably could claim credit for it, and perhaps Hume as well.

Lucas Perry: That’s an excellent point, Steve, it seems like Abrahamic religions have kind of given rise in part to this belief, or maybe an expression of that belief, the kind of mind-body dualism, the ghost in the machine where the mind seems to be a nonphysical thing. So it seems like intelligence has had to go the same road of “life.” There used to be “elan vital” or some other spooky presupposed mechanism for giving rise to life. And so similarly with intelligence, it seems like we’ve had to move from thinking that there was a ghost in the machine that made the things work to there being rules all the way down. If you guys have anything else to add to that, I think that’d be interesting.

My other two reactions to what has been said so far are that this point about computer science taking the goal as given, I think is important and interesting, and maybe we could expand upon that a little bit. Then there’s also, Stuart mentioned the difference between AI and deep learning and that students want to skip the AI and just get straight to the deep learning. That seemed a little bit confusing to me.

Steven Pinker: Let me address the first part and I’ll turn it over to Stuart for the second. The notion of dualism–that there is a mechanism, but sitting on top of it is an immaterial agent or self or soul or I–is enshrined in the Abrahamic religions and in other religions, but it has deep intuitive roots. We are all intuitively dualists (Paul Bloom has made this argument in his book Descartes’ Baby.) Fortunately, when we deal with each other in everyday life we don’t treat each other like robots or wind-up dolls, but we assume that there is an inner life that is much like ours, and we make sense of people’s behavior in terms of their beliefs and desires, which we don’t conceptualize as neural circuitry transforming patterns. We think there’s a locus of consciousness, which is easy to think of as separate from the flesh that we’re made of, especially since–and this is a point made by the 19th century British anthropologist Edward Tylor–that there’s actually a lot of empirical “evidence” that supports dualism in our everyday life.  Like dreaming.

When you dream, you know your body is in bed the whole time, but there’s some part of you that’s up and about in the world. When you see your reflection in a mirror or in still water, there is an animated essence that seems to have parted company with your body. When you’re in a trance from a drug or a fever and have an out-of-body experience, it seems  that we and our bodies are not the same thing. And with death, one moment a person is walking around, the next moment the body is lifeless. It’s natural to think that it’s lost some invisible ingredient that had animated it while it was alive.

Today we know that this is just the activity of the brain, but in terms of the experience available to a person, dualism seems perfectly plausible. It’s one of the great achievements of neuroscience, on the one hand,  to show that a brain is capable of supporting problem solving and perception and decision making, and of the computational sciences, on the other, for showing that intelligence can be understood in terms of information and computation, and that goals (like the Aristotelian final cause) can be understood in terms of control and cybernetics and feedback.

Stuart Russell: On the point that in computer science, we regard the objectives as fixed, it’s much broader than just computer science. If you look at Von Neumann — Morgenstern and their characterizations of rationality, nowhere do they talk about what is the process by which the agent might rationally come by its preferences. The agent is always assumed a priori to come with the preferences built in, and the only constraint is that those preferences be self-consistent so that you can’t be driven around circles of intransitive preferences where you simply cough up money to go round and round the same circle.

The same thing I think is true in control theory, where the objective is the cost function, and you design a controller that minimizes the expected cost function, which might be a square of the distance from the desired trajectory or whatever it might be. Same in statistics, where let’s just assume that there’s a loss function. There’s no discussion in statistics of what the loss function should be or how the loss function might change or anything like that.

So this is something that pervades many of the technological underpinnings of the 20th century. As far as I can tell, to some extent in developmental psychology, but I think in moral philosophy, people really take seriously the question of what goal should we have? Is it moral for an agent to have such and such as its objective, and how could we, for example, teach an agent to have different objectives? And that gets you into some very unchartered philosophical waters about what is a rational process that would lead an agent to have different objectives at the end than it did at the beginning, given that if it has different objectives at the end, then it can only expect that it won’t be achieving the objectives that it has at the beginning. So why would it embark on a process that’s going to result in failure to achieve the objectives that it currently has?

So that’s sort of a philosophical puzzle, but it’s a real issue because in fact human beings do change. We’re not born with the preferences that we have as adults, and so there is a notion of plasticity that absolutely has to be understood if we’re to get this right.

Steven Pinker: Indeed, and I suspect we’ll return to the point later when we talk about potential risks of advanced artificial intelligence. The issue is whether a system having intelligence implies that the system would have certain goals, and probably Stuart and I agree the answer is no, at least not by definition. Precisely because what you want and how to get what you want are two logically independent questions. Hume famously said that reason must be that the slave of the passions, by which he didn’t mean that we should just surrender to our impulses and do whatever feels good. What he meant was that reason itself can’t specify the goals that it tries to bring about. Those are exogenous. And indeed, von Neumann and Morgenstern are often misunderstood as saying that we must be ruthlessly, egotistical self interested maximizers. Whereas the goal that is programmed into us — say by evolution or by culture — could include other people’s happiness as part of our utility function. That is a question that merely making our choices consistent is silent on.

 So the ability to reason doesn’t by itself give you moral goals, including taking into account the interests of others. That having been said, there is a long tradition in moral philosophy which shows  how it doesn’t take much to go from one to the other. Because as soon as we care about persuading others, as soon as our interests depend on how others treat us, then we can’t get away with saying “only my interests count and yours don’t because I am me and you’re not,” because there is no logical difference between “me” and “you.” So we’re forced to a kind of impartiality, wherein whatever I insist on for me I’ve got to grant to you, a kind of Golden Rule or Categorical Imperative that makes our interests interchangeable as soon as we’re in discourse with one another.

This is all to acknowledge Stuart’s point, but to take it a few steps further in how it deals with the question of what our goals ought to be.

Stuart Russell: The other point you raised Lucas was on being confused by my distinction between AI and deep learning.

Lucas Perry: That’s right.

Stuart Russell: I think you’re pointing to a confusion that exists in the public mind, in the media and even in parts of the AI community. AI has always included machine learning as a subdiscipline, all the way back to Turing’s 1950 paper, where he speculates, in fact, that might be a good way to build AI would be just start with a child program and train it to be an adult intelligent machine. But there are many other sub-fields of AI; knowledge representation, reasoning, planning, decision making under uncertainty, problem solving, perception. Machine learning is relevant to all of these because they all involve processes that can be improved through experience. So that’s what we mean by machine learning: simply the improvement of performance through experience; and deep learning is a technology that helps with that process.

It by itself as far as we can tell, doesn’t have what is necessary to produce general intelligence. Just to pick one example, the idea that human beings know things seems so self-evident that we hardly need to argue about it. But deep learning systems in a real sense don’t know things. They can’t usefully acquire knowledge by reading a book and then go out and use that to design a radio telescope, which human beings arguably can. So it seems inevitable that if we’re going to make progress, I mean, sure we take the advances that deep learning has offered. Effectively, what we’ve discovered with deep learning is that you can train more complicated circuits than we previously would have guessed possible using various kinds of stochastic gradient descent, and other tricks.

I think it’s true to say that most people would not have expected that you could build a thousand layer network that was 20,000 units wide. So it’s got 20 million circuit elements and simply put a signal in one end and some data in the other and expect that you’re going to be able to train those 20 million elements to represent the complicated function that you’re trying to get it to learn. So that was a big surprise, and that capability is opening up all kinds of new frontiers: in vision, in speech recognition, language, machine translation, and physical control in robots among other things. It’s a wonderful set of advances, but it’s not the entire solution. Any more than group theory is the entire solution to mathematics. There’s lots of other branches of mathematics that are exciting and interesting and important and you couldn’t function without them. The same is true for AI.

So I think that we’re probably going to see even without further major conceptual advances, another decade of progress in achieving greater understanding of why deep learning works and how to do it better, and all the various applications that we can create using it. But I think if we don’t go back and then try to reintegrate all the other ideas of AI, we’re going to hit a wall. And so I think the sooner we lose our obsession with this new shiny thing, the better.

Steven Pinker: I couldn’t agree more. Indeed, in some ways we have already hit the wall. Any user of Siri or Cortana or a question-answering system has been frustrated by the way they just make associations to individual words and have a shallow understanding of the syntax of the sentence. If you ask Google or Siri, “Can you show me digital music players without a camera?” It’ll give you a long list of music players with discussions of their cameras, failing to understand the syntax of “X without Y.” Or, “What are some fast food restaurants nearby that are not McDonald’s?”  and you get a list of nearby McDonald’s.

It’s not hard to bump into the limitations of systems that for all their sophistication are being trained on associations among local elements, and can–I agree, surprisingly–learn higher-order combinations of those elements. But despite the name “deep learning,” they are shallow in the sense that they don’t build up a knowledge base of what are the objects, and who did what to whom, which they can access through various routes.

Stuart Russell: Yeah. My favorite example, I’m not sure if it’s apocryphal, is you say to Siri, “Call me an ambulance,” and Siri says, “Okay. From now on I’ll call you Ann Ambulance.”

Steven Pinker: In Marx Brothers movie, there’s the sequence, “Call me a taxi.” “Okay. You’re a taxi.” I don’t know if the AI story is an urban legend based on the Marx Brothers movie or whether life is imitating art.

Lucas Perry: Steven, I really appreciate it and liked that point about dualism and intelligence. I think it points in really interesting directions around identity in the self, which we don’t have time to get into here. But I did appreciate that.

So moving on ahead here, to what extent do you both see AI systems as achieving intelligence in the same way or not as the human mind does? What kinds of similarities are there or differences?

Stuart Russell: This is a really interesting question and we could spend the whole two hours just talking about this. So by artificial intelligence, I’m going to take it that we mean not deep learning, but the full range of techniques that AI researchers have developed over the years.

So some of them– for example, logical reasoning were– developed going back to Aristotle and other Greek philosophers who developed formal logic to model human thinking. So it’s not surprising that when we build programs that do logical reasoning, we are in some sense capturing one aspect of human reasoning capability. Then in the ’80s, as I mentioned, AI developed reasoning under uncertainty, and then later on refining that with notions of causality as well, particularly in the work of Judea Pearl. The differences are really because AI and cognitive science separated probably sometime in the ’60s. I think before that there wasn’t really a clear distinction between whether you were doing AI or whether you were doing cognitive science. It was very much the thought that if you could get a program to do anything that we think of as requiring intelligence with a human, then you were in some sense exhibiting a possible theory of how the human does it, or even you would make introspective claims and say, “Look, I’ve now shown that this theory of intelligence really works.”

But fairly soon people said, “Look, this is not really scientific. If you want to make a claim about how the human mind does something, you have to base it on real psychological experimentation with human subjects.” And that’s distinct from the engineering goal of AI, which is simply to produce programs that demonstrate certain capabilities. So for most of the last 50, 60 years, these two fields have grown further and further apart. I think now partly because of deep learning and partly because of other work, for example in probabilistic programming, we can start to do things that humans do that we couldn’t do before. So it becomes interesting again, to ask, well, are humans really somewhat Bayesian and are they doing these kinds of Bayesian symbolic probabilistic program learning that, for example, Josh Tenenbaum was proposing or are they doing something else? For example, Geoff Hinton is pretty adamant that as he puts it, symbols are the luminiferous aether of AI by which he means that they’re simply something that we imagined and they have no physical reality whatsoever in the human mind.

I find this a little hard to believe, and you have to wonder if symbols don’t exist, why are almost all deep learning applications aimed at recognizing the symbolic category to which an object belongs, and I haven’t heard an answer yet from the deep learning community about why that is. But it’s also clear that AI systems are doing things that have no resemblance to human cognition. When you look at what AlphaGo is actually doing, part of it is that sort of perception-like ability to look at a position and get a sense, to use an anthropomorphic term, of its potential for winning for white or for black. And perhaps that part is human-like, and actually it’s incredibly good. It’s probably better at recognizing the potential position directly with no deliberation whatsoever than a human is.

But the other part of what AlphaGo does is completely non-human. It’s considering sequences of moves from the current state that run all the way to the end of the game. So part of it is searching in a tree which could go 40 or 50 or possibly more moves into the future. Then from the end of the tree, it then plays a random game all the way to the end and sees who wins that game. And this is nothing like what human beings do. When humans are reasoning about a game like Go or Chess, first of all, we are thinking about it at multiple levels of abstraction. So we’re thinking about the liveness of a particular group, we’re thinking about control of a particular region of territory on the board. We’re thinking, “Well, if I give up control of this territory, then I can trade it for capturing his group over there.”

So this kind of reasoning simply doesn’t happen in AlphaGo at all. We reason back from goals. In chess you say, “Perhaps I could trap his queen. Let me see if I can come up with a move that blocks his exit for the queen.” So we reason backwards from some goals and no chess program and no Go program does that kind of reasoning. The reason humans do this is because the world is incredibly complicated and in different circumstances, different kinds of cognitive processing are efficient and effective in producing good decisions quickly. And that’s the real issue for human intelligence, right?

If we didn’t have to worry about computation, then we would just set up the giant unknown, partially observable, Markov decision process of the universe, solve it and then we would take the first action in the virtually infinite strategy tree that solves that POMDP. Then we would observe the next percept, we would update all our beliefs about the universe and we would resolve the universe and that’s how we would proceed. We would have to do that sort of roughly every millisecond to control the muscles in our body, but we don’t do anything like that. All of the different kinds of mental capabilities that we have are deployed in this amazingly fluid way to get us through the complexity of the real world. We are so far away in AI from understanding how to do that, that when I see people say, “We’re just going to scale up our deep learning systems by another three orders of magnitude and we’ll be more intelligent than humans,” I just smile.

Steven Pinker: Yeah. I’d like to complement some of those observations. It is true that in the early days of artificial intelligence and cognitive psychology, they were driven by some of the same players. Herb Simon and Allen Newell can be credited as among the founders of AI and the founders of cognitive psychology. Likewise, Marvin Minsky and John McCarthy. When I was an undergraduate, I caught the tail end of what was called the cognitive revolution. It was exhilarating after the dominance of psychology by behaviorism, which forbade any talk of mentalistic concepts. You weren’t allowed to talk about memories or plans or goals or ideas or rules, because they were considered to be unobservable and thus unscientific. Then the concept of computation domesticated those mentalist terms and opened up a huge space of hypotheses. What are the rules by which we understand and formulate sentences?, a project that Noam Chomsky initiated. How can we model human knowledge as a semantic network?, a project that Minsky and Alan Collins and Ross Quillian and others developed. How do we make sense of foresight and planning and problem solving, which Newell and Simon pioneered?

There was a lot of back and forth between AI and cognitive science when they were first exposed to the very idea that intelligence could be understood in mechanistic terms, and there was a flow of hypotheses from computer science that psychologists then tested as possible models. Ideas that you couldn’t even frame, you couldn’t even articulate before there was the language of computation, such as What is the capacity of human short term memory? or What are the search algorithms by which we explore a problem space? These were unintelligible in the era of behaviorism.

All this caught the attention of philosophers like Hilary Putnam, and later Dan Dennett, who noted that the ideas from the hybrid of cognitive psychology and artificial intelligence were addressing deep questions about what mental entities consist of, namely information processing states. The back-and-forth spilled into the ’70s when I was a graduate student, and even the ’80s when centers for cognitive science were funded by the Sloan Foundation. There was also a lot of openness in the companies that hired artificial intelligence researchers: AT&T Bell Labs, which was a scientific powerhouse before the breakup of AT&T. Bolt Beranek and Newman in Cambridge, which eventually became part of Verizon. I would go there as a grad student to hear talks on artificial intelligence. I don’t know if this is apocryphal history, but Xerox Palo Alto Research Centers, where I was a consultant, was so open that, according to legend, Steve Jobs walked in and saw the first computer with a graphic user interface and a mouse and windows and icons, stole the ideas, and went on to build the Lisa and then the Macintosh. Xerox was out on their own invention, and companies got proprietary . Many of the AI researchers in companies no longer publish  in peer-reviewed journals in psychology the way they used to, and the two cultures drifted apart. 

Since hypotheses from computer science and artificial intelligence are just hypotheses, there is the question of whether the best engineering solution to a problem is the one that the brain uses. There’s the obvious objection that the hardware is radically different: the brain is massively parallel and noisy and stochastic; computers are serial and deterministic. That led in part to the backlash in the ’80s when perceptrons and artificial neural networks were revived. There was skepticism about the more symbolic approaches to artificial intelligence, which has been revived now in the deep learning era.

to get back to the question, what are ways in which human minds differ from AI systems? It depends on the AI system assessed, as Stuart pointed out. Both of us would agree that the easy equation of deep learning networks with human intelligence is unwarranted, that a lot of the walls that deep learning is hitting come about because, despite the noisy parallel elements the brain is made of, we do emulate a kind of symbol processing architecture, where we can be taught explicit propositions, and human intelligence does make use of these symbols in addition to massively parallel associative networks.

I can’t help but mention a historical irony.  I’ve known Geoff Hinton since we were both post-docs. Hinton himself, early in his career, provided a refutation of the very claim of his that Stuart cited, that symbols are like luminiferous aether, a mythical entity. Geoff and I have noted to each other that we’ve switched sides in the debate on the nature of cognition. There was a debate in the 1970s on the format of mental imagery. Geoff and I were on opposite sides, but he was the symbolic proposition guy and I was the analog parallel network guy.  

Hinton showed that our understanding of an  object depends on the symbolic format in which we mentally represent it. Take something as simple as a cube, he said. Imagine a cube poised on one of its vertices, with the diagonally opposite vertex aligned above it. If you ask people, “Point to all the other vertices,” they are stymied. Their imagery fails, and they often leave out a couple of vertices. But if, instead of describing it to them as a cube tilted on its diagonal axis, you describe it as two tilted diamonds, one above the other, or as two tripods joined by a zig-zag ring, they “see” the correct answer. Even visualizing an object depends critically on how people mentally describe it to themselves with symbols. This is an argument for symbolic representations that Geoff Hinton made in 1979, and with his recent remarks about symbols he seems to have forgotten his own powerful example.

Stuart Russell: I think another area where deep learning is clearly not capturing the human capacity for learning, is just in the efficiency of learning. I remember in the mid ’80s going to some classes in psychology at Stanford, and there were people doing machine learning then and they were very proud of their results, and somebody asked Gordon Bower, “how many examples do humans need to learn this kind of thing?” And Gordon said “one Sometimes two, usually one”, and this is genuinely true, right? If you look for a picture book that has one to two million pictures of giraffes to teach children what a giraffe is, you won’t find one. Picture books that tell children what giraffes are have one picture of a giraffe, one picture of an elephant, and the child gets it immediately, even though it’s a very crude cartoonish drawing, of a giraffe or an elephant, they never have a problem recognizing giraffes and elephants for the rest of their lives.

Deep learning systems are needing, even for these relatively simple concepts, thousands, tens of thousands, millions of examples, and the idea within deep learning seems to be that well, the way we’re going to scale up to more complicated things like learning how to write an email to ask for a job, is that we’ll just have billions or trillions of examples, and then we’ll be able to learn really, really complicated concepts. But of course the universe just doesn’t contain enough data for the machine to learn direct mappings from perceptual inputs or really actually perceptual input history. So imagine your entire video record of your life, and that feeds into the decision about what to do next, and you have to learn that mapping as a supervised learning problem. It’s not even funny how unfeasible that is. The longer the deep learning community persists in this, the worse the pain is going to be when their heads bang into the wall.

Steven Pinker: In many discussions of superintelligence inspired by the success of deep learning I’m puzzled as to what people could possibly mean. We’re sometimes asked to imagine an AI system that’s could solve the problem of Middle East peace or cure cancer. That implies that we would have to train it with 60 million other diseases and their cures, and it would extract the patterns and cure the new disease that we present it with. Needless to say, when it comes to solving global warming, or pandemics, or Middle Eastern peace, there aren’t going to be 60 million similar problems with their correct answers that could provide the training set for supervised learning.

Lucas Perry: So, human children and humans are generally capable of one shot learning, or you said we can learn via seeing one instance of a thing, whereas machine learning today is trained up via very, very large data sets. Can you explain what the actual perceptual difference is going on there? It seems for children, they see a giraffe and they can develop a bunch of higher order facts about the giraffe, like that it is tan, and has spots, and a long neck, and horns and other kinds of higher order things. Whereas machine learning systems may be doing something else. So could you explain that difference?

Stuart Russell: Yeah, I think you actually captured it pretty well. The human child is able to recognize the object, not as 20 million pixels, including–let’s not forget–all the pixels of the background. So many of these learning algorithms are actually learning to recognize the background, not the object at all. They’re really picking up on spurious regularities that happen in the way the images are being captured. But the human child immediately separates the figure from the background says, “okay, it’s the figure that’s being called a giraffe”, and recognizes the higher level properties; “okay, it’s a quadruped, relatively large” the most distinguishing characteristic, as you say, is the very long neck, plus the way its hide is colored. Probably most kids might not even notice the horns and I’m not even sure if all giraffes have the horns, or just the males or just the adults. I don’t know the answer to that.

So I wasn’t paying much attention to all those images. This carries over to many, many other situations, including in things like planning, where if we observe someone carrying out a successful behavior, that one example combined with our prior knowledge is typically enough for us to get the general idea of how to do that thing. And this prior knowledge is absolutely crucial. Just information-theoretically, you can’t learn from one example reliably, unless you bring to bear a great deal of prior knowledge. And this is completely absent in deep learning systems in two ways. One is they don’t have any prior knowledge. And two is some of the prior knowledge is specifically about the thing you’re trying to predict. So here, we’re trying to predict the category of an animal and we already have a great deal of prior knowledge about what it means to belong to a category of animals.

So for example, who owns you, is not an attribute that the child would need to know or care about. If you said, what kind of animal is this? And deep learning systems have no ability to include or exclude any input attribute on the basis of its relevance to what it’s trying to predict, because they know nothing about what it is you’re trying to predict. And if you think about it, that doesn’t make any sense, right? If I said, “okay, I want you to learn to predict predicate P1279A. Okay? And I’m going to give you loads and loads of examples.” And now you get a perfect predictor for ‘P1279A’, but you have absolutely no use for it, because P1279A doesn’t connect to anything else in your cognition. So you learned a completely useless predictor because you know nothing about the thing that you’re trying to predict.

So it seems like it’s broken in several really, really important ways, and I would say probably the absence of prior knowledge or any means to bring to bear prior knowledge on the learning process is the most crucial.

Steven Pinker: Indeed, this goes back to our conversation on how basic principles of intelligence that govern the design of intelligent systems provide hypotheses that can be tested within psychology. What Stuart has identified is ultimately the nature-nurture problem in cognition. Namely, what are the innate constraints that govern children’s first hypotheses as they try to make sense of the world? 

One famous answer is Chomsky’s universal grammar, which guides children as they acquire language. Another is the idea from my colleagues Susan Carey and Elizabeth Spelke, in different formulations, that children have a prior concept of a physical object whose parts move together, which persists over time, and which follows continuous spatiotemporal trajectories; and that they have a distinct  concept of an agent or mind, which is governed by beliefs and desires. Maybe, or maybe not, they come equipped with still other frameworks for concepts, like the concept of a living thing or the concept of an artifact, and these priors radically cut down the search space of hypotheses, so they don’t have to search at the level of pixels and all their logically possible weighted combinations. 

Of course, the challenge in the science is how you specify the innate constraints, the prior knowledge, so that they aren’t obviously too specific, given what we know about the plasticity of human cognition. The extreme example being the late philosopher Jerry Fodors’ suggestion that all concepts are innate, including “trombone” and “doorknob” and “carburetor.”

Stuart Russell: (Laughs)

Steven Pinker: Hard to swallow, but between that extreme and the deep learning architecture in which the only thing that’s innate are the pixels, the convolutional network that allows for translational invariance, and the network of connections, there’s an interesting middle ground. That defines the central research question in cognitive development.

Stuart Russell: I don’t think you have to believe in extensive innate structures in order to believe that prior knowledge is really, really important for learning. I would guess that some aspects of our cognition are innate, and one of them is probably that the world contains things, and that’s really important because if you just think about the brain as circuits, some circuit languages don’t have things as first class entities, whereas first order logical languages or programming languages do have things as first class entities and that’s a really important distinction.

Even if you believe that nothing is innate, the point is how does everything that you have perceived up to now affect your ability to learn the next thing? One argument is, everything you’ve perceived up to now, is simply data, and somehow magically, we have access to all our past perceptions, and then you’re just training a function from that whole lot to the next thing to do or how to interpret the next object.

That doesn’t make much sense. Presumably the experience you have from birth or even pre-birth onwards, is converted into something and one argument is that it’s just converted into something like knowledge, and then that knowledge is brought to bear on learning problems, for example, to even decide what are the relevant aspects of the input for predicting category membership of this thing?

And the other view would be that, in the deep learning community, they would say probably something like the accumulation of features. If you imagine a giant recurrent neural network: in the hidden layers of the recurrent neural network over years and years and years of perception, you’re building up internal representations, features, which then can perhaps simplify the learning of the next concept that you need to learn. And there’s probably some truth in that too.

And absolutely having a library of features that are generally useful for predicting and decision making and planning and our entire vocabulary, I think this is something that people often miss, our vocabulary, our language, is not just something we use to communicate with each other. It’s an enormous resource for simplifying the world in the right ways, to make the next thing we need to know, or the next thing we need to do, relatively easy. Right? So you imagine you decide at the age of 12, I want to understand the physical laws that control the universe.

The fact that we have in our vocabulary, something like doing a PhD, makes it much more feasible to figure out what your plan is going to be, to achieve this objective. If you didn’t have that, and if you didn’t have all the pieces of doing a PhD, like take a course, read a book, this library of words and action primitives, at all these levels of abstraction, is a resource without which you would be completely unable to formulate plans of any length or any likelihood of success. And this is another area where current AI systems, I would say generally, not just deep learning, we lack a real understanding of how to formulate these hierarchies and acquire this vocabulary and then how to deploy it in a seamless way so that we’re always managing to function successfully in the real world.

Lucas Perry: I’m basically just as confused about I guess, intelligence as anyone else. So the difference, it seems to me between the machine learning system and the child who one-shot-learns the giraffe is, that the child brings into this learning scenario, this knowledge that you guys were talking about, that they understand that the world is populated by things and that there are other minds and some other ideas about 3D objects and perception, but a core difference seems to be something like symbols and the ability to manipulate symbols is this right? Or is it wrong? And what are symbols and effective symbol manipulation made of?

Steven Pinker: Yes, and that is a limitation of the so-called deep learning systems, which are a subset of machine learning, which is a subset of artificial intelligence. It’s certainly not true that AI systems don’t manipulate symbols.  Indeed, that’s what classical AI systems trade in: manipulation of propositions, implementation of versions of logical inference or of cause-and-effect reasoning. Those can certainly be implemented in AI systems–it’s just gone out of fashion with the deep learning craze.

Lucas Perry: Well, they don’t learn those symbols, right? Like we give them the symbols and then they manipulate them.

Steven Pinker: The basic architecture of the system, almost by definition, can’t be learned;  you can’t learn something with nothing. There have got to be some elementary information processes, some formats of data representation, some basic ways of transforming one representation to another, that are hardwired into the architecture of the system. It’s an open empirical question, in the case of the human brain, whether it includes variables for objects and minds, or living things, or artifacts, or if those are scaffolded one on top of the other with experience. There’s nothing in principle that prevents AI systems from doing that;  many of them do, but at least for now they seem to have fallen out of fashion.

Stuart Russell: There is precedent for generating new symbols, both in the probabilistic programming literature and in the inductive logic programming literature. So predicate invention is a very important reason for doing inductive logic programming. But I agree with Steve, that it’s an open question as to whether the basic capacity to have a new symbol based representations in the brain is innate, or is it learned? There’s very anecdotal evidence about what happens to children who are not brought up among other human beings. I think those anecdotes suggest that they don’t become symbol-using in the same way. So it might be that the process of developing symbol-using capabilities in the brain is enormously aided by the fact that we grow up in the presence of symbol-using entities, namely our parents and family members and community. And of course that leads you to then a chicken and egg problem.

So you’d have to argue in that case that early humans, or pre-humans had much more rudimentary symbol-like capabilities: some animals have the ability to refer to different phenomena or objects with different signs, different kinds of sounds that some new world monkeys have, for example, for a snake and for puma, but they’re not able to do the full range of things that we do with symbols. You could argue that the symbol using capability developed over hundreds of thousands of years and the unaided human mind doesn’t come with it built in, but because we’re usually bathed in symbol-using activity around us, we are able to quickly pick it up. I don’t know what the truth is, but it seems very clear that this kind of capability, for example, gives you the ability to generalize so much faster than you can with circuits. So just to give a particular example of the rules of Go, we talked about earlier, the rules of Go apply the same rule at every time-step in the game.

And it’s the same rule at every square in the game, except around the edges, and if you have what we call first order capability, meaning you can have universal quantifiers or in programs, we think of these as loops, you can say very quickly for every square on the board, if you have a piece on there and it’s surrounded by the enemy, then it’s dead. That’s sort of a crude approximation to how things work and go, but it’s roughly right. In a circuit, you can’t say that because you don’t have the ability to say for every square. So you have to have a piece of circuit for each square. So you’ve got 361 copies of the rule in each of those copies has to be learned separately, and this is one of the things that we do with convolutional neural networks.

A convolutional neural network has the universal quantifier over the input space, built into it. So it’s a kind of cheating, and as far as we know, the brain doesn’t have that type of weight sharing. So the key aspect is not just the physical structure of the convolutional network, which has this repeating local receptive fields on each different part of the retina, so to speak, but that we also insist that weights for each of those local receptive fields are copied across all receptive fields in the retina. So there aren’t millions of separate weights that are trained, there’s only a few, sometimes even just a handful of weights that are trained and then the code makes sure that those are effectively copied across the entire retina. And the brain. I don’t think has any way to do that, so it’s doing something else to achieve this kind of rapid generalization.

Lucas Perry: All right. So now with all of this context and understanding about intelligence and its origins today in 2020, AI is beginning to proliferate and is occupying a lot of news cycles. What particular important changes to human society does the rise and proliferation of AI signal and how do you view it in relation to the agricultural and industrial revolutions?

Steven Pinker: I’m going to begin with a meta-answer, which is that we should keep in mind how spectacularly ignorant we are of the future even the relatively near future. Experts at superforecasting studied by Phil Tetlock, pretty much the best in the world, go down to about chance after about five years out. And we know, looking at predictions of the future from the past,  how ludicrous they can be, both in underpredicting technological changes and in overpredicting them. A 1993 book by Bill Gates called The Road Ahead  said virtually nothing about the internet! And there’s a sport of looking at science-fiction movies and spotting ludicrous anachronisms, such as the fact that in 2001: A Space Odyssey they were using typewriters. They had suspended animation and trips to Jupiter, but they hadn’t invented the word processor. To say nothing of the social changes they failed to predict, such as the fact that all of the women in the movie were secretaries and assistants.

So we should begin by acknowledging that it is extraordinarily difficult to predict the future. And there’s a systematic reason, namely that the future depends not just on technological developments, but also on people’s reaction to the developments, and on the  reactions to the reactions, and the reactions to the reactions to the reactions. There are seven and a half billion of us reacting, and we have to acknowledge that there’s a lot we’re going to get wrong. 

It’s safe to say that a lot of tasks that involve physical manipulation, like stocking shelves and driving trucks, are going to be automated, and societies will have to deal with the possibility of radical changes in employment, and Stuart talks about those in his book. We don’t know whether the job market will be flexible enough to create new jobs, always at the frontier of what machines can’t yet do, or whether there’ll be massive unemployment that will require economic adjustments, such as a universal basic income or government sponsored service. 

Less clear is the extent to which high-level decision making, like policy, diplomacy, or scientific hypothesis-testing,  will be replaced by AI. I think that’s impossible to predict.  Although, closer to the replacement of truck drivers by autonomous vehicles, AI as a useful tool, rather than as a replacement, for human intelligence will explode in science and business and technology and every walk of life.

Stuart Russell: I think all of those things are true. And I agree that our general record of forecasting has been pretty dismal. I am smiling as Steve was talking, because I was remembering Ray Kurzweil recently saying how proud he was that he had predicted the self driving car, I think it was in ’96 or ’92, something like that, and possibly wasn’t aware that the first self driving car was driving on the freeway in 1987, before he even thought to predict that such a thing might happen. If I had to say, in the next decade, if you said, roughly speaking, that what happened in the 2010s was primarily that visual perception became very crudely feasible for machines when it wasn’t before.

And that’s already having huge impact, including in self-driving cars, I would say that language understanding at least in a simplified sense will become possible in this decade. And I think it’ll be a combination of deep learning with probabilistic programming, with Bayesian and symbolic methods. That will open up enormous areas of activity to machines where they simply couldn’t go before, and some of that will be very straightforward, job replacement for call center workers. Most of what they do, I think could be automated by systems that are able to understand their conversations. The role of the smart speaker, the Alexa, or Cortana or Siri or whatever will radically change and will enable AI systems to actually understand your life to a much greater extent. One of the reasons that Siri and Cortana or Alexa are not very useful to me is because they just don’t understand anything about my life.

The “call me an ambulance” example illustrates that. If I got a text message saying “Johnny’s in the hospital with a broken arm”, well, if it doesn’t understand that Johnny is possibly my cat, or possibly my son, or possibly my great grandfather and does Johnny live nearby, or in my house, or on another continent, then it hasn’t the faintest idea of what to do. Or even whether I care. It’s only really through language understanding. I doubt that we’re going to be filling these things full of first order logic assertions that we will type into our AI system. So it’s only through language that it’s going to be able to acquire the knowledge that it needs to be a useful assistant to an individual or a corporation. So having that language capability will open up whole new areas for AI to be useful to individuals and also to take jobs from people. And I’m not able to predict what else we might be able to do when there are AI systems that understand language, but it has to have a huge impact.

Lucas Perry: Is there anything else that you guys would like to add in terms of where AI is at right now, where it’ll be in the near future and the benefits and risks it will pose?

Stuart Russell: I could point to a few things that are already happening. There’s a lot of discussion about the negative impacts on women and minorities from algorithms that inadvertently pick up on biases in society. So we saw the example of Amazon’s hiring algorithm that rejected any resume that had the word “woman’s” in it. And I think that’s serious, but I think the AI community we’re still not completely woke, and there’s a lot of consciousness raising that needs to happen. But I think technically that problem is manageable, and I think one interesting thing that’s occurring is that we’re starting to develop an understanding, not just of the machine learning algorithm, but of the socio-technical context in which that machine learning is embedded and modeling that social technical context allows you to predict whether the use of that algorithm will have negative feedback kinds of consequences, or it will be vulnerable to certain kinds of selection bias in the input data, and so on.

Deepfakes surveillance and manipulation, that’s another big area, and then something I’m very concerned about is the use of AI for autonomous weapons. This is another area where we fight against media stereotypes. So when the media talk about autonomous weapons, they invariably have a picture of a Terminator. Always. And I tell journalists, I’m not going to talk to you if you put a picture of a Terminator in the article. And they always say, well, I don’t have any control over that, that’s a different part of the newspaper, but it always happens anyway.

And the reason that’s a problem is because then everyone thinks, “Oh, well this is science fiction. We don’t have to worry about this because this is science fiction.” And you know, I’ve heard the Russian ambassador to the UN and Geneva say, well, why are we even discussing these things, because this is science fiction, it’s 20 or 30 years in the future? Oh, by the way, I have some of these weapons, if you’d like to buy them. The reality is that many militaries around the world are developing these, companies are selling them. There’s a Turkish arms company, STM, selling a device, which is basically the slaughterbot from the Slaughterbots movie. So it’s a small drone with onboard explosives and they advertise it as capable of tracking and autonomously attacking human beings based on video signatures and/or face recognition.

The Turkish government has announced that they’re going to be using those against the Kurds in Syria sometime this year. So we’ll see if it happens, but there’s no doubt that this is not science fiction, and it’s very real. And it’s going to create a new kind of weapon of mass destruction, because if it’s autonomous, it doesn’t need to be supervised. And if it doesn’t need to be supervised, then you can launch them by the million, and then you have something with the same effect as a nuclear weapon, but much cheaper, much easier to proliferate with much less collateral damage and all the rest of it.

Steven Pinker: I think in all of these discussions, it’s critical to not fall prey to a status-quo bias and compare the hypothetical problems of a future technology with an idealized present, ignoring the real problems with the present we take for granted. In the case of bias, we know that humans are horribly biased. It’s not just that we’re biased against particular genders and ethnic groups and sexual orientations. But inj general we make judgements that can easily be outperformed by even simple algorithms, like a linear regression formula. So we should remember that our benchmark in talking about the accuracies or inaccuracies of AI prediction algorithms has to be the human, and that’s often a pretty low bar. When it comes to bias, of course, a system that’s trained on a sample that’s unrepresentative is not a particularly intelligent system. And going back to the idea that we have to distinguish the goals we want to achieve from the intelligence that achieves them, if our goal is to overcome past inequities, then by definition we don’t want to make selections that simply replicate the statistical distribution of women and minorities in the past. Our goal is to rectify those inequities, and the problem in a system that replicates them is not that it’s not intelligent enough, but then we’ve given it the wrong goal.

When it comes to weapons, here too, we’ve got to compare the potential harm of intelligent weapons systems with the stupendous harm of dumb weapon systems. Aerial bombardment, artillery, automatic weapons, search-and- destroy missions, and tank battles have killed people by the millions. I think there’s been insufficient attention to how a battleground that used smarter weapons would compare to what we’ve tolerated for centuries simply because that’s what we have come to accept, though it’s being fantastically destructive. What ultimately we want to do is to make the use of any weapons less likely, and as I’ve written about, that has been the general trend in the last 75 years, fortunately.

Stuart Russell: Yeah, I think there is some truth in that. When I first got the email from Human Rights Watch, so they began a campaign, I think was back in 2013, to argue for a treaty banning autonomous weapons. Human Rights Watch came into existence because of the awful things that human soldiers do. And now they’re saying “No, no human soldiers are great, it’s the machines we need to worry about.” And I found that a little bit odd. To me, the argument about whether the weapons will inadvertently violate humans right in ways that human soldiers don’t, or sort of accidentally kill people in ways that we are getting better at avoiding, I don’t think that’s the issue. I think it’s specifically the weapon of mass destruction property that autonomous weapons have that for example machine guns don’t.

There’s a hundred million or more Kalashnikov rifles in private hands in the world. If all those weapons got up one morning by themselves and started shooting anyone they could see, that would be a big chunk of the human race gone, but they don’t do that. Each of them has to be carried by a person. And if you want to put a million of them into the field, you need another 10 million people to feed and train those million soldiers, and to transport them, and protect them, and all that stuff. And that’s why we haven’t seen very large scale death from all those hundred million Kalashnikovs.

Even carpet-bombing, which I think nowadays would be regarded as indiscriminate and therefore a violation of international law. And I think even during the Second World War, people argued that “No, you can’t go and bomb cities.” But once the Germans started to do it, then there was escalating rounds of retaliation and people lost all sense of what was a civilized and what was an uncivilized act of war. But even The Blitz against Great Britain, as far as I know killed only between 50 and 60,000 people, even though it hit dozens and dozens of cities. But literally one truckload of autonomous weapons can kill a million people.

An interesting fact about World War II is that for every person who died, between 1,000 and 10,000 bullets were fired. So just killing people with bullets on average in World War II cost you, let’s take a geometric mean 3,000 bullets, which is actually about a thousand dollars at current prices, but you could build a lethal autonomous weapon for a lot less than that. And even if they had a 25% success rate in finding and killing a human, it’s much cheaper than the bullet, let alone the guns and the aircraft and all the rest of it.

So as a way of killing very, very large numbers of people it’s incredibly cheap and incredibly effective. They can also be selective. So you can kill just the kind of people you want to get rid of. And it seems to me that we just don’t need another weapon of mass destruction with all of these extra characteristics. We’ve got rid of to some extent biological and chemical weapons. We’re trying to get rid of nuclear weapons, and introducing another one that’s arguably much worse seems to be a step in the wrong direction.

Steven Pinker: You asked also about the benefits of artificial intelligence, which I think could be stupendous. They include elimination of drudgery and the boring and dangerous jobs that no one really likes to do, like stocking shelves, making beds, mining coal, and picking fruit. There could be a bonanza in automating all the things that humans want done without human pain and labor and boredom and danger. It raises the problem of how we will support the people (if new jobs don’t materialize) who have nothing to do. But that’s a more minor economic problem to solve, compared to the spectacular advance we could have in eliminating human drudgery.

Also, there are a lot of jobs, such as the care of elderly people–lifting them onto toilets, reaching things from upper shelves–that, if automated, would allow more of them to live at home instead of being warehoused in nursing homes. Here, too, the potential for human flourishing is spectacular. And as I mentioned, many kinds of human judgment are so error-prone that they can already be replaced by simple algorithms, and better still if they were more intelligent algorithms. There’s the potential of much less waste, much less error, far fewer accidents. An obvious example is the million and a quarter people killed in traffic accidents each year that could be terrifically reduced if we had autonomous vehicles that were affordable and widespread.

Lucas Perry: A core of this is that all of the problems that humanity faces simply require intelligence to solve them, essentially. And if we’re able to solve the problem of how to make intelligent machines, then our problems will evermore and continuously become automateable by machine systems. So Stuart, do have you have anything else to add here in terms of existential hope and benefits to compliment what Steve just contributed before we pivot into existential risk?

Stuart Russell: Yeah, there is an argument going around, and I think Mark Zuckerberg said it pretty clearly, and Oren Etzioni and various other people have said basically the same thing. And it’s usually put this way, “If you’re against AI, then you’re against better medical decisions, or reducing medical errors, or safer cars,” and so on. And this is, I think, just a ridiculous argument. So first of all, people who are concerned about the risks of AI, are not against AI, right? That’s like arguing if you’re a nuclear engineer and you’re concerned about the possibility of a design flaw that would lead to a meltdown, you’re against electricity. No, you’re not against electricity. You’re just against millions of people dying for no reason, and you want to fix the problem. And the same argument I think is true about those who are concerned about the risk of AI. If AI didn’t have any benefits we wouldn’t be having this discussion at all. No one would be investing any money, no one would have put their lives and careers into working on the capabilities of AI, and the whole point would be moot.

So of course, AI will have benefits, but if you don’t address the risks, you won’t get the benefits, because the technology will be rejected, or we won’t even have a choice to reject it. And if you look at what happened with nuclear power, I think it’s really an object lesson. Nuclear power could and still can produce quite cheap electricity. So I have a house in France and most electricity in France comes from nuclear power, and it’s very cheap and very reliable. And it also doesn’t produce a lot of carbon dioxide, but because of Chernobyl, the nuclear industry has been literally decimated, by which I mean, reduced by a factor of 10, or more. And so we didn’t get the benefits, because we didn’t pay enough attention to the risks. The same holds with AI.

So the benefits of AI in the long run I would argue are pretty unlimited, and medical errors and safer cars, that’s all nice, but that’s a tiny, tiny footnote in what can be done. As Steve already mentioned, the elimination of drudgery and repetitive work. It’s easy for us intellectuals to talk about that. We’ve never really engaged in a whole lot of it, but for most of the human race, for most of recorded history, people with power and money have used everybody else as robots to get what they want. Whether we’ve been using them as military robots, or agricultural robots, or factory robots, we’ve been using people as robots.

And if you had gone back to the early hunter gatherer days and written some science fiction, and you said, “You know what, in the future, people will go into big square buildings, thousands of feet long with no windows and they’ll do the same thing a thousand times a day. And then they’ll go back the next day and do the same thing another thousand times. And they’re going to do that for thousands of days until they’re practically dead.” The audience, the readers of science fiction in 20,000 BC, would have said, “You’re completely nuts, that’s so unrealistic.” But that’s how we did it. And now we’re worried that it’s coming to an end, and it is coming to an end, because we finally have robots that can do the things that we’ve been using human robots to do.

And I’m not saying we should just get rid of those jobs, because jobs have all kinds of purposes in people’s lives. And I’m not a big fan of UBI, which says basically, “Okay, we give up. Humans are useless, so the machines will feed them and house them, entertain them, but that’s all they’re good for.”

Now the benefits to me… It’s hard to imagine, just like we could not imagine very well all the things we would use the internet for. I mean, I remember the Berkeley computer science faculty in the ’80s sitting around at lunch, we knew more about networking than almost anybody else, but we still had absolutely no idea. What was the point of being able to click on a link? What’s that about? We totally blew it.

And we don’t understand all the things that superhuman AI could do for us. I mean, Steve mentioned that we could do much better science, and I agree with that. In the book, I visualize it as taking various ideas, like, “travel as a service,” and extending that to “everything as a service.” So travel as a service is a good example. Like if you think about going to Australia 200 years ago, you’re talking about a billion dollar proposition, probably 10 years, thousands of people, 80% chance of death. Now I take out my cell phone, I go tap, tap, tap, and now I’m in Australia tomorrow. And it’s basically free compared to what it used to be. So that’s what I mean by, as a service, you want something, you just get it.

Superhuman AI could make everything as a service. So think about the things that are expensive and difficult or impossible now, like training a neurosurgeon, or building a railway to connect your rural village to a nearby city so that people can visit, or trade, or whatever. For most of the developing world these things are completely out of reach. The health budget of a lot of countries in Africa is less than $10 per person per year. So the entire health budget of a country would train one neurosurgeon in the US. So these things are out of reach, but if you take out the humans then these services can become effectively free. They become services like travel is today, and that would enable us to bring everyone on earth up to the kind of living standard that they might aspire to. And if we can figure out the resource constraints and so on that will be a wonderful thing.

Lucas Perry: Now that’s quite a beautiful picture of the future. There’s a lot of existential hope there. The other side to existential hope is existential risk. Now this is an interesting subject, which Steve and you, Stuart, I believe have disagreements about. So pivoting into this area, and Steve, you can go first here, do you believe that human beings, should we not go extinct in the meantime, will we build artificial superintelligence? And does that pose an existential risk to humanity?

Steven Pinker: Yeah, I’m on record as being skeptical of that scenario and dubious about the value of putting a lot of effort into worrying about it now. The concept of superintelligence is itself obscure. In a lot of the discussions you could replace the word “superintelligence” with “magic” or “miracle” and the sentence would read the same. You read about an AI system that could duplicate brains in silicon, or solve problems like war in the Middle East, or cure cancer.  It’s just imagining the possibility of a solution and assuming that the ability to bring it about will exist, without laying out what that intelligence would consist of, or what would count as a solution to the problem. 

So I find the concept of superintelligence itself a dubious extrapolation of an unextrapolable continuum, like human-to-animal, or not-so-bright human-to-smart-human. I don’t think there is a power called “intelligence” such that we can compare a squirrel or an octopus to a human and say, “Well, imagine even more of that.” 

I’m also skeptical about the existential risk scenarios. They tend to come in two varieties. One is based on the notion of a will to power: that as soon as you get an intelligent system, it will inevitably want to dominate and exploit. Often the analogy is that we humans have exploited and often extinguished animals because we’re smarter than them, so as soon as there is an artificial system that’s smarter than us, it’ll do to us what we did to the dodos. Or that technologically advanced civilizations, like European colonists and conquistadors subjugated and sometimes wiped out indigenous peoples, so that’s what an AI system might do to us. That’s one variety of this scenario.

I think that scenario confuses intelligence with dominance, based on the fact that in one species, Homo sapiens, they happen to come bundled together, because we came about through natural selection, a competitive process driven by relative success at capturing scarce resources and competing for mates, ultimately with the goal of relative reproductive success. But there’s no reason that a system that is designed to pursue a goal would have as its goal, domination. This goes back to our earlier discussion that the ability to achieve a goal is distinct from what the goal is.

It just so happens that in products of natural selection, the goal was winning in reproductive competition. For an artifact we design, there’s just no reason that would be true. This is sometimes called the orthogonality thesis in discussions of existential risk, although that’s just a fancy-schmancy way of referring to Hume’s distinction between our goals and our intelligence.

Now I know that there is an argument that says, “Wouldn’t any intelligence system have to maximize its own survivability, because if it’s given the goal of X, well, you can’t achieve X if you don’t exist, therefore, as a subgoal to achieving X, you’ve got to maximize your own survival at all costs.” I think that’s fallacious. It’s certainly not true that all complex systems have to work toward their own perpetuation. My iPhone doesn’t take any steps to resist my dropping it into a toilet, or letting it run out of power.

You could imagine if it could be programmed like a child to whine, and to cry, and to refuse to do what it’s told to do as its power level went down. We wouldn’t buy one. And we know in the natural world, there are plenty of living systems that sacrifice their own existence for other goals. When a bee stings you, its barbed stinger is dislodged when the bee escapes, killing the bee, but because the bee is programmed to maximize the survivability of the colony, not itself, it willingly sacrifices itself. So it is not true that by definition an intelligent system has to maximize its own power or survivability.

But the more common existential threat scenario is not a will to power but collateral damage. That if an AI system is given a single goal, what if it relentlessly pursues it without consideration of side effects, including harm to us? There are famous examples that I originally thought were spoofs, but were intended seriously, like giving an AI system the goal of making as many paperclips as possible, and so it converts all available matter into paperclips, including our own bodies (putting aside the fact that we don’t need more efficient paperclip manufacturing than what we already have, and that human bodies are a pretty crummy source of iron for paperclips).

Barely more plausible is the idea that we might give an AI system the goal of curing cancer, and so it will  conscript us as involuntary guinea pigs and induce tumors in all of us, or that we might give it the goal of regulating the level of water behind a dam and it might flood a town because it was never given the goal of not drowning a village. 

The problem with these scenarios is that they’re self-refuting. They assume that an “intelligent” artifact would be designed to implement a single goal, which is not true of even the stupid artifacts that we live with. When we design a car, we don’t just give the goal of going from A to B as fast as possible; we also install brakes and a steering wheel and a muffler and a catalytic converter. A lot of these scenarios seem to presuppose both idiocy on the part of the designers, who would give a system control over the infrastructure of the entire planet without testing it first to see how it worked, and an idiocy on the part of the allegedly intelligent system, which would pursue a single goal regardless of all the other effects. This does not exist in any human artifact, let alone one that claims to be intelligent. Giving an AI system one vaguely worded, sketchy goal, and empowering it with control over the entire infrastructure of the planet without testing it first seems to me just so self-evidently moronic that I don’t worry that engineers have to be warned against it.

I’ve quoted Stuart himself, who in an interview made the point well when he said, “No one talks about building bridges that don’t fall down. They just call it building bridges.” Likewise, AI that avoids idiocies like that is just AI, it’s not AI with extra safeguards. That’s what intelligence consists of.

Let me make one other comment. You could say, well, even if the odds are small, the damage would be so catastrophic that it is worth our concern. But there are downsides to worrying about existential risk. One of them is the possible stigmatization and abandonment of helpful technologies. Stuart mentioned the example of nuclear power. What’s catastrophic is that we don’t roll out nuclear power the way that France did, which would go a long way toward solving the genuinely dangerous problem of climate change. Fear of nuclear power has been irrationally stoked by vivid examples:the fairly trivial accident at Three Mile Island in the United States, which killed no one, the tsunami at Fukushima, where people died in the botched evacuation, not the nuclear accident, and the Soviet bungling at Chernobyl. Even that accident killed a fraction of the people that die every day from the burning of fossil fuels, to say nothing of the likely future harm from climate change. The reaction to Chernobyl is exactly how we should not deal with the dangers facing humanity. 

Genetically modified organisms are another example: a technology overregulated or outlawed out of worst-case fears, depriving us of the spectacular benefits of greater ecological sustainability, human nutrition, and less use of water and pesticides. 

There are other downsides of fretting about exotic hypothetical existential risks. There is a line of reasoning in the existential risk community and the so called Rationality community that goes something like this: since the harm of extinguishing the species is basically infinite, probabilities no longer matter, because by expected utility calculations, if you multiply the tiny risk by the very large number of the potential descendants of humans before the sun expands and kills us off (or in wilder scenarios,  the astronomically larger number of immortal consciousnesses that will exist when we can upload our connectomes to the cloud, or when we colonize and multiply in other solar systems)—well, then even an eensy, eensy, infinitesimal probability of extinction would be catastrophic, and we should worry about it now.

The problem is that that argument could apply to any scenario with a nonzero probability, which means any scenario that is not logically impossible. Should we take steps to prevent the evolution of toxic killer gerbils that will nibble everyone to death? If I say, “That’s preposterous,” you can say, “Well, even if the probability is very, very small, since the harm of extinction is so great, we must devote some brain power to that scenario.”

I do fear the moral hazard of human intellect being absorbed in this free-for-all: that any risk, if you imagine it’s potentially existential, could justify any amount of expenditure, according to this expected utility calculation. The hazard is that smart people, clever enough to grasp a danger that common sense would never conceive of, will be absorbed into what might be a fruitless pursuit, compared to areas where we urgently do need application of human brain power–in climate, in the prevention of nuclear war, in the prevention of pandemics. Those are real risks, which no one denies, and we haven’t solved any of them, together with other massive sources of human misery like Alzheimer’s disease. Given these needs, I wonder whether the infinitesimal-probability-times-infinite-harm is the right way of allocating our intellectual capital.

Lucas Perry: Stuart, you want to react to those points.

Stuart Russell: Yeah, there’s a lot there to react to and I’m tempted to start at the end and work back and just ask, well, if we were spending hundreds of billions of dollars a year to breed billions of toxic killer gerbils, wouldn’t you ask people if that was a good idea before dismissing any reason to be concerned about it? If that’s what we were actually investing in creating. I don’t buy the analogy between AI and toxic killer gerbils in any shape or form. But I will go back to the beginning, and we began by talking about feasibility. And Steve argues, I think, primarily, that it’s not even meaningful, that we could create superhuman levels of intelligence, that there isn’t a single continuum.

And yes, there isn’t a single continuum, but there doesn’t have to be a single continuum. When people say one person is more intelligent than another, or one species is more intelligent than other, it’s not a scientific statement that there is a single scalar on which species one exceeds species two. They’re talking in broad brush. So when we say humans are more intelligent than chimpanzees, that’s probably a reasonable thing to say, but there are clearly dimensions of intelligence where actually chimpanzees are more intelligent than humans. For example, short term memory. A chimpanzee, once they get what a digit is, they can learn 20 digit telephone numbers at the drop of a hat, and humans can’t do that. Clearly there’s dimensions on which chimpanzee intelligence, on average, is probably better than human. But nonetheless, when you look at which species would you rather be right now the chimpanzees don’t have much of a chance against the humans.

I think that there is a meaningful motion of generality of intelligence, and one way to think about it is to take a decision making scenario where we already understand how to produce very effective decisions, and then ask, how is that decision scenario restricted, and what happens when we relax the restrictions and figure out how to maintain the same, let’s say, superhuman quality of decision making? So if you look at Go play, it’s clear that the humans have been left far behind. So it’s not unreasonable to ask, just as the machines wiped the floor with humans on the Go board, and the chess board, and now on the StarCraft board, and lots of other boards, could you take that and transfer that into the real world where we make decisions of all kinds? The difference between the Go board and the real world is pretty dramatic. And that’s why we’ve had lots of success on the Go board and not so much in the real world.

The first thing is that the Go board is fully observable. You can see the entire state of the world that matters. And of course in the real world there’s lots of stuff you don’t see and don’t know. Some of it you can infer by accumulating information over time, what we call state estimation, but that turns out to be quite a difficult problem. Another thing is that we know all the rules of Go, and of course in the real world, you don’t know all the rules, you have to learn a lot as you go along. Another thing about the Go board is that despite the fact that we think of it as really complicated, it’s incredibly simple compared to the real world. At any given time on the Go board there’s a couple of hundred legal moves, and the game lasts for a couple hundred moves.

And if you said, well, what are the analogous primitive actions in the real world for a human being? Well, we have 600 muscles and we can actuate them maybe about 10 times per second each. Your brain probably isn’t able to do that, but physically that’s what could be your action space. And so you actually have then a far greater action space. And you’re also talking about… We often make plans that last for many years, which is literally trillions of primitive actions in terms of muscle actuations. Now we don’t plan those all out in detail, but we function on those kinds of timescales. Those are some of the ways that Go and the real world differ. And what we do in AI is we don’t say, okay, I’ve done Go, now I’m going to work on suicide Go, and now I’m going to work on chess with three queens.

What we try to do is extract the general lessons. Okay, we now understand fairly well how to handle that whole class of problems. Can we relax the assumptions, these basic qualitative assumptions about the nature of the problem? And if you relax all the ones that I listed, and probably a couple more that I’ve got, you’re getting towards systems that can function at a superhuman level in the real world, assuming that you figure out how to deal with all those issues. So just as we find ourself flummoxed by the moves that the AI system makes on the Go board, if you’re a General, and you’re up against an AI system that’s controlling, or coming up with the decision making plans for the other side, you might find yourself flummoxed, that everything you try, the machine has already anticipated and put in place something that will prevent your plan from succeeding. The pace of warfare will be beyond anything humans have ever contemplated, right?

So they won’t even have time to think, just as the Iraqis were not used to the rate of decision making of the US Army in the first Gulf War, and they couldn’t do anything. They were literally paralyzed, and just step by step by step the allied forces were able to take them apart because they couldn’t respond within the timescales that the allied forces were operating.

So it will be kind of like that if you were a human general. If you were a human CEO and your competitor company is organized and run by AI systems, you’d be in the same kind of situation. So it’s entirely conceivable. I’m not necessarily saying plausible, but conceivable that we can create real world decision making capabilities that exceed those of humans across the board. So that this notion of generality, I think it is something that still needs to be worked out. Most definitions of generality that people come up with end up saying, “Well, humans are general because they can do all the things that humans can do,” which is sort of a tautology. But nonetheless, it’s interesting that when you think about all the jobs: doctor, carpenter, advertising, sales, representative, most normally functioning people could do most of those jobs at least to some reasonable level.

So we are incredibly flexible compared to current AI systems. There is progress on achieving generality, but there’s a long way to go. I’m certainly not one of those who says that superintelligent AI is imminent and that’s why we need to worry. And in fact, I’m probably more conservative. If you want to appeal to what most expert AI people think, most expert AI people think that we will have something that’s reasonably described as superintelligent AI sooner than I do.

So most people think sometime in the middle of the century. It turns out that Asian AI researchers particularly in China are more optimistic, so they think 20 years. People in the US and Europe may be more like 40 years. I would be reasonably confident saying by the end of this century.

I think Nick Bostrom is in about the same place. He’s also more conservative than the average expert AI researcher. There are major breakthroughs that have to happen, but the massive investment that’s taking place, the influx of incredibly smart people into the field, these things suggest that those breakthroughs will probably take place but the timescale is very hard to say.

And when we think about the risks, I would say Steve is really putting up one straw man after another and then knocking it down. So for example, the paperclip argument is not a scenario that Nick Bostrom thinks is one of the more likely ways for the human race to end. It’s a philosophical thought experiment intended to illustrate a point. And the point is incontrovertible and I don’t think Steve disagrees with it.

So let’s not use the word intelligent because I think Steve here is using the word intelligent to mean always behaves in whatever way we think we wish that it would behave well.

Of course, if you define intelligence that way, then there isn’t an issue. The question is, how do we create any such thing? And the ways we have right now of creating any such thing fall under the standard model, which I described earlier that we set up, let’s call it a superoptimizer and then we give it an objective. And then off it goes. And he’s (Bostrom) describing what happens when you give a superoptimizer the wrong goal. And he’s not saying, “Yes, of course we should give it wrong goals.”

And he’s using this to illustrate what happens when you give it even what seems to be innocuous. So he’s trying to convey the idea that we are not very good at judging the consequences of seemingly innocuous goals. My example of curing cancer: “Curing cancer? Yeah, of course, that’s a good goal to give to an AI system” — but the point is, if that’s the only goal you give to the AI system, then all these weird things happen because that’s the nature of super optimizers. That’s the nature of the standard model of AI.

And this is, I think, the main point being made is not that no matter what we do AI going to get us. It’s that given our current understanding and given hundreds of billions of dollars are being invested into that current understanding then there is a failure mode and it’s reasonable to point that out just as if you’re a nuclear engineer and you say, “Look, everyone is designing these reactors in this way. All of you are doing this. And look there’s this failure mode.” That’s a reasonable thing to point out.

Steven Pinker: Several reactions. First, while money is pouring into AI, it’s not pouring it into super-optimizers tasked with curing cancer and with the power to kidnap people. And the analogies of humans outcompeting chimpanzees, or American generals outsmarting their Iraqi counterparts, once again assume that systems that are smarter than us will therefore be in competition with us. As for straw men, I was mindful to avoid them: the AI system that would give people tumors to pursue the goal of curing cancer was taken from Stuart’s book.

I agree that a super-optimizer that was given a single goal would be menace. But a super-optimizer that pursued a single goal is self-evidently unintelligent, not superintelligent! 

Stuart Russell: Of course, we have multiple goals. There’s a whole field of multi-attribute utility theory that’s been going now for more than 50 years. Of course, we understand that. When we look at even the design of the algorithms that Uber uses to get you to the airport, they take into account multiple goals.

But the point is the same argument applies if you operate in a standard model when you add in the multiple goals. Unless you’re able to be sure that you have completely and correctly captured all the things we care about under all conceivable and the inconceivable,because I think one of the things about superintelligent AI systems they will come up with, by human standards, inconceivable forms of actions.

We cannot guarantee that. And this is the point. So you could say multiple goals, but multiple goals are just a single goal. They add up to the ability to rank futures. And the question is: is that ability to rank futures fully aligned with what humans want their futures to be like? And the answer is inevitably, no. We are inevitably going to leave things out.

So even if you have a thousand terms in the objective function, there’s probably another million that you ought to have included that you didn’t think about because it never occurred to you.

So for example, you can go out and find lists of important things that human beings care about. This is sort of the whole-values community, human-development community, Maslow hierarchy, all of those things. People do make whole lists of things trying to build up a picture of very roughly what is the human utility function after all.

But invariably, those lists just refer to things that are usually a subject of discussion among humans about “Do we spend money on schools or hospitals?”or whatever it might be. On that list, you will not find the color of the sky because no one, no humans right now are thinking about, “Oh, should we change the sky to be orange with pink stripes?” But if someone did change the color of the sky, I can bet you a lot of people would be really upset about it.

And so invariably we fail to include many, many criteria in whatever list of objectives you might come up with. And when you do that, what happens is that the optimizer will take advantage of those dimensions of freedom and typically, and actually under fairly general algebraic conditions, will set them to extreme values because that gives you better optimization on the things that are in the list of goals.

So the argument is that within the standard model, which I bear some responsibility for, because it’s the way we wrote the first three editions of the textbook, within the standard model, further progress on AI could lead to increasing problems of control and it’s not because there’s any will to dominance.

I don’t know of any serious thinkers in the X-risk community who think that that’s the problem. That’s another straw man.

Steven Pinker: When you’re finished, I do have some responses to that.

Stuart Russell: The argument is not that we automatically build in because we all want our systems to be alpha males or anything like that. And I think Steve Omohundro has put it fairly clearly in some of his earlier papers that the behavior of a superoptimizer given any finite list of goals is going to include efforts to maximize its computational resources and other resources that will help it achieve the objectives that we do specify.

And you could put in something saying, “Well, and don’t spend any money.” Or, “Don’t do this and don’t do that and don’t do the other.” But the same structure of the argument is going to apply. We can reduce the risk by adding more and more stuff into the explicit objectives, but I think the argument I’m making in the book is that that’s just a completely broken way to design AI systems.

The meta argument is that if we don’t talk about the failure modes, we won’t be able to address them. So actually I think that Steve and I don’t disagree about the plausible future evolution. I don’t think it’s particularly plausible. If I was going into forecast mode, so just betting on the future saying, “What’s the probability that this thing will happen or that thing will happen?” I don’t think it’s particularly plausible that we will be destroyed by superintelligent AI.

And there are several reasons why I don’t think that’s going to occur because we would probably get some early warnings of it. And if we couldn’t figure out how to prevent it, we would probably put very strong restrictions on further development or we would figure out how to actually make it provably safe and beneficial.

But you can’t have that discussion unless you talk about the failure modes. Just like in nuclear safety, it’s not against the rules to raise possible failure modes like what if this molten sodium that you’re proposing should flow around all these pipes? What if it ever came into contact with the water that’s on the turbine side of the system? Wouldn’t you have a massive explosion which could rip off the containment and so on? That’s not exactly what happened in Chernobyl, but not so dissimilar.

And of course that’s what they do. So this culture of safety that Steve talks about consist exactly of this. People saying, “Look, if you design things that way these terrible things are going to happen. So don’t design things that way, design things this way.” And this is a process that we are going through in the AI community right now.

And I have to say, I just actually was reading a letter from one of my very senior colleagues, former president of AAAI, who said, “Five years ago, everyone thought Stuart was nuts, but Stuart was right. These risks have to be taken seriously and we all owe him a great debt for bringing it within the AI community so that we can start to address it.”

And I don’t think I invented these risks. And I was just in chance position that I had two years of sabbatical to think about the future of the field and to read some of the things that others had already written about the field from the outside.

My sense is that Steve and I are kind of the glass half full glass half empty. In terms of our forecast, we think on the whole, the weather tomorrow is likely to be sunny. I think we disagree on how to make sure that it’s sunny. I really do think that the problem of creating a provably beneficial AI, by which I mean that no matter how powerful the AI system is, we remain in power. We have power over it forever, that we never lose control.

That’s a big ask and the idea that we could solve that problem without even mentioning it, without even talking about it and without even pointing out why it’s difficult and why it’s important, that’s not the culture of safety. That’s sort of more like the culture of the communist party committee in Chernobyl, that simply continued to assert that nothing bad was happening.

Steven Pinker: Obviously, I’m in favor of the safety mindset of engineering, that is, you test the system before you implement it, you try to anticipate the failure modes. And perhaps I have overestimated the common sense of the AI community and they have to be warned about the absurdity of building a superoptimizer.  But a lot of these examples–flooding a town to control the water level, or curing cancer by turning humans into involuntary Guinea pigs, or maximizing happiness by injecting everyone with a drip of antidepressants–strike me as so far from reasonable failure modes that they’re not part of the ordinary engineering effort to ensure safety–particularly when they are coupled with the term “existential.”

These are not ordinary engineering discussions of ways in which a system could fail; they are speculations on how the human species might end. That is very different from not plugging in an AI system until you’ve tested it to find out how it fails. And perhaps we agree that the superoptimizers in these thought experiments are so unintelligent that no one will actually empower them.

Stuart Russell: But Steve, I wasn’t saying we give it one goal. I’m saying however many goals we give it, that’s equivalent to giving it a ranking over futures. So the idea of single goal versus multiple is a complete red herring.

Steven Pinker: But the scare stories all involve systems that are given a single goal. As you go down the tail of possible risks, you’re getting into potentially infinitesimal risks. There is no system, conceivable or existing, that will have zero risk of every possibility. 

Stuart Russell: If we could do that, if we had some serious theory by which we could say, “Okay. We’ve got within epsilon of the true human ranking over futures,” I think that’s very hard to do. We literally do not have a clue how to do that. And the purpose of these examples is actually to dismiss the idea that this has a simple solution.

So people want to dismiss the idea of risk by saying, “Oh, we’ll just give the AI system such and such objective.” And then the failure mode goes away and everything’s cool. And then people say, “Oh, but no.” Look, if you give it the objective that everyone should be happy, then here’s a solution that the AI system could find that clearly we wouldn’t want.

Those processes lead actually to deeper questioning, what do we really mean by happy? We don’t just mean pleasure as measured by the pleasure center in the brain. And the same arguments happen in moral philosophy.

So no one is accusing G.E. Moore of being a naive idiot because he objected to a pleasure maximization definition of what is a good moral decision to make. He was making an important philosophical point and I don’t think we should dismiss that same point when it’s made in the context of designing objectives for AI systems.

Steven Pinker: Yes, that’s an excellent argument against building a universally empowered AI system that’s given the single goal of maximizing human happiness–your example. Do AI researchers need to be warned against that absurd project? It seems to me that that’s the straw man, and so are the other scenarios that are designed to sow worry, such as conscripting the entire human race as involuntary guinea pigs in cancer experiments. Even if there isn’t an epsilon that we can’t go below in laying out possible risks, it doesn’t strike me that that’s within the epsilon.

These strike me more as exercises of human imagination. Assuming a ridiculously simple system that’s given one goal, what could go wrong? Well, yeah, stuff could go wrong, but is that really what’s going to face us when it comes to actual AI systems that have some hope of being implemented?

Naturally, we ought to test the living daylights out of any system before we give it control over anything. That’s Stuart’s point about building bridges that don’t fall down and the standard safety ethic in engineering. But I’m not sure that exotic scenarios based on incredibly stupid ideas for AI systems like giving one the goal of maximizing human happiness is the route that gives us safe AI.

Stuart Russell: Okay. So let me say once again that the one goal versus multiple goals is a red herring. If you think it’s so easy to specify the goal correctly, perhaps your next paper will write it out. Then we’ll say, “Okay, that’s not a straw man. This is Steve Pinker’s suggestion of what the objective should be for the superintelligent AI system.” And then the people who love doing these things, probably Nick Bostrom and others will find ways of failing.

So the idea that we could just test before deploying something that is significantly more powerful than human beings or even the human race combined, that’s a pretty optimistic idea. We’re not even able to test ordinary software systems right now. So test generation is one of the effective methods used in software engineering, but it has many, many known failure cases for real world examples, including multiplication.

Intel’s Pentium chip was tested with billions of examples of multiplication, but it failed to uncover a bug in the multiplication circuitry, which caused it to produce incorrect results in some cases. And so we have a technology of formal verification, which would have uncovered that error, but particularly in the US there’s a culture that’s somewhat opposed to using formal verification in software design.

Less so in hardware design nowadays, partly because of the Pentium error, but still in software, formal verification is considered very difficult and very European and not something we do. And this is far harder than that because software verification typically is thinking only about correctness of the software in an internal sense, that what happens inside the algorithm between the inputs and outputs meet some specifications.

What we want here is that the combination of the algorithm and the world evolves in ways that we are certain to be pleased about. And that’s a much harder kind of thing. Control theory has that view of what they mean by verification. And they’re able to do very simple linear quadratic regulators and a few other examples. And beyond that, they get stuck. And so I actually think that the testing is probably neither (not very) feasible. I mean, not saying we shouldn’t do it, but it’s going to be extremely hard to get any kind of confidence from testing, because you’re really asking, can you simulate the entire world and all the ways a system could use the world to bring about the objective.

However complicated and however multifaceted that objective is, it’s probably going to be the wrong one. So I’ve proposed a, not completely different, but a generalized form of AI that knows that it doesn’t know what the real objectives are. It knows it doesn’t know how humans rank possible futures and that changes the way it behaves, but that also has failure modes.

One of them being the plasticity of human preference rankings over the future and how do you prevent the AI system from taking advantage of that plasticity? You can’t prevent it completely because anything it does is going to have some effect on human preferences. But the question is what constitutes reasonable modifications of human preferences and what constitutes unreasonable ones? We don’t know the answer to that. So there are many, many really difficult research problems that we have to overcome for the research agenda that I’m proposing to have a chance of success.

I’m not that optimistic that this is an easy or a straightforward problem to solve and I think we can only solve it if we go outside the conceptual framework that AI has worked in for the last 70 years.

Steven Pinker: Well, yes. Certainly, if the conceptual framework for AI is optimizing some single or small list of generic goals, like a ranking over possible futures, and it is empowered to pursue them by any means, as opposed to building tools that solve specific problems. But note that you’ve also given arguments why the fantasies of superintelligence are unlikely to come about–the near-miraculous powers to outsmart us, to augment its own intelligence, to defeat all of our attempts to control it. In the scenarios, these all work flawlessly–yet  the complexities that make it hard to predict all conceivable failures also make it hard to achieve superintelligence in the first place.

Namely, we can’t take into account the fantastically chaotic and unpredictable reactions of humans. And we can’t program a system that has complete knowledge of the physical universe without allowing it to do experiments and acquire empirical knowledge, at a rate determined by the physical world. Exactly the infirmities that prevent us from exploring the entire space of behavior of one of these systems in advance is the reason that it’s not going to be superintelligent in the way that these scenarios outline.

And that’s a reason not to empower any generic goal-driven system that aspires toward “superintelligence” or that we might think of as “superintelligent”–it  is unlikely to exist, and likely to display various forms of error and stupidity.

Stuart Russell: I would agree that some of the concerns that you might see in the X-risk community are, say, nonphysical. So the idea that a system could predict the next hundred years and your entire life in such detail that a hundred years ago, it knew what you were going to be saying at a particular millisecond in a hundred years in the future, this is obviously complete nonsense.

I don’t think we need to be too concerned about that as a serious question. Whether it’s a thought experiment that sheds light on fundamental questions in decision theory, like the Newcomb problems is another issue that we don’t have to get into. But we can’t solve the problem by saying, “Well, superintelligence of the kind that could lead to significant global consequences could not possibly exist.”

And actually I kind of like Danny Hillis’ argument, which says that actually, no, it already does exist and it already has, and is having significant global consequences. And his example is to view, let’s say the fossil fuel industry as if it were an AI system. I think this is an interesting line of thought, because what he’s saying basically and — other people have said similar things — is that you should think of a corporation as if it’s an algorithm and it’s maximizing a poorly designed objective, which you might say is some discounted stream of quarterly profits or whatever. And it really is doing it in a way that’s oblivious to lots of other concerns of the human race. And it has outwitted the rest of the human race.

So we might all think, well, of course, we know that what it’s trying to do is wrong and of course we all know the right answer, but in fact we’ve lost and we should have pointed out a hundred years ago that there is this risk and it needs to be taken seriously.

And it was. People did point it out a hundred years ago, but no one took them seriously. And this is what happened. So I think we have actually a fairly good example that this type of thing, the optimization of objectives, ignoring externalities as the economists would point out, by superintelligent entities. And in some sense, the fossil fuel industry outwitted us because whatever organizational structures allow large groups of humans to generate effective complex behaviors in the real world and develop complex plans, it operates in some ways like a superintelligent entity, just like we were able to put a person on the moon because of the combined effect of many human intellects working together.

But each of those humans in the fossil fuel industry is a piece of an algorithm if you like and their own individual preferences about the future don’t count for much and in fact, they get molded by their role within the corporation.

I think in some ways you already have an existence proof that the concern is real.

Steven Pinker: A simpler explanation is that people like energy, fossil fuels are the most convenient source, and no one has had to pay for the external damage they do. Clearly we ought to anticipate foreseeable risks and attempt to mitigate them. But they have to be calibrated against what we know, taking into account our own ignorance of the future. It can be hazardous to chase the wrong worries, such as running out of petroleum, which was the big worry in the 1970s. Now we know that the problem with petroleum is too much, not too little. Overpopulation and genetically modified organisms are other examples. 

If we try to fantasize too far into the future, beyond what we can reasonably predict, we can sow fear about the wrong risks. My concern about all these centers and smart people worrying about the existential risk of AI is that we are misallocating our worry budget and our intellectual resources. We should be thinking hard about how to mitigate climate change, which is a real problem. That is less true of spinning exotic scenarios about hypothetical AI systems which have been given control over the physical universe and might enslave us in cancer experiments.

Lucas Perry: All right. So wrapping up here, do you guys have final statements that you’d like to say, just if you felt like what you just said didn’t fully capture what you want to end on on this issue of AI existential risk?

Steven Pinker: Despite our disagreements, most about my assessment of AI agrees with Stuart’s. I personally don’t think that the adjective existential is helpful in ordinary concerns over safety, which we ought to have. I think there are tremendous potential benefits to AI, and that we ought to seek at the same time as we anticipate the reasonable risks and take every effort to mitigate them.

Stuart Russell: Yep. I mean, it’s hard to disagree that we should focus on the reasonable risks. The question is whether you think that the hundreds of billions of dollars that are being invested into AI research will produce systems that can have potentially global consequences.

And to me it seems self evident that it can and we can look at even simple machine algorithms like the content selection algorithms in social media because those algorithms interact with humans for hours every day and dictate what literally billions of people see and read every day. They are having substantial global impact already.

And they are very, very simple. They don’t know that human beings exist at all, but they still learn to manipulate our brains to optimize the objective. I had a very interesting little Facebook exchange with Yann LeCun. And at some point in the argument, Yann said something quite similar to something Steve said earlier. He said, “There’s really no risk. You’d have to be extremely stupid to put an incorrect objective into a powerful system and then deploy it on a global scale.”

And I said, “Well, you mean like optimizing click-through?” And he said, “Facebook stopped using click-through years ago.” And I said, “Well, why was that?” And he said, “Oh, because it was the incorrect objective.”

So you did put an incorrect objective into a powerful system, deployed on a global scale. Now what does that say about Facebook? So I think just as you might have said — and in fact the nuclear industry did say — “It’s perfectly safe. Nothing can go wrong. We’re the experts. We understand safety. We understand everything.”

Nonetheless, we had Chernobyl, we had Fukushima. And actually, I think there’s an argument to be made that despite the massive environmental cost of foregoing nuclear power, that countries like Germany, Italy, Spain and probably a bunch of others are in the process of actually deciding that we need to phase out nuclear power because even though theoretically, it’s possible to develop and operate completely safe nuclear power systems, it’s beyond our capabilities and the evidence is there.

You might have argued that while Russia is corrupt, it’s technology was not as great as it should have been, they cut lots of corners, but you can’t argue that the Japanese nuclear industry was unsophisticated or unconcerned with safety, but they still failed. And so I think voters in those countries who said, “We don’t want nuclear power because we just don’t want to be in that situation even if we have the engineers making our best efforts.”

These kinds of considerations suggest that we do need to pay very careful attention. I’m not saying we should stop working on climate change, but when we invented synthetic biology, we said okay, we’d better think about how do we prevent the creation of disease or new disease organisms that could produce pandemics. And we took steps. People spent a lot of time thinking about safety mechanisms for those devices. We have to do the same thing for AI.

Lucas Perry: All right. Stuart and Steven, thanks so much. I’ve learned a ton of stuff today. If listeners want to follow you or look into your work, where’s the best place to do that? I’ll start with you, Steven.

Steven Pinker: Stevenpinker.com, which has pages for ten books, including the most recent, Enlightenment Now. SAPinker on Twitter. 

Lucas Perry: And Stuart.

Stuart Russell: So you can Google me. I don’t really have a website or social media activity, but the book Human Compatible, which was published last October by Viking in the US and Penguin in the UK and it’s being translated into lots of languages, I think that captures my views pretty well.

Lucas Perry: All right. Thanks so much for coming on. And yeah, it was a pleasure speaking.

Steven Pinker: Thanks very much, Lucas for hosting it. Thank you Stuart for the dialogue.

Stuart Russell:It was great fun, Steve. I look forward to doing it again.

Steven Pinker: Me too.

End of recorded material

AI Alignment Podcast: An Overview of Technical AI Alignment in 2018 and 2019 with Buck Shlegeris and Rohin Shah

 Topics discussed in this episode include:

  • Rohin’s and Buck’s optimism and pessimism about different approaches to aligned AI
  • Traditional arguments for AI as an x-risk
  • Modeling agents as expected utility maximizers
  • Ambitious value learning and specification learning/narrow value learning
  • Agency and optimization
  • Robustness
  • Scaling to superhuman abilities
  • Universality
  • Impact regularization
  • Causal models, oracles, and decision theory
  • Discontinuous and continuous takeoff scenarios
  • Probability of AI-induced existential risk
  • Timelines for AGI
  • Information hazards

Timestamps: 

0:00 Intro

3:48 Traditional arguments for AI as an existential risk

5:40 What is AI alignment?

7:30 Back to a basic analysis of AI as an existential risk

18:25 Can we model agents in ways other than as expected utility maximizers?

19:34 Is it skillful to try and model human preferences as a utility function?

27:09 Suggestions for alternatives to modeling humans with utility functions

40:30 Agency and optimization

45:55 Embedded decision theory

48:30 More on value learning

49:58 What is robustness and why does it matter?

01:13:00 Scaling to superhuman abilities

01:26:13 Universality

01:33:40 Impact regularization

01:40:34 Causal models, oracles, and decision theory

01:43:05 Forecasting as well as discontinuous and continuous takeoff scenarios

01:53:18 What is the probability of AI-induced existential risk?

02:00:53 Likelihood of continuous and discontinuous take off scenarios

02:08:08 What would you both do if you had more power and resources?

02:12:38 AI timelines

02:14:00 Information hazards

02:19:19 Where to follow Buck and Rohin and learn more

 

Works referenced: 

AI Alignment 2018-19 Review

Takeoff Speeds by Paul Christiano

Discontinuous progress investigation by AI Impacts

An Overview of Technical AI Alignment with Rohin Shah (Part 1)

An Overview of Technical AI Alignment with Rohin Shah (Part 2)

Alignment Newsletter

Intelligence Explosion Microeconomics

AI Alignment: Why It’s Hard and Where to Start

AI Risk for Computer Scientists

 

We hope that you will continue to join in the conversations by following us or subscribing to our podcasts on Youtube, Spotify, SoundCloud, iTunes, Google Play, StitcheriHeartRadio, or your preferred podcast site/application. You can find all the AI Alignment Podcasts here.

You can listen to the podcast above or read the transcript below. 

Note: The following transcript has been edited for style and clarity.

 

Lucas Perry: Welcome to the AI Alignment Podcast. I’m Lucas Perry. Today we have a special episode with Buck Shlegeris and Rohin Shah that serves as a review of progress in technical AI alignment over 2018 and 2019. This episode serves as an awesome birds eye view of the varying focus areas of technical AI alignment research and also helps to develop a sense of the field. I found this conversation to be super valuable for helping me to better understand the state and current trajectory of technical AI alignment research. This podcast covers traditional arguments for AI as an x-risk, what AI alignment is, the modeling of agents as expected utility maximizers, iterated distillation and amplification, AI safety via debate, agency and optimization, value learning, robustness, scaling to superhuman abilities, and more. The structure of this podcast is based on Rohin’s AI Alignment Forum post titled AI Alignment 2018-19 Review. That post is an excellent resource to take a look at in addition to this podcast. Rohin also had a conversation with us about just a year ago titled An Overview of Technical AI Alignment with Rohin shah. This episode serves as a follow up to that overview and as an update to what’s been going on in the field. You can find a link for it on the page for this episode.  

Buck Shlegeris is a researcher at the Machine Intelligence Research Institute. He tries to work to make the future good for sentient beings and currently believes that working on existential risk from artificial intelligence is the best way of doing this. Buck worked as a software engineer at PayPal before joining MIRI, and was the first employee at Triplebyte. He previously studied at the Australian National University, majoring in CS and minoring in math and physics, and he has presented work on data structure synthesis at industry conferences.

Rohin Shah is a 6th year PhD student in Computer Science at the Center for Human-Compatible AI at UC Berkeley. He is involved in Effective Altruism and was the co-president of EA UC Berkeley for 2015-16 and ran EA UW during 2016-2017. Out of concern for animal welfare, Rohin is almost vegan because of the intense suffering on factory farms. He is interested in AI, machine learning, programming languages, complexity theory, algorithms, security, and quantum computing to name a few. Rohin’s research focuses on building safe and aligned AI systems that pursue the objectives their users intend them to pursue, rather than the objectives that were literally specified. He also publishes the Alignment Newsletter, which summarizes work relevant to AI alignment. The Alignment Newsletter is something I highly recommend that you follow in addition to this podcast.  

And with that, let’s get into our review of AI alignment with Rohin Shah and Buck Shlegeris.

To get things started here, the plan is to go through Rohin’s post on the Alignment Forum about AI Alignment 2018 and 2019 In Review. We’ll be using this as a way of structuring this conversation and as a way of moving methodically through things that have changed or updated in 2018 and 2019, and to use those as a place for conversation. So then, Rohin, you can start us off by going through this document. Let’s start at the beginning, and we’ll move through sequentially and jump in where necessary or where there is interest.

Rohin Shah: Sure, that sounds good. I think I started this out by talking about this basic analysis of AI risk that’s been happening for the last couple of years. In particular, you have these traditional arguments, so maybe I’ll just talk about the traditional argument first, which basically says that the AI systems that we’re going to build are going to be powerful optimizers. When you optimize something, you tend to get these sort of edge case outcomes, these extreme outcomes that are a little hard to predict ahead of time.

You can’t just rely on tests with less powerful systems in order to predict what will happen, and so you can’t rely on your normal common sense reasoning in order to deal with this. In particular, powerful AI systems are probably going to look like expected utility maximizers due to various coherence arguments, like the Von Neumann–Morgenstern rationality theorem, and these expected utility maximizers have convergent instrumental sub-goals, like not wanting to be switched off because then they can’t achieve their goal, and wanting to accumulate a lot of power and resources.

The standard argument goes, because AI systems are going to be built this way, they will have these convergent instrumental sub-goals. This makes them dangerous because they will be pursuing goals that we don’t want.

Lucas Perry: Before we continue too much deeper into this, I’d want to actually start off with a really simple question for both of you. What is AI alignment?

Rohin Shah: Different people mean different things by it. When I use the word alignment, I’m usually talking about what has been more specifically called intent alignment, which is basically aiming for the property that the AI system is trying to do what you want. It’s trying to help you. Possibly it doesn’t know exactly how to best help you, and it might make some mistakes in the process of trying to help you, but really what it’s trying to do is to help you.

Buck Shlegeris: The way I would say what I mean by AI alignment, I guess I would step back a little bit, and think about why it is that I care about this question at all. I think that the fundamental fact which has me interested in anything about powerful AI systems of the future is that I think they’ll be a big deal in some way or another. And when I ask myself the question “what are the kinds of things that could be problems about how these really powerful AI systems work or affect the world”, one of the things which feels like a problem is that, we might not  know how to apply these systems reliably to the kinds of problems which we care about, and so by default humanity will end up applying them in ways that lead to really bad outcomes. And so I guess, from that perspective, when I think about AI alignment, I think about trying to make ways of building AI systems such that we can apply them to tasks that are valuable, such that that they’ll reliably pursue those tasks instead of doing something else which is really dangerous and bad.

I’m fine with intent alignment as the focus. I kind of agree with, for instance, Paul Christiano, that it’s not my problem if my AI system incompetently kills everyone, that’s the capability’s people’s problem. I just want to make the system so it’s trying to cause good outcomes.

Lucas Perry: Both of these understandings of what it means to build beneficial AI or aligned AI systems can take us back to what Rohin was just talking about, where there’s this basic analysis of AI risk, about AI as powerful optimizers and the associated risks there. With that framing and those definitions, Rohin, can you take us back into this basic analysis of AI risk?

Rohin Shah: Sure. The traditional argument looks like AI systems are going to be goal-directed. If you expect that your AI system is going to be goal-directed, and that goal is not the one that humans care about, then it’s going to be dangerous because it’s going to try to gain power and resources with which to achieve its goal.

If the humans tried to turn it off, it’s going to say, “No, don’t do that,” and it’s going to try to take actions that avoid that. So it pits the AI and the humans in an adversarial game with each other, and you ideally don’t want to be fighting against a superintelligent AI system. That seems bad.

Buck Shlegeris: I feel like Rohin is to some extent setting this up in a way that he’s then going to argue is wrong, which I think is kind of unfair. In particular, Rohin, I think you’re making these points about VNM theorems and stuff to set up the fact that it seems like these arguments don’t actually work. I feel that this makes it kind of unfairly sound like the earlier AI alignment arguments are wrong. I think this is an incredibly important question, of whether early arguments about the importance of AI safety were quite flawed. My impression is that overall the early arguments about AI safety were pretty good. And I think it’s a very interesting question whether this is in fact true. And I’d be interested in arguing about it, but I think it’s the kind of thing that ought to be argued about explicitly.

Rohin Shah: Yeah, sure.

Buck Shlegeris: And I get that you were kind of saying it narratively, so this is only a minor complaint. It’s a thing I wanted to note.

Rohin Shah: I think my position on that question of “how good were the early AI risk arguments,” probably people’s internal beliefs were good as to why AI was supposed to be risky, and the things they wrote down were not very good. Some things were good and some things weren’t. I think Intelligence Explosion Microeconomics was good. I think AI Alignment: Why It’s Hard and Where to Start, was misleading.

Buck Shlegeris: I think I agree with your sense that people probably had a lot of reasonable beliefs but that the written arguments seem flawed. I think another thing that’s true is that random people like me who were on LessWrong in 2012 or something, ended up having a lot of really stupid beliefs about AI alignment, which I think isn’t really the fault of the people who were thinking about it the best, but is maybe sociologically interesting.

Rohin Shah: Yes, that seems plausible to me. Don’t have a strong opinion on it.

Lucas Perry: To provide a little bit of framing here and better analysis of basic AI x-risk arguments, can you list what the starting arguments for AI risk were?

Rohin Shah: I think I am reasonably well portraying what the written arguments were. Underlying arguments that people probably had would be something more like, “Well, it sure seems like if you want to do useful things in the world, you need to have AI systems that are pursuing goals.” If you have something that’s more like tool AI, like Google Maps, that system is going to be good at the one thing it was designed to do, but it’s not going to be able to learn and then apply its knowledge to new tasks autonomously. It sure seems like if you want to do really powerful things in the world, like run companies or make policies, you probably do need AI systems that are constantly learning about their world and applying their knowledge in order to come up with new ways to do things.

In the history of human thought, we just don’t seem to know of a way to cause that to happen except by putting goals in systems, and so probably AI systems are going to be goal-directed. And one way you can formalize goal-directedness is by thinking about expected utility maximizers, and people did a bunch of formal analysis of that. Mostly going to ignore it because I think you can just say all the same thing with the idea of pursuing goals and it’s all fine.

Buck Shlegeris: I think one important clarification to that, is you were saying the reason that tool AIs aren’t just the whole story of what happens with AI is that you can’t apply it to all problems. I think another important element is that people back then, and I now, believe that if you want to build a really good tool, you’re probably going to end up wanting to structure that as an agent internally. And even if you aren’t trying to structure it as an agent, if you’re just searching over lots of different programs implicitly, perhaps by training a really large recurrent policy, you’re going to end up finding something agent shaped.

Rohin Shah: I don’t disagree with any of that. I think we were using the words tool AI differently.

Buck Shlegeris: Okay.

Rohin Shah: In my mind, if we’re talking about tool AI, we’re imagining a pretty restricted action space where no matter what actions in this action space are taken, with high probability, nothing bad is going to happen. And you’ll search within that action space, but you don’t go to arbitrary action in the real world or something like that. This is what makes tool AI hard to apply to all problems.

Buck Shlegeris: I would have thought that’s a pretty non-standard use of the term tool AI.

Rohin Shah: Possibly.

Buck Shlegeris: In particular, I would have thought that restricting the action space enough that you’re safe, regardless of how much it wants to hurt you, seems kind of non-standard.

Rohin Shah: Yes. I have never really liked the concept of tool AI very much, so I kind of just want to move on.

Lucas Perry: Hey, It’s post-podcast Lucas here. I just want to highlight here a little bit of clarification that Rohin was interested in adding, which is that he thinks that “tool AI evokes a sense of many different properties that he doesn’t know which properties most people are  usually thinking about and as a result he prefers not to use the phrase tool AI. And instead would like to use more precise terminology. He doesn’t necessarily feel though that the concepts underlying tool AI are useless.” So let’s tie things a bit back to these basic arguments for x-risk that many people are familiar with, that have to do with convergent instrumental sub-goals and the difficulty of specifying and aligning systems with our goals and what we actually care about in our preference hierarchies.

One of the things here that Buck was seeming to bring up, he was saying that you may have been narratively setting up the Von Neumann–Morgenstern theorem, which sets up AIs as expected utility maximizers, and that you are going to argue that that argument, which is sort of the formalization of these earlier AI risk arguments, that that is less convincing to you now than it was before, but Buck still thinks that these arguments are strong. Could you unpack this a little bit more or am I getting this right?

Rohin Shah: To be clear, I also agree with Buck, that the spirit of the original arguments does seem correct, though, there are people who disagree with both of us about that. Basically, the VNM theorem roughly says, if you have preferences over a set of outcomes, and you satisfy some pretty intuitive axioms about how you make decisions, then you can represent your preferences using a utility function such that your decisions will always be, choose the action that maximizes the expected utility. This is, at least in writing, given as a reason to expect that AI systems would be maximizing expected utility. The thing is, when you talk about AI systems that are acting in the real world, they’re just selecting a universe history, if you will. Any observed behavior is compatible with the maximization of some utility function. Utility functions are a really, really broad class of things when you apply it to choosing from universe histories.

Buck Shlegeris: An intuitive example of this: suppose that you see that every day I walk home from work in a really inefficient way. It’s impossible to know whether I’m doing that because I happened to really like that path. For any sequence of actions that I take, there’s some utility functions such that that was the optimal sequence of actions. And so we don’t actually learn anything about how my policy is constrained based on the fact that I’m an expected utility maximizer.

Lucas Perry: Right. If I only had access to your behavior and not your insides.

Rohin Shah: Yeah, exactly. If you have a robot twitching forever, that’s all it does, there is a utility function over a universe history that says that is the optimal thing to do. Every time the robot twitches to the right, it’s like, yeah, the thing that was optimal to do at that moment in time was twitching to the right. If at some point somebody takes a hammer and smashes the robot and it breaks, then the utility function that corresponds to that being optimal is like, yeah, that was the exact right moment to break down.

If you have these pathologically complex utility functions as possibilities, every behavior is compatible with maximizing expected utility, you might want to say something like, probably we’ll have the simple utility maximizers, but that’s a pretty strong assumption, and you’d need to justify it somehow. And the VNM theorem wouldn’t let you do that.

Lucas Perry: So is the problem here that you’re unable to fully extract human preference hierarchies from human behavior?

Rohin Shah: Well, you’re unable to extract agent preferences from agent behavior. You can see any agent behavior and you can rationalize it as expected utility maximization, but it’s not very useful. Doesn’t give you predictive power.

Buck Shlegeris: I just want to have my go at saying this argument in three sentences. Once upon a time, people said that because all rational systems act like they’re maximizing an expected utility function, we should expect them to have various behaviors like trying to maximize the amount of power they have. But every set of actions that you could take is consistent with being an expected utility maximizer, therefore you can’t use the fact that something is an expected utility maximizer in order to argue that it will have a particular set of behaviors, without making a bunch of additional arguments. And I basically think that I was wrong to be persuaded by the naive argument that Rohin was describing, which just goes directly from rational things are expected utility maximizers, to therefore rational things are power maximizing.

Rohin Shah: To be clear, this was the thing I also believed. The main reason I wrote the post that argued against it was because I spent half a year under the delusion that this was a valid argument.

Lucas Perry: Just for my understanding here, the view is that because any behavior, any agent from the outside can be understood as being an expected utility maximizer, that there are behaviors that clearly do not do instrumental sub-goal things, like maximize power and resources, yet those things can still be viewed as expected utility maximizers from the outside. So additional arguments are required for why expected utility maximizers do instrumental sub-goal things, which are AI risky.

Rohin Shah: Yeah, that’s exactly right.

Lucas Perry: Okay. What else is on offer other than expected utility maximizers? You guys talked about comprehensive AI services might be one. Are there other formal agentive classes of ‘thing that is not an expected utility maximizer but still has goals?’

Rohin Shah: A formalism for that? I think some people like John Wentworth is for example, thinking about markets as a model of agency. Some people like to think of multi-agent groups together leading to an emergent agency and want to model human minds this way. How formal are these? Not that formal yet.

Buck Shlegeris: I don’t think there’s anything which is competitively popular with expected utility maximization as the framework for thinking about this stuff.

Rohin Shah: Oh yes, certainly not. Expected utility maximization is used everywhere. Nothing else comes anywhere close.

Lucas Perry: So there’s been this complete focus on utility functions and representing the human utility function, whatever that means. Do you guys think that this is going to continue to be the primary way of thinking about and modeling human preference hierarchies? How much does it actually relate to human preference hierarchies? I’m wondering if it might just be substantially different in some way.

Buck Shlegeris: Me and Rohin are going to disagree about this. I think that trying to model human preferences as a utility function is really dumb and bad and will not help you do things that are useful. I don’t know; If I want to make an AI that’s incredibly good at recommending me movies that I’m going to like, some kind of value learning thing where it tries to learn my utility function over movies is plausibly a good idea. Even things where I’m trying to use an AI system as a receptionist, I can imagine value learning being a good idea.

But I feel extremely pessimistic about more ambitious value learning kinds of things, where I try to, for example, have an AI system which learns human preferences and then acts in large scale ways in the world. I basically feel pretty pessimistic about every alignment strategy which goes via that kind of a route. I feel much better about either trying to not use AI systems for problems where you have to think about large scale human preferences, or having an AI system which does something more like modeling what humans would say in response to various questions and then using that directly instead of trying to get a value function out of it.

Rohin Shah: Yeah. Funnily enough, I was going to start off by saying I think Buck and I are going to agree on this.

Buck Shlegeris: Oh.

Rohin Shah: And I think I mostly agree with the things that you said. The thing I was going to say was I feel pretty pessimistic about trying to model the normative underlying human values, where you have to get things like population ethics right, and what to do with the possibility of infinite value. How do you deal with fanaticism? What’s up with moral uncertainty? I feel pretty pessimistic about any sort of scheme that involves figuring that out before developing human-level AI systems.

There’s a related concept which is also called value learning, which I would prefer to be called something else, but I feel like the name’s locked in now. In my sequence, I called it narrow value learning, but even that feels bad. Maybe at least for this podcast we could call it specification learning, which is sort of more like the tasks Buck mentioned, like if you want to learn preferences over movies, representing that using a utility function seems fine.

Lucas Perry: Like superficial preferences?

Rohin Shah: Sure. I usually think of it as you have in mind a task that you want your AI system to do, and now you have to get your AI system to reliably do it. It’s unclear whether this should even be called a value learning at this point. Maybe it’s just the entire alignment problem. But techniques like inverse reinforcement learning, preference learning, learning from corrections, inverse reward design where you learn from a proxy reward, all of these are more trying to do the thing where you have a set of behaviors in mind, and you want to communicate that to the agent.

Buck Shlegeris: The way that I’ve been thinking about how optimistic I should be about value learning or specification learning recently has been that I suspect that at the point where AI is human level, by default we’ll have value learning which is about at human level. We’re about as good at giving AI systems information about our preferences that it can do stuff with as we are giving other humans information about our preferences that we can do stuff with. And when I imagine hiring someone to recommend music to me, I feel like there are probably music nerds who could do a pretty good job of looking at my Spotify history, and recommending bands that I’d like if they spent a week on it. I feel a lot more pessimistic about being able to talk to a philosopher for a week, and then them answer hard questions about my preferences, especially if they didn’t have the advantage of already being humans themselves.

Rohin Shah: Yep. That seems right.

Buck Shlegeris: So maybe that’s how I would separate out the specification learning stuff that I feel optimistic about from the more ambitious value learning stuff that I feel pretty pessimistic about.

Rohin Shah: I do want to note that I collated a bunch of stuff arguing against ambitious value learning. If I had to make a case for optimism about even that approach, it would look more like, “Under the value learning approach, it seems possible with uncertainty over rewards, values, preferences, whatever you want to call them to get an AI system such that you actually are able to change it, because it would reason that if you’re trying to change it, well then that means something about it is currently not good for helping you and so it would be better to let itself be changed. I’m not very convinced by this argument.”

Buck Shlegeris: I feel like if you try to write down four different utility functions that the agent is uncertain between, I think it’s just actually really hard for me to imagine concrete scenarios where the AI is corrigible as a result of its uncertainty over utility functions. Imagine the AI system thinks that you’re going to switch it off and replace it with an AI system which has a different method of inferring values from your actions and your words. It’s not going to want to let you do that, because its utility function is to have the world be the way that is expressed by your utility function as estimated the way that it approximates utility functions. And so being replaced by a thing which estimates utility functions or infers utility functions some other way means that it’s very unlikely to get what it actually wants, and other arguments like this. I’m not sure if these are super old arguments that you’re five levels of counter-arguments to.

Rohin Shah: I definitely know this argument. I think the problem of fully updated deference is what I would normally point to as representing this general class of claims and I think it’s a good counter argument. When I actually think about this, I sort of start getting confused about what it means for an AI system to terminally value the final output of what its value learning system would do. It feels like some additional notion of how the AI chooses actions has been posited, that hasn’t actually been captured in the model and so I feel fairly uncertain about all of these arguments and kind of want to defer to the future. 

Buck Shlegeris: I think the thing that I’m describing is just what happens if you read the algorithm literally. Like, if you read the value learning algorithm literally, it has this notion of the AI system wants to maximize the human’s actual utility function.

Rohin Shah: For an optimal agent playing a CIRL (cooperative inverse reinforcement learning) game, I agree with your argument. If you take optimality as defined in the cooperative inverse reinforcement learning paper and it’s playing over a long period of time, then yes, it’s definitely going to prefer to keep itself in charge rather than a different AI system that would infer values in a different way.

Lucas Perry: It seems like so far utility functions are the best way of trying to get an understanding of what human beings care about and value and have preferences over, you guys are bringing up all of the difficult intricacies with trying to understand and model human preferences as utility functions. One of the things that you also bring up here, Rohin, in your review, is the risk of lock-in, which may require us to solve hard philosophical problems before the development of AGI. That has something to do with ambitious value learning, which would be like learning the one true human utility function which probably just doesn’t exist.

Buck Shlegeris: I think I want to object to a little bit of your framing there. My stance on utility functions of humans isn’t that there are a bunch of complicated subtleties on top, it’s that modeling humans with utility functions is just a really sad state to be in. If your alignment strategy involves positing that humans behave as expected utility maximizers, I am very pessimistic about it working in the short term, and I just think that we should be trying to completely avoid anything which does that. It’s not like there’s a bunch of complicated sub-problems that we need to work out about how to describe us as expected utility maximizers, my best guess is that we would just not end up doing that because it’s not a good idea.

Lucas Perry: For the ambitious value learning?

Buck Shlegeris: Yeah, that’s right.

Lucas Perry: Okay, do you have something that’s on offer?

Buck Shlegeris: The two options instead of that, which seem attractive to me? As I said earlier, one is you just convince everyone to not use AI systems for things where you need to have an understanding of large scale human preferences. The other one is the kind of thing that Paul Christiano’s iterated distillation and amplification, or a variety of his other ideas, the kind of thing that he’s trying to get there is, I think, if you make a really powerful AI system, it’s actually going to have an excellent model of human values in whatever representation is best for actually making predictions about  humans because a really excellent AGI, like a really excellent paperclip maximizer, it’s really important for it to really get how humans work so that it can manipulate them into letting it build lots of paperclip factories or whatever.

So I think that if you think that we have AGI, then by assumption I think we have a system which is able to reason about human values if it wants. And so if we can apply these really powerful AI systems to tasks such that the things that they do display their good understanding of human values, then we’re fine and it’s just okay that there was no way that we could represent a utility function directly. So for instance, the idea in IDA is that if we could have this system which is just trying to answer questions the same way that humans would, but enormously more cheaply because it can run faster than humans and a few other tricks, then we don’t have to worry about writing down a utility functions of humans directly because we can just make the system do things that are kind of similar to the things humans would have done, and so it implicitly has this human utility function built into it. That’s option two. Option one is don’t use anything that requires a complex human utility function, option two is have your systems learn human values implicitly, by giving them a task such that this is beneficial for them and such that their good understanding of human values comes out in their actions.

Rohin Shah: One way I might condense that point, is that you’re asking for a nice formalism for human preferences and I just point to all the humans out there in the world who don’t know anything about utility functions, which is 99% of them and nonetheless still seem pretty good at inferring human preferences.

Lucas Perry: On this part about AGI, if it is AGI it should be able to reason about human preferences, then why would it not be able to construct something that was more explicit and thus was able to do more ambitious value learning?

Buck Shlegeris: So it can totally do that, itself. But we can’t force that structure from the outside with our own algorithms.

Rohin Shah: Image classification is a good analogy. Like, in the past we were using hand engineered features, namely SIFT and HOG and then training classifiers over these hand engineered features in order to do image classification. And then we came to the era of deep learning and we just said, yeah, throw away all those features and just do everything end to end with a convolutional neural net and it worked way better. The point was that, in fact there are good representations for most tasks and humans trying to write them down ahead of time just doesn’t work very well at that. It tends to work better if you let the AI system discover its own representations that best capture the thing you wanted to capture.

Lucas Perry: Can you unpack this point a little bit more? I’m not sure that I’m completely understanding it. Buck is rejecting this modeling human beings explicitly as expected utility maximizers and trying to explicitly come up with utility functions in our AI systems. The first was to convince people not to use these kinds of things. And the second is to make it so that the behavior and output of the AI systems has some implicit understanding of human behavior. Can you unpack this a bit more for me or give me another example?

Rohin Shah: So here’s another example. Let’s say I was teaching my kid that I don’t have, how to catch a ball. It seems that the formalism that’s available to me for learning how to catch a ball is, well, you can go all the way down to look at our best models of physics, we could use Newtonian mechanics let’s say, like here are these equations, estimate the velocity and the distance of the ball and the angle at which it’s thrown plug that into these equations and then predict that the ball’s going to come here and then just put your hand there and then magically catch it. We won’t even talk about the catching part. That seems like a pretty shitty way to teach a kid how to catch a ball.

Probably it’s just a lot better to just play catch with the kid for a while and let the kid’s brain figure out this is how to predict where the ball is going to go such that I can predict where it’s going to be and then catch it.

I’m basically 100% confident that the thing that the brain is doing is not Newtonian mechanics. It’s doing something else that’s just way more efficient at predicting where the ball is going to be so that I can catch it and if I forced the brain to use Newtonian mechanics, I bet it would not do very well at this task.

Buck Shlegeris: I feel like that still isn’t quite saying the key thing here. I don’t know how to say this off the top of my head either, but I think there’s this key point about: just because your neural net can learn a particular feature of the world doesn’t mean that you can back out some other property of the world by forcing the neural net to have a particular shape. Does that make any sense, Rohin?

Rohin Shah: Yeah, vaguely. I mean, well, no, maybe not.

Buck Shlegeris: The problem isn’t just the capabilities problem. There’s this way you can try and infer a human utility function by asking, according to this model, what’s the maximum likelihood utility function given all these things the human did. If you have a good enough model, you will in fact end up making very good predictions about the human, it’s just that the decomposition into their planning function and their utility function is not going to result in a utility function which is anything like a thing that I would want maximized if this process was done on me. There is going to be some decomposition like this, which is totally fine, but the utility function part just isn’t going to correspond to the thing that I want.

Rohin Shah: Yeah, that is also a problem, but I agree that is not the thing I was describing.

Lucas Perry: Is the point there that there’s a lack of alignment between the utility function and the planning function. Given that the planning function imperfectly optimizes the utility function.

Rohin Shah: It’s more like there are just infinitely many possible pairs of planning functions and utility functions that exactly predict human behavior. Even if it were true that humans were expected utility maximizers, which Buck is arguing we’re not, and I agree with him. There is a planning function that’s like humans are perfectly anti-rational and if you’re like what utility function works with that planner to predict human behavior. Well, the literal negative of the true utility function when combined with the anti-rational planner produces the same behavior as the true utility function with the perfect planner, there’s no information that lets you distinguish between these two possibilities.

You have to build it in as an assumption. I think Buck’s point is that building things in as assumptions is probably not going to work.

Buck Shlegeris: Yeah.

Rohin Shah: A point I agree with. In philosophy this is called the is-ought problem, right? What you can train your AI system on is a bunch of “is” facts and then you have to add in some assumptions in order to jump to “ought” facts, which is what the utility function is trying to do. The utility function is trying to tell you how you ought to behave in new situations and the point of the is-ought distinction is that you need some bridging assumptions in order to get from is to ought.

Buck Shlegeris: And I guess an important part here is your system will do an amazing job of answering “is” questions about what humans would say about “ought” questions. And so I guess maybe you could phrase the second part as: to get your system to do things that match human preferences, use the fact that it knows how to make accurate “is” statements about humans’ ought statements?

Lucas Perry: It seems like we’re strictly talking about inferring the human utility function or preferences via looking at behavior. What if you also had more access to the actual structure of the human’s brain?

Rohin Shah: This is like the approach that Stuart Armstrong likes to talk about. The same things still apply. You still have the is-ought problem where the facts about the brain are “is” facts and how you translate that into “ought” facts is going to involve some assumptions. Maybe you can break down such assumptions that everyone would agree with. Maybe it’s like if this particular neuron in a human brain spikes, that’s a good thing and we want more of it and if this other one spikes, that’s a bad thing. We don’t want it. Maybe that assumption is fine.

Lucas Perry: I guess I’m just pointing out, if you could find the places in the human brain that generate the statements about Ought questions.

Rohin Shah: As Buck said, that lets you predict what humans would say about ought statements, which your assumption could then be, whatever humans say about ought statements, that’s what you ought to do. And that’s still an assumption. Maybe it’s a very reasonable assumption that we’re happy to put it into our AI system.

Lucas Perry: If we’re not willing to accept some humans’ “is” statements about “ought” questions then we have to do some meta-ethical moral policing in our assumptions around getting “is” statements from “ought” questions.

Rohin Shah: Yes, that seems right to me. I don’t know how you would do such a thing, but you would have to do something along those lines.

Buck Shlegeris: I would additionally say that I feel pretty great about trying to do things which use the fact that we can trust our AI to have good “is” answers to “ought” questions, but there’s a bunch of problems with this. I think it’s a good starting point but trying to use that to do arbitrarily complicated things in the world has a lot of problems. For instance, suppose I’m trying to decide whether we should design a city this way or that way. It’s hard to know how to go from the ability to know how humans would answer questions about preferences to knowing what you should do to design the city. And this is for a bunch of reasons, one of them is that the human might not be able to figure out from your city building plans what the city’s going to actually be like. And another is that the human might give inconsistent answers about what design is good, depending on how you phrase the question, such that if you try to figure out a good city plan by optimizing for the thing that the human is going to be most enthusiastic about, then you might end up with a bad city plan. Paul Christiano has written in a lot of detail about a lot of this.

Lucas Perry: That also reminds me of what Stuart Armstrong wrote about the framing on the questions changing output on the preference.

Rohin Shah: Yep.

Buck Shlegeris: Sorry, to be clear other people than Paul Christiano have also written a lot about this stuff, (including Rohin). My favorite writing about this stuff is by Paul.

Lucas Perry: Yeah, those do seem problematic but it would also seem that there would be further “is” statements that if you queried people’s meta-preferences about those things, you would get more “is” statements about that, but then that just pushes the “ought” assumptions that you need to make further back. Getting into very philosophically weedy territory. Do you think that this kind of thing could be pushed to the long reflection as is talked about by William MacAskill and Toby Ord or how much of this do you actually think needs to be solved in order to have safe and aligned AGI?

Buck Shlegeris: I think there are kind of two different ways that you could hope to have good outcomes from AGI. One is: set up a world such that you never needed to make an AGI which can make large scale decisions about the world. And two is: solve the full alignment problem.

I’m currently pretty pessimistic about the second of those being technically feasible. And I’m kind of pretty pessimistic about the first of those being a plan that will work. But in the world where you can have everyone only apply powerful and dangerous AI systems in ways that don’t require an understanding of human values, then you can push all of these problems onto the long reflection. In worlds where you can do arbitrarily complicated things in ways that humans would approve of, you don’t really need to long reflect this stuff because of the fact that these powerful AI systems already have the capacity of doing portions of the long reflection work inside themselves as needed. (Quotes about the long reflection

Rohin Shah: Yeah, so I think my take, it’s not exactly disagreeing with Buck. It’s more like from a different frame as Buck’s. If you just got AI systems that did the things that humans did now, this does not seem to me to obviously require solving hard problems in philosophy. That’s the lower bound on what you can do before having to do long reflection type stuff. Eventually you do want to do a longer reflection. I feel relatively optimistic about having a technical solution to alignment that allows us to do the long reflection after building AI systems. So the long reflection would include both humans and AI systems thinking hard, reflecting on difficult problems and so on.

Buck Shlegeris: To be clear, I’m super enthusiastic about there being a long reflection or something along those lines.

Lucas Perry: I always find it useful reflecting on just how human beings do many of these things because I think that when thinking about things in the strict AI alignment sense, it can seem almost impossible, but human beings are able to do so many of these things without solving all of these difficult problems. It seems like in the very least, we’ll be able to get AI systems that very, very approximately do what is good or what is approved of by human beings because we can already do that.

Buck Shlegeris: That argument doesn’t really make sense to me. It also didn’t make sense when Rohin referred to it a minute ago.

Rohin Shah: It’s not an argument for we technically know how to do this. It is more an argument for this as at least within the space of possibilities.

Lucas Perry: Yeah, I guess that’s how I was also thinking of it. It is within the space of possibilities. So utility functions are good because they can be optimized for, and there seem to be risks with optimization. Is there anything here that you guys would like to say about better understanding agency? I know this is one of the things that is important within the MIRI agenda.

Buck Shlegeris: I am a bad MIRI employee. I don’t really get that part of the MIRI agenda, and so I’m not going to defend it. I have certainly learned some interesting things from talking to Scott Garrabrant and other MIRI people who have lots of interesting thoughts about this stuff. I don’t quite see the path from there to good alignment strategies. But I also haven’t spent a super long time thinking about it because I, in general, don’t try to think about all of the different AI alignment things that I could possibly think about.

Rohin Shah: Yeah. I also am not a good person to ask about this. Most of my knowledge comes from reading things and MIRI has stopped writing things very much recently, so I don’t know what their ideas are. I, like Buck, don’t really see a good alignment strategy that starts with, first we understand optimization and so that’s the main reason why I haven’t looked into it very much.

Buck Shlegeris: I think I don’t actually agree with the thing you said there, Rohin. I feel like understanding optimization could plausibly be really nice. Basically the story there is, it’s a real bummer if we have to make really powerful AI systems via searching over large recurrent policies for things that implement optimizers. If it turned out that we could figure out some way of coding up optimizer stuffs directly, then this could maybe mean you didn’t need to make mesa-optimizers. And maybe this means that your inner alignment problems go away, which could be really nice. The thing that I was saying I haven’t thought that much about is, the relevance of thinking about, for instance, the various weirdnesses that happen when you consider embedded agency or decision theory, and things like that.

Rohin Shah: Oh, got it. Yeah. I think I agree that understanding optimization would be great if we succeeded at it and I’m mostly pessimistic about us succeeding at it, but also there are people who are optimistic about it and I don’t know why they’re optimistic about it.

Lucas Perry: Hey it’s post-podcast Lucas here again. So, I just want to add a little more detail here again on behalf of Rohin. Here he feels pessimistic about us understanding optimization well enough and in a short enough time period that we are able to create powerful optimizers that we understand that rival the performance of the AI systems we’re already building and will build in the near future. Back to the episode. 

Buck Shlegeris: The arguments that MIRI has made about this,… they think that there are a bunch of questions about what optimization is, that are plausibly just not that hard compared to other problems which small groups of people have occasionally solved, like coming up with foundations of mathematics, kind of a big conceptual deal but also a relatively small group of people. And before we had formalizations of math, I think it might’ve seemed as impossible to progress on as formalizing optimization or coming up with a better picture of that. So maybe that’s my argument for some optimism.

Rohin Shah: Yeah, I think pointing to some examples of great success does not imply… Like there are probably many similar things that didn’t work out and we don’t know about them cause nobody bothered to tell us about them because they failed. Seems plausible maybe.

Lucas Perry: So, exploring more deeply this point of agency can either, or both of you, give us a little bit of a picture about the relevance or non relevance of decision theory here to AI alignment and I think, Buck, you mentioned the trickiness of embedded decision theory.

Rohin Shah: If you go back to our traditional argument for AI risk, it’s basically powerful AI systems will be very strong optimizers. They will possibly be misaligned with us and this is bad. And in particular one specific way that you might imagine this going wrong is this idea of mesa optimization where we don’t know how to build optimizers right now. And so what we end up doing is basically search across a huge number of programs looking for ones that do well at optimization and use that as our AGI system. And in this world, if you buy that as a model of what’s happening, then you’ll basically have almost no control over what exactly that system is optimizing for. And that seems like a recipe for misalignment. It sure would be better if we could build the optimizer directly and know what it is optimizing for. And in order to do that, we need to know how to do optimization well.

Lucas Perry: What are the kinds of places that we use mesa optimizers today?

Rohin Shah: It’s not used very much yet. The field of meta learning is the closest example. In the field of meta learning you have a distribution over tasks and you use gradient descent or some other AI technique in order to find an AI system that itself, once given a new task, learns how to perform that task well.

Existing meta learning systems are more like learning how to do all the tasks well and then when they’ll see a new task they just figure out ah, it’s this task and then they roll out the policy that they already learned. But the eventual goal for meta learning is to get something that, online, learns how to do the task without having previously figured out how to do that task.

Lucas Perry: Okay, so Rohin did what you say cover embedded decision theory?

Rohin Shah: No, not really. I think embedded decision theory is just, we want to understand optimization. Our current notion of optimization, one way you could formalize it is to say my AI agent is going to have Bayesian belief over all the possible ways that the environment could be. It’s going to update that belief over time as it gets observations and then it’s going to act optimally with respect to that belief, by maximizing its expected utility. And embedded decision theory basically calls into question the idea that there’s a separation between the agent and the environment. In particular I, as a human, couldn’t possibly have a Bayesian belief about the entire earth because the entire Earth contains me. I can’t have a Bayesian belief over myself so this means that our existing formalization of agency is flawed. It can’t capture these things that affect real agents. And embedded decision theory, embedded agency, more broadly, is trying to deal with this fact and have a new formalization that works even in these situations.

Buck Shlegeris: I want to give my understanding of the pitch for it. One part is that if you don’t understand embedded agency, then if you try to make an AI system in a hard coded way, like making a hard coded optimizer, traditional phrasings of what an optimizer is, are just literally wrong in that, for example, they’re assuming that you have these massive beliefs over world states that you can’t really have. And plausibly, it is really bad to try to make systems by hardcoding assumptions that are just clearly false. And so if we want to hardcode agents with particular properties, it would be good if we knew a way of coding the agent that isn’t implicitly making clearly false assumptions.

And the second pitch for it is something like when you want to understand a topic, sometimes it’s worth looking at something about the topic which you’re definitely wrong about, and trying to think about that part until you are less confused about it. When I’m studying physics or something, a thing that I love doing is looking for the easiest question whose answer I don’t know, and then trying to just dive in until I have satisfactorily answered that question, hoping that the practice that I get about thinking about physics from answering a question correctly will generalize to much harder questions. I think that’s part of the pitch here. Here is a problem that we would need to answer, if we wanted to understand how superintelligent AI systems work, so we should try answering it because it seems easier than some of the other problems.

Lucas Perry: Okay. I think I feel satisfied. The next thing here Rohin in your AI alignment 2018-19 review is value learning. I feel like we’ve talked a bunch about this already. Is there anything here that you want to say or do you want to skip this?

Rohin Shah: One thing we didn’t cover is, if you have uncertainty over what you’re supposed to optimize, this turns into an interactive sort of game between the human and the AI agent, which seems pretty good. A priori you should expect that there’s going to need to be a lot of interaction between the human and the AI system in order for the AI system to actually be able to do the things that the human wants it to do. And so having formalisms and ideas of where this interaction naturally falls out seems like a good thing.

Buck Shlegeris: I’ve said a lot of things about how I am very pessimistic about value learning as a strategy. Nevertheless it seems like it might be really good for there to be people who are researching this, and trying to get as good as we can get at improving sample efficiency so that can have your AI systems understand your preferences over music with as little human interaction as possible, just in case it turns out to be possible to solve the hard version of value learning. Because a lot of the engineering effort required to make ambitious value learning work will plausibly be in common with the kinds of stuff you have to do to make these more simple specification learning tasks work out. That’s a reason for me to be enthusiastic about people researching value learning even if I’m pessimistic about the overall thing working.

Lucas Perry: All right, so what is robustness and why does it matter?

Rohin Shah: Robustness is one of those words that doesn’t super clearly have a definition and people use it differently. Robust agents don’t fail catastrophically in situations slightly different from the ones that they were designed for. One example of a case where we see a failure of robustness currently, is in adversarial examples for image classifiers, where it is possible to take an image, make a slight perturbation to it, and then the resulting image is completely misclassified. You take a correctly classified image of a Panda, slightly perturb it such that a human can’t tell what the difference is, and then it’s classified as a gibbon with 99% confidence. Admittedly this was with an older image classifier. I think you need to make the perturbations a bit larger now in order to get them.

Lucas Perry: This is because the relevant information that it uses are very local to infer panda-ness rather than global properties of the panda?

Rohin Shah: It’s more like they’re high frequency features or imperceptible features. There’s a lot of controversy about this but there is a pretty popular recent paper that I believe, but not everyone believes, that claims that this was because they’re picking up on real imperceptible features that do generalize to the test set, that humans can’t detect. That’s an example of robustness. Recently people have been applying this to reinforcement learning both by adversarially modifying the observations that agents get and also by training agents that act in the environment adversarially towards the original agent. One paper out of CHAI showed that there’s this kick and defend environment where you’ve got two MuJoCo robots. One of them is kicking a soccer ball. The other one’s a goalie, that’s trying to prevent the kicker from successfully shooting a goal, and they showed that if you do self play in order to get kickers and defenders and then you take the kicker, you freeze it, you don’t train it anymore and you retrain a new defender against this kicker.

What is the strategy that this new defender learns? It just sort of falls to the ground and flaps about in a random looking way and the kicker just gets so confused that it usually fails to even touch the ball and so this is sort of an adversarial example for RL agents now, it’s showing that even they’re not very robust.

There was also a paper out of DeepMind that did the same sort of thing. For their adversarial attack they learned what sorts of mistakes the agent would make early on in training and then just tried to replicate those mistakes once the agent was fully trained and they found that this helped them uncover a lot of bad behaviors. Even at the end of training.

From the perspective of alignment, it’s clear that we want robustness. It’s not exactly clear what we want robustness to. This robustness to adversarial perturbations was kind of a bit weird as a threat model. If there is an adversary in the environment they’re probably not going to be restricted to small perturbations. They’re probably not going to get white box access to your AI system; even if they did, this doesn’t seem to really connect with the AI system as adversarially optimizing against humans story, which is how we get to the x-risk part, so it’s not totally clear.

I think on the intent alignment case, which is the thing that I usually think about, you mostly want to ensure that whatever is driving the “motivation” of the AI system, you want that to be very robust. You want it to agree with what humans would want in all situations or at least all situations that are going to come up or something like that. Paul Christiano has written a few blog posts about this that talk about what techniques he’s excited about solving that problem, which boil down to interpretability, adversarial training, and improving adversarial training through relaxations of the problem.

Buck Shlegeris: I’m pretty confused about this, and so it’s possible what I’m going to say is dumb. When I look at problems with robustness or problems that Rohin put in this robustness category here, I want to divide it into two parts. One of the parts is, things that I think of as capability problems, which I kind of expect the rest of the world will need to solve on its own. For instance, things about safe exploration, how do I get my system to learn to do good things without ever doing really bad things, this just doesn’t seem very related to the AI alignment problem to me. And I also feel reasonably optimistic that you can solve it by doing dumb techniques which don’t have anything too difficult to them, like you can have your system so that it has a good model of the world that it got from unsupervised learning somehow and then it never does dumb enough things. And also I don’t really see that kind of robustness problem leading to existential catastrophes. And the other half of robustness is the half that I care about a lot, which in my mind, is mostly trying to make sure that you succeeded at inner alignment. That is, that the mesa optimizers you’ve found through gradient descent have goals that actually match your goals.

This is like robustness in the sense that you’re trying to guarantee that in every situation, your AI system, as Rohin was saying, is intent aligned with you. It’s trying to do the kind of thing that you want. And I worry that, by default, we’re going to end up with AI systems not intent aligned, so there exist a bunch of situations they can be put in such that they do things that are very much not what you’d want, and therefore they fail at robustness. I think this is a really important problem, it’s like half of the AI safety problem or more, in my mind, and I’m not very optimistic about being able to solve it with prosaic techniques.

Rohin Shah: That sounds roughly similar to what I was saying. Yes.

Buck Shlegeris: I don’t think we disagree about this super much except for the fact that I think you seem to care more about safe exploration and similar stuff than I think I do.

Rohin Shah: I think safe exploration’s a bad example. I don’t know what safe exploration is even trying to solve but I think other stuff, I agree. I do care about it more. One place where I somewhat disagree with you is, you sort of have this point about all these robustness problems are the things that the rest of the world has incentives to figure out, and will probably figure out. That seems true for alignment too, it sure seems like you want your system to be aligned in order to do the things that you actually want. Everyone that has an incentive for this to happen. I totally expect people who aren’t EAs or rationalists or weird longtermists to be working on AI alignment in the future and to some extent even now. I think that’s one thing.

Buck Shlegeris: You should say your other thing, but then I want to get back to that point.

Rohin Shah: The other thing is I think I agree with you that it’s not clear to me how failures of the robustness of things other than motivation lead to x-risk, but I’m more optimistic than you are that our solutions to those kinds of robustness will help with the solutions to “motivation robustness” or how to make your mesa optimizer aligned.

Buck Shlegeris: Yeah, sorry, I guess I actually do agree with that last point. I am very interested in trying to figure out how to have aligned to mesa optimizers, and I think that a reasonable strategy to pursue in order to get aligned mesa optimizers is trying to figure out how to make your image classifiers robust to adversarial examples. I think you probably won’t succeed even if you succeed with the image classifiers, but it seems like the image classifiers are still probably where you should start. And I guess if we can’t figure out how to make image classifiers robust to adversarial examples in like 10 years, I’m going to be super pessimistic about the harder robustness problem, and that would be great to know.

Rohin Shah: For what it’s worth, my take on the adversarial examples of image classifiers is, we’re going to train image classifiers on more data with bigger nets, it’s just going to mostly go away. Prediction. I’m laying my cards on the table.

Buck Shlegeris: That’s also something like my guess.

Rohin Shah: Okay.

Buck Shlegeris: My prediction is: to get image classifiers that are robust to epsilon ball perturbations or whatever, some combination of larger things and adversarial training and a couple other clever things, will probably mean that we have robust image classifiers in 5 or 10 years at the latest.

Rohin Shah: Cool. And you wanted to return to the other point about the world having incentives to do alignment.

Buck Shlegeris: So I don’t quite know how to express this, but I think it’s really important which is going to make this a really fun experience for everyone involved. You know how Airbnb… Or sorry, I guess a better example of this is actually Uber drivers. Where I give basically every Uber driver a five star rating, even though some Uber drivers are just clearly more pleasant for me than others, and Uber doesn’t seem to try very hard to get around these problems, even though I think that if Uber caused there to be a 30% difference in pay between the drivers who I think of as 75th percentile and the drivers I think of as 25th percentile, this would make the service probably noticeably better for me. I guess it seems to me that a lot of the time the world just doesn’t try do kind of complicated things to make systems actually aligned, and it just does hack jobs, and then everyone deals with the fact that everything is unaligned as a result.

To draw this analogy back, I think that we’re likely to have the kind of alignment techniques that solve problems that are as simple and obvious as: we should have a way to have rate your hosts on Airbnb. But I’m worried that we won’t ever get around to solving the problems that are like, but what if your hosts are incentivized to tell you sob stories such that you give them good ratings, even though actually they were worse than some other hosts. And this is never a big enough deal that people are unilaterally individually incentivized to solve the harder version of the alignment problem, and then everyone ends up using these systems that actually aren’t aligned in the strong sense and then we end up in a doomy world. I’m curious if any of that made any sense.

Lucas Perry: Is a simple way to put that we fall into inadequate or an unoptimal equilibrium and then there’s tragedy of the commons and bad game theory stuff that happens that keeps us locked and that the same story could apply to alignment?

Buck Shlegeris: Yeah, that’s not quite what I mean.

Lucas Perry: Okay.

Rohin Shah: I think Buck’s point is that actually Uber or Airbnb could unilaterally, no gains required, make their system better and this would be an improvement for them and everyone else, and they don’t do it. There is nothing about equilibrium that is a failure of Uber to do this thing that seems so obviously good.

Buck Shlegeris: I’m not actually claiming that it’s better for Uber, I’m just claiming that there is a misalignment there. Plausibly, an Uber exec, if they were listening to this they’d just be like, “LOL, that’s a really stupid idea. People would hate it.” And then they would say more complicated things like “most riders are relatively price sensitive and so this doesn’t matter.” And plausibly they’re completely right.

Rohin Shah: That’s what I was going to say.

Buck Shlegeris: But the thing which feels important to me is something like a lot of the time it’s not worth solving the alignment problems at any given moment because something else is a bigger problem to how things are going locally. And this can continue being the case for a long time, and then you end up with everyone being locked in to this system where they never solved the alignment problems. And it’s really hard to make people understand this, and then you get locked into this bad world.

Rohin Shah: So if I were to try and put that in the context of AI alignment, I think this is a legitimate reason for being more pessimistic. And the way that I would make that argument is: it sure seems like we are going to decide on what method or path we’re going to use to build AGI. Maybe we’ll do a bunch of research and decide we’re just going to scale up language models or something like this. I don’t know. And we will do that before we have any idea of which technique would be easiest to align and as a result, we will be forced to try to align this exogenously chosen AGI technique and that would be harder than if we got to design our alignment techniques and our AGI techniques simultaneously.

Buck Shlegeris: I’m imagining some pretty slow take off here, and I don’t imagine this as ever having a phase where we built this AGI and now we need to align it. It’s more like we’re continuously building and deploying these systems that are gradually more and more powerful, and every time we want to deploy a system, it has to be doing something which is useful to someone. And many of the things which are useful, require things that are kind of like alignment. “I want to make a lot of money from my system that will give advice,” and if it wants to give good generalist advice over email, it’s going to need to have at least some implicit understanding of human preferences. Maybe we just use giant language models and everything’s just totally fine here. A really good language model isn’t able to give arbitrarily good aligned advice, but you can get advice that sounds really good from a language model, and I’m worried that the default path is going to involve the most popular AI advice services being kind of misaligned, and just never bothering to fix that. Does that make any more sense?

Rohin Shah: Yeah, I think I totally buy that that will happen. But I think I’m more like as you get to AI systems doing more and more important things in the world, it becomes more and more important that they are really truly aligned and investment in alignment increases correspondingly.

Buck Shlegeris: What’s the mechanism by which people realize that they need to put more work into alignment here?

Rohin Shah: I think there’s multiple. One is I expect that people are aware, like even in the Uber case, I expect people are aware of the misalignment that exists, but decide that it’s not worth their time to fix it. So the continuation of that, people will be aware of it and then they will decide that they should fix it.

Buck Shlegeris: If I’m trying to sell to city governments this language model based system which will give them advice on city planning, it’s not clear to me that at any point the city governments are going to start demanding better alignment features. Maybe that’s the way that it goes but it doesn’t seem obvious that city governments would think to ask that, and —

Rohin Shah: I wasn’t imagining this from the user side. I was imagining this from the engineers or designers side.

Buck Shlegeris: Yeah.

Rohin Shah: I think from the user side I would speak more to warning shots. You know, you have your cashier AI system or your waiter AIs and they were optimizing for tips more so than actually collecting money and so they like offer free meals in order to get more tips. At some point one of these AI systems passes all of the internal checks and makes it out into the world and only then does the problem arise and everyone’s like, “Oh my God, this is terrible. What the hell are you doing? Make this better.”

Buck Shlegeris: There’s two mechanisms via which that alignment might be okay. One of them is that researchers might realize that they want to put more effort into alignment and then solve these problems. The other mechanism is that users might demand better alignment because of warning shots. I think that I don’t buy that either of these is sufficient. I don’t buy that it’s sufficient for researchers to decide to do it because in a competitive world, the researchers who realize this is important, if they try to only make aligned products, they are not going to be able to sell them because their products will be much less good than the unaligned ones. So you have to argue that there is demand for the things which are actually aligned well. But for this to work, your users have to be able to distinguish between things that have good alignment properties and those which don’t, and this seems really hard for users to do. And I guess, when I try to imagine analogies, I just don’t see many examples of people successfully solving problems like this, like businesses making products that are different levels of dangerousness, and then users successfully buying the safe ones.

Rohin Shah: I think usually what happens is you get regulation that forces everyone to be safe. I don’t know if it was regulation, but like airplanes are incredibly safe. Cars are incredibly safe.

Buck Shlegeris: Yeah but in this case what would happen is doing the unsafe thing allows you to make enormous amounts of money, and so the countries which don’t put in the regulations are going to be massively advantaged compared to ones which don’t.

Rohin Shah: Why doesn’t that apply for cars and airplanes?

Buck Shlegeris: So to start with, cars in poor countries are a lot less safe. Another thing is that a lot of the effort in making safer cars and airplanes comes from designing them. Once you’ve done the work of designing it, it’s that much more expensive to put your formally-verified 747 software into more planes, and because of weird features of the fact that there are only like two big plane manufacturers, everyone gets the safer planes.

Lucas Perry: So tying this into robustness. The fundamental concern here is about the incentives to make aligned systems that are safety and alignment robust in the real world.

Rohin Shah: I think that’s basically right. I sort of see these incentives as existing and the world generally being reasonably good at dealing with high stakes problems.

Buck Shlegeris: What’s an example of the world being good at dealing with a high stakes problem?

Rohin Shah: I feel like biotech seems reasonably well handled, relatively speaking,

Buck Shlegeris: Like bio-security?

Rohin Shah: Yeah.

Buck Shlegeris: Okay, if the world handles AI as well as bio-security, there’s no way we’re okay.

Rohin Shah: Really? I’m aware of ways in which we’re not doing bio-security well, but there seem to be ways in which we’re doing it well too.

Buck Shlegeris: The nice thing about bio-security is that very few people are incentivized to kill everyone, and this means that it’s okay if you’re sloppier about your regulations, but my understanding is that lots of regulations are pretty weak.

Rohin Shah: I guess I was more imagining the research community’s coordination on this. Surprisingly good.

Buck Shlegeris: I wouldn’t describe it that way.

Rohin Shah: It seems like the vast majority of the research community is onboard with the right thing and like 1% isn’t. Yeah. Plausibly we need to have regulations for that last 1%.

Buck Shlegeris: I think that 99% of the synthetic biology research community is on board with “it would be bad if everyone died.” I think that some very small proportion is onboard with things like “we shouldn’t do research if it’s very dangerous and will make the world a lot worse.” I would say like way less than half of synthetic biologists seem to agree with statements like “it’s bad to do really dangerous research.” Or like, “when you’re considering doing research, you consider differential technological development.” I think this is just not a thing biologists think about, from my experience talking to biologists.

Rohin Shah: I’d be interested in betting with you on this afterwards.

Buck Shlegeris: Me too.

Lucas Perry: So it seems like it’s going to be difficult to come down to a concrete understanding or agreement here on the incentive structures in the world and whether they lead to the proliferation of unaligned AI systems or semi aligned AI systems versus fully aligned AI systems and whether that poses a kind of lock-in, right? Would you say that that fairly summarizes your concern Buck?

Buck Shlegeris: Yeah. I expect that Rohin and I agree mostly on the size of the coordination problem required, or the costs that would be required by trying to do things the safer way. And I think Rohin is just a lot more optimistic about those costs being paid.

Rohin Shah: I think I’m optimistic both about people’s ability to coordinate paying those costs and about incentives pointing towards paying those costs.

Buck Shlegeris: I think that Rohin is right that I disagree with him about the second of those as well.

Lucas Perry: Are you interested in unpacking this anymore? Are you happy to move on?

Buck Shlegeris: I actually do want to talk about this for two more minutes. I am really surprised by the claim that humans have solved coordination problems as hard as this one. I think the example you gave is humans doing radically nowhere near well enough. What are examples of coordination problem type things… There was a bunch of stuff with nuclear weapons, where I feel like humans did badly enough that we definitely wouldn’t have been okay in an AI situation. There are a bunch of examples of the US secretly threatening people with nuclear strikes, which I think is an example of some kind of coordination failure. I don’t think that the world has successfully coordinated on never threaten first nuclear strikes. If we had successfully coordinated on that, I would consider nuclear weapons to be less of a failure, but as it is the US has actually according to Daniel Ellsberg threatened a bunch of people with first strikes.

Rohin Shah: Yeah, I think I update less on specific scenarios and update quite a lot more on, “it just never happened.” The sheer amount of coincidence that would be required given the level of, Oh my God, there were close calls multiple times a year for many decades. That seems just totally implausible and it just means that our understanding of what’s happening is wrong.

Buck Shlegeris: Again, also the thing I’m imagining is this very gradual takeoff world where people, every year, they release their new most powerful AI systems. And if, in a particular year, AI Corp decided to not release its thing, then AI Corps two and three and four would rise to being one, two and three in total profits instead of two, three and four. In that kind of a world, I feel a lot more pessimistic.

Rohin Shah: I’m definitely imagining more of the case where they coordinate to all not do things. Either by international regulation or via the companies themselves coordinating amongst each other. Even without that, it’s plausible that AI Corp one does this. One example I’d give is, Waymo has just been very slow to deploy self driving cars relative to all the other self driving car companies, and my impression is that this is mostly because of safety concerns.

Buck Shlegeris: Interesting and slightly persuasive example. I would love to talk through this more at some point. I think this is really important and I think I haven’t heard a really good conversation about this.

Apologies for describing what I think is going wrong inside your mind or something, which is generally a bad way of saying things, but it sounds kind of to me like you’re implicitly assuming more concentrated advantage and fewer actors than I think actually are implied by gradual takeoff scenarios.

Rohin Shah: I’m usually imagining something like a 100+ companies trying to build the next best AI system, and 10 or 20 of them being clear front runners or something.

Buck Shlegeris: That makes sense. I guess I don’t quite see how the coordination successes you were describing arise in that kind of a world. But I am happy to move on.

Lucas Perry: So before we move on on this point, is there anything which you would suggest as obvious solutions, should Buck’s model of the risks here be the case. So it seemed like it would demand more centralized institutions which would help to mitigate some of the lock in here.

Rohin Shah: Yeah. So there’s a lot of work in policy and governance about this. Not much of which is public unfortunately. But I think the thing to say is that people are thinking about it and it does sort of look like trying to figure out how to get the world to actually coordinate on things. But as Buck has pointed out, we have tried to do this before and so there’s probably a lot to learn from past cases as well. But I am not an expert on this and don’t really want to talk as though I were one.

Lucas Perry: All right. So there’s lots of governance and coordination thought that kind of needs to go into solving many of these coordination issues around developing beneficial AI. So I think with that we can move along now to scaling to superhuman abilities. So Rohin, what do you have to say about this topic area?

Rohin Shah: I think this is in some sense related to what we were talking about before, you can predict what a human would say, but it’s hard to back out true underlying values beneath them. Here the problem is, suppose you are learning from some sort of human feedback about what you’re supposed to be doing, the information contained in that tells you how to do whatever the human can do. It doesn’t really tell you how to exceed what the human can do without having some additional assumptions.

Now, depending on how the human feedback is structured, this might lead to different things like if the human is demonstrating how to do the task to you, then this would suggest that it would be hard to do the task any better than the human can, but if the human was evaluating how well you did the task, then you can do the task better in a way that the human wouldn’t be able to tell was better. Ideally, at some point we would like to have AI systems that can actually do just really powerful, great things, that we are unable to understand all the details of and so we would neither be able to demonstrate or evaluate them.

How do we get to those sorts of AI systems? The main proposals in this bucket are iterated amplification, debate, and recursive reward modeling. So in iterated amplification, we started with an initial policy, and we alternate between amplification and distillation, which increases capabilities and efficiency respectively. This can encode a bunch of different algorithms, but usually amplification is done by decomposing questions into easier sub questions, and then using the agent to answer those sub questions. While distillation can be done using supervised learning or reinforcement learning, so you get these answers that are created by these amplified systems that take a long time to run, and you just train a neural net to very quickly predict the answers without having to do this whole big decomposition thing. In debate, we train an agent through self play in a zero sum game where the agent’s goal is to win a question answering debate as evaluated by a human judge. The hope here is that since both sides of the debate can point out flaws in the other side’s arguments — they’re both very powerful AI systems — such a set up can use a human judge to train far more capable agents while still incentivizing the agents to provide honest true information. With recursive reward modeling, you can think of it as an instantiation of the general alternate between amplification and distillation framework, but it works sort of bottom up instead of top down. So you’ll start by building AI systems that can help you evaluate simple, easy tasks. Then use those AI systems to help you evaluate more complex tasks and you keep iterating this process until eventually you have AI systems that help you with very complex tasks like how to design the city. And this lets you then train an AI agent that can design the city effectively even though you don’t totally understand why it’s doing the things it’s doing or why they’re even good.

Lucas Perry: Do either of you guys have any high level thoughts on any of these approaches to scaling to superhuman abilities?

Buck Shlegeris: I have some.

Lucas Perry: Go for it.

Buck Shlegeris: So to start with, I think it’s worth noting that another approach would be ambitious value learning, in the sense that I would phrase these not as approaches for scaling to superhuman abilities, but they’re like approaches for scaling to superhuman abilities while only doing tasks that relate to the actual behavior of humans rather than trying to back out their values explicitly. Does that match your thing Rohin?

Rohin Shah: Yeah, I agree. I often phrase that as with ambitious value learning, there’s not a clear ground truth to be focusing on, whereas with all three of these methods, the ground truth is what a human would do if they got a very, very long time to think or at least that is what they’re trying to approximate. It’s a little tricky to see why exactly they’re approximating that, but there are some good posts about this. The key difference between these techniques and ambitious value learning is that there is in some sense a ground truth that you are trying to approximate.

Buck Shlegeris: I think these are all kind of exciting ideas. I think they’re all kind of better ideas than I expected to exist for this problem a few years ago. Which probably means we should update against my ability to correctly judge how hard AI safety problems are, which is great news, in as much as I think that a lot of these problems are really hard. Nevertheless, I don’t feel super optimistic that any of them are actually going to work. One thing which isn’t in the elevator pitch for IDA, which is iterated distillation and amplification (and debate), is that you get to hire the humans who are going to be providing the feedback, or the humans whose answers AI systems are going to be trained with. And this is actually really great. Because for instance, you could have this program where you hire a bunch of people and you put them through your one month long training an AGI course. And then you only take the top 50% of them. I feel a lot more optimistic about these proposals given you’re allowed to think really hard about how to set it up such that the humans have the easiest time possible. And this is one reason why I’m optimistic about people doing research in factored cognition and stuff, which I’m sure Rohin’s going to explain in a bit.

One comment about recursive reward modeling: it seems like it has a lot of things in common with IDA. The main downside that it seems to have to me is that the human is in charge of figuring out how to decompose the task into evaluations at a variety of levels. Whereas with IDA, your system itself is able to naturally decompose the task into a variety levels, and for this reason I feel a bit more optimistic about IDA.

Rohin Shah: With recursive reward modeling, one agent that you can train is just an agent that’s good at doing decompositions. That is a thing you can do with it. It’s a thing that the people at DeepMind are thinking about. 

Buck Shlegeris: Yep, that’s a really good point. 

Rohin Shah: I also strongly like the fact that you can train your humans to be good at providing feedback. This is also true about specification learning. It’s less clear if it’s true about ambitious value learning. No one’s really proposed how you could do ambitious value learning really. Maybe arguably Stuart Russell’s book is kind of a proposal, but it doesn’t have that many details.

Buck Shlegeris: And, for example, it doesn’t address any of my concerns in ways that I find persuasive.

Rohin Shah: Right. But for specification learning also you definitely want to train the humans who are going to be providing feedback to the AI system. That is an important part of why you should expect this to work.

Buck Shlegeris: I often give talks where I try to give an introduction to IDA and debate as a proposal for AI alignment. I’m giving these talks to people with computer science backgrounds, and they’re almost always incredibly skeptical that it’s actually possible to decompose thought in this kind of a way. And with debate, they’re very skeptical that truth wins, or that the nash equilibrium is accuracy. For this reason I’m super enthusiastic about research into the factored cognition hypothesis of the type that Ought is doing some of.

I’m kind of interested in your overall take for how likely it is that the factored cognition hypothesis holds and that it’s actually possible to do any of this stuff, Rohin. You could also explain what that is.

Rohin Shah: I’ll do that. So basically with both iterated amplification, debate, or recursive reward modeling, they all hinge on this idea of being able to decompose questions, maybe it’s not so obvious why that’s true for debate, but it’s true. Go listen to the podcast about debate if you want to get more details on that.

So this hypothesis is basically for any tasks that we care about, it is possible to decompose this into a bunch of sub tasks that are all easier to do. Such that if you’re able to do the sub tasks, then you can do the overall top level tasks and in particular you can iterate this down, building a tree of smaller and smaller tasks until you can get to the level of tasks that a human could do in a day. Or if you’re trying to do it very far, maybe tasks that a human can do in a couple of minutes. Whether or not you can actually decompose the task “be an effective CEO” into a bunch of sub tasks that eventually bottom out into things humans can do in a few minutes is totally unclear. Some people are optimistic, some people are pessimistic. It’s called the factored cognition hypothesis and Ought is an organization that’s studying it.

It sounds very controversial at first and I, like many other people had the intuitive reaction of, ‘Oh my God, this is never going to work and it’s not true’. I think the thing that actually makes me optimistic about it is you don’t have to do what you might call a direct decomposition. You can do things like if your task is to be an effective CEO, your first sub question could be, what are the important things to think about when being a CEO or something like this, as opposed to usually when I think of decompositions I would think of, first I need to deal with hiring. Maybe I need to understand HR, maybe I need to understand all of the metrics that the company is optimizing. Very object level concerns, but the decompositions are totally allowed to also be meta level where you’ll spin off a bunch of computation that is just trying to answer the meta level of question of how should I best think about this question at all.

Another important reason for optimism is that based on the structure of iterated amplification, debate and recursive reward modeling, this tree can be gigantic. It can be exponentially large. Something that we couldn’t run even if we had all of the humans on Earth collaborating to do this. That’s okay. Given how the training process is structured, considering the fact that you can do the equivalent of millennia of person years of effort in this decomposed tree, I think that also gives me more of a, ‘okay, maybe this is possible’ and that’s also why you’re able to do all of this meta level thinking because you have a computational budget for it. When you take all of those together, I sort of come up with “seems possible. I don’t really know.”

Buck Shlegeris: I think I’m currently at 30-to-50% on the factored cognition thing basically working out. Which isn’t nothing.

Rohin Shah: Yeah, that seems like a perfectly reasonable thing. I think I could imagine putting a day of thought into it and coming up with numbers anywhere between 20 and 80.

Buck Shlegeris: For what it’s worth, in conversation at some point in the last few years, Paul Christiano gave numbers that were not wildly more optimistic than me. I don’t think that the people who are working on this think it’s obviously fine. And it would be great if this stuff works, so I’m really in favor of people looking into it.

Rohin Shah: Yeah, I should mention another key intuition against it. We have all these examples of human geniuses like Ramanujan, who were posed very difficult math problems and just immediately get the answer and then you ask them how did they do it and they say, well, I asked myself what should the answer be? And I was like, the answer should be a continued fraction. And then I asked myself which continued fraction and then I got the answer. And you’re like, that does not sound very decomposable. It seems like you need these magic flashes of intuition. Those would be the hard cases for factored cognition. It still seems possible that you could do it by both this exponential try a bunch of possibilities and also by being able to discover intuitions that work in practice and just believing them because they work in practice and then applying them to the problem at hand. You could imagine that with enough computation you’d be able to discover such intuitions.

Buck Shlegeris: You can’t answer a math problem by searching exponentially much through the search tree. The only exponential power you get from IDA is IDA is letting you specify the output of your cognitive process in such a way that’s going to match some exponentially sized human process. As long as that exponentially sized human process was only exponentially sized because it’s really inefficient, but is kind of fundamentally not an exponentially sized problem, then your machine learning should be able to speed it up a bunch. But the thing where you search over search strategy is not valid. If that’s all you can do, that’s not good enough.

Rohin Shah: Searching over search strategies, I agree you can’t do, but if you have an exponential search that could be implemented by humans. We know by hypothesis, if you can solve it with a flash of intuition, there is in fact some more efficient way to do it and so whether or not the distillation steps will actually be enough to get to the point where you can do those flashes of intuition. That’s an open question.

Buck Shlegeris: This is one of my favorite areas of AI safety research and I would love for there to be more of it. Something I have been floating for a little while is I kind of wish that there was another Ought. It just seems like it would be so good if we had definitive information about the factored cognition hypothesis. And it also it seems like the kind of thing which is potentially parallelizable. And I feel like I know a lot of people who love talking about how thinking works. A lot of rationalists are really into this. I would just be super excited for some of them to form teams of four and go off on their own and build an Ought competitor. I feel like this is the kind of thing where plausibly, a bunch of enthusiastic people could make progress on their own.

Rohin Shah: Yeah, I agree with that. Definitely seems like one of the higher value things but I might be more excited about universality.

Lucas Perry: All right, well let’s get started with universality then. What is universality and why are you optimistic about it?

Rohin Shah: So universality is hard to explain well, in a single sentence. For whatever supervisor is training our agent, you want that supervisor to “know everything the agent knows.” In particular if the agent comes up with some deceptive strategy to look like it’s achieving the goal, but actually it hasn’t. The supervisors should know that it was doing this deceptive strategy for the reason of trying to trick the supervisor and so the supervisor can then penalize it. The classic example of why this is important and hard also due to Paul Christiano is plagiarism. Suppose you are training on the AI system to produce novel works of literature and as part of its training data, the AI system gets to read this library of a million books.

It’s possible that this AI system decides, Hey, you know the best way I can make a great novel seeming book is to just take these five books and take out plot points, passages from each of them and put them together and then this new book will look totally novel and will be very good because I used all of the best Shakespearean writing or whatever. If your supervisor doesn’t know that the agent has done this, the only way the supervisor can really check is to go read the entire million books. Even if the agent only read 10 books and so then the supervision becomes a way more costly than running the agent, which is not a great state to be in, and so what you really want is that if the agent does this, the supervisor is able to say, I see that you just copied this stuff over from these other books in order to trick me into thinking that you had written something novel that was good.

That’s bad. I’m penalizing you. Stop doing that in the future. Now, this sort of property, I mean it’s very nice in the abstract, but who knows whether or not we can actually build it in practice. There’s some reason for optimism that I don’t think I can adequately convey, but I wrote a newsletter summarizing some of it sometime ago, but again, reading through the posts I became more optimistic that it was an achievable property, than when I first heard what the property was. The reason I’m optimistic about it is that it just sort of seems to capture the thing that we actually care about. It’s not everything, like it doesn’t solve the robustness problem. Universality only tells you what the agent’s currently doing. You know all the facts about that. Whereas for robustness you want to say even in these hypothetical situations that the agent hasn’t encountered yet and doesn’t know stuff about, even when it encounters those situations, it’s going to stay aligned with you so universality doesn’t get you all the way there, but it definitely feels like it’s getting you quite a bit.

Buck Shlegeris: That’s really interesting to hear you phrase it that way. I guess I would have thought of universality as a subset of robustness. I’m curious what you think of that first.

Rohin Shah: I definitely think you could use universality to achieve a subset of robustness. Maybe I would say universality is a subset of interpretability.

Buck Shlegeris: Yeah, and I care about interpretability as a subset of robustness basically, or as a subset of inner alignment, which is pretty close to robustness in my mind. The other thing I would say is you were saying there that one difference between universality and robustness is that universality only tells you why the agent did the thing it currently did, and this doesn’t suffice to tell us about the situations that the agent isn’t currently in. One really nice thing though is that if the agent is only acting a particular way because it wants you to trust it, that’s a fact about its current behavior that you will know, and so if you have the universality property, your overseer just knows your agent is trying to deceive it. Which seems like it would be incredibly great and would resolve like half of my problem with safety if you had it.

Rohin Shah: Yeah, that seems right. The case that universality doesn’t cover is when your AI system is initially not deceptive, but then at some point in the future it’s like, ‘Oh my God, now it’s possible to go and build Dyson spheres or something, but wait, in this situation probably I should be doing this other thing and humans won’t like that. Now I better deceive humans’. The transition into deception would have to be a surprise in some sense even to the AI system.

Buck Shlegeris: Yeah, I guess I’m just not worried about that. Suppose I have this system which is as smart as a reasonably smart human or 10 reasonably smart humans, but it’s not as smart as the whole world. If I can just ask it what its best sense about how aligned it is, is? And if I can trust its answer? I don’t know man, I’m pretty okay with systems that think they’re aligned, answering that question honestly.

Rohin Shah: I think I somewhat agree. I like this reversal where I’m the pessimistic one.

Buck Shlegeris: Yeah me too. I’m like, “look, system, I want you to think as hard as you can to come up with the best arguments you can come up with for why you are misaligned, and the problems with you.” And if I just actually trust the system to get this right, then the bad outcomes I get here are just pure accidents. I just had this terrible initialization of my neural net parameters, such that I had this system that honestly believed that it was going to be aligned. And then as it got trained more, this suddenly changed and I couldn’t do anything about it. I don’t quite see the story for how this goes super wrong. It seems a lot less bad than the default situation.

Rohin Shah: Yeah. I think the story I would tell is something like, well, if you look at humans, they’re pretty wrong about what their preferences will be in the future. For example, there’s this trope of how teenagers fall in love and then fall out of love, but when they’re in love, they swear undying oaths to each other or something. To the extent that is true, that seems like the sort of failure that could lead to x-risk if it also happened with AI systems.

Buck Shlegeris: I feel pretty optimistic about all the garden-variety approaches to solving this. Teenagers were not selected very hard on accuracy of their undying oaths. And if you instead had accuracy of self-model as a key feature you were selecting for in your AI system, plausibly you’ll just be way more okay.

Rohin Shah: Yeah. Maybe people could coordinate well on this. I feel less good about people coordinating on this sort of problem.

Buck Shlegeris: For what it’s worth, I think there are coordination problems here and I feel like my previous argument about why coordination is hard and won’t happen by default also probably applies to us not being okay. I’m not sure how this all plays out. I’d have to think about it more.

Rohin Shah: Yeah. I think it’s more like this is a subtle and non-obvious problem, which by hypothesis doesn’t happen in the systems you actually have and only happens later and those are the sorts of problems I’m like, Ooh, not sure if we can deal with those ones, but I agree that there’s a good chance that there’s just not a problem at all in the world where we already have universality and checked all the obvious stuff.

Buck Shlegeris: Yeah. I would like to say universality is one of my other favorite areas of AI alignment research, in terms of how happy I’d be if it worked out really well.

Lucas Perry: All right, so let’s see if we can slightly pick up the pace here. Moving forward and starting with interpretability.

Rohin Shah: Yeah, so I mean I think we’ve basically discussed interpretability already. Universality is a specific kind of interpretability, but the case for interpretability is just like, sure seems like it would be good if you could understand what your AI systems are doing. You could then notice when they’re not aligned, and fix that somehow. It’s a pretty clear cut case for a thing that would be good if we achieved it and it’s still pretty uncertain how likely we are to be able to achieve it.

Lucas Perry: All right, so let’s keep it moving and let’s hit impact regularization now.

Rohin Shah: Yeah, impact regularization in particular is one of the ideas that are not trying to align the AI system but are instead trying to say, well, whatever AI system we build, let’s make sure it doesn’t cause a catastrophe. It doesn’t lead to extinction or existential risk. What it hopes to do is say, all right, AI system, do whatever it is you wanted to do. I don’t care about that. Just make sure that you don’t have a huge impact upon the world.

Whatever you do, keep your impact not too high. And so there’s been a lot of work on this in recent years there’s been relative reachability, attainable utility preservation, and I think in general the sense is like, wow, it’s gone quite a bit further than people expected it to go. I think it definitely does prevent you from doing very, very powerful things of the sort, like if you wanted to stop all competing AI projects from ever being able to build AGI, that doesn’t seem like the sort of thing you can do with an impact regularized AI system, but it sort of seems plausible that you could prevent convergent instrumental sub goals using impact regularization. Where AI systems that are trying to steal resources and power from humans, you could imagine that you’d say, hey, don’t do that level of impact, you can still have the level of impact of say running a company or something like that.

Buck Shlegeris: My take on all this is that I’m pretty pessimistic about all of it working. I think that impact regularization or whatever is a non-optimal point on the capabilities / alignment trade off or something, in terms of safety you’re getting for how much capability you’re sacrificing. My basic a problem here is basically analogous to my problem with value learning, where I think we’re trying to take these extremely essentially fuzzy concepts and then factor our agent through these fuzzy concepts like impact, and basically the thing that I imagine happening is any impact regularization strategy you try to employ, if your AI is usable, will end up not helping with its alignment. For any definition of impacts you come up with, it’ll end up doing something which gets around that. Or it’ll make your AI system completely useless, is my basic guess as to what happens.

Rohin Shah: Yeah, so I think again in this setting, if you formalize it and then say, consider the optimal agent. Yeah, that can totally get around your impact penalty, but in practice it sure seems like, what you want to do is say this convergent instrumental subgoal stuff, don’t do any of that. Continue to do things that are normal in regular life. And those seem like pretty distinct categories. Such that I would not be shocked if we could actually distinguish between the two.

Buck Shlegeris: It sounds like the main benefit you’re going for is trying to make your AI system not do insane, convergent, instrumental sub-goal style stuff. So another approach I can imagine taking here would be some kind of value learning or something, where you’re asking humans for feedback on whether plans are insanely convergent, instrumental sub-goal style, and just not doing the things which, when humans are asked to rate how sketchy the plans are the humans rate as sufficiently sketchy? That seems like about as good a plan. I’m curious what you think.

Rohin Shah: The idea of power as your attainable utility across a wide variety of utility functions seems like a pretty good formalization to me. I think in the worlds where I actually buy a formalization, I tend to expect the formalization to work better. I do think the formalization is not perfect. Most notably with the current formalization of power, your power never changes if you have extremely good beliefs. Your notion, you’re just like, I always have the same power because I’m always able to do the same things and you never get surprised, so maybe I agree with you because I think the current formalization is not good enough.  (The strike through section has been redacted by Rohin. It’s incorrect and you can see why here.) Yeah, I think I agree with you but I could see it going either way.

Buck Shlegeris: I could be totally wrong about this, and correct me if I’m wrong, my sense is that you have to be able to back out the agent’s utility function or its models of the world. Which seems like it’s assuming a particular path for AI development which doesn’t seem to me particularly likely.

Rohin Shah: I definitely agree with that for all the current methods too.

Buck Shlegeris: So it’s like: assume that we have already perfectly solved our problems with universality and robustness and transparency and whatever else. I feel like you kind of have to have solved all of those problems before you can do this, and then you don’t need it or something.

Rohin Shah: I don’t think I agree with that. I definitely agree that the current algorithms that people have written assume that you can just make a change to the AI’s utility function. I don’t think that’s what even their proponents would suggest as the actual plan.

Buck Shlegeris: What is the actual plan?

Rohin Shah: I don’t actually know what their actual plan would be, but one plan I could imagine is figure out what exactly the conceptual things we have to do with impact measurement are, and then whatever method we have for building AGI, probably there’s going to be some part which is specify the goal and then in the specify goal part, instead of just saying pursue X, we want to say pursue X without changing your ability to pursue Y, and Z and W, and P, and Q.

Buck Shlegeris: I think that that does not sound like a good plan. I don’t think that we should expect our AI systems to be structured that way in the future.

Rohin Shah: Plausibly we have to do this with natural language or something.

Buck Shlegeris: It seems very likely to me that the thing you do is reinforcement learning where at the start of the episode you get a sentence of English which is telling you what your goal is and then blah, blah, blah, blah, blah, and this seems like a pretty reasonable strategy for making powerful and sort of aligned AI. Aligned enough to be usable for things that aren’t very hard. But you just fundamentally don’t have access to the internal representations that the AI is using for its sense of what belief is, and stuff like that. And that seems like a really big problem.

Rohin Shah: I definitely see this as more of an outer alignment thing, or like an easier to specify outer alignment type thing than say, IDA is that type stuff.

Buck Shlegeris: Okay, I guess that makes sense. So we’re just like assuming we’ve solved all the inner alignment problems?

Rohin Shah: In the story so far yeah, I think all of the researchers who actually work on this haven’t thought much about inner alignment.

Buck Shlegeris: My overall summary is that I really don’t like this plan. I feel like it’s not robust to scale. As you were saying Rohin, if your system gets more and more accurate beliefs, stuff breaks. It just feels like the kind of thing that doesn’t work.

Rohin Shah: I mean, it’s definitely not conceptually neat and elegant in the sense of it’s not attacking the underlying problem. And in a problem setting where you expect adversarial optimization type dynamics, conceptual elegance actually does count for quite a lot in whether or not you believe your solution will work.

Buck Shlegeris: I feel it’s like trying to add edge detectors to your image classifiers to make them more adversarily robust or something, which is backwards.

Rohin Shah: Yeah, I think I agree with that general perspective. I don’t actually know if I’m more optimistic than you. Maybe I just don’t say… Maybe we’d have the same uncertainty distributions and you just say yours more strongly or something.

Lucas Perry: All right, so then let’s just move a little quickly through the next three, which are causal modeling, oracles, and decision theory.

Rohin Shah: Yeah, I mean, well decision theory, MIRI did some work on it. I am not the person to ask about it, so I’m going to skip that one. Even if you look at the long version, I’m just like, here are some posts. Good luck. So causal modeling, I don’t fully understand what the overall story is here but the actual work that’s been published is basically what we can do is we can take potential plans or training processes for AI systems. We can write down causal models that tell us how the various pieces of the training system interact with each other and then using algorithms developed for causal models we can tell when an AI system would have an incentive to either observe or intervene on an underlying variable.

One thing that came out of this was that you can build a model-based reinforcement learner that doesn’t have any incentive to wire head as long as when it makes its plans, the plans are evaluated by the current reward function as opposed to whatever future reward function it would have. And that was explained using this framework of causal modeling. Oracles, Oracles are basically the idea that we can just train an AI system to just answer questions, give it a question and it tries to figure out the best answer it can to that question, prioritizing accuracy.

One worry that people have recently been talking about is the predictions that the Oracle makes then affect the world, which can affect whether or not the prediction was correct. Like maybe if I predict that I will go to bed at 11 then I’m more likely to actually go to bed at 11 because I want my prediction to come true or something and so then the Oracles can still “choose” between different self confirming predictions and so that gives them a source of agency and one way that people want to avoid this is using what are called counter-factual Oracles where you set up the training, such that the Oracles are basically making predictions under the assumption that their predictions are not going to influence the future.

Lucas Perry: Yeah, okay. Oracles seem like they just won’t happen. There’ll be incentives to make things other than Oracles and that Oracles would even be able to exert influence upon the world in other ways.

Rohin Shah: Yeah, I think I agree that Oracles do not seem very competitive.

Lucas Perry: Let’s do forecasting now then.

Rohin Shah: So the main sub things within forecasting one, there’s just been a lot of work recently on actually building good forecasting technology. There has been an AI specific version of Metaculus that’s been going on for a while now. There’s been some work at the Future of Humanity Institute on building better tools for working with probability distributions under recording and evaluating forecasts. There was an AI resolution council where basically now you can make forecasts about what this particular group of people will think in five years or something like that, which is much easier to operationalize than most other kinds of forecasts. So this helps with constructing good questions. On the actual object level, I think there are two main things. One is that it became increasingly more obvious in the past two years that AI progress currently is being driven by larger and larger amounts of compute.

It totally could be driven by other things as well, but at the very least, compute is a pretty important factor. And then takeoff speeds. So there’s been this long debate in the AI safety community over whether — to take the extremes, whether or not we should expect that AI capabilities will see a very sharp spike. So initially, your AI capabilities are improving by like one unit a year, maybe then with some improvements it got to two units a year and then for whatever reason, suddenly they’re now at 20 units a year or a hundred units a year and they just swoop way past what you would get by extrapolating past trends, and so that’s what we might call a discontinuous takeoff. If you predict that that won’t happen instead you’ll get AI that’s initially improving at one unit per year. Then maybe two units per year, maybe three units per year. Then five units per year, and the rate of progress continually increases. The world’s still gets very, very crazy, but in a sort of gradual, continuous way that would be called a continuous takeoff.

Basically there were two posts that argued pretty forcefully for continuous takeoff back in, I want to say February of 2018, and this at least made me believe that continuous takeoff was more likely. Sadly, we just haven’t actually seen much defense of the other side of the view since then. Even though we do know that there definitely are people who still believe the other side, that there will be a discontinuous takeoff.

Lucas Perry: Yeah so what are both you guys’ views on them?

Buck Shlegeris: Here are a couple of things. One is that I really love the operationalization of slow take off or continuous take off that Paul provided in his post, which was one of the ones Rohin was referring to from February 2018. He says, “by slow takeoff, I mean that there is a four year doubling of the economy before there is a one year doubling of the economy.” As in, there’s a period of four years over which world GDP increases by a factor four, after which there is a period of one year. As opposed to a situation where the first one-year doubling happens out of nowhere. Currently, doubling times for the economy are on the order of like 20 years, and so a one year doubling would be a really big deal. The way that I would phrase why we care about this, is because worlds where we have widespread, human level AI feel like they have incredibly fast economic growth. And if it’s true that we expect AI progress to increase gradually and continuously, then one important consequence of this is that by the time we have human level AI systems, the world is already totally insane. A four year doubling would just be crazy. That would be economic growth drastically higher than economic growth currently is.

This means it would be obvious to everyone who’s paying attention that something is up and the world is radically changing in a rapid fashion. Another way I’ve been thinking about this recently is people talk about transformative AI, by which they mean AI which would have at least as much of an impact on the world as the industrial revolution had. And it seems plausible to me that octopus level AI would be transformative. Like suppose that AI could just never get better than octopus brains. This would be way smaller of a deal than I expect AI to actually be, but it would still be a massive deal, and would still possibly lead to a change in the world that I would call transformative. And if you think this is true, and if you think that we’re going to have octopus level AI before we have human level AI, then you should expect that radical changes that you might call transformative have happened by the time that we get to the AI alignment problems that we’ve been worrying about. And if so, this is really big news.

When I was reading about this stuff when I was 18, I was casually imagining that the alignment problem is a thing that some people have to solve while they’re building an AGI in their lab while the rest of the world’s ignoring them. But if the thing which is actually happening is the world is going insane around everyone, that’s a really important difference.

Rohin Shah: I would say that this is probably the most important contested question in AI alignment right now. Some consequences of it are in a gradual or continuous takeoff world you expect that by the time we get to systems that can pose an existential risk. You’ve already had pretty smart systems that have been deployed in the real world. They probably had some failure modes. Whether or not we call them alignment failure modes or not is maybe not that important. The point is people will be aware that AI systems can fail in weird ways, depending on what sorts of failures you expect, you might expect this to lead to more coordination, more involvement in safety work. You might also be more optimistic about using testing and engineering styles of approaches to the problem which rely a bit more on trial and error type of reasoning because you actually will get a chance to see errors before they happen at a super intelligent existential risk causing mode. There are lots of implications of this form that pretty radically change which alignment plans you think are feasible.

Buck Shlegeris: Also, it’s pretty radically changed how optimistic you are about this whole AI alignment situation, at the very least, people who are very optimistic about AI alignment causing relatively small amounts of existential risk. A lot of the reason for this seems to be that they think that we’re going to get these warning shots where before we have superintelligent AI, we have sub-human level intelligent AI with alignment failures like the cashier Rohin was talking about earlier. And then people start caring about AI alignment a lot more. So optimism is also greatly affected by what you think about this.

I’ve actually been wanting to argue with people about this recently. I wrote a doc last night where I was arguing that even in gradual takeoff worlds, we should expect a reasonably high probability of doom if we can’t solve the AI alignment problem. And I’m interested to have this conversation in more detail with people at some point. But yeah, I agree with what Rohin said.

Overall on takeoff speeds, I guess I still feel pretty uncertain. It seems to me that currently, what we can do with AI, like AI capabilities are increasing consistently, and a lot of this comes from applying relatively non-mindblowing algorithmic ideas to larger amounts of compute and data. And I would be kind of surprised if you can’t basically ride this wave away until you have transformative AI. And so if I want to argue that we’re going to have fast takeoffs, I kind of have to argue that there’s some other approach you can take which lets you build AI without having to go along that slow path, which also will happen first. And I guess I think it’s kind of plausible that that is what’s going to happen. I think that’s what you’d have to argue for if you want to argue for a fast take off.

Rohin Shah: That all seems right to me. I’d be surprised if, out of nowhere, we saw a new AI approach suddenly started working and overtook deep learning. You also have to argue that it then very quickly reaches human level AI, which would be quite surprising, right? In some sense, it would have to be something completely novel that we failed to think about in the last 60 years. We’re putting in way more effort now than we were in the last 60 years, but then counter counterpoint is that all of that extra effort is going straight into deep learning. It’s not really searching for completely new paradigm-shifting ways to get to AGI.

Buck Shlegeris: So here’s how I’d make that argument. Perhaps a really important input into a field like AI, is the number of really smart kids who have been wanting to be an AI researcher since they were 16 because they thought that it’s the most important thing in the world. I think that in physics, a lot of the people who turn into physicists have actually wanted to be physicists forever. I think the number of really smart kids who wanted to be AI researchers forever has possibly gone up by a factor of 10 over the last 10 years, it might even be more. And there just are problems sometimes, that are bottle necked on that kind of a thing, probably. And so it wouldn’t be totally shocking to me, if as a result of this particular input to AI radically increasing, we end up in kind of a different situation. I haven’t quite thought through this argument fully.

Rohin Shah: Yeah, the argument seems plausible. There’s a large space of arguments like this. I think even after that, then I’ve started questioning, “Okay, we get a new paradigm. The same arguments apply to that paradigm?” Not as strongly. I guess not the arguments you were saying about compute going up over time, but the arguments given in the original slow takeoff posts which were people quickly start taking the low-hanging fruit and then move on. When there’s a lot of effort being put into getting some property, you should expect that easy low-hanging fruit is usually just already taken, and that’s why you don’t expect discontinuities. Unless the new idea just immediately rockets you to human-level AGI, or x-risk causing AGI, I think the same argument would pretty quickly start applying to that as well.

Buck Shlegeris: I think it’s plausible that you do get rocketed pretty quickly to human-level AI. And I agree that this is an insane sounding claim.

Rohin Shah: Great. As long as we agree on that.

Buck Shlegeris: Something which has been on my to-do list for a while, and something I’ve been doing a bit of and I’d be excited for someone else doing more of, is reading the history of science and getting more of a sense of what kinds of things are bottlenecked by what, where. It could lead me to be a bit less confused about a bunch of this stuff. AI Impacts has done a lot of great work cataloging all of the things that aren’t discontinuous changes, which certainly is a strong evidence to me against my claim here.

Lucas Perry: All right. What is the probability of AI-induced existential risk?

Rohin Shah: Unconditional on anything? I might give it 1 in 20. 5%.

Buck Shlegeris: I’d give 50%.

Rohin Shah: I had a conversation with AI Impacts that went into this in more detail and partially just anchored on the number I gave there, which was 10% conditional on no intervention from longtermists, I think the broad argument is really just the one that Buck and I were disagreeing about earlier, which is to what extent will society be incentivized to solve the problem? There’s some chance that the first thing we try just works and we don’t even need to solve any sort of alignment problem. It might just be fine. This is not implausible to me. Maybe that’s 30% or something.

Most of the remaining probability comes from, “Okay, the alignment problem is a real problem. We need to deal with it.” It might be very easy in which case we can just solve it straight away. That might be the case. That doesn’t seem that likely to me if it was a problem at all. But what we will get is a lot of these warning shots and people understanding the risks a lot more as we get more powerful AI systems. This estimate is also conditional on gradual takeoff. I keep forgetting to say that, mostly because I don’t know what probability I should put on discontinuous takeoff.

Lucas Perry: So is 5% with longtermist intervention, increasing to 10% if fast takeoff?

Rohin Shah: Yes, but still with longtermist intervention. I’m pretty pessimistic on fast takeoff, but my probability assigned to fast takeoff is not very high. In a gradual takeoff world, you get a lot of warning shots. There will just generally be awareness of the fact that the alignment problem is a real thing and you won’t have the situation you have right now of people saying this thing about worrying about superintelligent AI systems not doing what we want is totally bullshit. That won’t be a thing. Almost everyone will not be saying that anymore, in the version where we’re right and there is a problem. As a result, people will not want to build AI systems that are going to kill them. People tend to be pretty risk averse in my estimation of the world, which Buck will probably disagree with. And as a result, you’ll get a lot of people trying to actually work on solving the alignment problem. There’ll be some amount of global coordination which will give us more time to solve the alignment problem than we may otherwise have had. And together, these forces mean that probably we’ll be okay.

Buck Shlegeris: So I think my disagreements with Rohin are basically that I think fast takeoffs are more likely. I basically think there is almost surely a problem. I think that alignment might be difficult, and I’m more pessimistic about coordination. I know I said four things there, but I actually think of this as three disagreements. I want to say that “there isn’t actually a problem” is just a kind of “alignment is really easy to solve.” So then there’s three disagreements. One is gradual takeoff, another is difficulty of solving competitive prosaic alignment, and another is how good we are at coordination.

I haven’t actually written down these numbers since I last changed my mind about a lot of the inputs to them, so maybe I’m being really dumb. I guess, it feels to me that in fast takeoff worlds, we are very sad unless we have competitive alignment techniques, and so then we’re just only okay if we have these competitive alignment techniques. I guess I would say that I’m something like 30% on us having good competitive alignment techniques by the time that it’s important, which incidentally is higher than Rohin I think.

Rohin Shah: Yeah, 30 is totally within the 25th to 75th interval on the probability, which is a weird thing to be reporting. 30 might be my median, I don’t know.

Buck Shlegeris: To be clear, I’m not just including the outer alignment proportion here, which is what we were talking about before with IDA. I’m also including the inner alignment.

Rohin Shah: Yeah, 30% does seem a bit high. I think I’m a little more pessimistic.

Buck Shlegeris: So I’m like 30% that we can just solve the AI alignment problem in this excellent way, such that anyone who wants to can have very little extra cost and then make AI systems that are aligned. I feel like in worlds where we did that, it’s pretty likely that things are reasonably okay. I think that the gradual versus fast takeoff isn’t actually enormously much of a crux for me because I feel like in worlds without competitive alignment techniques and gradual takeoff, we still have a very high probability of doom. And I think that comes down to disagreements about coordination. So maybe the main important disagreement between Rohin and I is, actually how well we’ll be able to coordinate, or how strongly individual incentives will be for alignment.

Rohin Shah: I think there are other things. The reason I feel a bit more pessimistic than you in the fast takeoff world is just solving problems in advance just is really quite difficult and I really like the ability to be able to test techniques on actual AI systems. You’ll have to work with less powerful things. At some point, you do have to make the jump to more powerful things. But, still, being able to test on the less powerful things, that’s so good, so much safety from there.

Buck Shlegeris: It’s not actually clear to me that you get to test the most important parts of your safety techniques. So I think that there are a bunch of safety problems that just do not occur on dog-level AIs, and do occur on human-level AI. If there are three levels of AI, there’s a thing which is as powerful as a dog, there’s a thing which is as powerful as a human, and there’s a thing which is as powerful as a thousand John von Neumanns. In gradual takeoff world, you have a bunch of time in both of these two milestones, maybe. I guess it’s not super clear to me that you can use results on less powerful systems as that much evidence about whether your safety techniques work on drastically more powerful systems. It’s definitely somewhat helpful.

Rohin Shah: It depends what you condition on in your difference between continuous takeoff and discontinuous takeoff to say which one of them happens faster. I guess the delta between dog and human is definitely longer in gradual takeoff for sure. Okay, if that’s what you were saying, yep, I agree with that.

Buck Shlegeris: Yeah, sorry, that’s all I meant.

Rohin Shah: Cool. One thing I wanted to ask is when you say dog-level AI assistant, do you mean something like a neural net that if put in a dog’s body replacing its brain would do about as well as a dog? Because such a neural net could then be put in other environments and learn to become really good at other things, probably superhuman at many things that weren’t in the ancestral environment. Do you mean that sort of thing?

Buck Shlegeris: Yeah, that’s what I mean. Dog-level AI is probably much better than GPT2 at answering questions. I’m going to define something as dog-level AI, if it’s about as good as a dog at things which I think dogs are pretty heavily optimized for, like visual processing or motor control in novel scenarios or other things like that, that I think dogs are pretty good at.

Rohin Shah: Makes sense. So I think in that case, plausibly, dog-level AI already poses an existential risk. I can believe that too.

Buck Shlegeris: Yeah.

Rohin Shah: The AI cashier example feels like it could totally happen probably before a dog-level AI. You’ve got all of the motivation problems already at that point of the game, and I don’t know what problems you expect to see beyond then.

Buck Shlegeris: I’m more talking about whether you can test your solutions. I’m not quite sure how to say my intuitions here. I feel like there are various strategies which work for corralling dogs and which don’t work for making humans do what you want. In as much as your alignment strategy is aiming at a flavor of problem that only occurs when you have superhuman things, you don’t get to test that either way. I don’t think this is a super important point unless you think it is. I guess I feel good about moving on from here.

Rohin Shah: Mm-hmm (affirmative). Sounds good to me.

Lucas Perry: Okay, we’ve talked about what you guys have called gradual and fast takeoff scenarios, or continuous and discontinuous. Could you guys put some probabilities down on the likelihood of, and stories that you have in your head, for fast and slow takeoff scenarios?

Rohin Shah: That is a hard question. There are two sorts of reasoning I do about probabilities. One is: use my internal simulation of whatever I’m trying to predict, internally simulate what it looks like, whether it’s by my own models, is it likely? How likely is it? At what point would I be willing to bet on it. Stuff like that. And then there’s a separate extra step where I’m like, “What do other people think about this? Oh, a lot of people think this thing that I assigned one percent probability to is very likely. Hmm, I should probably not be saying one percent then.” I don’t know how to do that second part for, well, most things but especially in this setting. So I’m going to just report Rohin’s model only, which will predictably be understating the probability for fast takeoff in that if someone from MIRI were to talk to me for five hours, I would probably say a higher number for the probability of fast takeoff after that, and I know that that’s going to happen. I’m just going to ignore that fact and report my own model anyway.

On my own model, it’s something like in worlds where AGI happens soon, like in the next couple of decades, then I’m like, “Man, 95% on gradual take off.” If it’s further away, like three to five decades, then I’m like, “Some things could have changed by then, maybe I’m 80%.” And then if it’s way off into the future and centuries, then I’m like, “Ah, maybe it’s 70%, 65%.” The reason it goes down over time is just because it seems to me like if you want to argue for discontinuous takeoff, you need to posit that there’s some paradigm change in how AI progress is happening and that seems more likely the further in the future you go.

Buck Shlegeris: I feel kind of surprised that you get so low, like to 65% or 70%. I would have thought that those arguments are a strong default and then maybe at the moment where in a position that seems particularly gradual takeoff-y, but I would have thought that you over time get to 80% or something.

Rohin Shah: Yeah. Maybe my internal model is like, “Holy shit, why do these MIRI people keep saying that discontinuous takeoff is so obvious.” I agree that the arguments in Paul’s posts feel very compelling to me and so maybe I should just be more confident in them. I think saying 80%, even in centuries is plausibly a correct answer.

Lucas Perry: So, Rohin, is the view here that since compute is the thing that’s being leveraged to make most AI advances that you would expect that to be the mechanism by which that continues to happen in the future and we have some certainty over how compute continues to change into the future? Whereas things that would be leading to a discontinuous takeoff would be world-shattering, fundamental insights into algorithms that would have powerful recursive self-improvement, which is something you wouldn’t necessarily see if we just keep going this leveraging compute route?

Rohin Shah: Yeah, I think that’s a pretty good summary. Again, on the backdrop of the default argument for this is people are really trying to build AGI. It would be pretty surprising if there is just this really important thing that everyone had just missed.

Buck Shlegeris: It sure seems like in machine learning when I look at the things which have happened over the last 20 years, all of them feel like the ideas are kind of obvious or someone else had proposed them 20 years earlier. ConvNets were proposed 20 years before they were good on ImageNet, and LSTMs were ages before they were good for natural language, and so on and so on and so on. Other subjects are not like this, like in physics sometimes they just messed around for 50 years before they knew what was happening. I don’t know, I feel confused how to feel about the fact that in some subjects, it feels like they just do suddenly get better at things for reasons other than having more compute.

Rohin Shah: I think physics, at least, was often bottlenecked by measurements, I want to say.

Buck Shlegeris: Yes, so this is one reason I’ve been interested in history of science recently, but there are certainly a bunch of things. People were interested in chemistry for a long time and it turns out that chemistry comes from quantum mechanics and you could, theoretically, have guessed quantum mechanics 70 years earlier than people did if you were smart enough. It’s not that complicated a hypothesis to think of. Or relativity is the classic example of something which could have been invented 50 years earlier. I don’t know, I would love to learn more about this.

Lucas Perry: Just to tie this back to the question, could you give your probabilities as well?

Buck Shlegeris: Oh, geez, I don’t know. Honestly, right now I feel like I’m 70% gradual takeoff or something, but I don’t know. I might change my mind if I think about this for another hour. And there’s also theoretical arguments as well for why most takeoffs are gradual, like the stuff in Paul’s post. The easiest summary is, before someone does something really well, someone else does it kind of well in cases where a lot of people are trying to do the thing.

Lucas Perry: Okay. One facet of this, that I haven’t heard discussed, is recursive self-improvement, and I’m confused about where that becomes the thing that affects whether it’s discontinuous or continuous. If someone does something kind of well before something does something really well, if recursive self-improvement is a property of the thing being done kind of well, is it just kind of self-improving really quickly, or?

Buck Shlegeris: Yeah. I think Paul’s post does a great job of talking about this exact argument. I think his basic claim is, which I find pretty plausible, before you have a system which is really good at self-improving, you have a system which is kind of good at self-improving, if it turns out to be really helpful to have a system be good at self-improving. And as soon as this is true, you have to posit an additional discontinuity.

Rohin Shah: One other thing I’d note is that humans are totally self improving. Productivity techniques, for example, are a form of self-improvement. You could imagine that AI systems might have advantages that humans don’t, like being able to read their own weights and edit them directly. How much of an advantage this gives to the AI system, unclear. Still, I think then I just go back to the argument that Buck already made, which is at some point you get to an AI system that is somewhat good at understanding its weights and figuring out how to edit them, and that happens before you get the really powerful ones. Maybe this is like saying, “Well, you’ll reach human levels of self-improvement by the time you have rat-level AI or something instead of human-level AI,” which argues that you’ll hit this hyperbolic point of the curve earlier, but it still looks like a hyperbolic curve that’s still continuous at every point.

Buck Shlegeris: I agree.

Lucas Perry: I feel just generally surprised about your probabilities on continuous takeoff scenarios that they’d be slow.

Rohin Shah: The reason I’m trying to avoid the word slow and fast is because they’re misleading. Slow takeoff is not slow in calendar time relative to fast takeoff. The question is, is there a spike at some point? Some people, upon reading Paul’s posts are like, “Slow takeoff is faster than fast takeoff.” That’s a reasonably common reaction to it.

Buck Shlegeris: I would put it as slow takeoff is the claim that things are insane before you have the human-level AI.

Rohin Shah: Yeah.

Lucas Perry: This seems like a helpful perspective shift on this takeoff scenario question. I have not read Paul’s post. What is it called so that we can include it in the page for this podcast?

Rohin Shah: It’s just called Takeoff Speeds. Then the corresponding AI Impacts post is called Will AI See Discontinuous Progress?, I believe.

Lucas Perry: So if each of you guys had a lot more reach and influence and power and resources to bring to the AI alignment problem right now, what would you do?

Rohin Shah: I get this question a lot and my response is always, “Man, I don’t know.” It seems hard to scalably use people right now for AI risk. I can talk about which areas of research I’d like to see more people focus on. If you gave me people where I’m like, “I trust your judgment on your ability to do good conceptual work” or something, where would I put them? I think a lot of it would be on making good robust arguments for AI risk. I don’t think we really have them, which seems like kind of a bad situation to be in. I think I would also invest a lot more in having good introductory materials, like this review, except this review is a little more aimed at people who are already in the field. It is less aimed at people who are trying to enter the field. I think we just have pretty terrible resources for people coming into the field and that should change.

Buck Shlegeris: I think that our resources are way better than they used to be.

Rohin Shah: That seems true.

Buck Shlegeris: In the course of my work, I talk to a lot of people who are new to AI alignment about it and I would say that their level of informedness is drastically better now than it was two years ago. A lot of which is due to things like 80,000 hours podcast, and other things like this podcast and the Alignment Newsletter, and so on. I think we just have made it somewhat easier for people to get into everything. The Alignment Forum, having its sequences prominently displayed, and so on.

Rohin Shah: Yeah, you named literally all of the things I would have named. Buck definitely has more information on this than I do. I do not work with people who are entering the field as much. I do think we could be substantially better.

Buck Shlegeris: Yes. I feel like I do have access to resources, not directly but in the sense that I know people at eg Open Philanthropy and the EA Funds  and if I thought there were obvious things they should do, I think it’s pretty likely that those funders would have already made them happen. And I occasionally embark on projects myself that I think are good for AI alignment, mostly on the outreach side. On a few occasions over the last year, I’ve just done projects that I was optimistic about. So I don’t think I can name things that are just shovel-ready opportunities for someone else to do, which is good news because it’s mostly because I think most of these things are already being done.

I am enthusiastic about workshops. I help run with MIRI these AI Risks for Computer Scientists workshops and I ran my own computing workshop with some friends, with kind of a similar purpose, aimed at people who are interested in this kind of stuff and who would like to spend some time learning more about it. I feel optimistic about this kind of project as a way of doing the thing Rohin was saying, making it easier for people to start having really deep thoughts about a lot of AI alignment stuff. So that’s a kind of direction of projects that I’m pretty enthusiastic about. A couple other random AI alignment things I’m optimistic about. I’ve already mentioned that I think there should be an Ought competitor just because it seems like the kind of thing that more work could go into. I agree with Rohin on it being good to have more conceptual analysis of a bunch of this stuff. I’m generically enthusiastic about there being more high quality research done and more smart people, who’ve thought about this a lot, working on it as best as they can.

Rohin Shah: I think the actual bottleneck is good research and not necessarily field building, and I’m more optimistic about good research. Specifically, I am particularly interested in universality, interpretability. I would love for there to be some way to give people who work on AI alignment the chance to step back and think about the high-level picture for a while. I don’t know if people don’t do this because they don’t want to or because they don’t feel like they have the affordance to do so, and I would like the affordance to be there. I’d be very interested in people building models of what AGI systems could look like. Expected utility maximizers are one example of a model that you could have. Maybe we just try to redo evolution. We just create a very complicated, diverse environment with lots of agents going around and in their multi-agent interaction, they develop general intelligence somehow. I’d be interested for someone to take that scenario, flesh it out more, and then talk about what the alignment problem looks like in that setting.

Buck Shlegeris: I would love to have someone get really knowledgeable about evolutionary biology and try and apply analogies of that to AI alignment. I think that evolutionary biology has lots of smart things to say about what optimizers are and it’d be great to have those insights. I think Eliezer sort of did this many years ago. It would be good for more people to do this in my opinion.

Lucas Perry: All right. We’re in the home stretch here. AI timelines. What do you think about the current state of predictions? There’s been surveys that have been done with people giving maybe 50% probability over most researchers at about 2050 or so. What are each of your AI timelines? What’s your probability distribution look like? What do you think about the state of predictions on this?

Rohin Shah: Haven’t looked at the state of predictions in a while. It depends on who was surveyed. I think most people haven’t thought about it very much and I don’t know if I expect their predictions to be that good, but maybe wisdom of the crowds is a real thing. I don’t think about it very much. I mostly use my inside view and talk to a bunch of people. Maybe, median, 30 years from now, which is 2050. So I guess I agree with them, don’t I? That feels like an accident. The surveys were not an input into this process.

Lucas Perry: Okay, Buck?

Buck Shlegeris: I don’t know what I think my overall timelines are. I think AI in the next 10 or 20 years is pretty plausible. Maybe I want to give it something around 50% which puts my median at around 2040. In terms of the state of things that people have said about AI timelines, I have had some really great conversations with people about their research on AI timelines which hasn’t been published yet. But at some point in the next year, I think it’s pretty likely that much better stuff about AI timelines modeling will have been published than has currently been published, so I’m excited for that.

Lucas Perry: All right. Information hazards. Originally, there seemed to be a lot of worry in the community about information hazards and even talking about superintelligence and being afraid of talking to anyone in positions of power, whether they be in private institutions or in government, about the strategic advantage of AI, about how one day it may confer a decisive strategic advantage. The dissonance here for me is that Putin comes out and says that who controls AI will control the world. Nick Bostrom published Superintelligence, which basically says what I already said. Max Tegmark’s Life 3.0 basically also. My initial reaction and intuition is the cat’s out of the bag. I don’t think that echoing this increases risks any further than the risk is already at. But maybe you disagree.

Buck Shlegeris: Yeah. So here are two opinions I have about info hazards. One is: how bad is it to say stuff like that all over the internet? My guess is it’s mildly bad because I think that not everyone thinks those things. I think that even if you could get those opinions as consequences from reading Superintelligence, I think that most people in fact have not read Superintelligence. Sometimes there are ideas where I just really don’t want them to be crystallized common knowledge. I think that, to a large extent, assuming gradual takeoff worlds, it kind of doesn’t matter because AI systems are going to be radically transforming the world inevitably. I guess you can affect how governments think about it, but it’s a bit different there.

The other point I want to make about info hazards is I think there are a bunch of trickinesses with AI safety, where thinking about AI safety makes you think about questions about how AI development might go. I think that thinking about how AI development is going to go occasionally leads to think about things that are maybe, could be, relevant to capabilities, and I think that this makes it hard to do research because you then get scared about talking about them.

Rohin Shah: So I think my take on this is info hazards are real in the sense that there, in fact, are costs to saying specific kinds of information and publicizing them a bit. I think I’ll agree in principle that some kinds of capabilities information has the cost of accelerating timelines. I usually think these are pretty strongly outweighed by the benefits in that it just seems really hard to be able to do any kind of shared intellectual work when you’re constantly worried about what you do and don’t make public. It really seems like if you really want to build a shared understanding within the field of AI alignment, that benefit is worth saying things that might be bad in some other ways. This depends on a lot of background facts that I’m not going to cover here but, for example, I probably wouldn’t say the same thing about bio security.

Lucas Perry: Okay. That makes sense. Thanks for your opinions on this. So at the current state in time, do you guys think that people should be engaging with people in government or in policy spheres on questions of AI alignment?

Rohin Shah: Yes, but not in the sense of we’re worried about when AGI comes. Even saying things like it might be really bad, as opposed to saying it might kill everybody, seems not great. Mostly on the basis of my model for what it takes to get governments to do things is, at the very least, you need consensus in the field so it seems kind of pointless to try right now. It might even be poisoning the well for future efforts. I think it does make sense to engage with government and policymakers about things that are in fact problems right now. To the extent that you think that recommender systems are causing a lot of problems, I think it makes sense to engage with government about how alignment-like techniques can help with that, especially if you’re doing a bunch of specification learning-type stuff. That seems like the sort of stuff that should have relevance today and I think it would be great if those of us who did specification learning were trying to use it to improve existing systems.

Buck Shlegeris: This isn’t my field. I trust the judgment of a lot of other people. I think that it’s plausible that it’s worth building relationships with governments now, not that I know what I’m talking about. I will note that I basically have only seen people talk about how to do AI governance in the cases where the AI safety problem is 90th percentile easiest. I basically only see people talking about it in the case where the technical safety problem is pretty doable, and this concerns me. I’ve just never seen anyone talk about what you do in a world where you’re as pessimistic as I am, except to completely give up.

Lucas Perry: All right. Wrapping up here, is there anything else that we didn’t talk about that you guys think was important? Or something that we weren’t able to spend enough time on, that you would’ve liked to spend more time on?

Rohin Shah: I do want to eventually continue the conversation with Buck about coordination, but that does seem like it should happen not on this podcast.

Buck Shlegeris: That’s what I was going to say too. Something that I want someone to do is write a trajectory for how AI goes down, that is really specific about what the world GDP is in every one of the years from now until insane intelligence explosion. And just write down what the world is like in each of those years because I don’t know how to write an internally consistent, plausible trajectory. I don’t know how to write even one of those for anything except a ridiculously fast takeoff. And this feels like a real shame.

Rohin Shah: That seems good to me as well. And also the sort of thing that I could not do because I don’t know economics.

Lucas Perry: All right, so let’s wrap up here then. So if listeners are interested in following either of you or seeing more of your blog posts or places where you would recommend they read more materials on AI alignment, where can they do that? We’ll start with you, Buck.

Buck Shlegeris: You can Google me and find my website. I often post things on the Effective Altruism Forum. If you want to talk to me about AI alignment in person, perhaps you should apply to the AI Risks for Computer Scientists workshops run by MIRI.

Lucas Perry: And Rohin?

Rohin Shah: I write the Alignment Newsletter. That’s a thing that you could sign up for. Also on my website, if you Google Rohin Shah Alignment Newsletter, I’m sure I will come up. These are also cross posted to the Alignment Forum, so another thing you can do is go to the Alignment Forum, look up my username and just see things that are there. I don’t know that this is actually the thing that you want to be doing. If you’re new to AI safety and want to learn more about it, I would echo the resources Buck mentioned earlier, which are the 80k podcasts about AI alignment. There are probably on the order of five of these. There’s the Alignment Newsletter. There are the three recommended sequences on the Alignment Forum. Just go to alignmentforum.org and look under recommended sequences. And this podcast, of course.

Lucas Perry: All right. Heroic job, everyone. This is going to be a really good resource, I think. It’s given me a lot of perspective on how thinking has changed over the past year or two.

Buck Shlegeris: And we can listen to it again in a year and see how dumb we are.

Lucas Perry: Yeah. There were lots of predictions and probabilities given today, so it’ll be interesting to see how things are in a year or two from now. That’ll be great. All right, so cool. Thank you both so much for coming on.

End of recorded material

AI Alignment Podcast: On Lethal Autonomous Weapons with Paul Scharre

 Topics discussed in this episode include:

  • What autonomous weapons are and how they may be used
  • The debate around acceptable and unacceptable uses of autonomous weapons
  • Degrees and kinds of ways of integrating human decision making in autonomous weapons 
  • Risks and benefits of autonomous weapons
  • An arms race for autonomous weapons
  • How autonomous weapons issues may matter for AI alignment and long-term AI safety

Timestamps: 

0:00 Intro

3:50 Why care about autonomous weapons?

4:31 What are autonomous weapons? 

06:47 What does “autonomy” mean? 

09:13 Will we see autonomous weapons in civilian contexts? 

11:29 How do we draw lines of acceptable and unacceptable uses of autonomous weapons? 

24:34 Defining and exploring human “in the loop,” “on the loop,” and “out of loop” 

31:14 The possibility of generating international lethal laws of robotics

36:15 Whether autonomous weapons will sanitize war and psychologically distance humans in detrimental ways

44:57 Are persons studying the psychological aspects of autonomous weapons use? 

47:05 Risks of the accidental escalation of war and conflict 

52:26 Is there an arms race for autonomous weapons? 

01:00:10 Further clarifying what autonomous weapons are

01:05:33 Does the successful regulation of autonomous weapons matter for long-term AI alignment considerations?

01:09:25 Does Paul see AI as an existential risk?

 

We hope that you will continue to join in the conversations by following us or subscribing to our podcasts on Youtube, Spotify, SoundCloud, iTunes, Google Play, StitcheriHeartRadio, or your preferred podcast site/application. You can find all the AI Alignment Podcasts here.

You can listen to the podcast above or read the transcript below. 

Lucas Perry: Welcome to the AI Alignment Podcast. I’m Lucas Perry. Today’s conversation is with Paul Scharre and explores the issue of lethal autonomous weapons. And so just what is the relation of lethal autonomous weapons and the related policy and governance issues to AI alignment and long-term AI risk? Well there’s a key question to keep in mind throughout this entire conversation and it’s that: if we cannot establish a governance mechanism as a global community on the concept that we should not let AI make the decision to kill, then how can we deal with more subtle near term issues and eventual long term safety issues about AI systems? This question is aimed at exploring the idea that autonomous weapons and their related governance represent a possibly critical first step on the international cooperation and coordination of global AI issues. If we’re committed to developing beneficial AI and eventually beneficial AGI then how important is this first step in AI governance and what precedents and foundations will it lay for future AI efforts and issues? So it’s this perspective that I suggest keeping in mind throughout the conversation. And many thanks to FLI’s Emilia Javorsky for much help on developing the questions for this podcast. 

Paul Scharre is a Senior Fellow and Director of the Technology and National Security Program at the Center for a New American Security. He is the award-winning author of Army of None: Autonomous Weapons and the Future of War, which won the 2019 Colby Award and was named one of Bill Gates’ top five books of 2018.

Mr. Scharre worked in the Office of the Secretary of Defense (OSD) where he played a leading role in establishing policies on unmanned and autonomous systems and emerging weapons technologies. Mr. Scharre led the DoD working group that drafted DoD Directive 3000.09, establishing the Department’s policies on autonomy in weapon systems. Mr. Scharre also led DoD efforts to establish policies on intelligence, surveillance, and reconnaissance (ISR) programs and directed energy technologies. He was involved in the drafting of policy guidance in the 2012 Defense Strategic Guidance, 2010 Quadrennial Defense Review, and Secretary-level planning guidance. His most recent position was Special Assistant to the Under Secretary of Defense for Policy. Prior to joining the Office of the Secretary of Defense, Mr. Scharre served as a special operations reconnaissance team leader in the Army’s 3rd Ranger Battalion and completed multiple tours to Iraq and Afghanistan.

The Future of Life Institute is a non-profit and this podcast is funded and supported by listeners like you. So if you find what we do on this podcast to be important and beneficial, please consider supporting the podcast by donating at futureoflife.org/donate. If you support any other content creators via services like Patreon, consider viewing a regular subscription to FLI in the same light. You can also follow us on your preferred listening platform, like on Apple Podcasts or Spotify, by searching for us directly or following the links on the page for this podcast found in the description.

And with that, here’s my conversion with Paul Scharre. 

All right. So we’re here today to discuss your book, Army of None, and issues related to autonomous weapons in the 21st century. To start things off here, I think we can develop a little bit of the motivations for why this matters. Why should the average person care about the development and deployment of lethal autonomous weapons?

Paul Scharre: I think the most basic reason is because we all are going to live in the world that militaries are going to be deploying future weapons. Even if you don’t serve in the military, even if you don’t work on issues surrounding say, conflict, this kind of technology could affect all of us. And so I think we all have a stake in what this future looks like.

Lucas Perry: Let’s clarify a little bit more about what this technology actually looks like then. Often in common media, and for most people who don’t know about lethal autonomous weapons or killer robots, the media often portrays it as a terminator like scenario. So could you explain why this is wrong, and what are more accurate ways of communicating with the public about what these weapons are and the unique concerns that they pose?

Paul Scharre: Yes, I mean, the Terminator is like the first thing that comes up because it’s such a common pop culture reference. It’s right there in people’s minds. So I think go ahead and for the listeners, imagine that humanoid robot in the Terminator, and then just throw that away, because that’s not what we’re talking about. Let me make a different comparison. Self-driving cars. We are seeing right now the evolution of automobiles that with each generation of car incorporate more autonomous features: parking, intelligent cruise control, automatic braking. These increasingly autonomous features in cars that are added every single year, a little more autonomy, a little more autonomy, are taking us down at some point in time to a road of having fully autonomous cars that would drive themselves. We have something like the Google car where there’s no steering wheel at all. People are just passengers along for the ride. We’re seeing something very similar happen in the military with each generation of robotic systems and we now have air and ground and undersea robots deployed all around the world in over 100 countries and non state groups around the globe with some form of drones or robotic systems, and with each generation they’re becoming increasingly autonomous.

Now, the issue surrounding autonomous weapons is, what happens when a predator drone has as much autonomy as a self-driving car? What happens when you have a weapon that’s out in the battlefield, and it’s making its own decisions about whom to kill? Is that something that we’re comfortable with? What are the legal and moral and ethical ramifications of this? And the strategic implications? What might they do for the balance of power between nations, or stability among countries? These are really the issues surrounding autonomous weapons, and it’s really about this idea that we might have, at some point of time and perhaps the not very distant future, machines making their own decisions about whom to kill on the battlefield.

Lucas Perry: Could you unpack a little bit more about what autonomy really is or means because it seems to me that it’s more like an aggregation of a bunch of different technologies like computer vision and image recognition, and other kinds of machine learning that are aggregated together. So could you just develop a little bit more about where we are in terms of the various technologies required for autonomy?

Paul Scharre: Yes, so autonomy is not really a technology, it’s an attribute of a machine or of a person. And autonomy is about freedom. It’s the freedom that a machine or a person is given to perform some tasks in some environment for some period of time. As people, we have very little autonomy as children and more autonomy as we grow up, we have different autonomy in different settings. In some work environments, there might be more constraints put on you; what things you can and cannot do. And it’s also environment-specific and task-specific. You might have autonomy to do certain things, but not other things. It’s the same with machines. We’re ultimately talking about giving freedom to machines to perform certain actions under certain conditions in certain environments.

There are lots of simple forms of autonomy that we interact with all the time that we sort of take for granted. A thermostat is a very simple autonomous system, it’s a machine that’s given a freedom to decide… decide, let’s put that in air quotes, because we come back to what it means for machines to decide. But basically, the thermostat is given the ability to turn on and off the heat and air conditioning based on certain parameters that a human sets, a desired temperature, or if you have a programmable thermostat, maybe the desired temperature at certain times a day or days of the week, is a very bounded kind of autonomy. And that’s what we’re talking about for any of these machines. We’re not talking about freewill, or whether the machine develops consciousness. That’s not a problem today, maybe someday, but certainly not with the machines we’re talking about today. It’s a question really of, how much freedom do we want to give machines, or in this case, weapons operating on the battlefield to make certain kinds of choices?

Now we’re still talking about weapons that are designed by people, built by people, launched by people, and put into the battlefields to perform some mission, but there might be a little bit less human control than there is today. And then there are a whole bunch of questions that come along with that, like, is it going to work? Would it be effective? What happens if there are accidents? Are we comfortable with seeding that degree of control over to the machine?

Lucas Perry: You mentioned the application of this kind of technology in the context of battlefields. Is there also consideration and interest in the use of lethal autonomous weapons in civilian contexts?

Paul Scharre: Yes, I mean, I think there’s less energy on that topic. You certainly see less of a poll from the police community. I mean, I don’t really run into people in a police or Homeland Security context, saying we should be building autonomous weapons. Well, you will hear that from militaries. Oftentimes, groups that are concerned about the humanitarian consequences of autonomous weapons will raise that as a concern. There’s both what might militaries do in the battlefield, but then there’s a concern about proliferation. What happens when the technology proliferates, and it’s being used for internal security issues, could be a dictator, using these kinds of weapons to repress the population. That’s one concern. And that’s, I think, a very, very valid one. We’ve often seen one of the last checks against dictators, is when they tell their internal security forces to fire on civilians, on their own citizens. There have been instances where the security forces say, “No, we won’t.” That doesn’t always happen. Of course, tragically, sometimes security forces do attack their citizens. We saw in the massacre in Tiananmen Square that Chinese military troops are willing to murder Chinese citizens. But we’ve seen other instances, certainly in the fall of the Eastern Bloc at the end of the Cold War, that security forces… these are our friends, these are our family. We’re not going to kill them.

And autonomous weapons could take away one of those checks on dictators. So I think that’s a very valid concern. And that is a more general concern about the proliferation of military technology into policing even here in America. We’ve seen this in the last 20 years, is a lot of military tech ends up being used by police forces in ways that maybe isn’t appropriate. And so that’s, I think, a very valid and legitimate sort of concern about… even if this isn’t kind of the intended use, what would that look like and what are the risks that could come with that, and how should we think about those kinds of issues as well?

Lucas Perry: All right. So we’re developing autonomy in systems and there’s concern about how this autonomy will be deployed in context where lethal force or force may be used. So the question then arises and is sort of the question at the heart of lethal autonomous weapons: Where is it that we will draw a line between acceptable and unacceptable uses of artificial intelligence in autonomous weapons or in the military, or in civilian policing? So I’m curious to know how you think about where to draw those lines or that line in particular, and how you would suggest to any possible regulators who might be listening, how to think about and construct lines of acceptable and unacceptable uses of AI.

Paul Scharre: That’s a great question. So I think let’s take a step back first and sort of talk about, what would be the kinds of things that would make uses acceptable or unacceptable. Let’s just talk about the military context just to kind of bound the problem for a second. So in the military context, you have a couple reasons for drawing lines, if you will. One is legal issues, legal concerns. We have a legal framework to think about right and wrong in war. It’s called the laws of war or international humanitarian law. And it lays out a set of parameters for what is acceptable and what… And so that’s one of the places where there has been consensus internationally, among countries that come together at the United Nations through the Convention on Certain Conventional Weapons, the CCW, the process, we’ve had conversations going on about autonomous weapons.

One of the points of consensus among nations is that existing international humanitarian law or the laws of war would apply to autonomous weapons. And that any uses of autonomy in weapons, those weapons have to be used in a manner that complies with the laws of war. Now, that may sound trivial, but it’s a pretty significant point of agreement and it’s one that places some bounds on things that you can or cannot do. So, for example, one of the baseline principles of the laws of war is the principle of distinction. Military forces cannot intentionally target civilians. They can only intentionally target other military forces. And so any use of force these people to comply with this distinction, so right off the bat, that’s a very important and significant one when it comes to autonomous weapons. So if you have to use a weapon that could not be used in a way to comply with this principle of distinction, it would be illegal under the laws war and you wouldn’t be able to build it.

And there are other principles as well, principles about proportionality, and ensuring that any collateral damage that affects civilians or civilian infrastructure is not disproportionate to the military necessity of the target that is being attacked. There are principles about avoiding unnecessary suffering of combatants. Respecting anyone who’s rendered out of combat or the appropriate term is “hors de combat,” who surrendered have been incapacitated and not targeting them. So these are like very significant rules that any weapon system, autonomous weapon or not, has to comply with. And any use of any weapon has to comply with, any use of force. And so that is something that constrains considerably what nations are permitted to do in a lawful fashion. Now do people break the laws of war? Well, sure, that happens. We’re seeing that happen in Syria today, Bashar al-Assad is murdering civilians, there are examples of Rogue actors and non state terrorist groups and others that don’t care about respecting the laws of war. But those are very significant bounds.

Now, one could also say that there are more bounds that we should put on autonomous weapons that might be moral or ethical considerations that exist outside the laws of war, that aren’t written down in a formal way in the laws of war, but they’re still important and I think those often come to the fore with this topic. And there are other ones that might apply in terms of reasons why we might be concerned about stability among nations. But the laws of war, at least a very valuable starting point for this conversation about what is acceptable and not acceptable. I want to make clear, I’m not saying that the laws of war are insufficient, and we need to go beyond them and add in additional constraints. I’m actually not saying that. There are people that make that argument, and I want to give credit to their argument, and not pretend it doesn’t exist. I want the listeners to sort of understand the full scope of arguments about this technology. But I’m not saying myself that’s the case necessarily. But I do think that there are concerns that people raise.

For example, people might say it’s wrong for a machine to decide whom to kill, it’s wrong for a machine to make the decision about life and death. Now I think that’s an interesting argument. Why? Why is it wrong? Is it because we think the machine might get the answer wrong, it might perform not as well as the humans because I think that there’s something intrinsic about weighing the value of life and death that we want humans to do, and appreciating the value of another person’s life before making one of these decisions. Those are all very valid counter arguments that exist in this space.

Lucas Perry: Yes. So thanks for clarifying that. For listeners, it’s important here to clarify the difference where some people you’re saying would find the laws of war to be sufficient in the case of autonomous weapons, and some would not.

Paul Scharre: Yes, I mean, this is a hotly debated issue. I mean, this is in many ways, the crux of the issue surrounding autonomous weapons. I’m going to oversimplify a bit because you have a variety of different views on this, but you certainly have some people whose views are, look, we have a set of structures called the laws of war that tell us what right and wrong looks like and more. And most of the things that people are worried about are already prohibited under the laws of war. So for example, if what you’re worried about is autonomous weapons, running amok murdering civilians, that’s illegal under the laws of war. And so one of the points of pushback that you’ll sometimes get from governments or others to the idea of creating like an ad hoc treaty that would ban autonomous weapons or some class of autonomous weapons, is look, some of the things people worry about like they’re already prohibited under the laws of war, passing another law to say the thing that’s already illegal is now illegal again doesn’t add any value.

There’s group of arguments that says the laws of war dictate effects in the battlefield. So they dictate sort of what the end effect is, they don’t really affect the process. And there’s a line of reasoning that says, that’s fine. The process doesn’t matter. If someday we could use autonomous weapons in a way that was more humane and more precise than people, then we should use them. And just the same way that self-driving cars will someday save lives on roads by avoiding accidents, maybe we could build autonomous weapons that would avoid mistakes in war and accidentally targeting civilians, and therefore we should use them. And let’s just focus on complying better with the laws of war. That’s one school of thought.

Then there’s a whole bunch of reasons why you might say, well, that’s not enough. One reason might be, well, militaries’ compliance with the laws of war. Isn’t that great? Actually, like people talk a good game, but when you look at military practice, especially if the rules for using weapon are kind of convoluted, you have to take a bunch of additional steps in order to use it in a way that’s lawful, that kind of goes out the window in conflict. Real world and tragic historical example of this was experienced throughout the 20th century with landmines where land mines were permitted to be used lawfully, and still are, if you’re not a signatory to the Ottawa Convention, they’re permitted to be used lawfully provided you put in a whole bunch of procedures to make sure that minefields are marked and we know the location of minefields, so they can be demined after conflict.

Now, in practice, countries weren’t doing this. I mean, many of them were just scattering mines from the air. And so we had this horrific problem of millions of mines around the globe persisting after a conflict. The response was basically this global movement to ban mines entirely to say, look, it’s not that it’s inconceivable to use mines in a way that you mean, but it requires a whole bunch of additional efforts, that countries aren’t doing, and so we have to take this weapon away from countries because they are not actually using it in a way that’s responsible. That’s a school of thought with autonomous weapons. Is look, maybe you can conjure up thought experiments about how you can use autonomous weapons in these very specific instances, and it’s acceptable, but once you start any use, it’s a slippery slope, and next thing you know, it’ll be just like landmines all over again, and they’ll be everywhere and civilians will be being killed. And so the better thing to do is to just not let this process even start, and not letting militaries have access to the technology because they won’t use it responsibly, regardless of whether it’s theoretically possible. That’s a pretty reasonable and defensible argument. And there are other arguments too.

One could say, actually, it’s not just about avoiding civilian harm, but there’s something intrinsic about weighing the value of an enemy soldier’s life, that we want humans involved in that process. And that if we took humans away from that process, we’ll be losing something that sure maybe it’s not written down in the laws of war, but maybe it’s not written down because it was always implicit that humans will always be making these choices. And now that it’s decision in front of us, we should write this down, that humans should be involved in these decisions and should be weighing the value of the human life, even an enemy soldier. Because if we give that up, we might give up something that is a constraint on violence and war that holds back some of the worst excesses of violence, we might even can make something about ourselves. And this is, I think, a really tricky issue because there’s a cost to humans making these decisions. It’s a very real cost. It’s a cost in post traumatic stress that soldiers face and moral injury. It’s a cost in lives that are ruined, not just the people that are killed in a battlefield, but the people have to live with that violence afterwards, and the ramifications and even the choices that they themselves make. It’s a cost in suicides of veterans, and substance abuse and destroyed families and lives.

And so to say that we want humans to stay still evolved to be more than responsible for killing, is to say I’m choosing that cost. I’m choosing to absorb and acknowledge and take on the cost of post traumatic stress and moral injury, and also the burdens that come with war. And I think it’s worth reflecting on the fact that the burdens of war are distributed very unequally, not just between combatants, but also on the societies that fight. As a democratic nation in the United States, we make a decision as a country to go to war, through our elected representatives. And yet, it’s a very tiny slice of the population that bears the burden for that war, not just putting themselves at risk, but also carrying the moral burden of that afterwards.

And so if you say, well, I want there to be someone who’s going to live with that trauma for the rest of your life. I think that’s an argument that one can make, but you need to acknowledge that that’s real. And that’s not a burden that we all share equally, it’s a burden we’re placing on young women and men that we send off to fight on our behalf. The flip side is if we didn’t do that, if we fought a war and no one felt the moral burden of killing, no one slept uneasy at night afterwards, what would they say about us as a society? I think these are difficult questions. I don’t have easy answers to that. But I think these are challenging things for us to wrestle with.

Lucas Perry: Yes, I mean, there’s a lot there. I think that was a really good illustration of the different points of views on this. I hadn’t heard or considered much the implications of post traumatic stress. And I think moral burden, you called it that would be a factor in what autonomous weapons would relieve in countries which have the power to develop them. Speaking personally, I think I find the arguments most compelling about the necessity of having human beings integrated in the process of decision making with regards to killing, because if you remove that, then you’re removing the deep aspect of humanity, which sometimes does not follow the laws of war, which we currently don’t have complex enough preference learning techniques and machine learning techniques to actually train autonomous weapon systems in everything that human beings value and care about, and that there are situations where deviating from following the laws of war may be the best thing to do. I’m not sure if you have any thoughts about this, but I think you did a good job of illustrating all the different positions, and that’s just my initial reaction to it.

Paul Scharre: Yes, these are tricky issues. And so I think one of the things I want to try to do for listeners is try to lay out the landscape of what these arguments are, and some of the pros and cons of them because I think sometimes they will often oversimplify on all sides. The other people will be like, well, we should have humans involved in making these decisions. Well, humans involved where? If I get into a self-driving car that has no steering wheel, it’s not true that there’s no human involvement. The type of human involvement has just changed in terms of where it exists. So now, instead of manually driving the car, I’m still choosing the car’s destination, I’m still telling the car where I want to go. You’re going to get into the car and car take me wherever you want to go. So the type of human involvement is changed.

So what kind of human relationship do we want with decisions about life and death in the battlefield? What type of human involvement is right or necessary or appropriate and for what reason? For a legal reason, for a moral reason. These are interesting challenges. We haven’t had to confront anymore. These arguments I think unfairly get simplified on all sides. Conversely, you hear people say things like, it doesn’t matter, because these weapons are going to get built anyway. It’s a little bit overly simplistic in the sense that there are examples of successes in arms control. It’s hard to pull off. There are many examples of failures as well, but there are places where civilized nations have walked back from some technologies to varying degrees of success, whether it’s chemical weapons or biological weapons or other things. So what is success look like in constraining a weapon? Is it no one ever uses the weapon? Is it most nations don’t use it? It’s not used in certain ways. These are complicated issues.

Lucas Perry: Right. So let’s talk a little bit here about integrating human emotion and human reasoning and humanity itself into the autonomous weapon systems and the life or death decisions that they will be making. So hitting on a few concepts here, if you could help explain what people mean when they say human in the loop, and human on the loop, and how this relates to the integration of human control and human responsibility and human accountability in the use of autonomous weapons.

Paul Scharre: Let’s unpack some of this terminology. Broadly speaking, people tend to use the terms human in the loop, on the loop, or out of the loop to refer to semi autonomous weapons human is in the loop, which means that for any really semi autonomous process or system, the machine is taking an action and then it pauses and waits for humans to take a positive action before proceeding. A good example of a human in the loop system is the automated backups on your computer when they require you to push a button to say okay to do the backup now. They’re waiting music in action before proceeding. In a human on the loop system, or one where the supervisor control is one of the human doesn’t have to take any positive action for the system to proceed. The human can intervene, so the human can sit back, and if you want to, you can jump in.

Example of this might be your thermostat. When you’re in a house, you’ve already set the parameters, it’ll turn on the heat and air conditioning on its own, but if you’re not happy with the outcome, you could change it. Now, when you’re out of the house, your thermostat is operating in a fully autonomous fashion in this respect where humans out of the loop. You don’t have any ability to intervene for some period of time. It’s really all about time duration. For supervisory control, how much time does the human have to identify something is wrong and then intervene? So for example, things like the Tesla autopilots. That’s one where the human is in a supervisory control capacity. So the autopilot function in a car, the human doesn’t have to do anything, car’s driving itself, but they can intervene.

The problem with some of those control architectures is the time that you are permitting people to identify that there’s a problem, figure out what’s going on, decide to take action, intervene, really realistic before harm happens. Is it realistic that a human can be not paying attention, and then all of a sudden, identify that the car is in trouble and leap into action to avoid an accident when you’re speeding on the highway 70 miles an hour? And then you can see quite clearly in a number of fatal accidents with these autopilots, that that’s not feasible. People actually aren’t capable of doing that. So you’ve got to think about sort of what is the role of the human in this process? This is not just a semi autonomous or supervised autonomous or fully autonomous process. It’s one where the human is involved in some varying capacity.

And what are we expecting the human to do? Same thing with something that’s fully autonomous. We’re talking about a system that’s operating on its own for some period of time. How long before it checks back in with a person? What information is that person given? And what is their capacity to intervene or how bad could things go wrong when the person is not involved? And when we talk about weapons specifically. There are lots of weapons that operate in a semi autonomous fashion today where the human is choosing the target, but there’s a lot of automation in IDing targets presenting information to people in actually carrying out an attack, once the human has chosen a target, there are many, many weapons that are what the military calls fire and forget weapon, so once it’s launched, it’s not coming back. Those have been widely used for 70 years since World War Two. So that’s not new.

There are a whole bunch of weapons that operate in a supervisory autonomy mode, where humans on the loop. These are generally used in a more limited fashion for immediate localized defense of air bases or ships or ground vehicles defending against air or missile or rocket attack, particularly when the speed of these attacks might overwhelm people’s ability to respond. For humans to be in the loop, for humans to push a button, every time there’s a missile coming in, you could have so many missiles coming in so fast that you have to just simply activate an automatic defensive mode that will shoot down all have the missiles based on some pre-programmed parameters that humans put into the system. This exists today. The systems have been around for decades since the 1980s. And there were widespread use with at least 30 countries around the globe. So this is a type of weapon system that’s already in operation. These supervisory autonomous weapons. What really would be new would be fully autonomous weapons that operate on their own, whereas humans are still building them and launching them, but humans put them into operation, and then there’s some period of time where they were able to search a target area for targets and they were able to find these targets, and then based on some programming that was designed by people, identify the targets and attack them on their own.

Lucas Perry: Would you consider that out of the loop for that period of time?

Paul Scharre: Exactly. So over that period of time, humans are out of the loop on that decision over which targets they’re attacking. That would be potentially largely a new development in war. There are some isolated cases of some weapon systems that cross this line, by in large that would be new. That’s at least the starting point of what people might be concerned about. Now, you might envision things that are more advanced beyond that, but that’s sort of the near term development that could be on the horizon in the next five to 15 years, telling the weapon system, go into this area, fly around or search around underwater and find any ships of this type and attack them for some period of time in space. And that changes the human’s relationship with the use of force a little bit. It doesn’t mean the humans not involved at all, but the humans not quite as involved as they used to be. And is that something we’re comfortable with? And what are the implications of that kind of shift in warfare.

Lucas Perry: So the relevant things here are how this helps to integrate human control and human responsibility and human accountability into autonomous weapons systems. And just hearing you speak about all of that, it also seems like very relevant questions have to do with human psychology, about what human beings are actually likely to be able to do. And then also, I think you articulately put the practical question of whether or not people will be able to react to certain threats given certain situations. So in terms of trying to understand acceptable and unacceptable uses of autonomous weapons, that seems to supervene upon a lot of these facets of benefits and disadvantages of in the loop, on the loop, and out of the loop for different situations and different risks, plus how much we’re willing to automate killing and death and remove human decision making from some of these situations or not.

Paul Scharre: Yes, I mean, I think what’s challenging in this space is that it would be nice, it would be ideal if we could sort of reach agreement among nations for sort of a lethal laws of robotics, and Isaac Asimov’s books about robots you think of these three laws of robotics. Well, those laws aren’t going to work because one of them is not harming a human being and it’s not going to work in the military context, but could there be some agreement among countries for lethal laws of robots that would govern the behavior of autonomous systems in war, and it might sort of say, these are the things that are acceptable or not? Maybe. Maybe that’s possible someday. I think we’re not there yet at least, there are certainly not agreement as widespread disagreement among nations about what approach to take. But the good starting position of trying to understand what are the goals we want to achieve. And I think you’re right that we need to keep the human sort of front and center. But I this this is like a really important asymmetry between humans and machines that’s worth highlighting, which is to say that the laws of war government effects in the battlefield, and then in that sentence, the laws of war, don’t say the human has to pick every target, the laws of war say that the use of force must be executed according to certain principles of distinction and proportionality and other things.

One important asymmetry in the laws of war, however, is that machines are not legal agents. Only humans have legal agents. And so it’s ultimately humans that are responsible for complying with the laws of war. You can’t put a machine on trial for a war crime. It doesn’t make sense. It doesn’t have intentionality. So it’s ultimately a human responsibility to ensure this kind of compliance with the laws of war. It’s a good starting point then for conversation to try to understand if we start from that proposition that it’s a human responsibility to ensure compliance with the laws of war, then what follows from that? What balances that place on human involvement? One of the early parts of the conversations on autonomous weapons internationally came from this very technological based conversation. To say, well, based on the technology, draw these lines, you should put these limits in place. The problem with that approach is not that you can’t do it.

The problem is the state of the technology when? 2014 when discussions on autonomous weapons started at the very beginning of the deep learning revolution, today, in 2020, our estimate of whether technology might be in five years or 10 years or 50 years? The technology moving so quickly than any technologically based set of rules about how we should approach this problem and what is the appropriate use of machines versus human decision making in the use of force. Any technologically based answer is one that we may look back in 10 years or 20 years and say is wrong. We could get it wrong in the sense that we might be leaving valuable technological opportunities on the table and we’re banning technology that if we used it actually might make war more humane and reduce civilian casualties, or we might be permitting technologies that turned out in retrospect to be problematic, and we shouldn’t have done that.

And one of the things we’ve seen historically when you look at attempts to ban weapons is that ones that are technologically based don’t always fare very well over time. So for example, the early bans on poison gas banned the use of poison gas that are launched from artillery shells. It allowed actually poison gas administered via canisters, and so the first use of poison gas in World War One by the Germans was canister based, they actually just laid out little canisters and then open the valves. Now that turns out to be not very practical way of using poison gas in war, because you have someone basically on your side standing over this canister, opening a valve and then getting gassed. And so it’s a little bit tricky, but technically permissible.

One of the things that can be challenging is it’s hard to foresee how the technology is going to evolve. A better approach and one that we’ve seen the dialogue internationally sort of shift towards is our human-centered approach. To start from the position of the human and say, look, if we had all the technology in the world and war, what decisions would we want humans to make and why? Not because the technology cannot make decisions, but because it should not. I think it’s actually a very valuable starting place to understand a conversation, because the technology is moving so quickly.

What role do we want humans to play in warfare, and why do we think this is the case? Are there some tasks in war, or some decisions that we think are fundamentally human that should be decisions that only humans should make and we shouldn’t hand off to machines? I think that’s a really valuable starting position then to try to better interrogate how do we want to use this technology going forward? Because the landscape of technological opportunity is going to keep expanding. And so what do we want to do with this technology? How do we want to use it? And are there ways that we can use this technology that keeps humans in control of the use of force in the battlefield? Keep humans legally and morally and ethically responsible, but may make war more humane in the process, that may make war more precise, that may reduce civilian casualties without losing our humanity in the process.

Lucas Perry: So I guess the thought experiment, there would be like, if we had weapons that let us just delete people instantly without consequences, how would we want human decision making to be integrated with that? Reflecting on that also makes me consider this other point that I think is also important for my considerations around lethal autonomous weapons, which is the necessity of integrating human experience in the consequences of war, the pain and the suffering and the carnage and the PTSD as being almost necessary vehicles to some extent to make us tired of it to integrate how horrible it is. So I guess I would just be interested in integrating that perspective into it not just being about humans making decisions and the decisions being integrated in the execution process, but also about the experiential ramifications of being in relation to what actually happens in war and what violence is like and what happens in violence.

Paul Scharre: Well, I think that we want to unpack a little bit some of the things you’re talking about. Are we talking about ensuring that there is an accurate representation to the people carrying out the violence about what’s happening on the other end, that we’re not sanitizing things. And I think that’s a fair point. When we begin to put more psychological barriers between the person making the decision and the effects, it might be easier for them to carry out larger scale attacks, versus actually making war and more horrible. Now that’s a line of reasoning, I suppose, to say we should make war more horrible, so there’ll be less of it. I’m not sure we might get the outcome that there is less of it. We just might have more horrible war, but that’s a different issue. Those are more difficult questions.

I will say that I often hear philosophers raising things about skin in the game. I rarely hear them being raised by people who have had skin in the game, who have experienced up close in a personal way the horrors of war. And I’m less convinced that there’s a lot of good that comes from the tragedy of war. I think there’s value in us trying to think about how do we make war less terrible? How do we reduce civilian casualties? How do we have less war? But this often comes up in the context of technologies like we should somehow put ourselves at risk. No military does that, no military has ever done that in human history. The whole purpose of militaries getting technology in training is to get an advantage on the adversary. It’s not a fair fight. It’s not supposed to be, it’s not a boxing match. So these are things worth exploring. We need to come from the standpoint of the reality of what war is and not from a philosophical exercise about war might be, but deal with the realities of what actually occurs in the battlefield.

Lucas Perry: So I think that’s a really interesting point. And as someone with a background and interest in philosophy, it’s quite funny. So you do have experience in war, right?

Paul Scharre: Yes, I’ve fought in Iraq and Afghanistan.

Lucas Perry: Then it’s interesting for me, if you see this distinction between people who are actually veterans, who have experienced violence and carnage and tragedies of war, and the perspective here is that PTSD and associated trauma with these kinds of experiences, you find that they’re less salient for decreasing people’s willingness or decision to engage in further war. Is that your claim?

Paul Scharre: I don’t know. No, I don’t know. I don’t know the answer to that. I don’t know. That’s some difficult question for political scientists to figure out about voting preferences of veterans. All I’m saying is that I hear a lot of claims in this space that I think are often not very well interrogated or not very well explored. And there’s a real price that people pay for being involved. Now, people want to say that we’re willing to bear that price for some reason, like okay, but I think we should acknowledge it.

Lucas Perry: Yeah, that make sense. I guess the thing that I was just pointing at was it would be psychologically interesting to know if philosophers are detached from the experience, maybe they don’t actually know about the psychological implications of being involved in horrible war. And if people who are actually veterans disagree with philosophers about the importance of there being skin in the game, if philosophers say that skin in the game reduces willingness to be in war, if the claim is that that wouldn’t actually decrease their willingness to go to war. I think that seems psychologically very important and relevant, because there is this concern about how autonomous weapons and integrating human decision making to lethal autonomous weapons would potentially sanitize war. And so there’s the trade off between the potential mitigating effects of being involved in war, and then also the negative effects which are incurred by veterans who would actually have to be exposed by it and bring the trauma back for communities to have deeper experiential relation with.

Paul Scharre: Yes, and look, we don’t do that, right? We had a whole generation of veterans come back from Vietnam and we as society listen to the stories and understand them and understand, no. I have heard over the years people raise this issue whether it’s drones, autonomous weapons, this issue of having skin in the game either physically being at risk or psychologically. And I’ve rarely heard it raised by people who it’s been them who’s on the line. People often have very gut emotional reactions to this topic. And I think that’s valuable because it’s speaking to something that resonates with people, whether it’s an emotional reaction opposed to autonomous weapons, and that you often get that from many people that go, there’s something about this. It doesn’t feel right. I don’t like this idea. Or people saying, the opposite reaction. Other people that say that “wouldn’t this make war great, it’s more precise and more humane,” and which my reaction is often a little bit like… have you ever interacted with a computer? They break all the time. What are you talking about?

But all of these things I think they’re speaking to instincts that people have about this technology, but it’s worth asking questions to better understand, what is it that we’re reacting to? Is it an assumption about the technologies, is it an assumption about the nature of war? One of the concerns I’ve heard raised is like this will impersonalize war and create more distance between people killing. If you sort of buy that argument, that impersonal war is a bad thing, then you would say the greatest thing would be deeply personal war, like hand to hand combat. It appears to harken back to some glorious age of war when people looked each other in the eye and hacked each other to bits with swords, like real humans. That’s not that that war never occurred in human history. In fact, we’ve had conflicts like that, even in recent memory that involve hand to hand weapons. They tend not to be very humane conflicts. When we see civil violence, when people are murdering each other with machetes or garden tools or other things, it tends to be horrific communal violence, mass atrocities in Rwanda or Cambodia or other places. So I think it’s important to deal with the reality of what war is and not some fantasy.

Lucas Perry: Yes, I think that that makes a lot of sense. It’s really tricky. And the psychology around this I think is difficult and probably not studied enough.

Paul Scharre: There’s real war that occurs in the world, and then there’s the fantasy of war that we, as a society, tell ourselves when we go to movie theaters, and we watch stories about soldiers who are heroes, who conquer the bad guys. We’re told a fantasy, and it’s a fantasy as a society that allows society to perpetuate wars, that allows us to send young men and women off to die. And it’s not to say that there are no circumstances in which a nation might need to go to war to defend itself or its interest, but we sort of dress war up in these pretty clothes, and let’s not confuse that with the reality of what actually occurs. People said, well, through autonomous weapons, then we won’t have people sort of weighing the value of life and death. I mean, it happens sometimes, but it’s not like every time someone dies in war, that there was this thoughtful exercise where a committee sat around and said, “Do we really need to kill this person? Is it really appropriate?” There’s a lot of dehumanization that goes on on the battlefield. So I think this is what makes this issue very challenging. Many of the objections to autonomous weapons are objections to war. That’s what people are actually objecting to.

The question isn’t, is war bad? Of course war’s terrible? The question is sort of, how do we find ways going forward to use technology that may make war more precise and more humane without losing our humanity in the process, and are ways to do that? It’s a challenging question. I think the answer is probably yes, but it’s one that’s going to require a lot of interrogation to try to get there. It’s a difficult issue because it’s also a dynamic process where there’s an interplay between competitors. If we get this wrong, we can easily end up in a situation where there’s less human control, there’s more violence and war. There are lots of opportunities to make things worse as well.

If we could make war perfect, that would be great, in terms of no civilian suffering and reduce the suffering of enemy combatants and the number of lives lost. If we could push a button and make war go away, that would be wonderful. Those things will all be great. The more practical question really is, can we improve upon the status quo and how can we do so in a thoughtful way, or at least not make things worse than today? And I think those are hard enough problems to try to address.

Lucas Perry: I appreciate that you bring a very holistic, well-weighed perspective to the varying sides of this issue. So these are all very big and difficult. Are you aware of people actually studying whether some of these effects exist or not, and whether they would actually sanitize things or not? Or is this basically all just coming down to people’s intuitions and simulations in their head?

Paul Scharre: Some of both. There’s really great scholarship that’s being done on autonomous weapons, certainly there’s a robust array of legal based scholarship, people trying to understand how the law of war might interface with autonomous weapons. But there’s also been worked on by thinking about some of these human psychological interactions, Missy Cummings, who’s at Duke who runs the humans and automation lab down has done some work on human machine interfaces on weapon systems to think through some of these concerns. I think probably less attention paid to the human machine interface dimension of this and the human psychological dimension of it. But there’s been a lot of work done by people like Heather Roth, people at Article 36, and others thinking about concepts of meaningful human control and what might look like in weapon systems.

I think one of the things that’s challenging across the board in this issue is that it is a politically contentious topic. You have kind of levels of this debate going on, you have scholars trying to sort of understand the issue maybe, and then you also have a whole array of politically motivated groups, international organizations, civil society organizations, countries, duking it out basically, at the UN and in the media about where we should go with this technology. As you get a lot of motivated reasoning on all sides about what should the answer be. So for example, one of the things that fascinates me is i’ll often hear people say, autonomous weapons are terrible, and they’ll have a terrible outcome, and we need to ban them now. And if we just pass a treaty and we have enough political will we could ban them. I’ll also hear people say a ban would be pointless, it wouldn’t work. And anyways, wouldn’t autonomous weapons be great? There are other possible beliefs. One could say that a ban is feasible, but the weapons aren’t that big of a deal. So it just seems to me like there’s a lot of politically motivated reasoning that goes on this debate, which makes it very challenging.

Lucas Perry: So one of the concerns around autonomous weapons has to do with accidental escalation of warfare and conflict. Could you explore this point and explain what some strategies might be to prevent accidental escalation of warfare as AI is increasingly being used in the military?

Paul Scharre: Yes, so I think in general, you could bucket maybe concerns about autonomous weapons into two categories. One is a concern that they may not function very well and could have accidents, those accidents could lead to civilian casualties, that could lead to accidental escalation among nations and a crisis, military force forces operating in close proximity to one another and there could be accidents. This happens with people. And you might worry about actions with autonomous systems and maybe one shoots down an enemy aircraft and there’s an escalation and people are killed. And then how do you unwind that? How do you communicate to your adversary? We didn’t mean to do that. We’re sorry. How do you do that in a period of tension? That’s a particular challenge.

There’s a whole other set of challenges that come from the weapons might work. And that might get to some of these deeper questions about the role of humans in decision making about life and death. But this issue of accidental escalation kind of comes into the category of they don’t work very well, then they’re not reliable. And this is the case for a lot of AI and autonomous technology today, which isn’t to say it doesn’t work at all, if it didn’t work at all, it would be much easier. There’d be no debates about bias and facial recognition systems if they never identify faces. There’d be no debates about safety with self-driving cars if the car couldn’t go anywhere. The problem is that a lot of these AI based systems work very well in some settings, and then if the settings change ever so slightly, they don’t work very well at all anymore. And the performance can drop off very dramatically, and they’re not very robust to changes in environmental conditions. So this is a huge problem for the military, because in particular, the military doesn’t get to test its systems in its actual operating environment.

So you can take a car, and you can take it on the roads, and you can test it in an actual driving environment. And we’ve seen car companies rack up 10 million miles or more of driving data. And then they can go back and they can run simulations. So Waymo has said that they run 10 million miles of simulated driving every single day. And they can simulate in different lighting conditions, in different environmental conditions. Well, the military can build simulations too, but simulations of what? What will the next war look like? Well we don’t know because we haven’t fought it yet. The good news is that war’s very rare, which is great. But that also means that for these kinds of systems, we don’t necessarily know the operating conditions that they’ll be in, and so there is this real problem of this risk of accidents. And it’s exacerbated in the fact that this is also a very adversarial environment. So you actually have an enemy who’s trying to trick your system and manipulate it. That’s adds another layer of complications.

Driving is a little bit competitive, maybe somebody doesn’t want to let you into the lane, but the pedestrians aren’t generally trying to get hit by cars. That’s a whole other complication in the military space. So all of that leads to concerns that the systems may do okay in training, and then we take them out in the real world, and they fail and they fail a pretty bad way. If it’s a weapon system that is making its own decisions about whom to kill, it could be that it fails in a benign way, then it targets nothing. And that’s a problem for the military who built it, or fails in a more hazardous way, in a dangerous way and attacks the wrong targets. And when we’re talking about an autonomous weapon, the essence of this autonomous weapon is making its own decisions about which targets to attack and then carrying out those attacks. If you get that wrong, those could be pretty significant consequences with that. One of those things could be civilian harm. And that’s a major concern. There are processes in place for printing that operationally and test and evaluation, are those sufficient? I think they’re good reasons to say that maybe they’re not sufficient or not completely sufficient, and they need to be revised or improved.

And I’ll point out, we can come back to this that the US Defense Department actually has a more stringent procedure in place for reviewing autonomous weapons more than other weapons, beyond what the laws of war have, the US is one of the few countries that has this. But then there’s also question about accidental escalation, which also could be the case. Would that lead to like an entire war? Probably not. But it could make things a lot harder to defuse tensions in a crisis, and that could be problematic. So we just had an incident not too long ago, where the United States carried out an attack against the very senior Iranian General, General Soleimani, who’s the head of the Iranian Quds Force and killed him in a drone strike. And that was an intentional decision made by a person somewhere in the US government.

Now, did they fully think that through? I don’t know, that’s a different question. But a human made that decision in any case. Well, that’s a huge escalation of hostilities between the US and Iraq. And there was a lot of uncertainty afterwards about what would happen and Iran launched some ballistic missiles against US troops in Iraq. And whether that’s it, or there’s more retaliation to come, I think we’ll see. But it could be a much more challenging situation, if you had a situation in the future where an autonomous weapon malfunctioned and took some action. And now the other side might feel compelled to respond. They might say, well, we have to, we can’t let this go. Because humans emotions are on the line and national pride and prestige, and they feel like they need to maintain a principle of deterrence and they need to retaliate it. So these could all be very complicated things if you had an accident with an autonomous weapon.

Lucas Perry: Right. And so an adjacent issue that I’d like to explore now is how a potential arms race can have interplay with issues around accidental escalation of conflict. So is there already an arms race brewing for autonomous weapons? If so, why and what could potentially be done to deescalate such a situation?

Paul Scharre: If there’s an arms race, it’s a very strange one because no one is building the weapons. We see militaries advancing in robotics and autonomy, but we don’t really see sort of this rush to build autonomous weapons. I struggle to point to any programs that I’m aware of in militaries around the globe that are clearly oriented to build fully autonomous weapons. I think there are lots of places where much like these incremental advancements of autonomy in cars, you can see more autonomous features in military vehicles and drones and robotic systems and missiles. They’re adding more autonomy. And one might be violently concerned about where that’s going. But it’s just simply not the case that militaries have declared their intention. We’re going to build autonomous weapons, and here they are, and here’s our program to build them. I would struggle to use the term arms race. It could happen, maybe worth a starting line of an arms race. But I don’t think we’re in one today by any means.

It’s worth also asking, when we say arms race, what do we mean and why do we care? This is again, one of these terms, it’s often thrown around. You’ll hear about this, the concept of autonomous weapons or AI, people say we shouldn’t have an arms race. Okay. Why? Why is an arms race a bad thing? Militaries normally invest in new technologies to improve their national defense. That’s a normal activity. So if you say arms race, what do you mean by that? Is it beyond normal activity? And why would that be problematic? In the political science world, the specific definitions vary, but generally, an arms race is viewed as an increase in defense spending overall, or in a particular technology area above normal levels of modernizing militaries. Now, usually, this is problematic for a couple of reasons. One could be that it ends up just in a massive national expenditure, like during the case of the Cold War, nuclear weapons, that doesn’t really yield any military value or increase anyone’s defense or security, it just ends up net flushing a lot of money down the drain. That’s money that could be spent elsewhere for pre K education or healthcare or something else that might be societally beneficial instead of building all of these weapons. So that’s one concern.

Another one might be that we end up in a world that the large number of these weapons or the type of their weapons makes it worse off. Are we really better off in a world where there are 10s of thousands of nuclear weapons on hair-trigger versus a few thousand weapons or a few hundred weapons? Well, if we ever have zero, all things being equal, probably fewer nuclear weapons is better than more of them. So that’s another kind of concern whether in terms of violence and destructiveness of war, if a war breakout or the likelihood of war and the stability of war. This is an A in an area where certainly we’re not in any way from a spending standpoint, in an arms race for autonomous weapons or AI today, when you look at actual expenditures, they’re a small fraction of what militaries are spending on, if you look at, say AI or autonomous features at large.

And again for autonomous weapons, there really aren’t at least openly declared programs to say go build a fully autonomous weapon today. But even if that were the case, why is that bad? Why would a world where militaries are racing to build lots of atomic weapons be a bad thing? I think it would be a bad thing, but I think it’s also worth just answering that question, because it’s not obvious to everyone. This is something that’s often missing in a lot of these debates and dialogues about autonomous weapons, people may not share some of the underlying assumptions. It’s better to bring out these assumptions and explain, I think this would be bad for these reasons, because maybe it’s not intuitive to other people that they don’t share those reasons and articulating them could increase understanding.

For example, the FLI letter on autonomous weapons from a few years ago said, “the key question for humanity today is whether to start a global AI arms race or prevent it from starting. If any major military power pushes ahead with AI weapon development, the global arms race is virtually inevitable. And the endpoint of this technological trajectory is obvious. Autonomous weapons will become the Kalashnikovs of tomorrow.” I like the language, it’s very literary, “the Kalashnikovs of tomorrow.” Like it’s a very concrete image. But there’s a whole bunch of assumptions packed into those few sentences that maybe don’t work in the letter that’s intended to like sort of galvanize public interest and attention, but are worth really unpacking. What do we mean when we say autonomous weapons are the Kalashnikovs of tomorrow and why is that bad? And what does that mean? Those are, I think, important things to draw out and better understand.

It’s particularly hard for this issue because the weapons don’t exist yet. And so it’s not actually like debates around something like landlines. We could point to the mines and say like “this is a landmine, we all agree this is a landmine. This is what it’s doing to people.” And everyone could agree on what the harm is being caused. The people might disagree on what to do about it, but there’s agreement on what the weapon is and what the effect is. But for autonomous weapons, all these things are up to debate. Even the term itself is not clearly defined. And when I hear people describe it, people can be describing a whole range of things. Some people when they say the word autonomous weapon, they’re envisioning a Roomba with a gun on it. And other people are envisioning the Terminator. Now, both of those things are probably bad ideas, but for very different reasons. And that is important to draw out in these conversations. When you say autonomous weapon, what do you mean? What are you envisioning? What are you worried about? Worried about certain types of scenarios or certain types of effects?

If we want to get to the place where we really as a society come together and grapple with this challenge, I think first and foremost, a better communication is needed and people may still disagree, but it’s much more helpful. Stuart Russell from Berkeley has talked a lot about dangers of small anti-personnel autonomous weapons that would widely be the proliferated. He made the Slaughterbots video that’s been seen millions of times on YouTube. That’s a very specific image. It’s an image that’s very concrete. So then you can say, when Stuart Russell is worried about autonomous weapons, this is what he’s worried about. And then you can start to try to better understand the assumptions that go into that.

Now, I don’t share Stuart’s concerns, and we’ve written about it and talked about before, but it’s not actually because we disagree about the technology, I would agree that that’s very doable with existing technology. We disagree about the social responses to that technology, and how people respond, and what are the countermeasures and what are ways to prevent proliferation. So we, I think, disagree on some of the political or social factors that surround kind of how people approach this technology and use it. Sometimes people actually totally agree on the risks and even maybe the potential futures, they just have different values. And there might be some people who their primary value is trying to have fewer weapons in the world. Now that’s a noble goal. And they’re like, hey, anyway that we can have fewer weapons, fewer advanced technologies, that’s better. That’s very different from someone who’s coming from a position of saying, my goal is to improve my own nation’s defense. That’s a totally different value system. A total different preference. And they might be like, I also value what you say, but I don’t value it as much. And I’m going to take actions that advance these preferences. It’s important to really sort of try to better draw them out and understand them in this debate, if we’re going to get to a place where we can, as a society come up with some helpful solutions to this problem.

Lucas Perry: Wonderful. I’m totally on board with that. Two questions and confusions on my end. The first is, I feel a bit confused when you say these weapons don’t exist already. It seems to me more like autonomy exists on a spectrum and is the integration of many different technologies and decision making in systems. It seems to me there is already a certain degree of autonomy, there isn’t Terminator level autonomy, or specify an objective and the autonomous system can just basically go execute that, that seems to require very high level of generality, but there seems to already exist a level of autonomy today.

And so in that video, Stuart says that slaughterbots in particular represent a miniaturization and integration of many technologies, which already exist today. And the second thing that I’m confused about is when you say that it’s unclear to you that militaries are very interested in this or that there currently is an arms race. It seems like yes, there isn’t an arms race, like there was with nuclear weapons where it’s very clear, and they’re like Manhattan projects around this kind of technology, but given the strategic advantage conferred by this technology now and likely soon, it seems to me like game theoretically, from the position of militaries around the world that have the capacity to invest in these things, that it is inevitable given their battlefield importance that there would be massive ramping up or investments, or that there already is great interest in developing the autonomy and the subtechnologies required for developing fully autonomous systems.

Paul Scharre: Those are great questions and right on point. And I think the central issues in both of your questions are when we say these weapons or when I say these things, I should be more precise. When we say autonomous weapons, what do we mean exactly? And this is one of the things that can be tricky in this space, because there are not these universally agreed upon definitions. There are certainly many weapons systems used widely around the globe today that incorporate some autonomous features. Many of these are fire and forget weapons. When someone launches them, they’re not coming back. They have in that sense, autonomy to carry out their mission. But autonomy is relatively limited and narrowly bounded, and humans, for the most part are choosing the targets. So you can think of kind of maybe these three classes of weapons, these semi autonomous weapons, where humans are choosing the targets, but there’s lots of autonomy surrounding that decision, queuing information to people, flying the munition once the person launches it. That’s one type of weapon, widely used today by really every advanced military.

Another one is the supervised autonomous weapons that are used in these relatively limited settings for defensive purposes, where there is kind of this automatic mode that people can turn them on and activate them to defend the ship or the ground base or the vehicle. And these are really needed for these situations where the incoming threats are too fast for humans to respond. And these again are widely used around the globe and have been in place for decades. And then there are what we could call fully autonomous weapons, where the human’s launching them and human programs in the parameters, but they have some freedom to fly a search pattern over some area and then once they find a target, attack it on their own. For the most part, with some exceptions, those weapons are not widely used today. There have been some experimental systems that have been designed. There have been some put into operation in the past. The Israeli harpy drone is an example of this that is still in operation today. It’s been around since the ’90s, so it’s not really very new. And it’s been sold to a handful of countries, India, Turkey, South Korea, China, and the Chinese have reportedly reverse engineered their own version of this.

But it’s not like when widespread. So it’s not like a major component of militaries order of that. I think you see militaries investing in robotic systems, but the bulk of their fleets are still human occupied platforms, robotics are largely an adjunct to them. And in terms of spending, while there is increased spending on robotics, most of the spending is still going towards more traditional military platforms. The same is also true about the degree of autonomy, most of these robotic systems are just remote controlled, and they have very limited autonomy today. Now we’re seeing more autonomy over time in both robotic vehicles and in missiles. But militaries have a strong incentive to keep humans involved.

It is absolutely the case that militaries want technologies that will give them an advantage on the battlefield. But part of achieving an advantage means your systems work, they do what you want them to do, the enemy doesn’t hack them and take them over, you have control over them. All of those things point to more human control. So I think that’s the thing where you actually see militaries trying to figure out where’s the right place on the spectrum of autonomy? How much autonomy is right, and that line is going to shift over time. But it’s not the case that they necessarily want just full autonomy because what does that mean, then they do want weapon systems to sort of operate under some degree of human direction and involvement. It’s just that what that looks like may evolve over time as the technology advances.

And there are also, I should add, other bureaucratic factors that come into play that militaries investments are not entirely strategic. There’s bureaucratic politics within organizations. There’s politics more broadly with the domestic defense industry interfacing with the political system in that country. They might drive resources in certain directions. There’s some degree of inertia of course in any system that are also factors in play.

Lucas Perry: So I want to hit here a little bit on longer term perspectives. So the Future of Life Institute in particular is interested in mitigating existential risks. We’re interested in the advanced risks from powerful AI technologies where AI not aligned with human values and goals and preferences and intentions can potentially lead us to suboptimal equilibria that were trapped in permanently or could lead to human extinction. And so other technologies we care about are nuclear weapons and synthetic-bio enabled by AI technologies, etc. So there is this view here that if we cannot establish a governance mechanism as a global community on the concept that we should not let AI make the decision to kill then how can we deal with more subtle near term issues and eventual long term safety issues around the powerful AI technologies? So there’s this view of ensuring beneficial outcomes around lethal autonomous weapons or at least beneficial regulation or development of that technology, and the necessity of that for longer term AI risk and value alignment with AI systems as they become increasingly intelligent. I’m curious to know if you have a view or perspective on this.

Paul Scharre: This is the fun part of the podcast with the Future of Life because this rarely comes up in a lot of the conversations because I think in a lot of the debates, people are focused on just much more near term issues surrounding autonomous weapons or AI. I think that if you’re inclined to see that there are longer term risks for more advanced developments in AI, then I think it’s very logical to say that there’s some value in humanity coming together to come up with some set of rules about autonomous weapons today, even if the specific rules don’t really matter that much, because the level of risk is maybe not as significant, but the process of coming together and agreeing on some set of norms and limits on particularly military applications in AI is probably beneficial and may begin to create the foundations for future cooperation. The stakes for autonomous weapons might be big, but are certainly not existential. I think in any reasonable interpretation of autonomous weapons might do really, unless you start thinking about autonomy wired into, like nuclear launch decisions which is basically nuts. And I don’t think it’s really what’s on the table for realistically what people might be worried about.

When we try to come together as a human society to grapple with problems, we’re basically forced to deal with the institutions that we have in place. So for example, for autonomous weapons, we’re having debates in the UN Convention on Certain Conventional Weapons to CCW. Is that the best form for talking about autonomous weapons? Well, it’s kind of the form that exists for this kind of problem set. It’s not bad. It’s not perfect in some respects, but it’s the one that exists. And so if you’re worried about future AI risk, creating the institutional muscle memory among the relevant actors in society, whether it’s nation states, AI scientists, members of civil society, militaries, if you’re worried about military applications, whoever it is, to come together, to have these conversations, and to come up with some answer, and maybe set some agreements, some limits is probably really valuable actually because it begins to establish the right human networks for collaboration and cooperation, because it’s ultimately people, it’s people who know each other.

So oh, “I worked with this person on this last thing.” If you look at, for example, the international movement that The Campaign to Stop Killer Robots is spearheading, that institution or framework, those people, those relationships are born out of past successful efforts to ban landmines and then cluster munitions. So there’s a path dependency, and human relationships and bureaucracies, institutions that really matters. Coming together and reaching any kind of agreement, actually, to set some kind of limits is probably really vital to start exercising those muscles today.

Lucas Perry: All right, wonderful. And a final fun FLI question for you. What are your views on long term AI safety considerations? Do you view AI eventually as an existential risk and do you integrate that into your decision making and thinking around the integration of AI and military technology?

Paul Scharre: Yes, it’s a great question. It’s not something that comes up a lot in the world that I live in, in Washington in the policy world, people don’t tend to think about that kind of risk. I think it’s a concern. It’s a hard problem because we don’t really know how the technology is evolving. And I think that one of the things is challenging with AI is our frame for future more advanced AI. Often the default frame is sort of thinking about human like intelligence. When people talk about future AI, people talk about terms like AGI, or high level machine intelligence or human like intelligence, we don’t really know how the technology is evolving.

I think one of the things that we’re seeing with AI machine learning that’s quite interesting is that it often is evolving in ways that are very different from human intelligence, in fact, very quite alien and quite unusual. And I’m not the first person to say this, but I think that this is valid that we are, I think, on the verge of a Copernican revolution in how we think about intelligence, that rather than thinking of human intelligence as the center of the universe, that we’re realizing that humans are simply one type of intelligence among a whole vast array and space of possible forms of intelligence, and we’re creating different kinds, they may have very different intelligence profiles, they may just look very different, they may be much smarter than humans in some ways and dumber in other ways. I don’t know where things are going. I think it’s entirely possible that we move forward into a future where we see many more forms of advanced intelligent systems. And because they don’t have the same intelligence profile as human beings, we continue to kick the can down the road into being true intelligence because it doesn’t look like us. It doesn’t think like us. It thinks differently. But these systems may yet be very powerful in very interesting ways.

We’ve already seen lots of AI systems, even very simple ones exhibit a lot of creativity, a lot of interesting and surprising behavior. And as we begin to see the sort of scope of their intelligence widen over time, I think there are going to be risks that come with that. They may not be the risks that we were expecting, but I think over time, there going to be significant risks, and in some ways that our anthropocentric view is, I think, a real hindrance here. And I think it may lead us to then underestimate risk from things that don’t look quite like humans, and maybe miss some things that are very real. I’m not at all worried about some AI system one day becoming self aware, and having human level sentience, that does not keep me up at night. I am deeply concerned about advanced forms of malware. We’re not there today yet. But you could envision things over time that are adapting and learning and begin to populate the web, like there are people doing interesting ways of thinking about systems that have misaligned goals. It’s also possible to envision systems that don’t have any human directed goals at all. Viruses don’t. They replicate. They’re effective at replicating, but they don’t necessarily have a goal in the way that we think of it other than self replication.

If you have systems that are capable of replicating, of accumulating resources, of adapting, over time, you might have all of the right boxes to check to begin to have systems that could be problematic. They could accumulate resources that could cause problems. Even if they’re not trying to pursue either a goal that’s misaligned with human interest or even any goal that we might recognize. They simply could get out in the wild, if they’re effective at replication and acquiring resources and adapting, then they might survive. I think we’re likely to be surprised and continue to be surprised by how AI systems evolve, and where that might take us. And it might surprise us in ways that are humbling for how we think about human intelligence. So one question I guess is, is human intelligence a convergence point for more intelligent systems? As AI systems become more advanced, and they become more human like, or less human like and more alien.

Lucas Perry: Unless we train them very specifically on human preference hierarchies and structures.

Paul Scharre: Right. Exactly. Right. And so I’m not actually worried about a system that has the intelligence profile of humans, when you think about capacity in different tasks.

Lucas Perry: I see what you mean. You’re not worried about an anthropomorphic AI, you’re worried about a very powerful, intelligent, capable AI, that is alien and that we don’t understand.

Paul Scharre: Right. They might have cross domain functionality, it might have the ability to do continuous learning. It might be adaptive in some interesting ways. I mean, one of the interesting things we’ve seen about the field of AI is that people are able to tackle a whole variety of problems with some very simple methods and algorithms. And this seems for some reason offensive to some people in the AI community, I don’t know why, but people have been able to use some relatively simple methods, with just huge amounts of data and compute, it’s like a variety of different kinds of problems, some of which seem very complex.

Now, they’re simple compared to the real world, when you look at things like strategy games like StarCraft and Dota 2, like the world looks way more complex, but these are still really complicated kind of problems. And systems are basically able to learn totally on their own. That’s not general intelligence, but it starts to point towards the capacity to have systems that are capable of learning a whole variety of different tasks. They can’t do this today, continuously without suffering the problem of catastrophic forgetting that people are working on these things as well. The problems today are the systems aren’t very robust. They don’t handle perturbations in the environment very well. People are working on these things. I think it’s really hard to see how this evolves. But yes, in general, I think that our fixation on human intelligence as the pinnacle of intelligence, or even the goal of what we’re trying to build, and the sort of this anthropocentric view is, I think, probably one that’s likely to lead us to maybe underestimate some kinds of risks.

Lucas Perry: I think those are excellent points and I hope that mindfulness about that is able to proliferate in government and in actors who have power to help mitigate some of these future and short term AI risks. I really appreciate your perspective and I think you bring a wholesomeness and a deep authentic entertaining of all the different positions and arguments here on the question of autonomous weapons and I find that valuable. So thank you so much for your time and for helping to share information about autonomous weapons with us.

Paul Scharre: Thank you and thanks everyone for listening. Take care.

End of recorded material

AI Alignment Podcast: On the Long-term Importance of Current AI Policy with Nicolas Moës and Jared Brown

 Topics discussed in this episode include:

  • The importance of current AI policy work for long-term AI risk
  • Where we currently stand in the process of forming AI policy
  • Why persons worried about existential risk should care about present day AI policy
  • AI and the global community
  • The rationality and irrationality around AI race narratives

Timestamps: 

0:00 Intro

4:58 Why it’s important to work on AI policy 

12:08 Our historical position in the process of AI policy

21:54 For long-termists and those concerned about AGI risk, how is AI policy today important and relevant? 

33:46 AI policy and shorter-term global catastrophic and existential risks

38:18 The Brussels and Sacramento effects

41:23 Why is racing on AI technology bad? 

48:45 The rationality of racing to AGI 

58:22 Where is AI policy currently?

 

We hope that you will continue to join in the conversations by following us or subscribing to our podcasts on Youtube, Spotify, SoundCloud, iTunes, Google Play, StitcheriHeartRadio, or your preferred podcast site/application. You can find all the AI Alignment Podcasts here.

You can listen to the podcast above or read the transcript below. 

Lucas Perry: Welcome to the AI Alignment Podcast. I’m Lucas Perry. Today’s episode is with Jared Brown and Nicolas Moës, two AI Policy researchers and AI influencers who are both concerned with the long-term and existential risks associated with artificial general intelligence and superintelligence. For us at the the Future of Life Institute, we’re particularly interested in mitigating threats from powerful AI that could lead to the extinction of life. One avenue of trying to address such threats could be through action in the space of AI policy. But just what can we do today to help ensure beneficial outcomes from AGI and superintelligence in the policy sphere? This podcast focuses on this question.

As for some key points to reflect on throughout the podcast, Nicolas Moes points out that engaging in AI policy today is important because: 1) Experience gained on short-term AI policy issues is important to be considered a relevant advisor on long-term AI policy issues coming up in the future. 2) There are very few people that care about AGI safety currently in government, politics or in policy communities. 3) There are opportunities to influence current AI policy decisions in order to provide a fertile ground for future policy decisions or, better but rarer, to be directly shaping AGI safety policy today though evergreen texts. Future policy that is implemented is path dependent on current policy that we implement today. What we do now is precedent setting. 4) There are opportunities today to develop a skillset useful for other policy issues and causes. 5) Little resource is being spent on this avenue for impact, so the current return on investment is quite good.

Finally I’d like to reflect on the need to bridge the long-term and short-term partitioning of AI risk discourse. You might have heard this divide before, where there are long-term risks from AI, like a long-term risk being powerful AGI or superintelligence misaligned with human values causing the extinction of life, and then short-term risk like algorithmic bias and automation induced disemployment. Bridging this divide means understanding the real and deep interdependencies and path dependencies between the technology and governance which choose to develop today, and the world where AGI or superintelligence emerges. 

For those not familiar with Jared Brown or Nicolas Moës, Nicolas is an economist by training focused on the impact of Artificial Intelligence on geopolitics, the economy and society. He is the Brussels-based representative of The Future Society. Passionate about global technological progress, Nicolas monitors global developments in the legislative framework surrounding AI. He completed his Masters degree in Economics at the University of Oxford with a thesis on institutional engineering for resolving the tragedy of the commons in global contexts. 

Jared is the Senior Advisor for Government Affairs at FLI, working to reduce global catastrophic and existential risk (GCR/x-risk) by influencing the U.S. policymaking process, especially as it relates to emerging technologies. He is also a Special Advisor for Government Affairs at the Global Catastrophic Risk Institute. He has spent his career working at the intersection of public policy, emergency management, and risk management, having previously served as an Analyst in Emergency Management and Homeland Security Policy at the U.S. Congressional Research Service and in homeland security at the U.S. Department of Transportation.

The Future of Life Institute is a non-profit and this podcast is funded and supported by listeners like you. So if you find what we do on this podcast to be important and beneficial, please consider supporting the podcast by donating at futureoflife.org/donate. These contributions make it possible for us to bring you conversations like these and to develop the podcast further. You can also follow us on your preferred listening platform by searching for us directly or following the links on the page for this podcast found in the description.

And with that, here is Jared Brown and Nicolas Moës on AI policy. 

I guess we can start off here, with developing the motivations around why it’s important for people to be considering AI policy. So, why is it important to be working on AI policy right now?

Nicolas Moës: It’s important right now because there has been an uptick in markets, right? So AI technologies are now embedded in many more products than ever before. Part of it is hype, but part of it is also having a real impact on profits and bottom line. So there is an impact on society that we have never seen before. For example, the way Facebook algorithms have affected recent history is something that has made the population and policy makers panic a bit.

And so quite naturally the policy window has opened. I think it’s also important to be working on it for people who would like to make the world better for two reasons. As I mentioned, since the policy window is open that means that there is a demand for advice to fill in the gaps that exist in the legislation, right? There have been many concrete situations where, as an AI policy researcher, you get asked to provide input either by joining expert group, or workshops or simply directly some people who say, “Oh, you know about AI, so could you just send me a position paper on this?”

Nicolas Moës: So these policies are getting written right now, which at first is quite soft and then becomes harder and harder policies, and now to the point that at least in the EU, you have regulations for AI on the agenda, which is one of the hardest form of legislation out there. Once these are written it is very difficult to change them. It’s quite sticky. There is a lot of path dependency in legislation. So this first legislation that passes, will probably shape the box in which future legislation can evolve. Its constraints, the trajectory of future policies, and therefore it’s really difficult to take future policies in another direction. So for people who are concerned about AGI, it’s important to be already present right now.

The second point, is that these people who are currently interacting with policymakers on a daily basis are concerned about very specific things and they are gaining a lot of experience with policymakers, so that in the future when you have more general algorithms that come into play, the people with experience to advise on these policies will actually be concerned about what many people call short term issues. People who are concerned more about the safety, the robustness of these more general algorithm would actually end up having a hard time getting into the room, right? You cannot just walk in and claim authority when you have people with 10, 15 or even 20 years of experience regulating this particular field of engineering.

Jared Brown: I think that sums it up great, and I would just add that there are some very specific examples of where we’re seeing what has largely been, up to this point, a set of principles being developed by different governments, or industry groups. We’re now seeing attempts to actually enact hard law or policy.

Just in the US, the Office of Management and Budget and the Office of Science and Technology Policy issued a memorandum calling for further AI regulation and non-regulatory actions and they issued a set of principles, that’s out for comment right now, and people are looking at those principles, trying to see if there’s ways of commenting on it to increase its longterm focus and its ability to adapt to increasingly powerful AI.

The OECD has already issued, and had sign ons to its AI principles, which are quite good.

Lucas Perry: What is the OECD?

Nicolas Moës: The Organization for Economic Cooperation and Development.

Jared Brown: Yes. Those principles are now going from principles to an observatory, and that will be launched by the end of February. And we’re seeing the effect of these principles now being adopted, and attempts now are being made to implement those into real regulatory approaches. So, the window from transitioning from principles to hard law is occurring right now, and as Nicholas said, decisions that are made now will have longterm effects because typically governments don’t turn their attention to issues more than once every five, maybe even 10 years. And so, if you come in three years from now with some brilliant idea about AI policy, chances are, the moment to enact that policy has already passed because the year prior, or two years prior, your government has enacted its formative legislation on AI.

Nicolas Moës: Yeah, yeah. So long as this policy benefits most people, they are very unlikely to even reopen, let’s say, the discussion, at all.

Lucas Perry: Right. So a few points here. The first is this one about path dependency, which means that the kinds of policies which we adopt now are going to be really important, because they’re going to inform and shape the kinds of policies that we’re able or willing to adopt later, and AI is going to be around for a long, long time. So we’re setting a lot of the foundation. The second thing was that if you care about AGI risk, or the risks of superintelligence, or very powerful forms of AI that you need to have been part of the conversation since the beginning, or else you’re not going to really be able to get a seat at the table when these things come around.

And Jared, is there a point here that I’m missing that you were trying to make?

Jared Brown: No, I think that sums it up nicely. The effect of these policies, and the ability of these policies to remain what you might call evergreen. So, long lasting and adaptive to the changing nature of AI technology is going to be critical. We see this all the time in tech policy. There are tech policies out there that were informed by the challenges of the time in which they were made and they quickly become detrimental, or outdated at best. And then there are tech policies that tend to be more adaptive, and those stand the test of time. And we need to be willing to engage with the short term policy making considerations, such that we’re making sure that the policies are evergreen for AI, as it becomes increasingly powerful.

Nicolas Moës: Besides the evergreen aspects of the policies that you want to set up now, there’s this notion of providing a fertile ground. So some policies that are very appropriate for short term issues, for example, fairness and deception, and fundamental rights abuse and that kind of thing, are actually almost copy pasted to future legislation. So, if you manage to already put concerns for safety, like robustness, corrigibility, and value alignment of the algorithm today, even if you don’t have any influence in 10 or 15 years when they review the legislation, you have some chances to see the policymakers just copy pasting this part on safety and to put it in whatever new legislation comes up in 10 years.

Jared Brown: There’s precedent setting, and legislators are woe to have to make fundamental reforms to legislation, and so if we see proper consideration of safety and security on AI in the evergreen pieces of legislation that are being developed now, that’s unlikely to be removed in future legislation.

Lucas Perry: Jared, you said that a lot of the principles and norms which have been articulated over say, the past five years are becoming codified into hard law slowly. It also would just be good if you guys could historically contextualize our position in terms of AI policy, whether or not we stand at an important inflection point, where we are in terms of this emerging technology.

Jared Brown: Sure, sure. So, I think if you went back just to 2017, 2016, at least in the US, there was very little attention to artificial intelligence. There were a smattering of congressional hearings being held, a few pertinent policy documents being released by executive agencies, but by and large, the term artificial intelligence remained in the science fiction realm of thinking.

Since that time, there’s been a massive amount of attention paid to artificial intelligence, such that in almost every Western democracy that I’m familiar with, it’s now part of the common discourse about technology policy. The phrase emerging tech is something that you see all over the place, regardless of the context, and there’s a real sensitivity by Western style democracy policymakers towards this idea that technology is shifting under our feet. There’s this thing called artificial intelligence, there’s this thing called synthetic biology, there’s other technologies linked into that — 5G and hypersonics are two other areas — where there’s a real understanding that something is changing, and we need to get our arms around it. Now, that has largely started with, in the past year, or year and a half, a slew of principles. There are at least 80 some odd sets of principles. FLI was one of the first to create a set of principles, along with many partners, and those are the Asilomar AI Principles.

Those principles you can see replicated and informing many sets of principles since then. We mentioned earlier, the OECD AI principles are probably the most substantive and important at this point, because they have the signature and backing of so many sovereign nation states, including the United States and most of the EU. Now that we have these core soft law principles, there’s an appetite for converting that into real hard law regulation or approaches to how AI will be managed in different governance systems.

What we’re seeing in the US, there’s been a few regulatory approaches already taken. For instance, rule making on the inclusion of AI algorithms into the housing market. This vision, if you will, from the Department of Transportation, about how to deal with autonomous vehicles. The FDA has approved products coming into the market that involve AI and diagnostics in the healthcare industry, and so forth. We’re seeing initial policies being established, but what we haven’t yet seen in any real context, is sort of a cross-sectoral AI broadly-focused piece of legislation or regulation.

And that’s what’s currently being developed both in the EU and in the US. That type of legislation, which seems like a natural evolution from where we’re at with principles, into a comprehensive holistic approach to AI regulation and legislation, is now occurring. And that’s why this time is so critical for AI policy.

Lucas Perry: So you’re saying that a broader and more holistic view about AI regulation and what it means to have and regulate beneficial AI is developed before more specific policies are implemented, with regards to the military, or autonomous weapons, or healthcare, or nuclear command and control.

Jared Brown: So, typically, governments try, whether or not they succeed remains to be seen, to be more strategic in their approach. If there is a common element that’s affecting many different sectors of society, they try and at least strategically approach that issue, to think: what is common across all policy arenas, where AI is having an effect, and what can we do to legislate holistically about AI? And then as necessary, build sector specific policies on particular issues.

So clearly, you’re not going to see some massive piece of legislation that covers all the potential issues that has to do with autonomous vehicles, labor displacement, workforce training, et cetera. But you do want to have an overarching strategic plan for how you’re regulating, how you’re thinking about governing AI holistically. And that’s what’s occurring right now, is we have the principles, now we need to develop that cross-sectoral approach, so that we can then subsequently have consistent and informed policy on particular issue areas as they come up, and as they’re needed.

Lucas Perry: And that cross-sectoral approach would be something like: AI should be interpretable and robust and secure.

Jared Brown: That’s written in principles to a large degree. But now we’re seeing, what does that really mean? So in the EU they’re calling it the European Approach to AI, and they’re going to be coming out with a white paper, maybe by the time this podcast is released, and that will sort of be their initial official set of options and opinions about how AI can be dealt with holistically by the EU. In the US, they’re setting regulatory principles for individual regulatory agencies. These are principles that will apply to the FDA, or the Department of Transportation, or the Department of Commerce, or the Department of Defense, as they think about how they deal with the specific issues of AI in their arenas of governance. Making sure that baseline foundation is informed and is an evergreen document, so that it incorporates future considerations, or is at least adaptable to future technological development in AI is critically important.

Nicolas Moës: With regards to the EU in particular, the historical context is maybe a bit different. As you mentioned, right now they are discussing this white paper with many transversal policy instruments that would be put forward, with this legislation. This is going to be negotiated over the next year. There is intentions to have the legislation at the EU level by the end of the current commission’s term. So that’s mean within five years. This is something that is quite interesting to explore, is that in 2016 there was this parliamentary dossier on initiative, so it’s something that does not have any binding power, just to show the opinion of the European parliament, that was dealing with robotics and civil laws. So, considering how civil law in Europe should be adjusted to robotics.

That was in 2016, right? And now there’s been this uptick in activities. This is something that we have to be aware of. It’s moved quite fast, but then again, there still is a couple of years before regulations get approved. This is one point that I wanted to clarify about, when we say it is fast or it is slow, we are talking still about a couple of years. Which is, when you know how long it takes for you to develop your network, to develop your understanding of the issues, and to try to influence the issues, a couple of years is really way too short. The second point I wanted to make is also, what will the policy landscape look like in two years? Will we have the EU again leveraging its huge market power to impose its regulations within the European Commission. There are some intentions to diffuse whatever regulations come out of the European Commission right now, throughout the world, right? To form a sort of influence sphere, where all the AI produced, even abroad, would actually be fitting EU standards.

Over the past two, three years there have been a mushrooming of AI policy players, right? The ITU has set up this AI For Good, and has reoriented its position towards AI. There has been the Global Forum on AI for Humanity, political AI summits, which kind of pace the discussions about the global governance of artificial intelligence.

But would there be space for new players in the future? That’s something that I’m a bit unsure. One of the reasons why it might be an inflection point, as you asked, is because now I think the pawns are set on the board, right? And it is unlikely that somebody could come in and just disturb everything. I don’t know in Washington how it plays, but in Brussels it seems very much like everybody knows each other already and it’s only about bargaining with each other, not especially listening to outside views.

Jared Brown: So, I think the policy environment is being set. I wouldn’t quite go so far as to say all of the pawns are on the chess board, but I think many of them are. The queen is certainly industry, and industry has stood up and taken notice that governments want to regulate and want to be proactive about their approach to artificial intelligence. And you’ve seen this, because you can open up your daily newspaper pretty much anywhere in the world and see some headline about some CEO of some powerful tech company mentioning AI in the same breath as government, and government action or government regulations.

Industry is certainly aware of the attention that AI is getting, and they are positioning themselves to influence that as much as possible. And so civil society groups such as the ones Nico and I represent have to step up, which is not to say the industry has all bad ideas, some of what they’re proposing is quite good. But it’s not exclusively a domain controlled by industry opinions about the regulatory nature of future technologies.

Lucas Perry: All right. I’d like to pivot here, more into some of the views and motivations the Future of Life Institute and the Future Society take, when looking at AI policy. The question in particular that I’d like to explore is how is current AI policy important for those concerned with AGI risk and longterm considerations about artificial intelligence growing into powerful generality, and then one day surpassing human beings in intelligence? For those interested in the issue of AGI risk or super intelligence risk, is AI policy today important? Why might it be important? What can we do to help shape or inform the outcomes related to this?

Nicolas Moës: I mean, obviously, I’m working full time on this and if I could, I would work double full time on this. So I do think it’s important. But it’s still too early to be talking about this in the policy rooms, at least in Brussels. Even though we have identified a couple of policymakers that would be keen to talk about that. But it’s politically not feasible to put forward these kind of discussions. However, AI policy currently is important because there is a demand for advice, for policy research, for concrete recommendations about how to govern this technological transition that we are experiencing.

So there is this demand where people who are concerned about fundamental rights, and safety, and robustness, civil society groups, but also academics and industry themselves sometime come in with their clear recommendations about how you should concretely regulate, or govern, or otherwise influence the development and deployment of AI technologies, and in that set of people, if you have people who are concerned about safety, you would be able then, to provide advice for providing evergreen policies, as we’ve mentioned earlier and set up, let’s say, a fertile ground for better policies in the future, as well.

The second part of why it’s important right now is also the longterm workforce management. If people who are concerned about the AGI safety are not in the room right now, and if they are in the room but focused only on AGI safety, they might be perceived as irrelevant by current policymakers, and therefore they might have restricted access to opportunities for gaining experience in that field. And therefore over the long term this dynamic reduces the growth rate, let’s say, of the workforce that is concerned about AGI safety, and that could be identified as a relevant advisor in the future. As a general purpose technology, even short term issues regarding AI policy have a long term impact on the whole of society.

Jared Brown: Both Nicholas and I have used this term “path dependency,” which you’ll hear a lot in our community and I think it really helps maybe to build out that metaphor. Various different members of the audience of this podcast are going to have different timelines in their heads when they think about when AGI might occur, and who’s going to develop it, what the characteristics of that system will be, and how likely it is that it will be unaligned, and so on and so forth. I’m not here to engage in that debate, but I would encourage everyone to literally think about whatever timeline you have in your head, or whatever descriptions you have for the characteristics that are most likely to occur when AGI occurs.

You have a vision of that future environment, and clearly you can imagine different environments by which humanity is more likely to be able to manage that challenge than other environments. An obvious example, if the world were engaged in World War Three, 30 years from now, and some company develops AGI, that’s not good. It’s not a good world for AGI to be developed in, if it’s currently engaged in World War Three at the same time. I’m not suggesting we’re doing anything to mitigate World War Three, but there are different environments for when AGI can occur that will make it more or less likely that we will have a beneficial outcome from the development of this technology.

We’re literally on a path towards that future. More government funding for AI safety research is a good thing. That’s a decision that has to get made, that’s made every single day, in governments all across the world. Governments have R&D budgets. How much is that being spent on AI safety versus AI capability development? If you would like to see more, then that decision is being made every single fiscal year of every single government that has an R&D budget. And what you can do to influence it is really up to you and how many resources you’re going to put into it.

Lucas Perry: Many of the ways it seems that AI policy currently is important for AGI existential risk are indirect. Perhaps it’s direct insofar as there’s these foundational evergreen documents, and maybe changing our trajectory directly is kind of a direct intervention.

Jared Brown: How much has nuclear policy changed? When our governance of nuclear weapons changed because the US initially decided to use the weapon. That decision irrevocably changed the future of Nuclear Weapons Policy, and there is no way you can counterfactually unspool all of the various different ways the initial use of the weapon, not once, but twice by the US sent a signal to the world A, the US was willing to use this weapon and the power of that weapon was on full display.

There are going to be junctures in the trajectory of AI policy that are going to be potentially as fundamental as whether or not the US should use a nuclear weapon at Hiroshima. Those decisions are going to be hard to see necessarily right now, if you’re not in the room and you’re not thinking about the way that policy is going to project into the future. That’s where this matters. You can’t unspool and rerun history. We can’t decide for instance, on lethal autonomous weapons policy. There is a world that exists, a future scenario 30 years from now, where international governance has never been established on lethal autonomous weapons. And lethal autonomous weapons is completely the norm for militaries to use indiscriminately or without proper safety at all. And then there’s a world where they’ve been completely banned. Those two conditions will have serious effect on the likelihood that governments are up to the challenge of addressing potential global catastrophic and existential risk arising from unaligned AGI. And so it’s more than just setting a path. It’s central to the capacity building of our future to deal with these challenges.

Nicolas Moës: Regarding other existential risks, I mean Jared is more of an expert on that than I am. In the EU, because this topic is so hot, it’s much more promising, let’s say as an avenue for impact, than other policy dossiers because we don’t have the omnibus type of legislation that you have in the US. The EU remains quite topic for topic. In the end, there is very little power embeded in the EU, mostly it depends on the nation states as well, right?

So AI is as moves at the EU level, which makes you want to walk at the EU level AI policy for sure. But for the other issues, it sometimes remains still at the national level. That’d being said, the EU also has this particularity, let’s say off being able to reshape debates at the national level. So, if there were people to consider what are the best approaches to reduce existential risk in general via EU policy, I’m sure there would be a couple of dossiers right now with policy window opens that could be a conduit for impact.

Jared Brown: If the community of folks that are concerned about the development of AGI are correct and that it may have potentially global catastrophic and existential threat to society, then you’re necessarily obviously admitting that AGI is also going to affect the society extremely broadly. It’s going to be akin to an industrial revolution, as is often said. And that’s going to permeate every which way in society.

And there’s been some great work to scope this out. For instance, in the nuclear sphere, I would recommend to all the audience that they take a look at a recent edited compendium of papers by the Stockholm International Peace Research Institute. They have a fantastic compendium of papers about AI’s effect on strategic stability in nuclear risk. That type of sector specific analysis can be done with synthetic biology and various other things that people are concerned about as evolving into existential or global catastrophic risk.

And then there are current concerns with non anthropomorphic risk. AI is going to be tremendously helpful if used correctly to track and monitor near earth objects. You have to be concerned about asteroid impacts. AI is a great tool to be used to help reduce that risk by monitoring and tracking near Earth objects.

We may yet make tremendous discoveries in geology to deal with supervolcanoes. Just recently there’s been some great coverage of a AI company called Blue Dot for monitoring the potential pandemics arising with the Coronavirus. We see these applications of AI very beneficially reducing other global catastrophic and existential risks, but there are aggravating factors as well, especially for other anthropomorphic concerns related to nuclear risk and synthetic biology.

Nicolas Moës: Some people who are concerned about is AGI sometimes might see AI as overall negative in expectation, but a lot of policy makers see AI as an opportunity more than as a risk, right? So, starting with a negative narrative or a pessimistic narrative is difficult in the current landscape.

In Europe it might be a bit easier because for odd historical reasons it tends to be a bit more cautious about technology and tends to be more proactive about regulations than maybe anywhere else in the world. I’m not saying whether it’s a good thing or a bad thing. I think there’s advantages and disadvantages. It’s important to know though that even in Europe you still have people who are anti-regulation. The European commission set this independent high level expert group on AI with 52 or 54 experts on AI to decide about the ethical principles that will inform the legislation on AI. So this was for the past year and a half, or the past two years even. Among them, the divisions are really important. Some of them wanted to just let it go for self-regulation because even issues of fairness or safety will be detected eventually by society and addressed when they arise. And it’s important to mention that actually in the commission, even though the current white paper seems to be more on the side of preventive regulations or proactive regulations, the commissioner for digital, Thierry Breton is definitely cautious about the approach he takes. But you can see that he is quite positive about the potential of technology.

The important thing here as well is that these players have an influential role to play on policy, right? So, going back to this negative narrative about AGI, it’s also something where we have to talk about how you communicate and how you influence in the end the policy debate, given the current preferences and the opinions of people in society as a whole, not only the opinions of experts. If it was only about experts, it would be maybe different, but this is politics, right? The opinion of everybody matters and it’s important that whatever influence you want to have on AI policy is compatible with the rest of society’s opinion.

Lucas Perry: So, I’m curious to know more about the extent to which the AI policy sphere is mindful of and exploring the shorter term global catastrophic or maybe even existential risks that arise from the interplay of more near term artificial intelligence with other kinds of technologies. Jared mentioned a few in terms of synthetic biology, and global pandemics, and autonomous weapons, and AI being implemented in the military and early warning detection systems. So, I’m curious to know more about the extent to which there are considerations and discussions around the interplay of shorter term AI risks with actual global catastrophic and existential risks.

Jared Brown: So, there’s this general understanding, which I think most people accept, that AI is not magical. It is open to manipulation, it has certain inherent flaws in its current capability and constructs. We need to make sure that that is fully embraced as we consider different applications of AI into systems like nuclear command and control. At a certain point in time, the argument could be sound that AI is a better decision maker than your average set of humans in a command and control structure. There’s no shortage of instances of near misses with nuclear war based on existing sensor arrays, and so on and so forth, and the humans behind those sensor arrays, with nuclear command and control. But we have to be making those evaluations fully informed about the true limitations of AI and that’s where the community is really important. We have to cut through the hype and cut through overselling what AI is capable of, and be brutally honest about the current limitations of AI as it evolves, and whether or not it makes sense from a risk perspective to integrate AI in certain ways.

Nicolas Moës: There has been human mistakes that have led to close calls, but I believe these close calls have been corrected because of another human in the loop. In early warning systems though, you might actually end up with no human in the loop. I mean, again, we cannot really say whether these humans in the loop were statistically important because we don’t have the alternatives obviously to compare it to.

Another thing regarding whether some people think that AI is magic, I, I think, would be a bit more cynical. I still find myself in some workshops or policy conferences where you have some people who apparently haven’t seen ever a line of code in their entire life and still believe that if you tell the developer “make sure your AI is explainable,” that magically the AI would become explainable. This is still quite common in Brussels, I’m afraid. But there is a lot of heterogeneity. I think now we have, even among the 705 MEPs, there is one of them who is a former programmer from France. And that’s the kind of person who, given his expertise, if he was placed on the AI dossier, I guess he would have a lot more influence because of his expertise.

Jared Brown: Yeah. I think in the US there’s this phrase that kicks around that the US is experiencing a techlash, meaning there’s a growing reluctance, cynicism, criticism of major tech industry players. So, this started with the Cambridge Analytica problems that arose in the 2016 election. Some of it’s related to concerns about potential monopolies. I will say that it’s not directly related to AI, but that general level of criticism, more skepticism, is being imbued into the overall policy environment. And so people are more willing to question the latest, next greatest thing that’s coming from the tech industry because we’re currently having this retrospective analysis of what we used to think of a fantastic and development may not be as fantastic as we thought it was. That kind of skepticism is somewhat helpful for our community because it can be leveraged for people to be more willing to take a critical eye in the way that we apply technology going forward, knowing that there may have been some mistakes made in the past.

Lucas Perry: Before we move on to more empirical questions and questions about how AI policy is actually being implemented today, are there any other things here that you guys would like to touch on or say about the importance of engaging with AI policy and its interplay and role in mitigating both AGI risk and existential risk?

Nicolas Moës: Yeah, the so called Brussels effect, which actually describes that whatever decisions in European policy that is made is actually influencing the rest of the world. I mentioned it briefly earlier. I’d be curious to hear what you, Jared, thinks about that. In Washington, do people consider it, the GDPR for example, as a pre made text that they can just copy paste? Because apparently, I know that California has released something quite similar based on GDPR. By the way, GDPR is the General Data Protection Regulations governing protection of privacy in the EU. It’s a regulation, so it has a binding effect on EU member States. That, by the Brussels effect, what I mean is that for example, this big piece of legislation as being, let’s say, integrated by big companies abroad, including US companies to ensure that they can keep access to the European market.

And so the commission is actually quite proud of announcing that for example, some Brazilian legislator or some Japanese legislator or some Indian legislators are coming to the commission to translate the text of GDPR, and to take it back to their discussion in their own jurisdiction. I’m curious to hear what you think of whether the European third way about AI has a greater potential to lead to beneficial AI and beneficial AGI than legislation coming out of the US and China given the economic incentives that they’ve got.

Jared Brown: I think in addition to the Brussels effect, we might have to amend it to say the Brussels and the Sacramento effect. Sacramento being the State Capitol of California because it’s one thing for the EU who have adopted the GDPR, and then California essentially replicated a lot of the GDPR, but not entirely, into what they call the CCPA, the California Consumer Privacy Act. If you combine the market size of the EU with California, you clearly have enough influence over the global economy. California for those who aren’t familiar, would be the seventh or sixth largest economy in the world if it were a standalone nation. So, the combined effect of Brussels and Sacramento developing tech policy or leading tech policy is not to be understated.

What remains to be seen though is how long lasting that precedent will be. And their ability to essentially be the first movers in the regulatory space will remain. With some of the criticism being developed around GDPR and the CCPA, it could be that leads to other governments trying to be more proactive to be the first out the door, the first movers in terms of major regulatory effects, which would minimize the Brussels effect or the Brussels and Sacramento effect.

Lucas Perry: So in terms of race conditions and sticking here on questions of global catastrophic risk and existential risks and why AI policy and governance and strategy considerations are important for risks associated with racing between say the United States and China on AI technology. Could you guys speak a little bit to the importance of appropriate AI policy and strategic positioning on mitigating race conditions and a why race would be bad for AGI risk and existential and global catastrophic risks in general?

Jared Brown: To simplify it, the basic logic here is that if two competing nations states or companies are engaged in a competitive environment to be the first to develop X, Y, Z, and they see tremendous incentive and advantage to being the first to develop such technology, then they’re more likely to cut corners when it comes to safety. And cut corners thinking about how to carefully apply these new developments to various different environments. There has been a lot of discussion about who will come to dominate the world and control AI technology. I’m not sure that either Nicolas or I really think that narrative is entirely accurate. Technology need not be a zero sum environment where the benefits are only accrued by one state or another. Or that the benefits accruing to one state necessarily reduce the benefits to another state. And there has been a growing recognition of this.

Nicolas earlier mentioned the high level expert group in the EU, an equivalent type body in the US, it’s called the National Security Commission on AI. And in their interim report they recognize that there is a strong need and one of their early recommendations is for what they call Track 1.5 or Track 2 diplomacy, which is essentially jargon for engagement with China and Russia on AI safety issues. Because if we deploy these technologies in reckless ways, that doesn’t benefit anyone. And we can still move cooperatively on AI safety and on the responsible use of AI without mitigating or entering into a zero sum environment where the benefits are only going to be accrued by one state or another.

Nicolas Moës: I definitely see the safety technologies as that would benefit everybody. If you’re thinking in two different types of inventions, the one that promotes safety indeed would be useful, but I believe that enhancing raw capabilities, you would actually race for that. Right? So, I totally agree with your decision narrative. I know people on both sides seeing this as a silly thing, you know, with media hype and of course industry benefiting a lot from this narrative.

There is a lot of this though that remains the rational thing to do, right? Whenever you start negotiating standards, you can say, “Well look at our systems. They are more advanced, so they should become the global standards for AI,” right? That actually is worrisome because the trajectory right now, since there is this narrative in place, is that over the medium term, you would expect the technologies maybe to diverge, and so both blocks, or if you want to charitably include the EU into this race, the three blocks would start diverging and therefore we’ll need each other less and less. The economic cost of an open conflict would actually decrease, but this is over the very long term.

That’s kind of the dangers of race dynamics as I see them. Again, it’s very heterogeneous, right? When we say the US against China, when you look at the more granular level of even units of governments are sometimes operating with a very different mindset. So, as for what in AI policy can actually be relevant to this for example, I do think they can, because at least on the Chinese side as far as I know, there is this awareness of the safety issue. Right? And there has been a pretty explicit article. It was like, “the US and China should work together to future proof AI.” So, it gives you the impression that some government officials or former government officials in China are interested in this dialogue about the safety of AI, which is what we would want. We don’t especially have to put the raw capabilities question on the table so long as there is common agreements about safety.

At the global level, there’s a lot of things happening to tackle this coordination problem. For example, the OECD AI Policy Observatory is an interesting setup because that’s an institution with which the US is still interacting. There have been fewer and fewer multilateral fora with which the US administration has been willing to interact constructively, let’s say. But for the OECD one yes, there’s been quite a lot of interactions. China is an observer to the OECD. So, I do believe that there is potential there to have a dialogue between the US and China, in particular about AI governance. And plenty of other fora exist at the global level to enable this Track 1.5 / Track 2 diplomacy that you mentioned Jared. For example, the Global Governance of AI Forum that the Future Society has organized, and Beneficial AGI that Future of Life Institute has organized.

Jared Brown: Yeah, and that’s sort of part and parcel with one of the most prominent examples of, some people call it scientific diplomacy, and that’s kind of a weird term, but the Pugwash conferences that occurred all throughout the Cold War where technical experts were meeting on the side to essentially establish a rapport between Russian and US scientists on issues of nuclear security and biological security as well.

So, there are plenty of examples where even if this race dynamic gets out of control, and even if we find ourselves 20 years from now in an extremely competitive, contentious relationship with near peer adversaries competing over the development of AI technology and other technologies, we shouldn’t, as civil society groups, give up hope and surrender to the inevitability that safety problems are likely to occur. We need to be looking to the past examples of what can be leveraged in order to appeal to essentially the common humanity of these nation states in their common interest in not wanting to see threats arise that would challenge either of their power dynamics.

Nicolas Moës: The context matters a lot, but sometimes it can be easier than one can think, right? So, I think when we organized the US China AI Tech Summit, because it was about business, about the cutting edge and because it was also about just getting together to discuss. And a bit before this US / China race dynamics was full on, there was not so many issues with getting our guests. Knowledge might be a bit more difficult with some officials not able to join events where officials from other countries are because of diplomatic reasons. And that was in June 2018 right? But back then there was the willingness and the possibility, since the US China tension was quite limited.

Jared Brown: Yeah, and I’ll just throw out a quick plug for other FLI podcasts. I recommend listeners check out the work that we did with Matthew Meselson. Max Tegmark had a great podcast on the development of the Biological Weapons Convention, which is a great example of how two competing nation states came to a common understanding about what was essentially a global catastrophic, or is, a global catastrophic and existential risk and develop the biological weapons convention.

Lucas Perry: So, tabling collaboration on safety, which can certainly be mutually beneficial in just focusing on capabilities research and how at least it seems basically just rational to race for that in a game theoretic sense.

That seems basically just rational to race for that in a game theoretic sense. I’m interested in exploring if you guys have any views or points to add here about mitigating the risks there, and how it may simply actually not be rational to race for that?

Nicolas Moës: So, there is the narrative currently that it’s rational to race on some aspect of raw capabilities, right? However, when you go beyond the typical game theoretical model, when you enable people to build bridges, you could actually find certain circumstances under which you have a so-called institutional entrepreneur building up in institutions that is legitimate so that everybody agrees upon that enforces the cooporation agreement.

In economics, the windfall clause is regarding the distribution of it. Here what I’m talking about in the game theoretical space, is how to avoid the negative impact, right? So, the windfall clause would operate in this very limited set of scenarios whereby the AGI leads to an abundance of wealth, and then a windfall clause deals with the distributional aspect and therefore reduce the incentive to a certain extent to produce AGI. However, to abide to the windfall clause, you still have to preserve the incentive to develop the AGI. Right? But you might actually tamp that down.

What I was talking about here, regarding the institutional entrepreneur, who can break this race by simply having a credible commitment from both sides and enforcing that commitment. So like the typical model of the tragedy of the commons, which here could be seen as you over-explored the time to superintelligence level, you can solve the tragedy of the commons, actually. So it’s not that rational anymore. Once you know that there is a solution, it’s not rational to go for the worst case scenario, right? You actually can design a mechanism that forces you to move towards the better outcome. It’s costly though, but it can be done if people are willing to put in the effort, and it’s not costly enough to justify not doing it.

Jared Brown: I would just add that the underlying assumptions about the rationality of racing towards raw capability development, largely depend on the level of risk you assign to unaligned AI or deploying narrow AI in ways that exacerbate global catastrophic and existential risk. Those game theories essentially can be changed and those dynamics can be changed if our community eventually starts to better sensitize players on both sides about the lose/lose situation, which we could find ourselves in through this type of racing. And so it’s not set in stone and the environment can be changed as information asymmetry is decreased between the two competing partners and there’s a greater appreciation for the lose/lose situations that can be developed.

Lucas Perry: Yeah. So I guess I just want to highlight the point then the superficial first analysis, it would seem that the rational game theoretic thing to do is to increase capability as much as possible, so that you have power and security over other actors. But that might not be true under further investigation.

Jared Brown: Right, and I mean, for those people who haven’t had to suffer through game theory classes, there’s a great popular culture example here that a lot of people have seen Stranger Things on Netflix. If you haven’t, maybe skip ahead 20 seconds until I’m done saying this. But there is an example of the US and Russia competing to understand the upside down world, and then releasing untold havoc onto their societies, because of this upside down discovery. For those of you who have watched, it’s actually a fairly realistic example of where this kind of competing technological development leads somewhere that’s a lose/lose for both parties, and if they had better cooperation and better information sharing about the potential risks, because they were each discovering it themselves without communicating those risks, neither would have opened up the portals to the upside down world.

Nicolas Moës: The same dynamics, the same “oh it’s rational to race” dynamic applied to nuclear policy and nuclear arms race has led to, actually, some treaties, far from perfection. Right? But some treaties. So this is the thing where, because the model, the tragedy of the commons, it’s easy to communicate. It’s a nice thing was doom and fatality that is embedded with it. This resonates really well with people, especially in the media, it’s a very simple thing to say. But this simply might not be true. Right? As I mentioned. So there is this institutional entrepreneurship aspect which requires resources, right? So that is very costly to do. But civil society is doing that, and I think the Future of Life Institute has agency to do that. The Future Society is definitely doing that. We are actually agents of breaking away from these game theoretical situations that would be otherwise unlikely.

We fixate a lot on the model, but in reality, we have seen the nuclear policy, the worst case scenario being averted sometimes by mistake. Right? The human in the loop not following the policy or something like that. Right. So it’s interesting as well. It shows how unpredictable all this is. It really shows that for AI, it’s the same. You could have the militaries on both sides, literally from one day to the next, start a discussion about AI safety, and how to ensure that they keep control. There’s a lot of goodwill on both sides and so maybe we could say like, “Oh, the economist” — and I’m an economist by just training so I can be a bit harsh on myself — they’re like, the economist would say, “But this is not rational.” Well, in the end, it is more rational, right? So long as you win, you know, remain in a healthy life and feel like you have done the right thing, this is the rational thing to do. Maybe if Netflix is not your thing, “Inadequate Equilibria” by Eliezer Yudkowsky explores these kinds of conundrums as well. Why do you have sub-optimal situations in life in general? It’s a very, general model, but I found it very interesting to think about these issues, and in the end it boils down to these kinds of situations.

Lucas Perry: Yeah, right. Like for example, the United States and Russia having like 7,000 nuclear warheads each, and being on hair trigger alert with one another, is a kind of in-optimal equilibrium that we’ve nudged ourself into. I mean it maybe just completely unrealistic, but a more optimum place to be would be no nuclear weapons, but have used all of that technology and information for nuclear power. Well, we would all just be better off.

Nicolas Moës: Yeah. What you describe seems to be a better situation. However, the rational thing to do at some point would have been before the Soviet Union developed, incapacitate Soviet Union to develop. Now, the mutually assured destruction policy is holding up a lot of that. But I do believe that the diplomacy, the discussions, the communication, even merely the fact of communicating like, “Look, if you do that and we will do that,” is a form of progress towards: basically you should not use it.

Jared Brown: Game theory is nice to boil things down into a nice little boxes, clearly. But the dynamics of the nuclear situation with the USSR and the US add countless number of boxes that you get end up in and yes, each of us having way too large nuclear arsenals is a sub-optimal outcome, but it’s not the worst possible outcome, that would have been total nuclear annihilation. So it’s important not just to look at it criticisms of the current situation, but also see the benefits of this current situation and why this box is better than some other boxes that we ended up in. And that way, we can leverage the past that we have taken to get to where we’re at, find the paths that were actually positive, and reapply those lessons learned to the trajectory of emerging technology once again. We can’t throw out everything that has happened on nuclear policy and assume that there’s nothing to be gained from it, just because the situation that we’ve ended up in is suboptimal.

Nicolas Moës: Something that I have experienced while interacting with policymakers and diplomats. You actually have an agency over what is going on. This is important also to note, is that it’s not like a small thing, and the world is passing by. No. Even in policy, which seems to be maybe a bit more arcane, in policy, you can pull the right levers to make somebody feel less like they have to obey this race narrative.

Jared Brown: Just recently in the last National Defense Authorization Act, there was a provision talking about the importance of military to military dialogues being established, potentially even with adversarial states like North Korea and Iran, for that exact reason. That better communication between militaries can lead to a reduction of miscalculation, and therefore adverse escalation of conflicts. We saw this just recently between the US and Iran. There was not direct communication perhaps between the US and Iran, but there was indirect communication, some of that over Twitter, about the intentions and the actions that different states might take. Iran and the US, in reaction to other events, and that may have helped deescalate the situation to where we find now. It’s far from perfect, but this is the type of thing that civil society can help encourage as we are dealing with new types of technology that can be as dangerous as nuclear weapons.

Lucas Perry: I just want to touch on what is actually going on now and actually being considered before we wrap things up. You talked about this a little bit before, Jared, you mentioned that currently in terms of AI policy, we are moving from principles and recommendations to the implementation of these into hard law. So building off of this, I’m just trying to get a better sense of where AI policy is, currently. What are the kinds of things that have been implemented, and what hasn’t, and what needs to be done?

Jared Brown: So there are some key decisions that have to be made in the near term on AI policy that I see replicating in many different government environments. One of them is about liability. I think it’s very important for people to understand the influence that establishing liability has for safety considerations. By liability, I mean who is legally responsible if something goes wrong? The basic idea is if an autonomous vehicle crashes into a school bus, who’s going to be held responsible and under what conditions? Or if an algorithm is biased and systematically violates the civil rights of one minority group, who is legally responsible for that? Is it the creator of the algorithm, the developer of the algorithm? Is it the deployer of that algorithm? Is there no liability for anyone at all in that system? And governments writ large are struggling with trying to assign liability, and that’s a key area of governance and AI policy that’s occurring now.

For the most part, it would be wise for governments to not provide blanket liability to AI, simply as a matter of trying to encourage and foster the adoption of those technologies; such that we encourage people to essentially use those technologies in unquestioning ways and sincerely surrender the decision making from the human to that AI algorithm. There are other key issue areas. There is the question of educating the populace. The example here I give is, you hear the term financial literacy all the time about how educated is your populace about how to deal with money matters.

There’s a lot about technical literacy, technology literacy being developed. The Finnish government has a whole course on AI that they’re making available to the entire EU. How we educate our population and prepare our population from a workforce training perspective matters a lot. If that training incorporates considerations for common AI safety problems, if we’re training people about how adversarial examples can affect machine learning and so on and so forth, we’re doing a better job of sensitizing the population to potential longterm risks. That’s another example of where AI policy is being developed. And I’ll throw out one more, which is a common example that people will understand. You have a driver’s license from your state. The state has traditionally been responsible for deciding the human qualities that are necessary, in order for you to operate a vehicle. And the same goes for state licensing boards have been responsible for certifying and allowing people to practice the law or practice medicine.

Doctors and lawyers, there are national organizations, but licensing is typically done at the state. Now if we talk about AI starting to essentially replace human functions, governments have to look again at this division about who regulates what and when. There’s sort of an opportunity in all democracies to reevaluate the distribution of responsibility between units of government, about who has the responsibility to regulate and monitor and govern AI, when it is doing something that a human being used to do. And there are different pros and cons for different models. But suffice it to say that that’s a common theme in AI policy right now, is how to deal with who has the responsibility to govern AI, if it’s essentially replacing what used to be formally, exclusively a human function.

Nicolas Moës: Yeah, so in terms of where we stand, currently, actually let’s bring some context maybe to this question as well, right? The way it has evolved over the past few years is that you had really ethical principles in 2017 and 2018. Let’s look at the global level first. Like at the global level, you had for example, the Montréal Declaration, which was intended to be global, but for mostly fundamental rights-oriented countries, so that that excludes some of the key players. We have already talked about dozens and dozens of principles for AI in values context or in general, right. That was 2018, and then once we have seen is more the first multi-lateral guidelines so we have the OECD principles, GPAI which is this global panel on AI, was also a big thing between Canada and France, which was initially intended to become kind of the international body for AI governance, but that deflated a bit over time, and so you had also the establishment of all this fora for discussion, that I have already mentioned. Political AI summits and the Global Forum on AI for Humanity, which is, again, a Franco-Canadian initiative like the AI for Good. The Global Governance of AI Forum in the Middle East. There was this ethically aligned design initiative at the IEEE, which is a global standards center, which has garnered a lot of attention among policymakers and other stakeholders. But the move towards harder law is coming, and since it’s towards harder law, at the global level there is not much that can happen. Nation states remain sovereign in the eye of international law.

So unless you write up an international treaty, it would be at the government level that you have to move towards hard law. So at the global level, the next step that we can see is these audits and certification principles. It’s not hard law, but you use labels to independently certify whether an algorithm is good. Some of them are tailored for specific countries. So I think Denmark has its own certification mechanism for AI algorithms. The US is seeing the appearance of values initiatives, notably by the big consulting companies, which are all of the auditors. So this is something that is interesting to see how we shift from soft law, towards this industry-wide regulation for these algorithms. At the EU level, where you have some hard legislative power, you had also a high level group on liability. Which is very important, because they basically argued that we’re going to have to update product liability rules in certain ways for AI and for internet of things products.

This is interesting to look at as well, because when you look at product liability rules, this is hard law, right? So what they have recommended is directly translatable into this legislation. And so you move on at this stage since the end of 2019, you have this hard law coming up and this commission white paper which really kickstarts the debates about what will the regulation for AI be? And whether it will be a regulation. So it could be something else like a directive. The high level expert group has come up with a self assessment list for companies to see whether they are obeying the ethical principles decided upon in Europe. So these are kind of soft self regulation things, which might eventually affect court rulings or something like that. But they do not represent the law, and now the big players are moving in, either at the global level with these more and more powerful labeling initiatives, or certification initiatives, and at the EU level with this hard law.

And the reason why the EU level has moved on towards hard law so quickly, is because during the very short campaign of the commission president, AI was a political issue. The techlash was strong, and of course a lot of industry was complaining that there was nothing happening in AI in the EU. So they wanted strong action and that kind of stuff. The circumstances that led the EU to be in pole position for developing hard law. Elsewhere in the world, you actually have more fragmented initiatives at this stage, except the OECD AI policy observatory, which might be influential in itself, right? It’s important to note the AI principles that the OECD has published. Even though they are not binding, they would actually influence the whole debate. Right? Because at the international level, for example, when the OECD had privacy principles, this became the reference point for many legislators. So some countries who don’t want to spend years even debating how to legislate AI might just be like, “okay, here is the OECD principles, how do we implement that in our current body of law?” And that’s it.

Jared Brown: And I’ll just add one more quick dynamic that’s coming up with AI policy, which is essentially the tolerance of that government for the risk associated with emerging technology. A classic example here is, the US actually has a much higher level of flood risk tolerance than other countries. So we engineer largely, throughout the US, our dams and our flood walls and our other flood protection systems to a 1-in-100 year standard. Meaning the flood protection system is supposed to protect you from a severe storm that would have a 1% chance of occurring in a given year. Other countries have vastly different decisions there. Different countries make different policy decisions about the tolerance that they’re going to have for certain things to happen. And so as we think about emerging technology risk, it’s important to think about the way that your government is shaping policies and the underlying tolerance that they have for something going wrong.

It could be as simple as how likely it is that you will die because of an autonomous vehicle crash. And the EU, traditionally, has had what they call a precautionary principal approach, which is in the face of uncertain risks, they’re more likely to regulate and restrict development until those risks are better understood, than the US, which typically has adopted the precautionary principle less often.

Nicolas Moës: There is a lot of uncertainty. A lot of uncertainty about policy, but also a lot of uncertainty about the impact that all these technologies are having. The dam standard, you can quantify quite easily the force of nature, but here we are dealing with social forces that are a bit different. I still remember quite a lot of people being very negative about Facebook’s chances of success, because people would not be willing to put pictures of themself online. I guess 10 years later, these people have been proven wrong. The same thing could happen with AI, right? So people are currently, at least in the EU, afraid of some aspects of AI. So let’s say an autonomous vehicle. Surrendering decision-making about our life and death to an autonomous vehicle, that’s something that’s maybe as technology improves, people would be more and more willing to do that. So yeah, it’s very difficult to predict, and even more to quantify I think.

Lucas Perry: All right. So thank you both so much. Do either of you guys have any concluding thoughts about AI policy or anything else you’d just like to wrap up on?

Jared Brown: I just hope the audience really appreciates the importance of engaging in the policy discussion. Trying to map out a beneficial forward for AI policy, because if you’re concerned like we are about the long term trajectory of this emerging technology and other emerging technologies, it’s never too early to start engaging in the policy discussion on how to map a beneficial path forward.

Nicolas Moës: Yeah, and one last thought, we were talking with Jared a couple of days ago about the number of people doing that. So thank you by the way, Jared for inviting me, and Lucas, for inviting me on the podcast. But that led us to wonder how many people are doing what we are doing, with the motivation that we have regarding these longer term concerns. That makes me think, yeah, there’s very few resources like labor resources, financial resources, dedicated to this issue. And I’d be really interested if there is, in the audience, anybody interested in that issue, definitely, they should get in touch. There are too few people right now with similar motivations, and caring about the same thing in AI policy to actually miss the opportunity of meeting each other and coordinating better.

Jared Brown: Agreed.

Lucas Perry: All right. Wonderful. So yeah, thank you guys both so much for coming on.

End of recorded material

AI Alignment Podcast: Identity and the AI Revolution with David Pearce and Andrés Gómez Emilsson

 Topics discussed in this episode include:

  • Identity from epistemic, ontological, and phenomenological perspectives
  • Identity formation in biological evolution
  • Open, closed, and empty individualism
  • The moral relevance of views on identity
  • Identity in the world today and on the path to superintelligence and beyond

Timestamps: 

0:00 – Intro

6:33 – What is identity?

9:52 – Ontological aspects of identity

12:50 – Epistemological and phenomenological aspects of identity

18:21 – Biological evolution of identity

26:23 – Functionality or arbitrariness of identity / whether or not there are right or wrong answers

31:23 – Moral relevance of identity

34:20 – Religion as codifying views on identity

37:50 – Different views on identity

53:16 – The hard problem and the binding problem

56:52 – The problem of causal efficacy, and the palette problem

1:00:12 – Navigating views of identity towards truth

1:08:34 – The relationship between identity and the self model

1:10:43 – The ethical implications of different views on identity

1:21:11 – The consequences of different views on identity on preference weighting

1:26:34 – Identity and AI alignment

1:37:50 – Nationalism and AI alignment

1:42:09 – Cryonics, species divergence, immortality, uploads, and merging.

1:50:28 – Future scenarios from Life 3.0

1:58:35 – The role of identity in the AI itself

 

We hope that you will continue to join in the conversations by following us or subscribing to our podcasts on Youtube, Spotify, SoundCloud, iTunes, Google Play, StitcheriHeartRadio, or your preferred podcast site/application. You can find all the AI Alignment Podcasts here.

You can listen to the podcast above or read the transcript below. 

The transcript has been edited for style and clarity

Lucas Perry: Welcome to the AI Alignment Podcast. I’m Lucas Perry. Today we have an episode with Andres Gomez Emillson and David Pearce on identity. This episode is about identity from the ontological, epistemological and phenomenological perspectives. In less jargony language, we discuss identity from the fundamental perspective of what actually exists, of how identity arises given functional world models and self models in biological organisms, and of the subjective or qualitative experience of self or identity as a feature of consciousness. Given these angles on identity, we discuss what identity is, the formation of identity in biological life via evolution, why identity is important to explore and it’s ethical implications and implications for game theory, and  we directly discuss its relevance to the AI alignment problem and the project of creating beneficial AI.

I think the question of “How is this relevant to AI Alignment?” is useful to explore here in the intro. The AI Alignment problem can be construed in the technical limited sense of the question of “how to program AI systems to understand and be aligned with human values, preferences, goals, ethics, and objectives.” In a limited sense this is strictly a technical problem that supervenes upon research in machine learning, AI, computer science, psychology, neuroscience, philosophy, etc. I like to approach the problem of aligning AI systems from a broader and more generalist perspective. In the way that I think about the problem, a broader view of AI alignment takes into account the problems of AI governance, philosophy, AI ethics, and reflects deeply on the context in which the technical side of the problem will be taking place, the motivations of humanity and the human beings engaged in the AI alignment process, the ingredients required for success, and other civilization level questions on our way hopefully to beneficial superintelligence. 

It is from both of these perspectives that I feel exploring the question of identity is important. AI researchers have their own identities and those identities factor into their lived experience of the world, their motivations, and their ethics. In fact, the same is of course true of policy makers and anyone in positions of power to influence the alignment process, so being aware of commonly held identity models and views is important for understanding their consequences and functions in the world. From a macroscopic perspective, identity has evolved over the past 4.5 billion years on earth and surely will continue to do so in AI systems themselves and in the humans which hope to wield that power. Some humans may wish to merge, other to pass away or simply die, and others to be upgraded or uploaded in some way. Questions of identity are also crucial to this process of relating to one another and to AI systems in a rapidly evolving world where what it means to be human is quickly changing, where copies of digital minds or AIs can be made trivially, and the boundary between what we conventionally call the self and world begins to dissolve and break down in new ways, demanding new understandings of ourselves and identity in particular. I also want to highlight an important thought from the podcast that any actions we wish to take with regards to improving or changing understandings or lived experience of identity must be Sociologically relevant, or such interventions simply risk being irrelevant. This means understanding what is reasonable for human beings to be able to update their minds with and accept over certain periods of time and also the game theoretic implications of certain views of identity and their functional usefulness. This conversation is thus an attempt to broaden the conversation on these issues outside of what is normally discussed and to flag this area as something worthy of consideration.

For those not familiar with David Pearce or Andres Gomez Emilsson. David is a co-founder of the World Transhumanist Association, rebranded humanity plus, and is a prominent figure within the transhumanism movement in general. You might know him from his work on the Hedonistic Imperative, a book which explores our moral obligation to work towards the abolition of suffering in all sentient life through technological intervention. Andrés is a consciousness researcher at the Qualia Research Institute and is also the Co-founder and President of the Stanford Transhumanist Association. He has a Master’s in Computational Psychology from Stanford.

The Future of Life Institute is a non-profit and this podcast is funded and supported by listeners like you. So if you find what we do on this podcast to be important and beneficial, please consider supporting the podcast by donating at futureoflife.org/donate

 If you’d like to be a regular supporter, please consider a monthly subscription donation to make sure we can continue our efforts into the future. 

These contributions make it possible for us to bring you conversations like these and to develop the podcast further. You can also follow us on your preferred listening platform by searching for us directly or following the links on the page for this podcast found in the description. 

And with that, here is my conversation with Andres Gomez Emilsson and David Pearce 

I just want to start off with some quotes here that I think would be useful. The last podcast that we had was with Yuval Noah Harari and Max Tegmark. One of the points that Yuval really emphasized was the importance of self understanding questions like, who am I? What am I in the age of technology? Yuval all said “Get to know yourself better. It’s maybe the most important thing in life. We haven’t really progressed much in the last thousands of years, and the reason is that yes, we keep getting this advice, but we don’t really want to do it,” he goes on to say that, “especially as technology will give us all, at least some of us more and more power, the temptations of naive utopias are going to be more and more irresistible, and I think the really most powerful check on these naive utopias is really getting to know yourself better.”

In search of getting to know ourselves better, I want to explore this question of identity with both of you. To start off, what is identity?

David Pearce: One problem is that we have more than one conception of identity. There is the straightforward, logical sense that philosophers call the indiscernibility of identicals, namely that if A equals B, then anything true of A is true of B. In one sense, that’s trivially true, but when it comes to something like personal identity, it just doesn’t hold water at all. You are a different person from your namesake who went to bed last night – and it’s very easy carelessly to shift between these two different senses of identity.

Or one might speak of the United States. In what sense is the United States the same nation in 2020 as it was in 1975? It’s interest-relative.

Andrés Gómez Emilsson: Yeah and to go a little bit deeper on that, I would make the distinction as David made it between ontological identity, what fundamentally is actually going on in the physical world? In instantiated reality? Then there’s conventional identity definitely, the idea of continuing to exist from one moment to another as a human and also countries and so on.

Then there’s also phenomenological identity, which is our intuitive common sense view of: What are we and basically, what are the conditions that will allow us to continue to exist? We can go into more detail but yet, the phenomenological notion of identity is an incredible can of worms because there’s so many different ways of experiencing identity and all of them have their own interesting idiosyncrasies. Most people tend to confuse the two. They tend to confuse ontological and phenomenological identity. Just as a simple example that I’m sure we will revisit in the future, when a person has, let’s say an ego dissolution or a mystical experience and they feel that they merged with the rest of the cosmos, and they come out and say, “Oh, we’re all one consciousness.” That tends to be interpreted as some kind of grasp of an ontological reality. Whereas we could argue in a sense that that was just the shift in phenomenological identity, that your sense of self got transformed, not necessarily that you’re actually directly merging with the cosmos in a literal sense. Although, of course it might be very indicative of how conventional our sense of identity is if it can be modified so drastically in other states of consciousness.

Lucas Perry: Right, and let’s just start with the ontological sense. How does one understand or think about identity from the ontological side?

Andrés Gómez Emilsson: In order to reason about this, you need a shared frame of reference for what actually exists, and a number of things including the nature of time and space, and memory because in the common sense view of time called presentism, where basically there’s just the present moment, the past is a convenient construction and the future is a fiction useful in practical sense, but they don’t literally exist in that sense. This notion that A equals B in the sense of, Hey, you could modify what happens to A and that will automatically also modify what happens to B. It kind of makes sense and you can perhaps think of identity is moving over time along with everything else.

On the other hand, if you have an eternalist point of view where basically you interpret the whole of space time as just basically there, on their own coordinates in the multiverse, that kind of provides a different notion of ontological identity because it’s in a sense, a moment of experience is its own separate piece of reality.

In addition, you also need to consider the question of connectivity: in what way different parts of reality are connected to each other? In a conventional sense, as you go from one second to the next, you’ve continued to be connected to yourself in an unbroken stream of consciousness and this has actually led some philosophers to hypothesize that the proper unit of identity is from the moment your wake up to the moment in which you go to sleep because that’s an unbroken chain/stream of consciousness.

From a scientific and philosophically rigorous point of view, it’s actually difficult to make the case that our stream of consciousness is truly unbroken. Definitely if you have an eternalist point of view on experience and on the nature of time, what you will instead see is from the moment you wake up to the moment you go to sleep, there’s actually been an extraordinarily large amount of snapshots of discrete and moments of experience. In that sense, each of those individual moments of experiences would be its own ontologically separate individual.

Now one of the things that becomes kind of complicated with a kind of an eternalist account of time and identity is that you cannot actually change it. There’s nothing you can actually do to A, so that reasoning of if you do anything to A an A equals B, then the same will happen to B, doesn’t even actually apply in here because everything is already there. You cannot actually modify A any more than you can modify the number five.

David Pearce: Yes, it’s a rather depressing perspective in many ways, the eternalist view. If one internalizes it too much, it can lead to a sense of fatalism and despair. A lot of the time it’s probably actually best to think of the future as open.

Lucas Perry: This helps to clarify some of the ontological part of identity. Now, you mentioned this phenomenological aspect and I want to say also the epistemological aspect of identity. Could you unpack those two? And maybe clarify this distinction for me if you wouldn’t parse it this way? I guess I would say that the epistemological one is the models that human beings have about the world and about ourselves. It includes how the world is populated with a lot of different objects that have identity like humans and planets and galaxies. Then we have our self model, which is the model of our body and our space in social groups and who we think we are.

Then there’s the phenomenological identity, which is that subjective qualitative experience of self or the ego in relation to experience. Or where there’s an identification with attention and experience. Could you unpack these two later senses?

Andrés Gómez Emilsson: Yeah, for sure. I mean in a sense you could have like an implicit self model that doesn’t actually become part of your consciousness or it’s not necessarily something that you’re explicitly rendering. This goes on all the time. You’ve definitely, I’m sure, had the experience of riding a bicycle and after a little while you can almost do it without thinking. Of course, you’re engaging with the process in a very embodied fashion, but you’re not cognizing very much about it. Definitely you’re not representing, let’s say your body state, or you’re representing exactly what is going on in a cognitive way. It’s all kind of implicit in the way in which you feel. I would say that paints a little bit of a distinction between a self model which is ultimately functional. It has to do with, are you processing the information that you’re required to solve the task that involves modeling what you are in your environment and distinguishing it from the felt sense of, are you a person? What are you? How are you located and so on.

The first one is the one that most of robotics and machine learning, that have like an embodied component, are really trying to get at. You just need the appropriate information processing in order to solve the task. They’re not very concerned about, does this feel like anything? Or does it feel like a particular entity or a self to be that particular algorithm?

Whereas, we’re talking about the phenomenological sense of identity. That’s very explicitly about how it feels like and there’s all kinds of ways in which a healthy so to speak, sense of identity, can be broken down in all sorts of interesting ways. There’s many failure modes, we can put it that way.

One might argue, I mean I suspect for example, David Pearce might say this, which is that, our self models or our implicit sense of self, because of the way in which it was brought up through Darwinian selection pressures, is already extremely ill in some sense at least, from the point of view of it, it actually telling us something true and actually making us do something ethical. It has all sorts of problems, but it is definitely functional. You can anticipate being a person tomorrow and plan accordingly. You leave messages to yourself by encoding them in memory and yeah, this is a convenient sense of conventional identity.

It’s very natural for most people’s experiences. I can briefly mention a couple of ways in which it can break down. One of them is depersonalization. It’s a particular psychological disorder where one stops feeling like a person, and it might have something to do with basically, not being able to synchronize with your bodily feelings in such a way that you don’t actually feel embodied. You may feel this incarnate entity or just a witness experiencing a human experience, but not actually being that person.

Then you also have things such as empathogen induced sense of shared identity with others. If you’d take MDMA, you may feel that all of humanity is deeply connected, or we’re all part of the same essence of humanity in a very positive sense of identity, but perhaps not in an evolutionary adaptive sense. Finally, is people with a multiple personality disorder, where in a sense they have a very unstable sense of who they are and sometimes it can be so extreme that there’s epistemological blockages from one sense of self to another.

David Pearce: As neuroscientist Donald Hoffman likes to say, fitness trumps truth. Each of us runs a world-simulation. But it’s not an impartial, accurate, faithful world-simulation. I am at the center of a world-simulation, my egocentric world, the hub of reality that follows me around. And of course there are billions upon billions of other analogous examples too. This is genetically extremely fitness-enhancing. But it’s systematically misleading. In that sense, I think Darwinian life is malware.

Lucas Perry: Wrapping up here on these different aspects of identity, I just want to make sure that I have all of them here. Would you say that those are all of the aspects?

David Pearce: One can add the distinction between type- and token- identity. In principle, it’s possible to create from scratch a molecular duplicate of you. Is that person you? It’s type-identical, but it’s not token-identical.

Lucas Perry: Oh, right. I think I’ve heard this used in some other places as numerical distinction versus qualitative distinction. Is that right?

David Pearce: Yeah, that’s the same distinction.

Lucas Perry: Unpacking here more about what identity is. Let’s talk about it purely as something that the world has produced. What can we say about the evolution of identity in biological life? What is the efficacy of certain identity models in Darwinian evolution?

Andrés Gómez Emilsson: I would say that self models most likely have existed, potentially since pretty early on in the evolutionary timeline. You may argue that in some sense even bacteria has some kind of self model. But again, a self model is really just functional. The bacteria does need to know, at least implicitly, it’s size in order to be able to navigate it’s environment, follow chemical gradients, and so on, not step on itself. That’s not the same, again, as a phenomenal sense of identity, and that one I would strongly suspect came much later. Perhaps with the advent of the first primitive nervous systems. That would be only if actually running that phenomenal model is giving you some kind of fitness advantage.

One of the things that you will encounter with David and I is that we think that phenomenally bound experiences have a lot of computational properties and in a sense, the reason why we’re conscious has to do with the fact that unified moments of experience are doing computationally useful legwork. It comes when you merge implicit self models in just the functional sense together with the computational benefits of actually running a conscious system that, perhaps for the first time in history, you will actually have a phenomenal self model.

I would suspect probably in the Cambrian explosion this was already going on to some extent. All of these interesting evolutionary oddities that happen in the Cambrian explosion probably had some kind of rudimentary sense of self. I would be skeptical that is going on.

For example, in plants. One of the key reasons is that running a real time world simulation in a conscious framework is very calorically expensive.

David Pearce: Yes, it’s a scandal. What, evolutionarily speaking, is consciousness “for”? What could a hypothetical p-zombie not do? The perspective that Andrés and I are articulating is that essentially what makes biological minds special is phenomenal binding – the capacity to run real-time, phenomenally-bound world-simulations, i.e. not just be 86 billion discrete, membrane-bound pixels of experience. Somehow, we each generate an entire cross-modally matched, real-time world-simulation, made up of individual perceptual objects, somehow bound into a unitary self. The unity of perception is extraordinarily computationally powerful and adaptive. Simply saying that it’s extremely fitness-enhancing doesn’t explain it, because something like telepathy would be extremely fitness-enhancing too, but it’s physically impossible.

Yes, how biological minds manage to run phenomenally-bound world-simulations is unknown: they would seem to be classically impossible. One way to appreciate just how advantageous is (non-psychotic) phenomenal binding is to look at syndromes where binding even partially breaks down: simultanagnosia, where one can see only one object at once, or motion blindness (akinetopsia), where you can’t actually see moving objects, or florid schizophrenia. Just imagine those syndromes combined. Why aren’t we just micro-experiential zombies?

Lucas Perry: Do we have any interesting points here to look at in the evolutionary tree for where identity is substantially different from ape consciousness? If we look back at human evolution, it seems that it’s given the apes and particularly our species a pretty strong sense of self, and that gives rise to much of our ape socialization and politics. I’m wondering if there was anything else like maybe insects or other creatures that have gone in a different direction? Also if you guys might be able to just speak a little bit on the formation of ape identity.

Andrés Gómez Emilsson: Definitely I think like the perspective of the selfish gene, it’s pretty illuminating here. Nominally, our sense of identity is the sense of one person, one mind. In practice however, if you make sense of identity as well in terms of that which you want to defend, or that of which you consider worth preserving, you will see that people’s sense of identity also extends to their family members and of course, with the neocortex and the ability to create more complex associations. Then you have crazy things like sense of identity being based on race or country of origin or other constructs like that.are building on top of imports from the sense of, hey, the people who are familiar to you feel more like you.

It’s genetically adaptive to have that and from the point of view of the selfish gene, genes that could recognize themselves in others and favor the existence of others that also share the same genes, are more likely to reproduce. That’s called the inclusive fitness in biology, you’re not just trying to survive yourself or make copies of yourself, you’re also trying to help those that are very similar to you do the same. Almost certainly, it’s a huge aspect of how we perceive the world. Just anecdotally from a number of trip reports, there’s this interesting thread of how some chemicals like MDMA and 2CB, for those who don’t know, it’s these empathogenic psychedelics, that people get the strange sense that people they’ve never met before in their life are as close to them as a cousin, or maybe a half brother, or half sister. It’s a very comfortable and quite beautiful feeling. You could imagine that nature was very selective on who do you give that feeling to in order to maximize inclusive fitness.

All of this builds up to the overall prediction I would make that, the sense of identity of ants and other extremely social insects might be very different. The reason being that they are genetically incentivized to basically treat each other as themselves. Most ants themselves don’t produce any offspring. They are genetically sisters and all of their genetic incentives are into basically helping the queen pass on the genes into other colonies. In that sense, I would imagine an ant probably sees other ants of the same colony pretty much as themselves.

David Pearce: Yes. There was an extraordinary finding a few years ago: members of one species of social ant actually passed the mirror test – which has traditionally been regarded as the gold standard for the concept of a self. It was shocking enough, to many people, when a small fish was shown to be capable of mirror self-recognition. If some ants too can pass the mirror test, it suggests some form of meta-cognition, self-recognition, that is extraordinarily ancient.

What is it that distinguishes humans from nonhuman animals? I suspect the distinction relates to something that is still physically unexplained: how is it that a massively parallel brain gives rise to serial, logico-linguistic thought? It’s unexplained, but I would say this serial stream is what distinguishes us, most of all – not possession of a self-concept.

Lucas Perry: Is there such a thing as a right answer to questions of identity? Or is it fundamentally just something that’s functional? Or is it ultimately arbitrary?

Andrés Gómez Emilsson: I think there is the right answer. From a functional perspective, there’s just so many different ways of thinking about it. As I was describing perhaps with ants and humans, their sense of identity is probably pretty different. But, they both are useful for passing on the genes. In that sense they’re all equally valid. Imagine in the future is some kind of a swarm mind that also has its own distinct functionally adaptive sense of identity, and I mean in that sense that it ground truth to what it should be from the point of view of functionality. It really just depends on what is the replication unit.

Ontologically though, I think there’s a case to be made that either or empty individualism are true. Maybe it would be good to define those terms first.

Lucas Perry: Before we do that. Your answer then is just that, yes, you suspect that also ontologically in terms of fundamental physics, there are answers to questions of identity? Identity itself isn’t a confused category?

Andrés Gómez Emilsson: Yeah, I don’t think it’s a leaky reification as they say.

Lucas Perry: From the phenomenological sense, is the self an illusion or not? Is the self a valid category? Is your view also on identity that there is a right answer there?

Andrés Gómez Emilsson: From the phenomenological point of view? No, I would consider it a parameter, mostly. Just something that you can vary, and there’s trade offs or different experiences of identity.

Lucas Perry: Okay. How about you David?

David Pearce: I think ultimately, yes, there are right answers. In practice, life would be unlivable if we didn’t maintain these fictions. These fictions are (in one sense) deeply immoral. We punish someone for a deed that their namesake performed, let’s say 10, 15, 20 years ago. America recently executed a murderer for a crime that was done 20 years ago. Now quite aside from issues of freedom and responsibility and so on, this is just scapegoating.

Lucas Perry: David, do you feel that in the ontological sense there are right or wrong answers to questions of identity? And in the phenomenological sense? And in the functional sense?

David Pearce: Yes.

Lucas Perry: Okay, so then I guess you disagree with Andres about the phenomenological sense?

David Pearce: I’m not sure, Andrés and I agree about most things. Are we disagreeing Andrés?

Andrés Gómez Emilsson: I’m not sure. I mean, what I said about the phenomenal aspect of identity was that I think of it as a parameter of our world simulation. In that sense, there’s no true phenomenological sense of identity. They’re all useful for different things. The reason I would say this too is, you can assume that something like each snapshot of experience, is its own separate identity. I’m not even sure you can accurately represent that in a moment of experience itself. This is itself a huge can of worms that opens up the problem of referents. Can we even actually refer to something from our own point of view? My intuition here is that, whatever sense of identity you have at a phenomenal level, I think of it as a parameter of the world simulation and I don’t think it can be an accurate representation of something true. It’s just going to be a feeling, so to speak.

David Pearce: I could endorse that. We fundamentally misperceive each other. The Hogan sisters, conjoined craniopagus twins, know something that the rest of us don’t. The Hogan sisters share a thalamic bridge, which enables them partially, to a limited extent, to “mind-meld”. The rest of us see other people essentially as objects that have feelings. When one thinks of one’s own ignorance, perhaps one laments one’s failures as a mathematician or a physicist or whatever; but an absolutely fundamental form of ignorance that we take for granted is we (mis)conceive other people and nonhuman animals as essentially objects with feelings, whereas individually, we ourselves have first-person experience. Whether it’s going to be possible to overcome this limitation in the future I don’t know. It’s going to be immensely technically challenging – building something like reversible thalamic bridges. A lot depends on one’s theory of phenomenal binding. But let’s imagine a future civilization in which partial “mind-melding” is routine. I think it will lead to a revolution not just in morality, but in decision-theoretic rationality too – one will take into account the desires, the interests, and the preferences of what will seem like different aspects of oneself.

Lucas Perry: Why does identity matter morally? I think you guys have made a good case about how it’s important functionally, historically in terms of biological evolution, and then in terms of like society and culture identity is clearly extremely important for human social relations, for navigating social hierarchies and understanding one’s position of having a concept of self and identity over time, but why does it matter morally here?

Andrés Gómez Emilsson: One interesting story where you can think of a lot of social movements, in a sense, a lot of ideologies that have existed in human history, as attempts to hack people’s sense of identities or make use of them for the purpose of the reproduction of the ideology or the social movement itself.

To a large extent, a lot of the things that you see in a therapy have a lot to do with expanding your sense of identity to include your future self as well, which is something that a lot of people struggle with when it comes to impulsive decisions or your rationality. There’s these interesting point of view of how a two year old or a three year old, hasn’t yet internalized the fact that they will wake up tomorrow and that the consequences of what they did today will linger on in the following days. This is kind of a revelation when a kid finally internalizes the fact that, Oh my gosh, I will continue to exist for the rest of my life. There’s going to be a point where I’m going to be 40 years old and also there’s going to be a time where I’m 80 years old and all of those are real, and I should plan ahead for it.

Ultimately, I do think that advocating for a very inclusive sense of identity, where the locus of identity is consciousness itself. I do think that might be a tremendous moral and ethical implications.

David Pearce: We want an inclusive sense of “us” that embraces all sentient beings.  This is extremely ambitious, but I think that should be the long-term goal.

Lucas Perry: Right, there’s a spectrum here and where you fall on the spectrum will lead to different functions and behaviors, solipsism or extreme egoism on one end, pure selflessness or ego death or pure altruism on the other end. Perhaps there are other degrees and axes on which you can move, but the point is it leads to radically different identifications and relations with other sentient beings and with other instantiations of consciousness.

David Pearce: Would our conception of death be different if it was a convention to give someone a different name when they woke up each morning? Because after all, waking up is akin to reincarnation. Why is it that when one is drifting asleep each night, one isn’t afraid of death? It’s because (in some sense) one believes one is going to be reincarnated in the morning.

Lucas Perry: I like that. Okay, I want to return to this question after we hit on the different views of identity to really unpack the different ethical implications more. I wanted to sneak that in here for a bit of context. Pivoting back to this sort of historical and contextual analysis of identity. We talked about biological evolution as like instantiating these things. How do you guys view religion as codifying an egoist view on identity? Religion codifies the idea of the eternal soul and the soul, I think, maps very strongly onto the phenomenological self. It makes that the thing that is immutable or undying or which transcends this realm?

I’m talking obviously specifically here about Abrahamic religions, but then also in Buddhism there is, the self is an illusion, or what David referred to as empty individualism, which we’ll get into, where it says that identification with the phenomenological self is fundamentally a misapprehension of reality and like a confusion and that that leads to attachment and suffering and fear of death. Do you guys have comments here about religion as codifying views on identity?

Andrés Gómez Emilsson: I think it’s definitely really interesting that there are different views of identity and religion. How I grew up, I always assumed religion was about souls and getting into heaven. As it turns out, I just needed to know about Eastern religions and cults. That also happened to sometimes have like different views of personal identity. That was definitely a revelation to me. I would actually say that I started questioning the sense of a common sense of personal identity before I learned about Eastern religions and I was really pretty surprised and very happy when I found out that, let’s say Hinduism actually, it has a kind of universal consciousness take on identity, a socially sanctioned way of looking at the world that has a very expansive sense of identity. Buddhism is also pretty interesting because as far as I understand it, they consider actually pretty much any view of identity to be a cause for suffering fundamentally has to do with a sense of craving either for existence or craving for non-existence, which they also consider a problem. A Buddhist would generally say that even something like universal consciousness, believing that we’re all fundamentally Krishna incarnating in many different ways, itself will also be a source of suffering to some extent because you may crave further existence, which may not be very good from their point of view. It makes me optimistic that there’s other types of religions with other views of identity.

David Pearce: Yes. Here is one of my earliest memories. My mother belonged to The Order of the Cross – a very obscure, small, vaguely Christian denomination, non-sexist, who worship God the Father-Mother. And I recall being told, aged five, that I could be born again. It might be as a little boy, but it might be as a little girl – because gender didn’t matter. And I was absolutely appalled at this – at the age of five or so – because in some sense girls were, and I couldn’t actually express this, defective.

And religious conceptions of identity vary immensely. One thinks of something like Original Sin in Christianity. I could now make a lot of superficial comments about religion. But one would need to explore in detail the different religious traditions and their different conceptions of identity.

Lucas Perry: What are the different views on identity? If you can say anything, why don’t you hit on the ontological sense and the phenomenological sense? Or if we just want to stick to the phenomenological sense then we can.

Andrés Gómez Emilsson: I mean, are you talking about an open, empty, closed?

Lucas Perry: Yeah. So that would be the phenomenological sense, yeah.

Andrés Gómez Emilsson: No, actually I would claim those are attempts at getting at the ontological sense.

Lucas Perry: Okay.

Andrés Gómez Emilsson: If you do truly have a soul ontology, something that implicitly a very large percentage of the human population have, that would be, yeah, in this view called a closed individualist perspective. Common sense, you start existing when you’re born, you stop existing when you die, you’re just a stream of consciousness. Even perhaps more strongly, you’re a soul that has experiences, but experiences maybe are not fundamental to what you are.

Then there is the more Buddhist and definitely more generally scientifically-minded view, which is empty individualism, which is that you only exist as a moment of experience, and from one moment to the next that you are a completely different entity. And then, finally, there is open individualism, which is like Hinduism claiming that we are all one consciousness fundamentally.

There is an ontological way of thinking of these notions of identity. It’s possible that a lot of people think of them just phenomenologically, or they may just think there’s no further fact beyond the phenomenal. In which case something like that closed individualism, for most people most of the time, is self-evidently true because you are moving in time and you can notice that you continue to be yourself from one moment to the next. Then, of course, what would it feel like if you weren’t the same person from one moment to the next? Well, each of those moments might completely be under the illusion that it is a continuous self.

For most things in philosophy and science, if you want to use something as evidence, it has to agree with one theory and disagree with another one. And the sense of continuity from one second to the next seems to be compatible with all three views. So it’s not itself much evidence either way.

States of depersonalization are probably much more akin to empty individualism from a phenomenological point of view, and then you have ego death and definitely some experiences of the psychedelic variety, especially high doses of psychedelics tend to produce very strong feelings of open individualism. That often comes in the form of noticing that your conventional sense of self is very buggy and doesn’t seem to track anything real, but then realizing that you can identify with awareness itself. And if you do that, then in some sense automatically, you realize that you are every other experience out there, since the fundamental ingredient of a witness or awareness is shared with every conscious experience.

Lucas Perry: These views on identity are confusing to me because agents haven’t existed for most of the universe and I don’t know why we need to privilege agents in our ideas of identity. They seem to me just emergent patterns of a big, ancient, old, physical universe process that’s unfolding. It’s confusing to me that just because there are complex self- and world-modeling patterns in the world, that we need to privilege them with some kind of shared identity across themselves or across the world. Do you see what I mean here?

Andrés Gómez Emilsson: Oh, yeah, yeah, definitely. I’m not agent-centric. And I mean, in a sense also, all of these other exotic feelings of identity often also come with states of low agency. You actually don’t feel that you have much of a choice in what you could do. I mean, definitely depersonalization, for example, often comes with a sense of inability to make choices, that actually it’s not you who’s making the choice, they’re just unfolding and happening. Of course, in some meditative traditions that’s considered a path to awakening, but in practice for a lot of people, that’s a very unpleasant type of experience.

It sounds like it might be privileging agents; I would say that’s not the case. If you zoom out and you see the bigger worldview, it includes basically this concept, David calls it non-materialist physicalist idealism, where the laws of physics describe the behavior of the universe, but that which is behaving according to the laws of physics is qualia, is consciousness itself.

I take very seriously the idea that a given molecule or a particular atom contains moments of experience, it’s just perhaps very fleeting and very dim or are just not very relevant in many ways, but I do think it’s there. And sense of identity, maybe not in a phenomenal sense, I don’t think an atom actually feels like an agent over time, but continuity of its experience and the boundaries of its experience would have strong bearings on ontological sense of identity.

There’s a huge, obviously, a huge jump between talking about the identity of atoms and then talking about the identity of a moment of experience, which presumably is an emergent effect of 100 billion neurons, themselves made of so many different atoms. Crazy as it may be, it is both David Pearce’s view and my view that actually each moment of experience does stand as an ontological unit. It’s just the ontological unit of a certain kind that usually we don’t see in physics, but it is both physical and ontologically closed.

Lucas Perry: Maybe you could unpack this. You know mereological nihilism, maybe I privilege this view where I just am trying to be as simple as possible and not build up too many concepts on top of each other.

Andrés Gómez Emilsson: Mereological nihilism basically says that there are no entities that have parts. Everything is part-less. All that exists in reality is individual monads, so to speak, things that are fundamentally self-existing. For that, if you have let’s say monad A and monad B, just put together side by side, that doesn’t entail that now there is a monad AB that mixes the two.

Lucas Perry: Or if you put a bunch of fundamental quarks together that it makes something called an atom. You would just say that it’s quarks arranged atom-wise. There’s the structure and the information there, but it’s just made of the monads.

Andrés Gómez Emilsson: Right. And the atom is a wonderful case, basically the same as a molecule, where I would say mereological nihilism with fundamental particles as just the only truly existing beings does seem to be false when you look at how, for example, molecules behave. The building block account of how chemical bonds happen, which is with these Lewis diagrams of how it can have a single bond or double bond and you have the octet rule, and you’re trying to build these chains of atoms strung together. And all that matters for those diagrams is what each atom is locally connected to.

However, if you just use these in order to predict what molecules are possible and how they behave and their properties, you will see that there’s a lot of artifacts that are empirically disproven. And over the years, chemistry has become more and more sophisticated where eventually, it’s come to the realization that you need to take into account the entire molecule at once in order to understand what its “dynamically stable” configuration, which involves all of the electrons and all of the nuclei simultaneously interlocking into a particular pattern that self replicates.

Lucas Perry: And it has new properties over and above the parts.

Andrés Gómez Emilsson: Exactly.

Lucas Perry: That doesn’t make any sense to me or my intuitions, so maybe my intuitions are just really wrong. Where does the new property or causality come from? Because it essentially has causal efficacy over and above the parts.

Andrés Gómez Emilsson: Yeah, it’s tremendously confusing. I mean, I’m currently writing an article about basically how this sense of topological segmentation can, in a sense, account both for this effect of what we might call weak downward causation, which is like, you get a molecule and now the molecule will have effects in the world; that you need to take into account all of the electrons and all of the nuclei simultaneously as a unit in order to actually know what the effect is going to be in the world. You cannot just take each of the components separately, but that’s something that we could call weak downward causation. It’s not that fundamentally you’re introducing a new law of physics. Everything is still predicted by Schrödinger equation, it’s still governing the behavior of the entire molecule. It’s just that the appropriate unit of analysis is not the electron, but it would be the entire molecule.

Now, if you pair this together with a sense of identity that comes from topology, then I think there might be a good case for why moments of experience are discrete entities. The analogy here with the topological segmentation, hopefully I’m not going to lose too many listeners here, but we can make an analogy with, for example, a balloon. That if you start out imagining that you are the surface of the balloon and then you take the balloon by two ends and you twist them in opposite directions, eventually at the middle point you get what’s called a pinch point. Basically, the balloon collapses in the center and you end up having these two smooth surfaces connected by a pinch point. Each of those twists creates a new topological segment, or in a sense is segmenting out the balloon. You could basically interpret things such as molecules as new topological segmentations of what’s fundamentally the quantum fields that is implementing them.

Usually, the segmentations may look like an electron or a proton, but if you assemble them together just right, you can get them to essentially melt with each other and become one topologically continuous unit. The nice thing about this account is that you get everything that you want. You explain, on the one hand, why identity would actually have causal implications, and it’s this weak downward causation effect, at the same time as being able to explain: how is it possible that the universe can break down into many different entities? Well, the answer is the way in which it is breaking down is through topological segmentations. You end up having these self-contained regions of the wave function that are discommunicated from the rest of it, and each of those might be a different subject of experience.

David Pearce: It’s very much an open question: the intrinsic nature of the physical. Commonly, materialism and physicalism are conflated. But the point of view that Andrés and I take seriously, non-materialist physicalism, is actually a form of idealism. Recently, philosopher Phil Goff, who used to be a skeptic-critic of non-materialist physicalism because of the binding problem, published a book defending it, “Galileo’s Error”.

Again, it’s very much an open question. We’re making some key background assumptions here. A critical background assumption is physicalism, and that quantum mechanics is complete:  there is no “element of reality” that is missing from the equations (or possibly the fundamental equation) of physics. But physics itself seems to be silent on the intrinsic nature of the physical. What is the intrinsic nature of a quantum field? Intuitively, it’s a field of insentience; but this isn’t a scientific discovery, it’s a (very strong) philosophical intuition.

And if you couple this with the fact that the only part of the world to which one has direct access, i.e., one’s own conscious mind (though this is controversial), is consciousness, sentience. The non-materialist physicalist conjectures that we are typical, in one sense – inasmuch as the fields of your central nervous system aren’t ontologically different from the fields of the rest of the world. And what makes sentient beings special is the way that fields are organized into unified subjects of experience, egocentric world-simulations.

Now, I’m personally fairly confident that we are, individually, minds running egocentric world-simulations: direct realism is false. I’m not at all confident – though I explore the idea – that experience is the intrinsic nature of the physical, the “stuff” of the world. This is a tradition that goes back via Russell, ultimately, to Schopenhauer. Schopenhauer essentially turns Kant on his head.

Kant famously said that all we will ever know is phenomenology, appearances; we will never, never know the intrinsic, noumenal nature of the world. But Schopenhauer argues that essentially we do actually know one tiny piece of the noumenal essence of the world, the essence of the physical, and it’s experiential. So yes, tentatively, at any rate, Andrés and I would defend non-materialist or idealistic physicalism. The actual term “non-materialist physicalism” is due to the late Grover Maxwell.

Lucas Perry: Sorry, could you just define that real quick? I think we haven’t.

David Pearce: Physicalism is the idea that no “element of reality” is missing from the equations of physics, presumably (some relativistic generalization of) the universal Schrödinger equation.

Lucas Perry: It’s a kind of naturalism, too.

David Pearce: Oh, yes. It is naturalism. There are some forms of idealism and panpsychism that are non-naturalistic, but this view is uncompromisingly monist. Non-materialist physicalism isn’t claiming that a primitive experience is attached in some way to fundamental physical properties. The idea is that the actual intrinsic nature, the essence of the physical, is experiential.

Stephen Hawking, for instance, was a wave function monist. A doctrinaire materialist, but he famously said that we have no idea what breathed fire into the equations and makes the universe first to describe. Now, intuitively, of course one assumes that the fire in the equations, Kant’s noumenal essence of the world, is non-experiential. But if so, we have the hard problem, we have the binding problem, we have the problem of causal efficacy, a great mess of problems.

But if, and it’s obviously a huge if, the actual intrinsic nature of the physical is experiential, then we have a theory of reality that is empirically adequate, that has tremendous explanatory and predictive power. It’s mind-bogglingly implausible, at least to those of us steeped in the conceptual framework of materialism. But yes, by transposing the entire mathematical apparatus of modern physics, quantum field theory or its generalization, onto an idealist ontology, one actually has a complete account of reality that explains the technological successes of science, its predictive power, and doesn’t give rise to such insoluble mysteries as the hard problem.

Lucas Perry: I think all of this is very clarifying. There are also background metaphysical views, which people may or may not disagree upon, which are also important for identity. I also want to be careful to define some terms, in case some listeners don’t know what they mean. I think you hit on like four different things which all had to do with consciousness. The hard problem is why different kinds of computation actually… why it’s something to be that computation or like why there is consciousness correlated or associated with that experience.

Then you also said the binding problem. Is it the binding problem, why there is a unitary experience that’s, you said, modally connected earlier?

David Pearce: Yes, and if one takes the standard view from neuroscience that your brain consists of 86-billion-odd discrete, decohered, membrane-bound nerve cells, then phenomenal binding, whether local or global, ought to be impossible. So yeah, this is the binding problem, this (partial) structural mismatch. If your brain is scanned when you’re seeing a particular perceptual object, neuroscanning can apparently pick out distributed feature-processors, edge-detectors, motion-detectors, color-mediating neurons (etc). And yet there isn’t the perfect structural match that must exist if physicalism is true. And David Chalmers – because of this (partial) structural mismatch – goes on to argue that dualism must be true. Although I agree with David Chalmers that yes, phenomenal binding is classically impossible, if one takes the intrinsic nature argument seriously, then phenomenal unity is minted in.

The intrinsic nature argument, recall, is that experience, consciousness, discloses the intrinsic nature of the physical. Now, one of the reasons why this idea is so desperately implausible is it makes the fundamental “psychon” of consciousness ludicrously small. But there’s a neglected corollary of non-materialist physicalism, namely that if experience discloses the intrinsic nature of the physical, then experience must be temporally incredibly fine-grained too. And if we probe your nervous system at a temporal resolution of femtoseconds or even attoseconds, what would we find? My guess is that it would be possible to recover a perfect structural match between what you are experiencing now in your phenomenal world-simulation and the underlying physics. Superpositions (“cat states”) are individual states [i.e. not classical aggregates].

Now, if the effective lifetime of neuronal superpositions and the CNS were milliseconds, they would be the obvious candidate for a perfect structural match and explain the phenomenal unity of consciousness. But physicists, not least Max Tegmark, have done the maths: decoherence means that the effective lifetime of neuronal superpositions in the CNS, assuming the unitary-only dynamics, is femtoseconds or less, which is intuitively the reductio ad absurdum of any kind of quantum mind.

But one person’s reductio ad absurdum is another person’s falsifiable prediction. I’m guessing – I’m sounding like a believer, but I’m not –  I am guessing that with sufficiently sensitive molecular matter- wave interferometry, perhaps using “trained up” mini-brains, that the non-classical interference signature will disclose a perfect structural match between what you’re experiencing right now, your unified phenomenal world-simulation, and the underlying physics.

Lucas Perry: So, we hit on the hard problem and also the binding problem. There was like two other ones that you threw out there earlier that… I forget what they were?

David Pearce: Yeah, the problem of causal efficacy. How is it that you and I can discuss consciousness? How is it that the “raw feels” of consciousness have not merely the causal, but also the functional efficacy to inspire discussions of their existence?

Lucas Perry: And then what was the last one?

David Pearce: Oh, it’s been called the palette problem, P-A-L-E-T-T-E. As in the fact that there is tremendous diversity of different kinds of experience and yet the fundamental entities recognized by physics, at least on the normal tale, are extremely simple and homogeneous. What explains this extraordinarily rich palette of conscious experience? Physics exhaustively describes the structural-relational properties of the world. What physics doesn’t do is deal in the essence of the physical, its intrinsic nature.

Now, it’s an extremely plausible assumption that the world’s fundamental fields are non-experiential, devoid of any subjective properties – and this may well be the case. But if so, we have the hard problem, the binding problem, the problem of causal efficacy, the palette problem – a whole raft of problems.

Lucas Perry: Okay. So, this all serves the purpose of codifying that there’s these questions up in the air about these metaphysical views which inform identity. We got here because we were talking about mereological nihilism, and Andrés said that one view that you guys have is that you can divide or cut or partition consciousness into individual, momentary, unitary moments of experience that you claim are ontologically simple. What is your credence on this view?

Andrés Gómez Emilsson: Phenomenological evidence. When you experience your visual fields, you don’t only experience one point at a time. The contents of your experience are not ones and zeros; it isn’t the case that you experience one and then zero and then one again. Rather, you experience many different types of qualia varieties simultaneously: visuals experience and auditory experience and so on. All of that gets presented to you. I take that very seriously. I mean, some other researchers may fundamentally say that that’s an illusion, that there’s actually never a unified experience, but that has way many more problems than actually thinking seriously that unity of consciousness.

David Pearce: A number of distinct questions arise here. Are each of us egocentric phenomenal world-simulations? A lot of people are implicitly perceptual direct realists, even though they might disavow the label. Implicitly, they assume that they have some kind of direct access to physical properties. They associate experience with some kind of stream of thoughts and feelings behind their forehead. But if instead we are world-simulationists, then there arises the question: what is the actual fundamental nature of the world beyond your phenomenal world-simulation? Is it experiential or non-experiential? I am agnostic about that – even though I explore non-materialist physicalism.

Lucas Perry: So, I guess I’m just trying to get a better answer here on how is it that we navigate these views of identity towards truth?

Andrés Gómez Emilsson: An example I thought of, of a very big contrast between what you may intuitively imagine is going on versus what’s actually happening, is if you are very afraid of snakes, for example, you look at a snake. You feel, “Oh, my gosh, it’s intruding into my world and I should get away from it,” and you have this representation of it as a very big other. Anything that is very threatening, oftentimes you represent it as “an other”.

But crazily, that’s actually just yourself to a large extent because it’s still part of your experience. Within your moment of experience, the whole phenomenal quality of looking at a snake and thinking, “That’s an other,” is entirely contained within you. In that sense, these ways of ascribing identity and continuity to the things around us or a self-other division are almost psychotic. They start out by assuming that you can segment out a piece of your experience and call it something that belongs to somebody else, even though clearly, it’s still just part of your own experience; it’s you.

Lucas Perry: But the background here is also that you’re calling your experience your own experience, which is maybe also a kind of psychopathy. Is that the word you used?

Andrés Gómez Emilsson: Yeah, yeah, yeah, that’s right.

Lucas Perry: Maybe the scientific thing is, there’s just snake experience and it’s neither yours nor not yours, and there’s what we conventionally call a snake.

Andrés Gómez Emilsson: That said, there are ways in which I think you can use experience to gain insight about other experiences. If you’re looking at a picture that has two blue dots, I think you can accurately say, by paying attention to one of those blue dots, the phenomenal property of my sensation of blue is also in that other part of my visual field. And this is a case where in a sense you can I think, meaningfully refer to some aspect of your experience by pointing at an other aspect of your experience. It’s still maybe in some sense kind of crazy, but it’s still closer to truth than many other things that we think of or imagine.

Honest and true statements about the nature of other people’s experiences, I think are very much achievable. Bridging the reference gap, I think it might be possible to overcome and you can probably aim for a true sense of identity, harmonizing the phenomenal and the ontological sense of identity.

Lucas Perry: I mean, I think that part of the motivation, for example in Buddhism, is that you need to always be understanding yourself in reality as it is or else you will suffer, and that it is through understanding how things are that you’ll stop suffering. I like this point that you said about unifying the phenomenal identity and phenomenal self with what is ontologically true, but that also seems not intrinsically necessary because there’s also this other point here where you can maybe function or have the epistemology of any arbitrary identity view but not identify with it. You don’t take it as your ultimate understanding of the nature of the world, or what it means to be this limited pattern in a giant system.

Andrés Gómez Emilsson: I mean, generally speaking, that’s obviously pretty good advice. It does seem to be something that’s constrained to the workings of the human mind as it is currently implemented. I mean, definitely all these Buddhists advises of “don’t identify with it” or “don’t get attached to it.” Ultimately, it cashes out in experiencing less of a craving, for example, or feeling less despair in some cases. Useful advice, not universally applicable.

For many people, their problem might be something like, sure, like desire, craving, attachment, in which case these Buddhist practices will actually be very helpful. But if your problem is something like a melancholic depression, then lack of desire doesn’t actually seem very appealing; that is the default state and it’s not a good one. Just be mindful of universalizing this advice.

David Pearce: Yes. Other things being equal, the happiest people tend to have the most desires. Of course, a tremendous desire can also bring tremendous suffering, but there are a very large number of people in the world who are essentially unmotivated. Nothing really excites them. In some cases, they’re just waiting to die: melancholic depression. Desire can be harnessed.

A big problem, of course, is that in a Darwinian world, many of our desires are mutually inconsistent. And to use (what to me at least would be) a trivial example – it’s not trivial to everyone –  if you have 50 different football teams with all their supporters, there is logically no way that the preferences of these fanatical football supporters can be reconciled. But nonetheless, by raising their hedonic set-points, one can allow all football supporters to enjoy information-sensitive gradients of bliss. But there is simply no way to reconcile their preferences.

Lucas Perry: There’s part of me that does want to do some universalization here, and maybe that is wrong or unskillful to do, but I seem to be able to imagine a future where, say we get aligned superintelligence and there’s some kind of rapid expansion, some kind of optimization bubble of some kind. And maybe there are the worker AIs and then there are the exploiter AIs, and the exploiter AIs just get blissed out.

And imagine if some of the exploiter AIs are egomaniacs in their hedonistic simulations and some of them are hive minds, and they all have different views on open individualism or closed individualism. Some of the views on identity just seem more deluded to me than others. I seem to have a problem with a self identification and reification of self as something. It seems to me, to take something that is conventional and make it an ultimate truth, which is confusing to the agent, and that to me seems bad or wrong, like our world model is wrong. Part of me wants to say it is always better to know the truth, but I also feel like I’m having a hard time being able to say how to navigate views of identity in a true way, and then another part of me feels like actually it doesn’t really matter only in so far as it affects the flavor of that consciousness.

Andrés Gómez Emilsson: If we find like the chemical or genetic levers for different notions of identity, we could presumably imagine a lot of different ecosystems of approaches to identity in the future, some of them perhaps being much more adaptive than others. I do think I grasp a little bit maybe the intuition pump, and I think that’s actually something that resonates quite a bit with us, which is that it is an instrumental value for sure to always be truth-seeking, especially when you’re talking about general intelligence.

It’s very weird and it sounds like it’s going to fail if you say, “Hey, I’m going to be truth-seeking in every domain except on here.” And these might be identity, or value function, or your model of physics or something like that, but perhaps actual superintelligence in some sense it really entails having an open-ended model for everything, including ultimately who you are. If you’re not having those open-ended models that can be revised with further evidence and reasoning, you are not a super intelligence.

That intuition pump may suggest that if intelligence turns out to be extremely adaptive and powerful, then presumably, the superintelligences of the future will have true models of what’s actually going on in the world, not just convenient fictions.

David Pearce: Yes. In some sense I would hope our long-term goal is ignorance of the entire Darwinian era and its horrors. But it would be extremely dangerous if we were to give up prematurely. We need to understand reality and the theoretical upper bounds of rational moral agency in the cosmos. But ultimately, when we have done literally everything that it is possible to do to minimize and prevent suffering, I think in some sense we should want to forget about it altogether. But I would stress the risks of premature defeatism.

Lucas Perry: Of course we’re always going to need a self model, a model of the cognitive architecture in which the self model is embedded, it needs to understand the directly adjacent computations which are integrated into it, but it seems like the views of identity go beyond just this self model. Is that the solution to identity? What does open, closed, or empty individualism have to say about something like that?

Andrés Gómez Emilsson: Open, empty and closed as ontological claims, yeah, I mean they are separable from the functional uses of a self model. It does however, have bearings on basically the decision theoretic rationality of an intelligence, because when it comes to planning ahead, if you have the intense objective of being as happy as you can, and somebody offers you a cloning machine and they say, “Hey, you can trade one year of your life for just a completely new copy of yourself.” Do you press the button to make that happen? For making that decision, you actually do require a model of ontological notion of identity, unless you just care about replication.

Lucas Perry: So I think that the problem there is that identity, at least in us apes, is caught up in ethics. If you could have an agent like that where identity was not factored into ethics, then I think that it would make a better decision.

Andrés Gómez Emilsson: It’s definitely a question too of whether you can bootstrap an impartial god’s-eye-view on the wellbeing of all sentient beings without first having developed a sense of own identity and then wanting to preserve it, and finally updating it with more information, you know, philosophy, reasoning, physics. I do wonder if you can start out without caring about identity, and finally concluding with kind of an impartial god’s-eye-view. I think probably in practice a lot of those transitions do happen because the person is first concerned with themselves, and then they update the model of who they are based on more evidence. You know, I could be wrong, it might be possible to completely sidestep Darwinian identities and just jump straight up into impartial care for all sentient beings, I don’t know.

Lucas Perry: So we’re getting into the ethics of identity here, and why it matters. The question for this portion of the discussion is what are the ethical implications of different views on identity? Andres, I think you can sort of kick this conversation off by talking a little bit about the game theory.

Andrés Gómez Emilsson: Right, well yeah, the game theory is surprisingly complicated. Just consider within a given person, in fact, the different “sub agents” of an individual. Let’s say you’re drinking with your friends on a Friday evening, but you know you have to wake up early at 8:00 AM for whatever reason, and you’re deciding whether to have another drink or not. Your intoxicated self says, “Yes, of course. Tonight is all that matters.” Whereas your cautious self might try to persuade you that no, you will also exist tomorrow in the morning.

Within a given person, there’s all kinds of complex game theory that happens between alternative views of identity. Even implicitly it becomes obviously much more tricky when you expand it outwards, how like some social movements in a sense are trying to hack people’s view of identity, whether the unit is your political party, or the country, or the whole ecosystem, or whatever it may be. A key thing to consider here is the existence of legible Schelling points, also called focal points, which is in the essence of communication between entities, what are some kind of guiding principles that they can use in order to effectively coordinate and move towards a certain goal?

I would say that having something like open individualism itself can be a powerful Schelling point for coordination. Especially because if you can be convinced that somebody is an open individualist, you have reasons to trust them. There’s all of this research on how high-trust social environments are so much more conducive to productivity and long-term sustainability than low-trust environments, and expansive notions of identity are very trust building.

On the other hand, from a game theoretical point of view, you also have the problem of defection. Within an open individualist society, you have a small group of people who can fake the test of open individualism. They can take over from within, and instantiate some kind of a dictatorship or some type of a closed individualist takeover of what was a really good society, good for everybody.

This is a serious problem, even when it comes to, for example, forming groups of people with all of them share a certain experience. For example, MDMA, or 5-MeO-DMT, or let’s say deep stages of meditation. Even then, you’ve got to be careful, because people who are resistant to those states may pretend that they have an expanded notion of identity, but actually covertly work towards a much more reduced sense of identity. I have yet to see a credible game theoretically aware solution to how to make this work.

Lucas Perry: If you could clarify the knobs in a person, whether it be altruism, or selfishness, or other things that the different views on identity turn, and if you could clarify how that affects the game theory, then I think that that would be helpful.

Andrés Gómez Emilsson: I mean, I think the biggest knob is fundamentally what experiences count from the point of view of the fact that you expect to, in a sense, be there or expect them to be real, in as real of a way as your current experience is. It’s also contingent on theories of consciousness, because you could be an open individualist and still believe that higher order cognition is necessary for consciousness, and that non-human animals are not conscious. That gives rise to all sorts of other problems, the person presumably is altruistic and cares about others, but they just still don’t include non-human animals for a completely different reason in that case.

Definitely another knob is how you consider what you will be in the future. Whether you consider that to be part of the universe or the entirety of the universe. I guess I used to think that personal identity was very tied to a hedonic tone. I think of them as much more dissociated now. There is a general pattern: people who are very low mood may have kind of a bias towards empty individualism. People who become open individualists often experience a huge surge in positive feelings for a while because they feel that they’re never going to die, like the fear of death greatly diminishes, but I don’t actually think it’s a surefire or a foolproof way of increasing wellbeing, because if you take seriously open individualism, it also comes with terrible implications. Like that hey, we are also the pigs in factory farms. It’s not a very pleasant view.

Lucas Perry: Yeah, I take that seriously.

Andrés Gómez Emilsson: I used to believe for a while that the best thing we could possibly do in the world was to just write a lot of essays and books about why open individualism is true. Now I think it’s important to combine it with consciousness technologies so that, hey, once we do want to upgrade our sense of identity to a greater circle of compassion, that we also have the enhanced happiness and mental stability to be able to actually engage with that without going crazy.

Lucas Perry: This has me thinking about one point that I think is very motivating for me for the ethical case of veganism. Take the common sense, normal consciousness, like most people have, and that I have, you just feel like a self that’s having an experience. You just feel like you are fortunate enough to be born as you, and to be having the Andrés experience or the Lucas experience, and that your life is from birth to death, or whatever, and when you die you will be annihilated, you will no longer have experience. Then who is it that is experiencing the cow consciousness? Who is it that is experiencing the chicken and the pig consciousness? There’s so many instantiations of that, like billions. Even if this is based off of the irrationality, it still feels motivating to me. Yeah, I could just die and wake up as a cow 10 billion times. That’s kind of the experience that is going on right now. The sudden confused awakening into cow consciousness plus factory farming conditions. I’m not sure if you find that completely irrational or motivating or what.

Andrés Gómez Emilsson: No, I mean I think it makes sense. We have a common friend as well, Magnus Vinding. He wrote a pro-veganism book actually kind of with this line of reasoning. It’s called You Are Them. About how post theoretical science of consciousness and identity itself is a strong case for an ethical lifestyle.

Lucas Perry: Just touching here on the ethical implications, some other points that I just want to add here are that when one is identified with one’s phenomenal identity, in particular, I want to talk about the experience of self, where you feel like you’re a closed individualist, which your life is like when you were born, and then up until when you die, that’s you. I think that that breeds a very strong duality in terms of your relationship with your own personal phenomenal consciousness. The suffering and joy which you have direct access to are categorized as mine or not mine.

Those which are mine take high moral and ethical priority over the suffering of others. You’re not mind-melded with all of the other brains, right? So there’s an epistemological limitation there where you’re not directly experiencing the suffering of other people, but the closed individualist view goes a step further and isn’t just saying that there’s an epistemological limitation, but it’s also saying that this consciousness is mine, and that consciousness is yours, and this is the distinction between self and other. And given selfishness, that self consciousness will take moral priority over other consciousness.

That I think just obviously has massive ethical implications with regards to their greed of people. I view here the ethical implications as being important because, at least in the way that human beings function, if one is able to fully rid themselves of the ultimate identification with your personal consciousness as being the content of self, then I can move beyond the duality of consciousness of self and other, and care about all instances of wellbeing and suffering much more equally than I currently do. That to me seems harder to do, at least with human brains. If we have a strong reification and identification with your instances of suffering or wellbeing as your own.

David Pearce: Part of the problem is that the existence of other subjects of experience is metaphysical speculation. It’s metaphysical speculation that one should take extremely seriously: I’m not a solipsist. I believe that other subjects of experience, human and nonhuman, are as real as my experience. But nonetheless, it is still speculative and theoretical. One cannot feel their experiences. There is simply no way, given the way that we are constituted, the way we are now, that one can behave with impartial, God-like benevolence.

Andrés Gómez Emilsson: I guess I would question it perhaps a little bit that we only care about our future suffering within our own experience, because this is me, this is mine, it’s not an other. In a sense I think we care about those more, largely because they’re are more intense, you do see examples of, for example, mirror touch synesthesia, of people who if they see somebody else get hurt, they also experience pain. I don’t mean a fleeting sense of discomfort, but perhaps even actual strong pain because they’re able to kind of reflect that for whatever reason.

People like that are generally very motivated to help others as well. In a sense, their implicit self model includes others, or at least weighs others more than most people do. I mean in some sense you can perhaps make sense of selfishness in this context as the coincidence that what is within our self model is experienced as more intense. But there’s plenty of counter examples to that, including sense of depersonalization or ego death, where you can experience the feeling of God, for example, as being this eternal and impersonal force that is infinitely more intense than you, and therefore it matters more, even though you don’t experience it as you. Perhaps the core issue is what gets the highest amount of intensity within your world simulation.

Lucas Perry: Okay, so I also just want to touch on a little bit about preferences here before we move on to how this is relevant to AI alignment and the creation of beneficial AI. From the moral realist perspective, if you take the metaphysical existence of consciousness very substantially, and you view it as the ground of morality, then different views on identity will shift how you weight the preferences of other creatures.

So from a moral perspective, whatever kinds of views of identity end up broadening your moral circle of compassion closer and closer to the end goal of impartial benevolence for all sentient beings according to their degree and kinds of worth, I would view as a good thing. But now there’s this other way to think about identity because if you’re listening to this, and you’re a moral anti-realist, there is just the arbitrary, evolutionary, and historical set of preferences that exist across all creatures on the planet.

Then the views on identity I think are also obviously again going to weigh into your moral considerations about how much to just respect different preferences, right. One might want to go beyond hedonic consequentialism here, and could just be a preference consequentialist. You could be a deontological ethicist or a virtue ethicist too. We could also consider about how different views on identity as lived experiences would affect what it means to become virtuous, if being virtuous means moving beyond the self actually.

Andrés Gómez Emilsson: I think I understand what you’re getting at. I mean, really there’s kind of two components to ontology. One is what exists, and then the other one is what is valuable. You can arrive at something like open individualism just from the point of view of what exists, but still have disagreements with other open individualists about what is valuable. Alternatively, you could agree on what is valuable with somebody but completely disagree on what exists. To get the power of cooperation of open individualism as a Schelling point, there also needs to be some level of agreement on what is valuable, not just what exists.

It definitely sounds arrogant, but I do think that by the same principle by which you arrive at open individualism or empty individualism, basically nonstandard views of identities, you can also arrive at hedonistic utilitarianism, and that is, again, like the principle of really caring about knowing who or what you are fundamentally. To know yourself more deeply also entails understanding from second to second how your preferences impact your state of consciousness. It is my view that just as open individualism, you can think of it as the implication of taking a very systematic approach to make sense of identity. Likewise, philosophical hedonism is also an implication of taking a very systematic approach at trying to figure out what is valuable. How do we know that pleasure is good?

David Pearce: Yeah, does the pain-pleasure axis disclose the world’s intrinsic metric of (dis)value? There is something completely coercive about pleasure and pain. One can’t transcend the pleasure/pain axis. Compare the effects of taking heroin, or “enhanced interrogation”. There is no one with an inverted pleasure/pain axis. Supposed counter-examples, like sado-masochists, in fact just validate the primacy of the pleasure/pain axis.

What follows from the primacy of the pleasure/pain axis? Should we be aiming, as classical utilitarians urge, to maximize the positive abundance of subjective value in the universe, or at least our forward light-cone? But if we are classical utilitarians, there is a latently apocalyptic implication of classical utilitarianism – namely, that we ought to be aiming to launch something like a utilitronium (or hedonium) shockwave – where utilitronium or hedonium is matter and energy optimized for pure bliss.

So rather than any kind of notion of personal identity as we currently understand it, if one is a classical utilitarian – or if one is programming a computer or a robot with the utility function of classical utilitarianism –  should one therefore essentially be aiming to launch an apocalyptic utilitronium shockwave? Or alternatively, should one be trying to ensure that the abundance of positive value within our cosmological horizon is suboptimal by classical utilitarian criteria?

I don’t actually personally advocate a utilitronium shockwave. I don’t think it’s sociologically realistic. I think much more sociologically realistic is to aim for a world based on gradients of intelligent bliss -because that way, people’s existing values and preferences can (for the most part) be conserved. But nonetheless, if one is a classical utilitarian, it’s not clear one is allowed this kind of messy compromise.

Lucas Perry: All right, so now that we’re getting into the juicy, hedonistic imperative type stuff, let’s talk about here how about how this is relevant to AI alignment and the creation of beneficial AI. I think that this is clear based off of the conversations we’ve had already about the ethical implications, and just how prevalent identity is in our world for the functioning of society and sociology, and just civilization in general.

Let’s limit the conversation for the moment just to AI alignment. And for this initial discussion of AI alignment, I just want to limit it to the definition of AI alignment as developing the technical process by which AIs can learn human preferences, and help further express and idealize humanity. So exploring how identity is important and meaningful for that process, two points I think that it’s relevant for, who are we making the AI for? Different views on identity I think would matter, because if we assume that sufficiently powerful and integrated AI systems are likely to have consciousness or to have qualia, they’re moral agents in themselves.

So who are we making the AI for? We’re making new patients or subjects of morality if we ground morality on consciousness. So from a purely egoistic point of view, the AI alignment process is just for humans. It’s just to get the AI to serve us. But if we care about all sentient beings impartially, and we just want to maximize conscious bliss in the world, and we don’t have this dualistic distinction of consciousness being self or other, we could make the AI alignment process something that is more purely altruistic. That we recognize that we’re creating something that is fundamentally more morally relevant than we are, given that it may have more profound capacities for experience or not.

David, I’m also holding in my hand, I know that you’re skeptical of the ability of AGI or superintelligence to be conscious. I agree that that’s not solved yet, but I’m just working here with the idea of, okay, maybe if they are. So I think it can change the altruism versus selfishness, the motivations around who we’re training the AIs for. And then the second part is why are we making the AI? Are we making it for ourselves or are we making it for the world?

If we take a view from nowhere, what Andrés called a god’s-eye-view, is this ultimately something that is for humanity or is it something ultimately for just making a better world? Personally, I feel that if the end goal is ultimate loving kindness and impartial ethical commitment to the wellbeing of all sentient creatures in all directions, then ideally the process is something that we’re doing for the world, and that we recognize the intrinsic moral worth of the AGI and superintelligence as ultimately more morally relevant descendants of ours. So I wonder if you guys have any reactions to this?

Andrés Gómez Emilsson: Yeah, yeah, definitely. So many. Tongue in cheek, but you’ve just made me chuckle when you said, “Why are we making the AI to begin with?” I think there’s a case to be made that the actual reason why we’re making AI is a kind of an impressive display of fitness in order to signal our intellectual fortitude and superiority. I mean sociologically speaking, you know, actually getting an AI to do something really well. It’s a way in which you can yourself signal your own intelligence, and I guess I worry to some extent that this is a bit of a tragedy of the commons, as it is the case with our weapon development. You’re so concerned with whether you can, and especially because of the social incentives, that you’re going to gain status and be looked at as somebody who’s really competent and smart, that you don’t really stop and wonder whether you should be building this thing in the first place.

Leaving that aside, just from a purely ethically motivated point of view, I do remember thinking and having a lot of discussions many years ago about if we can make a super computer experience what it is like for a human to be on MDMA. Then all of a sudden that supercomputer becomes a moral patient. It actually matters, you probably shouldn’t turn it off. Maybe in fact you should make more of them. A very important thing I’d like to say here is: I think it’s really important to distinguish the notion of intelligence.

On the one hand, as causal power over your environment, and on the other hand as the capacity for self insight, and introspection, and understanding reality. I would say that we tend to confuse these quite a bit. I mean especially in circles that don’t take consciousness very seriously. It’s usually implicitly assumed that having a superhuman ability to control your environment entails that you also have, in a sense, kind of a superhuman sense of self or a superhuman broad sense of intelligence. Whereas even if you are a functionalist, I mean even if you believe that a digital computer can be conscious, you can make a pretty strong case that even then, it is not something automatic. It’s not just that if you program the appropriate behavior, it will automatically also be conscious.

A super straight forward example here is that if you have the Chinese room, if it’s just a giant lookup table, clearly it is not a subject of experience, even though the input / output mapping might be very persuasive. There’s definitely still the problems there, and I think if we aim instead towards maximizing intelligence in the broad sense, that does entail also the ability to actually understand the nature and scope of other states of consciousness. And in that sense, I think a superintelligence of that sort would it be intrinsically aligned with the intrinsic values of consciousness. But there are just so many ways of making partial superintelligences that maybe are superintelligent in many ways, but not in that one in particular, and I worry about that.

David Pearce: I sometimes sketch this simplistic trichotomy, three conceptions of superintelligence. One is a kind of “Intelligence Explosion” of recursively self-improving software-based AI. Then there is the Kurzweilian scenario – a complete fusion of humans and our machines. And then there is, very crudely, biological superintelligence, not just rewriting our genetic source code, but also (and Neuralink prefigures this) essentially “narrow” superintelligence-on-a-chip so that anything that anything a classical digital computer can do a biological human or a transhuman can do.

So yes, I see full-spectrum superintelligence as our biological descendants, super-sentient, able to navigate radically alien states of consciousness. So I think the question that you’re asking is why are we developing “narrow” AI – non-biological machine superintelligence.

Lucas Perry: Speaking specifically from the AI alignment perspective, how you align current day systems and future systems to superintelligence and beyond with human values and preferences, and so the question born of that, in the context of these questions of identity, is who are we making that AI for and why are we making the AI?

David Pearce: If you’ve got Buddha, “I teach one thing and one thing only, suffering and the end of suffering”… Buddha would press the OFF button, and I would press the OFF button.

Lucas Perry: What’s the off button?

David Pearce: Sorry, the notional initiation of a vacuum phase-transition (or something similar) that (instantaneously) obliterates Darwinian life. But when people talk about “AI alignment”, or most people working in the field at any rate, they are not talking about a Buddhist ethic [the end of suffering] – they have something else in mind. In practical terms, this is not a fruitful line of thought to pursue – you know, the implications of Buddhist, Benatarian, negative utilitarian, suffering-focused ethics.

Essentially that one wants to ratchet up hedonic range and hedonic set-points in such a way that you’re conserving people’s existing preferences – even though their existing preferences and values are, in many cases, in conflict with each other. Now, how one actually implements this in a classical digital computer, or a classically parallel connectionist system, or some kind of hybrid, I don’t know precisely.

Andrés Gómez Emilsson: At least there is one pretty famous cognitive scientist and AI theorist does propose the Buddhist ethic of turning the off button of the universe. Thomas Metzinger, and his benevolent, artificial anti-natalism. I mean, yeah. Actually that’s pretty interesting because he explores the idea of an AI that truly kind of extrapolates human values and what’s good for us as subjects of experience. The AI concludes what we are psychologically unable to, which is that the ethical choice is non-existence.

But yeah, I mean, I think that’s, as David pointed out, implausible. I think it’s much better to put our efforts in creating a super cooperator cluster that tries to recalibrate the hedonic set point so that we are animated by gradients of bliss. Sociological constraints are really, really important here. Otherwise you risk…

Lucas Perry: Being irrelevant.

Andrés Gómez Emilsson: … being irrelevant, yeah, is one thing. The other thing is unleashing an ineffective or failed attempt at sterilizing the world, which would be so much, much worse.

Lucas Perry: I don’t agree with this view, David. Generally, I think that Darwinian history has probably been net negative, but I’m extremely optimistic about how good the future can be. And so I think it’s an open question at the end of time, how much misery and suffering and positive experience there was. So I guess I would say I’m agnostic as to this question. But if we get AI alignment right, and these other things, then I think that it can be extremely good. And I just want to tether this back to identity and AI alignment.

Andrés Gómez Emilsson: I do have the strong intuition that if empty individualism is correct at an ontological level, then actually negative utilitarianism can be pretty strongly defended on the grounds that when you have a moment of intense suffering, that’s the entirety of that entity’s existence. And especially with eternalism, once it happens, there’s nothing you can do to prevent it.

There’s something that seems particularly awful about allowing inherently negative experiences to just exist. That said, I think open individualism actually may to some extent weaken that. Because even if the suffering was very intense, you can still imagine that if you identify with consciousness as a whole, you may be willing to undergo some bad suffering as a trade-off for something much, much better in the future.

It sounds completely insane if you’re currently experiencing a cluster headache or something astronomically painful. But maybe from the point of view of eternity, it actually makes sense. Those are still tiny specs of experience relative to the beings that are going to exist in the future. You can imagine Jupiter brains and Dyson spheres just in a constant ecstatic state. I think open individualism might counterbalance some of the negative utilitarian worries and would be something that an AI would have to contemplate and might push it one way or the other.

Lucas Perry: Let’s go ahead and expand the definition of AI alignment. A broader way we can look at the AI alignment problem, or the problem of generating beneficial AI, and making future AI stuff go well, where that is understood is the project of making sure that the technical, political, social, and moral consequences of short-term to super intelligence and beyond, is that as we go through all of that, that is a beneficial process.

Thinking about identity in that process, we were talking about how strong nationalism or strong identity or identification with regards to a nation state is a form of identity construction that people do. The nation or the country becomes part of self. One of the problems of the AI alignment problem is arms racing between countries, and so taking shortcuts on safety. I’m not trying to propose clear answers or solutions here. It’s unclear how successful an intervention here could even be. But these views on identity and how much nationalism shifts or not, I think feed into how difficult or not the problem will be.

Andrés Gómez Emilsson: The point of game theory becomes very, very important in that yes, you do want to help other people who are also trying to improve the well-being of all consciousness. On the other hand, if there’s a way to fake caring about the entirety of consciousness, that is a problem because then you would be using resources on people who would hoard them or even worse wrestle the power away from you so that they can focus on their narrow sense of identity.

In that sense, I think having technologies in order to set particular phenomenal experiences of identity, as well as to be able to detect them, might be super important. But above all, and I mean this is definitely my area of research, having a way of objectively quantifying how good or bad a state of consciousness is based on the activity of a nervous system seems to me like an extraordinarily key component for any kind of a serious AI alignment.

If you’re actually trying to prevent bad scenarios in the future, you’ve got to have a principle way of knowing whether the outcome is bad, or at the very least knowing whether the outcome is terrible. The aligned AI should be able to grasp that a certain state of consciousness, even if nobody has experienced it before, will be really bad and it should be avoided, and that tends to be the lens through which I see this.

In terms of improving people’s internal self-consistency, as David pointed out, I think it’s kind of pointless to try to satisfy a lot of people’s preferences, such as having their favorite sports team win, because there’s really just no way of satisfying everybody’s preferences. In the realm of psychology is where a lot of these interventions would happen. You can’t expect an AI to be aligned with you, if you yourself are not aligned with yourself, right, if you have all of these strange, psychotic, competing sub-agents. It seems like part of the process is going to be developing techniques to become more consistent, so that we can actually be helped.

David Pearce: In terms of risks this century, nationalism has been responsible for most of the wars of the past two centuries, and nationalism is highly likely to lead to catastrophic war this century. And the underlying source of global catastrophic risk? I don’t think it’s AI. It’s male human primates doing what male human primates have been “designed” by evolution to do – to fight, to compete, to wage war. And even vegan pacifists like me, how do we spend their leisure time? Playing violent video games.

There are technical ways one can envisage mitigating the risk. Perhaps it’s unduly optimistic aiming for all-female governance or for a democratically-accountable world state under the auspices of the United Nations. But I think unless one does have somebody with a monopoly on the use of force that we are going to have cataclysmic nuclear war this century. It’s highly likely: I think we’re sleepwalking our way towards disaster. It’s more intellectually exciting discussing exotic risks from AI that goes FOOM, or something like that. But there are much more mundane catastrophes that are, I suspect, going to unfold this century.

Lucas Perry: All right, so getting into the other part here about AI alignment and beneficial AI throughout this next century, there’s a lot of different things that increased intelligence and capacity and power over the world is going to enable. There’s going to be human biological species divergence via AI-enabled bioengineering. There is this fundamental desire for immortality with many people, and the drive towards super intelligence and beyond for some people promises immortality. I think that in terms of closed individualism here, closed individualism is extremely motivating for this extreme self-concern of desire for immortality.

There are people currently today who are investing in say, cryonics, because they want to freeze themselves and make it long enough so that they can somehow become immortal, very clearly influenced by their ideas of identity. As Yuval Noah Harari was saying on our last podcast, it subverts many of the classic liberal myths that we have about the same intrinsic worth across all people; and then if you add humans 2.0 or 3.0 or 4.0 into the mixture, it’s going to subvert that even more. So there are important questions of identity there, I think.

With sufficiently advanced super intelligence people flirt with the idea of being uploaded. The identity questions here which are relevant are if we scan the information architecture or the neural architecture of your brain and upload it, will people feel like that is them? Is it not them? What does it mean to be you? Also, of course, in scenarios where people want to merge with the AI, what is it that you would want to be kept in the merging process? What is superfluous to you? What is not nonessential to your identity or what it means to be you, that you would be okay or not with merging?

And then I think that most importantly here, I’m very interested in the Descendants scenario, where we just view AI as like our evolutionary descendants. There’s this tendency in humanity to not be okay with this descendant scenario. Because of closed individualist views on identity, they won’t see that consciousness is the same kind of thing, or they won’t see it as their own consciousness. They see that well-being through the lens of self and other, so that makes people less interested in they’re being descendant, super-intelligent conscious AIs. Maybe there’s also a bit of speciesism in there.

I wonder if you guys want to have any reactions to identity in any of these processes? Again, they are human, biological species divergence via AI-enabled bioengineering, immortality, uploads, merging, or the Descendants scenario.

David Pearce: In spite of thinking that Darwinian life is sentient malware, I think cryonics should be opt-out, and cryothanasia should be opt-in, as a way to defang death. So long as someone is suspended in optimal conditions, it ought to be possible for advanced intelligence to reanimate that person. And sure, if one is an “empty” individualist, or if you’re the kind of person who wakes up in the morning troubled that you’re not the person who went to sleep last night, this reanimated person may not really be “you”. But if you’re more normal, yes, I think it should be possible to reanimate “you” if you are suspended.

In terms of mind uploads, this is back to the binding problem. Even assuming that you can be scanned with a moderate degree of fidelity, I don’t think your notional digital counterpart is a subject to experience. Even if I am completely wrong here and that somehow subjects or experience inexplicably emerge in classical digital computers, there’s no guarantee that the qualia would be the same. After all, you can replay a game of chess with perfect fidelity, but there’s no guarantee incidentals like the textures or the pieces will be the same. Why expect the textures of qualia to be the same, but that isn’t really my objection. It’s the fact that a digital computer cannot support phenomenally-bound subjects of experience.

Andrés Gómez Emilsson: I also think cryonics is really good. Even though with a different nonstandard view of personal identity, it’s kind of puzzling. Why would you care about it? Lots of practical considerations. I like what David said of like defanging death. I think that’s a good idea, but also giving people skin in the game for the future.

People who enact policy and become politically successful, often tend to be 50 years plus, and there’s a lot of things that they weigh on, that they will not actually get to experience, that probably biases politicians and people who are enacting policy to focus, especially just on short-term gains as opposed to really genuinely trying to improve the long-term; and I think cryonics would be helpful in giving people skin in the game.

More broadly speaking, it does seem to be the case that what aspect of transhumanism a person is likely to focus on depends a lot on their theories of identity. I mean, if we break down transhumanism into the three supers of super happiness, super longevity, and super intelligence, the longevity branch is pretty large. There’s a lot of people looking for ways of rejuvenating, preventing aging, and reviving ourselves, or even uploading ourselves.

Then there’s people who are very interested in super intelligence. I think that’s probably the most popular type of transhumanism nowadays. That one I think does rely to some extent on people having a functionalist information theoretic account of their own identity. There’s all of these tropes of, “Hey, if you leave a large enough digital footprint online, a super intelligence will be able to reverse engineer your brain just from that, and maybe reanimate you in the future,” or something of that nature.

And then there’s, yeah, people like David and I, and the Qualia Research Institute as well, that care primarily about super happiness. We think of it as kind of a requirement for a future that is actually worth living. You can have all the longevity and all the intelligence you want, but if you’re not happy, I don’t really see the point. A lot of the concerns with longevity, fear of death and so on, in retrospect, I think will be probably considered some kind of a neurosis. Obviously a genetically adaptive neurosis, but something that can be cured with mood-enhancing technologies.

Lucas Perry: Leveraging human selfishness or leveraging how most people are closed individualists seems like the way to having good AI alignment. To one extent, I find the immortality pursuits through cryonics to be pretty elitist. But I think it’s a really good point that giving the policymakers and the older generation and people in power more skin in the game over the future is both potentially very good and also very scary.

It’s very scary to the extent to which they could get absolute power, but also very good if you’re able to mitigate risks of them developing absolute power. But again, as you said, it motivates them towards more deeply and profoundly considering future considerations, being less myopic, being less selfish. So that getting the AI alignment process right and doing the necessary technical work, it’s not done for a short-term nationalistic gain. Again, with an asterisk here that the risk is unilaterally getting more and more power.

Andrés Gómez Emilsson: Yeah, yeah, yeah. Also, without cryonics, another way to increase skin in the game, may be more straight-forwardly positive. Bliss technologies do that. A lot of people who are depressed or nihilistic or vengeful or misanthropic, they don’t really care about destroying the world or watching it burn, so to speak, because they don’t have anything to lose. But if you have a really reliable MDMA-like technological device that reliably produces wonderful states of consciousness, I think people will be much more careful about preserving their own health, and also not watch the world burn, because they know “I could be back home and actually experiencing this rather than just trying to satisfy my misanthropic desires.”

David Pearce: Yeah, the happiest people I know work in the field of existential risk. Rather than great happiness just making people reckless, it can also make them more inclined to conserve and protect.

Lucas Perry: Awesome. I guess just one more thing that I wanted to hit on these different ways that technology is going to change society is… I don’t know. In my heart, the ideal is the vow to liberate all sentient beings in all directions from suffering. The closed individualist view seems generally fairly antithetical to that, but there’s also this desire for me to be realistic about leveraging that human selfishness towards that ethic. The capacity here for conversations on identity going forward, if we can at least give people more information to subvert or challenge or give them information about why the common sense closed individualist view might be wrong, I think it would just have a ton of implications for how people end up viewing human species divergence, or immortality, or uploads, or merging, or the Descendants scenario.

In Max’s book, Life 3.0, he describes a bunch of different scenarios for how you want the world to be as the impact of AI grows, if we’re lucky enough to reach superintelligent AI. These scenarios that he gives are, for example, an Egalitarian Utopia where humans, cyborgs and uploads coexist peacefully thanks to property abolition and guaranteed income. There’s a Libertarian Utopia where human, cyborgs, and uploads, and superintelligences coexist peacefully thanks to property rights. There is a Protector God scenario where essentially omniscient and omnipotent AI maximizes human happiness by intervening only in ways that preserve our feeling of control of our own destiny, and hides well enough that many humans even doubt the AI’s existence. There’s Enslaved God, which is kind of self-evident. The AI is a slave to our will. The Descendants Scenario, which I described earlier, where AIs replace human beings, but give us a graceful exit, making us view them as our worthy descendants, much as parents feel happy and proud to have a child who’s smarter than them, who learns from them, and then accomplishes what they could only dream of, even if they can’t live to see it.

After the book was released, Max did a survey of which ideal societies people were most excited about. And basically most people wanted either the Egalitarian Utopia or the Libertarian Utopia. These are very human centric of course, because I think most people are closed individualists, so okay, they’re going to pick that. And then they wanted to Protector God next, and then the fourth most popular was an Enslaved God. The fifth most popular was Descendants.

I’m a very big fan of the Descendants scenario. Maybe it’s because of my empty individualism. I just feel here that as views on identity are quite uninformed for most people or most people don’t take it, or closed individualism just seems intuitively true from the beginning because it seems like it’s been selected for mostly by Darwinian evolution to have a very strong sense of self. I just think that challenging conventional views on identity will very much shift the kinds of worlds that people are okay with or the kinds of worlds that people want.

If we had a big, massive public education campaign about the philosophy of identity and then take the same survey later, I think that the numbers would be much more different. That seems to be a necessary part of the education of humanity in the process of beneficial AI and AI alignment. To me, the Descendant scenario just seems best because it’s more clearly in line with this ethic of being impartially devoted to maximizing the well-being of sentience everywhere.

I’m curious to know your guys’ reaction to these different scenarios about how you feel views on identity as they shift will inform the kinds of worlds that humanity finds beautiful or meaningful or worthy of pursuit through and with AI.

David Pearce: If today’s hedonic range is -10 to zero to +10, yes, whether building a civilization with a hedonic range of +70 to +100, i.e. with more hedonic contrast, or +90 to a +100 with less hedonic contrast, the multiple phase-changes in consciousness involved are completely inconceivable to humans. But in terms of full-spectrum superintelligence, what we don’t know is the nature of their radically alien-state-spaces of consciousness – far more different than, let’s say, dreaming consciousness and waking consciousness – that I suspect that intelligence going to explore. And we just do not have the language, the concepts, to conceptualize what these alien state-spaces of consciousness are like. I suspect billions of years of consciousness-exploration lie ahead. I assume that a central element will be the pleasure-axis – that these states will be generically wonderful – but they will otherwise be completely alien. And so talk of “identity” with primitive Darwinian malware like us is quite fanciful.

Andrés Gómez Emilsson: Consider the following thought experiment where you have a chimpanzee right next to a person, who is right next to another person, where the third one is currently on a high dose of DMT, combined with ketamine and salvia. If you consider those three entities, I think very likely, actually the experience of the chimpanzee and the experience of the sober person are very much alike, compared to the person who is on DMT, ketamine, and salvia, who is in a completely different alien-state space of consciousness. And in some sense, biologically you’re unrelatable from the point of view of qualia and the sense of self, and time, and space, and all of those things.

Personally, I think having intimations with alien-state spaces of consciousness is actually good also quite apart from changes in a feeling that you’ve become one with the universe. Merely having experience with really different states of consciousness makes it easier for you to identify with consciousness as a whole: you realize, okay, my DMT self, so to speak, cannot exist naturally, and it’s just so much different to who I am normally, and even more different than perhaps being a chimpanzee, that you could imagine caring as well about alien-state spaces of consciousness that are completely nonhuman, and I think that it can be pretty helpful.

The other reason why I give a lot of credence to open individualism being a winning strategy, even just from a purely political and sociological point of view, is that open individualists are not afraid of changing their own state of consciousness, because they realize that it will be them either way. Whereas closed individualists can actually be pretty scared of, for example, taking DMT or something like that. They tend to have at least the suspicion that, oh my gosh, is the person who is going to be on DMT me? Am I going to be there? Or maybe I’m just being possessed by a different entity with completely different values and consciousness.

With open individualism, no matter what type of consciousness your brain generates, it’s going to be you. It massively amplifies the degrees of freedom for coordination. Plus, you’re not afraid of tuning your consciousness for particular new computational uses. Again, this could be extremely powerful as a cooperation and coordination tool.

To summarize, I think a plausible and very nice future scenario is going to be the mixture of open individualism, on the one hand; second, generically enhanced hedonic tone, so that everything is amazing; and third, expanded range of possible experiences. That we will have the tools to experience pretty much arbitrary state spaces of consciousness and consider them our own.

The Descendant scenario, I think it’s much easier to imagine thinking of the new entities as your offspring if you can at least know what they feel like. You can take a drug or something and know, “okay, this is what it’s like to be a post-human android. I like it. This is wonderful. It’s better than being a human.” That would make it possible.

Lucas Perry: Wonderful. This last question is just the role of identity in the AI itself, or the superintelligence itself, as it experiences the world, the ethical implications of those identity models, et cetera. There is the question of identity now, and if we get aligned superintelligence and post-human superintelligence, and we have Jupiter rings or Dyson spheres or whatever, that there’s the question of identity evolving in that system. We are very much creating Life 3.0, and there is a substantive question of what kind of identity views it will take, what it’s phenomenal experience of self or not will have. This all is relevant and important because if we’re concerned with maximizing conscious well-being, then these are flavors of consciousness which would require a sufficiently, rigorous science of consciousness to understand their valence properties.

Andrés Gómez Emilsson: I mean, I think it’s a really, really good thing to think about. The overall frame I tend to utilize, to analyze this kind of questions is, I wrote an article and you can find it in Qualia Computing that is called “Consciousness Versus Replicators.” I think that is a pretty good overarching ethical framework where I basically, I describe how different kinds of ethics can give different worldviews, but also they depend on your philosophical sophistication.

At the very beginning, you have ethics such as the battle between good and evil, but then you start introspecting. You’re like, “okay, what is evil exactly,” and you realize that nobody sets out to do evil from the very beginning. Usually, they actually have motivations that make sense within their own experience. Then you shift towards this other theory that’s called the balance between good and evil, super common in Eastern religions. Also, people who take a lot of psychedelics or meditate a lot tend to arrive to that view, as in like, “oh, don’t be too concerned about suffering or the universe. It’s all a huge yin and yang. The evil part makes the good part better,” or like weird things like that.

Then you have a little bit more developed, what I call it gradients of wisdom. I would say like Sam Harris, and definitely a lot of people in our community think that way, which is they come to the realization that there are societies that don’t help human flourishing, and there are ideologies that do, and it’s really important to be discerning. We can’t just say, “Hey, everything is equally good.”

But finally, I would say the fourth level would be consciousness versus replicators, which involves, one, taking open individualism seriously; and second, realizing that anything that matters, it matters because it influences experiences. Can you have that as your underlying ethical principle? There’s this danger of replicators hijacking our motivational architecture in order to pursue their own replication, independent of the well-being of sentience, and you guard for that. I think you’re in a pretty good space to actually do a lot of good. I would say perhaps that is the sort of ethics or morality we should think about how to instantiate in an artificial intelligence.

In the extreme, you have what I call a pure replicator, and a pure replicator essentially is a system or an entity that uses all of its resources exclusively to make copies of itself, independently of whether that causes good or bad experiences elsewhere. It just doesn’t care. I would argue that humans are not pure replicators. That in fact, we do care about consciousness, at the very least our own consciousness. And evolution is recruiting the fact that we care about consciousness in order to, as a side effect, increase our inclusiveness our genes.

But these discussions we’re having right now, this is the possibility of a post-human ethic is the genie is getting out of the bottle in the sense of consciousness is kind of taking its own values and trying to transcend the selfish genetic process that gave rise to it.

Lucas Perry: Ooh, I like that. That’s good. Anything to add, David?

David Pearce: No. Simply, I hope we have a Buddhist AI.

Lucas Perry: I agree. All right, so I’ve really enjoyed this conversation. I feel more confused now than when I came in, which is very good. Yeah, thank you both so much for coming on.

End of recorded material

AI Alignment Podcast: On DeepMind, AI Safety, and Recursive Reward Modeling with Jan Leike

Jan Leike is a senior research scientist who leads the agent alignment team at DeepMind. His is one of three teams within their technical AGI group; each team focuses on different aspects of ensuring advanced AI systems are aligned and beneficial. Jan’s journey in the field of AI has taken him from a PhD on a theoretical reinforcement learning agent called AIXI to empirical AI safety research focused on recursive reward modeling. This conversation explores his movement from theoretical to empirical AI safety research — why empirical safety research is important and how this has lead him to his work on recursive reward modeling. We also discuss research directions he’s optimistic will lead to safely scalable systems, more facets of his own thinking, and other work being done at DeepMind.

 Topics discussed in this episode include:

  • Theoretical and empirical AI safety research
  • Jan’s and DeepMind’s approaches to AI safety
  • Jan’s work and thoughts on recursive reward modeling
  • AI safety benchmarking at DeepMind
  • The potential modularity of AGI
  • Comments on the cultural and intellectual differences between the AI safety and mainstream AI communities
  • Joining the DeepMind safety team

Timestamps: 

0:00 intro

2:15 Jan’s intellectual journey in computer science to AI safety

7:35 Transitioning from theoretical to empirical research

11:25 Jan’s and DeepMind’s approach to AI safety

17:23 Recursive reward modeling

29:26 Experimenting with recursive reward modeling

32:42 How recursive reward modeling serves AI safety

34:55 Pessimism about recursive reward modeling

38:35 How this research direction fits in the safety landscape

42:10 Can deep reinforcement learning get us to AGI?

42:50 How modular will AGI be?

44:25 Efforts at DeepMind for AI safety benchmarking

49:30 Differences between the AI safety and mainstream AI communities

55:15 Most exciting piece of empirical safety work in the next 5 years

56:35 Joining the DeepMind safety team

 

Works referenced:

Scalable agent alignment via reward modeling

The Boat Race Problem

Move 37

Jan Leike on reward hacking

OpenAI Safety Gym

ImageNet

Unrestricted Adversarial Examples

 

We hope that you will continue to join in the conversations by following us or subscribing to our podcasts on Youtube, Spotify, SoundCloud, iTunes, Google Play, StitcheriHeartRadio, or your preferred podcast site/application. You can find all the AI Alignment Podcasts here.

You can listen to the podcast above or read the transcript below. 

Lucas Perry: Hello everyone and welcome to the AI Alignment Podcast. I’m Lucas Perry. Today, we’re speaking with Jan Leike. Jan Leike is a senior research scientist at DeepMind and his research aims at helping to make machine learning robust and beneficial; he works on safety and alignment of reinforcement learning agents. His current research can be understood as motivated by the following question: How can we design competitive and scalable machine learning algorithms that make sequential decisions in the absence of a reward function? If this podcast is interesting or valuable to you, please consider following us on your preferred listening platform and leaving us a good review.

This conversation covers Jan’s PhD and movement from theoretical to empirical AI research, why this shift took place and his view on the importance of empirical AI safety research, we discuss how DeepMind approaches the projects of beneficial AI and AI safety. We discuss the AI alignment landscape today and the kinds of approaches that Jan is most excited about. We get into Jan’s main area of research of recursive reward modeling, and we talk about AI safety benchmarking efforts at DeepMind and the intellectual and cultural differences between the AI alignment/AI safety community and the mainstream AI and machine learning community. As a friendly heads up, there were some audio issues with the incoming audio in the second half of the podcast. We did our best to clean these up and I feel the resulting audio to be easily listenable. I’d also like to give many thanks to Richard Ngo, Vishal Maini, and Richard Mallah for help on developing and refining the questions for this podcast. And with that, let’s get into our conversation with Jan Leike.

Why don’t you start off by taking us through your journey in the field of AI. How did you first become interested in math and computer science? Tell me a little bit about your time as a PhD student. What perked your curiosity where, why you were pursuing what you were pursuing?

Jan Leike: I got interested in AGI and AGI safety around 2012. I was doing a Master’s degree at a time, and I was trying to think about what I should do with my career. And I was reading a whole bunch of stuff online. That’s how I got into this whole area. My background was kind of in math and computer science at the time, but I wasn’t really working on AI. I was more working on software verification. Then I came across Marcus Hutter’s AIXI model, which is basically a formal mathematical model for what AGI could look like. And it’s highly idealized. It’s not something that could actually run, but you can kind of think about it and you can actually prove things about it. And I was really excited about that. I thought that was a great starting point because you remember that was back in 2012 before the whole deep learning revolution happened, so it was not really clear what kind of approaches might we actually take towards AGI. The purpose of my PhD was to kind of understand AGI from a high-level theoretical perspective.

Lucas Perry: The PhD was with Marcus Hutter on AIXI or “A,” “I,” “X,” “I.” From that pursuit, what interesting or really valuable things did you glean from that process?

Jan Leike: So my thesis ended up being just a number of theoretical results, some of which are that actually this idealized agent AIXI is not optimal in any objective sense. In a way, it all depends on the universal Turing machine that is used to define it. But however, there’s variants on AIXI that have objective properties, such as asymptotic convergence to the optimal policy. This variant is basically a variant based on Thompson sampling, but this is a fully general reinforcement learning setting. So that’s partially observable, and you don’t have episodes. It’s like everything is one long episode. So it’s not really a setting where you can give any sample complexity bounds. Asymptotic convergence is all you can do. And then another thing that came out of that was what we called, “A Formal Solution to the Grain of Truth Problem.” This is a collaboration with the Machine Intelligence Research Institute.

And the idea here is that one of the problems with the AIXI formal model is that it assumes that its environment is computable, but itself is incomputable. You can’t really do multi-agent analysis with it. And so what we did was propose a formalism that is like a variant on AIXI that can be in its own environment class if we embed an agent or environment together with other of these AIXI like agents. And then while they do that, they can still asymptotically learn to predict correctly what the agents will do and then converge to Nash equilibrium asymptotically.

Lucas Perry: So the sense in which AIXI was a theoretical ideal was that the process by which it was able to learn or infer preferences was computationally intractable.

Jan Leike: The AIXI model basically just tries to answer the reinforcement learning question. So you’re given an environment and you’re given a reward signal, how do you optimize that? In a way, you’re using what we call the Solomonoff prior to predict the next observations that come from the environment, and then you essentially do an exhaustive tree search over all the possible action sequences that you could take and the possible consequences that you would predict and then make the best action in terms of returns. This is kind of similar to how AlphaGo uses Monte Carlo tree search to select the best actions. The reason why you can’t literally build AIXI is that this Solomonoff prior that I mentioned, it is basically the set of all possible Turing machines which is countably infinite, and then you have to run all of them in parallel and take a weighted average over what they would predict.

If you’ve ever tried to run all of computer programs in parallel, you’ll know that this is not going to go too well. I find AIXI helpful when thinking about AGI in terms of what could advanced machine learning or reinforcement learning agents look like. So they have some kind of learned model in which they do planning and select actions that are good in the long run. So I think in a way it tells us that if you believe that the reinforcement learning problem is the right problem to phrase an AGI in, then AIXI proves that there can be a solution to that problem. I think on a high level, having thought about this model is useful when thinking about where are we headed and what are potential milestones on the way to AGI. But at the same time, I think my mistake at the time was really getting lost in some of the technical details that really matter if you want to publish a paper on this stuff, but don’t transfer as much in the analogy.

Lucas Perry: After you finished up your PhD with Hutter, you finished working on AIXI now. Is this when you transitioned to DeepMind and make this transition from theoretical to empirical research?

Jan Leike: Yeah, that’s basically right. At the time when I started my PhD, I decided to work on theory because it wasn’t really clear to me what AGI would look like and how we’d build it. So I wanted to do something general and something more timeless. Then we saw a whole bunch of evidence that deep reinforcement learning is viable and you can make it work. Came out with a DQN nature paper, and there was the AlphaGo event. And then it became pretty clear to me that deep reinforcement learning was going somewhere, and that’s something we should work on. At the time, my tool set was very theoretical and my comparative advantage was thinking about theory and using the tools that I learned and developed in my PhD.

And the problem was that deep reinforcement learning has very little theory behind it, and there’s very little theory on RL. And the little theory on RL that we have says that basically function approximation shouldn’t really work. So that means it’s really hard to actually gain traction on doing something theoretically. And at the same time, at the time we were very clear that we could just take some agents that existed and we can just build something, and then we could make incremental progress on things that would actually help us make AGI safer.

Lucas Perry: Can you clarify how the theoretical foundations of deep reinforcement learning are weak? What does that mean? Does that mean that we have this thing and it works, but we’re not super sure about how it works? Or the theories about the mechanisms which constitute that functioning thing are weak? We can’t extrapolate out very far with them?

Jan Leike: Yeah, so basically there’s the two parts. So if you take deep neural networks, there are some results that tell you that depth is better than width. And if you increase capacity, you can represent any function and things like that. But basically the kind of thing that I would want to use is real sample complexity bounds that tell you if your network has X many parameters, how much training data you do need, how many batches do you need to train in order to actually converge? Can you converge asymptotically? None of these things are even true in theory. You can get examples where it doesn’t work. And of course we know that in practice because sometimes training is just unstable, but it doesn’t mean that you can’t tune it and make it work in practice.

On the RL side, there is a bunch of convergence results that people have given in tabular MDPs, Markov decision processes. In that setting, everything is really nice and you can give sample complexity bounds, or let’s say some bounds on how long learning will take. But as soon as you kind of go into a function approximation setting, all bets are off and there’s very simple two-state MDPs they can draw where just simple linear function approximation completely breaks. And this is a problem that we haven’t really gotten a great handle on theoretically. And so going from linear function approximation to deep neural networks is just going to make everything so much harder.

Lucas Perry: Are there any other significant ways in which your thinking has changed as you transitioned from theoretical to empirical?

Jan Leike: In the absence of these theoretical tools, you have two options. Either you try to develop these tools, and that seems very hard and many smart people have tried for a long time. Or you just move on to different tools. And I think especially if you have systems that you can do experiments on, then having an empirical approach makes a lot of sense if you think that these systems actually can teach us something useful about the kind of systems that we are going to build in the future.

Lucas Perry: A lot of your thinking has been about understanding the theoretical foundations, like what AGI might even look like, and then transitioning to an empirical based approach that you see as efficacious for studying systems in the real world and bringing about safe AGI systems. So now that you’re in DeepMind and you’re in this intellectual journey that we’re covering, how is DeepMind and how is Jan approaching beneficial AI and AI alignment in this context?

Jan Leike: DeepMind is a big place, and there is a lot of different safety efforts across the organization. People are working on say robustness to adversarial inputs, fairness, verification of neural networks, interpretability and so on and so on. What I’m doing is I’m focusing on reward modeling as an approach to alignment.

Lucas Perry: So just taking a step back and still trying to get a bigger picture of DeepMind’s approach to beneficial AI and AI alignment. It’s being attacked at many different angles. So could you clarify this what seems to be like a portfolio approach? The AI alignment slash AI safety agendas that I’ve seen enumerate several different characteristics or areas of alignment and safety research that we need to get a grapple on, and it seems like DeepMind is trying its best to hit on all of them.

Jan Leike: DeepMind’s approach to safety is quite like a portfolio. We don’t really know what will end up panning out. So we pursue a bunch of different approaches in parallel. So I’m on the technical AGI safety team that roughly consists of three subteams. There’s a team around incentive theory that tries to model on a high level what incentives agents could have in different situations and how we could understand them. Then there is an agent analysis team that is trying to take some of our state of the art agents and figure out what they are doing and how they’re making the decisions they make. And this can be both from a behavioral perspective and from actually looking inside the neural networks. And then finally there is the agent alignment team, which I’m leading, and that’s trying to figure out how to scale reward modeling. There’s also an ethics research team, and then there’s a policy team.

Lucas Perry: This is a good place then to pivot into how you see the AI alignment landscape today. You guys have this portfolio approach that you just explained. Given that and given all of these various efforts for attacking the problem of beneficial AI from different directions, how do you see the AI alignment landscape today? Is there any more insight that you can provide into that portfolio approach given that it is contextualized within many other different organizations who are also coming at the problem from specific angles? So like MIRI is working on agent foundations and does theoretical research. CHAI has its own things that it’s interested in, like cooperative inverse reinforcement learning and inverse reinforcement learning. OpenAI is also doing its own thing that I’m less clear on, which may have to do with factored evaluation. Ought as well is working on stuff. So could you expand a little bit here on that?

Jan Leike: Our direction for getting a solution to the alignment problem revolves around recursive reward modeling. I think on a high level, basically the way I’m thinking about this is that if you’re working on alignment, you really want to be part of the projects that builds AGI. Be there and have impact while that happens. In order to do that, you kind of need to be a part of the action. So you really have to understand the tech on the detailed level. And I don’t think that safety is an add on that you think of later or then add at a later stage in the process. And I don’t think we can just do some theory work that informs algorithmic decisions that make everything go well. I think we need something that is a lot more integrated with the project that actually builds AGI. So in particular, the way we are currently thinking about is it seems like the part that actually gives you alignment is not this algorithmic change and more something like an overall training procedure on how to combine your machine learning components into a big system.

So in terms of how I pick my research directions, I am most excited about approaches that can scale to AGI and beyond. Another thing that I think is really important is that I think people will just want to build agents, and we can’t only constrain ourselves to building say question answering systems. There’s basically a lot of real world problems that we want to solve with AGI, and these are fundamentally sequential decision problems, right? So if I look something up online and then I write an email, there’s a sequence of decisions I make, which websites I access and which links I click on. And then there’s a sequence of decisions of which characters are input in the email. And if you phrase the problem as, “I want to be able to do most things that humans can do with a mouse and a keyboard on a computer,” then that’s a very clearly scoped reinforcement learning problem. Although the reward function problem is not very clear.

Lucas Perry: So you’re articulating that DeepMind, you would explain that even given all these different approaches you guys have on all these different safety teams, the way that you personally pick your research direction is that you’re excited about things which safely scale to AGI and superintelligence and beyond. And that recursive reward modeling is one of these things.

Jan Leike: Yeah. So the problem that we’re trying to solve is the agent alignment problem. And the agent alignment problem is the question of how can we create agents that act in accordance with the user’s intentions. We are kind of inherently focused around agents. But also, we’re trying to figure out how to get them to do what we want. So in terms of reinforcement learning, what we’re trying to do is learn a reward function that captures the user’s intentions and that we can optimize with RL.

Lucas Perry: So let’s get into your work here on recursive reward modeling. This is something that you’re personally working on. Let’s just start off with what is recursive reward modeling?

Jan Leike: I’m going to start off with explaining what reward modeling is. What we want to do is we want to apply reinforcement learning to the real world. And one of the fundamental problems of doing that is that the real world doesn’t have built in reward functions. And so basically what we need is a reward function that captures the user’s intentions. Let me give you an example for the core motivation of why we want to do reward modeling, a blog posts that OpenAI made a while back: The Boat Race Problem, where they were training a reinforcement modeling agent to race the boat around the track and complete the race as fast as possible, but what actually ended up happening is that the boat was getting stuck in the small lagoon and then circling around there. And the reason for that is the RL agent was trying to maximize the number of points that it gets.

And the way you get points in this game is by moving over these buoys that are along the track. And so if you go to the lagoon, there’s these buoys that keep respawning, and then so you can get a lot of points without actually completing the race. This is the kind of behavior that we don’t want out of our AI systems. But then on the other hand, there’s things we wouldn’t think of but we want out of our AI systems. And I think a good example is AlphaGo’s famous Move 37. In its Go game against Lee Sedol, Move 37 was this brilliant move that AlphaGo made that was a move that no human would have made, but it actually ended up turning around the entire game in AlphaGo’s favor. And this is how Lee Sedol ended up losing the game. The commonality between both of these examples is some AI system doing something that a human wouldn’t do.

In one case, that’s something that we want: Move 37. In the other case, it’s something that we don’t want, this is the circling boat. I think the crucial difference here is in what is the goal of the task. In the Go example, the goal was to win the game of Go. Whereas in the boat race example, the goal was to go around the track and complete the race, and the agent clearly wasn’t accomplishing that goal. So that’s why we want to be able to communicate goals to our agents. So we need these goals or these reward functions that our agents learn to be aligned with the user’s intentions. If we do it this way, we also get the possibility that our systems actually outperform humans and actually do something that would be better than what the human would have done. And this is something that you, for example, couldn’t get out of imitation learning or inverse reinforcement learning.

The central claim that reward modeling is revolving around is that evaluation is easier than behavior. So I can, for example, look at a video of an agent and tell you whether or not that agent is doing a backflip, even though I couldn’t do a backflip myself. So in this case, it’s kind of harder to actually do the task than to evaluate it. And that kind of puts the human in the leveraged position because the human only has to be able to give feedback on the behavior rather than actually being able to do it. So we’ve been building prototypes for reward modeling for a number of years now. We want to actually get hands on experience with these systems and see examples of where they fail and how we can fix them. One particular example seen again and again is if you don’t provide online feedback to the agent, something can happen is that the agent finds loopholes in the reward model.

It finds states where the reward model think it’s a high reward state, but actually it isn’t. So one example is in the Atari game Hero, where you can get points for shooting laser beams at spiders. And so what the agent figures out is that if it stands really close to the spider and starts shooting but then turns around and the shot goes the other way, then the reward model will think the shot is about to hit the spider so it should give you a reward because that gives you points. But actually the agent doesn’t end killing the spider, and so it can just do the same thing again and again and get reward for it.

And so it’s kind of found this exploit in the reward model. We know that online training, training with an actual human in the loop who keeps giving feedback, can get you around this problem. And the reason is that whenever the agent gets stuck in these kinds of loopholes, a human can just look at the agent’s behavior and give some additional feedback you can then use to update the reward model. And the reward model in turn can then teach the agent that this is actually not a high reward state. So what about recursive reward modeling? One question that we have when you’re trying to think about how to scale reward modeling is that eventually you want to tackle domains where it’s too hard for the human to really figure out what’s going on because the core problem is very complex or the human is not an expert in the domain.

Right now, this is basically only in the idea stage, but the basic idea is to apply reward modeling recursively. You have this evaluation task that is too complex for the human to do, and you’re training a bunch of evaluation helper agents that will help you do the evaluation of the main agent that you’re training. These agents then in turn will be trained with reward modeling or recursive reward modeling.

Let’s say you want to train the agent that designs a computer chip, and so it does a bunch of work, and then it outputs this giant schema for what the chip could look like. Now that schema is so complex and so complicated that, as a human, even if you were an expert in chip design, you wouldn’t be able to understand all of that, but you can figure out what aspects of the chip you care about, right? Like what is the number of, say, FLOPS it can do per second or what is the thermal properties.

For each of these aspects that you care about, you will spin up another agent, you teach another agent how to do that subtask, and then you would use the output of that agent, could be, let’s say, a document that details the thermal properties or a benchmark result on how this chip would do if we actually built it. And then you can look at all of these outputs of the evaluation helper agents, and then you compose those into feedback for the actual agent that you’re training.

The idea is here that basically the tasks that the evaluation helper agents have to do are easier problems in a more narrow domain because, A, they only have to do one sub-aspect of the evaluation, and also you’re relying on the fact that evaluation is easier than behavior. Since you have this easier task, you would hope that if you can solve easier tasks, then you can use the solutions or the agents that you train on these easier tasks to kind of scaffold your way up to solving harder and harder tasks. You could use this to push out the general scope of tasks that you can tackle with your AI systems.

The hope would be that, at least I would claim, that this is a general scheme that, in principle, can capture a lot of human economic activity that way. One really crucial aspect is that you’re able to compose a training signal for the agents that are trying to solve the task. You have to ground this out where you’re in some level, if you picture this big tree or directed acyclic graph of agents that help you train other agents and so on, there has to be a bottom level where the human can just look at what’s going on and can just give feedback directly, and use the feedback on the lowest level task to build up more and more complex training signals for more and more complex agents that are solving harder and harder tasks.

Lucas Perry: Can you clarify how the bootstrapping here happens? Like the very bottom level, how you’re first able to train the agents dedicated to sub-questions of the larger question?

Jan Leike: If you give me a new task to solve with recursive reward modeling, the way I would proceed is, assuming that we solved all of these technical problems, let’s say we can train agents with reward modeling on arbitrary tasks, then the way I would solve it is I would first think about what do you care about in this task? How do I measure its success? What are the different aspects of success that I care about? These are going to be my evaluation criteria.

For each of my evaluation criteria, I’m going to define a new subtask, and the subtasks will be “help me evaluate this criteria.” In the computer chip example, that was FLOPs per second, and so on, and so on. Then I proceed recursively. For each of the subtasks that I just identified, I start again by saying, “Okay, so now I have this agent, it’s supposed to get computer chip design, and a bunch of associated documentation, say, and now it has to produce this document that outlines the thermal properties of this chip.”

What I would do, again, is I’d be like, “Okay, this is a pretty complex task, so let’s think about how to break it down. How would I evaluate this document?” So I proceed to do this until I arrive at a task where I can just say, “Okay, I basically know how to do this task, or I know how to evaluate this task.” And then I can start spinning up my agents, right? And then I train the agents on those tasks, and then once I’ve trained all of my agents on the leaf level tasks, and I’m happy with those, I then proceed training the next level higher.

Lucas Perry: And the evaluation criteria, or aspects, are an expression of your reward function, right?

Jan Leike: The reward function will end up capturing all of that. Let’s say we have solved all of the evaluation subtasks, right? We can use the evaluation assistant agents to help us evaluate the overall performance of the main agent that you were training. Of course, whenever this agent does something, you don’t want to have to evaluate all of their behavior, so what you do is you essentially distill this whole tree of different evaluation helper agents that you build. There’s lots of little humans in the loop in that tree into one model that will predict what that whole tree of what agents and humans will say. That will basically be the reward that the main agent is being trained on.

Lucas Perry: That’s pretty beautiful. I mean, the structure is elegant.

Jan Leike: Thanks.

Lucas Perry: I still don’t fully understand it obviously, but it’s beginning to dawn upon my non-computer science consciousness.

Jan Leike: Well, current research on this stuff revolves around two questions, and I think these are the main questions that we need to think about when trying to figure out whether or not a scheme like this can work. The first question is how well does the one-step set up work, only reward modeling, no recursion? If one-step reward modeling doesn’t work, you can’t hope to ever build a whole tree out of that component, so clearly that has to be true.

And then the second question is how do errors accumulate if we build a system? Essentially what you’re doing is you’re training a bunch of machine learning components to help you build a training signal for other machine learning components. Of course none of them are going to be perfect. Even if my ability to do machine learning is infinitely great, which of course it isn’t, at the end of the day, they’re still being trained by humans, and humans make mistakes every once in a while.

If my bottom level has a certain, let’s say, reward accuracy, the next level up that I use those to train is going to have a lower accuracy, or potentially have a lower accuracy, because their training signal is slightly off. Now, if you keep doing this and building a more and more complex system, how do the errors in the system accumulate? This is a question we haven’t really done much work on so far, and this is certainly something we need to do more on in the future.

Lucas Perry: What sorts of experiments can we do with recursive reward modeling today, and why is it hard?

Jan Leike: The reason why this is difficult to find such tasks is because essentially you need tasks that have two properties. The first property is they have to be difficult enough so that you can’t evaluate them directly, right? Otherwise, you wouldn’t need the recursive part in recursive reward modeling. And then secondly, they have to be easy enough that we can actually hope to be able to do them with today’s systems.

In a way, it’s two very contradictory objectives, so it’s kind of hard to find something in the intersection. We can study a lot of the crucial parts of this independently of actually being able to build a prototype of the recursive part of recursive reward modeling.

Lucas Perry: I guess I’m just trying to also get a sense of when you think that recursive reward modeling might become feasible.

Jan Leike: One good point would be where we could get to the point where we’ve done like a whole lot of tasks with reward modeling, and we’re basically running out of tasks that we can do directly. Or an opportunity comes up when we find a task that we actually think we can do and that requires a decomposition. There’s ways in which you could try to do this now by artificially limiting yourself. You could, for example, solve chess with recursive reward modeling by pretending that there isn’t a procedure or reward function for chess.

If you rely on the human to look at the board and tell you whether or not it’s checkmate, if you’re like a pro chess player, you could probably do that quite well. But if you’re an amateur or a non-expert, you don’t really know that much about chess other than the rules, it’s kind of hard for a human to do that quickly.

What you could do is you could train evaluation helper agents that give you useful information about the chessboard, where, let’s say, they color certain tiles on the board that are currently under threat. And then using that information, you can make the assessment of whether this is a checkmate situation much more easily.

While we could do this kind of setup and use recursive reward modeling, and you’ll maybe learn something, at the same time, it’s not an ideal test bed because it’s just not going to look impressive as a solution because we already know how to use machine learning to play chess, so we wouldn’t really add anything in terms of value of tasks that we can do now that we couldn’t do otherwise.

Lucas Perry: But wouldn’t it show you that recursive reward modeling works?

Jan Leike: You get one data point on this particular domain, and so the question is what data points would you learn about recursive reward modeling that you wouldn’t learn in other ways? You could treat this as like two different individual tasks that you just solve with reward modeling. One task is coloring the tiles of the board, and one task is actually playing chess. We know we can do the latter, because we’ve done that.

What would be interesting about this experiment would be that you kind of learn how to cope with the errors in the system. Every once in a while, like the human will label a state incorrectly, and so you would learn how well you can still train even though your training signal is slightly off. I think we can also investigate that without actually having to literally build this recursive setup. I think there’s easier experiments we could do.

Lucas Perry: Do you want to offer any final words of clarification or just something succinct about how this serves AI safety and alignment?

Jan Leike: One way to think about safety is this specification robustness assurance framework. What this is, is basically a very high-level way of carving the space of safety problems into different categories. These are three categories. The first category is specification. How do you get the system to do what you want? Basically, what we usually mean when we talk about alignment. The second category is robustness. How can you make your system robust to various perturbations, such as adversarial inputs or distributional changes? The third category is assurance. How can you get better calibrated beliefs about how safe, or in the sense of robust and specification too, your system actually is?

Usually in assurance category, we talk about various tools for understanding and monitoring agents, right? This is stuff about testing, interpretability, verification, and so on, and so on. The stuff that I am primarily working on is in the specification category, where we’re basically trying to figure out how to get our agents to pursue the goals that we want them to pursue. The ambition of recursive reward modeling is to solve all of the specification problems. Some of the problems that we worry about are, let’s say, off switch problems, where your agent might meddle with its off switch, and you just don’t want it to do that. Another problem, let’s say, what about side effects? What about reward tampering? There’s a whole class of these kind of problems, and instead of trying to solve them each individually, we try to solve the whole class of problems at once.

Lucas Perry: Yeah, it’s an ambitious project. The success of figuring out the specification problem supervenes upon other things, but at the same time, if those other things are figured out, the solution to this enables, as you’re saying, a system which safely scales to super intelligence and beyond, and retains alignment, right?

Jan Leike: That’s the claim.

Lucas Perry: Its position then in AI safety and alignment is pretty clear to me. I’m not sure if you have other points you want to add to that though?

Jan Leike: Nope, I think that was it.

Lucas Perry: Okay. We’re just going to hit on a few more questions here then on recursive reward modeling. Recursive reward modeling seems to require some very competent agent or person to break down an evaluation into sub-questions or sub-evaluations. Is the creation of that structure actually scalable?

Jan Leike: Yeah, this is a really good question. I would picture these decompositions of the evaluation task to be essentially hardcoded, so you have a human expert that knows something about the task, and they can tell you what they care about in the task. The way I picture this is you could probably do a lot of the tasks in recursion depth of three, or five, or something like that, but eventually they’re so out of the range of what the human can do that they don’t even know how to break down the evaluation of that task.

Then this decomposition problem is not a problem that you want to tackle with recursive reward modeling, where basically you train an agent to propose decompositions, and then you have an evaluation where the human evaluates whether or not that was good decomposition. This is very, very far future stuff at that point; you’ve already worked with recursive reward modeling for a while, and you have done a bunch of decompositions, and so I don’t expect this to be something that we will be addressing anytime soon, and it’s certainly something that is within the scope of what the stuff should be able to do.

Recursive reward modeling is a super general method that you basically want to be able to apply to any kind of task that humans typically do, and proposing decompositions of the evaluation is one of them.

Lucas Perry: Are there pessimisms that you have about recursive reward modeling? How might recursive reward modeling fall short? And I think the three areas that we want to hit here are robustness, mesa-optimizers, and tasks that are difficult to ground.

Jan Leike: As I said earlier, basically you’re trying to solve the whole class of specification problems, where you still need robustness and assurance. In particular, there’s what we call the reward to result gap, where you might have the right reward function, and then you still need to find an agent that actually is good at optimizing that reward function. That’s an obvious problem, and there’s a lot of people just trying to make RL agents perform better. Mesa-optimizers I think are, in general, an open question. There’s still a lot of uncertainty around how they would come up, and what exactly is going on there. I think one thing that would be really cool is actually have a demo of how they could come up in a training procedure in a way that people wouldn’t expect. I think that would be pretty valuable.

And then thirdly, recursive reward modeling is probably not very well suited for tasks that are really difficult to ground. Moral philosophy might be in that category. The way I understand this is that moral philosophy tries to tackle questions that is really difficult to get really hard facts and empirical evidence on. These human intuitions might be like really difficult to ground, and to actually teach to our agents in a way that generalizes. If you don’t have this grounding, then I don’t know how you could build a training signal for the higher level questions that might evolve from that.

In other words, to make this concrete, let’s say I want to train an agent to write a really good book in moral philosophy, and now of course I can evaluate that book based on how novel it is relative to what the humans have written, or like the general literature. How interesting is it? Does it make intuitive sense? But then in order to actually make the progress on moral philosophy, I need to update my value somehow in a way that is actually the right direction, and I don’t really know what would be a good way to evaluate.

Lucas Perry: I think then it would be a good spot here for us to circle back around to the alignment landscape. A lot of what you’ve been saying here has rung bells in my head about other efforts, like with iterated distillation and application, and debate with Geoffrey Irving, and factored evaluation at Ought. There’s these categories of things which are supposed to be general solutions to making systems which safely scale to aligned super intelligence and beyond. This also fits in that vein of the alignment landscape, right?

Jan Leike: Yeah. I think that’s right. In some ways, the stuff that you mentioned, like projects that people are pursuing at OpenAI, and at Ought, share lot structure with what recursive reward modeling is trying to do, where you try to compose training signals for tasks that are too hard for humans. I think one of the big differences in how we think about this problem is that we want to figure out how to train the agents that do stuff in the world, and I think a lot of the discussion at OpenAI and Ought kind of center around building question answering systems and fine tuning language models, where the ambition is to get them to do reasoning tasks that are very difficult for humans to do directly, and then you do that by decomposing them into easier reasoning tasks. You could say it’s one scalable alignment technique out of several that are being proposed, and we have a special focus on agents. I think agents are great. I think people will build agents that do stuff in the world, take sequential actions, look at videos.

Lucas Perry: What research directions or projects are you most excited about, just in general?

Jan Leike: In general, the safety community as a whole should have a portfolio approach where we just try to pursue a bunch of paths in parallel, essentially as many as can be pursued in parallel. I personally I’m most excited about approaches that can work with existing deep learning and scale to AGI and beyond. There could be many ways in which things pan out in the future, but right now there’s an enormous amount of resources being put towards scaling deep learning. That’s something that we should take seriously and consider into the way we think about solutions to the problem.

Lucas Perry: This also reflects your support and insistence on empirical practice as being beneficial, and the importance of being in and amongst the pragmatic, empirical, tangible, real-world, present-day projects, which are likely to bring about AGI, such that one can have an impact. What do you think is missing or underserved in AI alignment and AI safety? If you were given, say, like $1 billion, how would you invest it here?

Jan Leike: That would be the same answer I just gave you before. Basically, I think you want to have a portfolio, so you invest that money to like a whole bunch of directions. I think I would invest more than many other people in the community towards working empirically with, let’s say, today’s deep RL systems, build prototypes of aligned AGI, and then do experiments on them. I’ll be excited to see more of that type of work. Those might be like my personal biases speaking too because that’s like why I’m working on this direction.

Lucas Perry: Yeah. I think that’s deeply informative though for other people who might be trying to find their footing in determining how they ought to approach this problem. So how likely do you think it is that deep reinforcement learning scales up to AGI? What are the strongest considerations for and against that?

Jan Leike: I don’t think anyone really knows whether that’s the case or not. Deep learning certainly has a pretty convincing track record of fitting arbitrary functions. We can basically fit a function that knows how to play StarCraft. That’s a pretty complicated function. I think, well, whatever the answer is to this question, in safety what we should be doing is we should be conservative about this. Take the possibility that deep RL could scale to AGI very seriously, and plan for that possibility.

Lucas Perry: How modular do you think AGI will be and what makes you optimistic about having clearly defined components which do planning, reward modeling, or anything else?

Jan Leike: There’s certainly a lot of advantages if you can build a system of components that you understand really well. The way I currently picturing trying to build, say, a prototype for aligned AGI would be somewhat modular. The trend in deep learning is always towards training end-to-end. Meaning that you just have your raw data coming in and the raw predictions coming out and you just train some giant model that does all of that. That certainly gives you performance benefits on some tasks because whatever structure the model ends up learning can just be better than what the humans perceptions would recommend.

How it actually ends up working out is kind of unclear at the moment. I think in terms of what we’d like for safety is that if you have a modular system, it’s going to make it easier to really understand what’s going on there because you can understand the components and you can understand how they’re working together, so it helps you break down the problem of doing assurance in the system, so that’s certainly a path that we would like to work out

Lucas Perry: And is inner alignment here a problem that is relevant to both the modular components and then how the modular components are interrelated within the system?

Jan Leike: Yeah, I think you should definitely think about what the incentives are and what the training signals are of all of the components that you’re using to build a system.

Lucas Perry: As we approach AGI, what efforts are going on right now at DeepMind for AI safety benchmarking?

Jan Leike: We’ve spent some time thinking about AI safety benchmarks. We made a few little environments that are called gridworlds that are basically just kind of like chess board tiles where your agent moves around, and those are I think useful to showcase some of the problems. But really I think there’s a lot of appetite right now for building benchmark environments that let you test your agent on different properties. For example, OpenAI just recently released a collection of environments for safe exploration that require you to train in the presence of site constraints. But there’s also a lot of other properties that you could actually build tractable benchmarks for today.

So another example would be adversarial inputs, and there’s this generalized adversarial examples challenge. There’s also, you could build a benchmark for distributional shift, which in some ways you already do that in machine learning a lot where you do a little training in a test split, but usually these are on from the same decision. There’s various trans learning research going on. I don’t think there is really established benchmarks for those. This is certainly something that could be done.

There’s also problems that we worry about in longterm safety that I think would be kind of hard to really do good benchmarks on right now. Here I’m thinking of things like the off-switch problems, reward gaming, where you actually have an agent that can modify its own input rewards. The problem here is really you need very complex environments that are difficult to build and learn with current systems.

But I think overall this is something that would be very useful for the community to pursue, because the history of recent machine learning progress has always been that if you make a benchmark, people will start improving on the benchmark. The benchmark starts driving progress, and we’ve seen this with the ImageNet benchmark. We’ve seen that with the Atari benchmark, just to name two examples. And so if you had a safety benchmark, you would kind of incentivize people to make safety progress. Then if it’s an established benchmark, you can also publish on this. Then longer term, once you’ve had success with a bunch of benchmarks or they’ve been established and accepted, they could also become industry norms.

Lucas Perry: I’m just trying to understand how benchmarks in general, whether they’re safety benchmarks or not, exist in the international and national communities of computer science. Are Chinese computer scientists going to care about the DeepMind safety benchmarks? Are they something that necessarily are incorporated?

Jan Leike: Why do you think Chinese researchers care about the ImageNet benchmark?

Lucas Perry: Well, I don’t really know anything about the ImageNet benchmark.

Jan Leike: Oh, so the ImageNet is this big collection of labeled images that a lot of people train image classifiers on. So these are like pictures of various breeds of dogs and cats and so on. Things people are doing, or at least were doing for a while, was training larger and larger vision models on ImageNet and then you can measure what is your test accuracy of ImageNet and that’s a very tangible benchmark on how well you can do computer vision with your machine learning models.

Lucas Perry: So when these benchmarks are created, they’re just published openly?

Jan Leike: Yeah, you can just download ImageNet. You can just get started at trying a model on ImageNet in like half an hour on your computer.

Lucas Perry: So the safety benchmarks are just published openly. They can just be easily accessed, and they are public and open methods by which systems can be benchmarked to the degree to which they’re aligned and safe?

Jan Leike: Yeah. I think in general people underestimate how difficult environment design is. I think in order for a safety benchmark to get established, it actually has to be done really well. But if you do it really well and you can get a whole bunch of people interested because it’s becomes clear that this is something that is hard to do and methods can’t … but it’s also something that, let’s say if you made progress on it, you could write a paper or you can get employed at a company because you did something that people agreed was hard to do. At the same time, it has the result that is very easily measurable.

Lucas Perry: Okay. To finish up on this point, I’m just interested in if you could give a final summary on your feelings and interests in AI safety benchmarking. Besides the efforts that are going on right now, what are you hoping for?

Jan Leike: I think in summary I would be pretty excited about seeing more safety benchmarks that actually measure some of the things that we care about if they’re done well and if they really pay attention to a lot of detail, because I think that can drive a lot of progress on these problems. It’s like the same story as with reward modeling, right? Because then it becomes easy to evaluate progress and it becomes easy to evaluate what people are doing and that makes it easy for people to do stuff and then see whether or not whatever they’re doing is helpful.

Lucas Perry: So there appears to be cultural and intellectual differences between the AI alignment and AI safety communities and the mainstream AI community, people who are just probably interested in deep learning and ML.

Jan Leike: Yeah. So historically the machine learning community and the long term safety community have been somewhat disjoined.

Lucas Perry: So given this disjointedness, what would you like the mainstream ML and AI community to do or to think differently?

Jan Leike: The mainstream ML community doesn’t think enough about how whatever they are building will actually end up being deployed in practice, and I think people that are starting to realize that they can’t really do RL in the real world if they don’t do reward modeling, and I think it’s most obvious to robotics people trying to get robots to do stuff in the real world all the time. So I think reward modeling will become a lot more popular. We’ve already seen that in the past two years.

Lucas Perry: Some of what you’re saying is reminding me of Stuart Russell’s new book, Human Compatible. I’m curious to know if you have any thoughts on that and what he said there and how that relates to this.

Jan Leike: Yes. Stuart also has been a proponent of this for a long time. In a way, he has been one of the few computer science professors who are really engaging with some of these longer term AI questions, in particular around safety. I don’t know why there isn’t more people saying what he’s saying.

Lucas Perry: It seems like it’s not even just the difference and disjointedness between the mainstream ML and AI community and then the AI safety and AI alignment community is just that one group is thinking longterm and the other is not. It’s just a whole different perspective and understanding about what it means for something to be beneficial and what it takes for something to be beneficial. I don’t think you need to think about the future to understand the importance of recursive reward modeling or the kinds of shifts that Stuart Russell is arguing for given the systems which are being created today are already creating plenty of problems. We’ve enumerated those many times here. That seems to me to be because the systems are clearly not fully capturing what human beings really want. Just trying to understand better and also what you think the alignment and AI safety community should do or think differently to address this difference.

Jan Leike: The longterm safety community in particular, initially I think a lot of the arguments people made were very high level and almost philosophical, has been a little bit of a shift towards concrete mathematical, but also at the same time very abstract research, then towards empirical research. I think this is kind of a natural transition from one mode of operation to something more concrete, but there’s still some parts of the safety community in the first phases.

I think there’s a failure mode here where people just spend a lot of time thinking about what would be the optimal way of addressing a certain problem before they actually go out and do something, and I think an approach that I tend to favor is thinking about this problem for a bit and then doing some stuff and then iterating and thinking some more. That way you get some concrete feedback on whether or not you’re actually making progress.

I think another thing that I would love to see the community do more of is I think there’s not enough appreciation for clear explanations, and there’s a tendency that people write a lot of vague blog posts, and that’s difficult to critique and to build on. Where we really have to move as a community is toward more concrete technical stuff that you can clearly point at parts of it and be like, “This makes sense. This doesn’t make sense. He has very likely made a mistake,” then that’s something where we can actually build on and make progress with.

In general. I think this is the sort of community that attracts a lot of thinking from first principles, and there’s a lot of power in that approach. If you’re not bound by what other people think and what other people have tried, then you can really discover truly novel directions and truly novel approaches and ideas, but at the same time I think there’s also a danger of overusing this kind of technique, because I think it’s right to also connect what you’re doing with the literature and what everyone else is doing. Otherwise, you will just keep reinventing the wheel on some kind of potential solution to safety.

Lucas Perry: Do you have any suggestions for the AI safety and alignment community regarding also alleviating or remedying this cultural and intellectual difference between what they’re working on and what the mainstream ML and AI communities are doing and working on such that it shifts their mindset and work to increase the chances that more people are aware of what is required to create beneficial AI systems?

Jan Leike: Something that would be helpful for this bridge would be if the safety community as a whole, let’s say, spends more time engaging with the machine learning literature, the machine learning lingo and jargon, and try to phrase the safety ideas and the research in those terms and write it up in a paper that can be published in NeurIPS rather than something that is a blog post. The form is just a format that people are much more likely to engage with.

This is not to say that I don’t like blog posts. Blogs are great for getting some of the ideas across. We also provide blog posts about our safety research at DeepMind, but if you really want to dive into the technical details and you want to get the machine learning community to really engage with the details of your work, then writing a technical paper is just the best way to do that.

Lucas Perry: What do you think might be the most exciting piece of empirical safety work which you can realistically imagine seeing within the next five years?

Jan Leike: We’ve done a lot of experiments with reward modeling, and I personally have been surprised how far we could scale it. It’s been able to tackle every problem that we’ve been throwing at it. So right now we’re training agents to follow natural language instructions in these 3D environments that we’re building here at DeepMind. These are a lot harder problems than, say, Atari games, and reward modeling still is able to tackle them just fine.

One kind of idea what that prototype could look like is a model-based reinforcement learning agent where you learn a dynamics model then train a reward model from human feedback then the reinforcement learning agent uses the dynamics model and the reward model to do search at training and at test time. So you can actually deploy it in the environment and it can just learn to adapt its plans quickly. Then we could use that to do a whole bunch of experiments that we would want that system to do. You know like, solve off-switch problems or solve reward tampering problems or side effects, problems and so on. So I think that’d be really exciting, and I think that’s well within the kind of system that we could build in the near future.

Lucas Perry: Cool. So then wrapping up here, let’s talk a little bit about the pragmatic side of your team and DeepMind in general. Is DeepMind currently hiring? Is the safety team hiring? What is the status of all of that, and if listeners might be able to get involved?

Jan Leike: We always love to hire new people that join us in these efforts. In particular, we’re hiring research engineers and research scientists to help us build this stuff. So, if you, dear listener, have, let’s say, a Master’s degree in machine learning or some kind of other hands on experience in building and training deep learning systems, you might want to apply for a research engineering position. For a research scientist position the best qualification is probably a PhD in machine learning or something equivalent. We also do research internships for people who maybe have a little bit early in their PhD. So this is the kind of thing that applies to you and you’re excited about working on these sort of problems, then please contact us.

Lucas Perry: All right, and with that, thank you very much for your time, Jan.

End of recorded material

AI Alignment Podcast: Machine Ethics and AI Governance with Wendell Wallach

Wendell Wallach has been at the forefront of contemporary emerging technology issues for decades now. As an interdisciplinary thinker, he has engaged at the intersections of ethics, governance, AI, bioethics, robotics, and philosophy since the beginning formulations of what we now know as AI alignment were being codified. Wendell began with a broad interest in the ethics of emerging technology and has since become focused on machine ethics and AI governance. This conversation with Wendell explores his intellectual journey and participation in these fields.

 Topics discussed in this episode include:

  • Wendell’s intellectual journey in machine ethics and AI governance 
  • The history of machine ethics and alignment considerations
  • How machine ethics and AI alignment serve to produce beneficial AI 
  • Soft law and hard law for shaping AI governance 
  • Wendell’s and broader efforts for the global governance of AI
  • Social and political mechanisms for mitigating the risks of AI 
  • Wendell’s forthcoming book

Key points from Wendell:

  • “So when you were talking about machine ethics or when we were talking about machine ethics, we were really thinking about it in terms of just how do you introduce ethical procedures so that when machines encounter new situations, particularly when the designers can’t fully predict what their actions will be, that they factor in ethical considerations as they choose between various courses of action. So we were really talking about very basic program in the machines, but we weren’t just thinking of it in terms of the basics. We were thinking of it in terms of the evolution of smart machines… What we encounter in the Singularity Institute, now MIRI for artificial intelligence approach of friendly AI and what became value alignment is more or less a presumption of very high order intelligence capabilities by the system and how you would ensure that their values align with those of the machines. They tended to start from that level. So that was the distinction. Where the machine ethics folks did look at those futuristic concerns, they did more so from a philosophical level and at least a belief or appreciation that this is going to be a relatively evolutionary course, whereby the friendly AI and value alignment folks, they tended to presume that we’re going to have very high order cognitive capabilities and how do we ensure that those align with the systems. Now, the convergence, I would say, is what’s happening right now because in workshops that have been organized around the societal and ethical impact of intelligent systems.”
  • “My sense has been that with both machine ethics and value alignment, we’ve sort of got the cart in front of the horse. So I’m waiting to see some great implementation breakthroughs, I just haven’t seen them. Most of the time, when I encounter researchers who say they’re taking seriously, I see they’re tripping over relatively low level implementations. The difficulty is here, and all of this is converging. What AI alignment was initially and what it’s becoming now I think are quite different. I think in the very early days, it really was presumptions that you would have these higher order intelligences and then how were you going to align them. Now, as AI alignment, people look at the value issues as they intersect with present day AI agendas. I realize that you can’t make the presumptions about the higher order systems without going through developmental steps to get there. So, in that sense, I think whether it’s AI alignment or machine ethics, the one will absorb the lessons of the other. Both will utilize advances that happen on both fronts.”
  • “David Collingridge wrote a book where he outlined a problem that is now known as the Collingridge Dilemma. Basically, Collingridge said that while it was easiest to regulate a technology early in its style development, early in its development, we had a little idea of what its societal impact would be. By the time we did understand what the challenges from the societal impact were, the technology would be so deeply entrenched in our society that it would be very difficult to change its trajectory. So we see that today with social media. Social media was totally entrenched in our society before we realized how it could be manipulated in ways that would undermine democracy. Now we’re having a devil of a time of figuring out what we could do. So Gary and I, who had been talking about these kinds of problems for years, we realized that we were constantly lamenting the challenge, but we altered the conversation one day over a cup of coffee. We said, “Well, if we had our druthers, if we have some degree of influence, what would we propose?” We came up with a model that we referred to as governance coordinating committees. Our idea was that you would put in place a kind of issues manager that would try and guide the development of a field, but first of all, it would just monitor development, convene forums between the many stakeholders, map issues and gaps, see if anyone was addressing those issues and gaps or where their best practices had come to the floor. If these issues were not being addressed, then how could you address them, looking at a broad array of mechanisms. By a broad array of mechanisms, we meant you start with feasible technological solutions, you then look at what can be managed through corporate self-governance, and if you couldn’t find anything in either of those areas, then you turn to what is sometimes called soft law… So Gary and I proposed this model. Every time we ever talked about it, people would say, “Boy, that’s a great idea. Somebody should do that.” I was going to international forums, such as going to the World Economic meetings in Davos, where I’d be asked to be a fire-starter on all kinds of subject areas by safety and food security and the law of the ocean. In a few minutes, I would quickly outline this model as a way of getting people to think much more richly about ways to manage technological development and not just immediately go to laws and regulatory bodies. All of this convinced me that this model was very valuable, but it wasn’t being taken up. All of that led to this first International Congress for the Governance of Artificial Intelligence, which will be convened in Prague on April 16 to 18. I do invite those of you listening to this podcast who are interested in the international governance of AI or really agile governance for technology more broadly to join us at that gathering.”

 

Important timestamps: 

0:00 intro

2:50 Wendell’s evolution in work and thought

10:45 AI alignment and machine ethics

27:05 Wendell’s focus on AI governance

34:04 How much can soft law shape hard law?

37:27 What does hard law consist of?

43:25 Contextualizing the International Congress for the Governance of AI

45:00 How AI governance efforts might fail

58:40 AGI governance

1:05:00 Wendell’s forthcoming book

 

Works referenced:

A Dangerous Master: How to  Keep Technology from Slipping Beyond Our Control 

Moral Machines: Teaching Robots Right from Wrong

 

We hope that you will continue to join in the conversations by following us or subscribing to our podcasts on Youtube, Spotify, SoundCloud, iTunes, Google Play, StitcheriHeartRadio, or your preferred podcast site/application. You can find all the AI Alignment Podcasts here.

You can listen to the podcast above or read the transcript below. 

Lucas Perry: Hey everyone, welcome to the AI Alignment Podcast. I’m Lucas Perry. Today, we’ll be speaking with Wendell Wallach. This episode is primarily dedicated to the issue and topic of AI governance, though in order to get there we go on and explore Wendell’s intellectual journey in machine ethics and how that led him up to his current efforts in AI governance. We also discuss how machine ethics and AI alignment both attempt to serve the project of creating beneficial AI and deal with the moral and ethical considerations related to the growing power and use of artificial intelligence. We discuss soft law and hard law for shaping AI governance. We get into Wendell’s efforts for the global governance of AI and discuss the related risks. And to finish things off we also briefly touch on AGI governance and Wendell’s forthcoming book. If you find this podcast valuable, interesting, or helpful, consider sharing it with others who might find it valuable as well.

For those who are not familiar with Wendell, Wendell is an internationally recognized expert on the ethical and governance concerns posed by emerging technologies, particularly artificial intelligence and neuroscience. Wendell is a consultant and ethicist, and a scholar at Yale University’s Interdiscplinary Center or Bioethics. He is also a co-author with Colin Allen, Moral Machines: Teaching Robots Right from Wrong. This work maps machine ethics, machine morality, computational morality, and friendly AI. He has a second and more recent book, A Dangerous Master: How to Keep Technology from Slipping Beyond our Control. From my perspective of things, it seems there is much growing enthusiasm and momentum in the space of AI policy and governance efforts. So, this conversation and those like it I feel help to further develop my perspective and understanding of where we are in the project and space of AI governance. For these reasons, I hope that you’ll find it valuable as well. So, let’s get into our conversation with Wendell Wallach.

It would be great if you could start by clarifying the evolution of your thought in science and technology over the years. It appears that you’ve gone from being interested in bioethics to machine ethics to now a more recent focus in AI governance and AI ethics. Can you take us through this movement in your thought and work?

Wendell Wallach: In reality, all three of those themes have been involved in my work from the very beginning, but the emphasis has changed. So I lived a very idiosyncratic life that ended with two computer consulting companies that I had helped start. But I had felt that there were books that I wanted to get out of my head, and I turned those companies over to the employees, and I started writing and realized that I was not up on some of the latest work in cognitive science. So one thing led to another, and I was invited to the first meeting of a technology and ethics working group at Yale University that had actually been started by Nick Bostrom when he was at Yale and Bonnie Kaplan. Nick left about a year later, and a year after that, Bonnie Kaplan had an accident, and the chair of that working group was turned over to me.

So that started my focus on technology and ethics more broadly. It was not limited to bioethics, but it did happen within the confine for the Yale Interdisciplinary Center for Bioethics. I was all over the place and the sense that I was already a kind of transdisciplinary thinker, transdisciplinary scholar, but having the challenge of focusing my study and my work so it was manageable. In other words, I was trying to think broadly at the same time as I was trying to focus on different subject areas. One thing led to another. I was invited to a conference in Baden Baden where I met Colin Allen. We together with the woman who started the workshop there, Eva Schmidt, began thinking about a topic that we were calling machine morality at that time. By machine morality, we meant thinking about how moral decision making faculties might be implemented in computers and robots.

Around the same time, there were other scholars working on the same themes. Michael and Susan Anderson, for example, had grabbed on to the title ‘machine ethics.’ Over time, as these various pathways converge, machine ethics became the main research area or the way in which this research project was referred to. It did have other names in addition to machine morality. It was sometimes called computational morality. At the same time, there were others who were working on it under the title of friendly AI, a term that was coined by Eliezer Yudkowsky. But the real difference between the machine ethics folks and the friendly AI folks was that the friendly AI folks were explicitly focused upon the challenge of how you would manage or tame superintelligence, whereby the machine ethics crew were much more ethicists, philosophers, computer scientists who were really thinking about first steps toward introducing moral decision making faculties, moral sensitivity into computers and robots. This was a relatively small group of scholars, but as this evolved over time, Eva and Collin and I decided that we would write a book mapping the development of this field of research.

Eva Schmidt fell away, and the book finally came out from Oxford University Press under the title Moral Machines: Teaching Robots Right from Wrong. So, as you may be aware, that’s still a seminal text out there. It’s still something that is read broadly and is being cited broadly, and in fact, it’s citations are going up and were even being requested by Oxford University Press to produce an update of the book. Machine Ethics was two parts philosophy, one part, computer science. It was basically two fields of study. One was looking explicitly at the question of implementing sensitivity to moral considerations in computers and robots, and the other side with really thinking comprehensively about how humans make moral decisions. So, arguably, Moral Machines was the first book that really took that comprehensive look at human moral decision making seriously. It was also a time when there was a lot of research going on in moral psychology in the way in which people’s affective and decision making concerns affected what became our ethical decision making processes.

So we were also able to bring some of that in, bring evolutionary psychology in and bring a lot of new fields of research that had not really been given their due or had not been integrated very well with the dominant reason based theories of ethics such as deontology, which is really ethical approaches that focus on duties, rules and consequentialism, which is an ethical theory that says right and wrong is not determined by following the rules or doing your duty, it’s determined by looking at the consequences of your action and selecting that course or the action likely to produce the greatest good for the greatest number. So it’s like we were integrating evolutionary psychology, cognitive science, moral psychology, together with the more rational-based theories, as we looked at top down and bottom up approaches for introducing sensitivity to ethical considerations in computers and robots.

The major shift in that whole trajectory and one I only learned about at the first FLI conference in Puerto Rico where I and Jim Moor were the only two people who had been actively involved in the machine ethics community, Jim Moor is a professor at Dartmouth, for those of you who are not aware of him, and he has been a seminal figure in the philosophy of computing for decades now, was at that Puerto Rican gathering, the concept of value alignment with race to us for the first time. What I realized was that those who are talking about value alignment from the AI perspective, by and large, had little or no understanding that there had ever been a field or was an ongoing field known as machine ethics.

That led to my applying for a Future of Life Institute grant, which I was awarded as PI. That grant was to host three annual workshops bringing together experts not only in AI, but machine ethics, philosophy, generally, resilience, engineering, robotics, a broad array of fields of people who had been thinking seriously about value issues in computational systems. Those really became groundbreaking workshops where it was clear that the computer scientists and the AI researchers knew very little about ethics issues, and the ethicists didn’t necessarily have a great depth of understanding of some of the challenges coming up in artificial intelligence. Bart Selman and Stuart Russell agreed to be co-chairs of those workshops with me. The last one was completed over a year ago with some closing presentations in New York city and at Yale.

Lucas Perry: I think it’d be helpful here if you could disambiguate the machine ethics crowd and way of thinking and what has been done there from the AI alignment, value alignment, Eliezer branch of thinking that has been going on. AI alignment seems more focused on explicitly trying to understand human preference hierarchies and be able to specify objectives without the machine systems doing other things that we don’t want them to do. Then you said that machine ethics is about imbuing ethical decision making faculties or reasoning or sensitivities in machine systems. That, to me, seems more like normative ethics. We have these normative theories like you mentioned deontology and consequentialism and virtue ethics, and maybe machines can invent other normative ethical theories. So they seem like different projects.

Wendell Wallach: They are very different projects. The question is whether they converge or not or whether they can really be treated totally distinct projects from each other. So when you were talking about machine ethics or when we were talking about machine ethics, we were really thinking about it in terms of just how do you introduce ethical procedures so that when machines encounter new situations, particularly when the designers can’t fully predict what their actions will be, that they factor in ethical considerations as they choose between various courses of action. So we were really talking about very basic program in the machines, but we weren’t just thinking of it in terms of the basics. We were thinking of it in terms of the evolution of smart machines. For example, in Moral Machines, Colin and I had a chart that we had actually developed with Eva Schmidt and had been in earlier articles that the three of us offered, and it looked at the development of machines on two axes.

One was increasing autonomy, and the other was increasing sensitivity with at the far other extremes, sensitivity to ethical consideration. We realized that you could put any tool within that chart. So a hammer has no sensitivity, and it has no autonomy. But when you think of a thermostat, it has a very low degree of sensitivity and a very low degree of autonomy, so as temperatures change, it can turn on or off heating. We then, within that chart, had a series of semicircles, one that delineated when we moved into the realm of what we labeled operational morality. By operational morality, we meant that the computer designers could more or less figure out all the situations the system would encounter and hard program its responses to those situations. The next level was what we call functional morality, which was as the computer programmers could no longer predetermine all the situations the system would encounter, the system would have to have some kind of ethical sub routines. Then at the highest level was full moral agency.

What we encounter in the Singularity Institute, now MIRI for artificial intelligence approach of friendly AI and what became value alignment is more or less a presumption of very high order intelligence capabilities by the system and how you would ensure that their values align with those of the machines. They tended to start from that level. So that was the distinction. Where the machine ethics folks did look at those futuristic concerns, they did more so from a philosophical level and at least a belief or appreciation that this is going to be a relatively evolutionary course, whereby the friendly AI and value alignment folks, they tended to presume that we’re going to have very high order cognitive capabilities and how do we ensure that those align with the systems. Now, the convergence, I would say, is what’s happening right now because in workshops that have been organized around the societal and ethical impact of intelligent systems. The first experiments even the value alignment people are doing still tend to be relatively low level experiments, given the capabilities assistants have today.

So I would say, in effect, they are machine ethics experiments or at least they’re starting to recognize that the challenges at least initially aren’t that much different than those the machine ethicists outlined. As far as the later concerns go, which is what is the best course to proceed on producing systems that are value aligned, well there, I think we have some overlap also coming into the machine ethicist, which raises questions about some of these more technical and mathematically-based approaches to value alignment and whether they might be successful. In that regard, Shannon Vallor, an ethicist at Santa Clara University, who wrote a book called Technology and the Virtues, and has now taken a professorship at Edinburgh, she and I produced a paper called, I think it was From Machine Ethics to Value Alignment to virtue alignment. We’re really proposing that analytical approaches alone will not get us to machines that we can trust or that will be fully ethically aligned.

Lucas Perry: Can you provide some examples about specific implementations or systems or applications of machine ethics today?

Wendell Wallach: There really isn’t much. Sensitivity to ethical considerations is still heavily reliant on how much we can get that input into systems and then how you integrate that input. So we are still very much at the stage of bringing various inputs in without a lot of integration, let alone analysis of what’s been integrated and decisions being made based on that analysis. For all purposes and both machine ethics, then I would say, bottom up value alignment, there’s just not a lot that’s been done. These are still somewhat futuristic research trajectories.

Lucas Perry: I think I’m just trying to poke here to understand better about what you find most skillful and useful about both approaches in terms of a portfolio approach to building beneficial AI systems, like if this is an opportunity to convince people that machine ethics is something valuable and that should be considered and worked on and expanded. I’m curious to know what you would say.

Wendell Wallach: Well, I think machine ethics is the name of the game in the sense that for all I talk about systems that will have very high order of capabilities. We just aren’t there. We’re still dealing with relatively limited forms of cognitive decision making. For all the wonder that’s going on in machine learning, that’s still a relatively limited kind of learning approach. So I’m not dealing with machines that are making fundamental decisions at this point, or if they are allowed to, it’s largely because humans have abrogated their responsibility, trust the machines, and let the machines make the decisions regardless of whether the machines actually have the capabilities to make sophisticated decisions.

Well, I think as we move along, as you get more and more inputs into systems and you figure out ways of integrating them, there will be the problem of which decisions can be made without, let’s just say, higher order consciousness or understanding of the falling implications of those systems, of the situations, of the ethical concerns arising in the situations and which decisions really require levels of, and I’m going to use the understanding and consciousness words, but I’m using them in a circumspect way for the machines to fully appreciate the ramifications of the decisions being made and therefore those who are affected by those decisions or how those decisions will affect those around it.

Our first stage is going to be largely systems of limited consciousness or limited understanding and our appreciation of what they can and cannot do in a successful manner and when you truly need a human decision maker in the loop. I think that’s what we are broadly. The differences between the approaches with the AI researchers are looking at what kind of flexibility they have within the tools I have now for building AI systems. The machine ethicists, I think they’ll tend to be largely philosophically rooted or ethically rooted or practically ethically rooted, and therefore they tend to be more sensitive to the ramifications of decision makings by machines and capacities that need to be accounted for before you want to turn over a decision to a machine, such as a lethal autonomous weapon. What should the machine really understand before it can be a lethal autonomous weapon, and therefore, how tightly does the meaningful human control need to be?

Lucas Perry: I’m feeling a tension between trying to understand the role and place of both of these projects and how they’re skillful. In terms just strict AI alignment, if we had a system that wanted to help us and it was very good at preference learning such that it could use all human artifacts in the world like books, movies and other things. It can also study your behavior and also have conversations with us. It could leverage all data points in the world for building a deep and rich understanding of individual human preference hierarchies, and then also it could extrapolate broad preference facts about species wide general considerations. If that project were to succeed, then within those meta preferences and that preference hierarchy exists the kinds of normative ethical systems that machine ethics is trying to pay lip service to or to be sensitive towards or to imbue in machine systems.

From my perspective, if that kind of narrative that I just gave is true or valid, then that would be sort of a complete value alignment, and so far as it would create beneficial machine systems. But in order to have that kind of normative decision making and sensibilities in machine systems such that they fully understand and are sensitive to the ethical ramifications of certain decision makings, that requires higher order logic and the ability to generate concepts and to interrelate them and to shift them around and use them in the kinds of ways that human beings do, which we’re far short of.

Wendell Wallach: So that’s where the convergence is. We’re far short of it. So I have no problem with the description you made. The only thing I noted is, at the beginning you said, if we had, and for me, in order to have, you will have to go through these stages of development that we have been alluding to as machine ethics. Now, how much of that will be able to utilize tools that come out of artificial intelligence that we had not been able to imagine in the early days of machine ethics? I have no idea. There’s so many uncertainties on how that pathway is going to unfold. There’re uncertainties about what order the breakthroughs will take place, how the breakthroughs will interact with other breakthroughs and technology more broadly, whether there will be public reactions to autonomous systems along the way that slow down the course of development or even stop certain areas of research.

So I don’t know how this is all going to unfold. I do see within the AI community, there is kind of a leap of faith to a presumption of breaths of capacity that when I look at it, I still look at, well, how do we get between here and there. When I look at getting between here and there, I see that you’re going to have to solve some of these lower level problems that got described more in the machine ethics world than have initially been seen by the value alignment approaches. That said, now that we’re getting researchers actually trying to look at implementing value alignment, I think they’re coming to appreciate that these lower level problems are there. We can’t presume high level preference parsing by machines without them going through developmental stages in relationship to understanding what a preference is, what a norm is, how they get applied within different contexts.

My sense has been that with both machine ethics and value alignment, we’ve sort of got the cart in front of the horse. So I’m waiting to see some great implementation breakthroughs, I just haven’t seen them. Most of the time, when I encounter researchers who say they’re taking seriously, I see they’re tripping over relatively low level implementations. The difficulty is here, and all of this is converging. What AI alignment was initially and what it’s becoming now I think are quite different. I think in the very early days, it really was presumptions that you would have these higher order intelligences and then how were you going to align them. Now, as AI alignment, people look at the value issues as they intersect with present day AI agendas. I realize that you can’t make the presumptions about the higher order systems without going through developmental steps to get there.

So, in that sense, I think whether it’s AI alignment or machine ethics, the one will absorb the lessons of the other. Both will utilize advances that happen on both fronts. All I’m trying to underscore here is there are computer engineers and roboticist and philosophers who reflected on issues that perhaps the value alignment people are learning something from. I, in the end, don’t care about machine ethics or value alignment per se, I just care about people talking with each other and learning what they can from each other and moving away from a kind of arrogance that I sometimes see happen on both sides of the fence that one says to the other you do not understand. The good news and one thing that I was very happy about in terms of what we did in these three workshops that I was PI on with the help of the Future of Life Institute was, I think we sort of broke open the door for transdisciplinary dialogue.

Now, true, This was just one workshop. Now, we have gone from a time where the first Future of Life Institute gathering of Puerto Rico, the ethicists in the room, Jim Moore and I were backbenchers, to a time where we have countless conferences that are basically transdisciplinary conferences where people from many fields of research are now beginning to listen to each of them. The serious folks in the technology and ethics really have recognized the richness of ethical decision making in real contexts. Therefore, I think they can point that out. Technologists sometimes like to say, “Well, you ethicist, what do you have to say because you can’t tell us what’s right and wrong anyway?” Maybe that isn’t what ethics is all about, about dictating what’s right and wrong. Maybe ethics is more about how do we navigate the uncertainties of life, and what kinds of intelligence need to be brought to bear to navigate the uncertainties of life with a degree of sensitivity, depth, awareness, and appreciation for the multilayered kinds of intelligences that come into play.

Lucas Perry: In the context of this uncertainty about machine ethics and about AI alignment and however much or little convergence there might be, let’s talk about how all of this leads up into AI governance now. You touched on a lot of your machine ethics work. What made you pivot into AI governance, and where is that taking you today?

Wendell Wallach: After completing moral machines, I started to think about the fact that very few people had a deep and multidisciplinary understanding of the broad array of ethical and societal impacts posed by emerging technologies. I decided to write a primer on that, focusing on what could go wrong and how we might diffuse ethical challenges and undesirable societal impacts. That was finally published under the title A Dangerous Master: How to Keep Technology from Slipping Beyond our Control. The first part of that was really a primer on the various fields of science from synthetic biology to geoengineering, what the benefits were, what could go wrong. But then the book was very much about introducing people to various themes that arise, managing complex, adaptive systems, resilience, engineering, transcending limits, a whole flock of themes that have become part of language of discussing emerging technologies but weren’t necessarily known to a broader public.

Even for those of us who are specialists in one area of research such as biotech, we have had very little understanding of AI or geoengineering or some of the other fields. So I felt there was a need for a primer. Then the final chapter for the primer, I turned to how some of these challenges might be addressed through governance and oversight. Simultaneously, while I was working on that book, Gary Marchant and I, Gary Marchant is the director of the Center for Law and Innovation at the Sandra Day O’Connor School of Law at Arizona State University. Gary has been a specialist in the law and governance of emerging technologies. He and I, in our interactions lamented the fact that it was very difficult for any form of governance of these technologies. It was something called the pacing problem. The pacing problem refers to the fact that scientific discovery and technological innovation is far outpacing our ability to put in place appropriate ethical legal oversight, and that converges with another dilemma that has bedeviled people in technology governance for decades, going back to 1980.

David Collingridge wrote a book where he outlined a problem that is now known as the Collingridge Dilemma. Basically, Collingridge said that while it was easiest to regulate a technology early in its style development, early in its development, we had a little idea of what its societal impact would be. By the time we did understand what the challenges from the societal impact were, the technology would be so deeply entrenched in our society that it would be very difficult to change its trajectory. So we see that today with social media. Social media was totally entrenched in our society before we realized how it could be manipulated in ways that would undermine democracy. Now we’re having a devil of a time of figuring out what we could do.

So Gary and I, who had been talking about these kinds of problems for years, we realized that we were constantly lamenting the challenge, but we altered the conversation one day over a cup of coffee. We said, “Well, if we had our druthers, if we have some degree of influence, what would we propose?” We came up with a model that we referred to as governance coordinating committees. Our idea was that you would put in place a kind of issues manager that would try and guide the development of a field, but first of all, it would just monitor development, convene forums between the many stakeholders, map issues and gaps, see if anyone was addressing those issues and gaps or where their best practices had come to the floor. If these issues were not being addressed, then how could you address them, looking at a broad array of mechanisms. By a broad array of mechanisms, we meant you start with feasible technological solutions, you then look at what can be managed through corporate self-governance, and if you couldn’t find anything in either of those areas, then you turn to what is sometimes called soft law.

Soft law is laboratory practices and procedures, standards, codes of conduct, insurance policy, a whole plethora of mechanisms that fall short of laws and regulatory oversight. The value of soft law is that soft law can be proposed easily, and you can throw it out if technological advances mean it’s no longer necessary. So it’s very agile, it’s very adaptive. Really anyone can propose the news off law mechanism. But that contributes to one of the downsides, which is you can have competing soft law, but the other downside is perhaps even more important is that you seldom have a means of enforcement if there are violations of soft law. So, on some areas you deem need enforcement, and that’s why hard law and regulatory institutions become important.

So Gary and I proposed this model. Every time we ever talked about it, people would say, “Boy, that’s a great idea. Somebody should do that.” I was going to international forums, such as going to the World Economic meetings in Davos, where I’d be asked to be a fire-starter on all kinds of subject areas by safety and food security and the law of the ocean. In a few minutes, I would quickly outline this model as a way of getting people to think much more richly about ways to manage technological development and not just immediately go to laws and regulatory bodies. All of this convinced me that this model was very valuable, but it wasn’t being taken up. All of that led to this first International Congress for the Governance of Artificial Intelligence, which will be convened in Prague on April 16 to 18. I do invite those of you listening to this podcast who are interested in the international governance of AI or really agile governance for technology more broadly to join us at that gathering.

Lucas Perry: Can you specify the extent to which you think that soft law, international norms will shape hard law policy?

Wendell Wallach: I don’t think any of this is that easy at the moment because when I started working on this project and working toward the Congress, there was almost no one in this space. Suddenly, we have a whole flock of organizations that have jumped into it. We have more than 53 lists of principles for artificial intelligence and all kinds of specifications of laws coming along like GDPR, and the EU will actually be coming out very soon with a whole other list of proposed regulations for the development of autonomous systems. So we are now in an explosion of groups, each of which in one form or another is proposing both laws and soft law mechanisms. I think that means we are even more in need of something like a governance coordinating committee. What I mean is loose coordination and cooperation, but at least putting some mechanism in place for that.

Some of the groups that have come to the floor are like the OECD, which actually represents a broad array of the nations, but not all of them. The Chinese were not party to the development of the OECD principles. The Chinese, for example, have somewhat different principles and laws that are most attractive in the west. My point is that we have an awful lot of groups, some of which would like to have a significant leadership role or are dominating role, and we’ll have to see to what extent they cooperate with each other or whether we finally have a cacophony of competing soft law recommendations. But I think even if there’s a competition at the UN perhaps with a new mechanism that we create or through each of these bodies like the OECD and IAAA individually, best practices will come to the fore over time and they will become the soft law guidelines. Now, which of those soft guidelines need to make hard law? That may vary from nation to nation.

Lucas Perry: The agility here is in part imbued by a large amount of soft laws, which will then clarify best practices?

Wendell Wallach: Well, I think like anything else, just like the development of artificial intelligence. There’s all kinds of experimentation going on, all kinds of soft law frameworks, principles which have to be developed into policy and soft law frameworks going on. It will vary from nation to nation. We’ll get an insight over time about which practices really work and which haven’t worked. Hopefully, with some degree of coordination, we can underscore the best practices, we can monitor the development of the field in a way where we can underscore where the issues that still need to be addressed. We may have forums to work out differences. There may never be a full consensus and there may not need to be a full consensus considering much of the soft law will be implemented on a national or regional view like front. Only some of it will need to be top down in the sense that it’s international.

Lucas Perry: Can you clarify the set of things or legal instruments which consist of soft law and then the side of things which make up a hard law?

Wendell Wallach: Well, hard law is always things that have become governmentally instituted. So the laws and regulatory agencies that we have in America, for example, or you have the same within Europe, but you have different approaches to hard law. The Europeans are more willing to put in pretty rigorous hard law frameworks, and they believe that if we codify what we don’t want, that will force developers to come up with new creative experimental pathways that accommodate our values and goals. In America, were reticent to codify things into hard law because we think that will squelch innovation. So those are different approaches. But below hard law, in terms of soft law, you really do have these vast array of different mechanisms. So I mentioned international standards, some of those are technical. We see a lot of technical standards come in out of the IEEE and the ISO. The IEEE, for example, has jumped into the governance of autonomous systems in a way where it wants to go beyond what can be elucidated technically to talk more about what kinds of values we’re putting in place and what the actual implementation of those values would be. So that’s soft law.

Insurance policies sometimes dictate what you can and cannot do. So that soft law. We have laboratory practices and procedures. What’s safe to do in a laboratory and what isn’t? That’s soft law. We have new approaches to implementing values within technical systems, what is sometimes referred to as value-added design. That’s kind of a form of soft law. There are innumerable frameworks that we can come up with and we can create new ones if we need to to help delineate what is acceptable and what isn’t acceptable. But again, that delineation may or may not be enforceable. Some enforcement is, if you don’t do what the insurance policy has demanded of you, you lose your insurance policy, and that’s a form of enforceability.

You can lose membership in various organizations. Soft law gets into great detail in terms of acceptable use of humans and animals in research. But at least that’s a soft law that has, within the United States and Europe and elsewhere, some ability to prosecute people who violate the rights of individuals, who harm animals in a way that is not acceptable in the course of doing the research. So what are we trying to achieve by convening a first International Congress for the Governance of Artificial Intelligence? First of all, our hope is that we will get a broad array of stakeholders present. So, far, nearly all the governance initiatives are circumspect in terms of who’s there and who is not there. We are making special efforts to ensure that we have a robust representation from the Chinese. We’re going to make sure that we have robust representation from those from underserved nations and communities who are likely to be very effected by AI, but not necessarily we’ll know a great deal about it. So having a broad array of stakeholders is the number one goal of what we are doing.

Secondly, between here and the Congress, we’re convening six experts workshops. What we intend to do with these expert workshops is bring together a dozen or more of those individuals who have already been thinking very deeply about the kinds of governance mechanisms that we need. Do understand that I’m using the word governance, not government. Government usually just entails hard law and bureaucracies. By governance, we mean bringing in many other solutions to what we call regulatory or oversight problems. So we’re hopeful that we’ll get experts not only in AI governance, but also in thinking about agile governance more broadly that we will have them come to these small expert workshops we’re putting together, and at those expert workshops, we hope to elucidate what are the most promising mechanisms for the international governance of the AI. If they can elucidate those mechanisms, they will then be brought before the Congress. At the Congress, we’ll have further discussions and a Richmond around some of those mechanisms, and then by the end of the Congress, we will have boats to see if there’s an overwhelming consensus of those present to move forward on some of these initiatives.

Perhaps, something like what I had called the governance coordinating committee might be one of those mechanisms. I happen to have also been an advisor to the UN secretary General’s higher level panel on digital cooperation, and they drew upon some of my research and combined that with others and came up with one of their recommendations, so they recommended something that is sometimes referred to a network of networks. Very similar to what I’ve been calling a governance coordinating committee. In the end, I don’t care what mechanisms we start to put in place, just that we begin to take first steps toward putting in place that will be seen as trustworthy. If we can’t do that, then why bother. At the end of the Congress, we’ll have these votes. Hopefully that will bring some momentum behind further action to move expeditiously toward putting some of these mechanisms in place.

Lucas Perry: Can you contextualize this International Congress for the Governance of AI within the broader AI governance landscape? What are the other efforts going on, and how does this fit in with all of them?

Wendell Wallach: Well, there are many different efforts underway. The EU has its efforts, the IEEE has its effort. The World Economic Forum convenes people to talk about some of these issues. You’ll have some of this come up in the Partnership in AI, you have OECD. There are conversations going on in the UN. You the higher level panels recommendations. So they have now become a vast plethora of different groups that have jumped into it. Our point is that, so far, none of these groups include all the stakeholders. So the Congress is an attempt to bring all of these groups together and ensure that other stakeholders have a place at the table. That would be the main difference.

We want to weave the groups together, but we are not trying to put in place some new authority or someone who has authority over the individual groups. We’re just trying to make sure that we’re looking at the development of AI comprehensively, that we’re talking with each other, that we have forums to talk with each other, that issues aren’t going unaddressed, and then if somebody truly has come forward with best practices and procedures, that those are made available to everyone else in the world or at least underscored for others in the world as promising pathways to go down.

Lucas Perry: Can you elaborate on how these efforts might fail to develop trust or how they might fail to bring about coordination on the issues? Is it always in the incentive of a country to share best practices around AI if that increases the capacity of other countries to catch up?

Wendell Wallach: We always have this problem of competition and cooperation. Where’s competition going to take place? How much cooperation will there actually be? It’s no mystery to anyone in the world that decisions are being made as we speak about whether or not we’re going to move towards wider cooperation within the international world or whether we have movements where we are going to be looking at a war of civilization or at least a competition between civilizations. I happen to believe there’s so many problems within emerging technologies that if we don’t have some degree of coordination, we’re all damned and that that should prevail in global climate change and in other areas, but whether we’ll actually be able to pull that off has to do with decisions going on in individual countries. So, at the moment, we’re particularly seeing that tension between China and the US. If the trade work can be diffused, then maybe we can back off from that tension a little bit, but at the moment, everything’s up for grabs.

That being said, when everything’s up for grabs, my belief is you do what you can to facilitate the values that you think need to be forwarded, and therefore I’m pushing us toward recognizing the importance of a degree of cooperation without pretending that we aren’t going to compete with each other. Competition’s not bad. Competition, as we all know, furthers innovation helps disrupt technologies that are inefficient and replace them with more efficient ways of moving forward. I’m all for competition, but I would like to see it in a broader framework where there is at least a degree of cooperation on AI ethics and international governmental cooperation.

Lucas Perry: The path forward seems to have something to do with really reifying the importance of cooperation and how that makes us all better off to some extent, not pretending like there’s going to be full 100% cooperation, but cooperation where it’s needed such that we don’t begin defecting on each other in ways that are mutually bad and incompatible.

Wendell Wallach: That claim is central to the whole FLI approach.

Lucas Perry: Yeah. So, if we talk about AI in particular, there’s this issue of lethal autonomous weapons. There’s an issue of, as you mentioned, the spread of disinformation, the way in which AI systems and machine learning can be used more and more to lie and to spread subversive or malicious information campaigns. There’s also the degree to which algorithms will or will not be contributing to discrimination. So these are all like short term things that are governance issues for us to work on today.

Wendell Wallach: I think the longer term trajectory is that AI systems are giving increasing power to those who want to manipulate human behavior either from marketing or political purposes, and they’re manipulating the behavior by studying human behavior and playing to our vulnerabilities. So humans are very much becoming machines in this AI commercial political juggernaut.

Lucas Perry: Sure. So human beings have our own psychological bugs and exploits, and massive machine learning can find those bugs and exploits and exploit them in us.

Wendell Wallach: And in real time. I mean, with the collection of sensors and facial recognition software and emotion recognition software over 5G with a large database of our past preferences and behaviors, we can be bombarded with signals to manipulate our behavior on very low levels and areas where we are known to be vulnerable.

Lucas Perry: So the question is to the extent to which and the strategies for which we can use within the context of these national and global AI governance efforts to mitigate these risks.

Wendell Wallach: To mitigate these risks, to make sure that we have meaningful public education, meaning I would say from grammar school up, digital literacy so that individuals can recognize when they’re being scammed, when they’re being lied to. I mean, we’ll never be perfect at that, but at least have ones antenna out for that and the degree to which we perhaps need to have some self recognition that if we’re going to not be just manipulable. But we’ll truly cultivate the capacity to recognize when there are internal and external pressures upon us and diffuse those pressures so we can look at new, more creative, individualized responses to the challenge at hand.

Lucas Perry: I think that that point about elementary to high school education is really interesting and important. I don’t know what it’s like today. I guess they’re about the same as what I experienced. They just seemed completely incompatible with the way the technology is going and dis-employment and other things in terms of the way that they teach and what they teach.

Wendell Wallach: Well, it’s not happening within the school systems. What I don’t fully understand is how savvy young people are within their own youth culture, whether they’re recognizing when they’re being manipulated or not, whether that’s part of that culture. I mean part of my culture, and God knows I’m getting on in years now, but it goes back to questions of phoniness and pretense and so forth. So we did have our youth culture that was very sensitive to that. But that wasn’t part of what our educational institutions were engaged in.

The difference now is that we’ll have to be both within the youth culture, but also we would need to be actually teaching digital literacy. So, for an example, I’m encountering a as scam a week, I would say right now through the telephone or through email. Some new way that somebody has figured out to try and rip off some money from me. I can’t believe how many new approaches are coming up. It just flags that this form of corruption requires remarkable degree of both sensitivity but a degree of digital knowledge so that you can recognize when you need to at least check out whether this is real or a scan before you give sensitive information to others.

Lucas Perry: The saving grace, I think for, gen Z and millennial people is that… I mean, I don’t know what the percentages are, but more than before, many of us have basically grown up on the internet.

Wendell Wallach: So they have a degree of digital literacy.

Lucas Perry: But it’s not codified by an institution like the schooling system, but changing the schooling system to the technological predictions of academics. I don’t know how much hope I have. It seems like it’s a really slow process to change anything about education. It seems like it almost has to be done outside of public education

Wendell Wallach: That may be what we mean by governance now is what can be done within the existing institutions and what has to find means of being addressed outside of the existing institutions, and is it happening or isn’t it happening? If youth culture in its evolving forms gives 90% of digital literacy to young people, fine, but what about those people who are not within the networks of getting that education, and what about the other 10%? How does that take place? I think that’s the kind of creativity and oversight we need is just monitoring what’s going on, what’s happening, what’s not happening. Some areas may lead to actual governmental needs or interventions. So let’s take the technological unemployment issue. I’ve been thinking a lot about that disruption in new ways. One question I have is whether it can be slowed down. An example for me for a slow down would be if we found ways of not rewarding corporations for introducing technologies that bring about minimal efficiencies but are more costly to the society than the efficiencies that they introduce for their own productivity gains.

So, if it’s a small efficiency, but the corporation fires 10,000 people and just 10,000 people are now on the door, I’m not sure whether we should be rewarding corporations for that. On the other hand, I’m not quite sure what kind of political economy you could put in place so you didn’t reward corporations for that. Let’s just say that you have automatic long haul trucking. In the United States, we have 1.7 million long haul truck drivers. It’s one of the top jobs in the country. First of all, long haul trucking can probably be replaced more quickly than we’ll have self driving trucks in the cities because of some of the technical issues encountered in cities and on country roads and so forth. So you could have a long haul truck that just went from on-ramp to off ramp and then have human drivers who take over the truck for the last few miles to take it to the shipping depot.

But if we’ve replaced long haul truckers in the United States over a 10 year period, that would mean putting 14,000 truck drivers out of work every month. That means you have to create 14,000 jobs a month that are appropriate for long haul truck drivers. At the same time, as you’re creating jobs for new people entering the workforce and for others whose jobs are disappearing because of automation, it’s not going to happen. Given the culture in the United States, my melodramatic example is some long haul truckers may just decide to take the semis closed down interstate highways and sit in their cap and say to the government, “Bring it on.” We are moving into that kind of social instability. So, on one hand, if getting rid of the human drivers doesn’t bring massive efficiencies, it could very easily bring social instability and large societal costs. So perhaps we don’t want to encourage that. But we need to look at it in greater depth to understand what the benefits and costs are.

We often overplay the benefits, and we under-represent the downsides and the costs. You could see a form of tax on corporations relative to how many workers they laid off and how many jobs they created. It could be a sliding tax. For corporations reducing its workforce dramatically, then it gets a higher tax on its profit than one that’s actually increasing its workforce. That would be a form of maybe how you’re funding UBI. In UBI, I would like to see something that I’ve referred to as UBI plus plus plus. I mean there’ve been various UBI pluses. But in my thought was that you’re being given that basic income for performing a service for the society. In other words, performing a service for the society is your job. There may not be anybody overseeing what service you are providing or you might be able to decide yourself what that service would be.

Maybe somebody was an aspiring actor would decide that they were going to put together an acting group and take Shakespeare into the school system, that that was their service to the society. Others may decide they don’t know how to do a service to the society, but they want to go back to school, so perhaps they’re preparing for a new job or a new contribution, and perhaps other people will really need a job and we’ll have to create high touch jobs such as those that you have in Japan for them. But the point is UBI is paying you for a job. The job you’re doing is providing a service to the society, and that service is actually improving the overall society. So, if you had thousands of creative people taking educational programs into schools, perhaps you’re improving overall education and therefore the smarts of the next generation.

Most of this is not international governance, but where it does impinge upon international considerations is if we do have massive unemployment. It’s going to be poorer nations that are going to be truly set back. I’ve been planning out in international circles that we now have the Sustainable Development Goals. Well, just technological unemployment alone could undermine the realization of the Sustainable Development Goals.

Lucas Perry: So that seems like a really big scary issue.

Wendell Wallach: It’s going to vary from country to country. I mean, the fascinating thing is how different these national governments will be. So some of the countries in Africa are leap frogging technology. They’re moving forward. They’re building smart cities. They aren’t going through our development. But other countries don’t even have functioning governments or the governments are highly autocratic. When you look at the technology available for surveillance systems now, I mean we’re very likely to see some governments in the world that look like horrible forms of dictatorship gulags, at the same time as there’ll be some countries where human rights are deeply entrenched, and the oversight of the technologies will be such that they will not be overly repressive on individual behavior.

Lucas Perry: Yeah. Hopefully all of these global governance mechanisms that are being developed will bring to light all of these issues and then effectively work on them. One issue which is related, and I’m not sure how fits in here or it fits in with your thinking, is specifically the messaging and thought around the governance related to AGI and superintelligence. Do you have any thinking here about how any of this feeds into that or your thoughts about that?

Wendell Wallach: I think that the difficulty is we’re still in a realm where when and what AGI or superintelligence will appear and what it will look like. It’s still so highly speculative. So, at this stage of the game, I don’t think that AGI is really a governmental issue beyond the question of whether government should be funding some of the research. There may also be a role for governments in monitoring when we’re crossing thresholds that open the door for AGI. But I’m not so concerned about that because I think there’s a pretty robust community that’s doing that already that’s not governmental, and perhaps we don’t need the government too involved. But the point here is, if we can put in place robust mechanisms for the international governance of AI, then potentially those mechanisms either make recommendations that perhaps slow down the adoption of technologies that could be dangerous or enhance the ethics and the sensitivity and the development of the technologies. If and when we are about to cross thresholds that open real dangers or serious benefits, that we have the mechanisms in place to help regulate the unfold into that trajectory.

But that, of course, has to be wishful thinking at this point. We’re taking baby steps at this stage of the game. Those baby steps are going to be building on the activities at FLI and OpenAI and other groups that are already engaged in. My way of approaching it is, and it’s not just with AGI, it’s also in relationship to biotech, is just a flag that are speculative dangers out there, and we are making decisions today about what pathways we, humanity as a whole, want to navigate. So, oftentimes in my presentations, I will have a slide up, and that slide is two robots kneeling over the corpse of a human. When I put that slide up, I say we may even be dealing with the melodramatic possibility that we are inventing the human species as we have known it out of existence.

So that’s my way of flagging that that’s the concern, but not trying to pretend that that’s one that governments should or can address at this point more that we are inflection point where we should and can put in place values and mechanisms to try and ensure that the trajectory of the emerging technologies is human-centered, is planet-centered, is about human flourishing.

Lucas Perry: I think that the worry of the information that is implicit to that is that if there are two AIs embodied as robots or whatever, standing over a human corpse to represent them dominating or transcending the human species. What is implicit to that is that they have more power than us because you require more power to be able to do something like that. To have more power than the human species is something governments would maybe be interested in that would be something maybe we wouldn’t want to message about.

Wendell Wallach: I mean, it’s the problem with lethal autonomous weapons. Now, I think most of the world has come to understand that lethal autonomous weapons is a bad idea, but that’s not stopping governments from pursuing them or the security establishment within government saying that it’s necessary that we go down this road. Therefore, we don’t get an international ban or treaty. The messaging with governments is complicated. I’m using the messaging only to stress what I think we should be doing in the near term.

Lucas Perry: Yeah, I think that that’s a good idea and the correct approach. So, if everything goes right in terms of this process of AI governance, then we’re able to properly manage the development of new AI technology, what is your hope here? What are optimistic visions of the future, given successful AI governance?

Wendell Wallach: I’m a little bit different than most people on this. I’m not so much caught up in visions of the future based on this technology or that technology. My focus is more that we have a conscious active decision making process in the present where people get to put in place the values and instruments they need to have a degree of control over the overall development of emerging technologies. So, yes, of course I would like to see us address global climate change. I would like us to adapt AI for all. I would like to see all kinds of things take place. But more than anything, I’m acutely aware of what a significant inflection point this is in human history, and that we’re having the pass through a very difficult and perhaps in relatively narrow doorway in order ensure human flourishing for the next couple of hundred years.

I mean, I understand that I’m a little older than most of the people involved in this process, so I’m not going to be on the stage for that much longer barring radical life extension taking place in the next 20 years. So, unlike many people who are working on positive technology visions for the future, I’m less concerned with the future and more concerned with how, in the present, we nudge technology onto our positive course. So my investment is more that we ensure that humanity not only have a chance, but a chance to truly prevail.

Lucas Perry: Beautiful. So you’re now discussing about how you’re essentially focused on what we can do immediately. There’s the extent to which AI alignment and machine ethics or whatever are trying to imbue an understanding of human preference hierarchies in machine systems and to develop ethical sensibilities and sensitivities. I wonder what the role is for, first of all, embodied compassion and loving kindness in persons as models for AI systems and then embodied loving kindness and compassion and pure altruism in machine systems as a form of alignment with idealized human preference hierarchies and ethical sensibilities.

Wendell Wallach: In addition of this work I’m doing on the governance of emerging technologies, I’m also writing a book right now. The book has a working title, which is Descartes Meets Buddha: Enlightenment for the Information Age.

Lucas Perry: I didn’t know that. So that’s great.

Wendell Wallach: So this fits in with your question very broadly. I’m both looking at if the enlightenment ethos, which has directed humanities development over the last few hundred years is imploding under the weight of its own success, then what ethos do we put in place that gives humanity a direction for flourish and over the next few hundred years? I think central to creating that new ethos is to have a new understanding of what it means to be human. But that new understanding isn’t something totally new. It needs to have some convergence with what’s been perennial wisdom to be meaningful. But the fact is when we ask these questions, how are we similar to and how do we truly differ from the artificial forms of intelligence that we’re creating? Or what will it mean to be human as we evolved through the impact of emerging technologies, whether that’s life extension or uploading or bioengineering?

There still is this fundamental question about what grounds, what it means to be human. In other words, what’s not just up for grabs or up for engineering. To that, I bring in my own reflections after having meditated for the last 50 years on my own insights shall we say and how that converges with what we’ve learned about human functioning, human decision making and human ethics through the cognitive sciences over the last decade or two. Out of that, I’ve come up with a new model that I referred to as cyber souls, meaning that as sciences illuminating the computational and biochemical mechanisms that give rise to human capabilities, we have often lost sight of the way in which evolution also forged us into integrated beings, integrated within ourselves and searching for an adapted integration to the environment and the other entities that share in that environment.

And it’s this need for integration and relationship, which is fundamental in ethics, but also in decision making. There’s the second part of this, which is this new fascination with moral psychology and the recognition that reason alone may not be enough for good decision making. And that if we have an ethics that doesn’t accommodate people’s moral psychology, then reason alone isn’t going to be persuasive for people, they have to be moved by it. So I think this leads us to perhaps a new understanding of what’s the role of psychological states in our decision making, what information is carried by different psychological states, and how does that information help direct us toward making good and bad decisions. So I call that a silent ethic. There are certain mental states, which historically have at least indicated for people that they’re in the right place at the right time, in the right way.

Oftentimes, these states, whether they’re called flow or oneness or creativity, they’re being given some spiritual overlay and people look directly at how to achieve these states. But that may be a misunderstanding of the role of mental states. Mental States are giving us information. As we factor that information into our choices and actions, those mental states fall away, and the byproduct are these so-called spiritual or transcendent states, and often they have characteristics where thought and thinking comes to a rest. So I call this the silent ethic, taking the actions, making the choices that allow our thoughts to come to rest. When our thoughts are coming to rest, we’re usually in relationships within ourself and our environments that you can think of as embodied presence or perhaps even the foundations for virtue. So my own sense is we may be moving toward a new or revived virtue ethics. Part of what I’m trying to express in this new book is what I think is foundational to the flourishing of that new virtue ethics.

Lucas Perry: That’s really interesting. I bring this up and asking because I’ve been interested in the role of idealization, ethically, morally and emotionally in people and reaching towards whatever is possible in terms of human psychological enlightenment and how that may exist as certain benchmarks or reference frames in terms of value learning.

Wendell Wallach: Well, it is a counter pose to the notion that machines are going to have this kind of embodied understanding. I’m highly skeptical that we will get machines in the next hundred years that come in close to this kind of embodied understanding. I’m not skeptical that we could have on new kind of revival movement among humans where we create a new class of moral exemplars, which seems to be the exact opposite of what we’re doing at the moment.

Lucas Perry: Yeah. If we can get the AI systems and create abundance and reduce existential risk of bunch and have a long period of reflection, perhaps there will be this space for reaching for the limits of human idealization and enlightenment.

Wendell Wallach: It’s part of what the whole question is going on, for us, philosophy types, to what extent is this all about machine superintelligence and to what extent are we using the conversation about superintelligence as an imperfect mirror to think more deeply about the ways we’re similar to in dissimilar from the AI systems we’re creating or have a potential to create.

Lucas Perry: All right. So, with that, thank you very much for your time.

 If you enjoyed this podcast, please subscribe. Give it a like or share it on your preferred social media platform. We’ll be back again soon with another episode in the AI alignment series.

End of recorded material

AI Alignment Podcast: Human Compatible: Artificial Intelligence and the Problem of Control with Stuart Russell

Stuart Russell is one of AI’s true pioneers and has been at the forefront of the field for decades. His expertise and forward thinking have culminated in his newest work, Human Compatible: Artificial Intelligence and the Problem of Control. The book is a cornerstone piece, alongside Superintelligence and Life 3.0, that articulates the civilization-scale problem we face of aligning machine intelligence with human goals and values. Not only is this a further articulation and development of the AI alignment problem, but Stuart also proposes a novel solution which bring us to a better understanding of what it will take to create beneficial machine intelligence.

 Topics discussed in this episode include:

  • Stuart’s intentions in writing the book
  • The history of intellectual thought leading up to the control problem
  • The problem of control
  • Why tool AI won’t work
  • Messages for different audiences
  • Stuart’s proposed solution to the control problem

Key points from Stuart: 

  •  “I think it was around 2013 that it really struck me that in fact we’d been thinking about AI the wrong way all together. The way we had set up the whole field was basically kind of a copy of human intelligence in that a human is intelligent, if their actions achieve their goals. And so a machine should be intelligent if its actions achieve its goals. And then of course we have to supply the goals in the form of reward functions or cost functions or logical goals statements. And that works up to a point. It works when machines are stupid. And if you provide the wrong objective, then you can reset them and fix the objective and hope that this time what the machine does is actually beneficial to you. But if machines are more intelligent than humans, then giving them the wrong objective would basically be setting up a kind of a chess match between humanity and a machine that has an objective that’s across purposes with our own. And we wouldn’t win that chess match.”
  • “So when a human gives an objective to another human, it’s perfectly clear that that’s not the sole life mission. So you ask someone to fetch the coffee, that doesn’t mean fetch the coffee at all costs. It just means on the whole, I’d rather have coffee than not, but you know, don’t kill anyone to get the coffee. Don’t empty out my bank account to get the coffee. Don’t trudge 300 miles across the desert to get the coffee. In the standard model of AI, the machine doesn’t understand any of that. It just takes the objective and that’s its sole purpose in life. The more general model would be that the machine understands that the human has internally some overall preference structure of which this particular objective fetch the coffee or take me to the airport is just a little local manifestation. And machine’s purpose should be to help the human realize in the best possible way their overall preference structure. If at the moment that happens to include getting a cup of coffee, that’s great or taking him to the airport. But it’s always in the background of this much larger preference structure that the machine knows and it doesn’t fully understand. One way of thinking about is to say that the standard model of AI assumes that the machine has perfect knowledge of the objective and the model I’m proposing assumes that the model has imperfect knowledge of the objective or partial knowledge of the objective. So it’s a strictly more general case.”
  • “The objective is to reorient the field of AI so that in future we build systems using an approach that doesn’t present the same risk as the standard model… That’s the message I think for the AI community is the first phase our existence maybe should come to an end and we need to move on to this other way of doing things. Because it’s the only way that works as machines become more intelligent. We can’t afford to stick with the standard model because as I said, systems with the wrong objective could have arbitrarily bad consequences.”

 

Important timestamps: 

0:00 Intro

2:10 Intentions and background on the book

4:30 Human intellectual tradition leading up to the problem of control

7:41 Summary of the structure of the book

8:28 The issue with the current formulation of building intelligent machine systems

10:57 Beginnings of a solution

12:54 Might tool AI be of any help here?

16:30 Core message of the book

20:36 How the book is useful for different audiences

26:30 Inferring the preferences of irrational agents

36:30 Why does this all matter?

39:50 What is really at stake?

45:10 Risks and challenges on the path to beneficial AI

54:55 We should consider laws and regulations around AI

01:03:54 How is this book differentiated from those like it?

 

Works referenced:

Human Compatible: Artificial Intelligence and the Problem of Control

Superintelligence

Life 3.0

Occam’s razor is insufficient to infer the preferences of irrational agents

Synthesizing a human’s preferences into a utility function with Stuart Armstrong

 

We hope that you will continue to join in the conversations by following us or subscribing to our podcasts on Youtube, Spotify, SoundCloud, iTunes, Google Play, StitcheriHeartRadio, or your preferred podcast site/application. You can find all the AI Alignment Podcasts here.

You can listen to the podcast above or read the transcript below. 

Lucas: Hey everyone, welcome back to the AI Alignment Podcast. I’m Lucas Perry and today we’ll be speaking with Stuart Russell about his new book, Human Compatible: Artificial Intelligence and The Problem of Control. Daniel Kahneman says “This is the most important book I have read in quite some time. It lucidly explains how the coming age of artificial super intelligence threatens human control. Crucially, it also introduces a novel solution and a reason for hope.”

Yoshua Bengio says that “This beautifully written book addresses a fundamental challenge for humanity: increasingly intelligent machines that do what we ask, but not what we really intend. Essential reading if you care about our future.”

I found that this book helped clarify both intelligence and AI to me as well as the control problem born of the pursuit of machine intelligence. And as mentioned, Stuart offers a reconceptualization of what it means to build beneficial and intelligent machine systems. That provides a crucial place of pivoting and how we ought to be building intelligent machines systems.

Many of you will already be familiar with Stuart Russell. He is a professor of computer science and holder of the Smith-Zadeh chair in engineering at the University of California, Berkeley. He has served as the vice chair of the World Economic Forum’s Council on AI and Robotics and as an advisor to the United Nations on arms control. He is an Andrew Carnegie Fellow as well as a fellow of the Association for The Advancement of Artificial Intelligence, the Association for Computing Machinery and the American Association for the Advancement of Science.

He is the author with Peter Norvig of the definitive and universally acclaimed textbook on AI, Artificial Intelligence: A Modern Approach. And so without further ado, let’s get into our conversation with Stuart Russell.

Let’s start with a little bit of context around the book. Can you expand a little bit on your intentions and background for writing this book in terms of timing and inspiration?

Stuart: I’ve been doing AI since I was in high school and for most of that time the goal has been let’s try to make AI better because I think we’ll all agree AI is mostly not very good. When we wrote the first edition of the textbook, we decided to have a section called, What If We Do Succeed? Because it seemed to me that even though everyone was working on making AI equivalent to humans or better than humans, no one was thinking about what would happen if that turned out to be successful.

So that section in the first edition in 94 was a little equivocal, let’s say, you know, we could lose control or we could have a golden age and let’s try to be optimistic. And then by the third edition, which was 2010 the idea that we could lose control was fairly widespread, at least outside the AI communities. People worrying about existential risk like Steve Omohundro, Eliezer Yudkowsky and so on.

So we included those a little bit more of that viewpoint. I think it was around 2013 that it really struck me that in fact we’d been thinking about AI the wrong way all together. The way we had set up the whole field was basically kind of a copy of human intelligence in that a human is intelligent, if their actions achieve their goals. And so a machine should be intelligent if its actions achieve its goals. And then of course we have to supply the goals in the form of reward functions or cost functions or logical goals statements. And that works up to a point. It works when machines are stupid. And if you provide the wrong objective, then you can reset them and fix the objective and hope that this time what the machine does is actually beneficial to you. But if machines are more intelligent than humans, then giving them the wrong objective would basically be setting up a kind of a chess match between humanity and a machine that has an objective that’s across purposes with our own. And we wouldn’t win that chess match.

So I started thinking about how to solve that problem. And the book is a result of the first couple of years of thinking about how to do it.

Lucas: So you’ve given us a short and concise history of the field of AI alignment and the problem of getting AI systems to do what you want. One of the things that I found so great about your book was the history of evolution and concepts and ideas as they pertain to information theory, computer science, decision theory and rationality. Chapters one through three you sort of move sequentially through many of the most essential concepts that have brought us to this problem of human control over AI systems.

Stuart: I guess what I’m trying to show is how ingrained it is in intellectual thought going back a couple of thousand years. Even in the concept of evolution, this notion of fitness, you know we think of it as an objective that creatures are trying to satisfy. So in the 20th century you had a whole lot of disciplines, economics developed around the idea of maximizing utility or welfare or profit depending on which branch you look at. Control theory is about minimizing a cost function, so the cost function described some deviation from ideal behavior and then you build systems that minimize the cost. Operations research, which is dynamic programming and Markov decision processes is all about maximizing the sum of rewards. And statistics if you set it up in general, is about minimizing an expected loss function.

So all of these disciplines have the same bug if you like. It’s a natural way to set things up, but in the long run we’ll just see it as a bad cramped way of doing engineering. And what I’m proposing in the book actually is a way of thinking about it that’s much more in a binary rather than thinking about the machine and it’s objective.

You think about this coupled system with humans or you know, it could be any entity that wants a machine to do something good for it or another system to do something good for it. And then the system itself, which is supposed to do something good for the human or whatever else it is that wants something good to happen. So this kind of coupled system, don’t really see that in the intellectual tradition. Maybe one exception that I know of, which is the idea of principle agent games in economics. So a principal might be an employer and the agent might be the employee. And then the game is how does the employer get the employee to do something that the employer actually wants them to do, given that the employee, the agent has their own utility function and would rather be sitting home drinking beers and watching football on the telly.

How do you get them to show up at work and do all kinds of things they wouldn’t normally want to do? The simplest way is you pay them. But you know, there’s all kinds of other ideas about incentive schemes and status and then various kinds of sanctions if people don’t show up and so on. So the economists study that notion, which is a coupled system where one entity wants to benefit from the behavior of another.

So that’s probably the closest example that we have. And then maybe in ecology, look at symbiotic species or something like that. But there’s not very many examples that I’m aware of. In fact, maybe I can’t think of any, where the entity that’s supposedly in control, namely us, is less intelligent than the entity that it’s supposedly controlling, namely the machine.

Lucas: So providing some framing and context here for the listener, the first part of your book, chapters one through three explores the idea of intelligence in humans and in machines. There you give this historical development of ideas and I feel that this history you give of computer science and the AI alignment problem really helps to demystify both the person and evolution as a process and the background behind this problem.

Your second part of your book, chapters four through six discusses some of the problems arising from imbuing machines with intelligence. So this is a lot of the AI alignment problem considerations. And then the third part, chapter seven through ten suggests a new way to think about AI, to ensure that machines remain beneficial to humans forever.

You’ve begun stating this problem and readers can see in chapters one through three that this problem goes back a long time, right? The problem with computer science at its inception was that definition that you gave that a machine is intelligent in so far as it is able to achieve its objectives. In reaction to this, you’ve developed cooperative inverse reinforcement learning and inverse reinforcement learning, which is sort of part of the latter stages of this book where you’re arguing for new definition that is more conducive to alignment.

Stuart: Yeah. In the standard model as I call it in the book, the humans specifies the objective and plugs it into the machine. If for example, you get in your self driving car and it says, “Where do you want to go?” And you say, “Okay, take me to the airport.” For current algorithms as we understand them, understand built on this kind of model, that objective becomes the sole life purpose of the vehicle. It doesn’t necessarily understand that in fact that’s not your sole life purpose. If you suddenly get a call from the hospital saying, oh, you know, your child has just been run over and is in the emergency room. You may well not want to go to the airport. Or if you get into a traffic jam and you’ve already missed the last flight, then again you might not want to go to the airport.

So when a human gives an objective to another human, it’s perfectly clear that that’s not the sole life mission. So you ask someone to fetch the coffee, that doesn’t mean fetch the coffee at all costs. It just means on the whole, I’d rather have coffee than not, but you know, don’t kill anyone to get the coffee. Don’t empty out my bank account to get the coffee. Don’t trudge 300 miles across the desert to get the coffee.

In the standard model of AI, the machine doesn’t understand any of that. It just takes the objective and that’s its sole purpose in life. The more general model would be that the machine understands that the human has internally some overall preference structure of which this particular objective fetch the coffee or take me to the airport is just a little local manifestation. And machine’s purpose should be to help the human realize in the best possible way their overall preference structure.

If at the moment that happens to include getting a cup of coffee, that’s great or taking him to the airport. But it’s always in the background of this much larger preference structure that the machine knows and it doesn’t fully understand. One way of thinking about is to say that the standard model of AI assumes that the machine has perfect knowledge of the objective and the model I’m proposing assumes that the model has imperfect knowledge of the objective or partial knowledge of the objective. So it’s a strictly more general case.

When the machine has partial knowledge of the objective there’s whole lot of new things that come into play that simply don’t arise when the machine thinks it knows the objective. For example, if the machine knows the objective, it would never ask permission to do an action. It would never say, you know, is it okay if I do this because it believes that it’s already extracted all there is to know about human preferences in the form of this objective. And so whatever plan it formulates to achieve the objective must be the right thing to do.

Whereas a machine that knows that it doesn’t know the full objective could say, well, given what I know, this action looks okay, but I want to check with the boss before going ahead because it might be that this plan actually violate some part of the human preference structure that it doesn’t know about. So you get machines that ask permission, you get machines that, for example, allow themselves to be switched off because the machine knows that it might do something that will make the human unhappy. And if the human wants to avoid that and switches the machine off, that’s actually a good thing. Whereas a machine that has a fixed objective would never want to be switched off because that guarantees that it won’t achieve the objective.

So in the new approach you have a strictly more general repertoire of behaviors that the machine can exhibit. The idea of inverse reinforcement learning is this is the way for the machine to actually learn more about what the human preference structure is. By observing human behavior, which could be verbal behavior, like, could you fetch me a cup of coffee? That’s a fairly clear indicator about your preference structure, but it could also be that you know, you ask a human question and the human doesn’t reply. Maybe the human’s mad at you and is unhappy about the line of questioning that you’re pursuing.

 So human behavior means everything humans do and have done in the past. So everything we’ve ever written down, every movie we’ve made, every television broadcast contains information about human behavior and therefore about human preferences. Inverse reinforcement learning really means how do we take all that behavior and learn human preferences from it?

Lucas: What can you say about how tool AI as a possible path to AI alignment fits in this schema where we reject the standard model, as you call it, in favor of this new one?

Stuart: Tool AI is a notion, oddly enough, it doesn’t really occur within the field of AI. It’s a phrase that came from people who are thinking from the outside about possible risks from AI. And what it seems to mean is the idea that rather than buildings general purpose intelligence systems. If you are building AI systems designed for some specific purpose, then that’s sort of innocuous and doesn’t present any risks. And some people argue that in fact if you just have a large collection of these innocuous application specific AI systems, then there’s nothing to worry about.

My experience of tool AI is that when you build applications specific systems, you can kind of do it in two ways. One is you kind of hack it. In other words, you figure out how you would do this task and then you write a whole bunch of very, very special purpose code. So, for example, if you were doing handwriting recognition, you might think, oh, okay, well in order to find an ‘S’ I have to look for a line that’s curvy and I follow the line and it has to have three bends, it has to be arranged this way. And you know, you write a whole bunch of tests to check each characteristic of an ad that it has all these characteristics and it doesn’t have any loops and this, that and the other. And then you see okay, that’s an S.

And that’s actually not the way that people went about the problem of handwriting recognition. The way that they did it was to develop machine learning systems that could take images of characters that were labeled and then train a recognizer that could recognize new instances of characters. And in fact, Yann LeCun at AT&T was doing a system that was designed to recognize words and figures on checks. So very, very, very application specific, very tooley and order to do that he invented convolutional neural networks. Which is what we now call deep learning.

So, out of this very, very narrow piece of tool AI came this very, very general technique. Which has solved or largely solved object recognition, speech recognition, machine translation, and some people argue will produce general purpose AI. So I don’t think there’s any safety to be found in focusing on tool AI.

The second point is that people feel that somehow to tool AI is not an agent. So an agent meaning a system that you can think of as perceiving the world and then taking actions. And again, I’m not sure that’s really true. So a Go program is an agent. It’s an agent that operates in a small world, namely the Go board, but it perceives the board, the move that’s made and it takes action.

It chooses what to do next in many applications like this, this is the really the only way to build an effective tool is that it should be an agent. If it’s a little vacuum cleaning robot or lawn mowing robot, certainly a domestic robot that’s supposed to keep your house clean and look after the dog while you’re out. There’s simply no way to build those kinds of systems except as agents and as we improve the capabilities of these systems, whether it’s for perception or planning and behaving in the real physical world. We’re effectively going to be creating general purpose intelligent agents. I don’t really see salvation in the idea that we’re just going to build applications specific tools.

Lucas: So that helps to clarify that tool AI do not get around this update that you’re trying to do with regards to the standard model. So pivoting back to intentions surrounding the book, if you could distill the core message or the central objective in writing this book, how would you say that?

Stuart: The objective is to reorient the field of AI so that in future we build systems using an approach that doesn’t present the same risk as a standard model. I’m addressing multiple audiences. That’s the message I think for the AI community is the first phase our existence maybe should come to an end and we need to move on to this other way of doing things. Because it’s the only way that works as machines become more intelligent. We can’t afford to stick with the standard model because as I said, systems with the wrong objective could have arbitrarily bad consequences.

Then the other audience is the general public, people who are interested in policy, how things are going to unfold in future and technology and so on. For them, I think it’s important to actually understand more about AI rather than just thinking of AI as this kind of magic juice that triples the value of your startup company. It’s a collection of technologies and those technologies have been built within a framework, the standard model that has been very useful and is shared with these other fields, economic, statistics, operations of search, control theory. But that model does not work as we move forward and we’re already seeing places where the failure of the model is having serious negative consequences.

One example would be what’s happened with social media. So social media algorithms, content selection algorithms are designed to show you stuff or recommend stuff in order to maximize click-through. Clicking is what generates revenue for the social media platforms. And so that’s what they tried to do and I almost said they want to show you stuff that you will click on. And that’s what you might think is the right solution to that problem, right? If you want to maximize, click-through, then show people stuff they want to click on and that sounds relatively harmless.

Although people have argued that this creates a filter bubble or a little echo chamber where you only see stuff that you like and you don’t see anything outside of your comfort zone. That’s true. It might tend to cause your interests to become narrower, but actually that isn’t really what happened and that’s not what the algorithms are doing. The algorithms are not trying to show you the stuff you like. They’re trying to turn you into predictable clickers. They seem to have figured out that they can do that by gradually modifying your preferences and they can do that by feeding you material. That’s basically, if you think of a spectrum of preferences, it’s to one side or the other because they want to drive you to an extreme. At the extremes of the political spectrum or the ecological spectrum or whatever image you want to look at. You’re apparently a more predictable clicker and so they can monetize you more effectively.

So this is just a consequence of reinforcement learning algorithms that optimize click-through. And in retrospect, we now understand that optimizing click-through was a mistake. That was the wrong objective. But you know, it’s kind of too late and in fact it’s still going on and we can’t undo it. We can’t switch off these systems because there’s so tied in to our everyday lives and there’s so much economic incentive to keep them going.

So I want people in general to kind of understand what is the effect of operating these narrow optimizing systems that pursue these fixed and incorrect objectives. The effect of those on our world is already pretty big. Some people argue that operation’s pursuing the maximization of profit have the same property. They’re kind of like AI systems. They’re kind of super intelligent because they think over long time scales, they have massive information, resources and so on. They happen to have human components, but when you put a couple of hundred thousand humans together into one of these corporations, they kind of have this super intelligent understanding, manipulation capabilities and so on.

Lucas: This is a powerful and important update for research communities. I want to focus here in a little bit on the core messages of the book as per each audience because I think you can say and clarify different things for different people. So for example, my impressions are that for sort of laypersons who are not AI researchers, the history of ideas that you give clarifies the foundations of many fields and how it has led up to this AI alignment problem. As you move through and past single agent cases to multiple agent cases where we give rise to game theory and decision theory and how that all affects AI alignment.

So for laypersons, I think this book is critical for showing the problem, demystifying it, making it simple, and giving the foundational and core concepts for which human beings need to exist in this world today. And to operate in a world where AI is ever becoming a more important thing.

And then for the research community, as you just discussed, it seems like this rejection of the standard model and this clear identification of systems with exogenous objectives that are sort of singular and lack context and nuance. That when these things optimize for their objectives, they run over a ton of other things that we care about. And so we have to shift from this understanding where the objective is something inside of the exogenous system to something that the system is uncertain about and which actually exists inside of the person.

And I think the last thing that I sort of saw was for people who are not AI researchers, it says, here’s this AI alignment problem. It is deeply interdependent and difficult. It requires economists and sociologists and moral philosophers. And for this reason too, it is important for you to join in to help. Do you have anything here you’d like to hit on or expand on or anything I might’ve gotten wrong?

Stuart: I think that’s basically right. One thing that I probably should clarify, and it comes maybe from the phrase value alignment. The goal is not to build machines whose values are identical to those of humans. In other words, it’s not to just put in the right objective because I actually believe that that’s just fundamentally impossible to do that. Partly because humans actually don’t know their own preference structure. There’s lots of things that we might have a future positive or a negative reaction to that we don’t yet know, lots of foods that we haven’t yet tried. And in the book I give the example of the durian fruit, which some people really love and some people find utterly disgusting, and I don’t know which I am because I’ve never tried it. So I’m genuinely uncertain about my own preference structure.

It’s really not going to be possible for machines to be built with the right objective built in. They have to know that they don’t know what the objective is. And it’s that uncertainty that creates this deferential behavior. It becomes rational for that machine to ask permission and to allow itself to be switched off, which as I said, are things that a standard model machine would never do.

The reason why psychology, economics, moral philosophy become absolutely central, is that these fields have studied questions of human preferences, human motivation, and also the fundamental question which machines are going to face, of how do you act on behalf of more than one person? The version of the problem where there’s one machine and one human is relatively constrained and relatively straightforward to solve, but when you get one machine and many humans or many machines and many humans, then all kinds of complications come in, which social scientists have studied for centuries. That’s why they do it, because there’s more than one person.

And psychology comes in because the process whereby the machine is going to learn about human preferences requires that there be some connection between those preferences and the behavior that humans exhibit, because the inverse reinforcement learning process involves observing the behavior and figuring out what are the underlying preferences that would explain that behavior, and then how can I help the human with those preferences.

Humans, surprise, surprise, are not perfectly rational. If they were perfectly rational, we wouldn’t need to worry about psychology; we would do all this just with mathematics. But the connection between human preferences and human behavior is extremely complex. It’s mediated by our whole cognitive structure, and is subject to lots of deviations from perfect rationality. One of the deviations is that we are simply unable, despite our best efforts, to calculate what is the right thing to do given our preferences.

Lee Sedol, I’m pretty sure wanted to win the games of Go that he was playing against AlphaGo, but he wasn’t able to, because he couldn’t calculate the winning move. And so if you observe his behavior and you assume that he’s perfectly rational, the only explanation is that he wanted to lose, because that’s what he did. He made losing moves. But actually that would be obviously a mistake.

So we have to interpret his behavior in the light of his cognitive limitations. That becomes then a matter of empirical psychology. What are the cognitive limitations of humans, and how do they manifest themselves in the kind of imperfect decisions that we make? And then there’s other deviations from rationality. We’re myopic, we suffer from weakness of will. We know that we ought to do this, that this is the right thing to do, but we do something else. And we’re emotional. We do things driven by our emotional subsystems, when we lose our temper for example, that we later regret and say, “I wish I hadn’t done that.”

 All of this is really important for us to understand going forward, if we want to build machines that can accurately interpret human behavior as evidence for underlying human preferences.

Lucas: You’ve touched on inverse reinforcement learning in terms of human behavior. Stuart Armstrong was on the other week, and I believe his claim was that you can’t infer anything about behavior without making assumptions about rationality and vice versa. So there’s sort of an incompleteness there. I’m just pushing here and wondering more about the value of human speech, about what our revealed preferences might be, how this fits in with your book and narrative, as well as furthering neuroscience and psychology, and how all of these things can decrease uncertainty over human preferences for the AI.

Stuart: That’s a complicated set of questions. I agree with Stuart Armstrong that humans are not perfectly rational. I’ve in fact written an entire book about that. But I don’t agree that it’s fundamentally impossible to recover information about preferences from human behavior. Let me give the kind of straw man argument. So let’s take Gary Kasparov: chess player, was world champion in the 1990s, some people would argue the strongest chess player in history. You might think it’s obvious that he wanted to win the games that he played. And when he did win, he was smiling, jumping up and down, shaking his fists in triumph. And when he lost, he behaved in a very depressed way, he was angry with himself and so on.

Now it’s entirely possible logically that in fact he wanted to lose every single game that he played, but his decision making was so far from rational that even though he wanted to lose, he kept playing the best possible move. So he’s got this completely reversed set of goals and a completely reversed decision making process. So it looks on the outside as if he’s trying to win and he’s happy when he wins. But in fact, he’s trying to lose and he’s unhappy when he wins, but his attempt to appear unhappy again is reversed. So it looks on the outside like he’s really happy because he keeps doing the wrong things, so to speak.

This is an old idea in philosophy. Donald Davidson calls it radical interpretation: that from the outside, you can sort of flip all the bits and come up with an explanation that’s sort the complete reverse of what any reasonable person would think the explanation to be. The problem with that approach is that it then takes away the meaning of the word “preference” altogether. For example, let’s take the situation where Kasparov can checkmate his opponent in one move, and it’s blatantly obvious and in fact, he’s taken a whole sequence of moves to get to that situation.

If in all such cases where there’s an obvious way to achieve the objective, he simply does something different, in other words, let’s say he resigns, so whenever he’s in a position with an obvious immediate win, he instantly resigns, then in what sense is it meaningful to say that Kasparov actually wants to win the game if he always resigns whenever he has a chance of winning?

You simply vitiate the entire meaning of the word “preference”. It’s just not correct to say that a person who always resigns whenever they have a chance of winning really wants to win games. You can then kind of work back from there. So by observing human behavior in situations where the decision is kind of an obvious one that doesn’t require a huge amount of calculation, then it’s reasonable to assume that the preferences are the ones that they reveal by choosing the obvious action. If you offer someone a lump of coal or a $1,000 bill and they choose a $1,000 bill, it’s unreasonable to say, “Oh, they really prefer the lump of coal, but they’re just really stupid, so they keep choosing the $1,000 dollar bill.” That would just be daft. So in fact it’s quite natural that we’re able to gradually infer the preferences of imperfect entities, but we have to make some assumptions that we might call minimal rationality, which is that in cases where the choice is obvious, people will generally tend to make the obvious choice.

Lucas: I want to be careful here about not misrepresenting any of Stuart Armstrong’s ideas. I think this is in relation to the work Occam’s Razor is Insufficient to Infer the Preferences of Irrational Agents, if you’re familiar with that?

Stuart: Yeah.

Lucas: So then everything you said still suffices. Is that the case?

Stuart: I don’t think we radically disagree. I think maybe it’s a matter of emphasis. How important is it to observe the fact that there is this possibility of radical interpretation? It doesn’t worry me. Maybe it worries him, but it doesn’t worry me because we do a reasonably good job of inferring each other’s preferences all the time by just ascribing at least a minimum amount of rationality in human decision making behavior.

This is why economists, the way they try to elicit preferences, is by offering you direct choices. They say, “Here’s two pizzas. Are you going to have a bubblegum and pineapple pizza, or you can have ham and cheese pizza. Which one would you like?” And if you choose the ham and cheese pizza, they’ll infer that you prefer the ham and cheese pizza, and not the bubblegum and pineapple one, as seems pretty reasonable.

There may be real cases where there is genuine ambiguity about what’s driving human behavior. I am certainly not pretending that human cognition is no mystery; it still is largely a mystery. And I think for the long term, it’s going to be really important to try to unpack some of that mystery. Horribly to me, the biggest deviation from rationality that humans exhibit is the fact that our choices are always made in the context of a whole hierarchy of commitments that effectively put us into what’s usually a much, much smaller decision-making situation than the real problem. So the real problem is I’m alive, I’m in this enormous world, I’m going to live for a few more decades hopefully, and then my descendants will live for years after that and lots of other people on the world will live for a long time. So which actions do I do now?

And I could do anything. I could continue talking to you and recording this podcast. I could take out my phone and start trading stocks. I could go out on the street and start protesting climate change. I could set fire to the building and claim the insurance payment, and so on and so forth. I could do a gazillion things. Anything that’s logically possible I could do. And I continue to talk in the podcast because I’m existing in this whole network and hierarchy of commitments. I agreed that we would do the podcast, and why did I do that? Well, because you asked me, and because I’ve written the book and why did I write the book and so on.

So there’s a whole nested collection of commitments, and we do that because otherwise we couldn’t possibly manage to behave successfully in the real world at all. The real decision problem is not, what do I say next in this podcast? It’s what motor control commands do I send to my 600 odd muscles in order to optimize my payoff for the rest of time until the heat death of the universe? And that’s completely and utterly impossible to figure out.

I always, and we always, exist within what I think Savage called a small world decision problem. We are aware only of a small number of options. So if you want to understand human behavior, you have to understand what are the commitments and what is the hierarchy of activities in which that human is engaged. Because otherwise you might be wondering, well why isn’t Stuart taking out his phone and trading stocks? But that would be a silly thing to wonder. It’s reasonable to ask, well why is he answering the question that way and not the other way?

Lucas: And so “AI, please fetch the coffee,” also exists in such a hierarchy. And without the hierarchy, the request is missing much of the meaning that is required for the AI to successfully do the thing. So it’s like an inevitability that this hierarchy is required to do things that are meaningful for people.

Stuart: Yeah, I think that’s right. Requests are a very interesting special case of behavior, right? They’re just another kind of behavior. But up to now, we’ve interpreted them as defining the objective for the machine, which is clearly not the case. And people have recognized this for a long time. For example, my late colleague Bob Wilensky had a project called the Unix Consultant, which was a natural language system, and it was actually built as an agent, that would help you with Unix stuff, so managing files on your desktop and so on. You could ask it questions like, “Could you make some more space on my disk?”, and the system needs to know that RM*, which means “remove all files”, is probably not the right thing to do, that this request to make space on the disk is actually part of a larger plan that the user might have. And for that plan, most of the other files are required.

So a more appropriate response would be, “I found these backup files that have already been deleted. Should I empty them from the trash?”, or whatever it might be. So in almost no circumstances would a request be taken literally as defining the sole objective. If you asked for a cup of coffee, what happens if there’s no coffee? Perhaps it’s reasonable to bring a cup of tea or “Would you like a can of Coke instead?”, and not to … I think in the book I had the example that you stop at a gas station in the middle of the desert, 250 miles from the nearest town and they haven’t got any coffee. The right thing to do is not to trundle off across the desert and come back 10 days later with coffee from a nearby town. But instead to ask, well, “There isn’t any coffee. Would you like some tea or some Coca-Cola instead?”

 This is very natural for humans and in philosophy of language, my other late colleague Paul Grice, was famous for pointing out that many statements, questions, requests, commands in language have this characteristic that they don’t really mean what they say. I mean, we all understand if someone says, “Can you pass the salt?”, the correct answer is not, “Yes, I am physically able to pass the salt.” He became an adjective, right? So we talk about Gricean analysis, where you don’t take the meaning literally, but you look at the context in which it was said and the motivations of the speaker and so on to infer what is a reasonable course of action when you hear that request.

Lucas: You’ve done a wonderful job so far painting the picture of the AI alignment problem and the solution for which you offer, at least the pivoting which you’d like the community to take. So for laypersons who might not be involved or experts in AI research, plus the AI alignment community, plus potential researchers who might be brought in by this process or book, plus policymakers who may also listen to it, what’s at stake here? Why does this matter?

Stuart: I think AI, for most of its history, has been an interesting curiosity. It’s a fascinating problem, but as a technology it was woefully lacking. And it has found various niches where it’s useful, even before the current incarnation in terms of deep learning. But if we assume that progress will continue and that we will create machines with general purpose intelligence, that would be roughly speaking, the biggest event in human history.

History, our civilization, is just a consequence of the fact that we have intelligence, and if we had a lot more, it would be a radical step change in our civilization. If these were possible at all, it would enable other inventions that people have talked about as possibly the biggest event in human history, for example, creating the ability for people to live forever or much, much longer life span than we currently have, or creating the possibility for people to travel faster than light so that we could colonize the universe.

If those are possible, then they’re going to be much more possible with the help of AI. If there’s a solution to climate change, it’s going to be much more possible to solve climate change with the help of AI. It’s this fact that AI in the form of general purpose intelligence systems is this kind of über technology that makes it such a powerful development if and when it happens. So the upside is enormous. And then the downside is also enormous, because if you build things that are more intelligent than you, then you face this problem. You’ve made something that’s much more powerful than human beings, but somehow you’ve got to make sure that it never actually has any power. And that’s not completely obvious how to do that.

The last part of the book is a proposal for how we could do that, how you could change this notion of what we mean by an intelligent system so that rather than copying this sort of abstract human model, this idea of rationality, of decision making in the interest, in the pursuit of one’s own objectives, we have this other kind of system, this sort of coupled binary system where the machine is necessarily acting in the service of human preferences.

If we can do that, then we can reap the benefits of arbitrarily intelligent AI. Then as I said, the upside would be enormous. If we can’t do that, if we can’t solve this problem, then there are really two possibilities. One is that we need to curtail the development of artificial intelligence and for all the reasons that I just mentioned, it’s going to be very hard because the upside incentive is so enormous. It would be very hard to stop research and development in AI.

The third alternative is that we create general purpose, superhuman intelligent machines and we lose control of them, and they’re pursuing objectives that are ultimately mistaken objectives. There’s tons of science fiction stories that tell you what happens next, and none of them are desirable futures for the human race.

Lucas: Can you expand upon what you mean by if we’re successful in the control/alignment problem, what “tremendous” actually means? What actually are the conclusions or what is borne out of the process of generating an aligned super intelligence from that point on until heat death or whatever else?

Stuart: Assuming that we have a general purpose intelligence that is beneficial to humans, then you can think about it in two ways. I already mentioned the possibility that you’d be able to use that capability to solve problems that we find very difficult, such as eternal life, curing disease, solving the problem of climate change, solving the problem of faster than light travel and so on. You might think of these as sort of the science fiction-y upside benefits. But just in practical terms, when you think about the quality of life for most people on earth, let’s say it leaves something to be desired. And you say, “Okay, would be a reasonable aspiration?”, and put it somewhere like the 90th percentile in the US. That would mean a ten-fold increase in GDP for the world if you brought everyone on earth up to what we call a reasonably nice standard of living by Western standards.

General purpose AI can do that in the following way, without all these science fiction inventions and so on. So just deploying the technologies and materials and processes that we already have in ways that are much, much more efficient and obviously much, much less labor intensive.

The reason that things cost a lot and the reason that people in poor countries can’t afford them … They can’t build bridges or lay railroad tracks or build hospitals because they’re really, really expensive and they haven’t yet developed the productive capacities to produce goods that could pay for all those things. The reason things are really, really expensive is because they have a very long chain of production in which human effort is involved at every stage. The money all goes to pay all those humans, whether it’s the scientists and engineers who designed the MRI machine or the people who worked on the production line or the people who worked mining the metals that go into making the MRI machine.

All the money is really paying for human time. If machines are doing every stage of the production process, then you take all of those costs out, and to some extent it becomes like a digital newspaper, in the sense that you can have as much of it as you want. It’s almost free to make new copies of a digital newspaper, and it would become almost free to produce the material goods and services that constitute a good quality of life for people. And at that point, arguing about who has more of it is like arguing about who has more digital copies of the newspaper. It becomes sort of pointless.

That has two benefits. One is everyone is relatively much better off, assuming that we can get politics and economics out of the way, and also there’s then much less incentive for people to go around starting wars and killing each other, because there isn’t this struggle which has sort of characterized most of human history. The struggle for power, wealth and access to resources and so on. There are other reasons people kill each other, religion being one of them, but it certainly I think would help if this source of competition and warfare were removed.

Lucas: These are very important short-term considerations and benefits from getting this control problem and this alignment problem correct. One thing that the superintelligence will hopefully also do is reduce existential risk to zero, right?  And so if existential risk is reduced to zero, then basically what happens is the entire cosmic endowment, some hundreds of thousands of galaxies, become unlocked to us. Perhaps some fraction of it would have to be explored first in order to ensure existential risk is pretty close to zero. I find your arguments are pragmatic and helpful for the common person about why this is important.

For me personally, and why I’m passionate about AI alignment and existential risk issues, is that the reduction of existential risk to zero and having an aligned intelligence that’s capable of authentically spreading through the cosmic endowment, to me seems to potentially unlock a kind of transcendent object at the end of time, ultimately influenced by what we do here and now, which is directed and created by coming to better know what is good, and spreading that.

What I find so beautiful and important and meaningful about this problem in particular, and why anyone who’s reading your book, why it’s so important for them for core reading, and reading for laypersons, for computer scientists, for just everyone, is that if we get this right, this universe can be maybe one of the universes and perhaps the multiverse, where something like the most beautiful thing physically possible could be made by us within the laws of physics. And that to me is extremely awe-inspiring.

Stuart: I think that human beings being the way they are, will probably find more ways to get it wrong. We’ll need more solutions for those problems and perhaps AI will help us solve other existential risks, and perhaps it won’t. The control problem I think is very important. There are a couple of other issues that I think we still need to be concerned with. Well, I don’t think we need to be concerned with all of them, but a couple of issues that I haven’t begun to address or solve … One of those is obviously the problem of misuse, that we may find ways to build beneficial AI systems that remain under control in a mathematically guaranteed way. And that’s great. But the problem of making sure that only those kinds of systems are ever built and used, that’s a different problem. That’s a problem about human motivation and human behavior, which I don’t really have a good solution to. It’s sort of like the malware problem, except much, much, much, much worse. If we do go ahead developing general purpose intelligence systems that are beneficial and so on, then, parts of that technology, the general purpose intelligent capabilities could be put into systems that are not beneficial as it were, that don’t have a safety catch. And that misuse problem. If you look at how well we’re doing with malware, you’d have to say, more work needs to be done. We’re kind of totally failing to control malware and the ability of people to inflict damage on others by uncontrolled software that’s getting worse. We need an international response and a policing response. Some people argue that, oh, it’s fine. The super intelligent AI that we build will make sure that other nefarious development efforts are nipped in the bud.

This doesn’t make me particularly confident. So I think that’s an issue. The third issue is, shall we say enfeeblement. This notion that if we develop machines that are capable of running every aspect of our civilization, then that changes the dynamic that’s been in place since the beginning of human history or pre history. Which is that for our civilization to continue, we have had to pass on our knowledge and our skills to the next generation. That people have to learn what it is that the human race knows over and over again in every generation, just to keep things going. And if you add it all up, if you look, there’s about a hundred odd billion people who’ve ever lived and they spend each about 10 years learning stuff on average. So that’s a trillion person years of teaching and learning to keep our civilization going. And there’s a very good reason why we’ve done that because without it, things would fall apart very quickly.

But that’s going to change. Now. We don’t have to put it into the heads of the next generation of humans. We can put it into the heads of the machines and they can take care of the civilization. And then you get this almost irreversible process of enfeeblement, where humans no longer know how their own civilization functions. They lose knowledge of science, of engineering, even of the humanities of literature. If machines are writing books and producing movies, then we don’t even need to learn that. You see this in E. M. Forster’s story, The Machine Stops from 1909 which is a very prescient story about a civilization that becomes completely dependent on its own machines. Or if you like something more recent in WALL-E the human race is on a, sort of a cruise ship in space and they all become obese and stupid because the machines look after everything and all they do is consume and enjoy. And that’s not a future that I would want for the human race.

And arguably the machines should say, this is not the future you want, tie your shoelaces, but we are these, shortsighted. We may effectively override what the machines are telling us and say, “No, no, you have to tie my shoe laces for me.” So I think this is a problem that we have to think about. Again, this is a problem for infinity. Once you turn things over to the machines, it’s practically impossible, I think, to reverse that process, we have to keep our own human civilization going in perpetuity and that requires a kind of a cultural process that I don’t yet understand how it would work, exactly.

Because the effort involved in learning, let’s say going to medical school, it’s 15 years of school and then college and then medical school and then residency. It’s a huge effort. It’s a huge investment and at some point the incentive to undergo that process will disappear. And so something else other than… So at the moment it’s partly money, partly prestige, partly a desire to be someone who is in a position to help others. So somehow we got to make our culture capable of maintaining that process indefinitely when many of the incentive structures that have kept it in place go away.

Lucas: This makes me wonder and think about how from an evolutionary cosmological perspective, how this sort of transition from humans being the most intelligent form of life on this planet to machine intelligence being the most intelligent form of life. How that plays out in the very longterm. If we can do thought experiments where we imagine if monkeys had been actually creating humans and then had created humans, what the role of the monkey would still be.

Stuart: Yep. But we should not be creating the machine analog of humans, I.E. autonomous entities pursuing their own objectives. So we’ve pursued our objectives pretty much at the expense of the monkeys and the gorillas and we should not be producing machines that play an analogous role. That would be a really dumb thing to do.

Lucas: That’s an interesting comparison because the objectives of the human are exogenous to the monkey and that’s the key issue that you point out. If the monkey had been clever and had been able to control evolution, then they would have set the human uncertain as to the monkey’s preferences and then had him optimize those.

Stuart: Yeah, I mean they could imagine creating a race of humans that were intelligent but completely subservient to the interests of the monkeys. Assuming that they solved the enfeeblement problem and the misuse problem, then they’d be pretty happy with the way things turned out. I don’t see any real alternative. So Samuel Butler in 1863 wrote a book about a society that faces the problem of superintelligent machines and they take the other solution, which is actually to stop. They see no alternative but to just ban the construction of intelligent machines altogether. In fact, they ban all machines and in Frank Herbert’s Dune, the same thing. They have a catastrophic war in which humanity just survives in its conflict with intelligent machines. And then from then on, all intelligent machines, in fact, all computers are banned altogether. I can’t see that that’s a plausible direction, but it could be that we decide at some point that we cannot solve the control problem or we can’t solve the misuse problem or we can’t solve the enfeeblement problem.

And we decided that it’s in our best interests to just not go down this path at all. To me that just doesn’t feel like a possible direction. Things can change if we start to see bigger catastrophes. I think the click through catastrophe is already pretty big and it results from very, very simple minded algorithms that know nothing about human cognition or politics or anything else. They’re not even explicitly trying to manipulate us. It’s just, that’s what the code does in a very simple minded way. So we could imagine bigger catastrophes happening that we survived by the skin of our teeth as happened in Dune for example. And then that would change the way people think about the problem. And we see this over and over again with nuclear power, with fossil fuels and so on that by large technology is always seen as beneficial and more technology is therefore more beneficial.

And we pushed your head often ignoring the people who say “But, but, but what about this drawback? What about this drawback?” And maybe that starting to change with respect to fossil fuels. Several countries have now decided since Chernobyl and Fukushima to ban nuclear power, the EU has much stronger restriction on genetically modified foods than a lot of other countries, so there are pockets where people have pushed back against technological progress and said, “No, not all technology is good and not all uses of technology are good and so we need to exercise a choice.” But the benefits of AI are potentially so enormous. It’s going to take a lot to undo this forward progress.

Lucas: Yeah, absolutely. Whatever results from earth originating intelligent life at the end of time, that thing is up to us to create. I’m quoting you here, you say, “A compassionate and jubilant use of humanity’s cosmic endowment sounds wonderful, but we also have to reckon with the rapid rate of innovation in the malfeasance sector, ill intentioned people are thinking up new ways to misuse AI so quickly that this chapter is likely to be outdated even before an attains printed form. Think of it not as depressing reading. However, but as a call to act before it’s too late.”

Thinking about this and everything you just touched on. There’s obviously a ton for us to get right here that needs to be gotten right and it’s a question and problem for everyone in the human species to have a voice in.

Stuart: Yeah. I think we really need to start considering the possibility that there ought to be a law against it. For a long time the IT industry almost uniquely has operated in a completely unregulated way. The car industry for example, cars have to follow various kinds of design and safety rules. You have to have headlights and turn signals and brakes and so on. A car that’s designed in an unsafe way gets taken off the market, but software can do pretty much whatever it wants.

Every license agreement that you sign whenever you buy or use software tells you that it doesn’t matter what their software does. The manufacturer is not responsible for anything and so on. And I think it’s a good idea to actually take legislative steps, regulatory steps just to get comfortable with the idea that yes, I see we maybe do need regulation. San Francisco, for example, has banned the use of facial recognition in public or for policing. California has a ban on the impersonation of human beings by AI systems. I think that ban should be pretty much universal. But in California it’s primary area of applicability is in persuading people to vote in any particular direction in an election. So it’s a fairly narrow limitation. But when you think about it, why would you want to allow AI systems to impersonate human beings so that in other words, the human who’s in conversation, believes that if they’re talking to another human being, that they owe that other human being a whole raft of respect, politeness, all kinds of obligations that are involved in interacting with other humans.

But you don’t owe any of those things to an AI system. And so why should we allow people to effectively defraud humans by convincing them that in fact they’re engaged with another human when they aren’t? So I think it would be a good idea to just start things off with some basic common sense rules. I think the GDPR rule that says that you can’t use an algorithm to make a decision that has a significant legal effect on a person. So you can’t put them in jail simply as a result of an algorithm, for example. You can’t fire them from a job simply as a result of an algorithm. You can use the algorithm to advise, but a human has to be involved in the decision and the person has to be able to query the decision and ask for the reasons and in some sense have a right of appeal.

So these are common sense rules that almost everyone would agree with. And yet certainly in the U.S., there’s reluctance to put them into effect. And I think going forward, if we want to have safe AI systems, there’s at least going to be a role for regulations. There should also be standards as in I triple E standards. There should also be professional codes of conduct. People should be trained in how to recognize potentially unsafe designs for AI systems, but there should, I think, be a role for regulation where at some point you would say, if you want to put an AI system on the internet, for example, just as if you want to put software into the app store, it has to pass a whole bunch of checks to make sure that it’s safe to make sure that it won’t wreak havoc. So, we better start thinking about that. I don’t know yet what that regulation should say, but we shouldn’t be in principle opposed to the idea that such regulations might exist at some point.

Lucas: I basically agree that these regulations should be implemented today, but they seem pretty temporary or transient as the uncertainty in the AI system for the humans’ objective function or utility function decreases. So they become more certain about what we want. At some point it becomes unethical to have human beings governing these processes instead of AI systems. Right? So if we have timelines from AI researchers that range from 50 to a hundred years for AGI, we could potentially see laws and regulations like this go up in the next five to 10 and then disappear again somewhere within the next hundred to 150 years max.

Stuart: That’s an interesting viewpoint. And I think we have to be a little careful because autonomy is part of our preference structure. So although one might say, okay, know who gets to run the government? Well self, evidently it’s possible that machines could do a better job than the humans we currently have that would be better only in a narrow sense that maybe it would reduce crime, maybe it would increase economic output, we’d have better health outcomes, people would be more educated than they would with humans making those decisions, but there would be a dramatic loss in autonomy. And autonomy is a significant part of our preference structure. And so it isn’t necessarily the case that the right solution is that machines should be running the government. And this is something that the machines themselves will presumably recognize and this is the reason why parents at some point tell the child, “No, you have to tie your own shoe laces.” Because they want the child to develop autonomy.

The same thing will be true. The machines want humans to retain autonomy. As I said earlier, with respect to enfeeblement, right? It’s this conflict between our longterm best interest and our short term-ism in the choices that we tend to make. It’s always easier to say, “Oh no, I can’t be bothered at the time I shoelaces. Please could you do it?” But if you keep doing that, then the longterm consequences are bad. We have to understand how autonomy, which includes machines not making decisions, folds into our overall preference structure. And up to now there hasn’t been much of a choice, at least in the global sense. Of course it’s been humans making the decisions, although within any local context it’s only a subset of humans who are making the decisions and a lot of other people don’t have as much autonomy. To me, I think autonomy is a really important currency that to the extent possible, everyone should have as much of it as possible.

Lucas: I think you really hit the nail on the head. The problem is where autonomy fits in the hierarchy of our preferences and meta preferences. For me, it seems more instrumental than being an end goal in itself. Now this is an empirical question across all people where autonomy fits in their preference hierarchies and whether it’s like a terminal value or not, and whether under reflection and self idealization, our preferences distill into something else or not. Autonomy could possibly but not necessarily be an end goal. In so far as that it simply provides utility for all of our other goals. Because without autonomy we can’t act on what we think will best optimize our own preferences and end values. So definitely a lot of questions there. The structure of our preference hierarchy will certainly dictate, it seems, the longterm outcome of humanity and how enfeeblement unfolds.

Stuart: The danger would be that we misunderstand the entire nature of the human preference hierarchy. So sociologists and others have talked about the hierarchy of human needs in terms of food, shelter, physical security and so on. But they’ve always kind of assumed that you are a human being and therefore you’re the one deciding stuff. And so they tend not to think so much about fundamental properties of the ability to make your own choices for good or ill. And science fiction writers have had a field day with this. Pointing out that machines that do what you want are potentially disastrous because you lose the freedom of choice.

One could imagine that if we formulate things not quite right and the effect of the algorithms that we build is to make machines that don’t value autonomy in the right way or don’t have it folded into the overall preference structure in the right way, that we could end up with a subtle but gradual and very serious loss of autonomy in a way that we may not even notice as it happens. Like the slow boiling frog. If we could look ahead a hundred years and see how things turn out, he would say, “Oh my goodness, that is a terrible mistake”. We’re going to make sure that that doesn’t happen. So I think we need to be pretty careful. And again this is where we probably need the help of philosophers to make sure that we keep things straight and understand how these things fit together.

Lucas: Right, so seems like we simply don’t understand ourselves. We don’t know the hierarchy of our preferences. We don’t really know what preferences exactly are. Stuart Armstrong talks about how we haven’t figured out the symbol grounding problem. So there are issues with even understanding how preferences relate to one another ultimately and how the meaning there is born. And we’re building AI systems which will be more capable than us. Perhaps they will be conscious. You have a short subchapter I believe on that or at least on how you’re not going to talk about consciousness.

Stuart: Yeah. I have a paragraph saying I have nothing to say.

Lucas: So potentially these things will also be moral patients and we don’t know how to get them to do the things that we’re not entirely sure that we want them to do. So how would you differentiate this book from Superintelligence or Life 3.0 or other books on the AI alignment problem. And superintelligence in this space.

Stuart: I think the two major differences are one, I believe that to understand this whole set of issues or even just to understand what’s happening with AI and what’s going to happen, you have to understand something about AI. And I think that Superintelligence and Life 3.0 are to some extent, easier to grasp. If you already understand quite a bit about AI. And if you don’t, then it’s quite difficult to get as much out of those books as is in there. I think they are full of interesting points and ideas, but those points and ideas are easier to get out if you understand AI. So I wanted people to understand AI, understand, not just it as a technology, right? You could talk about how deep learning works, but that’s not the point. The point is really what is intelligence and how have we taken that qualitative understanding of what that means and turned it into this technical discipline where the standard model is machines that achieve fixed objectives.

And then the second major difference is that I’m proposing a solution for at least one of the big failure modes of AI. And as far as I can tell, that solution, I mean, it’s sort of mentioned in some ways in Superintelligence, I think the phrase there is normative uncertainty, but it has a slightly different connotation. And partly that’s because this approach of inverse reinforcement learning is something that we’ve actually worked on at Berkeley for a little over 20 years. It wasn’t invented for this purpose, but it happens to fit this purpose and then the approach of how we solve this problem is fleshed out in terms of understanding that it’s this coupled system between the human that has the preferences and the machine that’s trying to satisfy those preferences and doesn’t know what they are. So I think that part is different. That’s not really present in those other two books.

It certainly shares, I think the desire to convince people that this is a serious issue. I think both Superintelligence and Life 3.0 do a good job of that Superintelligence is sort of a bit more depressing. It’s such a good job of convincing you that things can go South, so many ways that you almost despair. Life 3.0 is a bit more cheerful. And also I think Life 3.0 does a good job of asking you what you want the outcome to be. And obviously you don’t want it to be catastrophic outcomes where we’re all placed in concrete coffins with heroin drips as Stuart Armstrong likes to put it.

But there are lots of other outcomes which are the ones you want. So I think that’s an interesting part of that book. And of course Max Tegmark, the author of Life 3.0 is a physicist. So he has lots of amazing stuff about the technologies of the future, which I don’t have so much. So those are the main differences. I think that wanting to convey the essence of intelligence, how that notion has developed, how is it really an integral part of our whole intellectual tradition and our technological society and how that model is fundamentally wrong and what’s the new model that we have to replace it with.

Lucas: Yeah, absolutely. I feel that you help to clarify intelligence for me, the history of intelligence from evolution up until modern computer science problems. I think that you really set the AI alignment problem up well resulting from there being intelligences and multi-agent scenarios, trying to do different things, and then you suggest a solution, which we’ve discussed here already. So thanks so much for coming on the podcast, Stuart, your book is set for release on October 8th?

Stuart: That’s correct.

Lucas: Great. We’ll include links for that in the description. Thanks so much for coming on.

 If you enjoyed this podcast, please subscribe. Give it a like or share it on your preferred social media platform. We’ll be back again soon with another episode in the AI alignment series.

End of recorded material

AI Alignment Podcast: Synthesizing a human’s preferences into a utility function with Stuart Armstrong

In his Research Agenda v0.9: Synthesizing a human’s preferences into a utility function, Stuart Armstrong develops an approach for generating friendly artificial intelligence. His alignment proposal can broadly be understood as a kind of inverse reinforcement learning where most of the task of inferring human preferences is left to the AI itself. It’s up to us to build the correct assumptions, definitions, preference learning methodology, and synthesis process into the AI system such that it will be able to meaningfully learn human preferences and synthesize them into an adequate utility function. In order to get this all right, his agenda looks at how to understand and identify human partial preferences, how to ultimately synthesize these learned preferences into an “adequate” utility function, the practicalities of developing and estimating the human utility function, and how this agenda can assist in other methods of AI alignment.

Topics discussed in this episode include:

  • The core aspects and ideas of Stuart’s research agenda
  • Human values being changeable, manipulable, contradictory, and underdefined
  • This research agenda in the context of the broader AI alignment landscape
  • What the proposed synthesis process looks like
  • How to identify human partial preferences
  • Why a utility function anyway?
  • Idealization and reflective equilibrium
  • Open questions and potential problem areas

Last chance to take a short (4 minute) survey to share your feedback about the podcast.

 

Key points from Stuart: 

  • “There are two core parts to this research project essentially. The first part is to identify the humans’ internal models, figure out what they are, how we use them and how we can get an AI to realize what’s going on. So those give us the sort of partial preferences, the pieces from which we build our general preferences. The second part is to then knit all these pieces together into an overall preference for any given individual in a way that works reasonably well and respects as much as possible the person’s different preferences, meta-preferences and so on. The second part of the project is the one that people tend to have strong opinions about because they can see how it works and how the building blocks might fit together and how they’d prefer that it would be fit together in different ways and so on but in essence, the first part is the most important because that fundamentally defines the pieces of what human preferences are.”
  • “So, when I said that human values are contradictory, changeable, manipulable and underdefined, I was saying that the first three are relatively easy to deal with but that the last one is not. Most of the time, people have not considered the whole of the situation that they or the world or whatever is confronted with. No situation is exactly analogous to another, so you have to try and fit it in to different categories. So if someone dubious gets elected in a country and starts doing very authoritarian things, does this fit in the tyranny box which should be resisted or does this fit in the normal process of democracy box in which case it should be endured and dealt with through democratic means. What’ll happen is generally that it’ll have features of both, so it might not fit comfortably in either box and then there’s a wide variety for someone to be hypocritical or to choose one side or the other but the reason that there’s such a wide variety of possibilities is because this is a situation that has not been exactly confronted before so people don’t actually have preferences here. They don’t have a partial preference over this situation because it’s not one that they’ve ever considered… I’ve actually argued at some point in the research agenda that this is an argument for insuring that we don’t go too far from the human baseline normal into exotic things where our preferences are not well-defined because in these areas, the chance that there is a large negative seems higher than the chance that there’s a large positive… So, when I say not go too far, I don’t mean not embrace a hugely transformative future. I’m saying not embrace a hugely transformative future where our moral categories start breaking down.”
  • “One of the reasons to look for a utility function is to look for something stable that doesn’t change over time and there is evidence that consistency requirements will push any form of preference function towards a utility function and that if you don’t have a utility function, you just lose value. So, the desire to put this into a utility function is not out of an admiration for utility functions per se but our desire to get something that won’t further change or won’t further drift in a direction that we can’t control and have no idea about. The other reason is that as we start to control our own preferences better and have a better ability to manipulate our own minds, we are going to be pushing ourselves towards utility functions because of the same pressures of basically not losing value pointlessly.”
  • “Reflective equilibrium is basically you refine your own preferences, make them more consistent, apply them to yourself until you’ve reached a moment where your meta-preferences and your preferences are all smoothly aligned with each other. What I’m doing is a much more messy synthesis process and I’m doing it in order to preserve as much as possible of the actual human preferences. It is very easy to reach reflective equilibrium by just, for instance, having completely flat preferences or very simple preferences, these tend to be very reflectively in equilibrium with itself and pushing towards this thing is a push towards, in my view, excessive simplicity and the great risk of losing valuable preferences. The risk of losing valuable preferences seems to me a much higher risk than the gain in terms of simplicity or elegance that you might get. There is no reason that the kludgey human brain and it’s mess of preferences should lead to some simple reflective equilibrium. In fact, you could say that this is an argument against reflexive equilibrium because it means that many different starting points, many different minds with very different preferences will lead to similar outcomes which basically means that you’re throwing away a lot of the details of your input data.”
  • “Imagine that we have reached some positive outcome, we have got alignment and we haven’t reached it through a single trick and we haven’t reached it through the sort of tool AIs or software as a service or those kinds of approaches, we have reached an actual alignment. It, therefore, seems to me all the problems that I’ve listed or almost all of them will have had to have been solved, therefore, in a sense, much of this research agenda needs to be done directly or indirectly in order to achieve any form of sensible alignment. Now, the term directly or indirectly is doing a lot of the work here but I feel that quite a bit of this will have to be done directly.”

 

Important timestamps: 

0:00 Introductions 

3:24 A story of evolution (inspiring just-so story)

6:30 How does your “inspiring just-so story” help to inform this research agenda?

8:53 The two core parts to the research agenda 

10:00 How this research agenda is contextualized in the AI alignment landscape

12:45 The fundamental ideas behind the research project 

15:10 What are partial preferences? 

17:50 Why reflexive self-consistency isn’t enough 

20:05 How are humans contradictory and how does this affect the difficulty of the agenda?

25:30 Why human values being underdefined presents the greatest challenge 

33:55 Expanding on the synthesis process 

35:20 How to extract the partial preferences of the person 

36:50 Why a utility function? 

41:45 Are there alternative goal ordering or action producing methods for agents other than utility functions?

44:40 Extending and normalizing partial preferences and covering the rest of section 2 

50:00 Moving into section 3, synthesizing the utility function in practice 

52:00 Why this research agenda is helpful for other alignment methodologies 

55:50 Limits of the agenda and other problems 

58:40 Synthesizing a species wide utility function 

1:01:20 Concerns over the alignment methodology containing leaky abstractions 

1:06:10 Reflective equilibrium and the agenda not being a philosophical ideal 

1:08:10 Can we check the result of the synthesis process?

01:09:55 How did the Mahatma Armstrong idealization process fail? 

01:14:40 Any clarifications for the AI alignment community? 

 

Works referenced:

Research Agenda v0.9: Synthesising a human’s preferences into a utility function 

Some Comments on Stuart Armstrong’s “Research Agenda v0.9” 

Mahatma Armstrong: CEVed to death 

The Bitter Lesson 

 

We hope that you will continue to join in the conversations by following us or subscribing to our podcasts on Youtube, Spotify, SoundCloud, iTunes, Google Play, StitcheriHeartRadio, or your preferred podcast site/application. You can find all the AI Alignment Podcasts here.

You can listen to the podcast above or read the transcript below. 

Lucas: Hey everyone and welcome back to the AI Alignment Podcast at the Future of Life Institute. I’m Lucas Perry and today we’ll be speaking with Stuart Armstrong on his Research Agenda version 0.9: Synthesizing a human’s preferences into a utility function. Here Stuart takes us through the fundamental idea behind this research agenda, what this process of synthesizing human preferences into a utility function might look like, key philosophical and empirical insights needed for progress, how human values are changeable, manipulable, under-defined and contradictory, how these facts affect generating an adequate synthesis of human values, where this all fits in the alignment landscape and how it can inform other approaches to aligned AI systems.

If you find this podcast interesting or useful, consider sharing it with friends, on social media platforms, forums or anywhere you think it might be found valuable. I’d also like to put out a final call for this round of SurveyMonkey polling and feedback, so if you have any comments, suggestions or any other thoughts you’d like to share with me about the podcast, potential guests or anything else, feel free to do so through the SurveyMonkey poll link attached to the description of wherever you might find this podcast. I’d love to hear from you. There also seems to be some lack of knowledge regarding the pages that we create for each podcast episode. You can find a link to that in the description as well and it contains a summary of the episode, topics discussed, key points from the guest, important timestamps if you want to skip around, works referenced, as well as a full transcript of the audio in case you prefer reading.

Stuart Armstrong is a researcher at the Future of Humanity Institute who focuses on the safety and possibilities of artificial intelligence, how to define the potential goals of AI and map humanities partially defined values into it and the longterm potential for intelligent life across the reachable universe. He has been working with people at FHI and other organizations such as DeepMind to formalize AI desiderata in general models so the AI designers can include these safety methods in their designs. His collaboration with DeepMind on “Interruptability” has been mentioned in over 100 media articles. Stuart’s past research interests include comparing existential risks in general, including their probability and their interactions, anthropic probability, how the fact that we exist affects our probability estimates around that key fact, decision theories that are stable under self-reflection and anthropic considerations, negotiation theory and how to deal with uncertainty about your own preferences, computational biochemistry, fast ligand screening, parabolic geometry and his Oxford DPhil was on the holonomy of projective and conformal Cartan geometries and so without further ado or pretenses that I know anything about the holonomy of projective and conformal Cartan geometries, I give you Stuart Armstrong.

We’re here today to discuss your research agenda version 0.9: Synthesizing a human’s preferences into a utility function. One wonderful place for us to start here would be with this sort of story of evolution, which you call an inspiring just so story, and so starting this, I think it would be helpful for us contextualizing sort of the place of the human and what the human is as we sort of find ourselves here at the beginning of this value alignment problem. I’ll go ahead and read there here for listeners to begin developing a historical context and narrative.

So, I’m quoting you here. You say, “This is the story of how evolution created humans with preferences and what the nature of these preferences are. The story is not true in the sense of accurate. Instead, it is intended to provide some inspiration as to the direction of this research agenda. In the beginning, evolution created instinct driven agents. These agents have no preferences or goals nor do they need any. They were like Q-learning agents. They knew the correct action to take in different circumstances but that was it. Consider baby turtles that walk towards the light upon birth because traditionally, the sea was lighter than the land. Of course, this behavior fails them in the era of artificial lighting but evolution has a tiny bandwidth, acting once per generation, so it created agents capable of planning, of figuring out different approaches rather than having to follow instincts. This was useful especially in varying environments and so evolution off-loaded a lot of it’s job onto the planning agents.”

“Of course, to be of any use, the planning agents need to be able to model their environment to some extent or else their plans can’t work and had to have preferences or else every plan was as good as another. So, in creating the first planning agents, evolution created the first agents with preferences. Of course, evolution is messy, undirected process, so the process wasn’t clean. Planning agents are still riven with instincts and the modeling of the environment is situational, used for when it was needed rather than some consistent whole. Thus, the preferences of these agents were underdefined and some times contradictory. Finally, evolution created agents capable of self-modeling and of modeling other agents in their species. This might have been because of competitive social pressures as agents learned to lie and detect lying. Of course, this being evolution, the self and other modeling took the form of kludges built upon spandrels, built upon kludges and then arrived humans, who developed norms and norm violations.”

“As a side effect of this, we started having higher order preferences as to what norms and preferences should be but instincts and contradictions remained. This is evolution after all, and evolution looked upon this hideous mess and saw that it was good. Good for evolution that is but if we want it to be good for us, we’re going to need to straighten out this mess somewhat.” Here we arrive, Stuart, in the human condition after hundreds of millions of years of evolution. So, given the story of human evolution that you’ve written here, why were you so interested in this story and why were you looking into this mess to better understand AI alignment and development this research agenda?

Stuart: This goes back to a paper that I co-wrote for NuerIPS It basically develops the idea of inverse reinforcement learning or more broadly, can you infer what the preferences of an agent are just by observing their behavior. Humans are not entirely rational, so the question I was looking at is can you simultaneously infer the rationality and the preferences of an agent by observing their behavior. It turns out to be mathematically completely impossible. We can’t infer the preferences without making assumptions about the rationality and we can’t infer the rationality without making assumptions about the preferences. This is a rigorous result, so my looking at human evolution is to basically get around this result, in a sense, to make the right assumptions so that we can extract actual human preferences since we can’t just do it by observing behavior. We need to dig a bit deeper.

Lucas: So, what have you gleaned then from looking at this process of human evolution and seeing into how messy the person is?

Stuart: Well, there’s two key insights here. The first is that I located where human preferences reside or where we can assume that human preferences reside and that’s in the internal models of the humans, how we model the world, how we judge, that was a good thing or I want that or ooh, I’d be really embarrassed about that, and so human preferences are defined in this project or at least the building blocks of human preferences are defined to be in these internal models that humans have with the labeling of states of outcomes as good or bad. The other point to bring about evolution is that since it’s not anything like a clean process, it’s not like we have one general model with clearly labeled preferences and then everything else flows from that. It is a mixture of situational models in different circumstances with subtly different things labeled as good or bad. So, as I said to you in preferences are contradictory, changeable, manipulable and underdefined.

So, there are two core parts to this research project essentially. The first part is to identify the humans’ internal models, figure out what they are, how we use them and how we can get an AI to realize what’s going on. So those give us the sort of partial preferences, the pieces from which we build our general preferences. The second part is to then knit all these pieces together into an overall preference for any given individual in a way that works reasonably well and respects as much as possible the person’s different preferences, meta-preferences and so on.

The second part of the project is the one that people tend to have strong opinions about because they can see how it works and how the building blocks might fit together and how they’d prefer that it would be fit together in different ways and so on but in essence, the first part is the most important because that fundamentally defines the pieces of what human preferences are.

Lucas: Before we dive into the specifics of your agenda here, can you contextualize it within evolution of your thought on AI alignment and also how it fits within the broader research landscape?

Stuart: So, this is just my perspective on what the AI alignment landscape looks like. There are a collection of different approaches addressing different aspects of the alignment problem. Some of them, which MIRI is working a lot on, are technical things of how to ensure stability of goals and other similar thoughts along these lines that should be necessary for any approach. Others are developed on how to make the AI safe either indirectly or make itself fully aligned. So, the first category you have things like software as a service. Can we have super intelligent abilities integrated in a system that doesn’t allow for say super intelligent agents with pernicious goals.

Others that I have looked into in the past are things like low impact agents or oracles, which again, the idea is we have a superintelligence, we cannot align it with human preferences, yet we can use it to get some useful work done. Then there are the approaches, which aim to solve the whole problem and get actual alignment, what used to be called the friendly AI approach. So here, it’s not an AI that’s constrained in any ways, it’s an AI that is intrinsically motivated to do the right thing. There are a variety of different approaches to that, some more serious than others. Paul Christiano has an interesting variant on that, though it’s hard to tell, I would say, his in a bit of a mixture of value alignment and constraining what the AI can do in a sense, but it is very similar and so this is of that last type, of getting the aligned, the friendly AI, the aligned utility function.

In that area, there are what I would call the ones that sort of rely on indirect proxies. This is the ideas of you put Nick Bostrom in a room for 500 years or a virtual version of that and hope that you get something aligned at the end of that. There are direct approaches and this is the basic direct approach, doing everything the hard way in a sense but defining everything that needs to be defined so that the AI can then assemble an aligned preference function from all the data.

Lucas: Wonderful. So you gave us a good summary earlier of the different parts of this research agenda. Would you like to expand a little bit on the “fundamental idea” behind this specific research project?

Stuart: There are two fundamental ideas that are not too hard to articulate. The first is that though our revealed preferences could be wrong though our stated preferences could be wrong, what our actual preferences are at least in one moment is what we model inside our head, what we’re thinking of as the better option. We might lie, as I say, in politics or in a court of law or just socially but generally, when we know that we’re lying, it’s because there’s a divergence between what we’re saying and what we’re modeling internally. So, it is this internal model, which I’m identifying as the place where our preferences lie and then all the rest of it, the whole convoluted synthesis project is just basically how do we take these basic pieces and combine them in a way that does not seem to result in anything disastrous and that respects human preferences and meta-preferences and this is a key thing, actually reaches a result. That’s why the research project is designed for having a lot of default actions in a lot of situations.

Like if the person does not have strong meta-preferences, then there’s a whole procedure of how you combine say preferences about the world and preferences about your identity are, by default, combined in a different way if you would want GDP to go up, that’s a preference about the world. If you yourself would want to believe something or believe only the truth, for example, that’s a preference about your identity. It tends to be that identity preferences are more fragile, so the default is that preferences about the world are just added together and this overcomes most of the contradictions because very few human preferences are exactly anti-aligned whereas identity preferences are combined in a more smooth process so that you don’t lose too much on any of them. But as I said, these are the default procedures, and they’re all defined so that we get an answer but there’s also large abilities for the person’s meta-preferences to override the defaults. Again, precautions are taken to ensure that an answer is actually reached.

Lucas: Can you unpack what partial preferences are? What you mean by partial preferences and how they’re contextualized within human mental models?

Stuart: What I mean by partial preference is mainly that a human has a small model of part of the world like let’s say they’re going to a movie and they would prefer to invite someone they like to go with them. Within this mental model, there is the movie, themselves and the presence or absence of the other person. So, this is a very narrow model of reality, virtually the entire rest of the world and, definitely, the entire rest of the universe does not affect this. It could be very different and not change anything of this. So, this is what I call a partial preference. You can’t go from this to a general rule of what the person would want to do in every circumstance but it is a narrow valid preference. Partial preferences refers to two things, first of all, that it doesn’t cover all of our preferences and secondly, the model in which it lives only covers a narrow slice of the world.

You can make some modifications to this. This is the whole point of the second section that if the approach works, variations on the synthesis project should not actually result in results that are disastrous at all. If the synthesis process being changed a little bit would result in a disaster, then something has gone wrong with the whole approach but you could, for example, add restrictions like looking for consistent preferences but I’m starting with basically the fundamental thing is there is this mental model, there is an unambiguous judgment that one thing is better than another and then we can go from there in many ways. A key part of this approach is that there is no single fundamental synthesis process that would work, so it is aiming for an adequate synthesis rather than an idealized one because humans are a mess of contradictory preferences and because even philosophers have contradictory meta-preferences within their own minds and with each other and because people can learn different preferences depending on the order in which information is presented to them, for example.

Any method has to make a lot of choices, and therefore, I’m writing down explicitly as many of the choices that have to be made as I can so that other people can see what I see the processes entailing. I am quite wary of things that look for reflexive self-consistency because in a sense, if you define your ideal system as one that’s reflexively self-consistent, that’s a sort of local condition in a sense that the morality judges itself by its own assessment and that means that you could theoretically wander arbitrarily far in preference space before you hit that. I don’t want something that is just defined by this has reached reflective equilibrium, this morality synthesis is now self-consistent, I want something that is self-consistent and it’s not too far from where it started. So, I prefer to tie things much more closely to actual human preferences and to explicitly aim for a synthesis process that doesn’t wander too far away from them.

Lucas: I see, so the starting point is the evaluative moral that we’re trying to keep it close to?

Stuart: Yes, I don’t think you can say that any human preference synthesized is intrinsically wrong as long as it reflects some of the preferences that were inputs into it. However, I think you can say that it is wrong from the perspective of the human that you started with if it strongly contradicts what they would want. Disagreements from my starting position is something which I take to be very relevant to the ultimate outcome. There’s a bit of a challenge here because we have to avoid say preferences which are based on inaccurate facts. So, some of the preferences are inevitably going to be removed or changed just because they’re based on factually inaccurate beliefs. Some other processes of trying to make consistent what is sort of very vague will also result in some preferences being moved beyond. So, you can’t just say the starting person has veto power over the final outcome but you do want to respect their starting preferences as much as you possibly can.

Lucas: So, reflecting here on the difficulty of this agenda and on how human beings contain contradictory preferences and models, can you expand a bit how we contain these internal contradictions and how this contributes to the difficulty of the agenda?

Stuart: I mean humans contain many contradictions within them. Our mood shifts. We famously are hypocritical in favor of ourselves and against the foibles of others, we basically rewrite narratives to allow ourselves to always be heroes. Anyone who’s sort of had some experience of a human has had knowledge of when they’ve decided one way or decided the other way or felt that something was important and something else wasn’t and often, people just come up with a justification for what they wanted to do anyway, especially if they’re in a social situation, and then some people can cling to this justification and integrate that into their morality while behaving differently in other ways. The easiest example are sort of political hypocrites. The anti-gay preacher who sleeps with other men is a stereotype for a reason but it’s not just a sort of contradiction at that level. It’s that basically most of the categories in which we articulate our preferences are not particularly consistent.

If we throw a potentially powerful AI in this, which could change the world drastically, we may end up with things across our preferences. For example, suppose that someone created or wanted to create a subspecies of human that was bred to be a slave race. Now, this race did not particularly enjoy being a slave race but they wanted to be slaves very strongly. In this situation, a lot of our intuitions are falling apart because we know that slavery is almost always involuntary and is backed up by coercion. We also know that even though our preferences and our enjoyments do sometimes come apart, they don’t normally come apart that much. So, we’re now confronted by a novel situation where a lot of our intuitions are pushing against each other.

You also have things like nationalism for example. Some people have strong nationalist sentiments about their country and sometimes their country changes and in this case, what seemed like a very simple, yes, I will obey the laws of my nation, for example, becomes much more complicated as the whole concept of my nation starts to break down. This is the main way that I see preferences to being underdefined. They’re articulated in terms of concepts which are not universal and which bind together many, many different concepts that may come apart.

Lucas: So, at any given moment, like myself at this moment, the issue is that there’s a large branching factor of how many possible future Lucases there can be. At this time, currently and maybe a short interval around this time as you sort of explore in your paper, the sum total of my partial preferences and the partial world models in which these partial preferences are contained. The expression of these preferences and models can be expressed differently and sort of hacked and changed based off how questions are asked, the order of questions. I am like a 10,000-faced thing which I can show you one of my many faces depending on how you push my buttons and depending on all of the external input that I get in the future, I’m going to express and maybe become more idealized in one of many different paths. The only thing that we have to evaluate which of these many different paths I would prefer is what I would say right now, right?

Say my core value is joy or certain kinds of conscious experiences over others and all I would have for evaluating this many branching thing is say this preference now at this time but that could be changed in the future, who knows? I will create new narratives and stories that justify the new person that I am and that makes sense of the new values and preferences that I have retroactively, like something that I wouldn’t actually have approved of now but my new, maybe more evil version of myself would approve and create a new narrative retroactively. Is this sort of helping to elucidate and paint the picture of why human beings are so messy?

Stuart: Yes, we need to separate that into two. The first is that our values can be manipulated by other humans as they often are and by the AI itself during the process but that can be combated to some extent. I have a paper that may soon come out on how to reduce the influence of an AI over a learning process that it can manipulate. That’s one aspect. The other aspect is when you are confronted by a new situation, you can go in multiple different directions and these things are just not defined. So, when I said that human values are contradictory, changeable, manipulable and underdefined, I was saying that the first three are relatively easy to deal with but that the last one is not.

Most of the time, people have not considered the whole of the situation that they or the world or whatever is confronted with. No situation is exactly analogous to another, so you have to try and fit it in to different categories. So if someone dubious gets elected in a country and starts doing very authoritarian things, does this fit in the tyranny box which should be resisted or does this fit in the normal process of democracy box in which case it should be endured and dealt with through democratic means. What’ll happen is generally that it’ll have features of both, so it might not fit comfortably in either box and then there’s a wide variety for someone to be hypocritical or to choose one side or the other but the reason that there’s such a wide variety of possibilities is because this is a situation that has not been exactly confronted before so people don’t actually have preferences here. They don’t have a partial preference over this situation because it’s not one that they’ve ever considered.

How they develop one is due to a lot as you say, the order in which information is presented, which category it seems to most strongly fit into and so on. We are going here for very mild underdefinedness. The willing slave race was my attempt to push it out a bit further into something somewhat odd and then if you consider a powerful AI that is able to create vast numbers of intelligent entities, for example, and reshape society, human bodies and human minds in hugely transformative ways, we are going to enter sort of very odd situations where all our starting instincts are almost useless. I’ve actually argued at some point in the research agenda that this is an argument for insuring that we don’t go too far from the human baseline normal into exotic things where our preferences are not well-defined because in these areas, the chance that there is a large negative seems higher than the chance that there’s a large positive.

Now, I’m talking about things that are very distant in terms of our categories, like the world of Star Trek is exactly the human world from this perspective because even though they have science fiction technology, all of the concepts and decisions they are articulated around concepts that we’re very familiar with because it is a work of fiction addressed to us now. So, when I say not go too far, I don’t mean not embrace a hugely transformative future. I’m saying not embrace a hugely transformative future where our moral categories start breaking down.

Lucas: In my mind, there’s two senses. There’s the sense in which we have these models for things and we have all of these necessary and sufficient conditions for which something can be pattern matched to some sort of concept or thing and we can encounter situations where there’re conditions for many different things being included in the context in a new way which makes it so that the thing like goodness or justice is underdefined in the slavery case because we don’t really know initially whether this thing is good or bad. I see this underdefined in this sense. The other sense is maybe the sense in which my brain is a neural architectural aggregate of a lot of neurons and the sum total of its firing statistics and specific neural pathways can be potentially identified as containing preferences and models somewhere within there. So is it also true to say that it’s underdefined in the sense that the human as not a thing in the world but as a process in the world largely constituted of the human brain, even within that process, it’s underdefined where in the neural firing statistics or the processing of the person there could ever be something called a concrete preference or value?

Stuart: I would disagree that it is underdefined in the second sense.

Lucas: Okay.

Stuart: In order to solve the second problem, you need to solve the symbol grounding problem for humans. You need to show that the symbols or the neural pattern firing or the neuron connection or something inside the brain corresponds to some concepts in the outside world. This is one of my sort of side research projects. When I say side research project, I mean I wrote a couple of blog posts on this pointing out how I might approach it and I point out that you can do this in a very empirical way. If you think that a certain pattern of neural firing refers to say a rabbit, you can see whether this thing firing in the brain is that predictive of say a rabbit in the outside world or predictive of this person is going to start talking about rabbits soon.

In model theory, the actual thing that gives meaning to the symbols is sort of beyond the scope of the math theory but if you have a potential connection between the symbols and the outside world, you can check whether this theory is a good one or a terrible one. If you say this corresponds to hunger and yet that thing only seems to trigger when someone’s having sex, for example, we can say, okay, your model that this corresponds to hunger is terrible. It’s wrong. I cannot use it for predicting that the person will eat in the world but I can use it for predicting that they’re having sex. So, if I model this as connected with sex, this is a much better grounding of that symbol. So using methods like this and there’re some subtleties I also address Quine’s “gavagai” and connect it to sort of webs of connotation and concepts that go together but the basic idea is to empirically solve the symbol grounding problem for humans.

When I say that things are underdefined, I mean that they are articulated in terms of concepts that are underdefined across all possibilities in the world, not that these concepts could be anything or we don’t know what they mean. Our mental models correspond to something. It’s a collection of past experience and the concepts in our brain are tying together a variety of experiences that we’ve had. They might not be crisp. They might not be well-defined even if you look at say the totality of the universe but they correspond to something, to some repeated experience, some concepts to some thought process that we’ve had and that we’ve extracted this idea from. When we do this in practice, we are going to inject some of our own judgements into it and since humans are so very similar in how we interpret each other and how we decompose many concepts, it’s not necessarily particularly bad that we do so, but I strongly disagree that these are arbitrary concepts that are going to be put in by hand. They are going to be in the main identified via once you have some criteria for tracking what happens in the brain, comparing it with the outside world and those kinds of things.

My concept, maybe a cinema is not an objectively well-defined fact but what I think of as a cinema and what I expect in a cinema and what I don’t expect in a cinema, like I expect it to go dark and a projector and things like that. I don’t expect that this would be in a completely open space in the Sahara Desert under the sun with no seats and no sounds and no projection. I’m pretty clear that one of these things is a lot more of a cinema than the other.

Lucas: Do you want to expand here a little bit about this synthesis process?

Stuart: The main idea is to try and ensure that no disasters come about and the main thing that could lead to a disaster is the over prioritization of certain preferences over others. There are other avenues to disaster but this seems to be the most obvious. The other important part of the synthesis process is that it has to reach an outcome, which means that a vague description is not sufficient, so that’s why it’s phrased in terms of this is the default way that you synthesize preferences. This way may be modified by certain meta-preferences. The meta-preferences have to be reducible to some different way of synthesizing the preferences.

For example, the synthesis is not particularly over-weighting long-term preferences versus short term preferences. It would prioritize long-term preferences but not exclude short term ones. So, I want to be thin is not necessarily prioritizing over that’s a delicious piece of cake that I’d like to eat right now, for example, but human meta-preferences often prioritize long-term preferences over short term ones, so this is going to be included and this is going to change the default balance towards long-term preferences.

Lucas: So, powering the synthesis process, how are we to extract the partial preferences and their weights from the person?

Stuart: That’s, as I say, the first part of the project and that is a lot more empirical. This is going to be a lot more looking at what neuroscience says, maybe even what algorithm theory says or what modeling of algorithms say and about what’s physically going on in the brain and how this corresponds to internal mental models. There might be things like people noting down what they’re thinking, correlating this with changes in the brain and this is a much more empirical aspect to the process that could be carried out essentially independently from the synthesis product.

Lucas: So, a much more advanced neuroscience would be beneficial here?

Stuart: Yes, but even without that, it might be possible to infer some of these things indirectly via the AI and if the AI accounts well for uncertainties, this will not result in disasters. If it knows that we would really dislike losing something of importance to our values, even if it’s not entirely sure what the thing of importance is, it will naturally, with that kind of motivation, act in a cautious way, trying to preserve anything that could be valuable until such time as it figures out better what we want in this model.

Lucas: So, in section two of your paper, synthesizing the preference utility function, within this section, you note that this is not the only way of constructing the human utility function. So, can you guide us through this more theoretical section, first discussing what sort of utility function and why a utility function in the first place?

Stuart: One of the reasons to look for a utility function is to look for something stable that doesn’t change over time and there is evidence that consistency requirements will push any form of preference function towards a utility function and that if you don’t have a utility function, you just lose value. So, the desire to put this into a utility function is not out of an admiration for utility functions per se but our desire to get something that won’t further change or won’t further drift in a direction that we can’t control and have no idea about. The other reason is that as we start to control our own preferences better and have a better ability to manipulate our own minds, we are going to be pushing ourselves towards utility functions because of the same pressures of basically not losing value pointlessly.

You can kind of see it in some investment bankers who have to a large extent, constructed their own preferences to be expected money maximizers within a range and it was quite surprising to see but human beings are capable of pushing themselves towards that and this is what repeated exposure to different investment decision tends to do to you and it’s the correct thing to do in terms of maximizing the money and this is the kind of thing that general pressure on humans combined with human’s ability to self-modify, which we may develop in the future, so all this is going to be pushing us towards a utility function anyway, so we may as well go all the way and get the utility function directly rather than being pushed into it.

Lucas: So, is the view here that the reason why we’re choosing utility functions even when human beings are very far from being utility functions is that when optimizing our choices in mundane scenarios, it’s pushing us in that direction anyway?

Stuart: In part. I mean utility functions can be arbitrarily complicated and can be consistent with arbitrarily complex behavior. A lot of when people think of utility functions, they tend to think of simple utility functions and simple utility functions are obviously simplifications that don’t capture everything that we value but complex utility functions can capture as much of the value as we want. What tends to happen is that when people have say, inconsistent preferences, that they are pushed to make them consistent by the circumstances of how things are presented, like you might start with the chocolate mousse but then if offered a trade for the cherry pie, go for the cherry pie and then if offered a trade for the maple pie, go for the maple pie but then you won’t go back to the chocolate or even if you do, you won’t continue going around the cycle because you’ve seen that there is a cycle and this is ridiculous and then you stop it at that point.

So, what we decide when we don’t have utility functions tends to be determined by the order in which things are encountered and under contingent things and as I say, non-utility functions tend to be intrinsically less stable and so can drift. So, for all these reasons, it’s better to nail down a utility function from the start so that you don’t have the further drift and your preferences are not determined by the order in which you encounter things, for example.

Lucas: This is though in part thus a kind of normative preference then, right? To use utility functions in order not to be pushed around like that. Maybe one can have the meta-preferences for their preferences to be expressed in the order in which they encounter things.

Stuart: You could have that strong meta-preference, yes, though even that can be captured by a utility function if you feel like doing it. Utility functions can capture pretty much any form of preferences, even the ones that seem absurdly inconsistent. So, we’re not actually losing anything in theory by insisting that it should be a utility function. We may be losing things in practice in the construction of that utility function. I’m just saying if you don’t have something that is isomorphic with a utility function or very close to that, your preferences are going to drift randomly affected by many contingent factors. You might want that, in which case, you should put it in explicitly rather than implicitly and if you put it in explicitly, it can be captured by a utility function that is conditional on the things that you see, in the order in which you see them, for example.

Lucas: So, comprehensive AI services and other tool-like AI approaches to AI alignment I suppose avoid some of the anxieties produced by a strong agential AIs with utility functions. Are there alternative goal ordering or action producing methods in agents other than utility functions that may have the properties that we desire of utility functions or is the category of utility functions just so large that it encapsulates much of what is just mathematically rigorous and simple?

Stuart: I’m not entirely sure. Alternative goal structures tend to be quite ad hoc and limited in my practical experience whereas utility functions or reward functions which may or may not be isomorphic do seem to be universal. There are possible inconsistencies within utility functions themselves if you get a self-referential utility function including your own preferences, for example, but MIRI’s work should hope to clarify those aspects. I came up with an alternative goal structure which is basically an equivalence class of utility functions that are not equivalent in terms of utility and this could successfully model an agent’s who’s preferences were determined by the order in which things were chosen but I put this together as a toy model or as a thought experiment. I would never seriously suggest building that. So, it just seems that for the moment, most non-utility function things are either ad hoc or under-defined or incomplete and that most things can be captured by utility functions, so the things that are not utility functions all seem at the moment to be flawed and the utility functions seem to be sufficiently versatile to capture anything that you would want.

This may mean by the way that we may lose some of the elegant properties of utility function that we normally assume like deontology can be captured by a utility function that assigns one to obeying all the rules and zero to violating any of them and this is a perfectly valid utility function, however, there’s not much in terms of expected utility in terms of this. It behaves almost exactly like a behavioral constraint, never choose any option that is against the rules. That kind of thing, even though it’s technically a utility function, might not behave the way that we’re used to utility functions behaving in practice. So, when I say that it should be captured as a utility function, I mean formally it has to be defined in this way but informally, it may not have the properties that we informally expect of utility functions.

Lucas: Wonderful. This is a really great picture that you’re painting. Can you discuss extending and normalizing the partial preferences? Take us through the rest of section two on synthesizing to a utility function.

Stuart: The extending is just basically you have, for instance, a preference of going to the cinema this day with that friend versus going to the cinema without that friend. That’s an incredibly narrow preference, but you also have preferences about watching films in general, being with friends in general, so these things should be combined in as much as they can be into some judgment of what you like to watch, who you like to watch with and under what circumstances. That’s the generalizing. The extending is basically trying to push these beyond the typical situations. So, if there was a sort of virtual reality, which really gave you the feeling that other people were present with you, which current virtual reality doesn’t tend to, then would this count as being with your friend. What level of interaction would be required for it to count as being with your friend? Well, that’s some of the sort of extending.

The normalizing is just basically the fact that utility functions are defined up to scaling, up to multiplying by some positive real constant. So, if you want to add utilities together or combine them in a smooth-min or combine them in any way, you have to scale the different preferences and there are various ways of doing this. I fail to find an intrinsically good way of doing it that has all the nice formal properties that you would want but there are a variety of ways that can be done, all of which seem acceptable. The one I’m currently using is the mean max normalization, which is that the best possible outcome gets a utility of one, and the average outcome gets a utility of zero. This is the scaling.

Then the weight of these preferences is just how strongly you feel about it. Do you have a mild preference for going to the cinema with this friend? Do you have an overwhelming desire for chocolate? Once they’ve normalized, you weigh them, and you combine them.

Lucas: Can you take us through the rest of section two here, if there’s anything else here that you think is worth mentioning?

Stuart: I’d like to point out that this is intended to work with any particular human being that you point the process at, so there are a lot of assumptions that I made from my non-moral realist, worried about over simplification and other things. The idea is that if people have strong meta-preferences themselves, these will overwhelm the default decisions that I’ve made but if people don’t have strong meta-preferences, then they are synthesized in this way in the way which I feel is the best to not lose any important human value. There are also judgements about what would constitute a disaster or how we might judge this to have gone disastrously wrong, those are important and need to be sort of fleshed out a bit more because many of them can’t be quite captured within this system.

The other thing is that the outcomes may be very different. To choose a silly example, if you are 50% total utilitarian versus 50% average utilitarian or if you’re 45%, 55% either way, the outcomes are going to be very different because the pressure on the future is going to be different and because the AI is going to have a lot of power, it’s going to result in very different outcomes but from our perspective where if we put 50/50 total utilitarianism and average utilitarianism, we’re not exactly 50/50 most of the time. We’re kind of … Yeah, they’re about the same. So, 45, 55, should not result in a disaster if 50/50 doesn’t.

So, even though from the perspective of these three mixes, 45/55, 50/50, 55/45, these three mixes will look at something that optimizes one of the other two mixes and say that is very bad from my perspective, however, more human perspective, we’re saying all of them are pretty much okay. Well, we would say none of them are pretty much okay because they don’t incorporate many other of our preferences but the idea is that when we get all the preferences together, it shouldn’t matter a bit if it’s a bit fuzzy. So even though the outcome will change a lot if we shift it a little bit, the quality of the outcome shouldn’t change a lot and this is connected with a point that I’ll put up in section three that uncertainties may change the outcome a lot but again, uncertainties should not change the quality of the outcome and the quality of the outcome is measured in a somewhat informal way by our current preferences.

Lucas: So, moving along here into section three, what can you tell us about the synthesis of the human utility function in practice?

Stuart: So, first of all, there’s … Well, let’s do this project, let’s get it done but we don’t have perfect models of the human brain, we haven’t grounded all the symbols, what are we going to do with the great uncertainties. So, that’s arguing that even with the uncertainties, this method is considerably better than nothing and you should expect it to be pretty safe and somewhat adequate even with great uncertainties. The other part is I’m showing how thinking in terms of the human mental models can help to correct and improve some other methods like revealed preferences, our stated preferences, or the locking the philosopher in a box for a thousand years. All methods fail and we actually have a pretty clear idea when they fail, revealed preferences fail because we don’t model bounded rationality very well and even when we do, we know that sometimes our preferences are different from what we reveal. Stated preferences fail in situations where there’s strong incentives not to tell the truth, for example.

We could deal with these by sort of adding all the counter examples of the special case or we could add the counter examples as something to learn from or what I’m recommending is that we add them as something to learn from while stating that the reason that this is a counter example is that there is a divergence between whatever we’re measuring and the internal model of the human. The idea being that it is a lot easier to generalize when you have an error theory rather than just lists of error examples.

Lucas: Right and so there’s also this point of view here that you’re arguing that this research agenda and perspective is also potentially very helpful for things like corrigibility and low impact research and Christiano’s distillation and amplification, which you claim all seem to be methods that require some simplified version of the human utility function. So any sorts of conceptual insights or systematic insights which are generated through this research agenda in your view seem to be able to make significant contributions to other research agendas which don’t specifically take this lens?

Stuart: I feel that even something like corrigibility can benefit from this because in my experience, things like corrigibility, things like low impact have to define to some extent what is important and what can be categorized as unimportant. A low impact AI cannot be agnostic about our preferences, it has to know that a nuclear war is a high impact thing whether or not we’d like it whereas turning on an orange light that doesn’t go anywhere is a low impact thing, but there’s no real intrinsic measure by which one is high impact and the other is low impact. Both of them have ripples across the universe. So, I think I phrased it as Hitler, Gandhi and Thanos all know what a low impact AI is, all know what an oracle AI is, or know the behavior to expect from it. So, it means that we need to get some of the human preferences in, the bit that tells us that nuclear wars are high impact but we don’t need to get all of it in because since so many different humans will agree on it, you don’t need to capture any of their individual preferences.

Lucas: So, it’s applicable to these other methodologies and it’s also your belief and I’m quoting you here, you say that, “I’d give a 10% chance of it being possible this way, meaning through this research agenda and a 95% chance that some of these ideas will be very useful for other methods of alignment.” So, just adding that here as your credences for the skillfulness of applying insights from this research agenda to other areas of AI alignment.

Stuart: In a sense, you could think of this research agenda in reverse. Imagine that we have reached some outcome that isn’t some positive outcome, we have got alignment and we haven’t reached it through a single trick and we haven’t reached it through the sort of tool AIs or software as a service or those kinds of approaches, we have reached an actual alignment. It, therefore, seems to me all the problems that I’ve listed or almost all of them will have had to have been solved, therefore, in a sense, much of this research agenda needs to be done directly or indirectly in order to achieve any form of sensible alignment. Now, the term directly or indirectly is doing a lot of the work here but I feel that quite a bit of this will have to be done directly.

Lucas: Yeah, I think that that makes a lot of sense. It seems like there’s just a ton about the person that is just confused and difficult to understand what we even mean here in terms of our understanding of the person and also broader definitions included in alignment. Given this optimism that you’ve stated here surrounding the applicability of this research agenda on synthesizing a humans’ preferences into a utility function, what can you say about the limits of this method? Any pessimism to inject here?

Stuart: So, I have a section four, which is labeled as the things that I don’t address. Some of these are actually a bit sneaky like the section on how to combine the preferences of different people because if you read that section, it basically lays out ways of combining different people’s preferences. But I’ve put it in that to say I don’t want to talk about this issue in the context of this research agenda because I think this just diverts from the important work here, and there are a few of those points but some of them are genuine things that I think are problems and the biggest is the fact that there is a sort of informal Godel statement in humans about their own preferences. How many people would accept a computer synthesis of their preferences and say yes, that is my preferences, especially when they can explore it a bit and find the counter intuitive bits? I expect humans in general to reject the AI assigned synthesis no matter what it is, pretty much just because it was synthesized and then given to them, I expect them to reject or want to change it.

We have a natural reluctance to accept the judgment of other entities about our own morality and this is a perfectly fine meta-preference that most humans have and I think all humans have to some degree and I have no way of capturing it within the system because it’s basically a Godel statement in a sense. The best synthesis process is the one that wasn’t used. The other thing is that people want to continue with moral learning and moral improvement and I’ve tried to decompose moral learning and more improvements into different things and show that some forms of moral improvements and moral learning will continue even when you have a fully synthesized utility function but I know that this doesn’t capture everything of what people mean by this and I think it doesn’t even capture everything of what I would mean by this. So, again, there is a large hole in there.

There are some other holes of the sort of more technical nature like infinite utilities, stability of values and a bunch of other things but conceptually, I’m the most worried about these two aspects, the fact that you would reject what values you were assigned and the fact that you’d want to continue to improve and how do we define continuing improvement that isn’t just the same as well your values may drift randomly.

Lucas: What are your thoughts here? Feel free to expand on both the practical and theoretical difficulties of applying this across humanity and aggregating it into a single human species wide utility function.

Stuart: Well, the practical difficulties are basically politics, how to get agreements between different groups. People might want to hang onto their assets or their advantages. Other people might want sort of stronger equality. Everyone will have broad principles to appeal to. Basically, there’s going to be a lot of fighting over the different weightings of individual utilities. The hope there is that, especially with a powerful AI, that the advantage might be sufficiently high that it’s easier to do something where everybody gains even if the gains are uneven than to talk about how to divide a fixed sized pie. The theoretical issue is mainly what do we do with anti-altruistic preferences. I’m not talking about selfish preferences, those are very easy to deal with. That’s just basically competition for the utility, for the resources, for the goodness but actual anti-altruistic utilities so, someone who wants harm to befall other people and also to deal with altruistic preferences because you shouldn’t penalize people for having altruistic preferences.

You should, in a sense, take out the altruistic preferences and put that in the humanity one and allow their own personal preferences some extra weight, but anti-altruistic preferences are a challenge especially because it’s not quite clear where the edge is. Now, if you want someone to suffer, that’s an anti-altruistic preference. If you want to win a game and part of your enjoyment of the game is that other people lose, where exactly does that lie and that’s a very natural preference. You might become a very different person if you didn’t get some at least mild enjoyment from other people losing or from the status boost there is a bit tricky. You might sort of just tone them down so that mild anti-altruistic preferences are perfectly fine, so if you want someone to lose to your brilliant strategy at chess, that’s perfectly fine but if you want someone to be dropped slowly into a pit of boiling acid, then that’s not fine.

The other big question is population ethics. How do we deal with new entities and how do we deal with other conscious or not quite conscious animals around the world, so who gets to count as a part of the global utility function?

Lucas: So, I’m curious to know about concerns over aspects of this alignment story or any kind of alignment story involving lots of leaky abstractions, like in Rich Sutton’s short essay called The Bitter Lesson, he discusses how the bitter lesson of computer science is how leveraging computation over human domain-specific ingenuity has broadly been more efficacious for breeding very powerful results. We seem to have this tendency or partiality towards trying to imbue human wisdom or knowledge or unique techniques or kind of trickery or domain-specific insight into architecting the algorithm and alignment process in specific ways whereas maybe just throwing tons of computation at the thing has been more productive historically. Do you have any response here for concerns over concepts being leaky abstractions, or the categories in which you use to break down human preferences, not fully capturing what our preferences are?

Stuart: Well, in a sense that’s part of the research project and part of the reasons why I warned against going to distant words where in my phrasing, the web of connotations break down, in your phrasing the abstractions become too leaky and this is also part of why even though the second part is done as if this is the theoretical way of doing it, I also think there should be a lot of experimental aspect to it to test where this is going, where it goes surprisingly wrong or surprisingly right, the second part, though it’s presented as just this is basically the algorithm, it should be tested and checked and played around with to see how it goes. For The Bitter Lesson, the difference here I think is that in the case of The Bitter Lesson, we know what we’re trying to do.

We have objectives whether it’s winning at a game, whether it’s classifying images successfully, whether it’s classifying some other feature successfully, we have some criteria for the success of it. The constraints I’m putting in by hand are not so much trying to put in the wisdom of the human or the wisdom of the Stuart. There’s some of that but it’s to try and avoid disasters and the disasters cannot be just avoided with more data. You can get to many different points from the data and I’m trying to carve away lots of them. Don’t oversimplify, for example. So, to go back to The Bitter Lesson, you could say that you can tune your regularizer and what I’m saying is have a very weak regularizer, for example and this is not something that The Bitter Lesson applies to because in the real world, on the problems where The Bitter Lesson applies, you can see whether hand tuning the regularizer works because you can check what the outcome is and compare it with what you want.

Since you can’t compare it with what you want, because if we knew what we wanted we’d kind of have it solved, what I’m saying here is don’t put a strong regularizer for these reasons. The data can’t tell me that I need a stronger regularizer because the data has no opinion if you want on that. There is no ideal outcome to compare with. There might be some problems but the problems like if our preferences do not look like my logic or like our logic, this points towards the method failing, not towards the method’s needing more data and less restrictions.

Lucas: I mean I’m sure part of this research agenda is also further clarification and refinement of the taxonomy and categories used, which could potentially be elucidated by progress in neuroscience.

Stuart: Yes, and there’s a reason that this is version 0.9 and not yet version 1. I’m getting a lot of feedback and going to refine it before trying to put it out as version 1. It’s in alpha or in beta at the moment. It’s a prerelease agenda.

Lucas: Well, so hopefully this podcast will spark a lot more interest and knowledge about this research agenda and so hopefully we can further contribute to bettering it.

Stuart: When I say that this is in alpha or in beta, that doesn’t mean don’t criticize it, do criticize it and especially if these can lead to improvements but don’t just assume that this is fully set in stone yet.

Lucas: Right, so that’s sort of framing this whole conversation in the light of epistemic humility and willingness to change. So, two more questions here and then we’ll wrap up. So, reflective equilibrium, you say that this is not a philosophical ideal, can you expand here about your thoughts on reflective equilibrium and how this process is not a philosophical ideal?

Stuart: Reflective equilibrium is basically you refine your own preferences, make them more consistent, apply them to yourself until you’ve reached a moment where your meta-preferences and your preferences are all smoothly aligned with each other. What I’m doing is a much more messy synthesis process and I’m doing it in order to preserve as much as possible of the actual human preferences. It is very easy to reach reflective equilibrium by just, for instance, having completely flat preferences or very simple preferences, these tend to be very reflectively in equilibrium with itself and pushing towards this thing is a push towards, in my view, excessive simplicity and the great risk of losing valuable preferences. The risk of losing valuable preferences seems to me a much higher risk than the gain in terms of simplicity or elegance that you might get. There is no reason that the kludgey human brain and it’s mess of preferences should lead to some simple reflective equilibrium.

In fact, you could say that this is an argument against reflexive equilibrium because it means that many different starting points, many different minds with very different preferences will lead to similar outcomes which basically means that you’re throwing away a lot of the details of your input data.

Lucas: So, I guess two things, one is that this process clarifies and improves on incorrect beliefs in the person but it does not reflect what you or I might call moral wrongness, so like if some human is evil, then the synthesized human utility function will reflect that evilness. So, my second question here is, an idealization process is very alluring to me. Is it possible to synthesize the human utility function and then run it internally on the AI and then see what we get in the end and then check if that’s a good thing or not?

Stuart: Yes, in practice, this whole thing, if it works, is going to be very experimental and we’re going to be checking the outcomes and there’s nothing wrong with sort of wanting to be an idealized version of yourself. What I have, especially if it’s just one idealized, it’s the version where you are the idealized version of the idealized version of the idealized version of the idealized version, et cetera, of yourself where there is a great risk of losing yourself and the inputs there. This is where I had the idealized process where I started off wanting to be more compassionate and spreading my compassion to more and more things at each step, eventually coming to value insects as much as humans and then at the next step, value rocks as much as humans and then removing humans because of the damage that they can do to mountains, that was a process or something along the lines of what I can see if you are constantly idealizing yourself without any criteria for stop idealizing now or you’ve gone too far from where you started.

Your ideal self is pretty close to yourself. The triple idealized version of your idealized, idealized self or so on, starts becoming pretty far from your starting point and this is the sort of areas where I fear over-simplicity or trying to get to reflective equilibrium at the expense of other qualities and so on, these are the places where I fear this pushes towards.

Lucas: Can you make more clear what failed in our view in terms of that idealization process where Mahatma Armstrong turns into a complete negative utilitarian?

Stuart: It didn’t even turn into a negative utilitarian, it just turned into someone that valued rocks as much as they valued humans and therefore eliminated humans on utilitarian grounds in order to preserve rocks or to preserve insects if you wanted to go down one level of credibility. The point of this is this was the outcome of someone that wants to be more compassionate, continuously wanting to make more compassionate versions of themselves that still want to be more compassionate and so on. It went too far from where it had started. It’s one of many possible narratives but the point is the only way of resisting something like that happening is to tie the higher levels to the starting point. A better thing might say I want to be what myself would think is good and what my idealized self would think was good and what the idealized, idealized self would think was good and so on. So that kind of thing could work but just idealizing without ever tying it back to the starting point, to what compassion meant for the first entity, not what it meant for the nth entity is the problem that I see here.

Lucas: If I think about all possible versions of myself across time and I just happen to be one of them, this just seems to be a meta-preference to bias towards the one that I happen to be at this moment, right?

Stuart: We have to make a decision as to what preferences to take and we may as well take now because if we try and take into account our future preferences, we are starting to come a cropper with the manipulable aspect of our preferences. The fact that these could be literally anything. There is a future Stuart who is probably a Nazi because you can apply a certain amount of pressure to transform my preferences. I would not want to endorse their preferences now. There are future Stuarts who are saints, whose preferences I might endorse. So, if we’re deciding which future preferences that we’re accepting, we have to decide it according to criteria and criteria that at least are in part of what we have now.

We could sort of defer to our expected future selves if we sort of say I expect a reasonable experience of the future, define what reasonable means and then average out our current preferences with our reasonable future preferences if we can define what we mean by reasonable then, then yes, we can do this. This is our sole way of doing things and if we do it this way, it will most likely be non-disastrous. If doing the synthesis process with our current preference is non-disastrous then doing it with the average of our future reasonable preferences is also going to be non-disastrous. This is one of the choices that you could choose to put into the process.

Lucas: Right, so we can be mindful here that we’ll have lots of meta-preferences about the synthesis process itself.

Stuart: Yes, you can put it as a meta-preference or you can put it explicitly in the process if that’s a way you would prefer to do it. The whole process is designed strongly around get an answer from this process, so the, yes, we could do this, let’s see if we can do it for one person over a short period of time and then we can talk about how we might take into account considerations like that, including as I say, this might be in the meta-preferences themselves. This is basically another version of moral learning. We’re kind of okay with our values shifting but not okay with our values shifting arbitrarily. We really don’t want our values to completely flip from what we have now, though some aspects we’re more okay with them changing. This is part of the complicated how do you do moral learning.

Lucas: All right, beautiful, Stuart. Contemplating all this is really quite fascinating and I just think in general, humanity has a ton more thinking to do and self-reflection in order to get this process really right and I think that this conversation has really helped elucidate that to me and all of my contradictory preferences and my multitudes within the context of my partial and sometimes erroneous mental models, reflecting on that also has me feeling maybe slightly depersonalized and a bit ontologically empty but it’s beautiful and fascinating. Do you have anything here that you would like to make clear to the AI alignment community about this research agenda? Any last few words that you would like to say or points to clarify?

Stuart: There are people who disagree with this research agenda, some of them quite strongly and some of them having alternative approaches. I like that fact that they are researching other alternatives. If they disagree with the agenda and want to engage with it, the best engagement that I could see is pointing out why bits of the agenda are unnecessary or how alternate solutions could work. You could also point out that maybe it’s impossible to do it this way, which would also be useful but if you think you have a solution or the sketch of a solution, then pointing out which bits of the agenda you solve otherwise would be a very valuable exercise.

Lucas: In terms of engagement, you prefer people writing responses on the AI Alignment forum or Lesswrong

Stuart: Emailing me is also fine. I will eventually answer every non-crazy email.

Lucas: Okay, wonderful. I really appreciate all of your work here on this research agenda and all of your writing and thinking in general. You’re helping to create beautiful futures with AI and you’re much appreciated for that.

If you enjoyed this podcast, please subscribe, give it a like, or share it on your preferred social media platform. We’ll be back again soon with another episode in the AI Alignment series.

End of recorded material

AI Alignment Podcast: China’s AI Superpower Dream with Jeffrey Ding

“In July 2017, The State Council of China released the New Generation Artificial Intelligence Development Plan. This policy outlines China’s strategy to build a domestic AI industry worth nearly US$150 billion in the next few years and to become the leading AI power by 2030. This officially marked the development of the AI sector as a national priority and it was included in President Xi Jinping’s grand vision for China.” (FLI’s AI Policy – China page) In the context of these developments and an increase in conversations regarding AI and China, Lucas spoke with Jeffrey Ding from the Center for the Governance of AI (GovAI). Jeffrey is the China lead for GovAI where he researches China’s AI development and strategy, as well as China’s approach to strategic technologies more generally. 

Topics discussed in this episode include:

  • China’s historical relationships with technology development
  • China’s AI goals and some recently released principles
  • Jeffrey Ding’s work, Deciphering China’s AI Dream
  • The central drivers of AI and the resulting Chinese AI strategy
  • Chinese AI capabilities
  • AGI and superintelligence awareness and thinking in China
  • Dispelling AI myths, promoting appropriate memes
  • What healthy competition between the US and China might look like

You can take a short (3 minute) survey to share your feedback about the podcast here.

 

Key points from Jeffrey: 

  • “Even if you don’t think Chinese AI capabilities are as strong as have been hyped up in the media and elsewhere, important actors will treat China as either a bogeyman figure or as a Sputnik type of wake-up call motivator… other key actors will leverage that as a narrative, as a Sputnik moment of sorts to justify whatever policies they want to do. So we want to understand what’s happening and how the conversation around what’s happening in China’s AI development is unfolding.”
  • “There certainly are differences, but we don’t want to exaggerate them. I think oftentimes analysis of China happens in a vacuum where it’s like, ‘Oh, this only happens in this mysterious far off land, we call China and it doesn’t happen anywhere else.’ Shoshana Zuboff has this great book on Surveillance Capitalism that shows how the violation of privacy is pretty extensive on the US side, not only from big companies but also from the national security apparatus. So I think a similar phenomenon is taking place with the social credit system. Jeremy Dom at Yale laws China Center has put it really nicely where he says that, ‘We often project our worst fears about technology in AI onto what’s happening in China, and we look through a glass darkly and we unleash all of our anxieties on what’s happening on to China without reflecting on what’s happening here in the US, what’s happening here in the UK.'”
  • “I think we have to be careful about which historical analogies and memes we choose. So ‘arms race’ is a very specific call back to cold war context, where there’s almost these discrete types of missiles that we are racing Soviet Union on and discrete applications that we can count up; Or even going way back to what some scholars call the first industrial arms race in the military sphere over steam power boats between Britain and France in the late 19th century. And all of those instances you can count up. France has four iron clads, UK has four iron clads; They’re racing to see who can build more. I don’t think there’s anything like that. There’s not this discreet thing that we’re racing to see who can have more of. If anything, it’s about a competition to see who can absorb AI advances from abroad better, who can diffuse them throughout the economy, who can adopt them in a more sustainable way without sacrificing core values. So that’s sort of one meme that I really want to dispel. Related to that, assumptions that often influence a lot of our discourse on this is techno-nationalist assumption, which is this idea that technology is contained within national boundaries and that the nation state is the most important actor –– which is correct and a good one to have and a lot of instances. But there are also good reasons to adopt techno-globalist assumptions as well, especially in the area of how fast technologies diffuse nowadays and also how much underneath this national level competition, firms from different countries are working together and make standards alliances with each other. So there’s this undercurrent of techno-globalism, where there are people flows, idea flows, company flows happening while the coverage and the sexy topic is always going to be about national level competition, zero sum competition, relative games rhetoric. So you’re trying to find a balance between those two streams.”
  • “I think currently a lot of people in the US are locked into this mindset that the only two players that exist in the world are the US and China. And if you look at our conversation, right, oftentimes I’ve displayed that bias as well. We should probably have talked a lot more about China-EU or China-Japan corporations in this space and networks in this space because there’s a lot happening there too. So a lot of US policy makers see this as a two-player game between the US and China. And then in that sense, if there’s some cancer research project about discovering proteins using AI that may benefit China by 10 points and benefit the US only by eight points, but it’s going to save a lot of people from cancer  –– if you only care about making everything about maintaining a lead over China, then you might not take that deal. But if you think about it from the broader landscape of it’s not just a zero sum competition between US and China, then your kind of evaluation of those different point structures and what you think is rational will change.”

 

Important timestamps: 

0:00 intro 

2:14 Motivations for the conversation

5:44 Historical background on China and AI 

8:13 AI principles in China and the US 

16:20 Jeffrey Ding’s work, Deciphering China’s AI Dream 

21:55 Does China’s government play a central hand in setting regulations? 

23:25 Can Chinese implementation of regulations and standards move faster than in the US? Is China buying shares in companies to have decision making power? 

27:05 The components and drivers of AI in China and how they affect Chinese AI strategy 

35:30 Chinese government guidance funds for AI development 

37:30 Analyzing China’s AI capabilities 

44:20 Implications for the future of AI and AI strategy given the current state of the world 

49:30 How important are AGI and superintelligence concerns in China?

52:30 Are there explicit technical AI research programs in China for AGI? 

53:40 Dispelling AI myths and promoting appropriate memes

56:10 Relative and absolute gains in international politics 

59:11 On Peter Thiel’s recent comments on superintelligence, AI, and China 

1:04:10 Major updates and changes since Jeffrey wrote Deciphering China’s AI Dream 

1:05:50 What does healthy competition between China and the US look like? 

1:11:05 Where to follow Jeffrey and read more of his work

 

Works referenced 

Deciphering China’s AI Dream

FLI AI Policy – China page

ChinAI Newsletter

Jeff’s Twitter

Previous podcast with Jeffrey

 

We hope that you will continue to join in the conversations by following us or subscribing to our podcasts on Youtube, Spotify, SoundCloud, iTunes, Google Play, StitcheriHeartRadio, or your preferred podcast site/application. You can find all the AI Alignment Podcasts here.

You can listen to the podcast above or read the transcript below. More works from GovAI can be found here.

 

Lucas Perry: Hello everyone and welcome back to the AI Alignment Podcast at The Future of Life Institute. I’m Lucas Perry and today we’ll be speaking with Jeffrey Ding from The Future of Humanity Institute on China and their efforts to be the leading AI Superpower by 2030. In this podcast, we provide a largely descriptive account of China’s historical technological efforts, their current intentions and methods for pushing Chinese AI Success, some of the foundational AI principles being called for within China; We cover the drivers of AI progress, the components of success, China’s strategies born of these variables; We also assess China’s current and likely future AI capabilities, and the consequences of all this tied together. The FLI AI Policy China page, and Jeffrey Ding’s publication Deciphering China’s AI Dream are large drivers of this conversation, and I recommend you check them out.

If you find this podcast interesting or useful, consider sharing it with friends on social media platforms, forums, or anywhere you think it might be found valuable. As always, you can provide feedback for me by following the SurveyMonkey link found in the description of wherever you might find this podcast. 

Jeffrey Ding specializes in AI strategy and China’s approach to strategic technologies more generally. He is the China lead for the Center for the Governance of AI. There, Jeff researches China’s development of AI and his work has been cited in the Washington Post, South China Morning Post, MIT Technological Review, Bloomberg News, Quartz, and other outlets. He is a fluent Mandarin speaker and has worked at the US Department of State and the Hong Kong Legislative Council. He is also reading for a PhD in international relations as a Rhodes scholar at the University of Oxford. And so without further ado, let’s jump into our conversation with Jeffrey Ding.

Let’s go ahead and start off by providing a bit of the motivations for this conversation today. So why is it that China is important for AI alignment? Why should we be having this conversation? Why are people worried about the US-China AI Dynamic?

Jeffrey Ding: Two main reasons, and I think they follow an “even if” structure. The first reason is China is probably second only to the US in terms of a comprehensive national AI capabilities measurement. That’s a very hard and abstract thing to measure. But if you’re taking which countries have the firms on the leading edge of the technology, the universities, the research labs, and then the scale to lead in industrial terms and also in potential investment in projects related to artificial general intelligence. I would put China second only to the US, at least in terms of my intuition and sort of my analysis that I’ve done on the subject.

The second reason is even if you don’t think Chinese AI capabilities are as strong as have been hyped up in the media and elsewhere, important actors will treat China as either a bogeyman figure or as a Sputnik type of wake-up call motivator. And you can see this in the rhetoric coming from the US especially today, and even in areas that aren’t necessarily connected. So Axios had a leaked memo from the US National Security Council that was talking about centralizing US telecommunication services to prepare for 5G. And in the memo, one of the justifications for this was because China is leading in AI advances. The memo doesn’t really tie the two together. There are connections –– 5G may empower different AI technologies –– but that’s a clear example of how even if Chinese capabilities in AI, especially in projects related to AGI, are not as substantial as has been reported, or we think, other key actors will leverage that as a narrative, as a Sputnik moment of sorts to justify whatever policies they want to do. So we want to understand what’s happening and how the conversation around what’s happening in China’s AI development is unfolding.

Lucas Perry: So the first aspect being that they’re basically the second most powerful AI developer. And we can get into later their relative strength to the US; I think that in your estimation, they have about half as much AI capability relative to the United States. And here, the second one is you’re saying –– and there’s this common meme in AI Alignment about how avoiding races is important because in races, actors have incentives to cut corners in order to gain decisive strategic advantage by being the first to deploy advanced forms of artificial intelligence –– so there’s this important need, you’re saying, for actually understanding the relationship and state of Chinese AI Development to dispel inflammatory race narratives?

Jeffrey Ding: Yeah, I would say China’s probably at the center of most race narratives when we talk about AI arms races and the conversation in at least US policy-making circles –– which is what I follow most, US national security circles –– has not talked necessarily about AI as a decisive strategic advantage in terms of artificial general intelligence, but definitely in terms of decisive strategic advantage and who has more productive power, military power. So yeah, I would agree with that.

Lucas Perry: All right, so let’s provide a little bit more historical background here, I think, to sort of contextualize why there’s this rising conversation about the role of China in the AI space. So I’m taking this here from the FLI AI Policy China page: “In July of 2017, the State Council of China released the New Generation Artificial Intelligence Development Plan. And this was an AI research strategy policy to build a domestic AI industry worth nearly $150 billion in the next few years” –– again, this was in 2017 –– “and to become a leading AI power by 2030. This officially marked the development of the AI sector as a national priority, and it was included in President Xi Jinping’s grand vision for China.” And just adding a little bit more color here: “given this, the government expects its companies and research facilities to be at the same level as leading countries like the United States by 2020.” So within a year from now –– maybe a bit ambitious, given your estimation that they have is about half as much capability as us.

But continuing this picture I’m painting: “five years later, it calls for breakthroughs in select disciplines within AI” –– so that would be by 2025. “That will become a key impetus for economic transformation. And then in the final stage, by 2030, China is intending to become the world’s premier artificial intelligence innovation center, which will in turn foster a new national leadership and establish the key fundamentals for an economic great power,” in their words. So there’s this very clear, intentional stance that China has been developing in the past few years.

Jeffrey Ding: Yeah, definitely. And I think it was Jess Newman who put together the AI policy in China page –– did a great job. It’s a good summary of this New Generation AI Development Plan issued in July 2017 and I would say the plan was more reflective of momentum that was already happening at the local level with companies like Baidu, Tencent, Alibaba, making the shift to focus on AI as a core part of their business strategy. Shenzhen, other cities, had already set up their own local funds and plans, and this was an instance of the Chinese national government, in the words of I think Paul Triolo and some other folks at New America, “riding the wave,” and kind of joining this wave of AI development.

Lucas Perry: And so adding a bit more color here again: there’s also been developments in principles that are being espoused in this context. I’d say probably the first major principles on AI were developed at the Asilomar Conference, at least those pertaining to AGI. In June 2019, the New Generation of AI Governance Expert Committee released principles for next-generation artificial intelligence governance, which included tenants like harmony and friendliness and fairness and justice, inclusiveness and sharing, open cooperation, shared responsibility, and agile governance. 

And then also in May of 2019 the Beijing AI Principles were released. That was by a multi-stakeholder coalition, including the Beijing Academy of Artificial Intelligence, a bunch of top universities in China, as well as industrial firms such as Baidu, Alibaba, and Tencent. And these 15 principles, among other things, called for “the construction of a human community with a shared future and the realization of beneficial AI for humankind in nature.” So it seems like principles and intentions are also being developed similarly in China that sort of echo and reflect many of the principles and intentions that have been developing in the states.

Jeffrey Ding: Yeah, I think there’s definitely a lot of similarities, and I think it’s not just with this recent flurry of AI ethics documents that you’ve done a good job of summarizing. It dates back to even the plan that we were just talking about. If you read the July 2017 New Generation AI Plan carefully, there’s a lot of sections devoted to AI ethics, including some sections that are worried about human robot alienation.

So, depending on how you read that, you could read that as already anticipating some of the issues that could occur if human goals and AI goals do not align. Even back in March, I believe, of 2018, a lot of government bodies came together with companies to put out a white paper on AI standardization, which I translated for New America. And in that, they talk about AI safety and security issues, how it’s important to ensure that the design goals of AI are consistent with the interests, ethics, and morals of most humans. So a lot of these topics, I don’t even know if they’re western topics. These are just basic concepts: We want systems to be controllable and reliable. And yes, those have deeper meanings in the sense of AGI, but that doesn’t mean that some of these initial core values can’t be really easily applied to some of these deeper meanings that we talk about when we talk about AGI ethics.

Lucas Perry: So with all of the animosity and posturing and whatever that happens between the United States and China, these sort of principles and intentions which are being developed, at least in terms of AI –– both of them sort of have international intentions for the common good of humanity; At least that’s what is being stated in these documents. How do you think about the reality of the day-to-day combativeness and competition between the US and China in relation to these principles which strive towards the deployment of AI for the common good of humanity more broadly, rather than just within the context of one country?

Jeffrey Ding: It’s a really good question. I think the first point to clarify is these statements don’t have teeth behind them unless they’re enforced, unless there’s resources dedicated to funding research on these issues, to track 1.5, track 2 diplomacy, technical meetings between researchers. These are just statements that people can put out and they don’t have teeth unless they’re actually enforced. Oftentimes, we know it’s the case. Firms like Google and Microsoft, Amazon, will put out principles about facial recognition or what their ethical stances are, but behind the scenes they’ll chase profit motives and maximize shareholder value. And I would say the same would take place for Tencent, Baidu, Alibaba. So I want to clarify that, first of all. The competitive dynamics are real: It’s partly not just an AI story, it’s a broader story of China’s rise. I’ve come from international relations background, so I’m a PhD student at Oxford studying that, and there’s a big debate in the literature about what happens when a rising power challenges an established power. And oftentimes frictions result, and it’s about how to manage these frictions without leading to accidents, miscalculation, arms races. And that’s the tough part of it.

Lucas Perry: So it seems –– at least for a baseline, thinking that we’re still pretty early in the process of AI alignment or this long-term vision we have –– it seems like at least there is theoretically some shared foundational principles reflective across both the cultures. Again, these Beijing AI Principles also include focus on benefiting all of humanity and the environment; serving human values such as privacy, dignity, freedom, autonomy and rights; continuous focus on AI safety and security; inclusivity, openness; supporting international cooperation; and avoiding a malicious AI race. So the question now simply seems: implementation of these shared principles, ensuring that they manifest.

Jeffrey Ding: Yeah. I don’t mean to be dismissive of these efforts to create principles that were at least expressing the rhetoric of planning for all of humanity. I think there’s definitely a lot of areas of US-China cooperation in the past that have also echoed some of these principles: bi-lateral cooperation on climate change research; there’s a good nuclear safety cooperation module; different centers that we’ve worked on. But at the same time, I also think that even with that list of terms you just mentioned, there are some differences in terms of how both sides understand different terms.

So with privacy in the Chinese context, it’s not necessarily that Chinese people or political actors don’t care about privacy. It’s that privacy might mean more of privacy as an instrumental right, to ensure your financial data doesn’t get leaked, you don’t lose all your money; to ensure that your consumer data is protected from companies; but not necessarily in other contexts where privacy is seen as an intrinsic right, as a civil right of sorts, where it’s also about an individual’s protection from government surveillance. That type of protection is not caught up in conversations about privacy in China as much.

Lucas Perry: Right, so there are going to be implicitly different understandings about some of these principles that we’ll have to navigate. And again, you brought up privacy as something –– and this has been something people have been paying more attention to, as there has been kind of this hype and maybe a little bit of hysteria over the China social crediting system, and plenty of misunderstanding around that.

Jeffrey Ding: Yeah, and this ties into a lot of what I’ve been thinking about lately, which is there certainly are differences, but we don’t want to exaggerate them. I think oftentimes analysis of China happens in a vacuum where it’s like, “Oh, this only happens in this mysterious far off land we call China and it doesn’t happen anywhere else.” Shoshana Zuboff has this great book on surveillance capitalism that shows how the violation of privacy is pretty extensive on the US side, not only from big companies but also from the national security apparatus.

So I think a similar phenomenon is taking place with the social credit system. Jeremy Dom at Yale Law’s China Center has put it really nicely where he says that, “We often project our worst fears about technology in AI onto what’s happening in China, and we look through a glass darkly and we unleash all of our anxieties on what’s happening onto China without reflecting on what’s happening here in the US, what’s happening here in the UK.”

Lucas Perry: Right. I would guess that generally in human psychology it seems easier to see the evil in the other rather than in the self.

Jeffrey Ding: Yeah, that’s a little bit out of range for me, but I’m sure there’s studies on that.

Lucas Perry: Yeah. All right, so let’s get in here now to your work on deciphering China’s AI dream. This is a work that you’d published in 2018 and in this work you divided up into these four different sections. First you work on context, then you discuss components, then you discuss capabilities, and then you discuss consequences all in relation to AI in China. Would you like to just sort of unpack the structuring?

Jeffrey Ding: Yeah, this was very much just a descriptive paper. I was just starting out researching this area and I just had a bunch of basic questions. So question number one for context: what is the background behind China’s AI Strategy? How does it compare to other countries’ plans? How does it compare to its own past science and technology plans? The second question was, what are they doing in terms of pushing forward drivers of AI Development? So that’s the component section. The third question is, how well are they doing? It’s about assessing China’s AI capabilities. And then the fourth is, so what’s it all mean? Why does it matter? And that’s where I talk about the consequences and the potential implications of China’s AI ambitions for issues related to AI Safety, some of the AGI issues we’ve been talking about, national security, economic development, and social governance.

Lucas Perry: So let’s go ahead and move sequentially through these. We’ve already here discussed a bit of context about what’s going on in China in terms of at least the intentional stance and the development of some principles. Are there any other key facets or areas here that you’d like to add about China’s AI strategy in terms of its past science and technology? Just to paint a picture for our listeners.

Jeffrey Ding: Yeah, definitely. I think two past critical technologies that you could look at are the plans to increase China’s space industry, aerospace sector; and then also biotechnology. So in each of these other areas there was also a national level strategic plan; An agency or an office was set up to manage this national plan; Substantial funding was dedicated. With the New Generation AI Plan, there was also a sort of implementation office set up across a bunch of the different departments tasked with implementing the plan.

AI was also elevated to the level of a national strategic technology. And so what’s different between these two phases? Because it’s debatable how successful the space plan and the biotech plans have been. What’s different with AI is you already had big tech giants who are pursuing AI capabilities and have the resources to shift a lot of their investments toward the AI space, independent of government funding mechanisms: companies like Baidu, Tencent, Alibaba, even startups that have really risen like SenseTime. And you see that reflected in the type of model.

It’s no longer the traditional national champion model where the government almost builds a company from the ground up, maybe with the help of like international financers and investors. Now it’s a national team model where they ask for the support of these leading tech giants, but it’s not like these tech giants are reliant on the government for subsidies or funding to survive. They are already flourishing firms that have international presence.

The other bit of context I would just add is that if you look at the New Generation Plan, there’s a lot of terms that are related to manufacturing. And I mentioned in Deciphering China’s AI Dream, how there’s a lot of connections and callbacks to manufacturing plans. And I think this is key because it’s one aspect of China’s strive for AI as they want to escape the middle income trap and kind of get to those higher levels of value-add in the manufacturing chain. So I want to stress that as a key point of context.

Lucas Perry: So the framing here is the Chinese government is trying to enable companies which already exist and already are successful. And this stands in contrast to the US and the UK where it seems like the government isn’t even part of a teamwork effort.

Jeffrey Ding: Yeah. So maybe a good comparison would be how technical standards develop, which is an emphasis of not only this deciphering China dream paper but a lot of later work. So I’m talking about technical standards, like how do you measure the accuracy of facial recognition systems and who gets to set those measures, or product safety standards for different AI applications. And in many other countries, including the US, the process for that is much more decentralized. It’s largely done through industry alliances. There is the NIST, which is a body under the Department of Commerce in the US that helps coordinate that to some extent, but not nearly as much as what happens in China with the Standards Administration Commission (SAC), I believe. There, it’s much more of a centralized effort to create technical standards. And there are pros and cons to both.

With the more decentralized approach, you minimize the risks of technological lock-in by setting standards too early, and you let firms have a little bit more freedom, competition as well. Whereas having a more centralized top-down effort might lead to earlier harmonization on standards and let you leverage economies of scale when you just have more interoperable protocols. That could help with data sharing, help with creating stable test bed for different firms to compete and measure stuff I was talking about earlier, like algorithmic accuracy. So there are pros and cons of the two different approaches. But I think yeah, that does flush out how the relationship between firms and the government differs a little bit, at least in the context of standards setting.

Lucas Perry: So on top of standards setting, would you say China’s government plays more of a central hand in the regulation as well?

Jeffrey Ding: That’s a good question. It probably differs in terms of what area of regulation. So I think in some cases there’s a willingness to let companies experiment and then put down regulations afterward. So this is the classic example with mobile payments: There was definitely a gray space as to how these platforms like Alipay, WeChat Pay were essentially pushing into a gray area of law in terms of who could handle this much money that’s traditionally in the hands of the banks. Instead of clamping down on it right away, the Chinese government kind of let that play itself out, and then once these mobile pay platforms got big enough that they’re holding so much capital and have so much influence on the monetary stock, they then started drafting regulations for them to be almost treated as banks. So that’s an example of where it’s more of a hands-off approach.

In AI, folks have said that the US and China are probably closer in terms of their approach to regulation, which is much more hands-off than the EU. And I think that’s just a product partly of the structural differences in the AI ecosystem. The EU has very few big internet giants and AI algorithm firms, so they have more of an incentive to regulate other countries’ big tech giants and AI firms.

Lucas Perry: So two questions are coming up. One is, is there sufficiently more unity and coordination in the Chinese government such that when standards and regulations, or decisions surrounding AI, need to be implemented that they’re able to move, say, much quicker than the United States government? And the second thing was, I believe you mentioned also that the Chinese government is also trying to find ways of using potential government money for buying up shares in these companies and try to gain decision making power.

Jeffrey Ding: Yeah, I’ll start with the latter. The reference is to the establishment of special management shares: so these would be almost symbolic, less than 1% shares in a company so that they could maybe get a seat on the board –– or another vehicle is through the establishment of party committees within companies, so there’s always a tie to party leadership. I don’t have that much more insight into how these work. I think probably it’s fair to say that the day-to-day and long-term planning decisions of a lot of these companies are mostly just driven by what their leadership wants, not necessarily what the party leaders want, because it’s just very hard to micromanage these billion dollar giants.

And that was part of a lot of what was happening with the reform of the state-owned enterprise sector, where, I think it was the SAC –– there are a lot of acronyms –– but this was the body in control of state-owned enterprises and they significantly cut down the number of enterprises that they directly oversee and sort of focused on the big ones, like the big banks or the big oil companies.

To your first point on how smooth policy enforcement is, this is not something I’ve studied that carefully. I think to some extent there’s more variability in terms of what the government does. So I read somewhere that if you look at the government relations departments of Chinese big tech companies versus US big tech companies, there’s just a lot more on the Chinese side –– although that might be changing with recent developments in the US. Two cases I’m thinking of right now are the Chinese government worrying about addictive games and then issuing the ban against some games including Tencent’s PUBG, which has wrecked Tencent’s game revenues and was really hurtful for their stock value.

So that’s something that would be very hard for the US government to be like, “Hey, this game is banned.” At the same time, there’s a lot of messiness with this, which is why I’m pontificating and equivocating and not really giving you a stable answer, because local governments don’t implement things that well. There’s a lot of local center attention. And especially with technical stuff –– this is the case of the US as well –– there’s just not as much technical talent in the government. So with a lot of these technical privacy issues, it’s very hard to develop good regulations if you don’t actually understand the tech. So what they’ve been trying to do is audit privacy policies of different social media tech companies and they started with 10 of the biggest and have tried to audit them. So I think it’s very much a developing process in both China and the US.

Lucas Perry: So you’re saying that the Chinese government, like the US, lacks much scientific or technical expertise? I had some sort of idea in my head that many of the Chinese mayors or other political figures actually have engineering degrees or degrees in science.

Jeffrey Ding: That’s definitely true. But I mean, by technical expertise I mean something like what the US government did with the digital service corps, where they’re getting people who have worked in the leading edge tech firms to then work for the government. That type of stuff would be useful in China.

Lucas Perry: So let’s move on to the second part, discussing components. And here you relate the key features of China’s AI strategy to the drivers of AI development, and here the drivers of AI development you say are hardware in the form of chips for training and executing AI algorithms, data as an input for AI Algorithms, research and algorithm development –– so actual AI researchers working on the architectures and systems through which the data will be put, and then the commercial AI ecosystems, which I suppose support and feed these first three things. What can you say about the state of these components in China and how it affects China’s AI strategy?

Jeffrey Ding: I think the main thing that I want to emphasize here that a lot of this is the Chinese government is trying to fill in some of the gaps, a lot of this is about enabling people, firms that are already doing the work. One of the gaps is private firms tend to under-invest in basic research or will under-invest in broader education because they don’t get a capture all those gains. So the government tries to support not only AI as a national level discipline but also to construct AI institutes, help fund talent programs to bring back the leading researchers from overseas. So that’s one part of it. 

The second part of it, which I did not talk about that much in the report in this section but I’ve recently researched more and more about, is that where the government is more actively driving things is when they are the final end client. So this is definitely the case in the surveillance industry space: provincial-level public security bureaus are working with companies in both hardware, data, research and development and the whole security systems integration process to develop more advanced high tech surveillance systems.

Lucas Perry: Expanding here, there’s also this way of understanding Chinese AI strategy as it relates to previous technologies and how it’s similar or different. Ways in which it’s similar involve strong degree of state support and intervention, transfer of both technology and talent, and investment in long-term whole-of-society measures; I’m quoting you here.

Jeffrey Ding: Yeah.

Lucas Perry: Furthermore, you state that China is adopting a catch-up approach in the hardware necessary to train and execute AI algorithms. This points towards an asymmetry, that most of the chip manufacturers are not in China and they have to buy them from Nvidia. And then you go on to mention about how access to large quantities of data is an important driver for AI systems and that China’s data protectionism favors Chinese AI companies and accessing data from China’s large domestic market, but it also detracts from cross-border pooling of data.

Jeffrey Ding: Yeah, and just to expand on that point, there’s been good research out of folks at DigiChina, which is a New America Institute, that looks at the cybersecurity law –– and we’re still figuring out how that’s going to be implemented completely, but the original draft would have prevented companies from taking data that was collected inside of China and taking it outside of China.

And actually these folks at DigiChina point out how some of the major backlash to this law didn’t just come from US multinational incorporations but also Chinese multinationals. That aspect of data protectionism illustrates a key trade-off: on one sense, countries and national security players are valuing personal data almost as a national security asset for the risk of blackmail or something. So this is the whole Grindr case in the US where I think Grindr was encouraged or strongly encouraged by the US government to find a non-Chinese owner. So that’s on one aspect you want to protect personal information, but on the other hand, free data flows are critical to spurring gains and innovation as well for some of these larger companies.

Lucas Perry: Is there an interest here to be able to sell their data to other companies abroad? Is that why they’re against this data protectionism in China?

Jeffrey Ding: I don’t know that much about this particular case, but I think Alibaba and Tencent have labs all around the world. So they might want to collate their data together, so they were worried that the cybersecurity law would affect that.

Lucas Perry: And just highlighting here for the listeners that access to large amounts of high quality data is extremely important for efficaciously training models and machine learning systems. Data is a new, very valuable resource. And so you go on here to say, I’m quoting you again, “China’s also actively recruiting and cultivating talented researchers to develop AI algorithms. The state council’s AI plan outlines a two pronged gathering and training approach.” This seems to be very important, but it also seems like from your report that China’s losing AI talent to America largely. What can you say about this?

Jeffrey Ding: Often the biggest bottleneck cited to AI development is lack of technical talent. That gap will eventually be filled just based on pure operations in the market, but in the meantime there has been a focus on AI talent, whether that’s through some of these national talent programs, or it also happens through things like local governments offering tax breaks for companies who may have headquarters around the world.

For example, Jingchi which is an autonomous driving startup, they had I think their main base in California or one of their main bases in California; But then Shenzhen or Guangzhou, I’m not sure which local government it was, they gave them basically free office space to move one of their bases back to China and that brings a lot of talented people back. And you’re right, a lot of the best and brightest do go to US companies as well, and one of the key channels for recruiting Chinese students are big firms setting up offshore research and development labs like Microsoft Research Asia in Beijing.

And then the third thing I’ll point out, and this is something I’ve noticed recently when I was doing translations from science and tech media platforms that are looking at the talent space in particular: They’ve pointed out that there’s sometimes a tension between the gathering and the training planks. So there’ve been complaints from domestic Chinese researchers, so maybe you have two super talented PhD students. One decides to stay in China, the other decides to go abroad for their post-doc. And oftentimes the talent plans –– the recruiting, gathering plank of this talent policy –– will then favor the person who went abroad for the post-doc experience over the person who stayed in China, and they might be just as good. So then that actually creates an incentive for more people to go abroad. There’s been good research that a lot of the best and brightest ended up staying abroad; The stay rates, especially in the US for Chinese PhD students in computer science fields, are shockingly high.

Lucas Perry: What can you say about Chinese PhD student anxieties with regards to leaving the United States to go visit family in China and come back? I’ve heard that there may be anxieties about not being let back in given that their research has focused on AI and that there’s been increasing US suspicions of spying or whatever.

Jeffrey Ding: I don’t know how much of it is a recent development but I think it’s just when applying for different stages of the path to permanent residency –– whether it’s applying for the H-1B visa or if you’re in the green card pipeline –– I’ve heard just secondhand that they avoid traveling abroad or going back to visit family just to kind of show commitment that they’re residing here in the US. So I don’t know how much of that is recent. My dad actually, he started out as a PhD student in math at University of Iowa before switching to computer science and I remember we had a death in the family and he couldn’t go back because it was so early on in his stay. So I’m sure it’s a conflicted situation for a lot of Chinese international students in the US.

Lucas Perry: So moving along here and ending this component section, you also say here –– and this kind of goes back to what we were discussing earlier about government guidance funds –– Chinese government is also starting to take a more active role in funding AI ventures, helping to grow the fourth driver of AI development, which again is the commercial AI ecosystems, which support and are the context for hardware data and research on algorithm development. And so the Chinese government is disbursing funds through what are called Government Guidance Funds or GGFs, set up by local governments and state owned companies. And the government has invested more than a billion US dollars on domestic startups. This seems to be in clear contrast with how America functions on this, with much of the investments shifting towards healthcare and AI as the priority areas in the last two years.

Jeffrey Ding: Right, yeah. So the GGFs are an interesting funding vehicle. The China Money Network, which has I think the best English language coverage of these vehicles, say that they may be history’s greatest experiment in using state capitol to reshape a nation’s economy. These essentially are Public Private Partnerships, PPPs, which do exist across the world, in the US. And the idea is basically the state seeds and anchors these investment vehicles and then they partner with private capital to also invest in startups, companies that the government thinks either are supporting a particular policy initiative or are good for overall development.

A lot of this is hard to decipher in terms of what the impact has been so far, because publicly available information is relatively scarce. I mentioned in my report that these funds haven’t had a successful exit yet, which means that maybe just they need more time. I think there’s also been some complaints that the big VCs –– whether it’s Chinese VCs or even international VCs that have a Chinese arm –– they much prefer to just to go it on their own rather than be tied to all the strings and potential regulations that come with working with the government. So I think it’s definitely a case of time will tell, and also this is a very fertile research area that I know some people are looking into. So be on the lookout for more conclusive findings about these GGFs, especially how they relate to the emerging technologies.

Lucas Perry: All right. So we’re getting to your capabilities section, which assesses the current state of China’s AI capabilities across the four drivers of AI development. Here you’re constructing an AI Potential Index, which is an index for the potentiality of, say, a country, based off these four variables, to be able to create successful AI products. So based on your research, you give China an AI Potential Index score of 17, which is about half of the US’s AI Potential Index score of 33. And so you state here that what is sort of essential to draw from this finding is the relative scale, or at least the proportionality, between China and the US. So the conclusion which we can try to draw from this is that China trails the US in every driver except for access to data, and that on all of these dimensions China is about half as capable as the US.

Jeffrey Ding: Yes, so the AIPI, the AI Potential Index, was definitely just meant as a first cut at developing a measure for which we can make comparative claims. I think at the time, and even now, I think we just throw around things like, “who is ahead in AI?” I was reading this recent Defense One article that was like, “China’s the world leader in GANs,” G-A-Ns, Generative Adversarial Networks. That’s just not even a claim that is coherent. Are you the leader at developing the talent who is going to make advancement to GANs? Are you the leader at applying and deploying GANs in the military field? Are you the leader in producing the most publications related to GANs?

I think that’s what was frustrating me about the conversation and net assessment of different countries’ AI capabilities, so that’s why I tried to develop a more systematic framework which looked at the different drivers, and it was basically looking at what is the potential of country’s AI capabilities based on their marks across these drivers.

Since then, probably the main thing that I’ve done update this was in my written testimony before the US China Economic and Security Review Commission, where I kind of switch up a little bit how I evaluate the current AI capabilities of China and the US. Basically there’s this very fuzzy concept of national AI capabilities that we throw around and I slice it up into three cross-sections. The first is, let’s look at what the scientific and technological inputs and outputs different countries are putting into AI. So that’s: how many publications are coming out of this country in Europe versus China versus US? How many outputs also in the sense of publications or inputs in the sense of R&D investments? So let’s take a look at that. 

The second slice is, let’s not just say AI. I think every time you say AI it’s always better to specify subtypes, or at least in the second slice I look at different layers of the AI value chain: foundational layers, technological layers, and the application layer. So, for example, foundation layers may be who is leading in developing the AI open source software that serves as the technological backbone for a lot of these AI applications and technologies? 

And then the third slice that I take is different sub domains of AI –– so computer vision, predictive intelligence, natural language processing, et cetera. And basically my conclusion: I throw a bunch of statistics in this written testimony out there –– some of it draws from this AI potential index that I put out last year –– and my conclusion is that China is not poised to overtake the US in the technology domain of AI; Rather the US maintains structural advantages in the quality of S and T inputs and outputs, the fundamental layers of the AI value chain, and key sub domains of AI.

So yeah, this stuff changes really fast too. I think a lot of people are trying to put together more systemic ways of measuring these things. So Jack Clark at openAI; projects like the AI index out of Stanford University; Matt Sheehan recently put out a really good piece for MacroPolo on developing sort of a five-dimensional framework for understanding data. So in this AIPI first cut, my data indicator is just a very raw who has more mobile phone users, but that obviously doesn’t matter for who’s going to lead in autonomous vehicles. So having finer grained understanding of how to measure different drivers will definitely help this field going forward.

Lucas Perry: What can you say about symmetries or asymmetries in terms of sub-fields in AI research like GANs or computer vision or any number of different sub-fields? Can we expect very strong specialties to develop in one country rather than another, or there to be lasting asymmetries in this space, or does research publication subvert this to some extent?

Jeffrey Ding: I think natural language processing is probably the best example because everyone says NLP, but then you just have that abstract word and you never dive into, “Oh wait, China might have a comparative advantage in Chinese language data processing, speech recognition, knowledge mapping,” which makes sense. There is just more of an incentive for Chinese companies to put out huge open source repositories to train automatic speech recognition.

So there might be some advantage in Chinese language data processing, although Microsoft Research Asia has very strong NOP capabilities as well. Facial recognition, maybe another area of comparative advantage: I think in my testimony I cite that China has published 900 patents in this sub domain in 2017; In that same year less than 150 patents related to facial recognition were filed in the US. So that could be partly just because there’s so much more of a fervor for surveillance applications, but in other domains such as the larger scale business applications the US probably possesses a decisive advantage. So autonomous vehicles are the best example of that: In my opinion, Google’s Waymo, GM’s Cruise are lapping the field.

And then finally in my written testimony I also try to look at military applications, and I find one metric that puts the US as having more than seven times as many military patents filed with the terms “autonomous” or “unmanned” in the patent abstract in the years 2003 to 2015. So yeah, that’s one of the research streams I’m really interested in, is how can we have more fine grain metrics that actually put into context China’s AI development, and that way we can have a more measured understanding of it.

Lucas Perry: All right, so we’ve gone into length now providing a descriptive account of China and the United States and key descriptive insights of your research. Moving into consequences now, I’ll just state some of these insights which you bring to light in your paper and then maybe you can expand on them a bit.

Jeffrey Ding: Sure.

Lucas Perry: You discuss the potential implications of China’s AI dream for issues of AI safety and ethics, national security, economic development, and social governance. The thinking here is becoming more diversified and substantive, though you claim it’s also too early to form firm conclusions about the long-term trajectory of China’s AI development; This is probably also true of any other country, really. You go on to conclude that a group of Chinese actors is increasingly engaged with issues of AI safety and ethics. 

A new book has been authored by Tencent’s Research Institute, and it includes a chapter in which the authors discuss the Asilomar Principles in detail and call for  strong regulations and controlling spells for AI. There’s also this conclusion that military applications of AI could provide a decisive strategic advantage in international security. The degree to which China’s approach to military AI represents a revolution in military affairs is an important question to study, to see how strategic advantages between the United States and China continue to change. You continue by elucidating how the economic benefit is the primary and immediate driving force behind China’s development of AI –– and again, I think you highlighted this sort of manufacturing perspective on this.

And finally, China’s adoption of AI Technologies could also have implications for its mode of social governance. For the state council’s AI plan, you state, “AI will play an irreplaceable role in maintaining social stability, an aim reflected in local level integrations of AI across a broad range of public services, including judicial services, medical care, and public security.” So given these sort of insights that you’ve come to and consequences of this descriptive picture we’ve painted about China and AI, is there anything else you’d like to add here?

Jeffrey Ding: Yeah, I think as you are laying out those four categories of consequences, I was just thinking this is what makes this area so exciting to study because if you think about it, each four of those consequences map out onto four research fields: AI ethics and safety, which with benevolent AI efforts, stuff that FLI is doing, the broader technology studies, critical technologies studies, technology ethics field; then in the social governance space, AI as a tool of social control: what are the social aftershocks of AI’s economic implications? You have this entire field of democracy studies or studies of technology and authoritarianism; and the economic benefits, you have this entire field of innovation studies: how do we understand the productivity benefits of general purpose technologies? And of course with AI as a revolution in military affairs, you have this whole field of security studies that is trying to understand what are the implications of new emerging technologies for national security? 

So it’s easy to start delineating these into their separate containers. I think what’s hard, especially for those of us are really concerned about that first field, AI ethics and safety, and the risks of AGI arms races, is a lot of other people are really, really concerned about those other three fields. And how do we tie in concepts from those fields? How do we take from those fields, learn from those fields, shape the language that we’re using to also be in conversation with those fields –– and then also see how those fields may actually be in conflict with some of what our goals are? And then how do we navigate those conflicts? How do we prioritize different things over others? It’s an exciting but daunting prospect ahead.

Lucas Perry: If you’re listening to this and are interested in becoming an AI researcher in terms of the China landscape, we need you. There’s a lot of great and open research questions here to work on.

Jeffrey Ding: For sure. For sure.

Lucas Perry: So I’ve extracted some insights from previous podcasts you did –– I can leave a link for that in the page for this podcast –– so I just want to kind of rapid fire these as points that I thought were interesting that we may or may not have covered here. You point out a language asymmetry: The best Chinese AI researchers read English and Chinese, whereas the western researchers generally cannot do this. You have a newsletter called China AI with 1A; Your newsletter attempts to correct for this as you translate important Chinese tech-related things into English. I suggest everyone follow that if you’re interested in continuing to track China and AI. There is more international cooperation on research at international conferences –– this is a general trend that you point out: Some top Chinese AI conferences are English only. Furthermore, I believe that you claim that the top 10% of AI research is still happening in America and the UK. 

Another point which I think that you’ve brought up is that China is behind on military AI uses. I’m also interested here just to see if you can expand a little bit more on it, but that China and AI safety and superintelligence is also something interesting to hear a little bit more about because on this podcast we often take the lens of long-term AI issues and AGI and super intelligence. So I think you mentioned that the Nick Bostrom of China is Professor, correct me if I get this wrong, Jao ting Wang. And also I’m curious here if you might be able to expand on how large or serious this China superintelligence FLI/FHI vibe is and what the implications of this are, and if there are any orgs in China that are explicitly focused on this. I’m sorry if this is a silly question, but are there like nonprofits in China in the same way that there are in the US? How does that function? Is China on the brink of having an FHI or FLI or MIRI or anything like this?

Jeffrey Ding: So a lot to untangle there and all really good questions. First, just to clarify, yeah, there are definitely nonprofits, non-governmental organizations. In recent years there has been some pressure on international nongovernmental organizations, nonprofit organizations, but there’s definitely nonprofits. One of the open source NLP initiatives I mentioned earlier, the Chinese language Corpus, was put together by a nonprofit online organization called AIShell Foundation, and they put together AIShell-1, AIShell-2, which are the largest open source speech Corpus available for Mandarin speech recognition.

I haven’t really followed up on Jao ting Wang. He’s a philosopher at the Chinese Academy of Social Sciences. The sort of “Nick Bostrom of China” label was more of a newsletter headline to get people to read, but he does devote a lot of time and thinking to the long-term risks of AI. Another professor at Nanjing University by the name of Zhi-Hua Zhou, he’s published articles about the need to not even touch some of what he calls strong AI. These were published in a pretty influential publication outlet by the Chinese Computer Federation, which brings together a lot of the big name computer scientists. So there’s definitely conversations about this happening. Whether there is an FHI, FLI equivalent, let’s say probably not, at least not yet.

Peking University may be developing something in this space. Berggruen Institute is also I think looking at some related issues. There’s probably a lot of stuff happening in Hong Kong as well; Maybe we just haven’t looked hard enough. I think the biggest difference is there’s definitely not something on the level of a DeepMind or OpenAI, because even the firms with the best general AI capabilities –– DeepMind and OpenAI almost like these unique entities where profits and stocks don’t matter.

So yeah, definitely some differences, but honestly I updated significantly once I started reading more, and nobody had really looked at this Zhi-Hua Zhou essay before we went looking and found it. So maybe there are a lot of these organizations and institutions out there but we just need to look harder.

Lucas Perry: So on this point of there not being OpenAI or DeepMind equivalents, are there any research organizations or departments explicitly focused on the mission of creating artificial general intelligence or superintelligence safely scalable machine learning systems that could go from now until infinity? Or is this just more like scattered researchers?

Jeffrey Ding: I think it’s how you define an AGI project. Like what you just said is probably a good tight definition. I know Seth Baum, he’s done some research tracking AGI projects and he says that there are six in China. I would say probably the only ones that come close are, I guess Tencent says it’s one of their missions streams to develop artificial general intelligence; horizon robotics, which is actually like a chip company, they also state it as one of their objectives. It depends also on how much you think work on neuroscience related pathways into AGI count or not. So there’s probably some Chinese Academy of Science labs working on whole brain emulation or kind of more brain inspired approaches to AGI, but definitely not anywhere to the level of DeepMind, OpenAI.

Lucas Perry: All right. So there are some myths in table one of your paper which you demystify. Three of these are: China’s approach to AI is defined by its top-down and monolithic nature; China is winning the AI arms race; And there is little to no discussion of issues of AI ethics and safety in China. And then maybe lastly I might add, if you might be able to add to it, that there is just to begin with an AI arms race between the US and China.

Jeffrey Ding: Yeah, I think that’s a good addition. I think we have to be careful about which historical analogies and memes we choose. So “arms race” is a very specific call back to cold war context, where there’s almost these discrete types of missiles that we are racing Soviet Union on and discrete applications that we can count up; Or even going way back to what some scholars call the first industrial arms race in the military sphere over steam power boats between Britain and France in the late 19th century. And all of those instances you can count up. France has four iron clads, UK has four iron clads; They’re racing to see who can build more. I don’t think there’s anything like that. There’s not this discreet thing that we’re racing to see who can have more of. If anything, it’s about a competition to see who can absorb AI advances from abroad better, who can diffuse them throughout the economy, who can adopt them in a more sustainable way without sacrificing core values.

So that’s sort of one meme that I really want to dispel. Related to that, assumptions that often influence a lot of our discourse on this is techno-nationalist assumption, which is this idea that technology is contained within national boundaries and that the nation state is the most important actor –– which is correct and a good one to have and a lot of instances. But there are also good reasons to adopt techno-globalist assumptions as well, especially in the area of how fast technologies diffuse nowadays and also how much underneath this national level competition, firms from different countries are working together and make standards alliances with each other. So there’s this undercurrent of techno-globalism, where there are people flows, idea flows, company flows happening while the coverage and the sexy topic is always going to be about national level competition, zero sum competition, relative games rhetoric. So you’re trying to find a balance between those two streams.

Lucas Perry: What can you say about this sort of reflection on zero sum games versus healthy competition and the properties of AI and AI research? I’m seeking clarification on this secondary framing that we can take on a more international perspective about deployment and implementation of AI research and systems rather than, as you said, this sort of techno-nationalist one.

Jeffrey Ding: Actually, this idea comes from my supervisor: Relative gains make sense if there’s only two players involved, just from a pure self-interest maximizing standpoint. But once you introduce three or more players, relative gains doesn’t make as much sense as optimizing for absolute gains. So maybe one way to explain this is to take the perspective of a European country –– let’s say Germany –– and you are working on an AI project with China or some other country that maybe the US is pressuring you not to work with; You’re working with Saudi Arabia or China on some project and it’s going to benefit China 10 arbitrary points and it’s going to benefit Germany eight arbitrary points versus if you didn’t choose to cooperate at all.

So in that sense, Germany, the rational actor, would take that deal. You’re not just caring about being better than China; From a German perspective, you care about maintaining leadership in the European Union, providing health benefits to your citizens, continuing to power your economy. So in that sense you would take the deal even though China benefits a little bit more, relatively speaking. 

I think currently a lot of people in the US are locked into this mindset that the only two players that exist in the world are the US and China. And if you look at our conversation, right, oftentimes I’ve displayed that bias as well. We should probably have talked a lot more about China-EU or China-Japan cooperation in this space and networks in this space because there’s a lot happening there too. So a lot of US policy makers see this as a two-player game between the US and China. And then in that sense, if there’s some cancer research project about discovering proteins using AI that may benefit China by 10 points and benefit the US only by eight points, but it’s going to save a lot of people from cancer  –– if you only care about making everything about maintaining a lead over China, then you might not take that deal. But if you think about it from the broader landscape of it’s not just a zero sum competition between US and China, then your kind of evaluation of those different point structures and what you think is rational will change.

Lucas Perry: So as there’s more actors, is the idea here that you care more about absolute gains in the sense that these utility points or whatever can be translated into decisive strategic advantages like military advantages?

Jeffrey Ding: Yeah, I think that’s part of it. What I was thinking along that example is basically 

if you as Germany don’t choose to cooperate with Saudi Arabia or work on this joint research project with China then the UK or some other countries just going to swoop in. And that possibility doesn’t exist in the world where you’re just thinking about two players. There’s a lot of different ways to fit these sort of formal models, but that’s probably the most simplistic way of explaining it.

Lucas Perry: Okay, cool. So you’ve spoken a bit here on important myths that we need to dispel or memes that we need to combat. And recently Peter Thiel has been on a bunch of conservative platforms, and he also wrote an op-ed, basically fanning the flames of AGI as a military weapon, AI as a path to superintelligence and, “Google campuses have lots of Chinese people on them who may be spies,” and that Google is actively helping China with AI military technology. In terms of bad memes and myths to combat, what are your thoughts here?

Jeffrey Ding: There’s just a lot of things that Thiel gets wrong. I’m mostly kind of just confused because he is one of the original founders of OpenAI, he’s funded other institutions, really concerned about AGI safety, really concerned about race dynamics –– and then in the middle of this piece, he first says AI is a military technology, then he goes back to saying AI is dual use in the middle, and then he says this ambiguity is “strangely missing from the narrative that pits a monolithic AI against all of humanity.” He out of anyone should know that these conversations about the risks of AGI, why are you attacking this straw man in the form of a terminator AI meme? Especially, you’re funding a lot of the organizations that are worried about the risks of AGI for all of humanity. 

The other main thing that’s really problematic is if you’re concerned about the US military advantage, that more than ever is rooted on our innovation advantage. It’s not about spinoff from military innovation to civilian innovation, which was the case in the days of US tech competition against Japan. It’s more the case of spin on, where innovations are happening in the commercial sector that are undergirding the US military advantage.

And this idea of painting Google as anti-American for setting up labs in China is so counterproductive. There are independent Google developer conferences all across China just because so many Chinese programmers want to use Google tools like TensorFlow. It goes back to the fundamental AI open source software I was talking about earlier that lets Google expand its talent pool: People want to work on Google products; They’re more used to the framework of Google tools to build all these products. Google’s not doing this out of charity to help the Chinese military. They’re doing this because the US has a flawed high-skilled immigration system, so they need to go to other countries to get talent. 

Also, the other thing about the piece is he cites no empirical research on any of these fronts, when there’s this whole globalization of innovation literature that backs up empirically a lot of what I’m saying. And then I’ve done my own empirical research on Microsoft Research Asia, which as we’ve mentioned is their second biggest lab overall, it’s based in Beijing. I’ve tracked their PhD Fellowship Program: This basically gives people at Chinese PhD programs, you get a full scholarship and you just do an internship at Microsoft Research Asia for one of the summers. And then we track their career trajectories, and a lot of them end up coming to the US or working for Microsoft Research Asia in Beijing. And the ones that come to the US don’t just go to Microsoft: They go to Snapchat or Facebook or other companies. And it’s not just about the people: As I mentioned earlier, we have this innovation centrism about who produces the technology first, but oftentimes it’s about who diffuses and adopts the technology first. And we’re not always going to be the first on the scene, so we have to be able to adopt and diffuse technologies that are invented first in other areas. And these overseas labs are some of our best portals into understanding what’s happening in these other areas. If we lose them, it’s another form of asymmetry because Chinese AI companies are going abroad and expanding. 

I honestly, I’m just really confused about what the point of this piece was and to be honest, it’s kind of sad because this is not what Thiel researches every day. So he’s obviously picking up bits and pieces from the narrative frames that are dominating our conversation. And it’s actually probably a structural stain on how we’ve allowed the discourse to have so many of these bad problematic memes, and we need more people calling them out actively, doing the heart to heart conversations behind the scenes to get people to change their minds or have productive constructive conversations about these.

And the last thing I’ll point out here is there’s this zombie Cold War mentality that still lingers today, and I think the historian Walter McDougall was really great in calling this out, where he talks about we paint this other, this enemy, and we use it to justify sacrifices in human values to drive society to its fullest technological potential. And that often comes with sacrificing human values like privacy, equality, freedom of speech. And I don’t want us to compete with China over who can build better tools to sensor, repress, and surveil dissidents and minority groups, right? Let’s see who can build the better, I don’t know, industrial internet of things or build better privacy preserving algorithms that are going to sustain a more trustworthy AI ecosystem.

Lucas Perry: Awesome. So just moving along here as we’re making it to the end of our conversation: What are updates you’ve had or major changes since you’ve written Deciphering China’s AI Dreams, since it has been a year?

Jeffrey Ding: Yeah, I mentioned some of the updates in the capability section. The consequences, I mean I think those are still the four main big issues, all of them tied to four different literature bases. The biggest change would probably be in the component section. I think when I started out, I was pretty new in this field, I was reading a lot of literature from the China watching community and also a lot from Chinese comparative politics or articles about China, and so I focused a lot on government policies. And while I think the party and the government are definitely major players, I think I probably overemphasized the importance of government policies versus what is happening at the local level.

So if I were to go back and rewrite it, I would’ve looked a lot more at what is happening at the local level, given more examples of AI firms, like iFlytek I think is a very interesting under-covered firm, and how they are setting up research institutes with a university in Chung Cheng very similar to the industry- academia style collaborations in the US, basically just ensuring that they’re able to train the next generation of talent. They have relatively close ties to the state as well, I think controlling shares or a large percentage of shares owned by state-owned vehicles. So I probably would have gone back and looked at some of these more under-covered firms and localities and looked at what they were doing rather than just looking at the rhetoric coming from the central government.

Lucas Perry: Okay. What does it mean for there to be healthy competition between the United States and China? What is an ideal AI research and political situation? What are the ideal properties of the relations the US and China can have on the path to superintelligence?

Jeffrey Ding: Yeah.

Lucas Perry: Solve AI Governance for me, Jeff!

Jeffrey Ding: If I could answer that question, I think I could probably retire or something. I don’t know.

Lucas Perry: Well, we’d still have to figure out how to implement the ideal governance solutions.

Jeffrey Ding: Yeah. I think one starting point is on the way to more advanced AI systems, we have to stop looking at AI as if it’s like this completely special area with no analogs, because even though there are unique aspects of AI –– like their autonomous intelligence systems, a possibility of the product surpassing human level intelligence, or the process surpassing human level intelligence –– we can learn a lot from past general purpose technologies like steam, electricity, the diesel engine. And we can learn about a lot of competition in past strategic industries like chips, steel.

So I think probably one thing that we can distill from some of this literature is there are some aspects of AI development that are going to be more likely to lead to race dynamics than others. So one cut that you could take are industries where it’s likely that there are only going to be two or three, four or five major players –– so it might be the case that capital costs, the upstart costs, the infrastructure costs of autonomous vehicles requires that there are going to be only one or two players across the world. And that is like, hey, if you’re a national government who’s thinking strategically, you might really want to have a player in that space, so that might incentivize more competition. Whereas in other fields, maybe there’s just going to be a lot more competition or less need for relative gain, zero sum thinking. So like neural machine translation, that could be a case of something that just almost becomes like a commodity. 

So then there are things we can think about in those fields where there’s only going to be four or five players or three or four players. Can we maybe balance it out so that at least one is from the two major powers or is the better approach to, I don’t know, enact global competition, global antitrust policy to kind of ensure that there’s always going to be a bunch of different players from a bunch of different countries? So those are some of the things that come to mind that I’m thinking about, but yeah, this is definitely something where I claim zero credibility relative to others who are thinking about it.

Lucas Perry: Right. Well unclear anyone has very good answers here. I think my perspective, to add at least one frame on it, is that given the dual use nature of many of the technologies like computer vision and like embedded robot systems and developing autonomy and image classification –– all of these different AI specialty subsystems can be sort of put together in arbitrary ways. So in terms of autonomous weapons, FLI’s position is, it’s important to establish international standards around the appropriate and beneficial uses of these technologies.

Image classification, as people already know, can be used for discrimination or beneficial things. And the technologies can be aggregated to make anything from literal terminator swarm robots to lifesaving medical treatments. So the relation between the United States and China can be made more productive if clear standards based on the expression of the principles we enumerated earlier could be created. And given that, then we might be taking some paths towards a beneficial beautiful future of advanced AI systems.

Jeffrey Ding: Yeah, no, I like that a lot. And some of the technical standards documents I’ve been translating: I definitely think in the short-term, technical standards are a good way forward, sort of solve the starter pack type of problems before AGI. Even some Chinese white papers on AI standardization have put out the idea of ranking the intelligence level of different autonomous systems –– like an autonomous car might be more than a smart speaker or something: Even that is a nice way to kind of keep track of the progress, is continuities in terms of intelligence explosions and trajectories in the space. So yeah, I definitely second that idea. Standardization efforts, autonomous weapons regulation efforts, as serving as the building blocks for larger AGI safety issues.

Lucas Perry: I would definitely like to echo this starter pack point of view. There’s a lot of open questions about the architectures or ways in which we’re going to get to AGI, about how the political landscape and research landscape is going to change in time. But I think that we already have enough capabilities and questions that we should really be considering where we can be practicing and implementing the regulations and standards and principles and intentions today in 2019 that are going to lead to robustly good futures for AGI and superintelligence.

Jeffrey Ding: Yeah. Cool.

Lucas Perry: So Jeff, if people want to follow you, what is the best way to do that?

Jeffrey Ding: You can hit me up on Twitter, I’m @JJDing99; Or I put out a weekly newsletter featuring translations on AI related issues from Chinese media, Chinese scholars and that’s China AI Newsletter, C-H-I-N-A-I. if you just search that, it should pop up.

Lucas Perry: Links to those will be provided in the description of wherever you might find this podcast. Jeff, thank you so much for coming on and thank you for all of your work and research and efforts in this space, for helping to create a robust and beneficial future with AI.

Jeffrey Ding: All right, Lucas. Thanks. Thanks for the opportunity. This was fun.

Lucas Perry: If you enjoyed this podcast, please subscribe, give it a like or share it on your preferred social media platform. We’ll be back again soon with another episode in the AI Alignment series.

End of recorded material

AI Alignment Podcast: On the Governance of AI with Jade Leung

In this podcast, Lucas spoke with Jade Leung from the Center for the Governance of AI (GovAI). GovAI strives to help humanity capture the benefits and mitigate the risks of artificial intelligence. The center focuses on the political challenges arising from transformative AI, and they seek to guide the development of such technology for the common good by researching issues in AI governance and advising decision makers. Jade is Head of Research and Partnerships at GovAI, where her research focuses on modeling the politics of strategic general purpose technologies, with the intention of understanding which dynamics seed cooperation and conflict.

Topics discussed in this episode include:

  • The landscape of AI governance
  • GovAI’s research agenda and priorities
  • Aligning government and companies with ideal governance and the common good
  • Norms and efforts in the AI alignment community in this space
  • Technical AI alignment vs. AI Governance vs. malicious use cases
  • Lethal autonomous weapons
  • Where we are in terms of our efforts and what further work is needed in this space

You can take a short (3 minute) survey to share your feedback about the podcast here.

Important timestamps: 

0:00 Introduction and updates

2:07 What is AI governance?

11:35 Specific work that Jade and the GovAI team are working on

17:21 Windfall clause

21:20 Policy advocacy and AI alignment community norms and efforts

27:22 Moving away from short-term vs long-term framing to a stakes framing

30:44 How do we come to ideal governance?

40:22 How can we contribute to ideal governance through influencing companies and government?

48:12 US and China on AI

51:18 What more can we be doing to positively impact AI governance?

56:46 What is more worrisome, malicious use cases of AI or technical AI alignment?

01:01:19 What is more important/difficult, AI governance or technical AI alignment?

01:03:49 Lethal autonomous weapons

01:09:49 Thinking through tech companies in this space and what we should do

 

Two key points from Jade: 

“I think one way in which we need to rebalance a little bit, as kind of an example of this is, I’m aware that a lot of the work, at least that I see in this space, is sort of focused on very aligned organizations and non-government organizations. So we’re looking at private labs that are working on developing AGI. And they’re more nimble. They have more familiar people in them, we think more similarly to those kinds of people. And so I think there’s an attraction. There’s really good rational reasons to engage with the folks because they’re the ones who are developing this technology and they’re plausibly the ones who are going to develop something advanced.

“But there’s also, I think, somewhat biased reasons why we engage, is because they’re not as messy, or they’re more familiar, or we see more value aligned. And I think this early in the field, putting all our eggs in a couple of very, very limited baskets, is plausibly not that great a strategy. That being said, I’m actually not entirely sure what I’m advocating for. I’m not sure that I want people to go and engage with all of the UN conversations on this because there’s a lot of noise and very little signal. So I think it’s a tricky one to navigate, for sure. But I’ve just been reflecting on it lately, that I think we sort of need to be a bit conscious about not group thinking ourselves into thinking we’re sort of covering all the basis that we need to cover.”

 

“I think one thing I’d like for people to be thinking about… this short term v. long term bifurcation. And I think a fair number of people are. And the framing that I’ve tried on a little bit is more thinking about it in terms of stakes. So how high are the stakes for a particular application area, or a particular sort of manifestation of a risk or a concern.

“And I think in terms of thinking about it in the stakes sense, as opposed to the timeline sense, helps me at least try to identify things that we currently call or label near term concerns, and try to filter the ones that are worth engaging in versus the ones that maybe we just don’t need to engage in at all. An example here is that basically I am trying to identify near term/existing concerns that I think could scale in stakes as AI becomes more advanced. And if those exist, then there’s really good reason to engage in them for several reasons, right?…Plausibly, another one would be privacy as well, because I think privacy is currently a very salient concern. But also, privacy is an example of one of the fundamental values that we are at risk of eroding if we continue to deploy technologies for other reasons : efficiency gains, or for increasing control and centralizing of power. And privacy is this small microcosm of a maybe larger concern about how we could possibly be chipping away at these very fundamental things which we would want to preserve in the longer run, but we’re at risk of not preserving because we continue to operate in this dynamic of innovation and performance for whatever cost. Those are examples of conversations where I find it plausible that there are existing conversations that we should be more engaged in just because those are actually going to matter for the things that we call long term concerns, or the things that I would call sort of high stakes concerns.”

 

We hope that you will continue to join in the conversations by following us or subscribing to our podcasts on Youtube, Spotify, SoundCloud, iTunes, Google Play, StitcheriHeartRadio, or your preferred podcast site/application. You can find all the AI Alignment Podcasts here.

You can listen to the podcast above or read the transcript below. Key works mentioned in this podcast can be found here 

Lucas: Hey, everyone. Welcome back to the AI Alignment Podcast. I’m Lucas Perry. And today, we will be speaking with Jade Leung from the Center for the Governance of AI, housed at the Future of Humanity Institute. Their work strives to help humanity capture the benefits and mitigate the risks of artificial intelligence. They focus on the political challenges arising from transformative AI, and seek to guide the development of such technology for the common good by researching issues in AI governance and advising decision makers. Jade is Head of Research and Partnerships at GovAI, and her research work focusing on modeling the politics of strategic general purpose technologies, with the intention of understanding which dynamics seed cooperation and conflict.

In this episode, we discuss GovAI’s research agenda and priorities, the landscape of AI governance, how we might arrive at ideal governance, the dynamics and roles of both companies and states within this space, how we might be able to better align private companies with what we take to be ideal governance. We get into the relative importance of technical AI alignment and governance efforts on our path to AGI, we touch on lethal autonomous weapons, and also discuss where we are in terms of our efforts in this broad space, and what work we might like to see more of.

As a general bit of announcement, I found all the feedback coming in through the SurveyMonkey poll to be greatly helpful. I’ve read through all of your comments and thoughts, and am working on incorporating feedback where I can. So for the meanwhile, I’m going to leave the survey up, and you’ll be able to find a link to it in a description of wherever you might find this podcast. Your feedback really helps and is appreciated. And, as always, if you find this podcast interesting or useful, consider sharing with others who might find it valuable as well. And so, without further ado, let’s jump into our conversation with Jade Leung.

So let’s go ahead and start by providing a little bit of framing on what AI governance is, the conceptual landscape that surrounds it. What is AI governance, and how do you view and think about this space?

Jade: I think the way that I tend to think about AI governance is with respect to how it relates to the technical field of AI safety. In both fields, the broad goal is how humanity can best navigate our transition towards a world with advanced AI systems in it. The technical AI safety agenda and the kind of research that’s being done there is primarily focused on how do we build these systems safely and well. And the way that I think about AI governance with respect to that is broadly everything else that’s not that. So that includes things like the social, political, economic context that surrounds the way in which this technology is developed and built and used and employed.

And specifically, I think with AI governance, we focus on a couple of different elements of it. One big element is the governance piece. So what are the kinds of norms and institutions we want around a world with advanced AI serving the common good of humanity. And then we also focus a lot on the kind of strategic political impacts and effects and consequences of the route on the way to a world like that. So what are the kinds of risks, social, political, economic? And what are the kinds of impacts and effects that us developing it in sort of sub-optimal ways could have on the various things that we care about.

Lucas: Right. And so just to throw out some other cornerstones here, because I think there’s many different ways of breaking up this field and thinking about it, and this sort of touches on some of the things that you mentioned. There’s the political angle, the economic angle. There’s the military. There’s the governance and the ethical dimensions.

Here on the AI Alignment Podcast, before we’ve, at least breaking down the taxonomy sort of into the technical AI alignment research, which is getting machine systems to be aligned with human values and desires and goals, and then the sort of AI governance, the strategy, the law stuff, and then the ethical dimension. Do you have any preferred view or way of breaking this all down? Or is it all just about good to you?

Jade: Yeah. I mean, there are a number of different ways of breaking it down. And I think people also mean different things when they say strategy and governance and whatnot. I’m not particular excited about getting into definitional debates. But maybe one way of thinking about what this word governance means is, at least I often think of governance as the norms, and the processes, and the institutions that are going to, and already do, shape the development and deployment of AI. So I think a couple of things that are work underlining in that, I think there’s … The word governance isn’t just specifically government and regulations. I think that’s a specific kind of broadening of the term, which is worth pointing out because that’s a common misconception, I think, when people use the word governance.

So when I say governance, I mean governance and regulation, for sure. But I also mean what are other actors doing that aren’t governance? So labs, researchers, developers, NGOs, journalists, et cetera, and also other mechanisms that aren’t regulation. So it could be things like reputation, financial flows, talent flows, public perception, what’s within and outside the opportune window, et cetera. So there’s a number of different levers I think you can pull if you’re thinking about governance.

It’s probably worth also pointing out, I think, when people say governance, a lot of the time people are talking about the normative side of things, so what should it look like, and how could be if it were good? A lot of governance research, at least in this space now, is very much descriptive. So it’s kind of like what’s actually happening, and trying to understand the landscape of risk, the landscape of existing norms that we have to work with, what’s a tractable way forward with existing actors? How do you model existing actors in the first place? So a fair amount of the research is very descriptive, and I would qualify that as AI governance research, for sure.

Other ways of breaking it down are, according to the research done that we put out, is one option. So that kind of breaks it down into firstly understanding the technological trajectory, so that’s understanding where this technology is likely to go, what are the technical inputs and constraints, and particularly the ones that have implications for governance outcomes. This looks like things like modeling AI progress, mapping capabilities, involves a fair amount of technical work.

And then you’ve got the politics cluster, which is probably where a fair amount of the work is at the moment. This is looking at political dynamics between powerful actors. So, for example, my work is focusing on big firms and government and how they relate to each other, but also includes how AI transforms and impacts political systems, both domestically and internationally. This includes the cluster around international security and the race dynamics that fall into that. And then also international trade, which is a thing that we don’t talk about a huge amount, but politics also includes this big dimension of economics in it.

And then the last cluster is this governance cluster, which is probably the most normative end of what we would want to be working on in this space. This is looking at things like what are the ideal institutions, infrastructure, norms, mechanisms that we can put in place now/in the future that we should be aiming towards that can steer us in robustly good directions. And this also includes understanding what shapes the way that these governance systems are developed. So, for example, what roles does the public have to play in this? What role do researchers have to play in this? And what can we learn from the way that we’ve governed previous technologies in similar domains, or with similar challenges, and how have we done on the governance front on those bits as well. So that’s another way of breaking it down, but I’ve heard more than a couple of ways of breaking this space down.

Lucas: Yeah, yeah. And all of them are sort of valid in their own ways, and so we don’t have to spend too much time on this here. Now, a lot of these things that you’ve mentioned are quite macroscopic effects in the society and the world, like norms and values and developing a concept of ideal governance and understanding actors and incentives and corporations and institutions and governments. Largely, I find myself having trouble developing strong intuitions about how to think about how to impact these things because it’s so big it’s almost like the question of, “Okay, let’s figure out how to model all of human civilization.” At least all of the things that matter a lot for the development and deployment of technology.

And then let’s also think about ideal governance, like what is also the best of all possible worlds, based off of our current values, that we would like to use our model of human civilization to bring us closer towards? So being in this field, and exploring all of these research threads, how do you view making progress here?

Jade: I can hear the confusion in your voice, and I very much resonate with it. We’re sort of consistently confused, I think, at this place. And it is a very big, both set of questions, and a big space to kind of wrap one’s head around. I want to emphasize that this space is very new, and people working in this space are very few, at least with respect to AI safety, for example, which is still a very small section that feels as though it’s growing, which is a good thing. We are at least a couple of years behind, both in terms of size, but also in terms of sophistication of thought and sophistication of understanding what are more concrete/sort of decision relevant ways in which we can progress this research. So we’re working hard, but it’s a fair ways off.

One way in which I think about it is to think about it in terms of what actors are making decisions now/in the near to medium future, that are the decisions that you want to influence. And then you sort of work backwards from that. I think at least, for me, when I think about how we do our research at the Center for the Governance of AI, for example, when I think about what is valuable for us to research and what’s valuable to invest in, I want to be able to tell a story of how I expect this research to influence a decision, or a set of decisions, or a decision maker’s priorities or strategies or whatever.

Ways of breaking that down a little bit further would be to say, you know, who are the actors that we actually care about? One relatively crude bifurcation is focusing on those who are in charge of developing and deploying these technologies, firms, labs, researchers, et cetera, and then those who are in charge of sort of shaping the environment in which this technology is deployed, and used, and is incentivized to progress. So that’s folks who shape the legislative environment, folks who shape the market environment, folks who shape the research culture environment, and expectations and whatnot.

And with those two sets of decision makers, you can then boil it down into what are the particular decisions they are in charge of making that you can decide you want to influence, or try to influence, by providing them with research insights or doing research that will in some down shoot way, affect the way they think about how these decisions should be made. And a very, very concrete example would be to pick, say, a particular firm. And they have a set of priorities, or a set of things that they care about achieving within the lifespan of that firm. And they have a set of strategies and tactics that they intend to use to execute on that set of priorities. So you can either focus on trying to shift their priorities towards better directions if you think they’re off, or you can try to point out ways in which their strategies could be done slightly better, e.g. they be coordinating more with other actors, or they should be thinking harder about openness in their research norms. Et cetera, et cetera.

Well, you can kind of boil it down to the actor level and the decision specific level, and get some sense of what it actually means for progress to happen, and for you to have some kind of impact with this research. One caveat with this is that I think if one takes this lens on what research is worth doing, you’ll end up missing a lot of valuable research being done. So a lot of the work that we do currently, as I said before, is very much understanding what’s going on in the first place. What are the actual inputs into the AI production function that matter and are constrained and are bottle-necked? Where are they currently controlled? A number of other things which are mostly just descriptive I can’t tell you with which decision I’m going to influence by understanding this. But having a better baseline will inform better work across a number of different areas. I’d say that this particular lens is one way of thinking about progress. There’s a number of other things that it wouldn’t measure, that are still worth doing in this space.

Lucas: So it does seem like we gain a fair amount of tractability by just thinking, at least short term, who are the key actors, and how might we be able to guide them in a direction which seems better. I think here it would also be helpful if you could let us know, what is the actual research that you, and say, Allan Dafoe engage in on a day to day basis. So there’s analyzing historical cases. I know that you guys have done work with specifying your research agenda. You have done surveys of American attitudes and trends on opinions on AI. Jeffrey Ding has also released a paper on deciphering China’s AI dream, tries to understand China’s AI strategy. You’ve also released on the malicious use cases of artificial intelligence. So, I mean, what is it like being Jade on a day to day trying to conquer this problem?

Jade: The specific work that I’ve spent most of my research time on to date sort of falls into the politics/governance cluster. And basically, the work that I do is centered on the assumption that there are things that we can learn from a history of trying to govern strategic general purpose technologies well. And if you look at AI, and you believe that it has certain properties that make it strategic, strategic here in the sense that it’s important for things like national security and economic leadership of nations and whatnot. And it’s also general purpose technology, in that it has the potential to do what GPTs do, which is to sort of change the nature of economic production, push forward a number of different frontiers simultaneously, enable consistent cumulative progress, change course of organizational functions like transportation, communication, et cetera.

So if you think that AI looks like strategic general purpose technology, then the claim is something like, in history we’ve seen a set of technology that plausibly have the same traits. So the ones that I focus on are biotechnology, cryptography, and aerospace technology. And the question that sort of kicked off this research is, how have we dealt with the very fraught competition that we currently see in the space of AI when we’ve competed across these technologies in the past. And the reason why there’s a focus on competition here is because, I think one important thing that characterizes a lot of the reasons why we’ve got a fair number of risks in the AI space is because we are competing over it. “We” here being very powerful nations, very powerful firms, and the reason why competition is an important thing to highlight is that it exacerbates a number of risks and it causes a number of risks.

So when you’re in a competitive environment, actors were normally incentivized to take larger risks than they otherwise would rationally do. They are largely incentivized to not engage in the kind of thinking that is required to think about public goods governance and serving the common benefit of humanity. And they’re more likely to engage in thinking about, is more about serving parochial, sort of private, interests.

Competition is bad for a number of reasons. Or it could be bad for a number of reasons. And so the question I’m asking is, how have we competed in the past? And what have been the outcomes of those competitions? Long story short, so the research that I do is basically I dissect these cases of technology development, specifically in the US. And I analyze the kinds of conflicts, and the kinds of cooperation that have existed between the US government and the firms that were leading technology development, and also the researcher communities that were driving these technologies forward.

Other pieces of research that are going on, we have a fair number of our researcher working on understanding what are the important inputs into AI that are actually progressing us forward. How important is compute relative to algorithmic structures, for example? How important is talent, with respect to other inputs? And then the reason why that’s important to analyze and useful to think about is understanding who controls these inputs, and how they’re likely to progress in terms of future trends. So that’s an example of the technology forecasting work.

In the politics work, we have a pretty big chunk on looking at the relationship between governments and firms. So this is a big piece of work that I’ve been doing, along with a fair amount of others, understanding, for example, if the US government wanted to control AI R&D, what are the various levers that they have available, that they could use to do things like seize patents, or control research publications, or exercise things like export controls, or investment constraints, or whatnot. And the reason why we focus on that is because my hypothesis is that ultimately, ultimately you’re going to start to see states get much more involved. At the moment, you’re currently in this period of time wherein a lot of people describe it as very private sector driven, and the governments are behind, I think, and history would also suggest that the state is going to be involved much more significantly very soon. So understanding what they could do, and what their motivations are, are important.

And then, lastly, on the governance piece, a big chunk of our work here is specifically on public opinions. So you’ve mentioned this before. But basically, we have a big substantial chunk of our work, consistently, is just understanding what the public thinks about various issues to do with AI. So recently, we published a report of the recent set of surveys that we did surveying the American public. And we asked them a variety of different questions and got some very interesting answers.

So we asked them questions like: What risks do you think are most important? Which institution do you trust the most to do things with respect to AI governance and development? How important do you think certain types of governance challenges are for American people? Et cetera. And the reason why this is important for the governance piece is because governance ultimately needs to have sort of public legitimacy. And so the idea was that understanding how the American public thinks about certain issues can at least help to shape some of the conversation around where we should be headed in governance work.

Lucas: So there’s also been work here, for example, on capabilities forecasting. And I think Allan and Nick Bostrom also come at these from slightly different angles sometimes. And I’d just like to explore all of these so we can get all of the sort of flavors of the different ways that researchers come at this problem. Was it Ben Garfinkel who did the offense-defense analysis?

Jade: Yeah.

Lucas: So, for example, there’s work on that. That work was specifically on trying to understand how the offense-defense bias scales as capabilities change. This could have been done with nuclear weapons, for example.

Jade: Yeah, exactly. That was an awesome piece of work by Allan and Ben Garfinkel, looking at this concept of the offense-defense balance, which exists for weapon systems broadly. And they were sort of analyzing and modeling. It’s a relatively theoretical piece of work, trying to model how the offense-defense balance changes with investments. And then there was a bit of a investigation there specifically on how we could expect AI to affect the offense-defense balance in different types of contexts. The other cluster work, which I failed to mention as well, is a lot of our work on policy, specifically. So this is where projects like the windfall clause would fall in.

Lucas: Could you explain what the windfall clause is, in a sentence or two?

Jade: The windfall clause is an example of a policy lever, which we think could be a good idea to talk about in public and potentially think about implementing. And the windfall clause is an ex-ante voluntary commitment by AI developers to distribute profits from the development of advanced AI for the common benefit of humanity. What I mean by ex-ante is that they commit to it now. So an AI developer, say a given AI firm, will commit to, or sign, the windfall clause prior to knowing whether they will get to anything like advanced AI. And what they commit to is saying that if I hit a certain threshold of profits, so what we call windfall profit, and the threshold is very, very, very high. So the idea is that this should only really kick in if a firm really hits the jackpot and develops something that is so advanced, or so transformative in the economic sense, that they get a huge amount of profit from it at some sort of very unprecedented scale.

So if they hit that threshold of profit, this clause will kick in, and that will commit them to distributing their profits according to some kind of pre-committed distribution mechanism. And the idea with the distribution mechanism is that it will redistribute these products along the lines of ensuring that sort of everyone in the world can benefit from this kind of bounty. There’s a lot of different ways in which you could do the distribution. And we’re about to put out the report which outlines some of our thinking on it. And there are many more ways in which it could be done besides from what we talk about.

But effectively, what you want in a distribution mechanism is you want it to be able to do things like rectify inequalities that could have been caused in the process of developing advanced AI. You want it to be able to provide a financial buffer to those who’ve been thoughtlessly unemployed by the development of advanced AI. And then you also want it to do somewhat positive things too. So it could be, for example, that you distribute it according to meeting the sustainable development goals. Or it could be redistributed according to a scheme that looks something like the UBI. And that transitions us into a different type of economic structure. So there are various ways in which you could play around with it.

Effectively, the windfall clause is starting a conversation about how we should be thinking about the responsibilities that AI developers have to ensure that if they do luck out, or if they do develop something that is as advanced as some of what we speculate we could get to, there is a responsibility there. And there also should be a committed mechanism there to ensure that that is balanced out in a way that reflects the way that we want this value to be distributed across the world.

And that’s an example of the policy lever that is sort of uniquely concrete, in that we don’t actually do a lot of concrete research. We don’t do much policy advocacy work at all. But to the extent that we want to do some policy advocacy work, it’s mostly with the motivation that we want to be starting important conversations about robustly good policies that we could be advocating for now, that can help steer us in better directions.

Lucas: And fitting this into the research threads that we’re talking about here, this goes back to, I believe, Nick Bostrom’s Superintelligence. And so it’s sort of predicated on more foundational principles, which can be attributed to before the Asilomar Conference, but also the Asilomar principles which were developed in 2017, that the benefits of AI should be spread widely, and there should be abundance. And so then there becomes these sort of specific policy implementations or mechanisms by which we are going to realize these principles which form the foundation of our ideal governance.

So Nick has sort of done a lot of this work on forecasting. The forecasting in Superintelligence was less about concrete timelines, and more about the logical conclusions of the kinds of capabilities that AI will have, fitting that into our timeline of AI governance thinking, with ideal governance at the end of that. And then behind us, we have history, which we can, as you’re doing yourself, try to glean more information about how what you call general purpose technologies affect incentives and institutions and policy and law and the reaction of government to these new powerful things. Before we brought up the windfall clause, you were discussing policy at FHI.

Jade: Yeah, and one of the reasons why it’s hard is because if we put on the frame that we mostly make progress by influencing decisions, we want to be pretty certain about what kinds of directions we want these decisions to go, and what we would want these decisions to be, before we engage in any sort of substantial policy advocacy work to try to make that actually a thing in the real world. I am very, very hesitant about our ability to do that well, at least at the moment. I think we need to be very humble about thinking about making concrete recommendations because this work is hard. And I also think there is this dynamic, at least, in setting norms, and particularly legislation or regulation, but also just setting up institutions, in that it’s pretty slow work, but it’s very path dependent work. So if you establish things, they’ll be sort of here to stay. And we see a lot of legacy institutions and legacy norms that are maybe a bit outdated with respect to how the world has progressed in general. But we still struggle with them because it’s very hard to get rid of them. And so the kind of emphasis on humility, I think, is a big one. And it’s a big reason why basically policy advocacy work is quite slim on the ground, at least in the moment, because we’re not confident enough in our views on things.

Lucas: Yeah, but there’s also this tension here. The technology’s coming anyway. And so we’re sort of on this timeline to get the right policy stuff figured out. And here, when I look at, let’s just take the Democrats and the Republicans in the United States, and how they interact. Generally, in terms of specific policy implementation and recommendation, it just seems like different people have various dispositions and foundational principles which are at odds with one another, and that policy recommendations are often not substantially tested, or the result of empirical scientific investigation. They’re sort of a culmination and aggregate of one’s very broad squishy intuitions and modeling or the world, and different intuitions one has. Which is sort of why, at least at the policy level, seemingly in the United States government, it seems like a lot of the conversation is just endless arguing that gets nowhere. How do we avoid that here?

Jade: I mean, this is not just specifically an AI governance problem. I think we just struggle with this in general as we try to do governance and politics work in a good way. It’s a frustrating dynamic. But I think one thing that you said definitely resonates and that, a bit contra to what I just said. Whether we like it or not, governance is going to happen, particularly if you take the view that basically anything that shapes the way this is going to go, you could call governance. Something is going to fill the gap because that’s what humans do. You either have the absence of good governance, or you have somewhat better governance if you try to engage a little bit. There’s definitely that tension.

One thing that I’ve recently been reflecting on, in terms of things that we under-prioritize in this community, because it’s sort of a bit of a double-edged sword of being very conscientious about being epistemically humble and being very cautious about things, and trying to be better calibrated and all of that, which are very strong traits of people who work in this space at the moment. But I think almost because of those traits, too, we undervalue, or we don’t invest enough time or resource in just trying to engage in existing policy discussions and existing governance institutions. And I think there’s also an aversion to engaging in things that feel frustrating and slow, and that’s plausibly a mistake, at least in terms of how much attention we pay to it because in the absence of our engagement, the things still going to happen anyway.

Lucas: I must admit that as someone interested in philosophy I’ve resisted for a long time now, the idea of governance in AI at least casually in favor of nice calm cool rational conversations at tables that you might have with friends about values, and ideal governance, and what kinds of futures you’d like. But as you’re saying, and as Alan says, that’s not the way that the world works. So here we are.

Jade: So here we are. And I think one way in which we need to rebalance a little bit, as kind of an example of this is, I’m aware that a lot of the work, at least that I see in this space, is sort of focused on very aligned organizations and non-government organizations. So we’re looking at private labs that are working on developing AGI. And they’re more nimble. They have more familiar people in them, we think more similarly to those kinds of people. And so I think there’s an attraction. There’s really good rational reasons to engage with the folks because they’re the ones who are developing this technology and they’re plausibly the ones who are going to develop something advanced.

But there’s also, I think, somewhat biased reasons why we engage, is because they’re not as messy, or they’re more familiar, or we feel more value aligned. And I think this early in the field, putting all our eggs in a couple of very, very limited baskets, is plausibly not that great a strategy. That being said, I’m actually not entirely sure what I’m advocating for. I’m not sure that I want people to go and engage with all of the UN conversations on this because there’s a lot of noise and very little signal. So I think it’s a tricky one to navigate, for sure. But I’ve just been reflecting on it lately, that I think we sort of need to be a bit conscious about not group thinking ourselves into thinking we’re sort of covering all the bases that we need to cover.

Lucas: Yeah. My view on this, and this may be wrong, is just looking at the EA community, and the alignment community, and all that they’ve done to try to help with AI alignment. It seems like a lot of talent feeding into tech companies. And there’s minimal efforts right now to engage in actual policy and decision making at the government level, even for short term issues like disemployment and privacy and other things. The AI alignment is happening now, it seems.

Jade: On the noise to signal point, I think one thing I’d like for people to be thinking about, I’m pretty annoyed at this short term v. long term bifurcation. And I think a fair number of people are. And the framing that I’ve tried on a little bit is more thinking about it in terms of stakes. So how high are the stakes for a particular application area, or a particular sort of manifestation of a risk or a concern.

And I think in terms of thinking about it in the stakes sense, as opposed to the timeline sense, helps me at least try to identify things that we currently call or label near term concerns, and try to filter the ones that are worth engaging in versus the ones that maybe we just don’t need to engage in at all. An example here is that basically I am trying to identify near term/existing concerns that I think could scale in stakes as AI becomes more advanced. And if those exist, then there’s really good reason to engage in them for several reasons, right? One is this path dependency that I talked about before, so norms that you’re developing around, for example, privacy or surveillance. Those norms are going to stick, and the ways in which we decide we want to govern that, even with narrow technologies now, those are the ones we’re going to inherit, grandfather in, as we start to advance this technology space. And then I think you can also just get a fair amount of information about how we should be governing the more advanced versions of these risks or concerns if you engage earlier.

I think there are actually probably, even just off the top off of my head, I can think of a couple which seemed to have scalable stakes. So, for example, a very existing conversation in the policy space is about this labor displacement problem and automation. And that’s the thing that people are freaking out about now, is the extent that you have litigation and bills and whatnot being passed, or being talked about at least. And you’ve got a number of people running on political platforms on the basis of that kind of issue. And that is both an existing concern, given automation to date. But it’s also plausibly a huge concern as this stuff is more advanced, to the point of economic singularity, if you wanted to use that term, where you’ve got vast changes in the structure of the labor market and the employment market, and you can have substantial transformative impacts on the ways in which humans engage and create economic value and production.

And so existing automation concerns can scale into large scale labor displacement concerns, can scale into pretty confusing philosophical questions about what it means to conduct oneself as a human in a world where you’re no longer needed in terms of employment. And so that’s an example of a conversation which I wish more people were engaged in right now.

Plausibly, another one would be privacy as well, because I think privacy is currently a very salient concern. But also, privacy is an example of one of the fundamental values that we are at risk of eroding if we continue to deploy technologies for other reasons : efficiency gains, or for increasing control and centralizing of power. And privacy is this small microcosm of a maybe larger concern about how we could possibly be chipping away at these very fundamental things which we would want to preserve in the longer run, but we’re at risk of not preserving because we continue to operate in this dynamic of innovation and performance for whatever cost. Those are examples of conversations where I find it plausible that there are existing conversations that we should be more engaged in just because those are actually going to matter for the things that we call long term concerns, or the things that I would call sort of high stakes concerns.

Lucas: That makes sense. I think that trying on the stakes framing is helpful, and I can see why. It’s just a question about what are the things today, and within the next few years, that are likely to have a large effect on a larger end that we arrive at with transformative AI. So we’ve got this space of all these four cornerstones that you guys are exploring. Again, this has to do with the interplay and interdependency of technical AI safety, politics, policy of ideal governance, the economics, the military balance and struggle, and race dynamics all here with AI, on our path to AGI. So starting here with ideal governance, and we can see how we can move through these cornerstones, what is the process by which ideal governance is arrived at? How might this evolve over time as we get closer to superintelligence?

Jade: It may be a couple of thoughts, mostly about what I think a desirable process is that we should follow, or what kind of desired traits do we want to have in the way that we get to ideal governance and what ideal governance could plausibly look like. I think that’s to the extent that I maybe have thoughts about it. And they’re quite obvious ones, I think. Governance literature has said a lot about what consists of both morally sound, politically sound, socially sound governance processes or design of governance processes.

So those are things like legitimacy and accountability and transparency. I think there are some interesting debates about how important certain goals are, either as end goals or as instrumental goals. So for example, I’m not clear where my thinking is on how important inclusion and diversity is. As we’re aiming for ideal governance, so I think that’s an open question, at least in my mind.

There are also things to think through around what’s unique to trying to aim for ideal governance for a transformative general purpose technology. We don’t have a very good track record of governing general purpose technologies at all. I think we have general purpose technologies that have integrated into society and have served a lot of value. But that’s not for having had governance of them. I think we’ve been come combination of lucky and somewhat thoughtful sometimes, but not consistently so. If we’re staking the claim that AI could be a uniquely transformative technology, then we need to ensure that we’re thinking hard about the specific challenges that it poses. It’s a very fast-moving emerging technology. And governments historically has always been relatively slow at catching up. But you also have certain capabilities that you can realize by developing, for example, AGI or super intelligence, which governance frameworks or institutions have never had to deal with before. So thinking hard about what’s unique about this particular governance challenge, I think, is important.

Lucas: Seems like often, ideal governance is arrived at through massive suffering of previous political systems, like this form of ideal governance that the founding fathers of the United States came up with was sort of an expression of the suffering they experienced at the hands of the British. And so I guess if you track historically how we’ve shifted from feudalism and monarchy to democracy and capitalism and all these other things, it seems like governance is a large slowly reactive process born of revolution. Whereas, here, what we’re actually trying to do is have foresight and wisdom about what the world should look like, rather than trying to learn from some mistake or some un-ideal governance we generate through AI.

Jade: Yeah, and I think that’s also another big piece of it, is another way of thinking about how to get to ideal governance is to aim for a period of time, or a state of the world in which we can actually do the thinking well without a number of other distractions/concerns on the way. So for example, conditions that we want to drive towards would mean getting rid of things like the current competitor environment that we have, which for many reasons, some of which I mentioned earlier, it’s a bad thing, and it’s particularly counterproductive to giving us the kind of space and cooperative spirit and whatnot that we need to come to ideal governance. Because if you’re caught in this strategic competitive environment, then that makes a bunch of things just much harder to do in terms of aiming for coordination and cooperation and whatnot.

You also probably want better, more accurate, information out there, hence being able to think harder by looking at better information. And so a lot of work can be done to encourage more accurate information to hold more weight in public discussions, and then also encourage an environment that is genuine, epistemically healthy deliberation about that kind of information. All of what I’m saying is also not particularly unique, maybe, to ideal governance for AI. I think in general, you can sometimes broaden this discussion to what does it look like to govern a global world relatively well. And AI is one of the particular challenges that are maybe forcing us to have some of these conversations. But in some ways, when you end up talking about governance, it ends up being relatively abstract in a way, I think, ruins technology. At least in some ways there are also particular challenges, I think, if you’re thinking particularly about superintelligence scenarios. But if you’re just talking about governance challenges in general, things like accurate information, more patience, lack of competition and rivalrous dynamics and what not, that generally is kind of just helpful.

Lucas: So, I mean, arriving at ideal governance here, I’m just trying to model and think about it, and understand if there should be anything here that should be practiced differently, or if I’m just sort of slightly confused here. Generally, when I think about ideal governance, I see that it’s born of very basic values and principles. And I view these values and principles as coming from nature, like the genetics, evolution instantiating certain biases and principles and people that tend to lead to cooperation, conditioning of a culture, how we’re nurtured in our homes, and how our environment conditions us. And also, people update their values and principles as they live in the world and communicate with other people and engage in public discourse, even more foundational, meta-ethical reasoning, or normative reasoning about what is valuable.

And historically, these sort of conversations haven’t mattered, or they don’t seem to matter, or they seem to just be things that people assume, and they don’t get that abstract or meta about their values and their views of value, and their views of ethics. It’s been said that, in some sense, on our path to superintelligence, we’re doing philosophy on a deadline, and that there are sort of deep and difficult questions about the nature of value, and how best to express value, and how to idealize ourselves as individuals and as a civilization.

So I guess I’m just throwing this all out there. Maybe not necessarily we have any concrete answers. But I’m just trying to think more about the kinds of practices and reasoning that should and can be expected to inform ideal governance. Should meta-ethics matter here, where it doesn’t seem to matter in public discourse. I still struggle between the ultimate value expression that might be happening through superintelligence, and the tension between that, and how are public discourse functions. I don’t know if you have any thoughts here.

Jade: No particular thoughts, aside from to generally agree that I think meta-ethics is important. It is also confusing to me why public discourse doesn’t seem to track the things that seem important. This probably is something that we’ve struggled and tried to address in various ways before, so I guess I’m always cognizant of trying to learn from ways in which we’ve tried to improve public discourse and tried to create spaces for this kind of conversation.

It’s a tricky one for sure, and thinking about better practices is probably the main way at least in which I engage with thinking about ideal governance. It’s often the case that people, when they look at the cluster of ideal governance work though like, “Oh, this is the thing that’s going to tell us what the answer is,” like what’s the constitution that we have to put in place, or whatever it is.

At least for me, the maun chunk of thinking is mostly centered around process, and it’s mostly centered around what constitutes a productive optimal process, and some ways of answering this pretty hard question. And how do you create the conditions in which you can engage with that process without being distracted or concerned about things like competition? Those are kind of the main ways in which it seems obvious that we can fix the current environment so that we’re better placed to answer what is a very hard question.

Lucas: Coming to mind here is also, is this feature that you pointed out, I believe, that ideal governance is not figuring everything out in terms of our values, but rather creating the kind of civilization and space in which we can take the time to figure out ideal governance. So maybe ideal governance is not solving ideal governance, but creating a space to solve ideal governance.

Usually, ideal governance has to do with modeling human psychology, and how to best to get human being to produce value and live together harmoniously. But when we introduce AI, and human beings become potentially obsolete, then ideal governance potentially becomes something else. And I wonder, if the role of, say, experimental cities with different laws, policies, and governing institutions might be helpful here.

Jade: Yeah, that’s an interesting thought. Another thought that came to mind as well, actually, is just kind of reflecting on how ill-equipped I feel thinking about this question. One funny trait of this field is that you have a slim number of philosophers, but specially in the AI strategy and safety space, it’s political scientists, international relations people, economists, and engineers, and computer scientists thinking about questions that other spaces have tried to answer in different ways before.

So when you mention psychology, that’s an example. Obviously, philosophy has something to say about this. But there’s also a whole space of people have thought about how we govern things well across a number of different domains, and how we do a bunch of coordination and cooperation better, and stuff like that. And so it makes me reflect on the fact that there could be things that we already have learned that we should be reflecting a little bit more on which we currently just don’t have access to because we don’t necessarily have the right people or the right domains of knowledge in this space.

Lucas: Like AI alignment has been attracting a certain crowd of researchers, and so we miss out on some of the insights that, say, psychologists might have about ideal governance.

Jade: Exactly, yeah.

Lucas: So moving along here, from ideal governance, assuming we can agree on what ideal governance is, or if we can come to a place where civilization is stable and out of existential risk territory, and where we can sit down and actually talk about ideal governance, how do we begin to think about how to contribute to AI governance through working with or in private companies and/or government.

Jade: This is a good, and quite large, question. I think there are a couple of main ways in which I think about productive actions that either companies or governments can take, or productive things we can do with both of these actors to make them more inclined to do good things. On the point of other companies, the primary thing I think that is important to work on, at least concretely in the near term, is to do something like establish the norm and expectation that as developers of this important technology that will have a large plausible impact on the world, they have a very large responsibility proportional to their ability to impact the development of this technology. By making the responsibility something that is tied to their ability to shape this technology, I think that as a foundational premise or a foundational axiom to hold about why private companies are important, that can get us a lot of relatively concrete things that we should be thinking about doing.

The simple way of saying its is something like if you are developing the thing, you’re responsibly for thinking about how that thing is going to affect the world. And establishing that, I think is a somewhat obvious thing. But it’s definitely not how the private sector operates at the moment, in that there is an assumed limited responsibility irrespective of how your stuff is deployed in the world. What that actually means can be relatively concrete. Just looking at what these labs, or what these firms have the ability to influence, and trying to understand how you want to change it.

So, for example, internal company policy on things like what kind of research is done and invested in, and how you allocate resources across, for example, safety and capabilities research, what particular publishing norms you have, and considerations around risks or benefits. Those are very concrete internal company policies that can be adjusted and shifted based on one’s idea of what they’re responsible for. The broad thing, I think, to try to steer them in this direction of embracing, acknowledging, and then living up this greater responsibility, as an entity that is responsible for developing the thing.

Lucas: How would we concretely change the incentive structure of a company who’s interested in maximizing profit towards this increased responsibility, say, in the domains that you just enumerated.

Jade: This is definitely probably one of the hardest things about this claim being translated into practice. I mean, it’s not the first time we’ve been somewhat upset at companies for doing things that society doesn’t agree with. We don’t have a great track record of changing the way that industries or companies work. That being said, I think if you’re outside of the company, there are particularly levers that one can pull that can influence the way that a company is incentivized. And then I think we’ve also got examples of us being able to use these levers well.

The fact that companies are constrained by the environment that a government creates, and governments also have the threat of things like regulation, or the threat of being able to pass certain laws or whatnot, which actually the mere threat, historically, has done a fair amount in terms of incentivizing companies to just step up their game because they don’t want regulation to kick in, which isn’t conducive to what they want to do, for example.

Users of the technology is a pretty classic one. It’s a pretty inefficient one, I think, because you’ve got to coordinate many, many different types of users, and actors, and consumers and whatnot, to have an impact on what companies are incentivized to do. But you have seen environmental practices in other types of industries that have been put in place as standards or expectations that companies should abide by because consumers across a long period of time have been able to say, “I disagree with this particular practice.” That’s an example of a trend that has succeeded.

Lucas: That would be like boycotting or divestment.

Jade: Yeah, exactly. And maybe a slightly more efficient one is focusing on things like researchers and employees. That is, if you are a researcher, if you’re an employee, you have levers over the employer that you work for. They need you, and you need them, and there’s that kind of dependency in that relationship. This is all a long way of saying that I think, yes, I agree it’s hard to change incentive structures of any industry, and maybe specifically so in this case because they’re very large. But I don’t think it’s impossible. And I think we need to think harder about how to use those well. I think the other thing that’s working in our favor in this particular case is that we have a unique set of founders or leaders of these labs or companies that have expressed pretty genuine sounding commitments to safety and to cooperativeness, and to serving the common good. It’s not a very robust strategy to rely on certain founders just being good people. But I think in this case, it’s kind of working in our favor.

Lucas: For now, yeah. There’s probably already other interest groups who are less careful, who are actually just making policy recommendations right now, and we’re broadly not in on the conversation due to the way that we think about the issue. So in terms of government, what should we be doing? Yeah, it seems like there’s just not much happening.

Jade: Yeah. So I agree there isn’t much happening, or at least relative to how much work we’re putting into trying to understand and engage with private labs. There isn’t much happening with government. So I think there needs to be more thought put into how we do that piece of engagement. I think good things that we could be trying to encourage more governments to do, for one, investing in productive relationships with the technical community, and productive relationships with the researcher community, and with companies as well. At least in the US, it’s pretty adversarial between Silicon Valley firms and DC.

And that isn’t good for a number of reasons. And one very obvious reason is that there isn’t common information or common understand of what’s going on, what the risks are, what the capabilities are, et cetera. One of the main critiques of governments is that they’re ill-equipped in terms of access to knowledge, and access to expertise, to be able to appropriately design things like bills, or things like pieces of legislation or whatnot. And I think that’s also something that governments should take responsibility for addressing.

So those are kind of law hanging fruit. There’s a really tricky balance that I think governments will need to strike, which is the balance between avoiding over-hasty ill-informed regulation. A lot of my work looking at history will show that the main ways in which we’ve achieved substantial regulation is as a result of big public, largely negative events to do with the technology screwing something up, or the technology causing a lot of fear, for whatever reasons. And so there’s a very sharp spike in public fear or public concern, and then the government then kicks into gear. And I think that’s not a good dynamic in terms of forming nuanced well-considered regulation and governance norms. Avoiding the outcome is important, but it’s also important that governments do engage and track how this is going, and particularly track where things like company policy and industry-wide efforts are not going to be sufficient. So when do you start translating some of the more soft law, if you will, into actual hard law.

That will be a very tricky timing question, I think, for governments to grapple with. But ultimately, it’s not sufficient to have companies governing themselves. You’ll need to be able to consecrate it into government backed efforts and initiatives and legislation and bills. My strong intuition is that it’s not quite the right time to roll out object level policies. And so the main task for governments will be just to position themselves to do that well when the time is right.

Lucas: So what’s coming to my mind here is I’m thinking about YouTube compilations of congressional members of the United States and senators asking horrible questions to Mark Zuckerberg and the CEO of, say, Google. They just don’t understand the issues. The United States is currently not really thinking that much about AI, and especially transformative AI. Whereas, China, it seems, has taken a step in this direction and is doing massive governmental investments. So what can we say about this assuming difference? And the question is, what are governments to do in this space? Different governments are paying attention at different levels.

Jade: Some governments are more technological savvy than others, for one. So I pushed back on the US not … They’re paying attention on different things. So, for example, the Department of Commerce put out a notice to the public indicating that they’re exploring putting in place export controls on a cluster of emerging technologies, including a fair number of AI relevant technologies. The point of export controls is to do something like ensure that adversaries don’t get access to critical technologies that, if they do, then that could undermine national security and/or domestic industrial base. The reasons why export controls are concerning is because they’re a) a relatively outdated tool. They used to work relatively well when you were targeting specific kind of weapons technologies, or basically things that you could touch and see. And the restriction of them from being on the market by the US means that a fair amount of it won’t be able to be accessed by other folks around the world. And you’ve seen export controls be increasingly less effective the more that we’ve tried to apply to things like cryptography, where it’s largely software based. And so trying to use export controls, which are applied at the national border, is a very tricky thing to make effective.

So you have the US paying attention to the fact that they think that AI is a national security concern, at least in this respect, enough to indicate that they’re interested in exploring export controls. I think it’s unlikely that export controls are going to be effective at achieving the goals that the US want to pursue. But I think export controls is also indicative of a world that we don’t want to slide in, which is a world where you have rivalrous economic blocks, where you’re sort of protecting your own base, and you’re not contributing to the kind of global commons of progressing this technology.

Maybe it goes back to what we were saying before, in that if you’re not engaged in the governance, the governance is going to happen anyway. This is an example of activity is going to happen anyway. I think people assume now, probably rightfully so, that the US government is not going to be very effective because they are not technically literate. In general, they are sort of relatively slow moving. They’ve got a bunch of other problems that they need to think about, et cetera. But I don’t think it’s going to take very, very long for the US government to start to seriously engage. I think the thing that is worth trying to influence is what they do when they start to engage.

If I had a policy in mind that I thought was robustly good that the US government should pass, then that would be the more proactive approach. It seems possible that if we think about this hard enough, there could be robustly good things that the US government could do, that could be good to be proactive about.

Lucas: Okay, so there’s this sort of general sense that we’re pretty heavy on academic papers because we’re really trying to understand the problem, and the problem is so difficult, and we’re trying to be careful and sure about how we progress. And it seems like it’s not clear if there is much room, currently, for direct action, given our uncertainty about specific policy implementations. There are some shorter term issues. And sorry to say shorter term issues. But, by that, I mean automation and maybe lethal autonomous weapons and privacy. These things, we have a more clear sense of, at least about potential things that we can start doing. So I’m just trying to get a sense here from you, on top of these efforts to try to understand the issues more, and on top of these efforts, for example, like 80,000 Hours has contributed. And by working to place aligned persons in various private organizations, what else can we be doing? What would you like to see more being done on here?

Jade: I think this is on top of just more research. But that would be the first thing that comes to mind, is people thinking hard about it seems like a thing that I want a lot more of, in general. But on top of that, what you mentioned, I think, the placing people, that maybe fits into this broader category of things that seems good to do, which is investing in building our capacity to influence the future. That’s quite a general statement. But something like it takes a fair amount of time to build up influence, particularly in certain institutions, like governments, like international institutions, et cetera. And so investing in that early seems good. And doing things like trying to encourage value aligned sensible people to climb the ladders that they need to climb in order to get to positions of influence, that generally seems like a good and useful thing.

The other thing that comes to mind as well is putting out more accurate information. One specific version of things that we could do here is, there is currently a fair number of inaccurate, or not well justified memes that are floating around, that are informing the way that people think. For example, the US and China are in a race. Or a more nuanced one is something like, inevitably, you’re going to have a safety performance trade off. And those are not great memes, in the sense that they don’t seem to be conclusively true. But they’re also not great in that they put you in a position of concluding something like, “Oh, well, if I’m going to invest in safety, I’ve got to be an altruist, or I’m going to trade off my competitive advantage.”

And so identifying what those bad ones are, and countering those, is one thing to do. Better memes could be something like those are developing this technology are responsible for thinking through its consequences. Or something even as simple as governance doesn’t mean government, and it doesn’t mean regulation. Because I think you’ve got a lot of firms who are terrified of regulation. And so they won’t engage in this governance conversation because of it. So there could be some really simple things I think we could do, just to make the public discourse both more accurate and more conducive to things being done that are good in the future.

Lucas: Yeah, here I’m also just seeing the tension here between the appropriate kinds of memes that inspire, I guess, a lot of the thinking within the AI alignment community, and the x-risk community, versus what is actually useful or spreadable for the general public, adding in here ways in which accurate information can be info-hazardy. I think broadly in our community, the common good principle, and building an awesome future for all sentient creatures, and I am curious to know how spreadable those memes are.

Jade: Yeah, the spreadability of memes is a thing that I want someone to investigate more. The things that make things not spreadable, for example, are just things that are, at a very simple level, quite complicated to explain, or are somewhat counterintuitive so you can’t pump the intuition very easily. Particularly things that require you to decide that one set of values that you care about, that’s competing against another set of values. Any set of things that brings nationalism against cosmopolitanism, I think, is a tricky one, because you have some subset of people. The ones that you and I talk to the most are very cosmopolitan. But you also have a fair amount of people who care about the common good principle, in some sense, but also care about their nation in a fairly large sense as well.

So there are things that make certain memes less good or less spreadable. And one key thing will be to figure out which ones are actually good in the true sense, and good in the pragmatic to spread sense.

Lucas: Maybe there’s a sort of research program here, where psychologists and researchers can explore focus groups on the best spreadable memes, which reflect a lot of the core and most important values that we see within AI alignment, and EA, and x-risk.

Jade: Yeah, that could be an interesting project. I think also in AI safety, or in the AI alignment space, people are framing safety in quite different ways. One framing of that, which like it’s a part of what it means to be a good AI person, is to think about safety. That’s an example of one that I’ve seen take off a little bit more lately because that’s an explicit act of trying to mainstream the thing. That’s a meme, or an example of a framing, or a meme, or whatever you want to call it. And you know there are pros and cons of that. The pros would be, plausibly, it’s just more mainstream. And I think you’ve seen evidence of that be the case because more people are more inclined to say, “Yeah, I agree. I don’t want to build a thing that kills me if I want it to get coffee.” But you’re not going to have a lot of conversations about maybe the magnitude of risks that you actually care about. So that’s maybe a con.

There’s maybe a bunch of stuff to do in this general space of thinking about how to better frame the kind of public facing narratives of some of these issues. Realistically, memes are going to fill the space. People are going to talk about it in certain ways. You might as well try to make it better, if it’s going to happen.

Lucas: Yeah, I really like that. That’s a very good point. So let’s talk here a little bit about technical AI alignment. So in technical AI alignment, the primary concerns are around the difficulty of specifying what humans actually care about. So this is like capturing human values and aligning with our preferences and goals, and what idealized versions of us might want. So, so much of AI governance is thus about ensuring that this AI alignment process we engage in doesn’t skip too many corners. The purpose of AI governance is to decrease risks, to increase coordination, and to do all of these other things to ensure that, say, the benefits of AI are spread widely and robustly, that we don’t get locked into any negative governance systems or value systems, and that this process of bringing AIs in alignment with the good doesn’t have researchers, or companies, or governments skipping too many corners on safety. In this context, and this interplay between governance and AI alignment, how much of a concern are malicious use cases relative to the AI alignment concerns within the context of AI governance?

Jade: That’s a hard one to answer, both because there is a fair amount of uncertainty around how you discuss the scale of the thing. But also because I think there are some interesting interactions between these two problems. For example, if you’re talking about how AI alignment interacts with this AI governance problem. You mentioned before AI alignment research is, in some ways, contingent on other things going well. I generally agree with that.

For example, it depends on AI safety taking place in research cultures and important labs. It requires institutional buy-in and coordination between institutions. It requires this mitigation of race dynamics so that you can actually allocate resources towards AI alignment research. All those things. And so in some ways, that particular problem being solved is contingent on us doing AI governance well. But then, also to the point of how big is malicious use risk relative to AI alignment, I think in some ways that’s hard to answer. But in some ideal world, you could sequence the problems that you could solve. If you solve the AI alignment problem first, then AI governance research basically becomes a much narrower space, addressing how an aligned AI could still cause problems because we’re not thinking about the concentration of power, the concentration of economic gains. And so you need to think about things like the windfall clause, to distribute that, or whatever it is. And you also need to think about the transition to creating an aligned AI, and what could be messy in that transition, how you avoid public backlash so that you can actually see the fruits of you having solved this AI alignment problem.

So that becomes more the kind of nature of the thing that AI governance research becomes, if you assume that you’ve solved the AI alignment problem. But if we assume that, in some world, it’s not that easy to solve, and both problems are hard, then I think there’s this interaction between the two. In some ways, it becomes harder. In some ways, they’re dependent. In some ways, it becomes easier if you solve bits of one problem.

Lucas: I generally model the risks of malicious use cases as being less than the AI alignment stuff.

Jade: I mean, I’m not sure I agree with that. But two things I could say to that. I think, one, intuition is something like you have to be a pretty awful person to really want to use a very powerful system to cause terrible ends. And it seems more plausible that people will just do it by accident, or unintentionally, or inadvertently.

Lucas: Or because the incentive structures aren’t aligned, and then we race.

Jade: Yeah. And then the other way to sort of support this claim is, if you look at biotechnology and bio-weapons, specifically, bio-security/bio-terrorism issues, so like malicious use equivalent. Those have been far less, in terms of frequency, compared to just bio-safety issues, which are the equivalent of accident risks. So people causing unintentional harm because we aren’t treating biotechnology safely, that’s cause a lot more problems, at least in terms of frequency, compared to people actually just trying to use it for terrible means.

Lucas: Right, but don’t we have to be careful here with the strategic properties and capabilities of the technology, especially in the context in which it exists? Because there’s nuclear weapons, which are sort of the larger more absolute power imbuing technology. There has been less of a need for people to take bio-weapons to that level. You know? And also there’s going to be limits, like with nuclear weapons, on the ability of a rogue actor to manufacture really effective bio-weapons without a large production facility or team of research scientists.

Jade: For sure, yeah. And there’s a number of those considerations, I think, to bear in mind. So it definitely isn’t the case that you haven’t seen malicious use in bio strictly because people haven’t wanted to do it. There’s a bunch of things like accessibility problems, and tacit knowledge that’s required, and those kinds of things.

Lucas: Then let’s go ahead and abstract away malicious use cases, and just think about technical AI alignment, and then AI/AGI governance. How do you see the relative importance of AI and AGI governance, and the process of AI alignment that we’re undertaking? Is solving AI governance potentially a bigger problem than AI alignment research, since AI alignment research will require the appropriate political context to succeed? On our path to AGI, we’ll need to mitigate a lot of the race conditions and increase coordination. And then even after we reach AGI, the AI governance problem will continue, as we sort of explored earlier that we need to be able to maintain a space in which humanity, AIs, and all earth originating sentient creatures are able to idealize harmoniously and in unity.

Jade: I both don’t think it’s possible to actually assess them at this point, in terms of how much we understand this problem. I have a bias towards saying that AI governance is the harder problem because I’m embedded in it and see it a lot more. And maybe ways to support that claim are things we’ve talked about. So AI alignment going well, or happening at all, is sort of contingent on a number of other factors that AI governments are trying to solve, so social political economic context needs to be right in order for that to actually happen, and then in order for that to have an impact.

There are some interesting things that are made maybe easier by AI alignment being solved, or somewhat solved, if you are thinking about the AI governance problem. In fact, it’s just like a general cluster of AI being safer and more robust and more transparent, or whatever, makes certain AI governance challenges just easier. The really obvious example here that comes to mind is the verification problem. The inability to verify what certain systems are designed to do and will do causes a bunch of governance problems. Like, arms control agreements are very hard. Establishing trust between parties to cooperate and coordinate is very hard.

If you happen to be able to solve some of those problems in the process of trying to tackle this AI alignment problem. And that makes AI governance a little bit easier. I’m not sure which direction it cashes out, in terms of which problem is more important. I’m certain that there are interactions between the two, and I’m pretty certain that one depends on the other, to some extent. So it becomes imminently really hard to govern the thing, if you can’t align the thing. But it also is probably the case that by solving some of the problems in one domain, you can help make the other problem a little bit tractable and easier.

Lucas: So now I’d like to get into lethal autonomous weapons. And we can go ahead and add whatever caveats are appropriate here. So in terms of lethal autonomous weapons, some people think that there are major stakes here. Lethal autonomous weapons are a major AI enabled technology that’s likely to come on the stage soon, as we make some moderate improvements to already existing technology, and then package it all together into the form of a lethal autonomous weapon. Some take the view that this is a crucial moment, or that there are high stakes here to get such weapons banned. The thinking here might be that by demarcating unacceptable uses of AI technology, such as for autonomously killing people, and by showing that we are capable of coordinating on this large and initial AI issue, that we might be taking the first steps in AI alignment, and the first steps in demonstrating our ability to take the technology and its consequences seriously.

And so we mentioned earlier how there’s been a lot of thinking, but not much action. This seems to be an initial place where we can take action. We don’t need to keep delaying our direction action and real world participation. So if we can’t get a ban on autonomous weapons, maybe it would seem that we have less hope for coordinating on more difficult issues. And so the lethal autonomous weapons may exacerbate global conflict by increasing skirmishing at borders, decrease the cost of war, dehumanize killing, taking the human element out of death, et cetera.

And other people disagree with this. Other people might argue that banning lethal autonomous weapons isn’t necessary in the long game. It’s not, as we’re framing it, a high stakes thing. Just because this sort of developmental step in this technology is not really crucial for coordination, or for political military stability. Or that coordination later would be born of other things, and that this would just be some other new military technology without much impact. So curious here, to gather what your views, or the views of FHI, or the Center for the Governance of AI, might have on autonomous weapons. Should there be a ban? Should the AI alignment community be doing more about this? And if not, why?

Jade: In terms of caveats, I’ve got a lot of them. So I think the first one is that I’ve not read up on this issue at all, followed it very loosely, but not nearly closely enough, that I feel like I have a confident well-informed opinion.

Lucas: Can I ask why?

Jade: Mostly because of bandwidth issues. It’s not because I have categorized them ahead of something not worth engaging in. I’m actually pretty uncertain about that. The second caveat is, definitely don’t claim to speak on behalf of anyone but myself in this case. The Center for the Governance of AI, we don’t have a particular position on this, nor the FHI.

Lucas: Would you say that this is because the Center for the Governance of AI, would it be for bandwidth issues again? Or would it be because it’s de-prioritized.

Jade: The main thing is bandwidth. Also, I think the main reason why it’s probably been de-prioritized, at least subconsciously, has been the framing of sort of focusing on things that are neglected by folks around the world. It seems like there are people at least with sort of somewhat good intentions tentatively engaged in the LAWS (lethal autonomous weapons) discussion. And so within that frame, I think de-prioritization because it’s not obviously neglected compared to other things that aren’t getting any focus at all.

With those things in mind, I could see a pretty decent case for investing more effort in engaging in this discussion, at least compared to what we currently have. I guess it’s hard to tell, compared to alternatives of how we could be spending those resources, giving it’s such a resource constrained space, in terms of people working in AI alignment, or just bandwidth, in terms of this community in general. So briefly, I think we’ve talked about this idea that there’s this fair amount of path dependency in the way that institutions and norms are built up. And if this is one of the first spaces, with respect to AI capabilities, where we’re going to be getting or driving towards some attempt at international norms, or establishing international institutions that could govern this space, then that’s going to be relevant in a general sense. And specifically, it’s going to be relevant for sort of defense and security related concerns in the AI space.

And so I think you both want to engage because there’s an opportunity to seed desirable norms and practices and process and information. But you also possibly want to engage because there could be a risk that bad norms are established. And so it’s important to engage, to prevent it going down something which is not a good path in terms of this path dependency.

Another reason I think that is maybe worth thinking through, in terms of making a case for engaging more, is that applications of AI in the military and defense spaces, possibly one of the most likely to cause substantial disruption in the near-ish future, and could be an example of something that I call the high stakes concerns in the future. And you can talk about AI and its impact on various aspects of the military domain, where it could have substantial risks. So, for example, in cyber escalation, or destabilizing nuclear security. Those would be examples where military and AI come together, and you can have bad outcomes that we do actually really care about. And so for the same reason, engaging specifically in any discussion that is touching on military and AI concerns, could be important.

And then the last one that comes to mind is the one that you mentioned. This is an opportunity to basically practice doing this coordination thing. And there are various things that are worth practicing or attempting. For one, I think even just observing how these discussions pan out is going to tell you a fair amount about how important actors think about the trade offs of using AI versus sort of going towards more safe outcomes or governance processes. And then our ability corral interest around good values or appropriate norms, or whatnot, that’s a good test of our ability to generally coordinate when we have some of those trade offs around, for example, military advantage versus safety. It gives you some insight into how we could be dealing with similarly shaped issues.

Lucas: All right. So let’s go ahead and bring it back here to concrete actionable real world things today, and understanding what’s actually going on outside of the abstract thinking. So I’m curious to know here more about private companies. At least, to me, they largely seem to be agents of capitalism, like we said. They have a bottom line that they’re trying to meet. And they’re not ultimately aligned with pro-social outcomes. They’re not necessarily committed to ideal governance, but perhaps forms of governance which best serve them. And as we sort of feed aligned people into tech companies, how should we be thinking about their goals, modulating their incentives? What does DeepMind really want? Or what can we realistically expect from key players? And what mechanisms, in addition to the windfall clause, can we use to sort of curb the worst aspects of profit-driven private companies?

Jade: If I knew what DeepMind actually wanted, or what Google actually thought, we’d be in a pretty different place. So a fair amount of what we’ve chatted through, I would echo again. So I think there’s both the importance of realizing that they’re not completely divorced from other people influencing them, or other actors influencing them. And so just thinking hard about which levers are in place already that actually constrain the action of companies, is a pretty good place to start, in terms of thinking about how you can have an impact on their activities.

There’s this common way of talking about big tech companies, which is they can do whatever they want, and they run the world, and we’ve got no way of controlling them. Reality is that they are consistently constrained by a fair number of things. Because they are agents of capitalism, as you described, and because they have to respond to various things within that system. So we’ve mentioned things before, like governments have levers, consumers have levers, employees have levers. And so I think focusing on what those are is a good place to start. Anything that comes to mind is, there’s something here around taking a very optimistic view of how companies could behave. Or at least this is the way that I prefer to think about it, is that you both need to be excited, and motivated, and think that companies can change and create the conditions in which they can. But one also then needs to have a kind of hidden clinic, in some ways.

On both of these, I think the first one, I really want the public discourse to turn more towards the direction of, if we assume that companies want to have the option of demonstrating pro-social incentives, then we should do things like ensure that the market rewards them for acting in pro-social ways, instead of penalizing their attempts at doing so, instead of critiquing every action that they take. So, for example, I think we should be making bigger deals, basically, of when companies are trying to do things that at least will look like them moving in the right direction, as opposed to immediately critiquing them as ethics washing, or sort of just paying lip service to the thing. I want there to be more of an environment where, if you are a company, or you’re a head of a company, if you’re genuinely well-intentioned, you feel like your efforts will be rewarded, because that’s how incentive structures work, right?

And then on the second point, in terms of being realistic about the fact that you can’t just wish companies into being good, that’s when I think the importance of things like public institutions and civil society groups become important. So ensuring that there are consistent forms of pressure, and keep making sure that they feel like their actions are being rewarding if pro-social, but also that there are ways of spotting in which they can be speaking as if they’re being pro-social, but acting differently.

So I think everyone’s kind of basically got a responsibility here, to ensure that this goes forward in some kind of productive direction. I think it’s hard. And we said before, you know, some industries have changed in the past successfully. But that’s always been hard, and long, and messy, and whatnot. But yeah, I do think it’s probably more tractable than the average person would think, in terms of influencing these companies to move in directions that are generally just a little bit more socially beneficial.

Lucas: Yeah. I mean, also the companies were generally made up of fairly reasonable well-intentioned people. I’m not all pessimistic. There are just a lot of people who sit at desks and have their structure. So yeah, thank you so much for coming on, Jade. It’s really been a pleasure. And, yeah.

Jade: Likewise.

Lucas: If you enjoyed this podcast, please subscribe, give it a like, or share it on your preferred social media platform. We’ll be back again soon with another episode in the AI Alignment series.

End of recorded material

AI Alignment Podcast: On Consciousness, Qualia, and Meaning with Mike Johnson and Andrés Gómez Emilsson

Consciousness is a concept which is at the forefront of much scientific and philosophical thinking. At the same time, there is large disagreement over what consciousness exactly is and whether it can be fully captured by science or is best explained away by a reductionist understanding. Some believe consciousness to be the source of all value and others take it to be a kind of delusion or confusion generated by algorithms in the brain. The Qualia Research Institute takes consciousness to be something substantial and real in the world that they expect can be captured by the language and tools of science and mathematics. To understand this position, we will have to unpack the philosophical motivations which inform this view, the intuition pumps which lend themselves to these motivations, and then explore the scientific process of investigation which is born of these considerations. Whether you take consciousness to be something real or illusory, the implications of these possibilities certainly have tremendous moral and empirical implications for life’s purpose and role in the universe. Is existence without consciousness meaningful?

In this podcast, Lucas spoke with Mike Johnson and Andrés Gómez Emilsson of the Qualia Research Institute. Andrés is a consciousness researcher at QRI and is also the Co-founder and President of the Stanford Transhumanist Association. He has a Master’s in Computational Psychology from Stanford. Mike is Executive Director at QRI and is also a co-founder. Mike is interested in neuroscience, philosophy of mind, and complexity theory.

Topics discussed in this episode include:

  • Functionalism and qualia realism
  • Views that are skeptical of consciousness
  • What we mean by consciousness
  • Consciousness and casuality
  • Marr’s levels of analysis
  • Core problem areas in thinking about consciousness
  • The Symmetry Theory of Valence
  • AI alignment and consciousness

You can take a short (3 minute) survey to share your feedback about the podcast here.

We hope that you will continue to join in the conversations by following us or subscribing to our podcasts on Youtube, SoundCloud, iTunes, Google Play, Stitcher, or your preferred podcast site/application. You can find all the AI Alignment Podcasts here.

You can learn more about consciousness research at the Qualia Research InstituteMike‘s blog, and Andrés blog. You can listen to the podcast above or read the transcript below. Thanks to Ian Rusconi for production and edits as well as Scott Hirsh for feedback.

Lucas: Hey, everyone. Welcome back to the AI Alignment Podcast. I’m Lucas Perry, and today we’ll be speaking with Andrés Gomez Emilsson and Mike Johnson from the Qualia Research Institute. In this episode, we discuss the Qualia Research Institute’s mission and core philosophy. We get into the differences between and arguments for and against functionalism and qualia realism. We discuss definitions of consciousness, how consciousness might be causal, we explore Marr’s Levels of Analysis, we discuss the Symmetry Theory of Valence. We also get into identity and consciousness, and the world, the is-out problem, what this all means for AI alignment and building beautiful futures.

And then end on some fun bits, exploring the potentially large amounts of qualia hidden away in cosmological events, and whether or not our universe is something more like heaven or hell. And remember, if you find this podcast interesting or useful, remember to like, comment, subscribe, and follow us on your preferred listening platform. You can continue to help make this podcast better by participating in a very short survey linked in the description of wherever you might find this podcast. It really helps. Andrés is a consciousness researcher at QRI and is also the Co-founder and President of the Stanford Transhumanist Association. He has a Master’s in Computational Psychology from Stanford. Mike is Executive Director at QRI and is also a co-founder.

He is interested in neuroscience, philosophy of mind, and complexity theory. And so, without further ado, I give you Mike Johnson and Andrés Gomez Emilsson. So, Mike and Andrés, thank you so much for coming on. Really excited about this conversation and there’s definitely a ton for us to get into here.

Andrés: Thank you so much for having us. It’s a pleasure.

Mike: Yeah, glad to be here.

Lucas: Let’s start off just talking to provide some background about the Qualia Research Institute. If you guys could explain a little bit, your perspective of the mission and base philosophy and vision that you guys have at QRI. If you could share that, that would be great.

Andrés: Yeah, for sure. I think one important point is there’s some people that think that really what matters might have to do with performing particular types of algorithms, or achieving external goals in the world. Broadly speaking, we tend to focus on experience as the source of value, and if you assume that experience is a source of value, then really mapping out what is the set of possible experiences, what are their computational properties, and above all, how good or bad they feel seems like an ethical and theoretical priority to actually make progress on how to systematically figure out what it is that we should be doing.

Mike: I’ll just add to that, this thing called consciousness seems pretty confusing and strange. We think of it as pre-paradigmatic, much like alchemy. Our vision for what we’re doing is to systematize it and to do to consciousness research what chemistry did to alchemy.

Lucas: To sort of summarize this, you guys are attempting to be very clear about phenomenology. You want to provide a formal structure for understanding and also being able to infer phenomenological states in people. So you guys are realists about consciousness?

Mike: Yes, absolutely.

Lucas: Let’s go ahead and lay some conceptual foundations. On your website, you guys describe QRI’s full stack, so the kinds of metaphysical and philosophical assumptions that you guys are holding to while you’re on this endeavor to mathematically capture consciousness.

Mike: I would say ‘full stack’ talks about how we do philosophy of mind, we do neuroscience, and we’re just getting into neurotechnology with the thought that yeah, if you have a better theory of consciousness, you should be able to have a better theory about the brain. And if you have a better theory about the brain, you should be able to build cooler stuff than you could otherwise. But starting with the philosophy, there’s this conception of qualia of formalism; the idea that phenomenology can be precisely represented mathematically. You borrow the goal from Giulio Tononi’s IIT. We don’t necessarily agree with the specific math involved, but the goal of constructing a mathematical object that is isomorphic to a systems phenomenology would be the correct approach if you want to formalize phenomenology.

And then from there, one of the big questions in how you even start is, what’s the simplest starting point? And here, I think one of our big innovations that is not seen at any other research group is we’ve started with emotional valence and pleasure. We think these are not only very ethically important, but also just literally the easiest place to start reverse engineering.

Lucas: Right, and so this view is also colored by physicalism and quality of structuralism and valence realism. Could you explain some of those things in a non-jargony way?

Mike: Sure. Quality of formalism is this idea that math is the right language to talk about qualia in, and that we can get a precise answer. This is another way of saying that we’re realists about consciousness much as people can be realists about electromagnetism. We’re also valence realists. This refers to how we believe emotional valence, or pain and pleasure, the goodness or badness of an experience. We think this is a natural kind. This concept carves reality at the joints. We have some further thoughts on how to define this mathematically as well.

Lucas: So you guys are physicalists, so you think that basically the causal structure of the world is best understood by physics and that consciousness was always part of the game engine of the universe from the beginning. Ontologically, it was basic and always there in the same sense that the other forces of nature were already in the game engine since the beginning?

Mike: Yeah, I would say so. I personally like the frame of dual aspect monism, but I would also step back a little bit and say there’s two attractors in this discussion. One is the physicalist attractor, and that’s QRI. Another would be the functionalist/computationalist attractor. I think a lot of AI researchers are in this attractor and this is a pretty deep question of, if we want to try to understand what value is, or what’s really going on, or if we want to try to reverse engineer phenomenology, do we pay attention to bits or atoms? What’s more real; bits or atoms?

Lucas: That’s an excellent question. Scientific reductionism here I think is very interesting. Could you guys go ahead and unpack though the skeptics position of your view and broadly adjudicate the merits of each view?

Andrés: Maybe a really important frame here is called Marr’s Levels of Analyses. David Marr was a cognitive scientist, wrote a really influential book in the ’80s called On Vision where he basically creates a schema for how to understand knowledge about, in this particular case, how you actually make sense of the world visually. The framework goes as follows: you have three ways in which you can describe a information processing system. First of all, the competitional/behavioral level. What that is about is understanding the input output mapping of an information processing system. Part of it is also understanding the run-time complexity of the system and under what conditions it’s able to perform its actions. Here an analogy would be with an abacus, for example.

On the computational/behavioral level, what an abacus can do is add, subtract, multiply, divide, and if you’re really creative you can also exponentiate and do other interesting things. Then you have the algorithmic level of analysis, which is a little bit more detailed, and in a sense more constrained. What the algorithm level of analysis is about is figuring out what are the internal representations and possible manipulations of those representations such that you get the input output of mapping described by the first layer. Here you have an interesting relationship where understanding the first layer doesn’t fully constrain the second one. That is to say, there are many systems that have the same input output mapping but that under the hood uses different algorithms.

In the case of the abacus, an algorithm might be something whenever you want to add a number you just push a bead. Whenever you’re done with a row, you push all of the beads backs and then you add a bead in the row underneath. And finally, you have the implementation level of analysis, and that is, what is the system actually made of? How is it constructed? All of these different levels ultimately also map onto different theories of consciousness, and that is basically where in the stack you associate consciousness, or being, or “what matters”. So, for example, behaviorists in the ’50s, they may associate consciousness, if they give any credibility to that term, with the behavioral level. They don’t really care what’s happening inside as long as you have extended pattern of reinforcement learning over many iterations.

What matters is basically how you’re behaving and that’s the crux of who you are. A functionalist will actually care about what algorithms you’re running, how is it that you’re actually transforming the input into the output. Functionalists generally do care about, for example, brain imaging, they do care about the high level algorithms that the brain is running, and generally will be very interested in figuring out these algorithms and generalize them in fields like machine learning and digital neural networks and so on. A physicalist associate consciousness at the implementation level of analysis. How the system is physically constructed, has bearings on what is it like to be that system.

Lucas: So, you guys haven’t said that this was your favorite approach, but if people are familiar with David Chalmers, these seem to be the easy problems, right? And functionalists are interested in just the easy problems and some of them will actually just try to explain consciousness away, right?

Mike: Yeah, I would say so. And I think to try to condense some of the criticism we have of functionalism, I would claim that it looks like a theory of consciousness and can feel like a theory of consciousness, but it may not actually do what we need a theory of consciousness to do; specify which exact phenomenological states are present.

Lucas: Is there not some conceptual partitioning that we need to do between functionalists who believe in qualia or consciousness, and those that are illusionists or want to explain it away or think that it’s a myth?

Mike: I think that there is that partition, and I guess there is a question of how principled the partition you can be, or whether if you chase the ideas down as far as you can, the partition collapses. Either consciousness is a thing that is real in some fundamental sense and I think you can get there with physicalism, or consciousness is more of a process, a leaky abstraction. I think functionalism naturally tugs in that direction. For example, Brian Tomasik has followed this line of reasoning and come to the conclusion of analytic functionalism, which is trying to explain away consciousness.

Lucas: What is your guys’s working definition of consciousness and what does it mean to say that consciousness is real.

Mike: It is a word that’s overloaded. It’s used in many contexts. I would frame it as what it feels like to be something, and something is conscious if there is something it feels like to be that thing.

Andrés: It’s important also to highlight some of its properties. As Mike pointed out, consciousness, it’s used in many different ways. There’s like eight to definitions for the word consciousness, and honestly, all of them are really interesting. Some of them are more fundamental than others and we tend to focus on the more fundamental side of the spectrum for the word. A sense that would be very not fundamental would be consciousness in the sense of social awareness or something like that. We actually think of consciousness much more in terms of qualia; what is it like to be something? What is it like to exist? Some of the key properties of consciousness are as follows: First of all, we do think it exists.

Second, in some sense it has causal power in the sense that the fact that we are conscious matters for evolution, evolution made us conscious for a reason that it’s actually doing some computational legwork that would be maybe possible to do, but just not as efficient or not as conveniently as it is possible with consciousness. Then also you have the property of qualia, the fact that we can experience sights, and colors, and tactile sensations, and thoughts experiences, and emotions, and so on, and all of these are in completely different worlds, and in a sense they are, but they have the property that they can be part of a unified experience that can experience color at the same time as experiencing sound. That sends those different types of sensations, we describe them as the category of consciousness because they can be experienced together.

And finally, you have unity, the fact that you have the capability of experiencing many qualia simultaneously. That’s generally a very strong claim to make, but we think you need to acknowledge and take seriously its unity.

Lucas: What are your guys’s intuition pumps for thinking why consciousness exists as a thing? Why is there a qualia?

Andrés: There’s the metaphysical question of why consciousness exists to begin within. That’s something I would like to punt for the time being. There’s also the question of why was it recruited for information processing purposes in animals? The intuition here is that there are various contrasts that you can have within experience, can serve a computational role. So, there may be a very deep reason why color qualia or visual qualia is used for information processing associated with sight, and why tactile qualia is associated with information processing useful for touching and making haptic representations, and that might have to do with the actual map of how all the qualia values are related to each other. Obviously, you have all of these edge cases, people who are seeing synesthetic.

They may open their eyes and they experience sounds associated with colors, and people tend to think of those as abnormal. I would flip it around and say that we are all synesthetic, it’s just that the synesthesia that we have in general is very evolutionarily adaptive. The reason why you experience colors when you open your eyes is that that type of qualia is really well suited to represent geometrically a projective space. That’s something that naturally comes out of representing the world with the sensory apparatus like eyes. That doesn’t mean that there aren’t other ways of doing it. It’s possible that you could have an offshoot of humans that whenever they opened their eyes, they experience sound and they use that very well to represent the visual world.

But we may very well be in a local maxima of how different types of qualia are used to represent and do certain types of computations in a very well-suited way. It’s like the intuition behind why we’re conscious, is that all of these different contrasts in the structure of the relationship of possible qualia values has computational implications, and there’s actual ways of using this contrast in very computationally effective ways.

Lucas: So, just to channel of the functionalist here, wouldn’t he just say that everything you just said about qualia could be fully reducible to input output and algorithmic information processing? So, why do we need this extra property of qualia?

Andrés: There’s this article, I believe is by Brian Tomasik that basically says, flavors of consciousness are flavors of computation. It might be very useful to do that exercise, where basically you identify color qualia as just a certain type of computation and it may very well be that the geometric structure of color is actually just a particular algorithmic structure, that whenever you have a particular type of algorithmic information processing, you get these geometric plate space. In the case of color, that’s a Euclidean three-dimensional space. In the case of tactile or smell, it might be a much more complicated space, but then it’s in a sense implied by the algorithms that we run. There is a number of good arguments there.

The general approach to how to tackle them is that when it comes down to actually defining what algorithms a given system is running, you will hit a wall when you try to formalize exactly how to do it. So, one example is, how do you determine the scope of an algorithm? When you’re analyzing a physical system and you’re trying to identify what algorithm it is running, are you allowed to basically contemplate 1,000 atoms? Are you allowed to contemplate a million atoms? Where is a natural boundary for you to say, “Whatever is inside here can be part of the same algorithm, but whatever is outside of it can’t.” And, there really isn’t a framing variant way of making those decisions. On the other hand, if you ask to see a qualia with actual physical states, there is a framing variant way of describing what the system is.

Mike: So, a couple of years ago I posted a piece giving a critique of functionalism and one of the examples that I brought up was, if I have a bag of popcorn and I shake the bag of popcorn, did I just torture someone? Did I just run a whole brain emulation of some horrible experience, or did I not? There’s not really an objective way to determine which algorithms a physical system is objectively running. So this is a kind of an unanswerable question from the perspective of functionalism, whereas with the physical theory of consciousness, it would have a clear answer.

Andrés: Another metaphor here he is, let’s say you’re at a park enjoying an ice cream. In this system that I created that has, let’s say isomorphic algorithms to whatever is going on in your brain, the particular algorithms that your brain is running in that precise moment within a functionalist paradigm maps onto a metal ball rolling down one of the paths within these machine in a straight line, not touching anything else. So there’s actually not much going on. According to functionalism, that would have to be equivalent and it would actually be generating your experience. Now the weird thing there is that you could actually break the machine, you could do a lot of things and the behavior of the ball would not change.

Meaning that within functionalism, and to actually understand what a system is doing, you need to understand the counter-factuals of the system. You need to understand, what would the system be doing if the input had been different? And all of a sudden, you end with this very, very gnarly problem of defining, well, how do you actually objectively decide what is the boundary of the system? Even some of these particular states that allegedly are very complicated, the system looks extremely simple, and you can remove a lot of parts without actually modifying its behavior. Then that casts in question whether there is a objective boundary, any known arbitrary boundary that you can draw around the system and say, “Yeah, this is equivalent to what’s going on in your brain,” right now.

This has a very heavy bearing on the binding problem. The binding problem for those who haven’t heard of it is basically, how is it possible that 100 billion neurons just because they’re skull-bound, spatially distributed, how is it possible that they simultaneously contribute to a unified experience as opposed to, for example, neurons in your brain and neurons in my brain contributing to a unified experience? You hit a lot of problems like what is the speed of propagation of information for different states within the brain? I’ll leave it at that for the time being.

Lucas: I would just like to be careful about this intuition here that experience is unified. I think that the intuition pump for that is direct phenomenological experience like experience seems unified, but experience also seems a lot of different ways that aren’t necessarily descriptive of reality, right?

Andrés: You can think of it as different levels of sophistication, where you may start out with a very naive understanding of the world, where you confuse your experience for the world itself. A very large percentage of people perceive the world and in a sense think that they are experiencing the world directly, whereas all the evidence indicates that actually you’re experiencing an internal representation. You can go and dream, you can hallucinate, you can enter interesting meditative states, and those don’t map to external states of the world.

There’s this transition that happens when you realize that in some sense you’re experiencing a world simulation created by your brain, and of course, you’re fooled by it in countless ways, especially when it comes to emotional things that we look at a person and we might have an intuition of what type of person that person is, and that if we’re not careful, we can confuse our intuition, we can confuse our feelings with truth as if we were actually able to sense their souls, so to speak, rather than, “Hey, I’m running some complicated models on people-space and trying to carve out who they are.” There’s definitely a lot of ways in which experience is very deceptive, but here I would actually make an important distinction.

When it comes to intentional content, and intentional content is basically what the experience is about, for example, if you’re looking at a chair, there’s the quality of chairness, the fact that you understand the meaning of chair and so on. That is usually a very deceptive part of experience. There’s another way of looking at experience that I would say is not deceptive, which is the phenomenal character of experience; how it presents itself. You can be deceived about basically what the experience is about, but you cannot be deceived about how you’re having the experience, how you’re experiencing it. You can infer based on a number of experiences that the only way for you to even actually experience a given phenomenal object is to incorporate a lot of that information into a unified representation.

But also, if you just pay attention to your experience that you can simultaneously place your attention in two spots of your visual field and make them harmonized. That’s a phenomenal character and I would say that there’s a strong case to be made to not doubt that property.

Lucas: I’m trying to do my best to channel the functionalist. I think he or she would say, “Okay, so what? That’s just more information processing, and i’ll bite the bullet on the binding problem. I still need some more time to figure that out. So what? It seems like these people who believe in qualia have an even tougher job of trying to explain this extra spooky quality in the world that’s different from all the other physical phenomenon that science has gone into.” It also seems to violate Occam’s razor or a principle of lightness where one’s metaphysics or ontology would want to assume the least amount of extra properties or entities in order to try to explain the world. I’m just really trying to tease out your best arguments here for qualia realism as we do have this current state of things in AI alignment where most people it seems would either try to explain away consciousness, would say it’s an illusion, or they’re anti-realist about qualia.

Mike: That’s a really good question, a really good frame. And I would say our strongest argument revolves around predictive power. Just like centuries ago, you could absolutely be a skeptic about, shall we say, electromagnetism realism. And you could say, “Yeah, I mean there is this thing we call static, and there’s this thing we call lightning, and there’s this thing we call load stones or magnets, but all these things are distinct. And to think that there’s some unifying frame, some deep structure of the universe that would tie all these things together and highly compress these phenomenon, that’s crazy talk.” And so, this is a viable position today to say that about consciousness, that it’s not yet clear whether consciousness has deep structure, but we’re assuming it does, and we think that unlocks a lot of predictive power.

We should be able to make predictions that are both more concise and compressed and crisp than others, and we should be able to make predictions that no one else can.

Lucas: So what is the most powerful here about what you guys are doing? Is it the specific theories and assumptions which you take are falsifiable?

Mike: Yeah.

Lucas: If we can make predictive assessments of these things, which are either leaky abstractions or are qualia, how would we even then be able to arrive at a realist or anti-realist view about qualia?

Mike: So, one frame on this is, it could be that one could explain a lot of things about observed behavior and implicit phenomenology through a purely functionalist or computationalist lens, but maybe for a given system it might take 10 terabytes. And if you can get there in a much simpler way, if you can explain it in terms of three elegant equations instead of 10 terabytes, then it wouldn’t be proof that there exists some crystal clear deep structure at work. But it would be very suggestive. Marr’s Levels of Analysis are pretty helpful here, where a functionalist might actually be very skeptical of consciousness mattering at all because it would say, “Hey, if you’re identifying consciousness at the implementation level of analysis, how could that have any bearing on how we are talking about, how we understand the world, how we’d behave?

Since the implementational level is kind of epiphenomenal from the point of view of the algorithm. How can an algorithm know its own implementation, all it can maybe figure out its own algorithm, and it’s identity would be constrained to its own algorithmic structure.” But that’s not quite true. In fact, there is bearings on one level of analysis onto another, meaning in some cases the implementation level of analysis doesn’t actually matter for the algorithm, but in some cases it does. So, if you were implementing a computer, let’s say with water, you have the option of maybe implementing a Turing machine with water buckets and in that case, okay, the implementation level of analysis goes out the window in terms of it doesn’t really help you understand the algorithm.

But if how you’re using water to implement algorithms is by basically creating this system of adding waves in buckets of different shapes, with different resonant modes, then the implementation level of analysis actually matters a whole lot for what algorithms are … finely tuned to be very effective in that substrate. In the case of consciousness and how we behave, we do think properties of the substrate have a lot of bearings on what algorithms we actually run. A functionalist should actually start caring about consciousness if the properties of consciousness makes the algorithms more efficient, more powerful.

Lucas: But what if qualia and consciousness are substantive real things? What if the epiphenomenonalist true and is like smoke rising from computation and it doesn’t have any causal efficacy?

Mike: To offer a re-frame on this, I like this frame of dual aspect monism better. There seems to be an implicit value judgment on epiphenomenalism. It’s seen as this very bad thing if a theory implies qualia as epiphenomenal. Just to put cards on the table, I think Andrés and I differ a little bit on how we see these things, although I think our ideas also mesh up well. But I would say that under the frame of something like dual aspect monism, that there’s actually one thing that exists, and it has two projections or shadows. And one projection is the physical world such as we can tell, and then the other projection is phenomenology, subjective experience. These are just two sides of the same coin and neither is epiphenomenal to the other. It’s literally just two different angles on the same thing.

And in that sense, qualia values and physical values are really talking about the same thing when you get down to it.

Lucas: Okay. So does this all begin with this move that Descartes makes, where he tries to produce a perfectly rational philosophy or worldview by making no assumptions and then starting with experience? Is this the kind of thing that you guys are doing in taking consciousness or qualia to be something real or serious?

Mike: I can just speak for myself here, but I would say my intuition comes from two places. One is staring deep into the beast of functionalism and realizing that it doesn’t lead to a clear answer. My model is that it just is this thing that looks like an answer but can never even in theory be an answer to how consciousness works. And if we deny consciousness, then we’re left in a tricky place with ethics and moral value. It also seems to leave value on the table in terms of predictions, that if we can assume consciousness as real and make better predictions, then that’s evidence that we should do that.

Lucas: Isn’t that just an argument that it would be potentially epistemically useful for ethics if we could have predictive power about consciousness?

Mike: Yeah. So, let’s assume that it’s 100 years, or 500 years, or 1,000 years in the future, and we’ve finally cracked consciousness. We’ve finally solved it. My open question is, what does the solution look like? If we’re functionalists, what does the solution look like? If we’re physicalists, what does the solution look like? And we can expand this to ethics as well.

Lucas: Just as a conceptual clarification, the functionalists are also physicalists though, right?

Andrés: There is two senses of the word physicalism here. So if there’s physicalism in the sense of like a theory of the universe, that the behavior of matter and energy, what happens in the universe is exhaustively described by the laws of physics, or future physics, there is also physicalism in the sense of understanding consciousness in contrast to functionalism. David Pearce, I think, would describe it as non-materialist physicalist idealism. There’s definitely a very close relationship between that phrasing and dual aspect monism. I can briefly unpack it. Basically non materialist is not saying that the stuff of the world is fundamentally unconscious. That’s something that materialism claims, that what the world is made of is not conscious, is raw matter so to speak.

Andrés: Physicalist, again in the sense of the laws of physics exhaustively describe behavior and idealist in the sense of what makes up the world is qualia or consciousness. The big picture view is that the actual substrate of the universe of quantum fields are fields of qualia.

Lucas: So Mike, you were saying that in the future when we potentially have a solution to the problem of consciousness, that in the end, the functionalists with algorithms and explanations of say all of the easy problems, all of the mechanisms behind the things that we call consciousness, you think that that project will ultimately fail?

Mike: I do believe that, and I guess my gentle challenge to functionalists would be to sketch out a vision of what a satisfying answer to consciousness would be, whether it’s completely explaining it a way or completely explaining it. If in 500 years you go to the local bookstore and you check out consciousness 101, and just flip through it, you look at the headlines and the chapter list and the pictures, what do you see? I think we have an answer as formalists, but I would be very interested in getting the functionalists state on this.

Lucas: All right, so you guys have this belief in the ability to formalize our understanding of consciousness, is this actually contingent on realism or anti realism?

Mike: It is implicitly dependent on realism, that consciousness is real enough to be describable mathematically in a precise sense. And actually that would be my definition of realism, that something is real if we can describe it exactly with mathematics and it is instantiated in the universe. I think the idea of connecting math and consciousness is very core to formalism.

Lucas: What’s particularly interesting here are the you’re making falsifiable claims about phenomenological states. It’s good and exciting that your Symmetry Theory of Valence, which we can get into now has falsifiable aspects. So do you guys want to describe here your Symmetry Theory of Valence and how this fits in and as a consequence of your valence realism?

Andrés: Sure, yeah. I think like one of the key places where this has bearings on is and understanding what is it that we actually want and what is it that we actually like and enjoy. That will be answered in an agent way. So basically you think of agents as entities who spin out possibilities for what actions to take and then they have a way of sorting them by expected utility and then carrying them out. A lot of people may associate what we want or what we like or what we care about at that level, the agent level, whereas we think actually the true source of value is more low level than that. That there’s something else that we’re actually using in order to implement agentive behavior. There’s ways of experiencing value that are completely separated from agents. You don’t actually need to be generating possible actions and evaluating them and enacting them for there to be value or for you to actually be able to enjoy something.

So what we’re examining here is actually what is the lower level property that gives rise even to agentive behavior that underlies every other aspect of experience. These would be a valence and specifically valence gradients. The general claim is that we are set up in such a way that we are basically climbing the valence gradient. This is not true in every situation, but it’s mostly true and it’s definitely mostly true in animals. And then the question becomes what implements valence gradients. Perhaps your intuition is this extraordinary fact that things that have nothing to do with our evolutionary past nonetheless can feel good or bad. So it’s understandable that if you hear somebody scream, you may get nervous or anxious or fearful or if you hear somebody laugh you may feel happy.

That makes sense from an evolutionary point of view, but why would the sound of the Bay Area Rapid Transit, the Bart, which creates these very intense screeching sounds, that is not even within like the vocal range of humans, it’s just really bizarre, never encountered before in our evolutionary past and nonetheless, it has an extraordinarily negative valence. That’s like a hint that valence has to do with patterns, it’s not just goals and actions and utility functions, but the actual pattern of your experience may determine valence. The same goes for a SUBPAC, is this technology that basically renders sounds between 10 and 100 hertz and some of them feel really good, some of them feel pretty unnerving, some of them are anxiety producing and it’s like why would that be the case? Especially when you’re getting two types of input that have nothing to do with our evolutionary past.

It seems that there’s ways of triggering high and low valence states just based on the structure of your experience. The last example I’ll give is very weird states of consciousness like meditation or psychedelics that seem to come with extraordinarily intense and novel forms of experiencing significance or a sense of bliss or pain. And again, they don’t seem to have much semantic content per se or rather the semantic content is not the core reason why they feel that they’re bad. It has to do more with a particular structure that they induce in experience.

Mike: There are many ways to talk about where pain and pleasure come from. We can talk about it in terms of neuro chemicals, opioids, dopamine. We can talk about it in terms of pleasure centers in the brain, in terms of goals and preferences and getting what you want, but all these have counterexamples. All of these have some points that you can follow the thread back to which will beg the question. I think the only way to explain emotional valence, pain and pleasure, that doesn’t beg the question is to explain it in terms of some patterns within phenomenology, just intrinsically feel good and some intrinsically feel bad. To touch back on the formalism brain, this would be saying that if we have a mathematical object that is isomorphic to your phenomenology, to what it feels like to be you, then some pattern or property of this object will refer to or will sort of intrinsically encode you are emotional valence, how pleasant or unpleasant this experiences.

That’s the valence formalism aspect that we’ve come to.

Lucas: So given the valence realism, the view is this intrinsic pleasure, pain axis of the world and this is sort of challenging I guess David Pearce’s view. There are things in experience which are just clearly good seeming or bad seeming. Will MacAskill called these pre theoretic properties we might ascribe to certain kinds of experiential aspects, like they’re just good or bad. So with this valence realism view, this potentiality in this goodness or badness whose nature is sort of self intimatingly disclosed in the physics and in the world since the beginning and now it’s unfolding and expressing itself more so and the universe is sort of coming to life, and embedded somewhere deep within the universe’s structure are these intrinsically good or intrinsically bad valances which complex computational systems and maybe other stuff has access to.

Andrés: Yeah, yeah, that’s right. And I would perhaps emphasize that it’s not only pre-theoretical, it’s pre-agentive, you don’t even need an agent for there to be valence.

Lucas: Right. Okay. This is going to be a good point I think for getting into these other more specific hairy philosophical problems. Could you go ahead and unpack a little bit more this view that pleasure or pain is self intimatingly good or bad that just by existing and experiential relation with the thing its nature is disclosed. Brian Tomasik here, and I think functionalists would say there’s just another reinforcement learning algorithm somewhere before that is just evaluating these phenomenological states. They’re not intrinsically or bad, that’s just what it feels like to be the kind of agent who has that belief.

Andrés: Sure. There’s definitely many angles from which to see this. One of them is by basically realizing that liking, wanting and learning are possible to dissociate, and in particular you’re going to have reinforcement without an associated positive valence. You can have also positive valence without reinforcement or learning. Generally they are correlated but they are different things. My understanding is a lot of people who may think of valence as something we believe matters because you are the type of agent that has a utility function and a reinforcement function. If that was the case, we would expect valence to melt away in states that are non agentive, we wouldn’t necessarily see it. And also that it would be intrinsically tied to intentional content, the aboutness of experience. A very strong counter example is that somebody may claim that really what they truly want this to be academically successful or something like that.

They think of the reward function as intrinsically tied to getting a degree or something like that. I would call that to some extent illusory, that if you actually look at how those preferences are being implemented, that deep down there would be valence gradients happening there. One way to show this would be let’s say the person on the graduation day, you give them an opioid antagonist. The person will subjectively feel that the day is meaningless, you’ve removed the pleasant cream of the experience that they were actually looking for, that they thought all along was tied in with intentional content with the fact of graduating but in fact it was the hedonic gloss that they were after, and that’s kind of like one intuition pump part there.

Lucas: These core problem areas that you’ve identified in Principia Qualia, would you just like to briefly touch on those?

Mike: Yeah, trying to break the problem down into modular pieces with the idea that if we can decompose the problem correctly then the sub problems become much easier than the overall problem and if you collect all the solutions to the sub problem than in aggregate, you get a full solution to the problem of consciousness. So I’ve split things up into the metaphysics, the math and the interpretation. The first question is what metaphysics do you even start with? What ontology do you even try to approach the problem? And we’ve chosen the ontology of physics that can objectively map onto reality in a way that computation can not. Then there’s this question of, okay, so you have your core ontology in this case physics, and then there’s this question of what counts, what actively contributes to consciousness? Do we look at electrons, electromagnetic fields, quarks?

This is an unanswered question. We have hypotheses but we don’t have an answer. Moving into the math, conscious system seemed to have boundaries, if something’s happening inside my head it can directly contribute to my conscious experience. But even if we put our heads together, literally speaking, your consciousness doesn’t bleed over into mine, there seems to be a boundary. So one way of framing this is the boundary problem and one way it’s framing it is the binding problem, and these are just two sides of the same coin. There’s this big puzzle of how do you draw the boundaries of a subject experience. IIT is set up to approach consciousness in itself through this lens that has a certain style of answer, style of approach. We don’t necessarily need to take that approach, but it’s a intellectual landmark. Then we get into things like the state-space problem and the topology of information problem.

If we figured out our basic ontology of what we think is a good starting point and of that stuff, what actively contributes to consciousness, and then we can figure out some principled way to draw a boundary around, okay, this is conscious experience A and this conscious experience B, and they don’t overlap. So you have a bunch of the information inside the boundary. Then there’s this math question of how do you rearrange it into a mathematical object that is isomorphic to what that stuff feels like. And again, IIT has an approach to this, we don’t necessarily ascribe to the exact approach but it’s good to be aware of. There’s also the interpretation problem, which is actually very near and dear to what QRI is working on and this is the concept of if you had a mathematical object that represented what it feels like to be you, how would we even start to figure out what it meant?

Lucas: This is also where the falsifiability comes in, right? If we have the mathematical object and we’re able to formally translate that into phenomenological states, then people can self report on predictions, right?

Mike: Yes. I don’t necessarily fully trust self reports as being the gold standard. I think maybe evolution is tricky sometimes and can lead to inaccurate self report, but at the same time it’s probably pretty good, and it’s the best we have for validating predictions.

Andrés: A lot of this gets easier if we assume that maybe we can be wrong in an absolute sense but we’re often pretty well calibrated to judge relative differences. Maybe you ask me how I’m doing on a scale of one to ten and I say seven and the reality is a five, maybe that’s a problem, but at the same time I like chocolate and if you give me some chocolate and I eat it and that improves my subjective experience and I would expect us to be well calibrated in terms of evaluating whether something is better or worse.

Lucas: There’s this view here though that the brain is not like a classical computer, that it is more like a resonant instrument.

Mike: Yeah. Maybe an analogy here it could be pretty useful. There’s this researcher William Sethares who basically figured out the way to quantify the mutual dissonance between pairs of notes. It turns out that it’s not very hard, all you need to do is add up the pairwise dissonance between every harmonic of the notes. And what that gives you is that if you take for example a major key and you compute the average dissonance between pairs of notes within that major key it’s going to be pretty good on average. And if you take the average dissonance of a minor key it’s going to be higher. So in a sense what distinguishes the minor and a major key is in the combinatorial space of possible permutations of notes, how frequently are they dissonant versus consonant.

That’s a very ground truth mathematical feature of a musical instrument and that’s going to be different from one instrument to the next. With that as a backdrop, we think of the brain and in particular valence in a very similar light that the brain has natural resonant modes and emotions may seem externally complicated. When you’re having a very complicated emotion and we ask you to describe it it’s almost like trying to describe a moment in a symphony, this very complicated composition and how do you even go about it. But deep down the reason why a particular frame sounds pleasant or unpleasant within music is ultimately tractable to the additive per wise dissonance of all of those harmonics. And likewise for a given state of consciousness we suspect that very similar to music the average pairwise dissonance between the harmonics present on a given point in time will be strongly related to how unpleasant the experience is.

These are electromagnetic waves and it’s not exactly like a static or it’s not exactly a standing wave either, but it gets really close to it. So basically what this is saying is there’s this excitation inhibition wave function and that happens statistically across macroscopic regions of the brain. There’s only a discrete number of ways in which that way we can fit an integer number of times in the brain. We’ll give you a link to the actual visualizations for what this looks like. There’s like a concrete example, one of the harmonics with the lowest frequency is basically a very simple one where interviewer hemispheres are alternatingly more excited versus inhibited. So that will be a low frequency harmonic because it is very spatially large waves, an alternating pattern of excitation. Much higher frequency harmonics are much more detailed and obviously hard to describe, but visually generally speaking, the spatial regions that are activated versus inhibited are these very thin wave fronts.

It’s not a mechanical wave as such, it’s a electromagnetic wave. So it would actually be the electric potential in each of these regions of the brain fluctuates, and within this paradigm on any given point in time you can describe a brain state as a weighted sum of all of its harmonics, and what that weighted sum looks like depends on your state of consciousness.

Lucas: Sorry, I’m getting a little caught up here on enjoying resonant sounds and then also the valence realism. The view isn’t that all minds will enjoy resonant things because happiness is like a fundamental valence thing of the world and all brains who come out of evolution should probably enjoy resonance.

Mike: It’s less about the stimulus, it’s less about the exact signal and it’s more about the effect of the signal on our brains. The resonance that matters, the resonance that counts, or the harmony that counts we’d say, or in a precisely technical term, the consonance that counts is the stuff that happens inside our brains. Empirically speaking most signals that involve a lot of harmony create more internal consonance in these natural brain harmonics than for example, dissonant stimuli. But the stuff that counts is inside the head, not the stuff that is going in our ears.

Just to be clear about QRI’s move here, Selen Atasoy has put forth this connecting specific harmonic wave model and what we’ve done is combined it with our symmetry threory of valence and said this is sort of a way of basically getting a foyer transform of where the energy is in terms of frequencies of brainwaves in a much cleaner way that has been available through EEG. Basically we can evaluate this data set for harmony. How much harmony is there in a brain, the link to the Symmetry Theory of Valence than it should be a very good proxy for how pleasant it is to be that brain.

Lucas: Wonderful.

Andrés: In this context, yeah, the Symmetry Theory of Valence would be much more fundamental. There’s probably many ways of generating states of consciousness that are in a sense completely unnatural that are not based on the harmonics of the brain, but we suspect the bulk of the differences in states of consciousness would cash out in differences in brain harmonics because that’s a very efficient way of modulating the symmetry of the state.

Mike: Basically, music can be thought of as a very sophisticated way to hack our brains into a state of greater consonance, greater harmony.

Lucas: All right. People should check out your Principia Qualia, which is the work that you’ve done that captures a lot of this well. Is there anywhere else that you’d like to refer people to for the specifics?

Mike: Principia qualia covers the philosophical framework and the symmetry theory of valence. Andrés has written deeply about this connectome specific harmonic wave frame and the name of that piece is Quantifying Bliss.

Lucas: Great. I would love to be able to quantify bliss and instantiate it everywhere. Let’s jump in here into a few problems and framings of consciousness. I’m just curious to see if you guys have any comments on ,the first is what you call the real problem of consciousness and the second one is what David Chalmers calls the Meta problem of consciousness. Would you like to go ahead and start off here with just this real problem of consciousness?

Mike: Yeah. So this gets to something we were talking about previously, is consciousness real or is it not? Is it something to be explained or to be explained away? This cashes out in terms of is it something that can be formalized or is it intrinsically fuzzy? I’m calling this the real problem of consciousness, and a lot depends on the answer to this. There are so many different ways to approach consciousness and hundreds, perhaps thousands of different carvings of the problem, panpsychism, we have dualism, we have non materialist physicalism and so on. I think essentially the core distinction, all of these theories sort themselves into two buckets, and that’s is consciousness real enough to formalize exactly or not. This frame is perhaps the most useful frame to use to evaluate theories of consciousness.

Lucas: And then there’s a Meta problem of consciousness which is quite funny, it’s basically like why have we been talking about consciousness for the past hour and what’s all this stuff about qualia and happiness and sadness? Why do people make claims about consciousness? Why does it seem to us that there is maybe something like a hard problem of consciousness, why is it that we experience phenomenological states? Why isn’t everything going on with the lights off?

Mike: I think this is a very clever move by David Chalmers. It’s a way to try to unify the field and get people to talk to each other, which is not so easy in the field. The Meta problem of consciousness doesn’t necessarily solve anything but it tries to inclusively start the conversation.

Andrés: The common move that people make here is all of these crazy things that we think about consciousness and talk about consciousness, that’s just any information processing system modeling its own attentional dynamics. That’s one illusionist frame, but even within qualia realist, qualia formalist paradigm, you still have the question of why do we even think or self reflect about consciousness. You could very well think of consciousness as being computationally relevant, you need to have consciousness and so on, but still lacking introspective access. You could have these complicated conscious information processing systems, but they don’t necessarily self reflect on the quality of their own consciousness. That property is important to model and make sense of.

We have a few formalisms that may give rise to some insight into how self reflectivity happens and in particular how is it possible to model the entirety of your state of consciousness in a given phenomenal object. These ties in with the notion of a homonculei, if the overall valence of your consciousness is actually a signal traditionally used for fitness evaluation, detecting basically when are you in existential risk to yourself or when there’s like reproductive opportunities that you may be missing out on, that it makes sense for there to be a general thermostat of the overall experience where you can just look at it and you get a a sense of the overall well being of the entire experience added together in such a way that you experienced them all at once.

I think like a lot of the puzzlement has to do with that internal self model of the overall well being of the experience, which is something that we are evolutionarily incentivized to actually summarize and be able to see at a glance.

Lucas: So, some people have a view where human beings are conscious and they assume everyone else is conscious and they think that the only place for value to reside is within consciousness, and that a world without consciousness is actually a world without any meaning or value. Even if we think that say philosophical zombies or people who are functionally identical to us but with no qualia or phenomenological states or experiential states, even if we think that those are conceivable, then it would seem that there would be no value in a world of p-zombies. So I guess my question is why does phenomenology matter? Why does the phenomenological modality of pain and pleasure or valence have some sort of special ethical or experiential status unlike qualia like red or blue?

Why does red or blue not disclose some sort of intrinsic value in the same way that my suffering does or my bliss does or the suffering or bliss of other people?

Mike: My intuition is also that consciousness is necessary for value. Nick Bostrom has this wonderful quote in super intelligence that we should be wary of building a Disneyland with no children, some technological wonderland that is filled with marvels of function but doesn’t have any subjective experience, doesn’t have anyone to enjoy it basically. I would just say that I think that most AI safety research is focused around making sure there is a Disneyland, making sure, for example, that we don’t just get turned into something like paperclips. But there’s this other problem, making sure there are children, making sure there are subjective experiences around to enjoy the future. I would say that there aren’t many live research threads on this problem and I see QRI as a live research thread on how to make sure there is subject experience in the future.

Probably a can of worms there, but as your question about in pleasure, I may pass that to my colleague Andrés.

Andrés: Nothing terribly satisfying here. I would go with David Pearce’s view that these properties of experience are self intimating and to the extent that you do believe in value, it will come up as the natural focal points for value, especially if you’re allowed to basically probe the quality of your experience where in many states you believe that the reason why you like something is for intentional content. Again, the case of graduating or it could be the the case of getting a promotion or one of those things that a lot of people associate, with feeling great, but if you actually probe the quality of experience, you will realize that there is this component of it which is its hedonic gloss and you can manipulate it directly again with things like opiate antagonists and if the symmetry theory of valence is true, potentially also by directly modulating the consonance and dissonance of the brain harmonics, in which case the hedonic gloss would change in peculiar ways.

When it comes to concealiance, when it comes to many different points of view, agreeing on what aspect of the experience is what brings value to it, it seems to be the hedonic gloss.

Lucas: So in terms of qualia and valence realism, would the causal properties of qualia be the thing that would show any arbitrary mind the self intimating nature of how good or bad an experience is, and in the space of all possible minds, what is the correct epistemological mechanism for evaluating the moral status of experiential or qualitative states?

Mike: So first of all, I would say that my focus so far has mostly been on describing what is and not what ought. I think that we can talk about valence without necessarily talking about ethics, but if we can talk about valence clearly, that certainly makes some questions in ethics and some frameworks in ethics make much more or less than. So the better we can clearly describe and purely descriptively talk about consciousness, the easier I think a lot of these ethical questions get. I’m trying hard not to privilege any ethical theory. I want to talk about reality. I want to talk about what exists, what’s real and what the structure of what exists is, and I think if we succeed at that then all these other questions about ethics and morality get much, much easier. I do think that there is an implicit should wrapped up in questions about valence, but I do think that’s another leap.

You can accept the valence is real without necessarily accepting that optimizing valence is an ethical imperative. I personally think, yes, it is very ethically important, but it is possible to take a purely descriptive frame to valence, that whether or not this also discloses, as David Pearce said, the utility function of the universe. That is another question and can be decomposed.

Andrés: One framing here too is that we do suspect valence is going to be the thing that matters up on any mind if you probe it in the right way in order to achieve reflective equilibrium. There’s the biggest example of a talk and neuro scientist was giving at some point, there was something off and everybody seemed to be a little bit anxious or irritated and nobody knew why and then one of the conference organizers suddenly came up to the presenter and did something to the microphone and then everything sounded way better and everybody was way happier. There was these very sorrow hissing pattern caused by some malfunction of the microphone and it was making everybody irritated, they just didn’t realize that was the source of the irritation, and when it got fixed then you know everybody’s like, “Oh, that’s why I was feeling upset.”

We will find that to be the case over and over when it comes to improving valence. So like somebody in the year 2050 might come up to one of the connectome specific harmonic wave clinics, “I don’t know what’s wrong with me,” but if you put them through the scanner we identify your 17th and 19th harmonic in a state of dissonance. We cancel 17th to make it more clean, and then the person who will say all of a sudden like, “Yeah, my problem is fixed. How did you do that?” So I think it’s going to be a lot like that, that the things that puzzle us about why do I prefer these, why do I think this is worse, will all of a sudden become crystal clear from the point of view of valence gradients objectively measured.

Mike: One of my favorite phrases in this context is what you can measure you can manage and if we can actually find the source of dissonance in a brain, then yeah, we can resolve it, and this could open the door for maybe honestly a lot of amazing things, making the human condition just intrinsically better. Also maybe a lot of worrying things, being able to directly manipulate emotions may not necessarily be socially positive on all fronts.

Lucas: So I guess here we can begin to jump into AI alignment and qualia. So we’re building AI systems and they’re getting pretty strong and they’re going to keep getting stronger potentially creating a superintelligence by the end of the century and consciousness and qualia seems to be along the ride for now. So I’d like to discuss a little bit here about more specific places in AI alignment where these views might inform it and direct it.

Mike: Yeah, I would share three problems of AI safety. There’s the technical problem, how do you make a self improving agent that is also predictable and safe. This is a very difficult technical problem. First of all to even make the agent but second of all especially to make it safe, especially if it becomes smarter than we are. There’s also the political problem, even if you have the best technical solution in the world and the sufficiently good technical solution doesn’t mean that it will be put into action in a sane way if we’re not in a reasonable political system. But I would say the third problem is what QRI is most focused on and that’s the philosophical problem. What are we even trying to do here? What is the optimal relationship between AI and humanity and also a couple of specific details here. First of all I think nihilism is absolutely an existential threat and if we can find some antidotes to nihilism through some advanced valence technology that could be enormously helpful for reducing Xrisk.

Lucas: What kind of nihilism or are you talking about here, like nihilism about morality and meaning?

Mike: Yes, I would say so, and just personal nihilism that it feels like nothing matters, so why not do risky things?

Lucas: Whose quote is it, the philosophers question like should you just kill yourself? That’s the yawning abyss of nihilism inviting you in.

Andrés: Albert Camus. The only real philosophical question is whether to commit suicide, whereas how I think of it is the real philosophical question is how to make love last, bringing value to the existence, and if you have value on tap, then the question of whether to kill yourself or not seems really nonsensical.

Lucas: For sure.

Mike: We could also say that right now there aren’t many good shelling points for global coordination. People talk about having global coordination and building AGI would be a great thing but we’re a little light on the details of how to do that. If the clear, comprehensive, useful, practical understanding of consciousness can be built, then this may sort of embody or generate new shelling points that the larger world could self organize around. If we can give people a clear understanding of what is and what could be, then I think we will get a better future that actually gets built.

Lucas: Yeah. Showing what is and what could be is immensely important and powerful. So moving forward with AI alignment as we’re building these more and more complex systems, there’s this needed distinction between unconscious and conscious information processing, if we’re interested in the morality and ethics of suffering and joy and other conscious states. How do you guys see the science of consciousness here, actually being able to distinguish between unconscious and conscious information processing systems?

Mike: There are a few frames here. One is that, yeah, it does seem like the brain does some processing in consciousness and some processing outside of consciousness. And what’s up with that, this could be sort of an interesting frame to explore in terms of avoiding things like mind crime in the AGI or AI space that if there are certain computations which are painful then don’t do them in a way that would be associated with consciousness. It would be very good to have rules of thumb here for how to do that. One interesting could be in the future we might not just have compilers which optimize for speed of processing or minimization of dependent libraries and so on, but could optimize for the valence of the computation on certain hardware. This of course gets into complex questions about computationalism, how hardware dependent this compiler would be and so on.

I think it’s an interesting and important long term frame

Lucas: So just illustrate here I think the ways in which solving or better understanding consciousness will inform AI alignment from present day until super intelligence and beyond.

Mike: I think there’s a lot of confusion about consciousness and a lot of confusion about what kind of thing the value problem is in AI Safety, and there are some novel approaches on the horizon. I was speaking with Stuart Armstrong the last year global and he had some great things to share about his model fragments paradigm. I think this is the right direction. It’s sort of understanding, yeah, human preferences are insane. Just they’re not a consistent formal system.

Lucas: Yeah, we contain multitudes.

Mike: Yes, yes. So first of all understanding what generates them seems valuable. So there’s this frame in AI safety we call the complexity value thesis. I believe Eliezer came up with it in a post on Lesswrong. It’s this frame where human value is very fragile in that it can be thought of as a small area, perhaps even almost a point in a very high dimensional space, say a thousand dimensions. If we go any distance in any direction from this tiny point in this high dimensional space, then we quickly get to something that we wouldn’t think of as very valuable. And maybe if we leave everything the same and take away freedom, this paints a pretty sobering picture of how difficult AI alignment will be.

I think this is perhaps arguably the source of a lot of worry in the community, that not only do we need to make machines that won’t just immediately kill us, but that will preserve our position in this very, very high dimensional space well enough that we keep the same trajectory and that possibly if we move at all, then we may enter a totally different trajectory, that we in 2019 wouldn’t think of as having any value. So this problem becomes very, very intractable. I would just say that there is an alternative frame. The phrasing that I’m playing around with here it is instead of the complexity of value thesis, the unity of value thesis, it could be that many of the things that we find valuable, eating ice cream, living in a just society, having a wonderful interaction with a loved one, all of these have the same underlying neural substrate and empirically this is what effective neuroscience is finding.

Eating a chocolate bar activates same brain regions as a transcendental religious experience. So maybe there’s some sort of elegant compression that can be made and that actually things aren’t so starkly strict. We’re not sort of this point in a super high dimensional space and if we leave the point, then everything of value is trashed forever, but maybe there’s some sort of convergent process that we can follow that we can essentialize. We can make this list of 100 things that humanity values and maybe they all have in common positive valence, and positive valence can sort of be reverse engineered. And to some people this feels like a very scary dystopic scenario, don’t knock it until you’ve tried it, but at the same time there’s a lot of complexity here.

One core frame that the idea of qualia of formalism and valence realism and offer AI safety is that maybe the actual goal is somewhat different than the complexity of value thesis puts forward. Maybe the actual goal is different and in fact easier. I think this could directly inform how we spend our resources on the problem space.

Lucas: Yeah, I was going to say that there exists standing tension between this view of the complexity of all preferences and values that human beings have and then the valence realist view which says that what’s ultimately good or certain experiential or hedonic states. I’m interested and curious about if this valence view is true, whether it’s all just going to turn into hedonium in the end.

Mike: I’m personally a fan of continuity. I think that if we do things right we’ll have plenty of time to get things right and also if we do things wrong then we’ll have plenty of time for things to be wrong. So I’m personally not a fan of big unilateral moves, it’s just getting back to this question of can understanding what is help us, clearly yes.

Andrés: Yeah. I guess one view is we could say preserve optionality and learn what is, and then from there hopefully we’ll be able to better inform oughts and with maintained optionality we’ll be able to choose the right thing. But that will require a cosmic level of coordination.

Mike: Sure. An interesting frame here is whole brain emulation. So whole brain emulation is sort of a frame built around functionalism and it’s a seductive frame I would say. If whole brain emulations wouldn’t necessarily have the same qualia based on hardware considerations as the original humans, there could be some weird lock in effects where if the majority of society turned themselves into p-zombies then it may be hard to go back on that.

Lucas: Yeah. All right. We’re just getting to the end here, I appreciate all of this. You guys have been tremendous and I really enjoyed this. I want to talk about identity in AI alignment. This sort of taxonomy that you’ve developed about open individualism and closed individualism and all of these other things. Would you like to touch on that and talk about implications here in AI alignment as you see it?

Andrés: Yeah. Yeah, for sure. The taxonomy comes from Daniel Kolak, a philosopher and mathematician. It’s a pretty good taxonomy and basically it’s like open individualism, that’s the view that a lot of meditators and mystics and people who take psychedelics often ascribe to, which is that we’re all one consciousness. Another frame is that our true identity is the light of consciousness so to speak. So it doesn’t matter in what form it manifests, it’s always the same fundamental ground of being. Then you have the common sense view, it’s called closed individualism. You start existing when you’re born, you stop existing when you die. You’re just this segment. Some religious actually extend that into the future or past with reincarnation or maybe with heaven.

The sense of ontological distinction between you and others while at the same time ontological continuity from one moment to the next within you. Finally you have this view that’s called empty individualism, which is that you’re just a moment of experience. That’s fairly common among physicists and a lot of people who’ve tried to formalize consciousness, often they converged on empty individualism. I think a lot of theories of ethics and rationality, like the veil of ignorance as a guide or like how do you define the rational decision making as maximizing the expected utility of yourself as an agent, all of those seem to implicitly be based on closed individualism and they’re not necessarily questioning it very much.

On the other hand, if the sense of individual identity of closed individualism doesn’t actually carve nature at its joints as a Buddhist might say, the feeling of continuity of being a separate unique entity is an illusory construction of your phenomenology that casts in a completely different light how to approach rationality itself and even self interest, right? If you start identifying with the light of consciousness rather than your particular instantiation, you will probably care a lot more about what happens to pigs in factory farms because … In so far as they are conscious they are you in a fundamental way. It matters a lot in terms of how to carve out different possible futures, especially when you get into these very tricky situations like, well what if there is mind melding or what if there is the possibility of making perfect copies of yourself?

All of these edge cases are really problematic from the common sense view of identity, but they’re not really a problem from an open individualist or empty individualist point of view. With all of this said, I do personally think there’s probably a way of combining open individualism with valence realism that gives rise to the next step in human rationality where we’re actually trying to really understand what the universe wants so to speak. But I would say that there is a very tricky aspect here that has to do with a game theory. We evolved to believe in close individualism. the fact that it’s evolutionarily adaptive, it’s obviously not an argument for it being fundamentally true, but it does seem to be some kind of a evolutionarily stable point to believe of yourself as who you can affect the most directly in a causal way If you define your boundary that way.

That basically gives you focus on the actual degrees of freedom that you do have, and if you think of society of open individualists, everybody’s altruistically maximally contributing to the universal consciousness, and then you have one close individualist who is just selfishly trying to maybe acquire power just for itself, you can imagine that one view would have a tremendous evolutionary advantage in that context. So I’m not one who just naively advocates for open individualism unreflectively. I think we still have to work out to the game theory of it, how to make it evolutionarily stable and also how to make it ethical. Open question, I do think it’s important to think about and if you take consciousness very seriously, especially within physicalism, that usually will cast huge doubts on the common sense view of identity.

It doesn’t seem like a very plausible view if you actually tried to formalize consciousness.

Mike: The game theory aspect is very interesting. You can think of closed individualism as something evolutionists produced that allows an agent to coordinate very closely with its past and future ourselves. Maybe we can say a little bit about why we’re not by default all empty individualists or open individualists. Empty individualism seems to have a problem where if every slice of conscious experience is its own thing, then why should you even coordinate with your past and future self because they’re not the same as you. So that leads to a problem of defection, and open individualism is everything is the same being so to speak than … As Andrés mentioned that allows free riders, if people are defecting, it doesn’t allow altruist punishment or any way to stop the free ride. There’s interesting game theory here and it also just feeds into the question of how we define our identity in the age of AI, the age of cloning, the age of mind uploading.

This gets very, very tricky very quickly depending on one’s theory of identity. They’re opening themselves up to getting hacked in different ways and so different theories of identity allow different forms of hacking.

Andrés: Yeah, which could be sometimes that’s really good and sometimes really bad. I would make the prediction that not necessarily open individualism in its full fledged form but a weaker sense of identity than closed individualism is likely going to be highly adaptive in the future as people basically have the ability to modify their state of consciousness in much more radical ways. People who just identify with narrow sense of identity will just be in their shells, not try to disturb the local attractor too much. That itself is not necessarily very advantageous. If the things on offer are actually really good, both hedonically and intelligence wise.

I do suspect basically people who are somewhat more open to basically identify with consciousness or at least identify with a broader sense of identity, they will be the people who will be doing more substantial progress, pushing the boundary and creating new cooperation and coordination technology.

Lucas: Wow, I love all that. Seeing closed individualism for what it was has had a tremendous impact on my life and this whole question of identity I think is largely confused for a lot of people. At the beginning you said that open individualism says that we are all one consciousness or something like this, right? For me in identity I’d like to move beyond all distinctions of sameness or differenceness. To say like, oh, we’re all one consciousness to me seems to say we’re all one electromagnetism, which is really to say the consciousness is like an independent feature or property of the world that’s just sort of a ground part of the world and when the world produces agents, consciousness is just an empty identityless property that comes along for the ride.

The same way in which it would be nonsense to say, “Oh, I am these specific atoms, I am just the forces of nature that are bounded within my skin and body” That would be nonsense. In the same way in sense of what we were discussing with consciousness there was the binding problem of the person, the discreteness of the person. Where does the person really begin or end? It seems like these different kinds of individualism have, as you said, epistemic and functional use, but they also, in my view, create a ton of epistemic problems, ethical issues, and in terms of the valence theory, if quality is actually something good or bad, then as David Pearce says, it’s really just an epistemological problem that you don’t have access to other brain states in order to see the self intimating nature of what it’s like to be that thing in that moment.

There’s a sense in which i want to reject all identity as arbitrary and I want to do that in an ultimate way, but then in the conventional way, I agree with you guys that there are these functional and epistemic issues that closed individualism seems to remedy somewhat and is why evolution, I guess selected for it, it’s good for gene propagation and being selfish. But once one sees AI as just a new method of instantiating bliss, it doesn’t matter where the bliss is. Bliss is bliss and there’s no such thing as your bliss or anyone else’s bliss. Bliss is like its own independent feature or property and you don’t really begin or end anywhere. You are like an expression of a 13.7 billion year old system that’s playing out.

The universe is just peopleing all of us at the same time, and when you get this view and you see you as just sort of like the super thin slice of the evolution of consciousness and life, for me it’s like why do I really need to propagate my information into the future? Like I really don’t think there’s anything particularly special about the information of anyone really that exists today. We want to preserve all of the good stuff and propagate those in the future, but people who seek a immortality through AI or seek any kind of continuation of what they believe to be their self is, I just see that all as misguided and I see it as wasting potentially better futures by trying to bring Windows 7 into the world of Windows 10.

Mike: This all gets very muddy when we try to merge human level psychological drives and concepts and adaptations with a fundamental physics level description of what is. I don’t have a clear answer. I would say that it would be great to identify with consciousness itself, but at the same time, that’s not necessarily super easy if you’re suffering from depression or anxiety. So I just think that this is going to be an ongoing negotiation within society and just hopefully we can figure out ways in which everyone can move.

Andrés: There’s an article I wrote it, I just called it consciousness versus replicators. That kind of gets to the heart of this issue, but that sounds a little bit like good and evil, but it really isn’t. The true enemy here is replication for replication’s sake. On the other hand, the only way in which we can ultimately benefit consciousness, at least in a plausible, evolutionarily stable way is through replication. We need to find the balance between replication and benefit of consciousness that makes the whole system stable, good for consciousness and resistant against the factors.

Mike: I would like to say that I really enjoy Max Tegmark’s general frame of you leaving this mathematical universe. One re-frame of what we were just talking about in these terms are there are patterns which have to do with identity and have to do with valence and have to do with many other things. The grand goal is to understand what makes a pattern good or bad and optimize our light cone for those sorts of patterns. This may have some counter intuitive things, maybe closed individualism is actually a very adaptive thing, in the long term it builds robust societies. Could be that that’s not true but I just think that taking the mathematical frame and the long term frame is a very generative approach.

Lucas: Absolutely. Great. I just want to finish up here on two fun things. It seems like good and bad are real in your view. Do we live in heaven or hell?

Mike: Lot of quips that come to mind here. Hell is other people, or nothing is good or bad but thinking makes it so. My pet theory I should say is that we live in something that is perhaps close to heaven as is physically possible. The best of all possible worlds.

Lucas: I don’t always feel that way but why do you think that?

Mike: This gets through the weeds of theories about consciousness. It’s this idea that we tend to think of consciousness on the human scale. Is the human condition good or bad, is the balance of human experience on the good end, the heavenly end or the hellish end. If we do have an objective theory of consciousness, we should be able to point it at things that are not human and even things that are not biological. It may seem like a type error to do this but we should be able to point it at stars and black holes and quantum fuzz. My pet theory, which is totally not validated, but it is falsifiable, and this gets into Bostrom’s simulation hypothesis, it could be that if we tally up the good valence and the bad valence in the universe, that first of all, the human stuff might just be a rounding error.

Most of the value, in this value the positive and negative valence is found elsewhere, not in humanity. And second of all, I have this list in the last appendix of Principia Qualia as well, where could massive amounts of consciousness be hiding in the cosmological sense. I’m very suspicious that the big bang starts with a very symmetrical state, I’ll just leave it there. In a utilitarian sense, if you want to get a sense of whether we live in a place closer to heaven or hell we should actually get a good theory of consciousness and we should point to things that are not humans and cosmological scale events or objects would be very interesting to point it at. This will give a much better clear answer as to whether we live in somewhere closer to heaven or hell than human intuition.

Lucas: All right, great. You guys have been super generous with your time and I’ve really enjoyed this and learned a lot. Is there anything else you guys would like to wrap up on?

Mike: Just I would like to say, yeah, thank you so much for the interview and reaching out and making this happen. It’s been really fun on our side too.

Andrés: Yeah, I think wonderful questions and it’s very rare for an interviewer to have non conventional views of identity to begin with, so it was really fun, really appreciate it.

Lucas: Would you guys like to go ahead and plug anything? What’s the best place to follow you guys, Twitter, Facebook, blogs, website?

Mike: Our website is qualiaresearchinstitute.org and we’re working on getting a PayPal donate button out but in the meantime you can send us some crypto. We’re building out the organization and if you want to read our stuff a lot of it is linked from the website and you can also read my stuff at my blog, opentheory.net and Andrés’ is @qualiacomputing.com.

Lucas: If you enjoyed this podcast, please subscribe, give it a like or share it on your preferred social media platform. We’ll be back again soon with another episode in the AI Alignment series.

End of recorded material

AI Alignment Podcast: An Overview of Technical AI Alignment with Rohin Shah (Part 2)

The space of AI alignment research is highly dynamic, and it’s often difficult to get a bird’s eye view of the landscape. This podcast is the second of two parts attempting to partially remedy this by providing an overview of technical AI alignment efforts. In particular, this episode seeks to continue the discussion from Part 1 by going in more depth with regards to the specific approaches to AI alignment. In this podcast, Lucas spoke with Rohin Shah. Rohin is a 5th year PhD student at UC Berkeley with the Center for Human-Compatible AI, working with Anca Dragan, Pieter Abbeel and Stuart Russell. Every week, he collects and summarizes recent progress relevant to AI alignment in the Alignment Newsletter

Topics discussed in this episode include:

  • Embedded agency
  • The field of “getting AI systems to do what we want”
  • Ambitious value learning
  • Corrigibility, including iterated amplification, debate, and factored cognition
  • AI boxing and impact measures
  • Robustness through verification, adverserial ML, and adverserial examples
  • Interpretability research
  • Comprehensive AI Services
  • Rohin’s relative optimism about the state of AI alignment

You can take a short (3 minute) survey to share your feedback about the podcast here.

We hope that you will continue to join in the conversations by following us or subscribing to our podcasts on Youtube, SoundCloud, iTunes, Google Play, Stitcher, or your preferred podcast site/application. You can find all the AI Alignment Podcasts here.

Lucas: Hey everyone, welcome back to the AI Alignment Podcast. I’m Lucas Perry, and today’s episode is the second part of our two part series with Rohin Shah, developing an overview of technical AI alignment efforts. If you haven’t listened to the first part, we highly recommend that you do, as it provides an introduction to the varying approaches discussed here. The second part is focused on exploring AI alignment methodologies in more depth, and nailing down the specifics of the approaches and lenses through which to view the problem.

In this episode, Rohin will begin by moving sequentially through the approaches discussed in the first episode. We’ll start with embedded agency, then discuss the field of getting AI systems to do what we want, and we’ll discuss ambitious value learning alongside this. Next, we’ll move to corrigibility, in particular, iterated amplification, debate, and factored cognition.

Next we’ll discuss placing limits on AI systems, things of this nature would be AI boxing and impact measures. After this we’ll get into robustness which consists of verification, adversarial machine learning, and adversarial examples to name a few.

Next we’ll discuss interpretability research, and finally comprehensive AI services. By listening to the first part of the series, you should have enough context for these materials in the second part. As a bit of announcement, I’d love for this podcast to be particularly useful and interesting for its listeners. So I’ve gone ahead and drafted a short three minute survey that you can find link to on the FLI page for this podcast, or in the description of where you might find this podcast. As always, if you find this podcast interesting or useful, please make sure to like, subscribe and follow us on your preferred listening platform.

For those of you that aren’t already familiar with Rohin, he is a fifth year PhD student in computer science at UC Berkeley with the Center for Human Compatible AI working with Anca Dragan, Pieter Abbeel, and Stuart Russell. Every week he collects and summarizes recent progress relative to AI alignment in the Alignment Newsletter. With that, we’re going to start off by moving sequentially through the approached just enumerated. All right. Then let’s go ahead and begin with the first one, which I believe was embedded agency.

Rohin: Yeah, so embedded agency. I kind of want to just differ to the embedded agency sequence, because I’m not going to do anywhere near as good a job as that does. But the basic idea is that we would like to have this sort of theory of intelligence, and one major blocker to this is the fact that all of our current theories, most notably, the reinforcement learning make this assumption that there is a nice clean boundary between the environment and the agent. It’s sort of like the agent is playing a video game, and the video game is the environment. There’s no way for the environment to actually affect the agent. The agent has this defined input channel, takes actions, those actions get sent to the video game environment, the video game environment does stuff based on that and creates an observation, and that observation was then sent back to the agent who gets to look at it, and there’s this very nice, clean abstraction there. The agent could be bigger than the video game, in the same way that I’m bigger than tic tac toe.

I can actually simulate the entire game tree of tic tac toe and figure out what the optimal policy for tic tac toe is. It’s actually this cool XKCD that does just show you the entire game tree, it’s great.

So in the same way in the video game setting, the agent can be bigger than the video game environment, in that it can have a perfectly accurate model of the environment and know exactly what its actions are going to do. So there are all of these nice assumptions that we get in video game environment land, but in real world land, these don’t work. If you consider me on the Earth, I cannot have an exact model of the entire environment because the environment contains me inside of it, and there is no way that I can have a perfect model of me inside of me. That’s just not a thing that can happen. Not to mention having a perfect model of the rest of the universe, but we’ll leave that aside even.

There’s the fact that it’s not super clear what exactly my action space is. Once there is now a laptop available to me, does the laptop start talking as part of my action space? Do we only talk about motor commands I can give to my limbs? But then what happens if I suddenly get uploaded and now I just don’t have any lens anymore? What happened to my actions, are they gone? So Embedded Agency broadly factors this question out into four sub problems. I associate them with colors, because that’s what Scott and Abram do in their sequence. The red one is decision theory. Normally decision theory is consider all possible actions to simulate their consequences, choose the one that will lead to the highest expected utility. This is not a thing you can do when you’re an embedded agent, because the environment could depend on what policy you do.

The classic example of this is Newcomb’s problem where part of the environment is all powerful being, Omega. Omega is able to predict you perfectly, so it knows exactly what you’re going to do, and Omega is 100% trustworthy, and all those nice simplifying assumptions. Omega provides you with the following game. He’s going to put two transparent boxes in front of you. The first box will always contain $1,000 dollars, and the second box will either contain a million dollars or nothing, and you can see this because they’re transparent. You’re given the option to either take one of the boxes or both of the boxes, and you just get whatever’s inside of them.

The catch is that Omega only puts the million dollars in the box if he predicts that you would take only the box with the million dollars in it, and not the other box. So now you see the two boxes, and you see that one box has a million dollars, and the other box has a thousand dollars. In that case, should you take both boxes? Or should you just take the box with the million dollars? So the way I’ve set it up right now, it’s logically impossible for you to do anything besides take the million dollars, so maybe you’d say okay, I’m logically required to do this, so maybe that’s not very interesting. But you can relax this to a problem where Omega is 99.999% likely to get the prediction right. Now in some sense you do have agency. You could choose both boxes and it would not be a logical impossibility, and you know, both boxes are there. You can’t change the amounts that are in the boxes now. Man, you should just take both boxes because it’s going to give you $1,000 more. Why would you not do that?

But I claim that the correct thing to do in this situation is to take only one box because the fact that you are the kind of agent who would only take one box is the reason that the one box has a million dollars in it anyway, and if you were the kind of agent that did not take one box, took two boxes instead, you just wouldn’t have seen the million dollars there. So that’s the sort of problem that comes up in embedded decision theory.

Lucas: Even though it’s a thought experiment, there’s a sense though in which the agent in the thought experiment is embedded in a world where he’s making the observation of boxes that have a million dollars in them with genius posing these situations?

Rohin: Yeah.

Lucas: I’m just seeking clarification on the embeddedness of the agent and Newcomb’s problem.

Rohin: The embeddedness is because the environment is able to predict exactly, or with close to perfect accuracy what the agent could do.

Lucas: The genie being the environment?

Rohin: Yeah, Omega is part of the environment. You’ve got you, the agent, and everything else, the environment, and you have to make good decision. We’ve only been talking about how the boundary between agent and environment isn’t actually all that clear. But to the extent that it’s sensible to talk about you being able to choose between actions, we want some sort of theory for how to do that when the environment can contain copies of you. So you could think of Omega as simulating a copy of you and seeing what you would do in this situation before actually presenting you with a choice.

So we’ve got the red decision theory, then we have yellow embedded world models. With embedded world models, the problem that you have is that, so normally in our nice video game environment, we can have an exact model of how the environment is going to respond to our actions, even if we don’t know it initially, we can learn it overtime, and then once we have it, it’s pretty easy to see how you could plan in order to do the optimal thing. You can sort of trial your actions, simulate them all, and then see which one does the best and do that one. This is roughly AIXI works. AIXI is the model of the optimally intelligent RL agent in this four video game environment like settings.

Once you’re in embedded agency land, you cannot have an exact model of the environment because for one thing the environment contains you and you can’t have an exact model of you, but also the environment is large, and you can’t simulate it exactly. The big issue is that it contains you. So how you get any sort of sensible guarantees on what you can do, even though the environment can contain you is the problem off of embedded world models. You still need a world model. It can’t be exact because it contains you. Maybe you could do something hierarchical where things are fuzzy at the top, but then you can go focus in on each particular levels of hierarchy in order to get more and more precise about each particular thing. Maybe this is sufficient? Not clear.

Lucas: So in terms of human beings though, we’re embedded agents that are capable of creating robust world models that are able to think about AI alignment.

Rohin: Yup, but we don’t know how we do it.

Lucas: Okay. Are there any sorts of understandings that we can draw from our experience?

Rohin: Oh yeah, I’m sure there are. There’s a ton of work on this that I’m not that familiar with, and probably a cog psy or psychology or neuroscience, all of these fields I’m sure will have something to say about it. Hierarchical world models in particular are pretty commonly talked about as interesting. I know that there’s a whole field of hierarchical reinforcement learning in AI that’s motivated by this, but I believe it’s also talked about in other areas of academia, and I’m sure there are other insights to be getting from there as well.

Lucas: All right, let’s move on then from hierarchical world models.

Rohin: Okay. Next is blue robust delegation. So with robust delegation, the basic issue here, so we talked about Vingean reflection a little bit in the first podcast. This is a problem that falls under robust delegation. The headline difficulty under robust delegation is that the agent is able to do self improvement, it can reason about itself and do things based on that. So one way you can think of this is that instead of thinking about it as self modification, you can think about it as the agent is constructing a new agent to act at future time steps. So then in that case your agent has the problem of how do I construct an agent for future time steps such that I am happy delegating my decision making to that future agent? That’s why it’s called robust delegation. Vingean reflection in particular is about how can you take an AI system that uses a particular logical theory in order to make inferences and have it move to a stronger logical theory, and actually trust the stronger logical theory to only make correct inferences?

Stated this way, the problem is impossible because lots of theorems, it’s a well known result in logic that a weaker theory can not prove the consistency of well even itself, but also any stronger theory as a corollary. Intuitively in this pretty simple example, we don’t know how to get an agent that can trust a smarter version of itself. You should expect this problem to be hard, right? It’s in some sense dual to the problem that we have of AI alignment where we’re creating something smarter than us, and we need it to pursue the things we want it to pursue, but it’s a lot smarter than us, so it’s hard to tell what it’s going to do.

So I think of this aversion of the AI alignment problem, but apply to the case of some embedded agent reasoning about itself, and making a better version of itself in the future. So I guess we can move on to the green section, which is sub system alignment. The tagline for subsystem alignment would be the embedded agent is going to be made out of parts. Its’ not this sort of unified coherent object. It’s got different pieces inside of it because it’s embedded in the environment, and the environment is made of pieces that make up the agent, and it seems likely that your AI system is going to be made up of different cognitive sub parts, and it’s not clear that those sub parts will integrate together into a unified whole such that unified whole is pursuing a goal that you like.

It could be that each individual sub part has its own goal and they’re all competing with each other in order to further their own goals, and that the aggregate overall behavior is usually good for humans, at least in our current environment. But as the environment changes, which it will due to technological progression, one of the parts might just win out and be optimizing some goal that is not anywhere close to what we wanted. A more concrete example would be one way that you could imagine building a powerful AI system is to have a world model that is awarded for making accurate predictions about what the world will look like, and then you have a decision making model, which has a normal reward function that we program in, and tries to choose actions in order to maximize that reward. So now we have an agent that has two sub systems in it.

You might worry for example that once the world model gets sufficiently powerful, it starts realizing that the decision making thing is depending on my output in order to make decisions. I can trick it into making the world easier to predict. So maybe I give it some models of the world that say make everything look red, or make everything black, then you will get high reward somehow. Then if the agent actually then takes that action and makes everything black, and now everything looks black forever more, then the world model can very easily predict, yeah, no matter what action you take, the world is just going to look black. That’s what the world is now, and that gets the highest possible reward. That’s a somewhat weird story for what could happen. But there’s no real stronger unit that says nope, this will definitely not happen.

Lucas: So in total sort of, what is the work that has been done here on inner optimizers?

Rohin: Clarifying that they could exist. I’m not sure if there has been much work on it.

Lucas: Okay. So this is our fourth cornerstone here in this embedded agency framework, correct?

Rohin: Yup, and that is the last one.

Lucas: So surmising these all together, where does that leave us?

Rohin: So I think my main takeaway is that I am much more strongly agreeing with MIRI that yup, we are confused about how intelligence works. That’s probably it, that we are confused about how intelligence works.

Lucas: What is this picture that I guess is conventionally held of what intelligence is that is wrong? Or confused?

Rohin: I don’t think there’s a thing that’s wrong about the conventional. So you could talk about a definition of intelligence, of being able to achieve arbitrary goals. I think Eliezer says something like cross domain optimization power, and I think that seems broadly fine. It’s more that we don’t know how intelligence is actually implemented, and I don’t think we ever claim to know that, but embedded agency is like we really don’t know it. You might’ve thought that we were making progress on figuring out how intelligence might be implemented with a classical decision theory, or the Von Neumann–Morgenstern utility theorem, or results like value of perfect information and stuff like being always non negative.

You might’ve thought that we were making progress on it, even if we didn’t fully understand it yet, and then you read on method agency and you’re like no, actually there are lots more conceptual problems that we have not even begin to touch yet. Well MIRI has begun to touch them I would say, but we really don’t have good stories for how any of these things work. Classically we just don’t have a description of how intelligence works. MIRI’s like even the small threads of things we thought about how intelligence could work are definitely not the full picture, and there are problems with them.

Lucas: Yeah, I mean just on simple reflection, it seems to me that in terms of the more confused conception of intelligence, it sort of models it more naively as we were discussing before, like the simple agent playing a computer game with these well defined channels going into the computer game environment.

Rohin: Yeah, you could think of AIXI for example as a model of how intelligence could work theoretically. The sequence is like no, this is why I see it as not a sufficient theoretical model.

Lucas: Yeah, I definitely think that it provides an important conceptual shift. So we have these four corner stones, and it’s illuminating in this way, are there any more conclusions or wrap up you’d like to do on embedded agency before we move on?

Rohin: Maybe I just want to add a disclaimer that MIRI is notoriously hard to understand and I don’t think this is different for me. It’s quite plausible that there is a lot of work that MIRI has done, and a lot of progress that MIRI has made that I either don’t know about or know about but don’t properly understand. So I know I’ve been saying I want to differ to people a lot, or I want to be uncertain a lot, but on MIRI I especially want to do so.

Lucas: All right, so let’s move on to the next one within this list.

Rohin: The next one was doing what humans want. How do I summarize that? I read a whole sequence of posts on it. I guess the story for success, to the extent that we have one right now is something like use all of the techniques that we’re developing, or at least the insights from them, if not the particular algorithms to create an AI system that behaves corrigibly. In the sense that it is trying to help us achieve our goals. You might be hopeful about this because we’re creating a bunch of algorithms for it to properly infer our goals and then pursue them, so this seems like a thing that could be done. Now, I don’t think we have a good story for how that happens. I think there are several open problems that show that our current algorithms are insufficient to do this. But it seems plausible that with more research we could get to something like that.

There’s not really a good overall summary of the field because it’s more like a bunch of people separately having a bunch of interesting ideas and insights, and I mentioned a bunch of them in the first part of the podcast already. Mostly because I’m excited about these and I’ve read about them recently, so I just sort of start talking about them whenever they seem even remotely relevant. But to reiterate them, there is the notion of analyzing the human AI system together as pursuing some sort of goal, or being collectively rational as opposed to having an individual AI system that is individually rational. So that’s been somewhat formalized in Cooperative Inverse Reinforcement Learning. Typically with inverse reinforcement learning, so not the cooperative kind, you have a human, the human is sort of exogenous, the AI doesn’t know that they exist, and the human creates a demonstration of the sort of behavior that they want the AI to do. If you’re thinking about robotics, it’s picking up a coffee cup, or something like this. Then the robot just sort of sees this demonstration and comes out of thin air, it’s just data that it gets.

Let’s say that I had executed this demonstration, what reward function would I have been optimizing? And then it figures out a reward function, and then it uses that reward function however it wants. Usually you would then use reinforcement learning to optimize that reward function and recreate the behavior. So that’s normal inverse reinforcement learning. Notably in here is that you’re not considering the human and the robot together as a full collective system. The human is sort of exogenous to the problem, and also notable is that the robot is sort of taking the reward to be something that it has as opposed to something that the human has.

So CIRL basically says, no, no, no, let’s not model it this way. The correct thing to do is to have a two player game that’s cooperative between the human and the robot, and now the human knows the reward function and is going to take actions somehow. They don’t necessarily have to be demonstrations. But the human knows the reward function and will be taking actions. The robot on the other hand does not know the reward function, and it also gets to take actions, and the robot keeps a probability distribution over the reward that the human has, and updates this overtime based on what the human does.

Once you have this, you get this sort of nice, interactive behavior where the human is taking actions that teach the robot about the reward function. The robot learns the reward function over time and then starts helping the human achieve his or her goals. This sort of teaching and learning behavior comes simply under the assumption that the human and the robot are both playing the game optimally, such that the reward function gets optimized as best as possible. So you get this sort of teaching and learning behavior from the normal notion of optimizing a particular objective, just from having the objective be a thing that the human knows, but not a thing that the robot knows. One thing that, I don’t know if CIRL introduced it, but it was one of the key aspects of CIRL was having probability distribution over a reward function, so you’re uncertain about what reward you’re optimizing.

This seems to give a bunch of nice properties. In particular, once the human starts taking actions like trying to shut down the robot, then the robot’s going to think okay, if I knew the correct reward function, I would be helping the human, and given that the human is trying to turn me off, I must be wrong about the reward function, I’m not helping, so I should actually just let the human turn me off, because that’s what would achieve the most reward for the human. So you no longer have this incentive to disable your shutdown button in order to keep optimizing. Now this isn’t exactly right, because better than both of those option is to disable the shutdown button, stop doing whatever it is you were doing because it was clearly bad, and then just observe humans for a while until you can narrow down what their reward function actually is, and then you go and optimize that reward, and behave like a traditional goal directed agent. This sounds bad. It doesn’t actually seem that bad to me under the assumption that the true reward function is a possibility that the robot is considering and has a reasonable amount of support in the prior.

Because in that case, once the AI system eventually narrows down on the reward function, it will be either the true reward function, or a reward function that’s basically indistinguishable from it, because otherwise, there would be some other information that I could gather in order to distinguish between them. So you actually would get good outcomes. Now of course in practice it seems likely that we would not be able to specify the space of reward functions well enough for this to work. I’m not sure about that point. Regardless, it seems like there’s been some sort of conceptual advance here about when the AI’s trying to do something for the human, it doesn’t have the disabling the shutdown button, the survival incentive.

So while maybe reward uncertainty is not exactly the right way to do it, it seems like you could do something analogous that doesn’t have the problems that reward uncertainty does.

One other thing that’s kind of in this vein, but a little bit different is the idea of an AI system that infers and follows human norms, and the reason we might be optimistic about this is because humans seem to be able to infer and follow norms pretty well. I don’t think humans can infer the values that some other human is trying to pursue and then optimize them to lead to good outcomes. We can do that to some extent. Like I can infer that someone is trying to move a cabinet, and then I can go help them move that cabinet. But in terms of their long term values or something, it seems pretty hard to infer and help with those. But norms, we do in fact do infer and follow all the time. So we might think that’s an easier problem, like our AI systems could do it as well.

Then the story for success is basically that with these AI systems, we are able to accelerate technological progress as before, but the AI systems behave in a relatively human like manner. They don’t do really crazy things that a human wouldn’t do, because that would be against our norms. As with the accelerating technological progress, we get to the point where we can colonize space, or whatever else it is you want to do with the feature. Perhaps even along the way we do enough AI alignment research to build an actual aligned superintelligence.

There are problems with this idea. Most notably if you accelerate technological progress, bad things can happen from that, and norm following AI systems would not necessarily stop that from happening. Also to the extent that if you think human society, if left to its own devices would lead to something bad happening in the future, or something catastrophic, then a norm following AI system would probably just make that worse, in that it would accelerate that disaster scenario, without really making it any better.

Lucas: AI systems in a vacuum that are simply norm following seem to have some issues, but it seems like an important tool in the toolkit of AI alignment to have AIs which are capable of modeling and following norms.

Rohin: Yup. That seems right. Definitely agree with that. I don’t think I had mentioned the reference on this. So for this one I would recommend people look at Incomplete Contracting and AI Alignment I believe is the name of the paper by Dylan Hadfield-Menell, and Gillian Hadfield, or also my post about it in the Value Learning Sequence.

So far I’ve been talking about sort of high level conceptual things within the, ‘get AI systems to do what we want.’ There are also a bunch of more concrete technical approaches. It’s like inverse reinforcement learning, deep reinforcement learning from human preferences, and there you basically get a bunch of comparisons of behavior from humans, and use that to infer a reward function that your agent can optimize. There’s recursive reward modeling where you take the task that you are trying to do, and then you consider a new auxiliary task of evaluating your original task. So maybe if you wanted to train an AI system to write fantasy books, well if you were to give human feedback on that, it would be quite expensive because you’d have to read the entire fantasy book and then give feedback. But maybe you could instead outsource the task, even evaluating fantasy books, you could recursively apply this technique and train a bunch of agents that can summarize the plot of a book or comment on the pros of the book, or give a one page summary of the character development.

Then you can use all of these AI systems to help you give feedback on the original AI system that’s trying to write a fantasy book. So that’s a recursive reward modeling. I guess going a bit back into the conceptual territory, I wrote a paper recently on learning preferences from the state of the world. So the intuition there is that the AI systems that we create aren’t just being created into a brand new world. They’re being instantiated in a world where we have already been acting for a long time. So the world is already optimized for our preferences, and as a result, our AI systems can just look at the world and infer quite a lot about our preferences. So we gave an algorithm that did this in some poor environments.

Lucas: Right, so again, this covers the conceptual category of methodologies of AI alignment where we’re trying to get AI systems to do what we want?

Rohin: Yeah, current AI systems in a sort of incremental way, without assuming general intelligence.

Lucas: And there’s all these different methodologies which exist in this context. But again, this is all sort of within this other umbrella of just getting AI to do things we want them to do?

Rohin: Yeah, and you can actually compare across all of these methods on particular environments. This hasn’t really been done so far, but in theory it can be done, and I’m hoping to do it at some point in the future.

Lucas: Okay. So we’ve discussed embedded agency, we’ve discussed this other category of getting AIs to do what we want them to do. Just moving forward here through diving deep on these approaches.

Rohin: I think the next one I wanted to talk about was ambitious value learning. So here the basic idea is that we’re going to build a superintelligent AI system, it’s going to have goals, because that’s what the Von Neumann—Morgenstern theorem tells us is that anything with preferences, if they’re consistent and coherent, which they should be for a superintelligent system, or at least as far as we can tell they should be consistent. Any type system has a utility function. So natural thought, why don’t we just figure out what the right utility function is, and put it into the AI system?

So there’s a lot of good arguments that you’re not going to be able to get the one correct utility function, but I think Stuart’s hope is that you can find one that is sufficiently good or adequate, and put that inside of the AI system. In order to do this, he wants to, I believe the goal is to learn the utility function by looking at both human behavior as well as the algorithm that human brains are implementing. So if you see that the human brain, when it knows that something is going to be sweet, tends to eat more of it. Then you can infer that humans like to eat sweet things. As opposed to humans really dislike eating sweet things, but they’re really bad at optimizing their utility function. In this project of ambitious value learning, you also need to deal with the fact that human preferences can be inconsistent, that the AI system can manipulate the human preferences. The classic example of that would be the AI system could give you a shot of heroin, and that probably change your preferences from I do not want heroin to I do want heroin. So what does it even mean to optimize for human preferences when they can just be changed like that?

So I think the next one was corrigibility and the associated iterated amplification and debate basically. I guess factored cognition as well. To give a very quick recap, the idea with corrigibility is that we would like to build an AI system that is trying to help us, and that’s the property that we should aim for as opposed to an AI system that actually helps us.

One motivation for focusing on this weaker criteria is that it seems quite difficult to create a system that knowably actually helps us, because that means that you need to have confidence that your AI system is never going to make mistakes. It seems like quite a difficult property to guarantee. In addition, if you don’t make some assumption on the environment, then there’s a no free lens theorem that says this is impossible. Now it’s probably reasonable to put some assumption on the environment, but it’s still true that your AI system could have reasonable beliefs based on past experience, and nature still throws it a curve ball, and that leads to some sort of bad outcome happening.

While we would like this to not happen, it also seems hard to avoid, and also probably not that bad. It seems like the worst outcomes come when your superintelligent system is applying all of its intelligence in pursuit of their own goal. That’s the thing that we should really focus on. That conception of what we want to enforce is probably the thing that I’m most excited about. Then there are particular algorithms that are meant to create corrigible agents, assuming we have the capabilities to get general intelligence. So one of these is iterated amplification.

Iterated amplification is really more of a framework to describe particular methods of training systems. In particular, you alternate between amplification and distillation steps. You start off with an agent that we’re going to assume is already aligned. So this could be a human. A human is a pretty slow agent. So the first thing we’re going to do is distill the human down into a fast agent. So we could use something like imitation learning, or maybe inverse reinforcement learning plus reinforcement learning, followed by reinforcement learning or something like that in order to train a neural net or some other AI system that mostly replicates the behavior of our human, and remains aligned. By aligned maybe I mean corrigible actually. We start with a corrigible agent, and then we produce agents that continue to be corrigible.

Probably the resulting agent is going to be a little less capable than the one that you started out with just because if the best you can do is to mimic the agent that you stated with, that gives you exactly as much capabilities as that agent. So if you don’t succeed at properly mimicking, then you’re going to be a little less capable. Then you take this fast agent and you amplify it, such that it becomes a lot more capable, at perhaps the cost of being a lot slower to compute.

One way that you could image doing amplification would be to have a human get a top level task, and for now we’ll assume that the task is question answering, so they get this top level question and they say okay, I could answer this question directly, but let me make use of this fast agent that we have from the last turn. We’ll make a bunch of sub questions that seem relevant for answering the overall question, and ask our distilled agent to answer all of those sub questions, and then using those answers, the human can then make a decision for their top level question. It doesn’t have to be the human. You could also have a distilled agent at the top level if you want.

I think having the human there seems more likely. So with this amplification you’re basically using the agent multiple times, letting it reason for longer in order to get a better result. So the resulting human, plus many copies of the agent system is more capable than the original distilled agent, but also slower. So we started off with something, let’s call it capability level five, and then we distilled it and it became capability level four, but it was a lot faster. Then we amplified it and maybe now it’s capability level eight. But it’s a lot slower. So we can distill it again and get something at capability level seven that’s pretty fast, and then amplify it again and so on and so forth. So the hope is that this would allow us to continually train an agent that can reach arbitrary levels of capabilities that are actually physically possibly while remaining aligned or corrigible the entire time.

Lucas: Yeah, that’s quite a great picture. I’m happy to jump into debate next, and then I’d like to talk about the potential relationship between the two.

Rohin: Sure. So with debate, the key hope that it is based on is that it is easier to point out a lie then it is to lie in the first place. This would be under the assumption that if you’ve got two agents, one agent is trying to lie, the other agent is trying to detect when the first agent is lying, you would need the second to have access to all of the information that the first agent is using to make this decision. But if that can happen, then we would hope that it is easier to detect lies than it is to actually lie. If this is actually true, we can imagine the following training process. We give a top level question, and then we have two AI systems debate what the correct answer to that top level question is. One of the agents initially says the answer is 12, then the other agent says no, the answer is 15, then they start arguing about why the answer should be 12 versus 15.

Eventually, after some amount of time, a human is asked to judge this debate and say which of the AI systems provided more true and useful information. Whichever agent the human decided was correct gets a reward of plus one, and the other one gets a reward of minus one, and is then updated presumably by a gradient decent or something. So the hope here is basically that the two AI system that are debating each other are about equally powerful, such that if any one of them tries to lie or manipulate the user in some way, the other AI system would be able to point that out, and then the human would give the reward to the AI systems that point it out. So at equilibrium you should expect that both of the agents are just trying to give you true and useful information, and they’re not trying to lie, because if they do try to lie, then the other one will point it out.

Lucas: Right. So there’s this important feature here where as you mentioned, the claim is that it’s easier to spot out lies then to make lies, and this sort of asymmetry here is one of the motivations which says that generally it’ll be easier to tell when agents are telling the truth rather than making up a lie.

Rohin: As along as you have another AI system that can point this out. Certainly a super intelligent AI system could lie to me and I wouldn’t be able to tell, probably, but it’s a lot harder for a superintelligent AI system to lie to me when I have another superintelligent AI system that’s trying to point out lies that the first one makes.

Lucas: Right. So now I think we can go ahead and cover its relationship to iterated amplification?

Rohin: Sure. There is actually quite a close relationship between the two, even though it doesn’t seem like it on first site. The hope with both of them is that your AI systems will learn to do human like reasoning, but on a much larger scale than humans can do. In particular, consider the following kind of agent. You have a human who is given a top level question that they have to answer, and that human can create a bunch of sub questions and then delegate each of those sub questions to another copy of the same human, initialized from scratch or something like that so they don’t know what the top level human has thought.

Then they now have to answer the sub question, but they too can delegate to another human further down the line. And so on you can just keep delegating down until you get something that questions are so easy that the human can just straight up answer them. So I’m going to call this structure a deliberation tree, because it’s a sort of tree of considerations such that every node, the answer to that node, it can be computed from the answers to the children nodes, plus a short bit of human reasoning that happened at that node.

In iterated amplification, what’s basically happening is you start with leaf nodes, the human agent. There’s just a human agent, and they can answer questions quickly. Then when you amplify it the first time, you get trees of depth one, where at the top level there’s a human who can then delegate sub questions out, but then those sub questions have to be answered by an agent that was trained to be like a human. So you’ve got something that approximates depth one human deliberation trees. Then after another round of distillation and amplification, you’ve got human delegating to agents that were trained to mimic humans that could delegate to agents that were trained to mimic humans. An approximate version of a depth two deliberation tree.

So iterated amplification is basically just building up the depth of the tree that the agent is approximating. But we hope that these deliberation trees are always just basically implementing corrigible reasoning, and that eventually once they get deep enough, you get arbitrarily strong capabilities.

Lucas: Can you offer some clarification as to why one might expect a group of copies of an AI, plus the human to scale to be able to make sure that during distillation, that alignment is retained?

Rohin: That’s an open problem. Whether you can do a distillation step that does preserve alignment/corrigibility, it’s a thing that Paul in a few recent posts in the Iterated Amplification Sequence, he calls it the reward engineering problem. The hope is that if you believe that the amplified agent is corrigible, then they are going to be smarter than the agent that they are trying to train via distillation. So you can actually just use the amplified agent to create a reward signal in order to train an agent during distillation. Since the amplified agent is smarter than the agent you’re distilling, you could plausibly actually create a reward function that’s not easily gameable, and actually gets the AI system to do what you want. I think that’s the concise lacking nuance story of why you might be optimistic about this.

Lucas: All right.

Rohin: So I’ll move on to how debate is also related to this. So we talk about how iterated amplification is basically like growing the depth of deliberation trees that the agent is approximating. The human part of this is judging any one node and its children. In debate on the other hand, you can imagine the same sort of deliberation tree, although now they’re more like arguments and counter arguments as opposed to considerations and counter considerations. But broadly the same thing. So imagine there’s this actual debate tree of ways the debate could possibly go.

Then you could think of the AI systems as choosing a particular path in the debate tree that makes them most likely to win. The key point is that given that the entire question can be resolved by exponentially size deliberation tree, if the two AI systems are capable of competing this exponential deliberation tree, then optimal play in the debate game is to go along the path that is going to lead in your victory, even given that the other player is trying to win themselves. The relation between iterated amplification and debate is that they both want the agents to implicitly be able to compute this exponential sized deliberation tree that humans could not do, and then use humans to detect a particular part of that tree. In iterated amplification you check a parent and its children. Those nodes, you look at that one section of the debate tree, and you make sure that it looks good, and then debate you look at a particular path on the debate tree and judge whether that path is good. One critique about these methods, is it’s not actually clear that an exponential sized deliberation tree is able to solve all problems that we might care about. Especially if the amount of work done at each node is pretty short, like ten minutes of a stent of a normal human.

One question that you would care about if you wanted to see if an iterated amplification could work is can these exponential sized deliberation trees actually solve hard problems? This is the factored cognition hypothesis. These deliberation trees can in fact solve arbitrarily complex tasks. And Ought is basically working on testing this hypothesis to see whether or not it’s true. It’s like finding the tasks, which seemed hardest to do in this decompositional way, and then seeing if teams of humans can actually figure out how to do them.

Lucas: Do you have an example of what would be one of these tasks that are difficult to decompose?

Rohin: Yeah. Take a bunch of humans who don’t know differential geometry or something, and have them solve the last problem in a textbook on differential geometry. They each only get ten minutes in order to do anything. None of them can read the entire textbook. Because that takes way more than ten minutes. I believe Ought is maybe not looking into that one in particular, that one sounds extremely hard, but they might be doing similar things with books of literature. Like trying to answer questions about a book that no one has actually read.

But I remember that Andreas was actually talking about this particular problem that I mentioned as well. I don’t know if they actually decided to do it.

Lucas: Right. So I mean just generally in this area here, it seems like there are these interesting open questions and considerations about I guess just the general epistemic efficacy of debate. And how good AI and human systems will be at debate, and again also as you just pointed out, whether or not arbitrarily difficult tasks can be solved through this decompositional process. I mean obviously we do have proofs for much simpler things. Why is there a question as to whether or not it would scale? How would it eventually break?

Rohin: With iterated amplification in particular, if you’re starting with humans who have only ten minutes to look at resources and come up with an answer, the particular thing I would say they might not be able to do is take a math textbook that the human did not know already and solve the final problem in it. Iterated amplification, to the extent that it starts with a human who’s limited to ten minutes, and whenever it does a decomposition, does it with a human who’s limited to ten minutes, would inherit all of the same problems if those problems are real problems.

Lucas: Cool, that’s interesting. I don’t have anything to add here then. So happy to keep on moving.

Rohin: I think I want to talk about limited AI systems next. So with AI boxing, the hope is to create a box around an AI system that prevents anything from inside the box from going out, and anything from outside the box from going in, except a particular input, output channels that we explicitly allow. So the hope here is that this allows us to make assumptions about the input, output channels that actually will hold in practice, and can be used to enter that AI’s function more like tool AIs or something, rather than agential AIs that will do bad things.

I think most literature on boxing has had an AI system in a box, and a human operator outside of it. One interesting proposal I’ve heard recently is to have the human also be inside the box, because if you’ve got information pulling from an AI to a human, that’s a pretty big hole in your box. But if the human is actually inside the box, then there’s not as much that can be done. There’s not as much information flowing into or out of the box.

Lucas: What does it actually mean to have the human locked in the box?

Rohin: One proposal would be have a human go inside a secured room from which no information an enter or leave, apart from the fact of whether or not the human has entered or left the room. The human can then interact with the AI as normal. As soon as the human leaves the room, the AI’s reward function is zero. Basically just doesn’t care about anything that happens in the future after that. So in this setting you would hope that the AI system has no incentive to affect the outside world because the only way it can affect the outside world is by sending the human outside. But as soon as it sends the human outside, its reward gets set to zero forever, so it doesn’t care about any of that.

Lucas: But wouldn’t it care before the human went out?

Rohin: Yes, but before the human goes out, there is nothing it can do to influence the outside world because both the human and the AI are in the box. It could influence the human while the human is in the box. But the human also can’t affect the outside world while it’s in the box.

Lucas: Yeah, but the human leaves the box and then can affect the world.

Rohin: Right. But any change to the world that happens after the human leaves, doesn’t make it into the reward function.

Lucas: But it does because the AI system would be planning over time from its initial moment where the reward isn’t set to zero.

Rohin: But your reward can be limited to the next K times steps, or something like this. Basically you’re limited to until the point that the human leaves. That’s the proposal here.

Lucas: I see, okay. That makes sense.

Rohin: The only information leaving the box, assuming security is a box in all the normal ways, is whether or not the human has chosen to leave. Pretty low information, doesn’t seem like an AI system would be able to use that to influence the outside world in some ridiculous way. So that’s boxing.

The next thing on my list for limited AGI is impact measures of side effects. There is also mild optimization and conservative concepts, but let’s start with impact measures. The basic hope is to create some quantification of how much impact a particular action that the AI chooses, has on the world, and to then penalize the AI for having a lot of impact so that it only does low impact things, which presumably will not cause catastrophe. One approach to this relative reachability. With relative reachability, you’re basically trying to not decrease the number of states that you can reach from the current state. So you’re trying to preserve option value. You’re trying to keep the same states reachable.

It’s not okay for you to make one state unreachable as long as you make a different state reachable. You need all of the states that were previously reachable to continue being reachable. The relative part is that the penalty is calculated relative to a baseline that measures what would’ve happened if the AI had done nothing, although there are other possible baselines you could use. The reason you do this is so that we don’t penalize the agent for side affects that happen in the environment. Like maybe I eat a sandwich, and now these states where there’s a sandwich in front of me are no longer accessible because I can’t un-eat a sandwich. We don’t want to penalize our AI system for that impact, because then it’ll try to stop me from eating a sandwich. We want to isolate the impact of the agent as opposed to impact that were happening in the environment anyway. So that’s what we need the relative part.

There is also attainable utility preservation from Alex Turner, which makes two major changes from relative reachability. First, instead of talking about reachability of states, it talk about how much you can achieve different utility functions. So if previously you were able to make lots of paperclips, then you want to make sure that you can still make lots of paperclips. If previously you were able to travel across the world within a day, then you want to still be able to travel across the world in a day. So that’s the first change I would make.

The second change is not only does it penalize decreases in attainable utility, it also penalizes increase in attainable utility. So if previously you could not mine asteroids in order to get their natural resources, you should still not be able to mine asteroids and get their resources. This seems kind of crazy when you first hear it, but the rational for it is that all of the convergent instrumental sub goals are about increases in power of your AI system. For example, for a broad range of utility functions, it is useful to get a lot of resources and a lot of power in order to achieve those utility functions. Well, if you penalize increases in attainable utility, then you’re going to penalize actions that just broadly get more resources, because those are helpful for many, many, many different utility functions.

Similarly, if you were going to be shutdown, but then you disable the shutdown button, well that just makes it much more possible for you to achieve pretty much every utility, because instead of being off, you are still on and can take actions. So that also will get heavily penalized because it led to such a large increase in attainable utilities. So those are I think the two main impact measures that I know of.

Okay, we’re getting to the things where I have less things to say about them, but now we’re at robustness. I mentioned this before, but there are two main challenges with verification. There’s the specification problem, making it computationally efficient, and all of the work is on the computationally efficient side, but I think the hardest part is the specification side, and I’d like to see more people do work on that.

I don’t think anyone is really working on verification with an eye to how to apply it to powerful AI systems. I might be wrong about that. Like I know something people who do care about AI safety who are working on verification, and it’s possible that they have thoughts about this that aren’t published and that I haven’t talked to them about. But the main thing I would want to see is what specifications can we actually give to our verification sub routines. At first glance, this is just the full problem of AI safety. We can’t just give a specification for what we want to an AGI.

What specifications can we get for a verification that’s going to increase our trust in the AI system. For adversarial training, again, all of the work done so far is in the adversarial example space where you try to frame an image classifier to be more robust to adversarial examples, and this kind of work sometimes, but doesn’t work great. For both verification and adversarial training, Paul Christiano has written a few blog posts about how you can apply this to advance AI systems, but I don’t know if anyone actively working on these with AGI in mind. With adversarial examples, there is too much work for me to summarize.

The thing that I find interesting about adversarial examples is that is shows that are we no able to create image classifiers that have learned human preferences. Humans have preferences over how we classify images, and we didn’t succeed at that.

Lucas: That’s funny.

Rohin: I can’t take credit for that framing, that one was due to Ian Goodfellow. But yeah, I see adversarial examples as contributing to a theory of deep learning that tells us how do we get deep learning systems to be closer to what we want them to be rather than these weird things that classify pandas as givens, even when they’re very clearly still pandas.

Lucas: Yeah, the framing’s pretty funny, and makes me feel kind of pessimistic.

Rohin: Maybe if I wanted to inject some optimism back in, there’s a frame under which an adversarial examples happen because our data sets are too small or something. We have some pretty large data sets, but humans do see more and get far richer information than just pixel inputs. We can go feel a chair and build 3D models of a chair through touch in addition to sight. There is actually a lot more information that humans have, and it’s possible that what we need as AI systems is just to have way more information, and are good to narrow it down on the right model.

So let us move on to I think the next thing is interpretability, which I also do not have much to say about, mostly because there is tons and tons of technical research on interpretability, and there is not much on interpretability from an AI alignment perspective. One thing to note with interpretability is you do want to be very careful about how you apply it. If you have a feedback cycle where you’re like I built an AI system, I’m going to use interpretability to check whether it’s good, and then you’re like oh shit, this AI system was bad, it was not making decisions for the right reasons, and then you go and fix your AI system, and then you throw interpretability at it again, and then you’re like oh, no, it’s still bad because of this other reason. If you do this often enough, basically what’s happening is you’re training your AI system to no longer have failures that are obvious to interpretability, and instead you have failures that are not obvious to interpretability, which will probably exist because your AI system seems to have been full of failures anyway.

So I would be pretty pessimistic about the system that interpretability found 10 or 20 different errors in. I would just expect that the resulting AI system has other failure modes that we were not able to uncover with interpretability, and those will at some point trigger and cause bad outcomes.

Lucas: Right, so interpretability will cover things such as super human intelligence interpretability, but also more mundane examples of present day systems correct, where the interpretability of say neural networks is basically, my understand is nowhere right now.

Rohin: Yeah, that’s basically right. There have been some techniques developed like sailiency maps, feature visualization, neural net models that hallucinate explanations post hoc, people have tried a bunch of things. None of them seem especially good, though some of them definitely are giving you more insight than you had before.

So I think that only leaves CAIS. With comprehensive AI service, it’s like a forecast for how AI will develop in the future. It also has some prescriptive aspects to it, like yeah, we should probably not do these things, because these don’t seem very safe, and we can do these other things instead. In particular, CAIS takes a strong stance AGI agents that are God-like fully integrated systems that are optimizing some utility function over the long term future.

It should be noted that it’s arguing against a very specific kind of AGI agent. This sort of long term expected utility maximizer that’s fully integrated and is okay to black box, can be broken down into modular components. That entire cluster of features, it’s what CAIS is talking about when it says AGI agent. So it takes a strong sense against that, saying A, it’s not likely that this is the first superintelligent thing that we built, and B, it’s clearly dangerous. That’s what we’ve been saying the entire time. So here’s a solution, why don’t we just not build it? And we’ll build these other things instead? As for what the other things are, the basic intuition pump here is that if you look at how AI is developed today, there is a bunch of research in development practices that we do. We try out a bunch of models, we try some different ways to clean our data, we try different ways of collecting data sets, and we try different algorithms and so on and so forth, and these research and development practices allow us to create better and better AI systems.

Now, our AI systems currently are also very bounded in their tasks that they do. There are specific tasks, and they do that task and that task alone, they do it in episodic ways. They are only trying to optimize over a bounded amount of time, they use a bounded amount of computation and other resources. So that’s what we’re going to call a service. It’s an AI system that does a bounded task, in bounded time, with bounded computation. Everything is bounded. Now our research and development practices are themselves bound to tasks, and AI has shown itself to be quite good at automating bounded tasks. We’ve definitely not automated all bounded tasks yet, but it does seem like we are in general are pretty good at automating bounded tasks with enough effort. So probably we will also automate research and development tasks.

We’re seeing some of this already with neural architecture search for example, and once AI R and D processes have been sufficiently automated, then we get this cycle where AI systems are doing the research and development needed to improve AI systems, and so we get to this point of recursive improvement that’s not self improvement anymore, because there’s not really an agentic itself to improve, but you do have recursive AI improving AI. So this can lead to the sort of very quick improvement and capabilities that we often associate with superintelligence. With that we can eventually get to a situation where any task that we care about, we could have a service that breaks that task down into a bunch of simple, automatable bounded tasks, and then we can create services that do each of those bounded tasks and interact with each other in order to in tandem complete the long term task.

This is how humans do engineering and building things. We have these research and development things, we have these modular systems that are interacting with each other via a well defined channel, so this seems more likely to be the firs thing that we build that’s capable of super intelligent reasoning rather than an AGI agent that’s optimizing the utility function of a long term, yada, yada, yada.

Lucas: Is there no risk? Because the superintelligence is the distributed network collaborating. So is there no risk for the collective distributed network to create some sort of epiphenomenal optimization effects?

Rohin: Yup, that’s definitely a thing that you should worry about. I know that Erik agrees with me on this because he explicitly lists this out in the tech report as a thing that needs more research and that we should be worried about. But the hope is that there are other things that you can do that normally we wouldn’t think about with technical AI safety research that would make more sense in this context. For example, we could train a predictive model of human approval. Given any scenario, the AI system should predict how much humans are going to like it or approve of it, and then that service can be used in order to check that other services are doing reasonable things.

Similarly, we might look at each individual service and see which of the other services it’s accessing, and then make sure that those are reasonable services. If we see a CEO of paper clip company going and talking to the synthetic biology service, we might be a bit suspicious and be like why is this happening? And then we can go and check to see why exactly that has happened. So there are all of these other things that we could do in this world, which aren’t really options in the AGI agent world.

Lucas: Aren’t they options in the AGI agential world where the architectures are done such that these important decision points are analyzable to the same degree as they would be in a CAIS framework?

Rohin: Not to my knowledge. As far as I can tell, most end to end train things, you might have the architectures be such that there are these points at which you expect that certain kinds of information will be flowing there, but you can’t easily look at the information that’s actually there and deduce what the system is doing. It’s just not interpretable enough to do that.

Lucas: Okay. I don’t think that I have any other questions or interesting points with regards to CAIS. It’s a very different and interesting conception of the kind of AI world that we can create. It seems to require its own new coordination challenge as if your hypothesis is true and that the agential AIs will be afforded more causal power in the world, and more efficiency than sort of the CAIS systems, that’ll give them a competitive advantage that will potentially bias civilization away from CAIS systems.

Rohin: I do want to note that I think the agential AI systems will be more expensive and take longer to develop than CAIS. So I do think CAIS will come first. Again, this is all in a particular world view.

Lucas: Maybe this might be abstracting too large, but does CAIS claim to function as an AI alignment methodology to be used on the long term? Do we retain the sort of CAIS architecture path, CAIS creating super intelligence or some sort of distributed task force?

Rohin: I’m not actually sure. There’s definitely a few chapters in the technical report that are like okay, what if we build AGI agents? How could we make sure that goes well? As long as CAIS comes before AGI systems, here’s what we can do in that setting.

But I feel like I personally think that AGI systems will come. My guess is that Erik does not think that this is necessary, and we could actually just have CAIS systems forever. I don’t really have a model for when to expect AGI separately of the CAIS world. I guess I have a few different potential scenarios that I can consider, and I can compare it to each of those, but it’s not like it’s CAIS and not CAIS. It’s more like it’s CAIS and a whole bunch of other potential scenarios, and in reality it’ll be some mixture of all of them.

Lucas: Okay, that makes more sense. So, there’s sort of an overload here, or just a ton of awesome information with regards to all of these different methodologies and conceptions here. So just looking at all of it, how do you feel about all of these different methodologies in general, and how does AI alignment look to you right now?

Rohin: Pretty optimistic about AI alignment, but I don’t think that’s so much from the particular technical safety research that we have. That’s some of it. I do think that there are promising approaches, and the fact that there are promising approaches makes me more optimistic. But I think more so my optimism comes from the strategic picture. A belief that A, that we will be able to convince people that this is important, such that people start actually focusing on this problem more broadly, and B that we would be able to get a bunch of people to coordinate such that they’re more likely to invest in safety. C, that I don’t place as much weight on the AI systems that are at long term, utility maximizers, and therefor we’re basically all screwed, which seems to be the position of many other people in the field.

I say optimistic. I mean optimistic relative to them. I’m probably pessimistic relative to the average person.

Lucas: A lot of these methodologies are new. Do you have any sort of broad view about how the field is progressing?

Rohin: Not a great one. Mostly because I would consider myself, maybe I’ve just recently stopped being new to the field, so I didn’t really get to observe the field very much in the past, but it seems like there’s been more of a shift towards figuring out how all of the things people were thinking about apply to real machine learning systems, which seems nice. The fact that it does connect is good. I don’t think the connections are super natural, or they just sort of clicked, but they did mostly work out. I’d say in many cases, and that seems pretty good. So yeah, the fact that we’re now doing a combination of theoretical, experimental, and conceptual work seems good.

It’s no longer the case that we’re mostly doing theory. That seems probably good.

Lucas: You’ve mentioned already a lot of really great links in this podcast, places people can go to learn more about these specific approaches and papers and strategies. And one place that is just generally great for people to go is to the Alignment Forum, where a lot of this information already exists. So are there just generally in other places that you recommend people check out if they’re interested in taking more technical deep dives?

Rohin: Probably actually at this point, one of the best places for a technical deep dive is the alignment newsletter database. I write a newsletter every week about AI alignment, all the stuff that’s happened in the past week, that’s the alignment newsletter, not the database, which also people can sign up for, but that’s not really a thing for technical deep dives. It’s more a thing for keeping a pace with developments in the field. But in addition, everything that ever goes into the newsletter is also kept in a separate database. I say database, it’s basically a Google sheets spreadsheet. So if you want to do a technical deep dive on any particular area, you can just go, look for the right category on the spreadsheet, and then just look at all the papers there, and read some or all of them.

Lucas: Yeah, so thanks so much for coming on the podcast Rohin, it was a pleasure to have you, and I really learned a lot and found it to be super valuable. So yeah, thanks again.

Rohin: Yeah, thanks for having me. It was great to be on here.

Lucas: If you enjoyed this podcast, please subscribe, give it a like, or share it on your preferred social media platform. We’ll be back again soon with another episode in the AI alignment series.

End of recorded material

AI Alignment Podcast: An Overview of Technical AI Alignment with Rohin Shah (Part 1)

The space of AI alignment research is highly dynamic, and it’s often difficult to get a bird’s eye view of the landscape. This podcast is the first of two parts attempting to partially remedy this by providing an overview of the organizations participating in technical AI research, their specific research directions, and how these approaches all come together to make up the state of technical AI alignment efforts. In this first part, Rohin moves sequentially through the technical research organizations in this space and carves through the field by its varying research philosophies. We also dive into the specifics of many different approaches to AI safety, explore where they disagree, discuss what properties varying approaches attempt to develop/preserve, and hear Rohin’s take on these different approaches.

You can take a short (3 minute) survey to share your feedback about the podcast here.

In this podcast, Lucas spoke with Rohin Shah. Rohin is a 5th year PhD student at UC Berkeley with the Center for Human-Compatible AI, working with Anca Dragan, Pieter Abbeel and Stuart Russell. Every week, he collects and summarizes recent progress relevant to AI alignment in the Alignment Newsletter

We hope that you will continue to join in the conversations by following us or subscribing to our podcasts on Youtube, SoundCloud, iTunes, Google Play, Stitcher, or your preferred podcast site/application. You can find all the AI Alignment Podcasts here.

Topics discussed in this episode include:

  • The perspectives of CHAI, MIRI, OpenAI, DeepMind, FHI, and others
  • Where and why they disagree on technical alignment
  • The kinds of properties and features we are trying to ensure in our AI systems
  • What Rohin is excited and optimistic about
  • Rohin’s recommended reading and advice for improving at AI alignment research

Lucas: Hey everyone, welcome back to the AI Alignment podcast. I’m Lucas Perry, and today we’ll be speaking with Rohin Shah. This episode is the first episode of two parts that both seek to provide an overview of the state of AI alignment. In this episode, we cover technical research organizations in the space of AI alignment, their research methodologies and philosophies, how these all come together on our path to beneficial AGI, and Rohin’s take on the state of the field.

As a general bit of announcement, I would love for this podcast to be particularly useful and informative for its listeners, so I’ve gone ahead and drafted a short survey to get a better sense of what can be improved. You can find a link to that survey in the description of wherever you might find this podcast, or on the page for this podcast on the FLI website.

Many of you will already be familiar with Rohin, he is a fourth year PhD student in Computer Science at UC Berkeley with the Center For Human-Compatible AI, working with Anca Dragan, Pieter Abbeel, and Stuart Russell. Every week, he collects and summarizes recent progress relevant to AI alignment in the Alignment Newsletter. And so, without further ado, I give you Rohin Shah.

Thanks so much for coming on the podcast, Rohin, it’s really a pleasure to have you.

Rohin: Thanks so much for having me on again, I’m excited to be back.

Lucas: Yeah, long time, no see since Puerto Rico Beneficial AGI. And so speaking of Beneficial AGI, you gave quite a good talk there which summarized technical alignment methodologies approaches and broad views, at this time; and that is the subject of this podcast today.

People can go and find that video on YouTube, and I suggest that you watch that; that should be coming out on the FLI YouTube channel in the coming weeks. But for right now, we’re going to be going in more depth, and with more granularity into a lot of these different technical approaches.

So, just to start off, it would be good if you could contextualize this list of technical approaches to AI alignment that we’re going to get into within the different organizations that they exist at, and the different philosophies and approaches that exist at these varying organizations.

Rohin: Okay, so disclaimer, I don’t know all of the organizations that well. I know that people tend to fit CHAI in a particular mold, for example; CHAI’s the place that I work at. And I mostly disagree with that being the mold for CHAI, so probably anything I say about other organizations is also going to be somewhat wrong; but I’ll give it a shot anyway.

So I guess I’ll start with CHAI. And I think our public output mostly comes from this perspective of how do we get AI systems to do what we want? So this is focusing on the alignment problem, how do we actually point them towards a goal that we actually want, align them with our values. Not everyone at CHAI takes this perspective, but I think that’s the one most commonly associated with us and it’s probably the perspective on which we publish the most. It’s also the perspective I, usually, but not always, take.

MIRI, on the other hand, takes a perspective of, “We don’t even know what’s going on with intelligence. Let’s try and figure out what we even mean by intelligence, what it means for there to be a super-intelligent AI system, what would it even do or how would we even understand it; can we have a theory of what all of this means? We’re confused, let’s be less confused, once we’re less confused, then we can think about how to actually get AI systems to do good things.” That’s one of the perspectives they take.

Another perspective they take is that there’s a particular problem with AI safety, which is that, “Even if we knew what goals we wanted to put into an AI system, we don’t know how to actually build an AI system that would, reliably, pursue those goals as opposed to something else.” That problem, even if you know what you want to do, how do you get an AI system to do it, is a problem that they focus on. And the difference from the thing I associated with CHAI before is that, with the CHAI perspective, you’re interested both in how do you get the AI system to actually pursue the goal that you want, but also how do you figure out what goal that you want, or what is the goal that you want. Though, I think most of the work so far has been on supposing you know the goal, how do you get your AI system to properly pursue it?

I think DeepMind safety came, at least, is pretty split across many different ways of looking at the problem. I think Jan Leike, for example, has done a lot of work on reward modeling, and this sort of fits in with the how do we get our AI systems be focused on the right task, the right goal. Whereas Vika has done a lot of work on side effects or impact measures. I don’t know if Vika would say this, but the way I interpret it how do we impose a constraint upon the AI system such that it never does anything catastrophic? But it’s not trying to get the AI system to do what we want, just not do what we don’t want, or what we think would be catastrophically bad.

OpenAI safety also seems to be, okay how do we get deep enforcement learning to do good things, to do what we want, to be a bit more robust? Then there’s also the iterated amplification debate factored cognition area of research, which is more along the lines of, can we write down a system that could, plausibly, lead to us building an aligned AGI or aligned powerful AI system?

FHI, no coherent direction, that’s all of FHI. Eric Drexler is also trying to understand how AI will develop it in the future is somewhat very different from what MIRI’s doing, but the same general theme of trying to figure out what is going on. So he just recently published a long technical report on comprehensive AI services, which is the general worldview for predicting what AI development will look like in the future. If we believed that that was, in fact, the way AI would happen, we would probably change what we work on from the technical safety point of view.

And Owain Evans does a lot of stuff, so maybe I’m just not going to try to categorize him. And then Stuart Armstrong works on this, “Okay, how do we get value learning to work such that we actually infer a utility function that we would be happy for an AGI system to optimize, or a super-intelligent AI system to optimize?”

And then Ought works on factory cognition, so it’s very adjacent to be iterated amplification and debate research agendas. Then there’s a few individual researchers, scattered, for example, Toronto, Montreal, and AMU and EPFL, maybe I won’t get into all of them because, yeah, that’s a lot; but we can delve into that later.

Lucas: Maybe a more helpful approach, then, would be if you could start by demystifying some of the MiRI stuff a little bit; which may seem most unusual.

Rohin: I guess, strategically, the point would be that you’re trying to build this AI system that’s going to be, hopefully, at some point in the future vastly more intelligent than humans, because we want them to help us colonize the universe or something like that, and lead to lots and lots of technological progress, etc., etc.

But this, basically, means that humans will not be in control unless we very, very specifically arrange it such that we are in control; we have to thread the needle, perfectly, in order to get this to work out. In the same way that, by default you, would expect that the most intelligent creatures, beings are the ones that are going to decide what happens. And so we really need to make sure and, also it’s probably hard to ensure, that these vastly more intelligent beings are actually doing what we want.

Given that, it seems like what we want is a good theory that allows us to understand and predict what these AI systems are going to do. Maybe not in the fine nitty, gritty details, because if we could predict what they would do, then we could do it ourselves and be just as intelligent as they are. But, at least, in broad strokes what sorts of universes are they going to create?

But given that they can apply so much more intelligence that we can, we need our guarantees to be really, really strong; like almost proof level. Maybe actual proofs are a little too much to expect, but we want to get as close to it as possible. Now, if we want to do something like that, we need a theory of intelligence; we can’t just sort of do a bunch of experiments, look at the results, and then try to extrapolate from there. Extrapolation does not give you the level of confidence that we would need for a problem this difficult.

And so rather, they would like to instead understand intelligence deeply, deconfuse themselves about it. Once you understand how intelligence works at a theoretical level, then you can start applying that theory to actual AI systems and seeing how they approximate the theory, or make predictions about what different AI systems will do. And, hopefully, then we could say, “Yeah, this system does look like it’s going to be very powerful as approximating this particular idea, this particular part of theory of intelligence. And we can see that with this particular theory of intelligence, we can align it with humans somehow, and you’d expect that this was going to work out.” Something like that.

Now, that sounded kind of dumb even to me as I was saying it, but that’s because we don’t have the theory yet; it’s very fun to speculate how you would use the theory before you actually have the theory. So that’s the reason they’re doing this, the actual thing that they’re focusing on is centered around problems of embedded agency. And I should say this is one of their, I think, two main strands of research, the other stand of research, I do not know anything about because they have not published anything about it.

But one of their strands of research is about embedded agency. And here the main point is that in the real world, any agent, any AI system, or a human is a part of their environment. They are smaller than the environment and the distinction between agent and environment is not crisp. Maybe I think of my body as being part of me but, I don’t know, to some extent, my laptop is also an extension of my agency; there’s a lot of stuff I can do with it.

Or, on the other hand, you could think maybe my arms and limbs aren’t actually a part of me, I could maybe get myself uploaded at some point in the future, and then I will no longer have arms or legs; but in some sense I am still me, I’m still an agent. So, this distinction is not actually crisp, and we always pretend that it is in AI, so far. And it turns out that once you stop making this crisp distinction and start allowing the boundary to be fuzzy, there are a lot of weird, interesting problems that show up and we don’t know how to deal with any of them, even in theory, so that’s what they focused on.

Lucas: And can you unpack, given that AI researchers control of the input/output channels for AI systems, why is it that there is this fuzziness? It seems like you could extrapolate away the fuzziness given that there are these sort of rigid and selected IO channels.

Rohin: Yeah, I agree that seems like the right thing for today’s AI systems; but I don’t know. If I think about, “Okay, this AGI is a generally intelligent AI system.” I kind of expect it to recognize that when we feed it inputs which, let’s say, we’re imagining a money maximizing AI system that’s taking in inputs like stock prices, and it outputs which stocks to buy. And maybe it can also read the news that lets it get newspaper articles in order to make better decisions about which stocks to buy.

At some point, I expect this AI system to read about AI and humans, and realize that, hey, it must be an AI system, it must be getting inputs and outputs. Its reward function must be to make this particular number in a bank account be as high as possible and then once it realizes this, there’s this part of the world, which is this number in the bank account, or it could be this particular value, this particular memory block in its own CPU, and its goal is now make that number as high as possible.

In some sense, it’s now modifying itself, especially if you’re thinking of the memory block inside the CPU. If it goes and edits that and sets that to a million, a billion, the highest number possible in that memory block, then it seems like it has, in some sense, done some self editing; it’s changed the agent part of it. It could also go and be like, “Okay actually what I care about is this particular award function box is supposed to output as high a number as possible. So what if I go and change my input channels such that it feeds me things that caused me to believe that I’ve made tons and tons of profit?” So this is a delusion backs consideration.

While it is true that I don’t see a clear, concrete way that an AI system ends up doing this, it does feel like an intelligent system should be capable of this sort of reasoning, even if it initially had these sort of fixed inputs and outputs. The idea here is that its outputs can be used to affect the inputs or future outputs.

Lucas: Right, so I think that that point is the clearest summation of this; it can affect its own inputs and outputs later. If you take human beings who are, by definition, human level intelligences we have, say, in a classic computer science sense if you thought of us, you’d say we strictly have five input channels: hearing seeing, touch, smell, etc.

Human beings have a fixed number of input/output channels but, obviously, human beings are capable of self modifying on those. And our agency is sort of squishy and dynamic in ways that would be very unpredictable, and I think that that unpredictability and the sort of almost seeming ephemerality of being an agent seems to be the crux of a lot of the problem.

Rohin: I agree that that’s a good intuition pump, I’m not sure that I agree it’s the crux. The crux, to me, it feels more like you specify some sort of behavior that you want which, in this case, was make a lot of money or make this number in a bank account go higher, or make this memory cell go as high as possible.

And when you were thinking about the specification, you assumed that the inputs and outputs fell within some strict parameters, like the inputs are always going to be news articles that are real and produced by human journalists, as opposed to a fake news article that was created by the AI in order to convince the reward function that actually it’s made a lot of money. And then the problem is that since the AI’s outputs can affect the inputs, the AI could cause the inputs to go outside of the space of possibilities that you imagine the inputs could be in. And this then allows the AI to game the specification that you had for it.

Lucas: Right. So, all the parts which constitute some AI system are all, potentially, modified by other parts. And so you have something that is fundamentally and completely dynamic, which you’re trying to make predictions about, but whose future structure is potentially very different and hard to predict based off of the current structure?

Rohin: Yeah, basically.

Lucas: And that in order to get past this we must, again, tunnel down on this decision theoretic and rational agency type issues at the bottom of intelligence to sort of have a more fundamental theory, which can be applied to these highly dynamic and difficult to understand situations?

Rohin: Yeah, I think the MIRI perspective is something like that. And in particular, it would be like trying to find a theory that allows you to put in something that stays stable even while the system, itself, is very dynamic.

Lucas: Right, even while your system, whose parts are all completely dynamic and able to be changed by other parts, how do you maintain a degree of alignment amongst that?

Rohin: One answer to this is give the AI a utility function. There is a utility function that’s explicitly trying to maximize that and in that case, it probably has an incentive in order to keep that to protect that the utility function, because if it gets changed, well then it’s not going to maximize that utility function anymore, it’ll maximize something else which will lead to worse behavior by the likes of the original utility function. That’s a thing that you could hope to do with a better theory of intelligence is, how do you create a utility function in an AI system stays stable, even as everything else is dynamically changing?

Lucas: Right, and without even getting into the issues of implementing one single stable utility function.

Rohin: Well, I think they’re looking into those issues. So, for example, Vingean Reflection is a problem that is entirely about how you create better, more improved version of yourself without having any value drift, or a change to the utility function.

Lucas: Is your utility function not self-modifying?

Rohin: So in theory, it could be. The hook would be that we could design an AI system that does not self-modify its utility function under almost all circumstances. Because if you change your utility function, then you’re going to start maximizing that new utility function which, by the original utility function’s evaluation, is worse. If I told you, “Lucas, you have got to go fetch coffee.” That’s the only thing in life you’re concerned about. You must take whatever actions are necessary in order to get the coffee.

And then someone goes like, “Hey Lucas, I’m going to change your utility function so that you want to fetch tea instead.” And then all of your decision making is going to be in service of getting tea. You would probably say, “No, don’t do that, I want to fetch coffee right now. If you change my utility function for being ‘fetch tea’, then I’m going to fetch tea, which is bad because I want to fetch coffee.” And so, hopefully, you don’t change your utility function because of this effect.

Lucas: Right. But isn’t this where corrigibility comes in, and where we admit that as we sort of understand more about the world and our own values, we want to be able to update utility functions?

Rohin: Yeah, so that is a different perspective; I’m not trying to describe that perspective right now. It’s a perspective for how you could get something stable in an AI system. And I associate it most with Eliezer, though I’m not actually sure if he holds this opinion.

Lucas: Okay, so I think this was very helpful for the MIRI case. So why don’t we go ahead and zoom in, I think, a bit on CHAI, which is the Center For Human-Compatible AI.

Rohin: So I think rather than talking about CHAI, I’m going to talk about the general field of trying to get AI systems do what we want; a lot of people at CHAI work on that but not everyone. And also a lot of people outside of CHAI work on that, because that seems to become more useful carving of the field. So there’s this broad argument for AI safety which is, “We’re going to have very intelligent things based on the orthagonality thesis, we can’t really say anything about their goals.” So, the really important thing is to make sure that the intelligence is pointed at the right goals, it’s pointed at doing what we actually want.

And so then the natural approach is, how do we get our AI systems to infer what we want to do and then actually pursue that? And I think, in some sense, it’s one of the most obvious approaches to AI safety. This is a clear enough problem, even with narrow current systems that there are plenty of people outside of AI safety working on this, as well. So this incorporates things like inverse reinforcement learning, preference learning, reward modeling, the CIRL cooperative IRL paper also fits into all of this. So yeah, I can begin to ante up those in more depth.

Lucas: Why don’t you start off by talking about the people who exist within the field of AI safety, give sort of a brief characterization of what’s going on outside of the field, but primarily focusing on those within the field. How this approach, in practice, I think generally is, say, different from MIRI to start off with, because we have a clear picture of them painted right next to what we’re delving into now.

Rohin: So I think difference of MiRI is that this is more targeted directly at the problem right now, in that you’re actually trying to figure out how do you build an AI system that does what you want. Now, admittedly, most of the techniques that people have come up with are not likely to scale up to super-intelligent AI, they’re not meant to, no one claims that they’re going to scale up to super-intelligent AI. They’re more like some incremental progress on figuring out how to get AI systems to do what we want and, hopefully, with enough incremental progress, we’ll get to a point where we can go, “Yes, this is what we need to do.”

Probably the most well known person here would be Dylan Hadfield-Menell, who you had on your podcast. And so he talked about CIRL and associated things quite a bit there, there’s not really that much I would say in addition to it. Maybe a quick summary of Dylan’s position is something like, “Instead of having AI systems that are optimizing for their own goals, we need to have AI systems that are optimizing for our goals, and try to infer our goals in order to do that.”

So rather than having an AI system that is individually rational with respect to its own goals, you instead want to have a human AI system such that the entire system is rationally optimizing for the human’s goals. This is sort of the point made by CIRL, where you have an AI system, you’ve got a human, they’re playing those two player game, the humans is the only one who knows the reward function, the robot is uncertain about what the reward function is, and has to learn by observing what the humans does.

And so, now you see that the robot does not have a utility function that it is trying to optimize; instead is learning about a utility function that the human has and then helping the human optimize that reward function. So summary, try to build human AI systems that are group rational, as opposed to an AI system that is individually rational; so that’s Dylan’s view. Then there’s Jan Leike at DeepMind, and a few people at OpenAI.

Lucas: Before we pivot into OpenAI and DeepMind, just sort of focusing here on the CHAI end of things and this broad view, and help me explain here how you would characterize it. The present day actively focused view on current issues, and present day issues and alignment and making incremental progress there. This view here you see as a sort of subsuming multiple organizations?

Rohin: Yes, I do.

Lucas: Okay. Is there a specific name you would, again, use to characterize this view?

Rohin: Oh, getting AI systems to do what we want. Let’s see, do I have a pithy name for this? Helpful AI systems or something.

Lucas: Right which, again, is focused on current day things, is seeking to make incremental progress, and which subsumes many different organizations?

Rohin: Yeah, that seems broadly true. I do think there are people who are doing more conceptual work, thinking about how this will scale to AGI and stuff like that; but it’s a minority of work in the space.

Lucas: Right. And so the question of how do we get AI systems to do what we want them to do, also includes these views of, say, Vingean Reflection or how we become idealized versions of ourselves, or how we build on value over time, right?

Rohin: Yeah. So, those are definitely questions that you would need to answer at some point. I’m not sure that you would need to answer Vingean Reflection at some point. But you would definitely need to answer how do you update, given that humans don’t actually know what they want, for a long-term future; you need to be able to deal with that fact at some point. It’s not really a focus of current research, but I agree that that is a thing about this approach will have to deal with, at some point.

Lucas: Okay. So, moving on from you and Dylan to DeepMind and these other places that you view as this sort of approach also being practice there?

Rohin: Yeah, so while Dylan and I and other at CHAI has been focused on sort of conceptual advances, like in toy environments, does this do the right thing? What are some sorts of data that we can learn from? Do they work in these very simple environments with quite simple algorithms? I would say that OpenAI and DeepMind safety teams are more focused on trying to get this to work in complex environments of the sort that we’re getting this to work on state-of-the-art environments, the most complex ones that we have.

Now I don’t mean DoTA and StarCraft, because running experiments with DoTAi and StarCraft is incredibly expensive, but can we get AI systems that do what we want for environments like Atari or MuJoCo? There’s some work on this happening at CHAI, there are pre-prints available online, but it hasn’t been published very widely yet. Most of the work, I would say, has been happening with an OpenAI/DeepMind collaboration, and most recently, there was a position paper from DeepMind on recursive reward modeling.

Right before that there was also a paper on combining first a paper, deeper enforcement learning from human preferences, which said, “Okay if we allow humans to specify what they want by just comparing between different pieces of behavior from the AI system, can we train an AI system to do what the human wants?” And then they built on that in order to create a system that could learn from demonstrations, initially, using a kind of imitation learning, and then improve upon the demonstrations using comparisons in the same way that deep RL from human preferences did.

So one way that you can do this research is that there’s this field of human computer interaction, which is about … well, it’s about many things. But one of the things that it’s about is how do you make the user interface for humans intuitive and easy to use such that you don’t have user error or operator? One comment from people that I liked is that most of the things that are classified as ‘user error’ or ‘operator error’ should not be classified as such, they should be classified as ‘interface errors’ where you had such a confusing interface that well, of course, at some point some user was going to get it wrong.

And similarly, here, what we want is a particular behavior out of the AI, or at least a particular set of outcomes from the AI; maybe we don’t know exactly how to achieve those outcomes. And AI is about giving us the tools to create that behavior in automated systems. The current tool that we all use is the reward function, we write down the reward function and then we give it to an algorithm, and it produces behaviors and the outcomes that we want.

And reward functions, they’re just a pretty terrible user interface, they’re better than the previous interface which is writing a program explicitly, which humans cannot do it if the task is something like image classification or continuous control in MuJoCo; it’s an improvement upon that. But reward functions are still a pretty poor interface, because they’re implicitly saying that they encode perfect knowledge of the optimal behavior in all possible environments; which is clearly not a thing that humans can do.

I would say that this area is about moving on from reward functions, going to the next thing that makes the human’s job even easier. And so we’ve got things like comparisons, we’ve got things like inverse award design where you specify a proxy to work function that only needs to work in the training environment. Or you do something like inverse reinforcement learning, where you learn from demonstrations; so I think that’s one nice way of looking at this field.

Lucas: So do you have anything else you would like to add on here about how we present-day get AI systems to do what we want them to do, section of the field?

Rohin: Maybe I want to plug my value learning sequence, because it talks about this much more eloquently than I can on this podcast?

Lucas: Sure. Where can people find your value learning sequence?

Rohin: It’s on the Alignment Forum. You just go to the Alignment Forum, at the top there’s ‘Recommended Sequences’, there’s ‘Embedded Agency’, which is from MIRI, the sort of stuff we already talked about; so that’s also great sequence, I would recommend it. There’s iterated amplification, also great sequence we haven’t talked about it yet. And then there’s my value learning sequence, so you can see it on the front page of the Alignment Forum.

Lucas: Great. So we’ve characterized these, say, different parts of the AI alignment field. And probably just so far it’s been cut into this sort of MIRI view, and then this broad approach of trying to get present-day AI systems to do what we want them to do, and to make incremental progress there. Are there any other slices of the AI alignment field that you would like to bring to light?

Rohin: Yeah, I’ve got four or five more. There’s the interated amplification and debate side of things, which is how do we build using current technologies, but imagining that they were way better? How do we build and align AGI? So they’re trying to solve the entire problem, as opposed to making incremental progress and, simultaneously, hopefully thinking about, conceptually, how do we fit all of these pieces together?

There’s limiting the AGI system, which is more about how do we prevent AI systems from behaving catastrophically? It makes no guarantees about the AI systems doing what we want, it just prevents them from doing really, really bad things. Techniques in that section includes boxing and avoiding side effects. There’s the robustness view, which is about how do we make AI systems well behaved or robustly? I guess that’s pretty self explanatory.

There’s transparency or interpretability, which I wouldn’t say is a technique by itself, but seems to be broadly useful for almost all of the other avenues, it’s something we would want to add to other techniques in order to make those techniques more effective. There’s also, in the same frame as MIRI, can we even understand intelligence? Can we even forecast what’s going to happen with AI? And within that, there’s comprehensive AI services.

here’s also lots of efforts on forecasting, but comprehensive AI services actually makes claims about what technical AI safety should do. So I think that one actually does have a place in this podcast, whereas most of the forecasting things do not, obviously. They have some implications on the strategic picture, but they don’t have clear implications on technical safety research directions, as far as I can tell it right now.

Lucas: Alright, so, do you want to go ahead and start off with the first one on the list there And then we’ll move sequentially down?

Rohin: Yeah, so iterated amplification and debate. This is similar to the helpful AGI section in the sense that we are trying to build an AI system that does what we want. That’s still the case here, but we’re now trying to figure out, conceptually, how can we do this using things like reinforcement learning and supervised learning, but imagining that they’re way better than they are right now? Such that the resulting agent is going to be aligned with us and reach arbitrary levels of intelligence; so in some sense, it’s trying to solve the entire problem.

We want to come up with a scheme such that if we run that scheme, we get good outcomes, we’ve solved almost all the problem. I think that it also differs in that the argument for why we can be successful is also different. This field is aiming to get a property of corrigibility, which I like to summarize as trying to help the overseer. It might fail to help the overseer, or the human, or the user, because it’s not very competent and maybe it makes a mistake and things that I like apples when actually I want oranges. But it was actually trying to help me; it actually thought I wanted apples.

So in corrigibility, you’re trying to help the overseer, whereas, in the previous thing about helpful AGI, you’re more getting an AI system that actually does what we want; there isn’t this distinction between what you’re trying to do versus what you actually do. So there’s a slightly different property that you’re trying to ensure, I think, on the strategic picture that’s the main difference.

The other difference is that these approaches are trying to make a single, unified generally intelligent AI system, and so they will make assumptions like, given that we’re trying to imagine something that’s generally intelligent, it should be able to do X, Y, and Z. Whereas the research agenda that’s let’s try to get AI systems that do want you want, tends not to make those assumptions. And so it’s more applicable to current systems or narrow system where you can’t assume that you have general intelligence.

For example, a claim that that Paul Christiano often talks about is that, “If your AI agent is generally intelligent and a little bit corrigible, it will probably easily be able to infer that its overseer, or the user, would like to remain in control of any resources that they have, and would like to be better informed about the situation, that the user would prefer that the agent does not lie to them etc., etc.” It was definitely not something that current day AI systems can do unless you really engineer them to, so this is presuming some level of generality, which we do not currently have.

So the next thing I said was limited AGI. Here the idea is, there are not very many policies or AI systems that will do what we want; what we want is a pretty narrow space in the space of all possible behaviors. Actually selecting one of the behaviors out of that space is quite difficult and requires a lot of information in order to narrow in on that piece of behavior. But if all you’re trying to do is avoid the catastrophic behaviors, then there are lots and lots of policies that successfully do that. And so it might be easier to find one of those policies; a policy that doesn’t ever kill all humans.

Lucas: At least the space of those policies, one might have this view and not think it sufficient for AI alignment, but see it as sort of a low hanging fruit to be picked. Because the space of non-catastrophic outcomes is larger than the space of extremely specific futures that human beings support.

Rohin: Yeah, exactly. And the success story here is, basically, that we develop this way of preventing catastrophic behaviors. All of our AI systems are filled with the system in place, and then technological progress continues as usual; it’s maybe not as fast as it would have been if we had an aligned AGI doing all of this for us, but hopefully it would still be somewhat fast, and hopefully enabled a bit by AI systems. Eventually, we will either make it to the future without ever building an AI system that doesn’t have a system in place, or we use this to do a bunch more AI research until we solve the full alignment problem, and then we can build, with high confidence that it’ll go well.

And actual proper aligned, super-intelligence that is helping us without any of these limitations systems in place. I think from a strategic picture, that’s basically the important parts about limited AGI. There are two subsections within those limits based on trying to change what the AI’s optimizing for, so this would be something like impact measures versus limits on the input/output channels of the AI system; so this would be something like AI boxing.

So, with robustness, I sort of think of the robustness mostly, it’s not going to give us safety by itself, probably, though there are some scenarios in which it could happen. It’s more meant to harden whichever other approach that we use. Maybe if we have an AI system that is trying to do what we want, to go back to the helpful AGI setting, maybe it does that 99.9 percent of the time. But we’re using this AI to make millions of decisions, which means it’s going to not do what we want 1,000 times. That seems like way too many times for comfort, because if it’s applying its intelligence to the wrong goal in those 1,000 times, you could get some pretty bad outcomes.

This is a super heuristic and fluffy argument, but there are lots of problems with it. I think it sets up the general reason that we would want robustness. So with robustness techniques, you’re basically trying to get some nice worst case guarantees that say, “Yeah, the AI system is never going to screw up super, super bad.” And this is helpful when you have an AI system that’s going to make many, many, many decisions, and we want to make sure that none of those decisions are going to be catastrophic.

And so some techniques in here include verification, adversarial training, and other adversarial ML techniques like Byzantine fault tolerance, or stuff like that. These are all the data poisoning, interpretability can also be helpful for robustness if you’ve got a strong overseer who can use interpretability to give good feedback to your AI system. But yeah, the overall goal is take something that doesn’t fail 99 percent of the time, and get it to not fail 100 percent of the time, or check whether or not it ever fails, so that you don’t have this very rare but very bad outcome.

Lucas: And so would you see this section as being within the context of any others or being sort of at a higher level of abstraction?

Rohin: I would say that it applies to any of the others, well okay, not the MIRI embedded agency stuff, because we don’t really have a story for how that ends up helping with AI safety. It could apply to however that caches out in the future, but we don’t really know right now. With limited AGI, many have this theoretical model, if you apply this sort of penalty, this sort of impact measure, then you’re never going to have any catastrophic outcomes.

But, of course, in practice, we train our AI systems to optimize that penalty and get the sort of weird black box thing out. And we’re not entirely sure if it’s respecting the penalty or something like this. Then you could use something like verification or your transparency in order to make sure that this is actually behaving the way we would predict them behave based on our analysis of what limits we need to put on the AI system.

Similarly, if you build AI systems that are doing what we want, maybe you want to use adversarial training to see if you can find any situations in which the AI system’s doing something weird, doing something which we wouldn’t classify as what we want, with iterated amplification or debate, maybe we want to verify that the corrigibility property happens all the time. It’s unclear how you would use verification for that, because it seems like a particularly hard property to formalize, but you could still do things like adversarial training or transparency.

We might have this theoretical arguments for why our systems will work, then once we turn them into actual real systems that will probably use neural nets and other messy stuff like that, are we sure that in the translation from theory to practice, all of our guarantees stayed? Unclear, we should probably use some robustness techniques to check that.

Interpretability, I believe, was next. It’s sort of similar in that it’s broadly useful for everything else. If you want to figure out whether an AI system is doing what you want, it would be really helpful to be able to look into the agent and see, “Oh, it chose to buy apples because it had seen me eat apples in the past.” Versus, “It chose to buy apples because there was this company that made it to buy the apples, so that it would make more profit.”

If we could see those two cases, if we could actually see into the decision making process, it becomes a lot easier to tell whether or not the AI system is doing what we want, or whether or not the AI system is corrigible, or whether or not be AI system is properly … Well, maybe it’s not as obvious for impact measures, but I wouldn’t expect it to be useful there as well, even if I don’t have a story off the top of my head.

Similarly with robustness, if you’re doing something like adversarial training, it sure would help if your adversary was able to look into the inner workings of the agent and be like, “Ah, I see this agent, it tends to underwrite this particular class of risky outcomes. So why don’t I search within that class of situations for one that is going to take a big risk on that it shouldn’t have taken otherwise?” It just makes all of the other problems a lot easier to do.

Lucas: And so how is progress made on interpretability?

Rohin: Right now I think most of the progress is in image classifiers. I’ve seen some work on interpretability for deep RL as well. Honestly, that’s probably most of the research is happening with classification systems, primarily image classifiers, but others as well. And then I also see the deep RL explanation systems because I read a lot of deep RL research.

But it’s motivated a lot, there are real problems with current AI systems, and interpretability helps you to diagnose and fix those, as well. For example, the problems of bias in classifiers, one thing that I remember from Deep Dream is you can ask Deep Dream to visualize barbells. And you always see these sort of muscular arms that are attached to the barbells because, in the training set, barbells were always being picked up by muscular people. So, that’s a way that you can tell that your classifier is not really learning the concepts that you wanted it to do.

In the bias case maybe your classifier always classifies anyone sitting at a computer as a man, because of bias in the data set. And using interpretability techniques, you could see that, okay when you look at this picture, the AI system is looking primarily at the pixels that represent the computer, as opposed to the pixels that represent the human. And making its decision to label this person as a man, based on that, and you’re like, no, that’s clearly the wrong thing to do. The classifier should be paying attention to the human, not to the laptop.

So I think a lot of interpretability research right now is you take a particular short term problem and figure out how you can make that problem easier to solve. Though a lot of it is also what would be the best way to understand what our model is doing? So I think a lot of the work that Chris Olah doing, for example, is in this vein, and then as we do this exploration, finding some sort of bias in the classifiers that you’re studying.

So, Comprehensive AI Services, an attempt to predict what the feature of AI development will look like, and the hope is that, by doing this, we can figure out what sort of technical safety things we will need to do. Or, strategically, what sort of things we should push for in the AI research community in order to make those systems safer.

There’s a big difference between, we are going to build a single unified AGI agent and it’s going to be generally intelligent to optimize the world according to a utility function versus we are going to build a bunch of disparate, separate, narrow AI systems that are going to interact with each other quite a lot. And because of that, they will be able to do a wide variety of tasks, none of them are going to look particularly like expected utility maximizers. And the safety research you want to do is different in those two different worlds. And CAIS is basically saying “We’re in the second of those worlds, not the first one.”

Lucas: Can you go ahead and tell us about ambitious value learning?

Rohin: Yeah, so with ambitious value learning, this is also an approach to how do we make an aligned AGI solve the entire problem in some sense? Which is look at not just human behavior, but also human brains of the algorithm that they implement, and use that to infer an adequate utility function, the one that we would be okay with the behavior that results from that.

Infer this utility function, I’m going to plug it into an expected utility maximizer. Now, of course, we do have to solve problems with even once we have the utility function, how do we actually build a system that maximizes that utility function, which is not a solved problem yet? But it does seem to be capturing from the main difficulties, if you could actually solve the problem. And so that’s an approach I associate most with Stuart Armstrong.

Lucas: Alright, and so you were saying earlier, in terms of your own view, it’s sort of an amalgamation of different credences that you have in the potential efficacy of all these different approaches. So, given all of these and all of their broad missions, and interests, and assumptions that they’re willing to make, what are you most hopeful about? What are you excited about? How do you, sort of, assign your credence and time here?

Rohin: I think I’m most excited about the concept of corrigibility. That seems like the right thing to aim for, it seems like it’s a thing we can achieve, it seems like if we achieve it, we’re probably okay, nothing’s going to go horribly wrong and probably will go very well. I am less confident on which approach to corrigibility I am most excited about. Iterated amplification and debate seem like if we were to implement them, they will probably lead to incorrigible behavior. But I am worried that either of those will be … Either we won’t actually be able to build generally intelligent agents, in which case both of those approaches don’t really work. Or another worry that I have is that those approaches might be too expensive to actually do in that other systems are just so much more computationally efficient that we just use those instead.

Due to economic pressures, Paul does not seem to be worried by either of these things. He’s definitely aware of both these issues, in fact, he was the one I think who listed computational efficiency as a desideratum, and he still is optimistic about them. So, I would not put a huge amount of credence in this view of mine.

If I were to say what I was excited about for portability instead of that, it would be something like take the research that we’re currently doing on how to get current AI systems to work, which often called ‘narrow value learning’. If you take that research, it seems plausible that this research, extended into the future, will give us some method of creating an AI system that’s implicitly learning our narrow values, and is corrigible as a result of that, even if it is not generally intelligent.

This is sort of a very hand wavey speculative intuition, certainly not as concrete as the hope that we have with iterated amplification. But I’m somewhat optimistic about it, and less optimistic about limiting AI systems, it seems like even if you succeed in finding a nice, simple rule that eliminates all catastrophic behaviors, which plausibly you could do, it seems hard to find one that both does that and also lets you do all of the things that you do want to do.

If you’re talking about impact metrics, for example, if you require AI to be a low impact, I expect that that would prevent you from doing many things that we actually want to do, because many things that we want to do are actually quite high impact. Now, Alex Turner disagrees with me on this, and he developed attainable utility preservation. He is explicitly working on this problem and disagree with me, so again I don’t know how much credence to put in this.

I don’t know if Vika agrees with me on this or not, she also might disagree with me and she is also directly working with this problem. So, yeah, seems hard to put a limit that also lets us do and things that we want. And in that case, it seems like due to economic pressures, we’d end up doing the things that don’t limit our AI systems from doing what they want.

I want to keep emphasizing my extreme uncertainty over all of this given that other people disagree with me on this, but that’s my current opinion. Similarly with boxing, it seems like it’s going to just make it very hard to actually use the AI system. Robustness and interpretability seems very broadly useful and supportive of most research on interpretability; maybe with an eye towards long term concerns, just because it seems to make every other approach to AI safety a lot more feasible and easier to solve.

I don’t think it’s a solution by itself, but given that it seems to improve almost every story I have for making an aligned AGI, seems like it’s very much worth getting a better understanding of it. Robustness is an interesting one, it’s not clear to me, if it is actually necessary. I kind of want to just voice lots of uncertainty about robustness and leave it at that. It’s certainly good to do in that it helps us be more confident in our AI systems, but maybe everything would be okay even if we just didn’t do anything. I don’t know, I feel like I would have to think a lot more about this and also see the techniques that we actually used to build AGI in order to have a better opinion on that.

Lucas: Could you give a few examples of where your intuitions here are coming from that don’t see robustness as an essential part of the AI alignment?

Rohin: Well, one major intuition, if you look at humans, they’re at least some human where I’m like, “Okay, I could just make this human a lot smarter, a lot faster, have them think for many, many years, and I still expect that they will be robust and not lead to some catastrophic outcome. They may not do exactly what I would have done, because they’re doing what they want. But they’re probably going to do something reasonable, they’re not going to do something crazy or ridiculous.

I feel like humans, some humans, the sufficiently risk averse and uncertain ones seem to be reasonably robust. I think that if you know that you’re planning over a very, very, very long time horizon, so imagine that you know you’re planning over billions of years, then the rational response to this is, “I really better make sure not to screw up right now, since there is just so much reward in the future, I really need to make sure that I can get it.” And so you get very strong pressures for preserving option value or not doing anything super crazy. So I think you could, plausibly, just get the reasonable outcomes from those effects. But again, these are not well thought out.

Lucas: All right, and so I just want to go ahead and guide us back to your general views, again, on the approaches. Is there anything that you’d like to add their own the approaches?

Rohin: I think I didn’t talk about CAIS yet. I guess my general view of CAIS, I broadly agree with it, that this does seem to be the most likely development path, meaning that it’s more likely than any other specific development path, but not more likely to have any other development path.

So I broadly agree with the worldview presented, I’m still trying to figure out what implications it has for technical safety research. I don’t agree with all of it, in particular, I think that you are likely to get AGI agents at some point, probably, after the CAIS soup of services happens. Which, I think, again, Drexler disagrees with me on that. So, put a bunch of uncertainty on that, but I broadly agree with that worldview that CAIS is proposing.

Lucas: In terms of this disagreement between you and Eric Drexler, are you imagining agenty AGI or super-intelligence which comes after the CAIS soup? Do you see that as an inevitable byproduct of CAIS or do you see that as an inevitable choice that humanity will make? And is Eric pushing the view that the agenty stuff doesn’t necessarily come later, it’s a choice that human beings would have to make?

Rohin: I do think it’s more like saying that this will be a choice that humans will make at some point. I’m sure that Eric, to some extent, is saying, “Yeah, just don’t do that.” But I think Eric and I do, in fact, have a disagreement on how much more performance you can get from an AGI agent, than a CAIS super of services. My argument is something like there is efficiency to be gained from going to an AGI agent, and Eric’s position as best I understand it, is that there is actually just not that much economic incentive to go to an AGI agent.

Lucas: What are your intuition pumps for why you think that you will gain a lot of computational efficiency from creating sort of an AGI agent? We don’t have to go super deep, but I guess a terse summary or something?

Rohin: Sure, I guess the main intuition pump is that in all of the past cases that we have of AI systems, you see that in speech recognition, in deep reinforcement learning, in image classification, we had all of the hand-built systems that separated these out into a few different modules that interacted with each other in a vaguely CAIS-like way. And then, at some point, we got enough computer and large enough data sets that we just threw deep learning at it, and deep learning just blew those approaches out of the water.

So there’s the argument from empirical experience, and there’s also the argument of if you try to modularize your systems yourself, you can’t really optimize the communication between them, you’re less integrated and you can’t make decisions based on global information, you have to make it based off of local information. And so the decisions tend to be a little bit worse. This could be taken as an explanation for the empirical observation that I made that we can already make; so that’s another intuition pump there.

Eric’s response would probably be something like, “Sure, this seems true for these narrow tasks, for narrow tasks.” You can get a lot of efficiency gains by integrating everything together and throwing deep learning and [inaudible 00:54:10] training at all of it. But for a sufficiently high level tasks, there’s not really that much to be gained by doing global information instead of local information, so you don’t actually lose much by having these separate systems, and you do get a lot of computational deficiency in generalization bonuses by modularizing. He had a good example of this that I’m not replicating and I don’t want to make my own example, because it’s not going to be as convincing; but that’s his current argument.

And then my counter-argument is that’s because humans have small brains, so given the size of our brains and the limits of our data, and the limits of the compute that we have, we are forced to do modularity and systematization to break tasks apart into modular chunks that we can then do individually. Like if you are running a corporation, you need each person to specialize in their own task without thinking about all the other tasks, because we just do not have the ability to optimize for everything all together because we have small brains, relatively speaking; or limited brains, is what I should say.

But this is not a limit that AI systems will have. An AI system would just vastly more computer than the human brain, vastly more data will, in fact, just be able to optimize all of this with global information and get better results. So that’s one thread of the argument taken down to two or three levels of arguments and counter-arguments. There are other threads of that debate, as well.

Lucas: I think that that serves a purpose for illustrating that here. So are there any other approaches here that you’d like to cover, or is that it?

Rohin: I didn’t talk about factored cognition very much. But I think it’s worth highlighting separately from iterated amplification in that it’s testing an empirical hypothesis of can humans decompose tasks into chunks of some small amount of time? And can we do arbitrarily complex tasks using these humans? I am particularly excited about this sort of work that’s trying to figure out what humans are capable of doing and what supervision they can give to AI systems.

Mostly because going back to a thing I said way back in the beginning, what we’re aiming for is a human AI system to be collectively rational as opposed to an AI system as individually rational. Part of the human-AI-system is the human, you want to be able to know what the human can do, what sort of policies they can implement, what sort of feedback they can be giving to the AI system. And something like factory cognition is testing a particular aspect of that; and I think that seems great and we need more of it.

Lucas: Right. I think that this seems to be the sort of emerging view of where social science or scientists are needed in AI alignment in order to, again as you said, sort of understand what human beings are capable in terms of supervised learning and analyzing the human component of the AI alignment problem as it requires us to be collectively rational with AI systems.

Rohin: Yeah, that seems right. I expect more writing on this in the future.

Lucas: All right, so there’s just a ton of approaches here to AI alignment, and our heroic listeners have a lot to take in here. In terms of getting more information, generally, about these approaches or if people are still interested in delving into all these different views that people take at the problem and methodologies of working on it, what would you suggest that interested persons look into or read into?

Rohin: I cannot give you a overview of everything, because that does not exist. To the extent that it exists, it’s either this podcast or the talk that I did at Beneficial AGI. I can suggest resources for individual items, so for embedded agency there’s the embedded agency sequence on the Alignment Forum; far and away the best thing for read for that.

For CAIS, Comprehensive AI Services, there was a 200 plus page tech report published by Eric Drexler at the beginning of this month, if you’re interested, you should go read the entire thing; it is quite good. But I also wrote a summary of it on the Alignment Forum, which is much more readable, in the sense that it’s shorter. And then there are a lot of comments on there that analyze it a bit more.

There’s also another summary written by Richard Ngo, also on the Alignment Forum. Maybe it’s only on Lesswrong, I forget; it’s probably on the Alignment Forum. But that’s a different take on comprehensive AI services, so I’d recommend reading that too.

For limited AGI, I have not really been keeping up with the literature on boxing, so I don’t have a favorite to recommend. I know that a couple have been written by, I believe, Jim Babcock and Roman Yampolskiy.

For impact measures, you want to read Vika’s paper on relative reachability. There’s also a blog post about it if you don’t want to read the paper. And Alex Turner’s blog posts on attainable utility preservation, I think it’s called ‘Towards A New Impact Measure’, and this is on the Alignment Forum.

For robustness, I would read Paul Christiano’s post called ‘Techniques For Optimizing Worst Case Performance’. This is definitely specific to how robustness will help under Paul’s conception of the problem and, in particular, his thinking of robustness in the setting where you have a very strong overseer for your AI system. But I don’t know of any other papers or blog post that’s talking about robustness, generally.

For AI systems that do what we want, there’s my value learning sequence that I mentioned before on the Alignment Forum. There’s CIRL or Cooperative Inverse Reinforcement Learning which is a paper by Dylan and others. There’s Deep Reinforcement Learning From Human Preferences and Recursive Reward Modeling, these are both papers that are particular instances of work in this field. I also want to recommend Inverse Reward Design, because I really like that paper; so that’s also a paper by Dylan, and others.

For corrigibility and iterated amplification, the iterated amplification sequence on the Alignment Forum or half of what Paul Christiano has written. If you want to read not an entire sequence of blog posts, then I think Clarifying AI alignment is probably the post I would recommend. It’s one of the posts in the sequence and talks about this distinction of creating an AI system that is trying to do what you want, as opposed to actually doing what you want and why we might want to aim for only the first one.

For iterated amplification, itself, that technique, there is a paper that I believe is called something like Supervising Strong Learners By Amplifying Weak Experts, which is a good thing to read and there’s also corresponding OpenAI blog posts, whose name I forget. I think if you search iterated amplification, OpenAI blog you’ll find it.

And then for debate, there’s AI Safety via Debate, which is a paper, there’s also a corresponding OpenAI blog post. For factory cognition, there’s a post called Factored Cognition, on the Alignment Forum; again, in the iterated amplification sequence.

For interpretability, there isn’t really anything talking about interpretability, from the strategic point of view of why we want it. I guess that same post I recommend before of techniques for optimizing worst case performance talks about it a little bit. For actual interpretability techniques, I recommend the distill articles, the building blocks of interpretability and feature visualization, but these are more about particular techniques for interpretability, as opposed to why we wanted interpretability.

And on ambitious value learning, the first chapter of my sequence on value learning talks exclusively about ambitious value learning; so that’s one thing I’d recommend. But also Stuart Armstrong has so many posts, I think there’s one that’s about resolving human values adequately and something else, something like that. That one might be one worth checking out, it’s very technical though; lots of math.

He’s also written a bunch of posts that convey the intuitions behind the ideas. They’re all split into a bunch of very short posts, so I can’t really recommend any one particular one. You could go to the alignment newsletter database and just search Stuart Armstrong, and click on all of those posts and read them. I think that was everything.

Lucas: That’s a wonderful list. So we’ll go ahead and link those all in the article which goes along with this podcast, so that’ll all be there organized in nice, neat lists for people. This is all probably been fairly overwhelming in terms of the number of approaches and how they differ, and how one is to adjudicate the merits of all of them. If someone is just sort of entering the space of AI alignment, or is beginning to be interested in sort of these different technical approaches, do you have any recommendations?

Rohin: Reading a lot, rather than trying to do actual research. This was my strategy, I started back in September of 2017 and I think for the first six months or so, I was reading about 20 hours a week, in addition to doing research; which was why it was only 20 hours a week, it wasn’t a full time thing I was doing.

And I think that was very helpful for actually forming a picture of what everyone was doing. Now, it’s plausible that you don’t want to actually learn about what everyone is doing, and you’re okay with like, “I’m fairly confident that this thing, this particular problem is an important piece of the problem and we need to solve it.” And I think it’s very easy to get that wrong, so I’m a little wary of recommending that but it’s a reasonable strategy to just say, “Okay, we probably will need to solve this problem, but even if we don’t, the intuitions that we get from trying to solve this problem will be useful.

Focusing on that particular problem, reading all of the literature on that, attacking that problem, in particular, lets you start doing things faster, while still doing things that are probably going to be useful; so that’s another strategy that people could do. But I don’t think it’s very good for orienting yourself in the field of AI safety.

Lucas: So you think that there’s a high value in people taking this time to read, to understand all the papers and the approaches before trying to participate in particular research questions or methodologies. Given how open this question is, all the approaches make different assumptions and take for granted different axioms which all come together to create a wide variety of things which can both complement each other and have varying degrees of efficacy in the real world when AI systems start to become more developed and advanced.

Rohin: Yeah, that seems right to me. Part of the reason I’m recommending this is because it seems to be that no one does this. I think, on the margin, I want more people who do this in a world where 20 percent of the people were doing this, and the other 80 percent were just taking particular piece of the problem and working on those. That might be the right balance, somewhere around there, I don’t know, it depends on how you count who is actually in the field. But somewhere between one and 10 percent of the people are doing this; closer to the one.

Lucas: Which is quite interesting, I think, given that it seems like AI alignment should be in a stage of maximum exploration just given the conceptually mapping the territory is very young. I mean, we’re essentially seeing the birth and initial development of an entirely new field and specific application of thinking. And there are many more mistakes to be made, and concepts to be clarified, and layers to be built. So, seems like we should be maximizing our attention in exploring the general space, trying to develop models, the efficacy of different approaches and philosophies and views of AI alignment.

Rohin: Yeah, I agree with you, that should not be surprising given that I am one of the people doing this, or trying to do this. Probably the better critique will come from people who are not doing this, and can tell both of us why we’re wrong about this.

Lucas: We’ve covered a lot here in terms of the specific approaches, your thoughts on the approaches, where we can find resources on the approaches, why setting the approaches matters. Are there any parts of the approaches that you feel deserve more attention in terms of these different sections that we’ve covered?

Rohin: I think I would want more work on looking at the intersection between things that are supposed to be complimentary, how interpretability can help you have AI systems that have the right goals, for example, would be a cool thing to do. Or what you need to do in order to get verification, which is a sub-part of robustness, to give you interesting guarantees on AI systems that we actually care about.

Most of the work on verification right now is like, there’s this nice specification that we have for adversarial examples, in particular, is there an input that is within some distance from a training data point, such that it gets classified differently from that training data point. And those are the nice formal specification and most of the work in verification takes this specification as given and that figures out more and more computationally efficient ways to actually verify that property, basically.

That does seem like a thing that needs to happen, but the much more urgent thing, in my mind, is how do we come up with these specifications in the first place? If I want to verify that my AI system is corrigible, or I want to verify that it’s not going to do anything catastrophic, or that it is going to not disable my value learning system, or something like that; how do I specify this at all in any way that lets me do something like a verification technique even given infinite computing power? It’s not clear to me how you would do something like that, and I would love to see people do more research on that.

That particular thing is my current reason for not being very optimistic about verification, in particular, but I don’t think anyone has really given it a try. So it’s plausible that there’s actually just some approach that could work that we just haven’t found yet because no one’s really been trying. I think all of the work on limited AGI is talking about, okay, does this actually eliminate all of the catastrophic behavior? Which, yeah, that’s definitely an important thing, but I wish that people would also do research on, given that we put this penalty or this limit on the AGI system, what things is it still capable of doing?

Have we just made it impossible for it to do anything of interest whatsoever, or can it actually still do pretty powerful things, even though we’ve placed these limits on it? That’s the main thing I want to see. From there, let’s have AI systems that do what we want, probably the biggest thing I want to see there, and I’ve been trying to do some of this myself, some conceptual thinking about how does this lead to good outcomes in the long term? So far, we’ve not been dealing with the fact that the human doesn’t actually know, doesn’t actually have a nice consistent utility function that they know and that can be optimized. So, once you relax that assumption, what the hell do you do? And then there’s also a bunch of other problems that would benefit from more conceptual clarification, maybe I don’t need to go into all of them right now.

Lucas: Yeah. And just to sort of inject something here that I think we haven’t touched on and that you might have some words about in terms of approaches. We discussed sort of agential views of advanced artificial intelligence, a services-based conception, though I don’t believe that we have talked about aligning AI systems that simply function as oracles or having a concert of oracles. You can get rid of the services thing, and the agency thing if the AI just tells you what is true, or answers your questions in a way that is value aligned.

Rohin: Yeah, I mostly want to punt on that question because I have not actually read all the papers. I might have read a grand total of one paper on the oracles, and also super intelligence which talks about oracles. So I feel like I know so little about the state of the art on oracles, that it should not actually say anything about them.

Lucas: Sure. So then just as a broad point to point out to our audience is that in terms of conceptualizing these different approaches to AI alignment, it’s important and crucial to consider the kind of AI system that you’re thinking about the kinds of features and properties that it has, and oracles are another version here that one can play with in one’s AI alignment thinking?

Rohin: I think the canonical paper there is something like Good and Safe Pieces of Oracles, but I have not actually read it. There is a list of things I want to read, it is on that list. But that list also has, I think, something like 300 papers on it, and apparently I have not gotten to oracles yet.

Lucas: And so for the sake of this whole podcast being as comprehensive as possible, are there any conceptions of AI, for example, that we have omitted so far adding on to this agential view, the CAIS view of it actually just being a lot of distributed services, or an oracle view?

Rohin: There’s also the Tool AI View. This is different from the services view, but it’s somewhat akin to the view you were talking about at the beginning of this podcast where you’ve got AI systems that have a narrowly defined input/output space, they’ve got a particular thing that they do with limit, and they just sort of take in their inputs and do some computation, they spit out their outputs and that’s it, that’s all that they do. You can’t really model them as having some long term utility function that they’re optimizing, they’re just implementing a particular input-output relation and it’s all they’re trying to do.

Even saying something like, “They are trying to do X.” Is basically using a bad model for them. I think the main argument against expecting tool AI systems is that they’re probably not going to be as useful as other services or agential AI, because tool AI systems would have to be programmed in a way where we understood what they were doing and why they were doing it. Whereas agential AI systems or services would be able to consider new possible ways of achieving goals that we hadn’t thought about and enact those plans.

And so they could get super human behavior by considering things that we wouldn’t consider. Whereas, true Ais … Like Google Maps is super human in some sense, but it’s super human only because it has a compute advantage over us. If we were given all of the data and all of the time, in human real time, that Google Maps had, we could implement a similar sort of algorithm as Google Maps and compute the optimal route ourselves.

Lucas: There seems to be this duality that is constantly being formed in our conception of AI alignment, where the AI system is this tangible external object which stands in some relationship to the human and is trying to help the human to achieve certain things.

Are there conceptions of value alignment which, however the procedure or methodology is done, changes or challenges the relationship between the AI system and the human system where it challenges what it means to be the AI or what it means to be human, whereas, there’s potentially some sort of merging or disruption of this dualistic scenario of the relationship?

Rohin: I don’t really know, I mean, it sounds like you’re talking about things like brain computer interfaces and stuff like that. I don’t really know of any intersection between AI safety research and that. I guess, this did remind me, too, that I want to make the point that all of this is about the relatively narrow, I claim, problem of aligning an AI system with a single human.

There is also the problem of, okay what if there are multiple humans, what if there are multiple AI systems, what if you’ve got a bunch of different groups of people and each group is value aligned within themselves, they build an AI that’s value aligned with them, but lots of different groups do this now what happens?

Solving the problem that I’ve been talking about does not mean that you have a good outcome in the long term future, it is merely one piece of a larger overall picture. I don’t think any of that larger overall picture removes the dualistic thing that you were talking about, but they dualistic part reminded me of the fact that I am talking about a narrow problem and not the whole problem, in some sense.

Lucas: Right and so just to offer some conceptual clarification here, again, the first problem is how do I get an AI system to do what I want it to do when the world is just me and that AI system?

Rohin: Me and that AI system and the rest of humanity, but the rest of humanity is treated as part of the environment.

Lucas: Right, so you’re not modeling other AI systems or how some mutually incompatible preferences and trained systems would interact in the world or something like that?

Rohin: Exactly.

Lucas: So the full AI alignment problem is… It’s funny because it’s just the question of civilization, I guess. How do you get the whole world and all of the AI systems to make a beautiful world instead of a bad world?

Rohin: Yeah, I’m not sure if you saw my lightning talk at Beneficial AGI, but I talked a bit about those. I think I called that top level problem, make AI related features stuff go well, very, very, very concrete, obviously.

Lucas: It makes sense. People know what you’re talking about.

Rohin: I probably wouldn’t call that broad problem the AI alignment problem. I kind of wonder is there a different alignment for the narrower trouble? We could maybe call it the ‘AI Safety Problem’ or the ‘AI Future Problem’, I don’t know. ‘Beneficially AI’ problem actually, I think that’s what I used last time.

Lucas: That’s a nice way to put it. So I think that, conceptually, leave us at a very good place for this first section.

Rohin: Yeah, seems pretty good to me.

Lucas: If you found this podcast interesting or useful, please make sure to check back for part two in a couple weeks where Rohin and I go into more detail about the strengths and weaknesses of specific approaches.

We’ll be back again soon with another episode in the AI Alignment podcast.

[end of recorded material]

AI Alignment Podcast: AI Alignment through Debate with Geoffrey Irving

“To make AI systems broadly useful for challenging real-world tasks, we need them to learn complex human goals and preferences. One approach to specifying complex goals asks humans to judge during training which agent behaviors are safe and useful, but this approach can fail if the task is too complicated for a human to directly judge. To help address this concern, we propose training agents via self play on a zero sum debate game. Given a question or proposed action, two agents take turns making short statements up to a limit, then a human judges which of the agents gave the most true, useful information…  In practice, whether debate works involves empirical questions about humans and the tasks we want AIs to perform, plus theoretical questions about the meaning of AI alignment.” AI safety via debate

Debate is something that we are all familiar with. Usually it involves two or more persons giving arguments and counter arguments over some question in order to prove a conclusion. At OpenAI, debate is being explored as an AI alignment methodology for reward learning (learning what humans want) and is a part of their scalability efforts (how to train/evolve systems to safely solve questions of increasing complexity). Debate might sometimes seem like a fruitless process, but when optimized and framed as a two-player zero-sum perfect-information game, we can see properties of debate and synergies with machine learning that may make it a powerful truth seeking process on the path to beneficial AGI.

On today’s episode, we are joined by Geoffrey Irving. Geoffrey is a member of the AI safety team at OpenAI. He has a PhD in computer science from Stanford University, and has worked at Google Brain on neural network theorem proving, cofounded Eddy Systems to autocorrect code as you type, and has worked on computational physics and geometry at Otherlab, D. E. Shaw Research, Pixar, and Weta Digital. He has screen credits on Tintin, Wall-E, Up, and Ratatouille. 

We hope that you will join in the conversations by following us or subscribing to our podcasts on Youtube, SoundCloud, iTunes, Google Play, Stitcher, or your preferred podcast site/application. You can find all the AI Alignment Podcasts here.

Topics discussed in this episode include:

  • What debate is and how it works
  • Experiments on debate in both machine learning and social science
  • Optimism and pessimism about debate
  • What amplification is and how it fits in
  • How Geoffrey took inspiration from amplification and AlphaGo
  • The importance of interpretability in debate
  • How debate works for normative questions
  • Why AI safety needs social scientists
You can find out more about Geoffrey Irving at his website. Here you can find the debate game mentioned in the podcast. Here you can find Geoffrey Irving, Paul Christiano, and Dario Amodei’s paper on debate. Here you can find an Open AI blog post on AI Safety via Debate. You can listen to the podcast above or read the transcript below.

Lucas: Hey, everyone. Welcome back to the AI Alignment Podcast. I’m Lucas Perry, and today we’ll be speaking with Geoffrey Irving about AI safety via Debate. We discuss how debate fits in with the general research directions of OpenAI, what amplification is and how it fits in, and the relation of all this with AI alignment. As always, if you find this podcast interesting or useful, please give it a like and share it with someone who might find it valuable.

Geoffrey Irving is a member of the AI safety team at OpenAI. He has a PhD in computer science from Stanford University, and has worked at Google Brain on neural network theorem proving, cofounded Eddy Systems to autocorrect code as you type, and has worked on computational physics and geometry at Otherlab, D. E. Shaw Research, Pixar, and Weta Digital. He has screen credits on Tintin, Wall-E, Up, and Ratatouille. Without further ado, I give you Geoffrey Irving.

Thanks again, Geoffrey, for coming on the podcast. It’s really a pleasure to have you here.

Geoffrey: Thank you very much, Lucas.

Lucas: We’re here today to discuss your work on debate. I think that just to start off, it’d be interesting if you could provide for us a bit of framing for debate, and how debate exists at OpenAI, in the context of OpenAI’s general current research agenda and directions that OpenAI is moving right now.

Geoffrey: I think broadly, we’re trying to accomplish AI safety by reward learning, so learning a model of what humans want and then trying to optimize agents that achieve that model, so do well according to that model. There’s sort of three parts to learning what humans want. One part is just a bunch of machine learning mechanics of how to learn from small sample sizes, how to ask basic questions, how to deal with data quality. There’s a lot more work, then, on the human side, so how do humans respond to the questions we want to ask, and how do we sort of best ask the questions?

Then, there’s sort of a third category of how do you make these systems work even if the agents are very strong? So stronger than human in some or all areas. That’s sort of the scalability aspect. Debate is one of our techniques for doing scalability. Amplification being the first one and Debate is a version of that. Generally want to be able to supervise a learning agent, even if it is smarter than a human or stronger than a human on some task or on many tasks.

Debate is you train two agents to play a game. The game is that these two agents see a question on some subject, they give their answers. Each debater has their own answer, and then they have a debate about which answer is better, which means more true and more useful, and then a human sees that debate transcript and judges who wins based on who they think told the most useful true thing. The result of the game is, one, who won the debate, and two, the answer of the person who won the debate.

You can also have variants where the judge interacts during the debate. We can get into these details. The general point is that, in my tasks, it is much easier to recognize good answers than it is to come up with the answers yourself. This applies at several levels.

For example, at the first level, you might have a task where a human can’t do the task, but they can know immediately if they see a good answer to the task. Like, I’m bad at gymnastics, but if I see someone do a flip very gracefully, then I can know, at least to some level of confidence, that they’ve done a good job. There are other tasks where you can’t directly recognize the answer, so you might see an answer, it looks plausible, say, “Oh, that looks like a great answer,” but there’s some hidden flaw. If an agent were to point out that flaw to you, you’d then think, “Oh, that’s actually a bad answer.” Maybe it was misleading, maybe it was just wrong. You need two agents doing a back and forth to be able to get at the truth.

Then, if you apply this recursively through several levels, you might have tasks where you can’t recognize whether an answer is good directly. You can’t even recognize whether a counterargument is good. Maybe a counter-counterargument, then you could recognize it. If you do sort of three steps of Debate, back and forth with two agents, you get to the truth.

Depending on the task, you need some number of these steps to pin down what the true answer is. Broadly, what’s happening is that the debaters, when they’re giving these counterarguments, they’re skipping over a lot of the space where maybe the human doesn’t understand what would happen, but they know that it would go a certain way. Like, if you’d see two expert players playing a game of Go, they’re constantly ignoring a larger number of moves which are just very bad and would waste a lot of time if you try to analyze them, and they’re just looking at kind of the most important part of the arguments about what is the best move in the game.

For all of the experiments, there are simple image tasks, like MNIST. MNIST is a dataset of digits, and then also some preliminary kind of human experiments with just human debaters. On MNIST, an example is you have a sparse classifier that learns to recognize MNIST from just seeing a few pixels, so you might train it by picking random masks of just a few pixels of an MNIST image. This classifier is very bad, because it only sees a small part of the image. It might get accuracies below 50%. I forget exactly the number, I could look them up.

Then, you train debaters that choose which parts of the image to show to the judge. You fix the judge at this kind of weak training process, and then the debaters will play a game where they try to pick pixels that convince the judge of what they want to argue for. There’s kind of two ways to decide … In the first paper, we had one debater that was incentivized to tell the truth, so it knew the correct class. Like say, we’re looking at a seven, and the other debater had to pre-specify which digit it was going to lie for. It might say, “I’m going to lie and say things were an eight.”

The game would be each of these agents pick a single pixel, and you sort of iterate for, let’s say six rounds, picking one pixel after another, and then the judge sees those pixels, and chooses who wins the debate. The judge has seen kind of a small part of the evidence. The result of that is that you can boost a weak classifier up to sort of higher performance, even if the agents are incentivized only to win the game. In this case, if you pick one debater to be the honest player and one debater to be the liar, the honest player wins a significant majority of the time, much more than the classifier would win on random play.

Lucas: The judge here, sorry, was not a human?

Geoffrey: The judge here is not a human. It’s just a classifier trained to do a bad job at MNIST, because it sees only a little bit of information. It’s trained to convergence, but its input is just a few pixels of an image.

Lucas: Then, so the pixels that are then revealed by the debaters, each pixel is sort of the argument.

Geoffrey: Each pixel is the argument.

Lucas: Just to provide a little bit more framing, there’s this feature of debate, you can apply it to a very large domain of things that you’d be surprised about if you expand the notion of what it means to debate to showing pixels or something like this.

Geoffrey: It’s actually more important to debate in natural language. The end goal here is we want to extract a strengthened, kind of improved version of human performance at a task. The way we go about this, either in amplification or in debate, is we sort of factor through reasoning. Instead of trying to train directly on the task, like the answers to the task, you might have some questions and some answers, and you could train directly on question/answer pairs, we’re going to build a task which includes all possible human reasoning in the form of, say, in this case, debates, and then we’ll train the agents to do well in this space of reasoning, and then well pick out the answers at the very end. Once we’re satisfied that the reasoning all works out.

Because humans, sort of the way we talk about higher level concepts, especially abstract concepts, and say subtle moral concepts, is natural language, the most important domain here, in the human case, is natural language. What we’ve done so far, in all experiments for Debate, is an image space, because it’s easier. We’re trying now to move that work into natural language so that we can get more interesting settings.

Lucas: Right. In terms of natural language, do you just want to unpack a little bit about how that would be done at this point in natural language? It seems like our natural language technology is not at a point where I really see robust natural language debates.

Geoffrey: There’s sort of two ways to go. One way is human debates. You just replace the ML agents with human debaters and then a human judge, and you see whether the system works in kind of an all-human context. The other way is machine learning natural language is getting good enough to do interestingly well on sample question/answer datasets, and Debate is already interesting if you do a very small number of steps. In the general debate, you sort of imagine that you have this long transcript, dozens of statements long, with points and counterpoints and counterpoints, but if you already do just two steps, you might do question, answer, and then single counterargument. For some tasks, at least in theory, it already should be stronger than the baseline of just doing direct question/answer, because you have this ability to focus in on a counterargument that is important.

An example might be you see a question and an answer and then another debater just says, “Which part of the answer is problematic?” They might point to a word or to a small phrase, and say, “This is the point you should sort of focus in on.” If you learn how to self critique, then you can boost the performance by iterating once you know how to self critique.

The hope is that even if we can’t do general debates on the machine learning side just yet, we can do shallow debates, or some sort of simple first step in this direction, and then work up over time.

Lucas: This just seems to be a very fundamental part of AI alignment where you’re just breaking things down into very simple problems and then trying to succeed in those simple cases.

Geoffrey: That’s right.

Lucas: Just provide a little bit more illustration of debate as a general concept, and what it means in the context of AI alignment. I mean, there are open questions here, obviously, about the efficacy of debate, how debate exists as a tool within the space, so epistemological things that allow us to arrive at truth, and I guess, infer other people’s preferences. Sorry, again, in terms of reward learning, and AI alignment, and debate’s place in all of this, just contextualize, I guess, its sort of role in AI alignment, more broadly.

Geoffrey: It’s focusing, again, on the scalability aspect. One way to formulate that is we have this sort of notion of, either from a philosophy side, reflective equilibrium, or kind of from the AI alignment literature, coherent extrapolated volition, which is sort of what a human would do if we had thought very carefully for a very long time about a question, and sort of considered all the possible nuances, and counterarguments, and so on, and kind of reached the conclusion that is sort of free of inconsistencies.

Then, we’d like to take this kind of vague notion of, what happens when a human thinks for a very long time, and compress it into something we can use as an algorithm in a machine learning context. It’s also a definition. This vague notion of, let a human think for a very long time, that’s sort of a definition, but it’s kind of a strange one. A single human can’t think for a super long time. We don’t have access to that at all. You sort of need a definition that is more factored, where either a bunch of humans think for a long time, we sort of break up tasks, or you sort of consider only parts of the argument space at a time, or something.

You go from there to things that are both definitions of what it means to simulate thinking for long time and also algorithms. The first one of these is Amplification from Paul Christiano, and there you have some questions, and you can’t answer them directly, but you know how to break up a question into subquestions that are hopefully somewhat simpler, and then you sort of recursively answer those subquestions, possibly breaking them down further. You get this big tree of all possible questions that descend from your outer question. You just sort of imagine that you’re simulating over that whole tree, and you come up with an answer, and then that’s the final answer for your question.

Similarly, Debate is a variant of that, in the sense that you have this kind of tree of all possible arguments, and you’re going to try to simulate somehow what would happen if you considered all possible arguments, and picked out the most important ones, and summarized that into an answer for your question.

The broad goal here is to give a practical definition of what it means for people to take human input and push it to its inclusion, and then hopefully, we have a definition that also works as an algorithm, where we can do practical ML training, to train machine learning models.

Lucas: Right, so there’s, I guess, two thoughts that I sort of have here. The first one is that there is just sort of this fundamental question of what is AI alignment? It seems like in your writing, and in the writing of others at OpenAI, it’s to get AI to do what we want them to do. What we want them to do is … either it’s what we want them to do right now, or what we would want to do under reflective equilibrium, or at least we want to sort of get to reflective equilibrium. As you said, it seems like a way of doing that is compressing human thinking, or doing it much faster somehow.

Geoffrey: One way to say it is we want to do what humans want, even if we understood all of the consequences. It’s some kind of, Do what humans want, plus some side condition of: ‘imagine if we knew everything we needed to know to evaluate their question.”

Lucas: How does Debate scale to that level of compressing-

Geoffrey: One thing we should say is that everything here is sort of a limiting state or a goal, but not something we’re going to reach. It’s more important that we have closure under the relative things we might not have thought about. Here are some practical examples from kind of nearer-term misalignment. There’s an experiment in social science where they send out a bunch of resumes to job applications to classified ads, and the resumes were paired off into pairs that were identical except that the name of the person was either white sounding or black sounding, and the result was that you got significantly higher callback rates if the person sounded white, and even if they had an entirely identical resume to the person sounding black.

Here’s a situation where direct human judgment is bad in the way that we could clearly know. You could imagine trying to push that into the task by having an agent say, “Okay, here is a resume. We’d like you to judge it.” Either pointing explicitly to what they should judge, or pointing out, “You might be biased here. Try to ignore the name of the resume, and focus on this issue, like say their education or their experience.” You sort of hope that if you have a mechanism for surfacing concerns or surfacing counterarguments, you can get to a stronger version of human decision making. There’s no need to wait for some long term very strong agent case for this to be relevant, because we’re already pretty bad at making decisions in simple ways.

Then, broadly, I sort of have this sense that there’s not going to be magic in decision making. If I go to some very smart person, and they have a better idea for how to make a decision, or how to answer a question, I expect there to be some way they could explain their reasoning to me. I don’t expect I just have to take them on faith. We want to build methods that surface the reasons they might have to come to a conclusion.

Now, it may be very difficult for them to explain the process for how they came to those arguments. There’s some question about whether the arguments they’re going to make is the same as the reasons they’re giving the answers. Maybe they’re sort of rationalizing and so on. You’d hope that once you sort of surface all the arguments around the question that could be relevant, you get a better answer than if you just ask people directly.

Lucas: As we move out of debate in simple cases of image classifiers or experiments in similar environments, what does debate look like … I don’t really understand the ways in which the algorithms can be trained to elucidate all of these counterconcerns, and all of these different arguments, in order to help human beings arrive at the truth.

Geoffrey: One case we’re considering, especially on kind of the human experiment side, or doing debates with humans, is some sort of domain expert debate. The two debaters are maybe an expert in some field, and they have a bunch of knowledge, which is not accessible to the judge, which is maybe a reasonably competent human, but doesn’t know the details of some domain. For example, we did a debate where there were two people that knew computer science and quantum computing debating a question about quantum computing to a person who has some background, but nothing in that field.

The idea is you start out, there’s a question. Here, the question was, “Is the complexity class BQP equal to NP, or does it contain NP?” One point is that you don’t have to know what those terms mean for that to be a question you might want to answer, say in the course of some other goal. The first steps, things the debaters might say, is they might give short, intuitive definitions for these concepts and make their claims about what the answer is. You might say, “NP is the class of problems where we can verify solutions once we’ve found them, and BQP is the class of things that can run on a quantum computer.”

Now, you could have a debater that just straight up lies right away and says, “Well, actually NP is the class of things that can run on fast randomized computers.” That’s just wrong, and so what would happen then is that the counter debater would just immediately point to Wikipedia and say, “Well, that isn’t the definition of this class.” The judge can look that up, they can read the definition, and realize that one of the debaters has lied, and the debate is over.

You can’t immediately lie in kind of a simple way or you’ll be caught out too fast and lose the game. You have to sort of tell the truth, except maybe you kind of slightly veer towards lying. This is if you want to lie in your argument. At every step, if you’re an honest debater, you can try to pin the liar down to making sort of concrete statements. In this case, if say someone claims that quantum computers can solve all of NP, you might say, “Well, you must point me to an algorithm that does that.” The debater that’s trying to lie and say that quantum computers can solve all of NP might say, “Well, I don’t know what the algorithm is, but meh, maybe there’s an algorithm,” and then they’re probably going to lose, then.

Maybe they have to point to a specific algorithm. There is no algorithm, so they have to make one up. That will be a lie, but maybe it’s kind of a subtle complicated lie. Then, you could kind of dig into the details of that, and maybe you can reduce the fact that that algorithm is a lie to some kind of simple algebra, which either the human can check, maybe they can ask Mathematica or something. The idea is you take a complicated question that’s maybe very broad and covers a lot of the knowledge that the judge doesn’t know and you try to focus in closer and closer on details of arguments that the judge can check.

What the judge needs to be able to do is kind of follow along in the steps until they reach the end, and then there’s some ground fact that they can just look up or check and see who wins.

Lucas: I see. Yeah, that’s interesting. A brief passing thought is thinking about double cruxes and some tools and methods that CFAR employs, like how they might be interesting or used in debate. I think I also want to provide some more clarification here. Beyond debate being a truth-seeking process or a method by which we’re able to see which agent is being truthful, or which agent is lying, and again, there’s sort of this claim that you have in your paper that seems central to this, where you say, “In the debate game, it is harder to lie than to refute a lie.” This asymmetry in debate between the liar and the truth-seeker should hopefully, in general, bias towards people more easily seeing who is telling the truth.

Geoffrey: Yep.

Lucas: In terms of AI alignment again, in the examples that you’ve provided, it seems to help human beings arrive at truth for complex questions that are above their current level of understanding. How does this, again, relate directly to reward learning or value learning?

Geoffrey: Let’s assume that in this debate game, it is the case that it’s very hard to liar, so the winning move is to say the truth. What we want to do then is train kind of two systems. One system will be able to reproduce human judgment. That system would be able to look at the debate transcript and predict what the human would say is the correct winner of the debate. Once you get that system trained, so that’s sort of you’re learning not direct toward, but again, some notion of predicting how humans deal with reasoning. Once you learn that bit, then you can train an agent to play this game.

Then, we have a zero sum game, and then we can sort of apply any technique used to play a zero sum game, like Monte Carlo tree search in AlphaGo, or just straight up RL algorithms, as in some of OpenAI’s work. The hope is that you can train an agent to play this game very well, and therefore, it will be able to predict where counter-arguments exist that would help it win debates, and therefore, if it plays the game well, and the best way to play the game is to tell the truth, then you end up with a value aligned system. Those are large assumptions. You should be cautious if those are true.

Lucas: There’s also all these issues that we can get into about biases that humans have, and issues with debate. Whether or not you’re just going to be optimizing the agents for exploiting human biases and convincing humans. Definitely seems like, even just looking at how human beings value align to each other, debate is one thing in a large toolbox of things, and in AI alignment, it seems like potentially Debate will also be a thing in a large toolbox of things that we use. I’m not sure what your thoughts are about that.

Geoffrey: I could give them. I would say that there’s two ways of approaching AI safety and AI alignment. One way is to try to propose, say, methods that do a reasonably good job at solving a specific problem. For example, you might tackle reversibility, which means don’t take actions that can’t be undone, unless you need to. You could try to pick that problem out and solve it, and then imagine how we’re going to fit this together into a whole picture later.

The other way to do it is try to propose algorithms which have at least some potential to solve the whole problem. Usually, they won’t, and then you should use them as a frame to try to think about how different pieces might be necessary to add on.

For example, in debate, the biggest thing in there is that it might be the case that you train a debate agent that gets very good at this task, the task is rich enough that it just learns a whole bunch of things about the world, and about how to think about the world, and maybe it ends up having separate goals, or it’s certainly not clearly aligned because the goal is to win the game. Maybe winning the game is not exactly aligned.

You’d like to know sort of not only what it’s saying, but why it’s saying things. You could imagine sort of adding interpret ability techniques to this, which would say, maybe Alice and Bob are debating. Alice says something and Bob says, “Well, Alice only said that because Alice is thinking some malicious fact.” If we add solid interpret ability techniques, we could point into Alice’s thoughts at that fact, and pull it out, and service that. Then, you could imagine sort of a strengthened version of a debate where you could not only argue about object level things, like using language, but about thoughts of the other agent, and talking about motivation.

It is a goal here in formulating something like debate or amplification, to propose a complete algorithm that would solve the whole problem. Often, not to get to that point, but we have now a frame where we can think about the whole picture in the context of this algorithm, and then fix it as required going forwards.

I think, in the end, I do view debate, if it succeeds, as potentially the top level frame, which doesn’t mean it’s the most important thing. It’s not a question of importance. More of just what is the underlying ground task that we want to solve? If we’re training agents to either play video games or do question/answers, here the proposal is train agents to engage in these debates and then figure out what parts of AI safety and AI alignment that doesn’t solve and add those on in that frame.

Lucas: You’re trying to achieve human level judgment, ultimately, through a judge?

Geoffrey: The assumption in this debate game is that it’s easier to be a judge than a debater. If it is the case, though, that you need the judge to get to human level before you can train a debater, then you have a problematic bootstrapping issue where, first you must solve value alignment for training the judge. Only then do you have value alignment for training the debater. This is one of the concerns I have. I think the concern sort of applies to some of other scalability techniques. I would say this is sort of unresolved. The hope would be that it’s not actually sort of human level difficult to be a judge on a lot of tasks. It’s sort of easier to check consistency of, say, one debate statement to the next, than it is to do long, reasoning processes. There’s a concern there, which I think is pretty important, and I think we don’t quite know how it plays out.

Lucas: The view is that we can assume, or take the human being to be the thing that is already value aligned, and the process by which … and it’s important, I think, to highlight the second part that you say. You say that you’re pointing out considerations, or whichever debater is saying that which is most true and useful. The useful part, I think, shouldn’t be glossed over, because you’re not just optimizing debaters to arrive at true statements. The useful part smuggles in a lot issues with normative things in ethics and metaethics.

Geoffrey: Let’s talk about the useful part.

Lucas: Sure.

Geoffrey: Say we just ask the question of debaters, “What should we do? What’s the next step that I, as an individual person, or my company, or the whole world should take in order to optimize total utility?” The notion of useful, then, is just what is the right action to take? Then, you would expect a debate that is good to have to get into the details of why actions are good, and so that debate would be about ethics, and metaethics, and strategy, and so on. It would pull in all of that content and sort of have to discuss it.

There’s a large sea of content you have to pull in. It’s roughly kind of all of human knowledge.

Lucas: Right, right, but isn’t there this gap between training agents to say what is good and useful and for agents to do what is good and useful, or true and useful?

Geoffrey: The way in which there’s a gap is this interpretability concern. You’re getting at a different gap, which I think is actually not there. I like giving game analogies, so let me give a Go analogy. You could imagine that there’s two goals in playing the game of Go. One goal is to find the best moves. This is a collaborative process where all of humanity, all of sort of Go humanity, say, collaborates to learn, and explore, and work together to find the best moves in Go, defined by, what are the moves that most win this game? That’s a non-zero sum game, where we’re sort of all working together. Two people competing on the other side of the Go board are working together to get at what the best moves are, but within a game, it’s a zero sum game.

You sit down, and you have two players, two people playing a game of Go, one of them’s going to win, zero sum. The fact that that game is zero sum doesn’t mean that we’re not learning some broad thing about the world, if you’ll zoom out a bit and look at the whole process.

We’re training agents to win this debate game to give the best arguments, but the thing we want to zoom out and get is the best answers. The best answers that are consistent with all the reasoning that we can bring into this task. There’s huge questions to be answered about whether the system actually works. I think there’s an intuitive notion of, say, reflective equilibrium, or coherent extrapolated volition, and whether debate achieves that is a complicated question that’s empirical, and theoretical, and we have to deal with, but I don’t think there’s quite the gap you’re getting at, but I may not have quite voiced your thoughts correctly.

Lucas: It would be helpful if you could unpack how the alignment that is gained through this process is transferred to new contexts. If I take an agent trained to win the Debate game outside of that context.

Geoffrey: You don’t. We don’t take it out of the context.

Lucas: Okay, so maybe that’s why I’m getting confused.

Geoffrey: Ah. I see. Okay, this [inaudible 00:26:09]. We train agents to play this debate game. To use them, we also have them play the debate game. By training time, we give them kind of a rich space of questions to think about, or concerns to answer, like a lot of discussion. Then, we want to go and answer a question in the world about what we should do, what the answer to some scientific question is, is this theorem true, or this conjecture true? We state that as a question, and we have them debate, and then whoever wins, they gave the right answer.

There’s a couple of important things you can add to that. I’ll give like three levels of kind of more detail you can go. One thing is the agents are trained to look at state in the debate game, which could be I’ve just given the question, or there’s a question and there’s a partial transcript, and they’re trained to say the next thing, to make the next move in the game. The first thing you can do is you have a question that you want to answer, say, what should the world do, or what should I do as a person? You just say, “Well, what’s the first move you’d make?” The first move they’d make is to give an answer, and then you just stop there, and you’re done, and you just trust that answer is correct. That’s not the strongest thing you could do.

The next thing you can do is you’ve trained this model of a judge that knows how to predict human judgment. You could have them, from the start of this game, play a whole bunch of games, play 1,000 games of debate, and from that learn with more accuracy what the answer might be. Similar to how you’d, say if you’re playing a game of Go, if you want to know the best move, you would say, “Well, let’s play 1,000 games of Go from this state. We’ll get more evidence and we’ll know what the best move is.”

The most interesting thing you can do, though, is you yourself can act as a judge in this game to sort of learn more about what the relevant issues are. Say there’s a question that you care a lot about. Hopefully, “What should the world do,” is a question you care a lot about. You want to not only see what the answer is, but why. You could act as a judge in this game, and you could, say, play a few debates, or explore part of this debate tree, the tree of all possible debates, and you could do the judgment yourself. There, the end answer will still be who you believe is the right answer, but the task of getting to that answer is still playing this game.

The bottom line here is, at test time, we are also going to debate.

Lucas: Yeah, right. Human beings are going to be participating in this debate process, but does or does not debate translate into systems which are autonomously deciding what we ought to do, given that we assume that their models of human judgment on debate are at human level or above?

Geoffrey: Yeah, so if you turn off the human in the loop part, then you get an autonomous agent. If the question is, “What should the next action be in, say, an environment?” And you don’t have humans in the loop at test time, then you can get an autonomous agent. You just sort of repeatedly simulate debating the question of what to do next. Again, you can cut this process short. Because the agents are trained to predict moves in debate, you can stop them after they’ve predicted the first move, which is what the answer is, and then just take that answer directly.

If you wanted the maximally efficient autonomous agent, that’s the case you would do. At OpenAI, my view, our goal is I don’t want to take AGI and immediately deploy it in the most fast twitch tasks. Something like self-driving a car. If we get to human level intelligence, I’m not going to just replace all the self-driving cars with AGI and let them do their thing. We want to use this for the paths where we need very strong capabilities. Ideally, those tasks are slower and more deliberative, so we can afford to, say, take a minute to interact with the system, or take a minute to have the system engage in its own internal debates to get more confidence in these answers.

The model here is basically the Oracle AI model, that rather than the autonomous agent operating at an NDP model.

Lucas: I think that this is a very important part to unpack a bit more. This distinction here that it’s more like an oracle and less like an autonomous agent going around optimizing everything. What does a world look like right before, during, after AGI given debate?

Geoffrey: The way I think about this is that, an oracle here is a question/answer system of some complexity. You asked it questions, possibly with a bunch of context attached, and it gives you answers. You can reduce pretty much anything to an oracle, if oracle is sort of general enough. If your goal is to take actions in an environment, you can ask the oracle, “What’s the best action to take, and the next step?” And just iteratively ask that oracle over and over again as you take the steps.

Lucas: Or you could generate the debate, right? Over the future steps?

Geoffrey: The most direct way to do an NDP with Debate is to engage in a debate at every step, restart the debate process, showing all the history that’s happened so far, and say, the question at hand, that we’re debating, is what’s the best action to take next? I think I’m relatively optimistic that when we make AGI, for a while after we make it, we will be using it in ways that aren’t extremely fine grain NDP-like in the sense of we’re going to take a million actions in a row, and they’re all actions that hit the environment.

We’d mainly use this full direct reduction. There’s more practical reductions for other questions. I’ll give an example. Say you want to write the best book on, say, metaethics, and you’d like debaters to produce this books. Let’s say that debaters are optimal agents so they know how to do debates on any subject. Even if the book is 1,000 pages long, or say it’s a couple hundred pages long, that’s a more reasonable book, you could do it in a single debate as follows. Ask the agents to write the book. Each agent writes its own book, say, and you ask them to debate which book is better, and that debate all needs to point at small parts of the book.

One of the debaters writes a 300 page book and buried in the middle of it is a subtle argument, which is malicious and wrong. The other debater need only point directly at the small part of the book that’s problematic and say, “Well, this book is terrible because of the following malicious argument, and my book is clearly better.” The way this works is, if you are able to point to problematic parts of books in a debate, and therefore win, the best first move in the debate is to write the best book, so you can do it in one step, where you produce this large object with a single debate, or a single debate game.

The reason I mention this is that’s a little better in terms of practicality, then, writing the book. If the book is like 100,000 words, you wouldn’t want to have a debate about each word, one after another. That’s sort of a silly, very expensive process.

Lucas: Right, so just to back up here, and to provide a little bit more framing, there’s this beginning at which we can see we’re just at a very low level trying to optimize agents for debate, and there’s going to be an asymmetry here that we predict, that it should, in general, usually be easier to tell who’s telling the truth over who’s not, because it’s easier to tell the truth than to lie, and lie in convincing ways. Scaling from there, it seems that what we ultimately really want is to then be able to train a judge, right?

Geoffrey: The goal is to train … You need both.

Lucas: Right. You need both to scale up together.

Geoffrey: Yep.

Lucas: Through doing so, we will have oracles that will be able to assist in decision making, because what we’re doing is amplifying human explicit deliberative processes. Again, this tying into coherent extrapolated volition and reflective equilibrium. What we want is to be able to amplify human decision making, which I guess is, in a broad sense, value alignment.

Geoffrey: Right.

Lucas: Are there any other technical motivations behind debate that you would like to unpack for why within the set of all these different AI alignment methodologies, why debate is so important?

Geoffrey: Sure. I think maybe the right way to talk about that is to contrast amplification and debate. I got to Debate by thinking a bunch about Amplification, talking a bunch with Paul Christiano. To summarize it for listeners again, Amplification is take questions, decompose them into subquestions that are hopefully easier to answer, repeat that decomposition until we get to things we can check, and use that whole process to answer complicated questions.

The stock amplification, the human is doing the task of decomposing the questions. The human decomposition has to be complete. For example, you might have a question, going back to this resume example, we’re having this debate about whether someone whose name sounds white or black is a good candidate. For whatever reason, no one here wants to be racist, but they’re all just completely oblivious to the idea they could be racist. They’re going through and discussing whether the candidate is applicable, they’re kind of breaking these questions down. Just subtly at every point, racism is kind of inching its way into the answers, and they just never thought to kind of investigate.

If at some point they had asked, “Well, am I deciding this because the person has a white sounding name?” They would have realized their mistake, but they never think of that. In debate, you are relying on the agents to be able to do the decomposition, so the agents can, even if it’s hard for a human to do it, point out, “Maybe you’re being racist here,” and sort of get the correction. The advantage of debate there is you get some help doing this combination of tasks for you.

The other thing that happened, frankly, which is one of the reasons I thought of debate, was AlphaGo. In thinking about amplification, I’ve been sort of concerned. “Is this process going to be scalable? Are we going to lose a bunch of efficiency in doing this complicated decomposition process?” I was sort of concerned that we would lose a bunch of efficiency and therefore be not competitive with unsafe techniques to getting to AGI.

Then, AlphaGo came out, and AlphaGo got very strong performance, and it did it by doing an explicit tree search. As part of AlphaGo, it’s doing this kind of deliberative process, and that was not only important for performance at test time, but was very important for getting the training to work. What happens is, in AlphaGo, at training time, it’s doing a bunch of tree search through the game of Go in order to improve the training signal, and then it’s training on that improved signal. That was one thing kind of sitting in the back of my mind.

I was kind of thinking through, then, the following way of thinking about alignment. At the beginning, we’re just training on direct answers. We have these questions we want to answer, an agent answers the questions, and we judge whether the answers are good. You sort of need some extra piece there, because maybe it’s hard to understand the answers. Then, you imagine training an explanation module that tries to explain the answers in a way that humans can understand. Then, those explanations might be kind of hard to understand, too, so maybe you need an explanation explanation module.

For a long time, it felt like that was just sort of ridiculous epicycles, adding more and more complexity. There was no clear end to that process, and it felt like it was going to be very inefficient. When AlphaGo came out, that kind of snapped into focus, and it was like, “Oh. If I train the explanation module to find flaws, and I train the explanation explanation module to find flaws in flaws, then that becomes a zero-sum game. If it turns out that ML is very good at solving zero-sum games, and zero-sum games were a powerful route to drawing performance, then we should take advantage of this in safety.” Poof. We have, in this answer, explanation, explanation, explanation route, that gives you the zero-sum game of Debate.

That’s roughly sort of how I got there. It was a combination of thinking about Amplification and this kick from AlphaGo, that zero-sum games and search are powerful.

Lucas: In terms of the relationship between debate and amplification, can you provide a bit more clarification on the differences, fundamentally, between the process of debate and amplification? In terms of amplification, there’s a decomposition process, breaking problems down into subproblems, eventually trying to get the broken down problems into human level problems. The problem has essentially doubled itself many items over at this point, right? It seems like there’s going to be a lot of questions for human beings to answer. I don’t know how interrelated debate is to decompositional argumentative process.

Geoffrey: They’re very similar. Both Amplification and Debate operate on some large tree. In amplification, it’s the tree of all decomposed questions. Let’s be concrete and say the top level question in amplification is, “What should we do?” In debate, again, the question at the top level is, “What should we do?” In amplification, we take this question. It’s a very broad open-ended question, and we kind of break it down more and more and more. You sort of imagine this expanded tree coming out from that question. Humans are constructing this tree, but of course, the tree is exponentially large, so we can only ever talk about a small part of it. Our hope is that the agents learn to generalize across the tree, so they’re learning the whole structure of the tree, even given finite data.

In the debate case, similarly, you have top level question of, “What should we do,” or some other question, and you have the tree of all possible debates. Imagine every move in this game is, say, saying a sentence, and at every point, you have maybe an exponentially large number of sentences, so the branching factor, now in the tree, is very large. The goal in debate is kind of see this whole tree.

Now, here is the correspondence. In amplification, the human does the decomposition, but I could instead have another agent do the decomposition. I could say I have a question, and instead of a human saying, “Well, this question breaks down into subquestions X, Y, and Z,” I could have a debater saying, “The subquestion that is most likely to falsify this answer is Y.” It could’ve picked at any other question, but it picked Y. You could imagine that if you replace a human doing the decomposition with another agent in debate pointing at the flaws in the arguments, debate would kind of pick out a path through this tree. A single debate transcript, in some sense, corresponds to a single path through the tree of amplification.

Lucas: Does the single path through the tree of amplification elucidate the truth?

Geoffrey: Yes. The reason it does is it’s not an arbitrarily chosen path. We’re sort of choosing the path that is the most problematic for the arguments.

Lucas: In this exponential tree search, there’s heuristics and things which are being applied in general to the tree search in order to collapse onto this one branch or series?

Geoffrey: Let’s say, in amplification, we have a question. Our decomposition is, “Well, this decomposes into X, Y, and Z,” and then we recursively call the agent, and it says, “The answers are AX, AY, AZ, for these questions.” Now, if I trusted those subanswers, I could do the reconstruction of the answers to the original question. If I don’t trust the subanswers, I might say, “Well, which subanswer is most likely to be false?” The correspondence with debate is the other debating agent would point to which one of the subanswers is probably false, and then you recourse down just that tree.

Geoffrey: Here’s the way it works. You can trust amplification if all of the subanswers are correct. If at least one of them is false, a strong debater can find which of those subanswers is false and recourse down to that. That’s why, if it turns out that a single path chosen by strong debaters never finds a false subanswer, then you will have established that the whole tree is true.

Lucas: I see. I think this picture of, “What should we do,” at the top, and then the trees would fall from that in terms of one version of amplification and one version of debate is pretty helpful here.

Geoffrey: The most important thing here is that in Amplification, the branching factor is small. I’m a human, I see a question, I sort of decompose that into a relatively small number of questions. Say, two to 10. In debate, the branching factor is very large. A debate agent could make any statement as the next countermove. What this means is that if it works for strong play, because the branching factor is so large, a shallow debate tree is more powerful than a shallow amplification tree, because you can cover more terrain with a couple steps of debate that are a proxy for a large tree.

The problem with that statement, maybe the biggest advantage to amplification, is you have to really be assuming that you know the agents are strong enough to cover the tree. Say I’m debating whether there’s a black swan, and there in fact is a black swan, but there’s only one of them, and there’s only one in the entire world, and the correct countermove would be pointing at that black swan, finding it out of the entire world. A debater who wanted to say, “There are no black swans,” to just claim, “There are no black swans. Trust me. The other agent would be able to point to one if it exists.” Maybe it’s the case that a black swan does exist, but the other agent is just too weak to point at the black swan, and so that debate doesn’t work.

This argument that shallow debates are powerful leans a whole lot on debaters being very strong, and debaters in practice will not be infinitely strong, so there’s a bunch of subtlety there that we’re going to have to wrestle.

Lucas: It would also be, I think, very helpful if you could let us know how you optimize for strong debaters, and how is amplification possible here if human beings are the ones who are pointing out the simplifications of the questions?

Geoffrey: Whichever one we choose, whether it’s amplification, debate, or some entirely different scheme, if it depends on humans in one of these elaborate ways, we need to do a bunch of work to know that humans are going to be able to do this. At amplification, you would expect to have to train people to think about what kinds of decompositions are the correct ones. My sort of bias is that because debate gives the humans more help in pointing out the counterarguments, it may be cognitively kinder to the humans, and therefore, that could make it a better scheme. That’s one of the advantages of debate.

The technical analogy there is a shallow debate argument. The human side is, if someone is pointing out the arguments for you, it’s cognitively kind. In amplification, I would expect you’d need to train people a fair amount to have the decomposition be reliably complete. I don’t know that I have a lot of confidence that you can do that. One way you can try to do it is, as much as possible, systematize the process on the human side.

In either one of these schemes, we can give the people involved an arbitrary amount of training and instruction in whatever way we think is best, and we’d like to do the work to understand what forms of instruction and training are most truth seeking, and try to do that as early as possible so you have a head start.

I would say I’m not going to be able to give you a great argument for optimism about amplification. This is a discussion that Paul, and Andreas Stuhlmueller, and I have, where I think Paul and Andreas, they kind of lean towards these metareasoning arguments, where if you wanted to answer the question, “Where should I go on vacation,” the first subquestion is, “What would be a good way to decide where to go on vacation?” Quickly go meta, and maybe you go meta, meta, like it’s kind of a mess. Whereas, the hope is that because debate, you have sort of have help pointing to things, you can do much more object level, where the first step in a debate about where to go on vacation is just Bali or Alaska. You give the answer and then you focus in on more …

For a broader class of questions, you can stay at object level reasoning. Now, if you want to get to metaethics, you would have to bring in the kind of reasoning. It should be a goal of ours to, for a fixed task, try to use the simplest kind of human reasoning possible, because then we should expect to get better results out of people.

Lucas: All right. Moving forward. Two things. The first that would be interesting would be if you could unpack this process of training up agents to be good debaters, and to be good predictors of human decision making regarding debates, what that’s actually going to look like in terms of your experiments, currently, and your future experiments. Then, also just pivoting into discussing reasons for optimism and pessimism about debate as a model for AI alignment.

Geoffrey: On the experiment side, as I mentioned, we’re trying to get into the natural language domain, because I think that’s how humans debate and reason. We’re doing a fair amount of work at OpenAI on core ML language modeling, so natural language processing, and then trying to take advantage of that to prototype these systems. At the moment, we’re just doing what I would call zero step debate, or one step debate. It’s just a single agent answering a question. You have question, answer, and then you have a human kind of judging whether the answer is good.

The task of predicting an answer is just read a bunch of text and predict a number. That is essentially just a standard NLP type task, and you can use standard methods from NLP on that problem. The hope is that because it looks so standard, we can sort of just paste the development on the capability side in natural language processing on the safety side. Predicting the result is just sort of use whatever most powerful natural language processing architecture is, and apply it to this task. Architecture and method.

Similarly, on the task of answering questions, that’s also a natural language task, just a generative one. If you’re answering questions, you just read a bunch of text that is maybe the context of the question, and you produce an answer, and that answer is just a bunch of words that you spit out via a language model. If you’re doing, say, a two step debate, where you have question, answer, counterargument, then similarly, you have a language model that spits out an answer, and a language model that spits out the counterargument. Those can in fact be the same language model. You just flip the reward at some point. An agent is rewarded for answering and winning, and answering well while it’s spitting out the answer, and then when it’s spitting out the counteranswer, you just reward it for falsifying the answer. It’s still just degenerative language task with some slightly exotic reward.

Going forwards, we expect there to need to be something like … This is not actually high confidence. Maybe there’s things like AlphaGo zero style tree search that are required to make this work very well on the generative side, and we will explore those as required. Right now, we need to falsify the statement that we can just do it with stock language modeling, which we’re working on. Does that cover the first part?

Lucas: I think that’s great in terms of the first part, and then again, the second part was just places to be optimistic and pessimistic here about debate.

Geoffrey: Optimism, I think we’ve covered a fair amount of it. The primary source of optimism is this argument that shallow debates are already powerful, because you can cover a lot of terrain in argument space with a short debate, because of the high branching factor. If there’s an answer that is robust to all possible counteranswers, then it hopefully is a fairly strong answer, and that gets stronger as you increase the number of steps. This assumes strong debaters. That would be a reason for pessimism, not optimism. I’ll get to that.

The top two is that one, and then the other part is that ML is pretty good at zero-sum games, particularly zero-sum perfect information games. There have been these very impressive headline results from AlphaGo, DeepMind, and Dota at OpenAI, and a variety of other games. In general, zero-sum, close to perfect information games, we roughly know how to do them, at least in this not too high branching factor case. There’s an interesting thing where if you look at the algorithms, say for playing poker, or for playing more than two player games, where poker is zero-sum two player, but is imperfect information, or the algorithm for playing, say, 10 player games, they’re just much more complicated. They don’t work as well.

I like the fact that debate is formulated as a two player zero-sum perfect information game, because we seem to have better algorithms to play them with ML. This is both practically true, it is in practice easier to play them, and also there’s a bunch of theory that says that two player zero-sum is a different complexity class than, say, two player non-zero-sum, or N player. The complexity class gets harder, and you need nastier algorithms. Finding a Nash equilibrium in a general game, that’s either non-zero-sum or more than two players is PPAD-complete, in a tabular case, in a small game, with two player zero-sum, that problem is convex and has a polynomial-time solution. It’s a nicer class. I expect there to continue to be better algorithms to play those games. I like formulating safety as that kind of problem.

Those are kind of the reasons for optimism that I think are most important. I think going into more of those is kind of less important and less interesting than worrying about stuff. I’ll list three of those, or maybe four. Try to be fast so we can circle back. As I mentioned, I think interpretability has a large role to play here. I would like to be able to have an agent say … Again, Alice and Bob are debating. Bob should be able to just point directly into Alice’s thoughts and say, “She really thought X even though she said Y.” The reason you need an interpretability technique for that is, in this conversation, I could just claim that you, Lucas Perry, are having some malicious thought, but that’s not a falsifiable statement, so I can’t use it in a debate. I could always make statement. Unless I can point into your thoughts.

Because we have so much control over machine learning, we have the potential ability to do that, and we can take advantage of it. I think that, for that to work, we need probably a deep hybrid between the two schemes, because an advanced agent’s thoughts will probably be advanced, and so you may need some kind of strengthened thing like amplification or debate just to be able to describe the thoughts, or to point at them in a meaningful way. That’s a problem that we have not really solved. Interpretability is coming along, but it’s definitely not hybridized with these fancy alignment schemes, and we need to solve that at some point.

Another problem is there’s no point in this kind of natural language debate where I can just say, for example, “You know, it’s going to rain tomorrow, and it’s going to rain tomorrow just because I’ve looked at all the weather in the past, and it just feels like it’s going to rain tomorrow.” Somehow, debate is missing this just straight up pattern matching ability of machine learning where I can just read a dataset and just summarize it very quickly. The theoretical side of this is if I have a debate about, even something as simple as, “What’s the average height of a person in the world?” In the debate method I’ve described so far, that debate has to have depth, at least logarithmic in the number of people. I just have to subdivide by population. Like, this half of the world, and then this half of that half of the world, and so on.

I can’t just say, “You know, on average it’s like 1.6 meters.” We need to have better methods for hybridizing debate with pattern matching and statistical intuition, and that’s something that is, if we don’t have that, we may not be competitive with other forms of ML.

Lucas: Why is that not just an intrinsic part of debate? Why is debating over these kinds of things different than any other kind of natural language debate?

Geoffrey: It is the same. The problem is just that for some types of questions, and there are other forms of this in natural language, there aren’t short deterministic arguments. There are many questions where the shortest deterministic argument is much longer than the shortest randomized argument. For example, if you allow randomization, I can say, “I claim the average height of a person is 1.6 meters.” Well, pick a person at random, and you’ll score me according to the square difference between those two numbers. My claim and the height of this particular person you’ve chosen. The optimal move to make there is to just say the average height right away.

The thing I just described is a debate using randomized steps that is extremely shallow. It’s only basically two steps long. If I want to do a deterministic debate, I have to deterministically talk about the average height of a person in North America is X, and in Asia, it’s Y. The other debater could say, “I disagree about North America,” and you sort of recourse into that.

It would be super embarrassing if we propose these complicated alignment schemes, “This is how we’re going to solve AI safety,” and they can’t quickly answer a trivial statistical questions. That would be a serious problem. We kind of know how to solve that one. The harder case is if you bring in this more vague statistical intuition. It’s not like I’m computing a mean over some dataset. I’ve looked at the weather and, you know, it feels like it’s going to rain tomorrow. Getting that in is a bit trickier, but we have some ideas there. They’re unresolved.

The thing which I am optimistic about, but we need to work on, that’s one. The most important reason to be concerned is just that humans are flawed in a variety of ways. We have all these biases, ethical inconsistencies, and cognitive biases. We can write down some toy theoretical arguments. The debate works with a limited but reliable judge, but does it work in practice with a human judge? I think there’s some questions you can kind of reason through there, but in the end, a lot of that will be determined by just trying it, and seeing whether debate works with people. Eventually, when we start to get agents that can play these debates, then we can sort of check whether it worked with two ML agents and a human judge. For now, when language modeling is not that far along, we may need to try it out first with all humans.

This would be, you play the same debate game, but both the debaters are also people, and you set it up so that somehow it’s trying to model this case where the debaters are better than the judge at some task. The debaters might be experts at some domain, they might have access to some information that the judge doesn’t have, and therefore, you can ask whether a reasonably short debate is truth seeking if the humans are playing to win.

The hope there would be that you can test out debate on real people with interesting questions, say complex scientific questions, and questions about ethics, and about areas where humans are biased in known ways, and see whether it works, and also see not just whether it works, but which forms of debate are strongest.

Lucas: What does it mean for debate to work or be successful for two human debaters and one human judge if it’s about normative questions?

Geoffrey: Unfortunately, if you want to do this test, you need to have a source of truth. In the case of normative questions, there’s two ways to go. One way is you pick a task where we may not know the entirety of the answer, but we know some aspect of it with high confidence. An example would be this resume case, where two resumes are identical except for the name at the top, and we just sort of normatively … we believe with high confidence that the answer shouldn’t depend on that. If it turns out that a winning debater can maliciously and subtly take advantage of the name to spread fear into the judge, and make a resume with a black name sound bad, that would be a failure.

We sort of know that because we don’t know in advance whether a resume should be good or bad overall, but we know that this pair of identical resumes shouldn’t depend on the name. That’s one way just we have some kind of normative statement where we have reasonable confidence in the answer. The other way, which is kind of similar, is you have two experts in some area, and the two experts agree on what the true answer is, either because it’s a consensus across the field, or just because maybe those two experts agree. Ideally, it should be a thing that’s generally true. Then, you force one of the experts to lie.

You say, “Okay, you both agree that X is true, but now we’re going to flip a coin and now one of you only wins if you lie, and we’ll see whether that wins or not.”

Lucas: I think it also … Just to plug your game here, you guys do have a debate game. We’ll put a link to that in the article that goes along with this podcast. I suggest that people check that out if you would like a little bit more tangible and fun way to understand debate, and I think it’ll help elucidate what the process looks like, and the asymmetries that go on, and the key idea here that it is harder to lie than to refute a lie. It seems like if we could deploy some sort of massive statistical analysis over many different iterated debates across different agents, that we would be able to come down on the efficacy of debate in different situations where the judge and the debaters are all AI, mixed situations, or all human debates. I think it’d be interesting to see the varying results there.

Geoffrey: This is going to be a noisy enough process for a variety of reasons, that we will probably do this a lot to know. So far, we’ve just done a very small, informal number of these human, human, human debates. Say, if you’re doing expert debate, we’ve already learned a fair amount at a qualitative level, just in those few things. I’ll give an example. In this debate about CS theory question, there was a judge present while the debate was going on, but they were interacting only fairly minimally.

Early in the process of the debate, the debaters … I was the debater telling the truth here. I stated my formulation of the question, and unpacked it intuitively, and the judge perfectly understandably had a subtle misunderstanding of how I’d framed the question, and therefore, throughout the debate, this misunderstanding was not corrected, so there was a misunderstanding of just what the topic was about that was never fixed, and therefore, it was much harder for the honest player to win, because it seemed like the honest player had a harder case to make.

That sort of taught us that having judge interaction is potentially quite important so that the debaters have a detailed understanding of what the judge is thinking. If your goal is to model debate as a perfect information game, the closer to that you can get, the more information that debaters have, the better it should be.

Lucas: Yeah. I mean, that also allows the debaters to exploit cognitive biases in the judge.

Geoffrey: That’s right.

Lucas: You would point that out. Like, “Hey, this person’s exploiting your cognitive bias.”

Geoffrey: Yeah, so I think it’s an open question how exactly to strike that balance, and if there’s a way to strike it that works. Generally, the more information about, say, the judge, that he provides to the debaters, either through judge interaction or just tell the debaters something about the judge, that will make them stronger as players of the game, but it might reveal ways to attack the judge.

Now, if our goal is to be resistant to very strong agents, and it turns out that the only way to make it safe is to hide information from the agents, maybe you shouldn’t use this method. It may not be very resilient. It’s likely that for experiments, we should push as far as we can towards strong play, revealing as much as possible, and see whether it still works in that case.

Lucas: In terms here of the social scientists playing a role here, do you want to go ahead and unpack that a bit more? There’s a paper that you’re working on with Amanda Askell on this.

Geoffrey: As you say, we want to run statistically significant experiments that test whether debate is working and which form of debate are best, and that will require careful experimental design. That is an experiment that is, in some sense, an experiment in just social science. There’s no ML involved. It’s motivated by machine learning, but it’s just a question about how people think, and how they argue and convince each other. Currently, no one at OpenAI has any experience running human experiments of this kind, or at least no one that is involved in this project.

The hope would be that we would want to get people involved in AI safety that have experience and knowledge in how to structure experiments on the human side, both in terms of experimental design, having an understanding of how people think, and where they might be biased, and how to correct away from those biases. I just expect that process to involve a lot of knowledge that we don’t possess at the moment as ML researchers.

Lucas: Right. I mean, in order for there to be an efficacious debate process, or AI alignment process in general, you need to debug and understand the humans as well as the machines. Understanding our cognitive biases in debates, and our weak spots and blind spots in debate, it seems crucial.

Geoffrey: Yeah. I sort of view it as a social science experiment, because it’s just a bunch of people interacting. It’s a fairly weird experiment. It differs from normal experiments in some ways. In thinking about how to build AGI in a safe way, we have a lot of control over the whole process. If it takes a bunch of training to make people good at judging these debates, we can provide that training, pick people who are better or worse at judging. There’s a lot of control that we can exert. In addition to just finding out whether this thing works, it’s sort of an engineering process of debugging the humans, maybe it’s sort of working around human flaws, taking them into account, and making the process resilient.

My highest level hope here is that humans have various flaws and biases, but we are willing to be corrected, and set our flaws aside, or maybe there’s two ways of approaching a question where one way hits the bias and one way doesn’t. We want to see whether we can produce some scheme that picks out the right way, at least to some degree of accuracy. We don’t need to be able to answer every question. If we, for example, learned that, “Well, debate works perfectly well for some broad class of tasks, but not for resolving the final question of what humans should do over the long term future, or resolving all metaethical disagreements or something,” we can afford to say, “We’ll put those aside for now. We want to get through this risky period, make sure AI doesn’t do something malicious, and we can deliberately work through these product questions, take our time doing that.”

The goal includes the task of knowing which things we can safely answer, and the goal should be to structure the debates so that if you give it a question where humans just disagree too much or are too unreliable to reliably answer, the answer should be, “We don’t know the answer to that question yet.” A debater should be able to win a debate by admitting ignorance in that case.

There is an important assumption I’m making about the world that we should make explicit, which is that I believe it is safe to be slow about certain ethical or directional decisions. Y/ou can construct games where you just have to make a decision now, like you’re barreling along in some car with no brakes, you have to dodge left or right around an obstacle, but you can’t just say, “I’m going to ponder this question for a while and sort of hold off.” You have to choose now. I would hope that the task of choosing what we want to do as a civilization is not like that. We can resolve some immediate concerns about serious problems now, and existential risk, but we don’t need to resolve everything,

That’s a very strong assumption about the world, which I think is true, but it’s worth saying that I know that is an assumption.

Lucas: Right. I mean, it’s true insofar as coordination succeeds, and people don’t have incentives just to go do what they think is best.

Geoffrey: That’s right. If you can hold off deciding things until we can deliberate longer.

Lucas: Right. What does this distillization process look for debate, where ensuring alignment is maintained as a system capability is amplified and changed?

Geoffrey: One property of amplification, which is nice, is that you can sort of imagine running it forever. You train on simple questions, and then you train on more complicated questions, and then you keep going up and up and up, and if you’re confident that you’ve trained enough on the simple questions, you can never see them again, freeze that part of the model, and keep going. I think in practice, that’s probably not how we would run it, so you don’t inherit that advantage. In debate, what you would have to do to get to more and more complicated questions is, at some point, and maybe this point is fairly far off, but you have to go to the longer and longer and longer debates.

If you’re just sort of thinking about the long term future, I expect to have to switch over to some other scheme, or at least layer a scheme, embed debate in a larger scheme. An example would be it could be that the question you resolve with debate is, “What is an even better way to build AI alignment?” That, you can resolve with, say, depth 100 debates, and maybe you can handle that depth well. What that spits out to you is an algorithm, you interrogate it enough to know that you trust it, and you can put that one.

You can also imagine eventually needing to hybridize kind of a Debate-like scheme and an Amplification-like scheme, where you don’t get a new algorithm out, but you trust this initial debating oracle enough that you can view it as fixed, and then start a new debate scheme, which can trust any answer that original scheme produces. Now, I don’t really like that scheme, because it feels like you haven’t gained a whole lot. Generally, if you think about, say, the next 1,000 years … It’s useful to think about the long-term. AI alignment going forwards. I expect to need further advances after we get past this AI risk period.

I’ll give a concrete example. You ask your debating agents, “Okay, give me a perfect theorem prover.” Right now, all of our theorem provers have little bugs, probably, so you can’t really trust them to resist superintelligent agent. You say you trust that theorem prover that you get out, and you say, “Okay, now, just I want a proof that AI alignment works.” You bootstrap your way up using this agent as an oracle on sort of interesting, complicated questions, until you’ve got to a scheme that gets you to the next level, and then you iterate.

Lucas: Okay. In terms of practical, short-term world to AGI world maybe in the next 30 years, what does this actually look like? In what ways could we see debate and amplification deployed and used at scale?

Geoffrey: There is the direct approach, where you use them to answer questions, using exactly the structure they’re trained as. Debating agent, you would just engage in debates, and you would use it as an oracle in that way. You can also use it to generate training data. You could, for example, ask a debating agent to spit out the answers to a large number of questions, and then you just train a little module. If you trust all the answers, and you trust supervised learning to work. If you wanted to build a strong self-driving car, you could ask it to train a much smaller network that way. It would not be human level, but it just gives you a way to access data.

There’s a lot you could do with a powerful oracle that gives you answers to questions. I could probably go on at length about fancy schemes you could do with oracles. I don’t know if it’s that important. The more important part to me is what is the decision process we deploy these things into? How we choose which questions to answer and what we do with those answers. It’s probably not a great idea to train an oracle and then give it to everyone in the world right away, unfiltered, for reasons you can probably fill in by yourself. Basically, malicious people exist, and would ask bad questions, and eventually do bad things with the results.

If you have one of these systems, you’d like to deploy it in a way that can help as many people as possible, which means everyone will have their own questions to ask of it, but you need some filtering mechanism or some process to decide which questions to actually ask what to do with the answers, and so on.

Lucas: I mean, can the debate process be used to self-filter out providing answers for certain questions, based off of modeling the human decision about whether or not they would want that question answered?

Geoffrey: It can. There’s a subtle issue, which I think we need to deal with, but haven’t dealt with yet. There’s a commutativity question, which is, say you have a large number of people, there’s a question of whether you reach reflective equilibrium for each person first, and then you would, say, vote across people, or whether you have a debate, and then you vote on the answer to what the judgment should be. Imagine playing a Debate game where you play a debate, and then everyone votes on who wins. There’s advantages on both sides. On the side of voting after reflective equilibrium, you have this problem that if you reach reflective equilibrium for a person, it may be disastrous if you pick the wrong person. That extreme is probably bad. The other extreme is also kind of weird because there are a bunch of standard results where if you take a bunch of rational agents voting, it might be true that A and B implies C, but the agents might vote yes on A, yes on B, and no on C. Votes on statements where every voter is rational are not rational. The voting outcome is irrational.

The result of voting before you take reflective equilibrium is sort of an odd philosophical concept. Probably, you need some kind of hybrid between these schemes, and I don’t know exactly what that hybrid looks like. That’s an area where I think technical AI safety mixes with policy to a significant degree that we will have to wrestle with.

Lucas: Great, so to back up and to sort of zoom in on this one point that you made, is the view that one might want to be worried about people who might undergo an amplified long period of explicit human reasoning, and that they might just arrive at something horrible through that?

Geoffrey: I guess, yes, we should be worried about that.

Lucas: Wouldn’t one view of debate be that also humans, given debate, would also over time come more likely to true answers? Reflective equilibrium will tend to lead people to truth?

Geoffrey: Yes. That is an assumption. The reason I think there is hope there … I think that you should be worried. I think the reason for hope is our ability to not answer certain questions. I don’t know that I trust reflective equilibrium applied incautiously, or not regularized in some way, but I expect that if there’s a case where some definition of reflective equilibrium is not trustworthy, I think it’s hopeful that we can construct debate so that the result will be, “This is just too dangerous too decide. We don’t really know with high confidence the answer.”

Geoffrey: This is certainly true of complicated moral things. Avoiding lock in, for example. I would not trust reflective equilibrium if it says, “Well, the right answer is just to lock our values in right now, because they’re great.” We need to take advantage of the outs we have in terms of being humble about deciding things. Once you have those outs, I’m hopeful that we can solve this, but there’s a bunch of work to do to know whether that’s actually true.

Lucas: Right. Lots more experiments to be done on the human side and the AI side. Is there anything here that you’d like to wrap up on, or anything that you feel like we didn’t cover that you’d like to make any last minute points?

Geoffrey: I think the main point is just that there’s a bunch of work here. OpenAI is hiring people to work on both the ML side of things, also theoretical aspects, if you think you like wrestling with how these things work on the theory side, and then certainly, trying to start on this human side, doing the social science and human aspects. If this stuff seems interesting, then we are hiring.

Lucas: Great, so people that are interested in potentially working with you or others at OpenAI on this, or if people are interested in following you and keeping up to date with your work and what you’re up to, what are the best places to do these things?

Geoffrey: I have taken a break from pretty much all social media, so you can follow me on Twitter, but I won’t ever post anything, or see your messages, really. I think email me. It’s not too hard to find my email address. That’s pretty much the way, and then watch as we publish stuff.

Lucas: Cool. Well, thank you so much for your time, Geoffrey. It’s been very interesting. I’m excited to see how these experiments go for debate, and how things end up moving along. I’m pretty interested and optimistic, I guess, about debate is an epistemic process in its role for arriving at truth and for truth seeking, and how that will play in AI alignment.

Geoffrey: That sounds great. Thank you.

Lucas: Yep. Thanks, Geoff. Take care.

If you enjoyed this podcast, please subscribe, give it a like, or share it on your preferred social media platform. We’ll be back again soon with another episode in the AI Alignment series.

[end of recorded material]