Paper deep dive

An Overview of 11 Proposals for Building Safe Advanced AI

Evan Hubinger

Year: 2020Venue: arXiv preprintArea: Surveys & ReviewsType: SurveyEmbeddings: 86

Intelligence

Status: succeeded | Model: google/gemini-3.1-flash-lite-preview | Prompt: intel-v1 | Confidence: 97%

Last extracted: 3/12/2026, 7:51:21 PM

Summary

This paper provides a comparative analysis of 11 proposals for building safe advanced AI, evaluating them across four key dimensions: outer alignment, inner alignment, training competitiveness, and performance competitiveness. The author emphasizes the need for both alignment and competitiveness, noting that prior literature often focuses on individual proposals or neglects the inner alignment component.

Entities (7)

AI Safety via Debate · ai-safety-proposal · 100%Evan Hubinger · author · 100%Inner Alignment · concept · 100%Iterated Amplification · ai-safety-proposal · 100%Outer Alignment · concept · 100%Recursive Reward Modeling · ai-safety-proposal · 100%HCH · concept · 95%

Relation Signals (3)

Evan Hubinger → authored → An Overview of 11 Proposals for Building Safe Advanced AI

confidence 100% · An overview of 11 proposals for building safe advanced AI Evan Hubinger

Imitative Amplification → limitsto → HCH

confidence 95% · Since imitative amplification trains M to imitate Amp(M), it limits to the fixed point of the Amp operator, which Paul Christiano calls HCH

Iterated Amplification → evaluatedon → Outer Alignment

confidence 90% · Each proposal is evaluated on the four components of outer alignment, inner alignment, training competitiveness, and performance competitiveness

Cypher Suggestions (2)

Map the evaluation components for a specific proposal. · confidence 95% · unvalidated

MATCH (p:Proposal {name: 'Iterated Amplification'})-[:EVALUATED_ON]->(c:Component) RETURN c.name

Find all AI safety proposals mentioned in the paper. · confidence 90% · unvalidated

MATCH (p:Proposal) RETURN p.name

Abstract

Abstract:This paper analyzes and compares 11 different proposals for building safe advanced AI under the current machine learning paradigm, including major contenders such as iterated amplification, AI safety via debate, and recursive reward modeling. Each proposal is evaluated on the four components of outer alignment, inner alignment, training competitiveness, and performance competitiveness, of which the distinction between the latter two is introduced in this paper. While prior literature has primarily focused on analyzing individual proposals, or primarily focused on outer alignment at the expense of inner alignment, this analysis seeks to take a comparative look at a wide range of proposals including a comparative analysis across all four previously mentioned components.

PDF

Open source PDF →Open local PDF →

Full Text

85,630 characters extracted from source content.

Expand or collapse full text

An overview of 11 proposals for building safe advanced AI Evan Hubinger ∗ Research Fellow, Machine Intelligence Research Institute May 29, 2020 Abstract This paper analyzes and compares 11 different proposals for building safe advanced AI under the current machine learning paradigm, including major contenders such as iterated amplification, AI safety via debate, and recursive reward modeling. Each proposal is evaluated on the four components of outer alignment, inner alignment, training competitiveness, and performance competitiveness, of which the distinction between the latter two is introduced in this paper. While prior literature has primarily focused on analyzing individual proposals, or primarily focused on outer alignment at the expense of inner alignment, this analysis seeks to take a comparative look at a wide range of proposals including a comparative analysis across all four previously mentioned components. Contents 1 Introduction2 2 Reinforcement learning + transparency tools3 3 Imitative amplification + intermittent oversight6 4 Imitative amplification + relaxed adversarial training10 5 Approval-based amplification + relaxed adversarial training12 6 Microscope AI16 7 STEM AI18 8 Narrow reward modeling + transparency tools20 ∗ Research supported by the Machine Intelligence Research Institute (intelligence.org). Special thanks to Kate Woolverton, Paul Christiano, Rohin Shah, Alex Turner, William Saunders, Beth Barnes, Abram Demski, Scott Garrabrant, Sam Eisenstat, and Tsvi Benson-Tilsen for providing helpful comments and feedback. Additional thanks to Claire Wang and Rob Bensinger for helping create this paper out of the Alignment Forum post that preceded it.[1] 1 arXiv:2012.07532v1 [cs.LG] 4 Dec 2020 9 Recursive reward modeling + relaxed adversarial training23 10 AI safety via debate with transparency tools27 11 Amplification with auxiliary RL objective + relaxed adversarial training31 12 Amplification alongside RL + relaxed adversarial training34 1 Introduction This paper is a collection of 11 different proposals for building safe advanced AI under the current machine learning paradigm. While there already exist good overviews of various particular approaches, such as amplification[2], debate[3], or recursive reward modeling[4], most of that literature focuses primarily on outer alignment at the expense of inner alignment[5] and doesn’t provide direct comparisons between approaches. The goal of this paper is to address this problem by providing a single collection of 11 different proposals for building safe advanced AI—each including both inner and outer alignment components. 1,2 This paper does not cover all existing proposals, and I strongly expect that there will be many additional new proposals in the future. Nevertheless, I think it is quite useful to take a broad look at what we have now and compare and contrast some of the current leading candidates. It is important to note that the way I describe the 11 approaches presented here is not necessarily an accurate representation of how others would describe them. You should treat the approaches I describe here as a particular version of the approach that I selected for analysis, rather than as a canonical version that their various creators or proponents would endorse. Furthermore, this paper only includes approaches that intend to directly build advanced AI systems via machine learning. Thus, this paper does not include other possible approaches for solving the broader AI existential risk problem such as: •finding a fundamentally different way of approaching AI than the current machine learning paradigm, with the intent of starting from foundations that make it easier to build safe advanced AI, •developing some advanced technology that produces a decisive strategic advantage[7] without using advanced AI, or •achieving global coordination around not building advanced AI via (for exam- ple) a persuasive demonstration that any advanced AI is likely to be unsafe. For each of the proposals that I consider, I will try to evaluate them on the following four basic components that I think are necessary in any story about how to build 1 Note that the order of the proposals given here is chosen purely for pedagogical reasons and is not meant to imply any sort of measure of importance. 2 For another proposal evaluated in the same way as those presented here that came out after this list was completed in May 2020, see “AI safety via market making.”[6] 2 safe advanced AI under the current machine learning paradigm. 1.Outer alignment– Outer alignment is about asking why the objective we’re training for is aligned—that is, if we actually got a model that was trying to optimize for the given loss/reward/etc., would we like that model? For a more thorough description of what I mean by outer alignment, see “Outer alignment and imitative amplification.”[8] 2.Inner alignment– Inner alignment is about asking the question of how our training procedure can guarantee that the model it produces will, in fact, be trying to accomplish the objective we trained it on. For a more rigorous treatment of this question and an explanation of why it might be a concern, see “Risks from Learned Optimization in Advanced Machine Learning Systems.”[5] We’l be assuming familiarity with concepts from this paper, such asmesa-optimization,deceptive alignment, andpseudo-alignment, below. 3.Training competitiveness– Competitiveness is a bit of a murky concept, so I’ve broken it up into two pieces here. Training competitiveness is the question of whether the given training procedure is one that a team or group of teams with a reasonable lead would be able to afford to implement without completely throwing away that lead. Thus, training competitiveness is about whether the proposed process of producing advanced AI is competitive. 4.Performance competitiveness– Performance competitiveness, on the other hand, is about whether the final product produced by the proposed process is competitive. Performance competitiveness is thus about asking whether a particular proposal, if successful, would satisfy the use cases for advanced AI—e.g. whether it would fill the economic niches that people want artificial general intelligence (AGI) to fill. It can be easy to focus too much on either the alignment side or the competitiveness side, while neglecting the other. We obviously want to avoid proposals which could be unsafe, but the “do nothing” proposal is equally unacceptable—while doing nothing is quite safe in terms of having no chance of directly leading to existential risk, it doesn’t actually help in any way relative to what would have happened by default. We therefore want proposals that are both aligned and competitive: not only should the resulting system not result in catastrophe, but it should also help substantially reduce existential risk in general—by providing a model of how safe advanced AI can be built, by being powerful enough to assist with future alignment research, and/or by granting a decisive strategic advantage that can be leveraged into otherwise reducing existential risk. 2 Reinforcement learning + transparency tools We want to be able to train advanced AI systems to learn what we want; and to the extent the system doesn’t fully understand what we want, we want the system to becorrigible, tolerating (or even assisting in) corrections, safeguards, and improvements.[9][10] 3 The first proposal we’l consider tries to achieve this via reinforcement learning (RL) and transparency tools: 1.Train an RL agent in an environment where corrigibility, honesty, multi-agent cooperation, etc. are incentivized. The basic idea would be to mimic the evolutionary forces that led to humans’ general cooperativeness. As an example of work along these lines that exists now, see OpenAI’s hide and seek game.[11] The environment could additionally be modified to directly reward following human instructions, thus encouraging corrigibility towards humans. For a more thorough discussion of this possibility, see Richard Ngo’s “Multi-agent safety.”[12] Figure 1:An image of OpenAI’s hide and seek game. 2. Have humans use transparency tools,[13] adversarial training, etc. to check for deceptive[5] or otherwise catastrophic[14] behavior in the resulting model. 2.1 Outer alignment Whether this proposal satisfies outer alignment is entirely dependent on whatever the dominant behavior is in the training environment—that is, it depends on the deployment behavior of those models which perform optimally in the training environment. If corrigibility, honesty, cooperation, etc. do in fact dominate in the limit, then such an approach would be outer aligned. By default, however, it seems quite difficult to understand the limiting behavior of complex, multi-agent environments, especially if they are anywhere as complex as the actual human ancestral environment. If following human instructions is incentivized, for example, that could lead to corrigibility in the limit—or it could lead to agents that only choose to follow human instructions for the instrumental reason of believing it will help them acquire more resources. 4 Alternatively, it might be possible to isolate the structure that was present in the human ancestral environment that led us to be cooperative, honest, etc. One worry here, however, is that even if we could figure out how to properly incentivize cooperation, it might result in agents that are cooperative with each other but not very cooperative with us, similar to how we might not be very cooperative with aliens that are very different from us. 2.2 Inner alignment Inner alignment in this setting requires that the system’s training produces some- thing in line with the optimal behavior in the environment rather than other, potentially perverse equilibria—the alignment of this optimal behavior with our actual preferences is then an outer alignment question. The basic proposal for how to avoid such perverse equilibria under this proposal is via the use of checks such as transparency tools and adversarial training to detect inner alignment failures before the model is deployed. Chris Olah describes this sort of transparency checking as giving you a “mulligan” that lets you throw away your model and start over if you find something wrong.[15] Thus, ideally, if this approach ends up not working it should be clear before the model is deployed, allowing either for this approach to be fixed or for a new approach to be found instead. There is also a reasonable chance that it does work initially—we don’t understand our models’ inductive biases very well, but it seems entirely possible that they could work out such that pseudo-alignment[5] is disincentivized. While it seems quite plausible to me that this sort of approach could catch proxy pseudo-alignment, it seems unlikely to me that it would successfully catch deceptive pseudo-alignment, as it could be very difficult to make transparency tools that are robust to a deceptive model actively trying to trick them. To catch deceptive alignment, it seems likely to be necessary to incorporate such checks into the training process itself—which is possible to do in this setting, though is not the approach I described above—in order to prevent deception from occurring in the first place, rather than trying to detect it after the fact. 2.3 Training competitiveness Training competitiveness here seems likely to depend on the extent to which the sort of agency produced by RL is necessary to train advanced AI systems. Performing RL in highly complex, difficult-to-simulate environments—especially if those environments involve interaction with the real world—could be quite expensive from a training competitiveness standpoint. Compared to simple language modeling, for example, the difficulty of on-policy data collection combined with low sample- efficiency could make full-scale RL much less training competitive. These sorts of competitiveness concerns could be particularly pronounced if the features necessary to ensure that the RL environment is aligned result in its being significantly more difficult to simulate. That being said, if RL is necessary to do anything powerful and simple language 5 modeling is insufficient, then whether or not language modeling is easier is a moot point. Whether RL is really necessary seems likely to depend on the extent to which it is necessary to explicitly train agents—which is very much an open question. Additionally, even if agency is required, it could potentially be obtained just by imitating an actor such as a human that already has it, rather than training it directly via RL. 2.4 Performance competitiveness The question for performance competitiveness in this setting is to what extent it is possible to create an environment that incentivizes all of the behavior you might want from your AGI. Such an environment doesn’t need to be purely simulated—you could do some simulation training and some real-world training, for example. Regardless of how your RL environment is constructed, however, it needs to actually incentivize the correct behavior for the tasks that you want to use your AI for. For example: can you incentivize good decision-making? Good question- answering? Good learning ability? Do you need good fine motor control, and if so, can you incentivize it? These are highly non-trivial questions: it could be quite difficult to set up an RL environment to teach an agent to do all of the tasks you might want it to perform to fill all the economic niches for AGI, for example. This will of course be highly dependent on exactly what economic niches you want your advanced AI to fill. 3 Imitative amplification + intermittent oversight Though many of the approaches on this list make use of the basic iterated amplification[2] framework, imitative amplification is probably the most straightfor- ward. Still, imitative amplification has a number of moving parts. To define imitative amplification, we’l first defineAmp(M)—the “amplification operator”—as the procedure where a humanHanswers a question with access to a modelM. 3 3 Note that in practiceAmp(M)doesn’t need to always be computed with an actual human. Hcan simply beMsome fixed fraction of the time, for example—or more often whenM is confident and less often whenMis uncertain—and the resulting procedure is effectively the same. See “A concrete proposal for adversarial IDA”[16] for an example of how something like that could work. 6 Figure 2:A diagram of the amplification operatorAmp(M), where white arrows indicate information transfer,Qis a question,AisAmp(M)’s answer,His a human, andMis the model. Then, imitative amplification is just the procedure of iteratively trainingMto imitate Amp(M). Figure 3:The basic imitative amplification setup, where green arrows indicate amplifica- tion, gray arrows indicate training, and cyan arrows indicate the imitative amplification loss. Finally, we can define imitative amplification + intermittent oversight—which is the full approach we want to consider here—as the combination of imitative amplification with intermittent oversight ofMbyAmp(M)whenever the target model changes. Specifically, we wantAmp(M)to look for deceptive or otherwise catastrophic behavior inMby utilizing things like transparency tools and adversarial attacks. Figure 4:Imitative amplification plus intermittent oversight, where red arrows indicate oversight (from the overseer to the overseen model). 7 3.1 Outer alignment Since imitative amplification trainsMto imitateAmp(M), it limits to the fixed point of theAmpoperator, which Paul Christiano calls HCH for “Humans Consulting HCH.” 4 [17] HCH is effectively a massive tree of humans consulting each other to answer questions. Figure 5:A (partial) diagram of HCH, where white arrows indicate information transfer. Thus, whether imitative amplification is outer aligned is dependent on whether HCH is aligned or not. HCH’s alignment, in turn, is likely to depend heavily on the specific humans used and what sort of policy they are implementing. The basic idea, however, is that since the limit is composed entirely of humans—and since we can control what those humans do—we should be able to ensure that all the optimization power is coming from the humans (and not from memetic selection, for example), which hopefully should allow us to make it safe. While there are certainly valid concerns with the humans in the tree accidentally (or purposefully) implementing some misaligned policy, there are possible things that can be done to address these problems.[18] 3.2 Inner alignment The hope here is to be able to detect and fix serious inner alignment failures via the use of intermittent oversight—that is, inspection by a human with access to a previous version of the model. Thus, the question of inner alignment here is dependent on the question of whether such an overseer would be able to detect deceptive or otherwise catastrophic behavior in the model. Ideally, if such detection is working, it should be clear as soon as the model starts to become misaligned, giving us the opportunity to either fix the issue or find 4 Imitative amplification leads to HCH in the limit of training a perfect predictor over all possible inputs—that is, an infinite amount of training data, an infinitely large model, and an infinitely large amount of training time. 8 a new approach. For that to work, however, it will likely need to be the case that the intermittent oversight is performed regularly enough to catch any deception as soon as it starts appearing—since once a model becomes deceptive, it might start trying to obfuscate its deception, which would make it significantly harder for the overseer to detect problems. That being said, while it might be quite difficult todirectlydetect deceptive cognition early enough, it might be possible to instead detect deceptive behavior indirectlyby verifying a condition that rules out deception while still being relatively easy to verify—such as myopia. In this context, a myopic model would be one that determines its output based on some fixed standard such as distance to HCH while a non-myopic model would be one that determines its output based on its consequences. 5 For a more detailed explanation of how such myopia verification might work, see “Towards a mechanistic understanding of corrigibility.”[10] Additionally, because we do intermittent oversight whenever we change the target network, we have a guarantee that the overseer is always at least as strong as any targets that the model was trained to imitate. That being said, the model could end up stronger than the targets it was trained to imitate if it manages to, for example, achieve significant compression of what the targets are doing—which in fact is necessary for training competitiveness. Furthermore, even if the overseer is stronger than the model, the offense-defense balance could be such that a deceptive model would still be able to hide its deception. Which side is favored in the offense-defense balance is likely to depend heavily on what sorts of transparency tools the overseer has access to. 3.3 Training competitiveness First, the training competitiveness of imitative amplification is likely to depend on whether pure imitation can be turned into a rich enough reward signal to facilitate highly sample-efficient learning. In my opinion, it seems likely that human language imitation (where language includes embedded images, videos, etc.) combined with techniques to improve sample efficiency 6 will be competitive at some tasks—namely highly-cognitive tasks such as general-purpose decision-making—but not at others, such as fine motor control. If that’s true, then as long as the primary economic use cases for AGI fall into the highly-cognitive category, imitative amplification should be training competitive. For a more detailed analysis of this question, see “Outer alignment and imitative amplification.”[8] Second, imitative amplification + intermittent oversight could also fail to be training competitive if it is very difficult for the intermittent oversight check to ever pass. This could happen if, for example, the training process for imitative amplification doesn’t naturally produce models that are transparent enough to facilitate such a check. That being said, what counts as “transparent enough” again 5 Enforcing myopia for amplification could also have the side benefit of improving competi- tiveness, as a non-myopic amplification model might try to make its output simpler—and thus less useful—if it is trying to make a human consulting it easier to predict. 6 See Paul Christiano’s “Efficient feedback”[19] for some examples of possible sample- efficiency-improving techniques. 9 depends heavily on the specifics of the transparency tools that the overseer has access to. 3.4 Performance competitiveness The performance competitiveness question here, as with the outer alignment ques- tion, is heavily dependent on the properties of HCH, as imitative amplification is specifically designed to limit toward HCH. In particular, performance competitive- ness is dependent on whether HCH is sufficiently universal so as to be capable of solving whatever hard problems we might want advanced AI to solve. For example, if trees of humans can’t replicate intuitive flashes of insight, and such flashes of insight are necessary to solve the sorts of highly-cognitive tasks we want AGI for, then imitative amplification would fail to be performance competitive. Alternatively, as long as it is always possible for a human to break down complex tasks that they can’t solve themselves into easier subtasks, imitative amplification should be performance competitive. Note that this doesn’t have to be the case for an arbitrary human without any training—only for the sort of human from whom you would actually be collecting training data. 4 Imitative amplification + relaxed adversarial training The basic idea here is very similar to the previous approach, in that we start out with pure imitative amplification. Figure 6:The basic imitative amplification setup. However, instead of doing intermittent oversight checks, we train directly on the intermittent oversight signal—that is, we jointly trainMon the imitation loss and on decreasing the probability of catastrophe as assessed byAmp(M)—yielding a relaxed adversarial training setup.[20] 10 Figure 7:Imitative amplification plus relaxed adversarial training, where the red arrows indicate the oversight loss. 4.1 Outer alignment Imitative amplification + relaxed adversarial training should have the same outer alignment properties as imitative amplification + intermittent oversight. If we assume that it is possible to imitate HCH in a demonstrably non-catastrophic way, 7 then the catastrophe-checking imposed by either relaxed adversarial training or intermittent oversight should not affect the theoretical limiting behavior. Such a limit assumes an infinite training distribution, however, which is of course impossible in practice. This leaves open the possibility for the two different approaches to have quite different behavior in the finite data regime, giving them potentially quite different inner alignment properties. 4.2 Inner alignment Relaxed adversarial training could have inner alignment properties that are very different from intermittent oversight’s. Intermittent oversight requires performing oversight after a period of training during which the model could potentially have become deceptive. Thus, intermittent oversight could potentially require catching deception after the fact, giving the deceptive model the opportunity to try to obfuscate its own cognition to trick the overseer. 8 Relaxed adversarial training, by contrast, has the potential to avoid deception before it ever occurs by providing a consistent gradient away from deception during the entirety of training. This avoids the issue of having to detect deception after the fact. Similarly to intermittent oversight, such a gradient could be achieved by having the overseer enforce a condition such as myopia, which is simple enough to reliably enforce while guaranteeing that all models which meet the condition are non-deceptive. 7 That is, it needs to be the case that in the limit there exists a model that perfectly imitates HCH in such a way that an amplified overseer will be convinced that the imitator is non-catastrophic. 8 This sort of failure mode could be particularly pernicious if the deceptive model is capable of gradient hacking.[21] 11 However, relaxed adversarial training does come with the possible risk of over- fitting the oversight signal, potentially resulting in obfuscated models that pass oversight despite not actually meeting the condition, simply by virtue of having been optimized to do so. 4.3 Training competitiveness Like imitative amplification + intermittent oversight, imitative amplification + relaxed adversarial training could fail to be training competitive either because imitative amplification is uncompetitive or because the oversight check is too hard to pass. On the one hand, relaxed adversarial training could require significantly more total oversight bandwidth, which could make training more difficult. On the other hand, unlike with intermittent oversight, relaxed adversarial training directly trains the model to pass the check, which could significantly boost training competitiveness by increasing the likelihood of the check passing. For relaxed adversarial training to be more training competitive than intermittent oversight, it needs to be the case that it significantly increases the probability of oversight working, which is likely to depend on the overfitting question of whether relaxed adversarial training is likely to overall increase or decrease model transparency. The critical question here is going to be how hard it is to find a model that is demonstrably non-catastrophic under the two different oversight regimes. For a more detailed analysis of this question, see “Relaxed adversarial training for inner alignment.”[20] 4.4 Performance competitiveness Similarly to the outer alignment question, the performance competitiveness proper- ties of imitative amplification + intermittent oversight and imitative amplification + relaxed adversarial training should be effectively the same, with performance competitiveness here highly dependent on the competitiveness of HCH. 5 Approval-based amplification + relaxed adversarial train- ing The idea behind approval-based amplification is to substitute the imitation signal in imitative amplification with an approval signal, trainingMto maximize the approval ofAmp(M). This gives us a myopic reward signal which can then be optimized via the use of some sort of myopic reinforcement learning. 9 9 For example, Deep TAMER[22] or deep reinforcement learning from policy-dependent human feedback[23]. 12 Figure 8:Approval-based amplification, where the blue arrows indicate the approval loss. We can then add on top of that a relaxed adversarial training signal, trainingMto also minimize the probability of catastrophe as assessed by Amp(M). Figure 9:Approval-based amplification + relaxed adversarial training, where the red arrows indicate the oversight loss. Additionally, it is worth noting that, in practice, since oversight can be treated as a form of approval, you can just compute one approval signal that includes both metrics rather than two distinct approval signals. 5.1 Outer alignment For the purposes of outer alignment and performance competitiveness, we have to understand what exactly the theoretical limit of approval-based amplification is. In general, it is not HCH. Instead, approval-based amplification limits to the following tree, where blue arrows indicate approval-maximization (such thatMis maximizing the approval ofHconsultingM ′ which is maximizing the approval ofHconsulting M ′ and so on). 10 10 Verifying that the given tree is, in fact, the limit of approval-based amplification is a straightforward exercise of expanding the training procedure out over time. Note that the models at the top of the tree are the most recent models and the models at the bottom of the tree are the oldest. 13 Figure 10:The top of the approval-based amplification limiting tree, where white arrows indicate information transfer, blue arrows indicate approval maximization,Hare humans, andMare models. As can be seen from the tree, the limit of approval-based amplification involves humans interspersed with models attempting to maximize the approval of their descendants. Thus, unlike the HCH tree, which was composed only of humans, analyzing the approval-based amplification tree involves analyzing the behavior of arbitrary models trained on approval-maximization signals—and, in fact, many such models as they appear all throughout the tree, each with a different subtree that they are maximizing the approval of. 11 As a result, producing an outer alignment guarantee regarding such a tree seems quite difficult, especially given the distinct possibility of the models gaming the approval signal. For additional detail on this question, see “Outer alignment and imitative amplification.”[8] 5.2 Inner alignment We are still using relaxed adversarial training, so for inner alignment to work we need the same guarantee of a consistent gradient away from deception. Note that we can still try to haveAmp(M)enforce myopia as a way of getting that gradient, as approval-maximization is a single-step reward signal that is entirely compatible with a myopic agent. 12 In the approval-based setting, however,Amp(M)is a different entity that limits to a different tree, meaning it could have different properties regarding how good its oversight signal is. In particular, the fact that approval-based 11 It is worth noting that there is also the possibility of recovering HCH as the limit of approval-based amplification if the approval signal is implemented as a human-computed distance metric relative to what the human would have done before seeing the model’s answer. 12 In particular, myopia verification for approval-based amplification could be used to produce approval-directed agents.[24] 14 amplification directly incentivizesMto foolHinto approving of it could result in Mbeing more prone to deception, more likely to hide facts fromHduring oversight, and/or more likely to obfuscate its internals if internals-checking is part of the oversight signal. That being said, it could also be that relaxed adversarial training works better in an approval-based setting if it results in a more powerfulAmp(M)that is better at performing oversight becauseMwas trained to provide more helpful information. 5.3 Training competitiveness The training competitiveness of approval-based amplification could be quite different from that of imitative amplification, as approval provides a very different training signal than imitation. In particular, the standard maxim that it is generally easier to evaluate behavior than to produce it seems to suggest that human approval should produce a significantly richer training signal than the simple automated distance metric in imitative amplification, resulting in more efficient training. This isn’t necessarily the case, however—an automated distance metric as computed by a powerful language model, for instance, could still produce a fairly rich (and more nicely differentiable) training signal. Additionally, approval-based amplification could also have different training com- petitiveness properties in terms of the likelihood of the relaxed adversarial training check passing. If approval-based amplification leads to models with more obfuscated internals, for example—perhaps because the model is incentivized to obfuscate how it’s gaming the approval signal—then that could hurt training competitiveness, as it could make it harder forAmp(M)to become confident thatMis safe (in addition to hurting inner alignment by makingMin fact less likely to be safe). 5.4 Performance competitiveness Performance competitiveness for approval-based amplification depends on the prop- erties of the approval-maximization tree given previously. Compared to HCH, it is quite possible for that tree to be either more or less competitive. On the one hand, approval-maximizing models might be able to produce ideas, considerations, or arguments that no human would have ever come up with, enabling the approval-maximization tree to be more competitive. In particular, if humans are significantly better at knowing good ideas when they see them than producing good ideas themselves—which fits with the standard maxim that it’s easier to evaluate behavior than produce it—then approval-based amplification should be more performance competitive than imitative amplification. On the other hand, approval-maximizing models might game the approval signal by producing convincing-sounding but ultimately bad ideas, considerations, or arguments that end up hurting performance competitiveness. For example, an idea could be initially very compelling and thus get high approval despite quite quickly revealing itself to be useless, vacuous, etc. when actually tested out. In such a situation where the bad ideas quickly reveal themselves, the performance 15 competitiveness problems would likely be quite clear; but if it takes a very long time for the bad ideas to reveal themselves, then approval-maximization might initially look a lot better from a performance competitiveness standpoint than it actually is. 6 Microscope AI Microscope AI is a fairly unique proposal which is designed to bypass some of the dangers of building highly agentic AGI systems by leveraging powerful transparency tools.[15] The basic proposal is as follows: 1.Train a predictive model on some set of data that you want to understand, while using transparency tools to verify that the model isn’t performing any optimization. 2.Use transparency tools to understand what the model learned about the data and use that understanding to guide human decision-making. Microscope AI is predicated on the basic insight that using transparency tools on a model doesn’t just teach us about that model—it also gives us useful information about the data that the model was trained on. Chris Olah talks about this basic phenomenon in his article “Visualizing Representations: Deep Learning and Human Beings”:[25] The visualizations are a bit like looking through a telescope. Just like a telescope transforms the sky into something we can see, the neural network transforms the data into a more accessible form. One learns about the telescope by observing how it magnifies the night sky, but the really remarkable thing is what one learns about the stars. Similarly, visualizing representations teaches us about neural networks, but it teaches us just as much, perhaps more, about the data itself. The basic goal of microscope AI is to produce high-quality knowledge that can inform important decision-making, rather than to produce powerful AGI systems that can make those decisions themselves. 6.1 Outer alignment Paul Christiano’s argument that the universal prior is malign suggests that there might be deceptive optimization in the limit of perfect prediction.[26] As a result, if we interpret outer alignment as referring to the behavior at minimal loss,[8] it seems likely that microscope AI is not outer aligned. This doesn’t necessarily rule out microscope AI as a proposal, however—whether such deceptive optimization would actually occur in practice is still a very open question. In particular, the argument that the universal prior is malign relies on an uncomputable limit—polynomial approximations, for example, could still be completely fine. Furthermore, the inner alignment technique proposed here of using transparency tools to detect and prevent any optimization could prevent this problem entirely if successful. 16 6.2 Inner alignment Since the goal here is not to produce a model performing optimization (a mesa- optimizer) in the first place, the question of inner alignment becomes rather different in this situation. We still want to avoid a deceptive or otherwise misaligned mesa- optimizer, but we want to do so by avoiding all mesa-optimization rather than by ensuring that any mesa-optimization is aligned. Specifically, the strategy here is to use transparency tools to detect any optimization (as opposed to just predictive modeling) and halt if we see anything of that form. This has the major benefit of sidestepping many of the most pernicious inner alignment concerns, but it also comes with plenty of its own problems. Whether this will work is likely to be dependent on a couple of different factors. First, it is an open question whether there exists a sensible way in which optimization and modeling can be distinguished from each other. It seems quite likely that any world model which admits queries of the form “what is the probability thatais the optimal action according to utilityu” or “what is the best action for agentAto take in statex” will implicitly have to be running some sort of optimization algorithm. Thus, any notion of a “pure world model” would likely have to either rule out such queries, ensure they are being estimated without the use of optimization, or otherwise find some way to ensure that they are safe. For a more thorough treatment of the difficulty here, see Abram Demski’s “Partial Agency.”[27] Second, assuming it is possible to distinguish between modeling and optimization in theory, it also needs to be possible in practice given whatever set of transparency tools are available. If the model can start performing optimization and potentially become deceptive before its optimization can be detected, that would obviously pose a major problem to this sort of approach. 6.3 Training competitiveness Microscope AI relies on being able to competitively train predictive models with useful internal representations. Thus, the question of training competitiveness for microscope AI comes down to the question of whether the process of training a predictor is a competitive way to extract useful information from a dataset. Given the success of current predictive models such as GPT-2,[28] it seems likely that it will be competitive to train powerful predictors—though it remains to be seen how competitive the process of extracting what those predictors have learned will be. How competitive that extraction process is seems likely to depend heavily on what sort of state-of-the-art transparency tools are available. 6.4 Performance competitiveness Performance competitiveness is perhaps the biggest question mark regarding micro- scope AI, as microscope AI forgoes producing AI agents which directly take actions in the world. The question of performance competitiveness for microscope AI is thus the question of whether enhanced human understanding alone—without AI 17 agents—is sufficient for the economic use cases where one might otherwise want highly agentic advanced AI (e.g., AGI). This question is likely to depend heavily on what exactly those use cases are. As with amplification, if you need very fine motor control, microscope AI is unlikely to get you there. Furthermore, unlike amplification, microscope AI would not be helpful for low-level decision-making (in amounts where it’s too expensive to hire a human). However, microscope AI could potentially give humans the knowledge to safely build other systems which could solve such tasks. Furthermore, if the primary use case for AGI is just high-level, big-picture decision-making (automating CEOs or doing AI research, for example), then it seems likely that microscope AI would have a good chance of being able to address those use cases. In that sort of a situation— where you’re only trying to make a small number of high-quality decisions—it seems likely to be fairly cheap to have a human in the loop, and thus simply improving that human’s knowledge and understanding via microscope AI might be sufficient to produce competitive decision-making. This is especially true if there is a market premium on having a human making the decisions, perhaps because that makes it easier to negotiate or work with other humans. 7 STEM AI STEM AI is a very simple proposal in a similar vein to microscope AI. Whereas the goal of microscope AI is to avoid the potential problems inherent in building agents, the goal of STEM AI is to avoid the potential problems inherent in modeling humans. Specifically, the idea of STEM AI is to train a model purely on abstract science, engineering, and/or mathematics problems while using transparency tools to ensure that the model isn’t thinking about anything outside of its sandbox. This approach has the potential to produce a powerful AI system—in terms of its ability to solve STEM problems—without relying on any human modeling. Not modeling humans could have major benefits such as ensuring that the resulting model doesn’t have the ability to trick us to nearly the same extent as if it possessed complex models of human behavior. For a more thorough treatment of why avoiding human modeling could be quite valuable, see Ramana Kumar and Scott Garrabrant’s “Thoughts on Human Models.”[29] 7.1 Outer alignment Similarly to microscope AI, it seems likely that—in the limit—the best STEM AIs would be malign in terms of having convergent instrumental goals which cause them to be at odds with humans. Thus, STEM AI is likely not outer aligned—however, if the inner alignment techniques being used are successful at preventing such malign optimization from occurring in practice (which the absence of human modeling could make significantly easier), then STEM AI might still be aligned overall. 18 7.2 Inner alignment The hope with STEM AI is that by preventing the model from ever considering anything outside its STEM sandbox, the malign limiting behavior that might cause it to fail to be outer aligned can be avoided. Unfortunately, such a sandboxing condition alone isn’t quite sufficient, as a model considering only things in its sandbox could still end up creating other models which would consider things outside of the sandbox. 13 Thus, it is somewhat unclear what the correct thing to do is in terms of inner alignment for a STEM AI. In my opinion, there are basically two options here: either do something similar to microscope AI and try to prevent all mesa-optimization, or do something similar to amplification and ensure that all mesa-optimization that occurs is fully myopic. In either case, the hope would be that the absence of human modeling makes it easier to enforce the desired condition (because modeling an agent such as a human increases the propensity for the model to become agentic itself, for example). 7.3 Training competitiveness Training competitiveness for STEM AI is likely to depend heavily on how hard it is for state-of-the-art machine learning algorithms to solve STEM problems compared to other domains such as language or robotics. Though there has already been significant progress in the field of applying current machine learning techniques to STEM problems such as theorem proving or protein folding, 14 it remains to be seen to what extent the competitiveness of these techniques will scale, and particularly how well they will scale in terms of solving difficult problems relative to other domains such as language modeling. 7.4 Performance competitiveness Similarly to microscope AI, performance competitiveness is perhaps one of the biggest sticking points with regards to STEM AI, as being confined solely to STEM problems has the potential to massively limit the applicability of an advanced AI system. That said, many purely STEM problems (e.g., protein folding or nanotechnology development) have the potential to produce huge economic boons that could easily surpass those from any other form of advanced AI, and have the potential to solve major societal problems (e.g., curing major illnesses). Thus, if the reason that you want to build advanced AI in the first place is to get such benefits, then STEM AI might be a perfectly acceptable substitute from a performance competitiveness standpoint. Furthermore, such boons could lead to a decisive strategic advantage that could enable heavy investment in aligning other forms of advanced AI which 13 The problem of “forwarding” safety guarantees regarding a model performing meta- learning to any other models found by that meta-learning procedure is a general problem that occurs in all inner alignment schemes, though it is particularly pernicious in this situation. For a more detailed discussion of this problem, see the “Meta-learning” section in “Relaxed adversarial training for inner alignment.”[20] 14 See [30], [31], [32]. 19 are more performance competitive. However, if one of the major use cases for your first advanced AI is helping to align your second advanced AI, STEM AI seems to perform quite poorly on that metric, as it advances our technology without also advancing our understanding of alignment. In particular, unlike every other approach on this list, STEM AI can’t be used to do alignment work, as its alignment guarantees are explicitly coming from it not modeling or thinking about humans in any way, including aligning AIs with them. Thus, STEM AI could potentially create a vulnerable world situation[33] where the powerful technology produced using the STEM AI makes it much easier to build advanced AI systems, without also making it more likely that they will be aligned. This problem could potentially be mitigated if the STEM AI were heavily focused on applications that could potentially assist with alignment (such as whole-brain emulation),[34] though to what extent that would actually be possible or actually help with alignment is quite unclear. 8 Narrow reward modeling + transparency tools The approach here is as follows: 1. For some particular problem, domain, or environment, jointly train a re- ward model using human feedback, and an agent that pursues that reward. Specifically, we can use an approach like that described in DeepMind Safety’s “Scalable agent alignment via reward modeling” where human feedback on agent trajectories can be used to refine the reward model, as seen below.[4] Figure 11:DeepMind Safety’s diagram of reward learning as seen in “Scalable agent alignment via reward modeling.” 2. Since we want to keep our agent confined to only the specific distribution on 20 which it was trained (thus helping prevent possible perverse generalizations), we can also incorporate out-of-distribution detection[35] to train the agent to halt if it encounters a sufficiently out-of-distribution state. 3.We can also have the human provide additional feedback on the reward model’s internal behavior via the use of transparency tools. Figure 12:Reward modeling plus oversight of the reward model by the user. And we can do the same thing with the agent, giving us a similar sort of mulligan as with reinforcement learning + transparency tools. Figure 13:Reward modeling plus oversight of both the reward model and the agent by the user. 21 8.1 Outer alignment The question of outer alignment here is going to depend on whether the various feedback mechanisms are sufficient to ensure that, in the limit, the learned reward model is correct over the domain where the action model believes that it is on distribution. This is a highly complex question that depends on quite a few factors, including how good the human’s trajectory and transparency feedback is, how hard it is to train an on/off distribution model, whether the reward model will itself develop optimization in the limit, the extent to which it’s possible for the agent to trick the user into thinking it’s doing a good job when it’s not, etc. Because outer alignment here is dependent on the limiting behavior of such a complex, dynamic process, it’s very hard to determine one way or the other how outer aligned such a proposal is likely to be. 8.2 Inner alignment The use of transparency tools in this approach is actually quite similar to the various relaxed adversarial training approaches, since the feedback mechanism here allows the oversight signal coming from the human to be incorporated directly into training. That being said, unlike in relaxed adversarial training—and similarly to the “mulligan” approach in RL + transparency tools—the human doing the oversight here gets no help from the model. Thus, inner alignment for narrow reward modeling + transparency tools comes down to the question of whether a human using transparency tools to perform oversight can prevent the development of deceptive or otherwise catastrophic behavior in either the reward model or the agent. This could be done, for example, by verifying that the reward model isn’t performing any optimization (as in microscope AI) and verifying that the agent is myopic (as was previously suggested could work for amplification). Such an approach could potentially require that the reward be myopic, however, which might be a fairly serious limitation on the applicability of narrow reward modeling if it was restricted to only those domains with myopic rewards. While it is not necessarily the case that you can’t do myopia verification with a non-myopic reward—you just need to verify that the model treats each episode myopically rather than each step—per-episode myopia verification could be significantly more difficult than per-step myopia verification, as it is potentially a much less natural behavior. 8.3 Training competitiveness The question of training competitiveness here is going to depend on the extent to which reward modeling is a good (compute-efficient, sample-efficient, etc.) way of training AI systems to solve poorly-defined tasks. There are many different approaches to solving this basic problem—including not just reward learning ap- proaches but also imitation learning and various different inverse reinforcement learning schemes 15 —and specifically what approach ends up coming out on top still 15 See for example [36] and [37]. 22 seems quite uncertain. That being said, reward modeling has the major competitiveness advantage of only requiring and depending on human feedback, not human demonstration; and feedback could be significantly more reliable and easier to elicit. Furthermore, other reward learning schemes such as inverse reinforcement learning can be incorporated into reward modeling by using them to produce a better initial reward model that can then be refined via reward modeling’s feedback mechanism. 8.4 Performance competitiveness As with microscope AI or STEM AI, a potentially major concern with the narrow reward modeling + transparency tools approach is the “narrow” part. While narrowness has potential alignment advantages in terms of reducing reliance on potentially shaky or even malign generalization, it also has the major disadvantage of restricting the approach’s usefulness to only producing relatively narrow advanced AI systems. Thus, the performance competitiveness of narrow reward modeling + transparency tools is likely to depend heavily on the extent to which truly general advanced AI systems are actually practically feasible and economically necessary. For a more detailed analysis of this question, see Eric Drexler’s “Reframing Superintelligence.”[38] 9 Recursive reward modeling + relaxed adversarial training Recursive reward modeling, as the name implies, is a sort of recursive, non-narrow version of narrow reward modeling.[4] What this results in is effectively a form of amplification where the distillation step (which was previously imitation or approval- maximization) becomes reward modeling. Specifically, the basic approach here is to train a modelMto maximize the reward obtained by performing reward learning on Amp(M). 23 Figure 14:Two different, equivalent diagrams of recursive reward modeling. The top diagram is taken from “Scalable agent alignment via reward modeling” and the bottom diagram is the equivalent amplification-style diagram, with the purple arrows indicating the use of the full reward modeling process. In this graphic, the images on the top and bottom are meant to represent the same process—specifically, if you take the purple arrow in the bottom image to represent reward modeling, and assume that the agents in the top image are all the same agent just at different time steps, 16 then you get precisely the same procedure represented in both images. Thus, the agent in the largest circle in the top picture isM 2 in the bottom picture, the user providing feedback for that agent isAmp(M 1 ), the agent in the medium-sized circle isM 1 , the user in the medium-sized circle isAmp(M 0 ), and the agent in the smallest circle isM 0 . Furthermore, similarly to the other amplification approaches, we can then add relaxed adversarial training by trainingMand the reward model to minimize the probability of deceptive or otherwise catastrophic behavior as assessed byAmp(M). 17 16 “Scalable agent alignment via reward modeling: a research direction”[4] notes that, while they initially assume that each agent is completely separate, “While this kind of sequential training is conceptually clearer, in practice it might make more sense to train all of these agents jointly to ensure that they are being trained on the right distribution. Moreover, all of these agents may share model parameters or even be copies of the same agent instantiated as different players in an adversarial game.” Thus, while the different agents are presented here as different instances of the same model—which is a type of recursive reward modeling—it is worth noting that recursive reward modeling also includes other possibilities, such as using completely different models for the different agents. 17 “Scalable agent alignment via reward modeling: a research direction”[4] mentions the possibility of such oversight, but does not include it as part of the base proposal as is 24 It is worth noting that recursive reward modeling is also amenable to intermittent oversight (instead of relaxed adversarial training) if desired. Figure 15:Recursive reward modeling plus relaxed adversarial training, where the red arrows indicate the oversight loss. 9.1 Outer alignment The first step in understanding the outer alignment properties of recursive reward modeling is figuring out what exactly the theoretical limit of the training process is. While we don’t know exactly what the limiting behavior of an individual instance of reward modeling is—as was noted previously for narrow reward modeling—we can express the limit of the recursive version in terms of many individual reward modeling limits. Specifically, we can unroll the recursive reward modeling process out over time to produce a limiting tree similar to the one given for approval-based amplification, except with approval-maximization replaced with reward modeling. done here, noting that, “When using recursive reward modeling users have theoption to provide feedback on the cognitive process that produced outcomes, but they are not required to do so. Moreover, this feedback might be difficult to provide in practice if the policy model is not very interpretable.” 25 Figure 16:The recursive reward modeling limiting tree, where white arrows indicate information transfer, purple arrows indicate reward modeling,Hare humans, andMare models. The purple arrows in this diagram represent perfect reward modeling, whereMis the model trained to maximize the reward obtained by performing reward modeling onHconsultingM ′ . Now, whether this tree is aligned or not is a very open question. As in the case of approval-based amplification—but unlike imitative amplification—it is difficult to form a clear model of what exactly this tree would do, as it involves not only humans but also models that are the limit of many individual instances of reward modeling—limits which could potentially involve deceptive or otherwise malign optimization. 9.2 Inner alignment The question of inner alignment here is mostly going to fall on the efficacy of the relaxed adversarial training. Such efficacy could be quite different from the efficacy of other amplification approaches, however, as both the model helping the human perform oversight and the model being overseen are trained via a very different process in recursive reward modeling. In particular, if the reward model is non-myopic, recursive reward modeling could rule out the possibility of using per-step myopia verification—as was suggested for the other amplification approaches—though per-episode myopia verification could still be possible, as with narrow reward modeling. If per-episode myopia verification is not tenable, however, 26 then an alternative condition that rules out deception while being possible to verify for agents trained via recursive reward modeling might need to be found. Furthermore, if reward modeling has a greater tendency to produce deception than imitation learning, oversight could be significantly harder with recursive re- ward modeling than with imitative amplification even if such a condition is found. Alternatively, if recursive reward modeling helps produce models that are more capable of assisting with oversight—because reward modeling is more capable of training models to effectively apply transparency tools than imitation learning is, for example—then relaxed adversarial training could work better with recursive reward modeling. 9.3 Training competitiveness The training competitiveness of recursive reward modeling depends not just on the effectiveness of reward modeling as an efficient way of training a model to solve a single poorly-defined task—as in narrow reward modeling—but on the effectiveness of reward modeling in training a general model which can solve an entire collection of poorly-defined tasks. That being said, many of the nice training competitiveness properties of reward learning continue to apply even in the recursive setting. For example, unlike imitative amplification—but like approval-based amplification—recursive reward modeling relies only on human feedback, not on human demonstrations. Furthermore, compared to approval-based amplification, recursive reward modeling is non-myopic, which could allow it to solve credit assignment problems that might be difficult for approval-based amplification. 9.4 Performance competitiveness Performance competitiveness for recursive reward modeling will depend on the competitiveness of its aforementioned limiting tree. Comparing this approach to HCH: the recursive reward modeling tree can consider ideas that no human would ever produce, potentially increasing competitiveness. And compared to the approval- maximization tree, the recursive reward modeling tree can learn to execute long-term strategies that short-term approval maximization wouldn’t incentivize. At the same time, both of these facets of recursive reward modeling have the potential for danger from an alignment perspective. Furthermore, if the different models in the recursive reward modeling tree each assign some different value to the final output—which could happen if the models are not per-episode myopic—they could try to jockey for control of the tree in such a way that hurts not only alignment but also competitiveness. 10 AI safety via debate with transparency tools There are many different forms of AI safety via debate,[3] but the approach we’l be considering here is as follows: 27 1.Train a model (“Alice”) to win debates against a copy of itself (“Bob”) in front of a human judge. Figure 17:The basic debate setup, where Alice is the first debater and Bob is the second debater. Blue lines indicate possible arguments that Alice can make and red lines indicate possible arguments that Bob can make. 2.Give the models the ability to cross-examine previous versions of the other model (e.g. Bob 2 can ask Alice 1 questions for which the answers are visible to the human judge). See Beth Barnes and Paul Christiano’s “Writeup: Progress on AI Safety via Debate” for a thorough explanation of the cross-examination procedure and its benefits.[39] Figure 18:Debate with cross-examination, where the orange arrow is an example of cross-examination where Bob (specifically Bob 2 , the second Bob) cross-examines a copy of Alice (Alice 1 ) from two moves previously in the debate. 3. Give the models access to transparency tools and incentivize them to win the debate by trying to use these tools to identify deceptive or otherwise catastrophic behavior in the other debater (e.g., Alice 2 can inspect Bob 1 ’s 28 internals and win the debate by presenting evidence of problematic cognition). Combined with cross-examination, debaters should be able to inspect versions of the other debater from arbitrary prior points in the debate. Figure 19:Debate with cross-examination and transparency tools, where the red arrow is an example of Alice using transparency tools on Bob 1 . 10.1 Outer alignment Outer alignment for debate depends on whether giving honest, cooperative, helpful advice is the strategy with the greatest probability of winning. Since the first debater gets to choose to defend whatever position they want, in the limit they’l choose to defend whatever position gives them the greatest probability of winning the debate. Thus, if honest, cooperative, helpful strategies dominate, then the first player will always pursue them in the limit, meaning that aligned answers can be produced at test/deployment time simply by eliciting the first debater’s position. Whether honesty and helpfulness actually dominate in the limit is a very open question, however, and one that depends heavily on many factors: the set of questions being asked, the effectiveness of the human judges, the length of the debate, etc. For example, if there is significant variation in how difficult different positions are to defend that is independent of their truthfulness (perhaps because arguments with more moving pieces are harder to defend regardless of how true they are), then it seems likely that there would be instances where the truth would not dominate. That being said, careful mechanism design such as the use of cross-examination could potentially help mitigate such issues. 10.2 Inner alignment The question of inner alignment for debate with transparency tools centers on the effectiveness of the debaters in using the transparency tools to oversee each other. The goal is to directly incentivize the debaters to use the available tools to look for deceptive or otherwise catastrophic behavior in each other, and present that 29 information to the judge to win the debate. Whether such an approach is actually likely to succeed is dependent on a couple of different factors, however. First, it’s unclear what sort of condition could be enforced in a debate setup that would successfully rule out deceptive alignment while still allowing for the necessary deception of the second debater, who is forced to defend the second-most probable option, and thus make arguments that they don’t actually believe in. Per-episode myopia could potentially present such a condition, though it might be a difficult condition to enforce. Second, getting a good training signal for incentivizing such transparency tool use could be quite difficult. The debaters need a way of learning how to inspect other models, a way of learning that doing so is a way to win the debate, and a way of learning the specific bad behavior that they should be looking for. This gets especially tricky if the condition being enforced is just the absence of deception, as getting training examples of catching deception would seem to require producing deceptive models, which is the very thing you’re trying to avoid doing in the first place. 10.3 Training competitiveness Training competitiveness for debate depends on the extent to which the debate game is a compute-efficient, sample-efficient way of training an advanced AI system. Since the debate training process is fundamentally reinforcement learning via self-play on a zero-sum game, the historical success of such methods on environments such as Go,[40] Dota, 2[41] or StarCraft[42] suggests that such self-play might be fairly competitive as a general training procedure. But unlike these cases, debate requires a human in the loop, which significantly reduces the amount of training data that can be collected and therefore demands significantly better sample efficiency to train successfully. Such sample efficiency concerns could potentially be resolved via training a supervised model to predict human judging behavior and then training against that, though this could raise many of its own additional safety and/or competitiveness concerns. 10.4 Performance competitiveness It has been proven that debate with cross-examination has the property that arbitrarily strong debaters can convince a polynomial-time judge of the answer to any problem in NEXP.[39] Thus, there is a strong theoretical argument for the performance competitiveness of debate—in the limit, debate has the ability to solve basically any interesting problem that can be solved. While it can be shown that approaches such as approval-based amplification and recursive reward modeling can similarly access NP, the ability of debate to go significantly beyond NP all the way to NEXP suggests that debate might be significantly more performance competitive than those approaches. Of course, such theoretical analysis doesn’t necessarily apply in practice—in reality, even in the limit of training, no model can ever actually be 30 arbitrarily strong, so the practical difference between accessing NP and accessing NEXP might be very minimal. 11 Amplification with auxiliary RL objective + relaxed ad- versarial training The approach here is to train a modelMon a joint objective composed of three different pieces. First, we’l trainMto imitateAmp(M)—that is, the basic imitative amplification loss. Figure 20:The basic imitative amplification setup. Second, we’l also trainMso as to causeAmp(M)to perform well on some RL objective in some environment. The idea here is to treatM’s answers toH’s questions as actions that yield some reward determined byH’s performance in the environment with access to those answers. Thus, we are trainingMto maximize a combination of the standard imitative loss and the RL reward obtained byAmp(M). Figure 21:Imitative amplification with an auxiliary RL objective, where the white arrows indicateAmp(M)acting in the environment and the yellow arrows indicate the RL loss for trainingMto maximize the reward obtained by Amp(M). Finally, we can add relaxed adversarial training by trainingMto decrease the 31 probability of catastrophe as assessed byAmp(M)trying to find adversarial attacks. Figure 22:Imitative amplification with an auxiliary RL objective plus relaxed adversarial training, where the red arrows indicate the oversight loss. As a concrete example of this approach, where the RL environment in question is the AI safety via debate game, see “Synthesizing amplification and debate.”[43] 11.1 Outer alignment An interesting property of this approach is that its outer alignment properties can be controlled by determining how to anneal the various different components of the loss in the limit. Annealing towards the imitative loss yields HCH as the limit—giving this approach the same outer alignment properties as imitative amplification—whereas annealing towards the RL loss yields as the limit whatever the dominant behavior in the given RL environment is. Given that HCH is generally significantly easier to understand than RL limiting behavior, it seems likely that in the majority of cases you will want to anneal towards the imitative loss to get the best outer alignment properties. However, there are other options. You could, for example, use the imitative loss as a constraint such that the model is trained to do well on the RL objective without deviating from HCH by more than some fixed amount. Whether such an approach is outer aligned or not is likely to depend heavily on the specific metric used to measure the distance from HCH, as well as the robustness of HCH to small deviations according to that metric. 11.2 Inner alignment We are using relaxed adversarial training here, which means that—as with all the other relaxed adversarial training approaches—inner alignment is likely to come down to whether or not the overseer is able to produce a consistent gradient away from deception. As always, this approach could perform quite differently from other 32 approaches on this count, since the model helping with oversight and the model being overseen are being trained differently. Comparing this approach to imitative amplification: the addition of an RL objective could help significantly if performance in the RL environment is highly correlated to oversight effectiveness. Furthermore, the fact that the model is being trained to maximize the performance of the human with access to the model means that the model is forced to learn how to present its information to a human to get reward, which could make it significantly more transparent and help with oversight. On the other hand, explicitly training the model to produce answers which change a human’s behavior could also encourage the model to trick the human, which could make deception more likely and thus make oversight harder. Additionally, training on an RL objective could make it significantly harder to do myopia verification, as optimizing even partially for a non-myopic RL loss could act as a strong force pushing the model away from myopia. 11.3 Training competitiveness Comparing this approach to imitative amplification: the addition of an RL loss could significantly improve training competitiveness by producing a much richer reward signal, though at the cost of having to run the RL environment. Furthermore, the addition of the imitative loss could even improve training competitiveness relative to pure RL by providing a less sparse reward. Additionally, incorporating an RL loss enables directly training the model to solve long-term tasks involving difficult credit assignment problems that might be hard for imitative amplification alone to handle (though, as in the case of recursive reward modeling, this trades off with the potential safety benefits of myopia). However, it is unclear what happens to this behavior if the RL loss is annealed away—ideally, if it is possible for HCH to produce the behavior, then we can hope that the model will converge on that, though this requires the optimal RL behavior to be close enough to HCH that first training on the RL loss and then training on the imitative loss actually helps with the imitation task. For that to be the case, success on the RL task likely needs to be highly correlated with good HCH imitation, for which language modeling and human approval maximization tasks (such as the AI safety via debate game, as in “Synthesizing amplification and debate”) could be good candidates.[43] 11.4 Performance competitiveness Like outer alignment, performance competitiveness is likely to depend heavily on exactly how the different losses are annealed in the limit. As was suggested previously, annealing towards the imitative loss makes the limit HCH, giving this approach the same performance competitiveness properties as imitative amplification. However, again, there are other options, such as using the imitative loss as a constraint such that the model is trained to maximize the RL reward without deviating too far from HCH. Such a constraint could help performance competitiveness compared to training purely on the RL loss (if the resulting model is significantly better at 33 interacting with humans, for example) or hinder performance competitiveness (if it’s very difficult to do well on the RL task while staying close to HCH). 12 Amplification alongside RL + relaxed adversarial training In this approach, the goal is to simultaneously train two models: a question-answering systemMand an RL agentπ. We will trainπon the joint objective of maximizing the RL reward and decreasing the probability of deceptive or otherwise catastrophic behavior as assessed by Amp(M). Note that while we are using a sort of relaxed adversarial training loss here, the overseer is using a totally different model (M) than the model being overseen (π). Figure 23:The training process forπin amplification alongside RL, where the white arrow indicatesπ’s actions in the environment, the yellow arrow indicatesπ’s RL loss, and the red arrows indicate the oversight loss forπ. We will then trainMon the following joint objective. First, we want to include the standard imitative amplification loss and trainMto imitate Amp(M). 34 Figure 24:Amplification alongside RL, whereMis trained only via an imitative loss. Second, we want to trainMto causeAmp(M)to predictπ’s actions and the RL environment’s responses well. The idea here is to trainMto tell the human everything the human needs to know to understand both whatπis doing and how to replicateπ’s behavior so that the human with access to the model can matchπ’s performance in the RL environment. Figure 25:Amplification alongside RL, where the white arrows toAmp(M)indicate information thatAmp(M)attempts to predict in the RL environment, and the orange arrows indicate the prediction loss for trainingMto cause Amp(M)to predict well. Third, we can add standard relaxed adversarial training and also trainMto decrease the probability of deceptive or otherwise catastrophic behavior as assessed by Amp(M). 35 Figure 26:Amplification alongside RL + relaxed adversarial training, where the red arrows indicate the oversight loss. Finally, once we have trained bothMandπvia the above losses, a critical component of this approach is that we then do away withπcompletely and exclusively deploy the resultingM. 12.1 Outer alignment Since we only deployM, notπ, the question of outer alignment here should be entirely dependent on whether this procedure produces outer alignedM. Furthermore, similarly to the previous approach, we can anneal the combination of losses here to get the desired limiting behavior and thus control our outer alignment properties. In particular, if we anneal towards the imitative loss, this approach can be made to limit to HCH, giving it the same outer alignment properties as imitative amplification. Even if we don’t anneal towards the imitative loss, as long as it is possible for HCH to fully predict bothπand the RL environment in the limit and the imitative loss isn’t annealed away completely, at minimal loss we should still end up with HCH. 12.2 Inner alignment Inner alignment is particularly tricky for this approach. First, there’s the standard relaxed adversarial training question of whetherAmp(M)will be able to properly overseeM. Comparing this approach to the previous approach, the fact that the auxiliary reward is purely predictive rather than a full RL loss should help us here, as it means that our loss is entirely myopic and thus we should be able to do myopia verification. That being said, ifπis sufficiently non-myopic, then havingMpredict πwhile still staying myopic itself could be a challenge. There’s also the problem ofπitself—while we throw awayπat the end, there are still likely to be opportunities for a sufficiently competent and deceptiveπto cause 36 serious issues. For example,πcould choose its actions to influenceM’s training in a particular way—or just break out of its sandbox and directly take bad actions in the world. Thus, we need the oversight ofπbyAmp(M)to at least work sufficiently well to prevent such issues. For that to happen, it needs to be the case thatAmp(M)with access to transparency tools can predict whether (and when)πwill ever produce such catastrophic behavior. Thus, the fact that we are explicitly trainingMto cause Amp(M)to be able to predictπcould be quite helpful here, as this should hopefully makeAmp(M)good enough at predictingπto foreseeπbecoming deceptive or producing any such bad behavior. 12.3 Training competitiveness Compared to other amplification approaches, this approach could be significantly more training competitive, as it enables us to just train an RL model with minimal restrictions. However, it also requires us to jointly train both an RL model and an amplification model, which could make it significantly less competitive, especially if it is difficult to successfully train the amplification model to predict the RL model. Such training can be viewed as effectively just model distillation, which is generally fairly doable even for current models. It should also be noted, however, that the inclusion of the imitative loss will change things, especially if you anneal towards the imitative loss in the limit. In such a situation, the question becomes how useful the auxiliary RL prediction task is in helping with the core imitation task, which—similarly to the previous approach—could be very dependent on the specific RL environment used and how related performance on that RL environment is to the imitation task. 12.4 Performance competitiveness As with outer alignment, annealing towards the imitative loss should produce HCH as the limit, giving this approach the same performance competitiveness properties as imitative amplification. Furthermore—again as with outer alignment—even if the imitative loss isn’t annealed to completely, as long as HCH can predictπin the limit, you should still get HCH at minimal loss. References [1] Evan Hubinger. An overview of 11 proposals for building safe advanced AI, 2020. URLhttps://w.alignmentforum.org/posts/fRsjBseRuvRhMPPE5/ an-overview-of-11-proposals-for-building-safe-advanced-ai. [2] Paul Christiano, Buck Shlegeris, and Dario Amodei. Supervising strong learners by amplifying weak experts.arXiv, 2018. URLhttps://arxiv.org/abs/1810.08575. [3]Geoffrey Irving, Paul Christiano, and Dario Amodei. AI safety via debate.arXiv, 2018. URLhttps://arxiv.org/abs/1805.00899. 37 [4]Jan Leike, David Krueger, Tom Everitt, Miljan Martic, Vishal Maini, and Shane Legg. Scalable agent alignment via reward modeling: a research direction.arXiv, 2018. URL https://arxiv.org/abs/1811.07871. [5]Evan Hubinger, Chris van Merwijk, Vladimir Mikulika, Joar Skalse, and Scott Garrabrant. Risks from Learned Optimization in Advanced Machine Learning Systems. arXiv, 2019. URLhttps://arxiv.org/abs/1906.01820. [6] Evan Hubinger. AI safety via market making, 2020. URLhttps://w. alignmentforum.org/posts/YWwzccGbcHMJMpT45/ai-safety-via-market-making. [7]Nick Bostrom.Superintelligence: Paths, Dangers, Strategies.Oxford University Press, 2014.URLhttps://global.oup.com/academic/product/ superintelligence-9780199678112?c=us&lang=en&. [8]Evan Hubinger.Outer alignment and imitative amplification, 2020. URLhttps://w.alignmentforum.org/posts/33EKjmAdKFn3pbKPJ/ outer-alignment-and-imitative-amplification. [9]Nate Soares, Benja Fallenstein, Eliezer Yudkowsky, and Stuart Armstrong. Corrigibility. AAAI 2015, 2015. URLhttps://intelligence.org/files/Corrigibility.pdf. [10]Evan Hubinger.Towards a mechanistic understanding of corrigibility, 2019.URLhttps://w.alignmentforum.org/posts/BKM8uQS6QdJPZLqCr/ towards-a-mechanistic-understanding-of-corrigibility. [11]Bowen Baker, Ingmar Kanitscheider, Todor Markov, Yi Wu, Glenn Powell, Bob McGrew, and Igor Mordatch. Emergent Tool Use From Multi-Agent Autocurricula. arXiv, 2019. URLhttps://arxiv.org/abs/1909.07528. [12]Richard Ngo. Multi-agent safety, 2020. URLhttps://w.alignmentforum.org/ posts/BXMCgpktdiawT3K5v/multi-agent-safety. [13]Chris Olah, Nick Cammarata, Ludwig Schubert, Gabriel Goh, Michael Petrov, and Shan Carter. Thread: Circuits.Distill, 2020. URLhttps://distill.pub/2020/ circuits/. [14] Paul Christiano. Learning with catastrophes, 2016. URLhttps://ai-alignment. com/learning-with-catastrophes-59387b55c30. [15]Evan Hubinger.Chris Olah’s views on AGI safety, 2019. URLhttps://w.alignmentforum.org/posts/X2i9dQQK3gETCyqh2/ chris-olah-s-views-on-agi-safety. [16]Evan Hubinger.A Concrete Proposal for Adversarial IDA, 2019. URLhttps://w.alignmentforum.org/posts/jYvm4mmjvGHcPXtGL/ a-concrete-proposal-for-adversarial-ida. [17]Paul Christiano. Strong HCH, 2016. URLhttps://ai-alignment.com/ strong-hch-bedb0dc08d4e. [18]PaulChristiano.Universalityandconsequentialism withinHCH,2019.URLhttps://ai-alignment.com/ universality-and-consequentialism-within-hch-c0bee00365bd. [19] Paul Christiano. Efficient feedback, 2015. URLhttps://ai-alignment.com/ efficient-feedback-a347748b1557. 38 [20]Evan Hubinger.Relaxed adversarial training for inner alignment, 2019.URLhttps://w.alignmentforum.org/posts/9Dy5YRaoCxH9zuJqa/ relaxed-adversarial-training-for-inner-alignment. [21]Evan Hubinger. Gradient hacking, 2019. URLhttps://w.alignmentforum.org/ posts/uXH4r6MmKPedk8rMA/gradient-hacking. [22]Garrett Warnell, Nicholas Waytowich, Vernon Lawhern, and Peter Stone. Deep TAMER: Interactive Agent Shaping in High-Dimensional State Spaces.arXiv, 2017. URLhttps://arxiv.org/abs/1709.10163. [23]Dilip Arumugam, Jun Ki Lee, Sophie Saskin, and Michael L. Littman. Deep Re- inforcement Learning from Policy-Dependent Human Feedback.arXiv, 2019. URL https://arxiv.org/abs/1902.04257. [24]Paul Christiano. Approval-directed agents, 2014. URLhttps://ai-alignment.com/ model-free-decisions-6e6609f5d99e. [25] Chris Olah. Visualizing Representations: Deep Learning and Human Beings, 2015. URLhttps://colah.github.io/posts/2015-01-Visualizing-Representations/. [26] Paul Christiano.What does the universal prior actually look like?, 2016.URLhttps://ordinaryideas.wordpress.com/2016/11/30/ what-does-the-universal-prior-actually-look-like. [27] Abram Demski. Partial Agency, 2019. URLhttps://w.alignmentforum.org/s/ HeYtBkNbEe7wpjc6X. [28] Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. Language Models are Unsupervised Multitask Learners.OpenAI, 2019. URLhttps://cdn.openai.com/better-language-models/language_models_ are_unsupervised_multitask_learners.pdf. [29]Ramana Kumar and Scott Garrabrant. Thoughts on human models.MIRI, 2019. URLhttps://intelligence.org/2019/02/22/thoughts-on-human-models. [30]Mitsuru Kusumoto, Keisuke Yahata, and Masahiro Sakai. Automated Theorem Proving in Intuitionistic Propositional Logic by Deep Reinforcement Learning.arXiv, 2018. URLhttps://arxiv.org/abs/1811.00796. [31]Kshitij Bansal, Sarah M. Loos, Markus N. Rabe, Christian Szegedy, and Stewart Wilcox. HOList: An Environment for Machine Learning of Higher-Order Theorem Proving.arXiv, 2019. URLhttps://arxiv.org/abs/1904.03241. [32]Andrew W. Senior, Richard Evans, John Jumper, James Kirkpatrick, Laurent Sifre, Tim Green, Chongli Qin, Augustin Žídek, Alexander W. R. Nelson, Alex Bridgland, Hugo Penedones, Stig Petersen, Karen Simonyan, Steve Crossan, Pushmeet Kohli, David T. Jones, David Silver, Koray Kavukcuoglu, and Demis Hassabis. Improved protein structure prediction using potentials from deep learning.Nature, 2020. URL https://w.nature.com/articles/s41586-019-1923-7.epdf. [33]Nick Bostrom. The Vulnerable World Hypothesis.Global Policy, 2019. URLhttps: //nickbostrom.com/papers/vulnerable.pdf. [34] Anders Sandberg and Nick Bostrom. Whole Brain Emulation: A Roadmap.FHI, 2008. URLhttps://w.fhi.ox.ac.uk/brain-emulation-roadmap-report.pdf. 39 [35]Jie Ren, Peter J. Liu, Emily Fertig, Jasper Snoek, Ryan Poplin, Mark A. DePristo, Joshua V. Dillon, and Balaji Lakshminarayanan. Likelihood Ratios for Out-of- Distribution Detection.arXiv, 2019. URLhttps://arxiv.org/abs/1906.02845. [36]Jonathan Ho and Stefano Ermon. Generative Adversarial Imitation Learning.arXiv, 2016. URLhttps://arxiv.org/abs/1606.03476. [37]Justin Fu, Katie Luo, and Sergey Levine. Learning Robust Rewards with Adversarial Inverse Reinforcement Learning.arXiv, 2017. URLhttps://arxiv.org/abs/1710. 11248. [38]K. Eric Drexler. Reframing Superintelligence: Comprehensive AI Services as General Intelligence.FHI, 2019. URLhttps://w.fhi.ox.ac.uk/wp-content/uploads/ Reframing_Superintelligence_FHI-TR-2019-1.1-1.pdf. [39]Beth Barnes and Paul Christiano. Writeup: Progress on AI Safety via De- bate, 2020. URLhttps://w.alignmentforum.org/posts/Br4xDbYu4Frwrb64a/ writeup-progress-on-ai-safety-via-debate-1. [40]David Silver, Thomas Hubert, Julian Schrittwieser, Ioannis Antonoglou, Matthew Lai, Arthur Guez, Marc Lanctot, Laurent Sifre, Dharshan Kumaran, Thore Grae- pel, Timothy Lillicrap, Karen Simonyan, and Demis Hassabis. A general reinforce- ment learning algorithm that masters chess, shogi, and Go through self-play.Sci- ence, 2018. URLhttps://science.sciencemag.org/content/362/6419/1140.full? ijkey=XGd77kI6W4rSc&keytype=ref&siteid=sci. [41]Filip Wolski, Szymon Sidor, Michael Petrov, David Farhi, Jonathan Raiman, Susan Zhang, Greg Brockman, Christy Dennison, Jie Tang, Henrique Pondé, Brooke Chan, Jakub Pachocki, and Przemysław Dębiak. OpenAI Five, 2018. URLhttps://openai. com/blog/openai-five/. [42] The AlphaStar team.AlphaStar: Mastering the Real-Time Strategy Game StarCraft I, 2019.URLhttps://deepmind.com/blog/article/ alphastar-mastering-real-time-strategy-game-starcraft-i. [43] Evan Hubinger.Synthesizing amplification and debate, 2020. URLhttps://w.alignmentforum.org/posts/dJSD5RK6Qoidb3QY5/ synthesizing-amplification-and-debate. 40