What is the likelihood of deceptive misalignment?

Evan Hubinger thinks that deceptive alignment is the default outcome of machine learning. His analysis of the likelihood of deceptive alignment loosely compares humans trying to align AI with God trying to align humans. God might come across three types of individuals in his alignment quest, and based upon this, Hubinger distinguishes three different types of alignment:

  • Jesus Christ (Internal Alignment): He is considered identical to God, so wants exactly the same things as God, because he has the same values and exactly the same way of thinking about the world.

  • Martin Luther (Corrigible Alignment): Unlike Jesus, he didn’t come prepackaged with God’s beliefs, but instead a desire to really understand what God wants him to do. So he really cares about the Bible, so he studies and understands exactly what it tells him to do independent of what other people (church) think that he should be doing. And he takes action to serve God based on his understanding.

  • Blaise Pascal (Deceptive Alignment): He wants to do his own thing (whatever it is) and doesn't really care about the Bible. But he also doesn't want to go to Hell while doing his own thing. So he studies the Bible and does what it says just to make sure he doesn't go to Hell.

An intuitive example behind why deception is considered the default case is that a ‘Pascal’ can want anything in the world, as long as he just acts according to the Bible. A ‘Luther’ on the other hand has to actually want to act according to the Bible and do only that. It is a much stricter condition that fewer people fulfill. But there is still more than one way of interpreting the Bible, therefore there is still more than one Luther. Lastly there is only one way to get exactly what God wants, i.e., to get a ‘Christ’. This is by creating a copy of God. So there are more Pascals than Luthers, and there are more Luthers than Christs. So the most likely scenario is that we get a Pascal.[1]

We can build upon this intuitive example by exploring the technical details in more depth. In trying to analyze which one of these three cases is most likely, we make some assumptions - first that we are working with a relatively complex base objective such as human values, and second that we have diverse enough training environments and different enough situations to get the models to robustly understand what we humans want them to want, i.e. we are concerned with deceptive models, not with dishonest ones.[2] Lastly, we also assume that all models are roughly made up of three parts: mesa objectives, world models, and an optimization process. The model will always start with a random proxy objective which we are trying to refine into the ‘true’ (base) objective that we want the model to have. The objective of the optimization process (Stochastic Gradient Descent (SGD)) is to improve performance, which can be done in two ways:

  • Better Proxies: The model can improve its understanding of the objective

  • Better World Models: The model can improve its ability to act on its existing understanding of the objective

Since ML models might result in different outcomes even when trained on the same data (path dependence), we explore if we end up with deceptive alignment in both low and high path dependance training environments. This gives us an estimate of the types of models that we expect SGD to produce.

High Path Dependence (HPD) alignment scenarios

There are three scenarios in the high path dependence case. In the HPD internal alignment scenario, the proxy is perfected to match the base objective before the world model is perfected. This means that the model is now successfully trying to optimize for the base objective. In the HPD corrigible alignment scenario, the world model is perfected before the proxy objective matches the base objective. This means that the world model is perfect enough to simulate SGD. Additionally, the best representation of the base objective is the actual base objective that SGD has. So SGD just ‘throws out’ the proxy objective and replaces it with a pointer to its own objective. In the HPD deceptive alignment scenario, the world model is again perfected before the proxy objective matches the base objective. However, this time SGD makes the current proxy a long-term goal (crystallizes it), which means that the inner optimizer is now free to maximize the base objective. In either of the latter two cases, SGD successfully trains an inner optimizer to optimize for the base objective during training.

Low Path Dependence (LPD) alignment scenarios

In a low path dependence world, the model will just try to pick the simplest algorithm out of all possible algorithms that solve the given problem. We focus on simplicity and compute inductive biases to answer the question - What is the simple, unique solution that you end up with every time you run your training process?

The simplicity inductive bias is an analysis of the algorithm complexity. The world models are exactly the same across all the different modes. The main difference between the models is the mesa objective, i.e. the complexity of the objective that we give the model which it has to learn and then optimize. The simplest internally aligned mesa-objective is the base objective. So for internal alignment, the complexity class is the full complexity of the base objective. The simplest corrigibly aligned mesa-objective is a specification for how to point to the base objective. The simplest deceptively aligned mesa-objective is literally any long term objective. So as far as Pascal is concerned the most complexity cost you have to pay is just the simplest long-term objective you can think of. In other words, the deceptively aligned one has the lowest complexity cost out of the three alignment scenarios, as long as we have a relatively complex training objective. All of the other model classes scale with the complexity of the training objective, whereas the complexity of the deceptively aligned model is essentially fixed. Therefore the more complex the base objective (e.g. human values) the more deceptive alignment wins out in the complexity race. Overall, the analysis results in the conclusion that #(Christs) < #(Luthers) < #(Pascals). So since there are more Pascals than the other two, we should expect deceptive alignment to be the default case as analyzed by the model simplicity inductive bias.

The speed inductive bias thankfully falls in favor of corrigibly and internally aligned models. The deceptively aligned model has to perform extra steps of ‘thinking’. It has to realize that caring about the mesa-objective in the long term means that it should optimize the training objective temporarily to get a higher reward on your internal objective eventually. The internal and corrigibly aligned models however don’t have to do this extra thinking and can just directly optimize what they care for (the base objective). So in terms of time taken for each model class to output the correct answer: t(internally aligned) ~= t(corrigibly aligned) < t(deceptively aligned)

This indicates both algorithmic simplicity and complexity inductive biases point towards deceptively aligned models being the most likely outcome to arise in a low path dependence machine learning training process.

Overall, we can see that gradient descent is by default going to take us toward the deceptively aligned model in both the low and the high path dependence cases. If we accept the above argument then we have to come up with techniques to better monitor models during training, as well as, undertake some intervention that changes the current ML training dynamics.


Alternative Phrasing

  • How likely is deception?

  • Might an agent be deceptively aligned?



  1. This example corresponds to the low path dependence simplicity inductive bias scenario. It was used here because it seemed like the easiest/most intuitive way to understand the overall argument through an example. ↩︎

  2. Hubinger writes this as the assumption that we are working at the limit of adversarial training. At the limit of basically refers to the fact that we have done as much adversarial training as we can to iron out as many kinks as we possibly can in the models understanding of the objective. ↩︎