What is the "sharp left turn"?

2 min read

Suggest changes in Google Docs

A sharp left turn (SLT) is a theorized scenario in which an AI’s capabilities generalize further than its alignment. In other words, if an AI went through a sharp left turn, its capabilities would quickly generalize outside the training distribution, but its alignment wouldn’t be able to keep up, resulting in a powerful misaligned model. This scenario depends on three main claims:

“Capabilities will generalize far (i.e., to many domains)”

An advance in capabilities could generalize in multiple domains, possibly at the same time or during a discrete phase transition.

“Alignment techniques that worked previously will fail during this transition”

The increase in capabilities and generalization would arise from emergent properties which would be qualitatively different from what the model used previously. As a result, alignment techniques that worked on old versions of the AI would not work on the new version.

“Humans won’t be able to intervene to prevent or align this transition”

The transition would happen too quickly for humans to notice or develop new alignment techniques in time.

If these claims are correct, we will end up with a misaligned AI. This could be avoided if we manage to build a goal-aligned AI before the SLT occurs; i.e., the AI should have beneficial goals and situational awareness. If this is the case, the AI will try to preserve its beneficial goals by developing new techniques to align itself as it goes through the SLT, and give rise to an aligned model post-SLT. You can read more about plans and caveats regarding the SLT here.

What is a “treacherous turn”?

What is deceptive alignment?