Why can't we just use a friendly AI to stop bad AIs?

People such as Yann Lecun, Chief AI Scientist at Meta, have suggested that misuse of AI[1] could be countered by some form of more powerful friendly AI.

This plan holds some promise, but requires a few things to go right

  1. The friendly AI must be aligned in the first place, which we currently do not know how to reliably do.

  2. The friendly AI must constantly wield a strong strategic advantage over any unaligned AI. In a multipolar scenario, it is unclear whether offense or defense would be favored.

Dan Hendrycks proposed the idea of an AI Leviathan[2] that is composed of a collection of sufficiently aligned AI that act to stop any misbehaving AI. The emergence of such a Leviathan could constitute a pivotal act. On the other hand, an extremely powerful singleton superintelligence would be able to act in a way which prevents any other AI systems from being created.

Such control from AI, whether from a singleton or a Leviathan, involves some important concentration of power.


  1. This argument might also apply to misaligned AI. ↩︎

  2. The name is a reference to Thomas Hobbes’ book of the same name. ↩︎