Why is AI alignment a hard problem?

Researchers differ in their view on how hard AI alignment is and where the difficulty comes from, but according to this talk by Eliezer Yudkowsky, there are three main sources of difficulty:

Alignment is hard like rocket science is hard: the system has to work under extreme circumstances.

> When you put a ton of stress on an algorithm by trying to run it at a smarter-than-human level, things may start to break that don’t break when you are just making your robot stagger across the room.

One thing that makes rocket science hard is the extreme stresses (e.g. high levels of acceleration, constant explosions, the vacuum of space) placed on rockets' mechanical components. Analogously, running a program that rapidly gains capabilities and becomes highly intelligent (both in absolute terms and relative to humans) places “stress” on its design by subjecting it to major new categories of things that can change and go wrong.

Alignment is hard like aiming a space probe is hard: you only get one shot.

> You may have only one shot. If something goes wrong, the system might be too “high” for you to reach up and suddenly fix it. You can build error recovery mechanisms into it; space probes are supposed to accept software updates. If something goes wrong in a way that precludes getting future updates, though, you’re screwed. You have lost the space probe.

If there’s a catastrophic failure with a recursively self-improving AI system, it will competently resist attempts to stop or alter it, so the code will be “out of reach” and we won’t be able to go back and edit it.

Alignment is hard like cryptographic security is hard: you have to plan for the possibility of intelligent adversaries.

> Your [AI] code is not an intelligent adversary if everything goes right*. If something goes wrong, it might try to defeat your safeguards.*

Cryptographers face attackers who search for flaws that they can exploit. Security is hard, and attackers often succeed in compromising systems. If an unaligned AI is motivated to do things its designers didn't intend, it might pose similar problems. For instance:

  • Find and exploit subtle flaws in any security measures its designers applied.

  • Persuade humans to do things it wants

  • Deceive humans about what it's doing so that we let it do stuff

  • Etc.

As with standard cybersecurity, "good under normal circumstances" is not good enough — we would be facing an intelligent adversary that is trying to create abnormal circumstances, and so our security measures need to be exceptionally robust.

Further reading: