Why is AI alignment a hard problem?

3 min read

Suggest changes in Google Docs

Researchers differ in their views on how hard AI alignment is and where the difficulty comes from. In this talk, Eliezer Yudkowsky lists three main sources of difficulty:

Alignment is hard like rocket science is hard: the system has to work under extreme circumstances.

When you put a ton of stress on an algorithm by trying to run it at a smarter-than-human level, things may start to break that don’t break when you are just making your robot stagger across the room.

One thing that makes rocket science hard is the extreme stresses (e.g., high levels of acceleration, constant explosions, the vacuum of space) placed on rockets' mechanical components. Analogously, running a program that rapidly gains capabilities and becomes highly intelligent (both in absolute terms and relative to humans) places “stress” on its design by subjecting it to major new categories of things that can change and go wrong.

Alignment is hard like aiming a space probe is hard: you only get one shot.

You may have only one shot. If something goes wrong, the system might be too “high” for you to reach up and suddenly fix it. You can build error recovery mechanisms into it; space probes are supposed to accept software updates. If something goes wrong in a way that precludes getting future updates, though, you’re screwed. You have lost the space probe.

If there’s a catastrophic failure with a recursively self-improving AI system, it will competently resist attempts to stop or alter it, so the code will be “out of reach” and we won’t be able to go back and edit it.

Alignment is hard like cryptographic security is hard: you have to plan for the possibility of intelligent adversaries.

Your [AI] code is not an intelligent adversary if everything goes right. If something goes wrong, it might try to defeat your safeguards.

Cryptographers face attackers who search for flaws that they can exploit. Security is hard, and attackers often succeed in compromising systems. If an unaligned AI is motivated to do things its designers didn't intend, it might pose similar problems. For example:

An AI motivated to pursue some set of goals under some set of behavioral constraints might find clever loopholes to route around those constraints.
An AI advanced enough to help its programmers with its own design might give deliberately misleading advice on how they should align it.

As with standard cybersecurity, "good under normal circumstances" is not good enough — we would be facing an intelligent adversary that is trying to create abnormal circumstances, and so our security measures need to be exceptionally robust.

What are the main sources of AI existential risk?

At a high level, what is the challenge of AI alignment?