What is "wireheading"?

Wireheading happens when an agent’s reward mechanism is stimulated directly, instead of assigning high rewards because the agent’s goals are being achieved in the world.1

One way of visualizing this is if we imagine a counter in the agent’s head which determines how satisfied it is with its situation. We say that the agent is wireheaded if something reaches inside its head to directly increase the counter. Now it very satisfied with its situation, even though its “goals” have not been achieved.

The term comes from experiments in which rats with electrodes implanted into their brains could activate their pleasure centers at the press of a button. Some of the rats repeatedly pressed the pleasure button until they died of hunger or thirst.

AI safety researchers sometimes worry that an AI may wirehead itself by accessing its reward function directly and setting its reward to its maximum value.2 This could be benign if it caused the AI to simply stop doing anything, but it could be problematic if it caused a powerful AI to take actions to ensure that we don’t stop it from wireheading.

There’s another way in which wireheading occasionally comes up. Some thought experiments in which a powerful AI is given a relatively simple goal, like "make humans happy", conclude that this would lead the AI to “wirehead” humans — e.g., by pumping us full of heroin, or by finding other ways to make us feel good while leaving out a lot of what makes life worth living.3

Both of these problems would be categorized as outer misalignment.

Further reading:


  1. Another (overlapping) term for an agent interfering with its reward is reward tampering. ↩︎

  2. Some, including proponents of shard theory, argue that this is unlikely to happen. ↩︎

  3. At least according to most people — some pure hedonists would argue only pleasure and pain matter. ↩︎



AISafety.info

We’re a global team of specialists and volunteers from various backgrounds who want to ensure that the effects of future AI are beneficial rather than catastrophic.