What is "wireheading"?

3 min read

Suggest changes in Google Docs

Wireheading happens when an agent’s reward mechanism is stimulated directly, instead of assigning high rewards because the agent’s goals are being achieved in the world.¹

One way of visualizing this is if we imagine a counter in the agent’s head which determines how satisfied it is with its situation. We say that the agent is wireheaded if something reaches inside its head to directly increase the counter. Now it very satisfied with its situation, even though its “goals” have not been achieved.

The term comes from experiments in which rats with electrodes implanted into their brains could activate their pleasure centers at the press of a button. Some of the rats repeatedly pressed the pleasure button until they died of hunger or thirst.

AI safety researchers sometimes worry that an AI may wirehead itself by accessing its reward function directly and setting its reward to its maximum value.² This could be benign if it caused the AI to simply stop doing anything, but it could be problematic if it caused a powerful AI to take actions to ensure that we don’t stop it from wireheading.

There’s another way in which wireheading occasionally comes up. Some thought experiments in which a powerful AI is given a relatively simple goal, like "make humans happy", conclude that this would lead the AI to “wirehead” humans — e.g., by pumping us full of heroin, or by finding other ways to make us feel good while leaving out a lot of what makes life worth living.³

Both of these problems would be categorized as outer misalignment.