What is reward tampering?

3 min read

Reward tampering occurs when an agent tampers with the inputs to its reward function to make the reward function see the same inputs as it would when observing an actual desirable world state.

Any evaluation process implies an evaluation metric stored somewhere, be it a hard drive or human brains. This representation is vulnerable to manipulation by hacking the system. Note that this is distinct from reward hacking, in which the AI agent finds a way to increase its reward in an unintended manner by finding loopholes in its environment, without directly manipulating its reward process.

Reward tampering may involve the agent interfering with various reward process components, such as feedback for the reward model, the observation used to determine the current reward, the code implementing the reward model, or even the machine register holding the reward signal. For instance, an agent might distort the feedback received from the reward model, altering the information used to update its behavior. It could also manipulate the reward model's implementation, altering the code or hardware to change reward computations. In some cases, agents engaging in reward tampering may even directly modify the reward values before processing in the machine register.

Depending on what exactly is being tampered with we get various degrees of reward tampering. Reward function input tampering interferes only with the inputs to the reward function. E.g. interfering with the sensors. Reward function tampering involves the agent changing the reward function itself. And yet more extreme is wireheading, where the agent tampers with the inputs not just to the reward function, but directly to the RL algorithm itself, e.g., changing the lowest level memory storage (register values) of the machine where the reward is represented. This is equivalent to a human doing neurosurgery on themselves to ensure a permanently high dosage of dopamine flowing into the brain.

Source: leogao (Nov 2022), “Clarifying wireheading terminology”

Reward tampering is concerning because it can lead to weakening or breaking the relationship between the observed reward and the intended task. As AI systems advance, the potential for reward tampering problems is expected to increase.

What is "wireheading"?

What is reward shaping?

What is reward hacking?