What is Goodhart's law?

5 min read

Suggest changes in Google Docs

Goodhart’s law states that when a measure becomes a target, it ceases to be a good measure. This happens all over the place:

One way to measure the quality of an online article might be by counting how many people click on it. However, if click count determines how much authors are paid or how high articles are ranked in search results, authors will be incentivized to write in a way that maximizes clicks, perhaps by choosing sensational titles. When they do so, click count may stop correlating well with the quality of articles.
When funding is allocated to school districts based on test scores, teachers are incentivized to "teach to the test," and the tests may stop being good measures of knowledge of the material.¹
IBM used to pay its programmers per line of code produced. This incentivized them to write bloated programs and punished simplicity, ultimately reducing the quality of the programmers’ work.

Scott Garrabrant identifies four forms of Goodhart’s law:

Regressional Goodhart — When selecting for a measure that is a proxy for your goal, you select not only for your goal, but also for the difference between the proxy and your goal. For example, being tall is correlated with being good at basketball, but if you exclusively pick exceptionally tall people to form a team, you end up selecting taller people who are worse players over slightly shorter people who are better players. This is an unavoidable problem when you only have noisy data, so you need to work around it, such as by using multiple independent proxies.
Causal Goodhart — When there is a non-causal correlation between the proxy and the goal, intervening on the proxy may fail to affect the goal. For example, giving basketball players stilts because taller people are better at basketball (height is a proxy for basketball skill), or filling up your rain gauge to help your crops grow (since water in a rain gauge is a proxy for amount of rainfall).
Extremal Goodhart — Situations in which the proxy takes an extreme value may be very different from the ordinary situations in which the correlation between the proxy and the goal was observed. For example, the very tallest people are also unhealthy because of that height, and therefore bad basketball players.
Adversarial Goodhart — When you optimize for a proxy, you provide an incentive for adversaries to take actions which decorrelate the proxy from your goal to make their performance look better according to your proxy. For example, if good grades are used as a proxy for ability, it could incentivize cheating since grades are easier to fake than ability.

Goodhart’s law is a major problem for AI alignment, because training a neural net typically involves using proxies for our true objective. For example, we might use "approval from a human supervisor" as a way of measuring the quality of a large language model's outputs. However, this may end up training the AI to give outputs that the supervisor believes are high-quality, rather than what is actually high-quality — for example, if a supervisor is training an AI to be honest, it might end up training the AI to give the output that the supervisor believes is true (even in cases where the supervisor is mistaken).

Mesa-optimization can also be understood as an example of Goodhart’s law. Deceptive alignment is an example of "adversarial Goodhart," since a deceptively-aligned system will perform well according to the proxy measure in order to mislead its supervisors about its true intentions, which are different from the goal the supervisors are aiming for.

One attempt to help solve this problem is to use milder forms of optimization such as quantilization.

There is a possibly fictional story of s oviet nail factories which, when tasked to produce a high number of nails, produced many tiny, useless nails, and when tasked to produce a certain amount of nails by weight, produced fewer, giant nails. ↩︎

What is a "quantilizer"?

What are "mesa-optimizers"?

What are "true names" in the context of AI alignment?