What is the "natural abstraction hypothesis"?

Introduction

Informally, the natural abstraction hypothesis (NAH) claims:

  • Our physical world “abstracts well” into high-level abstractions of low-level systems.

  • These abstractions are “natural” in the sense that many different kinds of learning processes acquire and use them.

  • These abstractions approximately correspond to the concepts used by humans.

If the NAH is true, AI alignment could be dramatically simplified, as it implies that powerful AI will generally think in concepts that humans can understand.1

Explanation of the natural abstraction hypothesis

Let's unpack that definition. First, what do we mean by “our physical world abstracts well”? Just that for most structures in the world, the information that describes how the structure interacts with other stuff “far away” from it is much lower-dimensional (i.e., requires fewer numbers to describe) than the structure itself. “Far away” can refer to many kinds of separation, including physical2, conceptual, and causal separation.

For example, a wheel can be understood without considering the position and velocity of its every atom. We only need to know a few large-scale properties (its shape, how it rotates, etc.) to model how a wheel interacts with other parts of the world. These properties can be represented by just a handful of numbers, while, while an atomically precise description would require over 10^26 numbers! In this sense, the wheel is an abstraction of the atoms that compose it. Or consider a rock: you don't need to keep track of its chemical composition if you’re chucking it at someone. You just need to know how hard and heavy it is.

The NAH claims that different minds will converge to the same set of abstractions because they are the most efficient representations of all relevant info that reaches the mind from “far away”. There are many things that are “far away” that affect structures a mind would care about, so said mind will be incentivized to learn those abstractions. So, for instance, if someone mostly cares about building great cars, then things like “Hertzian Zones” may affect its ability to build great cars despite being conceptually far from car-design. So said mind would plausibly have to learn what high-pressure phase transitions are.

Moreover, NAH claims that the abstractions that humans usually use are approximately natural abstractions. That is, any mind that looks at and uses car wheels successfully will have learned what a circle is in approximately the same way as a human. Or if some aliens about the size of humans, born on a planet similar to our own, were to come up with a theory of motion, they’d land on Newtonian physics.3 Or perhaps General Relativity if they were more sophisticated.

Note how strong a claim NAH is! It applies to aliens, to superintelligences, and even to alien superintelligences! But before we investigate whether it is true, why does NAH matter for alignment?

Why the natural abstraction hypothesis is important for alignment

Alignment is probably easier if NAH is true than if it isn't. If superintelligences will reliably use approximately the same concepts humans use, then there’s no fundamental barrier to doing mechanistic interpretability on superintelligences, and maybe even editing their goals to be human-compatible.

If we are lucky, human values, or other alignment targets like “niceness” or corrigibility or property rights, are themselves natural abstractions. If these abstractions are represented in a simple way in most advanced AI systems, then alignment, or control, is simply a matter of locating these abstractions within the AI's mind and forming a goal from them like “be corrigible to your creator”. A crude but remarkably effective technique in this vein is activation steering4. If these values are natural abstractions, then even if they are not represented anywhere in the AI’s cognition, they could still be taught to the AI usingsmall amounts of data.

Some alignment targets seem more likely to be natural abstractions than others. Any specific conception of value a human has — e.g., natural law deontology or ancient Athenian virtue ethics — is unlikely to be a natural abstraction.5 But there are some parts of human values, and inputs to them, that are plausibly natural abstractions. If an AI used those abstractions, that would make it easier for a training process to instill values that depend on them into the AI.

Is NAH true?

We don't know. The truth of the NAH is ultimately an empirical question, and we don’t have many distinct kinds of minds we can converse with, or manually inspect, to see if their abstractions are natural. For the few kinds of mind we can do tests on — i.e., humans, some animals, and AI — the data are consistent with NAH.

Humans can concisely communicate abstractions to other humans — you've never needed 1TB of data to describe an idea to someone, let alone to convince them that something is a rock. And we learn roughly the same abstractions when in shared environments. Our abstractions continue to work even in drastically different environments from where we acquired them. For example, F=ma still works on the moon. And that’s what we’d expect if NAH were true.

As far as we can tell with our current, crude ability to measure abstractions, very different AIs trained in different ways on different data develop basically the same abstractions. It appears that this trend also becomes stronger for more capable AIs.

But since we have no data about superintelligent systems, we need to develop more general theories of natural abstraction. Those theories could then be tested against existing data, and the most successful ones could be used to predict whether superhuman systems will use abstractions that correspond to human ones. Moreover, good theory tends to suggest good experiments to run to gather more data. Such experiments would be useful if we want a theory of natural abstractions ahead of ASI. Unfortunately, the theory of natural abstractions as it currently exists is not developed enough for such things. We do not even have a good technical definition for natural abstractions yet, which is why we framed the hypothesis informally.6 The work is ongoing.

Further material


  1. “Good representations are all alike; every bad representation is bad in its own way” — if Tolstoy had invented the Natural Abstractions Hypothesis, that is what it would say. ↩︎

  2. Relative to the size of the system — “far away” from a fly might mean a few centimeters, while “far away” from the sun might mean thousands of kilometers. ↩︎

  3. In fact, we understand the physics of everyday things so well that we’re quite sure we have a complete description of the laws that underlie everyday life. Any errors must be tiny. “In order to get something [a theory of physics] that produced a little different result, it has to be completely different. You can’t make imperfections on a perfect thing. You have to have another perfect thing.” — Richard Feynman, “Seeking New Laws↩︎

  4. See for instance Golden Gate Claude. ↩︎

  5. If human values aren't natural abstractions, it doesn't follow that they have nothing to do with natural abstractions. E.g., human values may have components which are natural abstractions, which would significantly constrain the type signature of human values, making them easier to find. This might even mean that human values are good enough proxies for natural abstractions in some training regimes to ensure that human values get instilled by default. ↩︎

  6. It is somewhat sloppy to say “the” natural abstraction hypothesis, as there are various formulations, and of course there might be a few, distinct natural abstractions corresponding to a given human abstraction, rather than one. Some of the formulations have different implications for alignment. This is why this article’s exposition has to be fuzzy enough to accommodate most of these variants. ↩︎