What is the orthogonality thesis?

The orthogonality thesis is the claim that any level of intelligence is compatible with any set of terminal goals. In other words, values and intelligence are “orthogonal” to each other, i.e., agents can vary in one dimension while staying constant in the other dimension.

This means we can’t assume that a system that is as smart as or smarter than humans will automatically be motivated by human values. Morality is not an obvious part of nature which any sufficiently-intelligent agent will discover; rather, it has many human-specific components. (Notice that this isn't true of every feature of human minds; for instance, instrumentally useful drives like self-preservation are likely to be converged upon by most intelligent agents.)

On its own, the orthogonality thesis only states that unaligned superintelligence is possible, not that it is likely, or that AI alignment is difficult. The reasons to think alignment is difficult come from concepts like Goodhart’s law, instrumental convergence, and inner misalignment.

While the orthogonality thesis is broadly accepted by the alignment research community, it has critics who come from a few distinct strands:

  • Some moral realists assert that a sufficiently intelligent entity would discover and adhere to objective moral truths that humans would endorse upon reflection.

  • Beren Millidge claims that the strong form of the orthogonality thesis is false within modern deep learning algorithms.

  • Nora Belrose contends that, depending on its implications, the thesis is either trivial (true but meaningless), false (intelligence strongly constrains goals) or unintelligible.