What is the orthogonality thesis?

The orthogonality thesis is the claim that any level of intelligence is compatible with any set of terminal goals. In other words, values and intelligence are “orthogonal” to each other, so that agents can vary in one dimension while staying constant in the other dimension.

This means we can’t assume that a system that is as smart as or smarter than humans will automatically follow human values. Morality is not an obvious part of nature which any sufficiently-intelligent agent will discover; rather, it has many human-specific components. (Notice that this isn't true of every feature of human minds; for instance, the techniques of practical reasoning and planning can be used towards a wide variety of ends and are likely to be used by any intelligent agent.

On its own, the orthogonality thesis only states that unaligned superintelligence is possible, not that it is likely, or that AI alignment is difficult. The reasons to think alignment is difficult come from concepts like Goodhart’s law, instrumental convergence, and inner misalignment.

While the orthogonality thesis is broadly accepted by the alignment research community, its critics come from a few distinct strands:

  • Some moral realists assert that a sufficiently intelligent entity would discover and adhere to objective moral truths that humans would endorse upon reflection.

  • Beren Millidge claims that the strong form of the orthogonality thesis is false within modern deep learning algorithms.