What is scalable oversight?

Some current techniques for aligning AI, such as RLHF, rely on humans’ ability to supervise AI. But humans won’t be able to reliably supervise AI systems much smarter than them i.e. superintelligent systems. For example, using human feedback runs into problems when AI systems have key information the humans lack, or when the AI is deceptive.

The basic idea behind most approaches to scalable oversight is to use AI systems to assist supervision of other AI systems. Approaches include:

Problems with research on scalable oversight methods include that the proposed methods can be hard to evaluate since we do not yet have AI with abilities that generally exceed ours. Scalable oversight research has also been criticized as being primarily about improving the capability of AI, rather than making AI safer.