What is scalable oversight?
Some current techniques for aligning AI, such as RLHF, rely on humans’ ability to supervise AI. But humans won’t be able to reliably judge the output of AI systems much smarter than them.
A human struggling to understand a superhuman AI’s response, from this paper.
Scalable oversight refers to a set of techniques and approaches to help humans effectively monitor, evaluate, and control such complex AI systems. These techniques are meant to scale as the AIs they apply to become more capable.
The basic idea behind most approaches to scalable oversight is to use AI systems to assist supervision of other AI systems. Approaches include:
-
Reward modeling
One problem with research on scalable oversight methods is that the proposed methods can be hard to evaluate since we do not yet have AI with abilities that generally exceed ours. Scalable oversight research has also been criticized as primarily improving the capability of AI rather than its safety.