Is there a good plan for alignment?

4 min read

There are certainly lots of plans (or “agendas”) for AI safety and alignment. However, it’s unclear whether any of them are particularly good (as the question asks). It seems plausible that several such plans could be needed together in order to fully handle the complexity of the AI alignment problem. It’s also unclear how AI alignment as a whole discipline will evolve in the coming years.

Nevertheless, here’s a (incomplete) list of plans and agendas for AI safety alignment. Note that some of these plans may have been abandoned or forgotten by their original proposers.

johnswentworth’s alignment plan (proposed in December 2021) involves resolving the field’s fundamental confusions about agent foundations. The plan calls for special attention “natural abstractions” (ways of naturally carving up a complex world into understandable abstract parts), followed by “selection theorems” (proving that certain agent architectures would be selected for by evolution/natural selection). After that, this would be followed by “ambitious value learning”.

According to John Wentworth’s 2022 update, the outlook for the plan is still quite strong, with many researchers beginning to converge on abstractions and interpretability as key building blocks. However, instead of “ambitious value learning”, corrigibility instead seems like a more likely target.
OpenAI’s own alignment plan (proposed in August 2022) involves engineering “good training signals that align with human intent” by training AIs using human feedback, to assist human evaluators, and to do alignment research themselves. In practice, this includes techniques like reinforcement learning from human feedback (RLHF) and recursive reward modeling (RRM).
Question-answer counterfactual interval (QACI), first proposed in October 2022 by Tamsin Leake, proposes creating an aligned AGI via an iterative series of simulations and “questions”. QACI is a formal mathematical object which points to a human who can undergo a simulated long reflection process, and return either aligned utility functions or actions. It aims to be robust to superintelligent optimization by being a fully-specified formal goal which if optimized would produce good outcomes.random garbage and progressing to finer and finer impressions about what the Ideal Utility Function™ should be.
Holden Karnofsky’s alignment plan (proposed in December 2022) involves several prongs, including mechanistic interpretability work, limiting the powers of AIs (e.g., by making them myopic/short-sighted, narrow AIs, or mild optimizers), and by using AIs to align each other. There’s a critique of this plan by Alex Flint.
David A. Dalrymple’s alignment plan (also proposed in December 2022) centers around the use of completely formalized models. Specifically, it proposes creating a completely formalized world model, designing and validating an alignment plan using the context of said fully formalized world model, and then executing said alignment plan in the real world.

There’s a much longer list at ai-plans.com. Note that many of them are less fleshed out than the ones presented here. See also the Research Agendas tag on the Alignment Forum.