What is corrigibility?
A “corrigible” agent
Agent
A system that can be understood as taking actions towards achieving a goal.
- If we try to pause or “hibernate” a corrigible AI, or shut it down entirely, it will let us do so. This is not something that an AI is automatically incentivized to let us do, since if it is shut down, it will be unable to fulfill its goals.
- If we try to reprogram a corrigible AI, it will not resist this change and will allow this modification to go through. If this behavior is not specifically incentivized, an AI might try to prevent this, such as by attempting to fool us into believing its utility functionwas modified successfully, while actually keeping its original utility function as obscured functionality. By default, this deception could be a preferred outcome according to the AI's current preferences.Utility functionView full definition
A mathematical function that assigns a number representing utility to every possible outcome. Outcomes with higher utility are preferred to outcomes with lower utility. “Maximizing utility” then means choosing the most preferred outcome.
Further reading:
- Corrigibility As Singular Target (CAST) by Max Harms
Keep Reading
Continue with the next entry in "The alignment problem"
What is the orthogonality thesis?
NextOr jump to a related question