What are the main problems that need to be solved in AI safety?

In addition to the need to define a robust objective for an Artificial General Intelligence (AGI) to pursue, we expect to run into some technical issues in the quest to align it. Here are some of these issues.

  • Specification gaming happens when an agent pursues an objective that is coherent with what you asked of it but not with what you intended, i.e. be careful what you wish for, especially with genies. This is known as an outer alignment failure.

  • Goal misgeneralisation happens when the agent learns a goal that gives it high marks during training but diverges from the intended goal when the environment changes such as in the Coinrun example where the bot learns to go to the right of the level instead of getting the coin as intended. This is known as an inner alignment failure.

  • Most ML programs currently in use (including Large Language Models such as ChatGPT) are black boxes: we don’t know how to inspect them to understand what they have learned during a training run. This means we would be unable to do a pre-flight check for misalignment before launching an AGI. Interpretability attempts to understand the inner workings of such models.d

  • If we were to detect an issue with an AGI after launching it, we have reason to believe that we might not be able to stop it or correct its path.

  • It is possible that an AGI might be aligned during its training phase and take a sharp left turn into unaligned territory when its capacities increase.

See also Concrete Problems in AI Safety.