What is the problem with RLHF?

Reinforcement learning from human feedback (RLHF) has numerous potential problems.


  • Accelerating alignment


  • Deceptive alignment

Tells us what we want to hear, rather then the behavior we want

  • Increased goal directedness

  • Misgeneralization


RLHF works well for cases like teaching a backflip because we know what we want. However, when it comes to ethics, we don’t know what we want, especially as we leave familiar situations, and the AI needs to generalize to situations where our intuitions fail us. In such situations we won’t be able to predict the behavior which the AI thinks is the application of our feedback.