What is David Krueger working on?

Krueger runs a lab at the University of Cambridge. Some things he is working on include:

  1. Operationalizing inner alignment failures and other speculative alignment failures that haven't actually been observed.

  2. Understanding neural network generalization.

For work done on (1), see Goal Misgeneralization, a paper that empirically demonstrated examples of inner alignment failure in Deep RL environments. For example, they trained an agent to get closer to cheese in a maze, where the cheese was always in the top right of a maze in the training set. During test time, when presented with cheese elsewhere, the RL agent navigated to the top right instead of to the cheese: it had learned the mesa-objective of "go to the top right".

For work done on (2), see OOD Generalization via Risk Extrapolation, an improvement on robustness to previous methods.



AISafety.info

We’re a global team of specialists and volunteers from various backgrounds who want to ensure that the effects of future AI are beneficial rather than catastrophic.