What is David Krueger working on?
Krueger runs a lab at the University of Cambridge. Some things he is working on include:
-
Operationalizing inner alignment failures and other speculative alignment failures that haven't actually been observed.
-
Understanding neural network generalization.
For work done on (1), see Goal Misgeneralization, a paper that empirically demonstrated examples of inner alignment failure in Deep RL environments. For example, they trained an agent to get closer to cheese in a maze, where the cheese was always in the top right of a maze in the training set. During test time, when presented with cheese elsewhere, the RL agent navigated to the top right instead of to the cheese: it had learned the mesa-objective of "go to the top right".
For work done on (2), see OOD Generalization via Risk Extrapolation, an improvement on robustness to previous methods.