What is Aligned AI / Stuart Armstrong working on?

One of the key problems in AI safety is that there are many ways for an AI to generalize off-distribution, so it is very likely that any arbitrary generalization will be unaligned. See the model splintering post for more detail. Aligned AI's plan to solve this problem is as follows:

  1. Maintain a set of all possible extrapolations of reward data that are consistent with the training process.
  2. Pick among these for a safe reward extrapolation.

As of 2022, they are working on algorithms to accomplish step 1: see Value Extrapolation.

Their initial operationalization of this problem is the “lion and husky problem.” Basically: if you train an image model on a dataset of images of lions and huskies, the lions are always in the desert, and the huskies are always in the snow. So the problem of learning a classifier is under-defined: should the classifier be classifying based on the background environment (e.g., snow vs. sand), or based on the animal in the image?

A good extrapolation algorithm, on this problem, would generate classifiers that extrapolate in all the different ways, and so the 'correct' extrapolation must be in this generated set of classifiers. They have also introduced a new dataset for this, with a similar idea: Happy Faces.

Step 2 could be done in different ways. Possibilities for doing this include: conservatism, generalized deference to humans, or an automated process for removing some goals like wireheading/deception/killing everyone.



AISafety.info

We’re a global team of specialists and volunteers from various backgrounds who want to ensure that the effects of future AI are beneficial rather than catastrophic.

© AISafety.info, 2022—1970

Aisafety.info is an Ashgro Inc Project. Ashgro Inc (EIN: 88-4232889) is a 501(c)(3) Public Charity incorporated in Delaware.