Why does deep learning make alignment especially difficult?

To ensure we are on the same page, it’s best to proceed with a specific definition of deep learning, mainly one that’s relevant to why such models make alignment difficult.

Deep learning (DL) explores the full potential of neural networks, stacking layers upon layers, repeatedly increasing the number of neurons per layer, to solve more complicated tasks as the years proceed. They are increasingly preferred today (say, GPT-3, for example)as human effort is reduced and large neural networks are simply too good as engineering features from an enormous amount of inputs.

Neural networks are black boxes, which is one significant issue concerning AI safety. Understanding them is a field we call interpretability, because it’s just that difficult to comprehend what a neural network is doing to accomplish the task we set for them.

The reason why it’s imperative that we understand neural networks is because at the highest level of abstraction, the network is trying to find a way to achieve the goal we set for it. The problem is, there are multiple ways to achieve at the same goal. (“Get me apples from the grocery store” can easily translate to “I have been asked to buy apples from the grocery store” but also “I have been asked to steal apples from the grocery store.” Of, course, we could simply correct this with the goal being “Buy apples from the grocery store.”)

As we have more complicated goals that we define for neural networks, we can’t easily outline them as we would do for simple actions. Asking an AI to preserve certain moral values, which we can’t explicitly define, is one such example. So it is vital that we understand neural networks for alignment with what we seek to achieve.

And as can be inferred, the larger the neural network in question, the more difficult it is to understand what’s going on within to generate an output. This is what makes it very difficult to decipher the inner workings of DL models and why deep learning especially makes alignment pretty difficult to do.

Another perspective to neural networks being black boxes consist of the fact that our goals are mostly defined by our training distribution, which is just a large set of examples, and the loss function we specify around it. Therefore, the control we have over specifying our goals explicitly is pretty limited and makes it even more difficult to ensure the model is aligned.