What is superposition in interpretability?

1 min read

Superposition refers to the concept that a neural network can sometimes use the same set of parameters to encode the recognition of multiple different disparate features. Since these features might not be related to each other this concept of superposition makes the task of interpretability of neural networks considerably harder. This concept is closely related to the concept of polysemantic neurons.

What are polysemantic neurons?

What is feature visualization?

How much can we learn about AI with interpretability tools?

Interpretability