GitHub

Incidental Polysemanticity

In neural networks, neurons that simultaneously represent two completely unrelated features (e.g. a neuron in a computer vision getting activated for both dogs and airplanes), called “polysemantic neurons”, are a problem for interpretability. The usual story for why polysemanticity happens (see this Anthropic paper https://arxiv.org/abs/2209.10652) is that there are simply more useful features than there are neurons, so the network is forced to cram the features into fewer dimensions. We call this “necessary polysemanticity”. Another possible explanation is that sometimes polysemanticity happens incidentally, because of how the random weights are initialized. In general, the reason neural networks learn anything is that at initialization, by random chance, some neuron happens to be very slightly correlated with one of the features that matter, and this correlation gets amplified by gradient descent. Therefore, if at the start, one neuron happens to be the most correlated neuron with both dogs and airplanes, then (depending on the specifics of the learning algorithm and the data) this might continue being the case throughout the learning process, causing “incidental polysemanticity” at the end.

You can find a more detailed theoretical understanding of how incidental polysemanticity happens in this paper.

Citation

@misc{lecomte2024causes,
      title={What Causes Polysemanticity? An Alternative Origin Story of Mixed Selectivity from Incidental Causes}, 
      author={Victor Lecomte and Kushal Thaman and Rylan Schaeffer and Naomi Bashkansky and Trevor Chow and Sanmi Koyejo},
      year={2024},
      eprint={2312.03096},
      archivePrefix={arXiv},
      primaryClass={cs.LG}
}

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
plots		plots
project-code		project-code
LICENSE		LICENSE
README.md		README.md
TheoryOfPolysemanticity.ipynb		TheoryOfPolysemanticity.ipynb
fb_vs_regularization.py		fb_vs_regularization.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Incidental Polysemanticity

Citation

About

Releases

Packages

Languages

License

kushalthaman/polysemanticity

Folders and files

Latest commit

History

Repository files navigation

Incidental Polysemanticity

Citation

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages