Skip to content

A curated reading list of research in Sparse Autoencoders, Feature Extraction and related topics in Mechanistic Interpretability

License

Notifications You must be signed in to change notification settings

koayon/awesome-sparse-autoencoders

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

18 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Awesome Sparse Autoencoders

Awesome License

Awesome Sparse Autoencoders is a curated list of papers, models, explainers and libraries for Dictionary Learning With Sparse Autoencoders.

Contents

About

Dictionary Learning with Sparse Autoencoders (SAEs) is a technique for disentangling the intermediate neural activations into more monosemantic representations for interpretability and steering.


In this repo, links are organised by topic and have explanations so you can decide what you would like to read. Especially recommended links are starred 🌟 and links which play a part in the current best recipe for SAEs are marked with a chef emoji 🧑‍🍳

Star this repository to see the latest developments in this research field.

We accept contributions! We strongly encourage researchers & practitioners to make pull requests with papers, approaches and explanations that they feel others in the community would benefit from 🤗

Architecture & Theory

Prism, Notion: Linus Lee (2024) pdf

Linus Lee's strategy for training Sparse Autoencoders. The main ideas are similar but he approaches the problem from a different angle, focusing on building tools for thought and debugging models.

🧑‍🍳 🌟 Scaling Sparse Autoencoders, OpenAI: Gao et al (2024) pdf blog code

The OpenAI's Superalignment team (RIP) show that sparse autoencoders can be scaled to large models like GPT-4 and find some interesting features. The main contribution of this work though is the Top-K sparsity mechanism as an activation function which means that the l1 sparsity auxiliary loss is no longer needed.

Their codebase is very clean but only shows using rather than training SAEs. Note their Triton kernels for sparse * dense matrix multiplication which improve the speed of training all SAEs - this is a great service to the community! (They use TransformerLens for activation caching).

🌟 Gated Sparse Autoencoders, DeepMind: Rajamanoharan et al (2024) pdf

To address the shrinkage problem in SAEs, DeepMind introduce a gating mechanism inspired by Shazeer-style GLUs. Theoretically this allows the autoencoder to separate predictions the magnitude of feature firing from whether the features fire or not. This seems to work and actually has benefits beyond what addressing shrinkage directly would do so it seems that the Gated SAEs also just provide better encoder representations.

Group Sparse Autoencoders, Harvard: Theodosis et al (2023) pdf

The authors show that by grouping features together in the sparsity constraint they're better able to learn features which naturally compose together. They test on image datasets but this could be applied to language models too.

Motivating Results

🌟 Transformer Visualisation via Dictionary Learning FAIR: Yun et al (2021) pdf

An early attempt to break down embeddings into components. A useful paper for understanding how the Interpretability literature fits into previous research.

Sparse Autoencoders Find Highly Interpretable Features in Language Models: Cunningham et al (2023) pdf code

The first clear demonstration that SAEs using an L1 penalty, rather than other dictionary learning methods, give rise to interpretable features.

Training Notes & Intuitions

Training Tricks for 1-Layer SAEs, Conmy (2023) blog

The most important things to get right seem to be the SAE width, resampling procedure, and the learning rate. They also suggest that rewarming up the learning rate after resampling is valuable.

Open Source Libraries

SAE Library, Eleuther: Belrose et al (2024) code

A highly readable library for distributed training of sparse autoencoders with a great API interface. Mostly uses standard methods and isn't as customisable. More of a training library than a research one.

SAELens: Joseph Bloom, Curt Tigges and David Chanin (2024) code

A library designed to help researchers train sparse autoencoders, analyze them with a focus on mechanistic interpretability, and generate insights to aid in developing safe and aligned AI systems.

Scaling Up & Scaling Laws

🌟 Scaling Monosemanticity to Claude Sonnet, Anthropic: Templeton et al (2024) website blog

Anthropic show that the SAE approach scales to frontier scale models. A useful proof of concept and some interesting phenomenological results.

There were a few criticisms of the paper [here] and [here]. # TODO: Add criticisms

Trained Autoencoders

Pythia Features, Northeastern: Marks et al (2023) blog

One of the first open source SAE dictionaries available. Released alongside training code.

Open Source Sparse Autoencoders for all Residual Stream Layers of GPT2-Small: Bloom (2024) blog

A set of 12 SAEs for the GPT2 Small residual stream. The post gives a fairly comprehensive write up of the specific methods used.

Other

List of Favourite Mech Interp Papers, Neel Nanda (2024) blog

Neel Nanda's list of important papers in Mechanistic Interpretability. There's a section on Sparse Autoencoders which has links to many of the same papers as above but with Neel's short takes on them. Other adjacent MechInterp fields are also covered which may be worth reading.

How To Report Better SAE Performance, Bostock (2024) blog

The author suggests that to present the SAE Pareto frontier more clearly (i.e. the trade-off between sparsity and reconstruction error) it's beneficial to git a Hill curve.

Not All Language Model Features Are Linear, MIT: Engels et al (2024) pdf

SAEs generally lean on the Linear Representation Hypothesis. This paper shows that this is not always the case and that some features of language models are not linear. They show some examples of this for features which are essentially in low dimensional manifolds instead of being strictly linear and hypothesise that this could happen more widely. Very interesting work showing something that seemed to be the case and this shows that for the last few 9s of reliability we may need to think beyond strict linearity. They use SAEs to find the non-linear features.

🌟 Automated Interpretability, OpenAI: Bills et al (2023) website blog code forked code

Once you've trained your SAE typically you want to understand what it has learned by checking how interpretable the features are by humans or LLMs. This library allows you to do that and this work introduced the basic framework for automated interpretability.

Using AI to Augment Human Intelligence, Google Brain: Carter & Nielsen (2017) website

A paper which thinks about how representations that can be manipulated are useful for humans. This is interesting to think about the applications of SAEs beyond enumerative safety.




Thanks for reading, if you have any suggestions or corrections please submit a pull request! And please hit the star button to show your appreciation.

Citing This Post

If you'd like to cite this article, please use:

@misc{ayonrinde_2023_awesome_sparse_autoencoders,
  author = "Kola Ayonrinde",
  title = "Awesome Sparse Autoencoders",
  year = 2024,
  publisher = "GitHub",
  url = "https://github.com/koayon/awesome-sparse-autoencoders/"
}

About

A curated reading list of research in Sparse Autoencoders, Feature Extraction and related topics in Mechanistic Interpretability

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 3

  •  
  •  
  •