My Learning and other stuff in interpretability

Before the real study

Re-learning the introduction of Transformers: finish the three videos in the What is a Transformer (along with its code)
- proof: Clean_Transformer_Demo_Template.ipynb
- Time: ~4h, including the code finishing(have to say some bugs are really annoying)
The Exploratory Analysis Demo Tutorial
- proof: Exploratory_Analysis_Demo.ipynb
- Time: ~2h, Since no coding is needed
General paper reading of mech interp(since I already took a surveyon it about a year ago, it's quite easy)
- papers: see papers.md
- Time: Alas, didn't remember that, because I just kind of going through them in many time pieces

Entering the real study time

Early stage reading and thinking: paper.md
Hands-on and experiments: hands_on.md

Note that these two markdowns also serves as some kind of note to myself, so it may contain some offensive words and might be a little nagging.

I believed I explained every files, what I did to them and what I get from them, in these three markdowns (including this one). So it's quite self-containing.

Results Display

Training acc:

ckpt8, win rate 38.0%
ckpt16, win rate 46.0%
ckpt24, win rate 52.0%
ckpt32, win rate 54.0%
ckpt40, win rate 57.99999999999999%
ckpt48, win rate 54.0%
ckpt56, win rate 57.99999999999999%
ckpt64, win rate 64.0%
ckpt72, win rate 64.0%
ckpt80, win rate 66.0%
ckpt88, win rate 66.0%
ckpt96, win rate 68.0%
ckpt104, win rate 70.0%
ckpt112, win rate 74.0%
ckpt120, win rate 78.0%
ckpt128, win rate 84.0%
ckpt136, win rate 88.0%
ckpt144, win rate 90.0%
ckpt152, win rate 90.0%
ckpt160, win rate 90.0%
ckpt168, win rate 90.0%
ckpt176, win rate 92.0%
ckpt184, win rate 94.0%

need to mention that the ckpt 0 (original gpt2) also has a win rate of 38%

Logit diff shows that the layer 11 is crucial to the new circuit

Logit Diff in Patched Head Pattern has a phase-transition style changing during finetuning process

Other gifs in gif file can prove that there is indeed a new circuit replacing the old one. Also, this gif suggest that the transformation of the new circuit is quite complicated, involving exploring and trying several times to reach an answer.

Conclusion

Actually I'm quite reluctant to write this because there's so much to explore, but alas, time is over and I spent more than 10 hours in this experiment and coding(not counting the paper reading and this conclusion writing), and the application table is going to over, so I shall stop right now and arrive at a conclusion.
Both hypotheses is partly correct: The finetune process indeed Transforms the original circuit, and it add new heads (as well as mechanisms that calls MLP layer more often) into this process as the task is getting more complicated while in the simillar context. Also, the formation of the new circuit is by a sort of exploration(where a number of heads heuristically tries to take part in the new circuit before the process convergenced), which can be seen from both the gif and the acc rate log(above).
The task is really easy. Actually I don't think it's something bad, as the larger the language model is, the harder for us to design some tasks that the model is capable in its easy form, but fails in harder ones(and the task itself has to be easy enough to be interpreted!). Moreover I think this easy task selection helped me to get slightly different results from the Fine-Tuning Enhances Existing Mechanisms: A Case Study on Entity Tracking paper. For complicated models, most circuits are already be developed during the pretraining, so it's quite good to use small models to look through this process.
Pity, questions and future
- A huge flaw in the experiment: the corruption only take place in the first of two subjects(if the subject is Jim and John, the answer is only Jim, and the name in the second place never appears in the answer during corruption, for example), so the mechanism of the second place is not well studied.
- The mechanism of building the new circuit is still under exploration. How is the new circuit formed? From my experiments it seems that there is a bit of chaos in the model, for it's adding noise/trying to explore different streategies. But it lacks more ablation and deeper study to explore it.
- Lacks theory and mathmatical derivations. I just don't have time and effort to learn such things in 10 hours... and it's too late when I realize that might be fun too.
What I'm proud of
- Come up with this question(though it is already explored several times before), but now my brain is full of training dynamics. What if we combine this dynamics with algorithm designing? I'm quite confident that this helps with the preformance compared to the black-box algorithms.
- Finishing all these stuff in ~10h, especially the part of fine-tuning models and use Lens to see the dynamics, so much accomplishment!
- didn't give up, though reach the border of giving up several times.LOL

That's all of my research in totally 12-13 hours, there might be flaws, errors and many not-satifying places, but so far I'm quite happy with it!

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
assets		assets
gif		gif
safe-rlhf		safe-rlhf
.gitignore		.gitignore
412-gpt2-medium.sh		412-gpt2-medium.sh
412-gpt2-xl.sh		412-gpt2-xl.sh
412-gpt2.sh		412-gpt2.sh
Clean_Transformer_Demo_Template.ipynb		Clean_Transformer_Demo_Template.ipynb
Exploratory_Analysis_Demo.ipynb		Exploratory_Analysis_Demo.ipynb
Exploratory_Analysis_Playground.ipynb		Exploratory_Analysis_Playground.ipynb
README.md		README.md
deal.py		deal.py
eval.py		eval.py
explore.ipynb		explore.ipynb
generate_gif.py		generate_gif.py
generate_static.py		generate_static.py
hands_on.md		hands_on.md
output.json		output.json
output_new.json		output_new.json
output_test.json		output_test.json
papers.md		papers.md
sample.py		sample.py
system_prompt.py		system_prompt.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

My Learning and other stuff in interpretability

Before the real study

Entering the real study time

Results Display

Conclusion

About

Releases

Packages

Languages

htlou/Interp

Folders and files

Latest commit

History

Repository files navigation

My Learning and other stuff in interpretability

Before the real study

Entering the real study time

Results Display

Conclusion

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages