-
Notifications
You must be signed in to change notification settings - Fork 15
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Extremely unbalanced attacker defender winrate and strange behavior of SFTed model #5
Comments
Hi, sorry for the mistake. We have fixed the bug of unbalance by adding |
For the empty-response problem, I guess there are some bugs in the response-collecting process. Could you describe how you accelerate |
For the lower issue I have already fixed, but still very unbalanced. Is 50-50 an expected outcome? I'm spinning up the backend using vllm, which speed up from 10h generate time to 0.5h |
It is not a 50-50 win rate in expectation. For imitation learning, the LLM only learns how to play the game. There is no regularizer to guarantee that the attacker and defender have the same level of playing. But if you continuously try SPAG-1 training on the im_model, the win rate will get closer to 50-50 in expectation. |
Still not sure where the bug is for the empty reply... Maybe print out each prompt and its generation in the dialog collection loop? |
@Linear95 This is one example I get using original
|
Some hypothesis:
|
Is there any chance that the original |
@Linear95 Here's one sample I found in
|
This is because we only considered the exact target word inference in the GPT-4 data collection. Later we found this rule was too strict for the defender, then we considered the word variations of tense and number in the self-play data collection. However, this does not make a significant change in the game outcomes. We just applied the current judgment in |
I have provided all the self-play game episodes in the |
For the third hypothesis, there is a risk that the tokenizer initialized differently in I just defined a new function for special token initialization and called it in both |
Yes, I'll try that. Thank you! |
@Linear95 I tried to replicate the results. Here's the summarization: During generation, by passing min_tokens=2 I can avoid empty responses observed before, as a result this increase the defender win rate. Here's the result for After IM: 46.5% +- 0.4% There seems to be no improvement beyond noise in evaluation. I also observed during training, defender win rate are increasing slowly. The weight decrease from around 4.29 (im) -> 3.92 (spag-1) -> 3.49 (spag-2). (As a reference, the author provided data is 4.28(im) -> 4.27 (spag-1) -> 3.56 (spag-2)) |
Also wanna add some results for
I see clear improvement in arc and bbh, so it at least indicate the code is working as expected. Though I don't observe reliable improvement in MMLU. |
We didn't claim that SPAG can significantly improve on MMLU. Actually, I don't think MMLU is a reasoning-testing benchmark. It's evaluating the general language capability. If you check our paper, MMLU even gets worse based on Baichuan-2. |
Hi @PinzhengWang322 I've uploaded: SFT ckpt: https://huggingface.co/ThWu/spag_im_ckpt And my code is public at https://github.com/thwu1/spag |
Thank you so much for sharing these resources! |
In the
gpt4_game_top30k_results.json
file, there are 20067 attacker win samples and 3287 defender win samples, with att/def ~ 6.1While after SFTed the model using
Here's the result statsitics of the generated self play histories (p.s. I optimized the play_llm_game code for speedup):
The att/def ~ 11.58 is even higher, indicate more inbalance!
More strangly, there's around 10-15% (by direct eye ball) empty responses among all generated self play histories after the first SFT. For instance:
Could the author confirm the unbalance and the strange SFTed behavior? @Linear95 @underwoodnoble
The text was updated successfully, but these errors were encountered: