-
Notifications
You must be signed in to change notification settings - Fork 521
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Is there any code for preparation of the dataset on which Starcoder has been originally trained? #56
Comments
What's interesting is that after finetuning it seems to be still working with FIM, so finetuning made not the model "forget" the FIM at least completely :) |
May I know your hardware that you used for finetuning? |
8× A100 I adapted the train scripts from the P.S. Regarding my original question — I've found the FIM preparation code in the Megatron-LM codebase. Seems that it's what they used to train Starcoder. |
Yes you can use the FIM preparation code in Megatron, there's also a FIM implementation here that could be easier to integrate with the current codebase. As for the data preparation we have the code at bigcode-dataset including how we added the special code tokens |
May I ask if there are any relevant scripts and tutorials for reference? |
So how much time did you spend? |
I need to know how to use
<filename>
,<fim_*>
and other special tokens listed in tokenizer special_tokens_map when preparing the dataset.I've been successfully able to finetune Starcoder on my own code, but I haven't specially prepared the dataset for FIM, so I feel the result could be inferior, as the VSCode extension uses FIM.
The text was updated successfully, but these errors were encountered: