Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Is there any code for preparation of the dataset on which Starcoder has been originally trained? #56

Open
xpl opened this issue May 30, 2023 · 6 comments

Comments

@xpl
Copy link

xpl commented May 30, 2023

I need to know how to use <filename>, <fim_*> and other special tokens listed in tokenizer special_tokens_map when preparing the dataset.

I've been successfully able to finetune Starcoder on my own code, but I haven't specially prepared the dataset for FIM, so I feel the result could be inferior, as the VSCode extension uses FIM.

@xpl
Copy link
Author

xpl commented May 30, 2023

What's interesting is that after finetuning it seems to be still working with FIM, so finetuning made not the model "forget" the FIM at least completely :)

@seyyedaliayati
Copy link

I've been successfully able to finetune Starcoder on my own code

May I know your hardware that you used for finetuning?

@xpl
Copy link
Author

xpl commented May 31, 2023

May I know your hardware that you used for finetuning?

8× A100

I adapted the train scripts from the chat folder — this one doesn't use LoRA and eats lots of memory. Had to decrease batch size, otherwise it OOMed. It also OOMs on trying to resume from checkpoint (there must be some memory-wise inefficiencies in the restoring code). But otherwise it runs pretty fast on an 330 Mb dataset.

P.S. Regarding my original question — I've found the FIM preparation code in the Megatron-LM codebase. Seems that it's what they used to train Starcoder.

@loubnabnl
Copy link
Contributor

loubnabnl commented Jun 1, 2023

Yes you can use the FIM preparation code in Megatron, there's also a FIM implementation here that could be easier to integrate with the current codebase. As for the data preparation we have the code at bigcode-dataset including how we added the special code tokens

@1920853199
Copy link

May I know your hardware that you used for finetuning?

8× A100

I adapted the train scripts from the chat folder — this one doesn't use LoRA and eats lots of memory. Had to decrease batch size, otherwise it OOMed. It also OOMs on trying to resume from checkpoint (there must be some memory-wise inefficiencies in the restoring code). But otherwise it runs pretty fast on an 330 Mb dataset.

P.S. Regarding my original question — I've found the FIM preparation code in the Megatron-LM codebase. Seems that it's what they used to train Starcoder.

May I ask if there are any relevant scripts and tutorials for reference?

@1920853199
Copy link

May I know your hardware that you used for finetuning?

8× A100

I adapted the train scripts from the folder — this one doesn't use LoRA and eats lots of memory. Had to decrease batch size, otherwise it OOMed. It also OOMs on trying to resume from checkpoint (there must be some memory-wise inefficiencies in the restoring code). But otherwise it runs pretty fast on an 330 Mb dataset.chat

P.S. Regarding my original question — I've found the FIM preparation code in the Megatron-LM codebase. Seems that it's what they used to train Starcoder.

So how much time did you spend?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants