Is there any code for preparation of the dataset on which Starcoder has been originally trained?

I need to know how to use `<filename>`, `<fim_*>` and other special tokens listed in tokenizer special_tokens_map when preparing the dataset.

I've been successfully able to finetune Starcoder on my own code, but I haven't specially prepared the dataset for FIM, so I feel the result could be inferior, as the VSCode extension uses FIM.