We use nanotron library for training SmolLM and SmolLM2 base models.
The scripts for training SmolLM v1 can be found in the smollm1
folder. SmolLM2 has a similar architecture and setup but uses an improved data mixture that we curated and significantly longer training periods (11 trillion tokens for the 1.7B, 4 trillion for the 360M and 2 trillion for the 135M). We will upload the SmolLM2 configs soon.
Please refer to nanotron for detailed instructions on setting up your training environment and launching jobs.
After setting up the environment and tokenizing the training datasets with datatrove (instructions available here), you can modify the configurations to match your number of nodes and local paths.
Below is an example of launching SmolLM1 135M training on 1 node (you can change the DP value to 8 in the config and adjust the batch size) and run:
git clone https://github.com/huggingface/nanotron
cd nanotron
# follow installation
CUDA_DEVICE_MAX_CONNECTIONS=1 torchrun --nproc_per_node=8 run_train.py --config-file smollm1/config_smollm1_135M.yaml
If you are working on a slurm cluster, you can modify the launch.slurm
and launch the training with:
sbatch launch.slurm
Note
Don't forget to create the logs directory before launching the job: