Name		Name	Last commit message	Last commit date
parent directory ..
smollm1		smollm1
README.md		README.md
launch.slurm		launch.slurm

README.md

Pre-training

We use nanotron library for training SmolLM and SmolLM2 base models.

The scripts for training SmolLM v1 can be found in the smollm1 folder. SmolLM2 has a similar architecture and setup but uses an improved data mixture that we curated and significantly longer training periods (11 trillion tokens for the 1.7B, 4 trillion for the 360M and 2 trillion for the 135M). We will upload the SmolLM2 configs soon.

Setup

Please refer to nanotron for detailed instructions on setting up your training environment and launching jobs.

After setting up the environment and tokenizing the training datasets with datatrove (instructions available here), you can modify the configurations to match your number of nodes and local paths.

Below is an example of launching SmolLM1 135M training on 1 node (you can change the DP value to 8 in the config and adjust the batch size) and run:

git clone https://github.com/huggingface/nanotron
cd nanotron
# follow installation
CUDA_DEVICE_MAX_CONNECTIONS=1 torchrun --nproc_per_node=8 run_train.py --config-file smollm1/config_smollm1_135M.yaml

If you are working on a slurm cluster, you can modify the launch.slurm and launch the training with:

sbatch launch.slurm

Note

Don't forget to create the logs directory before launching the job:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pre-training

pre-training

README.md

Pre-training

Setup

Files

pre-training

Directory actions

More options

Directory actions

More options

Latest commit

History

pre-training

Folders and files

parent directory

README.md

Pre-training

Setup