This is lingua refactored for Lanta HPC Cluster. The original readme of this repository is HERE
- Removed
unlimit
command from code because Lanta don't have permission to change shell resource limit - created script to create environement specifically made for Lanta:
create_env_lanta.sh
- created script to download training data and tokenizer
setup/download_prepare_hf_data.sh
,setup/prepare_tokenizer_lanta.sh
- modified code to not download eval dataset on the file because Lanta gpu node do not have internet connection.
setup/download_eval_datasets_lanta.py
andsetup/prepare_eval_data_lanta.sh
specifically made for Lanta. - Replace
lm-eval
with Custom version of lm-eval that support loading dataset from disk - Added Fine-tuning tutorial and with additional support scripts:
setup/convert_hf_checkpoint.sh
andsetup/download_hf_model.sh
- Added Fine-tuning 1B Llama3 and Pre-traning configuration example.
- Fixed Memory leak in evaluation steps
- Fixed Multinode training on Lanta
- Added custom dataset dummy_zhth support for the original data processing script.
- Added support for Qwen2, Qwen2.5 Models Family
sh setup/create_env_lanta.sh <path_to_store_conda_environment>
<path_to_store_conda_environment>
: Recommend putting the environement path outside of lingua folder to avoid stool.py indexing when running train script.<conda_path>
will be created at<path_to_store_conda_environment>/lingua_conda
sh setup/download_prepare_hf_data.sh <conda_path> <data_repo>
<data_repo>
Can be one of"fineweb_edu", "fineweb_edu_10bt", "dclm_baseline_1.0", "dclm_baseline_1.0_10prct", "dummy_zhth"
Please choosefineweb_edu_10bt
ordummy_zhth
for this tutorial because it use less disk space.- Training dataset will be downloaded to
<current_directory>/data/<data_repo>_shuffled
sh setup/prepare_tokenizer_lanta.sh <conda_path> <huggingface_privatekey> <tokenizer_name>
- Get Hugginface private key from this link
<tokenizer_name>
Can be on what listed insetup/download_tokenizer.py
llama2, llama3, gemma, qwen2
- Tokenizer will be saved to
<tokenizer_path>
=<current_directory>/tokenizer_file_<tokenizer_name>
sh setup/prepare_eval_data_lanta.sh <conda_path>
- Eval datasets will be downloaded to
<current_directory>/data/<dataset_name>
each datasets will have one folder
ml purge
ml Mamba/23.11.0-0
conda deactivate
conda activate <conda_path>
Edit lanta_pretrain.yaml
and run slurm job
python -m lingua.stool script=apps.main.train config=apps/main/configs/lanta_pretrain.yaml nodes=<num_nodes> partition=gpu project_name=<project_name> time=02:00:00
sh setup/download_hf_model.sh <conda_path> <REPO_ID> <DOWNLOAD_PATH>
<REPO_ID>
should be a varaint of Llama3 models family. In this demo we will use Llama-3.2-1B<DOWNLOAD_PATH>
is where the model will be downloaded into
Convert Checkpoint to DCP DCP format
sh setup/convert_hf_checkpoint.sh <conda_path> <DOWNLOAD_PATH> <MODEL_FAMMILY>
<DOWNLOAD_PATH>
is the same path we downloaded checkpoint from Huggingface.<MODEL_FAMMILY>
Must be one of"llama3" or "qwen2"
- Output will be at
<DOWNLOAD_PATH>-converted
Edit lanta_finetune_1B.yaml
(Llama3.2 1B) or lanta_finetune_1B_qwen.yaml
(Qwen2.5 1.5B) and run slurm job
python -m lingua.stool script=apps.main.train config=apps/main/configs/lanta_finetune_1B.yaml nodes=<num_nodes> partition=gpu project_name=<project_name> time=02:00:00
or
python -m lingua.stool script=apps.main.train config=apps/main/configs/lanta_finetune_1B_qwen.yaml nodes=<num_nodes> partition=gpu project_name=<project_name> time=02:00:00
We need to run checkpoint conversion in SLURM to avoid OOM.
- Edit
<PROJECT_NAME>
insetup/convert_dcp_checkpoint_llama.sh
- Run Command
sbatch setup/convert_dcp_checkpoint_llama.sh \
<CONDA_PATH> \
<LINGUA_CHECKPOINT_PATH> \
<TOKENIZER_DIR> \
<LINGUA_CHECKPOINT_PATH>
example: <full_path>/checkpoints/0000000300<TOKENIZER_DIR>
example: <full_path>/tokenizer_file_llama3/original/
We need to run checkpoint conversion in SLURM to avoid OOM.
- Edit
<PROJECT_NAME>
insetup/convert_dcp_checkpoint_qwen.sh
- Run Command
sbatch setup/convert_dcp_checkpoint_qwen.sh \
<CONDA_PATH> \
<LINGUA_CHECKPOINT_PATH> \
<TOKENIZER_DIR> \
<ORIGINAL_QWEN_HF_PATH>
<LINGUA_CHECKPOINT_PATH>
example: <full_path>/checkpoints/0000000300<TOKENIZER_DIR>
example: <full_path>/tokenizer_file_qwen2<ORIGINAL_QWEN_HF_PATH>
full path to original qwen huggingface checkpoint example: <full_path>/Qwen2.5-1.5B
Prerequisite: Convert Checkpoint to Huggingface Format
huggingface-cli upload --token <HF_TOKEN> <PUSH_HF_REPO> <LINGUA_CHECKPOINT_PATH>/hf .
<LINGUA_CHECKPOINT_PATH>
example: <full_path>/checkpoints/0000000300<PUSH_HF_REPO>
hf repository example: lst-nectec/llama-1b-finetuned