Lingua Modified for Lanta

This is lingua refactored for Lanta HPC Cluster. The original readme of this repository is HERE

Changelog

Removed unlimit command from code because Lanta don't have permission to change shell resource limit
created script to create environement specifically made for Lanta: create_env_lanta.sh
created script to download training data and tokenizer setup/download_prepare_hf_data.sh, setup/prepare_tokenizer_lanta.sh
modified code to not download eval dataset on the file because Lanta gpu node do not have internet connection. setup/download_eval_datasets_lanta.py and setup/prepare_eval_data_lanta.sh specifically made for Lanta.
Replace lm-eval with Custom version of lm-eval that support loading dataset from disk
Added Fine-tuning tutorial and with additional support scripts: setup/convert_hf_checkpoint.sh and setup/download_hf_model.sh
Added Fine-tuning 1B Llama3 and Pre-traning configuration example.
Fixed Memory leak in evaluation steps
Fixed Multinode training on Lanta
Added custom dataset dummy_zhth support for the original data processing script.
Added support for Qwen2, Qwen2.5 Models Family

Set up

Create Conda Environment

sh setup/create_env_lanta.sh <path_to_store_conda_environment>

<path_to_store_conda_environment>: Recommend putting the environement path outside of lingua folder to avoid stool.py indexing when running train script.
<conda_path> will be created at <path_to_store_conda_environment>/lingua_conda

Download training data from Huggingface

sh setup/download_prepare_hf_data.sh <conda_path> <data_repo>

<data_repo> Can be one of "fineweb_edu", "fineweb_edu_10bt", "dclm_baseline_1.0", "dclm_baseline_1.0_10prct", "dummy_zhth" Please choose fineweb_edu_10bt or dummy_zhth for this tutorial because it use less disk space.
Training dataset will be downloaded to <current_directory>/data/<data_repo>_shuffled

Download tokenizer from Huggingface

sh setup/prepare_tokenizer_lanta.sh <conda_path> <huggingface_privatekey> <tokenizer_name>

Get Hugginface private key from this link
<tokenizer_name> Can be on what listed in setup/download_tokenizer.py llama2, llama3, gemma, qwen2
Tokenizer will be saved to <tokenizer_path> = <current_directory>/tokenizer_file_<tokenizer_name>

Download eval data from Huggingface

sh setup/prepare_eval_data_lanta.sh <conda_path>

Eval datasets will be downloaded to <current_directory>/data/<dataset_name> each datasets will have one folder

Activate Conda Environment

ml purge
ml Mamba/23.11.0-0
conda deactivate
conda activate <conda_path>

Pre-training

Edit lanta_pretrain.yaml and run slurm job

python -m lingua.stool script=apps.main.train config=apps/main/configs/lanta_pretrain.yaml nodes=<num_nodes> partition=gpu project_name=<project_name> time=02:00:00

Fine-tuning

Download checkpoint from Huggingface

sh setup/download_hf_model.sh <conda_path> <REPO_ID> <DOWNLOAD_PATH>

<REPO_ID> should be a varaint of Llama3 models family. In this demo we will use Llama-3.2-1B
<DOWNLOAD_PATH> is where the model will be downloaded into

Convert Checkpoint to DCP DCP format

sh setup/convert_hf_checkpoint.sh <conda_path> <DOWNLOAD_PATH> <MODEL_FAMMILY>

<DOWNLOAD_PATH> is the same path we downloaded checkpoint from Huggingface.
<MODEL_FAMMILY> Must be one of "llama3" or "qwen2"
Output will be at <DOWNLOAD_PATH>-converted

Edit `lanta_finetune_1B.yaml` (Llama3.2 1B) or `lanta_finetune_1B_qwen.yaml` (Qwen2.5 1.5B) and run slurm job

python -m lingua.stool script=apps.main.train config=apps/main/configs/lanta_finetune_1B.yaml nodes=<num_nodes> partition=gpu project_name=<project_name> time=02:00:00

or

python -m lingua.stool script=apps.main.train config=apps/main/configs/lanta_finetune_1B_qwen.yaml nodes=<num_nodes> partition=gpu project_name=<project_name> time=02:00:00

Convert Checkpoint to Hugggingface Format

Llama3

We need to run checkpoint conversion in SLURM to avoid OOM.

Edit <PROJECT_NAME> in setup/convert_dcp_checkpoint_llama.sh
Run Command

sbatch setup/convert_dcp_checkpoint_llama.sh \
   <CONDA_PATH> \
   <LINGUA_CHECKPOINT_PATH> \
   <TOKENIZER_DIR> \

<LINGUA_CHECKPOINT_PATH> example: <full_path>/checkpoints/0000000300
<TOKENIZER_DIR> example: <full_path>/tokenizer_file_llama3/original/

Qwen2

We need to run checkpoint conversion in SLURM to avoid OOM.

Edit <PROJECT_NAME> in setup/convert_dcp_checkpoint_qwen.sh
Run Command

sbatch setup/convert_dcp_checkpoint_qwen.sh \
   <CONDA_PATH> \
   <LINGUA_CHECKPOINT_PATH> \
   <TOKENIZER_DIR> \
   <ORIGINAL_QWEN_HF_PATH>

<LINGUA_CHECKPOINT_PATH> example: <full_path>/checkpoints/0000000300
<TOKENIZER_DIR> example: <full_path>/tokenizer_file_qwen2
<ORIGINAL_QWEN_HF_PATH> full path to original qwen huggingface checkpoint example: <full_path>/Qwen2.5-1.5B

Upload to Huggingface

Prerequisite: Convert Checkpoint to Huggingface Format

huggingface-cli upload --token <HF_TOKEN> <PUSH_HF_REPO> <LINGUA_CHECKPOINT_PATH>/hf .

<LINGUA_CHECKPOINT_PATH> example: <full_path>/checkpoints/0000000300
<PUSH_HF_REPO> hf repository example: lst-nectec/llama-1b-finetuned

Name		Name	Last commit message	Last commit date
Latest commit History 72 Commits
apps		apps
lingua		lingua
logs		logs
setup		setup
.gitignore		.gitignore
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
README_original.md		README_original.md
dataloader.png		dataloader.png
lingua_overview.svg		lingua_overview.svg
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Lingua Modified for Lanta

Changelog

Set up

Create Conda Environment

Download training data from Huggingface

Download tokenizer from Huggingface

Download eval data from Huggingface

Activate Conda Environment

Pre-training

Fine-tuning

Download checkpoint from Huggingface

Convert Checkpoint to DCP DCP format

Edit `lanta_finetune_1B.yaml` (Llama3.2 1B) or `lanta_finetune_1B_qwen.yaml` (Qwen2.5 1.5B) and run slurm job

Convert Checkpoint to Hugggingface Format

Llama3

Qwen2

Upload to Huggingface

About

Releases

Packages

Languages

License

new5558/lingua-for-lanta

Folders and files

Latest commit

History

Repository files navigation

Lingua Modified for Lanta

Changelog

Set up

Create Conda Environment

Download training data from Huggingface

Download tokenizer from Huggingface

Download eval data from Huggingface

Activate Conda Environment

Pre-training

Fine-tuning

Download checkpoint from Huggingface

Convert Checkpoint to DCP DCP format

Edit lanta_finetune_1B.yaml (Llama3.2 1B) or lanta_finetune_1B_qwen.yaml (Qwen2.5 1.5B) and run slurm job

Convert Checkpoint to Hugggingface Format

Llama3

Qwen2

Upload to Huggingface

About

Resources

License

Code of conduct

Stars

Watchers

Forks

Releases

Packages 0

Languages

Edit `lanta_finetune_1B.yaml` (Llama3.2 1B) or `lanta_finetune_1B_qwen.yaml` (Qwen2.5 1.5B) and run slurm job

Packages