diff --git a/docs/llms/guides/sparse-finetuning-llm-gsm8k-with-sparseml.md b/docs/llms/guides/sparse-finetuning-llm-gsm8k-with-sparseml.md index e0a169b73e1..d123f3ab03c 100644 --- a/docs/llms/guides/sparse-finetuning-llm-gsm8k-with-sparseml.md +++ b/docs/llms/guides/sparse-finetuning-llm-gsm8k-with-sparseml.md @@ -29,7 +29,7 @@ This guide details the steps for going from a pre-trained, unoptimized Llama2 7B ## Prerequisites -- Training Environment: A system that meets the minimum hardware and software requirements as outlined in the [Install Guide](/get-started/install/sparseml#prerequisites). To replicate the setup used for fine-tuning in this guide, use 4 NVIDIA A100 GPUs for both dense and sparse fine-tuning steps. +- Training Environment: A system that meets the minimum hardware and software requirements as outlined in the [Install Guide](/get-started/install/sparseml#prerequisites). To replicate the setup used for fine-tuning in this guide, use 4 NVIDIA A100 80 GB GPUs for both dense and sparse fine-tuning steps and a system with at least 16 GB memory. - SparseML LLM Installation: An environment with SparseML for LLMs installed as outlined in the [Install Guide](/get-started/install/sparseml#generative-ai-hugging-face). - Background: Familiarity with Generative AI and working with large language models is recommended. @@ -42,28 +42,28 @@ zoo:llama2-7b-llama2_pretrain-base ## Dense fine-tuning We then fine-tune the above pre-trained dense model on the GSM8K dataset to obtain a model that we can later optimize using sparsification. ```bash -accelerate launch - --config_file example_fsdp_config.yaml - --no_python sparseml.transformers.text_generation.finetune - --model PATH_TO_MODEL or ZOO_STUB - --dataset "gsm8k" - --dataset_config_name "main" - --output_dir PATH_TO_OUTPUT - --splits "train" - --num_train_epochs 2 - --precision "bfloat16" - --gradient_checkpointing True - --bf16 True - --learning_rate 0.00005 - --lr_scheduler_type "linear" - --max_seq_length 1024 - --per_device_train_batch_size 32 - --max_grad_norm 2 +accelerate launch \ + --config_file example_fsdp_config.yaml \ + --no_python sparseml.transformers.text_generation.finetune \ + --model PATH_TO_MODEL or ZOO_STUB \ + --dataset "gsm8k" \ + --dataset_config_name "main" \ + --output_dir PATH_TO_OUTPUT \ + --splits "train" \ + --num_train_epochs 2 \ + --precision "bfloat16" \ + --gradient_checkpointing True \ + --bf16 True \ + --learning_rate 0.00005 \ + --lr_scheduler_type "linear" \ + --max_seq_length 1024 \ + --per_device_train_batch_size 32 \ + --max_grad_norm 2 \ --warmup_steps 20 ``` Note: *Some of these hyper-parameters may need further tuning to enhance the overall accuracy of the fine-tuned model. The values mentioned above were obtained through a quick hyper-parameter search. Parameters that could have a significant impact and are worth considering for tuning include: `learning_rate`, `max_grad_norm`, `warmup_steps`, `max_seq_length`.* -The example_fsdp_config.yaml used above contains the following setup for FSDP. Set the `num_processes` to the number of GPUs available. For our setup, we used 4 NVIDIA A100 GPUs so we set `num_processes` to `4`. +The example_fsdp_config.yaml used above contains the following setup for FSDP. Set the `num_processes` to the number of GPUs available. For our setup, we used 4 NVIDIA A100 80GB GPUs so we set `num_processes` to `4`. ```yaml compute_environment: LOCAL_MACHINE debug: false @@ -92,7 +92,7 @@ tpu_use_sudo: false use_cpu: false ``` -### Dense fine-tuned model accuracy +### Dense finetuned model accuracy [Evaluating](#evaluation-setup) the dense fine-tuned model on the `gsm8k 0-shot` task, results in a baseline accuracy of `37.52%`. We'll consider this accuracy as our baseline for calculating recovery for the oneshot sparse and sparse fine-tuned models we'll get later. Detailed results are provided below: ```json { @@ -125,19 +125,19 @@ Use the dense fine-tuned model obtained above and sparsify it to 50% in a onesho Command: ```bash -accelerate launch - --config_file example_fsdp_config.yaml - --no_python sparseml.transformers.text_generation.oneshot - --model PATH_TO_MODEL - --dataset "gsm8k" - --dataset_config_name "main" - --concatenate_data OPTIONAL - --recipe PATH_TO_RECIPE - --output_dir PATH_TO_OUTPUT - --splits "train" - --pad_to_max_length False - --oneshot_device DEVICE - --num_calibration_samples 1024 +accelerate launch \ + --config_file example_fsdp_config.yaml \ + --no_python sparseml.transformers.text_generation.oneshot \ + --model PATH_TO_MODEL \ + --dataset "gsm8k" \ + --dataset_config_name "main" \ + --concatenate_data OPTIONAL \ + --recipe PATH_TO_RECIPE \ + --output_dir PATH_TO_OUTPUT \ + --splits "train" \ + --pad_to_max_length False \ + --oneshot_device DEVICE \ + --num_calibration_samples 1024 \ --max_seq_len 4096 ``` Note: *You may wish to tweak the `num_calibration_samples` above to obtain better accuracy.* @@ -177,7 +177,7 @@ pruning_stage: To learn more about the OWL non-uniform sparsity profile method, visit [this link](https://github.com/luuyin/OWL/tree/main?tab=readme-ov-file#script-example-of-pruning-llama-7b-using-owl-sparsegpt). ### Oneshot 50% sparse model accuracy -[Evaluating](#evaluation-setup) the oneshot 50% sparse model on the `gsm8k 0-shot` task, results in an accuracy of `33.81%` and translates to a `90.11%` recovery over our [dense baseline](#Dense fine-tuned model accuracy). In the next step we'll see how to improve the recovery of this model using sparse fine-tuning. Detailed results for the oneshot 50% sparse model are provided below: +[Evaluating](#evaluation-setup) the oneshot 50% sparse model on the `gsm8k 0-shot` task, results in an accuracy of `33.81%` and translates to a `90.11%` recovery over our [dense baseline](#dense-finetuned-model-accuracy). In the next step we'll see how to improve the recovery of this model using sparse fine-tuning. Detailed results for the oneshot 50% sparse model are provided below: ```json { "results": { @@ -209,26 +209,26 @@ The one-shot sparse model generated previously can undergo further sparse fine-t Command: ```bash -accelerate launch - --config_file example_fsdp_config.yaml - --no_python sparseml.transformers.text_generation.finetune - --model PATH_TO_MODEL - --dataset "gsm8k" - --dataset_config_name "main" - --output_dir PATH_TO_OUTPUT - --splits "train" - --num_train_epochs 2 - --precision "bfloat16" - --gradient_checkpointing True - --bf16 True - --learning_rate 0.00005 - --lr_scheduler_type "linear" - --max_seq_length 1024 - --per_device_train_batch_size 32 - --max_grad_norm None - --warmup_steps 20 - --distill_teacher PATH_TO_TEACHER - --recipe PATH_TO_RECIPE +accelerate launch \ + --config_file example_fsdp_config.yaml \ + --no_python sparseml.transformers.text_generation.finetune \ + --model PATH_TO_MODEL \ + --dataset "gsm8k" \ + --dataset_config_name "main" \ + --output_dir PATH_TO_OUTPUT \ + --splits "train" \ + --num_train_epochs 2 \ + --precision "bfloat16" \ + --gradient_checkpointing True \ + --bf16 True \ + --learning_rate 0.00005 \ + --lr_scheduler_type "linear" \ + --max_seq_length 1024 \ + --per_device_train_batch_size 32 \ + --max_grad_norm None \ + --warmup_steps 20 \ + --distill_teacher PATH_TO_TEACHER \ + --recipe PATH_TO_RECIPE ``` Recipe: @@ -290,7 +290,7 @@ test_stage: Note: *Some of these hyper-parameters may need further tuning to enhance the overall accuracy of the fine-tuned model. The values mentioned above were obtained through a quick hyper-parameter search. Parameters that could have a significant impact and are worth considering for tuning include: `learning_rate`, `max_grad_norm`, `warmup_steps`, `max_seq_length`.* ### Fine-tuned 50% sparse model accuracy -[Evaluating](#evaluation-setup) the fine-tuned 50% sparse model on the `gsm8k 0-shot` task, results in an accuracy of `38.59%` and shows clear improvement over the [oneshot accuracy](#Oneshot 50% sparse model accuracy). The sparse fine-tuning step not only helped improve over the oneshot accuracy but even surpassed the dense baseline model. Detailed results for the oneshot 50% sparse model are provided below: +[Evaluating](#evaluation-setup) the fine-tuned 50% sparse model on the `gsm8k 0-shot` task, results in an accuracy of `38.59%` and shows clear improvement over the [oneshot accuracy](#oneshot-50%-sparse-model-accuracy). The sparse fine-tuning step not only helped improve over the oneshot accuracy but even surpassed the dense baseline model. Detailed results for the oneshot 50% sparse model are provided below: ```json { "results": { @@ -331,7 +331,7 @@ MODEL_PATH= TASK=gsm8k python main.py \ --model sparseml \ - --model_args pretrained=MODEL_PATH,trust_remote_code=True \ + --model_args pretrained=${MODEL_PATH},trust_remote_code=True \ --tasks $TASK \ --batch_size 48 \ --no_cache \ @@ -339,4 +339,4 @@ python main.py \ --output_path "${MODEL_PATH}/${TASK}.json" \ --device "cuda:0" \ --num_fewshot 0 -``` \ No newline at end of file +``` diff --git a/versioned_docs/version-1.7.0/llms/guides/sparse-finetuning-llm-gsm8k-with-sparseml.md b/versioned_docs/version-1.7.0/llms/guides/sparse-finetuning-llm-gsm8k-with-sparseml.md index e0a169b73e1..d123f3ab03c 100644 --- a/versioned_docs/version-1.7.0/llms/guides/sparse-finetuning-llm-gsm8k-with-sparseml.md +++ b/versioned_docs/version-1.7.0/llms/guides/sparse-finetuning-llm-gsm8k-with-sparseml.md @@ -29,7 +29,7 @@ This guide details the steps for going from a pre-trained, unoptimized Llama2 7B ## Prerequisites -- Training Environment: A system that meets the minimum hardware and software requirements as outlined in the [Install Guide](/get-started/install/sparseml#prerequisites). To replicate the setup used for fine-tuning in this guide, use 4 NVIDIA A100 GPUs for both dense and sparse fine-tuning steps. +- Training Environment: A system that meets the minimum hardware and software requirements as outlined in the [Install Guide](/get-started/install/sparseml#prerequisites). To replicate the setup used for fine-tuning in this guide, use 4 NVIDIA A100 80 GB GPUs for both dense and sparse fine-tuning steps and a system with at least 16 GB memory. - SparseML LLM Installation: An environment with SparseML for LLMs installed as outlined in the [Install Guide](/get-started/install/sparseml#generative-ai-hugging-face). - Background: Familiarity with Generative AI and working with large language models is recommended. @@ -42,28 +42,28 @@ zoo:llama2-7b-llama2_pretrain-base ## Dense fine-tuning We then fine-tune the above pre-trained dense model on the GSM8K dataset to obtain a model that we can later optimize using sparsification. ```bash -accelerate launch - --config_file example_fsdp_config.yaml - --no_python sparseml.transformers.text_generation.finetune - --model PATH_TO_MODEL or ZOO_STUB - --dataset "gsm8k" - --dataset_config_name "main" - --output_dir PATH_TO_OUTPUT - --splits "train" - --num_train_epochs 2 - --precision "bfloat16" - --gradient_checkpointing True - --bf16 True - --learning_rate 0.00005 - --lr_scheduler_type "linear" - --max_seq_length 1024 - --per_device_train_batch_size 32 - --max_grad_norm 2 +accelerate launch \ + --config_file example_fsdp_config.yaml \ + --no_python sparseml.transformers.text_generation.finetune \ + --model PATH_TO_MODEL or ZOO_STUB \ + --dataset "gsm8k" \ + --dataset_config_name "main" \ + --output_dir PATH_TO_OUTPUT \ + --splits "train" \ + --num_train_epochs 2 \ + --precision "bfloat16" \ + --gradient_checkpointing True \ + --bf16 True \ + --learning_rate 0.00005 \ + --lr_scheduler_type "linear" \ + --max_seq_length 1024 \ + --per_device_train_batch_size 32 \ + --max_grad_norm 2 \ --warmup_steps 20 ``` Note: *Some of these hyper-parameters may need further tuning to enhance the overall accuracy of the fine-tuned model. The values mentioned above were obtained through a quick hyper-parameter search. Parameters that could have a significant impact and are worth considering for tuning include: `learning_rate`, `max_grad_norm`, `warmup_steps`, `max_seq_length`.* -The example_fsdp_config.yaml used above contains the following setup for FSDP. Set the `num_processes` to the number of GPUs available. For our setup, we used 4 NVIDIA A100 GPUs so we set `num_processes` to `4`. +The example_fsdp_config.yaml used above contains the following setup for FSDP. Set the `num_processes` to the number of GPUs available. For our setup, we used 4 NVIDIA A100 80GB GPUs so we set `num_processes` to `4`. ```yaml compute_environment: LOCAL_MACHINE debug: false @@ -92,7 +92,7 @@ tpu_use_sudo: false use_cpu: false ``` -### Dense fine-tuned model accuracy +### Dense finetuned model accuracy [Evaluating](#evaluation-setup) the dense fine-tuned model on the `gsm8k 0-shot` task, results in a baseline accuracy of `37.52%`. We'll consider this accuracy as our baseline for calculating recovery for the oneshot sparse and sparse fine-tuned models we'll get later. Detailed results are provided below: ```json { @@ -125,19 +125,19 @@ Use the dense fine-tuned model obtained above and sparsify it to 50% in a onesho Command: ```bash -accelerate launch - --config_file example_fsdp_config.yaml - --no_python sparseml.transformers.text_generation.oneshot - --model PATH_TO_MODEL - --dataset "gsm8k" - --dataset_config_name "main" - --concatenate_data OPTIONAL - --recipe PATH_TO_RECIPE - --output_dir PATH_TO_OUTPUT - --splits "train" - --pad_to_max_length False - --oneshot_device DEVICE - --num_calibration_samples 1024 +accelerate launch \ + --config_file example_fsdp_config.yaml \ + --no_python sparseml.transformers.text_generation.oneshot \ + --model PATH_TO_MODEL \ + --dataset "gsm8k" \ + --dataset_config_name "main" \ + --concatenate_data OPTIONAL \ + --recipe PATH_TO_RECIPE \ + --output_dir PATH_TO_OUTPUT \ + --splits "train" \ + --pad_to_max_length False \ + --oneshot_device DEVICE \ + --num_calibration_samples 1024 \ --max_seq_len 4096 ``` Note: *You may wish to tweak the `num_calibration_samples` above to obtain better accuracy.* @@ -177,7 +177,7 @@ pruning_stage: To learn more about the OWL non-uniform sparsity profile method, visit [this link](https://github.com/luuyin/OWL/tree/main?tab=readme-ov-file#script-example-of-pruning-llama-7b-using-owl-sparsegpt). ### Oneshot 50% sparse model accuracy -[Evaluating](#evaluation-setup) the oneshot 50% sparse model on the `gsm8k 0-shot` task, results in an accuracy of `33.81%` and translates to a `90.11%` recovery over our [dense baseline](#Dense fine-tuned model accuracy). In the next step we'll see how to improve the recovery of this model using sparse fine-tuning. Detailed results for the oneshot 50% sparse model are provided below: +[Evaluating](#evaluation-setup) the oneshot 50% sparse model on the `gsm8k 0-shot` task, results in an accuracy of `33.81%` and translates to a `90.11%` recovery over our [dense baseline](#dense-finetuned-model-accuracy). In the next step we'll see how to improve the recovery of this model using sparse fine-tuning. Detailed results for the oneshot 50% sparse model are provided below: ```json { "results": { @@ -209,26 +209,26 @@ The one-shot sparse model generated previously can undergo further sparse fine-t Command: ```bash -accelerate launch - --config_file example_fsdp_config.yaml - --no_python sparseml.transformers.text_generation.finetune - --model PATH_TO_MODEL - --dataset "gsm8k" - --dataset_config_name "main" - --output_dir PATH_TO_OUTPUT - --splits "train" - --num_train_epochs 2 - --precision "bfloat16" - --gradient_checkpointing True - --bf16 True - --learning_rate 0.00005 - --lr_scheduler_type "linear" - --max_seq_length 1024 - --per_device_train_batch_size 32 - --max_grad_norm None - --warmup_steps 20 - --distill_teacher PATH_TO_TEACHER - --recipe PATH_TO_RECIPE +accelerate launch \ + --config_file example_fsdp_config.yaml \ + --no_python sparseml.transformers.text_generation.finetune \ + --model PATH_TO_MODEL \ + --dataset "gsm8k" \ + --dataset_config_name "main" \ + --output_dir PATH_TO_OUTPUT \ + --splits "train" \ + --num_train_epochs 2 \ + --precision "bfloat16" \ + --gradient_checkpointing True \ + --bf16 True \ + --learning_rate 0.00005 \ + --lr_scheduler_type "linear" \ + --max_seq_length 1024 \ + --per_device_train_batch_size 32 \ + --max_grad_norm None \ + --warmup_steps 20 \ + --distill_teacher PATH_TO_TEACHER \ + --recipe PATH_TO_RECIPE ``` Recipe: @@ -290,7 +290,7 @@ test_stage: Note: *Some of these hyper-parameters may need further tuning to enhance the overall accuracy of the fine-tuned model. The values mentioned above were obtained through a quick hyper-parameter search. Parameters that could have a significant impact and are worth considering for tuning include: `learning_rate`, `max_grad_norm`, `warmup_steps`, `max_seq_length`.* ### Fine-tuned 50% sparse model accuracy -[Evaluating](#evaluation-setup) the fine-tuned 50% sparse model on the `gsm8k 0-shot` task, results in an accuracy of `38.59%` and shows clear improvement over the [oneshot accuracy](#Oneshot 50% sparse model accuracy). The sparse fine-tuning step not only helped improve over the oneshot accuracy but even surpassed the dense baseline model. Detailed results for the oneshot 50% sparse model are provided below: +[Evaluating](#evaluation-setup) the fine-tuned 50% sparse model on the `gsm8k 0-shot` task, results in an accuracy of `38.59%` and shows clear improvement over the [oneshot accuracy](#oneshot-50%-sparse-model-accuracy). The sparse fine-tuning step not only helped improve over the oneshot accuracy but even surpassed the dense baseline model. Detailed results for the oneshot 50% sparse model are provided below: ```json { "results": { @@ -331,7 +331,7 @@ MODEL_PATH= TASK=gsm8k python main.py \ --model sparseml \ - --model_args pretrained=MODEL_PATH,trust_remote_code=True \ + --model_args pretrained=${MODEL_PATH},trust_remote_code=True \ --tasks $TASK \ --batch_size 48 \ --no_cache \ @@ -339,4 +339,4 @@ python main.py \ --output_path "${MODEL_PATH}/${TASK}.json" \ --device "cuda:0" \ --num_fewshot 0 -``` \ No newline at end of file +```