diff --git a/docs/llms/guides/sparse-finetuning-llm-gsm8k-with-sparseml.md b/docs/llms/guides/sparse-finetuning-llm-gsm8k-with-sparseml.md
index e0a169b73e1..d123f3ab03c 100644
--- a/docs/llms/guides/sparse-finetuning-llm-gsm8k-with-sparseml.md
+++ b/docs/llms/guides/sparse-finetuning-llm-gsm8k-with-sparseml.md
@@ -29,7 +29,7 @@ This guide details the steps for going from a pre-trained, unoptimized Llama2 7B
## Prerequisites
-- Training Environment: A system that meets the minimum hardware and software requirements as outlined in the [Install Guide](/get-started/install/sparseml#prerequisites). To replicate the setup used for fine-tuning in this guide, use 4 NVIDIA A100 GPUs for both dense and sparse fine-tuning steps.
+- Training Environment: A system that meets the minimum hardware and software requirements as outlined in the [Install Guide](/get-started/install/sparseml#prerequisites). To replicate the setup used for fine-tuning in this guide, use 4 NVIDIA A100 80 GB GPUs for both dense and sparse fine-tuning steps and a system with at least 16 GB memory.
- SparseML LLM Installation: An environment with SparseML for LLMs installed as outlined in the [Install Guide](/get-started/install/sparseml#generative-ai-hugging-face).
- Background: Familiarity with Generative AI and working with large language models is recommended.
@@ -42,28 +42,28 @@ zoo:llama2-7b-llama2_pretrain-base
## Dense fine-tuning
We then fine-tune the above pre-trained dense model on the GSM8K dataset to obtain a model that we can later optimize using sparsification.
```bash
-accelerate launch
- --config_file example_fsdp_config.yaml
- --no_python sparseml.transformers.text_generation.finetune
- --model PATH_TO_MODEL or ZOO_STUB
- --dataset "gsm8k"
- --dataset_config_name "main"
- --output_dir PATH_TO_OUTPUT
- --splits "train"
- --num_train_epochs 2
- --precision "bfloat16"
- --gradient_checkpointing True
- --bf16 True
- --learning_rate 0.00005
- --lr_scheduler_type "linear"
- --max_seq_length 1024
- --per_device_train_batch_size 32
- --max_grad_norm 2
+accelerate launch \
+ --config_file example_fsdp_config.yaml \
+ --no_python sparseml.transformers.text_generation.finetune \
+ --model PATH_TO_MODEL or ZOO_STUB \
+ --dataset "gsm8k" \
+ --dataset_config_name "main" \
+ --output_dir PATH_TO_OUTPUT \
+ --splits "train" \
+ --num_train_epochs 2 \
+ --precision "bfloat16" \
+ --gradient_checkpointing True \
+ --bf16 True \
+ --learning_rate 0.00005 \
+ --lr_scheduler_type "linear" \
+ --max_seq_length 1024 \
+ --per_device_train_batch_size 32 \
+ --max_grad_norm 2 \
--warmup_steps 20
```
Note: *Some of these hyper-parameters may need further tuning to enhance the overall accuracy of the fine-tuned model. The values mentioned above were obtained through a quick hyper-parameter search. Parameters that could have a significant impact and are worth considering for tuning include: `learning_rate`, `max_grad_norm`, `warmup_steps`, `max_seq_length`.*
-The example_fsdp_config.yaml used above contains the following setup for FSDP. Set the `num_processes` to the number of GPUs available. For our setup, we used 4 NVIDIA A100 GPUs so we set `num_processes` to `4`.
+The example_fsdp_config.yaml used above contains the following setup for FSDP. Set the `num_processes` to the number of GPUs available. For our setup, we used 4 NVIDIA A100 80GB GPUs so we set `num_processes` to `4`.
```yaml
compute_environment: LOCAL_MACHINE
debug: false
@@ -92,7 +92,7 @@ tpu_use_sudo: false
use_cpu: false
```
-### Dense fine-tuned model accuracy
+### Dense finetuned model accuracy
[Evaluating](#evaluation-setup) the dense fine-tuned model on the `gsm8k 0-shot` task, results in a baseline accuracy of `37.52%`. We'll consider this accuracy as our baseline for calculating recovery for the oneshot sparse and sparse fine-tuned models we'll get later. Detailed results are provided below:
```json
{
@@ -125,19 +125,19 @@ Use the dense fine-tuned model obtained above and sparsify it to 50% in a onesho
Command:
```bash
-accelerate launch
- --config_file example_fsdp_config.yaml
- --no_python sparseml.transformers.text_generation.oneshot
- --model PATH_TO_MODEL
- --dataset "gsm8k"
- --dataset_config_name "main"
- --concatenate_data OPTIONAL
- --recipe PATH_TO_RECIPE
- --output_dir PATH_TO_OUTPUT
- --splits "train"
- --pad_to_max_length False
- --oneshot_device DEVICE
- --num_calibration_samples 1024
+accelerate launch \
+ --config_file example_fsdp_config.yaml \
+ --no_python sparseml.transformers.text_generation.oneshot \
+ --model PATH_TO_MODEL \
+ --dataset "gsm8k" \
+ --dataset_config_name "main" \
+ --concatenate_data OPTIONAL \
+ --recipe PATH_TO_RECIPE \
+ --output_dir PATH_TO_OUTPUT \
+ --splits "train" \
+ --pad_to_max_length False \
+ --oneshot_device DEVICE \
+ --num_calibration_samples 1024 \
--max_seq_len 4096
```
Note: *You may wish to tweak the `num_calibration_samples` above to obtain better accuracy.*
@@ -177,7 +177,7 @@ pruning_stage:
To learn more about the OWL non-uniform sparsity profile method, visit [this link](https://github.com/luuyin/OWL/tree/main?tab=readme-ov-file#script-example-of-pruning-llama-7b-using-owl-sparsegpt).
### Oneshot 50% sparse model accuracy
-[Evaluating](#evaluation-setup) the oneshot 50% sparse model on the `gsm8k 0-shot` task, results in an accuracy of `33.81%` and translates to a `90.11%` recovery over our [dense baseline](#Dense fine-tuned model accuracy). In the next step we'll see how to improve the recovery of this model using sparse fine-tuning. Detailed results for the oneshot 50% sparse model are provided below:
+[Evaluating](#evaluation-setup) the oneshot 50% sparse model on the `gsm8k 0-shot` task, results in an accuracy of `33.81%` and translates to a `90.11%` recovery over our [dense baseline](#dense-finetuned-model-accuracy). In the next step we'll see how to improve the recovery of this model using sparse fine-tuning. Detailed results for the oneshot 50% sparse model are provided below:
```json
{
"results": {
@@ -209,26 +209,26 @@ The one-shot sparse model generated previously can undergo further sparse fine-t
Command:
```bash
-accelerate launch
- --config_file example_fsdp_config.yaml
- --no_python sparseml.transformers.text_generation.finetune
- --model PATH_TO_MODEL
- --dataset "gsm8k"
- --dataset_config_name "main"
- --output_dir PATH_TO_OUTPUT
- --splits "train"
- --num_train_epochs 2
- --precision "bfloat16"
- --gradient_checkpointing True
- --bf16 True
- --learning_rate 0.00005
- --lr_scheduler_type "linear"
- --max_seq_length 1024
- --per_device_train_batch_size 32
- --max_grad_norm None
- --warmup_steps 20
- --distill_teacher PATH_TO_TEACHER
- --recipe PATH_TO_RECIPE
+accelerate launch \
+ --config_file example_fsdp_config.yaml \
+ --no_python sparseml.transformers.text_generation.finetune \
+ --model PATH_TO_MODEL \
+ --dataset "gsm8k" \
+ --dataset_config_name "main" \
+ --output_dir PATH_TO_OUTPUT \
+ --splits "train" \
+ --num_train_epochs 2 \
+ --precision "bfloat16" \
+ --gradient_checkpointing True \
+ --bf16 True \
+ --learning_rate 0.00005 \
+ --lr_scheduler_type "linear" \
+ --max_seq_length 1024 \
+ --per_device_train_batch_size 32 \
+ --max_grad_norm None \
+ --warmup_steps 20 \
+ --distill_teacher PATH_TO_TEACHER \
+ --recipe PATH_TO_RECIPE
```
Recipe:
@@ -290,7 +290,7 @@ test_stage:
Note: *Some of these hyper-parameters may need further tuning to enhance the overall accuracy of the fine-tuned model. The values mentioned above were obtained through a quick hyper-parameter search. Parameters that could have a significant impact and are worth considering for tuning include: `learning_rate`, `max_grad_norm`, `warmup_steps`, `max_seq_length`.*
### Fine-tuned 50% sparse model accuracy
-[Evaluating](#evaluation-setup) the fine-tuned 50% sparse model on the `gsm8k 0-shot` task, results in an accuracy of `38.59%` and shows clear improvement over the [oneshot accuracy](#Oneshot 50% sparse model accuracy). The sparse fine-tuning step not only helped improve over the oneshot accuracy but even surpassed the dense baseline model. Detailed results for the oneshot 50% sparse model are provided below:
+[Evaluating](#evaluation-setup) the fine-tuned 50% sparse model on the `gsm8k 0-shot` task, results in an accuracy of `38.59%` and shows clear improvement over the [oneshot accuracy](#oneshot-50%-sparse-model-accuracy). The sparse fine-tuning step not only helped improve over the oneshot accuracy but even surpassed the dense baseline model. Detailed results for the oneshot 50% sparse model are provided below:
```json
{
"results": {
@@ -331,7 +331,7 @@ MODEL_PATH=
TASK=gsm8k
python main.py \
--model sparseml \
- --model_args pretrained=MODEL_PATH,trust_remote_code=True \
+ --model_args pretrained=${MODEL_PATH},trust_remote_code=True \
--tasks $TASK \
--batch_size 48 \
--no_cache \
@@ -339,4 +339,4 @@ python main.py \
--output_path "${MODEL_PATH}/${TASK}.json" \
--device "cuda:0" \
--num_fewshot 0
-```
\ No newline at end of file
+```
diff --git a/versioned_docs/version-1.7.0/llms/guides/sparse-finetuning-llm-gsm8k-with-sparseml.md b/versioned_docs/version-1.7.0/llms/guides/sparse-finetuning-llm-gsm8k-with-sparseml.md
index e0a169b73e1..d123f3ab03c 100644
--- a/versioned_docs/version-1.7.0/llms/guides/sparse-finetuning-llm-gsm8k-with-sparseml.md
+++ b/versioned_docs/version-1.7.0/llms/guides/sparse-finetuning-llm-gsm8k-with-sparseml.md
@@ -29,7 +29,7 @@ This guide details the steps for going from a pre-trained, unoptimized Llama2 7B
## Prerequisites
-- Training Environment: A system that meets the minimum hardware and software requirements as outlined in the [Install Guide](/get-started/install/sparseml#prerequisites). To replicate the setup used for fine-tuning in this guide, use 4 NVIDIA A100 GPUs for both dense and sparse fine-tuning steps.
+- Training Environment: A system that meets the minimum hardware and software requirements as outlined in the [Install Guide](/get-started/install/sparseml#prerequisites). To replicate the setup used for fine-tuning in this guide, use 4 NVIDIA A100 80 GB GPUs for both dense and sparse fine-tuning steps and a system with at least 16 GB memory.
- SparseML LLM Installation: An environment with SparseML for LLMs installed as outlined in the [Install Guide](/get-started/install/sparseml#generative-ai-hugging-face).
- Background: Familiarity with Generative AI and working with large language models is recommended.
@@ -42,28 +42,28 @@ zoo:llama2-7b-llama2_pretrain-base
## Dense fine-tuning
We then fine-tune the above pre-trained dense model on the GSM8K dataset to obtain a model that we can later optimize using sparsification.
```bash
-accelerate launch
- --config_file example_fsdp_config.yaml
- --no_python sparseml.transformers.text_generation.finetune
- --model PATH_TO_MODEL or ZOO_STUB
- --dataset "gsm8k"
- --dataset_config_name "main"
- --output_dir PATH_TO_OUTPUT
- --splits "train"
- --num_train_epochs 2
- --precision "bfloat16"
- --gradient_checkpointing True
- --bf16 True
- --learning_rate 0.00005
- --lr_scheduler_type "linear"
- --max_seq_length 1024
- --per_device_train_batch_size 32
- --max_grad_norm 2
+accelerate launch \
+ --config_file example_fsdp_config.yaml \
+ --no_python sparseml.transformers.text_generation.finetune \
+ --model PATH_TO_MODEL or ZOO_STUB \
+ --dataset "gsm8k" \
+ --dataset_config_name "main" \
+ --output_dir PATH_TO_OUTPUT \
+ --splits "train" \
+ --num_train_epochs 2 \
+ --precision "bfloat16" \
+ --gradient_checkpointing True \
+ --bf16 True \
+ --learning_rate 0.00005 \
+ --lr_scheduler_type "linear" \
+ --max_seq_length 1024 \
+ --per_device_train_batch_size 32 \
+ --max_grad_norm 2 \
--warmup_steps 20
```
Note: *Some of these hyper-parameters may need further tuning to enhance the overall accuracy of the fine-tuned model. The values mentioned above were obtained through a quick hyper-parameter search. Parameters that could have a significant impact and are worth considering for tuning include: `learning_rate`, `max_grad_norm`, `warmup_steps`, `max_seq_length`.*
-The example_fsdp_config.yaml used above contains the following setup for FSDP. Set the `num_processes` to the number of GPUs available. For our setup, we used 4 NVIDIA A100 GPUs so we set `num_processes` to `4`.
+The example_fsdp_config.yaml used above contains the following setup for FSDP. Set the `num_processes` to the number of GPUs available. For our setup, we used 4 NVIDIA A100 80GB GPUs so we set `num_processes` to `4`.
```yaml
compute_environment: LOCAL_MACHINE
debug: false
@@ -92,7 +92,7 @@ tpu_use_sudo: false
use_cpu: false
```
-### Dense fine-tuned model accuracy
+### Dense finetuned model accuracy
[Evaluating](#evaluation-setup) the dense fine-tuned model on the `gsm8k 0-shot` task, results in a baseline accuracy of `37.52%`. We'll consider this accuracy as our baseline for calculating recovery for the oneshot sparse and sparse fine-tuned models we'll get later. Detailed results are provided below:
```json
{
@@ -125,19 +125,19 @@ Use the dense fine-tuned model obtained above and sparsify it to 50% in a onesho
Command:
```bash
-accelerate launch
- --config_file example_fsdp_config.yaml
- --no_python sparseml.transformers.text_generation.oneshot
- --model PATH_TO_MODEL
- --dataset "gsm8k"
- --dataset_config_name "main"
- --concatenate_data OPTIONAL
- --recipe PATH_TO_RECIPE
- --output_dir PATH_TO_OUTPUT
- --splits "train"
- --pad_to_max_length False
- --oneshot_device DEVICE
- --num_calibration_samples 1024
+accelerate launch \
+ --config_file example_fsdp_config.yaml \
+ --no_python sparseml.transformers.text_generation.oneshot \
+ --model PATH_TO_MODEL \
+ --dataset "gsm8k" \
+ --dataset_config_name "main" \
+ --concatenate_data OPTIONAL \
+ --recipe PATH_TO_RECIPE \
+ --output_dir PATH_TO_OUTPUT \
+ --splits "train" \
+ --pad_to_max_length False \
+ --oneshot_device DEVICE \
+ --num_calibration_samples 1024 \
--max_seq_len 4096
```
Note: *You may wish to tweak the `num_calibration_samples` above to obtain better accuracy.*
@@ -177,7 +177,7 @@ pruning_stage:
To learn more about the OWL non-uniform sparsity profile method, visit [this link](https://github.com/luuyin/OWL/tree/main?tab=readme-ov-file#script-example-of-pruning-llama-7b-using-owl-sparsegpt).
### Oneshot 50% sparse model accuracy
-[Evaluating](#evaluation-setup) the oneshot 50% sparse model on the `gsm8k 0-shot` task, results in an accuracy of `33.81%` and translates to a `90.11%` recovery over our [dense baseline](#Dense fine-tuned model accuracy). In the next step we'll see how to improve the recovery of this model using sparse fine-tuning. Detailed results for the oneshot 50% sparse model are provided below:
+[Evaluating](#evaluation-setup) the oneshot 50% sparse model on the `gsm8k 0-shot` task, results in an accuracy of `33.81%` and translates to a `90.11%` recovery over our [dense baseline](#dense-finetuned-model-accuracy). In the next step we'll see how to improve the recovery of this model using sparse fine-tuning. Detailed results for the oneshot 50% sparse model are provided below:
```json
{
"results": {
@@ -209,26 +209,26 @@ The one-shot sparse model generated previously can undergo further sparse fine-t
Command:
```bash
-accelerate launch
- --config_file example_fsdp_config.yaml
- --no_python sparseml.transformers.text_generation.finetune
- --model PATH_TO_MODEL
- --dataset "gsm8k"
- --dataset_config_name "main"
- --output_dir PATH_TO_OUTPUT
- --splits "train"
- --num_train_epochs 2
- --precision "bfloat16"
- --gradient_checkpointing True
- --bf16 True
- --learning_rate 0.00005
- --lr_scheduler_type "linear"
- --max_seq_length 1024
- --per_device_train_batch_size 32
- --max_grad_norm None
- --warmup_steps 20
- --distill_teacher PATH_TO_TEACHER
- --recipe PATH_TO_RECIPE
+accelerate launch \
+ --config_file example_fsdp_config.yaml \
+ --no_python sparseml.transformers.text_generation.finetune \
+ --model PATH_TO_MODEL \
+ --dataset "gsm8k" \
+ --dataset_config_name "main" \
+ --output_dir PATH_TO_OUTPUT \
+ --splits "train" \
+ --num_train_epochs 2 \
+ --precision "bfloat16" \
+ --gradient_checkpointing True \
+ --bf16 True \
+ --learning_rate 0.00005 \
+ --lr_scheduler_type "linear" \
+ --max_seq_length 1024 \
+ --per_device_train_batch_size 32 \
+ --max_grad_norm None \
+ --warmup_steps 20 \
+ --distill_teacher PATH_TO_TEACHER \
+ --recipe PATH_TO_RECIPE
```
Recipe:
@@ -290,7 +290,7 @@ test_stage:
Note: *Some of these hyper-parameters may need further tuning to enhance the overall accuracy of the fine-tuned model. The values mentioned above were obtained through a quick hyper-parameter search. Parameters that could have a significant impact and are worth considering for tuning include: `learning_rate`, `max_grad_norm`, `warmup_steps`, `max_seq_length`.*
### Fine-tuned 50% sparse model accuracy
-[Evaluating](#evaluation-setup) the fine-tuned 50% sparse model on the `gsm8k 0-shot` task, results in an accuracy of `38.59%` and shows clear improvement over the [oneshot accuracy](#Oneshot 50% sparse model accuracy). The sparse fine-tuning step not only helped improve over the oneshot accuracy but even surpassed the dense baseline model. Detailed results for the oneshot 50% sparse model are provided below:
+[Evaluating](#evaluation-setup) the fine-tuned 50% sparse model on the `gsm8k 0-shot` task, results in an accuracy of `38.59%` and shows clear improvement over the [oneshot accuracy](#oneshot-50%-sparse-model-accuracy). The sparse fine-tuning step not only helped improve over the oneshot accuracy but even surpassed the dense baseline model. Detailed results for the oneshot 50% sparse model are provided below:
```json
{
"results": {
@@ -331,7 +331,7 @@ MODEL_PATH=
TASK=gsm8k
python main.py \
--model sparseml \
- --model_args pretrained=MODEL_PATH,trust_remote_code=True \
+ --model_args pretrained=${MODEL_PATH},trust_remote_code=True \
--tasks $TASK \
--batch_size 48 \
--no_cache \
@@ -339,4 +339,4 @@ python main.py \
--output_path "${MODEL_PATH}/${TASK}.json" \
--device "cuda:0" \
--num_fewshot 0
-```
\ No newline at end of file
+```