Clean up (hao-ai-lab#84)

Co-authored-by: rlsu9 <[email protected]>
ai-tools · Dec 16, 2024 · 285635e · 285635e
1 parent 58cfd71
commit 285635e
Show file tree

Hide file tree

Showing 9 changed files with 93 additions and 57 deletions.
diff --git a/README.md b/README.md
@@ -22,7 +22,7 @@ FastVideo is an open framework for distilling, training, and inferencing large v
 
 As state-of-the-art video diffusion models grow in size and sequence length, their become prohibitive to use. For instance, sampling a 5-second 720P video with Hunyuan takes 13 minutes on 4 X A100. FastVideo aim to make large video diffusion models fast to infer and efficient to train, and thus making them more **accessible**. 
 
-We introduce FastMochi and FastHunyuan, distilled versions of the Mochi and Hunyuan video diffusion models. FastMochi achieves high-quality sampling with just 8 inference steps. FastHunyuan maintains sampling quality with only 4 inference steps.
+We introduce FastMochi and FastHunyuan, distilled versions of the Mochi and Hunyuan video diffusion models. The distilled models are 8X faster to sample.
 
 
 
@@ -31,15 +31,15 @@ Other than the distilled weight, FastVideo provides a pipeline for training, dis
 
 - **Scalable**: FastVideo supports FSDP, sequence parallelism, and selective gradient checkpointing. Our code seamlessly scales to 64 GPUs in our test.
 - **Memory Efficient**: FastVideo supports LoRA finetuning coupled with precomputed latents and text embeddings for minimal memory usage.
-- **Variable Sequence length**: You can finetuning with both image and videos.
+- **Variable Sequence length**: You can finetune with both image and videos.
 
 ## Change Log
 
 - ```2024/12/16```: `FastVideo` v0.1 is released.
 
 
 ## 🔧 Installation
-The code is tested on Python 3.10.0 and CUDA 12.1.
+The code is tested on Python 3.10.0, CUDA 12.1 and H100.
 
 ```
 ./env_setup.sh fastvideo
@@ -55,7 +55,7 @@ python scripts/huggingface/download_hf.py --repo_id=FastVideo/FastHunyuan --loca
 # change the gpu count inside the script
 sh scripts/inference/inference_hunyuan.sh
 ```
-
+You can also inference FastHunyuan in the [official Hunyuan github](https://github.com/Tencent/HunyuanVideo).
 ### FastMochi
 You can use FastMochi
 
@@ -64,12 +64,10 @@ You can use FastMochi
 python scripts/huggingface/download_hf.py --repo_id=FastVideo/FastMochi-diffusers --local_dir=data/FastMochi-diffusers --repo_type=model
 # CLI inference
 bash scripts/inference/inference_mochi_sp.sh
-# Gradio web dem
-python demo/gradio_web_demo.py --model_path data/FastMochi-diffusers --guidance_scale 1.5 --num_frames 163
 ```
 
 ## Distillation
-Please refer to the [distillation guide](docs/distilation.md).
+Please refer to the [distillation guide](docs/distillation.md).
 
 ## Finetuning
 Please refer to the [finetuning guide](docs/finetuning.md).

diff --git a/docs/distillation.md b/docs/distillation.md
@@ -1,11 +1,10 @@
 ## 🎯 Distill
 
-
 Our distillation recipe is based on [Phased Consistency Model](https://github.com/G-U-N/Phased-Consistency-Model). We did not find significant improvement using multi-phase distillation, so we keep the one phase setup similar to the original latent consistency model's recipe.
 
 We use the [MixKit](https://huggingface.co/datasets/LanguageBind/Open-Sora-Plan-v1.1.0/tree/main/all_mixkit) dataset for distillation. To avoid running the text encoder and VAE during training, we preprocess all data to generate text embeddings and VAE latents.
 
-Preprocessing instructions can be found [data_preprocess.md](#-data-preprocess). For convenience, we also provide preprocessed data that can be downloaded directly using the following command:
+Preprocessing instructions can be found [data_preprocess.md](docs/data_preprocess.md). For convenience, we also provide preprocessed data that can be downloaded directly using the following command:
 
 ```bash
 python scripts/huggingface/download_hf.py --repo_id=FastVideo/HD-Mixkit-Finetune-Hunyuan --local_dir=data/HD-Mixkit-Finetune-Hunyuan --repo_type=dataset

diff --git a/docs/finetuning.md b/docs/finetuning.md
@@ -1,34 +1,41 @@
 
-## ⚡ Finetune
+## ⚡ Full Finetune
 
-We support full fine-tuning for both the Mochi and Hunyuan models. Additionally, we provide Image-Video Mix finetuning.
-
-
-Ensure your data is prepared and preprocessed in the format specified in the [Data Preprocess](#-data-preprocess). 
+Ensure your data is prepared and preprocessed in the format specified in [data_preprocess.md](docs/data_preprocess.md). For convenience, we also provide a mochi preprocessed Black Myth Wukong data that can be downloaded directly:
+```bash
+python scripts/huggingface/download_hf.py --repo_id=FastVideo/Mochi-Black-Myth --local_dir=data/Mochi-Black-Myth --repo_type=dataset
+```
 Download the original model weights with:
 ```bash
 python scripts/huggingface/download_hf.py --repo_id=genmo/mochi-1-preview --local_dir=data/mochi --repo_type=model
 python scripts/huggingface/download_hf.py --repo_id=FastVideo/hunyuan --local_dir=data/hunyuan --repo_type=model
 ```
 
-
-FastVideo/BLACK-MYTH-YQ
-Then run the finetune with:
+Then you can run the finetune with:
 ```
 bash scripts/finetune/finetune_mochi.sh # for mochi
-bash scripts/finetune/finetune_hunyuan.sh # for hunyuan
 ```
-For Image-Video Mixture Fine-tuning, make sure to enable the --group_frame option in your script.
-
+**Note that we did not tune the hyperparameters in the provided script**
 
-## Lora Finetune
+## ⚡ Lora Finetune
 
 Currently, we only provide Lora Finetune for Mochi model, the command for Lora Finetune is
 ```
 bash scripts/finetune/finetune_mochi_lora.sh
 ```
+### Minimum Hardware Requirement
+- 40 GB GPU memory each for 2 GPUs with lora
+- 30 GB GPU memory each for 2 GPUs with CPU offload and lora.
+
+## Finetune with Both Image and Video
+Our codebase support finetuning with both image and video. 
+
+```bash
+bash scripts/finetune/finetune_hunyuan.sh
+bash scripts/finetune/finetune_mochi_lora_mix.sh
+```
+For Image-Video Mixture Fine-tuning, make sure to enable the --group_frame option in your script.
+
 
-### 💰Hardware requirement
 
-- 72G VRAM is required for finetuning 10B mochi model.
 
diff --git a/fastvideo/sample/sample_t2v_hunyuan.py b/fastvideo/sample/sample_t2v_hunyuan.py
@@ -84,7 +84,8 @@ def main(args):
             x = x.transpose(0, 1).transpose(1, 2).squeeze(-1)
             outputs.append((x * 255).numpy().astype(np.uint8))
         os.makedirs(os.path.dirname(args.output_path), exist_ok=True)
-        imageio.mimsave(args.output_path + f"{prompt[:100]}.mp4", outputs, fps=args.fps)
+        imageio.mimsave(os.path.join(args.output_path, f"{prompt[:100]}.mp4"), outputs, fps=args.fps)
+
 
 
 

diff --git a/fastvideo/sample/sample_t2v_mochi.py b/fastvideo/sample/sample_t2v_mochi.py
@@ -107,7 +107,6 @@ def main(args):
         encoder_attention_mask = None
 
     if prompts is not None:
-        videos = []
         with torch.autocast("cuda", dtype=torch.bfloat16):
             for prompt in prompts:
                 video = pipe(
@@ -119,7 +118,12 @@ def main(args):
                     guidance_scale=args.guidance_scale,
                     generator=generator,
                 ).frames
-                videos.append(video[0])
+                if nccl_info.global_rank <= 0:
+                    os.makedirs(args.output_path, exist_ok=True)
+                    suffix = prompt.split(".")[0]
+                    export_to_video(
+                        video[0], os.path.join(args.output_path, f"{suffix}.mp4"), fps=30
+                    )
     else:
         with torch.autocast("cuda", dtype=torch.bfloat16):
             videos = pipe(
@@ -133,16 +137,7 @@ def main(args):
                 generator=generator,
             ).frames
 
-    if nccl_info.global_rank <= 0:
-        if prompts is not None:
-            # mkdir
-            os.makedirs(args.output_path, exist_ok=True)
-            for video, prompt in zip(videos, prompts):
-                suffix = prompt.split(".")[0]
-                export_to_video(
-                    video, os.path.join(args.output_path, f"{suffix}.mp4"), fps=30
-                )
-        else:
+        if nccl_info.global_rank <= 0:
             export_to_video(videos[0], args.output_path + ".mp4", fps=30)
 
 

diff --git a/scripts/finetune/finetune_mochi.sh b/scripts/finetune/finetune_mochi.sh
@@ -1,13 +1,13 @@
 export WANDB_BASE_URL="https://api.wandb.ai"
 export WANDB_MODE=online
 
-torchrun --nnodes 1 --nproc_per_node 1 \
+torchrun --nnodes 1 --nproc_per_node 4 \
     fastvideo/train.py \
     --seed 42 \
     --pretrained_model_name_or_path data/FastMochi-diffusers \
     --cache_dir data/.cache \
-    --data_json_path data/Image-Vid-Finetune-Mochi/videos2caption.json \
-    --validation_prompt_dir data/Image-Vid-Finetune-Mochi/validation \
+    --data_json_path data/Mochi-Black-Myth/videos2caption.json \
+    --validation_prompt_dir data/Mochi-Black-Myth/validation \
     --gradient_checkpointing \
     --train_batch_size=1 \
     --num_latent_t 16 \
@@ -27,7 +27,6 @@ torchrun --nnodes 1 --nproc_per_node 1 \
     --cfg 0.0 \
     --ema_decay 0.999 \
     --log_validation \
-    --output_dir=data/outputs/HSH-Taylor-Finetune \
-    --tracker_project_name HSH-Taylor-Finetune \
-    --num_frames 93 \
-    --group_frame
+    --output_dir=data/outputs/Black-Myth-Finetune \
+    --tracker_project_name Black-Myth-Finetune \
+    --num_frames 93 
diff --git a/scripts/finetune/finetune_mochi_lora.sh b/scripts/finetune/finetune_mochi_lora.sh
@@ -1,20 +1,20 @@
 export WANDB_BASE_URL="https://api.wandb.ai"
 export WANDB_MODE=online
 
-CUDA_VISIBLE_DEVICES=5 torchrun --nnodes 1 --nproc_per_node 1 \
+torchrun --nnodes 1 --nproc_per_node 2 \
     fastvideo/train.py \
     --seed 42 \
     --pretrained_model_name_or_path data/mochi \
     --cache_dir data/.cache \
-    --data_json_path data/Image-Vid-Finetune-Mochi/videos2caption.json \
-    --validation_prompt_dir data/Image-Vid-Finetune-Mochi/validation \
+    --data_json_path data/Mochi-Black-Myth/videos2caption.json \
+    --validation_prompt_dir data/Mochi-Black-Myth/validation \
     --gradient_checkpointing \
     --train_batch_size=1 \
     --num_latent_t 14 \
-    --sp_size 1 \
+    --sp_size 2 \
     --train_sp_batch_size 1 \
     --dataloader_num_workers 1 \
-    --gradient_accumulation_steps=1 \
+    --gradient_accumulation_steps=2 \
     --max_train_steps=2000 \
     --learning_rate=5e-6 \
     --mixed_precision=bf16 \
@@ -27,11 +27,11 @@ CUDA_VISIBLE_DEVICES=5 torchrun --nnodes 1 --nproc_per_node 1 \
     --cfg 0.0 \
     --ema_decay 0.999 \
     --log_validation \
-    --output_dir=data/outputs/HSH-Taylor-Finetune-Lora \
-    --tracker_project_name HSH-Taylor-Finetune-Lora \
+    --output_dir=data/outputs/Black-Myth-Lora-FT \
+    --tracker_project_name Black-Myth-Lora-Finetune \
     --num_frames 91 \
-    --group_frame \
     --lora_rank 128 \
     --lora_alpha 256 \
     --master_weight_type "bf16" \
-    --use_lora 
+    --use_lora \
+    --use_cpu_offload
diff --git a/scripts/finetune/finetune_mochi_lora_mix.sh b/scripts/finetune/finetune_mochi_lora_mix.sh
@@ -0,0 +1,37 @@
+export WANDB_BASE_URL="https://api.wandb.ai"
+export WANDB_MODE=online
+
+CUDA_VISIBLE_DEVICES=5 torchrun --nnodes 1 --nproc_per_node 1 \
+    fastvideo/train.py \
+    --seed 42 \
+    --pretrained_model_name_or_path data/mochi \
+    --cache_dir data/.cache \
+    --data_json_path data/Image-Vid-Finetune-Mochi/videos2caption.json \
+    --validation_prompt_dir data/Image-Vid-Finetune-Mochi/validation \
+    --gradient_checkpointing \
+    --train_batch_size=1 \
+    --num_latent_t 14 \
+    --sp_size 1 \
+    --train_sp_batch_size 1 \
+    --dataloader_num_workers 1 \
+    --gradient_accumulation_steps=1 \
+    --max_train_steps=2000 \
+    --learning_rate=5e-6 \
+    --mixed_precision=bf16 \
+    --checkpointing_steps=200 \
+    --validation_steps 100 \
+    --validation_sampling_steps 64 \
+    --checkpoints_total_limit 3 \
+    --allow_tf32 \
+    --ema_start_step 0 \
+    --cfg 0.0 \
+    --ema_decay 0.999 \
+    --log_validation \
+    --output_dir=data/outputs/HSH-Taylor-Finetune-Lora \
+    --tracker_project_name HSH-Taylor-Finetune-Lora \
+    --num_frames 91 \
+    --group_frame \
+    --lora_rank 128 \
+    --lora_alpha 256 \
+    --master_weight_type "bf16" \
+    --use_lora 
diff --git a/scripts/inference/inference_hunyuan.sh b/scripts/inference/inference_hunyuan.sh
@@ -1,7 +1,7 @@
 #!/bin/bash
 
-num_gpus=[Your GPU Count]
-
+num_gpus=4
+export MODEL_BASE=data/FastHunyuan
 torchrun --nnodes=1 --nproc_per_node=$num_gpus --master_port 29503 \
     fastvideo/sample/sample_t2v_hunyuan.py \
     --height 720 \
@@ -14,6 +14,6 @@ torchrun --nnodes=1 --nproc_per_node=$num_gpus --master_port 29503 \
     --flow-reverse \
     --prompt ./assets/prompt.txt \
     --seed 12345 \
-    --output_path outputs_video/hunyuan/ \
-    --model_path data/FastHunyuan \
-    --dit-weight data/FastHunyuan/hunyuan-video-t2v-720p/transformers/diffusion_pytorch_model.safetensors
+    --output_path outputs_video/hunyuan/cfg6/ \
+    --model_path $MODEL_BASE \
+    --dit-weight ${MODEL_BASE}/hunyuan-video-t2v-720p/transformers/mp_rank_00_model_states.pt