Skip to content

Commit

Permalink
Clean up (hao-ai-lab#84)
Browse files Browse the repository at this point in the history
Co-authored-by: rlsu9 <[email protected]>
  • Loading branch information
jzhang38 and rlsu9 authored Dec 16, 2024
1 parent 58cfd71 commit 285635e
Show file tree
Hide file tree
Showing 9 changed files with 93 additions and 57 deletions.
12 changes: 5 additions & 7 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -22,7 +22,7 @@ FastVideo is an open framework for distilling, training, and inferencing large v

As state-of-the-art video diffusion models grow in size and sequence length, their become prohibitive to use. For instance, sampling a 5-second 720P video with Hunyuan takes 13 minutes on 4 X A100. FastVideo aim to make large video diffusion models fast to infer and efficient to train, and thus making them more **accessible**.

We introduce FastMochi and FastHunyuan, distilled versions of the Mochi and Hunyuan video diffusion models. FastMochi achieves high-quality sampling with just 8 inference steps. FastHunyuan maintains sampling quality with only 4 inference steps.
We introduce FastMochi and FastHunyuan, distilled versions of the Mochi and Hunyuan video diffusion models. The distilled models are 8X faster to sample.



Expand All @@ -31,15 +31,15 @@ Other than the distilled weight, FastVideo provides a pipeline for training, dis

- **Scalable**: FastVideo supports FSDP, sequence parallelism, and selective gradient checkpointing. Our code seamlessly scales to 64 GPUs in our test.
- **Memory Efficient**: FastVideo supports LoRA finetuning coupled with precomputed latents and text embeddings for minimal memory usage.
- **Variable Sequence length**: You can finetuning with both image and videos.
- **Variable Sequence length**: You can finetune with both image and videos.

## Change Log

- ```2024/12/16```: `FastVideo` v0.1 is released.


## 🔧 Installation
The code is tested on Python 3.10.0 and CUDA 12.1.
The code is tested on Python 3.10.0, CUDA 12.1 and H100.

```
./env_setup.sh fastvideo
Expand All @@ -55,7 +55,7 @@ python scripts/huggingface/download_hf.py --repo_id=FastVideo/FastHunyuan --loca
# change the gpu count inside the script
sh scripts/inference/inference_hunyuan.sh
```

You can also inference FastHunyuan in the [official Hunyuan github](https://github.com/Tencent/HunyuanVideo).
### FastMochi
You can use FastMochi

Expand All @@ -64,12 +64,10 @@ You can use FastMochi
python scripts/huggingface/download_hf.py --repo_id=FastVideo/FastMochi-diffusers --local_dir=data/FastMochi-diffusers --repo_type=model
# CLI inference
bash scripts/inference/inference_mochi_sp.sh
# Gradio web dem
python demo/gradio_web_demo.py --model_path data/FastMochi-diffusers --guidance_scale 1.5 --num_frames 163
```

## Distillation
Please refer to the [distillation guide](docs/distilation.md).
Please refer to the [distillation guide](docs/distillation.md).

## Finetuning
Please refer to the [finetuning guide](docs/finetuning.md).
Expand Down
3 changes: 1 addition & 2 deletions docs/distillation.md
Original file line number Diff line number Diff line change
@@ -1,11 +1,10 @@
## 🎯 Distill


Our distillation recipe is based on [Phased Consistency Model](https://github.com/G-U-N/Phased-Consistency-Model). We did not find significant improvement using multi-phase distillation, so we keep the one phase setup similar to the original latent consistency model's recipe.

We use the [MixKit](https://huggingface.co/datasets/LanguageBind/Open-Sora-Plan-v1.1.0/tree/main/all_mixkit) dataset for distillation. To avoid running the text encoder and VAE during training, we preprocess all data to generate text embeddings and VAE latents.

Preprocessing instructions can be found [data_preprocess.md](#-data-preprocess). For convenience, we also provide preprocessed data that can be downloaded directly using the following command:
Preprocessing instructions can be found [data_preprocess.md](docs/data_preprocess.md). For convenience, we also provide preprocessed data that can be downloaded directly using the following command:

```bash
python scripts/huggingface/download_hf.py --repo_id=FastVideo/HD-Mixkit-Finetune-Hunyuan --local_dir=data/HD-Mixkit-Finetune-Hunyuan --repo_type=dataset
Expand Down
35 changes: 21 additions & 14 deletions docs/finetuning.md
Original file line number Diff line number Diff line change
@@ -1,34 +1,41 @@

## ⚡ Finetune
## Full Finetune

We support full fine-tuning for both the Mochi and Hunyuan models. Additionally, we provide Image-Video Mix finetuning.


Ensure your data is prepared and preprocessed in the format specified in the [Data Preprocess](#-data-preprocess).
Ensure your data is prepared and preprocessed in the format specified in [data_preprocess.md](docs/data_preprocess.md). For convenience, we also provide a mochi preprocessed Black Myth Wukong data that can be downloaded directly:
```bash
python scripts/huggingface/download_hf.py --repo_id=FastVideo/Mochi-Black-Myth --local_dir=data/Mochi-Black-Myth --repo_type=dataset
```
Download the original model weights with:
```bash
python scripts/huggingface/download_hf.py --repo_id=genmo/mochi-1-preview --local_dir=data/mochi --repo_type=model
python scripts/huggingface/download_hf.py --repo_id=FastVideo/hunyuan --local_dir=data/hunyuan --repo_type=model
```


FastVideo/BLACK-MYTH-YQ
Then run the finetune with:
Then you can run the finetune with:
```
bash scripts/finetune/finetune_mochi.sh # for mochi
bash scripts/finetune/finetune_hunyuan.sh # for hunyuan
```
For Image-Video Mixture Fine-tuning, make sure to enable the --group_frame option in your script.

**Note that we did not tune the hyperparameters in the provided script**

## Lora Finetune
## Lora Finetune

Currently, we only provide Lora Finetune for Mochi model, the command for Lora Finetune is
```
bash scripts/finetune/finetune_mochi_lora.sh
```
### Minimum Hardware Requirement
- 40 GB GPU memory each for 2 GPUs with lora
- 30 GB GPU memory each for 2 GPUs with CPU offload and lora.

## Finetune with Both Image and Video
Our codebase support finetuning with both image and video.

```bash
bash scripts/finetune/finetune_hunyuan.sh
bash scripts/finetune/finetune_mochi_lora_mix.sh
```
For Image-Video Mixture Fine-tuning, make sure to enable the --group_frame option in your script.


### 💰Hardware requirement

- 72G VRAM is required for finetuning 10B mochi model.

3 changes: 2 additions & 1 deletion fastvideo/sample/sample_t2v_hunyuan.py
Original file line number Diff line number Diff line change
Expand Up @@ -84,7 +84,8 @@ def main(args):
x = x.transpose(0, 1).transpose(1, 2).squeeze(-1)
outputs.append((x * 255).numpy().astype(np.uint8))
os.makedirs(os.path.dirname(args.output_path), exist_ok=True)
imageio.mimsave(args.output_path + f"{prompt[:100]}.mp4", outputs, fps=args.fps)
imageio.mimsave(os.path.join(args.output_path, f"{prompt[:100]}.mp4"), outputs, fps=args.fps)




Expand Down
19 changes: 7 additions & 12 deletions fastvideo/sample/sample_t2v_mochi.py
Original file line number Diff line number Diff line change
Expand Up @@ -107,7 +107,6 @@ def main(args):
encoder_attention_mask = None

if prompts is not None:
videos = []
with torch.autocast("cuda", dtype=torch.bfloat16):
for prompt in prompts:
video = pipe(
Expand All @@ -119,7 +118,12 @@ def main(args):
guidance_scale=args.guidance_scale,
generator=generator,
).frames
videos.append(video[0])
if nccl_info.global_rank <= 0:
os.makedirs(args.output_path, exist_ok=True)
suffix = prompt.split(".")[0]
export_to_video(
video[0], os.path.join(args.output_path, f"{suffix}.mp4"), fps=30
)
else:
with torch.autocast("cuda", dtype=torch.bfloat16):
videos = pipe(
Expand All @@ -133,16 +137,7 @@ def main(args):
generator=generator,
).frames

if nccl_info.global_rank <= 0:
if prompts is not None:
# mkdir
os.makedirs(args.output_path, exist_ok=True)
for video, prompt in zip(videos, prompts):
suffix = prompt.split(".")[0]
export_to_video(
video, os.path.join(args.output_path, f"{suffix}.mp4"), fps=30
)
else:
if nccl_info.global_rank <= 0:
export_to_video(videos[0], args.output_path + ".mp4", fps=30)


Expand Down
13 changes: 6 additions & 7 deletions scripts/finetune/finetune_mochi.sh
Original file line number Diff line number Diff line change
@@ -1,13 +1,13 @@
export WANDB_BASE_URL="https://api.wandb.ai"
export WANDB_MODE=online

torchrun --nnodes 1 --nproc_per_node 1 \
torchrun --nnodes 1 --nproc_per_node 4 \
fastvideo/train.py \
--seed 42 \
--pretrained_model_name_or_path data/FastMochi-diffusers \
--cache_dir data/.cache \
--data_json_path data/Image-Vid-Finetune-Mochi/videos2caption.json \
--validation_prompt_dir data/Image-Vid-Finetune-Mochi/validation \
--data_json_path data/Mochi-Black-Myth/videos2caption.json \
--validation_prompt_dir data/Mochi-Black-Myth/validation \
--gradient_checkpointing \
--train_batch_size=1 \
--num_latent_t 16 \
Expand All @@ -27,7 +27,6 @@ torchrun --nnodes 1 --nproc_per_node 1 \
--cfg 0.0 \
--ema_decay 0.999 \
--log_validation \
--output_dir=data/outputs/HSH-Taylor-Finetune \
--tracker_project_name HSH-Taylor-Finetune \
--num_frames 93 \
--group_frame
--output_dir=data/outputs/Black-Myth-Finetune \
--tracker_project_name Black-Myth-Finetune \
--num_frames 93
18 changes: 9 additions & 9 deletions scripts/finetune/finetune_mochi_lora.sh
Original file line number Diff line number Diff line change
@@ -1,20 +1,20 @@
export WANDB_BASE_URL="https://api.wandb.ai"
export WANDB_MODE=online

CUDA_VISIBLE_DEVICES=5 torchrun --nnodes 1 --nproc_per_node 1 \
torchrun --nnodes 1 --nproc_per_node 2 \
fastvideo/train.py \
--seed 42 \
--pretrained_model_name_or_path data/mochi \
--cache_dir data/.cache \
--data_json_path data/Image-Vid-Finetune-Mochi/videos2caption.json \
--validation_prompt_dir data/Image-Vid-Finetune-Mochi/validation \
--data_json_path data/Mochi-Black-Myth/videos2caption.json \
--validation_prompt_dir data/Mochi-Black-Myth/validation \
--gradient_checkpointing \
--train_batch_size=1 \
--num_latent_t 14 \
--sp_size 1 \
--sp_size 2 \
--train_sp_batch_size 1 \
--dataloader_num_workers 1 \
--gradient_accumulation_steps=1 \
--gradient_accumulation_steps=2 \
--max_train_steps=2000 \
--learning_rate=5e-6 \
--mixed_precision=bf16 \
Expand All @@ -27,11 +27,11 @@ CUDA_VISIBLE_DEVICES=5 torchrun --nnodes 1 --nproc_per_node 1 \
--cfg 0.0 \
--ema_decay 0.999 \
--log_validation \
--output_dir=data/outputs/HSH-Taylor-Finetune-Lora \
--tracker_project_name HSH-Taylor-Finetune-Lora \
--output_dir=data/outputs/Black-Myth-Lora-FT \
--tracker_project_name Black-Myth-Lora-Finetune \
--num_frames 91 \
--group_frame \
--lora_rank 128 \
--lora_alpha 256 \
--master_weight_type "bf16" \
--use_lora
--use_lora \
--use_cpu_offload
37 changes: 37 additions & 0 deletions scripts/finetune/finetune_mochi_lora_mix.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,37 @@
export WANDB_BASE_URL="https://api.wandb.ai"
export WANDB_MODE=online

CUDA_VISIBLE_DEVICES=5 torchrun --nnodes 1 --nproc_per_node 1 \
fastvideo/train.py \
--seed 42 \
--pretrained_model_name_or_path data/mochi \
--cache_dir data/.cache \
--data_json_path data/Image-Vid-Finetune-Mochi/videos2caption.json \
--validation_prompt_dir data/Image-Vid-Finetune-Mochi/validation \
--gradient_checkpointing \
--train_batch_size=1 \
--num_latent_t 14 \
--sp_size 1 \
--train_sp_batch_size 1 \
--dataloader_num_workers 1 \
--gradient_accumulation_steps=1 \
--max_train_steps=2000 \
--learning_rate=5e-6 \
--mixed_precision=bf16 \
--checkpointing_steps=200 \
--validation_steps 100 \
--validation_sampling_steps 64 \
--checkpoints_total_limit 3 \
--allow_tf32 \
--ema_start_step 0 \
--cfg 0.0 \
--ema_decay 0.999 \
--log_validation \
--output_dir=data/outputs/HSH-Taylor-Finetune-Lora \
--tracker_project_name HSH-Taylor-Finetune-Lora \
--num_frames 91 \
--group_frame \
--lora_rank 128 \
--lora_alpha 256 \
--master_weight_type "bf16" \
--use_lora
10 changes: 5 additions & 5 deletions scripts/inference/inference_hunyuan.sh
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
#!/bin/bash

num_gpus=[Your GPU Count]

num_gpus=4
export MODEL_BASE=data/FastHunyuan
torchrun --nnodes=1 --nproc_per_node=$num_gpus --master_port 29503 \
fastvideo/sample/sample_t2v_hunyuan.py \
--height 720 \
Expand All @@ -14,6 +14,6 @@ torchrun --nnodes=1 --nproc_per_node=$num_gpus --master_port 29503 \
--flow-reverse \
--prompt ./assets/prompt.txt \
--seed 12345 \
--output_path outputs_video/hunyuan/ \
--model_path data/FastHunyuan \
--dit-weight data/FastHunyuan/hunyuan-video-t2v-720p/transformers/diffusion_pytorch_model.safetensors
--output_path outputs_video/hunyuan/cfg6/ \
--model_path $MODEL_BASE \
--dit-weight ${MODEL_BASE}/hunyuan-video-t2v-720p/transformers/mp_rank_00_model_states.pt

0 comments on commit 285635e

Please sign in to comment.