Skip to content

Commit

Permalink
Cleanup
Browse files Browse the repository at this point in the history
Co-authored-by: rlsu9 <[email protected]>
  • Loading branch information
jzhang38 and rlsu9 authored Dec 16, 2024
1 parent 3bf892b commit 58cfd71
Show file tree
Hide file tree
Showing 21 changed files with 279 additions and 438 deletions.
194 changes: 40 additions & 154 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,26 +1,9 @@
# FastVideo

<div align="center">
<img src=assets/logo.jpg width="50%"/>
<img src=assets/logo.jpg width="30%"/>
</div>

FastVideo is a scalable framework for post-training video diffusion models, addressing the growing challenges of fine-tuning, distillation, and inference as model sizes and sequence lengths increase. As a first step, it provides an efficient script for distilling and fine-tuning the 10B Mochi model, with plans to expand features and support for more models.

### Features

- FastMochi, a distilled Mochi model that can generate videos with merely 8 sampling steps.
- Finetuning with FSDP (both master weight and ema weight), sequence parallelism, and selective gradient checkpointing.
- LoRA coupled with pecomputed the latents and text embedding for minumum memory consumption.
- Finetuning with both image and videos.

## Change Log


- ```2024/12/17```: `FastVideo` v0.1 is released.


## Fast and High-Quality Text-to-video Generation

FastVideo is an open framework for distilling, training, and inferencing large video diffusion model.
<div align="center">
<table style="margin-left: auto; margin-right: auto; border: none;">
<tr>
<td>
Expand All @@ -33,163 +16,66 @@ FastVideo is a scalable framework for post-training video diffusion models, addr
</td>
</tr>
</table>
</div>

## Table of Contents

Jump to a specific section:

- [🔧 Installation](#-installation)
- [🚀 Inference](#-inference)
- [🧱 Data Preprocess](#-data-preprocess)
- [🎯 Distill](#-distill)
- [⚡ Finetune](#-finetune)


## 🔧 Installation

- Python >= 3.10.0
- Cuda >= 12.1

```
git clone https://github.com/hao-ai-lab/FastVideo.git
cd FastVideo
./env_setup.sh fastvideo
# or you can install the working environment step by step following env_setup.sh
```


### What is this?

## 🚀 Inference

Use [scripts/huggingface/download_hf.py](scripts/huggingface/download_hf.py) to download the hugging-face style model to a local directory. Use it like this:
```bash
python scripts/huggingface/download_hf.py --repo_id=FastVideo/FastMochi --local_dir=data/FastMochi --repo_type=model
```
As state-of-the-art video diffusion models grow in size and sequence length, their become prohibitive to use. For instance, sampling a 5-second 720P video with Hunyuan takes 13 minutes on 4 X A100. FastVideo aim to make large video diffusion models fast to infer and efficient to train, and thus making them more **accessible**.

We introduce FastMochi and FastHunyuan, distilled versions of the Mochi and Hunyuan video diffusion models. FastMochi achieves high-quality sampling with just 8 inference steps. FastHunyuan maintains sampling quality with only 4 inference steps.

### 🔛 Quick Start with Gradio UI

```
python demo/gradio_web_demo.py --model_path data/FastMochi
```

### 🔛 CLI Inference with Sequence Parallelism
### What can I do with FastVideo?
Other than the distilled weight, FastVideo provides a pipeline for training, distilling, and inferencing video diffusion models. Key capabilities include:

We also provide CLI inference script featured with sequence parallelism in [scripts/inference](scripts/inference).

```
# bash scripts/inference/inference_mochi_sp.sh
num_gpus=4
torchrun --nnodes=1 --nproc_per_node=$num_gpus --master_port 29503 \
fastvideo/sample/sample_t2v_mochi.py \
--model_path data/FastMochi \
--prompt_path "assets/prompt.txt" \
--num_frames 93 \
--height 480 \
--width 848 \
--num_inference_steps 8 \
--guidance_scale 4.5 \
--output_path outputs_video/mochi_sp/ \
--shift 8 \
--seed 12345 \
--scheduler_type "pcm_linear_quadratic"
- **Scalable**: FastVideo supports FSDP, sequence parallelism, and selective gradient checkpointing. Our code seamlessly scales to 64 GPUs in our test.
- **Memory Efficient**: FastVideo supports LoRA finetuning coupled with precomputed latents and text embeddings for minimal memory usage.
- **Variable Sequence length**: You can finetuning with both image and videos.

```

## 🧱 Data Preprocess

To reduce the memory cost and time consumption caused by VAE and T5 during distillation and finetuning, we offload the VAE and T5 preprocess media part to the Data Preprocess section.
For data preprocessing, we need to prepare a source folder for the media we wish to use and a JSON file for the source information of these media.

### Sample for Data Preprocess

We provide a small sample dataset for you to start with, download the source media with command:
```
python scripts/huggingface/download_hf.py --repo_id=FastVideo/Image-Vid-Finetune-Src --local_dir=data/Image-Vid-Finetune-Src --repo_type=dataset
```
To preprocess dataset for finetune/distill run:

```
bash scripts/preprocess/preprocess_mochi_data.sh # for mochi
bash scripts/preprocess/preprocess_hunyuan_data.sh # for hunyuan
```

The preprocessed dataset will be stored in `Image-Vid-Finetune-Mochi` or `Image-Vid-Finetune-HunYuan` correspondingly.

### Create Custom Dataset
## Change Log

If you wish to create your own dataset for finetuning or distillation, please pay attention to the following format:
- ```2024/12/16```: `FastVideo` v0.1 is released.

Use a txt file to contain the source folder for media and the JSON file for meta information:

```
path_to_media_source_foder,path_to_json_file
```
The content of the JSON file is a list with each item corresponding to a media source.
## 🔧 Installation
The code is tested on Python 3.10.0 and CUDA 12.1.

For image media, the JSON item needs to follow this format:
```
{
"path": "0.jpg",
"cap": ["captions"]
}
```
For video media, the JSON item needs to follow this format:
```
{
"path": "1.mp4",
"resolution": {
"width": 848,
"height": 480
},
"fps": 30.0,
"duration": 6.033333333333333,
"cap": [
"caption"
]
}
```
Adjust the `DATA_MERGE_PATH` and `OUTPUT_DIR` in `scripts/preprocess/preprocess_****_data.sh` accordingly and run:
```
bash scripts/preprocess/preprocess_****_data.sh
./env_setup.sh fastvideo
conda activate fastvideo
```
The preprocessed data will be put into the `OUTPUT_DIR` and the `videos2caption.json` can be used in finetune and distill scripts.

## 🎯 Distill

We provide a dataset example here. First download testing data. Use [scripts/huggingface/download_hf.py](scripts/huggingface/download_hf.py) to download the data to a local directory. Use it like this:
## 🚀 Inference
We recommend using a GPU with 80GB of memory. To run the inference, use the following command:
### FastHunyuan
```bash
python scripts/huggingface/download_hf.py --repo_id=FastVideo/Mochi-425-Data --local_dir=data/Mochi-425-Data --repo_type=dataset
python scripts/huggingface/download_hf.py --repo_id=FastVideo/validation_embeddings --local_dir=data/validation_embeddings --repo_type=dataset
# Download the model weight
python scripts/huggingface/download_hf.py --repo_id=FastVideo/FastHunyuan --local_dir=data/FastHunyuan --repo_type=model
# change the gpu count inside the script
sh scripts/inference/inference_hunyuan.sh
```

Then the distillation can be launched by:
### FastMochi
You can use FastMochi

```bash
# Download the model weight
python scripts/huggingface/download_hf.py --repo_id=FastVideo/FastMochi-diffusers --local_dir=data/FastMochi-diffusers --repo_type=model
# CLI inference
bash scripts/inference/inference_mochi_sp.sh
# Gradio web dem
python demo/gradio_web_demo.py --model_path data/FastMochi-diffusers --guidance_scale 1.5 --num_frames 163
```
bash scripts/distill/distill_mochi.sh # for mochi
bash scripts/distill/distill_hunyuan.sh # for hunyuan
```


## ⚡ Finetune


### 💰Hardware requirement
## Distillation
Please refer to the [distillation guide](docs/distilation.md).

- 72G VRAM is required for finetuning 10B mochi model.
## Finetuning
Please refer to the [finetuning guide](docs/finetuning.md).

To launch finetuning, you will need to prepare data in the according to formats described in section [Data Preprocess](#-data-preprocess).
## Development Plan

If you are doing image-video mixture finetuning, make sure `--group_frame` is in your script.

Then run the finetune with:
```
bash scripts/finetune/finetune_mochi.sh # for mochi
bash scripts/finetune/finetune_hunyuan.sh # for hunyuan
```

## Acknowledgement
We learned from and reused code from the following projects: [PCM](https://github.com/G-U-N/Phased-Consistency-Model), [diffusers](https://github.com/huggingface/diffusers), and [OpenSoraPlan](https://github.com/PKU-YuanGroup/Open-Sora-Plan).
We learned and reused code from the following projects: [PCM](https://github.com/G-U-N/Phased-Consistency-Model), [diffusers](https://github.com/huggingface/diffusers), and [OpenSoraPlan](https://github.com/PKU-YuanGroup/Open-Sora-Plan).
21 changes: 10 additions & 11 deletions demo/gradio_web_demo.py
Original file line number Diff line number Diff line change
Expand Up @@ -19,14 +19,14 @@ def init_args():
parser.add_argument("--num_inference_steps", type=int, default=8)
parser.add_argument("--guidance_scale", type=float, default=4.5)
parser.add_argument("--model_path", type=str, default="data/mochi")
parser.add_argument("--seed", type=int, default=42)
parser.add_argument("--seed", type=int, default=12345)
parser.add_argument("--transformer_path", type=str, default=None)
parser.add_argument("--scheduler_type", type=str, default="euler")
parser.add_argument("--scheduler_type", type=str, default="pcm_linear_quadratic")
parser.add_argument("--lora_checkpoint_dir", type=str, default=None)
parser.add_argument("--shift", type=float, default=8.0)
parser.add_argument("--num_euler_timesteps", type=int, default=100)
parser.add_argument("--linear_threshold", type=float, default=0.025)
parser.add_argument("--linear_range", type=float, default=0.5)
parser.add_argument("--num_euler_timesteps", type=int, default=50)
parser.add_argument("--linear_threshold", type=float, default=0.1)
parser.add_argument("--linear_range", type=float, default=0.75)
parser.add_argument("--cpu_offload", action="store_true")
return parser.parse_args()

Expand All @@ -36,11 +36,12 @@ def load_model(args):
if args.scheduler_type == "euler":
scheduler = FlowMatchEulerDiscreteScheduler()
else:
linear_quadratic = True if "linear_quadratic" in args.scheduler_type else False
scheduler = PCMFMScheduler(
1000,
args.shift,
args.num_euler_timesteps,
False,
linear_quadratic,
args.linear_threshold,
args.linear_range,
)
Expand All @@ -58,10 +59,9 @@ def load_model(args):
pipe.enable_vae_tiling()
#pipe.to(device)
#if args.cpu_offload:
pipe.enable_model_cpu_offload()
pipe.enable_sequential_cpu_offload()
return pipe


def generate_video(
prompt,
negative_prompt,
Expand All @@ -77,8 +77,6 @@ def generate_video(
if randomize_seed:
seed = torch.randint(0, 1000000, (1,)).item()

pipe = load_model(args)
print("load model successfully")
generator = torch.Generator(device="cuda").manual_seed(seed)

if not use_negative_prompt:
Expand Down Expand Up @@ -108,7 +106,8 @@ def generate_video(
]

args = init_args()

pipe = load_model(args)
print("load model successfully")
with gr.Blocks() as demo:
gr.Markdown("# Fastvideo Mochi Video Generation Demo")

Expand Down
68 changes: 68 additions & 0 deletions docs/data_preprocess.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,68 @@



## 🧱 Data Preprocess

To save GPU memory, we precompute text embeddings and VAE latents to eliminate the need to load the text encoder and VAE during training.


We provide a sample dataset to help you get started. Download the source media using the following command:
```bash
python scripts/huggingface/download_hf.py --repo_id=FastVideo/Image-Vid-Finetune-Src --local_dir=data/Image-Vid-Finetune-Src --repo_type=dataset
```
To preprocess the dataset for fine-tuning or distillation, run:
```
bash scripts/preprocess/preprocess_mochi_data.sh # for mochi
bash scripts/preprocess/preprocess_hunyuan_data.sh # for hunyuan
```

The preprocessed dataset will be stored in `Image-Vid-Finetune-Mochi` or `Image-Vid-Finetune-HunYuan` correspondingly.

### Process your own dataset

If you wish to create your own dataset for finetuning or distillation, please structure you video dataset in the following format:

path_to_dataset_folder/
├── media/
│ ├── 0.jpg
│ ├── 1.mp4
│ ├── 2.jpg
├── video2caption.json
└── merge.txt

Format the JSON file as a list, where each item represents a media source:

For image media,
```
{
"path": "0.jpg",
"cap": ["captions"]
}
```
For video media,
```
{
"path": "1.mp4",
"resolution": {
"width": 848,
"height": 480
},
"fps": 30.0,
"duration": 6.033333333333333,
"cap": [
"caption"
]
}
```

Use a txt file (merge.txt) to contain the source folder for media and the JSON file for meta information:

```
path_to_media_source_foder,path_to_json_file
```

Adjust the `DATA_MERGE_PATH` and `OUTPUT_DIR` in `scripts/preprocess/preprocess_****_data.sh` accordingly and run:
```
bash scripts/preprocess/preprocess_****_data.sh
```
The preprocessed data will be put into the `OUTPUT_DIR` and the `videos2caption.json` can be used in finetune and distill scripts.
5 changes: 0 additions & 5 deletions docs/distill_hunyuan.md

This file was deleted.

24 changes: 24 additions & 0 deletions docs/distillation.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@
## 🎯 Distill


Our distillation recipe is based on [Phased Consistency Model](https://github.com/G-U-N/Phased-Consistency-Model). We did not find significant improvement using multi-phase distillation, so we keep the one phase setup similar to the original latent consistency model's recipe.

We use the [MixKit](https://huggingface.co/datasets/LanguageBind/Open-Sora-Plan-v1.1.0/tree/main/all_mixkit) dataset for distillation. To avoid running the text encoder and VAE during training, we preprocess all data to generate text embeddings and VAE latents.

Preprocessing instructions can be found [data_preprocess.md](#-data-preprocess). For convenience, we also provide preprocessed data that can be downloaded directly using the following command:

```bash
python scripts/huggingface/download_hf.py --repo_id=FastVideo/HD-Mixkit-Finetune-Hunyuan --local_dir=data/HD-Mixkit-Finetune-Hunyuan --repo_type=dataset
```
Next, download the original model weights with:

```bash
python scripts/huggingface/download_hf.py --repo_id=FastVideo/hunyuan --local_dir=data/hunyuan --repo_type=model
```
To launch the distillation process, use the following commands:

```
bash scripts/distill/distill_mochi.sh # for mochi
bash scripts/distill/distill_hunyuan.sh # for hunyuan
```
We also provide an optional script for distillation with adversarial loss, located at `fastvideo/distill_adv.py`. Although we tried adversarial loss, we did not observe significant improvements.
Loading

0 comments on commit 58cfd71

Please sign in to comment.