Cleanup

Co-authored-by: rlsu9 <[email protected]>
Maoshuiyang · Dec 16, 2024 · 58cfd71 · 58cfd71
1 parent 3bf892b
commit 58cfd71
Show file tree

Hide file tree

Showing 21 changed files with 279 additions and 438 deletions.
diff --git a/README.md b/README.md
@@ -1,26 +1,9 @@
-# FastVideo
-
 <div align="center">
-<img src=assets/logo.jpg width="50%"/>
+<img src=assets/logo.jpg width="30%"/>
 </div>
 
-FastVideo is a scalable framework for post-training video diffusion models, addressing the growing challenges of fine-tuning, distillation, and inference as model sizes and sequence lengths increase. As a first step, it provides an efficient script for distilling and fine-tuning the 10B Mochi model, with plans to expand features and support for more models.
-
-### Features
-
-- FastMochi, a distilled Mochi model that can generate videos with merely 8 sampling steps.
-- Finetuning with FSDP (both master weight and ema weight), sequence parallelism, and selective gradient checkpointing.
-- LoRA coupled with pecomputed the latents and text embedding for minumum memory consumption.
-- Finetuning with both image and videos.
-
-## Change Log
-
-
-- ```2024/12/17```: `FastVideo` v0.1 is released.
-
-
-## Fast and High-Quality Text-to-video Generation
-
+FastVideo is an open framework for distilling, training, and inferencing large video diffusion model.
+<div align="center">
 <table style="margin-left: auto; margin-right: auto; border: none;">
   <tr>
     <td>
@@ -33,163 +16,66 @@ FastVideo is a scalable framework for post-training video diffusion models, addr
     </td>
   </tr>
 </table>
+  </div>
 
-## Table of Contents
-
-Jump to a specific section:
-
-- [🔧 Installation](#-installation)
-- [🚀 Inference](#-inference)
-- [🧱 Data Preprocess](#-data-preprocess)
-- [🎯 Distill](#-distill)
-- [⚡ Finetune](#-finetune)
-
-
-## 🔧 Installation
-
-- Python >= 3.10.0
-- Cuda >= 12.1
-
-```
-git clone https://github.com/hao-ai-lab/FastVideo.git
-cd FastVideo 
-
-./env_setup.sh fastvideo
-# or you can install the working environment step by step following env_setup.sh
-```
-
-
+### What is this?
 
-## 🚀 Inference
-
-Use [scripts/huggingface/download_hf.py](scripts/huggingface/download_hf.py) to download the hugging-face style model to a local directory. Use it like this:
-```bash
-python scripts/huggingface/download_hf.py --repo_id=FastVideo/FastMochi --local_dir=data/FastMochi --repo_type=model
-```
+As state-of-the-art video diffusion models grow in size and sequence length, their become prohibitive to use. For instance, sampling a 5-second 720P video with Hunyuan takes 13 minutes on 4 X A100. FastVideo aim to make large video diffusion models fast to infer and efficient to train, and thus making them more **accessible**. 
 
+We introduce FastMochi and FastHunyuan, distilled versions of the Mochi and Hunyuan video diffusion models. FastMochi achieves high-quality sampling with just 8 inference steps. FastHunyuan maintains sampling quality with only 4 inference steps.
 
-### 🔛 Quick Start with Gradio UI
 
-```
-python demo/gradio_web_demo.py --model_path data/FastMochi
-```
 
-### 🔛 CLI Inference with Sequence Parallelism
+### What can I do with FastVideo?
+Other than the distilled weight, FastVideo provides a pipeline for training, distilling, and inferencing video diffusion models. Key capabilities include:
 
-We also provide CLI inference script featured with sequence parallelism in [scripts/inference](scripts/inference).
-
-```
-# bash scripts/inference/inference_mochi_sp.sh
-
-num_gpus=4
-
-torchrun --nnodes=1 --nproc_per_node=$num_gpus --master_port 29503 \
-    fastvideo/sample/sample_t2v_mochi.py \
-    --model_path data/FastMochi \
-    --prompt_path "assets/prompt.txt" \
-    --num_frames 93 \
-    --height 480 \
-    --width 848 \
-    --num_inference_steps 8 \
-    --guidance_scale 4.5 \
-    --output_path outputs_video/mochi_sp/ \
-    --shift 8 \
-    --seed 12345 \
-    --scheduler_type "pcm_linear_quadratic" 
+- **Scalable**: FastVideo supports FSDP, sequence parallelism, and selective gradient checkpointing. Our code seamlessly scales to 64 GPUs in our test.
+- **Memory Efficient**: FastVideo supports LoRA finetuning coupled with precomputed latents and text embeddings for minimal memory usage.
+- **Variable Sequence length**: You can finetuning with both image and videos.
 
-```
-
-## 🧱 Data Preprocess
-
-To reduce the memory cost and time consumption caused by VAE and T5 during distillation and finetuning, we offload the VAE and T5 preprocess media part to the Data Preprocess section.
-For data preprocessing, we need to prepare a source folder for the media we wish to use and a JSON file for the source information of these media.
-
-### Sample for Data Preprocess
-
-We provide a small sample dataset for you to start with, download the source media with command:
-```
-python scripts/huggingface/download_hf.py --repo_id=FastVideo/Image-Vid-Finetune-Src --local_dir=data/Image-Vid-Finetune-Src --repo_type=dataset
-```
-To preprocess dataset for finetune/distill run:
-
-```
-bash scripts/preprocess/preprocess_mochi_data.sh # for mochi
-bash scripts/preprocess/preprocess_hunyuan_data.sh # for hunyuan
-```
-
-The preprocessed dataset will be stored in `Image-Vid-Finetune-Mochi` or `Image-Vid-Finetune-HunYuan` correspondingly.
-
-### Create Custom Dataset
+## Change Log
 
-If you wish to create your own dataset for finetuning or distillation, please pay attention to the following format:
+- ```2024/12/16```: `FastVideo` v0.1 is released.
 
-Use a txt file to contain the source folder for media and the JSON file for meta information:
 
-```
-path_to_media_source_foder,path_to_json_file
-```
-The content of the JSON file is a list with each item corresponding to a media source.
+## 🔧 Installation
+The code is tested on Python 3.10.0 and CUDA 12.1.
 
-For image media, the JSON item needs to follow this format:
 ```
-{
-    "path": "0.jpg",
-    "cap": ["captions"]
-}
-```
-For video media, the JSON item needs to follow this format:
-```
-{
-    "path": "1.mp4",
-    "resolution": {
-      "width": 848,
-      "height": 480
-    },
-    "fps": 30.0,
-    "duration": 6.033333333333333,
-    "cap": [
-      "caption"
-    ]
-  }
-```
-Adjust the `DATA_MERGE_PATH` and `OUTPUT_DIR` in `scripts/preprocess/preprocess_****_data.sh` accordingly and run:
-```
-bash scripts/preprocess/preprocess_****_data.sh
+./env_setup.sh fastvideo
+conda activate fastvideo
 ```
-The preprocessed data will be put into the `OUTPUT_DIR` and the `videos2caption.json` can be used in finetune and distill scripts.
-
-## 🎯 Distill
 
-We provide a dataset example here. First download testing data. Use [scripts/huggingface/download_hf.py](scripts/huggingface/download_hf.py) to download the data to a local directory. Use it like this:
+## 🚀 Inference
+We recommend using a GPU with 80GB of memory. To run the inference, use the following command:
+### FastHunyuan
 ```bash
-python scripts/huggingface/download_hf.py --repo_id=FastVideo/Mochi-425-Data --local_dir=data/Mochi-425-Data --repo_type=dataset
-python scripts/huggingface/download_hf.py --repo_id=FastVideo/validation_embeddings --local_dir=data/validation_embeddings --repo_type=dataset
+# Download the model weight
+python scripts/huggingface/download_hf.py --repo_id=FastVideo/FastHunyuan --local_dir=data/FastHunyuan --repo_type=model
+# change the gpu count inside the script
+sh scripts/inference/inference_hunyuan.sh
 ```
 
-Then the distillation can be launched by:
+### FastMochi
+You can use FastMochi
 
+```bash
+# Download the model weight
+python scripts/huggingface/download_hf.py --repo_id=FastVideo/FastMochi-diffusers --local_dir=data/FastMochi-diffusers --repo_type=model
+# CLI inference
+bash scripts/inference/inference_mochi_sp.sh
+# Gradio web dem
+python demo/gradio_web_demo.py --model_path data/FastMochi-diffusers --guidance_scale 1.5 --num_frames 163
 ```
-bash scripts/distill/distill_mochi.sh # for mochi
-bash scripts/distill/distill_hunyuan.sh # for hunyuan
-```
-
-
-## ⚡ Finetune
-
 
-### 💰Hardware requirement
+## Distillation
+Please refer to the [distillation guide](docs/distilation.md).
 
-- 72G VRAM is required for finetuning 10B mochi model.
+## Finetuning
+Please refer to the [finetuning guide](docs/finetuning.md).
 
-To launch finetuning, you will need to prepare data in the according to formats described in section [Data Preprocess](#-data-preprocess). 
+## Development Plan
 
-If you are doing image-video mixture finetuning, make sure `--group_frame` is in your script.
-
-Then run the finetune with:
-```
-bash scripts/finetune/finetune_mochi.sh # for mochi
-bash scripts/finetune/finetune_hunyuan.sh # for hunyuan
-```
 
 ## Acknowledgement
-We learned from and reused code from the following projects: [PCM](https://github.com/G-U-N/Phased-Consistency-Model), [diffusers](https://github.com/huggingface/diffusers), and [OpenSoraPlan](https://github.com/PKU-YuanGroup/Open-Sora-Plan).
+We learned and reused code from the following projects: [PCM](https://github.com/G-U-N/Phased-Consistency-Model), [diffusers](https://github.com/huggingface/diffusers), and [OpenSoraPlan](https://github.com/PKU-YuanGroup/Open-Sora-Plan).
diff --git a/demo/gradio_web_demo.py b/demo/gradio_web_demo.py
@@ -19,14 +19,14 @@ def init_args():
     parser.add_argument("--num_inference_steps", type=int, default=8)
     parser.add_argument("--guidance_scale", type=float, default=4.5)
     parser.add_argument("--model_path", type=str, default="data/mochi")
-    parser.add_argument("--seed", type=int, default=42)
+    parser.add_argument("--seed", type=int, default=12345)
     parser.add_argument("--transformer_path", type=str, default=None)
-    parser.add_argument("--scheduler_type", type=str, default="euler")
+    parser.add_argument("--scheduler_type", type=str, default="pcm_linear_quadratic")
     parser.add_argument("--lora_checkpoint_dir", type=str, default=None)
     parser.add_argument("--shift", type=float, default=8.0)
-    parser.add_argument("--num_euler_timesteps", type=int, default=100)
-    parser.add_argument("--linear_threshold", type=float, default=0.025)
-    parser.add_argument("--linear_range", type=float, default=0.5)
+    parser.add_argument("--num_euler_timesteps", type=int, default=50)
+    parser.add_argument("--linear_threshold", type=float, default=0.1)
+    parser.add_argument("--linear_range", type=float, default=0.75)
     parser.add_argument("--cpu_offload", action="store_true")
     return parser.parse_args()
 
@@ -36,11 +36,12 @@ def load_model(args):
     if args.scheduler_type == "euler":
         scheduler = FlowMatchEulerDiscreteScheduler()
     else:
+        linear_quadratic = True  if "linear_quadratic" in args.scheduler_type else False
         scheduler = PCMFMScheduler(
             1000,
             args.shift,
             args.num_euler_timesteps,
-            False,
+            linear_quadratic,
             args.linear_threshold,
             args.linear_range,
         )
@@ -58,10 +59,9 @@ def load_model(args):
     pipe.enable_vae_tiling()
     #pipe.to(device)
     #if args.cpu_offload:
-    pipe.enable_model_cpu_offload()
+    pipe.enable_sequential_cpu_offload()
     return pipe
 
-
 def generate_video(
     prompt,
     negative_prompt,
@@ -77,8 +77,6 @@ def generate_video(
     if randomize_seed:
         seed = torch.randint(0, 1000000, (1,)).item()
 
-    pipe = load_model(args)
-    print("load model successfully")
     generator = torch.Generator(device="cuda").manual_seed(seed)
 
     if not use_negative_prompt:
@@ -108,7 +106,8 @@ def generate_video(
 ]
 
 args = init_args()
-
+pipe = load_model(args)
+print("load model successfully")
 with gr.Blocks() as demo:
     gr.Markdown("# Fastvideo Mochi Video Generation Demo")
 

diff --git a/docs/data_preprocess.md b/docs/data_preprocess.md
@@ -0,0 +1,68 @@
+
+
+
+## 🧱 Data Preprocess
+
+To save GPU memory, we precompute text embeddings and VAE latents to eliminate the need to load the text encoder and VAE during training.
+
+
+We provide a sample dataset to help you get started. Download the source media using the following command:
+```bash
+python scripts/huggingface/download_hf.py --repo_id=FastVideo/Image-Vid-Finetune-Src --local_dir=data/Image-Vid-Finetune-Src --repo_type=dataset
+```
+To preprocess the dataset for fine-tuning or distillation, run:
+```
+bash scripts/preprocess/preprocess_mochi_data.sh # for mochi
+bash scripts/preprocess/preprocess_hunyuan_data.sh # for hunyuan
+```
+
+The preprocessed dataset will be stored in `Image-Vid-Finetune-Mochi` or `Image-Vid-Finetune-HunYuan` correspondingly.
+
+### Process your own dataset
+
+If you wish to create your own dataset for finetuning or distillation, please structure you video dataset in the following format:
+
+path_to_dataset_folder/
+├── media/
+│   ├── 0.jpg
+│   ├── 1.mp4
+│   ├── 2.jpg
+├── video2caption.json
+└── merge.txt
+
+Format the JSON file as a list, where each item represents a media source:
+
+For image media,
+```
+{
+    "path": "0.jpg",
+    "cap": ["captions"]
+}
+```
+For video media, 
+```
+{
+    "path": "1.mp4",
+    "resolution": {
+      "width": 848,
+      "height": 480
+    },
+    "fps": 30.0,
+    "duration": 6.033333333333333,
+    "cap": [
+      "caption"
+    ]
+  }
+```
+
+Use a txt file (merge.txt) to contain the source folder for media and the JSON file for meta information:
+
+```
+path_to_media_source_foder,path_to_json_file
+```
+
+Adjust the `DATA_MERGE_PATH` and `OUTPUT_DIR` in `scripts/preprocess/preprocess_****_data.sh` accordingly and run:
+```
+bash scripts/preprocess/preprocess_****_data.sh
+```
+The preprocessed data will be put into the `OUTPUT_DIR` and the `videos2caption.json` can be used in finetune and distill scripts.
diff --git a/docs/distill_hunyuan.md b/docs/distill_hunyuan.md
diff --git a/docs/distillation.md b/docs/distillation.md
@@ -0,0 +1,24 @@
+## 🎯 Distill
+
+
+Our distillation recipe is based on [Phased Consistency Model](https://github.com/G-U-N/Phased-Consistency-Model). We did not find significant improvement using multi-phase distillation, so we keep the one phase setup similar to the original latent consistency model's recipe.
+
+We use the [MixKit](https://huggingface.co/datasets/LanguageBind/Open-Sora-Plan-v1.1.0/tree/main/all_mixkit) dataset for distillation. To avoid running the text encoder and VAE during training, we preprocess all data to generate text embeddings and VAE latents.
+
+Preprocessing instructions can be found [data_preprocess.md](#-data-preprocess). For convenience, we also provide preprocessed data that can be downloaded directly using the following command:
+
+```bash
+python scripts/huggingface/download_hf.py --repo_id=FastVideo/HD-Mixkit-Finetune-Hunyuan --local_dir=data/HD-Mixkit-Finetune-Hunyuan --repo_type=dataset
+```
+Next, download the original model weights with:
+
+```bash
+python scripts/huggingface/download_hf.py --repo_id=FastVideo/hunyuan --local_dir=data/hunyuan --repo_type=model
+```
+To launch the distillation process, use the following commands:
+
+```
+bash scripts/distill/distill_mochi.sh # for mochi
+bash scripts/distill/distill_hunyuan.sh # for hunyuan
+```
+We also provide an optional script for distillation with adversarial loss, located at `fastvideo/distill_adv.py`. Although we tried adversarial loss, we did not observe significant improvements.