Update README (hao-ai-lab#85)

Co-authored-by: rlsu9 <[email protected]>
Maoshuiyang · Dec 17, 2024 · b393570 · b393570
1 parent 285635e
commit b393570
Show file tree

Hide file tree

Showing 7 changed files with 25 additions and 41 deletions.
diff --git a/README.md b/README.md
@@ -2,7 +2,7 @@
 <img src=assets/logo.jpg width="30%"/>
 </div>
 
-FastVideo is an open framework for distilling, training, and inferencing large video diffusion model.
+FastVideo is an open-source framework for accelerating large video diffusion model.
 <div align="center">
 <table style="margin-left: auto; margin-right: auto; border: none;">
   <tr>
@@ -18,32 +18,27 @@ FastVideo is an open framework for distilling, training, and inferencing large v
 </table>
   </div>
 
-### What is this?
+<p align="center">
+    🤗 <a href="https://huggingface.co/FastVideo/FastMochi-diffuser" target="_blank">FastMochi</a> | 🤗 <a href="https://huggingface.co/FastVideo/FastHunyuan"  target="_blank">FastHunyuan</a> 
+</p>
 
-As state-of-the-art video diffusion models grow in size and sequence length, their become prohibitive to use. For instance, sampling a 5-second 720P video with Hunyuan takes 13 minutes on 4 X A100. FastVideo aim to make large video diffusion models fast to infer and efficient to train, and thus making them more **accessible**. 
+FastVideo currently offers: (with more to come)
 
-We introduce FastMochi and FastHunyuan, distilled versions of the Mochi and Hunyuan video diffusion models. The distilled models are 8X faster to sample.
+- FastHunyuan and FastMochi: consistency distilled video diffusion models for 8x inference speedup.
+- First open video DiT distillation recipes based on [PCM](https://github.com/G-U-N/Phased-Consistency-Model).
+- Scalable training with FSDP, sequence parallelism, and selective activation checkpointing, with near linear scaling to 64 GPUs.
+- Memory efficient finetuning with LoRA, precomputed latents, and precomputed text embeddings.
 
 
-
-### What can I do with FastVideo?
-Other than the distilled weight, FastVideo provides a pipeline for training, distilling, and inferencing video diffusion models. Key capabilities include:
-
-- **Scalable**: FastVideo supports FSDP, sequence parallelism, and selective gradient checkpointing. Our code seamlessly scales to 64 GPUs in our test.
-- **Memory Efficient**: FastVideo supports LoRA finetuning coupled with precomputed latents and text embeddings for minimal memory usage.
-- **Variable Sequence length**: You can finetune with both image and videos.
-
 ## Change Log
 
-- ```2024/12/16```: `FastVideo` v0.1 is released.
+- ```2024/12/17```: `FastVideo` v0.1 is released.
 
 
 ## 🔧 Installation
 The code is tested on Python 3.10.0, CUDA 12.1 and H100.
-
 ```
 ./env_setup.sh fastvideo
-conda activate fastvideo
 ```
 
 ## 🚀 Inference
@@ -52,12 +47,11 @@ We recommend using a GPU with 80GB of memory. To run the inference, use the foll
 ```bash
 # Download the model weight
 python scripts/huggingface/download_hf.py --repo_id=FastVideo/FastHunyuan --local_dir=data/FastHunyuan --repo_type=model
-# change the gpu count inside the script
+# CLI inference
 sh scripts/inference/inference_hunyuan.sh
 ```
 You can also inference FastHunyuan in the [official Hunyuan github](https://github.com/Tencent/HunyuanVideo).
 ### FastMochi
-You can use FastMochi
 
 ```bash
 # Download the model weight

diff --git a/assets/8steps/mochi-demo.gif b/assets/8steps/mochi-demo.gif
diff --git a/assets/prompt.txt b/assets/prompt.txt
@@ -1,9 +1,8 @@
-A hand enters the frame, pulling a sheet of plastic wrap over three balls of dough placed on a wooden surface. The plastic wrap is stretched to cover the dough more securely. The hand adjusts the wrap, ensuring that it is tight and smooth over the dough. The scene focuses on the hand's movements as it secures the edges of the plastic wrap. No new objects appear, and the camera remains stationary, focusing on the action of covering the dough.
-A vintage train snakes through the mountains, its plume of white steam rising dramatically against the jagged peaks. The cars glint in the late afternoon sun, their deep crimson and gold accents lending a touch of elegance. The tracks carve a precarious path along the cliffside, revealing glimpses of a roaring river far below. Inside, passengers peer out the large windows, their faces lit with awe as the landscape unfolds.
-A crowded rooftop bar buzzes with energy, the city skyline twinkling like a field of stars in the background. Strings of fairy lights hang above, casting a warm, golden glow over the scene. Groups of people gather around high tables, their laughter blending with the soft rhythm of live jazz. The aroma of freshly mixed cocktails and charred appetizers wafts through the air, mingling with the cool night breeze.
-En "The Matrix", Neo, interpretado por Keanu Reeves, personifica la lucha contra un sistema opresor a través de su icónica imagen, que incluye unos anteojos oscuros. Estos lentes no son solo un accesorio de moda; representan una barrera entre la realidad y la percepción. Al usar estos anteojos, Neo se sumerge en un mundo donde la verdad se oculta detrás de ilusiones y engaños. La oscuridad de los lentes simboliza la ignorancia y el control que las máquinas tienen sobre la humanidad, mientras que su propia búsqueda de la verdad lo lleva a descubrir sus auténticos poderes. La escena en que se los pone se convierte en un momento crucial, marcando su transformación de un simple programador a "El Elegido". Esta imagen se ha convertido en un ícono cultural, encapsulando el mensaje de que, al enfrentar la oscuridad, podemos encontrar la luz que nos guía hacia la libertad. Así, los anteojos de Neo se convierten en un símbolo de resistencia y autoconocimiento en un mundo manipulado.
-Medium close up. Low-angle shot. A woman in a 1950s retro dress sits in a diner bathed in neon light, surrounded by classic decor and lively chatter. The camera starts with a medium shot of her sitting at the counter, then slowly zooms in as she blows a shiny pink bubblegum bubble. The bubble swells dramatically before popping with a soft, playful burst. The scene is vibrant and nostalgic, evoking the fun and carefree spirit of the 1950s.
-Will Smith eats noodles.
-A short clip of the blonde woman taking a sip from her whiskey glass, her eyes locking with the camera as she smirks playfully. The background shows a group of people laughing and enjoying the party, with vibrant neon signs illuminating the space. The shot is taken in a way that conveys the feeling of a tipsy, carefree night out. The camera then zooms in on her face as she winks, creating a cheeky, flirtatious vibe.
-A superintelligent humanoid robot waking up. The robot has a sleek metallic body with futuristic design features. Its glowing red eyes are the focal point, emanating a sharp, intense light as it powers on. The scene is set in a dimly lit, high-tech laboratory filled with glowing control panels, robotic arms, and holographic screens. The setting emphasizes advanced technology and an atmosphere of mystery. The ambiance is eerie and dramatic, highlighting the moment of awakening and the robot's immense intelligence. Photorealistic style with a cinematic, dark sci-fi aesthetic. Aspect ratio: 16:9 --v 6.1
-A chimpanzee lead vocalist singing into a microphone on stage. The camera zooms in to show him singing. There is a spotlight on him.
+Will Smith casually eats noodles, his relaxed demeanor contrasting with the energetic background of a bustling street food market. The scene captures a mix of humor and authenticity. Mid-shot framing, vibrant lighting.
+A lone hiker stands atop a towering cliff, silhouetted against the vast horizon. The rugged landscape stretches endlessly beneath, its earthy tones blending into the soft blues of the sky. The scene captures the spirit of exploration and human resilience. High angle, dynamic framing, with soft natural lighting emphasizing the grandeur of nature.
+A hand with delicate fingers picks up a bright yellow lemon from a wooden bowl filled with lemons and sprigs of mint against a peach-colored background. The hand gently tosses the lemon up and catches it, showcasing its smooth texture. A beige string bag sits beside the bowl, adding a rustic touch to the scene. Additional lemons, one halved, are scattered around the base of the bowl. The even lighting enhances the vibrant colors and creates a fresh, inviting atmosphere.
+A curious raccoon peers through a vibrant field of yellow sunflowers, its eyes wide with interest. The playful yet serene atmosphere is complemented by soft natural light filtering through the petals. Mid-shot, warm and cheerful tones.
+A superintelligent humanoid robot waking up. The robot has a sleek metallic body with futuristic design features. Its glowing red eyes are the focal point, emanating a sharp, intense light as it powers on. The scene is set in a dimly lit, high-tech laboratory filled with glowing control panels, robotic arms, and holographic screens. The setting emphasizes advanced technology and an atmosphere of mystery. The ambiance is eerie and dramatic, highlighting the moment of awakening and the robots immense intelligence. Photorealistic style with a cinematic, dark sci-fi aesthetic. Aspect ratio: 16:9 --v 6.1
+fox in the forest close-up quickly turned its head to the left
+Man walking his dog in the woods on a hot sunny day
+A majestic lion strides across the golden savanna, its powerful frame glistening under the warm afternoon sun. The tall grass ripples gently in the breeze, enhancing the lion's commanding presence. The tone is vibrant, embodying the raw energy of the wild. Low angle, steady tracking shot, cinematic.
diff --git a/env_setup.sh b/env_setup.sh
@@ -1,16 +1,4 @@
 #!/bin/bash
-set -e
-
-CONDA_ENV=${1:-""}
-if [ -n "$CONDA_ENV" ]; then
-    # This is required to activate conda environment
-    eval "$(conda shell.bash hook)"
-
-    conda create -n $CONDA_ENV python=3.10.0 -y
-    conda activate $CONDA_ENV
-else
-    echo "Skipping conda environment creation. Make sure you have the correct environment activated."
-fi
 
 # install torch
 pip install torch==2.5.0 torchvision --index-url https://download.pytorch.org/whl/cu121

diff --git a/fastvideo/models/hunyuan/inference.py b/fastvideo/models/hunyuan/inference.py
@@ -421,7 +421,8 @@ def predict(
             raise ValueError(
                 f"Seed must be an integer, a list of integers, or None, got {seed}."
             )
-        generator = [torch.Generator(self.device).manual_seed(seed) for seed in seeds]
+        # Peiyuan: using GPU seed will cause A100 and H100 to generate different results...
+        generator = [torch.Generator("cpu").manual_seed(seed) for seed in seeds]
         out_dict["seeds"] = seeds
 
         # ========================================================================

diff --git a/fastvideo/sample/sample_t2v_mochi.py b/fastvideo/sample/sample_t2v_mochi.py
@@ -39,7 +39,7 @@ def main(args):
     initialize_distributed()
     print(nccl_info.sp_size)
     device = torch.cuda.current_device()
-    generator = torch.Generator(device).manual_seed(args.seed)
+    # Peiyuan: GPU seed will cause A100 and H100 to produce different results .....
     weight_dtype = torch.bfloat16
     if args.scheduler_type == "euler":
         scheduler = FlowMatchEulerDiscreteScheduler()
@@ -109,6 +109,7 @@ def main(args):
     if prompts is not None:
         with torch.autocast("cuda", dtype=torch.bfloat16):
             for prompt in prompts:
+                generator = torch.Generator("cpu").manual_seed(args.seed)
                 video = pipe(
                     prompt=[prompt],
                     height=args.height,
@@ -126,6 +127,7 @@ def main(args):
                     )
     else:
         with torch.autocast("cuda", dtype=torch.bfloat16):
+            generator = torch.Generator("cpu").manual_seed(args.seed)
             videos = pipe(
                 prompt_embeds=prompt_embeds,
                 prompt_attention_mask=encoder_attention_mask,

diff --git a/scripts/inference/inference_hunyuan.sh b/scripts/inference/inference_hunyuan.sh
@@ -13,7 +13,7 @@ torchrun --nnodes=1 --nproc_per_node=$num_gpus --master_port 29503 \
     --flow_shift 17 \
     --flow-reverse \
     --prompt ./assets/prompt.txt \
-    --seed 12345 \
+    --seed 1024 \
     --output_path outputs_video/hunyuan/cfg6/ \
     --model_path $MODEL_BASE \
     --dit-weight ${MODEL_BASE}/hunyuan-video-t2v-720p/transformers/mp_rank_00_model_states.pt