Name		Name	Last commit message	Last commit date
Latest commit History 62 Commits
assets		assets
demos		demos
docker		docker
scripts		scripts
src/genmo		src/genmo
tests		tests
.gitignore		.gitignore
.python-version		.python-version
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
pyrightconfig.json		pyrightconfig.json
requirements.txt		requirements.txt
uv.lock		uv.lock

Repository files navigation

mochi-xdit: Faster Parallel Inference of Mochi-preview Video Generation Model with xDiT

This repository provides an accelerated way to delpoy the Video Generation Model Mochi 1 using Unified Sequence Parallelism provided by xDiT.

Mochi-1 originally ran on 4xH100(80GB) GPUs, however, we made it run on a single L40(48GB) GPU with no accuracy loss!

Moreover, by applying xDiT, we successfully reduced the latency of generating a 49-frame 848x480 resolution video from 398 seconds (6 minutes 38 seconds) to 74 seconds (1 minute 14 seconds) on 6xL40 GPUs.

Metric	1x L40 GPU	2x L40 GPU (uly=2)	2x L40 GPU (cfg=2)	6x L40 GPU (cfg=2, ring=3)
Performance	398.00s	216.50s (1.8x)	199.07s (2.0x)	74.06s (5.4x)
Memory	30.83 GB	35.05 GB	36.69 GB	30.94 GB
Preview

The prompt of the video is: "Witness a grand space battle between starships, with lasers cutting through the darkness of space and explosions illuminating the void".

HightLights

Memory optimization makes mochi is able to generate video on a single 48GB L40 GPU without no accuracy loss.
Tiled VAE decoder enables the correct generation of video with any resolution, as well as reducing the memory footprint.
Unified Sequence Parallelism (USP) for AsymmetricAttention using xDiT: hybrid 2D sequence parallelism with Ring-Attention and DeepSpeed-Ulysses.
CFG parallel from xDiT is applied by us in Mochi-1 in a simple way.

Usage

This repository provides an accelerated inference version of Mochi 1 using Unified Sequence Parallelism provided by xDiT.

Feature	xDiT Version	Original Version
Attention Parallel	Ulysses+Ring+CFG	Ulysses
VAE	Variable Size	Fixed Size
Model Loading	Replicated/FSDP	FSDP

Usage

1. Install from source

pip install xfuser
sudo apt install ffmpeg
pip install .

2. Install from docker

docker pull thufeifeibear/mochi-dev:0.1

3. Run

Running mochi with a single GPU

CUDA_VISIBLE_DEVICES=0 python3 ./demos/cli.py --model_dir "<path_to_downloaded_directory>" --prompt "prompt"

Running mochi with multiple GPUs using Unified Sequence Parallelism provided by xDiT.

world_size is the total number of GPU used for video generation. Use the number of GPUs in CUDA_VISIBLE_DEVICES to control world_size.

Adjust the configuration of ulysses_degree, ring_degree, and CFG parallel degree to achieve optimal performance. If cfg_parallel is enabled, ulysses_degree x ring_degree = world_size. Otherwise, ulysses_degree x ring_degree x 2 = world_size.

E.g.,

export CUDA_VISIBLE_DEVICES=0,1,2,3
python3 ./demos/cli.py --model_dir "<path_to_downloaded_directory>" --prompt "prompt" \
 --use_xdit --ulysses_degree 2 --ring_degree 2

or

export CUDA_VISIBLE_DEVICES=0,1,2,4,5,6
python3 ./demos/cli.py --model_dir "<path_to_downloaded_directory>" --prompt "prompt" \
 --use_xdit --ulysses_degree 3 --ring_degree 1 --cfg_parallel

4. Performance

References

xDiT: an Inference Engine for Diffusion Transformers (DiTs) with Massive Parallelism

@misc{fang2024xditinferenceenginediffusion,
      title={xDiT: an Inference Engine for Diffusion Transformers (DiTs) with Massive Parallelism}, 
      author={Jiarui Fang and Jinzhe Pan and Xibo Sun and Aoyu Li and Jiannan Wang},
      year={2024},
      eprint={2411.01738},
      archivePrefix={arXiv},
      primaryClass={cs.DC},
      url={https://arxiv.org/abs/2411.01738}, 
}

USP: A Unified Sequence Parallelism Approach for Long Context Generative AI

@misc{fang2024uspunifiedsequenceparallelism,
      title={USP: A Unified Sequence Parallelism Approach for Long Context Generative AI}, 
      author={Jiarui Fang and Shangchun Zhao},
      year={2024},
      eprint={2405.07719},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2405.07719}, 
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

mochi-xdit: Faster Parallel Inference of Mochi-preview Video Generation Model with xDiT

HightLights

Usage

Usage

1. Install from source

2. Install from docker

3. Run

4. Performance

References

About

Releases

Packages

Languages

License

CirQ/mochi-xdit

Folders and files

Latest commit

History

Repository files navigation

mochi-xdit: Faster Parallel Inference of Mochi-preview Video Generation Model with xDiT

HightLights

Usage

Usage

1. Install from source

2. Install from docker

3. Run

4. Performance

References

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages