qingshui
diff --git a/‎.gitignore
+2 b/‎.gitignore
+2
diff --git a/‎LICENSE
+21 b/‎LICENSE
+21
diff --git a/‎README.md
+100 b/‎README.md
+100
diff --git a/‎configs/ffs/ffs_img_train.yaml
+45 b/‎configs/ffs/ffs_img_train.yaml
+45
diff --git a/‎configs/ffs/ffs_sample.yaml
+30 b/‎configs/ffs/ffs_sample.yaml
+30
diff --git a/‎configs/ffs/ffs_train.yaml
+42 b/‎configs/ffs/ffs_train.yaml
+42
diff --git a/‎configs/sky/sky_img_train.yaml
+43 b/‎configs/sky/sky_img_train.yaml
+43
diff --git a/‎configs/sky/sky_sample.yaml
+32 b/‎configs/sky/sky_sample.yaml
+32
diff --git a/‎configs/sky/sky_train.yaml
+42 b/‎configs/sky/sky_train.yaml
+42
@@ -0,0 +1,2 @@
+.vscode
+preprocess
@@ -0,0 +1,21 @@
+MIT License
+
+Copyright (c) 2023 Xin Ma
+
+Permission is hereby granted, free of charge, to any person obtaining a copy
+of this software and associated documentation files (the "Software"), to deal
+in the Software without restriction, including without limitation the rights
+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Software, and to permit persons to whom the Software is
+furnished to do so, subject to the following conditions:
+
+The above copyright notice and this permission notice shall be included in all
+copies or substantial portions of the Software.
+
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+SOFTWARE.
@@ -0,0 +1,100 @@
+## LAVITA: Latent Video Diffusion Models with Spatio-temporal Transformers (LAVITA)<br><sub>Official PyTorch Implementation</sub>
+
+### [Paper](https://maxin-cn.github.io/lavita_project/) | [Project Page](https://maxin-cn.github.io/lavita_project/)
+
+
+
+This repo contains PyTorch model definitions, pre-trained weights and training/sampling code for our paper exploring 
+latent diffusion models with transformers (LAVITA). You can find more visualizations on our [project page](https://maxin-cn.github.io/lavita_project/).
+
+> [**LAVITA: Latent Video Diffusion Models with Spatio-temporal Transformers**](https://maxin-cn.github.io/lavita_project/)<br>
+> [Xin Ma](https://maxin-cn.github.io/), [Yaohui Wang](https://wyhsirius.github.io/), [Xinyuan Chen](https://scholar.google.com/citations?user=3fWSC8YAAAAJ), [Yuan-Fang Li](https://users.monash.edu/~yli/), [Cunjian Chen](https://cunjian.github.io/), [Ziwei Liu](https://liuziwei7.github.io/), [Yu Qiao](https://scholar.google.com.hk/citations?user=gFtI-8QAAAAJ&hl=zh-CN)
+> <br>Department of Data Science \& AI, Faculty of Information Technology, Monash University <br> Shanghai Artificial Intelligence Laboratory, S-Lab, Nanyang Technological University<br>
+
+ We propose a novel architecture, the latent video diffusion model with spatio-temporal transformers, referred to as LAVITA, which integrates the Transformer architecture into diffusion models for the first time within the realm of video generation. Conceptually, LATIVA models spatial and temporal information separately to accommodate their inherent disparities as well as to reduce the computational complexity. Following this design strategy, we design several Transformer-based model variants to integrate spatial and temporal information harmoniously. Moreover, we identify the best practices in architectural choices and learning strategies for LAVITA through rigorous empirical analysis. Our comprehensive evaluation demonstrates that LAVITA achieves state-of-the-art performance across several standard video generation benchmarks, including FaceForensics, SkyTimelapse, UCF101, and Taichi-HD, outperforming current best models.
+
+ ![The architecure of LAVITA](visuals/architecture.svg)
+
+This repository contains:
+
+* 🪐 A simple PyTorch [implementation](models/lavita.py.py) of LAVITA
+* ⚡️ Pre-trained LAVITA models trained on FaceForensics, SkyTimelapse, Taichi-HD and UCF101 (256x256)
+
+* 🛸 A LAVITA [training script](train.py) using PyTorch DDP
+
+
+
+## Setup
+
+First, download and set up the repo:
+
+```bash
+git clone https://github.com/maxin-cn/LAVITA.git
+cd LAVITA
+```
+
+We provide an [`environment.yml`](environment.yml) file that can be used to create a Conda environment. If you only want 
+to run pre-trained models locally on CPU, you can remove the `cudatoolkit` and `pytorch-cuda` requirements from the file.
+
+```bash
+conda env create -f environment.yml
+conda activate lavita
+```
+
+
+## Sampling 
+
+**Pre-trained LAVITA checkpoints.** You can sample from our pre-trained LAVITA models with [`sample.py`](sample/sample.py). Weights for our pre-trained LAVITA model can be found [here](https://huggingface.co/maxin-cn/LAVITA). The script has various arguments to adjust sampling steps, change the classifier-free guidance scale, etc. For example, to sample from
+our model on FaceForensics, you can use:
+
+```bash
+bash sample/ffs.sh
+```
+
+or if you want to sample hundreds of videos, you can use the following script with Pytorch DDP:
+
+```bash
+bash sample/ffs_ddp.sh
+```
+
+## Training LAVITA
+
+We provide a training script for LAVITA in [`train.py`](train.py). This script can be used to train class-conditional and unconditional
+LAVITA models. To launch LAVITA (256x256) training with `N` GPUs on the FaceForensics dataset 
+:
+
+```bash
+torchrun --nnodes=1 --nproc_per_node=N train.py --config ./configs/ffs/ffs_train.yaml
+```
+
+or If you have a cluster that uses slurm, you can also train LAVITA's model using the following scripts:
+
+ ```bash
+sbatch slurm_scripts/ffs.slurm
+```
+
+We also provide the video-image joint training scripts [`train_with_img.py`](train_with_img.py). Similar to [`train.py`](train.py) scripts, this scripts can be also used to train class-conditional and unconditional
+LAVITA models. For example, if you wan to train LAVITA model on the FaceForensics dataset, you can use:
+
+```bash
+torchrun --nnodes=1 --nproc_per_node=N train.py --config ./configs/ffs/ffs_img_train.yaml
+```
+
+<!-- ## BibTeX
+
+```bibtex
+@article{Peebles2022DiT,
+  title={Scalable Diffusion Models with Transformers},
+  author={William Peebles and Saining Xie},
+  year={2022},
+  journal={arXiv preprint arXiv:2212.09748},
+}
+``` -->
+
+
+## Acknowledgments
+Video generation models are improving quickly and the development of LAVITA has been greatly inspired by the following amazing works and teams: [DiT](https://github.com/facebookresearch/DiT), [U-ViT](https://github.com/baofff/U-ViT), and [Tune-A-Video](https://github.com/showlab/Tune-A-Video).
+
+
+## License
+The code and model weights are licensed under [CC-BY-NC](license_for_usage.txt).
@@ -0,0 +1,45 @@
+# dataset
+dataset: "ffs_img"
+
+data_path: "/path/to/datasets/preprocessed_ffs/train/videos/"
+frame_data_path: "/path/to/datasets/preprocessed_ffs/train/images/"
+frame_data_txt: "/path/to/datasets/preprocessed_ffs/train_list.txt"
+pretrained_model_path: "/path/to/pretrained/LAVITA/"
+
+# save and load
+results_dir: "./results_img"
+pretrained:
+
+# model config: 
+model: LAVITAIMG-XL/2
+num_frames: 16
+image_size: 256 # choices=[256, 512]
+num_sampling_steps: 250
+frame_interval: 3
+fixed_spatial: False
+attention_bias: True
+learn_sigma: True # important
+extras: 1 # [1, 2, 78]
+
+# train config:
+save_ceph: True # important
+use_image_num: 8
+learning_rate: 1e-4
+ckpt_every: 10000
+clip_max_norm: 0.1
+start_clip_iter: 500000
+local_batch_size: 4 # important
+max_train_steps: 1000000
+global_seed: 3407
+num_workers: 8
+log_every: 100
+lr_warmup_steps: 0
+resume_from_checkpoint:
+gradient_accumulation_steps: 1 # TODO
+num_classes:
+
+# low VRAM and speed up training
+use_compile: False
+mixed_precision: False
+enable_xformers_memory_efficient_attention: False
+gradient_checkpointing: False
@@ -0,0 +1,30 @@
+# path:
+ckpt: # will be overwrite
+save_img_path: "./sample_videos" # will be overwrite
+pretrained_model_path: "/path/to/pretrained/LAVITA/"
+
+# model config: 
+model: LAVITA-XL/2
+num_frames: 16
+image_size: 256 # choices=[256, 512]
+frame_interval: 2
+fixed_spatial: False
+attention_bias: True
+learn_sigma: True
+extras: 1 # [1, 2, 78]
+num_classes:
+
+# model speedup
+use_compile: False
+use_fp16: True
+
+# sample config:
+seed:
+sample_method: 'ddpm'
+num_sampling_steps: 250
+cfg_scale: 1.0
+negative_name:
+
+# ddp sample config
+per_proc_batch_size: 2
+num_fvd_samples: 2048
@@ -0,0 +1,42 @@
+# dataset
+dataset: "ffs"
+
+data_path: "/path/to/datasets/preprocess_ffs/train/videos/" # s
+pretrained_model_path: "/path/to/pretrained/LAVITA/"
+
+# save and load
+results_dir: "./results"
+pretrained:
+
+# model config: 
+model: LAVITA-XL/2
+num_frames: 16
+image_size: 256 # choices=[256, 512]
+num_sampling_steps: 250
+frame_interval: 3
+fixed_spatial: False
+attention_bias: True
+learn_sigma: True # important
+extras: 1 # [1, 2, 78]
+
+# train config:
+save_ceph: True # important
+learning_rate: 1e-4
+ckpt_every: 10000
+clip_max_norm: 0.1
+start_clip_iter: 20000
+local_batch_size: 5 # important
+max_train_steps: 1000000
+global_seed: 3407
+num_workers: 8
+log_every: 100
+lr_warmup_steps: 0
+resume_from_checkpoint:
+gradient_accumulation_steps: 1 # TODO
+num_classes:
+
+# low VRAM and speed up training
+use_compile: False
+mixed_precision: False
+enable_xformers_memory_efficient_attention: False
+gradient_checkpointing: False
@@ -0,0 +1,43 @@
+# dataset
+dataset: "sky_img"
+
+data_path: "/path/to/datasets/sky_timelapse/sky_train/" # s/p
+pretrained_model_path: "/path/to/pretrained/LAVITA/"
+
+# save and load
+results_dir: "./results_img"
+pretrained:
+
+# model config: 
+model: LAVITAIMG-XL/2
+num_frames: 16
+image_size: 256 # choices=[256, 512]
+num_sampling_steps: 250
+frame_interval: 3
+fixed_spatial: False
+attention_bias: True
+learn_sigma: True
+extras: 1 # [1, 2, 78]
+
+# train config:
+save_ceph: True # important
+use_image_num: 8 # important
+learning_rate: 1e-4
+ckpt_every: 10000
+clip_max_norm: 0.1
+start_clip_iter: 20000
+local_batch_size: 4 # important
+max_train_steps: 1000000
+global_seed: 3407
+num_workers: 8
+log_every: 50
+lr_warmup_steps: 0
+resume_from_checkpoint:
+gradient_accumulation_steps: 1 # TODO
+num_classes:
+
+# low VRAM and speed up training
+use_compile: False
+mixed_precision: False
+enable_xformers_memory_efficient_attention: False
+gradient_checkpointing: False
@@ -0,0 +1,32 @@
+# path:
+ckpt: # will be overwrite
+save_img_path: "./sample_videos/" # will be overwrite
+pretrained_model_path: "/path/to/pretrained/LAVITA/"
+
+# model config: 
+model: LAVITA-XL/2
+num_frames: 16
+image_size: 256 # choices=[256, 512]
+frame_interval: 2
+fixed_spatial: False
+attention_bias: True
+learn_sigma: True
+extras: 1 # [1, 2, 78]
+num_classes:
+
+# model speedup
+use_compile: False
+use_fp16: True
+
+# sample config:
+seed:
+sample_method: 'ddpm'
+num_sampling_steps: 250
+cfg_scale: 1.0
+run_time: 12
+num_sample: 1
+negative_name:
+
+# ddp sample config
+per_proc_batch_size: 1
+num_fvd_samples: 2
@@ -0,0 +1,42 @@
+# dataset
+dataset: "sky"
+
+data_path: "/path/to/datasets/sky_timelapse/sky_train/"
+pretrained_model_path: "/path/to/pretrained/LAVITA/"
+
+# save and load
+results_dir: "./results"
+pretrained:
+
+# model config: 
+model: LAVITA-XL/2
+num_frames: 16
+image_size: 256 # choices=[256, 512]
+num_sampling_steps: 250
+frame_interval: 3
+fixed_spatial: False
+attention_bias: True
+learn_sigma: True
+extras: 1 # [1, 2, 78]
+
+# train config:
+save_ceph: True # important
+learning_rate: 1e-4
+ckpt_every: 10000
+clip_max_norm: 0.1
+start_clip_iter: 20000
+local_batch_size: 5 # important
+max_train_steps: 1000000
+global_seed: 3407
+num_workers: 8
+log_every: 50
+lr_warmup_steps: 0
+resume_from_checkpoint:
+gradient_accumulation_steps: 1 # TODO
+num_classes:
+
+# low VRAM and speed up training
+use_compile: False
+mixed_precision: False
+enable_xformers_memory_efficient_attention: False
+gradient_checkpointing: False