MagicDrive multi-view video generation with first-person view depth and semantic_map inputs

This repo is based on MagicDrive3D on 3D scene generation, but has been extended to video generation and with first-person view depth map and semantic map as control inputs.

changes

We have made a lot of changes to the original architecture in order to generate consistent multi-view videos given the new input conditions, including:

A new nuScenes data processing pipeline to generate depth map using Depth-Anything-V2 and fpv semantic map with the nuScenes map api for each camera and each frame of a scene. We then iterate over the available scenes and collect relevant information to a csv file for further data processing.
We create a new DatasetFromCSV class to load scene data into pytorch tensors, including rgb, depth, semantic, text description, etc.
A newly designed fpv_runner where we wrap a MultiControlNet and stable diffusion Unet into a model, and load the 6D data (b, c, f, n, h, w) into the the model for diffusion training, with f as the data frames and n as the camera views.
Model architucture changes: The controlnet in the original MagicDrive3D is being replaced by a MultiControlNet defined in the diffusers library, where it accepts depth and semantic_map as conditional inputs. The overall structure of the Unet remains unchanged, but all the BasicTransformerBlock are being replaced by the BasicMultiviewVideoTransformerBlock.
The BasicMultiviewVideoTransformerBlock adds three types of attentions: (1) SparseCausalAttention between frame i and frame 0 and frame i-1; (2) cross-view attention between neighboring camera views; (3) temporal attention along the frame axis of video data.
A new pipeline for video generation with depth and map inputs, as detailed in pipeline_fpv_controlnet.

The default branch is now the alanxu/fpv branch where all the changes are made, while the main branch is just a direct copy of the MagicDrive3D repo without any modifications.

results

Here we show some of the preliminary results after training on 142 scenes (6 key frames for each scene) with 2 A800-80G GPUs for about 8 hours. The scene description is from nuScenes. Each image below is divided into four quarters, with:

upper-left for original frames
lower-left for generated frames
upper-right for depth frames
lower-right for semantic_map frames.

We can clearly see that consistent multi-view video frames are generated with our model.

A driving scene image at boston-seaport. Wait at intersection, truck, peds, turn right, parking lot.

A driving scene image at singapore-onenorth. Intersection, peds, waiting vehicle, parked motorcycle at parking lot.

A driving scene image at singapore-onenorth. Many peds, parked buses, parked cars.

A driving scene image at singapore-onenorth. Turn left, motorcycle driving, ped sitting.

Name		Name	Last commit message	Last commit date
Latest commit History 89 Commits
assets		assets
configs		configs
data		data
demo		demo
doc		doc
magicdrive		magicdrive
perception		perception
pretrained		pretrained
requirements		requirements
third_party		third_party
tools		tools
.gitignore		.gitignore
.gitmodules		.gitmodules
LICENSE		LICENSE
README.MD		README.MD

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MagicDrive multi-view video generation with first-person view depth and semantic_map inputs

changes

results

About

Releases

Packages

Languages

License

alanxu89/MagicDrive

Folders and files

Latest commit

History

Repository files navigation

MagicDrive multi-view video generation with first-person view depth and semantic_map inputs

changes

results

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages