init commit

arkboy1224 · Oct 3, 2023 · c6a4d39 · c6a4d39
commit c6a4d39
Show file tree

Hide file tree

Showing 29 changed files with 4,750 additions and 0 deletions.
diff --git a/.gitignore b/.gitignore
@@ -0,0 +1,155 @@
+# Byte-compiled / optimized / DLL files
+__pycache__/
+*.py[cod]
+*$py.class
+*/__pycache__/
+
+# dataset-related, pre-trained models,
+vae_models/vqgan
+vae_models/*.gz
+vae_models/*.pt
+vae_models/*vqgan
+*.pt
+*.pth 
+
+# log files
+log/*.log
+out*
+test_results
+err*
+
+
+# C extensions
+*.so
+
+# Distribution / packaging
+.Python
+build/
+develop-eggs/
+dist/
+downloads/
+eggs/
+.eggs/
+lib/
+lib64/
+parts/
+sdist/
+var/
+wheels/
+pip-wheel-metadata/
+share/python-wheels/
+*.egg-info/
+.installed.cfg
+*.egg
+MANIFEST
+
+# PyInstaller
+#  Usually these files are written by a python script from a template
+#  before PyInstaller builds the exe, so as to inject date/other infos into it.
+*.manifest
+*.spec
+
+# Installer logs
+pip-log.txt
+pip-delete-this-directory.txt
+
+# Unit test / coverage reports
+htmlcov/
+.tox/
+.nox/
+.coverage
+.coverage.*
+.cache
+nosetests.xml
+coverage.xml
+*.cover
+*.py,cover
+.hypothesis/
+.pytest_cache/
+
+# Translations
+*.mo
+*.pot
+
+# Django stuff:
+*.log
+local_settings.py
+db.sqlite3
+db.sqlite3-journal
+
+# Flask stuff:
+instance/
+.webassets-cache
+
+# Scrapy stuff:
+.scrapy
+
+# Sphinx documentation
+docs/_build/
+
+# PyBuilder
+target/
+
+# Jupyter Notebook
+.ipynb_checkpoints
+
+# IPython
+profile_default/
+ipython_config.py
+
+# pyenv
+.python-version
+
+# pipenv
+#   According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control.
+#   However, in case of collaboration, if having platform-specific dependencies or dependencies
+#   having no cross-platform support, pipenv may install dependencies that don't work, or not
+#   install all needed dependencies.
+#Pipfile.lock
+
+# PEP 582; used by e.g. github.com/David-OConnor/pyflow
+__pypackages__/
+
+# Celery stuff
+celerybeat-schedule
+celerybeat.pid
+
+# SageMath parsed files
+*.sage.py
+
+# Environments
+.env
+.venv
+env/
+venv/
+ENV/
+env.bak/
+venv.bak/
+
+# Spyder project settings
+.spyderproject
+.spyproject
+
+# Rope project settings
+.ropeproject
+
+# mkdocs documentation
+/site
+
+# mypy
+.mypy_cache/
+.dmypy.json
+dmypy.json
+
+# Pyre type checker
+.pyre/
+
+*.zip
+*.pkl 
+*.csv 
+*.ckpt
+*.parquet 
+
+*.whl
+*.th
+*.onnx
diff --git a/LICENSE-CODE b/LICENSE-CODE
@@ -0,0 +1,21 @@
+MIT License
+
+Copyright (c) 2023 ByteDance
+
+Permission is hereby granted, free of charge, to any person obtaining a copy
+of this software and associated documentation files (the "Software"), to deal
+in the Software without restriction, including without limitation the rights
+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Software, and to permit persons to whom the Software is
+furnished to do so, subject to the following conditions:
+
+The above copyright notice and this permission notice shall be included in all
+copies or substantial portions of the Software.
+
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+SOFTWARE.
diff --git a/README.md b/README.md
@@ -0,0 +1,87 @@
+# MVDream
+Yichun Shi, Peng Wang, Jianglong Ye, Long Mai, Kejie Li, Xiao Yang
+
+| [Project Page](https://mv-dream.github.io/) | [3D Generation](https://github.com/bytedance/MVDream-threestudio) | [Paper](https://arxiv.org/abs/2308.16512) | [HuggingFace Demo (Coming)]() |
+
+
+- **This repo includes the diffusion model and 2D image generation code of [MVDream](https://mv-dream.github.io/index.html) paper.**
+- **For 3D Generation, please check [MVDream-threestudio](https://github.com/bytedance/MVDream-threestudio).**
+
+## Requirements
+You can use the same environment as in [Stable-Diffusion](https://github.com/Stability-AI/stablediffusion) for this repo. Or you can set up the environment by installing the given requirements
+
+``` python
+pip3 install -r requirements.txt
+```
+
+## Model Download
+Currently we provide two checkpoints, one fine-tuned from SD 1.5 and one from SD 2.1 base (512x512) model. 
+| Model      | Base Model | Resolution |
+| ----------- | ----------- | ----------- |
+| sd-v2.1-base-4view   | [Stable Diffusion 2.1 Base](https://huggingface.co/stabilityai/stable-diffusion-2-1-base) | 4x256x256 |
+| sd-v1.5-4view        | [Stable Diffusion 1.5](https://huggingface.co/runwayml/stable-diffusion-v1-5)             | 4x256x256 |
+
+By default, we use the SD-2.1-base model in our experiments.
+
+
+## Text-to-Image
+
+You can simply generate multi-view images by running the following command:
+
+``` bash
+python3 scripts/t2i.py --text "an astronaut riding a horse"
+```
+We also provide a gradio script to try out with GUI:
+
+``` bash
+python3 scripts/gradio_app.py
+```
+
+## Usage
+#### Load the Model
+We provide two ways to load the models of MVDream:
+- **Automatic**: load the model config with model name and weights from huggingface.
+``` python
+from mvdream.model_zoo import build_model
+model = build_model("sd-v2.1-base-4view")
+```
+- **Manual**: load the model with a config file and a checkpoint file.
+``` python
+from omegaconf import OmegaConf
+from mvdream.ldm.util import instantiate_from_config
+config = OmegaConf.load("mvdream/configs/sd-v2-base.yaml")
+model = instantiate_from_config(config.model)
+model.load_state_dict(torch.load("path/to/sd-v2.1-base-4view.th", map_location='cpu'))
+```
+
+#### Inference
+Here is a simple example for model inference:
+``` python
+import torch
+from mvdream.camera_utils import get_camera
+model.eval()
+model.cuda()
+with torch.no_grad():
+    noise = torch.randn(4,4,32,32, device="cuda") # batch of 4x for 4 views, latent size 32=256/8
+    t = torch.tensor([999]*4, dtype=torch.long, device="cuda")
+    cond = {
+        "context": model.get_learned_conditioning([""]*4).cuda(), # text embeddings
+        "camera": get_camera(4).cuda(), 
+        "num_frames": 4,
+    }
+    eps = model.apply_model(noise, t, cond=cond)
+```
+
+
+## Acknowledgement
+This repository is heavily based on [Stable Diffusion](https://huggingface.co/stabilityai/stable-diffusion-2-1-base). We would like to thank the authors of these work for publicly releasing their code.
+
+## Citation
+``` bibtex
+@article{shi2023MVDream,
+  author = {Shi, Yichun and Wang, Peng and Ye, Jianglong and Mai, Long and Li, Kejie and Yang, Xiao},
+  title = {MVDream: Multi-view Diffusion for 3D Generation},
+  journal = {arXiv:2308.16512},
+  year = {2023},
+}
+```
diff --git a/mvdream/__init__.py b/mvdream/__init__.py
@@ -0,0 +1 @@
+from .model_zoo import build_model
diff --git a/mvdream/camera_utils.py b/mvdream/camera_utils.py
@@ -0,0 +1,68 @@
+import numpy as np
+import torch
+
+
+def create_camera_to_world_matrix(elevation, azimuth):
+    elevation = np.radians(elevation)
+    azimuth = np.radians(azimuth)
+    # Convert elevation and azimuth angles to Cartesian coordinates on a unit sphere
+    x = np.cos(elevation) * np.sin(azimuth)
+    y = np.sin(elevation)
+    z = np.cos(elevation) * np.cos(azimuth)
+
+    # Calculate camera position, target, and up vectors
+    camera_pos = np.array([x, y, z])
+    target = np.array([0, 0, 0])
+    up = np.array([0, 1, 0])
+
+    # Construct view matrix
+    forward = target - camera_pos
+    forward /= np.linalg.norm(forward)
+    right = np.cross(forward, up)
+    right /= np.linalg.norm(right)
+    new_up = np.cross(right, forward)
+    new_up /= np.linalg.norm(new_up)
+    cam2world = np.eye(4)
+    cam2world[:3, :3] = np.array([right, new_up, -forward]).T
+    cam2world[:3, 3] = camera_pos
+    return cam2world
+
+
+def convert_opengl_to_blender(camera_matrix):
+    if isinstance(camera_matrix, np.ndarray):
+        # Construct transformation matrix to convert from OpenGL space to Blender space
+        flip_yz = np.array([[1, 0, 0, 0], [0, 0, -1, 0], [0, 1, 0, 0], [0, 0, 0, 1]])
+        camera_matrix_blender = np.dot(flip_yz, camera_matrix)
+    else:
+        # Construct transformation matrix to convert from OpenGL space to Blender space
+        flip_yz = torch.tensor([[1, 0, 0, 0], [0, 0, -1, 0], [0, 1, 0, 0], [0, 0, 0, 1]])
+        if camera_matrix.ndim == 3:
+            flip_yz = flip_yz.unsqueeze(0)
+        camera_matrix_blender = torch.matmul(flip_yz.to(camera_matrix), camera_matrix)
+    return camera_matrix_blender
+
+
+def normalize_camera(camera_matrix):
+    ''' normalize the camera location onto a unit-sphere'''
+    if isinstance(camera_matrix, np.ndarray):
+        camera_matrix = camera_matrix.reshape(-1,4,4)
+        translation = camera_matrix[:,:3,3]
+        translation = translation / (np.linalg.norm(translation, axis=1, keepdims=True) + 1e-8)
+        camera_matrix[:,:3,3] = translation
+    else:
+        camera_matrix = camera_matrix.reshape(-1,4,4)
+        translation = camera_matrix[:,:3,3]
+        translation = translation / (torch.norm(translation, dim=1, keepdim=True) + 1e-8)
+        camera_matrix[:,:3,3] = translation
+    return camera_matrix.reshape(-1,16)
+
+
+def get_camera(num_frames, elevation=15, azimuth_start=0, azimuth_span=360, blender_coord=True):
+    angle_gap = azimuth_span / num_frames
+    cameras = []
+    for azimuth in np.arange(azimuth_start, azimuth_span+azimuth_start, angle_gap):
+        camera_matrix = create_camera_to_world_matrix(elevation, azimuth)
+        if blender_coord:
+            camera_matrix = convert_opengl_to_blender(camera_matrix)
+        cameras.append(camera_matrix.flatten())
+    return torch.tensor(np.stack(cameras, 0)).float()
diff --git a/mvdream/configs/sd-v1.yaml b/mvdream/configs/sd-v1.yaml
@@ -0,0 +1,52 @@
+model:
+  target: mvdream.ldm.interface.LatentDiffusionInterface
+  params:
+    linear_start: 0.00085
+    linear_end: 0.0120
+    timesteps: 1000
+    scale_factor: 0.18215
+    parameterization: "eps"
+
+    unet_config:
+      target: mvdream.ldm.modules.diffusionmodules.openaimodel.MultiViewUNetModel
+      params:
+        image_size: 32 # unused
+        in_channels: 4
+        out_channels: 4
+        model_channels: 320
+        attention_resolutions: [ 4, 2, 1 ]
+        num_res_blocks: 2
+        channel_mult: [ 1, 2, 4, 4 ]
+        num_heads: 8
+        use_spatial_transformer: True
+        transformer_depth: 1
+        context_dim: 768
+        use_checkpoint: False
+        legacy: False
+        camera_dim: 16
+
+    first_stage_config:
+      target: mvdream.ldm.models.autoencoder.AutoencoderKL
+      params:
+        embed_dim: 4
+        monitor: val/rec_loss
+        ddconfig:
+          double_z: true
+          z_channels: 4
+          resolution: 256
+          in_channels: 3
+          out_ch: 3
+          ch: 128
+          ch_mult:
+          - 1
+          - 2
+          - 4
+          - 4
+          num_res_blocks: 2
+          attn_resolutions: []
+          dropout: 0.0
+        lossconfig:
+          target: torch.nn.Identity
+
+    cond_stage_config:
+      target: mvdream.ldm.modules.encoders.modules.FrozenCLIPEmbedder