3D-LLM: Injecting the 3D World into Large Language Models

Yining Hong, Haoyu Zhen, Peihao Chen, Shuhong Zheng, Yilun Du, Zhenfang Chen, Chuang Gan

Preliminary Code.

Data

All data will be gradually released in Google Drive

Pretraining Data

We are still cleaning the grounding part. All other pre-training data are released.

Object Data

Language annotations of object data released here.

For downloading Objaverse data, please refer to Objaverse website.

To get 3D features and point clouds of the Objaverse data, please refer to Step1 and Step3 of 3DLanguage Data generation - ChatCaptioner based

TODO: We will also release a small set (or probably the whole set) of Objaverse 3D features

Scene Data

Language data released here.

3D features and point clouds (~250G) are released here. However, if you want to explore generating the features yourself, please refer to the Three-step 3D Feature Extraction part here

Finetuning Data

TODO.

3DLanguage Data Generation

ChatCaptioner based / Three-step 3D Feature Extraction (Objaverse)

Step1: render images from different views of a scene

Follow the instruction in 3DLanguage_data/ChatCaptioner_based/objaverse_render/README.md for installation.

The following code will render images of a objaverse scene (e.g. f6e9ec5953854dff94176c36b877c519). The rendered images will be saved at 3DLanguage_data/ChatCaptioner_based/objaverse_render/output. (Please refer to 3DLanguage_data/ChatCaptioner_based/objaverse_render/README.md for more details about the command)

$ cd ./3DLanguage_data/ChatCaptioner_based/objaverse_render

$ {path/to/blender} -b -P render.py -noaudio --disable-crash-handler -- --uid f6e9ec5953854dff94176c36b877c519

Step2: generate caption for this objaverse scene

Installation:

Please follow ChatCaptioner to install the environment/

The following code will read the rended images of an objaverse scene (e.g., f6e9ec5953854dff94176c36b877c519) and generate scene caption at 3DLanguage_data/ChatCaptioner_based/output

$ cd ./3DLanguage_data/ChatCaptioner_based

$ python chatcaption.py --specific_scene f6e9ec5953854dff94176c36b877c519

Step3: 3D feature construction from rendered images

Follow the instruction in 3DLanguage_data/ChatCaptioner_based/gen_features/README.md for extracting 3D features from rendered images.

$ cd ./3DLanguage_data/ChatCaptioner_based/gen_features

Box-Demonstration-Instruction based

TODO

Revision based

TODO

Three-step 3D Feature Extraction (Scene)

This section is for constructing 3D features for scene data. If you already downloaded our released scene data, please skip this section.

First step

Installation:

Please follow Mask2Former to install the environment and download the pretrained weight to the current directory if extracting the masks with Mask2Former.

Please follow Segment Anything to install the environment and download the pretrained weight to the current directory if extracting the masks with SAM.

Extract masks with Mask2Former:

$ cd ./three_steps_3d_feature/first_step

$ python maskformer_mask.py --scene_dir_path DATA_DIR_WITH_RGB_IMAGES --save_dir_path DIR_YOU_WANT_TO_SAVE_THE_MASKS

Extract masks with Segment Anything:

$ cd ./three_steps_3d_feature/first_step

$ python sam_mask.py --scene_dir_path DATA_DIR_WITH_RGB_IMAGES --save_dir_path DIR_YOU_WANT_TO_SAVE_THE_MASKS

After the first step, we are expected to obtain a directory of masks (specified by --save_dir_path) that contains extracted masks for multi-view images of the scenes.

Second step

Note: BLIP features are for LAVIS(BLIP2), CLIP features are for open-flamingo.

Installation: The same as the following 3D-LLM_BLIP2-based section to install salesforce-lavis.

There are four options: (1) Extract CLIP feature with Mask2Former masks; (2) Extract CLIP feature with SAM masks; (3) Extract BLIP feature with Mask2Former masks; (4) Extract BLIP feature with SAM masks.

Extract 2D CLIP features with Mask2Former masks:

$ cd ./three_steps_3d_feature/second_step/

$ python clip_maskformer.py --scene_dir_path DATA_DIR_WITH_RGB_IMAGES --mask_dir_path MASK_DIR_FROM_1ST_STEP --save_dir_path DIR_YOU_WANT_TO_SAVE_THE_FEAT

For the other options, the scripts are in similar format.

After the second step, we are expected to obtain a directory of features (specified by --save_dir_path) that contains 2D features for multi-view images of the scenes.

Third step

Direct Reconstruction

Installation:

Please install the Habitat environment.

Reconstruct 3D feature from multi-view 2D features:

$ cd ./three_steps_3d_feature/third_step/

$ python sam_mask.py --data_dir_path DATA_DIR_WITH_RGB_IMAGES --depth_dir_path DATA_DIR_WITH_DEPTH_IMAGES --feat_dir_path FEATURE_DIR_FROM_2ND_STEP

After the third step, we are expected to obtain two files (pcd_pos.pt and pcd_feat.pt) for each room inside the corresponding RGB directory. pcd_pos.pt contains the point positions of the 3D point cloud (shape: N * 3). pcd_feat.pt contains the point features of the 3D point cloud (shape: N * n_dim). N is the number of sampled points in the point cloud (default: 300000) and n_dim is the feature dimension (1024 for CLIP feature, 1408 for BLIP feature).

GradSLAM (Feature Fusion)

Refer to Concept Fusion.

We will also release our reproduced version of Concept Fusion for our feature generation (we reproduced the paper before their official release).

Neural Field

Please refer to 3D-CLR repository.

3D-LLM_BLIP2-based

Installation

Install salesforce-lavis

$ conda create -n lavis python=3.8
$ conda activate lavis

$ git clone https://github.com/salesforce/LAVIS.git SalesForce-LAVIS
$ cd SalesForce-LAVIS
$ pip install -e .

$ pip install positional_encodings

Training

$ cd 3DLLM_BLIP2-base

$ conda activate lavis
# use facebook/opt-2.7b:
$ python -m torch.distributed.run --nproc_per_node=8 train.py --cfg-path lavis/projects/blip2/train/3dvqa_ft.yaml
# use flant5
$ python -m torch.distributed.run --nproc_per_node=8 train.py --cfg-path lavis/projects/blip2/train/3dvqa_flant5_ft.yaml

3D-LLM_flamingo-based

TODO.

Citation

If you find our work useful, please consider citing:

@article{3dllm,
 author = {Hong, Yining and Zhen, Haoyu and Chen, Peihao and Zheng, Shuhong and Du, Yilun and Chen, Zhenfang and Gan, Chuang},
 title = {3D-LLM: Injecting the 3D World into Large Language Models},
 journal = {arXiv},
 year = {2023},
}

Acknowledgements

https://github.com/salesforce/LAVIS

https://github.com/facebookresearch/Mask2Former

https://github.com/facebookresearch/segment-anything

https://github.com/mlfoundations/open_flamingo

https://github.com/concept-fusion/concept-fusion

https://github.com/evelinehong/3D-CLR-Official

Name		Name	Last commit message	Last commit date
Latest commit History 64 Commits
3DLLM_BLIP2-base		3DLLM_BLIP2-base
3DLanguage_data/ChatCaptioner_based		3DLanguage_data/ChatCaptioner_based
figs		figs
three_steps_3d_feature		three_steps_3d_feature
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

3D-LLM: Injecting the 3D World into Large Language Models

Data

Pretraining Data

Object Data

Scene Data

Finetuning Data

3DLanguage Data Generation

ChatCaptioner based / Three-step 3D Feature Extraction (Objaverse)

Step1: render images from different views of a scene

Step2: generate caption for this objaverse scene

Step3: 3D feature construction from rendered images

Box-Demonstration-Instruction based

Revision based

Three-step 3D Feature Extraction (Scene)

First step

Second step

Third step

Direct Reconstruction

GradSLAM (Feature Fusion)

Neural Field

3D-LLM_BLIP2-based

Installation

Training

3D-LLM_flamingo-based

Citation

Acknowledgements

About

Releases

Packages

Languages

License

pakeypay/3D-LLM

Folders and files

Latest commit

History

Repository files navigation

3D-LLM: Injecting the 3D World into Large Language Models

Data

Pretraining Data

Object Data

Scene Data

Finetuning Data

3DLanguage Data Generation

ChatCaptioner based / Three-step 3D Feature Extraction (Objaverse)

Step1: render images from different views of a scene

Step2: generate caption for this objaverse scene

Step3: 3D feature construction from rendered images

Box-Demonstration-Instruction based

Revision based

Three-step 3D Feature Extraction (Scene)

First step

Second step

Third step

Direct Reconstruction

GradSLAM (Feature Fusion)

Neural Field

3D-LLM_BLIP2-based

Installation

Training

3D-LLM_flamingo-based

Citation

Acknowledgements

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages