GitHub - liuzuyan/Oryx: MLLM for On-Demand Spatial-Temporal Understanding at Arbitrary Resolution

Oryx MLLM: On-Demand Spatial-Temporal Understanding at Arbitrary Resolution

Zuyan Liu^*,1,2 Yuhao Dong^*,2,3 Ziwei Liu³ Winston Hu² Jiwen Lu^1,✉ Yongming Rao^2,1,✉

¹Tsinghua University ²Tencent ³S-Lab, NTU

^* Equal Contribution^✉ Corresponding Author

Project Page:

arXiv Paper:

Model Checkpoints:

Oryx SFT Data: Collected from open-source datasets, prepared data comming soon

📢 News

[20/09/2024] 🔥 🚀Introducing Oryx! The Oryx models (7B/34B) support on-demand visual perception, achieve new state-of-the-art performance across image, video and 3D benchmarks, even surpassing advanced commercial models on some benchmarks.
- [Paper]: Detailed introduction of on-demand visual perception, including native resolution perception and dynamic compressor!
- [Checkpoints]: Try our advanced model on your own.
- [Scripts]: Start training models with customized data.

Introducing Oryx

Oryx is a unified multimodal architecture for the spatial-temporal understanding of images, videos, and multi-view 3D scenes. Oryx offers an on-demand solution to seamlessly and efficiently process visual inputs with arbitrary spatial sizes and temporal lengths. Our model achieve strong capabilities in image, video, and 3D multimodal understanding simultaneously.

Main idea of On-Demand Multimodal Understanding

Overview of Oryx Architecture

TODO List

Release all the model weights.
Release OryxViT model.
Demo code for generation.
All the training and inference code.
Evaluation code for image, video and 3D multi-modal benchmark.
Oryx SFT Data.
Oryx chatbox.
Enhanced Oryx model with latest LLM base models and better SFT data.
Introducing our explorations for OryxViT.

Generation Demo

You can try the generation results of our strong Oryx model with the following steps:

1. Download the Oryx model from our huggingface collections.

2. Download the Oryx-ViT vision encoder.

3. Replace the path for "mm_vision_tower" in the config.json with your local path for Oryx-ViT. (We will simplify step 1, 2 and 3 in an automatically manner soon.)

4. Modify the model path and run the inference script with your own video to test our model.

python inference.py

Training Instructions

Installation

1. Clone this repository:

git clone https://github.com/Oryx-mllm/oryx
cd oryx

2. Install the required package:

conda create -n oryx python=3.10 -y
conda activate oryx
pip install --upgrade pip
pip install -e

Preparation

3. Prepare training data:

🚧 We will release the instruction for collecting training data soon. We will also release our prepared data in patches.

Training

4. Training your own model:

Modify the following lines in the scripts at your own environments:

export PYTHONPATH=/PATH/TO/oryx:$PYTHONPATH
VISION_TOWER='oryx_vit:PATH/TO/oryx_vit_new.pth'
DATA="PATH/TO/DATA.json"
MODEL_NAME_OR_PATH="PATH/TO/7B_MODEL"

Scripts for training Oryx-7B

bash scripts/train_oryx_7b.sh

Scripts for training Oryx-34B

bash scripts/train_oryx_34b.sh

Citation

If you find it useful for your research and applications, please cite our paper using this BibTeX:

@article{liu2024oryx,
title={Oryx MLLM: On-Demand Spatial-Temporal Understanding at Arbitrary Resolution},
author={Liu, Zuyan and Dong, Yuhao and Liu, Ziwei and Hu, Winston and Lu, Jiwen and Rao, Yongming},
journal={arXiv preprint arXiv:2409.12961},
year={2024}
}

Acknowledgement

Our codebase is conducted on LLaVA
Thanks to lmms-eval team, for building such a usefule evaluation system!

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
lmms-eval		lmms-eval
oryx		oryx
scripts		scripts
README.md		README.md
inference.py		inference.py
inference.sh		inference.sh
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Oryx MLLM: On-Demand Spatial-Temporal Understanding at Arbitrary Resolution

📢 News

Introducing Oryx

Main idea of On-Demand Multimodal Understanding

Overview of Oryx Architecture

TODO List

Generation Demo

1. Download the Oryx model from our huggingface collections.

2. Download the Oryx-ViT vision encoder.

3. Replace the path for "mm_vision_tower" in the config.json with your local path for Oryx-ViT. (We will simplify step 1, 2 and 3 in an automatically manner soon.)

4. Modify the model path and run the inference script with your own video to test our model.

Training Instructions

Installation

1. Clone this repository:

2. Install the required package:

Preparation

3. Prepare training data:

Training

4. Training your own model:

Citation

Acknowledgement

About

Releases

Packages

Languages

liuzuyan/Oryx

Folders and files

Latest commit

History

Repository files navigation

Oryx MLLM: On-Demand Spatial-Temporal Understanding at Arbitrary Resolution

📢 News

Introducing Oryx

Main idea of On-Demand Multimodal Understanding

Overview of Oryx Architecture

TODO List

Generation Demo

1. Download the Oryx model from our huggingface collections.

2. Download the Oryx-ViT vision encoder.

3. Replace the path for "mm_vision_tower" in the config.json with your local path for Oryx-ViT. (We will simplify step 1, 2 and 3 in an automatically manner soon.)

4. Modify the model path and run the inference script with your own video to test our model.

Training Instructions

Installation

1. Clone this repository:

2. Install the required package:

Preparation

3. Prepare training data:

Training

4. Training your own model:

Citation

Acknowledgement

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages