Official implementation of Continuous 3D Perception Model with Persistent State
QianqianWang*, Yifei Zhang*, Aleksander Holynski, Alexei A Efros, Angjoo Kanazawa
(*: equal contribution)
- Release multi-view stereo results of DL3DV dataset.
- Online demo integrated with WebCam
- Clone CUT3R.
git clone https://github.com/CUT3R/CUT3R.git
cd CUT3R
- Create the environment.
conda create -n cut3r python=3.11 cmake=3.14.0
conda activate cut3r
conda install pytorch torchvision pytorch-cuda=12.1 -c pytorch -c nvidia # use the correct version of cuda for your system
pip install -r requirements.txt
# issues with pytorch dataloader, see https://github.com/pytorch/pytorch/issues/99625
conda install 'llvm-openmp<16'
# for training logging
pip install git+https://github.com/nerfstudio-project/gsplat.git
# for evaluation
pip install evo
pip install open3d
- Compile the cuda kernels for RoPE (as in CroCo v2).
cd src/croco/models/curope/
python setup.py build_ext --inplace
cd ../../../../
We currently provide checkpoints on Google Drive:
Modelname | Training resolutions | #Views | Head |
---|---|---|---|
cut3r_224_linear_4.pth |
224x224 | 16 | Linear |
cut3r_512_dpt_4_64.pth |
512x384, 512x336, 512x288, 512x256, 512x160, 384x512, 336x512, 288x512, 256x512, 160x512 | 4-64 | DPT |
cut3r_224_linear_4.pth
is our intermediate checkpoint andcut3r_512_dpt_4_64.pth
is our final checkpoint.
To download the weights, run the following commands:
cd src
# for 224 linear ckpt
gdown --fuzzy https://drive.google.com/file/d/11dAgFkWHpaOHsR6iuitlB_v4NFFBrWjy/view?usp=drive_link
# for 512 dpt ckpt
gdown --fuzzy https://drive.google.com/file/d/1Asz-ZB3FfpzZYwunhQvNPZEUA8XUNAYD/view?usp=drive_link
cd ..
To run the inference code, you can use the following command:
# the following script will run inference offline and visualize the output with viser on port 8080
python demo.py --model_path MODEL_PATH --seq_path SEQ_PATH --size SIZE --vis_threshold VIS_THRESHOLD --output_dir OUT_DIR # input can be a folder or a video
# Example:
# python demo.py --model_path src/cut3r_512_dpt_4_64.pth --size 512 \
# --seq_path examples/001 --vis_threshold 1.5 --output_dir tmp
#
# python demo.py --model_path src/cut3r_224_linear_4.pth --size 224 \
# --seq_path examples/001 --vis_threshold 1.5 --output_dir tmp
Output results will be saved to output_dir
.
Currently, we accelerate the feedforward process by processing inputs in parallel within the encoder, which results in linear memory consumption as the number of frames increases.
Our training data includes 32 datasets listed below. We provide processing scripts for all of them. Please download the datasets from their official sources, and refer to preprocess.md for processing scripts and more information about the datasets.
- ARKitScenes
- BlendedMVS
- CO3Dv2
- MegaDepth
- ScanNet++
- ScanNet
- WayMo Open dataset
- WildRGB-D
- Map-free
- TartanAir
- UnrealStereo4K
- Virtual KITTI 2
- 3D Ken Burns
- BEDLAM
- COP3D
- DL3DV
- Dynamic Replica
- EDEN
- Hypersim
- IRS
- Matterport3D
- MVImgNet
- MVS-Synth
- OmniObject3D
- PointOdyssey
- RealEstate10K
- SmartPortraits
- Spring
- Synscapes
- UASOL
- UrbanSyn
- HOI4D
Please follow MonST3R and Spann3R to prepare Sintel, Bonn, KITTI, NYU-v2, TUM-dynamics, ScanNet, 7scenes and Neural-RGBD datasets.
The datasets should be organized as follows:
data/
├── 7scenes
├── bonn
├── kitti
├── neural_rgbd
├── nyu-v2
├── scannetv2
├── sintel
└── tum
Please refer to the eval.md for more details.
To fine-tune the released checkpoints, you can use the two provided config files as a starting point. Note that these configs correspond to the final stage of training, where the goal is to train the model to handle long sequences. Therefore, in these configs, the encoders are frozen, and single-view datasets are removed. You may adjust the configurations as needed to suit your requirements.
# Remember to replace the dataset path to your own path
# the script has been tested on a 8xA100(80G) machine
cd src
# finetune 512 checkpoint
CUDA_LAUNCH_BLOCKING=1 NCCL_DEBUG=TRACE TORCH_DISTRIBUTED_DEBUG=DETAIL HYDRA_FULL_ERROR=1 accelerate launch --multi_gpu train.py --config-name dpt_512_vary_4_64
# finetune 224 checkpoint
CUDA_LAUNCH_BLOCKING=1 NCCL_DEBUG=TRACE TORCH_DISTRIBUTED_DEBUG=DETAIL HYDRA_FULL_ERROR=1 accelerate launch --multi_gpu train.py --config-name linear_224_fixed_16
Our code is based on the following awesome repositories:
We thank the authors for releasing their code!
If you find our work useful, please cite:
@article{wang2025continuous,
title={Continuous 3D Perception Model with Persistent State},
author={Wang, Qianqian and Zhang, Yifei and Holynski, Aleksander and Efros, Alexei A and Kanazawa, Angjoo},
journal={arXiv preprint arXiv:2501.12387},
year={2025}
}