Jinwoo Shin1 · Pieter Abbeel2 · Younggyo Seo1,3
1 KAIST 2UC Berkeley 3Dyson Robot Learning Lab

- We note that torch version >2.0 may work, but
conda install
with below version is recommended.
conda create -n rsp python=3.9.12 -y
conda activate rsp
conda install pytorch==1.13.1 torchvision==0.14.1 torchaudio==0.13.1 pytorch-cuda=11.7 -c pytorch -c nvidia
pip install -r requirements.txt
sh data_preprocessing/download.sh
sh data_preprocessing/extract.sh
- We assume the root directory for the data:
$DATA_ROOT = /data/kinetics400
. - If you want to change the root directory, please change
root_dl
ofdownload.sh
andextract.sh
.
- We resize the data into 256x256 for the efficient loading while training.
python data_preprocessing/make_256scale.py --datadir $DATA_ROOT
- We additionally provide the code to filter out several not-working videos.
python data_preprocessing/make_labels.py --datadir $DATA_ROOT --filedir train2
/data/kinetics400
|-- train2
|-- abseiling
|-- xx.mp4
|-- ...
|-- air_drumming
|-- xx.mp4
|-- ...
|-- ...
|-- labels
|-- label_full_1.0.pickle
- Note that
[N_NODE] x [BATCH_SIZE_PER_GPU] x [ACCUM_ITER] = 1536
to reproduce our results. - Default:
[DATA_PATH]=/data/kinetics400
python -m torch.distributed.launch --nproc_per_node=[N_NODE] main_pretrain_rsp.py \
--batch_size [BATCH_SIZE_PER_GPU] \
--accum_iter [ACCUM_ITER] \
--model rsp_vit_small_patch16 \
--epochs 400 \
--warmup_epochs 40 \
--data_path [DATA_PATH] \
--log_dir [LOG_DIR] \
--output_dir [LOG_DIR] \
--norm_pix_loss \
--repeated_sampling 2
We provide the checkpoint in the below:
The evaluation code is mainly built upon Dino.
- Step 1: Dataset preparation
We note that the default root path is [DATA_ROOT]=/data
. Additionally, we resize DAVIS of 480x(?) to 480x880 for a natural evaluation with patches.
sh data_preprocessing/eval/davis_download.sh
python data_preprocessing/eval/davis_preprocessing.py --data_root [DATA_ROOT]
[DATA_ROOT]/DAVIS_480_880
|-- Annotations/480p
|-- bear
|-- 00000.png
|-- ...
|-- ...
|-- ImageSets/2017/val.txt
|-- JPEGImages/480p
|-- bear
|-- 00000.jpg
|-- ...
|-- ...
- Step 2: Video object segmentation
python eval_video_segmentation_davis.py \
--finetune [LOG_DIR]/checkpoint-199.pth \
--output_dir [LOG_DIR]/davis_seg \
--data_path [DATA_ROOT]/DAVIS_480_880 \
--topk 7 --size_mask_neighborhood 30 --n_last_frames 30 \
--model vit_small
- Step 3: Evaluation the obtained segmentation
git clone https://github.com/davisvideochallenge/davis2017-evaluation
python ./davis2017-evaluation/evaluation_method.py \
--task semi-supervised \
--results_path [LOG_DIR]/davis_seg \
--davis_path [DATA_ROOT]/DAVIS_480_880
We provide the evaluation code at https://github.com/huiwon-jang/RSP/tree/eval_cortexbench.
- [ ] Evaluation codes: JHMDB, VIP, RLBench, Franka Kitchen
It's possible that this code may not accurately replicate the results outlined in the paper due to potential human errors during the preparation and cleaning of the code for release. If you encounter any difficulties in reproducing our findings, please don't hesitate to inform us. Additionally, we'll make an effort to carry out sanity-check experiments in the near future.