Visual Representation Learning with Stochastic Frame Prediction

Huiwon Jang¹ · Dongyoung Kim¹ · Junsu Kim¹
Jinwoo Shin¹ · Pieter Abbeel² · Younggyo Seo^1,3
¹ KAIST ²UC Berkeley ³Dyson Robot Learning Lab

[project page] [openreview]

1. Environment setup

We note that torch version >2.0 may work, but conda install with below version is recommended.

conda create -n rsp python=3.9.12 -y
conda activate rsp
conda install pytorch==1.13.1 torchvision==0.14.1 torchaudio==0.13.1 pytorch-cuda=11.7 -c pytorch -c nvidia
pip install -r requirements.txt

2. Dataset

Dataset download

sh data_preprocessing/download.sh
sh data_preprocessing/extract.sh

We assume the root directory for the data: $DATA_ROOT = /data/kinetics400.
If you want to change the root directory, please change root_dl of download.sh and extract.sh.

Dataset pre-processing

We resize the data into 256x256 for the efficient loading while training.

python data_preprocessing/make_256scale.py --datadir $DATA_ROOT

We additionally provide the code to filter out several not-working videos.

python data_preprocessing/make_labels.py --datadir $DATA_ROOT --filedir train2

Kinetics-400

/data/kinetics400
|-- train2
    |-- abseiling
        |-- xx.mp4
        |-- ...
    |-- air_drumming
        |-- xx.mp4
        |-- ...
    |-- ...
|-- labels
    |-- label_full_1.0.pickle

3. Pre-training RSP on Kinetics-400

Note that [N_NODE] x [BATCH_SIZE_PER_GPU] x [ACCUM_ITER] = 1536 to reproduce our results.
Default: [DATA_PATH]=/data/kinetics400

python -m torch.distributed.launch --nproc_per_node=[N_NODE] main_pretrain_rsp.py \
    --batch_size [BATCH_SIZE_PER_GPU] \
    --accum_iter [ACCUM_ITER] \
    --model rsp_vit_small_patch16 \
    --epochs 400 \
    --warmup_epochs 40 \
    --data_path [DATA_PATH] \
    --log_dir [LOG_DIR] \
    --output_dir [LOG_DIR] \
    --norm_pix_loss \
    --repeated_sampling 2

4. Evaluation

We provide the checkpoint in the below:

ViT-S/16 400 epochs: [link]
ViT-B/16 400 epochs: [link]

4.1. Video Label Propagation

The evaluation code is mainly built upon Dino.

1. DAVIS 2017 video object segmentation

Step 1: Dataset preparation

We note that the default root path is [DATA_ROOT]=/data. Additionally, we resize DAVIS of 480x(?) to 480x880 for a natural evaluation with patches.

sh data_preprocessing/eval/davis_download.sh
python data_preprocessing/eval/davis_preprocessing.py --data_root [DATA_ROOT]

[DATA_ROOT]/DAVIS_480_880
|-- Annotations/480p
    |-- bear
        |-- 00000.png
        |-- ...
    |-- ...
|-- ImageSets/2017/val.txt
|-- JPEGImages/480p
    |-- bear
        |-- 00000.jpg
        |-- ...
    |-- ...

Step 2: Video object segmentation

python eval_video_segmentation_davis.py \
    --finetune [LOG_DIR]/checkpoint-199.pth \
    --output_dir [LOG_DIR]/davis_seg \
    --data_path [DATA_ROOT]/DAVIS_480_880 \
    --topk 7 --size_mask_neighborhood 30 --n_last_frames 30 \
    --model vit_small

Step 3: Evaluation the obtained segmentation

git clone https://github.com/davisvideochallenge/davis2017-evaluation
python ./davis2017-evaluation/evaluation_method.py \
    --task semi-supervised \
    --results_path [LOG_DIR]/davis_seg \
    --davis_path [DATA_ROOT]/DAVIS_480_880

4.2. Vision-based Robot Learning

1. CortexBench

We provide the evaluation code at https://github.com/huiwon-jang/RSP/tree/eval_cortexbench.

TOODs

[ ] Evaluation codes: JHMDB, VIP, RLBench, Franka Kitchen

Note

It's possible that this code may not accurately replicate the results outlined in the paper due to potential human errors during the preparation and cleaning of the code for release. If you encounter any difficulties in reproducing our findings, please don't hesitate to inform us. Additionally, we'll make an effort to carry out sanity-check experiments in the near future.

Name		Name	Last commit message	Last commit date
Latest commit History 22 Commits
data_preprocessing		data_preprocessing
third_party_license		third_party_license
timm		timm
util		util
LICENSE		LICENSE
README.md		README.md
engine_pretrain_repsamp.py		engine_pretrain_repsamp.py
eval_video_segmentation_davis.py		eval_video_segmentation_davis.py
main_pretrain_rsp.py		main_pretrain_rsp.py
models_rsp.py		models_rsp.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Visual Representation Learning with Stochastic Frame Prediction

[project page] [openreview]

1. Environment setup

2. Dataset

Dataset download

Dataset pre-processing

Kinetics-400

3. Pre-training RSP on Kinetics-400

4. Evaluation

4.1. Video Label Propagation

1. DAVIS 2017 video object segmentation

4.2. Vision-based Robot Learning

1. CortexBench

TOODs

Note

About

Releases

Packages

Languages

License

huiwon-jang/RSP

Folders and files

Latest commit

History

Repository files navigation

Visual Representation Learning with Stochastic Frame Prediction

[project page] [openreview]

1. Environment setup

2. Dataset

Dataset download

Dataset pre-processing

Kinetics-400

3. Pre-training RSP on Kinetics-400

4. Evaluation

4.1. Video Label Propagation

1. DAVIS 2017 video object segmentation

4.2. Vision-based Robot Learning

1. CortexBench

TOODs

Note

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages