We consider the problem of segmenting objects in videos based on their motion and no other forms of supervision. Prior work has often approached this problem by using the principle of common fate, namely the fact that the motion of points that belong to the same object is strongly correlated. However, most authors have only considered instantaneous motion from optical flow. In this work, we present a way to train a segmentation network using long-term point trajectories as a supervisory signal to complement optical flow. The key difficulty is that long-term motion, unlike instantaneous motion, is difficult to model -- any parametric approximation is unlikely to capture complex motion patterns over long periods of time. We instead draw inspiration from subspace clustering approaches, proposing a loss function that seeks to group the trajectories into low-rank matrices where the motion of object points can be approximately explained as a linear combination of other point tracks. Our method outperforms the prior art on motion-based segmentation, which shows the utility of long-term motion and the effectiveness of our formulation.
The following packages are required to run the code:
- cv2
- numpy
- torch==2.0.1
- torchvision==0.15.2
- einops
- timm
- wandb
- tqdm
- scikit-learn
- scipy
- PIL
- detectron2
see environment.yaml
for precise versions and full list of dependencies and environment state.
Datasets should be placed under data/<dataset_name>
, e.g. data/DAVIS2016
.
For video segmentation we follow the dataset preparation steps of MotionGrouping including obtaining optical flow.
For trajectoies, we use CotrackerV2. To generate trajectories run the following commands:
# DAVIS
python extract_trajectories.py data/DAVIS2016/ data/DAVIS2016/Tracks/cotrackerv2_rel_stride4_aux2 --grid_step 1 --height 480 --width 854 --max_frames 100 --grid_stride 4 --precheck
# SegTrackV2
python extract_trajectories.py data/SegTrackv2/ data/SegTrackv2/Tracks/cotrackerv2_rel_stride4_aux2 --grid_step 1 --height 480 --width 854 --grid_stride 4 --max_frames 100 --seq-search-path JPEGImages --precheck
# FBMS
python extract_trajectories.py data/FBMS_clean/ data/FBMS_clean/Tracks/ --grid_step 1 --height 480 --width 854 --grid_stride 4 --max_frames 100 --seq-search-path JPEGImages --precheck
Note that calculating trajectories will take a long time and requires a lot of memory due to tracking very many points (we observed that this lead to more accurate trajectories with cotracker). We made use of SLURM arrays to distribute the workload across many GPUs. We used machines with at least 64 GB of RAM and 48 GB of GPU memory. The scrip has additional options and functionality to resume, checkpoint, and skip already processed sequences. Also there are options for debugging.
Experiments are controlled through a mix of config files and command line arguments. See config files and src/config.py
for a list of all available options.
python main.py GWM.DATASET DAVIS LOG_ID davis_training
for training a model on davis dataset.
We provide trained checkpoints for main experiments in the paper. These can be downloaded from the following links:
This repository builds on MaskFormer, MotionGrouping, guess-what-moves, and dino-vit-features.
@inproceedings{karazija24learning,
title={Learning segmentation from point trajectories},
author={Karazija, Laurynas and Laina, Iro and Rupprecht, Christian and Vedaldi, Andrea},
booktitle={The Thirty-eighth Annual Conference on Neural Information Processing Systems},
year={2024}
}