Visual Geometry Group, University of Oxford
We propose a new approach to learn to segment multiple image objects without manual supervision. The method can extract objects form still images, but uses videos for supervision. While prior works have considered motion for segmentation, a key insight is that, while motion can be used to identify objects, not all objects are necessarily in motion: the absence of motion does not imply the absence of objects. Hence, our model learns to predict image regions that are likely to contain motion patterns characteristic of objects moving rigidly. It does not predict specific motion, which cannot be done unambiguously from a still image, but a distribution of possible motions, which includes the possibility that an object does not move at all. We demonstrate the advantage of this approach over its deterministic counterpart and show state-of-the-art unsupervised object segmentation performance on simulated and real-world benchmarks, surpassing methods that use motion even at test time. As our approach is applicable to variety of network architectures that segment the scenes, we also apply it to existing image reconstruction-based models showing drastic improvement.
This repository builds on Mask2Former.
Create and name a conda environment of your choosing, e.g. ppmp
:
conda create -n ppmp python=3.9
conda activate ppmp
then install the requirements using this one liner:
conda install -y pytorch=1.12.1 torchvisio=0.13.1 cudatoolkit=11.3 -c pytorch && \
conda install -y kornia jupyter tensorboard timm einops scikit-learn scikit-image openexr-python tqdm -c conda-forge && \
conda install -y gcc_linux-64=7 gxx_linux-64=7 fontconfig && \
yes | pip install cvbase opencv-python filelock && \
yes | python -m pip install 'git+https://github.com/facebookresearch/detectron2.git' && \
cd mask2former/modeling/pixel_decoder/ops && \
sh make.sh
Datasets should be placed under data/<dataset_name>
, like data/movi_a
or data/moving_clevrtex
.
For MovingClevrTex, download and place the tar files under data/moving_clevrtex/tar
, see instructions here. The dataloader is set up to build an index into tar files and read required information on the fly.
For MOVi datasets, the files should be extracted to
data/<dataset_name>/<train or validation>/<seq name>/
using <seq name>_rgb_<frame num>.jpg
for rgb, <seq name>_ano_<frame num>.png
for masks, <seq name>_fwd_<frame num>.npz
or <seq name>_bwd_<frame num>.npz
for forward/backward optical flow, repectively. For example:
data/movi_a/train/movi_a_5995/movi_a_5995_ano_017.png
data/movi_a/train/movi_a_5995/movi_a_5995_rgb_017.jpg
data/movi_a/train/movi_a_5995/movi_a_5995_fwd_017.npz
data/movi_a/train/movi_a_5995/movi_a_5995_bwd_017.npz
See this notebook for details how to (down)load and normalise the Kubric datasets.
For KITTI, RAFT flow is required. We followed processing from here with appropriate filepath changes for KITTI dataset structure.
Experiments are controlled through a mix of config files and command line arguments. See config files and config.py for a list of all available options. For e.g. MOVi C dataset.
python main.py --config config_sacnn.yaml UNSUPVIDSEG.DATASET MOVi_C
or for MOVi D
# Note the switch to 24 object queries (slots)
python main.py --config config_sacnn.yaml UNSUPVIDSEG.DATASET MOVi_D MODEL.MASK_FORMER.NUM_OBJECT_QUERIES 24
See here for available checkpoints.
@inproceedings{karazija22unsupervised,
author = {Karazija, Laurynas and Choudhury, Subhabrata and Laina, Iro and Rupprecht, Christian and Vedaldi, Andrea},
booktitle = {Advances in Neural Information Processing Systems},
title = {{U}nsupervised {M}ulti-object {S}egmentation by {P}redicting {P}robable {M}otion {P}atterns},
volume = {35},
year = {2022}
}