Detic: A Detector with image classes that can use image-level labels to easily train detectors.
Detecting Twenty-thousand Classes using Image-level Supervision, Xingyi Zhou, Rohit Girdhar, Armand Joulin, Philipp Krähenbühl, Ishan Misra, ECCV 2022 (arXiv 2201.02605)
Detic requires to install CLIP.
pip install git+https://github.com/openai/CLIP.git
It is recommended to download and extract the dataset somewhere outside the project directory and symlink the dataset root to $MMDETECTION/data
as below. If your folder structure is different, you may need to change the corresponding paths in config files.
LVIS dataset is adopted as box-labeled data, LVIS is available from official website or mirror. You need to generate lvis_v1_train_norare.json
according to the official prepare datasets for open-vocabulary LVIS, which removes the labels of 337 rare-class from training. You can also download lvis_v1_train_norare.json from our backup. The directory should be like this.
mmdetection
├── data
│ ├── lvis
│ │ ├── annotations
│ │ | ├── lvis_v1_train.json
│ │ | ├── lvis_v1_val.json
│ │ | ├── lvis_v1_train_norare.json
│ │ ├── train2017
│ │ ├── val2017
ImageNet-LVIS is adopted as image-labeled data. You can download ImageNet-21K dataset from the official website. Then you need to unzip the overlapping classes of LVIS and convert them into LVIS annotation format according to the official prepare datasets. The directory should be like this.
mmdetection
├── data
│ ├── imagenet
│ │ ├── annotations
│ │ | ├── imagenet_lvis_image_info.json
│ │ ├── ImageNet-21K
│ │ | ├── n00007846
│ │ | ├── n01318894
│ │ | ├── ...
data/metadata/
is the preprocessed meta-data (included in the repo). Please follow the official instruction to pre-process the LVIS dataset. You will generate lvis_v1_train_cat_info.json
for Federated loss, which contains the frequency of each category of training set of LVIS. In addition, lvis_v1_clip_a+cname.npy
is the pre-computed CLIP embeddings for each category of LVIS. You can also choose to directly download lvis_v1_train_cat_info and lvis_v1_clip_a+cname.npy form our backup. The directory should be like this.
mmdetection
├── data
│ ├── metadata
│ │ ├── lvis_v1_train_cat_info.json
│ │ ├── lvis_v1_clip_a+cname.npy
Here we provide the Detic model for the open vocabulary demo. This model is trained on combined LVIS-COCO and ImageNet-21K for better demo purposes. LVIS models do not detect persons well due to its federated annotation protocol. LVIS+COCO models give better visual results.
Backbone | Training data | Config | Download |
---|---|---|---|
Swin-B | LVIS & COCO & ImageNet-21K | config | model |
You can also download other models from official model zoo, and convert the format by run
python tools/model_converters/detic_to_mmdet.py --src /path/to/detic_weight.pth --dst /path/to/mmdet_weight.pth
You can detect classes of existing dataset with --texts
command:
python demo/image_demo.py \
${IMAGE_PATH} \
${CONFIG_PATH} \
${MODEL_PATH} \
--texts lvis \
--pred-score-thr 0.5 \
--palette 'random'
Detic can detects any class given class names by using CLIP. You can detect customized classes with --texts
command:
python demo/image_demo.py \
${IMAGE_PATH} \
${CONFIG_PATH} \
${MODEL_PATH} \
--texts 'headphone . webcam . paper . coffe.' \
--pred-score-thr 0.3 \
--palette 'random'
Note that headphone
, paper
and coffe
(typo intended) are not LVIS classes. Despite the misspelled class name, Detic can produce a reasonable detection for coffe
.
There are two stages in the whole training process. The first stage is to train a model using images with box labels as the baseline. The second stage is to finetune from the baseline model and leverage image-labeled data.
To train the baseline with box-supervised, run
bash ./tools/dist_train.sh projects/Detic_new/detic_centernet2_r50_fpn_4x_lvis_boxsup.py 8
Model (Config) | mask mAP | mask mAP(official) | mask mAP_rare | mask mAP_rare(officical) |
---|---|---|---|---|
detic_centernet2_r50_fpn_4x_lvis_boxsup | 31.6 | 31.5 | 26.6 | 25.6 |
The second stage uses both object detection and image classification datasets.
We provide improved dataset_wrapper ConcatDataset
to concatenate multiple datasets, all datasets could have different annotation types and different pipelines (e.g., image_size). You can also obtain the index of dataset_source
for each sample through get_dataset_source
. We provide sampler MultiDataSampler
to custom the ratios of different datasets. Beside, we provide batch_sampler MultiDataAspectRatioBatchSampler
to enable different datasets to have different batchsizes. The config of multiple datasets is as follows:
dataset_det = dict(
type='ClassBalancedDataset',
oversample_thr=1e-3,
dataset=dict(
type='LVISV1Dataset',
data_root='data/lvis/',
ann_file='annotations/lvis_v1_train.json',
data_prefix=dict(img=''),
filter_cfg=dict(filter_empty_gt=True, min_size=32),
pipeline=train_pipeline_det,
backend_args=backend_args))
dataset_cls = dict(
type='ImageNetLVISV1Dataset',
data_root='data/imagenet',
ann_file='annotations/imagenet_lvis_image_info.json',
data_prefix=dict(img='ImageNet-LVIS/'),
pipeline=train_pipeline_cls,
backend_args=backend_args)
train_dataloader = dict(
batch_size=[8, 32],
num_workers=2,
persistent_workers=True,
sampler=dict(
type='MultiDataSampler',
dataset_ratio=[1, 4]),
batch_sampler=dict(
type='MultiDataAspectRatioBatchSampler',
num_datasets=2),
dataset=dict(
type='ConcatDataset',
datasets=[dataset_det, dataset_cls]))
- If the one of the multiple datasets is
ConcatDataset
, it is still considered as a dataset fornum_datasets
inMultiDataAspectRatioBatchSampler
.
To finetune the baseline model with image-labeled data, run:
bash ./tools/dist_train.sh projects/Detic_new/detic_centernet2_r50_fpn_4x_lvis_in21k-lvis.py 8
Model (Config) | mask mAP | mask mAP(official) | mask mAP_rare | mask mAP_rare(officical) |
---|---|---|---|---|
detic_centernet2_r50_fpn_4x_lvis_in21k-lvis | 32.9 | 33.2 | 30.9 | 29.7 |
Model (Config) | mask mAP | mask mAP(official) | mask mAP_rare | mask mAP_rare(officical) | Download |
---|---|---|---|---|---|
detic_centernet2_r50_fpn_4x_lvis_boxsup | 31.6 | 31.5 | 26.6 | 25.6 | model | log |
detic_centernet2_r50_fpn_4x_lvis_in21k-lvis | 32.9 | 33.2 | 30.9 | 29.7 | model | log |
detic_centernet2_swin-b_fpn_4x_lvis_boxsup | 40.7 | 40.7 | 38.0 | 35.9 | model | log |
detic_centernet2_swin-b_fpn_4x_lvis_in21k-lvis | 41.7 | 41.7 | 41.7 | 41.7 | model | log |
Model (Config) | mask mAP | mask mAP(official) | mask mAP_rare | mask mAP_rare(officical) | Download |
---|---|---|---|---|---|
detic_centernet2_r50_fpn_4x_lvis-base_boxsup | 30.4 | 30.2 | 16.2 | 16.4 | model | log |
detic_centernet2_r50_fpn_4x_lvis-base_in21k-lvis | 32.6 | 32.4 | 27.4 | 24.9 | model | log |
To evaluate a model with a trained model, run
python ./tools/test.py ${CONFIG_FILE} ${CHECKPOINT_FILE}
The models are converted from the official model zoo.
Model (Config) | mask mAP | mask mAP_novel | Download |
---|---|---|---|
detic_centernet2_swin-b_fpn_4x_lvis-base_boxsup | 38.4 | 21.9 | model |
detic_centernet2_swin-b_fpn_4x_lvis-base_in21k-lvis | 40.7 | 34.0 | model |
- The open-vocabulary LVIS setup is LVIS without rare class annotations in training, termed
lvisbase
. We evaluate rare classes as novel classes in testing. in21k-lvis
denotes that the model use the overlap classes between ImageNet-21K and LVIS as image-labeled data.
If you find Detic is useful in your research or applications, please consider giving a star 🌟 to the official repository and citing Detic by the following BibTeX entry.
@inproceedings{zhou2022detecting,
title={Detecting Twenty-thousand Classes using Image-level Supervision},
author={Zhou, Xingyi and Girdhar, Rohit and Joulin, Armand and Kr{\"a}henb{\"u}hl, Philipp and Misra, Ishan},
booktitle={ECCV},
year={2022}
}