Skip to content

Latest commit

 

History

History

detection

Applying ViT-Adapter to Object Detection

Our detection code is developed on top of MMDetection v2.23.0.

For details see Vision Transformer Adapter for Dense Predictions.

If you use this code for a paper please cite:

@article{chen2021vitadapter,
  title={Vision Transformer Adapter for Dense Predictions},
  author={Chen, Zhe and Duan, Yuchen and Wang, Wenhai and He, Junjun and Lu, Tong and Dai, Jifeng and Qiao, Yu},
  journal={arXiv preprint arXiv:2205.08534},
  year={2022}
}

Usage

Install MMDetection v2.23.0.

cd ops & sh make.sh # compile deformable attention
pip install timm==0.4.12
pip install mmdet==2.23.0
# recommended environment: torch1.9 + cuda11.1
pip install mmcv-full==1.4.2 -f https://download.openmmlab.com/mmcv/dist/cu111/torch1.9.0/index.html
pip install torch==1.9.0+cu111 torchvision==0.10.0+cu111 torchaudio==0.9.0 -f https://download.pytorch.org/whl/torch_stable.html
pip install instaboostfast # for htc++

Data preparation

Prepare COCO according to the guidelines in MMDetection v2.23.0.

Results and models

ViT-Adapter on COCO test-dev

HTC++

Method Backbone Pre-train Lr schd box AP mask AP #Param Config Download
HTC++ ViT-Adapter-L BEiT-L 3x 58.5 50.8 401M config model
HTC++ ViT-Adapter-L (MS) BEiT-L 3x 60.1 52.1 401M TODO -

ViT-Adapter on COCO minival

HTC++

Method Backbone Pre-train Lr schd box AP mask AP #Param Config Download
HTC++ ViT-Adapter-L BEiT-L 3x 57.9 50.2 401M config model
HTC++ ViT-Adapter-L (MS) BEiT-L 3x 59.8 51.7 401M TODO -

Baseline Detectors

Method Backbone Pre-train Lr schd Aug box AP mask AP #Param Config Download
Mask R-CNN ViT-Adapter-T DeiT-T 3x Yes 46.0 41.0 28M config model
Mask R-CNN ViT-Adapter-S DeiT-S 3x Yes 48.2 42.8 48M config model
Mask R-CNN ViT-Adapter-B DeiT-B 3x Yes 49.6 43.6 120M config model
Mask R-CNN ViT-Adapter-B Uni-Perceiver 3x Yes 50.7 44.9 120M config model
Mask R-CNN ViT-Adapter-L AugReg 3x Yes 50.9 44.8 348M config model

Advanced Detectors

Method Framework Pre-train Lr schd Aug box AP mask AP #Param Config Download
ViT-Adapter-S Cascade Mask R-CNN DeiT-S 3x Yes 51.5 44.5 86M config model
ViT-Adapter-S ATSS DeiT-S 3x Yes 49.6 - 36M config model
ViT-Adapter-S GFL DeiT-S 3x Yes 50.0 - 36M config model
ViT-Adapter-S Sparse R-CNN DeiT-S 3x Yes 48.1 - 110M config model
ViT-Adapter-B Upgraded Mask R-CNN MAE 25ep LSJ 50.3 44.7 122M config model
ViT-Adapter-B Upgraded Mask R-CNN MAE 50ep LSJ 50.8 45.1 122M config model

Evaluation

To evaluate ViT-Adapter-L + HTC++ on COCO val2017 on a single node with 8 gpus run:

sh dist_test.sh configs/htc++/htc++_beit_adapter_large_fpn_3x_coco.py /path/to/checkpoint_file 8 --eval bbox segm

This should give

Evaluate annotation type *bbox*
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.579
 Average Precision  (AP) @[ IoU=0.50      | area=   all | maxDets=1000 ] = 0.766
 Average Precision  (AP) @[ IoU=0.75      | area=   all | maxDets=1000 ] = 0.635
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= small | maxDets=1000 ] = 0.436
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=1000 ] = 0.616
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= large | maxDets=1000 ] = 0.726
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.736
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=300 ] = 0.736
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=1000 ] = 0.736
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= small | maxDets=1000 ] = 0.608
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=1000 ] = 0.768
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= large | maxDets=1000 ] = 0.863

Evaluate annotation type *segm*
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.502
 Average Precision  (AP) @[ IoU=0.50      | area=   all | maxDets=1000 ] = 0.744
 Average Precision  (AP) @[ IoU=0.75      | area=   all | maxDets=1000 ] = 0.549
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= small | maxDets=1000 ] = 0.328
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=1000 ] = 0.533
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= large | maxDets=1000 ] = 0.683
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.638
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=300 ] = 0.638
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=1000 ] = 0.638
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= small | maxDets=1000 ] = 0.499
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=1000 ] = 0.669
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= large | maxDets=1000 ] = 0.776

Training

To train ViT-Adapter-T + Mask R-CNN on COCO train2017 on a single node with 8 gpus for 36 epochs run:

sh dist_train.sh configs/mask_rcnn/mask_rcnn_deit_adapter_tiny_fpn_3x_coco.py 8