Our detection code is developed on top of MMDetection v2.23.0.
For details see Vision Transformer Adapter for Dense Predictions.
If you use this code for a paper please cite:
@article{chen2021vitadapter,
title={Vision Transformer Adapter for Dense Predictions},
author={Chen, Zhe and Duan, Yuchen and Wang, Wenhai and He, Junjun and Lu, Tong and Dai, Jifeng and Qiao, Yu},
journal={arXiv preprint arXiv:2205.08534},
year={2022}
}
Install MMDetection v2.23.0.
cd ops & sh make.sh # compile deformable attention
pip install timm==0.4.12
pip install mmdet==2.23.0
# recommended environment: torch1.9 + cuda11.1
pip install mmcv-full==1.4.2 -f https://download.openmmlab.com/mmcv/dist/cu111/torch1.9.0/index.html
pip install torch==1.9.0+cu111 torchvision==0.10.0+cu111 torchaudio==0.9.0 -f https://download.pytorch.org/whl/torch_stable.html
pip install instaboostfast # for htc++
Prepare COCO according to the guidelines in MMDetection v2.23.0.
HTC++
Method | Backbone | Pre-train | Lr schd | box AP | mask AP | #Param | Config | Download |
---|---|---|---|---|---|---|---|---|
HTC++ | ViT-Adapter-L | BEiT-L | 3x | 58.5 | 50.8 | 401M | config | model |
HTC++ | ViT-Adapter-L (MS) | BEiT-L | 3x | 60.1 | 52.1 | 401M | TODO | - |
HTC++
Method | Backbone | Pre-train | Lr schd | box AP | mask AP | #Param | Config | Download |
---|---|---|---|---|---|---|---|---|
HTC++ | ViT-Adapter-L | BEiT-L | 3x | 57.9 | 50.2 | 401M | config | model |
HTC++ | ViT-Adapter-L (MS) | BEiT-L | 3x | 59.8 | 51.7 | 401M | TODO | - |
Baseline Detectors
Method | Backbone | Pre-train | Lr schd | Aug | box AP | mask AP | #Param | Config | Download |
---|---|---|---|---|---|---|---|---|---|
Mask R-CNN | ViT-Adapter-T | DeiT-T | 3x | Yes | 46.0 | 41.0 | 28M | config | model |
Mask R-CNN | ViT-Adapter-S | DeiT-S | 3x | Yes | 48.2 | 42.8 | 48M | config | model |
Mask R-CNN | ViT-Adapter-B | DeiT-B | 3x | Yes | 49.6 | 43.6 | 120M | config | model |
Mask R-CNN | ViT-Adapter-B | Uni-Perceiver | 3x | Yes | 50.7 | 44.9 | 120M | config | model |
Mask R-CNN | ViT-Adapter-L | AugReg | 3x | Yes | 50.9 | 44.8 | 348M | config | model |
Advanced Detectors
Method | Framework | Pre-train | Lr schd | Aug | box AP | mask AP | #Param | Config | Download |
---|---|---|---|---|---|---|---|---|---|
ViT-Adapter-S | Cascade Mask R-CNN | DeiT-S | 3x | Yes | 51.5 | 44.5 | 86M | config | model |
ViT-Adapter-S | ATSS | DeiT-S | 3x | Yes | 49.6 | - | 36M | config | model |
ViT-Adapter-S | GFL | DeiT-S | 3x | Yes | 50.0 | - | 36M | config | model |
ViT-Adapter-S | Sparse R-CNN | DeiT-S | 3x | Yes | 48.1 | - | 110M | config | model |
ViT-Adapter-B | Upgraded Mask R-CNN | MAE | 25ep | LSJ | 50.3 | 44.7 | 122M | config | model |
ViT-Adapter-B | Upgraded Mask R-CNN | MAE | 50ep | LSJ | 50.8 | 45.1 | 122M | config | model |
To evaluate ViT-Adapter-L + HTC++ on COCO val2017 on a single node with 8 gpus run:
sh dist_test.sh configs/htc++/htc++_beit_adapter_large_fpn_3x_coco.py /path/to/checkpoint_file 8 --eval bbox segm
This should give
Evaluate annotation type *bbox*
Average Precision (AP) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.579
Average Precision (AP) @[ IoU=0.50 | area= all | maxDets=1000 ] = 0.766
Average Precision (AP) @[ IoU=0.75 | area= all | maxDets=1000 ] = 0.635
Average Precision (AP) @[ IoU=0.50:0.95 | area= small | maxDets=1000 ] = 0.436
Average Precision (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=1000 ] = 0.616
Average Precision (AP) @[ IoU=0.50:0.95 | area= large | maxDets=1000 ] = 0.726
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.736
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets=300 ] = 0.736
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets=1000 ] = 0.736
Average Recall (AR) @[ IoU=0.50:0.95 | area= small | maxDets=1000 ] = 0.608
Average Recall (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=1000 ] = 0.768
Average Recall (AR) @[ IoU=0.50:0.95 | area= large | maxDets=1000 ] = 0.863
Evaluate annotation type *segm*
Average Precision (AP) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.502
Average Precision (AP) @[ IoU=0.50 | area= all | maxDets=1000 ] = 0.744
Average Precision (AP) @[ IoU=0.75 | area= all | maxDets=1000 ] = 0.549
Average Precision (AP) @[ IoU=0.50:0.95 | area= small | maxDets=1000 ] = 0.328
Average Precision (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=1000 ] = 0.533
Average Precision (AP) @[ IoU=0.50:0.95 | area= large | maxDets=1000 ] = 0.683
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.638
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets=300 ] = 0.638
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets=1000 ] = 0.638
Average Recall (AR) @[ IoU=0.50:0.95 | area= small | maxDets=1000 ] = 0.499
Average Recall (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=1000 ] = 0.669
Average Recall (AR) @[ IoU=0.50:0.95 | area= large | maxDets=1000 ] = 0.776
To train ViT-Adapter-T + Mask R-CNN on COCO train2017 on a single node with 8 gpus for 36 epochs run:
sh dist_train.sh configs/mask_rcnn/mask_rcnn_deit_adapter_tiny_fpn_3x_coco.py 8