The folder includes the implementation of ViTamin for open-vocabulary segmentation.
We propose Sliding FC-CLIP which adapts ViTamin within the FC-CLIP framework. Thanks, FC-CLIP!
Please follows FC-CLIP's installation instructions and Getting Started with FC-CLIP.
The configuration for ViTamin is provided in "./configs/coco/panoptic-segmentation/fcclip/fcclip_vitamin_l_eval_ade20k.yaml"
image encoder | ADE20K(A-150) | Cityscapes | Mapillary Vistas | ADE20K (A-150) |
ADE20K-Full (A-847) |
Pascal Context 459 (PC-459) |
Pascal Context 59 (PC-59) |
Pascal VOC 21 (PAS-21) |
download |
---|---|---|---|---|---|---|---|---|---|
PQ | PQ | PQ | mIoU | mIoU | mIoU | mIoU | mIoU | ||
ConvNeXt-L | 26.8 | 44.0 | 18.3 | 34.1 | 14.8 | 18.2 | 58.4 | 81.8 | checkpoint |
ViT-L/14 | 24.6 | 40.7 | 16.5 | 31.8 | 14.3 | 18.3 | 55.1 | 81.5 | |
ViTamin-L | 27.3 | 44.0 | 18.2 | 35.6 | 16.1 | 20.4 | 58.4 | 83.4 | checkpoint |
@inproceedings{chen2024vitamin,
title={ViTamin: Designing Scalable Vision Models in the Vision-language Era},
author={Chen, Jieneng and Yu, Qihang and Shen, Xiaohui and Yuille, Alan and Chen, Liang-Chieh},
booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
year={2024}
}
This repo contains thr code for our paper Convolutions Die Hard: Open-Vocabulary Panoptic Segmentation with Single Frozen Convolutional CLIP
FC-CLIP is an universal model for open-vocabulary image segmentation problems, consisting of a class-agnostic segmenter, in-vocabulary classifier, out-of-vocabulary classifier. With everything built upon a shared single frozen convolutional CLIP model, FC-CLIP not only achieves state-of-the-art performance on various open-vocabulary segmentation benchmarks, but also enjoys a much lower training (3.2 days with 8 V100) and testing costs compared to prior arts.
See installation instructions.
See Preparing Datasets for FC-CLIP.
See Getting Started with FC-CLIP.
We also support FC-CLIP with HuggingFace 🤗 Demo
ADE20K(A-150) | Cityscapes | Mapillary Vistas | ADE20K-Full (A-847) |
Pascal Context 59 (PC-59) |
Pascal Context 459 (PC-459) |
Pascal VOC 21 (PAS-21) |
Pascal VOC 20 (PAS-20) |
COCO (training dataset) |
download | ||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
PQ | mAP | mIoU | PQ | mAP | mIoU | PQ | mIoU | mIoU | mIoU | mIoU | mIoU | mIoU | PQ | mAP | mIoU | ||
FC-CLIP | 26.8 | 16.8 | 34.1 | 44.0 | 26.8 | 56.2 | 18.3 | 27.8 | 14.8 | 58.4 | 18.2 | 81.8 | 95.4 | 54.4 | 44.6 | 63.7 | checkpoint |
If you use FC-CLIP in your research, please use the following BibTeX entry.
@inproceedings{yu2023fcclip,
title={Convolutions Die Hard: Open-Vocabulary Panoptic Segmentation with Single Frozen Convolutional CLIP},
author={Qihang Yu and Ju He and Xueqing Deng and Xiaohui Shen and Liang-Chieh Chen},
journal={arXiv},
year={2023}
}
Mask2Former (https://github.com/facebookresearch/Mask2Former)
ODISE (https://github.com/NVlabs/ODISE)