Abstract: Despite the significant progress in deep learning for dense visual recognition problems, such as semantic segmentation, traditional methods are constrained by fixed class sets. Meanwhile, vision-language foundation models, such as CLIP, have showcased remarkable effectiveness in numerous zero-shot image-level tasks, owing to their robust generalizability. Recently, a body of work has investigated utilizing these models in open-vocabulary semantic segmentation (OVSS). However, existing approaches often rely on impractical supervised pre-training or access to additional pre-trained networks. In this work, we propose a strong baseline for training-free OVSS, termed Neighbour-Aware CLIP (NACLIP), representing a straightforward adaptation of CLIP tailored for this scenario. Our method enforces localization of patches in the self-attention of CLIP's vision transformer which, despite being crucial for dense prediction tasks, has been overlooked in the OVSS literature. By incorporating design choices favouring segmentation, our approach significantly improves performance without requiring additional data, auxiliary pre-trained networks, or extensive hyperparameter tuning, making it highly practical for real-world applications. Experiments are performed on 8 popular semantic segmentation benchmarks, yielding state-of-the-art performance on most scenarios.
To run NACLIP, please install the following packages. We used Python 3.9
in our experiments.
pip install torch==1.10.1+cu111 torchvision==0.11.2+cu111 -f https://download.pytorch.org/whl/cu111/torch_stable.html
pip install openmim
mim install mmcv==2.0.1 mmengine==0.8.4 mmsegmentation==1.1.1
pip install ftfy regex yapf==0.40.1
We include the listed dataset configurations in this repo, following SCLIP: PASCAL VOC (with and without the background category), PASCAL Context (with and without the background category), Cityscapes, ADE20k, COCO-Stuff164k, and COCO-Object.
Please follow the MMSeg data preparation document to download and pre-process the datasets. The COCO-Object dataset can be converted from COCO-Stuff164k by executing the following command:
python ./datasets/cvt_coco_object.py PATH_TO_COCO_STUFF164K -o PATH_TO_COCO_OBJECT
Remember to modify the dataset paths (data_root
) in the config files in ./configs/
.
To evaluate our approach on a single benchmark, run the following command:
python eval.py --config ./configs/cfg_{benchmark_name}.py
You can also do the evaluation on all the benchmarks using the test_all.sh
script, whose general syntax is:
bash test_all.sh {arch} {attn} {gaussian_std} {pamr} {gpu} {log_path}
Values of reduced
for {arch}
, and naclip
for {attn}
represent our method.
For example, to reproduce the main results, run:
bash test_all.sh reduced naclip 5 on {gpu} {log_path}
With the default setup in this repo, the following results (mIoU) should be achieved:
VOC21 | PC60 | COCO Obj | VOC20 | Cityscapes | PC59 | ADE20K | COCO Stuff | Avg |
---|---|---|---|---|---|---|---|---|
62.36 | 34.99 | 36.19 | 80.60 | 38.27 | 38.35 | 19.05 | 25.18 | 41.87 |
We gratefully thank the authors of SCLIP, CLIP, and MMSegmentation, on which our code is based.