This repository contains the official PyTorch implementation for the paper
Frederic Z. Zhang, Yuhui Yuan, Dylan Campbell, Zhuoyao Zhong, Stephen Gould; Exploring Predicate Visual Context for Detecting Human-Object Interactions; In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2023, pp. 10411-10421.
Recently, the DETR framework has emerged as the dominant approach for human–object interaction (HOI) research. In particular, two-stage transformer-based HOI detectors are amongst the most performant and training-efficient approaches. However, these often condition HOI classification on object features that lack fine-grained contextual information, eschewing pose and orientation information in favour of visual cues about object identity and box extremities. This naturally hinders the recognition of complex or ambiguous interactions. In this work, we study these issues through visualisations and carefully designed experiments. Accordingly, we investigate how best to re-introduce image features via cross-attention. With an improved query design, extensive exploration of keys and values, and box pair positional embeddings as spatial guidance, our model with enhanced predicate visual context (PViC) outperforms state-of-the-art methods on the HICO-DET and V-COCO benchmarks, while maintaining low training cost.
- Use the package management tool of your choice and run the following commands after creating your environment.
# Say you are using Conda conda create --name pvic python=3.8 conda activate pvic # Required dependencies pip install torch==1.8.0+cu111 torchvision==0.9.0+cu111 -f https://download.pytorch.org/whl/torch_stable.html pip install matplotlib==3.6.3 scipy==1.10.0 tqdm==4.64.1 pip install numpy==1.24.1 timm==0.6.12 pip install wandb==0.13.9 seaborn==0.13.0 # Clone the repo and submodules git clone https://github.com/fredzzhang/pvic.git cd pvic git submodule init git submodule update pip install -e pocket # Build CUDA operator for MultiScaleDeformableAttention cd h_detr/models/ops python setup.py build install
- Prepare the HICO-DET dataset.
- If you have not downloaded the dataset before, run the following script.
cd /path/to/pvic/hicodet bash download.sh
- If you have previously downloaded the dataset, simply create a soft link.
cd /path/to/pvic/hicodet ln -s /path/to/hicodet_20160224_det ./hico_20160224_det
- If you have not downloaded the dataset before, run the following script.
- Prepare the V-COCO dataset (contained in MS COCO).
- If you have not downloaded the dataset before, run the following script
cd /path/to/pvic/vcoco bash download.sh
- If you have previously downloaded the dataset, simply create a soft link
cd /path/to/pvic/vcoco ln -s /path/to/coco ./mscoco2014
- If you have not downloaded the dataset before, run the following script
Visualisation utilities are implemented to run inference on a single image and visualise the cross-attention weights. A reference model is provided for demonstration purpose if you don't want to train a model yourself. Download the model and save it to ./checkpoints/
. Use the argument --index
to select images and --action
to specify the action index. Refer to the lookup table for action indices.
DETR=base python inference.py --resume checkpoints/pvic-detr-r50-hicodet.pth --index 4050 --action 111
The detected human-object pairs with scores overlayed are saved to fig.png
, while the attention weights are saved to pair_xx_attn_head_x.png
. Below are some sample outputs.
In addition, the argument --image-path
enables inference on custom images.
Refer to the documentation for model checkpoints and training/testing commands.
PViC is released under the BSD-3-Clause License.
If you find our work useful for your research, please consider citing us:
@inproceedings{zhang2023pvic,
author = {Zhang, Frederic Z. and Yuan, Yuhui and Campbell, Dylan and Zhong, Zhuoyao and Gould, Stephen},
title = {Exploring Predicate Visual Context in Detecting Human–Object Interactions},
booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)},
month = {October},
year = {2023},
pages = {10411-10421},
}
@inproceedings{zhang2022upt,
author = {Zhang, Frederic Z. and Campbell, Dylan and Gould, Stephen},
title = {Efficient Two-Stage Detection of Human-Object Interactions with a Novel Unary-Pairwise Transformer},
booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
month = {June},
year = {2022},
pages = {20104-20112}
}
@inproceedings{zhang2021scg,
author = {Zhang, Frederic Z. and Campbell, Dylan and Gould, Stephen},
title = {Spatially Conditioned Graphs for Detecting Human–Object Interactions},
booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)},
month = {October},
year = {2021},
pages = {13319-13327}
}