LTrack: Generalizing Multiple Object Tracking to Unseen Domains by Introducing Natural Language Representation
This repository is an official implementation of the AAAI-2023 accepted paper LTrack: Generalizing Multiple Object Tracking to Unseen Domains by Introducing Natural Language Representation.
TL; DR. LTrack is a fully end-to-end multiple-object tracking framework based on Transformer. It introduces natural language representaion from vision-language model CLIP to the MOT tracker for the first time. We hope this work can shed light on how to develop MOT trackers with promising generalization ability to some extent by combining the knowledge from image and language.
Abstract. Although existing multi-object tracking (MOT) algorithms have obtained competitive performance on various benchmarks, almost all of them train and validate models on the same domain. The domain generalization problem of MOT is hardly studied. To bridge this gap, we first draw the observation that the high-level information contained in natural language is domain invariant to different tracking domains. Based on this observation, we propose to introduce natural language representation into visual MOT models for boosting the domain generalization ability. However, it is infeasible to label every tracking target with a textual description. To tackle this problem, we design two modules, namely visual context prompting (VCP) and visual-language mixing (VLM). Specifically, VCP generates visual prompts based on the input frames. VLM joints the information in the generated visual prompts and the textual prompts from a pre-defined Trackbook to obtain instance-level pseudo textual description, which is domain invariant to different tracking scenes. Through training models on MOT17 and validating them on MOT20, we observe that the pseudo textual descriptions generated by our proposed modules improve the generalization performance of query-based trackers by large margins.
- (2023/07/26) Code is released.
Method | Dataset | Train Data | HOTA | DetA | AssA | MOTA | IDF1 | IDS | URL |
---|---|---|---|---|---|---|---|---|---|
LTrack | MOT17 | MOT17+CrowdHuman Val | 57.5 | 59.4 | 56.1 | 72.1 | 69.1 | 2100 | model |
Method | Dataset | Train Data | HOTA | DetA | AssA | MOTA | IDF1 | URL |
---|---|---|---|---|---|---|---|---|
LTrack | MOT20 | MOT17+CrowdHuman Val | 46.8 | 45.4 | 48.4 | 57.8 | 61.1 | model |
Method | Dataset | Train Data | HOTA-p | AssA-p | IDF1-p | URL |
---|---|---|---|---|---|---|
MOTR | BDD100K | MOT17+CrowdHuman Val | 33.7 | 39.3 | 40.6 | model |
Note:
- LTrack on MOT17 and Crowdhuman is trained on 8 NVIDIA TESLA V100 GPUs.
- The training time for MOT17 is about 2.5 days on V100;
- The inference speed is about 7.0 FPS for resolution 1536x800;
- All models of LTrack are trained with ResNet50 with pre-trained weights on COCO dataset.
The codebase is built on top of Deformable DETR. We use the CLIP text encoder to extract language embedding.
-
Linux, CUDA>=9.2, GCC>=5.4
-
Python>=3.7
We recommend you to use Anaconda to create a conda environment:
conda create -n deformable_detr python=3.7 pip
Then, activate the environment:
conda activate deformable_detr
-
PyTorch>=1.5.1, torchvision>=0.6.1 (following instructions here)
For example, if your CUDA version is 9.2, you could install pytorch and torchvision as following:
conda install pytorch=1.5.1 torchvision=0.6.1 cudatoolkit=9.2 -c pytorch
-
Other requirements
pip install -r requirements.txt
-
Build MultiScaleDeformableAttention
cd ./models/ops sh ./make.sh
- Download the pre-trained CLIP models (We use CLIP RN50.pt) and save them to the pre_trained folder.
- Please download MOT17 dataset and CrowdHuman dataset and organize them like FairMOT as following:
.
├── crowdhuman
│ ├── images
│ └── labels_with_ids
├── MOT17
│ ├── images
│ ├── labels_with_ids
├── MOT20
│ ├── images
│ ├── labels_with_ids
├── bdd100k
│ ├── images
│ ├── track
│ ├── train
│ ├── val
│ ├── labels
│ ├── track
│ ├── train
│ ├── val
- For BDD100K dataset, you can use the following script to generate txt file:
cd datasets/data_path
python3 generate_bdd100k_mot.py
cd ../../
You can download COCO pretrained weights from Deformable DETR. Then training MOTR on 8 GPUs as following:
sh configs/r50_clip_motr_train.sh
You can download the pretrained model of MOTR (the link is in "Main Results" session), then run following command to evaluate it on MOT17 test dataset (submit to server):
sh configs/r50_motr_submit_mot17.sh
sh configs/r50_motr_eval_mot20.sh
sh configs/r50_motr_eval_bdd100k.sh
If you find LTrack useful in your research, please consider citing:
@inproceedings{yu2023generalizing,
title={Generalizing multiple object tracking to unseen domains by introducing natural language representation},
author={Yu, En and Liu, Songtao and Li, Zhuoling and Yang, Jinrong and Li, Zeming and Han, Shoudong and Tao, Wenbing},
booktitle={Proceedings of the AAAI Conference on Artificial Intelligence},
volume={37},
number={3},
pages={3304--3312},
year={2023}
}