LTrack: Generalizing Multiple Object Tracking to Unseen Domains by Introducing Natural Language Representation

This repository is an official implementation of the AAAI-2023 accepted paper LTrack: Generalizing Multiple Object Tracking to Unseen Domains by Introducing Natural Language Representation.

Introduction

TL; DR. LTrack is a fully end-to-end multiple-object tracking framework based on Transformer. It introduces natural language representaion from vision-language model CLIP to the MOT tracker for the first time. We hope this work can shed light on how to develop MOT trackers with promising generalization ability to some extent by combining the knowledge from image and language.

Abstract. Although existing multi-object tracking (MOT) algorithms have obtained competitive performance on various benchmarks, almost all of them train and validate models on the same domain. The domain generalization problem of MOT is hardly studied. To bridge this gap, we first draw the observation that the high-level information contained in natural language is domain invariant to different tracking domains. Based on this observation, we propose to introduce natural language representation into visual MOT models for boosting the domain generalization ability. However, it is infeasible to label every tracking target with a textual description. To tackle this problem, we design two modules, namely visual context prompting (VCP) and visual-language mixing (VLM). Specifically, VCP generates visual prompts based on the input frames. VLM joints the information in the generated visual prompts and the textual prompts from a pre-defined Trackbook to obtain instance-level pseudo textual description, which is domain invariant to different tracking scenes. Through training models on MOT17 and validating them on MOT20, we observe that the pseudo textual descriptions generated by our proposed modules improve the generalization performance of query-based trackers by large margins.

Updates

(2023/07/26) Code is released.

Main Results

MOT17

Method	Dataset	Train Data	HOTA	DetA	AssA	MOTA	IDF1	IDS	URL
LTrack	MOT17	MOT17+CrowdHuman Val	57.5	59.4	56.1	72.1	69.1	2100	model

DanceTrack

Method	Dataset	Train Data	HOTA	DetA	AssA	MOTA	IDF1	URL
LTrack	MOT20	MOT17+CrowdHuman Val	46.8	45.4	48.4	57.8	61.1	model

BDD100K

Method	Dataset	Train Data	HOTA-p	AssA-p	IDF1-p	URL
MOTR	BDD100K	MOT17+CrowdHuman Val	33.7	39.3	40.6	model

Note:

LTrack on MOT17 and Crowdhuman is trained on 8 NVIDIA TESLA V100 GPUs.
The training time for MOT17 is about 2.5 days on V100;
The inference speed is about 7.0 FPS for resolution 1536x800;
All models of LTrack are trained with ResNet50 with pre-trained weights on COCO dataset.

Installation

The codebase is built on top of Deformable DETR. We use the CLIP text encoder to extract language embedding.

Requirements

Linux, CUDA>=9.2, GCC>=5.4
Python>=3.7

We recommend you to use Anaconda to create a conda environment:
```
conda create -n deformable_detr python=3.7 pip
```
Then, activate the environment:
```
conda activate deformable_detr
```
PyTorch>=1.5.1, torchvision>=0.6.1 (following instructions here)

For example, if your CUDA version is 9.2, you could install pytorch and torchvision as following:
```
conda install pytorch=1.5.1 torchvision=0.6.1 cudatoolkit=9.2 -c pytorch
```
Other requirements
```
pip install -r requirements.txt
```
Build MultiScaleDeformableAttention
```
cd ./models/ops
sh ./make.sh
```

Pre-trained CLIP Models

Download the pre-trained CLIP models (We use CLIP RN50.pt) and save them to the pre_trained folder.

Usage

Dataset preparation

Please download MOT17 dataset and CrowdHuman dataset and organize them like FairMOT as following:

.
├── crowdhuman
│   ├── images
│   └── labels_with_ids
├── MOT17
│   ├── images
│   ├── labels_with_ids
├── MOT20
│   ├── images
│   ├── labels_with_ids
├── bdd100k
│   ├── images
│       ├── track
│           ├── train
│           ├── val
│   ├── labels
│       ├── track
│           ├── train
│           ├── val

For BDD100K dataset, you can use the following script to generate txt file:

cd datasets/data_path
python3 generate_bdd100k_mot.py
cd ../../

Training and Evaluation

Training on single node

You can download COCO pretrained weights from Deformable DETR. Then training MOTR on 8 GPUs as following:

sh configs/r50_clip_motr_train.sh

Evaluation on MOT17

You can download the pretrained model of MOTR (the link is in "Main Results" session), then run following command to evaluate it on MOT17 test dataset (submit to server):

sh configs/r50_motr_submit_mot17.sh

Evaluation on MOT20

sh configs/r50_motr_eval_mot20.sh

Evaluation on BDD100K

sh configs/r50_motr_eval_bdd100k.sh

Citing LTrack

If you find LTrack useful in your research, please consider citing:

@inproceedings{yu2023generalizing,
  title={Generalizing multiple object tracking to unseen domains by introducing natural language representation},
  author={Yu, En and Liu, Songtao and Li, Zhuoling and Yang, Jinrong and Li, Zeming and Han, Shoudong and Tao, Wenbing},
  booktitle={Proceedings of the AAAI Conference on Artificial Intelligence},
  volume={37},
  number={3},
  pages={3304--3312},
  year={2023}
}

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
configs		configs
datasets		datasets
figs		figs
models		models
tools		tools
util		util
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
benchmark.py		benchmark.py
demo.py		demo.py
engine.py		engine.py
eval.py		eval.py
eval_bdd100k.py		eval_bdd100k.py
eval_dance.py		eval_dance.py
eval_mot20.py		eval_mot20.py
main.py		main.py
requirements.txt		requirements.txt
submit.py		submit.py
submit_dance.py		submit_dance.py
submit_mot15.py		submit_mot15.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

LTrack: Generalizing Multiple Object Tracking to Unseen Domains by Introducing Natural Language Representation

Introduction

Updates

Main Results

MOT17

DanceTrack

BDD100K

Installation

Requirements

Pre-trained CLIP Models

Usage

Dataset preparation

Training and Evaluation

Training on single node

Evaluation on MOT17

Evaluation on MOT20

Evaluation on BDD100K

Citing LTrack

About

Releases

Packages

Languages

License

Ahnsun/LTrack

Folders and files

Latest commit

History

Repository files navigation

LTrack: Generalizing Multiple Object Tracking to Unseen Domains by Introducing Natural Language Representation

Introduction

Updates

Main Results

MOT17

DanceTrack

BDD100K

Installation

Requirements

Pre-trained CLIP Models

Usage

Dataset preparation

Training and Evaluation

Training on single node

Evaluation on MOT17

Evaluation on MOT20

Evaluation on BDD100K

Citing LTrack

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages