Do Different Tracking Tasks Require Different Appearance Models?

Abstract

Tracking objects of interest in a video is one of the most popular and widely applicable problems in computer vision. However, with the years, a Cambrian explosion of use cases and benchmarks has fragmented the problem in a multitude of different experimental setups. As a consequence, the literature has fragmented too, and now the novel approaches proposed by the community are usually specialised to fit only one specific setup.

To understand to what extent this specialisation is actually necessary, in this work we present UniTrack, a solution to address five different tasks within the same framework. UniTrack consists of a single and task-agnostic appearance model, which can be learned in a supervised or self-supervised fashion, and multiple “heads” to address individual tasks and that do not require training. We show how most tracking tasks can be solved within this framework, and that the same appearance model can be used to obtain performance that is competitive against specialised methods for all the five tasks considered. The framework also allows us to analyse appearance models obtained with the most recent self-supervised methods, thus significantly extending their evaluation and comparison to a larger variety of important problems.

Proliferation of tracking setups

Various use cases and benchmarks has fragmented the tracking problem in multiple setups.

The UniTrack framework

To investigate what consititutes a good representation of objects in videos, we propose UniTrack, a unified tracking framework with a single shared appearance model. UniTrack is divided into three "levels":

Level-1: The shared, base appearance model. We are particularly interested in self-supervised representations like MoCo and SimCLR.
Level-2: Two parameter-free algorithmic primitives: propagation and assciation, that operate on deep features extracted by the appearance model.
Level-3: Task-specific solutions. Used to adapt the two level-2 primitives to task-specific outputs.

Results visulization

Below we show qualitative results of UniTrack on five different tracking tasks. The appearance model used is an ImageNet pre-trained ResNet-18. No task-specific training is performed. Note that we use this appearance model only for simplicity and consistency of comaprison. Many self-supervised representations are better on all task.

Single Object Tracking (SOT) @ OTB-2015

Video Object Segmentation (VOS) @ DAVIS-2017

Multi-Object Tracking (MOT) @ MOT-16

Multi-Object Tracking and Segmentation (MOTS) @ MOTS

Pose Tracking @ PoseTrack-2018

An evaluation platform for SSL models

The "decoupled" characterazition of appearance models in UniTrack allows us to use this framework as an evaluation platform for benchmarking self-supervised learning (SSL) models on multiple tracking tasks. And the evaluation does not requrire any training or finetuning!

In the right we show an example: the SSL model MoCo-v1 is tested on five tracking tasks. Each vertex shows the ranking against the other models for a specific tracking task. The larger the area of a polygon is, the better that learned representation performs, across the board.

Resuls are shown at the top of the page: the recent proposed VFS (bottom-right of the panel) dominates on all tasks except for SOT. We expect that newly-proposed representations will further improve performance across tasks.

Open-world applications

In UniTrack, the appearance models are general-purpose representations. This means we can perform class-agnostic tracking in an open-world context. Below we show a demo by combining YOLOX with UniTrack. YOLOX is trained on the COCO dataset, so in this demo we can perform Multi-Object Tracking for 80 COCO classes.

The YOLOX detector can be replaced with other detectors if you want to track objects of other classes.

BibTeX

@article{wang2021different,
  author    = {Wang, Zhongdao and Zhao, Hengshuang and Li, Ya-Li and Wang, Shengjin and Torr, Philip and Bertinetto, Luca},
  title     = {Do different tracking tasks require different appearance models?},
  journal   = {Thirty-Fifth Conference on Neural Infromation Processing Systems},
  year      = {2021},
}