Tracking objects of interest in a video is one of the most popular and widely applicable problems
in computer vision. However, with the years, a Cambrian explosion of use cases and benchmarks
has fragmented the problem in a multitude of different experimental setups. As a consequence, the
literature has fragmented too, and now the novel approaches proposed by the community are usually
specialised to fit only one specific setup.
To understand to what extent this specialisation is actually
necessary, in this work we present UniTrack, a solution to address five different tasks within the same
framework. UniTrack consists of a single and task-agnostic appearance model, which can be learned
in a supervised or self-supervised fashion, and multiple “heads” to address individual tasks and that do
not require training. We show how most tracking tasks can be solved within this framework, and that
the same appearance model can be used to obtain performance that is competitive against specialised
methods for all the five tasks considered. The framework also allows us to analyse appearance models
obtained with the most recent self-supervised methods, thus significantly extending their evaluation
and comparison to a larger variety of important problems.