CoTracker3: Simpler and Better Point Tracking by Pseudo-Labelling Real Videos

Most state-of-the-art point trackers are trained on synthetic data due to the difficulty of annotating real videos for this task. However, this can result in suboptimal performance due to the statistical gap between synthetic and real videos. In order to understand these issues better, we introduce CoTracker, comprising a new tracking model and a new semi-supervised training recipe.

This allows real videos without annotations to be used during training by generating pseudo-labels using off-the-shelf teachers. The new model eliminates or simplifies components from previous trackers, resulting in a simpler and often smaller architecture. This training scheme is much simpler than prior work and achieves better results using 1,000 times less data.

We further study the scaling behaviour to understand the impact of using more real unsupervised data in point tracking. The model is available in online and offline variants and reliably tracks visible and occluded points. We demonstrate qualitatively impressive tracking results, where points can be tracked for a long time even when they are occluded or leave the field of view. Quantitatively, CoTracker outperforms all recent trackers on standard benchmarks, often by a substantial margin.

CoTracker is the first iteration of this work. In the follow-up, we significantly simplified the architecture and showed that it is possible to improve point trackers by trainining them on real data with pseudo-labels produced by other synthetic-trained models. Check out CoTracker as well!

BootsTAPIR is a bootstrapped version of TAPIR, a feed-forward point tracker with a matching stage inspired by TAP-Vid and a refinement stage inspired by PIPs. The model demonstrates accurate tracking of visible points, which improves after scaling. However, it struggles to predict the positions of occluded points.

LocoTrack is an efficient point-tracking model that accuractely tracks visible points. It introduces 4D correlations that we also use in the architecture of CoTracker3.

Shape of Motion is a test-time optimisation method for dynamic reconstrucion that proposes a new motion representation. It produces accurate 3D tracks and shows visuals with nice dynamic reconstructions.

VGGSfM is the first end-to-end differentiable structure-from-motion pipeline that outperforms traditional algorithms. Point tracking is one of its components.

BibTeX

@InProceedings{karaev2024cotracker3,
    author    = {Nikita Karaev and Iurii Makarov and Jianyuan Wang and Natalia Neverova and Andrea Vedaldi and Christian Rupprecht},
    title     = {{CoTracker3}: Simpler and Better Point Tracking by Pseudo-Labelling Real Videos},
    journal   = {arxiv},
    year      = {2024}
  }

CoTracker3: Simpler and Better Point Tracking by Pseudo-Labelling Real Videos

Overview

Tracking through occlusions

Object-centric tracking on a regular grid

The effect of scaling

Failure cases

BibTeX

CoTracker3: Simpler and Better Point Tracking by Pseudo-Labelling Real Videos

Overview

Tracking through occlusions

Object-centric tracking on a regular grid

The effect of scaling

Failure cases

Related Links

BibTeX