TAPIR: Tracking Any Point with per-frame Initialization and temporal Refinement | Explained

Akruti Acharya
June 22, 2023
5 min read
blog image

Get ready to witness another step in the rapid evolution of artificial intelligence! Just when you thought advancements to Large Language Models and Generative AI were groundbreaking, the AI landscape has taken another giant leap forward. Brace yourself for the next wave of innovation in artificial intelligence!

Google DeepMind, University College London, and the University of Oxford have come together to develop a revolutionary model for object tracking in video sequences. 

Object tracking in videos plays a vital role in the field of computer vision. Accurately tracking objects' positions, shapes, and attributes over consecutive frames is essential for extracting meaningful insights and enabling advanced applications such as visual surveillance, autonomous vehicles, augmented reality, human-computer interaction, video editing, special effects, and sports analysis.

Scale your annotation workflows and power your model performance with data-driven insights
medical banner

TAPIR’s approach addresses the limitations of traditional object-tracking methods. It focuses on the "Track Any Point" (TAP) task, where the goal is to track specific points of interest within videos, regardless of their appearance or motion characteristics. By leveraging point-level correspondence, robust occlusion handling, and long-term tracking capabilities, TAPIR provides superior accuracy and robustness in object-tracking scenarios.



Before we dive into the details of Tracking Any Point with per-frame Initialization and temporal Refinement (TAPIR), let’s discuss object tracking in videos.

Object Tracking in a Video

Object tracking in a video sequence refers to the process of locating and following a specific object of interest across consecutive frames of a video. The goal is to accurately track the object's position, size, shape, and other relevant attributes throughout the video.

Applications of Object Tracking in a Video

It is a powerful technique that finds its utility across a wide range of applications in computer vision.

Visual Surveillance

Object tracking is widely used in surveillance systems for monitoring and analyzing the movement of objects, such as people or vehicles, in a video sequence. It enables tasks like action recognition, anomaly detection, and object behavior analysis.

Autonomous Vehicles

Playing a crucial role in autonomous driving systems, object tracking helps to identify and track vehicles, pedestrians, and other objects in real-time, allowing the vehicle to navigate safely and make informed decisions.

Augmented Reality

It is essential in augmented reality applications, where virtual objects need to be anchored or overlaid onto real-world objects. Tracking objects in the video feed helps align virtual content with the physical environment, creating a seamless augmented reality experience.

Human-Computer Interaction

Object tracking is used in gesture recognition and tracking systems, enabling natural and intuitive interactions between humans and computers. It allows for hand or body tracking, enabling gestures to be recognized and used as input in applications like gaming or virtual reality.

Video Editing and Special Effects

It is utilized in video editing and post-production workflows. It enables precise object segmentation and tracking, which can be used for tasks like object removal, object replacement, or applying visual effects to specific objects in the video.

Sports Analysis

Object tracking is employed in sports analytics to track players, balls, and other objects during a game. It provides valuable insights into player movements, ball trajectories, and performance analysis for activities like soccer, basketball, or tennis.

What is ‘Tracking Any Point’ (TAP)?

“Track Any Point” (TAP) refers to a tracking task where the goal is to accurately follow and track a specific point of interest within a video sequence. The term “Track Any Point” implies that the tracking algorithm should be capable of successfully tracking arbitrary points in the video, regardless of their appearance, motion, or other characteristics.



The Track Any Point approach is designed to tackle the limitations of object tracking such as:

Point-Level Correspondence

Traditional object tracking methods often rely on bounding boxes or predefined features to track objects. However, these methods may not accurately capture the correspondence between pixels in different frames, especially in scenarios with occlusions, appearance changes, or non-rigid deformations. TAP focuses on point-level correspondence, which provides a more accurate and detailed understanding of object motion and shape.

Occlusion Handling

Object occlusions pose a significant challenge in tracking algorithms. When objects are partially or completely occluded, traditional tracking methods may fail to maintain accurate tracking. TAP tackles this challenge by robustly estimating occlusions and recovering when the points reappear, incorporating search mechanisms to locate the points and maintain their correspondence.


The x marks indicate the predictions when the groundtruth is occluded. Source

Long-Term Tracking

Some objects or points of interest may remain visible for an extended period in a video sequence. Traditional tracking methods might struggle to leverage the appearance and motion information across multiple frames optimally. TAP addresses this by integrating information from multiple consecutive frames, allowing for more accurate and robust long-term tracking.

Limited Real-World Ground Truth

The availability of real-world ground truth data for object tracking is often limited or challenging to obtain. This poses difficulties for supervised-learning algorithms that typically require extensive labeled data. TAP tackles this limitation by leveraging synthetic data, allowing the model to learn from simulated environments without overfitting to specific data distributions.

By addressing these inherent limitations, the "Track Any Point" (TAP) approach offers a highly effective solution for object tracking. This approach provides a more refined and comprehensive methodology, enabling precise tracking at the point level. To delve deeper into the workings of TAP and understand its remarkable capabilities, let's explore the model in detail.

TAPIR: Tracking Any Point with per-frame Initialization and temporal Refinement

TAPIR (TAP with per-frame Initialization and temporal Refinement) is an advanced object tracking model that combines the strengths of two existing architectures, TAP-Net and Persistent Independent Particles (PIPs). 

TAPIR addresses the challenges of object tracking by employing a coarse-to-fine approach. Initially, it performs occlusion-robust matching independently on each frame using low-resolution features, providing coarse track. Then, through fine refinement, it iteratively utilizes local spatio-temporal information at a higher resolution, leveraging a neural network to trade-off motion smoothness and appearance cues for accurate tracking.

Scale your annotation workflows and power your model performance with data-driven insights
medical banner

A key aspect of TAPIR is its fully convolutional architecture, which allows efficient mapping onto modern GPU and TPU hardware. The model primarily consists of feature comparisons, spatial convolutions, and temporal convolutions, facilitating fast and parallel processing.

TAPIR also estimates its own uncertainty in position estimation through self-supervised learning. This enables the suppression of low-confidence predictions, improving benchmark scores and benefitting downstream algorithms that rely on precision.

TAPIR effectively combines these two methods, leveraging the complementary strengths of both architectures. By integrating the robust occlusion-robust matching of TAP-Net with the refinement capabilities of PIPs, TAPIR achieves highly accurate and continuous tracking results. TAPIR's parallelizable design ensures fast and efficient processing, overcoming the limitations of sequential processing in PIPs.

💡TAPIR paper can be found on arXiv.

TAPIR Architecture

TAPIR builds on TAP-Net for trajectory initialization and incorporates an architecture inspired by PIPs to refine the estimate. TAP-Net replaces PIPs' slow "Chaining" process, while a fully-convolutional network replaces MLP-Mixer, improving performance and eliminating complex chunking procedures. The model estimates its own uncertainty, enhancing performance and preventing downstream algorithm failures.



This multi-step architecture combines efficient initialization, refined estimation, and self-estimated uncertainty, resulting in a robust object tracking framework.

Track Initialization

Track initialization refers to the initial estimation of an object or point of interest’s position and characteristics in a video sequence. The TAPIR conducts a global comparison between the features of the query point and those of every other frame. This comparison helps compute an initial track estimate along with an uncertainty estimate. 

By utilizing the query point’s features and comparing them with features from multiple frames, TAPIR establishes a starting point for subsequent tracking algorithms to accurately track the object’s movement throughout the video.

Position Uncertainty Estimates

The position uncertainty estimates in TAPIR address the issue of predicting vastly incorrect locations. TAPIR incorporates the estimation of uncertainty regarding the position to handle situations where the algorithm is uncertain about the position, leading to potential errors in location prediction.

The algorithm computes an uncertainty estimate which indicates the likelihood of the prediction being significantly far from the ground truth. This uncertainty estimate is integrated into the loss function, along with occlusion and point location predictions, and contributes to improving the Average Jaccard metric. By considering the algorithm’s confidence in position prediction, TAPIR provides a more robust framework for object tracking.

Iterative Refinement

The refinement process is performed iteratively for a fixed number of iterations. In each iteration, the updated position estimate from the previous interaction is fed back into the refinement process, allowing for continuous improvement of the track estimate.

Training Dataset

TAPIR aims to bridge the gap between simulated and real-world data to improve performance. One difference between the Kubric MOViE dataset and real-world scenarios is the lack of panning in Kubric’s fixed camera setup. Real-world videos often involve panning, which poses a challenge for the models.


Example from the Kubric MOViE dataset. Source

To address this, the MOVi-E dataset is modified by introducing a random linear trajectory for the camera’s “look at” point. This trajectory allows the camera to move along a path, mimicking real-world panning. By incorporating panning scenarios into the dataset, TAPIR can better handle real-world videos and improve its tracking performance.

TAPIR Implementation

TAPIR is trained on the Kubric MOVi-E dataset. The model is trained for 50,000 steps with a cosine learning rate schedule. The training process utilizes 64 TPU-v3 cores, with each core processing 4 videos consisting of 24 frames per step. An Adam optimizer is used. Cross-replica batch normalization is applied within the ResNet backbone. 

💡TAPIR implementation can be found on GitHub

How does TAPIR Compare to Baseline Models?

The TAPIR model is evaluated to the baseline models such as TAP-Net and PIPs on the TAP-Vid benchmark. The TAP-Vid benchmark consists of annotated real and synthetic videos with point tracks. The evaluation metric used is Average Jaccard, which assesses the accuracy of position estimation and occlusion prediction.



TAPIR is also evaluated on the DAVIS, RGB-stacking and the Kinetics datasets. TAPIR improves by 10.6% over the best model on Kinetics (TAP-Net) and by 19.3% on DAVIS (PIPS). On RGB-stacking dataset, there is an improvement of 6.3%

💡 TAPIR provides two online Colab demos where you can try it on your own videos without installation: the first lets you run our best-performing TAPIR model and the second lets you run a model in an online fashion

TAPIR: Key Takeaways

  • TAPIR (Tracking Any Point with per-frame Initialization and temporal Refinement) is a model by Google’s DeepMind for object tracking, specifically designed to track a query point in a video sequence.
  • Model consists of two stages: a matching stage that finds suitable candidate point matches for the query point in each frame, and a refinement stage that updates the trajectory and query features based on local correlations.
  • TAPIR outperforms baseline methods by a significant margin on the TAP-Vid benchmark, achieving an approximate 20% absolute average Jaccard (AJ) improvement on DAVIS.
  • TAPIR offers fast parallel inference on long sequence videos, enabling real-time tracking of multiple points.
  • TAPIR is open-source. Colab notebooks are also available.

💡The TAPIR paper, titled "Tracking Any Point with per-frame Initialization and temporal Refinement," is presented by a team of authors including Carl Doersch, Yi Yang, Mel Vecerik, Dilara Gokay, Ankush Gupta, Yusuf Aytar, Joao Carreira, and Andrew Zisserman. The authors are affiliated with Google DeepMind and Andrew Zisserman is also associated with the VGG (Visual Geometry Group) at the Department of Engineering Science, University of Oxford. Their collective expertise and collaboration have contributed to the development of TAPIR, an advanced model for object tracking in videos.


Scale your annotation workflows and power your model performance with data-driven insights
medical banner

Recommended Articles

Written by Akruti Acharya
Akruti is a data scientist and technical content writer with a M.Sc. in Machine Learning & Artificial Intelligence from the University of Birmingham. She enjoys exploring new things and applying her technical and analytical skills to solve challenging problems and sharing her knowledge and... see more
View more posts
cta banner

Discuss this blog on Slack

Join the Encord Developers community to discuss the latest in computer vision, machine learning, and data-centric AI

Join the community

Software To Help You Turn Your Data Into AI

Forget fragmented workflows, annotation tools, and Notebooks for building AI applications. Encord Data Engine accelerates every step of taking your model into production.