Contents
What is CoTracker3?
Evolution of Point Tracking: From TAPIR to CoTracker3
How CoTracker3 Works
Pseudo-Labeling: Key to CoTracker3’s Efficiency
Performance and Benchmarking
Applications of CoTracker3
CoTracker3’s Failure Cases
How to Implement CoTracker3?
Conclusion
Encord Blog
CoTracker3: Simplified Point Tracking with Pseudo-Labeling by Meta AI
Point tracking is essential in computer vision for tasks like 3D reconstruction, video editing, and robotics. However, tracking points accurately across video frames, especially with occlusions or fast motion, remains challenging. Traditional models rely on complex architectures and large synthetic datasets, limiting their performance in real-world videos.
Meta AI’s CoTracker3 addresses these issues by simplifying previous trackers and improving data efficiency. It introduces a pseudo-labeling approach, allowing the model to train on real videos without annotations, making it more scalable and effective for real-world use.
This blog provides a detailed overview of CoTracker3, explaining how it works, its innovations over previous models, and the impact it has on point tracking.
What is CoTracker3?
CoTracker3 is the latest point tracking model from Meta AI, designed to track multiple points in videos even when those points are temporarily obscured or occluded. It builds on the foundation laid by earlier models such as TAPIR and CoTracker, while significantly simplifying their architecture.
CoTracker3: Simpler and Better Point Tracking by Pseudo-Labelling Real Videos
The key innovation in CoTracker3 is its ability to leverage pseudo-labeling, a semi-supervised learning approach that uses real, unannotated videos for training. This approach, combined with a leaner architecture, enables CoTracker3 to outperform previous models like BootsTAPIR and LocoTrack, all while using 1,000 times fewer training videos.
Evolution of Point Tracking: From TAPIR to CoTracker3
Early Developments in Point Tracking
The first breakthroughs in point tracking came from models like TAPIR and CoTracker, which leveraged neural networks to track points across video frames. TAPIR introduced the idea of using a global matching stage to improve tracking accuracy over long video sequences, refining point tracks by comparing them across multiple frames. This led to significant improvements in tracking, but the model’s reliance on complex components and synthetic training data limited its scalability.
CoTracker’s Joint Tracking
CoTracker introduced a transformer-based architecture capable of tracking multiple points jointly. This allowed the model to handle occlusions more effectively by using the correlations between tracks to infer the position of hidden points. CoTracker’s joint tracking approach made it a robust model, particularly in cases where points temporarily disappeared from view.
{For more information on CoTracker, read the blog Meta AI’s CoTracker: It is Better to Track Together for Video Motion Prediction.}
Simplification with CoTracker3
While TAPIR and CoTracker were both effective, their reliance on synthetic data and complex architecture left room for improvement. CoTracker3 simplifies the architecture by removing unnecessary components—such as the global matching stage—and streamlining the tracking process. These changes make CoTracker3 not only faster but also more efficient in its use of data.
How CoTracker3 Works
Simplified Architecture
Unlike previous models that used complex modules for processing correlations between points, CoTracker3 employs a multi-layer perceptron (MLP) to handle 4D correlation features. This change reduces computational overhead while maintaining high performance.
Another key feature is CoTracker3’s iterative update mechanism, which refines point tracks over several iterations. Initially, the model makes a rough estimate of the points’ positions and then uses a transformer to refine these estimates by updating the tracks iteratively across frames. This allows the model to progressively improve its predictions.
Handling Occlusions with Joint Tracking
One of CoTracker3's standout features is its ability to track points that become occluded or temporarily hidden in the video. This is made possible through cross-track attention, which allows the model to track multiple points simultaneously. By using the information from visible points, CoTracker3 can infer the positions of occluded points, making it particularly effective in challenging real-world scenarios where objects move in and out of view.
Two Modes: Online and Offline
CoTracker3 can operate in two modes: online and offline. In the online mode, the model processes video frames in real-time, making it suitable for applications that require immediate feedback, such as robotics. In offline mode, CoTracker3 processes the entire video sequence in one go, tracking points forward and backward in time. This mode is particularly useful for long-term tracking or when occlusions last for several frames, as it allows the model to reconstruct the trajectory of points that disappear and reappear in the video.
Pseudo-Labeling: Key to CoTracker3’s Efficiency
The Challenge of Annotating Real Videos
One of the biggest hurdles in point tracking is the difficulty of annotating real-world videos. Unlike synthetic datasets, where points can be labeled automatically, real videos require manual annotation, which is time-consuming and prone to errors. As a result, many previous tracking models have relied on synthetic data for training. However, these models often perform poorly when applied to real-world scenarios due to the gap between synthetic and real-world environments.
CoTracker3’s Pseudo-Labeling Approach
CoTracker3 addresses this issue with a pseudo-labeling approach. Instead of relying on manually annotated real videos, CoTracker3 generates pseudo-labels using existing tracking models trained on synthetic data. These pseudo-labels are then used to fine-tune CoTracker3 on real-world videos. By leveraging real data in this way, CoTracker3 can learn from the complexities and variations of real-world scenes without the need for expensive manual annotation.
This approach significantly reduces the amount of data required for training. For example, CoTracker3 uses 1,000 times fewer real videos than models like BootsTAPIR, yet it achieves superior performance.
{Read the paper available on arXiv authored by Nikita Karaev, Iurii Makarov, Jianyuan Wang, Natalia Neverova, Andrea Vedaldi and Christian Rupprecht: CoTracker3: Simpler and Better Point Tracking by Pseudo-Labelling Real Videos. }
Performance and Benchmarking
Tracking Accuracy
CoTracker3 has been tested on several point tracking benchmarks, including TAP-Vid and Dynamic Replica, and consistently outperforms previous models like TAPIR and LocoTrack. One of its key strengths is its ability to accurately track occluded points, thanks to its cross-track attention and iterative update mechanism.
In particular, CoTracker3 excels at handling occlusions, which have traditionally been a challenge for point trackers. This makes it an ideal solution for real-world applications where objects frequently move in and out of view.
Efficiency and Speed
In addition to its accuracy, CoTracker3 is also faster and more efficient than many of its predecessors. Its simplified architecture allows it to run 27% faster than LocoTrack, one of the fastest point trackers to date. Despite its increased speed, CoTracker3 maintains high performance across a range of video analysis tasks, from short clips to long sequences with complex motion and occlusions.
Data Efficiency
CoTracker3’s most notable advantage is its data efficiency. By using pseudo-labels from a relatively small number of real videos, CoTracker3 can outperform models like BootsTAPIR, which require millions of videos for training. This makes CoTracker3 a more scalable solution for point tracking, especially in scenarios where access to large annotated datasets is limited.
Applications of CoTracker3
The ability of CoTracker3 to track points accurately, even in challenging conditions, opens up new possibilities in several fields.
3D Reconstruction
CoTracker3 can be used for high-precision 3D reconstruction of objects or scenes from video footage. Its ability to track points across frames with minimal drift makes it particularly useful for industries like architecture, animation, and virtual reality, where accurate 3D models are essential.
Robotics
In robotics, CoTracker3’s online mode allows robots to track objects in real-time, even as they move through dynamic environments. This is critical for tasks such as object manipulation and autonomous navigation, where accurate and immediate feedback is necessary.
Video Editing and Special Effects
For video editors and visual effects artists, CoTracker3 offers a powerful tool for tasks such as motion tracking and video stabilization. Its ability to track points through occlusions ensures that effects like digital overlays or camera stabilization remain consistent throughout the video, even when the tracked object moves in and out of view.
CoTracker3’s Failure Cases
While CoTracker3 excels in tracking points across a variety of scenarios, it faces challenges when dealing with featureless surfaces. For instance, the model struggles to track points on surfaces like the sky or bodies of water, where there are few distinct visual features for the algorithm to lock onto.
CoTracker3: Simpler and Better Point Tracking by Pseudo-Labelling Real Videos
In such cases, the lack of texture or identifiable points leads to unreliable or lost tracks. These limitations highlight scenarios where CoTracker3's performance may degrade, especially in environments with minimal visual detail.
How to Implement CoTracker3?
CoTracker3 is available on GitHub and can be explored through a Colab demo or on Hugging Face Spaces. Getting started is simple: upload your .mp4 video or select one of the provided example videos.
Conclusion
CoTracker3 marks a significant advancement in point tracking, offering a simplified architecture that delivers both improved performance and greater efficiency. By eliminating unnecessary components and introducing a pseudo-labeling approach for training on real videos, CoTracker3 sets a new standard for point tracking models.
Its ability to track points accurately through occlusions, combined with its speed and data efficiency, makes it a versatile tool for a wide range of applications, from 3D reconstruction to robotics and video editing. Meta AI’s CoTracker3 demonstrates that simpler models can achieve superior results, paving the way for more scalable and effective point tracking solutions in the future.
Power your AI models with the right data
Automate your data curation, annotation and label validation workflows.
Get startedWritten by
Eric Landau
Explore our products