Data Versioning

Encord Computer Vision Glossary

Data versioning is the practice of tracking changes to datasets over time, allowing teams to manage, compare, and reproduce different versions of data used in machine learning workflows. Much like version control in software development (e.g., Git), data versioning helps prevent inconsistencies, supports collaboration, and improves the reproducibility of AI models.

In the context of AI data pipelines, especially in supervised learning, versioning is critical. Training datasets evolve: annotations are corrected, outliers removed, edge cases added, and formats standardized. Without version control, teams risk training models on mismatched data or losing track of the dataset that led to the best-performing model.

Key benefits of data versioning:

Reproducibility: Ensures that experiments can be repeated using the exact dataset version.
Traceability: Tracks how data was collected, cleaned, and annotated.
Collaboration: Allows distributed teams to contribute without overwriting each other’s work.
Auditability: Maintains a history of who made changes and why.

Tools commonly used for data versioning:

DVC (Data Version Control) – integrates with Git for dataset tracking
Pachyderm – versioned data pipelines for ML
Weights & Biases – experiment tracking with dataset lineage
LakeFS / Delta Lake – versioning for big data in data lakes

In geospatial AI and image annotation pipelines, where datasets can be hundreds of GBs, versioning allows teams to:

Compare model performance on v1 vs. v2 of a satellite image dataset
Roll back annotations that introduced label noise
Track dataset changes during active learning loop

Data versioning is a best practice for scalable, maintainable AI development—critical to ensuring the integrity and reliability of your models.

Discuss this blog on Slack

Join the Encord Developers community to discuss the latest in computer vision, machine learning, and data-centric AI

Join the community

Automate 97% of your annotation tasks with 99% accuracy

Data Versioning

Follow us