Data Versioning
Encord Computer Vision Glossary
Data versioning is the practice of tracking changes to datasets over time, allowing teams to manage, compare, and reproduce different versions of data used in machine learning workflows. Much like version control in software development (e.g., Git), data versioning helps prevent inconsistencies, supports collaboration, and improves the reproducibility of AI models.
In the context of AI data pipelines, especially in supervised learning, versioning is critical. Training datasets evolve: annotations are corrected, outliers removed, edge cases added, and formats standardized. Without version control, teams risk training models on mismatched data or losing track of the dataset that led to the best-performing model.
Key benefits of data versioning:
- Reproducibility: Ensures that experiments can be repeated using the exact dataset version.
- Traceability: Tracks how data was collected, cleaned, and annotated.
- Collaboration: Allows distributed teams to contribute without overwriting each other’s work.
- Auditability: Maintains a history of who made changes and why.
Tools commonly used for data versioning:
- DVC (Data Version Control) – integrates with Git for dataset tracking
- Pachyderm – versioned data pipelines for ML
- Weights & Biases – experiment tracking with dataset lineage
- LakeFS / Delta Lake – versioning for big data in data lakes
In geospatial AI and image annotation pipelines, where datasets can be hundreds of GBs, versioning allows teams to:
- Compare model performance on v1 vs. v2 of a satellite image dataset
- Roll back annotations that introduced label noise
- Track dataset changes during active learning loop
Data versioning is a best practice for scalable, maintainable AI development—critical to ensuring the integrity and reliability of your models.
Join the Encord Developers community to discuss the latest in computer vision, machine learning, and data-centric AI
Join the community