Data Drift
Encord Computer Vision Glossary
Data drift in machine learning refers to the situation where the statistical properties of the data used to train a machine learning model change over time, leading to a decrease in model performance. When the model is deployed in the real world, it may encounter new data that is significantly different from the data used to train it. This could be due to changes in the underlying distribution of the data, changes in the data collection process, or changes in the population being sampled.
If the machine learning model is not designed to handle data drift, its performance may deteriorate over time. For example, if a model is trained on data from one region, but is deployed in a different region where the characteristics of the data are different, the model may perform poorly. Similarly, if a model is trained on data from a particular time period, but is used to make predictions on new data that is significantly different, its performance may suffer.
To address data drift, machine learning models need to be designed with methods to detect and adapt to changes in the data distribution. This may involve continuously monitoring the performance of the model and retraining it on new data as needed, or developing algorithms that can adapt to changes in the data distribution in real time.
What are the types of data drift?
There are several types of data drift, including:
- Target drift: This occurs when the output or target variable of the model changes over time. For example, if a machine learning model is trained to predict customer churn, but the definition of churn changes, the model may become less accurate.
- Concept drift: Concept drift refers to a situation where the underlying distribution of the data changes over time. This can occur when new categories or trends emerge, or when the environment in which the data was collected changes.
- Covariate shift: Covariate shift occurs when the distribution of the input variables (also known as features) changes over time. This can be caused by changes in the data collection process or changes in the population being sampled.
- Label drift: Label drift occurs when the ground truth labels for a dataset change over time. This can happen when the criteria for labeling data changes, or when errors in labeling are introduced.
Discuss this blog on Slack
Join the Encord Developers community to discuss the latest in computer vision, machine learning, and data-centric AI
Join the community