Dominic Tarn •December 21, 2022
Computer Vision Data Operations Best Practice Guide
For outsiders to the world of computer vision technology, they can be forgiven for thinking that data scientists perform a type of magic with machine learning algorithms. And yet, outsiders, and many on the inside too, are overlooking the role of data operations teams in this magic.
In most cases, the reality of making computer vision models work, of getting them to produce the results project leaders want, comes down to a lot of hard work, trial-and-error, and the unsung heroes of the profession, data operations teams.
Data operations play an integral role in creating computer vision artificial intelligence models that extract insights from image or video-based datasets. And the work they do is very distinct from that of machine learning operations (MLops).
Computer vision applications generate enormous value across dozens of sectors, from automotive insurance to medical and healthcare, satellite imagery, even manufacturing, and retail. Making all of this work are tireless data operations teams.
Without high-quality data, ML models won’t generate results, and it’s data operations and annotation teams that ensure the right data is being fed into CV models and the process for doing so runs smoothly and efficiently.
In this article, we review the importance of data operations in computer vision projects, the role data ops teams play, and best practice guidelines for effective data operations teams.
What are Data Operations for Computer Vision Projects?
Data operations for computer vision systems cover a wide range of roles and responsibilities. We can sum up the work of data operations teams in several ways:
- Sourcing the video data and image data. Depending on the project and sector, these could be free open-source datasets, or proprietary data that is purchased specifically for the project.
- Data cleaning tasks. Although this might be done by a sub-team or an outsourced provider for the data ops team, it’s data ops who are ultimately responsible for ensuring the datasets are “clean” for computer vision models. Clean visual data must be available before annotation and labeling work can start. Data cleaning involves removing corrupted or duplicate images, and fixing numerous problems with video datasets, such as corrupted files, duplicate frames, ghost frames, variable frame rates, and other sometimes unknown and unexpected problems.
- Implementing and overseeing the annotation and labeling of large scale image or video datasets. For this task, most organizations either have an in-house team, or outsource data labeling for creating machine learning models. It often involves large amounts of data, so is time consuming and labor-intensive. As a result, making this as cost-effective as possible is essential.
- Once the basic frameworks of a data pipeline are established (sourcing the data, data cleaning, annotation, and labeling), a data operations team manages this pipeline. Ensuring the right quality control (QC), quality assurance (QA), and compliance processes are in place is vital to maintain the highest levels of data quality and optimize the experimentation and training stages of building a CV model.
- During the training stage, maintaining a high-quality, clean, and efficient data pipeline is essential. Data ops teams also need to ensure the right tools are being used (e.g. open-source annotation software, or proprietary platforms, ideally with API access), and that storage solutions are scalable to handle the volume of data the project requires.
- Data operations teams also check models for bias, see which are performing in line with or above expectations, and use data approximation or augmentation where needed, and help prepare a model to go from the training stage into production mode.
Annotator and data management in Encord
How Do I Know I Need a Data Operations Team?
In most cases, commercial projects, or computer vision projects that will have a commercial outcome or applications, need a data operations team.
Any project that’s going to handle large volumes of data, and will involve an extensive amount of cleaning, annotation, and testing, would benefit from a data ops team handling everything ML engineers and data scientists can’t manage. Remember, data scientists and ML engineers are specialists.
Project managers don’t want expensive, highly-trained specialists invoicing their time (or requesting overtime) because you’ve not got the resources to take care of everything that should be done before they get involved. The high-performance model that data science teams are training and putting into production is only as effective as the data it’s fed. That data will be of poor quality without a team to manage, clean, annotate, and perfect it.
Managing the pipeline and processes to ensure new training data can be sourced and fed into the model is essential to the smooth running of a computer vision project, and for that, you need a data operations team.
How Can Data Operations Help Model Development?
Data operations play a mission-critical role in model development because they take labor-intensive, manual, and semi-automatic tasks off the team responsible for developing computer vision algorithms.
Otherwise, data admin and operations would fall on the shoulders of the machine learning team. In turn, that would reduce the efficiency and bandwidth of that team because they’d be too busy with data admin, cleaning, annotations, and operational tasks.
What Does a Good Data Operations Process/Team Look Like?
A good data operations process and team keeps data flowing like a river. A smooth and unending flow of high-quality data, keeping a CV model working efficiently, and helping the ML team to train that model to achieve the desired outcomes.
How Do I Know If My Data Operations Team is Performing Well?
A team that’s performing well is organized, operationally efficient, and focused on producing high-quality and accurate image or video-based datasets. Integral to that are the labels and annotations applied to the videos or images.
To achieve the best results, data ops teams need to ensure those doing the annotation work have the right tools. Software that comes with a range of tools for applying labels and annotations, a collaborative dashboard to oversee the work, and an efficient data audit, security, and compliance framework is essential too.
Now, let’s look at the 6 best practices for an effective data operations team.
6 Best Practices for an Effective Data Operations Team
Data ops and annotation tools: Buy don’t Build
When it comes to data operations and annotation tools, the cost of building compared to buying is massive. It’s several orders of magnitude more expensive to build data ops and annotation tools, especially when there are so many good options on the market. Yes, some of those are open-source; however, many don’t do everything that commercial data ops teams require.
Commercial tools are massively more cost-effective than building your own in-house software while delivering what commercial data ops teams need better than open-source options.
It’s also worth noting that several are specifically tailored to achieve the needs of certain use cases, such as collaborative annotating tooling for clinical data ops teams and radiologists.
Treat data as an integral part of your company and projects intellectual property (IP)
Data ops, machine learning, and computer vision teams need to treat datasets as an integral part of your company and project's intellectual property (IP). Rather than treating it as an afterthought, or simply part of the process.
The datasets you use, and annotations and labels applied to the images and videos, make them unique; integral to the success of your project. Take every step to protect this IP, safeguarding it from data theft, and ensuring data integrity and compliance is maintained throughout.
Have a clear data audit trail, so that you know who’s worked on every image or video, with timestamps and metadata. An audit trail also makes data regulatory and compliance easier to achieve.
Implement data workflow processes before a project starts
It’s important to implement effective data workflow processes before a project starts, not during. Otherwise, you risk having a data pipeline that falls apart as soon as data starts flowing through it.
Have clear processes in place. Leverage the right tools. Ensure you’ve got the team(s), assets, budget, senior leadership support, and resources ready to handle the project starting.
Think clearly about the problems you are trying to solve
Again, before a project starts, make sure you’re clear on the questions or problems it’s trying to solve. Ask yourself the following questions:
- What are your project objectives?
- What are the metrics you’re trying to achieve?
- What level of accuracy does the model need so those objectives can be achieved?
- How much data will this project need?
- How much time do you have to achieve these results?
- What outcomes do senior leaders and project managers expect?
- How will a data ops team help you achieve those outcomes, and solve the problems the project has been set?
Take a data-centric approach to data ops
Data ops teams need to take a data-centric approach to help a computer vision project achieve the model performance needed. The success of the whole project relies on the data. Therefore, you need to approach solving this problem from the perspective of how to get the most out of the data provided, and how to get more data if that’s what each iteration of the model needs.
With that approach in mind, your data ops team will be more successful and better equipped to give the ML or CV teams what they need to solve the challenges they’ve been set.
Pick the most powerful labeling and annotation tools
Picking the most powerful labeling and annotation tools is integral to the success of data ops teams, and therefore, the whole project. In some cases, the tool you use depends on the use case. However, in others, the best tools are use case agnostic, and accelerate the success of projects with extensive and powerful automation features.
Image annotation in Encord
Encord and Encord Active are one such solution. Encord improves the efficiency of labeling data and managing a team of annotators. Encord Active is an open-source active learning framework for computer vision: a test suite for your labels, data, and models.
Having the right tools is a big asset for data operations teams. It’s the best way to ensure everything runs more smoothly, the right result is achieved in the timescale that project leaders require.
Experience Encord in action. Dramatically reduce manual video annotation tasks, generating massive savings and efficiencies. Try it for Free Today.
Dominic has over 10 years' experience writing content for high growth AI and SaaS startups. His writing covers a wide range of topics, including machine learning, artificial intelligence and computer vision. Dominic is the founder & CEO of Inbound Sales Content (ISC), an SEO growth-focused B2B content marketing agency. He has a History BA from UCL, has lived in three countries in the last decade, and is now happily settled with a family and cat in the North East of England. https://www.linkedin.com/in/dominicntarn-inboundsalescontent/ https://www.inboundsalescontent.com/
February 1, 2023
12 min read
January 31, 2023
7 min read