4 Questions to Ask When Evaluating Training Data Pipelines

Frederik Hvilshøj
November 11, 2022
5 min read
Back to blogs
blog image

Building a scalable and secure data pipeline involves a lot of decision making. However, for many machine learning and data science teams, the first step in the process is deciding where to store their datasets.

While the major cloud providers such as Google and AWS offer a lot of benefits, a computer vision project’s individual privacy and security considerations will determine the best storage solution for them. Whereas a medical artificial intelligence company operating in the United States might use a major cloud provider, a medical AI company working with EU patient data will only be able to use EU-based cloud providers to store that data. Likewise, companies that work with highly sensitive data, such as defense contractors, will have more specific storage requirements, often needing to store data on their own hard drives kept on the premises.

At the same time, all data pipelines need to be secure with very high encryption standards. Storing on-premise comes with its own challenges because while the major cloud storage providers have best-in-class teams dedicated to security, a company with an on-premise system will need to have a top-notch, in-house IT team that stays up-to-date on security and system maintenance. Otherwise, the company’s in-house storage system could be vulnerable to cyber attacks.

It’s a tough decision, influenced by many factors such as cost and compliance. However, as the first decision in building a secure and compliant data pipeline and related workflows, deciding where to store your data has implications for many other data-related decisions that follow, including which data products a company can use.

Here are four questions that data scientists and machine learning teams working on machine learning models should ask when determining whether a data product fits with the data pipelines for their particular use case.

Scale your annotation workflows and power your model performance with data-driven insights
medical banner

Is the product agnostic about where data is stored?

For machine learning teams working with sensitive datasets, data storage remains top-of-mind throughout the entire model development process. As the teams put together a data pipeline to feed their algorithms and train their models, there are a lot of off-the-shelf data products that can make the process easier. However, teams need to know that a data product can work seamlessly with their datasets regardless of where the data is stored.

Encord’s customers often ask, “What do you do with our data? Where do you store it?”

6x Faster Model Development with Encord

Encord helps you label your data 6x faster, create active learning pipelines and accelerate your model development. Try it free for 14 days

They want to ensure that the data remains stored in the location of their choice while they use our product. That’s not a problem because our product is storage agnostic.

A storage agnostic data product can integrate with any storage facilities, enabling machine learning teams to have the same seamless experience as if the data was stored in the product’s own cloud facility.

With a storage agnostic product, a company can use the same product with multiple data storage providers. It doesn’t matter if a company is storing data on nich or regional cloud providers, such as Germany’s Open Telekoms, or on a global provider such as AWS. Similarly, it doesn’t matter if a computer vision company working with healthcare images stores some of those datasets on a PACS viewer and some at an on-prem facility. A storage agnostic data product can integrate with all of these systems.

For most computer vision companies, building a multi-region, multi-cloud strategy is essential for long-term business growth. Working across different regions can provide companies with access to more clients and their machine learning teams with access to more and varied data. When a model trains on more data, it learns to make better predictions. In a similar vein, a model training for deployment in different regions will be better able to generalize to those specific regions when trained on the appropriate training datasets. Of course, gaining access to such regional datasets requires maintaining compliance with the data privacy and regulations of the governing jurisdiction, including those regarding data storage.

That’s why storage agnostic products are so important.

Storage-agnostic products make implementing multi-region, multi-cloud strategies possible. With these products, a company that works in multiple locations with multiple different storage buckets can build integrations for each storage location and maintain granular access to the data. By enabling a company to use the same product across multiple teams and localities, these products also save companies time, effort, and money by eliminating the need to search for new tools or train staff members on multiple tools.

What if the product doesn’t already have integrations with one of my data storage providers?

Once you’ve found a storage agnostic data product, the most important question a company can ask is: “How quickly can that tool be integrated with new kinds of clouds or storage facilities?”

One benefit of storing data with a major cloud provider is that it’s easy for products to integrate with those platforms because so many companies use them. However, if your company opted for a regional or on-prem solution, integration may not already exist.

Building end-to-end integrations can be difficult and time consuming. In general, the greater the uniqueness of a storage facility, the more complex the deployment, and the greater number of engineering hours required to build integrations. The greater the number of hours needed, the greater the cost.

The aim of any data product should be to integrate as seamlessly as possible with all places that a company might store data for their computer vision applications. If an integration doesn’t already exist, then it should be easy for the data product’s team to build it and add it into the repertoire of integrations that the product offers and facilitates.

We designed Encord to be storage agnostic, and we also architected the system so that we can build new integrations quickly and at a low cost. For instance, to stay compliant with data privacy laws, one of our customers needed to store their data on the German cloud provider Open Telekom. Our developers could build those integrations for the provider within a couple of days so that Encord fit seamlessly with their existing data pipeline while enabling the machine learning team to take full advantage of our platform and its features.

Having a storage-agnostic product that can be altered quickly to integrate with multiple storage providers allows companies to build an expandable data pipeline. As their security and privacy needs change, they can continue to collect and store data at multiple locations– running the spectrum from Big Cloud to on-premise– without having to worry about whether the data product will work with new datasets stored in new locations.

How does the product securely access my datasets for my computer vision model?

Nothing is more important than data pipeline security. The data needs to be encrypted to a high standard and inaccessible except to authorized users.

When companies pick a data tool, they need to know how they can grant the tool access to their data in a secure manner. A good solution to this problem is using a signed URL. With a signed URL a company can keep public access to the data shut off while allowing specific and approved external users to access and temporarily render the data without actually storing it.

If our customer uses their own private cloud storage,  our product never actually has to store the data, which means that our customers remain compliant with data privacy laws and their data remains secure.

Another benefit of using granular data access control is that it only grants access to the specific data items that a data product needs to have access to. For instance, if a computer vision company is working across multiple hospitals, but they currently only need to label images from patients at one hospital, then they can grant Encord’s product access to only images from that one hospital as opposed to granting blanket access to every hospital in which they work. Granting permissions to datasets with this level of specificity helps further ensure data compliance and protection.

Does the product allow the machine learning team to work with the datasets in a granular manner?

Whenever possible, companies should buy off-the-shelf data tools rather than build them internally.

However, off-the-shelf data products must work as well with a company’s data as if the company had built the product internally. Data products must have a flexible API that allows teams working on ML models to work with the data in the same ways as if the tool were built in-house for a custom purpose and as if the data were stored internally. Users need to be able to perform all the basic CRUD operations, manipulating the data and still allowing it to flow continuously and seamlessly through the pipeline. A flexible API that allows you to work with the data pipelines in this granular manner is an essential component for any data product.

In addition to having a flexible API, Encord also has a Python SDK. By wrapping the Python SDK around the API, we’ve made certain operations easier for Python developers. By providing an open source SDK, Encord enables developers to customize the tool until it fits perfectly with their machine learning and data pipeline needs.

With the right data products in place, data will flow fluidly through your data pipeline. With a strong data pipeline in place, you can more efficiently train your deep learning model, evaluate data quality, automate labelling and set up active learning pipelines, all of which ultimately decreases the time needed to build and deploy your models, getting you to production AI faster.

Get in touch to see Encord in action and try it out for yourself!

Where to next?

“I want to start annotating” - Get a free trial of Encord here.

"I want to get started right away" - You can find Encord Active on Github here or try the quickstart Python command from our documentation.

"Can you show me an example first?" - Check out this Colab Notebook.

If you want to support the project you can help us out by giving a Star on GitHub

Want to stay updated?

  • Follow us on Twitter and LinkedIn for more content on computer vision, training data, and active learning.
  • Join the Slack community to chat and connect.

author-avatar-url
Written by Frederik Hvilshøj
Frederik is the Machine Learning Lead at Encord. He has an extensive computer vision and deep learning background and has completed a Ph.D. in Explainable Deep Learning and Generative Models at Aarhus University, and published research in Efficient Counterfactuals from Invertible Neural Ne... see more
View more posts
cta banner

Build better ML models with Encord

Get started today
cta banner

Discuss this blog on Slack

Join the Encord Developers community to discuss the latest in computer vision, machine learning, and data-centric AI

Join the community

Software To Help You Turn Your Data Into AI

Forget fragmented workflows, annotation tools, and Notebooks for building AI applications. Encord Data Engine accelerates every step of taking your model into production.