profile

Emily Langhorne November 22, 2022

Helping a Digital Identity Verification Start-Up build High-Quality Facial Recognition Datasets

blog image

Vida, a full-service verified digital identity platform, serves customers throughout Southeast Asia. While facial recognition is a mature technology, most open source facial recognition datasets aren’t reflective of the region’s populations. Models trained on these datasets perform poorly, so Vida needs to build and manage new own datasets. Using Encord’s platform, Vida can oversee a large labeling team and annotate tens of thousands of images quickly.

Customer: Meet Vida

Vida uses optical character recognition and computer vision technology to provide a full-service verified digital identity platform. Digital verification empowers people to participate in the economy. For instance, financial institutions require identity verification to reduce fraud and ensure that assets arrive in the correct accounts, and ride hailing services require drivers to verify their identity. 

Operating mainly in Indonesia, Vida’s services enable banks, fintechs, online trading platforms, ride hailing companies, and other companies to verify the identity of users online. 

Indonesia is the biggest archipelago in the world, and traveling long distances can be challenging. With digital verification, customers no longer have to spend time waiting or traveling to get their identities verified. Vida users take a photo of themselves and a photo of their identification document. Vida’s technology then confirms the authenticity of the document, compares the document to the photo and an authoritative source to verify the user's identity.

Throughout Indonesia, a digital signature must be accompanied by identity verification. Being able to sign documents and open bank accounts online is incredibly beneficial, especially for micro-entrepreneurs and SMEs in rural areas. Vida’s platform reduces barriers for these populations when accessing financial products like loans and savings accounts.

Problem: Large Datasets, Large Labeling Teams 

Vida trains its computer vision models to predict the liveness of an image. The models learn to determine whether the image contains a physically present person or whether the image contains a fake representation of a person, such as a pre-taken photo or a 2D mask. 

Although facial recognition is a mature technology, most of the open source facial verification datasets contain faces from the Western Hemisphere or East Asia. When models train on these datasets, they don’t perform well on Southeast Asian demographics. Indonesia is also a majority Muslim country, so many woman wear a hijab, an attribute rarely encountered by models that train on these open source datasets. 

To improve model performance, Vida began collecting and annotating new datasets– ones reflective of Southeast Asian populations. The company needed a platform that could help them label and manage the tens of thousands of new images collected.

Vida’s team tried using some open source tools, but none of them allowed for managing a labeling team. Furthermore, facial verification data contains sensitive Personally Identifiable Information (PII), and Vida struggled to find a tool that gave them strong access control and the ability to keep customer data on their own servers.

Solution: Flexible Platform, Iterative Process

With Encord’s platform, Vida could easily set up a system for managing their 20-person labeling team. They have key managers who oversee the other annotators as well as reviewers. When a new annotator comes on board, Vida uses Encord’s tools to evaluate the new annotations and ensure that all labels are high quality.

Reviewing annotators and label quality in Encord

Vida’s work requires managing a lot of images – about 60,000 in a project. At first, Encord’s interface was showing all 60,000 at once, which created challenges around speed. However, after Vida gave Encord’s team feedback, they quickly changed the UI so that Vida could scale up the amount of images in each project.

“I’ve been very impressed with how Encord iterates on the SDK, listens to feedback, and constantly improves the product,” says Jeffrey Siaw, VP of Data Science at Vida.

With Encord, Vida can keep the data in their own Amazon S3 buckets, alleviating data privacy concerns about access and storage. Rather than require that data be stored on its own servers, Encord’s platform facilitates the use of a signed URL allowing it to access and retrieve the data from a customer’s preferred storage facility without storing data locally. 

Results: Increased Labeling Speed, Decreased False Acceptance Rate 

In the first month of using Encord, Vida’s team labeled 70,000 images at a rate much faster than they expected. 

When trained on the old datasets, Vida’s previous models had a false acceptance rate– they predicted that an image was of a physically present person when it was not– of six percent. 

False acceptances can have serious implications for Vida’s customers. For banks, a false verification could result in the opening of a fraudulent account. In ride hailing companies, it can increase the chance of robbery because a driver with a criminal record is onboarded using a false identity.

With Encord, Vida improved the quality of their datasets, and the new models had a false acceptance rate of only one percent.

“Using Encord’s platform, we were able to train our new models on much better datasets with much higher quality labels, reducing our false acceptance rate to only one percent” says Jeffrey Siaw, VP of Data Science.

As Vida continues to grow, the data science team will begin taking a more granular approach in their data management and labeling. They’ll try labeling faces differently and label attributes such as religious headdresses to better track how their models perform across more specific demographic features. 

Using Encord, they can label these datasets at speed while managing multiple projects with different types of labels and ontologies, all in one platform.