Back to Blogs

Contents

TLDR
Workflows Leaves Beta
Editor Power Ups
AI Support
Active Improves Data Curation and Model Evaluation
Label Snapshot Versioning
Opt-in to Beta Features Faster with Encord Labs

Encord Blog

Product Updates [October 2023]

November 10, 2023

5 mins

Back to Blogs

Power your AI models with the right data

Automate your data curation, annotation and label validation workflows.

Get started

Contents

TLDR
Workflows Leaves Beta
Editor Power Ups
AI Support
Active Improves Data Curation and Model Evaluation
Label Snapshot Versioning
Opt-in to Beta Features Faster with Encord Labs

Written by

Justin Sharps

View more posts

TLDR

Workflows leaves Beta!
Label snapshot versioning
AI support is now in the platform - instant-access to support
Encord Active improves Data Curation, Model Observability
Introducing Encord Labs
Encord @ RSNA - Come and see us at Booth #3772!

All this and more! Read on to make your day.

Workflows Leaves Beta

We are thrilled to announce that our highly-anticipated feature, Workflows, has officially transitioned from beta to general availability! This milestone could not have been achieved without the invaluable feedback from our dedicated users throughout the beta phase.

blog image

Workflows are designed to give you full control of the annotation process, ensuring a blend of high performance, usability, and extensibility that scales with the rapid pace and change of the AI industry. Some of the major improvements are:

Performance: Handle larger projects efficiently (a tenfold increase from the previous benchmark), with significant speed enhancements across the platform.
Usability: A new drag-and-drop UI simplifies workflow creation and the updated queue gives you full insight into the progress of your project.
Extensibility: Advanced routing, better review functionality, and integration with Encord Active tailored to evolving AI demands.

Editor Power Ups

Workflow scalability means more tasks and more labels in your annotation projects. We're also juicing up the editor to be more performant -- which means more labels per task, faster. Backend improvements mean your data will save faster and more seamlessly, and we're introducing raised limits on labels per task to benefit from those improvements as well -- contact us to work with more data per task! Arriving soon are further performance improvements to enhance the user experience when dealing with many objects and complex label timelines. This all adds up to create a more natural pre-processing and editing experience, even on long, label intense, video annotation workloads. Exciting!

AI Support

We understand that searching our documentation isn’t always your first thought when you need to learn about the platform. To address this, we've integrated AI support directly into our platform, ensuring you have quick access to the assistance you need, precisely when needed.

blog image

Whether you're onboarding for the first time, looking for a quick refresher on using the Label Editor, or need help understanding terminology, our AI assistant is here to help. It is regularly trained on all our platforms & SDK documentation, enabling it to provide intelligent and up-to-date responses to any questions you may have about our application!

Active Improves Data Curation and Model Evaluation

We know that curating the best images, frames from a video, or slices from a scan is a daunting, difficult, and time-intensive task. First, ensuring that your dataset is free of outliers, duplicates, and irrelevant images, and second, selecting the best samples is crucial for building robust and performant models.

Encord is your trusted partner along your journey and based on your feedback we have designed Active's new Explorer to simplify this process, incorporating best practices into intuitive user journeys:

Automated data quality checks: Active automatically identifies potential issues in your datasets, such as duplicates, blurry images, or corrupted frames. By filtering out these problematic frames, you can reduce annotation costs and prevent detrimental effects on your model's performance.
Intelligent curation: Use Active to curate a balanced and diverse dataset. Whether you're establishing a dataset for an initial model run or curating targeted data for critical edge cases or blind spots, Active has a tailored workflow ready for you.

blog image

After your data is annotated and your model is trained, Encord Active simplifies the shift to evaluation. Simply import your model predictions and access a detailed analysis of your model’s performance, with options to break it down by class, data collections, and splits such as train, test, and validation.

blog image

You can also use the Explorer to investigate your prediction types following a series of best-practice workflows:

Prediction inspection: Use the Explorer to delve into the types of model predictions – True Positives (TP), False Positives (FP), and False Negatives (FN), to understand your model's accuracy and behavior.
Spot and address blind spots: When an edge case or a blind spot is detected, Active's similarity search allows you to surface and curate additional samples from your unlabeled data pool that resemble these critical cases.
Continuous improvement cycle: Integrate these new samples into your annotation workflow, retrain your model, and directly compare performance improvements against previously identified edge cases.

blog image

Label Snapshot Versioning

Labeling training data is, like the model training process it supports, an iterative process. You’ve asked for ways to snapshot your progress — whether it’s to save a checkpoint before re-labeling, check-in progress as you work through a large project, or name different subsets for purposes such as training, testing, and validation. We’ve listened, and are happy to introduce label versioning for workflow projects. Navigate to the labels tab, select your tasks, and press ‘Save new version’ — you can refer to these snapshots by name and time.

Initially, we’re rolling out support for exporting labels from saved checkpoints, but look out for coming improvements such as restoring to different projects. As always, let us know how it helps and what more we can do to enhance your AI initiatives with labels and label set management tools!

blog image

Opt-in to Beta Features Faster with Encord Labs

Many of you have shown interest in working closely with our product development team and helping us create the best features — as such, we’re very happy to be introducing Encord Labs! Encord Labs will give you access to features at the bleeding edge, but give you control over how features appear in the platform. This means you will get all the power of rapidly evolving technology with none of the risks. Getting in on the ground floor means you can shape how features evolve faster, helping us ensure we build with tight customer feedback in mind. Encord Labs will be rolling out several select features in Q4 — contact us if you’re interested or would like to join our collaborative early tester program!

blog image

Thanks for reading, feel free to email product@encord.com with any questions or suggestions, and let us know if you're attending RSNA 2023!

Power your AI models with the right data

Automate your data curation, annotation and label validation workflows.

Get started

Written by

Justin Sharps

View more posts

Previous blog

Florence-2: Microsoft's New Foundation Model Explained

Next blog

Data Clustering: Intro, Methods, Applications

Related blogs

View all

sampleImage_explained-new-ai-executive-order-open-vs-closed-source

Learn

Understanding the United States Executive Order on Safe, Secure, and Trustworthy AI

On October 30, 2023, the White House announced an Executive Order issued by President Joe Biden aimed at fostering a balanced approach toward the development and deployment of Artificial Intelligence (AI) to ensure it's safe, secure, and trustworthy. It acknowledges the potential of AI technologies in solving urgent societal challenges and enhancing prosperity, productivity, innovation, and security. However, the Executive Order highlights the potential adverse effects that an irresponsible use of artificial intelligence could have, such as fraud, discrimination, bias, misinformation, threats to national security, and the need for guardrails. The Order calls for a collective effort from the federal government (including the Department of Homeland Security, the Department of Health and Human Services, the Department of Energy, the Department of Commerce, and more), the private sector, academia, and civil society to mitigate these harms while maximizing the benefits of AI. Here are the three main guiding principles behind this Executive Order: Safety and security: The Order emphasizes the need for robust, reliable, repeatable, and standardized evaluations of AI systems. It mandates addressing security risks, including those related to biotechnology, cybersecurity, and critical infrastructure. The document also highlights the importance of testing, post-deployment monitoring, and effective labeling to ensure that AI systems are ethically developed, securely operated, and compliant with federal laws. Responsible innovation: It encourages promoting responsible innovation, competition, and collaboration to maintain U.S. leadership in AI. The Order calls for investments in AI-related education, training, development, research, and tackling intellectual property issues. It also emphasizes creating a fair, open, and competitive AI ecosystem and marketplace, supporting small developers, and addressing potential risks from dominant firms' control over critical assets like semiconductors, computing power, cloud storage, and data. Supporting American workers: As AI creates new jobs and industries, the Order stresses adapting job training and education to support a diverse workforce. It advises against deploying AI in ways that undermine rights, worsen job quality, or cause harmful labor-force disruptions. The Order encourages building the next steps in AI development based on the views of workers, labor unions, educators, and employers to support responsible AI uses that improve workers' lives and augment human work. In subsequent sections of this article, we will examine the actions among the AI directives in this Executive Order. In the meantime, let’s explore how we got here. How did we get here? The History of AI Regulation in the United States of America President Biden's Executive Order on Safe, Secure, and Trustworthy Artificial Intelligence is the result of years of developing insights and responses to emerging technologies in the field of AI. In order to show how we came to this important turning point, this section will walk you through the path of AI regulation in the United States. Early Engagement, Regulating Open- and Closed-Source LLMs Navigating the spectrum between open and closed LLM systems is critical for effective AI policy. Striking the right balance will promote innovation and competition while managing the potential risks of AI. By 2024, the National Institute of Standards and Technology (NIST) under the U.S. Department of Commerce will determine whether they will allow the release of open model weights under public licenses. This, of course, is bound to stir up discussions surrounding treating open model weights as free speech and accusations of lobbying from big tech companies to protect their MOAT. As these LLM systems began permeating various sectors, the need for a regulatory framework became apparent. Policymakers grappling with the rapid advancements in AI models and tools started the conversation about balancing promoting US global leadership in AI with the risks to individuals, businesses, and national security. Legislative Efforts The early engagement translated into legislative action, with the USA’s House and Senate committees holding numerous hearings on AI. The hearings included big names like Elon Musk, CEO of SpaceX, Tesla, and X, formerly known as Twitter; Mark Zuckerberg, CEO of Meta; former Microsoft co-founder Bill Gates; and Sam Altman, CEO of OpenAI, the parent company of AI chatbot, ChatGPT. Biden Administration’s Early Steps In October 2022, the Biden administration issued a non-binding AI Bill of Rights, marking an early step towards delineating the government’s stance on governing automated systems, focusing on civil rights protection. Soon after, on September 12, several tech companies signed voluntary agreements to follow the rules President Biden set out for AI. This was the first step toward encouraging responsible AI use through partnerships with the private sector. SAFE Innovation—A Values-Based Framework and New Legislative Process Despite strong bipartisan interest, the challenge of passing comprehensive AI legislation continued, paving the way for the SAFE Innovation Framework proposal by Senate Majority Leader Chuck Schumer. The Executive Order The culmination of these efforts and the evolving understanding of AI's impact led to the issuance of the Executive Order on Safe, Secure, and Trustworthy Artificial Intelligence. This Executive Order embodies a more structured approach to AI governance, reflecting the administration’s commitment to promoting responsible AI development and deployment while addressing the associated potential risks of AI. What are the Executive Order Directives? We have summarized the Executive Order Directives below so you can easily skim through and find the directives and the corresponding actions relevant to you. Directive 1: New Standards for AI Safety and Security Actions: Require developers to share safety test results with the U.S. government. Develop standards and tools to ensure AI systems are safe and secure. Protect against AI-enabled risks to national security and public health. Establish strong standards for biological synthesis screening. Directive 2: Protecting Americans’ Privacy Actions: Prioritize federal support for privacy-preserving techniques in AI. Strengthen privacy-preserving research and technologies. Evaluate how agencies collect and use commercially available data. Develop guidelines for federal agencies to evaluate privacy-preserving techniques. Directive 3: Advancing Equity and Civil Rights Actions: Offer advice to stop AI programs from making discrimination worse. Address algorithmic discrimination through training and coordination. Ensure fairness in the criminal justice system's use of AI. Directive 4: Standing Up for Consumers, Patients, and Students Actions: Make advances in the responsible use of AI in healthcare. Shape AI’s potential in education. Protect consumers and patients while ensuring AI benefits. Directive 5: Promoting Innovation and Competition Actions: Catalyze AI research and provide grants in vital areas. Promote a fair and competitive AI ecosystem. Streamline visa criteria for skilled immigrants. Directive 6: Supporting Workers Actions: Develop principles and best practices for worker protection. Produce a report on AI’s labor-market impacts. Directive 7: Advancing American Leadership Abroad Actions: Expand collaborations on AI at bilateral, multilateral, and multistakeholder levels. Accelerate the development of AI standards with international partners. Promote responsible AI development abroad. Directive 8: Ensuring Responsible and Effective Government Use of AI Actions: Issue guidance for agencies’ AI use. Streamline AI product and service acquisition. Accelerate the hiring of AI professionals in government. Now that we've discussed the key directives of the US Executive Order on AI, let's compare and contrast them with the European Union's approach to AI regulation, known as the EU Artificial Intelligence Act (AI Act). US Executive Order on Safe, Secure, and Trustworthy AI vs European Union AI Act In the table below, we present a comparative overview of the key aspects and focus areas of the US Executive Order on Safe, Secure, and Trustworthy AI and the EU Artificial Intelligence Act (AI Act). Read more about the takes on “Proposed AI Regulation: EU AI Act, UK's Pro-Innovation, US AI Bill of Rights” from Encord’s co-founder and president. As you saw in the comparison, while both regulations aim to foster a safe and responsible AI ecosystem, they approach AI governance from slightly different vantage points, reflecting the distinct priorities and regulatory philosophies of the US and the EU. What does the European AI Act mean for you, an AI developer? Learn more from this article by Ulrik Stig Hansen, Encord’s co-founder and president. Conclusion Increased involvement from policymakers, legislative efforts, and joint initiatives between the public and private sectors have all contributed to the current AI regulatory landscape. The issuance of the Executive Order represents a significant milestone in the ongoing journey towards establishing a robust framework for AI governance in the U.S. aimed at harnessing the benefits of AI while mitigating its potential perils. But will regulations stifle the efforts of open-source AI? Or would it encourage an ecosystem of open innovation while regulating the risks at the application layer? In this article, you learned about the evolution of AI regulation in the U.S., focusing on key legislative efforts, the Biden Administration's early steps towards AI governance, and the collaborative initiatives that marked the journey towards the recent Executive Order. We talked about how AI was regulated, which led to the Executive Order on Safe, Secure, and Trustworthy Artificial Intelligence. These included actions taken by lawmakers, tech companies making voluntary commitments, and the release of frameworks based on values like the SAFE Innovation Framework. Finally, we compared different aspects of the directives to the proposed European Union AI Act, where you saw clearly different priorities and regulatory philosophies between the United States Congress and the European Parliament. Get access to our new AI Act Learning Pack, which includes all the key resources you need to ensure forward compatibility.

Nov 01 2023

5 M

sampleImage_vision-language-models-guide

machine learning

Guide to Vision-Language Models (VLMs)

For quite some time, the idea that artificial intelligence (AI) could understand visual and textual cues as effectively as humans seemed far-fetched and unimaginable. However, with the emergence of multimodal AI, we are seeing a revolution where AI can simultaneously comprehend various modalities, such as text, image, speech, facial expressions, physiological gestures, etc., to make sense of the world around us. The ability to process multiple modalities has opened up various avenues for AI applications. One exciting application of multimodal AI is Vision-Language Models (VLMs). These models can process and understand the modalities of language (text) and vision (image) simultaneously to perform advanced vision-language tasks, such as Visual Question Answering (VQA), image captioning, and Text-to-Image search. In this article, you will learn about: VLM architectures. VLM evaluation strategies. Mainstream datasets used for developing vision-language models. Key challenges, primary applications, and future trends of VLMs. Let’s start by understanding what vision-language models are. What Are Vision Language Models? A vision-language model is a fusion of vision and natural language models. It ingests images and their respective textual descriptions as inputs and learns to associate the knowledge from the two modalities. The vision part of the model captures spatial features from the images, while the language model encodes information from the text. The data from both modalities, including detected objects, the spatial layout of the image, and text embeddings, are mapped to each other. For example, if the image contains a bird, the model will learn to associate it with a similar keyword in the text descriptions. This way, the model learns to understand images and transforms the knowledge into natural language (text) and vice versa. Training VLMs Building VLMs involves pre-training foundation models and zero-shot learning. Transfer learning techniques, such as knowledge distillation, can be used to fine-tune the models for more specific downstream tasks. These are simpler techniques that require smaller datasets and less training time while maintaining decent results. Modern frameworks, on the other hand, use various techniques to get better results, such as Contrastive learning. Masked language-image modeling. Encoder-decoder modules with transformers and more. These architectures can learn complex relations between the various modalities and provide state-of-the-art results. Let’s discuss these in detail. Vision Language Models: Architectures and Popular Models Let’s look at some VLM architectures and learning techniques that mainstream models such as CLIP, Flamingo, and VisualBert, among others, use. Contrastive Learning Contrastive learning is a technique that learns data points by understanding their differences. The method computes a similarity score between data instances and aims to minimize contrastive loss. It’s most useful in semi-supervised learning, where only a few labeled samples guide the optimization process to label unseen data points. Contrastive Learning For example, one way to understand what a cat looks like is to compare it to a similar cat image and a dog image. Contrastive learning models learn to distinguish between a cat and a dog by identifying features such as facial structure, body size, and fur. The models can determine which image is closer to the original, called the “anchor,” and predict its class. CLIP is an example of a model that uses contrastive learning by computing the similarity between text and image embeddings using textual and visual encoders. It follows a three-step process to enable zero-shot predictions. Trains a text and image encoder during pretraining to learn the image-text pairs. Converts training dataset classes into captions. Estimates the best caption for the given input image for zero-shot prediction. CLIP Architecture VLMs like CLIP power the semantic search feature within Encord Active. When you log into Encord → Active → Choose a Project → Use the Natural Language search to find items in your dataset with a text description. Here is a way to search with natural language using “White sneakers” as the query term: ALIGN is another example that uses image and textual encoders to minimize the distance between similar embeddings using a contrastive loss function. PrefixLM PrefixLM is an NLP learning technique mostly used for model pre-training. It inputs a part of the text (a prefix) and learns to predict the next word in the sequence. In Visual Language Models, PrefixLM enables the model to predict the next sequence of words based on an image and its respective prefix text. It leverages a Vision Transformer (ViT) that divides an image into a one-dimensional patch sequence, each representing a local image region. Then, the model applies convolution or linear projection over the processed patches to generate contextualized visual embeddings. For text modality, the model converts the text prefix relative to the patch into a token embedding. The transformer's encoder-decoder blocks receive both visual and token embeddings. It is there that the model learns the relationships between the embeddings. SimVLM is a popular architecture utilizing the PrefixLM learning methodology. It has a simpler Transformer architecture than its predecessors, surpassing their results in various benchmarks. It uses a transformer encoder to learn image-prefix pairs and a transformer decoder to generate an output sequence. The model also demonstrates good generalization and zero-shot learning capabilities. SimVLM Architecture Similarly, VirTex uses a convolutional neural network to extract image features and a textual head with transformers to manage text prefixes. You can train the model end-to-end to predict the correct image captions by feeding image-text pairs to the textual head. VirTex Architecture Frozen PrefixLM While PrefixLM techniques require training visual and textual encoders from scratch, Frozen PrefixLM allows you to use pre-trained networks and only update the parameters of the image encoders. For instance, the architecture below shows how Frozen works using a pre-trained language model and visual encoder. The text encoder can belong to any large language model (LLM), and the visual encoder can also be a pre-trained visual foundation model. You can fine-tune the image encoder so its image representations align with textual embeddings, allowing the model to make better predictions. Frozen Architecture Flamingo's architecture uses a more state-of-the-art (SOTA) approach. It uses a CLIP-like vision encoder and an LLM called Chinchilla. Keeping the LLM fixed lets you train the visual encoder on images interleaved between texts. The visual encoders process the image through a Perceiver Sampler. The technique results in faster inference and makes Flamingo ideal for few-shot learning. Flamingo Architecture Multimodal Fusing with Cross-Attention This method utilizes the encoders of a pre-trained LLM for visual representation learning by adding cross-attention layers. VisualGPT is a primary example that allows quick adaptation of an LLM’s pre-trained encoder weights for visual tasks. VisualGPT Architecture Practitioners extract relevant objects from an image input and feed them to a visual encoder. The resulting visual representations are then fed to a decoder and initialized with weights according to pre-trained LLM. The decoder module balances the visual and textual information through a self-resurrecting activation unit (SRAU). The SRAU method avoids the issue of vanishing gradients, a common problem in deep learning where model weights fail to update due to small gradients. As such, VisualGPT outperforms several baseline models, such as the plain transformer, the Attention-on-Attention (AoA) transformer, and the X-transformer. Masked-language Modeling (MLM) & Image-Text Matching (ITM) MLM works in language models like BERT by masking or hiding a portion of a textual sequence and training the model to predict the missing text. ITM involves predicting whether sentence Y follows sentence X. You can adapt the MLM and ITM techniques for visual tasks. The diagram below illustrates VisualBERT's architecture, trained on the COCO dataset. VisualBERT Architecture It augments the MLM procedure by introducing image sequences and a masked textual description. Based on visual embeddings, the objective is to predict the missing text. Similarly, ITM predicts whether or not a caption matches the image. No Training You can directly use large-scale, pre-trained vision-language models without any fine-tuning. For example, MAGIC and ASIF are training-free frameworks that aim to predict text descriptions that align closely with the input image. MAGIC uses a specialized score based on CLIP-generated image embeddings to guide language models' output. Using this score, an LLM generates textual embeddings that align closely with the image semantics, enabling the model to perform multimodal tasks in a zero-shot manner. ASIF uses the idea that similar images have similar captions. The model computes the similarities between the training dataset's query and candidate images. Next, it compares the query image embeddings with the text embeddings of the corresponding candidate images. Then, it predicts a description whose embeddings are the most similar to those of the query image, resulting in comparable zero-shot performance to models like CLIP and LiT. ASIF Prediction Strategy Knowledge Distillation This technique involves transferring knowledge from a large, well-trained teacher model to a lighter student model with few parameters. This methodology allows researchers to train VLMs from larger, pre-trained models. For instance, ViLD is a popular VLM developed using the knowledge distillation methodology. The model uses a pre-trained open-vocabulary image classification model as the teacher to train a two-stage detector (student). The model matches textual embeddings from a textual encoder with image embeddings. ViLD Architecture Knowledge distillation transfers knowledge from the image encoder to the backbone model to generate regional embeddings automatically. Only the backbone model generates regional embeddings during inference, and it matches them with unseen textual embeddings. The objective is to draw correct bounding boxes around objects in an image based on textual descriptions. Evaluating Vision Language Models VLM validation involves assessing the quality of the relationships between the image and text data. For an image captioning model, this would mean comparing the generated captions to the ground-truth description. You can use various automated n-gram-based evaluation strategies to compare the predicted labels in terms of accuracy, semantics, and information precision. Below are a few key VLM evaluation metrics. BLEU: The Bilingual Evaluation Understudy (BLEU) metric was originally proposed to evaluate machine translation tasks. It computes the precision of the target text compared to a reference (ground truth) by considering how many words in the candidate sentence appear in the reference. ROUGE: Recall-Oriented Understudy for Gisting Evaluation (ROUGE) computes recall by considering how many words in the reference sentence appear in the candidate. METEOR: Metric for Evaluation of Translation with Explicit Ordering (METEOR) computes the harmonic mean of precision and recall, giving more weight to recall and multiplying it with a penalty term. The metric is an improvement over others that work with either Precision or Recall, as it combines information from both to give a better evaluation. CIDEr: Consensus-based Image Description Evaluation (CIDEr) compares a target sentence to a set of human sentences by computing the average similarity between reference and target sentences using TF-IDF scores. 🔥 NEW RELEASE: We released TTI-Eval (text-to-image evaluation), an open-source library for evaluating zero-shot classification models like CLIP and domain-specific ones like BioCLIP against your (or HF) datasets to estimate how well the model will perform. Get started with it on GitHub, and do ⭐️ the repo if it's awesome. 🔥 Now that you have learned evaluation metrics pertinent to Vision-Language Models (VLMs), knowing how to curate datasets for these models is essential. A suitable dataset provides fertile ground for training and validating VLMs and is pivotal in determining the models' performance across diverse tasks. Datasets for Vision Language Models Collecting training data for VLMs is more challenging than traditional AI models since it involves the collection and quality assurance of multiple data modalities. Encord Index streamlines this process by providing comprehensive data management and curation solutions. Below is a list of several datasets combining image and text data for multimodal training. LAION-5B: Practitioners use the LAION-5B dataset to build large, pre-trained VLMs. The dataset contains over five billion image-text pairs generated from CLIP, with descriptions in English and foreign languages, catering to a multilingual domain. PMD: The Public Model Dataset (PMD) originally appeared in the FLAVA paper and contains 70 billion image-text pairs. It is a collection of data from other large-scale datasets, such as COCO, Conceptual Captions (CC), RedCaps, etc. This dataset is a reservoir of multimodal data that fosters robust model training. VQA: Experts use the VQA dataset to fine-tune pre-trained VLMs for downstream VQA and visual reasoning tasks. The dataset contains over 200,000 images, with five questions per image, ten ground-truth answers, and three incorrect answers per question. ImageNet: ImageNet contains over 14 million images with annotations categorized according to the WordNet hierarchy. It’s helpful in building models for simple downstream tasks, such as image classification and object recognition. Despite the availability of high-quality multimodal datasets, VLMs can face significant challenges during the model development process. Let’s discuss them below. Limitations of Vision Language Models Although VLMs are powerful in understanding visual and textual modalities to process information, they face three primary challenges: Model complexity. Dataset bias. Evaluation difficulties. Model Complexity Language and vision models are quite complex on their own, and combining the two only worsens the problem. Their complexity raises additional challenges in acquiring powerful computing resources for training, collecting large datasets, and deploying on weak hardware such as IoT devices. Dataset Bias Dataset biases occur when VLMs memorize deep patterns within training and test sets without solving anything. For instance, training a VLM on images curated from the internet can cause the model to memorize specific patterns and not learn the conceptual differences between various images. Evaluation Strategies The evaluation strategies discussed above only compare a candidate sentence with reference sentences. The approach assumes that the reference sentences are the only ground truths. However, a particular image can have several ground-truth descriptions. Although consensus-based metrics like CIDEr account for the issue, using them becomes challenging when consensus is low for particular images. Another challenge is when a generic description applies to several images. Spurious Correlation As the illustration shows, a VLM can annotate or retrieve several relevant images that match the generic caption. However, in reality, the model is nothing more than a bag-of-words. All it’s doing is considering words, such as ‘city,’ ‘bus,’ ‘lights,’ etc., to describe the image instead of actually understanding the caption's sequential order and true contextual meaning. Furthermore, VLMs used for VQA can generate highly confident answers to nonsensical questions. For instance, asking a VLM, “What color is the car?” for an image that contains a white horse will generate the answer as “white” instead of pointing out that there isn’t a car in the picture. Lastly, VLMs lack compositional generalization. This means that their performance decreases when they process novel concepts. For example, a VLM can fail to recognize a yellow horse as a category since it’s rare to associate the color yellow with horses. Despite many development and deployment challenges, researchers and practitioners have made significant progress in adopting VLMs to solve real problems. Let’s discuss them briefly below. Applications of Vision Language Models While most VLMs discussed earlier are helpful in captioning images, their utility extends to various domains that leverage the capability to bridge visual and linguistic modalities. Here are some additional applications: Image Retrieval: Models such as FLAVA help users navigate through image repositories by helping them find relevant photos based on linguistic queries. An e-commerce site is a relevant example. Visitors can describe what they’re looking for in a search bar, and a VLM will show the suitable options on the screen. This application is also popular on smartphones, where users can type in keywords (landscapes, buildings, etc.) to retrieve associated images from the gallery. Generative AI: Image generation through textual prompts is a growing domain where models like DALL-E allow users to create art or photos based on their descriptions. The application is practical in businesses where designers and inventors want to visualize different product ideas. It also helps create content for websites and blogs and aids in storytelling. Segmentation: VLMs like SegGPT help with segmentation tasks such as instance, panoptic, semantic, and others. SegGPT segments an image by understanding user prompts and exploiting a distinct coloring scheme to segment objects in context. For instance, users can ask SegGPT to segment a rainbow from several images, and SegGPT will efficiently annotate all rainbows. [Video] Frederik and Justin discussed how Visual-Language Models (VLMs) power AI in different industries, including their efficiency over Large Language Models (LLMs). Future Research The following are a few crucial future research directions in the VLM domain: Better Datasets The research community is working on building better training and test datasets to help VLMs with compositional understanding. CLEVR is one example of this effort. CLEVR Dataset As the illustration shows, it contains images of novel shapes, colors, and corresponding questions that allow experts to test a VLM’s visual reasoning capacity. Better Evaluation Methods Evaluation challenges warrant in-depth research into better evaluation methods for building more robust VLMs. One alternative is to test VLMs for individual skills through the ARO benchmark. Attribute identification, relational reasoning, and word-order sensitivity (ARO) are three skills that VLMs must master. ARO Dataset The illustration above explains what ARO entails in different contexts. Using such a dataset, experts can analyze what VLMs learn and how to improve the outcomes. 🔥 NEW RELEASE: We released TTI-Eval (text-to-image evaluation), an open-source library for evaluating zero-shot classification models like CLIP and domain-specific ones like BioCLIP against your (or HF) datasets to estimate how well the model will perform. Get started with it on GitHub, and do ⭐️ the repo if it's awesome. 🔥 Robotics Researchers are also using VLMs to build purpose-specific robots. Such robots can help navigate environments, improve warehouse operations in manufacturing by monitoring items, and enhance human-machine interaction by allowing robots to understand human gestures, such as facial expressions, body language, voice tones, etc. Medical VQA VLMs’ ability to annotate images and recognize complex objects can help healthcare professionals with medical diagnoses. For example, they can ask VLMs critical questions about X-rays or MRI scans to determine potential problems early. Vision-Language Models: Key Takeaways Visual language modeling is an evolving field with great promise for the AI industry. Below are a few critical points regarding VLMs: Vision-language models are a multimodal architecture that simultaneously comprehends image and text data modalities. They use CV and NLP models to correlate information (embeddings) from the two modalities. Several VLM architectures exist that aim to relate visual semantics to textual representations. Although users can evaluate VLMs using automated scores, better evaluation strategies are crucial to building more reliable models. VLMs have many industrial use cases, such as robotics, medical diagnoses, chatbots, etc.

Nov 03 2023

5 M

sampleImage_hugging-face-image-quality-evaluation

Data Quality

Exploring the Quality of Hugging Face Image Datasets with Encord Active

Computer vision engineers, data scientists, and machine learning engineers face a pervasive issue: the prevalence of low-quality images within datasets. You have likely encountered this problem through incorrect labels, varied image resolutions, noise, and other distortions. Poor data quality can lead to models learning incorrect features, misclassifications, and unreliable or incorrect outputs. In a domain where accuracy and reliability are paramount, this issue can significantly impede the progress and success of projects. This could result in wasted resources and extended project timelines. Take a look at the following image collage of Chihuahuas or muffins, for example: Chihuahua or muffin? My search for the best computer vision API How fast could you tell which images are Chihuahuas vs. muffins? Fast? Slow? Were you correct in 100% of the images? I passed the collage to GPT-4V because, why not? 😂 And as you can see, even the best-in-class foundation model misclassified some muffins as Chihuahuas! (I pointed out a few.) So, how do you make your models perform better? The sauce lies in the systematic approach of exploring, evaluating, and fixing the quality of images. Enter Encord Active! It provides a platform to identify, tag problematic images, and use features to improve the dataset's quality. This article will show you how to use Encord Active (now available on Encord Index) to explore images, visualize potential issues, and take the next steps to rectify low-quality images. In particular, you will: Use a dog-food dataset from the Hugging Face Datasets library. Delve into the steps of creating an Encord Active project. Define and run quality metrics on the dataset. Visualize the quality metrics. Indicate strategies to fix the issues you identified. Ready? Let’s delve right in! 🚀 Using Encord Active to Explore the Quality of Your Images Encord Active toolkit helps you find and fix wrong labels through data exploration, model-assisted quality metrics, and one-click labeling integration. It takes a data-centric approach to improving model performance. With Encord Active, you can: Slice your visual data across metrics functions to identify data slices with low performance. Flag poor-performing slices and send them for review. Export your new data set and labels. Visually explore your data through interactive embeddings, precision/recall curves, and other advanced visualizations. Check out the project on GitHub, and hey, if you like it, leave a 🌟🫡. Demo: Explore the quality of 'dog' and 'food' images for ML models In this article, you will use Encord Active to explore the quality of the `sashs/dog-food` images. You’ll access the dataset through the Hugging Face Datasets library. You can use this dataset to build a binary classifier that categorizes images into the "dog" and "food" classes. The 'dog' class has images of canines that resemble fried chicken and some that resemble images of muffins, and the 'food' class has images of, you guessed it, fried chicken and muffins. The complete code is hosted on Colab. Open the Colab notebook side-to-side with this blog post. Interested in more computer vision, visual foundation models, active learning, and data quality notebooks? Check out the Encord Notebook repository Use Hugging Face Datasets to Download and Generate the Dataset Whatever machine learning, deep learning, or AI tasks you are working on, the Hugging Face Datasets library provides easy access to, sharing, and processing datasets, particularly those catering to audio, computer vision, and natural language processing (NLP) domains. The 🤗 datasets library enables a memory-mapped on-disk cache for quick lookups to back the datasets. Explore the Hugging Face Hub for the datasets directory You can browse and explore over 20,000 datasets housed in the library on the Hugging Face Hub. The Hub is a centralized platform for discovering and choosing datasets pertinent to your projects. In the search bar at the top, enter keywords related to the dataset you're interested in, e.g., "sentiment analysis," "image classification," etc. You should be able to: Filter datasets by domain, license, language, and so on. Find information such as the size, download number, and download link on the dataset card. Engage with the community by contributing to discussions, providing feedback, or suggesting improvements to the dataset. Load the ‘sashs/dog-food’ dataset Loading the `sashs/dog-food` dataset is pretty straightforward: Install the 🤗 Datasets library and download the dataset. To install Hugging Face Datasets, run the following command: pip install datasets Use the `load_dataset` function to load the 'sasha/dog-food' dataset from Hugging Face: dataset_dict=load_dataset('sasha/dog-food') `load_dataset` returns a dictionary object (`DatasetDict`). You can iterate through the train and test dataset split keys in the `DatasetDict` object. The keys map to a `Dataset` object containing the images for that particular split. You will explore the entire dataset rather than in separate splits. This should provide a comprehensive understanding of the data distribution, characteristics, and potential issues. To do that, merge the different splits into a single dataset using the `concatenate_datasets` function: dataset=concatenate_datasets([dfordindataset_dict.values()]) Perfect! Now, you have an entire dataset to explore with Encord Active in the subsequent sections. If you have not done that already, create a dataset directory to store the downloaded images. # Create a new directory "hugging_face_dataset" in the current working dir huggingface_dataset_path = Path.cwd() / "huggingface_dataset" # Delete dir if it already exists and recreate if huggingface_dataset_path.exists(): shutil.rmtree(huggingface_dataset_path) huggingface_dataset_path.mkdir() Use a loop to iterate through images from the ‘sashs/dog-food’ dataset and save them to the directory you created: for counter, item in tqdm(enumerate(dataset)): image = item['image'] image.save(f'./Hugging Face_dataset/{counter}.{image.format}') If your code throws errors, run the cell in the Colab notebook in the correct order. Super! You have prepared the groundwork for exploring your dataset with Encord Active. Create an Encord Active Project You must specify the directory containing your datasets when using Encord Active for exploration. You will initialize a local project with the image files—there are different ways to import and work with projects in Encord. Encord Active provides functions and utilities to load all your images, compute embeddings, and, based on that, evaluate the embeddings using pre-defined metrics. The metrics will help you search and find images with errors or quality issues. Before initializing the Encord Active project, define a function, `collect_all_images`, that obtains a list of all the image files from the `huggingface_dataset_path` directory, takes a root folder path as input, and returns a list of `Path` objects representing image files within the root folder: def collect_all_images(root_folder: Path) -> list[Path]: image_extensions = {".jpg", ".jpeg", ".png", ".bmp"} image_paths = [] for file_path in root_folder.glob("**/*"): if file_path.suffix.lower() in image_extensions: image_paths.append(file_path) return image_paths Remember to access and run the complete code in this cell. Initialize Encord Active project Next, initialize a local project using Encord Active's `init_local_project` function. This function provides the same functionality as running the `init` command in the CLI. If you prefer using the CLI, please refer to the “Quick import data & labels” guide. try: project_path: Path = init_local_project( files = image_files, target = projects_dir, project_name = "sample_ea_project", symlinks = False, ) except ProjectExistsError as e: project_path = Path("./sample_ea_project") print(e)# A project already exists with that name at the given path. Compute image embeddings and analyze them with metrics Analyzing raw image data directly in computer vision can often be impractical due to the high dimensionality of images. A common practice is to compute embeddings for the images to compress the dimensions, then run metrics on these embeddings to glean insights and evaluate the images. Ideally, you compute the embeddings using pre-trained (convolutional neural network) models. The pre-trained models capture the essential features of the images while reducing the data dimensionality. Once you obtain the embeddings, run similarity, clustering, and classification metrics to analyze different aspects of the dataset. Computing embeddings and running metrics on them can take quite a bit of manual effort. Enter Encord Active! Encord Active provides utility functions to run predefined subsets of metrics, or you can import your own sets of metrics. It computes the image embeddings and runs the metrics by the type of embeddings. Encord Active has three different types of embeddings: Image embeddings - general for each image or frame in the dataset Classification embeddings - associated with specific frame-level classifications Object embeddings - associated with specific objects, like polygons or bounding boxes Use the `run_metrics_by_embedding_type` function to execute quality metrics on the images, specifying the embedding type as `IMAGE`: run_metrics_by_embedding_type(EmbeddingType.IMAGE, data_dir=project_path, use_cache_only=True ) The `use_cache_only=True` parameter cached data only when executing the metrics rather than recomputing values or fetching fresh data. This can be a useful feature for saving computational resources and time, especially when working with large datasets or expensive computations. Create a `Project` object using the `project_path` - you will use this for further interactions with the project: ea_project=Project(project_path) Exploring the Quality Of Images From the Hugging Face Datasets Library Now that you have set up your project, it’s time to explore the images! There are typically two ways you could visualize images with Encord Active (EA): Through the web application (Encord Active UI) Combining EA with visualization libraries to display those embeddings based on the metrics We’ll use the latter in this article. You will import helper functions and modules from Encord Active with visualization libraries (`matplotlib` and `plotly`). This code cell contains a list of the modules and helper functions. Pre-defined subset of metrics in Encord Active Next, iterate through the data quality metrics in Encord Active to see the list of available metrics, access the name attribute of each metric object within that iterable, and construct a list of these names: [metric.nameformetricinavailable_metrics] You should get a similar output: There are several quality metrics to explore, so let’s define and use the helper functions to enable you to visualize the embeddings. Helper functions for displaying images and visualizing the metrics Define the `plot_top_k_images` function to plot the top k images for a metric: def plot_top_k_images(metric_name: str, metrics_data_summary: MetricsSeverity, project: Project, k: int, show_description: bool = False, ascending: bool = True): metric_df = metrics_data_summary.metrics[metric_name].df metric_df.sort_values(by='score', ascending=ascending, inplace=True) for _, row in metric_df.head(k).iterrows(): image = load_or_fill_image(row, project.file_structure) plt.imshow(image) plt.show() print(f"{metric_name} score: {row['score']}") if show_description: print(f"{row['description']}") The function sorts the DataFrame of metric scores, iterates through the top `k` images in your dataset, loads each image, and plots it using Matplotlib. It also prints the metric score and, optionally, the description of each image. You will use this function to plot all the images based on the metrics you define. Next, define a `plot_metric_distribution` function that creates a histogram of the specified metric scores using Plotly: def plot_metric_distribution(metric_name: str, metric_data_summary: MetricsSeverity): fig = px.histogram(metrics_data_summary.metrics[metric_name].df, x="score", nbins=50) fig.update_layout(title=f"{metric_name} score distribution", bargap=0.2) fig.show() Run the function to visualize the score distribution based on the “Aspect Ratio” metric: plot_metric_distribution(“AspectRatio”,metrics_data_summary) Most images in the dataset have aspect ratios close to 1.5, a normal distribution. The set has only a few extremely small or enormous image proportions. Use EA’s `create_image_size_distribution_chart` function to plot the size distribution of your images: image_sizes = get_all_image_sizes(ea_project.file_structure) fig = create_image_size_distribution_chart(image_sizes) fig.show() As you probably expected for an open-source dataset for computer vision applications, there is a dense cluster of points in the lower-left corner of the graph, indicating that many images have smaller resolutions, mostly below 2000 pixels in width and height. A few points are scattered further to the right, indicating images with a much larger width but not necessarily a proportional increase in height. This could represent panoramic images or images with unique aspect ratios. You’ll identify such images in subsequent sections. Inspect the Problematic Images What are the severe and moderate outliers in the image set? You might also need insights into the distribution and severity of outliers across various imaging attributes. The attributes include metrics such as green values, blue values, area, etc. Use the `create_outlier_distribution_chart` utility to plot image outliers based on all the available metrics in EA. The outliers are categorized into two levels: "severe outliers" (represented in red “tomato”) and "moderate outliers" (represented in orange): available_metrics = load_available_metrics(ea_project.file_structure.metrics) metrics_data_summary = get_metric_summary(available_metrics) all_metrics_outliers = get_all_metrics_outliers(metrics_data_summary) fig = create_outlier_distribution_chart(all_metrics_outliers, "tomato", 'orange') fig.show() Here’s the result: "Green Values," "Blue Values," and "Area" appear to be the most susceptible to outliers, while attributes like "Random Values on Images" have the least in the ‘sashs/dog-food’ dataset. This primarily means there are lots of images that have abnormally high values of green and blue tints. This could be due to the white balance settings in the camera for the images or low-quality sensors. If your model trains on this set, it’s likely that more balanced images may perturb the performance. What are the blurry images in the image set? Depending on your use case, you might discover that blurry images can sometimes deter your model. A model trained on clear images and then tested or used on blurry ones may not perform well. If the blur could lead to misinterpretations and errors, which can have significant consequences, you might want to explore the blurry images to remove or enhance them. plot_top_k_images('Blur',metrics_data_summary,ea_project,k=5,ascending=False) Based on a "Blur score" of -9.473 calculated by Encord Active, here is the output with one of the five blurriest images: What are the darkest images in the image set? Next, surface images with poor lighting or low visibility. Dark images can indicate issues with the quality. These could result from poor lighting during capture, incorrect exposure settings, or equipment malfunctions. Also, a model might struggle to recognize patterns in such images, which could reduce accuracy. Identify and correct these images to improve the overall training data quality. plot_top_k_images('Brightness', metrics_data_summary, ea_project, k=5, ascending=True) The resulting image reflects a low brightness score of 0.164: What are the duplicate or nearly similar images in the set? Image singularity in the context of image quality is when images have unique or atypical characteristics compared to most images in a dataset. Duplicate images can highlight potential issues in the data collection or processing pipeline. For instance, they could result from artifacts from a malfunctioning sensor or a flawed image processing step. In computer vision tasks, duplicate images can disproportionately influence the trained model, especially if the dataset is small. Identify and address these images to improve the robustness of your model. Use the “Image Singularity” metric to determine the score and the images that are near duplicates: plot_top_k_images('Image Singularity', metrics_data_summary, ea_project, k=15, show_description=True) Here, you can see two nearly identical images with similar “Image Singularity” scores: The tiny difference between the singularity scores of the two images—0.01299857 for the left and 0.012998693 for the right—shows how similar they are. Check out other similar or duplicate images by running this code cell. Awesome! You have played with a few pre-defined quality metrics. See the complete code to run other data quality metrics on the images. Next Steps: Fixing Data Quality Issues Identifying problematic images is half the battle. Ideally, the next step would be for you to take action on those insights and fix the issues. Encord Index can help you tag problematic images, which may skew model performance downstream. Post-identification, various strategies can be employed to rectify these issues. Below, I have listed some ways to fix problematic image issues. Tagging and Annotation Once you identify the problematic images, tag them within Encord Index to create a collection. One of the most common workflows we see from our users at Encord is identifying image quality issues at scale with Encord Index, tagging problematic images, and sending them upstream for annotation with Annotate. Search, sort, and filter your data until you have the subset of the data you need in Encord Index Re-labeling Incorrect labels can significantly hamper model performance. the Index collection facilitates the re-labeling process by exporting the incorrectly labeled images to an annotation platform like Encord Annotate, where you can correct the labels. Image augmentation and correction Image augmentation techniques enhance the diversity and size of the dataset to improve model robustness. Consider augmenting the data using techniques like rotation, scaling, cropping, and flipping. Some images may require corrections like brightness adjustment, noise reduction, or other image processing techniques to meet the desired quality standards. Image quality is not a one-time task but a continuous process. Regularly monitoring and evaluating your image quality will help maintain a high-quality dataset pivotal for achieving superior model performance. Active learning Use active learning techniques to improve the quality of the dataset iteratively. You can establish a continuous improvement cycle by training the model on good-quality datasets and then evaluating the model on low-quality datasets to suggest datasets to improve. Active learning (encord.com) Check out our practical guide to active learning for computer vision to learn more about active learning, its tradeoffs, alternatives, and a comprehensive explanation of active learning pipelines. Key Takeaways In this article, you defined the objective of training a binary classification model for your use case. Technically, you “gathered” human labels since the open 'sashs/dog-food' dataset was already labeled on Hugging Face. Finally, using Encord Active (now also available in Encord Index), you computed image embeddings and ran metrics on the embeddings. Inspect the problematic images by exploring the datasets based on objective quality metrics. Identifying and fixing the errors in the dataset will set up your downstream model training and ML application for success. If you are interested in exploring this topic further, there’s an excellent article from Aliaksei Mikhailiuk that perfectly describes the task of image quality assessment in three stages: Define an objective Gather the human labels for your dataset Train objective quality metrics on the data

Oct 19 2023

10 M

TLDR

Workflows Leaves Beta

Editor Power Ups

AI Support

Active Improves Data Curation and Model Evaluation

Label Snapshot Versioning

Opt-in to Beta Features Faster with Encord Labs

Encord Blog

Product Updates [October 2023]

Power your AI models with the right data

TLDR

Workflows Leaves Beta

Editor Power Ups

AI Support

Active Improves Data Curation and Model Evaluation

Label Snapshot Versioning

Opt-in to Beta Features Faster with Encord Labs

Written by

TLDR

Workflows Leaves Beta

Editor Power Ups

AI Support

Active Improves Data Curation and Model Evaluation

Label Snapshot Versioning

Opt-in to Beta Features Faster with Encord Labs

Power your AI models with the right data

Written by

Florence-2: Microsoft's New Foundation Model Explained

Data Clustering: Intro, Methods, Applications

Related blogs

Understanding the United States Executive Order on Safe, Secure, and Trustworthy AI

Guide to Vision-Language Models (VLMs)

Exploring the Quality of Hugging Face Image Datasets with Encord Active

How to Pre-Label Your Data with GPT-4o

Announcing the launch of Advanced Video Curation

Setting Up a Computer Vision Testing Platform

Fine-Tuning VLM: Enhancing Geo-Spatial Embeddings

Announcing the launch of Consensus in Encord Workflows

Announcing Auto-Segmentation Tracking For Video

Validating Model Performance Using Encord Active

Comparative Analysis of YOLOv9 and YOLOv8 Using Custom Dataset on Encord Active

How to Analyze Failure Modes of Object Detection Models for Debugging

How to Use Semantic Search to Curate Images of Products with Encord Active

Announcing HTJ2K Support for DICOM Files in Encord

Product Updates [January 2024]

Multiplanar Reconstruction (MPR) in the DICOM Editor

How to Pre-Label Data at Speed with Bulk Classifications

How to automate annotation with GPT-4 Vision’s rival, LLaVA

Product Updates [November 2023]

Expert Review with Workflows

Product Updates [September 2023]

Addressing Nuanced Machine Learning Tasks with In-Depth Ontologies

Encord Active 0.1.75 released: Kill Streamlit, Faster UI, and a Smoother Experience

Product Updates [August 2023]

DICOM Updates [August 2023]

Product Updates [July 2023]

DICOM Updates [July 2023]

Product Updates [June 2023]

DICOM Updates [June 2023]

Product Updates [May 2023]

DICOM Updates [May 2023]

DICOM Updates [April 2023]

Automating Foundation Models with Segment Anything Model (SAM) Using Encord Annotate

March Updates from Justin @ Encord

DICOM Updates [March 2023]

Product Updates [February 2023]

Product Updates [January 2023]

Product Updates [December 2022]

Product Updates [October 2022]

Product Updates [September 2022]

Product Update [August 2022]

Product Updates [July 2022]

Software To Help You Turn Your Data Into AI