Software To Help You Turn Your Data Into AI
Forget fragmented workflows, annotation tools, and Notebooks for building AI applications. Encord Data Engine accelerates every step of taking your model into production.
All this and more! Read on to make your day.
We are thrilled to announce that our highly-anticipated feature, Workflows, has officially transitioned from beta to general availability! This milestone could not have been achieved without the invaluable feedback from our dedicated users throughout the beta phase.
Workflows are designed to give you full control of the annotation process, ensuring a blend of high performance, usability, and extensibility that scales with the rapid pace and change of the AI industry. Some of the major improvements are:
Workflow scalability means more tasks and more labels in your annotation projects. We're also juicing up the editor to be more performant -- which means more labels per task, faster. Backend improvements mean your data will save faster and more seamlessly, and we're introducing raised limits on labels per task to benefit from those improvements as well -- contact us to work with more data per task! Arriving soon are further performance improvements to enhance the user experience when dealing with many objects and complex label timelines. This all adds up to create a more natural pre-processing and editing experience, even on long, label intense, video annotation workloads. Exciting!
We understand that searching our documentation isn’t always your first thought when you need to learn about the platform. To address this, we've integrated AI support directly into our platform, ensuring you have quick access to the assistance you need, precisely when needed.
Whether you're onboarding for the first time, looking for a quick refresher on using the Label Editor, or need help understanding terminology, our AI assistant is here to help. It is regularly trained on all our platforms & SDK documentation, enabling it to provide intelligent and up-to-date responses to any questions you may have about our application!
We know that curating the best images, frames from a video, or slices from a scan is a daunting, difficult, and time-intensive task. First, ensuring that your dataset is free of outliers, duplicates, and irrelevant images, and second, selecting the best samples is crucial for building robust and performant models.
Encord is your trusted partner along your journey and based on your feedback we have designed Active's new Explorer to simplify this process, incorporating best practices into intuitive user journeys:
After your data is annotated and your model is trained, Encord Active simplifies the shift to evaluation. Simply import your model predictions and access a detailed analysis of your model’s performance, with options to break it down by class, data collections, and splits such as train, test, and validation.
You can also use the Explorer to investigate your prediction types following a series of best-practice workflows:
Labeling training data is, like the model training process it supports, an iterative process. You’ve asked for ways to snapshot your progress — whether it’s to save a checkpoint before re-labeling, check-in progress as you work through a large project, or name different subsets for purposes such as training, testing, and validation. We’ve listened, and are happy to introduce label versioning for workflow projects. Navigate to the labels tab, select your tasks, and press ‘Save new version’ — you can refer to these snapshots by name and time.
Initially, we’re rolling out support for exporting labels from saved checkpoints, but look out for coming improvements such as restoring to different projects. As always, let us know how it helps and what more we can do to enhance your AI initiatives with labels and label set management tools!
Many of you have shown interest in working closely with our product development team and helping us create the best features — as such, we’re very happy to be introducing Encord Labs! Encord Labs will give you access to features at the bleeding edge, but give you control over how features appear in the platform. This means you will get all the power of rapidly evolving technology with none of the risks. Getting in on the ground floor means you can shape how features evolve faster, helping us ensure we build with tight customer feedback in mind. Encord Labs will be rolling out several select features in Q4 — contact us if you’re interested or would like to join our collaborative early tester program!
Thanks for reading, feel free to email product@encord.com with any questions or suggestions, and let us know if you're attending RSNA 2023!
Join the Encord Developers community to discuss the latest in computer vision, machine learning, and data-centric AI
Join the communityRelated Blogs
On October 30, 2023, the White House announced an Executive Order issued by President Joe Biden aimed at fostering a balanced approach toward the development and deployment of Artificial Intelligence (AI) to ensure it's safe, secure, and trustworthy. It acknowledges the potential of AI technologies in solving urgent societal challenges and enhancing prosperity, productivity, innovation, and security. However, the Executive Order highlights the potential adverse effects that an irresponsible use of artificial intelligence could have, such as fraud, discrimination, bias, misinformation, threats to national security, and the need for guardrails. The Order calls for a collective effort from the federal government (including the Department of Homeland Security, the Department of Health and Human Services, the Department of Energy, the Department of Commerce, and more), the private sector, academia, and civil society to mitigate these harms while maximizing the benefits of AI. Here are the three main guiding principles behind this Executive Order: Safety and security: The Order emphasizes the need for robust, reliable, repeatable, and standardized evaluations of AI systems. It mandates addressing security risks, including those related to biotechnology, cybersecurity, and critical infrastructure. The document also highlights the importance of testing, post-deployment monitoring, and effective labeling to ensure that AI systems are ethically developed, securely operated, and compliant with federal laws. Responsible innovation: It encourages promoting responsible innovation, competition, and collaboration to maintain U.S. leadership in AI. The Order calls for investments in AI-related education, training, development, research, and tackling intellectual property issues. It also emphasizes creating a fair, open, and competitive AI ecosystem and marketplace, supporting small developers, and addressing potential risks from dominant firms' control over critical assets like semiconductors, computing power, cloud storage, and data. Supporting American workers: As AI creates new jobs and industries, the Order stresses adapting job training and education to support a diverse workforce. It advises against deploying AI in ways that undermine rights, worsen job quality, or cause harmful labor-force disruptions. The Order encourages building the next steps in AI development based on the views of workers, labor unions, educators, and employers to support responsible AI uses that improve workers' lives and augment human work. In subsequent sections of this article, we will examine the actions among the AI directives in this Executive Order. In the meantime, let’s explore how we got here. How did we get here? The History of AI Regulation in the United States of America President Biden's Executive Order on Safe, Secure, and Trustworthy Artificial Intelligence is the result of years of developing insights and responses to emerging technologies in the field of AI. In order to show how we came to this important turning point, this section will walk you through the path of AI regulation in the United States. Early Engagement, Regulating Open- and Closed-Source LLMs Navigating the spectrum between open and closed LLM systems is critical for effective AI policy. Striking the right balance will promote innovation and competition while managing the potential risks of AI. By 2024, the National Institute of Standards and Technology (NIST) under the U.S. Department of Commerce will determine whether they will allow the release of open model weights under public licenses. This, of course, is bound to stir up discussions surrounding treating open model weights as free speech and accusations of lobbying from big tech companies to protect their MOAT. As these LLM systems began permeating various sectors, the need for a regulatory framework became apparent. Policymakers grappling with the rapid advancements in AI models and tools started the conversation about balancing promoting US global leadership in AI with the risks to individuals, businesses, and national security. Legislative Efforts The early engagement translated into legislative action, with the USA’s House and Senate committees holding numerous hearings on AI. The hearings included big names like Elon Musk, CEO of SpaceX, Tesla, and X, formerly known as Twitter; Mark Zuckerberg, CEO of Meta; former Microsoft co-founder Bill Gates; and Sam Altman, CEO of OpenAI, the parent company of AI chatbot, ChatGPT. Biden Administration’s Early Steps In October 2022, the Biden administration issued a non-binding AI Bill of Rights, marking an early step towards delineating the government’s stance on governing automated systems, focusing on civil rights protection. Soon after, on September 12, several tech companies signed voluntary agreements to follow the rules President Biden set out for AI. This was the first step toward encouraging responsible AI use through partnerships with the private sector. SAFE Innovation—A Values-Based Framework and New Legislative Process Despite strong bipartisan interest, the challenge of passing comprehensive AI legislation continued, paving the way for the SAFE Innovation Framework proposal by Senate Majority Leader Chuck Schumer. The Executive Order The culmination of these efforts and the evolving understanding of AI's impact led to the issuance of the Executive Order on Safe, Secure, and Trustworthy Artificial Intelligence. This Executive Order embodies a more structured approach to AI governance, reflecting the administration’s commitment to promoting responsible AI development and deployment while addressing the associated potential risks of AI. What are the Executive Order Directives? We have summarized the Executive Order Directives below so you can easily skim through and find the directives and the corresponding actions relevant to you. Directive 1: New Standards for AI Safety and Security Actions: Require developers to share safety test results with the U.S. government. Develop standards and tools to ensure AI systems are safe and secure. Protect against AI-enabled risks to national security and public health. Establish strong standards for biological synthesis screening. Directive 2: Protecting Americans’ Privacy Actions: Prioritize federal support for privacy-preserving techniques in AI. Strengthen privacy-preserving research and technologies. Evaluate how agencies collect and use commercially available data. Develop guidelines for federal agencies to evaluate privacy-preserving techniques. Directive 3: Advancing Equity and Civil Rights Actions: Offer advice to stop AI programs from making discrimination worse. Address algorithmic discrimination through training and coordination. Ensure fairness in the criminal justice system's use of AI. Directive 4: Standing Up for Consumers, Patients, and Students Actions: Make advances in the responsible use of AI in healthcare. Shape AI’s potential in education. Protect consumers and patients while ensuring AI benefits. Directive 5: Promoting Innovation and Competition Actions: Catalyze AI research and provide grants in vital areas. Promote a fair and competitive AI ecosystem. Streamline visa criteria for skilled immigrants. Directive 6: Supporting Workers Actions: Develop principles and best practices for worker protection. Produce a report on AI’s labor-market impacts. Directive 7: Advancing American Leadership Abroad Actions: Expand collaborations on AI at bilateral, multilateral, and multistakeholder levels. Accelerate the development of AI standards with international partners. Promote responsible AI development abroad. Directive 8: Ensuring Responsible and Effective Government Use of AI Actions: Issue guidance for agencies’ AI use. Streamline AI product and service acquisition. Accelerate the hiring of AI professionals in government. Now that we've discussed the key directives of the US Executive Order on AI, let's compare and contrast them with the European Union's approach to AI regulation, known as the EU Artificial Intelligence Act (AI Act). US Executive Order on Safe, Secure, and Trustworthy AI vs European Union AI Act In the table below, we present a comparative overview of the key aspects and focus areas of the US Executive Order on Safe, Secure, and Trustworthy AI and the EU Artificial Intelligence Act (AI Act). {{gray_callout_start}} Read more about the takes on “Proposed AI Regulation: EU AI Act, UK's Pro-Innovation, US AI Bill of Rights” from Encord’s co-founder and president. {{gray_callout_end}} As you saw in the comparison, while both regulations aim to foster a safe and responsible AI ecosystem, they approach AI governance from slightly different vantage points, reflecting the distinct priorities and regulatory philosophies of the US and the EU. {{light_callout_start}} What does the European AI Act mean for you, an AI developer? Learn more from this article by Ulrik Stig Hansen, Encord’s co-founder and president. {{light_callout_end}} Conclusion Increased involvement from policymakers, legislative efforts, and joint initiatives between the public and private sectors have all contributed to the current AI regulatory landscape. The issuance of the Executive Order represents a significant milestone in the ongoing journey towards establishing a robust framework for AI governance in the U.S. aimed at harnessing the benefits of AI while mitigating its potential perils. But will regulations stifle the efforts of open-source AI? Or would it encourage an ecosystem of open innovation while regulating the risks at the application layer? In this article, you learned about the evolution of AI regulation in the U.S., focusing on key legislative efforts, the Biden Administration's early steps towards AI governance, and the collaborative initiatives that marked the journey towards the recent Executive Order. We talked about how AI was regulated, which led to the Executive Order on Safe, Secure, and Trustworthy Artificial Intelligence. These included actions taken by lawmakers, tech companies making voluntary commitments, and the release of frameworks based on values like the SAFE Innovation Framework. Finally, we compared different aspects of the directives to the proposed European Union AI Act, where you saw clearly different priorities and regulatory philosophies between the United States Congress and the European Parliament. {{gray_callout_start}} Get access to our new AI Act Learning Pack, which includes all the key resources you need to ensure forward compatibility. {{gray_callout_end}} {{try_encord}}
November 1
For quite some time, the idea that artificial intelligence (AI) could understand visual and textual cues as effectively as humans seemed far-fetched and unimaginable. However, with the emergence of multimodal AI, we are seeing a revolution where AI can simultaneously comprehend various modalities, such as text, image, speech, facial expressions, physiological gestures, etc., to make sense of the world around us. The ability to process multiple modalities has opened up various avenues for AI applications. One such exciting application of multimodal AI is Vision-Language Models (VLMs). They can process and understand the modalities of language (text) and vision (image) simultaneously to perform advanced vision-language tasks, such as Visual Question Answering (VQA), image captioning, and Text-To-Image search. In this article, you will learn about: VLM architectures VLM evaluation strategies Mainstream datasets used for developing vision-language models Key challenges, primary applications, and future trends of VLMs Let’s start by understanding what vision-language models are. What Are Vision Language Models? A vision-language model is a fusion of vision and natural language models. It ingests images and their respective textual descriptions as inputs and learns to associate the knowledge from the two modalities. The vision part of the model captures spatial features from the images, while the language model encodes information from the text. The data from both modalities, including detected objects, spatial layout of the image, and text embeddings, are mapped to each other. For example, if the image contains a bird, the model will learn to associate it with a similar keyword in the text descriptions. This way, the model learns to understand images and transforms the knowledge into natural language (text) and vice-versa. Training VLMs Techniques for building VLMs include pre-training foundation models and zero-shot learning. You can use transfer learning techniques such as knowledge distillation to fine-tune the models for more specific downstream tasks. These are simpler techniques that require smaller datasets and less training time while maintaining decent results. Modern frameworks, on the other hand, use various techniques to get better results, such as Contrastive learning. Masked language-image modeling. Encoder-decoder modules with transformers and more. These architectures can learn complex relations between the various modalities and provide state-of-the-art results. Let’s discuss these in detail. {{RLHF_CTA}} Vision Language Models: Architectures and Popular Models Let’s look at some VLM architectures and learning techniques that mainstream models such as CLIP, Flamingo, and VisualBert, among others, use. Contrastive Learning Contrastive learning is a technique that learns data points by understanding their differences. The method computes a similarity score between data instances and aims to minimize contrastive loss. It’s most useful in semi-supervised learning, where only a few labeled samples guide the optimization process to label unseen data points. Contrastive Learning For example, one way of understanding what a cat looks like is to place it beside a similar cat image and a dog image. Contrastive learning models learn to distinguish between a cat and a dog by identifying several features, such as facial structure, body size, and the presence of fur. The models can determine which image is closer to the original, called the “anchor,” and predict its class. CLIP is an example of a model that uses contrastive learning by computing the similarity between text and image embeddings using textual and visual encoders. It follows a three-step process to enable zero-shot predictions. Trains a text and image encoder during pretraining to learn the image-text pairs. Converts training dataset classes into captions. Estimates the best caption for the given input image for zero-shot prediction. CLIP Architecture ALIGN is another example that uses image and textual encoders to minimize the distance between similar embeddings using a contrastive loss function. {{light_callout_start}} Want to know how to evaluate CLIP? Head onto our blog and read Evaluating Foundation Models (CLIP) using Encord Active. {{light_callout_end}} PrefixLM PrefixLM is an NLP learning technique mostly used for model pre-training. It inputs part of the text (a prefix) and learns to predict the next word in the sequence. In Visual Language Models, PrefixLM enables the model to predict the next sequence of words based on an image and its respective prefix text. It leverages a Vision Transformer (ViT) that divides an image into a one-dimensional sequence of patches, where each patch represents a local image region. Then, the model applies convolution or linear projection over the processed patches to generate contextualized visual embeddings. For text modality, the model converts the text prefix relative to the patch into a token embedding. The encoder-decoder blocks of the transformer receive both visual embedding and token embedding. It is there that the model learns the relationships between the embeddings. SimVLM is a popular architecture utilizing the PrefixLM learning methodology. It has a simpler Transformer architecture than its predecessors, surpassing their results in various benchmarks. It uses a transformer encoder to learn image-prefix pairs and a transformer decoder to generate an output sequence. The model also demonstrates good generalization and zero-shot learning capabilities. SimVLM Architecture Similarly, VirTex uses a convolutional neural network to extract image features and a textual head with transformers to manage text prefixes. You can train the model end-to-end to predict the correct image captions by feeding image-text pairs to the textual head. VirTex Architecture Frozen PrefixLM While PrefixLM techniques require you to train visual and textual encoders from scratch, Frozen PrefixLM allows you to use pre-trained networks and only update the parameters of the image encoders. For instance, the architecture below shows how Frozen works using a pre-trained language model and visual encoder. The text encoder can belong to any large language model (LLM), and the visual encoder can also be a pre-trained visual foundation model. You can fine-tune the image encoder so its image representations align with textual embeddings, allowing the model to make better predictions. Frozen Architecture A more state-of-the-art (SOTA) approach is Flamingo’s architecture, which uses a CLIP-like vision encoder and an LLM called Chinchilla. Keeping the LLM fixed, you can train the visual encoder on images interleaved between texts. The visual encoders process the image through a Perceiver Sampler. The technique results in faster inference and makes Flamingo ideal for few-shot learning. Flamingo Architecture Multimodal Fusing with Cross-Attention This method utilizes the encoders of a pre-trained LLM for visual representation learning by adding cross-attention layers. VisualGPT is a primary example that allows quick adaptation of an LLM’s pre-trained encoder weights for visual tasks. VisualGPT Architecture Practitioners extract relevant objects from an image input and feed them to a visual encoder. They feed the resulting visual representations to a decoder and initialize their weights according to pre-trained LLM. The decoder module balances the visual and textual information through a self-resurrecting activation unit (SRAU). The SRAU method avoids the issue of vanishing gradients - a common problem in deep learning where model weights fail to update due to small gradients. As such, VisualGPT outperforms several baseline models, such as plain transformer, Attention-on-Attention (AoA) transformer, X-transformer, etc. Masked-language Modeling (MLM) & Image-Text Matching (ITM) MLM works in language models like BERT by masking or hiding a portion of a textual sequence and training the model to predict the missing text. ITM involves predicting whether sentence Y follows sentence X. You can adapt the MLM and ITM techniques for visual tasks. For instance, the diagram below illustrates the architecture of VisualBERT, trained on the COCO dataset. VisualBERT Architecture It augments the MLM procedure by introducing image sequences and a masked textual description. The objective is to predict the missing text based on visual embeddings. Similarly, ITM predicts whether or not a caption matches the image. No Training You can directly use large-scale pre-trained vision-language models without any fine-tuning. For example, MAGIC and ASIF are training-free frameworks that aim to predict text descriptions that align closely with the input image. MAGIC uses a specialized score based on CLIP-generated image embeddings to guide the output of language models. Using this score, an LLM generates textual embeddings that align closely with the image semantics, enabling the model to perform multimodal tasks in a zero-shot manner. ASIF uses the idea that similar images have similar captions. The model computes the similarities between the training dataset's query and candidate images. Next, it compares the query image embeddings with the text embeddings of the corresponding candidate images. Then, it predicts a description whose embeddings have the highest similarity to the embeddings of the query image, resulting in comparable zero-shot performance to models like CLIP and LiT. ASIF Prediction Strategy Knowledge Distillation This technique involves transferring knowledge from a large, well-trained teacher model to a lighter student model with few parameters. This methodology allows researchers to train VLMs from larger pre-trained models. For instance, ViLD is a popular VLM developed using the knowledge distillation methodology. The model uses a pre-trained open-vocabulary image classification model as the teacher to train a two-stage detector (student). The model matches textual embeddings from a textual encoder with image embeddings. ViLD Architecture You can use knowledge distillation to transfer knowledge from the image encoder to the backbone model to generate regional embeddings automatically. Only the backbone model generates regional embeddings during inference, and the model matches them with unseen textual embeddings. The objective is to draw correct bounding boxes around objects in an image based on textual descriptions. Evaluating Vision Language Models VLM validation involves assessing the quality of the relationships between the image and text data. For example, for an image captioning model, this would mean comparing the generated captions to the ground truth description. You can use various automated n-gram-based evaluation strategies to compare the predicted labels in terms of accuracy, semantics, and information precision. A few of the key VLM evaluation metrics are mentioned below. BLEU: The Bilingual Evaluation Understudy (BLEU) metric was originally proposed to evaluate machine translation tasks. It computes the precision of the target text compared to a reference (ground truth) by considering how many words in the candidate sentence appear in the reference. ROUGE: Recall-Oriented Understudy for Gisting Evaluation (ROUGE) computes recall by considering how many words in the reference sentence appear in the candidate. METEOR: Metric for Evaluation of Translation with Explicit Ordering (METEOR) computes the harmonic mean of precision and recall, giving more weight to recall and multiplying it with a penalty term. The metric is an improvement over others that work with either Precision or Recall, as it combines information from both to give a better evaluation. CIDEr: Consensus-based Image Description Evaluation (CIDEr) compares a target sentence to a set of human sentences by computing the average similarity between reference and target sentences using TF-IDF scores. Now you have learned evaluation metrics pertinent to Vision-Language Models (VLMs), it's essential to know how to curate datasets for these models. The right dataset provides fertile ground for training and validating VLMs and is pivotal in determining the models' performance across diverse tasks. Datasets for Vision Language Models Collecting training data for VLMs is more challenging than traditional AI models since it involves the collection and quality assurance of multiple data modalities. Below is a list of several datasets combining image and text data for multimodal training. LAION-5B: Practitioners use the LAION-5B dataset for building large pre-trained VLMs. The dataset contains more than five billion image-text pairs generated from CLIP, with descriptions in English and foreign languages, thereby catering to a multilingual domain. PMD: The Public Model Dataset (PMD) originally appeared in the FLAVA paper and contains 70 billion image-text pairs. The data is a collection from other large-scale datasets, such as COCO, Conceptual Captions, RedCaps, etc. This dataset is a reservoir of multimodal data that fosters robust model training. VQA - Experts use the VQA dataset to fine-tune pre-trained VLMs for downstream VQA and visual reasoning tasks. The dataset contains over 200,000 images with five questions per image, ten ground-truth answers, and three incorrect answers per question. ImageNet: ImageNet contains over 14 million images with annotations categorized according to the WordNet hierarchy. It’s helpful in building models for simple downstream tasks, such as image classification and object recognition. Despite the availability of high-quality multimodal datasets, VLMs can face significant challenges during the model development process. Let’s discuss them below. {{Training_data_CTA}} Limitations of Vision Language Models Although VLMs are powerful in understanding visual and textual modalities to process information, they face three primary challenges: Model complexity Dataset bias Evaluation difficulties Model Complexity Language and vision models are quite complex on their own, and a combination of the two only worsens the problem. The complexity of these models raises additional challenges in acquiring powerful computing resources for training, the collection of large datasets, and deployment on weak hardware such as IoT devices. Dataset Bias Dataset biases occur when VLMs memorize deep patterns within training and test sets without solving anything. For instance, training a VLM on images curated from the internet can cause the model to memorize specific patterns and not learn the conceptual differences between various images. Evaluation Strategies The evaluation strategies discussed above only compare a candidate sentence with reference sentences. The approach assumes that the reference sentences are the only ground truths. However, there can be several ground-truth descriptions for a particular image. Although consensus-based metrics like CIDEr account for the issue, using them becomes challenging when consensus is low for particular images. Another challenge is when a generic description applies to several images. Spurious Correlation As the illustration shows, a VLM can annotate or retrieve several relevant images that match the generic caption. However, in reality, the model is nothing more than a bag-of-words. All it’s doing is considering words, such as ‘city,’ ‘bus,’ ‘lights,’ etc., to describe the image instead of actually understanding the caption's sequential order and true contextual meaning. Furthermore, VLMs used for VQA can generate highly confident answers to nonsensical questions. For instance, asking a VLM, “What color is the car?” for an image that contains a white horse will generate the answer as “white” instead of pointing out that there isn’t a car in the picture. Lastly, VLMs lack compositional generalization. It means their performance decreases when processing novel concepts. For example, a VLM can fail to recognize a yellow horse as a category since it’s rare to associate the color yellow with horses. Despite many development and deployment challenges, researchers and practitioners have made significant progress in adopting VLMs for solving real problems. Let’s discuss them briefly below. Applications of Vision Language Models While most VLMs discussed earlier are helpful in captioning images, their utility extends to a variety of other domains that leverage the capability to bridge visual and linguistic modalities. Here are some additional applications: Image Retrieval: Models such as FLAVA allow users to navigate through image repositories by helping them find relevant photos based on linguistic queries. A relevant example is an e-commerce site. Visitors can describe what they’re looking for in a search bar, and a VLM will show the suitable options on the screen. This application is also popular on smartphones, where users can type in keywords (landscapes, buildings, etc.) to retrieve associated images from the gallery. Generative AI: Image generation through textual prompts is a growing domain where models like DALL-E allow users to create art or photos based on their descriptions. The application is practical in businesses where designers and inventors want to visualize different product ideas. It also helps create content for websites and blogs and aids in storytelling. Segmentation: VLMs like SegGPT help with segmentation tasks, such as instance, panoptic, semantic segmentation, etc. SegGPT segments an image by understanding user prompts and exploits a distinct coloring scheme to segment objects in context. For instance, users can ask to segment a rainbow from several images, and SegGPT will annotate all rainbows efficiently. {{light_callout_start}} Read our detailed article on SegGPT: Segmenting everything in context [Explained] to learn more about how the model works. {{light_callout_end}} Future Research The following are a few crucial future research directions in the VLM domain: Better Datasets The research community is working on building better training and test datasets to help VLMs with compositional understanding. CLEVR is one example of this effort. CLEVR Dataset As the illustration shows, it contains images of novel shapes, colors, and corresponding questions that allow experts to test a VLM’s visual reasoning capacity. Better Evaluation Methods Evaluation challenges warrant in-depth research into better evaluation methods for building more robust VLMs. One alternative is to test VLMs for individual skills through the ARO benchmark. Attribute identification, relational reasoning, and word-order sensitivity (ARO) are three skills that VLMs must master. ARO Dataset The illustration above explains what ARO entails in different contexts. Using such a dataset, experts can analyze what VLMs learn and how to improve the outcomes. Robotics Researchers are also using VLMs to build purpose-specific robots. Such robots can help navigate environments, improve warehouse operations in manufacturing by monitoring items, and enhance human-machine interaction by allowing robots to understand human gestures, such as facial expressions, body language, voice tones, etc. Medical VQA VLMs’ ability to annotate images and recognize complex objects can help healthcare professionals with medical diagnoses. For example, they can ask VLMs critical questions about X-rays or MRI scans to determine potential problems early. Vision-Language Models: Key Takeaways Visual language modeling is an evolving field that holds great promise for the AI industry. Below are a few critical points regarding VLMs: Vision-language models are a multimodal architecture that simultaneously comprehends image and text data modalities. They use CV and NLP models to correlate information (embeddings) from the two modalities. Several VLM architectures exist that aim to relate visual semantics to textual representations. Although users can evaluate VLMs using automated scores, better evaluation strategies are crucial to building more reliable models. VLMs have many industrial use cases, such as robotics, medical diagnoses, chatbots, etc. {{try_encord}}
November 3
Computer vision engineers, data scientists, and machine learning engineers face a pervasive issue: the prevalence of low-quality images within datasets. You have likely encountered this problem through incorrect labels, varied image resolutions, noise, and other distortions. Poor data quality can lead to models learning incorrect features, misclassifications, and unreliable or incorrect outputs. In a domain where accuracy and reliability are paramount, this issue can significantly impede the progress and success of projects. This could result in wasted resources and extended project timelines. Take a look at the following image collage of Chihuahuas or muffins, for example: Chihuahua or muffin? My search for the best computer vision API How fast could you tell which images are Chihuahuas vs. muffins? Fast? Slow? Were you correct in 100% of the images? I passed the collage to GPT-4V because, why not? 😂 And as you can see, even the best-in-class foundation model misclassified some muffins as Chihuahuas! (I pointed out a few.) So, how do you make your models perform better? The sauce lies in the systematic approach of exploring, evaluating, and fixing the quality of images. Enter Encord Active! It provides a platform to identify, tag problematic images, and use features to improve the dataset's quality. This article will show you how to use Encord Active to explore images, visualize potential issues, and take next steps to rectify low-quality images. In particular, you will: Use a dog-food dataset from the Hugging Face Datasets library. Delve into the steps of creating an Encord Active project. Define and run quality metrics on the dataset. Visualize the quality metrics. Indicate strategies to fix the issues you identified. Ready? Let’s delve right in! 🚀 Using Encord Active to Explore the Quality of Your Images Encord Active toolkit helps you find and fix wrong labels through data exploration, model-assisted quality metrics, and one-click labeling integration. It takes a data-centric approach to improving model performance. With Encord Active, you can: Slice your visual data across metrics functions to identify data slices with low performance. Flag poor-performing slices and send them for review. Export your new data set and labels. Visually explore your data through interactive embeddings, precision/recall curves, and other advanced visualizations. Check out the project on GitHub, and hey, if you like it, leave a 🌟🫡. Demo: Explore the quality of 'dog' and 'food' images for ML models In this article, you will use Encord Active to explore the quality of the `sashs/dog-food` images. You’ll access the dataset through the Hugging Face Datasets library. You can use this dataset to build a binary classifier that categorizes images into the "dog" and "food" classes. The 'dog' class has images of canines that resemble fried chicken and some that resemble images of muffins, and the 'food' class has images of, you guessed it, fried chicken and muffins. The complete code is hosted on Colab. Open the Colab notebook side-to-side with this blog post. {{light_callout_start}} Interested in more computer vision, visual foundation models, active learning, and data quality notebooks? Check out the Encord Notebook repository {{light_callout_end}} Use Hugging Face Datasets to Download and Generate the Dataset Whatever machine learning, deep learning, or AI tasks you are working on, the Hugging Face Datasets library provides easy access to, sharing, and processing datasets, particularly those catering to audio, computer vision, and natural language processing (NLP) domains. The 🤗 datasets library enables an on-disk cache that is memory-mapped for quick lookups to back the datasets. Explore the Hugging Face Hub for the datasets directory You can browse and explore over 20,000 datasets housed in the library on the Hugging Face Hub. The Hub is a centralized platform for discovering and choosing datasets pertinent to your projects. In the search bar at the top, enter keywords related to the dataset you're interested in, e.g., "sentiment analysis," "image classification," etc. You should be able to: Filter datasets by domain, license, language, and so on. Find information such as the size, download number, and download link on the dataset card. Engage with the community by contributing to discussions, providing feedback, or suggesting improvements to the dataset. Load the ‘sashs/dog-food’ dataset Loading the `sashs/dog-food` dataset is pretty straightforward: Install the 🤗 Datasets library and download the dataset. To install Hugging Face Datasets, run the following command: pip install datasets Use the `load_dataset` function to load the 'sasha/dog-food' dataset from Hugging Face: dataset_dict=load_dataset('sasha/dog-food') `load_dataset` returns a dictionary object (`DatasetDict`). You can iterate through the train and test dataset split keys in the `DatasetDict` object. The keys map to a `Dataset` object containing the images for that particular split. You will explore the entire dataset rather than in separate splits. This should provide a comprehensive understanding of the data distribution, characteristics, and potential issues. To do that, merge the different splits into a single dataset using the `concatenate_datasets` function: dataset=concatenate_datasets([dfordindataset_dict.values()]) Perfect! Now, you have an entire dataset to explore with Encord Active in the subsequent sections. If you have not done that already, create a dataset directory to store the downloaded images. # Create a new directory "hugging_face_dataset" in the current working dir huggingface_dataset_path = Path.cwd() / "huggingface_dataset" # Delete dir if it already exists and recreate if huggingface_dataset_path.exists(): shutil.rmtree(huggingface_dataset_path) huggingface_dataset_path.mkdir() Use a loop to iterate through images from the ‘sashs/dog-food’ dataset and save them to the directory you created: for counter, item in tqdm(enumerate(dataset)): image = item['image'] image.save(f'./Hugging Face_dataset/{counter}.{image.format}') If your code throws errors, run the cell in the Colab notebook in the correct order. Super! You have prepared the groundwork for exploring your dataset with Encord Active. Create an Encord Active Project You must specify the directory containing your datasets when using Encord Active for exploration. You will initialize a local project with the image files—there are different ways to import and work with projects in Encord. Encord Active provides functions and utilities to load all your images, compute embeddings, and, based on that, evaluate the embeddings using pre-defined metrics. The metrics will help you search and find images with errors or quality issues. Before initializing the Encord Active project, define a function, `collect_all_images`, that obtains a list of all the image files from the `huggingface_dataset_path` directory, takes a root folder path as input, and returns a list of `Path` objects representing image files within the root folder: def collect_all_images(root_folder: Path) -> list[Path]: image_extensions = {".jpg", ".jpeg", ".png", ".bmp"} image_paths = [] for file_path in root_folder.glob("**/*"): if file_path.suffix.lower() in image_extensions: image_paths.append(file_path) return image_paths Remember to access and run the complete code in this cell. Initialize Encord Active project Next, initialize a local project using Encord Active's `init_local_project` function. This function provides the same functionality as running the `init` command in the CLI. If you prefer using the CLI, please refer to the “Quick import data & labels” guide. try: project_path: Path = init_local_project( files = image_files, target = projects_dir, project_name = "sample_ea_project", symlinks = False, ) except ProjectExistsError as e: project_path = Path("./sample_ea_project") print(e)# A project already exists with that name at the given path. Compute image embeddings and analyze them with metrics Analyzing raw image data directly in computer vision can often be impractical due to the high dimensionality of images. A common practice is to compute embeddings for the images to compress the dimensions, then run metrics on these embeddings to glean insights and evaluate the images. Ideally, you compute the embeddings using pre-trained (convolutional neural network) models. The pre-trained models capture the essential features of the images while reducing the data dimensionality. Once you obtain the embeddings, run similarity, clustering, and classification metrics to analyze different aspects of the dataset. Computing embeddings and running metrics on them can take quite a bit of manual effort. Enter Encord Active! Encord Active provides utility functions to run predefined subsets of metrics, or you can import your own sets of metrics. It computes the image embeddings and runs the metrics by the type of embeddings. Encord Active has three different types of embeddings: Image embeddings - general for each image or frame in the dataset Classification embeddings - associated with specific frame-level classifications Object embeddings - associated with specific objects, like polygons or bounding boxes Use the `run_metrics_by_embedding_type` function to execute quality metrics on the images, specifying the embedding type as `IMAGE`: run_metrics_by_embedding_type( EmbeddingType.IMAGE, data_dir=project_path, use_cache_only=True ) The `use_cache_only=True` parameter cached data only when executing the metrics rather than recomputing values or fetching fresh data. This can be a useful feature for saving computational resources and time, especially when working with large datasets or expensive computations. Create a `Project` object using the `project_path` - you will use this for further interactions with the project: ea_project=Project(project_path) Exploring the Quality Of Images From the Hugging Face Datasets Library Now that you have set up your project, it’s time to explore the images! There are typically two ways you could visualize images with Encord Active (EA): Through the web application (Encord Active UI) Combining EA with visualization libraries to display those embeddings based on the metrics We’ll use the latter in this article. You will import helper functions and modules from Encord Active with visualization libraries (`matplotlib` and `plotly`). This code cell contains a list of the modules and helper functions. Pre-defined subset of metrics in Encord Active Next, iterate through the data quality metrics in Encord Active to see the list of available metrics, access the name attribute of each metric object within that iterable, and construct a list of these names: [metric.nameformetricinavailable_metrics] You should get a similar output: There are several quality metrics to explore, so let’s define and use the helper functions to enable you to visualize the embeddings. Helper functions for displaying images and visualizing the metrics Define the `plot_top_k_images` function to plot the top k images for a metric: def plot_top_k_images(metric_name: str, metrics_data_summary: MetricsSeverity, project: Project, k: int, show_description: bool = False, ascending: bool = True): metric_df = metrics_data_summary.metrics[metric_name].df metric_df.sort_values(by='score', ascending=ascending, inplace=True) for _, row in metric_df.head(k).iterrows(): image = load_or_fill_image(row, project.file_structure) plt.imshow(image) plt.show() print(f"{metric_name} score: {row['score']}") if show_description: print(f"{row['description']}") The function sorts the DataFrame of metric scores, iterates through the top `k` images in your dataset, loads each image, and plots it using Matplotlib. It also prints the metric score and, optionally, the description of each image. You will use this function to plot all the images based on the metrics you define. Next, define a `plot_metric_distribution` function that creates a histogram of the specified metric scores using Plotly: def plot_metric_distribution(metric_name: str, metric_data_summary: MetricsSeverity): fig = px.histogram(metrics_data_summary.metrics[metric_name].df, x="score", nbins=50) fig.update_layout(title=f"{metric_name} score distribution", bargap=0.2) fig.show() Run the function to visualize the score distribution based on the “Aspect Ratio” metric: plot_metric_distribution(“AspectRatio”,metrics_data_summary) Most images in the dataset have aspect ratios close to 1.5, a normal distribution. The set has only a few extremely small or enormous image proportions. Use EA’s `create_image_size_distribution_chart` function to plot the size distribution of your images: image_sizes = get_all_image_sizes(ea_project.file_structure) fig = create_image_size_distribution_chart(image_sizes) fig.show() As you probably expected for an open-source dataset for computer vision applications, there is a dense cluster of points in the lower-left corner of the graph, indicating that many images have smaller resolutions, mostly below 2000 pixels in width and height. A few points are scattered further to the right, indicating images with a much larger width but not necessarily a proportional increase in height. This could represent panoramic images or images with unique aspect ratios. You’ll identify such images in subsequent sections. Inspect the Problematic Images What are the severe and moderate outliers in the image set? You might also need insights into the distribution and severity of outliers across various imaging attributes. The attributes include metrics such as green values, blue values, area, etc. Use the `create_outlier_distribution_chart` utility to plot image outliers based on all the available metrics in EA. The outliers are categorized into two levels: "severe outliers" (represented in red “tomato”) and "moderate outliers" (represented in orange): available_metrics = load_available_metrics(ea_project.file_structure.metrics) metrics_data_summary = get_metric_summary(available_metrics) all_metrics_outliers = get_all_metrics_outliers(metrics_data_summary) fig = create_outlier_distribution_chart(all_metrics_outliers, "tomato", 'orange') fig.show() Here’s the result: "Green Values," "Blue Values," and "Area" appear to be the most susceptible to outliers, while attributes like "Random Values on Images" have the least in the ‘sashs/dog-food’ dataset. This primarily means there are lots of images that have abnormally high values of green and blue tints. This could be due to the white balance settings in the camera for the images or low-quality sensors. If your model trains on this set, it’s likely that more balanced images may perturb the performance. What are the blurry images in the image set? Depending on your use case, you might discover that blurry images can sometimes deter your model. A model trained on clear images and then tested or used on blurry ones may not perform well. If the blur could lead to misinterpretations and errors, which can have significant consequences, you might want to explore the blurry images to remove or enhance them. plot_top_k_images('Blur',metrics_data_summary,ea_project,k=5,ascending=False) Based on a "Blur score" of -9.473 calculated by Encord Active, here is the output with one of the five blurriest images: What are the darkest images in the image set? Next, surface images with poor lighting or low visibility. Dark images can indicate issues with the quality. These could result from poor lighting during capture, incorrect exposure settings, or equipment malfunctions. Also, a model might struggle to recognize patterns in such images, which could reduce accuracy. Identify and correct these images to improve the overall training data quality. plot_top_k_images('Brightness', metrics_data_summary, ea_project, k=5, ascending=True) The resulting image reflects a low brightness score of 0.164: What are the duplicate or nearly similar images in the set? Image singularity in the context of image quality is when images have unique or atypical characteristics compared to most images in a dataset. Duplicate images can highlight potential issues in the data collection or processing pipeline. For instance, they could result from artifacts from a malfunctioning sensor or a flawed image processing step. In computer vision tasks, duplicate images can disproportionately influence the trained model, especially if the dataset is small. Identify and address these images to improve the robustness of your model. Use the “Image Singularity” metric to determine the score and the images that are near duplicates: plot_top_k_images('Image Singularity', metrics_data_summary, ea_project, k=15, show_description=True) Here, you can see two nearly identical images with similar “Image Singularity” scores: The tiny difference between the singularity scores of the two images—0.01299857 for the left and 0.012998693 for the right—shows how similar they are. Check out other similar or duplicate images by running this code cell. Awesome! You have played with a few pre-defined quality metrics. See the complete code to run other data quality metrics on the images. Next Steps: Fixing Data Quality Issues Identifying problematic images is half the battle. Ideally, the next step would be for you to take action on those insights and fix the issues. Encord Active (EA) can help you tag problematic images, which may skew model performance downstream. Post-identification, various strategies can be employed to rectify these issues. Below, I have listed some ways to fix problematic image issues. Tagging and annotation Once you identify the problematic images, you can tag them within EA. One of the most common workflows we see from our users at Encord is identifying image quality issues at scale with Encord Active, tagging problematic images, and sending them upstream for annotation with Annotate. Re-labeling Incorrect labels can significantly hamper model performance. EA facilitates the re-labeling process by exporting the incorrectly labeled images to an annotation platform like Encord Annotate, where you can correct the labels. Active learning Use active learning techniques to improve the quality of the dataset iteratively. You can establish a continuous improvement cycle by training the model on good-quality datasets and then evaluating the model on low-quality datasets to suggest datasets to improve. Active learning (encord.com) {{light_callout_start}} Check out our practical guide to active learning for computer vision to learn more about active learning, its tradeoffs, alternatives, and a comprehensive explanation of active learning pipelines. {{light_callout_end}} Image augmentation and correction Image augmentation techniques enhance the diversity and size of the dataset to improve model robustness. Consider augmenting the data using techniques like rotation, scaling, cropping, and flipping. Some images may require corrections like brightness adjustment, noise reduction, or other image processing techniques to meet the desired quality standards. Image quality is not a one-time task but a continuous process. Regularly monitoring and evaluating your image quality will help maintain a high-quality dataset pivotal for achieving superior model performance. Key Takeaways In this article, you defined the objective of training a binary classification model for your use case. Technically, you “gathered” human labels since the open 'sashs/dog-food' dataset was already labeled on Hugging Face. Finally, using Encord Active, you computed image embeddings and ran metrics on the embeddings. Inspect the problematic images by exploring the datasets based on the objective quality metrics. Identifying and fixing the errors in the dataset will set up your downstream model training and ML application for success. If you are interested in exploring this topic further, there’s an excellent article from Aliaksei Mikhailiuk that perfectly describes the task of image quality assessment in three stages: Define an objective Gather the human labels for your dataset Train objective quality metrics on the data {{Active_CTA}}
October 19
14 min
Forget fragmented workflows, annotation tools, and Notebooks for building AI applications. Encord Data Engine accelerates every step of taking your model into production.