Encord Blog

Stay up to date with the latest in DataOps, Computer Vision, Machine
Learning, and Data-Centric AI.

blog banner
Featured Blog

How To Fine-Tune Segment Anything

Computer vision is having its ChatGPT moment with the release of the Segment Anything Model (SAM) by Meta last week. Trained over 11 billion segmentation masks, SAM is a foundation model for predictive AI use cases rather than generative AI. While it has shown an incredible amount of flexibility in its ability to segment over wide-ranging image modalities and problem spaces, it was released without “fine-tuning” functionality. This tutorial will outline some of the key steps to fine-tune SAM using the mask decoder, particularly describing which functions from SAM to use to pre/post process the data so that it's in a good shape for fine tuning. {{light_callout_start}} In our upcoming webinar, we unpack how to fine-tune foundation models for auto-labeling, with a live demo of how to fine-tune SAM. Sign up here. {{light_callout_end}} {{try_encord}} What is the Segment Anything Model (SAM)? The Segment Anything Model (SAM) is a segmentation model developed by Meta AI. It is considered the first foundational model for Computer Vision. SAM was trained on a huge corpus of data containing millions of images and billions of masks, making it extremely powerful. As its name suggests, SAM is able to produce accurate segmentation masks for a wide variety of images. SAM’s design allows it to take human prompts into account, making it particularly powerful for Human In The Loop annotation. These prompts can be multi-modal: they can be points on the area to be segmented, a bounding box around the object to be segmented or a text prompt about what should be segmented. The model is structured into 3 components: an image encoder, a prompt encoder and a mask decoder. Source The image encoder generates an embedding for the image being segmented, whilst the prompt encoder generates an embedding for the prompts. The image encoder is a particularly large component in the model. This is in contrast to the lightweight mask   decoder, which predicts segmentation masks based on the embeddings. Meta AI has made the weights and biases of the model trained on the Segment Anything 1 Billion Mask (SA-1B) dataset available as a model checkpoint. {{light_callout_start}} Learn more about how Segment Anything works in our explainer blog post Segment Anything Model (SAM) Explained. {{light_callout_end}} What is Model Fine-Tuning? Publicly available state of the art models have a custom architecture and are typically supplied with pre-trained model weights. If these architectures were supplied without weights then the models would need to be trained from scratch by the users, who would need to use massive datasets to obtain state of the art performance. Model fine tuning is the process of taking a pre-trained model (architecture+weights) and showing it data for a particular use case. This will typically be data that the model hasn’t seen before, or that is underrepresented in its original training dataset. The difference between fine tuning the model and starting from scratch is the starting value of the weights and biases. If we were training from scratch, these would be randomly initialised according to some strategy. In such a starting configuration, the model would ‘know nothing’ of the task at hand and perform poorly. By using pre existing weights and biases as a starting point we can ‘fine tune’ the weights and biases so that our model works better on our custom dataset. For example: the information learnt to recognise cats (edge detection, counting paws) will be useful for recognising dogs. Why Would I Fine-Tune a Model? The purpose of fine tuning a model is to obtain higher performance on data which the pre-trained model has not seen before. For example, an image segmentation model trained on a broad corpus of data gathered from phone cameras will have mostly seen images from a horizontal perspective. If we tried to use this model for satellite imagery taken from a vertical perspective, it may not perform as well. If we were trying to segment rooftops, the model may not yield the best results. The pre-training is useful because the model will have learnt how to segment objects in general, so we want to take advantage of this starting point to build a model which can accurately segment rooftops. Furthermore, it is likely that our custom dataset would not have millions of examples, so we want to fine tune instead of training the model from scratch. Fine tuning is desirable so that we can obtain better performance on our specific use case, without having to incur the computational cost of training a model from scratch. How to Fine-Tune Segment Anything Model [With Code] Background & Architecture We gave an overview of the SAM architecture in the introduction section. The image encoder has a complex architecture with many parameters. In order to fine tune the model, it makes sense for us to focus on the mask decoder which is lightweight and therefore easier, faster and more memory efficient to fine tune. In order to fine tune SAM, we need to extract the underlying pieces of its architecture (image and prompt encoders, mask decoder). We cannot use SamPredictor.predict (link) for two reasons: We want to fine tune only the mask decoder This function calls SamPredictor.predict_torch which has the  @torch.no_grad() decorator (link), which prevents us from computing gradients Thus, we need to examine the SamPredictor.predict function and call the appropriate functions with gradient calculation enabled on the part we want to fine tune (the mask decoder). Doing this is also a good way to learn more about how SAM works. Creating a Custom Dataset We need three things to fine tune our model: Images on which to draw segmentations Segmentation ground truth masks Prompts to feed into the model We chose the stamp verification dataset (link) since it has data which SAM may not have seen in its training (i.e., stamps on documents). We can verify that it performs well, but not perfectly, on this dataset by running inference with the pre-trained weights. The ground truth masks are also extremely precise, which will allow us to calculate accurate losses. Finally, this dataset contains bounding boxes around the segmentation masks, which we can use as prompts to SAM. An example image is shown below. These bounding boxes align well with the workflow that a human annotator would go through when looking to generate segmentations. Input Data Preprocessing We need to preprocess the scans from numpy arrays to pytorch tensors. To do this, we can follow what happens inside SamPredictor.set_image (link) and SamPredictor.set_torch_image (link) which preprocesses the image. First, we can use utils.transform.ResizeLongestSide to resize the image, as this is the transformer used inside the predictor (link). We can then convert the image to a pytorch tensor and use the SAM preprocess method (link) to finish preprocessing. Training Setup We download the model checkpoint for the vit_b model and load them in: sam_model = sam_model_registry['vit_b'](checkpoint='sam_vit_b_01ec64.pth') We can set up an Adam optimizer with defaults and specify that the parameters to tune are those of the mask decoder: optimizer = torch.optim.Adam(sam_model.mask_decoder.parameters())  At the same time, we can set up our loss function, for example Mean Squared Error loss_fn = torch.nn.MSELoss() Training Loop In the main training loop, we will be iterating through our data items, generating masks and comparing them to our ground truth masks so that we can optimise the model parameters based on the loss function. In this example we used a GPU for training since it is much faster than using a CPU. It is important to use .to(device) on the appropriate tensors to make sure that we don’t have certain tensors on the CPU and others on the GPU. We want to embed images by wrapping the encoder in the torch.no_grad() context manager, since otherwise we will have memory issues, along with the fact that we are not looking to fine tune the image encoder. with torch.no_grad(): image_embedding = sam_model.image_encoder(input_image) We can also generate the prompt embeddings within the no_grad context manager. We use our bounding box coordinates, converted to pytorch tensors. with torch.no_grad(): sparse_embeddings, dense_embeddings = sam_model.prompt_encoder( points=None, boxes=box_torch, masks=None, ) Finally, we can generate the masks. Note that here we are in single mask generation mode (in contrast to the 3 masks that are normally output). low_res_masks, iou_predictions = sam_model.mask_decoder( image_embeddings=image_embedding, image_pe=sam_model.prompt_encoder.get_dense_pe(), sparse_prompt_embeddings=sparse_embeddings, dense_prompt_embeddings=dense_embeddings, multimask_output=False, ) The final step here is to upscale the masks back to the original image size, since they are low resolution. We can use Sam.postprocess_masks to achieve this. We will also want to generate binary masks from the predicted masks so that we can compare these to our ground truths. It is important to use torch functionals in order to not break backpropagation. upscaled_masks = sam_model.postprocess_masks(low_res_masks, input_size, original_image_size).to(device) from torch.nn.functional import threshold, normalize binary_mask = normalize(threshold(upscaled_masks, 0.0, 0)).to(device) Finally we can calculate the loss and run an optimisation step: loss = loss_fn(binary_mask, gt_binary_mask) optimizer.zero_grad() loss.backward() optimizer.step() By repeating this over a number of epochs and batches we can fine tune the SAM decoder. Saving Checkpoints and Starting a Model from it Once we are done with training and satisfied by the performance uplift, we can save the state dict of the tuned model using: torch.save(model.state_dict(), PATH) We can then load this state dict when we want to perform inference on data that is similar to the data we used to fine tune the model. {{light_callout_start}} You can find the Colab Notebook with all the code you need to fine-tune SAM here. Keep reading if you want a fully working solution out of the box! {{light_callout_end}} Fine-Tuning for Downstream Applications While SAM does not currently offer fine-tuning out of the box, we are building a custom fine tuner integrated with the Encord platform. As shown in this post, we fine tune the decoder in order to achieve this. This is available as an out of the box one click procedure in the web app, where the hyperparameters are automatically set. Original vanilla SAM mask: Mask generated by fine tuned version of the model: We can see that this mask is tighter than the original mask. This was the result of fine tuning on a small subset of images from the stamp verification dataset, and then running the tuned model on a previously unseen example. With further training and more examples we could obtain even better results. Conclusion That's all, folks! You have now learned how to fine-tune the Segment Anything Model (SAM). If you're looking to fine-tune SAM out of the box, you might also be interested to learn that we have recently released the Segment Anything Model in Encord, allowing you to fine-tune the model without writing any code. {{SAM_CTA}}

Read more
1 / 12
OpenAI’s DALL-E 3 Explained: Generate Images with ChatGPT

In the field of image generation, OpenAI continues to push the boundaries of what’s possible. On September 20th, 2023 Sam Altman announced DALL-E 3, which is set to revolutionize the world of text-to-image generation.  Fueled by Microsoft's support, the firm is strategically harnessing ChatGPT's surging popularity to maintain its leadership in generative AI, a critical move given the escalating competition from industry titans like Google and emerging disruptors like Bard, Midjourney, and Stability AI. {{light_callout_start}} Interested in fine-tuning vision foundation models, try Encord! {{light_callout_end}} DALL-E 3: What We Know So Far DALL-E 3 is a text-to-image model which is built upon DALL-E 2 and ChatGPT. It excels in understanding and translating textual descriptions into highly detailed and accurate images.  {{light_callout_start}} Watch the demo video for DALL-E 3! {{light_callout_end}}  While this powerful AI model is still in research preview, there's already a lot to be excited about. Here's a glimpse into what we know so far about DALL-E 3: Eliminating Prompt Engineering DALL-E 3 is set to redefine how we think about generating images from text. Modern text-to-image systems often fall short by ignoring words or descriptions, thereby requiring users to master the art of prompt engineering. In contrast, DALL·E 3 represents a remarkable leap forward in our ability to generate images that precisely adhere to the text provided, eliminating the complexities of prompt engineering. Integrated seamlessly with ChatGPT, DALL·E 3 acts as a creative partner, allowing users to effortlessly bring their ideas to life by generating tailored and visually stunning images from simple sentences to detailed paragraphs. DALL-E 3 Improved Precision DALL-E 3 is set to redefine how we think about generating images from text prompts. Previously DALL-E, like other generative AI models has shown issues interpreting complex text prompts and often mixing two concepts while generating images. Unlike its predecessors, this model is designed to understand text prompts with remarkable precision, capturing nuance and detail like never before. Focus on Ethical AI OpenAI is acutely aware of the ethical considerations that come with image generation models. To address these concerns, DALL-E 3 incorporates safety measures that restrict the generation of violent, adult, or hateful content. Moreover, it has mitigations in place to avoid generating images of public figures by name, thereby safeguarding privacy and reducing the risk of misinformation. OpenAI's commitment to ethical AI is further underscored by its collaboration with red teamers and domain experts. These partnerships aim to rigorously test the model and identify and mitigate potential biases, ensuring that DALL-E 3 is a responsible and reliable tool. Just this week, OpenAI unveiled the "OpenAI Red Teaming Network," a program designed to seek out experts across diverse domains. The aim is to engage these experts in evaluating their AI models, thereby contributing to the informed assessment of risks and the implementation of mitigation strategies throughout the entire lifecycle of model and product development. Transparency  As AI-generated content becomes more prevalent, the need for transparency in identifying such content grows. OpenAI is actively researching ways to help people distinguish AI-generated images from those created by humans. They are experimenting with a provenance classifier, an internal tool designed to determine whether an image was generated by DALL-E 3. This initiative reflects OpenAI's dedication to transparency and responsible AI usage. DALL-E 3 This latest iteration of DALL-E is scheduled for an initial release in early October, starting with ChatGPT Plus and ChatGPT Enterprise customers, with subsequent availability in research labs and through its API service in the autumn. OpenAI intends to roll out DALL-E 3 in phases but has not yet confirmed a specific date for a free public release. {{light_callout_start}} When DALL-E 3 is launched, you'll discover an in-depth explanation article about it on Encord! Stay tuned! {{light_callout_end}} Recommended Topics for Pre-Release Reading To brace yourself for the release and help you dive right into it, here are some suggested topics you can explore: Transformers Transformers are foundational architectures in the field of artificial intelligence, revolutionizing the way machines process and understand sequential data. Unlike traditional models that operate sequentially, Transformers employ parallel processing, making them exceptionally efficient. They use mechanisms like attention to weigh the importance of different elements in a sequence, enabling tasks such as language translation, sentiment analysis, and image generation. Transformers have become the cornerstone of modern AI, underpinning advanced models like DALL-E, ChatGPT, etc. {{light_callout_start}} For more information about Vision Transformers read Introduction to Vision Transformers (ViT){{light_callout_end}}  Foundation Models Foundation models are the bedrock of contemporary artificial intelligence, representing a transformative breakthrough in machine learning. These models are pre-trained on vast datasets, equipping them with a broad understanding of language and knowledge. GPT-3 and DALL-E, for instance, are prominent foundation models developed by OpenAI. These models serve as versatile building blocks upon which more specialized AI systems can be constructed. After pre-training on extensive text data from the internet, they can be fine-tuned for specific tasks, including natural language understanding, text generation, and even text-to-image conversion, as seen in DALL-E 3. Their ability to generalize knowledge and adapt to diverse applications underscores their significance in AI's rapid advancement. Foundation models have become instrumental in numerous fields, including large language models, AI chatbots, content generation, and more. Their capacity to grasp context, generate coherent responses, and perform diverse language-related tasks makes them invaluable tools for developers and researchers. Moreover, the flexibility of foundation models opens doors to creative and practical applications across various industries. {{light_callout_start}} For more information about foundation models read The Full Guide to Foundation Models{{light_callout_end}}  Text-to-Image Generation Text-to-image generation is a cutting-edge field in artificial intelligence that bridges the gap between textual descriptions and visual content creation. In this remarkable domain, AI models use neural networks to translate written text into vivid, pixel-perfect images. These models understand and interpret textual input, capturing intricate details, colors, and context to produce striking visual representations. Text-to-image generation finds applications in art, design, content creation, and more, offering a powerful tool for bringing creative ideas to life. As AI in this field continues to advance, it holds the promise of revolutionizing how we communicate and create visual content, offering exciting possibilities for artists, designers, and storytellers. {{light_callout_start}} Read the paper Zero-Shot Text-to-Image Generation by A. Ramesh, et al from OpenAI to understand how DALL-E generates images! {{light_callout_end}} 

September 21

5 min

Webinar: How to Fine Tune Foundation Models to Auto-Label Training Data

Foundation models, like Meta’s Segment Anything Model (SAM), have provided a host of benefits for data and ML teams looking to expedite the production of training data whilst improving the quality. This webinar walks you through how to go one step further and fine-tune foundation models, in particular Meta AI's SAM, to maximize relevance to your specific use case Here are the key resources from the webinar: Encord Active GitHub - our open source tool that allowed us to conduct our research The Google Colab Notebook used in yesterday’s session Our ML Solutions Engineer’s Fine-Tune SAM blog post

September 21

60 min

What to Expect From OpenAI’s GPT-Vision vs. Google’s Gemini

With Google gearing up to release Gemini this fall set to rival OpenAI’s GPT-Vision, it is going to be the Oppenheimer vs. Barbie of generative AI.  OpenAI and Google have been teasing their ground-breaking advancements in multimodal learning. Let's discuss what we know so far. Google’s Gemini: What We Know So Far At the May 2023 Google I/O developer conference, CEO Sundar Pichai unveiled Google's upcoming artificial intelligence (AI) system, codenamed Gemini. Developed by the esteemed DeepMind division, a collaboration between the Brain Team and DeepMind itself, Gemini represents a groundbreaking advancement in AI.  While detailed information remains confidential, recent interviews and reports have provided intriguing insights into the power and potential of Google's Gemini. {{light_callout_start}} Interested in fine-tuning foundation models, contact sales to discuss your use case. {{light_callout_end}}  Gemini’s Multimodal Integration Google CEO Sundar Pichai emphasized that Gemini combines DeepMind's AlphaGo strengths with extensive language modeling capabilities. With a multimodal design, Gemini seamlessly integrates text, images, and other data types, enabling more natural conversational abilities. Pichai also hinted at the potential for memory and planning features, which opens doors for tasks requiring advanced reasoning. Diverse Sizes and Capabilities Demis Hassabis, the CEO of DeepMind, provides insight into the versatility of Gemini. Drawing inspiration from AlphaGo's techniques such as reinforcement learning and tree search, Gemini is poised to acquire reasoning and problem-solving abilities. This "series of models" will be available in various sizes and capabilities, making it adaptable to a wide range of applications. Enhancing Accuracy and Content Quality Hassabis suggested that Gemini may employ techniques like fact-checking against sources such as Google Search and improved reinforcement learning. These measures are aimed at ensuring higher accuracy and reducing the generation of problematic or inaccurate content. Universal Personal Assistant In a recent interview, Sundar Pichai discussed Gemini's place in Google's product roadmap. He made it clear that conversational AI systems like Bard represent mere waypoints, not the ultimate goal. Pichai envisions Gemini and its future iterations as "incredible universal personal assistants," seamlessly integrated into people's daily lives, spanning various domains such as travel, work, and entertainment. He even suggests that today's chatbots will appear "trivial" compared to Gemini's capabilities within a few years. GPT-Vision: What We Know So Far OpenAI recently introduced GPT-4, a multimodal model that has the ability to process both textual and visual inputs, and in turn, generate text-based outputs. GPT-4, which was unveiled in March, was initially made available to the public through a subscription-based API with limited usage. It is speculated that the full potential of GPT-4 will be revealed in the autumn as GPT-Vision, coinciding with the launch of Google’s Gemini. GPT-4 Technical Report According to the paper published by OpenAI, the following is the current information available on GPT-Vision: Transformer-Based Architecture At its core, GPT-Vision utilizes a Transformer-based architecture that is pre-trained to predict the next token in a document, similar to its predecessors. Post-training alignment processes have further improved the model's performance, particularly in terms of factuality and adherence to desired behavior. Human-Level Performance GPT-4's capabilities are exemplified by its human-level performance on a range of professional and academic assessments. For instance, it achieves remarkable success in a simulated bar exam, with scores that rank among the top 10% of test takers. This accomplishment marks a significant improvement over its predecessor, GPT-3.5, which scored in the bottom 10% on the same test. GPT-Vision is expected to show similar performance if not better. Reliable Scaling and Infrastructure A crucial aspect of GPT-4's development involved establishing robust infrastructure and optimization methods that behave predictably across a wide range of scales. This predictability allowed us to accurately anticipate certain aspects of GPT-Vision's performance, even based on models trained with a mere fraction of the computational resources. Test-Time Techniques GPT-4 effectively leverages well-established test-time techniques developed for language models, such as few-shot prompting and chain-of-thought. These techniques enhance its adaptability and performance when handling both images and text. GPT-4 Technical Report Recommended Pre-release Reading Multimodal Learning Multimodal learning is a fascinating field within artificial intelligence that focuses on training models to understand and generate content across multiple modalities. These modalities encompass text, images, audio, and more. The main goal of multimodal learning is to empower AI systems to comprehend and generate information from various sensory inputs simultaneously. Multimodal learning demonstrates tremendous potential across numerous domains, including natural language processing, computer vision, speech recognition, and other areas where information is presented in diverse formats. {{light_callout_start}} Interested in multimodal learning? Read Introduction to Multimodal Deep Learning {{light_callout_end}}  Generative AI Generative AI refers to the development of algorithms and models that have the capacity to generate new content, such as text, images, music, or even video, based on patterns and data they've learned during training. These models are not only fascinating but also incredibly powerful, as they have the ability to create content that closely resembles human-produced work. Generative AI encompasses a range of techniques, including generative adversarial networks (GANs), autoencoders, and transformer-based models. It has wide-ranging applications, from creative content generation to data augmentation and synthesis. Transformers Transformers are a class of neural network architectures that have significantly reshaped the field of deep learning. Introduced in the landmark paper "Attention Is All You Need" by Vaswani et al. in 2017, Transformers excel at processing sequential data. They employ self-attention mechanisms to capture relationships and dependencies between elements in a sequence, making them highly adaptable for various tasks. Transformers have revolutionized natural language processing, enabling state-of-the-art performance in tasks like machine translation and text generation. Their versatility extends to other domains, including computer vision, audio processing, and reinforcement learning, making them a cornerstone in modern AI research. {{light_callout_start}} Interested in Vision Transformers? Read Introduction to Vision Transformers (ViT){{light_callout_end}}  Future Advancements in Multimodal Learning Recent Advances and Trends in Multimodal Deep Learning: A Review  Multimodal Image Description Enhanced language generation models for accurate and grammatically correct captions. Advanced attention-based image captioning mechanisms. Incorporation of external knowledge for context-aware image descriptions. Multimodal models for auto video subtitling. Multimodal Video Description Advancements in video dialogue systems for human-like interactions with AI. Exploration of audio feature extraction to improve video description in the absence of visual cues. Leveraging real-world event data for more accurate video descriptions. Research on combining video description with machine translation for efficient subtitling. Focus on making video subtitling processes cost-effective. Multimodal Visual Question Answering (VQA) Design of goal-oriented datasets to support real-time applications and specific use cases. Exploration of evaluation methods for open-ended VQA frameworks. Integration of context or linguistic information to enhance VQA performance. Adoption of context-aware image feature extraction techniques. Multimodal Speech Synthesis Enhancement of data efficiency for training End-to-End (E2E) DLTTS (Deep Learning Text-to-Speech) models. Utilization of specific context or linguistic information to bridge the gap between text and speech synthesis. Implementation of parallelization techniques to improve efficiency in DLTTS models. Integration of unpaired text and speech recordings for data-efficient training. Exploration of new feature learning techniques to address the "curse of dimensionality" in DLTTS. Research on the application of speech synthesis for voice conversion, translation, and cross-lingual speech conversion. Multimodal Emotion Recognition Development of advanced modeling and recognition techniques for non-invasive emotion analysis. Expansion of multimodal emotion recognition datasets for better representation. Investigation into the preprocessing of complex physiological signals for emotion detection. Research on the application of automated emotion recognition in real-world scenarios. Multimodal Event Detection Advancements in feature learning techniques to address the "curse of dimensionality" issue. Integration of textual data with audio and video media for comprehensive event detection. Synthesizing information from multiple social platforms using transfer learning strategies. Development of event detection models that consider real-time applications and user interactions. Designing goal-oriented datasets for event detection in specific domains and applications. Exploration of new evaluation methods for open-ended event detection frameworks. {{Training_data_CTA}}

September 20

5 min

A Complete Guide to Text Annotation

Have you ever considered the sources from which AI models acquire language? Or the extensive effort required to curate high-quality data to power today's sophisticated language systems?  By the end of the guide, you will be able to answer the following questions: What is text annotation? What are some types of text annotation?  How is text annotation? What is text annotation? Traditionally, text annotation involves adding comments, notes, or footnotes to a body of text. This practice is commonly seen when editors review a  draft, adding  notes or useful comments (i.e. annotations) before passing it on for corrections. In the context of machine learning, the term takes on a slightly different meaning. It refers to the systematic process of labeling pieces of text to generate a ground-truth. The labeled data ensures that a supervised machine learning algorithm can accurately interpret and understand the data. What does it mean to annotate text? In the data science world, annotating text is a process that requires a deep  understanding of both the problem at hand and the data itself to identify relevant features and label them so. This can be likened to the task of labeling cats and dogs in several images for image classification.  In text classification, annotating text would mean looking at sentences and marking them, putting each in predefined categories; like labeling online reviews as positive or negative, or news clippings as fake or real. More tasks, such as labeling parts of speech (like nouns, verbs, subjects, etc.), labeling key phrases or words in a text for named entity recognition (ner) or to summarize a long article or research paper in a few hundred words all come under annotating text. A Comprehensive Guide to Named Entity Recognition (NER) (Turing.com)  What are the benefits of text annotation? Doing what we described above enables a machine learning algorithm to identify different categories and use the data corresponding to these labels to learn what the data from each category typically looks like. This speeds up the learning task and improves the algorithm’s performance in the real world. Learning without labels, while common today in NLP, is challenging as it is left to the algorithm to identify the nuances of the English language without any additional help and also recognize them when the model is put out in the real world. In text classification, for instance, a negative piece of text might be veiled in sarcasm—something that a human reader would instantly recognize, but an algorithm might just see the sarcastically positive words as just positive! Text annotations and labels are invaluable in these cases. Large companies that are developing powerful language models today also, on the other hand, rely on text annotation for a number of important use cases. For social media companies, that includes flagging inappropriate comments or posts, online forums to flag bots and spammy content, or news websites to remove fake or low-quality pieces. Even apps for basic search engines and chatbots can be trained to extract information from their queries. Image by Author What are some types of annotation styles? Since there are several tasks of varying nature for language interpretation in natural language processing, annotating and preparing the training data for each of them has a different objective. However, there are some standard approaches that cover the basic NLP tasks like classifying text and parts of text. While these may not cover generative text tasks like text summarization, they are important in understanding the different approaches to label a text. Text Classification Just as it sounds, a text classification model is meant to take a piece of text (sentence, phrase or paragraph) and determine what category it belongs to. Document classification involves the categorization of long texts, often with multiple pages. This annotation process involves the annotators reading every text sample and determining which one of the context-dependent predefined categories each sample belongs to. Typical examples are binning news clippings into various topics, sorting documents based on their contents, or as simple as looking at movie plot summaries and mapping them to a genre (as shown in some examples below). Genre Classification Dataset IMDb | Kaggle Sentiment Annotation Similar to text classification in process and strategy, the annotator plays a larger role in labeling a dataset for sentiment-related tasks. This task requires the annotator to interpret the text and look for the emotion and implicit context behind it—something that is not readily apparent to humans or machines when looking at the text. Typical examples include sentiment analysis of a subject from social media data, analyzing customer feedback or product reviews, or gauging the shift in public opinion over a period of time by tracking historical texts. Entity Annotation Often understanding natural language extends to recalling or extracting important information from a given text, such as names, various numbers, topics of interest, etc. Annotating such information (in the form of words or phrases) is called entity annotation. Annotators look for terms in a text of interest and classify them into predefined categories such as dates, countries, topics, names, addresses, zip codes, etc. A user can look up or extract only the pertinent information from large documents by using models trained on such a dataset to quickly label portions of the text. Semantic annotation involves a similar process, but the tags are often concepts and topics. Keyphrase tagging (looking for topic-dependent keywords), NER (or named entity recognition) (covering a more extensive set of entities), and parts of speech tagging (understanding grammatical structure) come under entity annotation. Intent Annotation Another approach to annotating text is to direct the interpretation of a sentence towards an action. Typically used for chatbots, intent annotation helps create datasets that can train machine learning models to determine what the writer of the text wants. In the context of a virtual assistant, a message might be a greeting, an inquiry for information, or an actionable request. A model trained on a dataset where the text is labeled using intent annotation can classify each incoming message into a fixed category and simplify the conversation ahead. Linguistic Annotation This kind of text annotation focuses on how humans engage with the language—in pronunciation, phonetic sound, parts of speech, word meanings, and structure. Some of these are important in building a text-to-speech converter that creates human-sounding voices with different accents. FLORS - Part-of-Speech Tagger How is text annotated? Now that we have established the various perspectives from which an annotator can look at their task, we can look at what a standard process of text annotation would be and how to annotate text for a machine learning problem. There is no all-encompassing playbook, but a well-defined workflow to go through the process step-by-step and a clear annotation guideline helps a ton. What are annotation guidelines? Text annotation guidelines are a set of rules and suggestions that act as a reference guide for annotators. An annotator must look at it and be able to understand the modeling objective and the purpose the labels would serve to that end. Since these guidelines dictate what is required of the final annotations, they must be set by the team familiar with the data and will use the annotations.  These guidelines can begin with one of the annotation techniques, or something customized that defines the problem and what to look for in the data. They must also define various cases, common and potentially ambiguous, the annotator might face in the data and actions to perform for each such problem.  For that purpose, they must also cover common examples found in the data and guidelines to deal with outliers, out-of-distribution samples, or other cases that might induce ambiguity while annotating. You can create an annotation workflow by beginning with a skeleton process, as shown below. Curate Annotation Guidelines Selecting a Labeling Tool Defining an Annotation Process Review and Quality Control Curate Annotation Guidelines First, define the modeling problem (classification, generation, clustering, etc.) that the team is trying to tackle with the data and the expected outcome of the annotation process like the fixed set of labels/categories, data format, and exporting instructions. This can be extended to curating the actual guidelines that are comprehensive yet easy to revisit. Selecting a labeling Tool Getting the right text annotation tools can make all the difference between a laborious and menial task and a long but efficient process. Given the prevalence of text modeling, there are several open-source labeling tools available.  undefinedundefined Below is an illustration of doccano that shows how straightforward annotating intent detection and NER is! Open Source Annotation Tool for Machine Learning Practitioners Defining an Annotation Process Once the logistics are in place, it is important to have a reproducible and error-free workflow that can accommodate multiple annotators and a uniform collection of labeled samples. Defining an annotation process includes organizing the data source and labeled data, defining the usage of the guidelines and the annotation tool, a step-by-step guide to performing the actual text annotation, the format of saving and exporting the annotations, and the review every labeled sample. Given the commonly large sizes of text data teams usually work with, ensuring a streamlined flow of incoming samples and outgoing labels and reviewing each sample (which might get challenging as one sample can be as big as a multi-page document) is essential. Review and Quality Control Along with on-the-fly review, have a collective look at the labeled data periodically to avoid generic label errors or any bias in labeling that might have come in over time. undefinedundefined It is also common to have multiple annotators label the same sample for consistency and to avoid any bias in interpretation, especially in cases where sentiment or contextual interpretation is crucial. To check for the bias and reliability of multiple human annotators, there are statistical measures that can be used to highlight undesirable trends. Comparison metrics such as Cohen’s kappa statistic measure how often two annotators agree with each other on the same set of samples, given the likelihood they would agree by chance. An example of interpreting Cohen’s kappa is shown below. Monitoring such metrics would flag disagreement and expose potential caveats in understanding the data and the problem. Understanding Interobserver Agreement: The Kappa Statistic Text Annotation: Key Takeaways This article underlines the roles text annotation plays for natural language processing use cases and details how you can get started with data annotation for text. You saw how: high-quality data can significantly impact the training process for a machine learning model. different tasks require different approaches and perspectives to annotating a text corpus; some require understanding the meaning of the text, while others require grammar and structure. guidelines and choosing the right text annotation tool can simplify large-scale data annotation and improve reliability. using strategies such as multiple annotators, quality metrics, and more can help generate high-quality labels. {{Training_data_CTA}}

September 19

5 min

Introduction to Multimodal Deep Learning

Humans perceive the world using the five senses (vision, hearing, taste, smell, and touch). Our brain uses a combination of two, three, or all five senses to perform conscious intellectual activities like reading, thinking, and reasoning. These are our sensory modalities. In computing terminology, the equivalent of these senses are various data modalities, like text, images, audio, and videos, which are the basis for building intelligent systems. If artificial intelligence (AI) is to truly imitate human intelligence, it needs to combine multiple modalities to solve a problem.  Multimodal learning is a multi-disciplinary approach that can handle the heterogeneity of data sources to build computer agents with intelligent capabilities.  This article will introduce multimodal learning, discuss its implementation, and list some prominent use cases. We will discuss popular multimodal learning techniques, applications, and relevant datasets. {{product_sam_cta}} What is Multimodal Learning in Deep Learning? Multimodal deep learning trains AI models that combine information from several types of data simultaneously to learn their unified data representations and provide contextualized results with higher predictive accuracy for complex AI tasks. Today, modern AI architectures can learn cross-modal relationships and semantics from diverse data types to solve problems like image captioning, image and text-based document classification, multi-sensor object recognition, autonomous driving, video summarization, multimodal sentiment analysis, etc. For instance, in multimodal autonomous driving, AI models can process data from multiple input sensors and cameras to improve vehicle navigation and maneuverability. The Significance of Multimodal Data in the Real World Real-world objects generate data in multiple formats and structures, such as text, image, audio, video, etc. For example, when identifying a bird, we start by looking at the creature itself (visual information). Our understanding grows if it’s sitting on a tree (context). The identification is further solidified if we hear the bird chirping (audio input). Our brain can process this real-world information and quickly identify relationships between sensory inputs to generate an outcome. However, present-day machine learning models are nowhere as complex and intricate as the human brain. Hence, one of the biggest challenges in building multimodal deep learning models is processing different input modalities simultaneously.  Each type of data has a different representation, e.g., images consist of pixels, textual data is represented as a set of characters or words, and audio is represented using sound waves. Hence, a multimodal learning architecture requires specialized data transformations or representations for fusing multiple inputs and a complex deep network to understand patterns from the multi-faceted training data. Let’s talk more about how a multimodal model is built. Dissecting Multimodal Machine Learning Although the multimodal learning approach has only become popular recently, there have been few experiments in the past. Srivastava and Salakhutdinov demonstrated multimodal learning with Deep Boltzmann Machines back in 2012. Their network created representations or embeddings for images and text data and fused the layers to create a single model that was tested for classification and retrieval tasks. Although the approach was not popular at the time, it formed the basis of many modern architectures. Modern state-of-the-art (SOTA) multimodal architectures consist of distinct components responsible for transforming data into a unified or common representation.  Let’s talk about such components in more detail. How Multimodal Learning Works in Deep Learning? The first step in any deep learning project is to transform raw data into a format understood by the model. While this is easier for numerical data, which can be fed directly to the model, other data modalities, like text, must be transformed into word embeddings, i.e., similar words are represented as real-valued numerical vectors that the model can process easily.  With multimodal data, the various modalities have to be individually processed to generate embeddings and then fused. The final representation is an amalgamation of the information from all data modalities. During the training phase, multimodal AI models use this representation to learn the relationship and predict the outcomes for relevant AI tasks. There are multiple ways to generate embeddings for multimodal data. Let’s talk about these in detail. Input Embeddings The traditional method of generating data embeddings uses unimodal encoders to map data to a relevant space. This approach uses embedding techniques like Word2Vec for natural language processing tasks and Convolutional Neural Networks (CNNs) to encode images. These individual encodings are passed via a fusion module to form an aggregation of the original information, which is then fed to the prediction model. Hence, understanding each modality individually requires algorithms that function differently. Also, they need a lot of computational power to learn representations separately. Today, many state-of-the-art architectures utilize specialized embeddings designed to handle multimodal data and create a singular representation. These embeddings include Data2vec 2.0: The original Data2vec model was proposed by Meta AI’s Baevski, Hsu, et al. They proposed a self-supervised embedding model that can handle multiple modalities of speech, vision, and text. It uses the regular encoder-decoder architecture combined with a student-teacher approach. The student-encoder learns to predict masked data points while the teacher is exposed to the entire data. In December 2022, Meta AI proposed version 2.0 for the original framework, providing the same accuracy but 16x better performance in terms of speed. JAMIE: The Joint Variational Autoencoder for MultiModal Imputations and Embeddings is an open-source framework for embedding molecular structures. JAMIE solves the challenge of generating multi-modal data by taking partially matched samples across different cellular modalities. The information missing from certain samples is imputed by learning similar representations from other samples. ImageBind: ImageBind is a breakthrough model from Meta that can simultaneously fuse information from six modalities. It processes image and video data with added information such as text descriptions, color depth, and audio input from the image scene. It binds the entire sensory experience for the model by generating a single embedding consisting of contextual information from all six modalities. VilBERT: The Vision-and-Language BERT model is an upgrade over the original BERT architecture. The model consists of two parallel streams to process the two modalities (text and image) individually. The two streams interact via a co-attention transformer layer, i.e., one encoder transformer block for generating visual embeddings and another for linguistic embeddings. While these techniques can process multimodal data, each data modality usually creates an individual embedding that must be combined through a fusion module. {{light_callout_start}} If you want to learn more about embeddings, read our detailed blog on The Full Guide to Embeddings in Machine Learning. {{light_callout_end}} Fusion Module After feature extraction (or generating embeddings), the next step in a multimodal learning pipeline is multimodal fusion. This step combines the embeddings of different modalities into a single representation. Fusion can be achieved with simple operations such as concatenation or summation of the weights of the unimodal embeddings. However, the simpler approaches do not yield appreciable results. Advanced architectures use complex modules like the cross-attention transformer. With its attention mechanism, the transformer module has the advantage of selecting relevant modalities at each step of the process. Regardless of the approach, the optimal selection of the fusion method is an iterative process. Different approaches can work better in different cases depending on the problem and data type. Early, Intermediate, & Late Fusion Another key aspect of the multimodal architecture design is deciding between early, intermediate, and late fusion. Early fusion combines data from various modalities early on in the training pipeline. The single modalities are processed individually for feature extraction and then fused together. Intermediate fusion, also known as feature-level fusion, concatenates the feature representations from each modality before making predictions. This enables joint or shared representation learning for the AI model, resulting in improved performance. Late fusion processes each modality through the model independently and returns individual outputs. The independent predictions are then fused at a later stage using averaging or voting. This technique is less computationally expensive than early fusion but does not capture the relationships between the various modalities effectively. Popular Multimodal Datasets Piano Skills Assessment Dataset Sample A multimodal dataset consists of multiple data types, such as text, speech, and image. Some datasets may contain multiple input modalities, such as images or videos and their background sounds or textual descriptions. Others may contain different modalities in the input and output space, such as images (input) and their text captions (output) for image captioning tasks. Some popular multimodal datasets include: LJ Speech Dataset: A dataset containing public domain speeches published between 1884 and 1964 and their respective 13,100 short audio clips. The audios were recorded between 2016-17 and have a total length of 24 hours. The LJ Speech dataset can be used for audio transcription tasks or speech recognition. HowTo100M: A dataset consisting of 136M narrated video clips sourced from 1.2M YouTube videos and their related text descriptions (subtitles). The descriptions cover over 23K activities or domains, such as education, health, handcrafting, cooking, etc. This dataset is more suitable for building video captioning models or video localization tasks. MultiModal PISA: Introduced in the Piano Skills Assessment paper, the MultiModal PISA dataset consists of images of the piano being played and relevant annotations regarding the pianist’s skill level and tune difficulty. It also contains processed audio and videos of 61 piano performances. It is suitable for audio-video classification and skill assessment tasks. LAION 400K: A dataset containing 413M Image-Text pairs extracted from the Common Crawl web data dump. The dataset contains images with 256, 512, and 1024 dimensions, and images are filtered using OpenAI’s CLIP. The dataset also contains a KNN index that clusters similar images to extract specialized datasets. Popular Multimodal Deep Learning Models Many popular multimodal architectures have provided ground-breaking results in tasks like sentiment analysis, visual question-answering, and text-to-image generation. Let’s discuss some popular model architectures that are used with multimodal datasets. Stable Diffusion Stable Diffusion (SD) is a widely popular open-source text-to-image model developed by Stability AI. It is categorized under a class of generative models called Diffusion Models.  The model consists of a pre-trained Variational AutoEncoder (VAE) combined with a U-Net architecture based on a cross-attention mechanism to handle various input modalities (text and images). The encoder block of the VAE transforms the input image from pixel space to a latent representation, which downsamples the image to reduce its complexity. The image is denoised using the U-Net architecture iteratively to reverse the diffusion steps and reconstruct a sharp image using the VAE decoder block, as illustrated in the image below.  Stable Diffusion Architecture SD can create realistic visuals using short input prompts. For instance, if a user asks the model to create “A painting of the last supper by Picasso”, the model would create the following image or similar variations. Image Created By Stable Diffusion Using Input Prompt “A painting of the last supper by Picasso.” Or if the user enters the following input prompt: “A sunset over a mountain range, vector image.” The SD model would create the following image. Image Created By Stable Diffusion Using Input Prompt “A sunset over a mountain range, vector image.” Since SD is an open-source model, multiple variations of the SD architecture exist with different sizes and performances that fit different use cases. {{light_callout_start}} If you want to learn more about diffusion models, read our detailed blog on An Introduction to Diffusion Models for Machine Learning. {{light_callout_end}} Flamingo Flamingo is a few-shot learning Visual Language Model (VLM) developed by DeepMind. It can perform various image and video understanding tasks such as scene description, scene understanding QA, visual dialog, meme classification, action classification, etc. Since the model supports few-shot learning, it can adapt to various tasks by learning from a few task-specific input-output samples.  The model consists of blocks of a pre-trained NFNet-F6 Vision Encoder that outputs a flattened 1D image representation. The 1D representation is passed to a Perceiver Resampler that maps these features to a fixed number of output visual tokens, as illustrated in the image below. The Flamingo model comes in three size variants: Flamingo-3B, Flamingo-9B, and Flamingo-80B, and displays ground-breaking performance compared to similar SOTA models. Overview of Flamingo Architecture Meshed-Memory Transformer The Meshed-Memory Transformer is an image captioning model based on encoder-decoder architecture. The architecture comprises memory-augmented encoding layers responsible for processing multi-level visual information and a meshed decoding layer for generating text tokens. The proposed model produced state-of-the-art results, topping the MS-COCO online leaderboard and beating SOTA models, including Up-Down and RFNet. Architecture of Meshed Memory Transformer {{light_callout_start}} If you want to learn more about multimodal learning architectures, read our detailed blog on Meta-Transformer: Framework for Multimodal Learning. {{light_callout_end}} Applications of Multimodal Learning Multimodal deep neural networks have several prominent industry applications by automating media generation and analysis tasks. Let’s discuss some of them below. Image Captioning Image captioning is an AI model’s ability to comprehend visual information in an image and describe it in textual form. Such models are trained on image and text data and usually consist of an encoder-decoder infrastructure. The encoder processes the image to generate an intermediate representation, and the decoder maps this representation to the relevant text tokens. Social media platforms use image captioning models to segregate images into categories and similar clusters. One notable benefit of image captioning models is that people with visual impairment can use them to generate descriptions of images and scenes. Results of an Image Captioning Model Image Retrieval Multimodal learning models can combine computer vision and NLP to link text descriptions to respective images. This ability helps with image retrieval in large databases, where users can input text prompts and retrieve matching images. For instance, OpenAI’s CLIP model provides a wide variety of image classification tasks using natural language text available on the internet. As a real-world example, many modern smartphones provide this feature where users can type prompts like “Trees” or “Landscape” to pull up matching images from the gallery.  Visual Question Answering (VQA) Visual QA improves upon the image captioning models and allows the model to learn additional details regarding an image or scenario. Instead of generating a single description, the model can answer questions regarding the image iteratively. VQA has several helpful applications, such as allowing doctors to better understand medical scans via cross-questioning or as a virtual instructor to enable visual learning process for students. Text-to-Image Models Image generation from text prompts is a popular generative AI application that has already found several use cases in the real world. Models like DALL.E 2, Stable Diffusion, and Midjourney can generate excellent images from carefully curated text prompts. Social media creators, influencers, and marketers are extensively utilizing text-to-image models to generate unique and royalty-free visuals for their content. These models have enhanced the speed and efficiency of the content and art generation process. Today, digital artists can create highly accurate visuals within seconds instead of hours. Images Generated Using Stable Diffusion Using Various Input Prompts Text-to-Sound Generation Text-to-sound generation models can be categorized into speech and music synthesis. While the former can create a human speech that dictates the input text prompt, the latter understands the prompt as a descriptor and generates a musical tune. Both auditory models work on similar principles but have distinctly different applications. Speech synthesis is already being used to generate audio for social media video content. It can also help people with speech impairment. Moreover, artists are using text-to-sound models for AI music generation. They can generate music snippets quickly to add to their creative projects or create complete songs. For instance, an anonymous artist named Ghostwriter977 on Twitter recently submitted his AI-generated track “Heart on My Sleeve” for Grammy awards. The song sparked controversy for resembling the creative work of two real artists, Drake and The Weeknd. Overall, such models can speed up the content generation process significantly and improve the time to market for various creative projects. Emotion Recognition A multimodal emotion recognition AI model grasps various audiovisual cues and contextual information to categorize a person’s emotions. These models analyze features like facial expressions, body language, voice tone, spoken words, and any other contextual information, such as the description of any event. All this knowledge helps the model understand the subject’s emotions and categorize them accordingly. Emotion recognition has several key applications, such as identifying anxiety and depression in patients, conducting customer analysis, and recognizing whether a customer is enjoying the product. Furthermore, it can also be a key component for building empathetic AI robots, helping them understand human emotions and take necessary action. Different Emotions of Speakers in a Dialogue Multimodal Learning: Challenges & Future Research While we have seen many breakthroughs in multimodal learning, it is still nascent. Several challenges remain to be solved. Some of these key challenges are: Training time: Conventional deep learning models are already computationally expensive and take several hours to train. With multimodal, the model complexity is taken up a notch with various data types and fusion techniques. Reportedly, it can take up to 13 days to train a Stable Diffusion model using 256 A100 GPUs. Future research will primarily focus on generating efficient models requiring less training and cost. Optimal Fusion Techniques: Selecting the correct fusion technique is an iterative and time-consuming process. Many popular techniques cannot capture modality-specific information and fully replicate the complex relationships between the various modalities. Researchers are engaged in creating advanced fusion techniques to comprehend the complexity of the multimodal data. Interpretability: Lack of interpretation plagues all deep learning models. With multiple complex hidden layers capturing data from various modalities, the confusion only grows. Explaining how a model can comprehend various modalities and generate accurate results is challenging. Though researchers have developed various explainable multimodal techniques, numerous open challenges exist, such as insufficient evaluation metrics, lack of ground truth, and generalizability issues that must be addressed to apply multimodal AI in critical scenarios. Multimodal Learning: Key Takeaways Multimodal deep learning brings AI closer to human-like behavior by processing various modalities simultaneously. AI models can generate more accurate outcomes by integrating relevant contextual information from various data sources (text, audio, image). A multimodal model requires specialized embeddings and fusion modules to create representations of the different modalities. As multimodal learning gains traction, many specialized datasets and model architectures are being introduced. Notable multimodal learning models include Flamingo and Stable Diffusion. Multimodal learning has various practical applications, including text-to-image generation, emotion recognition, and image captioning. This AI field has yet to overcome certain challenges, such as building simple yet effective architectures for achieving reduced training times and improved accuracy. {{try_encord}}

September 19

5 min

What is Out-of-Distribution (OOD) Detection?

Imagine teaching a child about animals using only a book on farm animals. Now, what happens when this child encounters a picture of a lion or a penguin? Confusion, right? In the realm of deep neural networks, there's a similar story unfolding. It's called the closed-world assumption. Deep within the intricate layers of neural networks, there's a foundational belief we often overlook: the network will only ever meet data it's familiar with, data it was trained on. The true challenge isn't just about recognizing cows or chickens. It's about understanding the unfamiliar, the unexpected. It's about the lion in a world of farm animals. The real essence? The test data distribution. The test data should mirror the training data distribution for a machine learning model to perform optimally. However, in real-world scenarios, this is only sometimes the case. This divergence can lead to significant challenges, emphasizing the importance of detecting out-of-distribution (OOD) data.  As we delve deeper, we'll explore the intricacies of OOD detection and its pivotal role in ensuring the robustness and reliability of artificial intelligence systems. Out of Distribution Samples The Importance of OOD Detection Out-of-Distribution (OOD) detection refers to a model's ability to recognize and appropriately handle data that deviates significantly from its training set.  The closed-world assumption rests on believing that a neural network will predominantly encounter data that mirrors its training set. But in the vast and unpredictable landscape of real-world data, what happens when it stumbles upon these uncharted territories? That's where the significance of OOD detection comes into play. Real-world Implications of Ignoring OOD When neural networks confront out-of-distribution (OOD) data, the results can be less than ideal. A significant performance drop in real-world tasks is one of the immediate consequences. Think of it as a seasoned sailor suddenly finding themselves in uncharted waters, unsure how to navigate.  Moreover, the repercussions can be severe in critical domains. For instance, an AI system with OOD brittleness in medicine might misdiagnose a patient, leading to incorrect treatments. Similarly, in home robotics, a robot might misinterpret an object or a command, resulting in unintended actions. The dangers are real, highlighting the importance of detecting and handling OOD data effectively. The Ideal AI System Deep neural networks, the backbone of many modern AI systems, are typically trained under the closed-world assumption. This assumption presumes that the test data distribution closely mirrors the training data distribution. However, the real world seldom adheres to such neat confines.  When these networks face unfamiliar, out-of-distribution (OOD) data, their performance can wane dramatically. While such a dip might be tolerable in applications like product recommendations, it becomes a grave concern in critical sectors like medicine and home robotics. Even a minor misstep due to OOD brittleness can lead to catastrophic outcomes. An ideal AI system should be more adaptable. It should generalize to OOD examples and possess the acumen to flag instances that stretch beyond its understanding. This proactive approach ensures that when the system encounters data, it can't confidently process, it seeks human intervention rather than making a potentially erroneous decision. {{light_callout_start}} For a deeper dive into the intricacies of deep learning and its foundational concepts, check out this comprehensive guide on Demystifying Deep Learning. {{light_callout_end}} Understanding OOD Brittleness Deep neural networks, the linchpin of many AI systems, are trained with the closed-world assumption. This assumption presumes that the test data distribution closely resembles the training data distribution. However, the real world often defies such neat confines.  When these networks encounter unfamiliar, out-of-distribution (OOD) data, their performance can deteriorate significantly. While such a decline might be tolerable in applications like product recommendations, it becomes a grave concern in critical sectors like medicine and home robotics. Even a minor misstep due to OOD brittleness can lead to catastrophic outcomes. Why Models Exhibit OOD Brittleness The brittleness of models, especially deep neural networks, to OOD data is multifaceted. Let's delve deeper into the reasons: Model Complexity: Deep neural networks are highly parameterized, allowing them to fit complex patterns in the training data. While this complexity enables them to achieve high accuracy on in-distribution data, it can also make them susceptible to OOD data. The model might respond confidently to OOD inputs, even if they are nonsensical or far from the training distribution. Lack of Regularization: Regularization techniques, like dropout or weight decay, can improve a model's generalization. However, models can still overfit the training data if not applied or tuned correctly, making them brittle to OOD inputs. Dataset Shift: The data distribution can change over time in real-world applications. This phenomenon, known as dataset shift, can lead to situations where the model encounters OOD data even if it was not present during training. Model Assumptions: Many models, especially traditional statistical models, make certain assumptions about the data. If OOD data violate these assumptions, the model's performance can degrade. High Dimensionality: The curse of dimensionality can also play a role. Most of the volume in high-dimensional spaces is near the surface, making it easy for OOD data to lie far from the training data, causing models to extrapolate unpredictably. Adversarial Inputs: OOD data can sometimes be adversarial, crafted explicitly to deceive the model. Such inputs can exploit the model's vulnerabilities, causing it to make incorrect predictions with high confidence. Absence of OOD Training Samples: If a model has never seen examples of OOD data during training, it won't have learned to handle them. This is especially true for supervised learning models, which rely on labeled examples. Model's Objective Function: The objective function optimized during training (e.g., cross-entropy loss for classification tasks) might not penalize confident predictions on OOD data. This can lead to overly confident models even when they shouldn't be. Incorporating techniques to detect and handle OOD data is crucial, especially as AI systems are increasingly deployed in real-world, safety-critical applications. Types of Generalizations Models generalize in various ways, each with its implications for OOD detection. Some models might have a broad generalization, making them more adaptable to diverse data but potentially less accurate.  Others might have a narrow focus, excelling in specific tasks, but could be more comfortable when faced with unfamiliar data. Understanding the type of generalization a model employs is crucial for anticipating its behavior with OOD data and implementing appropriate detection mechanisms. Pre-trained Models vs. Traditional Models Pre-trained models, like BERT, have gained traction in recent years for their impressive performance across a range of tasks. One reason for their robustness against OOD data is their extensive training on diverse datasets. This broad exposure allows them to recognize and handle a wider range of inputs than traditional models that might be trained on more limited datasets.  For instance, a research paper titled "Using Pre-Training Can Improve Model Robustness and Uncertainty" highlighted that while pre-training might not always enhance performance on traditional classification metrics, it significantly bolsters model robustness and uncertainty estimates. This suggests that the extensive and diverse training data used in pre-training these models equips them with a broader understanding, making them more resilient to OOD data. However, even pre-trained models are not immune to OOD brittleness, emphasizing the need for continuous research and refinement in this domain. {{product_sam_cta}} Approaches to Detect OOD Instances Detecting out-of-distribution (OOD) instances is crucial for ensuring the robustness and reliability of machine learning models, especially deep neural networks. Several approaches have been proposed to address this challenge, each with advantages and nuances. Here, we delve into some of the prominent techniques. Maximum Softmax Probability Softmax probabilities can serve as a straightforward metric for OOD detection. Typically, a neural network model would output higher softmax probabilities for in-distribution data and lower probabilities for OOD data. By setting a threshold on these probabilities, one can flag instances below the threshold as potential OOD instances. Ensembling of Multiple Models Ensembling involves leveraging multiple models to make predictions. For OOD detection, the idea is that while individual models might be uncertain about an OOD instance, their collective decision can be more reliable. By comparing the outputs of different models, one can identify prediction discrepancies, which can indicate OOD data. Temperature Scaling Temperature scaling is a post-processing technique that calibrates the softmax outputs of a model. By adjusting the "temperature" parameter, one can modify the confidence of the model's predictions. Properly calibrated models can provide more accurate uncertainty estimates, aiding OOD detection. Training a Binary Classification Model as a Calibrator Another approach is to train a separate binary classification model that acts as a calibrator. This model is trained to distinguish between the in-distribution and OOD data. By feeding the outputs of the primary model into this calibrator, one can obtain a binary decision on whether the instance is in distribution or OOD. Monte-Carlo Dropout Dropout is a regularization technique commonly used in neural networks. Monte-Carlo Dropout involves performing dropout at inference time and running the model multiple times. The variance in the model's outputs across these runs can provide an estimate of the model's uncertainty, which can be used to detect OOD instances. Research in OOD Detection Deep learning models, particularly neural networks, have performed remarkably in various tasks. However, their vulnerability to out-of-distribution (OOD) data remains a significant concern. Recent research in 2023 has delved deeper into understanding this vulnerability and devising methods to detect OOD instances effectively. Simple and Principled Uncertainty Estimation with Deterministic Deep Learning via Distance Awareness (Liu et al., 2020): This paper emphasizes the need for AI systems to detect OOD instances beyond their capability and proposes a method for uncertainty estimation. Detecting Out-of-Distribution Examples with In-distribution Examples and Gram Matrices (Sastry & Oore, 2019): The study presents a method for detecting OOD examples using in-distribution examples and gram matrices, demonstrating its effectiveness in detecting far-from-distribution OOD examples.  Energy-based Out-of-distribution Detection (NeurIPS 2020): Proposing a unified framework for OOD detection, this research uses an energy score to detect anomalies. Learning Confidence for Out-of-Distribution Detection in Neural Networks (13 Feb 2018): The paper highlights that modern neural networks, despite their power, often fail to recognize when their predictions might be incorrect. The research delves into this aspect, aiming to improve confidence in OOD detection. Datasets and Benchmark Numbers In the realm of Out-of-Distribution (OOD) detection, several datasets have emerged as the gold standard for evaluating the performance of various detection methods. Here are some of the most popular datasets and their respective benchmark scores: Benchmark Dataset CIFAR-10 and CIFAR-100 are staple datasets in the computer vision community, often used to benchmark OOD detection methods. For instance, the DHM method has been tested on CIFAR-10 and CIFAR-100 and vice versa, showcasing its robustness. STL-10: Another dataset in the computer vision domain, the Mixup (Gaussian) method, has been applied here, demonstrating its effectiveness in OOD detection. MS-1M vs. IJB-C: This dataset comparison has seen the application of the ResNeXt50 + FSSD method, further emphasizing the importance of robust OOD detection techniques in diverse datasets. Fashion-MNIST: A dataset that's become increasingly popular for OOD detection, with methods like PAE showcasing their prowess. 20 Newsgroups: This dataset, more textual, has seen the application of the 2-Layered GRU method, highlighting the versatility of OOD detection across different data types. It's crucial to note that the benchmark scores of methods can vary based on the dataset, emphasizing the need for comprehensive testing across multiple datasets to ensure the robustness of OOD detection methods. OOD Detector Future Direction The field of OOD detection is rapidly evolving, with new methodologies and techniques emerging regularly. As AI systems become more integrated into real-world applications, the importance of robust OOD detection will only grow. Future research is likely to focus on: Enhanced Generalization: As models become more complex, it will be paramount to ensure they can generalize well to unseen data. This will involve developing techniques that can handle the vast diversity of real-world data. Integration with Other AI Domains: OOD detection will likely see integration with other AI domains, like transfer learning, few-shot learning, and more, to create holistic systems that are both robust and adaptable. Real-time OOD Detection: Real-time OOD detection will be crucial for applications like autonomous driving or medical diagnostics. Research will focus on making OOD detection methods faster without compromising on accuracy. Ethical Considerations: As with all AI advancements, the ethical implications of OOD detection will come to the fore. Ensuring that these systems are fair, transparent, and don't perpetuate biases will be a significant area of focus. With the pace of advancements in the field, the next few years promise to be exciting for OOD detection, with groundbreaking research and applications on the horizon. Out-of-Distribution Detection: Key Takeaways Out-of-distribution (OOD) detection, a pivotal algorithm in the AI landscape, is a cornerstone in modern AI systems.  As AI continues to permeate diverse sectors, from image classification in healthcare to pattern recognition in finance, identifying and handling out-of-distribution samples deviating from the input data the model was trained on becomes paramount.  Here are the pivotal takeaways from our exploration: Significance of OOD Detection: AI models, especially convolutional neural networks, are optimized for their training data. When faced with out-of-distribution data, their activations can misfire, and their performance can drastically plummet, leading to unreliable or even hazardous outcomes in real-world applications. Model Vulnerability: Despite their prowess and intricate loss function designs, models exhibit OOD brittleness primarily due to their training regimen. Their hyper-fine-tuning can make them less adaptable to unfamiliar inputs, emphasizing the need for novelty detection. Diverse Approaches: Researchers are exploring many techniques to enhance OOD detection, from leveraging generative models like variational autoencoders (VAE) to the ensembling of multiple models and from segmentation techniques to validation using Monte-Carlo dropout. Research Landscape: 2023 has seen groundbreaking research in OOD detection, with methods like DHM and PAE leading the charge. Platforms like Arxiv and GitHub have been instrumental in disseminating this knowledge. Datasets like CIFAR-10 serve as baselines for evaluating these novel techniques, and international conferences like ICML have been platforms for such discussions. Future Trajectory: The AI community, with contributions from researchers like Hendricks, Ren, and Chen, is gearing towards enhanced model generalization, real-time OOD detection using self-supervised and unsupervised techniques, and integrating ethical considerations into OOD methodologies. In essence, while being a technical challenge, OOD detection is a necessity in ensuring that AI systems, whether they employ classifier systems or delve into outlier detection, remain reliable, safe, and effective in diverse real-world scenarios. {{Training_data_CTA}}

September 15

4 min

Guide to Panoptic Segmentation

The term "panoptic" is derived from two words: "pan," meaning "all," and "optic," signifying "vision."  Panoptic segmentation, a pivotal concept in computer vision, offers a comprehensive approach to image segmentation. It stands out by simultaneously segmenting objects and classifying them. Thus, panoptic segmentation can be interpreted as viewing everything within a given visual field. This technique is a hybrid, merging semantic and instance segmentation strengths.  Introduced by Alexander Kirillov and his team in 2018, panoptic segmentation aims to provide a holistic view of image segmentation rather than relying on separate methodologies. A key distinction of panoptic segmentation is its ability to classify objects into two broad categories: "things" and "stuff." In computer vision, "things" refer to countable objects with a defined geometry, such as cars or animals. On the other hand, "stuff" pertains to objects identified primarily by texture and material, like the sky or roads. {{product_sam_cta}} Understanding Image Segmentation What is Image Segmentation? Image segmentation, a pivotal concept in computer vision, involves partitioning a digital image into multiple segments, often called image regions or objects. This process transforms an image into a more meaningful and easier-to-analyze representation. Image segmentation assigns labels to pixels so those with the same label share specific characteristics. This technique is instrumental in locating objects and boundaries within images. For instance, in medical imaging, segmentation can create 3D reconstructions from CT scans using geometry reconstruction algorithms. Types of Image Segmentation Semantic Segmentation: This approach identifies the class each pixel belongs to. For instance, in an image with multiple people, all pixels associated with persons will have the same class label, while the background pixels will be classified differently. {{light_callout_start}} For a deeper dive into a related topic, check out this comprehensive Guide to Semantic Segmentation on Encord's blog. {{light_callout_end}}   Instance Segmentation: Every pixel is identified for its specific belonging instance of the object. It's about detecting distinct objects of interest in the image. For example, in an image with multiple people, each person would be segmented as a unique object. Panoptic Segmentation: A combination of semantic and instance segmentation, panoptic segmentation identifies the class each pixel belongs to while distinguishing between different instances of the same class. What is Panoptic Segmentation? The term "panoptic" derives from encompassing everything visible in a single view. In computer vision, panoptic segmentation offers a unified approach to segmentation, seamlessly merging the capabilities of both instance and semantic segmentation. Panoptic segmentation is not just a mere combination of its counterparts but a sophisticated technique that classifies every pixel in an image based on its class label while identifying the specific instance of that class it belongs to. For instance, in an image with multiple cars, panoptic segmentation would identify each car and distinguish between them, providing a unique instance ID for each. This technique stands out from other segmentation tasks in its comprehensive nature. While semantic segmentation assigns pixels to their respective classes without distinguishing between individual instances, and instance segmentation identifies distinct objects without necessarily classifying every pixel, panoptic segmentation does both. Every pixel in an image processed using panoptic segmentation would have two associated values: a label indicating its class and an instance number. Pixels that belong to "stuff" regions, which are harder to quantify (like the sky or pavement), might have an instance number reflecting that categorization or none at all. In contrast, pixels belonging to "things" (countable objects like cars or people) would have unique instance IDs. This advanced segmentation technique has potential applications in various fields, including medical imaging, autonomous vehicles, and digital image processing. Its ability to provide a detailed understanding of images makes it a valuable tool in the evolving landscape of computer vision. Working Mechanism Panoptic segmentation has emerged as a groundbreaking technique in computer vision. It's a hybrid approach that beautifully marries the strengths of semantic and instance segmentation. While semantic segmentation classifies each pixel into a category, instance segmentation identifies individual object instances. On the other hand, panoptic segmentation does both: it classifies every pixel and assigns a unique instance ID to distinguishable objects. One of the state-of-the-art methods in panoptic segmentation is the Efficient Panoptic Segmentation (EfficientPS) method. This technique leverages deep learning and neural networks to achieve high-quality segmentation results. EfficientPS is designed to be both efficient in terms of computational resources and effective in terms of segmentation quality. It employs feature pyramid networks and convolutional layers to process input images and produce segmentation masks. The method also utilizes the COCO dataset for training and validation, ensuring that the models are exposed to diverse images and scenarios. The beauty of panoptic segmentation, especially methods like EfficientPS, lies in their ability to provide a detailed, pixel-level understanding of images. This is invaluable in real-world applications such as autonomous vehicles, where understanding the category (road, pedestrian, vehicle) and the individual instances (specific cars or people) is crucial for safe navigation. Key Components of Panoptic Segmentation Imagine a painter who not only recognizes every object in a scene but also meticulously colors within the lines, ensuring each detail stands out. That's the magic of panoptic segmentation in the world of computer vision. By understanding its key components, we can grasp how it effectively delineates and classifies every pixel in an image, ensuring both coherence and distinction. Fully Convolutional Network (FCN) and Mask R-CNN Fully Convolutional Networks (FCN) have emerged as a pivotal component in the panoptic segmentation. FCN's strength lies in its ability to process images of varying sizes and produce correspondingly-sized outputs. This network captures patterns from uncountable objects, such as the sky or roads, by classifying each pixel into a semantic label. It's designed to operate end-to-end, from pixel to pixel, offering a detailed, spatially dense prediction. Fully Convolutional Neural Networks Conversely, Mask R-CNN, an extension of the Faster R-CNN, plays a crucial role in recognizing countable objects. While Faster R-CNN is adept at bounding box recognition, Mask R-CNN adds a parallel branch for predicting an object mask. This means that for every detected object, Mask R-CNN identifies it and generates a high-quality segmentation mask for each instance. This dual functionality makes it an invaluable tool for tasks requiring object detection and pixel-level segmentation, such as identifying and distinguishing between individual cars in a traffic scene. Mask RCNN Architecture FCN and Mask R-CNN form the backbone of panoptic segmentation, ensuring that every pixel in an image is accurately classified and, if applicable, associated with a unique instance ID. EfficientPS Architecture One of the foundational elements of this architecture is Efficient Panoptic Segmentation (EfficientPS). EfficientNet is a model designed to systematically scale the network depth, width, and resolution. This ensures the model achieves optimal performance across various tasks without consuming excessive computational resources. A significant aspect of the EfficientPS architecture is the two-way Feature Pyramid Network (FPN). The FPN is adept at handling different scales in an image, making it invaluable for tasks that require understanding both the broader scene and the finer details. This two-way FPN ensures that features from both low-level and high-level layers of the network are utilized, providing a rich set of features for the segmentation task. Fusing outputs from semantic and instance segmentation is another hallmark of the EfficientPS architecture. While semantic segmentation provides a class label for each pixel, instance segmentation identifies individual object instances. EfficientPS combines these outputs, ensuring that every pixel in an image is classified and associated with a unique instance ID if it belongs to a countable object. What makes EfficientPS truly special is its loss function. The architecture employs a compound loss that combines the losses from semantic and instance segmentation tasks. This ensures the model is trained to perform optimally on both tasks simultaneously. The EfficientPS architecture, integrating EfficientNet, two-way FPN, and a compound loss function, set a new benchmark in panoptic segmentation, delivering state-of-the-art results across various datasets.  Prediction using EfficientPS Practical Applications of Panoptic Segmentation Medical Imaging Panoptic segmentation has made significant strides in medical imaging. Panoptic segmentation offers a detailed and comprehensive view of medical images by leveraging the power of both semantic and instance segmentation. This is particularly beneficial in tumor cell detection, where the model identifies the presence of tumor cells and differentiates between individual cells. Such precision is crucial for accurate diagnoses, enabling medical professionals to devise more effective treatment plans. Using datasets like COCO and Cityscapes, combined with deep learning algorithms, ensures that the segmentation models are trained on high-quality data, further enhancing their accuracy in medical diagnoses. Autonomous Vehicles The world of autonomous vehicles is another domain where panoptic segmentation proves its mettle. For self-driving cars, understanding the environment is paramount. Panoptic segmentation aids in this by providing a pixel-level understanding of the surroundings. It plays a pivotal role in distance-to-object estimation, ensuring the vehicle can make informed decisions in real-time. By distinguishing between countable objects (like pedestrians and other vehicles) and uncountable objects (like roads and skies), panoptic segmentation ensures safer navigation for autonomous vehicles. Digital Image Processing Modern smartphone cameras are a marvel of technology, and panoptic segmentation enhances their capabilities. Features like portrait mode, bokeh mode, and auto-focus leverage the power of image segmentation to differentiate between the subject and the background. This allows for the creation of professional-quality photos with depth effects. The fusion of semantic and instance segmentation ensures that the camera can identify and focus on the subject while blurring out the background, resulting in stunning photographs. Research in Panoptic Segmentation With the integration of advanced algorithms and neural networks, the research community has been pushing the boundaries of what's possible in computer vision. One notable model that has emerged as a leader in this space is the "OneFormer (ConvNeXt-L, single-scale, 512x1024),  which has set new benchmarks, especially on the Cityscapes val dataset. Important Papers 2023 has seen the publication of several influential papers that have shaped the trajectory of panoptic segmentation.  Panoptic Feature Pyramid Networks: This paper delves into a minimally extended version of Mask R-CNN with FPN, referred to as Panoptic FPN. The study showcases how this model is a robust and accurate baseline for semantic and instance segmentation tasks. Panoptic-DeepLab: A Simple, Strong, and Fast Baseline for Bottom-Up Panoptic Segmentation: The work introduces Panoptic-DeepLab, a system designed for panoptic segmentation. The paper emphasizes its simplicity, strength, and speed, aiming to establish a solid baseline for bottom-up methods. OneFormer: This model has emerged as a leader in panoptic segmentation in 2023, setting new benchmarks, especially on the Cityscapes val dataset. Panoptic Segmentation: Key Takeaways Panoptic segmentation has emerged as a pivotal technique in computer vision, offering a comprehensive approach to image segmentation. This method seamlessly integrates the strengths of semantic and instance segmentation, providing a holistic view of images. Let's recap the significant insights and applications of panoptic segmentation: Unified Approach: Panoptic segmentation is a hybrid technique that combines semantic and instance segmentation best. It assigns every pixel in an image a class label while distinguishing between individual object instances. This unified approach ensures every pixel has a clear, singular label, eliminating ambiguities. Diverse Applications: The applications of panoptic segmentation are vast and varied. In the medical field, it aids in precise tumor cell detection, enhancing the accuracy of diagnoses. For autonomous vehicles, it plays a crucial role in distance-to-object estimation, ensuring safer navigation. Additionally, panoptic segmentation enhances smartphone camera capabilities in digital image processing, enabling features like portrait mode and auto-focus. Innovative Research: The field has witnessed rapid advancements, with state-of-the-art models like EfficientPS pushing the boundaries of what's possible. These models leverage architectures like EfficientNet and Feature Pyramid Networks to deliver high-quality segmentation results efficiently. Datasets and Benchmarks: Research in panoptic segmentation is supported by many datasets, with Cityscapes being notable. The benchmark scores on these datasets provide a clear metric to gauge the performance of various models, guiding further research and development. Future Trajectory: The future of panoptic segmentation looks promising. With continuous research and integration of deep learning techniques, we can expect even more accurate and efficient models. These advancements will further expand the applications of panoptic segmentation, from healthcare to autonomous driving and beyond. Panoptic segmentation stands at the intersection of technology and innovation, offering solutions to complex computer vision challenges. As research progresses and technology evolves, its potential applications and impact on various industries will only grow. {{Training_data_CTA}}

September 13

5 min

5 Recent AI Research Papers

3D Gaussian Splatting for Real-Time Radiance Field Rendering The paper presents a novel real-time radiance field rendering technique using 3D Gaussian splatting, addressing the challenge of efficient and high-quality rendering. Objective: Develop a real-time radiance field rendering technique Problem: Achieving real-time display rates for rendering unbounded and complete scenes at 1080p resolution using Radiance Field methods. Existing approaches often involve costly neural network training and rendering or sacrifice quality for speed, making it difficult to attain both high visual quality and real-time performance for such scenes.  Solution Anisotropic 3D Gaussians as a high-quality, unstructured representation of radiance fields An optimization technique for 3D Gaussian properties, coupled with adaptive density control, to generate top-tier representations for captured scenes A fast, differentiable GPU-based rendering approach that incorporates visibility awareness, enables anisotropic splatting and supports swift backpropagation to accomplish exceptional quality view synthesis. Methodology  Scene Representation with 3D Gaussians: Begin with sparse points obtained during camera calibration. Utilize 3D Gaussians to represent the scene. Preserve key characteristics of continuous volumetric radiance fields. Avoid unnecessary computations in empty areas of the scene. Optimization and Density Control of 3D Gaussians: Implement interleaved optimization and density control for the 3D Gaussians. Focus on optimizing the anisotropic covariance to achieve precise scene representation. Fine-tune Gaussian properties to enhance accuracy. Fast Visibility-Aware Rendering Algorithm. Develop a rapid rendering algorithm designed for GPUs: Ensure visibility awareness in the rendering process. Enable anisotropic splatting for improved rendering quality. Accelerate training processes. Facilitate real-time rendering for efficient visualization of the radiance field. {{light_callout_start}} Find the code implementation on GitHub. {{light_callout_end}}  Results 3D Gaussian Splatting for Real-Time Radiance Field Rendering  Achieved real-time rendering of complex radiance fields, allowing for interactive and immersive experiences. Demonstrated significant improvements in rendering quality and performance compared to previous methods like InstantNGP and Plenoxels. Showcased the adaptability of the system through dynamic level-of-detail adjustments, maintaining visual fidelity while optimizing resource usage. Validated the effectiveness of 3D Gaussian splatting in handling radiance field rendering challenges. {{light_callout_start}} Read the original paper by Bernhard Kerbl, Georgios, Kopanas, Thomas Lemkühler: 3D Gaussian Splatting for Real-Time Radiance Field Rendering {{light_callout_end}}  Nougat: Neural Optical Understanding for Academic Documents Nougat aims to enhance the accessibility of scientific knowledge stored in digital documents, especially PDFs, by proposing Nougat, which performs OCR tasks. It is an academic document PDF parser that understands LaTeX math and tables. Objective: Enhance the accessibility of scientific knowledge stored in digital documents, particularly in PDF format.  Problem: Effectively preserving semantic information, particularly mathematical expressions while converting PDF-based documents into a machine-readable format (LaTex). Solution: Nougat is a vision transformer that enables end-to-end training for the task at hand. This architecture builds upon the Donut architecture and does not require any OCR-related inputs or modules, as the text is recognized implicitly by the network. Nougat: Neural Optical Understanding for Academic Documents Methodology Encoder: Receives document image. Crops margins and resizes the image to a fixed rectangle. Utilizes a Swin Transformer, splitting the image into windows and applying self-attention layers. Outputs a sequence of embedded patches. Decoder: Inputs the encoded image. Uses a transformer decoder architecture with cross-attention. Generates tokens in an auto-regressive manner. Projects the output to match the vocabulary size. Implementation: Adopts mBART decoder from BART. Utilizes a specialized tokenizer for scientific text, similar to Galactica’s approach. {{light_callout_start}} Find the code implementation on GitHub. {{light_callout_end}}  Results Mathematical expressions had the lowest agreement with the ground truth, mainly due to missed formulas by GROBID and challenges in equation prediction accuracy stemming from bounding box quality. Nougat, both in its small and base versions, consistently outperformed the alternative approach across all metrics, demonstrating its effectiveness in converting document images to compatible markup text. {{light_callout_start}} Read the original paper from Meta AI by Lukas Blecher, Guillem Cucurull, Thomas Scialom, and Robert Stojnic: Nougat: Neural Optical Understanding for Academic Documents. {{light_callout_end}}  Scaling up GANs for Text-to-Image Synthesis The paper introduces GigaGAN, a highly scalable GAN-based generative model for text-to-image synthesis, achieving exceptional scale, speed, and controllability compared to previous models. Objective: Alternative to auto-regressive and diffusion models for text-to-image synthesis. Problem: Making GANs more scalable and efficient in handling large datasets and generating high-quality, high-resolution images while maintaining stability and enabling fine-grained control over the generative process. Solution: GANs reintroduced as a multi-scale training scheme aim to improve the alignment between images and text descriptions and enhance the generation of low-frequency details in the output images. Methodology  The GigaGAN architecture consists of the following - Generator: Text Encoding branch: utilizes a pre-trained CLIP model to extract text embeddings and a learned attention layer. Style mapping network: produces a style vector similar to StyleGAN Synthesis Network: uses style vector as modulation and text embeddings as attention to create an image pyramid Sample-adaptive kernel selection: chooses convolution kernels based on input text conditioning Discriminator: The image branch of the discriminator makes independent predictions for each scale within the image pyramid. The text branch handles text in a manner similar to the generator, while the image branch operates on an image pyramid, providing predictions at multiple scales. {{light_callout_start}} Find the code for evaluation on GitHub. {{light_callout_end}}  Results Scale Advancement: GigaGAN is 36 times larger in terms of parameter count than StyleGAN2. It is 6 times larger than StyleGAN-XL and XMC-GAN. Quality Performance: Despite its impressive scale, GigaGAN does not show quality saturation concerning model size. Achieves a zero-shot FID (Fréchet Inception Distance) of 9.09 on the COCO2014 dataset, which is lower than DALL·E 2, Parti-750M, and Stable Diffusion. Efficiency: GigaGAN is orders of magnitude faster at image generation, taking only 0.13 seconds to generate a 512px image. High-Resolution Synthesis: It can synthesize ultra-high-resolution images at 4k resolution in just 3.66 seconds. Latent Vector Control: GigaGAN offers a controllable latent vector space, enabling various well-studied controllable image synthesis applications, including style mixing, prompt interpolation, and prompt mixing. {{light_callout_start}} Read the original paper by Minguk Kang, Jun-Yan Zhu, Richard Zhang, Jaesik Park, Eli Shechtman, Sylvain Paris, and Taesung Park: Scaling up GANs for Text-to-Image Synthesis.  {{light_callout_end}}  Code Llama: Open Foundation Models for Code Code Llama is a cutting-edge code-specialized language model, forged through extended training on code-specific datasets, delivering enhanced coding capabilities and support for a range of programming languages. Objective: Build a large language model (LLM) that can use text prompts to generate and discuss code. Problem: A specialized language model for code generation and understanding, with focus on performance, context handling, infiling, instruction following Solution: The proposed solution is Code Llama which is available as three variants: Code Llama: foundational code model Code Llama-Python specialized: for Python Code Llama-Instruct: fine-tuned model for understanding natural language instructions Methodology Code Llama is a specialized model built upon Llama 2. It was developed by extended training on code-specific datasets, including increased data sampling and longer training. {{light_callout_start}} Find the code for implementation on GitHub. {{light_callout_end}}  Results Code Llama achieves state-of-the-art performance among open models on several code benchmarks: Scores of up to 53% on HumanEval and scores of up to 55% on MBPP. Code Llama - Python 7B outperforms Llama 2 70B on both HumanEval and MBPP benchmarks. All variants of Code Llama models outperform every other publicly available model on the MultiPL-E benchmark. {{light_callout_start}} Read the original paper by Meta AI: Code Llama: Open Foundation Models for Code. {{light_callout_end}}  FaceChain: A Playground for Identity-Preserving Portrait Generation FaceChain is a personalized portrait generation framework that combines advanced LoRA models and perceptual understanding techniques to create your Digital-Twin. Objective: A personalized portrait generation framework that generates images from a limited set of input images. Problem: The limitations of existing personalized image generation solutions, including the inability to accurately capture key identity characteristics and the presence of defects like warping, blurring, or corruption in the generated images. Solution: FaceChain is a framework designed to preserve the unique characteristics of faces while offering versatile control over stylistic elements in image generation. FaceChain is the integration of two LoRA models into the Stable Diffusion model. This integration endows the model with the capability to simultaneously incorporate personalized style and identity information, addressing a critical challenge in image generation. Methodology Integration of LoRA Models: FaceChain incorporates LoRA models to improve the stability of stylistic elements and maintain consistency in preserving identity during text-to-image generation. Style and Identity Learning: Two LoRA models are used, namely the style-LoRA model and face-LoRA model. The style-LoRA model focuses on learning information related to portrait style, while the face-LoRA model focuses on preserving human identities. Separate Training: These two models are trained separately. The style-LoRA model is trained offline, while the face-LoRA model is trained online using user-uploaded images of the same human identity. Quality Control: To ensure the quality of input images for training the face-LoRA model, FaceChain employs a set of face-related perceptual understanding models. These models normalize the uploaded images, ensuring they meet specific quality standards such as appropriate size, good skin quality, correct orientation, and accurate tags. Weighted Model Integration: During inference, the weights of multiple LoRA models are merged into the Stable Diffusion model to generate personalized portraits. Post-Processing: The generated portraits undergo a series of post-processing steps to further enhance their details and overall quality. {{light_callout_start}} Find the code implementation on GitHub. {{light_callout_end}}  Results FaceChain: A Playground for Identity-Preserving Portrait Generation {{light_callout_start}} Read the original paper by Alibaba Group: FaceChain: A Playground for Identity-Preserving Portrait Generation. {{light_callout_end}} 

September 12

5 min

Image Thresholding in Image Processing

In digital image processing, thresholding is the simplest method of segmenting images. It plays a crucial role in image processing as it allows for the segmentation and extraction of important information from an image. By dividing an image into distinct regions based on pixel intensity or pixel value, thresholding helps distinguish objects or features of interest from the background. This technique is widely used in various applications such as object detection, image segmentation, and character recognition, enabling efficient analysis and interpretation of digital images. Additionally, image thresholding can enhance image quality by reducing noise and improving overall visual clarity.  Thresholding — Image Processing The choice of thresholding technique is critical determination of the accuracy and effectiveness of image analysis. Different thresholding techniques have their own strengths and limitations. Selecting the appropriate technique depends on factors such as image complexity, noise levels, and the desired outcome. Therefore, it is essential to give careful consideration to the selection and to conduct experimentation to ensure optimal results in image processing tasks.  In the article, we will cover the following: What is Image Thresholding?  Image Thresholding Techniques Applications of Image Thresholding  Practical Implementation and Considerations Challenges with Image Thresholding Future Developments in Image Thresholding Image Thresholding: Key Takeaways What is Image Thresholding? Image thresholding involves dividing an image into two or more regions based on intensity levels, allowing for easy analysis and extraction of desired features. By setting a threshold value, pixels with intensities above or below the threshold can be classified accordingly This technique aids in tasks such as object detection, segmentation, and image enhancement.  Image thresholding is a technique that simplifies a grayscale image into a binary image by classifying each pixel value as either black or white based on its intensity level or gray-level compared to the threshold value. This technique reduces the image to only two levels of intensity, making it easier to identify and isolate objects of interest. Binary image conversion allows for efficient processing and analysis of images, enabling various computer vision applications such as edge detection and pattern recognition.  In imaging processing algorithms, the principle of pixel classification based on intensity threshold is widely used. By setting a specific threshold value, pixels with intensity levels above the threshold are classified as white, while those below the threshold are classified as black. This principle forms the foundation for various image enhancement techniques that help to extract important features from an image for further analysis.  In data science and image processing, an entropy-based approach to image thresholding is used to optimize the process of segmenting specific types of image, often those with intricate textures or diverse patterns. By analyzing the entropy, which measures information randomness, this technique seeks to find the optimal threshold value that maximizes the information gained when converting the image into a binary form through thresholding. This approach is especially beneficial for images with complex backgrounds or varying lighting conditions. Through this technique, the binary thresholding process becomes finely tuned, resulting in more accurate segmentation and enhanced feature extraction, which is vital for applications in image analysis and computer vision tasks. Image Thresholding Techniques These are widely used in various fields such as medical imaging, computer vision, and remote sensing. These techniques are essential for accurate image processing and interpretation. They help to convert grayscale or color images into binary images, separating the foreground from the background, allowing for better segmentation and extraction of features from an image, which is crucial for various applications in computer vision and pattern recognition. Global Thresholding Global Thresholding is a widely used technique where a single threshold value is applied to an entire image. However, this technique  may not be suitable for images with varying lighting conditions or complex backgrounds. To overcome this limitation, adaptive thresholding techniques may be employed, which adjust the threshold value locally based on the characteristics of each pixel's neighborhood. These techniques are particularly useful in scenarios where there is significant variation in illumination across different regions of the image.  Thresholding-Based Image Segmentation Simple thresholding is a basic technique that assigns a binary value to each pixel based on a global threshold value. It is effective when the image has consistent lighting conditions and a clear foreground-background separation. However, when images contain varying lighting conditions or complex backgrounds, adaptive thresholding techniques are more suitable. These techniques dynamically adjust the threshold value for each pixel based on its local neighborhood, allowing for better segmentation and accurate object detection.  Otsu's Method for Automatic Threshold Determination is a widely used technique for automatically determining the optimal threshold value in image segmentation. It calculates the threshold by maximizing the between-class variance of pixel value, which effectively separates foreground and background regions. This method is particularly useful when dealing with images that have bimodal or multimodal intensity distributions, as it can accurately identify the threshold that best separates different objects or regions in the image.  Otsu's method - Wikipedia {{light_callout_start}} “A nonparametric and unsupervised method of automatic threshold selection for picture segmentation. An optimal threshold is selected by the discriminant criterion, so as to maximize the separability of the resultant classes in gray levels. The procedure utilizies only the zeroth- and the first-order cumulative moments of the gray-level histogram.” - Nobuyuki Otsu  {{light_callout_end}} Pros and Cons of Global Thresholding  Gobal thresholding offers several advantages, including its simplicity and efficiency in determining a single threshold value for the entire image. It is particularly effective in scenarios where the foreground and background regions have distinct intensity distributions. However, global thresholding may not be suitable for images with complex intensity distributions or when there is significant variation in lighting conditions across the image. Additionally, it may not accurately segment objects or regions that have overlapping intensity values.  Local (Adaptive) Thresholding  Local thresholding addresses the limitations of global thresholding by considering smaller regions within the image. It calculates a threshold value for each region based on its local characteristics, such as mean or median intensity. This approach allows for better adaptability to varying lighting conditions and complex intensity distributions, resulting in more accurate segmentation of objects or regions with overlapping intensity values. However, local thresholding may require more computational resources and can be sensitive to noise or uneven illumination within the image, which can affect the overall performance of the segmentation algorithm. Adaptive Thresholds for Different Image Regions are needed to overcome the challenges of variations in lighting conditions and contrast within an image. These adaptive thresholds help improve the accuracy and clarity of object or region detection. This approach involves dividing the image into smaller sub-regions and calculating a threshold value for each sub-region based on its local characteristics. By doing so, the algorithm can better account for these variations and mitigate the effects of noise or uneven illumination, as each sub-region is treated independently.  {{light_callout_start}} The simplest method to segment an image is thresholding. Using the thresholding method, segmentation of an image is done by fixing all pixels whose intensity values are more than the threshold to a foreground value.  {{light_callout_end}} Mean and Gaussian Adaptive Thresholding  Two commonly used methods in image processing are Mean and Gaussian Adaptive Thresholding. Mean adaptive thresholding calculates the threshold value for each sub-region by taking the average intensity of all pixels within that region. On the other hand, Gaussian adaptive thresholding uses a weighted average of pixel intensities, giving more importance to pixels closer to the center of the sub-region. These methods are effective in enhancing image quality and improving accuracy in tasks such as object detection or segmentation.   Advantages over Global Thresholding  Adaptive Thresholding has advantages over global thresholding. One advantage is that it can handle images with varying lighting conditions or uneven illumination. This is because adaptive thresholding calculates the threshold value locally, taking into account the specific characteristics of each sub-region. Additionally, adaptive thresholding can help preserve important details and fine textures in an image, as it adjusts the threshold value based on the local pixel intensities.  Applications of Image Thresholding  Image thresholding is a technique used in computer vision that has a variety of applications, including image segmentation, object detection, and character recognition. By separating objects from their background in an image, image thresholding makes it easier to analyze and extract relevant information. Optical character recognition (OCR) systems, for example, use image thresholding to distinguish between foreground (text) and background pixels in scanned documents, making them editable. Real-world applications  Object Detection: By setting a threshold value, objects can be separated from the background, allowing for more accurate and efficient object detection.  Medical Images: Image thresholding can be used to segment different structures or abnormalities for diagnosis and analysis in medical imaging. Quality Control: Image thresholding plays a crucial role in quality control processes, such as inspecting manufactured products for defects or ensuring consistency in color and texture of a color image. Object Segmentation: Image thresholding is also commonly used in computer vision tasks such as object segmentation, where it helps to separate foreground objects from the background. This enables more accurate and efficient detection of objects within an image. Noise Reduction: Thresholding can be utilized for noise reduction, as it can help to eliminate unwanted artifacts or disturbances in an image.   Edge Detection: Image thresholding aids in identifying and highlighting the boundaries between different objects or regions within an image with edge detection algorithms.  A step by step guide to Image Segmentation in Computer Vision can be read here. Thresholding Practical Implementation and Considerations  When implementing thresholding techniques, it is important to carefully select the appropriate threshold value based on the specific image and desired outcome. This can be achieved through experimentation or through the use of adaptive thresholding methods that automatically adjust the threshold based on local image characteristics. Furthermore, it is essential to consider the potential trade-off between noise reduction and preserving important details in the image, as aggressive thresholding may lead to the loss of valuable information.  Steps for implementing thresholding algorithms (Python) Here are step-by-step guides for implementing image thresholding algorithms using Python. You will implement the global thresholding and Otsu's thresholding, which are two commonly used thresholding techniques. Implementing Image Thresholding Algorithms in Python Global Thresholding Let us review what we know so far, and for this you can use Google Colab to run the below code. #Install the required library !pip install opencv–python #Get the image file !wget -O /content/sunflower.jpg 'https://images.unsplash.com/photo-1503936380431-4a4ce0fc296c?ixlib=rb4.0.3&ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D&auto=format&fit=crop&w=2970&q=80' After acquiring the image file, right-click on it to copy the image path, and proceed to paste it into the designated section labeled "ADD YOUR FILE PATH HERE." If you are using an alternative IDE, you can alternatively input the path on your local system. import cv2 from google.colab.patches import cv2_imshow # Read the image#image = cv2.imread('ADD YOUR FILE PATH HERE', cv2.IMREAD_GRAYSCALE) image = cv2.imread('/content/sunflower.jpg', cv2.IMREAD_GRAYSCALE) # Apply global thresholding _, binary_image = cv2.threshold(image, 127, 255, cv2.THRESH_BINARY) # Display the results cv2_imshow(image) cv2_imshow(binary_image) cv2.waitKey(0) cv2.destroyAllWindows() Output: Grayscale Image Binary Image Otsu's Thresholding import cv2 from google.colab.patches import cv2_imshow # Read the image image = cv2.imread('/content/sunflower.jpg', cv2.IMREAD_GRAYSCALE) # Define the desired width and height for the resized image desired_width = 640 # Change this to your desired width desired_height = 480 # Change this to your desired height # Resize the image to the desired size resized_image = cv2.resize(image, (desired_width, desired_height)) # Apply Otsu's thresholding _, binary_image = cv2.threshold(resized_image, 0, 255, cv2.THRESH_BINARY + cv2.THRESH_OTSU) # Display the results cv2_imshow(resized_image) cv2_imshow(binary_image) cv2.waitKey(0) cv2.destroyAllWindows() Output: Otsu’s Thresholding Image The code above applies Otsu's thresholding to the image and displays the original image and binary image or thresholded image. Remember to replace `'image.jpg'` with the actual path of your image file. These examples demonstrate the basic implementation of global thresholding and Otsu's thresholding in both Python. You can further customize these codes to suit your specific image processing needs, including pre-processing steps, visualization enhancements, and additional algorithm parameters. Global Thresholding Value in Python using Otsu’s Method import cv2 import numpy as np from google.colab.patches import cv2_imshow # Read the image in grayscale image = cv2.imread('/content/sunflower.jpg', cv2.IMREAD_GRAYSCALE) # Define the desired width and height for the resized image desired_width = 640 # Change this to your desired width desired_height = 480 # Change this to your desired height # Resize the image to the desired size resized_image = cv2.resize(image, (desired_width, desired_height)) # Calculate global threshold using Otsu's method _, global_thresholded = cv2.threshold(resized_image, 0, 255, cv2.THRESH_BINARY + cv2.THRESH_OTSU) # Calculate Otsu's threshold value directly otsu_threshold_value = cv2.threshold(resized_image, 0, 255, cv2.THRESH_BINARY + cv2.THRESH_OTSU)[0] # Display the results cv2_imshow(global_thresholded) print("Global Threshold Value:", otsu_threshold_value) cv2.waitKey(0) cv2.destroyAllWindows() Output: Global Threshold Value: 168.0 This code will display the original image and the image after global thresholding using Otsu's method, along with the threshold value determined by Otsu's algorithm. Pre-processing and post-processing impact  Pre-processing and post-processing techniques play a crucial role in achieving accurate and meaningful results in image thresholding. Employing a range of techniques before and after thresholding can significantly enhance the accuracy of segmentation and the usability of the final binary image. Pre-processing techniques such as noise reduction, image enhancement, and morphological operations before thresholding, can improve segmentation results. Similarly, post-processing techniques like connected component analysis and contour smoothing can further refine the binary image and remove any artifacts or imperfections.  Let's delve deeper into how pre-processing and post-processing impact image thresholding Pre-processing Impact Noise reduction techniques like Gaussian smoothing or median filtering techniques help suppress noise while preserving important edges and details. Contrast Enhancement of an image before thresholding can lead to better separation between object and background intensities. Histogram equalization or adaptive histogram equalization techniques lead to better separation between object and background intensities. Basically the image histogram will be greatly affected if the image is thresholded as shown in the figure below. Histogram transformations Illumination Correction is nothing but background subtraction or morphological operations normalize illumination across the image, especially in cases where lighting conditions are non-uniform or uneven. Edge detection techniques can be applied as a pre-processing step to identify significant edges in the image. This can assist in defining regions of interest and guide the thresholding process, especially when the boundaries between objects and background are not well-defined. Image Smoothing can be done using smoothing filters like Gaussian blur or mean filtering can reduce fine details and minor variations in the image, simplifying the thresholding process and leading to more coherent segmentation results. Post-processing Impact Connected Component Analysis identifies and labels separate regions in the binary image, distinguishing individual objects and eliminating isolated noise pixels. Morphological Operations like erosion and dilation fine-tune the binary image by removing small noise regions and filling in gaps between segmented objects. Object Size Filtering removes small objects or regions that are unlikely to be relevant, especially when dealing with noise or artifacts that may have been segmented as objects during thresholding. Smoothing Edges is achieved when smoothing filters applied to the binary image, and can result in cleaner and more natural-looking object boundaries. Object Feature Extraction involves area, perimeter, centroid, and orientation, and can be used for further analysis or classification. Object Merging and Splitting techniques can be applied to merge nearby objects or split overly large ones in cases where thresholding results in objects that are too fragmented or split. Pre-processing and post-processing steps are integral to obtaining accurate and meaningful results in image thresholding. The selection of appropriate techniques and their parameters should be guided by the specific characteristics of the image and the goals of the analysis. By thoughtfully combining pre-processing and post-processing techniques, it is possible to transform raw images into segmented binary images that provide valuable insights for various applications. Challenges with Image Thresholding  There are several challenges with image thresholding. Some of the main challenges are determining an appropriate threshold value, handling noise and variations in lighting conditions, and dealing with complex image backgrounds. Furthermore, selecting the right pre-processing and post-processing techniques can be difficult,  as it requires a deep understanding of the image content and the desired outcome. Overcoming these challenges requires careful consideration and experimentation..  The challenge of thresholding continuous antibody measures {{light_callout_start}} Some of the challenges of image thresholding include high computational cost, insufficient performance, lack of generalization and flexibility, lack of capacity to capture various image degradations, and many more. {{light_callout_end}} Image thresholding presents distinct challenges when dealing with complex images and varying lighting conditions. These challenges can impact the accuracy of segmentation results and require careful consideration to achieve reliable outcomes. Let's delve into the specific challenges posed by complex images and varying lighting conditions: Complex Images Complex Intensity Distributions: Images with complex intensity distributions, such as multi-modal or non-uniform distributions, can make selecting an appropriate threshold value difficult. Traditional thresholding methods that assume a bi-modal distribution might struggle to accurately segment objects when intensity values are spread across multiple peaks. Gradual Intensity Transitions: Objects with gradual intensity changes or subtle edges can be challenging to segment accurately. Traditional thresholding methods are designed to work best with well-defined edges, and they might lead to fragmented or imprecise segmentation when applied to images with gradual transitions. Overlapping Objects: Objects that overlap or occlude each other in the image can cause difficulties for thresholding. In such cases, a single threshold might segment a merged object as multiple objects, or vice versa. This can lead to inaccurate object separation and hinder subsequent analysis. Texture and Pattern Variability: Images with intricate textures or complex patterns can be tough to segment accurately. Traditional thresholding, which relies on intensity values alone, might not effectively capture the variations in textures, leading to under-segmentation or over-segmentation. Partial Occlusion: When an object is only partially visible due to occlusion or truncation, thresholding methods can struggle to define the boundaries accurately. Incomplete segmentation can lead to errors in size, shape, and feature measurements. Multiple Object Types: Images containing multiple types of objects with varying shapes, sizes, and intensities pose a challenge for uniform thresholding. Adapting the threshold value to cater to these diverse objects can be complex. Varying Lighting Conditions Uneven Illumination: Images captured under uneven or non-uniform lighting conditions can result in inaccurate segmentation using global thresholding. Objects illuminated differently might not be accurately separated from the background, leading to segmentation errors. Shadows and Highlights: Varying lighting conditions can create shadows and highlights, altering the perceived intensity values of objects. Shadows can cause objects to be under-segmented, while highlights can lead to over-segmentation. Local Intensity Variations: In the presence of varying lighting, the assumption of consistent intensity values across an object might not hold true. Adaptive thresholding methods that consider local intensity characteristics are better suited to handle such scenarios. Dynamic Scenes: Images captured in dynamic environments with changing lighting conditions, such as outdoor scenes or real-time video feeds, require continuous adjustment of threshold values to account for the evolving illumination. Static thresholding might result in poor segmentation. Reflections and Glare: Reflective surfaces or glare can cause spikes in intensity values, complicating the thresholding process. These spikes can be misleading and result in the misclassification of pixels. Addressing these challenges requires a combination of techniques, including adaptive thresholding methods, pre-processing steps, and post-processing refinements. Adaptive thresholding takes into account local intensity variations and is particularly effective in dealing with varying lighting conditions. Pre-processing steps, such as contrast enhancement and illumination normalization, can help mitigate the effects of uneven lighting. Post-processing techniques, like morphological operations and edge smoothing, can refine the segmentation results and eliminate artifacts. Image Thresholding in varying Lighting Conditions Furthermore, the integration of machine learning techniques, like convolutional neural networks (CNNs), can enhance segmentation accuracy for complex images and varying lighting conditions. These approaches learn from data and can adapt to the intricacies of the image content. Overall, understanding the unique challenges presented by complex images and varying lighting conditions and applying appropriate techniques is crucial for successful image thresholding in these scenarios.  Future Developments in Image Thresholding  Upcoming advancements in image processing include the integration of deep learning algorithms, which can further enhance segmentation accuracy by automatically learning and extracting features from complex images. Furthermore, advancements in hardware technology, such as the development of specialized processors for image processing tasks, may also contribute to faster and more efficient image thresholding in the future.  The potential impact of emerging technologies in image thresholding is signficant. With the integration of deep learning algorithms, we can expect more accurate and precise segmentation results, leading to improved applications in fields like medical imaging, autonomous vehicles, and object recognition. Furthermore, advancements in hardware technology can significantly enhance the speed and efficiency of image thresholding algorithms, enabling real-time processing and analysis of large-scale image datasets.  Image Thresholding: Key Takeaways Crucial Role of Thresholding: Image thresholding is vital for segmenting images, extracting features, and enhancing image quality. It's used in object detection, segmentation, and character recognition, aiding efficient image analysis. Technique Selection Importance: Choosing the right thresholding technique is crucial. Different methods have strengths and limitations, based on image complexity, noise, and goals. Careful consideration is essential for optimal results. Binary Conversion: Image thresholding simplifies images by converting them to binary form (black and white). This simplification aids in isolating objects and features of interest. Global and Adaptive Thresholding: Global thresholding is straightforward but not suitable for complex backgrounds. Adaptive thresholding adjusts locally, making it effective for varying lighting conditions. Otsu's Method and Applications: Otsu's method automatically determines optimal thresholds, especially useful for complex images. Thresholding finds applications in object detection, segmentation, edge detection, and quality control. Implementation and Challenges: Implementing thresholding involves selecting thresholds, pre-processing, and post-processing. Challenges include noise, lighting variations, complex backgrounds, and overlapping objects.

September 12

5 min

Get Your Models Into Production Faster
Encord is transforming how businesses are getting their computer vision models into production. We can do the same for you. Talk to us to find out how.

Barlow Twins: Self-Supervised Learning

Self-supervised learning (SSL) has emerged as a transformative paradigm in machine learning, particularly in computer vision applications. Unlike traditional supervised learning, where labeled data is a prerequisite, SSL leverages unlabeled data, making it a valuable approach when labeled datasets are scarce. The essence of SSL lies in its ability to process data of lower quality without compromising the ultimate outcomes. This approach mirrors how humans learn to classify objects more closely, extracting patterns and correlations from the data autonomously. However, a significant challenge in SSL is the potential for trivial, constant solutions. A naive SSL method trivially classifies every example as positive in binary classification, leading to a constant and uninformative solution.  This challenge underscores the importance of designing robust algorithms, such as the Barlow Twins, that can effectively leverage the power of SSL while avoiding pitfalls like trivial solutions. In the subsequent sections, we will delve deeper into the Barlow Twins approach to SSL, a new approach developed by Yann LeCun and the team at Facebook. We will also explore its unique features, benefits, and contribution to the ever-evolving landscape of self-supervised learning in machine learning. The Barlow Twins Approach The Barlow Twins method, named in homage to neuroscientist H. Barlow's redundancy-reduction principle, presents a novel approach to self-supervised learning (SSL). This method is particularly significant in computer vision, where SSL has rapidly bridged the performance gap with supervised methods. Central to the Barlow Twins approach is its unique objective function, designed to naturally prevent the collapse often observed in other SSL methods. This collapse typically results in trivial, constant solutions, a challenge many SSL algorithms grapple with. The Barlow Twins method addresses this by measuring the cross-correlation matrix between the outputs of two identical neural networks. These networks are fed with distorted versions of a sample, and the objective is to make this matrix as close to the identity matrix as possible. The role of the cross-correlation matrix is pivotal. By ensuring that the embedding vectors of distorted versions of a sample are similar, the method minimizes redundancy between the components of these vectors. This enhances the quality of the embeddings and ensures that the learned representations are robust and invariant to the applied distortions. Image Classification with Barlow Twins Barlow Twin Architecture Imagine you have many images of cats and dogs, but they must be labeled. You want to train a machine-learning model to distinguish between cats and dogs using this unlabeled dataset. Here is the Barlow Twins approach: Data Augmentation: Create two distorted versions for each image in the dataset. For instance, one version might be a cropped section of the original image, and the other might be the same cropped section but with altered brightness or color. Twin Neural Networks: Use two identical neural networks (the "twins"). Feed one distorted version of the image into the first network and the other distorted version into the second network. Objective Function: The goal is to make the outputs (embeddings) of the two networks as similar as possible for the same input image, ensuring that the networks recognize the two distorted versions as being of the same class (either cat or dog). At the same time, the Barlow Twins method aims to reduce redundancy in the embeddings. This is achieved by ensuring that the cross-correlation matrix of the embeddings from the two networks is close to an identity matrix. In simpler terms, the method ensures that each embedding component is as independent as possible from the other components. Training: The twin networks are trained using the above objective. Over time, the networks learn to produce similar embeddings for distorted versions of the same image and different embeddings for images of different classes (cats vs. dogs). Representation Learning: Once trained, you can use one of the twin networks (or both) to extract meaningful representations (embeddings) from new images. These representations can then be used with a simple linear classifier for various tasks, such as classification. Barlow Twins Loss Function The primary objective of the Barlow Twins method is to reduce redundancy in the representations learned by neural networks. To achieve this, the method uses two identical neural networks (often called "twins") that process two distorted versions of the same input sample. The goal is to make the outputs (or embeddings) of these networks as similar as possible for the same input while ensuring that the individual components of these embeddings are not redundant. The Barlow Twins loss function is designed to achieve this objective. It is formulated based on the cross-correlation matrix of the outputs from the two networks. Cross-Correlation Matrix Calculation: Let's say the outputs (embeddings) from the two networks for a batch of samples are Y1 and Y2. The cross-correlation matrix C is computed as the matrix product of the centered outputs of the two networks, normalized by the batch size. Loss Function: The diagonal elements of the matrix C represent the correlation of each component with itself. The method aims to make these diagonal elements equal to 1, ensuring that the embeddings from the two networks are similar. The off-diagonal elements represent the correlation between different components. The method aims to make these off-diagonal elements equal to 0, ensuring that the components of the embeddings are not redundant. The loss is then computed as the sum of the squared differences between the diagonal elements and the squared values of the off-diagonal elements. Pseudocode for Barlow Twins The Barlow Twins approach can be applied to more complex datasets and tasks beyond simple image classification. The key idea is to leverage the structure in unlabeled data by ensuring that the learned representations are consistent across distortions and non-redundant. Redundancy Reduction Principle Horace Basil Barlow, a renowned British vision scientist, significantly contributed to our understanding of the visual system. One of his most influential concepts was the redundancy reduction principle. Barlow posited that one of the primary computational aims of the visual system is to reduce redundancy, leading to the efficient coding hypothesis4. In simpler terms, while adjacent points in images often have similar brightness levels, the retina minimizes this redundancy, ensuring that the information processed is as concise and non-redundant as possible. The Barlow Twins method in self-supervised learning draws inspiration from this principle. By reducing redundancy, the Barlow Twins approach aims to create embeddings invariant to distortions and statistically independent across different parts of an image. This ensures that the neural networks, when trained with this method, produce representations that capture the essential features of the data while discarding superfluous information. In machine learning and computer vision, applying Barlow's redundancy reduction principle through the Barlow Twins method offers a promising avenue for achieving state-of-the-art results in various tasks, from image classification to segmentation. Key Features of Barlow Twins Independence from Large Batches One of the standout features of the Barlow Twins method is its independence from large batches. In deep learning, especially with extensive datasets, large batch sizes are often employed to expedite training. However, this can lead to challenges, including the need for significant GPU memory and potential generalization issues. The Barlow Twins approach, in contrast, does not necessitate large batches. This independence is particularly advantageous for those without access to extensive computational resources. The method's design, which emphasizes measuring the cross-correlation matrix between the outputs of two identical networks fed with distorted versions of a sample, ensures that the embeddings produced are invariant to these distortions. By aiming to make this matrix as close to the identity matrix as possible, the Barlow Twins method effectively minimizes redundancy between the components of the embedding vectors, irrespective of the batch size. Another noteworthy aspect is the method's resilience to overfitting. Since it doesn't rely on large batches, the risk of the model memorizing the training data, a common pitfall in machine learning, is substantially reduced. This ensures the trained models are more robust and can generalize to unseen data. The Barlow Twins approach's design, emphasizing redundancy reduction and independence from large batches, sets it apart in self-supervised learning methods. Its unique features make it resource-efficient and ensure its applicability and effectiveness across various tasks and computational settings. Symmetry in Network Twins The Barlow Twins approach is distinctive in its utilization of two identical neural networks, often called "twins". This symmetry departs from many other self-supervised learning methods that rely on predictor networks, gradient stopping, or moving averages to achieve their objectives. The beauty of this symmetric design lies in its simplicity and efficiency. By feeding distorted versions of a sample into these twin networks and then comparing their outputs, the Barlow Twins method ensures that the produced embeddings are invariant to the distortions. This symmetry eliminates the need for additional complexities like predictor networks, often used to map representations from one network to another. The absence of gradient stopping and moving averages in the Barlow Twins approach means that the training process is more straightforward and less prone to potential pitfalls associated with these techniques. Gradient stopping, for instance, can sometimes hinder the optimization process, leading to suboptimal results. In essence, the symmetric design of the Barlow Twins method not only simplifies the training process but also enhances the robustness and effectiveness of the learned representations. By focusing on redundancy reduction and leveraging the power of symmetric network twins, the Barlow Twins approach offers a fresh perspective in the ever-evolving landscape of self-supervised learning. Benefits of High-Dimensional Output Vectors The Barlow Twins approach has garnered attention for its unique take on self-supervised learning, particularly in its use of high-dimensional output vectors. But why does this matter? High-dimensional vectors allow for a richer data representation in neural networks. The Barlow Twins method can capture intricate patterns and nuances in the data that might be missed with lower-dimensional representations when using very high-dimensional vectors. This depth of representation is crucial for tasks like image recognition in computer vision, where subtle differences can be the key to accurate classification. Moreover, the Barlow Twins method leverages these high-dimensional vectors to ensure that the embeddings produced by the twin networks are both similar (due to the distorted versions of a sample) and minimally redundant. This balance between similarity and non-redundancy is achieved through the redundancy reduction principle, inspired by neuroscientist H. Barlow. To illustrate, imagine describing a complex painting using only a few colors. While you might capture the general theme, many details must be recovered. Now, imagine having a vast palette of colors at your disposal. The richness and depth of your description would be incomparably better. Similarly, high-dimensional vectors offer a richer "palette" for neural networks to represent data. Using very high-dimensional vectors in the Barlow Twins method allows for a more detailed and nuanced understanding of data, paving the way for more accurate and robust machine learning models. Performance and Comparisons The Barlow Twins approach has been a significant leap forward in self-supervised learning, particularly when benchmarked against the ImageNet dataset.13 ImageNet is a large-scale dataset pivotal for computer vision tasks and is a rigorous testing ground for novel algorithms and methodologies. In semi-supervised classification, especially in scenarios where data is limited, the Barlow Twins method has showcased commendable performance. The method outperforms previous methods on ImageNet for semi-supervised classification in the low-data regime. This is particularly noteworthy as working with limited data often challenges training robust models. With its unique approach to redundancy reduction and high-dimensional output vectors, the Barlow Twins method captures intricate patterns in the data, leading to improved classification results. Moreover, using a linear classifier head, the Barlow Twins approach aligns with the current state-of-the-art ImageNet classification. It also holds its ground in transfer tasks of classification and object detection.13 These results underscore the potential of the Barlow Twins method in pushing the boundaries of self-supervised learning, especially in computer vision tasks. ImageNet numbers for Barlow Twin SSL approach The Barlow Twins approach to SSL focuses on learning embeddings that remain invariant to input sample distortions. A significant challenge in this domain has been the emergence of trivial, constant solutions. While most contemporary methods have circumvented this issue through meticulous implementation nuances, the Barlow Twins approach introduces an objective function that inherently prevents such collapses.6 The Barlow Twins algorithm exhibits certain features when combined with other SSL methods. For instance, SimCLR and BYOL, two state-of-the-art SSL baselines, rely heavily on negative samples and data augmentations, respectively. In contrast, the Barlow Twins method sidesteps the need for negative samples, focusing instead on minimizing the redundancy between embeddings. This approach, combined with large batches and a tailored learning rate, has been instrumental in its success. Furthermore, the Barlow Twins algorithm has been tested on the ImageNet dataset, a large-scale computer vision benchmark. The results were compelling. Using a ResNet-50 encoder and a projector network, the Barlow Twins achieved a 67.9% top-1 accuracy after 100 epochs. This performance is particularly noteworthy when considering the algorithm's simplicity and the projector network's absence of batch normalization or ReLU. It's worth noting that the Barlow Twins' performance and comparisons are actively discussed on various platforms, including GitHub, where developers and researchers share their insights and modifications to the algorithm. As the field of SSL continues to grow, it will be intriguing to see how the Barlow Twins evolve and where they stand across different SSL methods. Barlow Twins: Key Takeaways The Importance of Redundancy Reduction The Barlow Twins method has been recognized for its innovative application of the Redundancy Reduction principle in self-supervised learning (SSL).6 This principle, inspired by neuroscientist H. Barlow, emphasizes the significance of reducing redundant information while retaining essential features. In the context of the Barlow Twins, this means creating embeddings invariant to distortions of the input sample while avoiding trivial constant solutions. The method achieves this by measuring the cross-correlation matrix between the outputs of two identical networks fed with distorted versions of a sample and ensuring it remains close to the identity matrix. This intricate balance ensures that the embeddings of distorted versions of a sample are alike, yet the redundancy between the components of these vectors is minimized. Advantages Over Other SSL Methods The Barlow Twins approach offers several unique advantages over other SSL methods. One of its standout features is its ability to naturally avoid the collapse of embeddings without needing large batches or asymmetry between the twin networks. This is achieved without using techniques like gradient stopping, predictor networks, or moving averages on weight updates. Furthermore, the method benefits from high-dimensional output vectors, allowing for richer data representation and improved performance in tasks like image recognition. The future looks promising as SSL narrows the gap with supervised methods, especially in large computer vision benchmarks. With its unique approach and advantages, the Barlow Twins method is poised to play a pivotal role in developing SSL methods. The potential for further research lies in refining the method, exploring its application in diverse domains, and integrating it with other advanced techniques to push the boundaries in SSL.

September 11

5 min

Introduction to Vision Transformers (ViT)

In the rapidly evolving landscape of artificial intelligence, a paradigm shift is underway in the field of computer vision.  Vision Transformers, or ViTs, are transformative models that bridge the worlds of image analysis and self-attention-based architectures. These models have shown remarkable promise in various computer vision tasks, inspired by the success of Transformers in natural language processing. In this article, we will explore Vision Transformers, how they work, and their diverse real-world applications. Whether you are a seasoned AI enthusiast or just beginning in this exciting field, join us on this journey to understand the future of computer vision. What is a Vision Transformer? The Vision Transformers, or ViTs for short, combine two influential fields in artificial intelligence: computer vision and natural language processing (NLP).  The Transformer model, originally proposed in the paper titled "Attention Is All You Need" by Vaswani et al. in 2017, serves as the foundation for ViTs. Transformers were designed as a neural network architecture that excels in handling sequential data, making them ideal for NLP tasks. ViTs bring the innovative architecture of Transformers to the world of computer vision.  {{light_callout_start}} The state-of-the-art large language models GPT by OpenAI and BERT by Google leverage transformers to model contextual information in text. BERT focuses on bidirectional representations and GPT on autoregressive generation. {{light_callout_end}} Vision Transformers vs Convolutional Neural Networks In computer vision, Convolutional Neural Networks (CNNs) have traditionally been the preferred models for processing and understanding visual data. However, a significant shift has occurred in recent years with the emergence of Vision Transformers (ViTs). These models, inspired by the success of Transformers in natural language processing, have shown remarkable potential in various computer vision tasks.  CNN Dominance For decades, Convolutional Neural Networks (CNNs) have been the dominant models used in computer vision. Inspired by the human visual system, these networks excel at processing visual data by leveraging convolutional operations and pooling layers. CNNs have achieved impressive resultsin various image-related tasks, earning their status as the go-to models for image classification, object detection, and image segmentation. Application of Convolutional Neural Network Method in Brain Computer Interface A convolutional network comprises layers of learnable filters that convolve over the input image. These filters are designed to detect specific features, such as edges, textures, or more complex patterns. Additionally, pooling layers downsample the feature maps, gradually reducing the spatial dimensions while retaining essential information. This hierarchical approach allows CNNs to learn and represent hierarchical features, capturing intricate details as they progress through the network. {{light_callout_start}} Read Convolutional Neural Networks (CNN) Overview for more information. {{light_callout_end}}  Vision Transformer Revolution While CNNs have been instrumental in computer vision, a paradigm shift has emerged with the introduction of Vision Transformers (ViTs). ViTs leverage the innovative Transformer architecture, originally designed for sequential data, and apply it to image understanding. CNNs operate directly on pixel-level data, exploiting spatial hierarchies and local patterns. In contrast, ViTs treat images as sequences of patches, borrowing a page from NLP where words are treated as tokens. This fundamental difference in data processing coupled with the power of self-attention, enables ViTs to learn intricate patterns and relationships within images, gives ViTs a unique advantage. {{light_callout_start}} The Vision Transformer (ViT) model was proposed in An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale by Alexey Dosovitskiy et al. This represented a significant breakthrough in the field, as it is the first time a Transformer encoder has been trained on ImageNet with superior performance to conventional convolutional architectures. {{light_callout_end}} How do Vision Transformers Work? Transformer Foundation To gain an understanding of how Vision Transformers operate, it is essential to understand the foundational concepts of the Transformer architecture like self-attention. Self-attention is a mechanism that allows the model to weigh the importance of different elements in a sequence when making predictions, leading to impressive results in various sequence-based tasks. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale Adapting the Transformer for Images The concept of self-attention has been adapted for processing images with the use of Vision Transformers. Unlike text data, images are inherently two-dimensional, comprising pixels arranged in rows and columns. To address this challenge, ViTs convert images into sequences that can be processed by the Transformer. Split an image into patches: The first step in processing an image with a Vision Transformer is to divide it into smaller, fixed-size patches. Each patch represents a local region of the image. Flatten the patches: Within each patch, the pixel values are flattened into a single vector. This flattening process allows the model to treat image patches as sequential data. Produce lower-dimensional linear embeddings: These flattened patch vectors are then projected into a lower-dimensional space using trainable linear transformations. This step reduces the dimensionality of the data while preserving important features. Add positional encodings: To retain information about the spatial arrangement of the patches, positional encodings are added. These encodings help the model understand the relative positions of different patches in the image. Feed the sequence into a Transformer encoder: The input to a standard Transformer encoder comprises the sequence of patch embeddings and positional embeddings. This encoder is composed of multiple layers, each containing two critical components: multi-head self-attention mechanisms (MSPs), responsible for calculating attention weights to prioritize input sequence elements during predictions, and multi-layer perceptron (MLP) blocks. Before each block, layer normalization (LN) is applied to appropriately scale and center the data within the layer, ensuring stability and efficiency during training. During the training, an optimizer is also used to adjust the model's hyperparameters in response to the loss computed during each training iteration. Classification Token: To enable image classification, a special "classification token" is prepended to the sequence of patch embeddings. This token's state at the output of the Transformer encoder serves as the representation of the entire image. Inductive Bias and ViT It's important to note that Vision Transformers exhibit less image-specific inductive bias compared to CNNs. In CNNs, concepts such as locality, two-dimensional neighborhood structure, and translation equivariance are embedded into each layer throughout the model. However, ViTs rely on self-attention layers for global context and only use a two-dimensional neighborhood structure in the initial stages for patch extraction. This means that ViTs rely more on learning spatial relations from scratch, offering a different perspective on image understanding. Hybrid Architecture In addition to the use of raw image patches, ViTs also provide the option for a hybrid architecture. With this approach, input sequences can be generated from feature maps extracted by a CNN. This level of flexibility allows practitioners to combine the strengths of CNNs and Transformers in a single model, offering further possibilities for optimizing performance. {{light_callout_start}} The code for the paper "An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale" and related projects is accessible on GitHub. This architecture is implemented in PyTorch, with TensorFlow implementations also provided. {{light_callout_end}} Real-World Applications of Vision Transformers Now that we have a solid understanding of what Vision Transformers are and how they work, let's explore their machine learning applications. These models have proven to be highly adaptable, thereby potentially transforming various computer vision tasks. Image Classification A primary application of Vision Transformers is image classification, where ViTs serve as powerful classifiers. They excel in categorizing images into predefined classes by learning intricate patterns and relationships within the image, driven by their self-attention mechanisms. Object Detection Object detection is another domain where Vision Transformers are making a significant impact. Detecting objects within an image involves not only classifying them but also precisely localizing their positions. ViTs, with their ability to preserve spatial information, are well-suited for this task. These algorithms can identify objects and provide their coordinates, contributing to advancements in areas like autonomous driving and surveillance. {{light_callout_start}} Read Object Detection: Models, Use Cases, Examples for more information. {{light_callout_end}} Image Segmentation Image segmentation, which involves dividing an image into meaningful segments or regions, benefits greatly from the capabilities of ViTs. These models can discern fine-grained details within an image and accurately delineate object boundaries. This is particularly valuable in medical imaging, where precise segmentation can aid in diagnosing diseases and conditions. Action Recognition Vision Transformers are also making strides in action recognition, where the goal is to understand and classify human actions in videos. Their ability to capture temporal dependencies, coupled with their strong image processing capabilities, positions ViTs as contenders in this field. They can recognize complex actions in video sequences, impacting areas such as video surveillance and human-computer interaction. Multi-Modal Tasks ViTs are not limited to images alone. They are also applied in multi-modal tasks that involve combining visual and textual information. These models excel in tasks like visual grounding, where they link textual descriptions to corresponding image regions, as well as visual question answering and visual reasoning, where they interpret and respond to questions based on visual content. Transfer Learning One of the remarkable features of Vision Transformers is their ability to leverage pre-trained models for transfer learning. By pre-training on large datasets, ViT models learn rich visual representations that can be fine-tuned for specific tasks with relatively small datasets. This transfer learning capability significantly reduces the need for extensive labeled data, making ViTs practical for a wide range of applications. Vision Transformers: Key Takeaways Vision Transformers (ViTs) represent a transformative shift in computer vision, leveraging the power of self-attention from natural language processing to image understanding. Unlike traditional Convolutional Neural Networks (CNNs), ViTs process images by splitting them into patches, flattening those patches, and then applying a Transformer architecture to learn complex patterns and relationships. ViTs rely on self-attention mechanisms, enabling them to capture long-range dependencies and global context within images, a feature not typically found in CNNs. Vision Transformers have applications in various real-world tasks, including image classification tasks, object detection, image segmentation, action recognition, generative modeling, and multi-modal tasks. {{Training_data_CTA}}

September 11

5 min

What is Retrieval Augmented Generation (RAG)?

The large-scale adoption of Artificial Intelligence continues to have a transformative effect on the world. Foundation models, especially Large Language Models (LLMs) like OpenAI's GPT, have gained widespread attention and captivated the general public's imagination. Trained on a vast corpus of online data and possessing the ability to understand and output natural language, LLMs are challenging the very nature of intelligence and creativity.  Yet the precise mechanisms that make state-of-the-art LLMs so effective are also the source of their biggest flaw - their tendency to provide inaccurate, as well as out of date information. LLMs are prone to making things up, and as generative models, they don’t cite sources in their responses. {{light_callout_start}} “Language models are not search engines or databases. Hallucinations are unavoidable. What is annoying is that the models generate text with mistakes that is [sic] hard to spot.” - Adrian Tam, A Gentle Introduction to Hallucinations in Large Language Models {{light_callout_end}}  What is Retrieval Augmented Generation (RAG)?  Enter Retrieval Augmented Generation, known as RAG,  a framework promising to optimize generative AI and ensure its responses are up-to-date, relevant to the prompt, and most importantly, true. How does RAG work?  The main idea behind RAG is surprisingly simple; combining LLMs with a separate store of content outside of the language model containing sourced and up-to-date information for the LLM to consult before generating a response for its users. In other words, this approach merges information retrieval with text generation. To truly appreciate how this works, it's essential to delve into the realm of deep learning and understand how language models process our prompts and produce responses in natural language. LLMs generate responses based purely on the user’s input and skillful prompt engineering is vital for maximizing the accuracy of the generated responses. This input is turned into embeddings, which are numerical representations of concepts that allow the AI to compute the semantics of what the user is asking.  In the RAG framework, the language model identifies relevant information in an external dataset after computing the embeddings of a user’s query. The LLM then performs a similarity search on the prompt and the external dataset, before fine-tuning the user’s prompt using the relevant information it retrieved. Only then is the prompt sent to the LLM to generate an output for the user. Classic LLM (left) vs one using the RAG framework (right) What makes this framework so effective is that ‘external dataset’ can mean any number of things. For example, these could be APIs, databases that are updated in real-time, or even open domains such as Wikipedia or GitHub. Benefits of Retrieval Augmented Generation (RAG) Combining the user’s prompt with a separate store of information before generating an output has multiple benefits, not least that it allows the LLM to provide sources for the responses it provides.  ‘Classic’ LLMs can only obtain new information during retraining, which is a very expensive and time-consuming process. However, the RAG framework overcomes this challenge by enabling real-time updates and new sources to be incorporated into the external dataset without having to re-train the entire model. This provides LLMs with valuable and specialized knowledge in addition to what’s included in their initial training data. Studies have demonstrated that RAG models surpass non-RAG models across various metrics, including reduced susceptibility to hallucinations and increased accuracy in responses. They are also less likely to leak sensitive personal information. Applications for Retrieval Augmented Generation (RAG) RAG has a wide range of applications across all domains that require specialized on-demand knowledge. Its applications includes, but are not limited to:  Chatbots and AI assistants: RAG models can be leveraged to build advanced question-answering systems superior to classic retrieval based chatbots. They can retrieve relevant information from a knowledge base and generate detailed, context-aware answers to user queries. The AI assistant found in our documentation is a perfect example of this. Education tools: RAG can be employed to develop educational tools that provide students with answers to questions, explanations, and additional context based on textbooks and reference materials. Legal Research and document review: Legal professionals can use RAG models to quickly search and summarize legal documents, statutes, and case law to aid in legal research and document review. Medical diagnosis and healthcare: In the healthcare domain, RAG models can help doctors and other medical professionals access the latest medical literature and clinical guidelines to assist in diagnosis and treatment recommendations. Language translation (with context): By considering the context from a knowledge base, RAG can assist in language translation tasks, resulting in more accurate translations that account for specific terminology or domain knowledge. Retrieval Augmented Generation (RAG): Summary RAG principles have been shown to reduce the frequency and severity of issues related to LLMs in a host of different metrics. The external knowledge sources that LLMs are given access to can vary and easily be kept up-to-date, providing the language models with sources as well as much-needed context for specific tasks and use cases. These embeddings are subsequently combined with the user's input to generate accurate responses.  Maintaining objectivity and accuracy in an online space rife with misinformation is extremely challenging, and since hallucinations are baked into the very fabric of how generative models work it currently seems impossible to imagine a generative AI model that is 100% accurate. However, RAG reminds us that improvements in AI depend as much on well-designed frameworks as they do on advancements in technology. This serves as a reminder as we work on advancing the next generation of deep learning technologies. {{Training_data_CTA}}

September 11

5 min

Encord Active 0.1.75 released: Kill Streamlit, Faster UI, and a Smoother Experience

At the Active Community, we are elated to announce the release of Encord Active 0.1.75, marking a significant milestone in our ongoing commitment to delivering unparalleled user experiences. This isn't just any update; we've made changes to redefine how you interact with our platform. Gone is Streamlit, paving the way for a more agile, quicker, and responsive UI.  As always, our primary objective is to ensure that you have the smoothest experience possible, and with this latest release, we've achieved just that. Discover the transformative features and improvements we've meticulously integrated into Encord Active 0.1.75! {{light_callout_start}} Encord Active provides a data-centric approach for improving model performance by helping you discover and correct erroneous labels through data exploration, model-assisted quality metrics, and one-click labeling integration. With Encord Active you can: Slice your visual data across metrics functions to identify data slices with low performance. Flag poor-performing slices and send them for review. Export your new data set and labels. Visually explore your data through interactive embeddings, precision/recall curves, and other advanced visualizations. Check out the project on GitHub, and hey, if you like it, leave us a 🌟🫡. {{light_callout_end}} Highlights of Major Features and Changes No more streamlit: New native UI At the heart of the Encord Active 0.1.75 release is the evolution of our user interface. While Streamlit served us well as the primary UI in our initial stages, we recognized its limitations, particularly for an open-source tool designed for scalability and production-level performance. From constraints like its numerous dependencies and limited potential for custom frontend components to a lack of Google  Colab integration, Streamlit posed challenges that hindered our vision. We took this as a cue to redesign and introduce a new native UI that's faster and offers a significantly smoother experience. By transitioning to a dedicated backend-frontend setup, we've eradicated previous complications and set the stage for a more performant Encord Active in future iterations. You'll now experience custom frontend components, seamless integration with Google Colab, a more responsive Explorer interface for delving deep into image datasets, enhanced usability, and swift loading times—a direct response to feedback from our community, who voiced concerns about sluggish interfaces with large datasets. By cutting ties with Streamlit and its inherent limitations, we have ushered in an era of increased speed and responsiveness—vital for effectively handling large computer vision datasets. With this release, Encord Active gets a completely new look and feel. We think that it is fresh enough to get a brand new command: encord-active start The start command has now replaced the previous visualize command. Prediction import We’ve streamlined the prediction imports via the SDK. They follow the same fundamental structure, and the documentation should be clearer. 10x improvement when tagging large datasets We have supercharged data tagging efficiency, achieving a remarkable 10x performance boost when tagging large amounts of data at once. Now, Encord Active can seamlessly handle large data batches simultaneously. This improvement improves your flow and makes data tagging lightning-fast. Deep Dive into Key Features Native UI While Streamlit was instrumental during our inception, its inherent challenges limited our scalability and adaptability. The all-new native UI in Encord Active 0.1.75 presents a clear, intuitive, responsive design built to serve our users' evolving needs. Direct Google Colab integration A significant advantage of moving away from Streamlit is the seamless integration with Google Colab. This feature paves the way for smoother workflows, especially for those using Google Colab for their data and ML tasks. No more `ngrok` or `nginx` integrations are required! We have put together a notebook for you to test this out. Run it directly from this notebook. Responsive Explorer interface and a button to hide annotations Exploring large image datasets? Our revamped Explorer is designed to ensure you navigate your datasets with unparalleled ease and speed. We have also added a button you can toggle under the Explorer tab to show or hide annotations in your images. Custom frontend components These allow for a more tailored user experience, giving you the tools and views you need without the fluff. Bug Fixes Video predictions Importing predictions for videos had a bug that assigned predictions to the wrong frames in videos (and image groups). This is now resolved. Classification predictions We have also addressed a crucial issue in our latest release concerning classification predictions. You can now trust that your classification predictions will be imported accurately and seamlessly. Optimized data migrations We have optimized data migration processes to be more efficient. We've addressed the issue where object embeddings, a compute-intensive task, were unnecessarily calculated in certain scenarios. With this release, expect more streamlined migrations and reduced computational overhead. Docker file release and include `liggeos` In our previous releases, the Docker file was wrong, so the Docker version did not get released. We've rectified this oversight. With this fix, this release is now fully Docker-ready for smoother installations and deployments. We have also included `liggeos` in the Docker image during build when trying to set up a project. That fixes issue #598.  Got rid of the ` encord-active-components` package In our commitment to streamlining and simplifying, we've made a pivotal change in this release. We've eliminated the separate `encord-active-components` package, opting instead to directly distribute the build bundled with its essential components. This move ensures a more integrated and efficient deployment for you. Explorer: signed URLs from AWS displayed "empty" cards We've rectified an issue where signed URLs from AWS displayed "empty" cards in the explorer. Expect consistent and accurate data representation for your AWS-stored content. On Our Radar Big video projects We've seen the import process crash when importing projects with many/long videos (more than an hour of video in total). The issue is typically a lack of disk space from inflating videos into separate frames. We suggest using smaller projects with shorter videos for now. With one of the following releases, video support will be much more reliable and eliminate the need for inflating videos into frames. Project subsetting Project subsetting is slow. We’re working to make this work much faster. We’ve also noticed complications when projects came from a local import (via the `init` command or `import --coco` command). We’re working on fixing this before the next release. Filtering the “Explorer” by tags If you have added a filter on the Explorer that includes Data or Label tags and then remove tags from some of the shown items, the Explorer won’t remove the items immediately. A page refresh will, however, show the correct results. What's No Longer Available? Most of the features in previous versions of Encord Active are still there. Below, we’ve listed the features that are no longer available. Export to CSV and COCO file formats Prediction confusion matrix We plan to bring back the confusion matrix, and if you’re missing the export features, please let us know in the Active community. Community Contributions This release wouldn't have been possible without the feedback and contributions from our community. We'd like to extend our heartfelt gratitude to everyone who played a part, especially those who highlighted the challenges with Streamlit and pushed for improved UI responsiveness. Your voices were instrumental in shaping this release.  {{light_callout_start}} Join our Active community for support, share your thoughts, and request features.{{light_callout_start}} Get the update now 🚀 pip install --upgrade encord-active See the releases (0.1.70 - 0.1.75) for more information Check the documentation for a quick start guide ⚠️ Remember to run `encord-active start` and not `encord-active visualize` in your project directory.

September 8

5 min

Product Updates [August 2023]

Improved Performance and Onboarding Focused on getting you up and running as quickly as possible, a simplified home page points out the essentials.  When navigating through annotation tasks, we are now loading image annotation tasks ahead of time to get one step closer to a native application experience. All the convenience of the cloud and the responsiveness of a desktop application showing that you can have your Encord cake and eat it too. A crucial early step, before you examine, annotate, or otherwise review your data, is connecting your cloud storage solution to Encord. Between multi-cloud, permissions and CORS it can be an arduous process. So — we’ve revamped the integration process into a guided step-by-step process in the app, and added the ability to confirm against storage resource URLs immediately after to give you complete and immediate feedback that your setup is in working order. Once you’ve onboarded your data integration, we’ve also made getting started with the SDK a one-step process. Simply generate your API key on the platform, and supply the file path of the downloaded private key when initiating your Encord SDK client. No copy-pasting and no fuss.  Require Annotations in Labeling Tasks Annotation tasks can be numerous and complex. In order to remove some of the burden of annotation and review, it can often be helpful to enforce that at least one of a particular annotation is present in a task. In addition to ongoing support for required specific nested attributes on objects and classifications, Encord has added support for ontology objects and classes to be required as well — enforcing that at least one of the indicated instance labels is present in a labeling task. We’ve paired this strict requirement enforcement with an improved issues drawer so annotators can quickly resolve outstanding issues and submit high quality annotation work.  Enhanced Workflow Collaboration with Collaborator Router and Simplified Queues We’ve simplified and separated the workflows task queue interface from the data inspection features. For team members — move through your annotate and review tasks with greater clarity. For admins, cleaner separation between data review and task control interfaces will keep you focused on the goal. We’re also adding a collaborator router so that you can route annotation tasks based on who made the most recent annotation submission or review judgement. Perfect for training newer annotation or review team members, or setting up partnerships and collaborations within wider projects and annotation workflows.  DICOM Updates Improvements in the label editor functionality are often especially relevant for DICOM annotation workloads. For example, the polygon tool’s new ability to show measurement vertex angles in the label editor with DICOM annotations can add value to your labelling workloads. Head over to the DICOM Update Blog to see the details, and other DICOM updates for this month. Around the web and around the world Encord is here to keep you up-to-date on the rapidly evolving world of AI. Check out these explainers for a deep-dive into the inner-workings of the latest in AI. Meta AI's CoTracker: It is Better to Track Together FastViT: Hybrid Vision Transformer with Structural Reparameterization Meta AI’s Photorealistic Unreal Graphics (PUG) Thanks for reading — as always, the inbox at product@encord.com is always open — we’d love to hear your thoughts on the feedback on the above! Talk to you soon,

September 6

3 min

DICOM Updates [August 2023]

Require top-level ontology categories This month, we are introducing the highly-anticipated top-level required feature, designed to empower you with greater control and precision in your annotation projects. With this feature, you can now define and prioritize the most critical requirements of your DICOM annotation tasks, ensuring that your team's efforts are focused on capturing the key insights that matter most. Have a look at our documentation for further guidance on how to use it.  📊 Introducing Advanced Measurement Features! 📏 Say hello to the brand-new measurement feature, designed to precisely quantify angles and areas. Whether it's evaluating the size of a tumor, determining the angle of a joint, or gauging the extent of a lesion, DICOM's new measurement feature empowers medical experts to make more informed decisions. Upgraded DICOM de-identification With the increasing importance of data privacy and compliance in healthcare, we understand the challenges you face when handling medical imaging data. Our upgraded DICOM de-identification service offers a comprehensive solution to de-identify DICOM files swiftly and securely, and offers integration with customisable reviewer workflows. Protect patient privacy and ensure regulatory compliance effortlessly with our state-of-the-art de-identification technology. Seamlessly remove all sensitive information from DICOM metadata as well as from pixel data while preserving the integrity of the data. We are excited as ever to receive your feedback on the latest and upcoming updates. Please feel free to contact us at product@encord.com if you have any thoughts or ideas on how we can enhance your experience with us. We eagerly await hearing from you.

September 6

2 min

Guide to Transfer Learning

Transfer learning has become an essential technique in the artificial intelligence (AI) domain due to the emergence of deep learning and the availability of large-scale datasets.  This comprehensive guide will discuss the fundamentals of transfer learning, explore its various types, and provide step-by-step instructions for implementing it. We’ll also address the challenges and practical applications of transfer learning. {{product_sam_cta}} What is Transfer Learning? In machine learning, a model's knowledge resides in its trained weights and biases. These weights are generated after extensive training over a comprehensive training dataset and help understand data patterns for the targeted problem.  Transfer learning is a type of fine-tuning in which the weights of a pre-trained model for an upstream AI task are applied to another AI model to achieve optimal performance on a similar downstream task using a smaller task-specificdataset. In other words, it leverages knowledge gained from solving one task to improve the performance of a related but different task. Since the model already has some knowledge related to the new task, it can learn well from a smaller dataset using fewer training epochs. Intuitive Examples Of Transfer Learning Transfer learning has applications in numerous deep learning projects, such as computer vision tasks like object detection or natural language processing tasks like sentiment analysis. For example, an image classification model trained to recognize cats can be fine-tuned to classify dogs. Since both animals have similar features, the weights from the cat classifier can be fine-tuned to create a high-performing dog classifier. Pre-trained Models Rather than starting a new task from scratch, pre-trained models capture patterns and representations from the training data, providing a foundation that can be leveraged for various tasks. Usually, these models are deep neural networks trained on large datasets, such as the ImageNet dataset for image-related tasks or TriviaQA for natural language processing tasks. Through training, the model acquires a thorough understanding of features, feature representations, hierarchies, and relationships within the data. The Spectrum of Pre-training Methods Several popular pre-trained architectures have epitomized the essence of transfer learning across domains. These include: VGG (Visual Geometry Group), a convolutional neural network architecture widely recognized for its straightforward design and remarkable effectiveness in image classification. Its architecture is defined by stacking layers with small filters, consistently preserving the spatial dimensions of the input. VGG is a starting point for more advanced models like VGG16 and VGG19. ResNet (Residual Network), a convolutional neural network architecture that addresses the vanishing gradient problem using skip connections, enabling the training of very deep networks. It excels in image classification and object detection tasks. BERT (Bidirectional Encoder Representations from Transformers), a pre-trained NLP model that has the ability to understand the context from both directions in a text sequence. Its proficiency in contextual understanding is used in various language-related tasks, such as text classification, sentiment analysis, and more. InceptionV3, a deep learning model based on the CNN architecture. It is widely used for image classification and computer vision tasks. It is a variant of the original GoogLeNet architecture known for its "inception" modules that allow it to capture information at multiple scales and levels of abstraction. Using prior knowledge of images during pre-training, InceptionV3's features can be adapted to perform well on narrower, more specialized tasks. Transferable Knowledge In transfer learning, transferable knowledge serves as the foundation that enables a model's expertise in one area to enhance its performance in another. Throughout the training process, a model accumulates insights that are either domain-specific or generic.  Domain-specific knowledge are relevant to a particular field, like medical imaging. Conversely, generic knowledge tackles more universal patterns that apply across domains, such as recognizing shapes or sentiments. Transferable knowledge can be categorized into two types: low-level features and high-level semantics. Low-level features encompass basic patterns like edges or textures, which are useful across many tasks. High-level semantics, on the other hand, delve into the meaning behind patterns and relationships, making them valuable for tasks requiring context-understanding. Task Similarity & Domains Understanding task similarity is critical to choosing an effective transfer learning approach – fine-tuning or feature extraction – and whether to transfer knowledge within the same domain or bridge gaps across diverse domains. Fine-tuning vs. Feature Extraction: When reusing pre-trained models, there are two main strategies to enhance model performance: fine-tuning and feature extraction. Fine-tuning involves adjusting the pre-trained model's parameters and activations while retraining its learned features. For specific fine-tuning tasks, a dense layer is added to the pre-trained layers to customize the model's outputs and minimize the loss on the new task, aligning them with the specific outcomes needed for the target task. On the other hand, feature extraction involves extracting the embeddings from the final layer or multiple layers of a pre-trained model. The extracted features are fed into a new model designed for the specific task to achieve better results. Usually, feature extraction does not modify the original network structure. It simply computes features from the training data that are leveraged for downstream tasks. Same-domain vs. Cross-domain Transfer: Transfer learning can work within the same domain or across different domains. In same-domain transfer, the source and target tasks are closely related, like recognizing different car models within the automotive domain. Cross-domain transfer involves applying knowledge from a source domain to an unrelated target domain, such as using image recognition expertise from art to enhance medical image analysis. Types of Transfer Learning  Transfer learning can be categorized into different types based on the context in which knowledge is transferred. These types offer insights into how models reuse their learned features to excel in new situations. Categorizations of Transfer Learning Let’s discuss two common types of transfer learning. Inductive Transfer Learning Inductive transfer learning is a technique used when  labeled data is consistent across the source and target domains, but the tasks undertaken by the models are distinct. It involves transferring knowledge across tasks or domains. When transferring across tasks, a model's understanding from one task aids in solving a different yet related task. For instance, using a model trained on image classification improves object detection performance. Transferring across domains extends this concept to different datasets. For instance, a model initially trained on photos of animals can be fine-tuned for medical image analysis. Transductive Transfer Learning In transductive learning, the model has encountered training and testing data beforehand.  Learning from the familiar training dataset, transductive learning makes predictions on the testing dataset. While the labels for the testing dataset might be unknown, the model uses its learned patterns to navigate the prediction process. Transductive transfer learning is applied to scenarios where the domains of the source and target tasks share a strong resemblance but are not precisely the same. Consider a model trained to classify different types of flowers from labeled images (source domain). The target task is identifying flowers in artistic paintings without labels (target domain). Here, the model's learned flower recognition abilities from labeled images are used to predict the types of flowers depicted in the paintings. How to Implement Transfer Learning Transfer learning is a nuanced process that requires deliberate planning, strategic choices, and meticulous adjustments. By piecing together the appropriate strategy and components, practitioners can effectively harness the power of transfer learning. Given a pre-trained model, here are detailed steps for transfer learning implementation. Learning Process of Transfer Learning Dataset Preparation In transfer learning, dataset preparation includes data collection and preprocessing for the target domain. Practitioners acquire labeled data for the target domain. Even though the tasks may differ, the fine-tuning training data should have similar characteristics to the source domain. During data preprocessing, employing techniques like data augmentation can significantly enhance the model's performance. {{light_callout_start}} If you want to learn more about data preprocessing, read our detailed blog on Mastering Data Cleaning & Data Preprocessing. {{light_callout_end}}  Model Selection & Architecture The process of model selection and architecture design sets the foundation for successful transfer learning. It involves choosing a suitable pre-trained model and intricately adjusting it to align with the downstream task. Deep learning models like VGG, ResNet, and BERT offer a solid foundation to build upon. Freeze the top layers of the chosen pre-trained model to build a base model for the downstream task that captures the general features of the source domain. Then, add layers to the base model to learn task-specific features. Transfer Strategy Transfer learning requires finding the right path to adapt a model's knowledge. Here are three distinct strategies to consider, tailored to different scenarios and data availability. Full Fine-tuning: This approach uses the target data to conduct fine-tuning across the entire model. It's effective when a considerable amount of labeled training data is available for the target task. Layer-wise Fine-tuning: It involves fine-tuning specific layers to adapt the pre-trained model's expertise. This strategy is appropriate when target data is limited. Feature Extraction: It involves holding the pre-trained layers constant and extracting their learned features. New model is trained based on the learned features for the downstream task. This method works well when the target dataset is small. The new model capitalizes on the pre-trained layers' general knowledge. Hyperparameter Tuning Hyperparameter tuning fine-tunes model's performance. These adjustable settings are pivotal in how the model learns and generalizes from data. Here are the key hyperparameters to focus on during transfer learning: Learning Rate: Tune the learning rate for the fine-tuning stage to determine how quickly the model updates its weights by learning from the downstream training data. Batch Size: Adjust the batch size to balance fast convergence and memory efficiency. Experiment to find the sweet spot. Regularization Techniques: Apply regularization methods like dropout or weight decay to prevent overfitting and improve model generalization. {{light_callout_start}} If you want to learn more about fine-tuning, read our detailed guide on Fine-tuning Models: Hyperparameter Optimization. {{light_callout_end}}  Training & Evaluation Train and compile the downstream model and modify the output layer according to the chosen transfer strategy on the target data. Keep a watchful eye on loss and accuracy as the model learns. Select evaluation metrics that align with the downstream task's objectives. For instance, model accuracy is the usual go-to metric for classification tasks, while the F1 score is preferred for imbalanced datasets. Ensure the model's capabilities are validated on a validation set, providing a fair assessment of its readiness for real-world challenges. Practical Applications of Transfer Learning Transfer learning offers practical applications in many industries, fueling innovation across AI tasks. Let's delve into some real-world applications where transfer learning has made a tangible difference: Autonomous Vehicles The autonomous vehicles industry benefits immensely from transfer learning. Models trained to recognize objects, pedestrians, and road signs from vast datasets can be fine-tuned to suit specific driving environments. For instance, a model originally developed for urban settings can be adapted to navigate rural roads with minimal data. Waymo, a prominent player in autonomous vehicles, uses transfer learning to enhance its vehicle's perception capabilities across various conditions. Healthcare Diagnostics AI applications in the healthcare domain use transfer learning to streamline medical processes and enhance patient care. One notable use is interpreting medical images such as X-rays, MRIs, and CT scans. Pre-trained models can be fine-tuned to detect anomalies or specific conditions, expediting diagnoses swiftly. By leveraging knowledge from existing patient data, models can forecast disease progression and tailor treatment plans. This proves especially valuable in personalized medicine. Moreover, transfer learning aids in extracting insights from vast medical texts, helping researchers stay updated with the latest findings and enabling faster discoveries. The importance of transfer learning is evident in a recent study regarding its use in COVID-19 detection from chest X-ray images. The experiment proposed using a pre-trained network (ResNet50) to identify COVID-19 cases. By repurposing the network's expertise, the model provided swift COVID diagnosis with 96% performance accuracy, demonstrating how transfer learning algorithms accelerate medical advancements. {{medical_CTA}} Gaming In game development, pre-trained models can be repurposed to generate characters, landscapes, or animations. Reinforcement learning models can use transfer learning capabilities to initialize agents with pre-trained policies, accelerating the learning process. For example, OpenAI's Dota 2 bot, OpenAI Five, blends reinforcement and transfer learning to master complex real-time gaming scenarios. System Overview of Dota 2 with Large-Scale Deep Reinforcement Learning E-commerce In e-commerce, recommendations based on user behavior and preferences can be optimized using transfer learning from similar user interactions. Models trained on extensive purchasing patterns can be fine-tuned to adapt to specific user segments. Moreover, NLP techniques like Word2Vec's pre-trained word embeddings enable e-commerce platforms to transfer knowledge from large text corpora effectively. This enhances their understanding of customer feedback and enables them to tailor strategies that enhance the shopping experience. Amazon, for instance, tailors product recommendations to individual customers through the transfer learning technique. Cross-lingual Translations The availability of extensive training data predominantly biased toward the English language creates a disparity in translation capabilities across languages. Transfer learning bridges this gap and enables effective cross-lingual translations. Large-scale pre-trained language models can be fine-tuned to other languages with limited training data. Transfer learning mitigates the need for vast language-specific datasets by transferring language characteristics from English language datasets. For example, Google's Multilingual Neural Machine Translation system, Google Translate, leverages transfer learning to provide cross-lingual translations. This system employs a shared encoder for multiple languages, utilizing pre-trained models on extensive English language datasets. Google’s Neural Machine Translation System: Bridging the Gap between Human and Machine Translation Limitations of Transfer Learning  While transfer learning enables knowledge sharing, it's essential to acknowledge its limitations. These challenges offer deeper insights to data scientists about areas that demand further attention and innovation. Here are several areas where transfer learning shows limitations: Dataset Bias & Mismatch Transfer learning's effectiveness hinges on the similarity between the source and target domains. If the source data doesn't adequately represent the target domain, models might struggle to adapt accurately. This dataset mismatch can lead to degraded performance, as the model inherits biases or assumptions from the source domain that do not apply to the target domain. {{light_callout_start}} If you want to learn more about reducing bias in machine learning, read our detailed blog on How To Mitigate Bias in Machine Learning Models. {{light_callout_end}} Overfitting & Generalization Despite its prowess, transfer learning is not immune to overfitting. When transferring knowledge from a vastly different domain, models might over-adapt to the nuances of the source data, resulting in poor generalization to the target task. Striking the right balance using learned features and not overemphasizing source domain characteristics is a persistent challenge. Catastrophic Forgetting Models mastering a new task may inadvertently lose proficiency in the original task. This phenomenon, known as catastrophic forgetting, occurs when sequential retraining for a new task overrides previously acquired knowledge. The new data changes the knowledge-heavy, pre-trained weights of the model, causing the model to lose prior knowledge. Balancing the preservation of existing expertise while acquiring new skills is crucial, particularly in continual learning scenarios. Ethical & Privacy Concerns The emergence of transfer learning has raised ethical questions regarding the origin and fairness of the source data. Fine-tuned models inheriting biases or sensitive information from source domains might perpetuate inequalities or breach privacy boundaries. Ensuring models are ethically trained and the transfer process adheres to privacy norms is an ongoing challenge. Advanced Topics in Transfer Learning As transfer learning advances, it ventures into uncharted territories with various advanced techniques that redefine its capabilities. These innovative methods revolutionize the process of transferring knowledge across domains, enriching model performance and adaptability. Here's a glimpse into some of the advanced topics in transfer learning: Domain Adaptation Techniques Domain adaptation is a critical aspect of transfer learning that addresses the challenge of applying models trained on one domain to perform well in another related domain. Here are two domain adaptation techniques: Self-training: Self-training iteratively labels unlabeled target domain data using the model's predictions. For example, training a sentiment analysis model using labeled data for positive and negative sentiment but unlabeled data for neutral sentiment. The model starts by making predictions on the neutral data and then uses them as "pseudo-labels" to fine-tune itself on the neutral sentiment, gradually improving its performance in this class. Basic Iterative Self-training Pipeline Adversarial Training: Adversarial training pits two models against each other – one adapts to the target domain, while the other attempts to distinguish between source and target data. This sharpens the model's skills in adapting to new domains. Adversarial training also plays a crucial role in strengthening models against adversarial attacks. Exposing the model to these adversarial inputs during training teaches them to recognize and resist such attacks in real-world scenarios. Zero-shot & Few-shot Learning Zero-shot learning involves training a model to recognize classes it has never seen during training, making predictions with no direct examples of those classes. Conversely, few-shot learning empowers a model to generalize from a few examples per class, allowing it to learn and make accurate predictions with minimal training data. Other learning strategies include one-shot learning and meta-learning. With one example per class, one-shot learning replicates the human ability to learn from a single instance. For example, training a model to identify rare plant species using just one image of each species. On the other hand, meta-learning involves training the model on a range of tasks, facilitating its swift transition to novel tasks with minimal data. Consider a model trained on various tasks, such as classifying animals, objects, and text sentiments. When given a new task, like identifying different types of trees, the model adapts swiftly due to its exposure to diverse tasks during meta-training. Multi-modal Transfer Learning Multi-modal transfer learning involves training models to process and understand information from different modalities, such as text, images, audio, and more. These techniques elevate models to become versatile communicators across different sensory domains.  Multimodal Emotion Recognition using Transfer Learning from Speaker Recognition and BERT-based models Two prominent types of multi-modal transfer learning are: Image-Text Transfer: This type of transfer learning uses text and visual information to generate outcomes. It is most appropriate for image captioning tasks. Audio-Visual Transfer: Audio-visual transfer learning enables tasks like recognizing objects through sound. This multi-sensory approach enriches the model's understanding and proficiency in decoding complex audio information. Future Trends in Transfer Learning The transfer learning landscape is transformative, with trends set to redefine how models adapt and specialize across various domains. These new directions offer a glimpse into the exciting future of knowledge transfer. Continual Learning & Lifelong Adaptation The future of transfer learning lies in models that continuously evolve to tackle new challenges. Continual learning involves training models on tasks over time, allowing them to retain knowledge and adapt to new tasks without forgetting what they've learned before. This lifelong adaptation reflects how humans learn and specialize over their lifetimes. As models become more sophisticated, the ability to learn from a constant stream of tasks promises to make them even more intelligent and versatile. Federated Transfer Learning Federated Transfer Learning Imagine a decentralized network of models collaborating to enhance each other's knowledge. Federated transfer learning envisions models distributed across different devices and locations, collectively learning from their local data while sharing global knowledge.  This approach respects privacy, as sensitive data remains local while still benefiting from the network's collective intelligence. Federated learning's synergy with transfer learning can democratize AI by enabling models to improve without centralizing data. Improved Pre-training Strategies Pre-training, a key element of transfer learning, is expected to become even more effective and efficient. Models will likely become adept at learning from fewer examples and faster convergence. Innovations in unsupervised pre-training can unlock latent patterns in data, leading to better transfer performance.  Techniques like self-supervised learning, where models learn from the data without human-labeled annotations, can further refine pre-training strategies, enabling models to grasp complex features from raw data. Ethical & Fair Transfer Learning The ethical dimension of transfer learning gains importance as models become more integral to decision-making. Future trends will focus on developing fair and unbiased transfer learning methods, ensuring that models don't perpetuate biases in the source data. Techniques that enable models to adapt while preserving fairness and avoiding discrimination will be crucial in building AI systems that are ethical, transparent, and accountable. Transfer Learning: Key Takeaways  Transfer learning is a dynamic ML technique that leverages pre-trained models to develop new models, saving time and resources while boosting performance. Transfer learning has proven its versatility, from its role in accelerating model training, enhancing performance, and reducing data requirements to its practical applications across industries like healthcare, gaming, and language translation. In transfer learning, it is vital to carefully select pre-trained models, understand the nuances of different transfer strategies, and navigate the limitations and ethical considerations of this approach. Techniques like domain adaptation, zero-shot learning, meta-learning, and multi-modal transfer learning offer more depth in the transfer learning domain. The future of transfer learning promises advanced federated techniques, continual learning, fair adaptation, and improved pre-training strategies. {{try_encord}}

September 5

7 min

Inter-rater Reliability: Definition, Examples, Calculation

Inter-rater reliability is crucial in research and clinical settings. It measures the agreement between two or more raters or observers when assessing subjects. This metric ensures that the data collected is consistent and reliable, regardless of who is collects or analyzes it. The significance of inter-rater reliability cannot be overstated, especially when the consistency between observers, raters, or coders is paramount to the validity of the study or assessment. Inter-rater reliability refers to the extent to which different raters or observers give consistent estimates of the same phenomenon. It is a measure of consistency or agreement between two or more raters. On the other hand, intra-rater reliability measures the consistency of ratings given by a single rater over different instances or over time. In research, inter-rater reliability is pivotal in ensuring the validity and reliability of study results. In qualitative research, where subjective judgments are often required, having a high degree of inter-rater reliability ensures that the findings are not merely the result of one individual's perspective or bias. Instead, it confirms that multiple experts view the data or results similarly, adding credibility to the findings.1 Moreover, in studies where multiple observers are involved, inter-rater reliability helps standardize the observations, ensuring that the study's outcomes are not skewed due to the variability in observations. Methods to Measure Inter-rater Reliability Inter-rater reliability, often called IRR, is a crucial statistical measure in research, especially when multiple raters or observers are involved. It assesses the degree of agreement among raters, ensuring consistency and reliability in the data collected. Various statistical methods have been developed to measure it, each with unique advantages and applications.1 Cohen's Kappa Cohen's Kappa is a widely recognized statistical method used to measure the agreement between two raters. It considers the possibility of the agreement occurring by chance, providing a more accurate measure than a simple percentage agreement. The Kappa statistic ranges from -1 to 1, where 1 indicates perfect agreement, 0 suggests no better agreement than chance, and -1 indicates complete disagreement.2 The formula for calculating Cohen's Kappa is: Where:  \( p_o \) is the observed proportion of agreement  \( p_e \) is the expected proportion of agreement Using Cohen's Kappa is essential when the data is categorical, and raters may agree by chance. It provides a more nuanced understanding of the reliability of raters. Intraclass Correlation Coefficient (ICC) The Intraclass Correlation Coefficient, commonly known as ICC, is another method used to measure the reliability of measurements made by different raters. It's beneficial when the measurements are continuous rather than categorical. ICC values range between 0 and 1, with values closer to 1 indicating higher reliability. One of the main differences between ICC and Cohen's Kappa is their application. While Cohen's Kappa is best suited for categorical data, ICC is ideal for continuous data. Additionally, ICC can be used for more than two raters, making it versatile in various research settings. Percentage Agreement Percentage agreement is the simplest method to measure inter-rater reliability. It calculates the proportion of times the raters agree without considering the possibility of chance agreement. While it's straightforward to compute, it doesn't provide as nuanced a picture as methods like Cohen's Kappa or ICC. For instance, if two raters agree 85% of the time, the percentage agreement is 85%. However, this method doesn't account for agreements that might have occurred by chance, making it less robust than other methods. Despite its simplicity, it is essential to be cautious when using percentage agreement, especially when the stakes are high, as it might provide an inflated sense of reliability. Factors Affecting Inter-rater Reliability Inter-rater reliability (IRR) is a crucial metric in research methodologies, especially when data collection involves multiple raters. It quantifies the degree of agreement among raters, ensuring that the data set remains consistent across different individuals. However, achieving a high IRR, such as a perfect agreement, is difficult. Several factors can influence the consistency between raters, and comprehending these can aid in enhancing the reliability measures of the data. Rater Training One of the most important factors affecting IRR is the training of raters. Proper training can significantly reduce variability and increase the coefficient of inter-rater agreement. For instance, in Krippendorff's study (2011) study, raters trained using a specific methodology exhibited a Cohen’s Kappa value of 0.85, indicating a high level of agreement, compared to untrained raters with a kappa value of just 0.5.4 Training ensures that all raters understand the rating scale and the criteria they are evaluating against. For example, in clinical diagnoses, raters can be trained using mock sessions where they are presented with sample patient data. Feedback sessions after these mock ratings can pinpoint areas of disagreement, offering a chance to elucidate and refine the methodology. {{light_callout_start}} Training and clear guidelines are not just best practices; they're essential. They bridge the gap between subjective judgments and objective evaluations, ensuring research remains unbiased and true to its purpose. {{light_callout_end}} Clarity of Definitions The clarity of definitions in the rating process is pivotal. Providing raters with unambiguous definitions, such as elucidating the difference between intra-rater and inter-rater reliability or explaining terms like "percent agreement" versus "chance agreement," ensures consistency. For example, in a research method involving the assessment of academic papers, if "originality" isn't clearly defined, raters might have divergent interpretations. A clear definition of terms in a study involving Krippendorff’s alpha as a reliability measure increased the alpha value from 0.6 to 0.9, indicating a higher degree of agreement.5 Defining the time frame between tests can lead to more consistent results in test-retest reliability assessments. Subjectivity in Ratings Subjectivity, especially in ordinal data, can significantly impede achieving a high IRR. For instance, in a data collection process involving movie reviews, two raters might have different thresholds for what constitutes a "good" film, leading to varied ratings. A Pearson correlation study found that when raters were given a clear guideline, the coefficient increased by 20%.6  To curtail subjectivity, it's imperative to have explicit guidelines. Tools like Excel for data analysis can help visualize areas of high variability. Moreover, employing reliability estimates like Fleiss Kappa or Cronbach's alpha can provide a clearer picture of the degree of agreement. For instance, a Fleiss Kappa value closer to 1 indicates high inter-rater reliability. While tools like the kappa statistic, intra-class correlation coefficient, and observed agreement offer quantifiable metrics, the foundation of high IRR lies in rigorous training, precise definitions, and minimizing subjectivity. Practical Applications and Examples of Inter-rater Reliability Inter-rater reliability (IRR) is used in various research methods to ensure that multiple raters or observers maintain consistency in their assessments. This measure often quantified using metrics such as Cohen’s Kappa or the intra-class correlation coefficient, is paramount when subjective judgments are involved. Let's explore the tangible applications of inter-rater reliability across diverse domains. Clinical Settings In clinical research, IRR is indispensable. Consider a scenario where a large-scale clinical trial is underway. Multiple clinicians collect data, assessing patient responses to a new drug. Here, the level of agreement among raters becomes critical. The trial's integrity is compromised if one clinician records a side effect while another overlooks it. In such settings, metrics like Fleiss Kappa or Pearson's correlation can quantify the degree of agreement among raters, ensuring that the data set remains consistent.7 Furthermore, in diagnoses, the stakes are even higher. A study revealed that when two radiologists interpreted the same X-rays without a standardized rating scale, their diagnoses had a variability of 15%. However, clear guidelines and training reduced the variability to just 3%, showcasing the power of high inter-rater reliability in clinical settings.8 {{medical_CTA}} Social Sciences Social sciences, with their inherent subjectivity, lean heavily on IRR. Multiple researchers conducted observational studies in a study exploring workplace dynamics in English corporate culture. Using tools like Excel for data analysis, the researchers found that the observed agreement among raters was a mere 60% without established guidelines. However, post-training and with clear definitions, the agreement soared to 90%, as measured by Krippendorff’s alpha.9 Education Education, a sector shaping future generations, cannot afford inconsistencies. Consider grading, a process fraught with subjectivity. In a study involving multiple teachers grading the same set of papers, the initial score variability was 20%. However, after a rigorous training session and with a standardized rating scale, the variability plummeted to just 5%.10 Standardized tests are the gateways to numerous opportunities, especially relying on IRR. A disparity in grading can alter a student's future. For instance, a test-retest reliability study found that scores varied by as much as 15 points on a 100-point scale without ensuring inter-rater agreement. Such inconsistencies can differentiate between a student getting their dream opportunity or missing out.10 Inter-rater reliability, quantified using metrics like the kappa statistic, Cronbach's alpha, or the intra-rater reliability measure, is non-negotiable across domains. Whether it's clinical trials, anthropological studies, or educational assessments, ensuring consistency among raters is not just a statistical necessity; it's an ethical one. Inter-rater Reliability: Key Takeaways Inter-rater reliability (IRR) is a cornerstone in various research domains, ensuring that evaluations, whether from clinical diagnoses, academic assessments, or qualitative studies, are consistent across different raters. Its significance cannot be overstated, as it safeguards the integrity of research findings and ensures that subjective judgments don't skew results. IRR is a litmus test for data reliability, especially when multiple observers or raters are involved. The call to action for researchers is clear: rigorous training and comprehensive guidelines for raters are non-negotiable. Ensuring that raters are well-equipped, both in terms of knowledge and tools, is paramount. It's not just about achieving consistent results; it's about upholding the sanctity of the research process and ensuring that findings are valid and reliable. Future Directions As we look ahead, the landscape of inter-rater reliability is poised for evolution. With technological advancements, there's potential for more sophisticated methods to measure and ensure IRR. Software solutions equipped with artificial intelligence and machine learning capabilities might soon offer tools that can assist in training raters, providing real-time feedback, and even predicting areas of potential disagreement. Moreover, as research methodologies become more intricate, the role of technology in aiding the process of ensuring IRR will undoubtedly grow. The future holds promise, from virtual reality-based training modules for raters to advanced statistical tools that can analyze inter-rater discrepancies in real time. For researchers and professionals alike, staying abreast of these advancements will ensure their work remains at the forefront of reliability and validity. In conclusion, while the principles of inter-rater reliability remain steadfast, the tools and methods to achieve it are ever-evolving, promising a future where consistency in evaluations is not just hoped for but assured. {{try_encord}}

September 1

5 min

Meta AI's CoTracker: It is Better to Track Together for Video Motion Prediction

In deep learning, establishing point correspondences in videos is a fundamental challenge with broad applications. Accurate video motion prediction is crucial for various downstream machine learning tasks, such as object tracking, action recognition, and scene understanding. To address the complexities associated with this task, Meta AI introduces "CoTracker," a cutting-edge architecture designed to revolutionize video motion estimation. CoTracker: It is Better to Track Together Video Motion Estimation Video motion estimation involves predicting the movement of points across frames in a video sequence. Traditionally, two main approaches have been used: optical flow and tracking algorithm. Optical flow estimates the velocity of points within a video frame, while the tracking method focuses on estimating the motion of individual points over an extended period. While both approaches have their strengths, they often overlook the strong correlations between points, particularly when points belong to the same physical object. These correlations are crucial for accurate motion prediction, especially when dealing with occlusions and complex scene dynamics. Video motion estimation has many practical applications in artificial intelligence, enabling enhanced visual understanding and interaction. In surveillance, it aids in object detection and anomaly detection. In filmmaking and entertainment, it drives special effects and scene transitions. In robotics and automation, it enhances robotic movement and task execution. Autonomous vehicles utilize it for environment perception and navigation. Medical imaging can benefit from motion-compensated diagnostics. Virtual reality benefits from realistic movement portrayal. Video compression and streaming utilize motion estimation for efficient data transmission. Co-Tracker: Architecture Meta AI has introduced  "CoTracker," an innovative architecture that enhances video motion prediction by jointly tracking multiple points throughout an entire video sequence. CoTracker is built on the foundation of the transformer network, a powerful and flexible neural architecture that has demonstrated success in various natural language processing and computer vision tasks. The key innovation of CoTracker is its ability to leverage both time and group attention blocks within the transformer architecture. By interleaving these attention blocks, CoTracker achieves a more comprehensive understanding of motion dynamics and correlations between points. This design enables CoTracker to overcome the limitations of traditional methods that focus on tracking points independently, thus unlocking a new era of accuracy and performance in video motion prediction. CoTracker: It is Better to Track Together Transformer Formulation The Co-Tracker architecture utilizes a transformer network with a CNN-based foundation, a versatile and powerful neural network architecture. This network denoted as Ψ : G → O, is tailored to enhance the accuracy of track estimates. Tracks are represented as input tokens Gi, encoding essential information like image features, visibility, appearance, correlation vectors, and positional encodings. The transformer processes these tokens iteratively to refine track predictions, ensuring context assimilation. The optimization of visibility is achieved through learned weights and strategically initialized quantities Windowed Inference Co-Tracker has the ability to support windowed applications, allowing it to efficiently handle long videos. In scenarios where the video length T' exceeds the maximum window size supported by the architecture, the video is split into windows with an overlap. The transformer is then applied iteratively across these windows, allowing the model to process extended video sequences while preserving accuracy. Unrolled Learning Unrolled learning is a vital component of Co-Tracker's training process. This mechanism enables the model to handle semi-overlapping windows, which is essential for maintaining accuracy across longer videos. During training, the model is trained using an unrolled fashion, effectively preparing it to handle videos of varying lengths during evaluation. Transformer Operation and Components Co-Tracker's transformer operates using interleaved time and group attention blocks. This unique approach allows the model to consider temporal and correlated group-based information simultaneously. Time attention captures the evolution of individual tracks over time, while group attention captures correlations between different tracks. This enhances the model's ability to reason about complex motion patterns and occlusions. Point Selection A crucial aspect of Co-Tracker's success lies in its approach to point selection. To ensure a fair comparison with existing methods and to maintain robustness in performance, the model is evaluated using two-point selection strategies: global and local. In the global strategy, points are selected on a regular grid across the entire image. In the local strategy, points are chosen in proximity to the target point. Point selection enhances the model's ability to focus on relevant points and regions, contributing to its accuracy in motion prediction. CoTracker: It is Better to Track Together Co-Tracker: Implementation Co-Tracker's implementation involves rendering 11,000 pre-generated 24-frame sequences from TAP-Vid-Kubric, each annotated with 2,000 tracked points. These points are preferentially sampled on objects.  During training, 256 points are randomly selected per sequence, visible either in the first or middle frames. Co-Tracker is trained as a baseline on TAP-Vid-Kubric sequences of size 24 frames using sliding windows of size 8 frames, iterated 50,000 times on 32 NVIDIA TESLA Volta V100 32GB GPUs. This scalable approach ensures efficient learning and flexibility to adapt the batch size according to available GPU memory, resulting in high-quality tracking performance and achieving a stable frame rate (fps).  Ground truth annotations enhance the training process, contributing to the model's robustness and accuracy in capturing complex motion patterns. {{light_callout_start}} To access the model on GitHub, visit: Co-Tracker. {{light_callout_end}}  Co-Tracker: Experiments and Benchmarks The Co-Tracker efficacy in video motion prediction and point tracking was evaluated on a series of experiments and benchmark assessments. The performance of the architecture was rigorously tested using a combination of synthetic and real-world datasets, each carefully chosen to represent a spectrum of challenges. The synthetic dataset, TAP-Vid-Kubric, played a pivotal role in training the architecture and simulating dynamic scenarios with object interactions. Benchmark datasets like TAP-Vid-DAVIS, TAP-Vid-Kinetics, BADJA, and FastCapture provided real-world videos with annotated trajectories to facilitate the assessment of Co-Tracker's predictive prowess. These evaluations adhered to predefined protocols tailored to the intricacies of each dataset. The "queried strided" protocol was adopted, requiring precise tracking in both forward and backward directions to address varying motion complexities. Evaluation metrics such as Occlusion Accuracy (OA), Average Jaccard (AJ), and Average Positional Accuracy (< δx avg) were used to gauge the architecture's performance. Co-Tracker: Results CoTracker: It is Better to Track Together The paper explores the impact of joint tracking and support grids, an essential element of Co-Tracker's design. By evaluating different support grids and employing the "uncorrelated single target point" protocol, it demonstrated that the architecture's ability to collectively reason about tracks and their trajectories (group attention and time attention) led to improved outcomes. The best results were achieved when the correct contextual points were considered, highlighting the effectiveness of combining local and global grids. The potential for even better performance was seen when using the "all target points" protocol, indicating that correlated points are indeed influential. Although this protocol was not directly compared to prior work for fairness, it aligns with real-world scenarios where segmentation models could automatically select correlated points. When compared to prior state-of-the-art AI models like RAFT and PIPs, Co-Tracker exhibited remarkable accuracy in tracking points and their visibility across various benchmark datasets. The architecture's capacity for long-term tracking of points in groups was especially beneficial. This approach was different from traditional single-point models and short-term optical flow methods that often grapple with accumulated drift issues. The meticulous evaluation protocol further solidified Co-Tracker's superior predictive capabilities. CoTracker: It is Better to Track Together During the exploration of the importance of training data, TAP-Vid-Kubric emerged as the superior choice vs.  FlyingThings++. The latter's short sequences clashed with Co-Tracker's reliance on sliding windows for training. On the other hand, Kubric's realistic scenes and occluded objects aligned seamlessly with the architecture's design. The significance of unrolled learning in the sliding window scheme was demonstrated through evaluations. Given that evaluation sequences often exceeded training video lengths, Co-Tracker's ability to propagate information between windows emerged as a crucial factor in its exceptional performance. {{light_callout_start}} Read the original paper by Nikita Karaev, Ignacio Rocco, Benjamin Graham, Natalia Neverova, Andrea Vedaldi, and Christian Rupprecht on Arxiv: CoTracker: It is Better to Track Together. {{light_callout_end}}  Co-Tracker: Key Takeaways CoTracker: It is Better to Track Together Group Tracking Boosts Accuracy: Co-Tracker's simultaneous tracking of multiple points improves accuracy by considering correlations between them, surpassing single-point models. Contextual Points Matter: Co-Tracker's success depends on choosing contextual points effectively within support grids, highlighting the importance of context in accurate tracking. Long-Term Group Tracking Prevails: Co-Tracker's long-term group tracking surpasses single-point models and short-term optical flow methods, ensuring better predictive accuracy and mitigating drift issues. Training Data's Influence: TAP-Vid-Kubric's training data is superior, aligning well with Co-Tracker's approach and offering more realistic scenes than FlyingThings++. Efficient Unrolled Learning: Co-Tracker's unrolled learning for sliding windows efficiently propagates information, proving vital for maintaining accuracy on longer sequences. Co-Tracker's success hinges on correlation utilization, context consideration, and real-world adaptability, solidifying its role as a transformative solution for video motion prediction and point tracking.

August 30

5 min

1 / 12

Software To Help You Turn Your Data Into AI

Forget fragmented workflows, annotation tools, and Notebooks for building AI applications. Encord Data Engine accelerates every step of taking your model into production.