Encord Blog

Label data 10x faster & gain control of your training data, today.

blog banner
Featured Blog

How To Fine-Tune Segment Anything

Computer vision is having its ChatGPT moment with the release of the Segment Anything Model (SAM) by Meta last week. Trained over 11 billion segmentation masks, SAM is a foundation model for predictive AI use cases rather than generative AI. While it has shown an incredible amount of flexibility in its ability to segment over wide-ranging image modalities and problem spaces, it was released without “fine-tuning” functionality. This tutorial will outline some of the key steps to fine-tune SAM using the mask decoder, particularly describing which functions from SAM to use to pre/post process the data so that it's in a good shape for fine tuning. {{Training_data_CTA::Supercharge your annotations by fine-tuning SAM for your use case}} What is the Segment Anything Model (SAM)? The Segment Anything Model (SAM) is a segmentation model developed by Meta AI. It is considered the first foundational model for Computer Vision. SAM was trained on a huge corpus of data containing millions of images and billions of masks, making it extremely powerful. As its name suggests, SAM is able to produce accurate segmentation masks for a wide variety of images. SAM’s design allows it to take human prompts into account, making it particularly powerful for Human In The Loop annotation. These prompts can be multi-modal: they can be points on the area to be segmented, a bounding box around the object to be segmented or a text prompt about what should be segmented. The model is structured into 3 components: an image encoder, a prompt encoder and a mask decoder. Source The image encoder generates an embedding for the image being segmented, whilst the prompt encoder generates an embedding for the prompts. The image encoder is a particularly large component in the model. This is in contrast to the lightweight mask   decoder, which predicts segmentation masks based on the embeddings. Meta AI has made the weights and biases of the model trained on the Segment Anything 1 Billion Mask (SA-1B) dataset available as a model checkpoint. {{light_callout_start}} Learn more about how Segment Anything works in our explainer blog post Segment Anything Model (SAM) Explained. {{light_callout_end}} What is Model Fine-Tuning? Publicly available state of the art models have a custom architecture and are typically supplied with pre-trained model weights. If these architectures were supplied without weights then the models would need to be trained from scratch by the users, who would need to use massive datasets to obtain state of the art performance. Model fine tuning is the process of taking a pre-trained model (architecture+weights) and showing it data for a particular use case. This will typically be data that the model hasn’t seen before, or that is underrepresented in its original training dataset. The difference between fine tuning the model and starting from scratch is the starting value of the weights and biases. If we were training from scratch, these would be randomly initialised according to some strategy. In such a starting configuration, the model would ‘know nothing’ of the task at hand and perform poorly. By using pre existing weights and biases as a starting point we can ‘fine tune’ the weights and biases so that our model works better on our custom dataset. For example: the information learnt to recognise cats (edge detection, counting paws) will be useful for recognising dogs. Why Would I Fine-Tune a Model? The purpose of fine tuning a model is to obtain higher performance on data which the pre-trained model has not seen before. For example, an image segmentation model trained on a broad corpus of data gathered from phone cameras will have mostly seen images from a horizontal perspective. If we tried to use this model for satellite imagery taken from a vertical perspective, it may not perform as well. If we were trying to segment rooftops, the model may not yield the best results. The pre-training is useful because the model will have learnt how to segment objects in general, so we want to take advantage of this starting point to build a model which can accurately segment rooftops. Furthermore, it is likely that our custom dataset would not have millions of examples, so we want to fine tune instead of training the model from scratch. Fine tuning is desirable so that we can obtain better performance on our specific use case, without having to incur the computational cost of training a model from scratch. How to Fine-Tune Segment Anything Model [With Code] Background & Architecture We gave an overview of the SAM architecture in the introduction section. The image encoder has a complex architecture with many parameters. In order to fine tune the model, it makes sense for us to focus on the mask decoder which is lightweight and therefore easier, faster and more memory efficient to fine tune. In order to fine tune SAM, we need to extract the underlying pieces of its architecture (image and prompt encoders, mask decoder). We cannot use SamPredictor.predict (link) for two reasons: We want to fine tune only the mask decoder This function calls SamPredictor.predict_torch which has the  @torch.no_grad() decorator (link), which prevents us from computing gradients Thus, we need to examine the SamPredictor.predict function and call the appropriate functions with gradient calculation enabled on the part we want to fine tune (the mask decoder). Doing this is also a good way to learn more about how SAM works. Creating a Custom Dataset We need three things to fine tune our model: Images on which to draw segmentations Segmentation ground truth masks Prompts to feed into the model We chose the stamp verification dataset (link) since it has data which SAM may not have seen in its training (i.e., stamps on documents). We can verify that it performs well, but not perfectly, on this dataset by running inference with the pre-trained weights. The ground truth masks are also extremely precise, which will allow us to calculate accurate losses. Finally, this dataset contains bounding boxes around the segmentation masks, which we can use as prompts to SAM. An example image is shown below. These bounding boxes align well with the workflow that a human annotator would go through when looking to generate segmentations. Input Data Preprocessing We need to preprocess the scans from numpy arrays to pytorch tensors. To do this, we can follow what happens inside SamPredictor.set_image (link) and SamPredictor.set_torch_image (link) which preprocesses the image. First, we can use utils.transform.ResizeLongestSide to resize the image, as this is the transformer used inside the predictor (link). We can then convert the image to a pytorch tensor and use the SAM preprocess method (link) to finish preprocessing. Training Setup We download the model checkpoint for the vit_b model and load them in: sam_model = sam_model_registry['vit_b'](checkpoint='sam_vit_b_01ec64.pth') We can set up an Adam optimizer with defaults and specify that the parameters to tune are those of the mask decoder: optimizer = torch.optim.Adam(sam_model.mask_decoder.parameters())  At the same time, we can set up our loss function, for example Mean Squared Error loss_fn = torch.nn.MSELoss() Training Loop In the main training loop, we will be iterating through our data items, generating masks and comparing them to our ground truth masks so that we can optimise the model parameters based on the loss function. In this example we used a GPU for training since it is much faster than using a CPU. It is important to use .to(device) on the appropriate tensors to make sure that we don’t have certain tensors on the CPU and others on the GPU. We want to embed images by wrapping the encoder in the torch.no_grad() context manager, since otherwise we will have memory issues, along with the fact that we are not looking to fine tune the image encoder. with torch.no_grad(): image_embedding = sam_model.image_encoder(input_image) We can also generate the prompt embeddings within the no_grad context manager. We use our bounding box coordinates, converted to pytorch tensors. with torch.no_grad(): sparse_embeddings, dense_embeddings = sam_model.prompt_encoder( points=None, boxes=box_torch, masks=None, ) Finally, we can generate the masks. Note that here we are in single mask generation mode (in contrast to the 3 masks that are normally output). low_res_masks, iou_predictions = sam_model.mask_decoder( image_embeddings=image_embedding, image_pe=sam_model.prompt_encoder.get_dense_pe(), sparse_prompt_embeddings=sparse_embeddings, dense_prompt_embeddings=dense_embeddings, multimask_output=False, ) The final step here is to upscale the masks back to the original image size, since they are low resolution. We can use Sam.postprocess_masks to achieve this. We will also want to generate binary masks from the predicted masks so that we can compare these to our ground truths. It is important to use torch functionals in order to not break backpropagation. upscaled_masks = sam_model.postprocess_masks(low_res_masks, input_size, original_image_size).to(device) from torch.nn.functional import threshold, normalize binary_mask = normalize(threshold(upscaled_masks, 0.0, 0)).to(device) Finally we can calculate the loss and run an optimisation step: loss = loss_fn(binary_mask, gt_binary_mask) optimizer.zero_grad() loss.backward() optimizer.step() By repeating this over a number of epochs and batches we can fine tune the SAM decoder. Saving Checkpoints and Starting a Model from it Once we are done with training and satisfied by the performance uplift, we can save the state dict of the tuned model using: torch.save(model.state_dict(), PATH) We can then load this state dict when we want to perform inference on data that is similar to the data we used to fine tune the model. {{light_callout_start}} You can find the Colab Notebook with all the code you need to fine-tune SAM here. Keep reading if you want a fully working solution out of the box! {{light_callout_end}} Fine-Tuning for Downstream Applications While SAM does not currently offer fine-tuning out of the box, we are building a custom fine tuner integrated with the Encord platform. As shown in this post, we fine tune the decoder in order to achieve this. This is available as an out of the box one click procedure in the web app, where the hyperparameters are automatically set. Original vanilla SAM mask: Mask generated by fine tuned version of the model: We can see that this mask is tighter than the original mask. This was the result of fine tuning on a small subset of images from the stamp verification dataset, and then running the tuned model on a previously unseen example. With further training and more examples we could obtain even better results. Conclusion That's all, folks! You have now learned how to fine-tune the Segment Anything Model (SAM). If you're looking to fine-tune SAM out of the box, you might also be interested to learn that we have recently released the Segment Anything Model in Encord, allowing you to fine-tune the model without writing any code. {{SAM_CTA}}

Read more
1 / 14
Expert Review with Workflows

Introduction Expert review workflows are crucial for accurate and successful annotation projects ensuring high data quality, efficient task allocation, and time savings. In this walkthrough, you’ll learn how to customize workflows to facilitate expert review flows and improve collaboration. As the AI and computer vision landscapes evolve, expert review workflows help you maintain data integrity, ensure optimal model performance, and maintain flexibility for future unknown labeling demands.  Understanding Workflows Workflows are systematic processes (or graphs) that define how tasks are organized, assigned, routed, and automated within an annotation project. They provide a structured way of handling various stages of a project, ensuring that each step is completed efficiently and in the correct order while tracking performance at each step. Expert Review With the importance of training data ever-increasing, expert review workflows ensure the highest quality of annotations, in turn leading to improved model performance. The expert review ensures data meets the required standard through subject matter experts thoroughly checking and validating a subset of the annotations created.  Benefits of Expert Review Workflows Expert review workflows offer a range of benefits that contribute to the success of data-centric projects: Improved Data Quality: Expert review ensures that data is accurate and error-free, leading to more reliable models and results. Efficient Task Allocation: Workflows help allocate tasks to the right experts, ensuring that each annotation or review is handled by the most qualified individuals. Error Detection and Correction: Issues can be identified and addressed promptly during the review process, preventing them from propagating further in the project. Time and Resource Savings: Automation within workflows streamline the process, reducing the time and effort required for manual coordination and ensuring experts aren’t wasting their time on menial tasks. Setting up Expert Review Workflows with Encord Create a New Workflow Template First, navigate to the "Workflow Templates" and click on the "+ New workflow template" button. For this walkthrough, we will create a workflow for an object detection model. Configuring the Workflow Template In the center, you will find the edit button and by clicking on it you will find on the right-hand side of the screen, you'll find the workflow library. This library contains components to build your workflow. Let’s look at each of these components as we add them to our insect detection project. Start Stage It's where your project begins, offering a clear overview of the project's foundation and helping team members understand the data they'll be working with. Annotate Stage This stage is the heart of the workflow, where data is annotated. The stage initially includes all annotators by default. To choose specific annotators, click the Annotate component, go to the Selective tab, enter the user's email, and select from the list. Only collaborators added via Project-level Manage collaborators will be available. The optional Webhook feature adds a layer of real-time notifications, enhancing project monitoring. Review Stage Multiple review stages can be included within a project, each with its unique set of reviewers and routing conditions helping to establish a structured process where subject matter experts validate annotations and detect errors. Strict Review With strict review, tasks stay put after label approval or rejection, giving reviewers time for adjustments and the ability to add comments for missing annotations. This provides reviewers with an additional opportunity to evaluate and, potentially, revise their judgments. This added layer of scrutiny helps to maintain accuracy and quality. Router A Router divides the pathway that annotation and review tasks follow within the workflow. You have the choice between two router types to select for your project: Percentage Router Precisely allocates annotations based on defined percentages, which is useful for the precise distribution of tasks, ensuring an equal workload split between different stages or groups of reviewers. Collaborator Router Customize annotation workflows based on collaborators to assign tasks strategically, ensuring alignment with expertise and responsibilities, and providing flexibility for diverse collaborators. For instance, a new annotator, Chris, may have his tasks automatically routed to an expert review queue, assigning pathology annotations to Dr. Smith and radiology annotations to Dr. Johnson. This approach optimizes the workflow, maintains quality through expert review, and allows flexibility for exceptions, enhancing collaboration in diverse teams Now that we've covered each element of the workflow, let's explore an instance of a workflow designed for object detection. Using Workflows in Annotation Projects To understand the integration of workflows in annotation projects, let's create an annotation project for an insect detection model with the following steps: Select and name the annotation project. Add insect dataset. You can create a new dataset here as well. Add the ontology for the annotation project. For quality assurance, opt for a workflow, either by creating a new one or utilizing an existing setup. And you are ready to start annotating! Select your annotation project and open the summary. The Summary provides an overview of your workflow project, displaying the status of tasks in each workflow stage, and offering a high-level visual representation of project progress. Navigate to the Queue for task management and labeling initiation, with options tailored to user permissions. It encompasses the Annotator's Queue, Reviewer's Queue, and Annotator & Reviewer, Admin, and Team Manager Queue. Users can filter, initiate, and assign tasks as needed, and this functionality varies based on the user's role. Admins and Task Managers can assign and release tasks, ensuring efficient task management within the project. Select the Start Labeling button to annotate your dataset. Label your dataset! Once the data has been annotated, reviewers find the labeled data to be reviewed in Queue as well.  The reviewer has the option to exert bulk action on multiple reviews at once. Once the review is complete, any rejected images can again be found in the Annotator’s queue. The reason for rejection can also be specified and the annotator must resolve the issue to submit the re-annotated data. The approved images are found in the expert review queues. Once all the reviews are accepted the annotation is complete! The multiple review stages process in the annotation project contributes to the refinement of the dataset, aligning it with the desired standards and objectives of the project. The flexibility to perform bulk actions on multiple reviews simultaneously streamlines the review workflow and the ability to specify reasons for rejection provides valuable feedback to annotators.  Wrapping Up In conclusion, expert review workflows play a pivotal role in ensuring the accuracy and success of data-centric projects like annotating an insect detection model. These workflows offer benefits such as improved data quality, efficient task allocation, and time savings. As technology advances, the importance of expert review workflows in maintaining data integrity becomes increasingly evident. They are an essential component in the evolving landscape of data-driven projects, ensuring optimal model performance. {{Training_data_CTA:: Optimize your annotation project with expert review workflows}}

December 5

5 min

One Year of ChatGPT - Here’s What’s Coming Next

Before OpenAI was a producer of the most scintillating boardroom corporate drama outside of an episode of Succession, it was the creator of the universally known AI application ChatGPT. On the eve of the one-year anniversary of its launch(a whirlwind year of progress, innovations, and twists), it is worth revisiting the state of AI post-ChatGPT with a view towards looking forward. A year ago, ChatGPT took the world by storm, smashing even OpenAI’s greatest expectations of adoption by becoming the world’s fastest-growing consumer app of all time. While the last year has been filled with a panoply of new models, hundreds of freshly minted startups, and gripping drama, it still very much feels like only the early days of the technology.  As the cofounder of an AI company, and having been steeped in the ecosystem for years, the difference this last year has made has been nothing short of remarkable—not just in technological progress or academic research (although the strides here have been dizzying) —but more in unlocking the public imagination and discourse around AI.  YCombinator, a leading barometer of the directionality of technological trends, has recently churned out batches where, for the first time, most companies are focused on AI. ChatGPT is now being used as a zinger in political debates. Once exotic terms like Retrieval Augmentation Generation are making their way into the vernacular of upper management in Fortune 500 companies. We have entered not a technological, but a societal change, where AI is palatable for mainstream digestion. So what’s next? Seeing the future is easy, if you know where to look in the present: technological progress does not move as a uniform front where adoption of innovation propagates equally across all facets of society. Instead, it moves like waves crashing the jagged rocks of a coastline, splashing chaotically forward, soaking some while leaving others dry. Observing where the water hits first lets you guess what happens when it splashes on others later. It takes one visit to San Francisco to notice the eerily empty vehicles traversing the city in a silent yet conspicuous manner to preview what the future looks like for municipalities around the world—a world with the elimination of Uber driver small talk.  While making firm predictions in a space arguably moving forward faster than any other technological movement in history is a fool’s game, clear themes are emerging that are worth paying attention to by looking at the water spots of those closest to the waves. We are only one year into this “new normal,” and the future will have much more to bring along the following: Dive Into Complexity One of the most exciting aspects of artificial intelligence as a technology is that it falls into a category few technologies do: “unbounded potential.” Moore’s Law in the ‘60s gave a self-fulling prophecy of computational progress for Silicon Valley to follow. The steady march of development cycles has paved the way from room-sized machines with the power of a home calculator to all the marvellous wonders we take for granted in society today.  Similar to computation, there are no limits in principle for the cognitive power of computers across the full range of human capabilities. This can stoke the terrors of a world-conquering AGI, but it also brings up a key principle worth considering: ever-increasing intellectual power.  The AIs of today that are drawing boxes over cars and running segmentations over people will be considered crude antiquities in a few years. They are sub-component solutions used only as intermediate steps to tackle more advanced problems (such as diagnosing cancer, counting cars for parking tickets, etc). We must walk before we can run, but it is not difficult to imagine an ability to tackle harder and harder questions over time. In the future, AI will be able to handle problems of increasing complexity and nuance, ones that are currently limitations for existing systems.  While ChatGPT and other equivalent LLMs of today are conversant (and hallucinatory) in wide-ranging topics, they still cannot handle niche topics with reliability. Companies, however, have already begun tailoring these models with specialized datasets and techniques to handle more domain-specific use cases. With improved training and prompting, the emergence of AI professionals - such as doctors, paralegals, and claims adjusters - is on the horizon. We’re also approaching an era where these specialized applications, like a FashionGPT trained on the latest trends, can provide personalized advice and recommendations according to individual preferences. We should expect a world where the complexity and nuance of problems, ones that are only available for particular domain experts of today, will be well within the scope of AI capabilities. Topics like advanced pathology, negotiating geopolitical situations, and company building will be problems within AI capacity. If the history of computers is any beacon, complexity is the direction forward.  Multi-modality  Right now, there are categorical boxes classifying different types of problems that AI systems can solve. We have “computer vision”, “NLP”, “reinforcement learning”, etc. We also have separations between “Predictive” and “Generative AI” (with a corresponding hype cycle accompanying the rise of the term). These categories are useful, but they are mostly in place because models can, by and large, solve one type of problem at a time. Whenever the categorizations are functions of technological limitations, you should not expect permanence; you should expect redefinitions. Humans are predictive and generative. You can ask me if a picture is of a cat or a dog, and I can give a pretty confident answer. But I can also draw a cat (albeit badly). Humans are also multi-modal. I can listen to the soundtrack of a movie and take in the sensory details of facial expressions, body language, and voice in both semantic content as well as tonal and volume variations. We are performing complex feats of sensor fusion across a spectrum of inputs, and we can perform rather complex inferences from these considerations. Given that we can do this adeptly, we shouldn’t expect any of these abilities to be outside the purview of sufficiently advanced models. The first inklings of this multi-modal direction are already upon us. ChatGPT has opened up to vision and can impressively discuss input images. Open-source models like LLaVA now reason over both text and vision. CLIP combines text and vision into a unified embedding structure and can be integrated with various types of applications. Other multimodal embedding agents are also becoming commonplace.  {{gray_callout_start}} Check out my webinar with Frederik Hvilshøj, Lead ML Engineer at Encord, on “How to build Semantic Visual Search with ChatGPT & CLIP”. {{gray_callout_end}} While these multimodal models haven’t found use in many practical applications yet, it is only a matter of time before they are integrated into commonplace workflows and products. Tied to the point above on complexity, multimodal models will start to replace their narrower counterparts to solve more sophisticated problems. Today's models can, by and large, see, hear, read, plan, move, etc. The models of the future will do all of these simultaneously. The Many Faces of Alignment The future themes poised to gain prominence in AI not only encompass technological advancements but also their societal impacts. Among the onslaught of buzzy terms borne out of the conversations in San Francisco coffee shops, alignment has stood out among the rest as the catch-all for all the surrounding non-technical considerations of the broader implications of AI. According to ChatGPT: AI alignment refers to the process and goal of ensuring that artificial intelligence (AI) systems' goals, decisions, and behaviors are in harmony with human values and intentions.  There are cascading conceptual circles of alignment dependent on the broadness of its application. As of now, the primary focus of laboratories and companies has been to align models to what is called a “loss function.” A loss function is a mathematical expression of how far away a model is from getting an answer “right.” At the end of the day, AI models are just very complicated functions, and all the surrounding infrastructure are very powerful functional optimization tool. A model behaving as it should as of now just means a function has been properly optimized to “having a low loss.” It begs the question of how you choose the right loss function in the first place. Is the loss function itself aligned with the broader goal of the researcher building it? Then there is the question: if the researcher is getting what they want, does the institution the researcher is sitting in get what it wants? The incentives of a research team might not necessarily be aligned with those of the company. There is the question of how all of this is aligned with the interests of the broader public, and so on.  Dall-E’s interpretation of the main concentric circles of alignment The clear direction here is that infrastructure for disentangling multilevel alignment seems inevitable (and necessary). Research in “superalignment” by institutions such as OpenAI, before their board debacle, is getting heavy focus in the community. It will likely lead to tools and best practices to help calibrate AI to human intention even as AI becomes increasingly powerful. At the coarse-grained societal level, this is a broad regulation imposed by politicians who need help finding the Google toolbar. Broad-brushed regulations similar to what we see in the EU AI Act, are very likely to follow worldwide. Tech companies will get better at aligning models to their loss, researchers and alignment advocates at a loss to human goals, and regulators at the technology to the law. Regulation, self-regulation, and corrective mechanisms are bound to come—their effectiveness is still uncertain.  The AI Internet A question in VC meetings all around the world is whether a small number of powerful foundation models will end up controlling all intelligence operations in the future or whether there will be a proliferation of smaller fine-tuned models floating around unmoored from centralized control. My guess is the answer is both.  Clearly, centralized foundation models perform quite well on generalized questions and use cases, but it will be difficult for foundation model providers to get access to proprietary datasets housed in companies and institutions to solve finer-grained, domain-specific problems. Larger models are also constrained by their size and much more difficult to embed in edge devices for common workflows. For these issues, corporations will likely use alternatives to control their own fine-tuned models. Rather than having one model control everything, the future is likely to have many more AI models than today.  The proliferation of AI models to come harkens back to the early proliferation of personal computing devices. The rise of the internet over the last 30 years has taught us a key lesson: things like to be connected. Intelligent models/agents will be no exception to this.  AI agents, another buzz term on the rise, are according to ChatGPT: Systems or entities that act autonomously in an environment to achieve specific goals or perform certain tasks.  We are seeing an uptake now on AI agents powered by various models tasked with specific responsibilities. Perhaps this will come down even to the individual level, where each person has their own personal AI completing the routine monotonous tasks for them on a daily basis. Whether this occurs or not, it is only a matter of time before these agents start to connect and communicate with each other. My scheduling assistant AI will need to talk to your scheduling assistant. AI will be social! My guess is a type of AI communication protocol will be one in which daisy-chaining models of different skills and occupations will exponentiate their individual usefulness. These communication protocols are still some ways from being established or formalized, but if the days of regular old computation mean much, they will not be far away. We are seeing the first Github repos showcasing orchestration systems of various models. While still crude, if you squint, you can see a world where this type of “AI internet” integrates into systems and workflows worldwide for everyday users. Paywalling The early internet provided a cornucopia of free content and usage powered by VC larges with the mandate of growth at all costs. It took a few years before the paywalls started, in news sites around the world, in walled-off premium features, and in jacked-up Uber rates. After proving the viability of a technology, the next logical step tends to be monetization.  For AI, the days of open papers, datasets, and sharing in communities are numbered as the profit engine picks up. We have already seen this in the increasingly, almost comically, vague descriptions OpenAI releases about their models. By the time GPT-5 rolls around, the expected release won’t be much less guarded than OpenAI just admitting, “we used GPUs for this.” Even non-tech companies are realising that the data they possess has tremendous value and will be much more savvy before letting it loose. AI is still only a small portion of the economy at the moment, but its generality and unbounded potential stated above lead to the expectation that it can have absolutely enormous economic impact.  Ironically, the value created by the early openness of technology will result in the end of technological sharing and a more closed mentality.  The last generation of tech growth has been fueled by social media and “attention.” Any barriers to engagement, such as putting a credit card upfront, were discouraged, and the expectation that “everything is free” became commonplace in using many internet services. OpenAI, in contrast, rather than starting with a traditional ad-based approach for monetization, opened up a premium subscription service and is now charging hefty sums for tailored models for corporations. The value of AI technology in its own right obviates the middle step of funding through advertising. Data and intelligence will likely not come for free. As we shift from an attention economy to an intelligence economy, where automation becomes a core driver of growth, expect the credit cards to start coming out. Dall-E’s interpretation of the coming AI paywall paving the transition from an attention economy to an intelligence economy Expect the Unexpected As a usual mealy-mouthed hedge in any predictive article, the requisite disclaimer of the unimaginable items must be established. In this case, this is also a genuine belief. Even natural extrapolations of AI technology moving forward can leave us in heady disbelief of possible future states. Even much smaller questions, like if OpenAI itself will survive in a year, are extremely difficult to predict. If you asked someone 50 years ago about capturing some of the most magnificent imagery in the world, of items big or small, wonders of the world captured within a device in the palm of your hand and served in an endless scroll among other wonders, it would seem possible and yet inconceivable. Now, we are bored by seeing some of the world's most magnificent, spectacular images and events. Our demand for stimulating content is being overtaken by supply. Analogously, with AI, we might be in a world where scientific progress is accelerated beyond our wildest dreams, where we have more answers than questions, and where we cannot even process the set of answers available to us.  Using AI, deep mathematical puzzles like the Riemann Hypothesis may be laid bare as a trivial exercise. Yet, the formulation of interesting questions might be bottlenecked by our own ability and appetite to answer them. A machine to push forward mathematical progress beyond our dreams might seem too much to imagine, but it’s only one of many surreal potential futures.  If you let yourself daydream of infinite personal assistants, where you have movies of arbitrary storylines created on the fly for individual consumption, where you can have long and insightful conversations with a cast of AI friends, where most manual and cognitive work of the day has completely transformed, you start to realize that it will be difficult to precisely chart out where AI is going.  There are of course both utopian and dystopian branches of these possibilities. The technology is agnostic to moral consequence; it is only the people using it and the responsibility they incur that can be considered in these calculations. The only thing to expect is that we won’t expect what’s coming. Conclusion Is ChatGPT the equivalent of AI what the iPhone moment of the app wave was in the early 2010s? Possibly—and probably why OpenAI ran a very Apple-like keynote before Sam Altman’s shocking dismissal and return. But what is clear is that once items have permeated into public consciousness, they cannot be revoked. People understand the potential now. Just 3 years ago a company struggling to raise a seed round had to compete for attention against crypto companies, payments processors, and fitness software. AI companies today are a hot ticket item and have huge expectations baked into this potential. It was only 9 months ago that I wrote about “bridging the gap” to production AI. Amidst all the frenzy around AI, it is difficult to forget that most models today are still only in the “POC” (Proof of Concept) state, not having proved sufficient value to be integrated with real-world applications.  ChatGPT really showed us a world beyond just production, to “post-production” AI, where AI's broader societal interactions and implications become more of the story than the technological components that it’s made of. We are now at the dawn of the “Post-Production” era. Where this will go exactly is of course impossible to say. But if you look at the past, and at the present, the themes to watch for are: complexity, multi-modality, connectivity, alignment, commercialization, and surprise. I am certainly ready to be surprised. 

November 29

5 min

Logistic Regression: Definition, Use Cases, Implementation

Logistic regression is a statistical model used to predict the probability of a binary outcome based on independent variables. It is commonly used in machine learning and data analysis for classification tasks. Unlike linear regression, logistic regression uses a logistic function to model the relationship between independent variables and outcome probability.  It has various applications, such as predicting customer purchasing likelihood, patient disease probability, online advertisement click probability, and the impact of social sciences on binary outcomes. Mastering logistic regression allows you to uncover valuable insights, optimize strategies, and enhance their ability to accurately classify and predict outcomes of interest. This article goes into more depth about logistic regression and gives a full look. The structure of the article is as follows: What is logistic regression? Data processing and implementation  Model training and evaluation Challenges in logistic regression Real-world applications of Logistic Regression Implementation of logistic regression in Python Logistic regression: key takeaways Frequently Asked Questions (FAQs) What is Logistic Regression? Logistic regression is a statistical model used to predict the probability of a binary outcome based on one or more independent variables. Its primary purpose in machine learning is to classify data into different categories and understand the relationship between the independent and outcome variables.  The fundamental difference between linear and logistic regression lies in the outcome variable. Linear regression is used when the outcome variable is continuous, while logistic regression is used when the outcome variable is binary or categorical.  {{light_callout_start}} Linear regression shows the linear relationship between the independent (predictor) variable, i.e., the X-axis, and the dependent (output) variable, i.e., the Y-axis, called linear regression. If there is a single input variable (an independent variable), such linear regression is called simple linear regression. {{light_callout_end}} Types of logistic regressions Binary, ordinal, and multinomial systems are the three categories of logistic regressions. Let's quickly examine each of these in more detail. Binary regression Binary logistic regression is used when the outcome variable has only two categories, and the goal is to predict the probability of an observation belonging to one of the two categories based on the independent variables. Multinomial regression Multinomial logistic regression is used when the outcome variable has more than two categories that are not ordered. In this case, the logistic regression model will estimate the probabilities of an observation belonging to each category relative to a reference category based on the independent variables. Ordinal regression Ordinal logistic regression is used when the outcome variable has more than two categories that are ordered. Each type of logistic regression has its own specific assumptions and interpretation methods. Ordinal logistic regression is useful when the outcome variable's categories are arranged in a certain way. It lets you look at which independent variables affect the chance that an observation will be in a higher or lower category on the ordinal scale.   Logistic Regression Curve Logistic Regression Equation The logistic regression equation The logistic regression equation is represented as: P(Y=1) = 1 / (1 + e^-(β0 + β1X1 + β2X2 + ... + βnXn)), where P(Y=1) is the probability of the outcome variable being 1, e is the base of the natural logarithm, β0 is the intercept, and β1 to βn are the coefficients for the independent variables X1 to Xn, respectively. The sigmoid function The sigmoid function, represented as: 1 / (1 + e^- (β0 + β1*X1 + β2*X2 + ... + βn*Xn)), is used in logistic regression to transform the linear combination of the independent variables into a probability. This sigmoid function ensures that the probability values predicted by the logistic regression equation always fall between 0 and 1.  By adjusting the coefficients (β values) of the independent variables, logistic regression can estimate the impact of each variable on the probability of the outcome variable being 1. {{light_callout_start}} A sigmoid function is a bounded, differentiable, real function that is defined for all real input values and has a non-negative derivative at each point and exactly one inflection point. A sigmoid "function" and a sigmoid "curve" refer to the same object. {{light_callout_end}} Breakdown of the key components of the equation In logistic regression, the dependent variable is the binary outcome predicted or explained, represented as 0 and 1. Independent variables, or predictor variables, influence the dependent variable, either continuous or categorical.  The coefficients, or β values, represent the strength and direction of the relationship between each independent variable, and the probability of the outcome variable is 1. Adjusting these coefficients can determine the impact of each independent variable on the predicted outcome. A larger coefficient indicates a stronger influence on the outcome variable.   A simple example to illustrate the application of the equation: Let's consider a simple linear regression equation that predicts the sales of a product based on its price. The equation may look like this:  Sales = 1000 - 50 * Price. In this equation, the coefficient of -50 indicates that for every unit increase in price, sales decrease by 50 units. So, if the price is $10, the predicted sales would be 1000 - 50 * 10 = 500 units.  By manipulating the coefficient and the variables in the equation, we can analyze how different factors impact the sales of the product. If we increase the price to $15, the predicted sales would decrease to 1000 - 50 * 15 = 250 units. Conversely, if we decrease the price to $5, the predicted sales would increase to 1000 - 50 * 5 = 750 units.  This equation provides us with a simple way to estimate the product's sales based on its price, allowing businesses to make informed pricing decisions. Assumptions of logistic regression In this section, you will learn the critical assumptions associated with logistic regression, such as linearity and independence. Understand Linear Regression Assumptions You will see why these assumptions are essential for the model's accuracy and reliability.   Critical assumptions of logistic regression In logistic regression analysis, the assumptions of linearity and independence are important because they ensure that the relationships between the independent and dependent variables are consistent. This lets you make accurate predictions. Violating these assumptions can compromise the validity of the analysis and its usefulness in making informed pricing decisions, thus highlighting the importance of these assumptions.  Assumptions impacting model accuracy and reliability in statistical analysis The model's accuracy and reliability are based on assumptions like linearity and independence. Linearity allows for accurate interpretation of independent variables' impact on log odds, while independence ensures unique information from each observation. The log odds, also known as the logit, are a mathematical transformation used in logistic regression to model the relationship between independent variables (predictors) and the probability of a binary outcome. Violations of these assumptions can introduce bias and confounding factors, leading to inaccurate results. Therefore, it's crucial to assess these assumptions during statistical analysis to ensure the validity and reliability of the results. Data Processing and Implementation  In logistic regression, data processing plays an important role in ensuring the accuracy of the results with steps like handling missing values, dealing with outliers, and transforming variables if necessary.  To ensure the analysis is reliable, using logistic regression also requires careful thought about several factors, such as model selection, goodness-of-fit tests, and validation techniques. Orange Data Mining - Preprocess Data preparation for logistic regression Data preprocessing for logistic regression involves several steps Firstly, handling missing values is crucial, as they can affect the model's accuracy. You can do this by removing the corresponding observations or assuming the missing values  Next, dealing with outliers is important, as they can significantly impact the model's performance. Outliers can be detected using various statistical techniques and then either treated or removed depending on their relevance to the analysis. Additionally, transforming variables may be necessary to meet logistic regression assumptions. This can include applying logarithmic functions, square roots, or other mathematical transformations to the variables. Transforming variables can help improve the linearity and normality assumptions of logistic regressions.  Finally, consider the multicollinearity issue, which occurs when independent variables in a logistic regression model are highly correlated. Addressing multicollinearity can be done through various techniques, such as removing one of the correlated variables or using dimension reduction methods like principal component analysis (PCA).  Overall, handling missing values, outliers, transforming variables, and multicollinearity are all essential steps in preparing data for logistic regression analysis. Techniques for handling missing data and dealing with categorical variables Missing data can be addressed by removing observations with missing values or using imputation methods.  Categorical variables must be transformed into numerical representations using one-hot encoding or dummy coding techniques. One-hot encoding creates binary columns for each category, while dummy coding creates multiple columns to avoid multicollinearity.  These techniques help the model capture patterns and relationships within categorical variables, enabling more informed predictions. These methods ensure accurate interpretation and utilization of categorical information in the model. Significance of data scaling and normalization Data scaling and normalization are essential preprocessing steps in machine learning. Scaling transforms data to a specific range, ensuring all features contribute equally to the model's training process. On the other hand, normalization transforms data to a mean of 0 and a standard deviation of 1, bringing all variables to the same scale. This helps compare and analyze variables more accurately, reduces outliers, and improves the convergence of machine learning algorithms relying on normality. Overall, scaling and normalization are crucial for ensuring reliable and accurate results in machine learning models. Model Training and Evaluation Machine learning involves model training and evaluation. During training, the algorithm learns from input data to make predictions or classifications. Techniques like gradient descent or random search are used to optimize parameters.  After training, the model is evaluated using separate data to assess its performance and generalization. Metrics like accuracy, precision, recall, and F1 score are calculated. The model is then deployed in real-world scenarios to make predictions. Regularization techniques can prevent overfitting, and cross-validation ensures robustness by testing the model on multiple subsets of the data. The goal is to develop a logistic regression model that generalizes well to new, unseen data. Process of training logistic regression models Training a logistic regression model involves several steps. Initially, the dataset is prepared, dividing it into training and validation/test sets. The model is then initialized with random coefficients and fitted to the training data. During training, the model iteratively adjusts these coefficients using an optimization algorithm (like gradient descent) to minimize the chosen cost function, often the binary cross-entropy.  At each iteration, the algorithm evaluates the model's performance on the training data, updating the coefficients to improve predictions. Regularization techniques may be employed to prevent overfitting by penalizing complex models. This process continues until the model converges or reaches a predefined stopping criterion. Finally, the trained model's performance is assessed using a separate validation or test set to ensure it generalizes well to unseen data, providing reliable predictions for new observations. Cost functions and their role in model training In logistic regression, the cost function plays a crucial role in model training by quantifying the error between predicted probabilities and actual outcomes. The most common cost function used is the binary cross-entropy (or log loss) function. It measures the difference between predicted probabilities and true binary outcomes. The aim during training is to minimize this cost function by adjusting the model's parameters (coefficients) iteratively through techniques like gradient descent. As the model learns from the data, it seeks to find the parameter values that minimize the overall cost, leading to better predictions. The cost function guides the optimization process, steering the model towards better fitting the data and improving its ability to make accurate predictions. Evaluation metrics for logistic regression Precision: Precision evaluates the proportion of true positive predictions out of all positive predictions made by the model, indicating the model's ability to avoid false positives. Recall: Recall (or sensitivity) calculates the proportion of true positive predictions from all actual positives in the dataset, emphasizing the model's ability to identify all relevant instances. F1-score: The F1-score combines precision and recall into a single metric, balancing both metrics to provide a harmonic mean, ideal for imbalanced datasets. It assesses a model's accuracy by considering false positives and negatives in classification tasks. Accuracy: Accuracy measures the proportion of correctly classified predictions out of the total predictions made by the model, making it a simple and intuitive evaluation metric for overall model performance. These metrics help assess the efficiency and dependability of a logistic regression model for binary classification tasks, particularly in scenarios requiring high precision and recall, such as medical diagnoses or fraud detection. Challenges in Logistic Regression Logistic regression faces challenges such as multicollinearity, overfitting, and assuming a linear relationship between predictors and outcome log-odds. These issues can lead to unstable coefficient estimates, overfitting, and difficulty generalizing the model to new data. Additionally, the assumption may not always be true in practice.  Common challenges faced in logistic regression Imbalanced datasets Imbalanced datasets lead to biased predictions towards the majority class and result in inaccurate evaluations for the minority class. This disparity in class representation hampers the model's ability to properly account for the less-represented group, affecting its overall predictive performance. Multicollinearity Multicollinearity arises from highly correlated predictor variables, making it difficult to determine the individual effects of each variable on the outcome. The strong interdependence among predictors further complicates the modeling process, impacting the reliability of the logistic regression analysis.  {{light_callout_start}} Multicollinearity reduces the precision of the estimated coefficients, which weakens the statistical power of your regression model. You might be unable to trust the p-values to identify statistically significant independent variables. {{light_callout_end}} Overfitting Overfitting occurs when the model becomes overly complex and starts fitting noise in the data rather than capturing the underlying patterns. This complexity reduces the model's ability to generalize well to new data, resulting in a decrease in overall performance. Mitigation strategies and techniques Mitigation strategies, such as regularization and feature engineering, are crucial in addressing these challenges and improving the logistic regression model's predictive accuracy and reliability. Regularization techniques address overfitting in machine learning models. It involves adding a penalty term to the model's cost function, discouraging complex or extreme parameter values. This helps prevent the model from fitting the training data too closely and improves generalization.  Polynomial terms raise predictor variables to higher powers, allowing for curved relationships between predictors and the target variable. This can capture more complex patterns that cannot be captured by a simple linear relationship. Interaction terms involve multiplying different predictor variables, allowing for the possibility that the relationship between predictors and the target variable differs based on the combination of predictor values. By including these non-linear terms, logistic regression can capture more nuanced and complex relationships, improving its predictive performance. Real-World Applications of Logistic Regression The real-world applications listed below highlight the versatility and potency of logistic regression in modeling complex relationships and making accurate predictions in various domains.  Healthcare The healthcare industry has greatly benefited from logistic regression, which is used to predict the likelihood of a patient having a certain disease based on their medical history and demographic factors. It predicts patient readmissions based on age, medical history, and comorbidities. It is commonly employed in healthcare research to identify risk factors for various health conditions and inform public health interventions and policies.  Banking and Finance Logistic regression is a statistical method used in banking and finance to predict loan defaults. It analyzes the relationship between income, credit score, and employment status variables. This helps institutions assess risk, make informed decisions, and develop strategies to mitigate losses. It also helps banks identify factors contributing to default risk and tailor marketing strategies.   Remote Sensing In remote sensing, logistic regression is used to analyze satellite imagery to classify land cover types like forest, agriculture, urban areas, and water bodies. This information is crucial for urban planning, environmental monitoring, and natural resource management. It also helps predict vegetation indices, assess plant health, and aid irrigation and crop management decisions.  {{gray_callout_start}} Explore inspiring customer stories ranging from cutting-edge startups to enterprise and international research organizations. Witness how tools and infrastructure are accelerating the development of groundbreaking AI applications. Dive into these inspiring narratives at Encord for a glimpse into the future of AI. {{gray_callout_end}} Implementation of Logistic Regression in Python Implementation of logistic regression in Python involves the following steps while using the sklearn library:  Import necessary libraries, such as Numpy, Pandas, Matplotlib, Seaborn and Scikit-Learn Then, load and preprocess the dataset by handling missing values and encoding categorical variables.  Next, split the data into training and testing sets.  Train the logistic regression model using the fit() function on the training set.  Make predictions on the testing set using the predict() function.  Evaluate the model's accuracy by comparing the predicted values with the actual labels in the testing set. This can be done using evaluation metrics such as accuracy score, confusion matrix, and classification report. Additionally, the model can be fine-tuned by adjusting hyperparameters, such as regularization strength, through grid search or cross-validation techniques.  The final step is to interpret and visualize the results to gain insights and make informed decisions based on the regression analysis. Simple Logistic Regression in Python Logistic regression predicts the probability of a binary outcome (0 or 1, yes or no, true or false) based on one or more input features.  Here's a step-by-step explanation of implementing logistic regression in Python using the scikit-learn library: # Import all the necessary libraries import numpy as np import pandas as pd from sklearn.model_selection import train_test_split from sklearn.linear_model import LogisticRegression from sklearn.metrics import accuracy_score,confusion_matrix, classification_report import seaborn as sns import matplotlib.pyplot as plt # Load Titanic dataset from seaborn titanic_data = sns.load_dataset('titanic') titanic_data.drop('deck',axis=1,inplace=True) titanic_data.dropna(inplace=True) # Import label encoder from sklearn import preprocessing # label_encoder object knows how to understand word labels. label_encoder = preprocessing.LabelEncoder() # Encode labels in column 'sex' to convert Male as 0 and Female as 1. titanic_data['sex']= label_encoder.fit_transform(titanic_data['sex']) print(titanic_data.head()) # Select features and target variable X = titanic_data[['pclass', 'sex', 'age', 'sibsp', 'parch', 'fare']] y = titanic_data['survived'] # Split the dataset into training and test sets (e.g., 80-20 split) X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) # Initialize and train the logistic regression model logistic_reg = LogisticRegression() logistic_reg.fit(X_train, y_train) # Make predictions on the test set predictions = logistic_reg.predict(X_test) # Calculate accuracy accuracy = accuracy_score(y_test, predictions) print("Accuracy:", accuracy) # Generate classification report print("Classification Report:") print(classification_report(y_test, predictions)) # Compute ROC curve and AUC from sklearn.metrics import roc_curve, auc fpr, tpr, thresholds = roc_curve(y_test, logistic_reg.predict_proba(X_test)[:, 1]) roc_auc = auc(fpr, tpr) # Plot ROC curve plt.figure(figsize=(8, 6)) plt.plot(fpr, tpr, color='blue', lw=2, label='ROC curve (AUC = %0.2f)' % roc_auc) plt.plot([0, 1], [0, 1], color='gray', linestyle='--') plt.xlabel('False Positive Rate') plt.ylabel('True Positive Rate') plt.title('Receiver Operating Characteristic (ROC) Curve') plt.legend(loc='lower right') plt.show() Outputs: Accuracy: 0.7902097902097902 ROC-AUC curve Interpretation Accuracy Our accuracy score is 0.79 (or 79.02%), which means that the model correctly predicted approximately 79% of the instances in the test dataset. Summary of classification report This classification report evaluates a model's performance in predicting survival outcomes (survived or not) based on various passenger attributes. For passengers who did not survive (class 0): The precision is 77%. When the model predicts a passenger didn't survive, it is accurate 77% of the time. For passengers who survived (class 1): The precision is 84%. When the model predicts a passenger survived, it is accurate 84% of the time. Recall For passengers who did not survive (class 0): The recall is 90%. The model correctly identifies 90% of all actual non-survivors. For passengers who survived (class 1): The recall is 65%. The model captures 65% of all actual survivors. F1-score For passengers who did not survive (class 0): The F1-score is 83%. For passengers who survived (class 1): The F1-score is 73%. There were 80 instances of passengers who did not survive and 63 instances of passengers who survived in the dataset. ROC Curve (Receiver Operating Characteristic) The ROC curve shows the trade-off between sensitivity (recall) and specificity (1 - FPR) at various thresholds. A curve closer to the top-left corner represents better performance. AUC (Area Under the Curve) Definition: AUC represents the area under the ROC curve. It quantifies the model's ability to distinguish between the positive and negative classes. A higher AUC value (closer to 1.0) indicates better discrimination; the model has better predictive performance. View the entire code here. Logistic Regression in Machine Learning {{gray_callout_start}} 🎯 Recommended: Accuracy vs. Precision vs. Recall in Machine Learning: What's the Difference?  {{gray_callout_end}} Logistic Regression: Key Takeaways Logistic regression is a popular algorithm used for binary classification tasks.  It estimates the probability of an event occurring based on input variables. It uses a sigmoid function to map the predicted probabilities to binary outcomes. Apply regularization to prevent overfitting and improve generalization. Logistic regression can be interpreted using coefficients, odds ratios, and p-values.  Logistic regression is widely used in various fields, such as medicine, finance, and marketing, due to its simplicity and interpretability. The algorithm is particularly useful when dealing with imbalanced datasets, as it can handle the imbalance by adjusting the decision threshold. Logistic regression assumes a linear relationship between the input variables of the outcome, which can be a limitation in cases where the relationship is non-linear. Despite its limitations, logistic regression remains a powerful tool for understanding the relationship between input variables and the probability of an event occurring. {{Active_CTA}}

November 27

8 min

What is Ensemble Learning?

Imagine you are watching a football match. The sports analysts provide you with detailed statistics and expert opinions. At the same time, you also take into account the opinions of fellow enthusiasts who may have witnessed previous matches. This approach helps overcome the limitations of relying solely on one model and increases overall accuracy. Similarly, in ensemble learning, combining multiple models or algorithms can improve prediction accuracy.  In both cases, the power of collective knowledge and multiple viewpoints is harnessed to make more informed and reliable predictions, overcoming the limitations of relying solely on one model. Let us take a deeper dive into what Ensemble Learning actually is. Ensemble learning is a machine learning technique that improves the performance of machine learning models by combining predictions from multiple models. By leveraging the strengths of diverse algorithms, ensemble methods aim to reduce both bias and variance, resulting in more reliable predictions. It also increases the model’s robustness to errors and uncertainties, especially in critical applications like healthcare or finance.  Ensemble learning techniques like bagging, boosting, and stacking enhance performance and reliability, making them valuable for teams that want to build reliable ML systems. Ensemble Learning This article highlights the benefits of ensemble learning for reducing bias and improving predictive model accuracy. It highlights techniques to identify and manage uncertainties, leading to more reliable risk assessments, and provides guidance on applying ensemble learning to predictive modeling tasks. Here, we will address the following topics: Brief overview  Ensemble learning techniques Benefits of ensemble learning Challenges and considerations Applications of ensemble learning Types of Ensemble Learning Ensemble learning differs from deep learning; the latter focuses on complex pattern recognition tasks through hierarchical feature learning. Ensemble techniques, such as bagging, boosting, stacking, and voting, address different aspects of model training to enhance prediction accuracy and robustness.  These techniques aim to reduce bias and variance in individual models, and improve prediction accuracy by learning previous errors, ultimately leading to a consensus prediction that is often more reliable than any single model.  {{light_callout_start}}  The main challenge is not to obtain highly accurate base models but to obtain base models that make different kinds of errors. If ensembles are used for classification, high accuracies can be achieved if different base models misclassify different training examples, even if the base classifier accuracy is low. {{light_callout_end}}  Bagging: Bootstrap aggregating Bootstrap aggregation, or bagging, is a technique that improves prediction accuracy by combining predictions from multiple models. It involves creating random subsets of data, training individual models on each subset, and combining their predictions. However, this only happens in regression tasks. For classification tasks, the majority vote is typically used. Bagging applies bootstrap sampling to obtain the data subsets for training the base learners.  Random forest The Random Forest algorithm is a prime example of bagging. It creates an ensemble of decision trees trained on samples of datasets. Ensemble learning effectively handles complex features and captures nuanced patterns, resulting in more reliable predictions. However, it is also true that the interpretability of ensemble models may be compromised due to the combination of multiple decision trees. Ensemble models can provide more accurate predictions than individual decision trees, but understanding the reasoning behind each prediction becomes challenging. Bagging helps reduce overfitting by generating multiple subsets of the training data and training individual decision trees on each subset.  It also helps reduce the impact of outliers or noisy data points by averaging the predictions of multiple decision trees. Ensemble Learning: Bagging & Boosting | Towards Data Science Boosting: Iterative learning Boosting is a technique in ensemble learning that converts a collection of weak learners into a strong one by focusing on the errors of previous iterations. The process involves incrementally increasing the weight of misclassified data points, so subsequent models focus more on difficult cases. The final model is created by combining these weak learners and prioritizing those that perform better.  Gradient boosting Gradient Boosting (GB) trains each model to minimize the errors of previous models by training each new model on the remaining errors. This iterative process effectively handles numerical and categorical data and can outperform other machine learning algorithms, making it versatile for various applications.  For example, you can apply Gradient Boosting in healthcare to predict disease likelihood accurately. Iteratively combining weak learners to build a strong learner can improve prediction accuracy, which could be valuable in providing insights for early intervention and personalized treatment plans based on demographic and medical factors such as age, gender, family history, and biomarkers. One potential challenge of gradient boosting in healthcare is its lack of interpretability. While it excels at accurately predicting disease likelihood, the complex nature of the algorithm makes it difficult to understand and interpret the underlying factors driving those predictions.  This can pose challenges for healthcare professionals who must explain the reasoning behind a particular prediction or treatment recommendation to patients. However, efforts are being made to develop techniques that enhance the interpretability of GB models in healthcare, ensuring transparency and trust in their use for decision-making. {{light_callout_start}} Boosting is an ensemble method that seeks to change the training data to focus attention on examples that previous fit models on the training dataset have gotten wrong. {{light_callout_end}} Boosting in Machine Learning | Boosting and AdaBoost - GeeksforGeeks In the clinical literature, gradient boosting has been successfully used to predict, among other things, cardiovascular events, the development of sepsis, delirium, and hospital readmissions following lumbar laminectomy. {{medical_CTA_light}} Stacking: Meta-learning Stacking, or stacked generalization, is a model-ensembling technique that improves predictive performance by combining predictions from multiple models. It involves training a meta-model that uses the output of base-level models to make a final prediction. The meta-model, a linear regression, a neural network, or any other algorithm makes the final prediction. This technique leverages the collective knowledge of different models to generate more accurate and robust predictions. The meta-model can be trained using ensemble algorithms like linear regression, neural networks, or support vector machines. The final prediction is based on the meta-model's output. Overfitting occurs when a model becomes too closely fitted to the training data and performs poorly on new, unseen data. Stacking helps mitigate overfitting by combining multiple models with different strengths and weaknesses, thereby reducing the risk of relying too heavily on a single model’s biases or idiosyncrasies.  For example, in financial forecasting, stacking combines models like regression, random forest, and gradient boosting to improve stock market predictions. This ensemble approach mitigates the individual biases in the model and allows easy incorporation of new models or the removal of underperforming ones, enhancing prediction performance over time. Voting Voting is a popular technique used in ensemble learning, where multiple models are combined to make predictions. Majority voting, or max voting, involves selecting the class label that receives the majority of votes from the individual models. On the other hand, weighted voting assigns different weights to each model's prediction and combines them to make a final decision. Both majority and weighted voting are methods of aggregating predictions from multiple models through a voting mechanism and strongly influence the final decision. Examples of algorithms that use voting in ensemble learning include random forests and gradient boosting (although it’s an additive model “weighted” addition). Random forest uses decision tree models trained on different data subsets. A majority vote determines the final forecast based on individual forecasts.  For instance, in a random forest applied to credit scoring, each decision tree might decide whether an individual is a credit risk. The final credit risk classification is based on the majority vote of all trees in the forest. This process typically improves predictive performance by harnessing the collective decision-making power of multiple models. {{light_callout_start}} The application of either bagging or boosting requires the selection of a base learner algorithm first. For example, if one chooses a classification tree, then boosting and bagging would be a pool of trees with a size equal to the user’s preference. {{light_callout_end}} Benefits of Ensemble Learning Improved accuracy and stability Ensemble methods combine the strengths of individual models by leveraging their diverse perspectives on the data. Each model may excel in different aspects, such as capturing different patterns or handling specific types of noise. By combining their predictions through voting or weighted averaging, ensemble methods can improve overall accuracy by capturing a more comprehensive understanding of the data. This helps to mitigate the weaknesses and biases that may be present in any single model. Ensemble learning, which improves model accuracy in the classification model while lowering mean absolute error in the regression model, can make a stable model less prone to overfitting. Ensemble methods also have the advantage of handling large datasets efficiently, making them suitable for big data applications. Additionally, ensemble methods provide a way to incorporate diverse perspectives and expertise from multiple models, leading to more robust and reliable predictions. Robustness Ensemble learning enhances robustness by considering multiple models' opinions and making consensus-based predictions. This mitigates the impact of outliers or errors in a single model, ensuring more accurate results. Combining diverse models reduces the risk of biases or inaccuracies from individual models, enhancing the overall reliability and performance of the ensemble learning approach. However, combining multiple models can increase the computational complexity compared to using a single model. Furthermore, as ensemble models incorporate different algorithms or variations of the same algorithm, their interpretability may be somewhat compromised.  Reducing overfitting Ensemble learning reduces overfitting by using random data subsets for training each model. Bagging introduces randomness and diversity, improving generalization performance. Boosting assigns higher weights to difficult-to-classify instances, focusing on challenging cases and improving accuracy. Iteratively adjusting weights allows boosting to learn from mistakes and build models sequentially, resulting in a strong ensemble capable of handling complex data patterns. Both approaches help improve generalization performance and accuracy in ensemble learning. Benefits of using Ensemble Learning on Land Use Data Challenges and Considerations in Ensemble Learning  Model selection and weighting Selecting the right combination of models to include in the ensemble, determining the optimal weighting of each model's predictions, and managing the computational resources required to train and evaluate multiple models simultaneously. Additionally, ensemble learning may not always improve performance if the individual models are too similar or if the training data has a high degree of noise. The diversity of the models—in terms of algorithms, feature processing, and data perspectives—is vital to covering a broader spectrum of data patterns. Optimal weighting of each model's contribution, often based on performance metrics, is crucial to harnessing their collective predictive power. Therefore, careful consideration and experimentation are necessary to achieve the desired results with ensemble learning. Computational complexity Ensemble learning, involving multiple algorithms and feature sets, requires more computational resources than individual models. While parallel processing offers a solution, orchestrating an ensemble of models across multiple processors can introduce complexity in both implementation and maintenance. Also, more computation might not always lead to better performance, especially if the ensemble is not set up correctly or if the models amplify each other's errors in noisy datasets. Diversity and overfitting Ensemble learning requires diverse models to avoid bias and enhance accuracy. By incorporating different algorithms, feature sets, and training data, ensemble learning captures a wider range of patterns, reducing the risk of overfitting and ensuring the ensemble can handle various scenarios and make accurate predictions in different contexts. Strategies such as cross-validation help in evaluating the ensemble's consistency and reliability, ensuring the ensemble is robust against different data scenarios. Interpretability Ensemble learning models prioritize accuracy over interpretability, resulting in highly accurate predictions. However, this trade-off makes the ensemble model more challenging to interpret. Techniques like feature importance analysis and model introspection can help provide insights but may not fully demystify the predictions of complex ensembles. the factors contributing to ensemble models' decision-making, reducing the interpretability challenge. Real-World Applications of Ensemble Learning Healthcare  Ensemble learning is utilized in healthcare for disease diagnosis and drug discovery. It combines predictions from multiple machine learning models trained on different features and algorithms, providing more accurate diagnoses. Ensemble methods also improve classification accuracy, especially in complex datasets or when models have complementary strengths and weaknesses. Ensemble classifiers like random forests are used in healthcare to achieve higher performance than individual models, enhancing the accuracy of these tasks. {{light_callout_start}} Here’s an article worth a read which talks of using AI & ML for detecting medical conditions. {{light_callout_end}}  Agriculture  Ensemble models combine multiple base models to reduce outliers and noise, resulting in more accurate predictions. This is particularly useful in sales forecasting, stock market analysis and weather prediction. In agriculture, ensemble learning can be applied to crop yield prediction. Combining the predictions of multiple models trained on different environmental factors, such as temperature, rainfall, and soil quality, ensemble methods can provide more accurate forecasts of crop yields. Ensemble learning techniques, such as stacking and bagging, improve performance and reliability.  Take a peek at this wonderful article on Encord that shows how to accurately measure carbon content in forests and elevate carbon credits with Treeconomy. Insurance Insurance companies can also benefit from ensemble methods in assessing risk and determining premiums. By combining the predictions of multiple models trained on various factors such as demographics, historical data, and market trends, insurance companies can better understand potential risks and make more accurate predictions of claim probabilities. This can help them set appropriate premiums for their customers and ensure a fair and sustainable insurance business.  Remote sensing  Ensemble learning techniques, like isolation forests and SVM ensembles, detect data anomalies by comparing multiple models' outputs. They increase detection accuracy and reduce false positives, making them useful for identifying fraudulent transactions, network intrusions, or unexpected behavior. These methods can be applied in remote sensing by combining multiple models or algorithms, training on different data subsets, and combining predictions through majority voting or weighted averaging. One practical use of remote sensing can be seen in this article; it’s worth a read. Sports  Ensemble learning in sports involves using multiple predictive models or algorithms to make more accurate predictions and decisions in various aspects of the sports industry. Common ensemble methods include model stacking and weighted averaging, which improve the accuracy and effectiveness of recommendation systems. By combining predictions from different models, such as machine learning algorithms or statistical models, ensemble learning helps sports teams, coaches, and analysts gain a better understanding of player performance, game outcomes, and strategic decision-making. This approach can also be applied to other sports areas, such as injury prediction, talent scouting, and fan engagement strategies.  By the way, you may be surprised to hear that a sports analytics company found that their ML team was unable to iterate and create new features due to a slow internal annotation tool. As a result, the team turned to Encord, which allowed them to annotate quickly and create new ontologies. Read the full story here. {{light_callout_start}} Ensemble models' outcomes can easily be explained using explainable AI algorithms. Hence, ensemble learning is extensively used in applications where an explanation is necessary. {{light_callout_end}} Psuedocode for implementing ensemble learning models Pseudocode is a high-level and informal description of a computer program or algorithm that uses a mix of natural language and some programming language-like constructs. It's not tied to any specific programming language syntax. It is used to represent the logic or steps of an algorithm in a readable and understandable format, aiding in planning and designing algorithms before actual coding. How do you build an ensemble of models? Here's a pseudo-code to show you how: Algorithm: Ensemble Learning with Majority Voting Input: - Training dataset (X_train, y_train) - Test dataset (X_test) - List of base models (models[]) Output: - Ensemble predictions for the test dataset Procedure Ensemble_Learning: # Train individual base models for each model in models: model.fit(X_train, y_train) # Make predictions using individual models for each model in models: predictions[model] = model.predict(X_test) # Combine predictions using majority voting for each instance in X_test: for each model in models: combined_predictions[instance][model] = predictions[model][instance] # Determine the most frequent prediction among models for each instance ensemble_prediction[instance] = majority_vote(combined_predictions[instance]) return ensemble_prediction What does it do? It takes input of training data, test data, and a list of base models. The base models are trained on the training dataset. Predictions are made using each individual model on the test dataset. For each instance in the test data, the pseudocode uses a function majority_vote() (not explicitly defined here) to perform majority voting and determine the ensemble prediction based on the predictions of the base models. Here's an illustration with pseudocode on how to implement different ensemble models: Pseudo Code of Ensemble Learning Ensemble Learning: Key Takeaways  Ensemble learning is a powerful technique that combines the predictions of multiple models to improve the accuracy and performance of recommendation systems. It can overcome the limitations of single models by considering the diverse preferences and tastes of different users. Ensemble techniques like bagging, boosting, and stacking enhance prediction accuracy and robustness by combining multiple models. Bagging reduces overfitting by averaging predictions from different data subsets. Boosting trains weak models sequentially, giving more weight to misclassified instances. Lastly, stacking combines predictions from multiple models, using another model to make the final prediction. These techniques demonstrate the power of combining multiple models to improve prediction accuracy and robustness. Combining multiple models reduces the impact of individual model errors and biases, leading to more reliable and consistent recommendations. Specific ensemble techniques like bagging, boosting, and stacking play a crucial role in achieving better results in ensemble learning. 

November 24

8 min

Accuracy vs. Precision vs. Recall in Machine Learning: What is the Difference?

In Machine Learning, the efficacy of a model is not just about its ability to make predictions but also to make the right ones. Practitioners use evaluation metrics to understand how well a model performs its intended task. They serve as a compass in the complex landscape of model performance. Accuracy, precision, and recall are important metrics that view the model's predictive capabilities. Accuracy is the measure of a model's overall correctness across all classes. The most intuitive metric is the proportion of true results in the total pool. True results include true positives and true negatives. Accuracy may be insufficient in situations with imbalanced classes or different error costs. Precision and recall address this gap. Precision measures how often predictions for the positive class are correct. Recall measures how well the model finds all positive instances in the dataset. To make informed decisions about improving and using a model, it's important to understand these metrics. This is especially true for binary classification. We may need to adjust these metrics to understand how well a model performs in multi-class problems fully. Understanding the difference between accuracy, precision, and recall is important in real-life situations. Each metric shows a different aspect of the model's performance. Classification Metrics Classification problems in machine learning revolve around categorizing data points into predefined classes or groups. For instance, determining whether an email is spam is a classic example of a binary classification problem. As the complexity of the data and the number of classes increases, so does the intricacy of the model. However, building a model is only half the battle. Key metrics like accuracy, precision, and recall from the confusion matrix are essential to assess its performance. Metrics provide insights into how well the model achieves its classification goals. They help identify improvement areas to show if the model aligns with the desired outcomes. Among these metrics, accuracy, precision, and recall are foundational. The Confusion Matrix The confusion matrix is important for evaluating classification models. It shows how well the model performs. Data scientists and machine learning practitioners can assess their models' accuracy and areas for improvement with a visual representation. Significance At its core, the confusion matrix is a table that compares the actual outcomes with the predicted outcomes of a classification model. It is pivotal in understanding the nuances of a model's performance, especially in scenarios where class imbalances exist or where the cost of different types of errors varies. Breaking down predictions into specific categories provides a granular view of a more informed decision-making process to optimize models. Elements of Confusion Matrix True Positive (TP): These are the instances where the model correctly predicted the positive class. For example, they are correctly identifying a fraudulent transaction as fraudulent. True Negative (TN): The model accurately predicted the negative class. Using the same example, it would be correctly identifying a legitimate transaction as legitimate. False Positive (FP): These are instances where the model incorrectly predicted the positive class. In our example, it would wrongly flag a legitimate transaction as fraudulent. False Negative (FN): This is when the model fails to identify the positive class, marking it as negative instead. In the context of our example, it would mean missing a fraudulent transaction and deeming it legitimate. Visual Representation and Interpretation The diagonal from the top-left to the bottom-right represents correct predictions (TP and TN), while the other represents incorrect predictions (FP and FN). You can analyze this matrix to calculate different performance metrics. These metrics include accuracy, precision, recall, and F1 score. Each metric gives you different information about the model's strengths and weaknesses. What is Accuracy in Machine Learning? Accuracy is a fundamental metric in classification, providing a straightforward measure of how well a model performs its intended task. Accuracy represents the ratio of correctly predicted instances to the total number of instances in the dataset. In simpler terms, it answers the question: "Out of all the predictions made, how many were correct?" Mathematical Formula Where: TP = True Positives TN = True Negatives FP = False Positives FN = False Negatives Significance Accuracy is often the first metric to consider when evaluating classification models. It's easy to understand and provides a quick snapshot of the model's performance. For instance, if a model has an accuracy of 90%, it makes correct predictions for 90 of every 100 instances. However, while accuracy is valuable, it's essential to understand when to use it. In scenarios where the classes are relatively balanced, and the misclassification cost is the same for each class, accuracy can be a reliable metric. Limitations Moreover, in real-world scenarios, the cost of different types of errors might vary. For instance, a false negative (failing to identify a disease) might have more severe consequences than a false positive in a medical diagnosis. Diving into Precision Precision is a pivotal metric in classification tasks, especially in scenarios with a high cost of false positives. It provides insights into the model's ability to correctly predict positive instances while minimizing the risk of false alarms. Precision, often referred to as the positive predictive value, quantifies the proportion of true positive predictions among all positive predictions made by the model. It answers the question: "Of all the instances predicted as positive, how many were positive?" Mathematical Formula Where: TP = True Positives FP = False Positives Significance Precision is important when false positives are costly.  In certain applications, the consequences of false positives can be severe, making precision an essential metric. For instance, in financial fraud detection, falsely flagging a legitimate transaction as fraudulent (a false positive) can lead to unnecessary investigations, customer dissatisfaction, and potential loss of business. Here, high precision ensures that most flagged transactions are indeed fraudulent, minimizing the number of false alarms. Limitations Precision focuses solely on the correctly predicted positive cases, neglecting the false negatives. As a result, a model can achieve high precision by making very few positive predictions, potentially missing out on many actual positive cases. This narrow focus can be misleading, especially when false negatives have significant consequences. Recall: Understanding Recall Recall, also known as sensitivity or true positive rate, is a crucial metric in classification that emphasizes the model's ability to identify all relevant instances. Recall measures the proportion of actual positive cases correctly identified by the model. It answers the question: "Of all the actual positive instances, how many were correctly predicted by the model?" Mathematical Formula: Where: TP = True Positives FN = False Negatives Significance Recall is important in scenarios where False Negatives are costly. Example: Similarly, a high recall ensures that most threats are identified and addressed in a security system designed to detect potential threats. While this might lead to some false alarms (false positives), the cost of missing a genuine threat (false negatives) could be catastrophic. Both examples emphasize minimizing the risk of overlooking actual positive cases, even if it means accepting some false positives. This underscores the importance of recall in scenarios where the implications of false negatives are significant. Limitations The recall metric is about finding all positive cases, even with more false positives. A model may predict most instances as positive to achieve a high recall. This leads to many incorrect positive predictions. This can reduce the model's precision and result in unnecessary actions or interventions based on these false alarms. {{gray_callout_start}} 💡 Recommended: The 10 Computer Vision Quality Assurance Metrics Your Team Should be Tracking.  {{gray_callout_end}} The Balancing Act: Precision and Recall Precision and recall, two commonly used metrics in classification, often present a trade-off that requires careful consideration based on the specific application and its requirements. The Trade-off Between Precision and Recall There's an inherent trade-off between precision and recall. Improving precision often comes at the expense of recall and vice versa. For instance, a model that predicts only the most certain positive cases will have high precision but may miss out on many actual positive cases, leading to low recall. This balance is crucial in fraud detection, where missing a fraudulent transaction (low recall) is as critical as incorrectly flagging a legitimate one (low precision). Precision vs. Recall The Significance of the Precision-Recall Curve The precision-recall curve is a graphical representation that showcases the relationship between precision and recalls for different threshold settings. It helps visualize the trade-off and select an optimal threshold that balances both metrics.  It is especially valuable for imbalanced datasets where one class is significantly underrepresented compared to others. In these scenarios, traditional metrics like accuracy can be misleading, as they might reflect the predominance of the majority class rather than the model's ability to identify the minority class correctly.  The precision-recall curve measures how well the minority class is predicted. The measurement checks how accurately we make positive predictions and detect actual positives. The curve is an important tool for assessing model performance in imbalanced datasets. It helps choose an optimal threshold that balances precision and recall effectively. The closer this curve approaches the top-right corner of the graph, the more capable the model is at achieving high precision and recall simultaneously, indicating a robust performance in distinguishing between classes, regardless of their frequency in the dataset. Precision Recall Curve Importance of Setting the Right Threshold for Classification Adjusting the classification threshold directly impacts the shape and position of the precision-recall curve. A lower threshold typically increases recall but reduces precision, shifting the curve towards higher recall values. Conversely, a higher threshold improves precision at the expense of recall, moving the curve towards higher precision values.  The precision-recall curve shows how changing thresholds affect precision and recall balance. This helps us choose the best threshold for the application's specific needs. Precision vs. Recall: Which Metric Should You Choose? The choice between precision and recall often hinges on the specific application and the associated costs of errors. Both metrics offer unique insights, but their importance varies based on the problem. Scenarios Where Precision is More Important Than Recall Precision becomes paramount when the cost of false positives is high. For instance, consider an email marketing campaign. If a company has many email addresses and pays a high cost for each email, it is important to ensure that the recipients are likely to respond. High precision ensures that most emails are sent to potential customers, minimizing wasted resources on those unlikely to engage. Scenarios Where Recall is More Important Than Precision Recall takes precedence when the cost of missing a positive instance (false negatives) is substantial. A classic example is in healthcare, specifically in administering flu shots. If you don't give a flu shot to someone who needs it, it could have serious health consequences. Also, giving a flu shot to someone who doesn't need it has a small cost. In such a scenario, healthcare providers might offer the flu shot to a broader audience, prioritizing recall over precision. Real-world examples Illustrate the Choice Between Precision and Recall Consider a weekly website with thousands of free registrations. The goal is to identify potential buyers among these registrants. While calling a non-buyer (false positive) isn't detrimental, missing out on a genuine buyer (false negative) could mean lost revenue. Here, high recall is desired, even if it compromises precision. In another scenario, imagine a store with 100 apples, of which 10 are bad. A method with a 20% recall might identify only 18 good apples, but if a shopper only wants 5 apples, the missed opportunities (false negatives) are inconsequential. However, a higher recall becomes essential for the store aiming to sell as many apples as possible. Classification Metrics: Key Takeaways Evaluation Metrics: Accuracy, precision, and recall remain foundational in assessing a machine learning model's predictive capabilities. These metrics are especially relevant in binary and multi-class classification scenarios, often involving imbalanced datasets. Accuracy: Provides a straightforward measure of a model's overall correctness across all classes but needs to be more accurate in imbalanced datasets, where one class (the majority class) might dominate. Change: Mentioned "majority class" to address "imbalanced datasets." Precision vs. Recall: Precision, highlighting the true positives and minimizing false positives, contrasts with recall, which focuses on capturing all positive instances and minimizing false negatives. The choice depends on the application's specific needs and the cost of errors. Confusion Matrix: Categorizes predictions into True Positives, True Negatives, False Positives, and False Negatives, offering a detailed view of a model's performance. This is essential in evaluating classifiers and their effectiveness. Precision-Recall Curve: Showcases the relationship between precision and recall for different threshold settings, which is crucial for understanding the trade-off in a classifier's performance. Classification Threshold: Adjusting this threshold in a machine learning model can help balance precision and recall, directly impacting the true positive rate and precision score. Context is Key: The relevance of precision, recall, and accuracy varies based on the nature of the problem, such as in a regression task or when high precision is critical for the positive class. {{Active_CTA}}

November 23

10 min

From Data to Diamonds: Unearth the True Value of Quality Data

Bridging the chasm between ‘Just AI’ and ‘Useful AI’ can be challenging, however it’s apparent that leveraging valuable data is crucial to this. As access to data increases, computer vision teams need to produce informative and reliable training data as a priority, one approach is through developing active learning pipelines. From data curation to annotation and beyond, this webinar will provide you with the tools to implement active learning pipelines and level up your computer vision models Here are the key resources from the webinar: [Guide] How to curate your data [Case Study] How one customer improved per-class performance by 67%

November 17

60 min

Florence-2: Microsoft's New Foundation Model Explained

In the world of Artificial General Intelligence (AGI) systems, a significant shift is underway toward leveraging versatile, pre trained representations that exhibit task-agnostic adaptability across diverse applications. This shift started in the field of natural language processing (NLP), and now it’s making its way into computer vision too. That’s where Florence-2 comes in: a vision foundation model designed to address the challenges of task diversity in computer vision and vision-language tasks. Background Artificial General Intelligence aims to create systems that can perform well across various tasks, much like how humans demonstrate diverse capabilities. Recent successes with versatile, pre trained models in the field of NLP have inspired a similar approach in the realm of computer vision. While existing large vision models excel in transfer learning, they often struggle when faced with various tasks and simple instructions. The challenge lies in handling spatial hierarchy and semantic granularity inherent in diverse vision-related tasks. Key challenges include the limited availability of comprehensive visual annotations and the absence of a unified pretraining framework with a singular neural network architecture seamlessly integrating spatial hierarchy and semantic granularity. Existing datasets tailored for specialized applications heavily rely on human labeling, which limits, the development of foundational models capable of capturing the intricacies of vision-related tasks. {{light_callout_start}} Read the blog Visual Foundation Models (VFMs) Explained to know more about large vision models.{{light_callout_end}}  Florence-2: An Overview To tackle these challenges head-on, the Florence-2 model emerges as a universal backbone achieved through multitask learning with extensive visual annotations. This results in a unified, prompt-based representation for diverse vision tasks, effectively addressing the challenges of limited comprehensive training data and the absence of a unified architecture. Built by Microsoft, the Florence-2 model adopts a sequence-to-sequence architecture, integrating an image encoder and a multi-modality encoder-decoder. This design accommodates a spectrum of vision tasks without the need for task-specific architectural modifications, aligning with the ethos of the NLP community for versatile model development with a consistent underlying structure. Florence-2 stands out through its unprecedented zero-shot and fine-tuning capabilities, achieving new state-of-the-art results in tasks such as captioning, object detection, visual grounding, and referring expression comprehension. Even after fine-tuning with public human-annotated data, Florence-2 competes with larger specialist models, establishing new benchmarks.  {{Training_data_CTA::Fine-tune Visual Foundation Models for your specific use case}} Technical Deep Dive Carefully designed to overcome the limitations of traditional single-task frameworks, Florence-2 employs a sequence-to-sequence learning paradigm, integrating various tasks under a common language modeling objective. Florence-2’s model architecture. Florence-2: Advancing a Unified Representation for a Variety of Vision Tasks Let's dive into the key components that make up this innovative model architecture. Task Formulation  Florence-2 adopts a sequence-to-sequence framework to address a wide range of vision tasks in a unified manner. Each task is treated as a translation problem, where the model takes an input image and a task-specific prompt and generates the corresponding output response.  Tasks can involve either text or region information, and the model adapts its processing based on the nature of the task. For region-specific tasks, location tokens are introduced to the tokenizer's vocabulary list, accommodating various formats like box representation, quad box representation, and polygon representation. Vision Encoder The vision encoder plays a pivotal role in processing input images. To accomplish this, Florence-2 incorporates DaViT (Data-efficient Vision Transformer) as its vision encoder. DaViT transforms input images into flattened visual token embeddings, capturing both spatial and semantic information. The resulting visual token embeddings are concatenated with text embeddings for further processing. Multi-Modality Encoder-Decoder Transformer The heart of Florence-2 lies in its transformer-based multi-modal encoder-decoder. This architecture processes both visual and language token embeddings, enabling a seamless fusion of textual and visual information. The multi-modality encoder-decoder is instrumental in generating responses that reflect a comprehensive understanding of the input image and task prompt. Optimization Objective To train Florence-2 effectively, a standard language modeling objective is employed. Given the input (combined image and prompt) and the target output, the model utilizes cross-entropy loss for all tasks. This optimization objective ensures that the model learns to generate accurate responses across a spectrum of vision-related tasks. The Florence-2 architecture stands as a testament to the power of multi-task learning and the seamless integration of textual and visual information. Let’s discuss the multi-task learning setup briefly. Multi-Task Learning Setup Multitask learning is at the core of Florence-2's capabilities, necessitating large-scale, high-quality annotated data. The model's data engine, FLD-5B, autonomously generates a comprehensive visual dataset with 5.4 billion annotations for 126 million images. This engine employs an iterative strategy of automated image annotation and model refinement, moving away from traditional single and manual annotation approaches. The multitask learning approach incorporates three distinct learning objectives, each addressing a different level of granularity and semantic understanding:  Image-level Understanding Tasks: Florence-2 excels in comprehending the overall context of images through linguistic descriptions. Tasks include image classification, captioning, and visual question answering (VQA). Region/Pixel-level Recognition Tasks: The model facilitates detailed object and entity localization within images, capturing relationships between objects and their spatial context. This encompasses tasks like object detection, segmentation, and referring expression comprehension. Fine-Grained Visual-Semantic Alignment Tasks: Florence-2 addresses the intricate task of aligning fine-grained details between text and image. This involves locating image regions corresponding to text phrases, such as objects, attributes, or relations. By incorporating these learning objectives within a multitask framework, Florence-2 becomes adept at handling various spatial details, distinguishing levels of understanding, and achieving universal representation for vision tasks. {{light_callout_start}} Read the original research paper by Azure AI, Microsoft, authored by Bin Xiao, Haiping Wu, Weijian Xu, Xiyang Dai, Houdong Hu, Yumao Lu, Michael Zeng, Ce Liu, Lu Yuan available on Arxiv: Florence-2: Advancing a Unified Representation for a Variety of Vision Tasks {{light_callout_end}}  Performance and Evaluation Zero-Shot and Fine-Tuning Capabilities Florence-2 impresses with its zero-shot performance, excelling in diverse tasks without task-specific fine-tuning. For instance, Florence-2-L achieves a CIDEr score of 135.6 on COCO caption, surpassing models like Flamingo with 80 billion parameters. In fine-tuning, Florence-2 demonstrates efficiency and effectiveness. Its simple design outperforms models with specialized architectures in tasks like RefCOCO and TextVQA. Florence-2-L showcases competitive state-of-the-art performance across various tasks, emphasizing its versatile capabilities. Comparison with SOTA Models Florence-2-L stands out among vision models, delivering strong performance and efficiency. Compared to models like PolyFormer and UNINEXT, Florence-2-L excels in tasks like RefCOCO REC and RES, showcasing its generalization across task levels. In image-level tasks, Florence-2 achieves a CIDEr score of 140.0 on COCO Caption karpathy test split, outperforming models like Flamingo with more parameters. Downstream tasks, including object detection and segmentation, highlight Florence-2's superior pre-training. It maintains competitive performance even with frozen model stages, emphasizing its effectiveness. Florence-2's performance in semantic segmentation tasks on the ADE20k dataset also stands out, outperforming previous state-of-the-art models like BEiT pre trained model on ViT-B. Qualitative Evaluation and Visualization Results Florence-2 is qualitatively evaluated on the following tasks: Detailed Image Caption Florence-2: Advancing a Unified Representation for a Variety of Vision Tasks Visual Grounding Florence-2: Advancing a Unified Representation for a Variety of Vision Tasks Open Vocabulary Detection Florence-2: Advancing a Unified Representation for a Variety of Vision Tasks OCR Florence-2: Advancing a Unified Representation for a Variety of Vision Tasks Region to Segmentation Florence-2: Advancing a Unified Representation for a Variety of Vision Tasks Comparison with SOTA LMMs The Florence-2 is evaluated against other Large Multimodal Models (LMMs) like GPT 4V, LLaVA, and miniGPT-4 on detailed caption tasks. Florence-2: Advancing a Unified Representation for a Variety of Vision Tasks Conclusion In conclusion, Florence-2 emerges as a groundbreaking vision foundation model, showcasing the immense potential of multi-task learning and the fusion of textual and visual information. It offers an efficient solution for various tasks without the need for extensive fine-tuning. The model's ability to handle tasks from image-level understanding to fine-grained visual-semantic alignment marks a significant stride towards a unified vision foundation. Florence-2's architecture, exemplifying the power of sequence-to-sequence learning, sets a new standard for comprehensive representation learning. Looking ahead, Florence-2 paves the way for the future of vision foundation models. Its success underscores the importance of considering diverse tasks and levels of granularity in training, promising more adaptable and robust machine learning models. As we navigate the evolving landscape of artificial intelligence, Florence-2's achievements open avenues for exploration, urging researchers to delve deeper into the realms of multi-task learning and cross-modal understanding. Read More Guide to Vision-Language Models (VLMs) MiniGPT-v2 Explained Top Multimodal Annotation Tools

November 14

5 min

Product Updates [October 2023]

TLDR Workflows leaves Beta! Label snapshot versioning AI support is now in the platform - instant-access to support Encord Active improves Data Curation, Model Observability Introducing Encord Labs Encord @ RSNA - Come and see us at Booth #3772! All this and more! Read on to make your day. Workflows Leaves Beta We are thrilled to announce that our highly-anticipated feature, Workflows, has officially transitioned from beta to general availability! This milestone could not have been achieved without the invaluable feedback from our dedicated users throughout the beta phase. Workflows are designed to give you full control of the annotation process, ensuring a blend of high performance, usability, and extensibility that scales with the rapid pace and change of the AI industry. Some of the major improvements are: Performance: Handle larger projects efficiently (a tenfold increase from the previous benchmark), with significant speed enhancements across the platform. Usability: A new drag-and-drop UI simplifies workflow creation and the updated queue gives you full insight into the progress of your project. Extensibility: Advanced routing, better review functionality, and integration with Encord Active tailored to evolving AI demands. Editor Power Ups Workflow scalability means more tasks and more labels in your annotation projects. We're also juicing up the editor to be more performant -- which means more labels per task, faster. Backend improvements mean your data will save faster and more seamlessly, and we're introducing raised limits on labels per task to benefit from those improvements as well -- contact us to work with more data per task! Arriving soon are further performance improvements to enhance the user experience when dealing with many objects and complex label timelines. This all adds up to create a more natural pre-processing and editing experience, even on long, label intense, video annotation workloads. Exciting! AI Support We understand that searching our documentation isn’t always your first thought when you need to learn about the platform. To address this, we've integrated AI support directly into our platform, ensuring you have quick access to the assistance you need, precisely when needed.  Whether you're onboarding for the first time, looking for a quick refresher on using the Label Editor, or need help understanding terminology, our AI assistant is here to help. It is regularly trained on all our platforms & SDK documentation, enabling it to provide intelligent and up-to-date responses to any questions you may have about our application! Active Improves Data Curation and Model Evaluation We know that curating the best images, frames from a video, or slices from a scan is a daunting, difficult, and time-intensive task. First, ensuring that your dataset is free of outliers, duplicates, and irrelevant images, and second, selecting the best samples is crucial for building robust and performant models. Encord is your trusted partner along your journey and based on your feedback we have designed Active's new Explorer to simplify this process, incorporating best practices into intuitive user journeys: Automated data quality checks: Active automatically identifies potential issues in your datasets, such as duplicates, blurry images, or corrupted frames. By filtering out these problematic frames, you can reduce annotation costs and prevent detrimental effects on your model's performance. Intelligent curation: Use Active to curate a balanced and diverse dataset. Whether you're establishing a dataset for an initial model run or curating targeted data for critical edge cases or blind spots, Active has a tailored workflow ready for you. After your data is annotated and your model is trained, Encord Active simplifies the shift to evaluation. Simply import your model predictions and access a detailed analysis of your model’s performance, with options to break it down by class, data collections, and splits such as train, test, and validation. You can also use the Explorer to investigate your prediction types following a series of best-practice workflows: Prediction inspection: Use the Explorer to delve into the types of model predictions – True Positives (TP), False Positives (FP), and False Negatives (FN), to understand your model's accuracy and behavior. Spot and address blind spots: When an edge case or a blind spot is detected, Active's similarity search allows you to surface and curate additional samples from your unlabeled data pool that resemble these critical cases. Continuous improvement cycle: Integrate these new samples into your annotation workflow, retrain your model, and directly compare performance improvements against previously identified edge cases. Label Snapshot Versioning Labeling training data is, like the model training process it supports, an iterative process. You’ve asked for ways to snapshot your progress — whether it’s to save a checkpoint before re-labeling, check-in progress as you work through a large project, or name different subsets for purposes such as training, testing, and validation. We’ve listened, and are happy to introduce label versioning for workflow projects. Navigate to the labels tab, select your tasks, and press ‘Save new version’ — you can refer to these snapshots by name and time. Initially, we’re rolling out support for exporting labels from saved checkpoints, but look out for coming improvements such as restoring to different projects. As always, let us know how it helps and what more we can do to enhance your AI initiatives with labels and label set management tools!  Opt-in to Beta Features Faster with Encord Labs Many of you have shown interest in working closely with our product development team and helping us create the best features — as such, we’re very happy to be introducing Encord Labs! Encord Labs will give you access to features at the bleeding edge, but give you control over how features appear in the platform. This means you will get all the power of rapidly evolving technology with none of the risks. Getting in on the ground floor means you can shape how features evolve faster, helping us ensure we build with tight customer feedback in mind. Encord Labs will be rolling out several select features in Q4 — contact us if you’re interested or would like to join our collaborative early tester program! Thanks for reading, feel free to email product@encord.com with any questions or suggestions, and let us know if you're attending RSNA 2023!

November 10

5 min

Data Clustering: Intro, Methods, Applications

Data clustering involves grouping data based on inherent similarities without predefined categories. The main benefits of data clustering include simplifying complex data, revealing hidden structures, and aiding in decision-making. Let’s understand more with the help of an example. It might seem intuitive that data clustering means clustering data into different groups. But why do we need this concept of data clustering? Data analysis using data clustering is a particularly interesting approach where you look at the entities or items by their general notion and not by their value. For example, over-the-top platforms like Netflix group movies and web series into categories such as “thriller,” “animation,” “documentaries,” “drama,” and so on for ease of user recommendation and access. Consider a problem where a retail company wants to segment its customer base for targeted marketing campaigns. They can analyze the buying patterns of the customers to create tailored discounts. If one customer is a frequent buyer of high-end clothing and the other likes to purchase electronics, then the company can provide special offers on clothing for the first customer and discounts on electronics for the second customer. This can result in increased sales and greater customer satisfaction. If you like to watch thriller movies, instead of searching for the next one yourself, the platform can easily suggest other movies with the same genre. This creates a win-win situation for the user and the platform.  In this article, we will discuss three major types of data clustering techniques - partition-based, hierarchical-based, and density-based along with some of their real-world applications across industries such as anomaly detection, healthcare, retail, image segmentation, data mining, and other applications. {{Training_data_CTA}} What is Data Clustering? In machine learning, tasks fall into two main categories: supervised learning, where data comes with explicit labels, and unsupervised learning, where data lacks these labels. Data clustering is a technique for analyzing unsupervised machine learning problems to find hidden patterns and traits within the data. It's a powerful method for pattern recognition that provides useful insights about the data that may not be evident from inspecting the raw data.  At the end of the clustering process, the dataset gets segmented into different clusters. Each group contains data points with similar characteristics, ensuring the clusters contain distinctly different data points. K-Means Clustering Types of Data Clustering Techniques There are three main data clustering methods:  Partitioning clustering Hierarchical clustering Density clustering Partitioning Clustering In partitioning clustering, each data point belongs to only one cluster. You must specify the number of clusters in advance. Common applications include image compression, document categorization, or customer segmentation. The K-means algorithm is one commonly used partition-based clustering algorithm in data science and machine learning. The main strength of this technique is that the clustering results are simple, efficient, and easy to deploy for real-world applications. Hierarchical Clustering Hierarchical clustering builds a tree-like structure of clusters within the dataset. This tree-like structure, represented by a dendrogram, allows each node to represent a cluster of data points. This representation does not require predefining the number of clusters, as opposed to partition-based clustering, making it more versatile to implement and extract insights.  By providing a multi-resolution view, i.e., a 3-dimensional view, this technique makes it easier to explore and understand the links between the smaller clusters at different levels of granularity. Additionally, by cutting the dendrogram at a desired height, you can extract clusters at different levels. Common applications include genetic clustering, document clustering, or image processing for image segmentation. Due to the multi-resolution visualization, this method can get very computationally intensive for large datasets with high dimensionality, as the time and memory requirements will increase significantly.  Density-based Clustering The density-based clustering approach identifies clusters based on the density of data points in a feature space. The feature space is related to the number of features or attributes used to describe the data points.  Clusters in dense regions are similar to clusters present in sparse regions. The clusters can be of any arbitrary shape and not just standard spherical or elliptical shapes, making this technique robust to noise in the data and suitable for high-dimensional datasets. DBSCAN is a notable density-based algorithm. It is popular among applications such as Geographical Information Systems (GIS) for providing location-based services by clustering GPS data and for intrusion detection by detecting cyber threats based on anomalies in network traffic data. Data Clustering Algorithms We will deep dive into three popular data clustering algorithms: K-means, hierarchical clustering, and DBSCAN, each of which falls under the three categories you learned above. K-means Clustering The k-means clustering algorithm aims to maximize the inter-cluster variance and minimize the intra-cluster variance. This ensures that similar points are closer within the same cluster, whereas dissimilar points in different clusters remain farther apart. Steps for K-means clustering: The first task for the algorithm is to pre-define the number of clusters with, say, a hyperparameter ‘k’. Each of these clusters will be assigned its cluster centers randomly.  It assigns each data point to one cluster based on the minimum distance between the data point and the cluster centroid. These distance measures are often calculated using Euclidean distance. Next, it updates all the cluster centroids with the mean value of all the data points within the cluster. It repeats steps 2 and 3 until a certain stopping criterion is met. The algorithm halts when the clusters stop changing, i.e., all points belong to those clusters whose centers are closest to them, or after a set number of iterations.  Finally, when the algorithm converges, each data point ultimately belongs to its closest cluster.  Although this algorithm seems pretty straightforward, certain aspects need to be carefully considered so that it does not converge to a suboptimal solution. Carefully initialize the number of clusters using some techniques rather than randomly. This way, you ensure that the algorithm does not fail or runs multiple times to avoid bias towards an initialization. Additionally, K-means assumes by default that cluster shapes are spherical and have equal size, which might not always be suitable.  Hierarchical Clustering Hierarchical clustering provides a multi-level view of data clusters. As discussed previously, since this method does not require pre-specifying the number of clusters, there are two approaches to using this algorithm:  Agglomerative clustering (bottom-up approach) Divisive clustering (top-down approach)    Hierarchical Clustering Agglomerative clustering initializes each data point as a cluster at the beginning. Next, the pairwise distance between the clusters is computed to check their similarity using linkage criteria such as single linkage, average linkage, or complete linkage.  Based on the distance determined by the criteria, the two nearest clusters merge iteratively until only one remains. During the cluster merging process, a dendrogram is created that captures the hierarchical relationships of clusters. The desired number of clusters is obtained by cutting the dendrogram at a certain height, considering that clusters on the top are more general than the bottom ones, which are more specific. Divisive clustering: all the data points start in one cluster instead of agglomerative clustering, where each data point is a single cluster. Pairwise distance similarities are calculated to split the most dissimilar clusters into two clusters. Finally, a dendrogram with a top-down view is created, which can be split at a certain height based on requirements.  DBSCAN Density-Based Spatial Clustering of Applications with Noise (DBSCAN) is a density-centric clustering algorithm that identifies clusters of arbitrary shapes. Unlike centroid-based clustering methods, DBSCAN looks for regions where data points are densely packed and separates them from sparser regions or noise. Here's a step-by-step breakdown of what the algorithm does: Selects a core point (similar to the centroid of a cluster) by looking at the neighboring data points. A data point becomes a core point if at least ‘z’ is the minimum number of points within a radius ‘r’ of a particular randomly chosen data point. Density-reachable points: All the points present within a radius ‘r’ from the core data point. Upon repeating the selection process for all data points, clusters of density-reachable points from core points will emerge. Border points: These points aren't dense enough to be core points but belong to a cluster, typically found at the cluster's edges. Noise points: Points that aren't core or border points are treated as noise. They're outliers, typically residing in low-density regions DBSCAN Clustering This algorithm is beneficial for obtaining clusters of varying densities with no specific shape or size. The final results depend greatly on the choice of hyperparameters, such as the radius ‘r’ and a minimum number of points ‘z’. Optimally tuning them is essential. Overall, it is an excellent technique for data exploration and analysis, specifically involving density-based real-world applications. Real-World Applications of Data Clustering Biomedical Domain  Clustering algorithms play a crucial role in patient analysis and advancements in medicine. One crucial example is gene expression analysis for cancer subtype classification. Like breast cancer subtypes, clustering enables the grouping of patients by similar gene expressions, leading to targeted therapies and facilitating biomarker discovery. {{medical_CTA}} Social Network Analysis Clustering algorithms identify online user communities for targeted advertising campaigns through social network analysis. By categorizing users as "travel enthusiasts," "techies," and the like, advertising content can be tailored to specific clusters, increasing click-through rates. Customer and Market Segmentation  In e-commerce, an online retailer aiming to enhance personalization can use clustering techniques to categorize customers into “occasional buyers” or “frequent buyers” based on previous purchases and browsing history. Several benefits are associated with this segmentation, including exclusive offers for specific groups or recommending personalized products. Using these algorithms for customer segmentation creates a win-win situation for customers and retailers. Customers get reasonable recommendations tailored to their preferences, whereas retailers get more orders, an increased repurchase rate, and, ultimately, customer satisfaction.  Recommendation Engine  Streaming platforms like Amazon Prime and Netflix use clustering algorithms to group users with similar viewing habits and preferences, such as “action movie enthusiasts” or “animation lovers,” to recommend content and increase user engagement.  Image Segmentation Image segmentation tasks are prominent in medical imaging. Such tasks require clustering algorithms for problem analysis. Given some MRI brain scans, you can apply density-based clustering techniques to group pixels corresponding to different tissue types, such as gray matter, white matter, etc. This can aid radiologists in detecting and precisely locating abnormalities such as tumors or lesions.  In summary, clustering algorithms not only assist in the procedure of medical diagnosis but also save a lot of time and effort to detect anomalies within complex images manually,  ultimately providing improved healthcare services for patients.  Data Clustering: Key takeaways Data clustering algorithms are an essential tool to understand and derive actionable insights from the plethora of data available on the web. There are mainly three types of clustering: partitioning clustering, hierarchical clustering, and density clustering. The K-means algorithm, a partition-based technique,  requires defining the number of clusters beforehand, and each data point is ultimately assigned to one cluster.  Hierarchical clustering, represented using a dendrogram, offers two methods: agglomerative (bottom-up) and divisive (top-down), providing a detailed view of clusters at various levels. Density-based clustering algorithms such as DBSCAN focus on data point density to create clusters of any arbitrary shape and size.  There are various real-world applications for clustering, ranging from recommendation engines and biomedical engineering to social network analysis and image segmentation.  {{Training_data_CTA}}

November 8

10 min

Get Your Models Into Production Faster
Encord is transforming how businesses are getting their computer vision models into production. We can do the same for you. Talk to us to find out how.

Mastering Supervised Learning: A Comprehensive Guide

Artificial Intelligence (AI) has witnessed remarkable advancements in recent years, revolutionizing industries and reshaping how we interact with technology. At the core of these developments lies supervised learning, a fundamental concept in machine learning.  In this comprehensive guide, we will delve deep into the world of supervised learning, exploring its significance, processes, and various facets like its significance, training a model on labeled data, the relationship between input features and output labels, generalizing knowledge, and making accurate predictions.  By the end of this article, you'll have a firm grasp of what supervised learning is and how it can be applied to solve real-world problems. Definition and Brief Explanation of Supervised Learning Supervised Learning is a type of machine learning where algorithms learn from labeled data to make predictions. In simpler terms, it's like teaching a machine to recognize patterns or relationships in data based on the examples you provide. These examples, also known as training data, consist of input features and their corresponding target labels. The objective is to build a model to learn from this training data to make accurate predictions or classifications on new, unseen data. Supervised Learning In machine learning, four main learning paradigms are commonly recognized: supervised, self-supervised, unsupervised, and reinforcement learning. As opposed to supervised learning, unsupervised learning deals with unlabeled data within a dataset; self-supervised learning is where the model learns from the data without explicit supervision or labeling; and in reinforcement learning, an agent learns to make decisions by interacting with an environment and receiving feedback in the form of rewards or punishments. {{light_callout_start}} Interested in learning more about self-supervised learning (SSL) and how it compares to supervised and unsupervised learning? Read our explainer article, “Self-supervised Learning Explained.” {{light_callout_end}} Importance and Relevance of Supervised Learning in AI Supervised learning is the foundation of many AI applications that impact our daily lives, from spam email detection to recommendation systems on streaming platforms. From medical diagnosis to autonomous driving, supervised learning plays a pivotal role. Its ability to learn from historical data and make predictions makes it versatile for progress in AI. As AI continues to evolve, supervised learning remains an indispensable part. It powers applications in natural language processing, computer vision, and speech recognition, making it vital for developing intelligent systems. Understanding how supervised learning works is essential for anyone interested in AI and machine learning. Overview  This article can prove to be a beginner’s guide to supervised learning, and here we will take a structured approach to understanding supervised learning: What is Supervised Learning: We'll start by breaking down the basic concept of supervised learning and examining the critical components involved. Types of Supervised Learning Algorithms: We will explore the different supervised learning algorithms and their characteristics, including classification and regression. You’ll learn examples of popular algorithms within each category. Data Preparation for Supervised Learning: Labeled data is the lifeblood of supervised learning, and we'll discuss the essential steps involved in preparing and cleaning data. We will also explain feature engineering, a crucial aspect of data preparation. Model Evaluation and Validation: Once a model is trained, it must be evaluated and validated to ensure its accuracy and reliability. We'll delve into various evaluation metrics and techniques used in this phase. Challenges and Future Directions: We'll discuss some of the difficulties in supervised learning and glimpse into the future, considering emerging trends and areas of research. Key Takeaways: Finally, we’ll quickly go through the main ingredients of the whole recipe for supervised learning. Now, let's embark on our journey to understand supervised learning. What is Supervised Learning? Supervised learning is a type of machine learning where an algorithm learns from labeled datasets to make predictions or decisions. It involves training a model on a dataset that contains input features and corresponding output labels, allowing the model to learn the relationship between the inputs and outputs.  Basic Concept  Supervised Learning operates under the assumption that there is a relationship or pattern hidden within the data that the model can learn and then apply to new, unseen data. In this context, "supervised" refers to providing guidance or supervision to the algorithm. Think of it as a teacher guiding a student through a textbook. The teacher knows the correct answers (the target labels), and the student learns by comparing their answers (predictions) to the teacher's. Main Components: Input Features and Target Labels To understand supervised learning fully, it's crucial to grasp the main components and processes involved. In supervised learning, labeled data is used to train a model, where each data point is associated with a corresponding target or output value.  The model learns from this labeled data to make predictions or classify new, unseen data accurately. Additionally, supervised learning requires the selection of an appropriate algorithm and the evaluation of the model's performance using metrics such as accuracy or precision. It's crucial to grasp the two main components: input features  target labels. Input Features: These are the variables or attributes that describe the data. For instance, in a spam email detection system, the input features might include the sender's email address, subject line, and the content of the email. The algorithm uses these features to make predictions. Target Labels: Target labels are the values we want the algorithm to predict or classify. In the case of spam email detection, the target labels would be binary: “spam” (1) or “not spam” (0). These labels are provided as part of the training data. {{gray_callout_start}} ⚡Learn more:  The Full Guide to Training Datasets for Machine Learning. {{gray_callout_end}} Training a Supervised Learning Model  Training a supervised learning model involves iteratively adjusting its parameters to minimize the difference between its predictions and the target values in the labeled data. This process is commonly known as optimization. During training, the model learns the underlying patterns and relationships in the data, allowing it to generalize and make accurate predictions on unseen data. However, it is important to note that the performance of a supervised learning model depends on the quality and representativeness of the labeled data used for training.  Supervised Learning Flowchart Supervised Learning.drawio - draw.io (diagrams.net) Training a supervised learning model involves several key steps: Data Collection: The first step is to gather labeled data, which typically consists of input features and their corresponding target labels. This data should be representative of the problem you want to solve. Data curation: The process of cleaning and organizing the collected data to ensure its quality and reliability. This step involves removing any outliers or inconsistencies, handling missing values, and transforming the data into a suitable format for training the model. Data Splitting: The collected data is usually divided into two subsets: the training dataset and the test data. Train the model with the training dataset, while the test data is reserved for evaluating its performance. Model Selection: Depending on the problem at hand, you choose an appropriate supervised learning algorithm. For example, if you're working on a classification task, you might opt for algorithms like logistic regression, support vector machines, or decision trees. Training the Model: This step involves feeding the training data into the chosen algorithm, allowing the model to learn the patterns and relationships in the data. The training iteratively adjusts its parameters to minimize prediction errors with its learning techniques. Model Evaluation: After training, you evaluate the model's performance using the test set. Standard evaluation metrics include accuracy, precision, recall, and F1-score. Fine-tuning: If the model's performance is unsatisfactory, you may need to fine-tune its hyperparameters or consider more advanced algorithms. This step is crucial for improving the model's accuracy. Deployment: Once you're satisfied with the model's performance, you can deploy it to make predictions on new, unseen data in real-world applications. Now that we've covered the fundamentals of supervised learning, let's explore the different types of supervised learning algorithms. {{Training_data_CTA}} Types of Supervised Learning Algorithms Types of supervised learning Algorithms include linear regression, logistic regression, decision trees, random forests, support vector machines, and neural networks. Each algorithm has its strengths and weaknesses, and the choice of algorithm depends on the specific problem and data at hand. It is also important to consider factors such as interpretability, computational efficiency, and scalability when selecting a supervised learning algorithm. Additionally, ensemble methods such as bagging and boosting can combine multiple models to improve prediction accuracy. Supervised learning can be categorized into two main types:  Classification  Regression Each type has its own characteristics and is suited to specific use cases. Supervised Learning Algorithms Classification Classification is a type of supervised learning where the goal is to assign data points to predefined categories or classes. In classification tasks, the target labels are discrete and represent different classes or groups. Naive Bayes is a classification algorithm commonly used in supervised learning. It is particularly useful for solving classification problems, spam email detection, and sentiment analysis, where it learns the probability of different classes based on the input features.  Here are some key points about classification: Binary Classification: In binary classification, there are only two possible classes, such as spam or not spam, fraud or not fraud, and so on. Multiclass Classification: Multiclass classification involves more than two classes. For example, classifying emails as spam, promotional, social, and primary. Examples of Classification Algorithms: Popular classification algorithms include logistic regression, support vector machines, decision trees, random forests, and neural networks. Use Cases: Classification is used in various applications, including sentiment analysis, image recognition, fraud detection, document categorization, and disease diagnosis. Regression Regression, on the other hand, is a type of supervised learning where the goal is to predict continuous values or numerical quantities. In regression tasks, the target labels are real numbers, and the model learns to map input features to a continuous output. Here are some key points about regression: Examples of Regression Algorithms: Common regression algorithms include linear regression, polynomial regression, ridge regression, and support vector regression. Use Cases: Regression is applied in scenarios like stock price prediction, real estate price estimation, and weather forecasting, where the goal is to make numerical predictions.  Examples of Popular Algorithms within Each Category Logistic Regression (Classification): Despite its name, logistic regression is used for binary classification. It models the probability of a data point belonging to one of the two classes, making it a fundamental algorithm in classification tasks. Decision Trees (Classification and Regression): Decision trees can be used for both classification and regression tasks. They break down a data set into smaller subsets based on input features and create a tree-like structure to make predictions. Linear Regression (Regression): Linear regression model is a simple yet powerful algorithm for regression tasks. It assumes a linear relationship between the input features and the target variable and tries to fit a straight line to the data. Random Forests (Classification and Regression): Random forests are an ensemble method that combines multiple decision trees to improve accuracy. They can be used for classification and regression problems and are known for their robustness. Some data scientists use the K-Nearest Neighbors (KNN) and K-Means algorithms for data classification and regression. These algorithms enable applications like spam email detection and sales forecasting. KNN is typically associated with unsupervised learning but can also be used in supervised learning. Another algorithm that is used for both regression and classification problems is Support Vector Machines (SVM). SVM aims to create the best line or decision boundary to segregate n-dimensional space into classes.  Now that we've explored the types of supervised learning algorithms, let's move on to another stage of the workflow—data preparation. Data Preprocessing for Supervised Learning Data preprocessing is an essential step in supervised learning. It involves cleaning and transforming raw data into a format suitable for training a model. Common techniques used in data preprocessing include handling missing values, encoding categorical variables, and scaling numerical features. Additionally, you can perform feature selection or extraction to reduce the dimensionality of the dataset and likely improve model performance. Data Preprocessing in Machine Learning Data Cleaning Data cleaning is a crucial part of data preprocessing. It involves removing or correcting any errors, inconsistencies, or outliers in the dataset. Data cleaning techniques include removing duplicate entries, correcting typos or spelling errors, and handling noisy or irrelevant data.  Missing Data in datasets is a common issue that can be addressed through techniques like deleting missing rows, imputing values, or using advanced imputation methods, but the most appropriate method depends on the dataset and research objectives.  Noisy Data containing errors or inconsistencies from measurement, data entry, or transmission can be addressed through techniques like smoothing, filtering, outlier detection, and removal methods.  {{grey_callout_start}} Data cleaning is also known as data cleansing or data preprocessing. Learn more about data cleaning and preprocessing through our detailed guide. {{grey_callout_end}} Data Transformation  Data transformation is another technique commonly used to address noisy data. This involves converting the data into a different form or scale, such as logarithmic or exponential transformations, to make it more suitable for analysis. Another approach is to impute missing values using statistical methods, which can help fill in gaps in the data and reduce the impact of missing information on the analysis.  Normalization standardizes the data range, allowing fair comparisons (considering the different units of variables) and reducing outliers, making it more robust and reliable for analysis when dealing with variables with different units or scales.  Attribute Selection is a crucial step in selecting the most relevant and informative attributes from a dataset, reducing dimensionality, improving efficiency, avoiding overfitting, and enhancing interpretability. Discretization converts continuous variables into discrete categories or intervals, simplifying the analysis process and making results easier to interpret.  Concept Hierarchy Generation sorts data into hierarchical structures based on connections and similarities. This helps us understand both discrete and continuous variables better. They also make it easier to interpret data and make decisions.  Data Reduction  Data reduction is a crucial technique in data analysis, reducing dataset complexity by transforming variables, simplifying the analysis process, improving computational efficiency, and removing redundant or irrelevant variables.  Data Cube Aggregation summarizes data across multiple dimensions, providing a higher-level view for analysis. This technique aids in quick and efficient decision-making by analyzing large volumes of data.  Attribute Subset Selection reduces data size, allowing you to focus on key factors contributing to patterns and insights, resulting in more accurate and efficient analysis results. Four methods are used to determine the most relevant attributes for analysis by evaluating their significance and contribution to the overall pattern. They are: undefinedundefinedundefinedundefined Numerosity Reduction reduces the data size without losing essential information, improving computational efficiency and speeding up analysis processes, particularly for large datasets. Dimensionality Reduction reduces variables while retaining relevant information. It's especially useful for high-dimensional data, eliminating noise and redundancy for better analysis. Introduction to the Concept of Feature Engineering and Its Impact on Model Performance Feature engineering is both an art and a science in machine learning. It involves creating new features from the existing ones or transforming features to better represent the underlying patterns in the data. Effective feature engineering can significantly boost a model's performance, while poor feature engineering can hinder it. Feature Engineering for Machine Learning Here are some examples of feature engineering: Feature Scaling: As mentioned earlier, feature scaling can be considered a form of feature engineering. It ensures that all features have a similar scale and can contribute equally to the model's predictions. Feature Extraction: In some cases, you may want to reduce the dimensionality of your data. Feature extraction techniques like Principal Component Analysis (PCA) can help identify the most critical features while reducing noise (irrelevant features). Text Data Transformation: When working with text data, techniques like TF-IDF (Term Frequency-Inverse Document Frequency) and word embeddings (e.g., Word2Vec) can convert text into numerical representations that machine learning models can process. Feature engineering is a creative process that requires a deep understanding of the data and the problem. It involves experimentation and iteration to find the most informative features for your model. With our data prepared and our model trained, the next critical step is evaluating and validating the supervised learning model. Model Evaluation and Validation Model evaluation and validation help you assess the performance of your model and ensure that it generalizes well to unseen data. Proper evaluation and validation help you identify any issues with your model, such as underfitting or overfitting, and make the necessary adjustments to improve its performance. Model Validation and Evaluation The Importance of Evaluating and Validating Supervised Learning Models Evaluating and validating supervised learning models is crucial to ensure they perform as expected in real-world scenarios. Without proper evaluation, a model might not generalize effectively to unseen data, leading to inaccurate predictions and potentially costly errors. Here's why model evaluation and validation are essential: Generalization Assessment: The goal of supervised learning is to create models that can make accurate predictions on new, unseen data. Model evaluation helps assess how well a model generalizes beyond the training data. Comparison of Models: In many cases, you might experiment with multiple algorithms or variations of a model. Model evaluation provides a basis for comparing these models and selecting the best-performing one. Tuning Hyperparameters: Model evaluation guides the fine-tuning of hyperparameters. By analyzing a model's performance on validation data, you can adjust hyperparameters to improve performance. Overview of Common Evaluation Metrics There are several evaluation metrics used in supervised learning, each suited to different types of problems. Here are some of the most common evaluation metrics: Accuracy: Accuracy measures the proportion of correctly classified instances out of all instances in the test set. It's a suitable metric for balanced datasets but can be misleading when dealing with imbalanced data. Precision: Precision measures the ratio of true positive predictions to the total positive predictions. It is particularly useful when the cost of false positives is high. Recall: Recall (or sensitivity) measures the ratio of true positives to all actual positives. It is essential when it's crucial to identify all positive instances, even if it means having some false positives. F1-Score: The F1-score is the harmonic mean of precision and recall. It provides a balanced measure of a model's performance, especially when dealing with imbalanced datasets. Confusion Matrix: A confusion matrix is a table that summarizes the model's predictions and actual class labels. It provides a more detailed view of a model's performance, showing true positives, true negatives, false positives, and false negatives. Model Evaluation Techniques  To evaluate and validate supervised learning models effectively, you can employ various techniques: Cross-Validation: Cross-validation involves splitting the data into multiple subsets and training and testing the model on different subsets. This helps assess how well the model generalizes to other data partitions. Learning Curves: Learning curves visualize how a model's performance changes as the size of the training data increases. They can reveal whether the model is underfitting or overfitting. ROC Curves and AUC: Receiver Operating Characteristic (ROC) curves show the trade-off between true positive rate and false positive rate at different classification thresholds. The Area Under the Curve (AUC) quantifies the overall performance of a binary classification model. Validation Sets: Besides the training and test sets, a validation set is often used to fine-tune models and avoid overfitting. The validation set helps make decisions about hyperparameters and model selection. By diligently applying these evaluation techniques and metrics, you can ensure that your supervised learning model is robust, accurate, and ready for deployment in real-world scenarios. {{light_callout_start}} Through predictive analytics and predictive modeling, supervised learning empowers teams to make data-driven decisions by learning from historical data. {{light_callout_end}} {{Active_CTA}} Challenges and Future Directions While supervised learning has achieved remarkable success in various domains, it has its challenges. Some of the key challenges in supervised learning include: Data Quality: The quality of the training data heavily influences model performance. Noisy, biased, or incomplete data can lead to inaccurate predictions. Overfitting: Overfitting occurs when a model learns to memorize the training data rather than generalize from it. Techniques like regularization and cross-validation can mitigate this issue. Imbalanced Data: Imbalanced datasets can lead to biased models that perform poorly for underrepresented classes. Resampling techniques and specialized algorithms can address this challenge. Curse of Dimensionality: As the dimensionality of the feature space increases, the amount of data required for effective modeling also increases. Dimensionality reduction techniques can help manage this issue. Interpretability: Deep learning models, such as neural networks, are often considered "black boxes" due to their complexity. Ensuring model interpretability is an ongoing challenge. Looking ahead, the field of supervised learning continues to evolve. Some promising directions include: Transfer Learning: Transfer learning allows models trained on one task to be adapted for use on another, reducing the need for massive amounts of labeled data. Pre-Trained Models: These allow practitioners to leverage the knowledge and feature representations learned from vast, general datasets, making it easier and more efficient to develop specialized models for specific tasks. AutoML: Automated Machine Learning (AutoML) tools are becoming more accessible, allowing individuals and organizations to build and deploy models with minimal manual intervention. Responsible AI: Responsible AI ensures ethical, fair, and accountable AI systems, considering societal impacts, mitigating harm, and promoting transparency and explainability for clear decision-making.  undefinedundefinedundefinedundefined {{light_callout_start}} Bias in Machine Learning refers to systematic errors introduced by algorithms or training data that lead to unfair or disproportionate predictions for specific groups or individuals. Learn how to mitigate model bias in Machine Learning. {{light_callout_end}} Key Takeaways Supervised learning is a foundational concept in data science, where data scientists leverage various techniques, including Naive Bayes, to build predictive models. It plays a pivotal role in various AI applications, including spam email detection, recommendation systems, medical diagnosis, and autonomous driving, making it essential to develop intelligent systems. The structured approach to understanding supervised learning includes input features, target labels, data preparation, model training, evaluation, and deployment. There are two main types of supervised learning algorithms: classification (for assigning data points to predefined categories) and regression (for predicting continuous values).  Data scientists select appropriate algorithms, such as K-Nearest Neighbors (KNN), to classify or regress data points, enabling applications like spam email detection or sales forecasting. Common techniques for data preparation include data cleaning, feature scaling, feature engineering, one-hot encoding, and handling imbalanced data. Model evaluation and validation are crucial for assessing performance, generalization, and fine-tuning hyperparameters in supervised learning, despite challenges like data quality and interpretability. {{try_encord}}

November 8

8 min

An Introduction to Cross-Entropy Loss Functions

Loss functions are widely used in machine learning tasks for optimizing models. The cross-entropy loss stands out among the many loss functions available, especially in classification tasks. But why is it so significant? Cross-entropy loss is invaluable in certain scenarios, particularly when interpreting the outputs of neural networks that utilize the softmax function, a common practice in deep learning models. This loss function measures the difference between two probability distributions, reflecting how well the model predicts the actual outcomes. The term "surrogate loss" refers to an alternative loss function used instead of the actual loss function, which might be difficult to compute or optimize. In this context, cross-entropy can be considered a surrogate for other more complex loss functions, providing a practical approach for model optimization. In the broader theoretical landscape of machine learning, there's an extensive analysis of a category of loss functions, often referred to in research as "composite loss" or "sum of losses." This category includes cross-entropy (also known as logistic loss), generalized cross-entropy, mean absolute error, and others. These loss functions are integral to providing non-asymptotic guarantees and placing an upper boundary on the estimation error of the actual loss based on the error values derived from the surrogate loss. Such guarantees are crucial as they influence the selection of models or hypotheses during the learning process. Researchers have been delving into novel loss functions designed for more complex, often adversarial, machine learning environments. For instance, certain innovative loss functions have been crafted by incorporating smoothing terms into traditional forms. These "smoothed" functions enhance model robustness, especially in adversarial settings where data alterations can mislead learning processes. These advancements are paving the way for new algorithms that can withstand adversarial attacks, fortifying their predictive accuracy. Foundations of Loss Functions Loss functions are the backbone of machine learning optimization, serving as critical navigational tools that guide the improvement of models during the training process. These functions present a measure that models strive to minimize, representing the difference or 'loss' between predicted and actual known values. While the concept of maximizing a function, often referred to as a "reward function," exists, particularly in reinforcement learning scenarios, the predominant focus in most machine learning contexts is minimizing the loss function. Role in Model Optimization Central to model optimization is the gradient descent process, which adjusts model parameters iteratively to minimize the loss function. This iterative optimization is further powered by backpropagation, an algorithm that calculates the gradient of the loss function concerning the model parameters. However, the optimization landscape is fraught with challenges. One of the primary concerns is the convergence to local minima instead of the global minimum. In simple terms, while the model might think it has found the optimal solution (local minimum), there might be a better overall solution (global minimum) that remains unexplored. Explanation of minima/maxima The choice and design of loss functions are crucial for optimal training of ML tasks. For instance, cross-entropy loss, commonly used in classification tasks, has properties such as being convex and providing a clear signal for model updates, making it particularly suitable for such problems. Understanding the nuances of different loss functions, including cross-entropy loss, and their impact on model optimization is essential for developing effective machine learning models. Common Loss Functions in Machine Learning Several loss functions have been developed and refined, each tailored to specific use cases. Mean Squared Error (MSE): The mean squared error (or MSE) is a quadratic loss function that measures the average squared difference between the estimated values (predictions) and the actual value. For n samples, it is mathematically represented as  MSE Loss MSE Loss is widely used in regression problems. For instance, predicting house prices based on various features like area, number of rooms, and location. A model with a lower MSE indicates a better fit of the model to the data. Hinge Loss Hinge loss, or max-margin loss, is used for binary classification tasks. It is defined as Hinge Loss Function Here, 0 is for correct classifications, and 1 is for wrong classifications. The hinge loss is near zero if the prediction is correct and with a substantial margin from the decision boundary (high confidence). However, the loss increases as the prediction is either wrong or correct, but with a slim margin from the decision boundary. Hinge loss is commonly associated with Support Vector Machines (SVM). It's used in scenarios where a clear margin of separation between classes is desired, such as in image classification or text categorization. Log Loss (Logistic Loss) Log loss quantifies the performance of a classification model where the prediction input is a probability value between 0 and 1. It is defined as: Log Loss function The log loss penalizes both errors (false positives and false negatives), whereas the confidently wrong predictions are more severely penalized. Log loss is used in logistic regression and neural networks for binary classification problems. It's suitable for scenarios like email spam detection, where you want to assign a probability of an email being spam. Each loss function has unique characteristics and is chosen based on the problem's nature and the desired output type. How to select a loss function Regression: In regression tasks, where the goal is to predict a continuous value, the difference between the predicted and actual values is of primary concern. Common loss functions for regression include: Mean Squared Error (MSE): Suitable for problems where large errors are particularly undesirable since they are squared and thus have a disproportionately large impact. The squaring operation amplifies larger errors. Mean Absolute Error (MAE): Useful when all errors, regardless of magnitude, are treated uniformly. Classification: In classification tasks, where the goal is to categorize inputs into classes, the focus is on the discrepancy between the predicted class probabilities and the actual class labels. Common loss functions for classification include: Log Loss (Logistic Loss): Used when the model outputs a probability for each class, especially in binary classification. Hinge Loss: Used for binary classification tasks, especially with Support Vector Machines, focusing on maximizing the margin. Cross-Entropy Loss: An extension of log loss to multi-class classification problems. The selection of a loss function is not one-size-fits-all. It requires a deep understanding of the problem, the nature of the data, the distribution of the target variable, and the specific goals of the analysis. {{Active_CTA}} Entropy in Information Theory Entropy in information theory measures the amount of uncertainty or disorder in a set of probabilities. It quantifies the expected value of the information contained in a message and is foundational for data compression and encryption. Shannon's Entropy Shannon's entropy, attributed to Claude Shannon, quantifies the uncertainty in predicting a random variable's value. It is defined as: Shannon Entropy Shannon's entropy is closely related to data compression. It represents the minimum number of bits needed to encode the information contained in a message, which is crucial for lossless data compression algorithms. When the entropy is low (i.e., less uncertainty), fewer bits are required to encode the information, leading to more efficient compression. Shannon's entropy is foundational for designing efficient telecommunications coding schemes and developing compression algorithms like Huffman coding. Kullback-Leibler Divergence Kullback-Leibler (KL) Divergence measures how one probability distribution diverges from a second, expected probability distribution. It is defined as KL Divergence Equation Here are the parameters and their meanings: P: The true probability distribution, which serves as the reference. Q: The approximate probability distribution is being compared to P. x: The event or outcome for which the probabilities are defined. P(x): The probability of event x according to the true distribution P. Q(x): The probability of event x according to the distribution Q. DKL ( p || q ): The KL Divergence quantifies the difference between the two distributions. KL Divergence is used in model evaluation to measure the difference between predicted probability and true distributions. It is especially useful in scenarios like neural network training, where the goal is to minimize the divergence between the predicted and true distributions. KL Divergence is often used for model comparison, anomaly detection, and variational inference methods to approximate complex probability distributions. Cross-Entropy: From Theory to Application Mathematical Derivation Cross-entropy is a fundamental concept in information theory that quantifies the difference between two probability distributions. It builds upon the foundational idea of entropy, which measures the uncertainty or randomness of a distribution. The cross-entropy between two distributions, P and Q, is defined as: Cross Entropy between P & Q P(x) is the probability of event x in distribution P, and Q(x) is the probability of event x in distribution Q. 1. Log-likelihood function and maximization: The log-likelihood measures how well a statistical model predicts a sample. In machine learning, maximizing the log-likelihood is equivalent to minimizing the cross-entropy between the true data distribution and the model's predictions. 2. Relationship with Kullback-Leibler divergence: The Kullback-Leibler (KL) divergence is another measure of how one probability distribution differs from a second reference distribution. Cross-entropy can be expressed in terms of KL divergence and the entropy of the true distribution: Where H(p) is the entropy of distribution p, and DKL(p || q) is the KL divergence between distributions p and q. Binary vs. Multi-Class Cross-Entropy Cross-entropy is a pivotal loss function in classification tasks, measuring the difference between two probability distributions. Cross-entropy formulation varies depending on the nature of the classification task:  binary or multi-class. Binary Cross-Entropy: This is tailored for binary classification tasks with only two possible outcomes. Given \( y \) as the actual label (either 0 or 1) and \( \hat{y} \) as the predicted probability of the label being 1, the binary cross-entropy loss is articulated as: This formulation captures the divergence of the predicted probability from the actual label. Categorical Cross-Entropy: Suited for multi-class classification tasks, this formulation is slightly more intricate. If \( P \) represents the true distribution over classes and \( Q \) is the predicted distribution, the categorical cross-entropy is given by: Categorical Cross-Entropy Loss Here, the loss is computed over all classes, emphasizing the divergence of the predicted class probabilities from the true class distribution. Challenges in Multi-Class Scenarios:  The complexity of multi-class cross-entropy escalates with an increase in the number of classes. A fundamental challenge is ensuring that the predicted probabilities across all classes aggregate to one. This normalization is typically achieved using the softmax function, which exponentiates each class score and then normalizes these values to yield a valid probability distribution. While binary and multi-class cross-entropy aim to measure the divergence between true and predicted distributions, their mathematical underpinnings and associated challenges differ based on the nature of the classification task. Practical Implications of Cross-Entropy Loss Cross-entropy loss is pivotal in optimizing models, especially in classification tasks. The implications of cross-entropy loss are vast and varied, impacting the speed of model convergence and regularization (to mitigate overfitting). Impact on Model Convergence Speed of Convergence: Cross-entropy loss is preferred in many deep learning tasks because it often leads to faster convergence than other loss functions. It amplifies the gradient when the predicted probability diverges significantly from the actual label, providing a stronger signal for the model to update its weights and thus encouraging faster learning. Avoiding Local Minima: The nature of the cross-entropy loss function helps models avoid getting stuck in local minima.. Cross-entropy loss penalizes incorrect predictions more heavily than other loss functions, which encourages the model to continue adjusting its parameters significantly until it finds a solution that generalizes well rather than settling for a suboptimal fit. Local Minima Regularization and Overfitting L1 and L2 Regularization: You can combine regularization techniques like L1 (Lasso) and L2 (Ridge) with cross-entropy loss to prevent overfitting.  L1 regularization tends to drive some feature weights to zero, promoting sparsity, while L2 shrinks weights, preventing any single feature from overshadowing others. These techniques add penalty terms to the loss function, discouraging the model from assigning too much importance to any feature. Dropout and its effect on cross-entropy: Dropout is a regularization technique where random subsets of neurons are turned off during training. This prevents the model from becoming overly reliant on any single neuron. When combined with cross-entropy loss, dropout can help the model generalize better to unseen data. Implementing Cross-Entropy in Modern Frameworks PyTorch In PyTorch, the `nn.CrossEntropyLoss()` function is used to compute the cross-entropy loss. It's important to note that the input to this loss function should be raw scores (logits) and not the output of a softmax function because it combines the softmax activation function and the negative log-likelihood loss in one class.  import tensorflow as tf loss_fn = tf.keras.losses.CategoricalCrossentropy() For binary classification tasks, `tf.keras.losses.BinaryCrossentropy()` is more appropriate: loss_fn_binary = tf.keras.losses.BinaryCrossentropy() Custom Loss Functions: TensorFlow and Keras provide flexibility in defining custom loss functions. This can be useful when the standard cross-entropy loss needs to be modified or combined with another loss function for specific applications. Advanced Topics in Cross-Entropy Label Smoothing Label smoothing is a regularization technique that prevents the model from becoming too confident about its predictions. Instead of using hard labels (e.g., [0, 1]), it uses soft labels (e.g., [0.1, 0.9]) to encourage the model to be less certain, distributing certainty between classes. Improving model generalization: Label smoothing can improve the generalization capability of models by preventing overfitting. Overfitting occurs when a model becomes too confident about its predictions based on the training data, leading to poor performance on unseen data. By using soft labels, label smoothing encourages the model to be less certain, which can lead to better generalization. Implementation and results: Most deep learning frameworks have label smoothing built-in implementations. For instance, in TensorFlow, it can be achieved by adding a small constant to the true labels and subtracting the same constant from the false labels. The results of using label smoothing can vary depending on the dataset and model architecture. Still, it can generally lead to improved performance, especially in cases where the training data is noisy or imbalanced. Cross Entropy Loss fn with Label Smoothing Focal Loss and Class Imbalance Focal loss is a modification of the standard cross-entropy loss designed to address the class imbalance problem. In datasets with imbalanced classes, the majority class can dominate the loss, leading to poor performance for the minority class. Focal Loss and Cross-Entropy Equation Origins and Context: The paper "Focal Loss for Dense Object Detection" delves into the challenges faced by one-stage object detectors, which have historically lagged behind the accuracy of two-stage detectors despite their potential for speed and simplicity. The authors identify the extreme foreground-background class imbalance during the training of dense detectors as the primary culprit. The core idea behind Focal Loss is to reshape the standard cross-entropy loss in a way that down-weights the loss assigned to well-classified examples. This ensures that the training focuses more on a sparse set of hard-to-classify examples, preventing the overwhelming influence of easy negatives. Addressing the class imbalance problem: Focal loss adds a modulating factor to the cross-entropy loss, which down-weights the loss contribution from easy examples (i.e., examples from the majority class) and up-weights the loss contribution from hard examples (i.e., examples from the minority class). This helps the model focus more on the minority class, leading to better performance on imbalanced datasets. Performance Implications: By focusing more on the minority class, focal loss can lead to improved performance on minority classes without sacrificing performance on the majority class. This makes it a valuable tool for tasks where the minority class is particularly important, such as medical diagnosis or fraud detection. Focal Loss Formula The parameters are: p_t is the model's estimated probability for the class with the true label t. alpha: A balancing factor, typically between 0 and 1, which can be set differently for each class. gamma: A focusing parameter, typically greater than 0, reduces the relative loss for well-classified examples, focusing more on hard, misclassified examples. Cross Entropy: Key Takeaways Cross-Entropy Loss as a Performance Measure: Cross-entropy loss is crucial in classification tasks because it quantifies the difference between the predicted probability distribution of the model and the actual distribution of the labels. It is particularly effective when combined with the softmax function in neural networks, providing a clear gradient signal that aids in faster and more efficient model training. Role of Loss Functions in Optimization: Loss functions like cross-entropy guide the training of machine learning models by providing a metric to minimize. The design of these functions, such as the convexity of cross-entropy, is essential to avoid local minima and ensure that the model finds the best possible parameters for accurate predictions. Handling Class Imbalance with Focal Loss: Focal loss is an adaptation of cross-entropy that addresses class imbalance by focusing training on hard-to-classify examples. It modifies the standard cross-entropy loss by adding a factor that reduces the contribution of easy-to-classify examples, thus preventing the majority class from overwhelming the learning process. Regularization Techniques to Prevent Overfitting: Combining cross-entropy loss with regularization techniques like L1 and L2 regularization, or dropout, can prevent overfitting. These methods add penalty terms to the loss function or randomly deactivate neurons during training, encouraging the model to generalize to new, unseen data. Label Smoothing for Improved Generalization: Label smoothing is a technique that uses soft labels instead of hard labels during training, which prevents the model from becoming overly confident about its predictions. This can lead to better generalization to unseen data by encouraging the model to distribute its certainty among the possible classes rather than focusing too narrowly on the classes observed in the training set. {{Active_CTA}}

November 7

10 min

Training vs. Fine-tuning: What is the Difference?

Training and fine-tuning are crucial stages in the machine learning model development lifecycle, serving distinct purposes. This article explains the intricacies of both methodologies, highlighting their differences and importance in ensuring optimal model performance. Training in the context of deep learning and neural networks refers to the phase where a new model learns from a dataset. During this phase, the model adjusts its model weights based on the input data and the corresponding output, often using embeddings and activation functions. While embeddings and activation functions play significant roles in certain model architectures and tasks, they are not universally employed during the training phase of all deep learning models. It's crucial to understand the specific context and model architecture to determine their relevance.  The objective is to diminish the discrepancy between the anticipated and factual output, frequently termed error or loss. This is predominantly achieved using algorithms like backpropagation and optimization techniques like gradient descent. Fine-tuning, conversely, follows the initial training, where a pre-trained model (previously trained on a vast dataset like ImageNet) is trained on a smaller, task-specific dataset. The rationale is to leverage the knowledge the model has acquired from the initial training process and tailor it to a more specific task. This becomes invaluable, especially when the new dataset for the new task is limited, as training from scratch might lead to overfitting. As training stars, the neural network's weights are randomly initialized or set using methods like He or Xavier initialization. These weights are fundamental in determining the model's predictions. As the training progresses, these weights adjust to minimize the error, guided by a specific learning rate.  Conversely, during fine-tuning, the model starts with pre-trained weights from the initial training, which are then fine-tuned to suit the new task better, often involving techniques like unfreezing certain layers or adjusting the batch size. The training aims to discern patterns and features from the data, creating a base model that excels on unseen data and is often validated using validation sets. Fine-tuning, however, zeroes in on adapting a generalized model for a specific task, often leveraging transfer learning to achieve this. While training focuses on generalizing models, fine-tuning refines this knowledge to cater to specific tasks, making it a crucial topic in NLP with models like BERT, computer vision tasks like image classification, and, more recently, the proliferation of foundation models. {{light_callout_start}} Learn more: Visual Foundation Models (VFMs) by Lead ML Engineer at Encord, Frederik Hvilshøj. {{light_callout_end}} The Training Process Initialization of Weights Random Initialization In deep learning, initializing the weights of neural networks is crucial for the training process. Random initialization is a common method where weights are assigned random values. This method ensures a break in symmetry among neurons, preventing them from updating similarly during backpropagation. However, random initialization can sometimes lead to slow convergence or the vanishing gradient problem. He or Xavier Initialization Specific strategies, like He or Xavier initialization, have been proposed to address the challenges of random initialization. He initialization, designed for ReLU activation functions, initializes weights based on the size of the previous layer, ensuring that the variance remains consistent across layers. On the other hand, Xavier initialization, suitable for tanh activation functions, considers the sizes of the current and previous layers. These methods help with faster and more stable convergence. Backpropagation and Weight Updates Gradient Descent Variants Backpropagation computes the gradient of the loss function concerning each weight by applying the chain rule. Various gradient descent algorithms update the weights and minimize the loss. The most basic form is the Batch Gradient Descent. However, other variants like Stochastic Gradient Descent (SGD) and Mini-Batch Gradient Descent have been introduced to improve efficiency and convergence. Role of Learning Rate The learning rate is a hyperparameter that dictates the step size during weight updates. A high learning rate might overshoot the optimal point, while a low learning rate might result in slow convergence. Adaptive learning rate methods like Adam, RMSprop, and Adagrad adjust the learning rate during training, facilitating faster convergence without manual tuning. Regularization Techniques Dropout Overfitting is a common pitfall in deep learning, where the model performs exceptionally well on the training data but needs to improve on unseen data. Dropout is a regularization technique that mitigates overfitting. During training, random neurons are "dropped out" or deactivated at each iteration, ensuring the model does not rely heavily on any specific neuron. Dropout Neural Networks L1 and L2 Regularization L1 and L2 are other regularization techniques that add a penalty to the loss function. L1 regularization adds a penalty equivalent to the absolute value of the weights' magnitude, which aids feature selection. L2 regularization adds a penalty based on the squared magnitude of weights, preventing weights from reaching extremely high values. Both methods help in preventing overfitting, penalizing complex models, and producing a more generalized model. L1 and L2 Regualization The Fine-tuning Process Transfer Learning: The Backbone of Fine-tuning Transfer learning is a technique where a model developed for a task is adapted for a second related task. It is a popular approach in deep learning where pre-trained models are used as the starting point for computer vision and natural language processing tasks due to the extensive computational resources and time required to train models from scratch. Pre-trained models save the time and resources needed to train a model from scratch. They have already learned features from large datasets, which can be leveraged for a new task with a smaller dataset. This is especially useful when acquiring labeled data is challenging or costly. When fine-tuning, it's common to adjust the deeper layers of the model while keeping the initial layers fixed. The rationale is that the initial layers capture generic features (like edges or textures), while the deeper layers capture more task-specific patterns. However, the extent to which layers are fine-tuned can vary based on the similarity between the new task and the original task. Strategies for Fine-tuning One of the key strategies in fine-tuning is adjusting the learning rates. A lower learning rate is often preferred because it makes the fine-tuning process more stable. This ensures the model retains the previously learned features without drastic alterations.  Another common strategy is freezing the initial layers of the model during the fine-tuning process. This means that these layers won't be updated during training. As mentioned, the initial layers capture more generic features, so fixing them is often beneficial. Applications and Use Cases Domain Adaptation Domain adaptation refers to the scenario where the source and target tasks are the same, but the data distributions differ. Fine-tuning can be used to adapt a model trained on source data to perform well on target data. Domain Adaptation Data Augmentation Data augmentation involves creating new training samples by applying transformations (like rotations, scaling, and cropping) to the existing data. Combined with fine-tuning, it can improve the model's performance, especially when the available labeled data is limited. Data Augmentation Comparative Analysis Benefits of Training from Scratch Customization: Training a model from scratch allows complete control over its architecture, making it tailored specifically for the task. No Prior Biases: Starting from scratch ensures the model doesn't inherit any biases or unwanted features from pre-existing datasets. Deep Understanding: Training a model from the ground up can provide deeper insights into the data's features and patterns, leading to a more robust model for specific datasets. Optimal for Unique Datasets: For datasets significantly different from existing ones, training from scratch might yield better results as the model learns features unique to that dataset. Limitations of Training from Scratch  This approach requires more time as the model learns features from the ground up and requires a large, diverse dataset for optimal performance. With the right data and regularization, models can easily fit. Extended Training Time: Starting from the basics means the model has to learn every feature, leading to prolonged training durations. Data Dependency: Achieving optimal performance mandates access to a vast and varied dataset, which might only sometimes be feasible. Risk of Overfitting: Without adequate data and proper regularization techniques, models can overfit, limiting their generalization capabilities on unseen data. Advantages of Fine-Tuning Efficiency in Training: Utilizing pre-trained models can expedite the training process, as they have already grasped foundational features from extensive datasets. Data Economy: Since the model has undergone training on vast datasets, fine-tuning typically demands a smaller amount of data, making it ideal for tasks with limited datasets. Limitations of Fine-Tuning Compatibility Issues: Ensuring that the input and output formats, as well as the architectures and frameworks of the pre-trained model, align with the new task can be challenging. Overfitting: Fine-tuning on a small dataset can lead to overfitting, which reduces the model's ability to generalize to new, unseen data. Knowledge Degradation: There's a risk that the model might forget some of the features and knowledge acquired during its initial training, a phenomenon often referred to as "catastrophic forgetting." Bias Propagation: Pre-trained models might carry inherent biases. When fine-tuned, these biases can be exacerbated, especially in applications that require high sensitivity, such as facial recognition. {{light_callout_start}} Optimizing your hyperparameters is a key process for getting your pre-trained models to learn the dataset during fine-tuning. Interested in learning more about hyperparameter optimization while fine-tuning models? Check out our article. {{light_callout_end}}  Research Breakthroughs Achieved Through Fine-tuning Fine-tuning in NLP BERT (Bidirectional Encoder Representations from Transformers) has been a cornerstone in the NLP community. Its architecture allows for capturing context from both directions (left-to-right and right-to-left) in a text, making it highly effective for various NLP tasks.  In 2023, we have seen advancements in BERT and its variants. One such development is "Ferret: Refer and Ground Anything Anywhere at Any Granularity." This Multimodal Large Language Model (MLLM) can understand the spatial reference of any shape or granularity within an image and accurately ground open-vocabulary descriptions. Such advancements highlight the potential of fine-tuning pre-trained models like BERT to achieve specific tasks with high precision. Fine-tuning in Computer Vision Models like ResNet and VGG have been foundational in computer vision. These architectures, with their deep layers, have been pivotal in achieving state-of-the-art results on various image classification tasks. In 2023, a significant breakthrough, "Improved Baselines with Visual Instruction Tuning," was introduced. This research emphasized the progress of large multimodal models (LMM) with visual instruction tuning. Such advancements underscore the importance of fine-tuning in adapting pre-trained models to specific tasks or datasets, enhancing their performance and utility. {{Training_data_CTA}} Training vs Fine-tuning: Key Takeaways Training and fine-tuning are pivotal processes in deep learning and machine learning. While training involves initializing model weights and building a new model from scratch using a dataset, fine-tuning leverages pre-trained models and tailors them to a specific task.  Opting for training from scratch is ideal when you have a large dataset vastly different from available pre-trained models like those on Imagenet. It's also the preferred strategy when there's an absence of pre-existing models on platforms like TensorFlow Hub, PyTorch Zoo, or Keras that align with the task. On the flip side, fine-tuning is advantageous when the dataset at hand is smaller or when the new task mirrors the objectives of the pre-trained model. This approach, backed by optimization techniques like adjusting the learning rate, allows for swifter convergence and frequently culminates in superior performance, especially in scenarios with limited training data. Future Trends and Predictions: The deep learning community, including platforms like OpenAI, is progressively gravitating towards fine-tuning, especially with the advent of large language models and transformers. This inclination is anticipated to persist, especially with the ascent of transfer learning and the triumph of models like BERT in NLP and ResNet in computer vision. As neural networks evolve and datasets expand, hybrid methodologies that amalgamate the strengths of both training and fine-tuning paradigms may emerge, potentially blurring the demarcation between the two.

November 7

5 min

Mean Average Precision in Object Detection

Object detection is a fascinating field in computer vision. It is tasked with locating and classifying objects within an image or video frame. The challenge lies in the model's ability to identify objects of varying shapes, sizes, and appearances, especially when they are partially occluded or set against cluttered backgrounds. Deep learning has proven highly effective in object detection. Through training, deep learning models extract features like shape, size, and texture from images to facilitate object detection. They can also learn to classify objects based on the extracted features. One widely used deep learning model for object detection is YOLO (You Only Look Once). YOLO is a single-shot object detection algorithm, meaning it detects objects in a single pass through the image. This makes YOLO very fast, but it can be less accurate than two-stage object detection algorithms. Another renowned deep learning model for object detection is SSD (Single Shot MultiBox Detector). SSD is similar to YOLO but uses a distinct approach to detecting objects. SSD partitions the image into a grid of cells, and each cell predicts potential bounding boxes for objects that may be present in the cell. This makes SSD more accurate than YOLO, but it is also slower. Object detection typically involves two primary components: Object Classification: Assigning labels or categories, such as "car", "person", or "cat", to detected objects. Object Localization: Identifying the object's position within the image, typically represented by a bounding box. These bounding boxes are described using coordinates (x, y) for the top-left corner, along with their dimensions (width, height). Evaluation Metrics for Object Detection Assessing the performance, effectiveness, and limitations of object detection models is pivotal. You can employ several evaluation metrics to assess the accuracy and robustness of these models: Mean Average Precision (mAP) averages the precision and recall scores for each object class to determine the overall accuracy of the object detector. Intersection over Union (IoU) measures the overlap between the predicted bounding box and the ground-truth bounding box. A score of 1.0 signifies a perfect overlap, whereas a score of 0.0 denotes no overlap between the predicted and the ground truth bounding boxes. False Positive Rate (FPR) measures the ratio of incorrect positive predictions to the total number of actual negatives. In simpler terms, it quantifies how often the model mistakenly predicts the presence of an object within a bounding box when there isn't one. False Negative Rate (FNR) measures the ratio of missed detections to the total number of actual objects. Essentially, it evaluates how often the model fails to detect an object when it is indeed present in the image. The choice of evaluation metric must align with the goals and nuances of the specific application. For instance, in traffic monitoring applications, mAP and IoU might be prioritized. Conversely, in medical imaging, where false alarms and missed detections can have serious implications, metrics such as FPR and FNR become highly significant. Importance of Evaluating Object Detection Models The evaluation of object detection models is critically important for a myriad of reasons: Performance Assessment: Given that object detection models operate in complex real-world scenarios—with factors like diverse lighting conditions, occlusions, and varying object sizes—it's essential to determine how well they cope with such challenges. Model Selection and Tuning: Not all object detection models perform well. Evaluating different models helps in selecting the most suitable one for a specific application. By comparing their performance metrics, you can make informed decisions about which model to use and whether any fine-tuning is necessary. Benchmarking: Object detection is a rapidly evolving field with new algorithms and architectures being developed regularly.  Understanding Limitations: Object detection models might perform well on some object classes but struggle with others. Evaluation helps identify which classes are challenging for the model and whether its performance is consistent across different object categories. Safety and Reliability: In critical applications such as autonomous driving, surveillance, and medical imaging, the accuracy of object detection directly impacts safety outcomes. Quality Control: Faulty object detection in industrial settings can precipitate production mishaps or equipment malfunctions. Periodic evaluation ensures models remain reliable. User Confidence: For users and stakeholders to trust object detection systems, you need to consistently validate capabilities. Iterative Improvement: Evaluation feedback is crucial for iterative model improvement. Understanding where a model fails or performs poorly provides insights into areas that need further research, feature engineering, or data augmentation. Legal and Ethical Considerations: Biased or flawed object detection can sometimes lead to legal and ethical ramifications, underscoring the importance of thorough evaluation. Resource Allocation: In resource-limited settings, evaluations guide the efficient distribution of computational resources, ensuring the best model performance. {{gray_callout_start}} New to object detection? Check out this short article on object detection, the models, use cases, and real-world applications. {{gray_callout_end}}  Overview of mAP Mean average precision (mAP) is a metric used to evaluate the performance of object detection models. It is calculated by averaging the precision-recall curves for each object class. Precision quantifies the fraction of true positives out of all detected objects, while recall measures the fraction of true positives out of all actual objects in the image. The AUC is a measure of the model's overall performance for that class, and it considers both precision and recall. By averaging these areas across all classes, we obtain mAP. The AUC score can be used to calculate the area under the precision-recall curve to get one number that describes model performance.  mAP is a popular metric for evaluating object detection models because it is easy to understand and interpret. It is also relatively insensitive to the number of objects in the image. A high mAP score indicates that the model can detect objects with both high precision and recall, which is critical in applications like autonomous driving where reliable object detection is pivotal to avoiding collisions. A perfect mAP score of 1.0 suggests that the model has achieved flawless detection across all classes and recall thresholds. Conversely, a lower mAP score signifies potential areas of improvement in the model's precision and/or recall. How to Calculate Mean Average Precision (mAP) 1. Generate the prediction scores using the model. 2. Convert the prediction scores to class labels. 3. Calculate the confusion matrix. 4. Calculate the precision and recall metrics. 5. Calculate the area under the precision-recall curve (AUC) for each class. 6. Average the AUCs to get the mAP score. Practical Applications mAP is a widely used metric for evaluating object detection models in a variety of applications, such as: Self-driving Cars Self-driving cars are one of the most promising applications of object detection technology. To safely navigate the road, self-driving cars need to be able to detect and track various objects, including pedestrians, cyclists, other vehicles, and traffic signs. mAP is a valuable metric for evaluating the performance of object detection models for self-driving cars because it takes into account both precision and recall. Source Precision is the fraction of detected objects that are actually present in the image or video, i.e., correct detections. Recall, on the other hand, measures how many of the actual objects in the image were successfully detected by the model..  High precision indicates fewer false positives, ensuring that the model isn't mistakenly identifying objects that aren't there. Conversely, high recall ensures the model detects most of the real objects in the scene.For self-driving cars, a high mAP is essential for ensuring safety. If the model is not able to detect objects accurately, it could lead to accidents.  Visual Search Visual search is a type of information retrieval that allows users to find images or videos that contain specific objects or scenes. It is a practical application of mean average precision (mAP) because mAP can be used to evaluate the performance and reliability of visual search algorithms. In visual search, the primary objective is to retrieve images or videos that are relevant to the user's query. This can be a challenging task, as there may be millions or even billions of images or sequences of videos available. To address this challenge, visual search algorithms use object detection models to identify the objects in the query image or video.  Object detection models play a pivotal role by identifying potential matches, and generating a list of candidate images or videos that seem to contain the queried objects. The mAP metric can be used to evaluate the performance of the object detection models by measuring the accuracy and completeness of the candidate lists. {{light_callout_start}} Interested in building visual search applications? Learn how to build semantic visual search with ChatGPT and CLIP in this webinar. {{light_callout_end}}  Medical Image Analysis Source mAP is used to evaluate the performance of object detection models in medical image analysis. It is calculated by taking the average of the precision-recall curves for all classes. The higher the mAP, the better the performance of the model. How to Calculate mAP The following code shows how to calculate mAP in Python: import numpy as np import pandas as pd import matplotlib.pyplot as plt import sklearn.metrics This code above imports essential libraries for our machine learning tasks and data visualization. The imported libraries are used for numerical operations (numpy), data manipulation (pandas), model evaluation (`precision_score` and `recall_score` from `sklearn.metrics`), and creating plots (`matplotlib.pyplot`).   Create two different datasets containing binary data. The code below defines two sets of data for binary classification model evaluations. Each set consists of ground truth labels (`y_true_01`) and predicted scores (`pred_scores_01`). y_true_01 = ["positive", "negative", "positive", "negative", "positive", "positive", "positive", "negative", "positive", "negative"] pred_scores_01 = [0.7, 0.3, 0.5, 0.6, 0.55, 0.9, 0.75, 0.2, 0.8, 0.3] `y_true_02` is a list of ground truth labels for a set of instances. In this case, the labels are either "positive" or "negative," representing the two classes in a binary classification problem. `pred_scores_02` is a list of predicted scores or probabilities assigned by a classification model to the instances in `y_true_01`. These scores represent the model's confidence in its predictions. y_true_02 = ["negative", "positive", "positive", "negative", "negative", "positive", "positive", "positive", "negative", "positive"] pred_scores_02 = [0.32, 0.9, 0.5, 0.1, 0.25, 0.9, 0.55, 0.3, 0.35, 0.85] `y_true_02` is another list of ground truth labels for a different set of instances. `pred_scores_02` is a list of predicted scores or probabilities assigned by a classification model to the instances in `y_true_02`. Set a threshold value with a range of 0.2 to 0.9 and a 0.05 step. Setting a threshold value with a range of 0.2 to 0.9 and a 0.05 step is a good practice for calculating mean average precision (mAP) because it allows you to see how the model performs at different levels of confidence.  thresholds = np.arange(start=0.2, stop=0.9, step=0.05) `precision_recall_curve()` function computes precision and recall ratings for various binary classification thresholds. The function accepts as inputs threshold values, projected scores, and ground truth labels (`y_true`, `pred_scores`, and `thresholds`). The thresholds are iterated through, predicted labels are generated, precision and recall scores are calculated, and the  results are then reported. Finally, lists of recall and precision values are returned. def precision_recall_curve(y_true, pred_scores, thresholds): precisions = [] recalls = [] for threshold in thresholds: y_pred = ["positive" if score >= threshold else "negative" for score in pred_scores]  precision = sklearn.metrics.precision_score(y_true=y_true,  y_pred=y_pred, pos_label="positive")  recall = sklearn.metrics.recall_score(y_true=y_true, y_pred=y_pred,  pos_label="positive") precisions.append(precision)  recalls.append(recall) return precisions, recalls Calculate the average precision scores for the first dataset (`y_true_01`) and plot out the result. precisions, recalls = precision_recall_curve(y_true=y_true_01, pred_scores=pred_scores_01, thresholds=thresholds) plt.plot(recalls, precisions, linewidth=4, color="red", zorder=0) #Set the label and the title for the precision-recall curve plot plt.xlabel("Recall", fontsize=12, fontweight='bold') plt.ylabel("Precision", fontsize=12, fontweight='bold') plt.title("Precision-Recall Curve", fontsize=15, fontweight="bold") plt.show() # Append values to calculate area under the curve (AUC) precisions.append(1)recalls.append(0) precisions = np.array(precisions) recalls = np.array(recalls) precisions = np.array(precisions) recalls = np.array(recalls) # Calculate the AP avg_precision_class01= np.sum((recalls[:-1] - recalls[1:]) * precisions[:-1]) print('============================================') print('Average precision score:',np.round(avg_precision_class01,2)) Output: The AP score of 0.95 is a good score; it indicates that the model performs relatively well in terms of precision when varying the classification threshold and measuring the trade-off between precision and recall. Now, let’s calculate the average precision scores for the second dataset (`y_true_02`) and plot out the result. # Calculate precision and recall values for different threshold precisions, recalls = precision_recall_curve(y_true=y_true_02, pred_scores=pred_scores_02, thresholds=thresholds) # Plot the precision-recall curve plt.plot(recalls, precisions, linewidth=4, color="blue", zorder=0) #Set the label and the title for the precision-recall curve plot plt.xlabel("Recall", fontsize=12, fontweight='bold') plt.ylabel("Precision", fontsize=12, fontweight='bold') plt.title("Precision-Recall Curve", fontsize=15, fontweight="bold") plt.show() # Append values to calculate area under the curve (AUC) precisions.append(1) recalls.append(0) #Convert precision and recall lists to Numpy arrays for computation precisions = np.array(precisions) recalls = np.array(recalls) # Calculate the AP avg_precision_class02 = np.sum((recalls[:-1] - recalls[1:]) * precisions[:-1]) print('============================================') print('Average precision score:',np.round(avg_precision_class02,2)) Output: For the second dataset, the AP score was 0.96 which is also a good score. It indicates that the model is able to identify positive samples with high precision and high recall. Calculating the Mean Average Precision (mAP) The mean Average Precision or mAP score is calculated by taking the mean AP over all classes and/or overall IoU thresholds, depending on the different detection challenges that exist. The formula for MAP: # Number of classes or labels (in this case, 2 classes) num_labels = 2 # Calculate the Mean Average Precision (mAP) by averaging the AP scores for both classes mAP = (avg_precision_class2 + avg_precision_class1) / num_labels # Print the Mean Average Precision score print('Mean average Precision score:', np.round(mAP, 3)) Output:  For class 1, you calculated an Average Precision (AP) score of 0.89, which indicates how well your model performs in terms of precision and recall for class 1. For class 2, you calculated an Average Precision (AP) score of 0.81, which indicates the performance of your model for class 2. You calculate the mAP score by averaging these AP scores for all classes. In this specific scenario, you averaged the AP scores for classes 1 and 2. Challenges and Limitations of mAP mAP is a widely used metric for evaluating the performance of object detection and instance segmentation algorithms. However, it has its own set of challenges and limitations that should be considered when interpreting its results: Sensitivity to IoU Threshold: The mAP calculation is sensitive to the chosen IoU threshold for matching ground truth and predicted boxes. Different applications might require different IoU thresholds, and using a single threshold might not be appropriate for all scenarios. Uneven Distribution of Object Sizes: mAP treats all object instances equally, regardless of their sizes. Algorithms might perform well on larger objects but struggle with smaller ones, leading to an imbalance in the evaluation. You can check out this helpful resource.  Ignoring Object Categories: mAP treats all object categories with the same importance. In real-world applications, some categories might be more critical than others, and this factor isn't reflected in mAP. Handling Multiple Object Instances: mAP focuses on evaluating the detection of individual instances of objects. It might not accurately reflect an algorithm's performance when multiple instances of the same object are closely packed together. Difficulty in Handling Overlapping Objects: When objects overlap significantly, it can be challenging to determine whether the predicted bounding boxes match the ground truth. This situation can lead to inaccuracies in mAP calculations. Doesn't Account for Execution Speed: mAP doesn't consider the computational efficiency or execution speed of an algorithm. In real-time applications, the speed of detection might be as crucial as its accuracy. Complexity of Calculations: The mAP calculation involves multiple steps, including sorting, precision-recall calculations, and interpolation. These steps can be complex and time-consuming to implement correctly. Mean Average Precision (mAP): Key Takeaways  Mean Average Precision (mAP) is an essential metric for evaluating object detection models' performance. Calculated through precision and recall values, mAP provides a comprehensive assessment of detection accuracy, aiding model selection, improvement, and benchmarking.  mAP is a good metric to use for applications where it is important to both detect objects and avoid false positives. A high mAP score is important for ensuring that the model can reliably detect objects. It has applications in self-driving cars, visual search, medical image analysis, and lots more. Deep learning techniques, exemplified by architectures like YOLO (You Only Look Once), aim to improve object detection performance, potentially leading to higher mAP scores in evaluations and contributing to advancements in various domains. Throughout this article, we've explored the inner workings of mAP, uncovering its mathematical underpinnings and its significance in assessing object detection performance.  Armed with this knowledge, you are better equipped to navigate the complex landscape of object detection, armed with the ability to make informed decisions when designing, training, and selecting models for specific applications. {{Active_CTA}}

November 5

5 min

Guide to Vision-Language Models (VLMs)

For quite some time, the idea that artificial intelligence (AI) could understand visual and textual cues as effectively as humans seemed far-fetched and unimaginable.  However, with the emergence of multimodal AI, we are seeing a revolution where AI can simultaneously comprehend various modalities, such as text, image, speech, facial expressions, physiological gestures, etc., to make sense of the world around us. The ability to process multiple modalities has opened up various avenues for AI applications. One such exciting application of multimodal AI is Vision-Language Models (VLMs). They can process and understand the modalities of language (text) and vision (image) simultaneously to perform advanced vision-language tasks, such as Visual Question Answering (VQA), image captioning, and Text-To-Image search. In this article, you will learn about:  VLM architectures VLM evaluation strategies Mainstream datasets used for developing vision-language models Key challenges, primary applications, and future trends of VLMs Let’s start by understanding what vision-language models are. What Are Vision Language Models? A vision-language model is a fusion of vision and natural language models. It ingests images and their respective textual descriptions as inputs and learns to associate the knowledge from the two modalities. The vision part of the model captures spatial features from the images, while the language model encodes information from the text. The data from both modalities, including detected objects, spatial layout of the image, and text embeddings, are mapped to each other. For example, if the image contains a bird, the model will learn to associate it with a similar keyword in the text descriptions. This way, the model learns to understand images and transforms the knowledge into natural language (text) and vice-versa. Training VLMs Techniques for building VLMs include pre-training foundation models and zero-shot learning.  You can use transfer learning techniques such as knowledge distillation to fine-tune the models for more specific downstream tasks. These are simpler techniques that require smaller datasets and less training time while maintaining decent results. Modern frameworks, on the other hand, use various techniques to get better results, such as Contrastive learning. Masked language-image modeling. Encoder-decoder modules with transformers and more. These architectures can learn complex relations between the various modalities and provide state-of-the-art results. Let’s discuss these in detail. {{RLHF_CTA}} Vision Language Models: Architectures and Popular Models Let’s look at some VLM architectures and learning techniques that mainstream models such as CLIP, Flamingo, and VisualBert, among others, use. Contrastive Learning Contrastive learning is a technique that learns data points by understanding their differences. The method computes a similarity score between data instances and aims to minimize contrastive loss. It’s most useful in semi-supervised learning, where only a few labeled samples guide the optimization process to label unseen data points. Contrastive Learning For example, one way of understanding what a cat looks like is to place it beside a similar cat image and a dog image. Contrastive learning models learn to distinguish between a cat and a dog by identifying several features, such as facial structure, body size, and the presence of fur. The models can determine which image is closer to the original, called the “anchor,” and predict its class. CLIP is an example of a model that uses contrastive learning by computing the similarity between text and image embeddings using textual and visual encoders. It follows a three-step process to enable zero-shot predictions. Trains a text and image encoder during pretraining to learn the image-text pairs. Converts training dataset classes into captions. Estimates the best caption for the given input image for zero-shot prediction. CLIP Architecture ALIGN is another example that uses image and textual encoders to minimize the distance between similar embeddings using a contrastive loss function. {{light_callout_start}} Want to know how to evaluate CLIP? Head onto our blog and read Evaluating Foundation Models (CLIP) using Encord Active. {{light_callout_end}} PrefixLM PrefixLM is an NLP learning technique mostly used for model pre-training. It inputs part of the text (a prefix) and learns to predict the next word in the sequence. In Visual Language Models, PrefixLM enables the model to predict the next sequence of words based on an image and its respective prefix text. It leverages a Vision Transformer (ViT) that divides an image into a one-dimensional sequence of patches, where each patch represents a local image region. Then, the model applies convolution or linear projection over the processed patches to generate contextualized visual embeddings. For text modality, the model converts the text prefix relative to the patch into a token embedding. The encoder-decoder blocks of the transformer receive both visual embedding and token embedding. It is there that the model learns the relationships between the embeddings. SimVLM is a popular architecture utilizing the PrefixLM learning methodology. It has a simpler Transformer architecture than its predecessors, surpassing their results in various benchmarks. It uses a transformer encoder to learn image-prefix pairs and a transformer decoder to generate an output sequence. The model also demonstrates good generalization and zero-shot learning capabilities. SimVLM Architecture Similarly, VirTex uses a convolutional neural network to extract image features and a textual head with transformers to manage text prefixes. You can train the model end-to-end to predict the correct image captions by feeding image-text pairs to the textual head. VirTex Architecture Frozen PrefixLM While PrefixLM techniques require you to train visual and textual encoders from scratch, Frozen PrefixLM allows you to use pre-trained networks and only update the parameters of the image encoders. For instance, the architecture below shows how Frozen works using a pre-trained language model and visual encoder. The text encoder can belong to any large language model (LLM), and the visual encoder can also be a pre-trained visual foundation model. You can fine-tune the image encoder so its image representations align with textual embeddings, allowing the model to make better predictions. Frozen Architecture A more state-of-the-art (SOTA) approach is Flamingo’s architecture, which uses a CLIP-like vision encoder and an LLM called Chinchilla. Keeping the LLM fixed, you can train the visual encoder on images interleaved between texts. The visual encoders process the image through a Perceiver Sampler. The technique results in faster inference and makes Flamingo ideal for few-shot learning. Flamingo Architecture Multimodal Fusing with Cross-Attention This method utilizes the encoders of a pre-trained LLM for visual representation learning by adding cross-attention layers. VisualGPT is a primary example that allows quick adaptation of an LLM’s pre-trained encoder weights for visual tasks. VisualGPT Architecture Practitioners extract relevant objects from an image input and feed them to a visual encoder. They feed the resulting visual representations to a decoder and initialize their weights according to pre-trained LLM. The decoder module balances the visual and textual information through a self-resurrecting activation unit (SRAU). The SRAU method avoids the issue of vanishing gradients - a common problem in deep learning where model weights fail to update due to small gradients. As such, VisualGPT outperforms several baseline models, such as plain transformer, Attention-on-Attention (AoA) transformer, X-transformer, etc. Masked-language Modeling (MLM) & Image-Text Matching (ITM) MLM works in language models like BERT by masking or hiding a portion of a textual sequence and training the model to predict the missing text. ITM involves predicting whether sentence Y follows sentence X. You can adapt the MLM and ITM techniques for visual tasks. For instance, the diagram below illustrates the architecture of VisualBERT, trained on the COCO dataset. VisualBERT Architecture It augments the MLM procedure by introducing image sequences and a masked textual description. The objective is to predict the missing text based on visual embeddings. Similarly, ITM predicts whether or not a caption matches the image. No Training You can directly use large-scale pre-trained vision-language models without any fine-tuning. For example, MAGIC and ASIF are training-free frameworks that aim to predict text descriptions that align closely with the input image.  MAGIC uses a specialized score based on CLIP-generated image embeddings to guide the output of language models. Using this score, an LLM generates textual embeddings that align closely with the image semantics, enabling the model to perform multimodal tasks in a zero-shot manner. ASIF uses the idea that similar images have similar captions. The model computes the similarities between the training dataset's query and candidate images. Next, it compares the query image embeddings with the text embeddings of the corresponding candidate images. Then, it predicts a description whose embeddings have the highest similarity to the embeddings of the query image, resulting in comparable zero-shot performance to models like CLIP and LiT. ASIF Prediction Strategy Knowledge Distillation This technique involves transferring knowledge from a large, well-trained teacher model to a lighter student model with few parameters. This methodology allows researchers to train VLMs from larger pre-trained models. For instance, ViLD is a popular VLM developed using the knowledge distillation methodology. The model uses a pre-trained open-vocabulary image classification model as the teacher to train a two-stage detector (student). The model matches textual embeddings from a textual encoder with image embeddings. ViLD Architecture You can use knowledge distillation to transfer knowledge from the image encoder to the backbone model to generate regional embeddings automatically. Only the backbone model generates regional embeddings during inference, and the model matches them with unseen textual embeddings. The objective is to draw correct bounding boxes around objects in an image based on textual descriptions. Evaluating Vision Language Models VLM validation involves assessing the quality of the relationships between the image and text data. For example, for an image captioning model, this would mean comparing the generated captions to the ground truth description. You can use various automated n-gram-based evaluation strategies to compare the predicted labels in terms of accuracy, semantics, and information precision. A few of the key VLM evaluation metrics are mentioned below. BLEU: The Bilingual Evaluation Understudy (BLEU) metric was originally proposed to evaluate machine translation tasks. It computes the precision of the target text compared to a reference (ground truth) by considering how many words in the candidate sentence appear in the reference.  ROUGE: Recall-Oriented Understudy for Gisting Evaluation (ROUGE) computes recall by considering how many words in the reference sentence appear in the candidate. METEOR: Metric for Evaluation of Translation with Explicit Ordering (METEOR) computes the harmonic mean of precision and recall, giving more weight to recall and multiplying it with a penalty term. The metric is an improvement over others that work with either Precision or Recall, as it combines information from both to give a better evaluation. CIDEr: Consensus-based Image Description Evaluation (CIDEr) compares a target sentence to a set of human sentences by computing the average similarity between reference and target sentences using TF-IDF scores. Now you have learned evaluation metrics pertinent to Vision-Language Models (VLMs), it's essential to know how to curate datasets for these models. The right dataset provides fertile ground for training and validating VLMs and is pivotal in determining the models' performance across diverse tasks. Datasets for Vision Language Models Collecting training data for VLMs is more challenging than traditional AI models since it involves the collection and quality assurance of multiple data modalities. Below is a list of several datasets combining image and text data for multimodal training. LAION-5B: Practitioners use the LAION-5B dataset for building large pre-trained VLMs. The dataset contains more than five billion image-text pairs generated from CLIP, with descriptions in English and foreign languages, thereby catering to a multilingual domain. PMD: The Public Model Dataset (PMD) originally appeared in the FLAVA paper and contains 70 billion image-text pairs. The data is a collection from other large-scale datasets, such as COCO, Conceptual Captions, RedCaps, etc. This dataset is a reservoir of multimodal data that fosters robust model training. VQA - Experts use the VQA dataset to fine-tune pre-trained VLMs for downstream VQA and visual reasoning tasks. The dataset contains over 200,000 images with five questions per image, ten ground-truth answers, and three incorrect answers per question. ImageNet: ImageNet contains over 14 million images with annotations categorized according to the WordNet hierarchy. It’s helpful in building models for simple downstream tasks, such as image classification and object recognition. Despite the availability of high-quality multimodal datasets, VLMs can face significant challenges during the model development process. Let’s discuss them below. {{Training_data_CTA}} Limitations of Vision Language Models Although VLMs are powerful in understanding visual and textual modalities to process information, they face three primary challenges: Model complexity Dataset bias Evaluation difficulties Model Complexity Language and vision models are quite complex on their own, and a combination of the two only worsens the problem. The complexity of these models raises additional challenges in acquiring powerful computing resources for training, the collection of large datasets, and deployment on weak hardware such as IoT devices. Dataset Bias Dataset biases occur when VLMs memorize deep patterns within training and test sets without solving anything. For instance, training a VLM on images curated from the internet can cause the model to memorize specific patterns and not learn the conceptual differences between various images. Evaluation Strategies The evaluation strategies discussed above only compare a candidate sentence with reference sentences. The approach assumes that the reference sentences are the only ground truths. However, there can be several ground-truth descriptions for a particular image. Although consensus-based metrics like CIDEr account for the issue, using them becomes challenging when consensus is low for particular images. Another challenge is when a generic description applies to several images. Spurious Correlation As the illustration shows, a VLM can annotate or retrieve several relevant images that match the generic caption. However, in reality, the model is nothing more than a bag-of-words. All it’s doing is considering words, such as ‘city,’ ‘bus,’ ‘lights,’ etc., to describe the image instead of actually understanding the caption's sequential order and true contextual meaning. Furthermore, VLMs used for VQA can generate highly confident answers to nonsensical questions. For instance, asking a VLM, “What color is the car?” for an image that contains a white horse will generate the answer as “white” instead of pointing out that there isn’t a car in the picture. Lastly, VLMs lack compositional generalization. It means their performance decreases when processing novel concepts. For example, a VLM can fail to recognize a yellow horse as a category since it’s rare to associate the color yellow with horses. Despite many development and deployment challenges, researchers and practitioners have made significant progress in adopting VLMs for solving real problems. Let’s discuss them briefly below. Applications of Vision Language Models While most VLMs discussed earlier are helpful in captioning images, their utility extends to a variety of other domains that leverage the capability to bridge visual and linguistic modalities. Here are some additional applications: Image Retrieval: Models such as FLAVA allow users to navigate through image repositories by helping them find relevant photos based on linguistic queries. A relevant example is an e-commerce site. Visitors can describe what they’re looking for in a search bar, and a VLM will show the suitable options on the screen. This application is also popular on smartphones, where users can type in keywords (landscapes, buildings, etc.) to retrieve associated images from the gallery. Generative AI: Image generation through textual prompts is a growing domain where models like DALL-E allow users to create art or photos based on their descriptions. The application is practical in businesses where designers and inventors want to visualize different product ideas. It also helps create content for websites and blogs and aids in storytelling. Segmentation: VLMs like SegGPT help with segmentation tasks, such as instance, panoptic, semantic segmentation, etc. SegGPT segments an image by understanding user prompts and exploits a distinct coloring scheme to segment objects in context. For instance, users can ask to segment a rainbow from several images, and SegGPT will annotate all rainbows efficiently. {{light_callout_start}} Read our detailed article on SegGPT: Segmenting everything in context [Explained] to learn more about how the model works. {{light_callout_end}} Future Research The following are a few crucial future research directions in the VLM domain: Better Datasets The research community is working on building better training and test datasets to help VLMs with compositional understanding. CLEVR is one example of this effort. CLEVR Dataset As the illustration shows, it contains images of novel shapes, colors, and corresponding questions that allow experts to test a VLM’s visual reasoning capacity. Better Evaluation Methods Evaluation challenges warrant in-depth research into better evaluation methods for building more robust VLMs. One alternative is to test VLMs for individual skills through the ARO benchmark. Attribute identification, relational reasoning, and word-order sensitivity (ARO) are three skills that VLMs must master. ARO Dataset The illustration above explains what ARO entails in different contexts. Using such a dataset, experts can analyze what VLMs learn and how to improve the outcomes. Robotics Researchers are also using VLMs to build purpose-specific robots. Such robots can help navigate environments, improve warehouse operations in manufacturing by monitoring items, and enhance human-machine interaction by allowing robots to understand human gestures, such as facial expressions, body language, voice tones, etc. Medical VQA VLMs’ ability to annotate images and recognize complex objects can help healthcare professionals with medical diagnoses. For example, they can ask VLMs critical questions about X-rays or MRI scans to determine potential problems early. Vision-Language Models: Key Takeaways Visual language modeling is an evolving field that holds great promise for the AI industry. Below are a few critical points regarding VLMs: Vision-language models are a multimodal architecture that simultaneously comprehends image and text data modalities. They use CV and NLP models to correlate information (embeddings) from the two modalities. Several VLM architectures exist that aim to relate visual semantics to textual representations. Although users can evaluate VLMs using automated scores, better evaluation strategies are crucial to building more reliable models. VLMs have many industrial use cases, such as robotics, medical diagnoses, chatbots, etc. {{try_encord}}

November 3

5 min

Understanding the United States Executive Order on Safe, Secure, and Trustworthy AI

On October 30, 2023, the White House announced an Executive Order issued by President Joe Biden aimed at fostering a balanced approach toward the development and deployment of Artificial Intelligence (AI) to ensure it's safe, secure, and trustworthy. It acknowledges the potential of AI technologies in solving urgent societal challenges and enhancing prosperity, productivity, innovation, and security. However, the Executive Order highlights the potential adverse effects that an irresponsible use of artificial intelligence could have, such as fraud, discrimination, bias, misinformation, threats to national security, and the need for guardrails. The Order calls for a collective effort from the federal government (including the Department of Homeland Security, the Department of Health and Human Services, the Department of Energy, the Department of Commerce, and more), the private sector, academia, and civil society to mitigate these harms while maximizing the benefits of AI. Here are the three main guiding principles behind this Executive Order: Safety and security: The Order emphasizes the need for robust, reliable, repeatable, and standardized evaluations of AI systems. It mandates addressing security risks, including those related to biotechnology, cybersecurity, and critical infrastructure. The document also highlights the importance of testing, post-deployment monitoring, and effective labeling to ensure that AI systems are ethically developed, securely operated, and compliant with federal laws​. Responsible innovation: It encourages promoting responsible innovation, competition, and collaboration to maintain U.S. leadership in AI. The Order calls for investments in AI-related education, training, development, research, and tackling intellectual property issues. It also emphasizes creating a fair, open, and competitive AI ecosystem and marketplace, supporting small developers, and addressing potential risks from dominant firms' control over critical assets like semiconductors, computing power, cloud storage, and data​​. Supporting American workers: As AI creates new jobs and industries, the Order stresses adapting job training and education to support a diverse workforce. It advises against deploying AI in ways that undermine rights, worsen job quality, or cause harmful labor-force disruptions. The Order encourages building the next steps in AI development based on the views of workers, labor unions, educators, and employers to support responsible AI uses that improve workers' lives and augment human work. In subsequent sections of this article, we will examine the actions among the AI directives in this Executive Order. In the meantime, let’s explore how we got here. How did we get here? The History of AI Regulation in the United States of America President Biden's Executive Order on Safe, Secure, and Trustworthy Artificial Intelligence is the result of years of developing insights and responses to emerging technologies in the field of AI. In order to show how we came to this important turning point, this section will walk you through the path of AI regulation in the United States. Early Engagement, Regulating Open- and Closed-Source LLMs Navigating the spectrum between open and closed LLM systems is critical for effective AI policy. Striking the right balance will promote innovation and competition while managing the potential risks of AI. By 2024, the National Institute of Standards and Technology (NIST) under the U.S. Department of Commerce will determine whether they will allow the release of open model weights under public licenses. This, of course, is bound to stir up discussions surrounding treating open model weights as free speech and accusations of lobbying from big tech companies to protect their MOAT. As these LLM systems began permeating various sectors, the need for a regulatory framework became apparent. Policymakers grappling with the rapid advancements in AI models and tools started the conversation about balancing promoting US global leadership in AI with the risks to individuals, businesses, and national security. Legislative Efforts The early engagement translated into legislative action, with the USA’s House and Senate committees holding numerous hearings on AI. The hearings included big names like Elon Musk, CEO of SpaceX, Tesla, and X, formerly known as Twitter; Mark Zuckerberg, CEO of Meta; former Microsoft co-founder Bill Gates; and Sam Altman, CEO of OpenAI, the parent company of AI chatbot, ChatGPT. Biden Administration’s Early Steps In October 2022, the Biden administration issued a non-binding AI Bill of Rights, marking an early step towards delineating the government’s stance on governing automated systems, focusing on civil rights protection. Soon after, on September 12, several tech companies signed voluntary agreements to follow the rules President Biden set out for AI. This was the first step toward encouraging responsible AI use through partnerships with the private sector. SAFE Innovation—A Values-Based Framework and New Legislative Process Despite strong bipartisan interest, the challenge of passing comprehensive AI legislation continued, paving the way for the SAFE Innovation Framework proposal by Senate Majority Leader Chuck Schumer​. The Executive Order The culmination of these efforts and the evolving understanding of AI's impact led to the issuance of the Executive Order on Safe, Secure, and Trustworthy Artificial Intelligence. This Executive Order embodies a more structured approach to AI governance, reflecting the administration’s commitment to promoting responsible AI development and deployment while addressing the associated potential risks of AI. What are the Executive Order Directives? We have summarized the Executive Order Directives below so you can easily skim through and find the directives and the corresponding actions relevant to you. Directive 1: New Standards for AI Safety and Security Actions: Require developers to share safety test results with the U.S. government. Develop standards and tools to ensure AI systems are safe and secure. Protect against AI-enabled risks to national security and public health. Establish strong standards for biological synthesis screening. Directive 2: Protecting Americans’ Privacy Actions: Prioritize federal support for privacy-preserving techniques in AI. Strengthen privacy-preserving research and technologies. Evaluate how agencies collect and use commercially available data. Develop guidelines for federal agencies to evaluate privacy-preserving techniques. Directive 3: Advancing Equity and Civil Rights Actions: Offer advice to stop AI programs from making discrimination worse. Address algorithmic discrimination through training and coordination. Ensure fairness in the criminal justice system's use of AI. Directive 4: Standing Up for Consumers, Patients, and Students Actions: Make advances in the responsible use of AI in healthcare. Shape AI’s potential in education. Protect consumers and patients while ensuring AI benefits. Directive 5: Promoting Innovation and Competition Actions: Catalyze AI research and provide grants in vital areas. Promote a fair and competitive AI ecosystem. Streamline visa criteria for skilled immigrants. Directive 6: Supporting Workers Actions: Develop principles and best practices for worker protection. Produce a report on AI’s labor-market impacts. Directive 7: Advancing American Leadership Abroad Actions: Expand collaborations on AI at bilateral, multilateral, and multistakeholder levels. Accelerate the development of AI standards with international partners. Promote responsible AI development abroad. Directive 8: Ensuring Responsible and Effective Government Use of AI Actions: Issue guidance for agencies’ AI use. Streamline AI product and service acquisition. Accelerate the hiring of AI professionals in government. Now that we've discussed the key directives of the US Executive Order on AI, let's compare and contrast them with the European Union's approach to AI regulation, known as the EU Artificial Intelligence Act (AI Act). US Executive Order on Safe, Secure, and Trustworthy AI vs European Union AI Act In the table below, we present a comparative overview of the key aspects and focus areas of the US Executive Order on Safe, Secure, and Trustworthy AI and the EU Artificial Intelligence Act (AI Act). {{gray_callout_start}} Read more about the takes on “Proposed AI Regulation: EU AI Act, UK's Pro-Innovation, US AI Bill of Rights” from Encord’s co-founder and president. {{gray_callout_end}} As you saw in the comparison, while both regulations aim to foster a safe and responsible AI ecosystem, they approach AI governance from slightly different vantage points, reflecting the distinct priorities and regulatory philosophies of the US and the EU. {{light_callout_start}} What does the European AI Act mean for you, an AI developer? Learn more from this article by Ulrik Stig Hansen, Encord’s co-founder and president. {{light_callout_end}} Conclusion Increased involvement from policymakers, legislative efforts, and joint initiatives between the public and private sectors have all contributed to the current AI regulatory landscape. The issuance of the Executive Order represents a significant milestone in the ongoing journey towards establishing a robust framework for AI governance in the U.S. aimed at harnessing the benefits of AI while mitigating its potential perils. But will regulations stifle the efforts of open-source AI? Or would it encourage an ecosystem of open innovation while regulating the risks at the application layer? In this article, you learned about the evolution of AI regulation in the U.S., focusing on key legislative efforts, the Biden Administration's early steps towards AI governance, and the collaborative initiatives that marked the journey towards the recent Executive Order. We talked about how AI was regulated, which led to the Executive Order on Safe, Secure, and Trustworthy Artificial Intelligence. These included actions taken by lawmakers, tech companies making voluntary commitments, and the release of frameworks based on values like the SAFE Innovation Framework. Finally, we compared different aspects of the directives to the proposed European Union AI Act, where you saw clearly different priorities and regulatory philosophies between the United States Congress and the European Parliament. {{gray_callout_start}} Get access to our new AI Act Learning Pack, which includes all the key resources you need to ensure forward compatibility. {{gray_callout_end}} {{try_encord}}

November 1

5 min

MiniGPT-v2 Explained

Meta has made an impressive foray into multimodal models through the launch of MiniGPT-v2. This model is capable of efficiently handling various vision-language tasks using straightforward multi-modal instructions. The performance of MiniGPT-v2 is remarkable, demonstrating its prowess across numerous vision-language tasks. The results  rival both OpenAI's multimodal GPT-4 and Microsoft’s LLaVA, thereby establishing a new standard in terms of state-of-the-art accuracy, especially when compared to other generalist models in the vision-language domain. The fusion of natural language processing and computer vision has given rise to a new breed of machine learning models with remarkable capabilities. MiniGPT-v2 is one such model that seeks to serve as a unified interface for a diverse set of vision-language tasks.  In this blog, we'll explore the world of MiniGPT-v2, understanding its architecture, core concepts, applications, and how it compares to its predecessor. But first, let's take a step back and appreciate the journey of multimodal models like MiniGPT. A Brief Look Back: The Rise of MiniGPT MiniGPT-v2 builds upon the success of its predecessor. Earlier versions of GPT (Generative Pre-Trained Transformer) and large language models (LLMs) like BERT laid the foundation for natural language understanding. These models achieved groundbreaking results in various language-related applications. With MiniGPT-v2, the focus shifts to integrating visual information into the mix. The vision-language multi-task learning landscape poses unique challenges. Imagine a scenario where you ask a model, "Identify the number of pedestrians in the image of a street." Depending on the context, the answer could involve describing the person's spatial location, identifying the bounding box around them, or providing a detailed image caption. The complexities inherented in these tasks require a versatile approach. In this context, large language models have shown their mettle in various language-related applications, including logical reasoning and common-sense understanding. Their success in natural language processing (NLP) has motivated AI researchers to extend their capabilities to the vision-language tasks, giving rise to models like MiniGPT-v2. {{RLHF_CTA}} Core Concepts of MiniGPT-v2 MiniGPT-v2: large language model as a unified interface for vision-language multi-task learning At the heart of MiniGPT-v2 lies a well-thought-out architecture. This model comprises three main components: Visual Backbone At the foundation of MiniGPT-v2 is its visual backbone, inspired by the Vision Transformer (ViT). The visual backbone serves as the model's vision encoder. It processes the visual information contained within images. This component is responsible for understanding and encoding the visual context of the input, enabling the model to "see" the content of images. One distinctive aspect is that the visual backbone is frozen during training. This means that the model's vision encoder doesn't get updated as the model learns from the dataset. It remains constant, allowing the model to focus on refining its language understanding capabilities. Linear Projection Layer The linear projection layer in MiniGPT-v2 plays a crucial role in enabling the model to efficiently process high-quality images. As image resolution increases, the number of visual tokens also grows significantly. Handling a large number of tokens can be computationally expensive and resource-intensive. To address this, MiniGPT-v2 employs the linear projection layer as a practical solution. The key idea here is to concatenate multiple adjacent visual tokens in the embedding space. By grouping these tokens together, they can be projected as a single entity into the same feature space as the large language model. This operation effectively reduces the number of visual input tokens by a significant factor. As a result, MiniGPT-v2 can process high-quality images more efficiently during the training and inference stages. Large Language Model  The main language model in MiniGPT-v2 comes from LLaMA-2 and works as a single interface for different vision language inputs. This pre-trained model acts as the bridge between visual and textual information, enabling MiniGPT-v2 to perform a wide range of vision-language tasks. The advanced large language model is not specialized for a single task but is designed to handle diverse instructions, questions, and prompts from users. This versatility is achieved using task-specific tokens, a key innovation in MiniGPT-v2. These tokens provide task context to the model, allowing it to understand the image-text pair and the nature of the task at hand. This adaptability extends to tasks that require spatial understanding, such as visual grounding. For instance, when the model needs to provide the spatial location of an object, it can generate textual representations of bounding boxes to denote the object's spatial position within an image. The use of task-specific tokens greatly enhances MiniGPT-v2's multi-task understanding during training. By providing a clear context for different tasks, it reduces ambiguity and makes each task easily distinguishable, improving learning efficiency. {{light_callout_start}} Read the original research paper by Meta, authored by Jun Chen, Deyao Zhu, Xiaoqian Shen, Xiang Li, Zechun Liu, Pengchuan Zhang, Raghuraman Krishnamoorthi, Vikas Chandra, Yunyang Xiong, Mohamed Elhoseiny available on Arxiv: MiniGPT-v2: large language model as a unified interface for vision-language multi-task learning. {{light_callout_end}}  Demo of MiniGPT-v2 It's one thing to talk about it, but it's another to see MiniGPT-v2 in action. Here's a sneak peek at how this model handles various vision-language tasks. Grounding The MiniGPT-v2 works well on image descriptions. When prompted to “Describe the above image”, the model not only describes the image but performs object detection as well. Visual Question and Answering (VQA) When prompted to find a place in the room to hide in the game of hide and seek, MiniGPT-v2 can understand the prompt, assess the image well, and provide a suitable answer. Detection In the case of object detection, MiniGPT-v2 can identify large objects. But in case of small objects, it resorts to describing the environment or the image. Applications of MiniGPT-v2 MiniGPT-v2's versatility shines through in its applications. It's not just about understanding the theory; it's about what it can do in the real world. Here are some of the key applications: Image Description: MiniGPT-v2 can generate detailed image descriptions. Visual Question Answering: It excels at answering complex visual questions. Visual Grounding: The model can pinpoint the locations of objects in images. Referring Expression Comprehension: It accurately understands and responds to referring expressions. Referring Expression Generation: It can generate referring expressions for objects in images. Object Parsing and Grounding: MiniGPT-v2 can extract objects from text and determine their bounding box locations. {{light_callout_start}} The open-source code for MiniGPT-4 and MiniGPT-v2 is available on Github.{{light_callout_end}}  Comparison with Predecessor, MiniGPT-4 To gauge MiniGPT-v2's progress, it's important to compare it with its predecessor, MiniGPT-4. The key distinction between the two lies in their performance and capabilities within the domain of vision-language multi-task learning. MiniGPT-v2, designed as an evolution of MiniGPT-4, surpassed its predecessor in several important aspects: Performance: Across a spectrum of visual question-answering (VQA) benchmarks, MiniGPT-v2 consistently outperformed MiniGPT-4. For instance, on QKVQA, MiniGPT-v2 exhibited a remarkable 20.3% increase in top-1 accuracy compared to its predecessor. Referring Expression Comprehension: MiniGPT-v2 demonstrated superior performance on referring expression comprehension (REC) benchmarks, including RefCOCO, RefCOCO+, and RefCOCOg. Adaptability: MiniGPT-v2, particularly the "chat" variant trained in the third stage, showed higher performance compared to MiniGPT. The third-stage training's focus on improving language skills translated into a substantial 20.7% boost in top-1 accuracy on challenging benchmarks like VizWiz. Comparison with SOTA The authors extensively evaluated the performance of the model, setting it against the backdrop of established state-of-the-art (SOTA) vision-language models. They conducted a rigorous series of experiments across diverse tasks, encompassing detailed image/grounded captioning, vision question answering, and visual grounding. MiniGPT-v2: large language model as a unified interface for vision-language multi-task learning MiniGPT-v2 showcased consistent performance that firmly established its position at the forefront. In comparison with previous vision-language generalist models such as MiniGPT-4, InstructBLIP, BLIP-2, LLaVA, and Shikra, MiniGPT-v2 undeniably emerges as a stellar performer, setting new standards for excellence in this domain. Vision Spatial Reasoning (VSR) serves as an exemplary case where MiniGPT-v2 not only outperforms MiniGPT-4 but does so with a substantial 21.3% lead. In the VSR benchmark, MiniGPT-v2 surpasses InstructBLIP by 11.3% and leaves LLaVA trailing by 11.7%. These remarkable achievements underscore MiniGPT-v2's prowess in complex vision-questioning tasks. {{light_callout_start}} Interested in evaluating foundation models? Check out our 2-part series on evaluating foundation models using Encord Active. {{light_callout_end}}  Conclusion The emergence of vision language models like MiniGPT-v2 represents a significant step forward in computer vision and natural language processing. It's a testament to the power of large language models and their adaptability to diverse tasks. As we continue to explore the capabilities of MiniGPT-v2, the possibilities for vision-language tasks are expanding. References Official website: https://minigpt-v2.github.io/ GitHub Code: https://github.com/Vision-CAIR/MiniGPT-4 HuggingFace Space: https://huggingface.co/spaces/Vision-CAIR/MiniGPT-v2 Dataset: https://huggingface.co/datasets/Vision-CAIR/cc_sbu_align {{RLHF_CTA}}

October 30

5 min

Top Multimodal Annotation Tools

Building powerful artificial intelligence (AI) models requires a robust data processing workflow. Most importantly, it includes accurate data annotation to build high-quality datasets suitable for training. It’s a crucial step for supervised machine learning, where models learn complex patterns from annotated data and use the knowledge to predict new, unseen samples. However, data annotation is a challenging task. Due to ever-increasing data volume, diverse data modalities, and the scarcity of high-quality domain-specific datasets, it is difficult for organizations and practitioners to streamline their machine learning (ML) development process.  In multimodal learning, annotation involves labeling several data types, such as images, text, and audio. Multimodal data annotation tools enable practitioners to conduct auto-annotation for various objects, from 2D images, videos, Digital Imaging and Communications in Medicine (DICOM) data, and geospatial information to 3D point clouds.  This article discusses the significance of multimodal annotation, different data annotation types, techniques, and challenges, and introduces some popular multimodal annotation tools. It also explains the factors businesses must consider before investing in an annotation tool and concludes with a few key takeaways. Significance of Multimodal Annotation Tools Global data production is increasing. Numerous public and proprietary data sources are curated every day. We need tools to organize and transform this data at scale to make it suitable for downstream AI tasks. This is why data annotation tools are in high demand. In fact, the global data annotation tools market is projected to grow to USD 14 billion by 2035, compared to USD 1 billion in 2022.  Since real-world data is multimodal, practitioners use multimodal annotation tools to curate datasets. Let’s look at a few factors that make these tools so important. Efficient Model Training The primary reason to use a multimodal annotation tool is to automate annotation processes for different modalities since manual data labeling is prone to human error and time-consuming. Such errors diminish training data quality and can lead to poor downstream models with low predictive potential. {{Training_data_CTA}} High-quality Data Curation A robust data curation workflow allows labelers to create, organize, and manage data. With multimodal annotation tools, labelers can quickly label data containing multiple modalities and add relevant categorizations. Data teams can readily feed such datasets into models and build high-quality ML pipelines. Fine-tuning Foundation Models For AI to address specific challenges, general-purpose multimodal foundation models often need to be fine-tuned. This requires carefully curated data that represents the particular business problem. Multimodal annotation tools help labelers label domain-specific data to train AI models for downstream tasks. Flexibility in Model Applications Multimodal annotation enables models to understand information from diverse modalities. This allows practitioners to use AI to the fullest extent in domains such as autonomous driving, medical diagnosis, emotion recognition, etc. However, annotating datasets for such diverse domains is challenging. Automated multimodal annotation tools provide various labeling solutions, such as labeling and bounding box annotation for object detection and frame classification and complex polygon annotations for segmentation tasks. Now that we have set the premise for multimodal annotation, let’s explore the technical aspects of this topic by discussing various data annotation types in the next section. Data Annotation Types and Techniques Before moving on to the list of the top multimodal annotation tools, let’s first consider the types of annotation techniques labelers use to categorize, classify, and annotate different modalities. Image Annotation Practitioners annotate images to help machine learning models learn to recognize different objects or entities they see in an image. This process helps with several AI tasks, such as image classification, image segmentation, and facial recognition.  Common image annotation techniques are: Bounding Box Annotation A bounding box is a rectangular shape drawn around the object of interest we want a model to recognize. For instance, labelers can draw rectangles around vehicles and people to train a model for classifying objects on the road. It is useful in cases where precise segmentation is not required, e.g., human detection in surveillance footage. Bounding Box Annotation Example in Encord Annotate Semantic Segmentation It’s a more granular approach where practitioners label each pixel in an image to classify different regions. They draw a closed boundary around the object. Every pixel within the boundary is assigned a single label. For instance, in the illustration below, semantic segmentation draws boundaries for person, bicycle, and background. Example of Image Segmentation in Encord Annotate {{light_callout_start}} Want to learn more about semantic segmentation? Read our detailed guide on Introduction to Semantic Segmentation. {{light_callout_end}} 3D Cuboid Annotation Cuboids are similar to bounding boxes, but instead of two-dimensional rectangles, they wrap three-dimensional cuboids around the object of interest. 3D Cuboid Annotation Example Polygon Annotation This image annotation technique involves drawing 2D shapes or polylines around the edges of the objects of interest for more pixel-perfect precision. Small lines trace a series of x-y coordinates along the edges of objects, making this annotation fine-tuned to various shapes. Polygon and Polyline Annotation Examples in Encord Annotate Keypoint Annotation This annotation involves labeling particular anchor points or landmarks on an object at key locations. These anchor points can track changes in an object's shape, size, and position. It is helpful in tasks like human posture detection. Primitive (Keypoint or Skelenton) Annotation in Encord Annotate Text Annotation Text annotation helps practitioners extract relevant information from text snippets by adding tags or assigning labels to documents or different sections or phrases in a document. Prominent text annotation techniques include: Sentiment Annotation Sentiment annotation is used in tasks like user review classification. Text documents are labeled based on the sentiment they represent, such as positive, negative, or neutral. The labels can be more granular depending on the task requirement, e.g., “Angry,” “Disgust,” “Happy,” “Elated,” “Sad,” etc. Text Classification It categorizes documents or longer texts into sub-topics or classes. It suits various domains, like legal, finance, and healthcare, to organize and filter documents or smaller pieces of text.  Entity Annotation It concerns labeling entities within the text, such as names of people, organizations, countries, etc. Natural language processing (NLP) models use labeled entities to learn patterns in text and perform various tasks. Named Entity Recognition Example Parts-of-speech (POS) Tagging POS tagging involves labeling grammatical aspects, such as nouns, verbs, adjectives, etc., within a sentence. Audio Annotation Like text and image, audio annotation refers to labeling audio clips, verbal speech, etc., for training models to interpret audio data. Below are a few common methods. Speech Transcription Transcription annotation involves converting the entire speech in an audio or video into text and applying different tags and notes to the text. The annotated text helps train ML models to accurately convert speeches to text format. Audio Classification Audio classification involves assigning a single label to an audio clip. For instance, for emotion classification, these labels could be sentiments like happy, sad, angry, etc.; for music classification, these labels can be genres like Jazz, Rock, etc. Sound Segment Labeling This involves labeling different segments of an audio wave according to the task at hand. For instance, these labels could differentiate between different instruments and human vocals in a song or tag noise, silence, and sound segments in a clip. Video Annotation Video annotation is similar to image annotation since videos are a sequence of static frames. It involves labeling actions, tracking objects, identifying locations, etc., across video frames. Below are a few ways to annotate videos. Object Tracking Practitioners label moving objects by tracking their positions in each frame. This is usually done by drawing bounding boxes around the object in each frame. Video Segmentation Practitioners categorize the video into short clips based on what’s in each scene or changing camera angles, marking separate boundaries for background and foreground objects across video frames. Location Video annotators identify and tag the coordinates of objects of interest in video clips for training models that can recognize particular locations. {{light_callout_start}} Want to know more about video annotation? Then head to our blog to read The Full Guide to Video Annotation for Computer Vision. {{light_callout_end}} Challenges of Multimodal Annotation Several challenges make multimodal annotation a difficult task for organizations and practitioners. Below are a few problems that teams face when labeling multimodal datasets: Data Complexity: As data variety and volume increase, it becomes difficult to segregate different data categories and identify the correct annotation technique. In particular, the emergence of new data types, such as LiDAR and geospatial data, makes identifying the appropriate labeling method challenging. Need for Specialized Skills: Expert intervention is necessary when labeling highly domain-specific data, such as medical images. The expertise required can sometimes be niche, making the process even more intricate. Absence of Universal Tools: Organizations must find different tools to perform various annotation tasks, as no single tool can meet all objectives. Correlations between Modalities: Practitioners must identify relationships between different modalities for correct labeling. For instance, labeling an image may require listening to an audio clip to understand the context. Individually annotating the two modalities can lead to low-quality training data. Compliance and Data Privacy: Strict data regulations mean annotators must be careful when analyzing sensitive data elements, such as faces, names, locations, emotions, etc. Incorrect labeling can introduce serious biases and potentially lead to unexpected data leakages. Cost: Developing an efficient labeling process is costly. It demands investment in robust storage platforms, specialized teams, and advanced tools. These challenges make it difficult for AI teams to curate multimodal datasets and build high-quality AI models. However, there are several multimodal annotation tools available on the market that can make annotation workflows more efficient and productive for AI teams. Let’s explore them in the next section. Top Multimodal Annotation Tools Let’s discuss the top annotation tools businesses can use to streamline their ML workflow. Encord Annotate Encord Annotate is a labeling platform that supports several multimodal annotation techniques for images, videos, Digital Imaging and Communications in Medicine (DICOM), Neuroimaging Informatics Technology Initiative (NIfTI), Electrocardiograms (ECGs), and geospatial data. Encord Annotate Dashboard Benefits and Key Features Supports object detection, keypoint skeleton pose, hanging protocols, instance segmentation, action recognition, frame classifications, polygons, and polyline annotation. Helps experts develop high-quality labeling workflows through human-in-the-loop integration. Allows teams to build sophisticated labeling ontologies – a structured framework for categorizing data by enabling nested classification to create granular datasets with precise ground truth labels. Boosts the annotation workflow through AI-assisted automation. You can train few-shot models and use only a small labeled dataset to annotate the rest of the samples. Provides an integrated platform for managing all training data through an easy-to-use interface. Features performance analytics that let teams assess annotation quality and optimize where necessary. Supports several file formats for different data types. Allows teams to use the Segment Anything Model (SAM) for auto segmentation to annotate domain-specific data instantly. Best For Teams that want to build large-scale computer vision models. Teams looking for expert support can use Encord, especially for labeling complex data. AI developers who work on diverse visual datasets and want a comprehensive annotation tool. Pricing Encord offers a free version for individuals and small teams. It also offers a team version for medium-sized enterprises that wish to scale their AI operations and an enterprise version for large-scale projects. Contact Encord’s sales team to purchase the Team or Enterprise version. LabelBox LabelBox is a multimodal platform that allows practitioners to annotate images, videos, geospatial data, text, audio, and HTML files. LabelBox Interface Benefits and Key Features Offers a single efficiency metric to measure label quality. Allows organizations to develop custom workflows according to data type. Provides data analytics functionality to reduce labeling costs and monitor performance. Has collaboration tools to enhance workflows across different teams. Best For Teams that want to economize on labeling costs. Teams working on visual and natural language processing (NLP) models, which require annotating images and textual data. Pricing LabelBox offers multiple tiers at different prices, including the Free, Starter, Standard, and Enterprise versions. SuperAnnotate SuperAnnotate helps teams speed up multimodal learning projects by providing comprehensive functions for annotating text, audio, images, videos, and point cloud data. SuperAnnotate Dashboard Benefits and Key Features Offers annotation tools for building training data for large language models (LLMs), such as image captions, question answers, instructions, etc. Teams can annotate audio clips to identify speech and sound. Best For Teams struggling with project management due to poor collaboration among teammates. Building efficient multimodal data curation pipelines. Price SuperAnnotate offers a Free, Pro, and Enterprise version. Users must reach out to sales to get a price quote. Computer Vision Annotation Tool (CVAT) CVAT is a multimodal labeling tool primarily for computer vision tasks in healthcare, manufacturing, retail, automotive, etc. CVAT Interface Benefits and Key Features It supports several image annotation techniques, such as 3D cuboids, object detection, semantic segmentation, etc. Features intelligent algorithms for boosting annotation efficiency. It offers integration with the cloud for data storage. Best For Teams that want cloud-based data storage solutions. Teams looking for a specialized image annotation platform. Pricing CVAT, with cloud support, comes in three variants - Free, Solo, and Team. The Solo and Team versions cost USD 33 per month. VGG Image Annotation (VIA) VIA is a web-based manual annotation tool for image, audio, and video data, requiring no initial setup or configuration. VIA Interface Benefits & Key Features Supports basic image annotation methods, including bounding boxes and polygons. Also features audio, face, and video annotation techniques. Offers a list annotation capability that allows experts to label a list of images. Best For Teams that want a low-cost annotation solution for computer vision. Pricing VIA is a free, open-source tool. Basic.ai Basic.ai offers a multimodal data annotation platform for 3D LiDAR point clouds, images, and videos. Basic.ai Data Annotation Dashboard  Benefits and Key Features Offers auto-annotation, segmentation, and tracking. Offers an AI-enabled quality control process to facilitate annotation review. Best For Teams looking to streamline data annotation workflows across industries like automotive, smart city, and robotics. Pricing Basic.ai offers free and team pricing plans. Label Studio Label Studio is an open-source multimodal data annotation tool for audio, text, images, HTML, videos, and time series data. Label Studio Dashboard Benefits and Key Features Teams can perform text classification, audio segmentation, audio transcription, emotion recognition, named entity recognition, video classification, etc. Offers labeling templates for a variety of use cases. Offers imports via different formats like JSON, CSV, TSV, RAR, ZIP archives, S3, and Google Cloud Storage. Best For Large and small teams experimenting with data labeling tools for building fine-tuned AI models. Price Label Studio offers a paid enterprise edition. It also offers a free and open-source community edition. Dataloop Dataloop, the “data engine for AI,” offers a multimodal annotation platform for video, LiDAR, and sensor fusion labeling. Dataloop Platform Interface Benefits and Key Features Offers AI-assisted tools to automate labeling workflows. Offers tools to support internal and external labeling teams. Offers quality and validation tools to streamline annotation issues. Best For Large and small teams working in retail, agriculture, drones, and the medical industries. Price The vendor does not provide pricing. Reportedly, it starts from $85/month for 150 annotation tool hours. Supervisely Supervisely is a multimodal annotation solution with AI-assisted labeling. Supervisely Dashboard Benefits and Key Features Provides precise annotation for images and video. Features annotation tools for point cloud, LiDAR, and DICOM datasets. Uses AI to assist in custom labeling. Best for Teams looking for a labeling solution for domain-specific data. Price Supervisely offers a free community version and a paid enterprise edition. KeyLabs KeyLabs is a multimodal annotation tool with an interactive user interface for labeling graphical and video data. KeyLabs Interface Benefits and Key Features Offers a user-friendly interface for selecting appropriate annotation techniques like segmentation, classification, shape interpolation, etc. Allows users to convert data to JSON. Best For Teams looking for a tool with efficient collaboration and access control features. Price KeyLabs offers paid versions only. It has Startup, Business, Pro, and Enterprise editions. Key Factors to Consider Before Selecting the Best Annotation Tool Choosing a suitable annotation software is daunting as several options exist with different features. Here are the critical factors to prioritize: Data Modality Support: Teams must choose tools that support the data modalities they want to work on with all the suitable annotation techniques. Ease of Use: Tools with an interactive UI help teams learn new features quickly and reduce the chances of error. Security and Compliance: Organizations should opt for solutions that have robust access management and privacy protocols to ensure compliance. Scalability: Teams Organizations planning to expand operations should select scalable solutions with minimal dependency on the vendor. Integration: A tool seamlessly integrating with other systems helps businesses by lowering setup time and costs. Output Format: Tools that readily convert annotated data into an appropriate format are best for teams that don’t want to spend time writing custom code for format conversion. Collaboration and Productivity Features: Businesses with large teams working on a single project should select a tool with robust project management features. Smart Annotation Techniques: Support for intelligent annotation techniques, such as active learning and overlapping labels, can help teams that want to label extensive datasets. Quality Control: Tools should allow AI experts to review annotation samples to avoid errors and maintain a high-quality dataset. Community Support: A platform with sufficient community support, helper guides, documentation, etc., is beneficial as it helps teams resolve issues quickly. Pricing: Beyond affordability, ensure the tool offers a balance of cost and functionality to give the best return on investment. Multimodal Annotation Tools: Key Takeaways Multimodal annotation refers to labeling different data modalities, such as text, video, audio, and image. The need for multimodal annotation tools will increase with rising data volumes and complexity.  Automated multimodal annotation tools boost labeling speed and improve training data quality. Annotating multimodal datasets is challenging due to high costs, hidden correlations between data modalities, and the need for specialized skills. Teams must select tools to mitigate such challenges and speed up the ML model development lifecycle. There are many annotation tools available. Teams must select the one that suits them best according to price and usability. {{try_encord}}

October 27

5 min

Top Alternatives to Labelbox

Labelbox is a popular data labeling platform, offering tools for various industries and use cases.  Labelbox labels data like images, text, and documents, making it a good choice for AI and machine learning projects. Key features include data labeling, quality assurance, integration with machine learning frameworks and data management tools, and an intuitive interface.  Yet, Labelbox does come with its own set of constraints, including issues with native video rendering, restricted DICOM compatibility, and a pricing structure that may not adapt effectively to scalability. For these reasons, we will explore alternatives to Labelbox.  {{Training_data_CTA::Label data 10x faster and gain control of your training data}} Encord Encord is a leading alternative platform to build annotation workflows, curate visual data, find and fix data errors, and monitor model performance. Key Features and Benefits of Encord: Encord is a state-of-the-art AI-assisted labeling and workflow tooling platform enriched by micro-models, ideal for various annotation and labeling use cases, QA workflows, and training computer vision models. Specifically designed for computer vision applications, Encord offers native support for a wide array of annotation types, such as bounding box, polygon, polyline, instance segmentation, keypoints, classification, and much more. Encord provides use-case-specific annotations, ranging from native DICOM and NIfTI annotations for medical imaging to specialized features catering to SAR (Synthetic Aperture Radar) data in geospatial applications. Integrated MLOps workflows for computer vision and machine learning teams — to detect edge cases and gaps in your training data and generate augmented data to improve label quality. Streamlined collaboration, annotator management, and quality assurance workflows facilitate precise tracking of annotator performance and elevate label quality. Robust security functionality — label audit trails, encryption, FDA, CE Compliance, and HIPAA compliance. An advanced Python SDK and API access, coupled with effortless export capabilities in JSON and COCO formats, enhance flexibility and integration with external systems. Auto-find and fix dataset biases and errors like outliers, duplication, and labeling mistakes. Integrated tagging for data and labels, including outlier tagging. Employs quality metrics (data, label, and model) to assess and improve ML pipeline performance across data curation, data labeling, and model training. iMerit iMerit is a data labeling service provider known for its annotations and management solutions. Unlike traditional labeling platforms, iMerit offers a service-based approach to data annotation. iMerit Key Features and Benefits Customizable solution for annotation, analysis, categorization, segmentation needs. Get insights from metrics such as the annotator's working hours, the number of objects per hour and more. iMerit also provides a free trial for it’s users, but has no mention of it’s pricing plan on it’s website. iMerit’s user interface may be less intuitive and user-friendly for beginners.  TELUS International TELUS International, formerly Playment, is a Labelbox alternative that focuses on specialized data labeling services, offering features tailored to specific use cases, ensuring user comfort. TELUS International Key Features and Benefits TELUS International allows the creation of custom data labeling workflows, ensuring that even the most specialized projects can be accommodated. The platform has review and feedback loops to maintain the accuracy of annotations. CX support in 50+ languages across all traditional and digital channels. Integration with other tools and platforms, allows workflow management and collaboration.  These features allow to accommodate the growing needs of businesses, ensuring that the platform can handle increasing data volumes and complexity.  There are limited integration options with other third-party software and systems, which may hinder the ability to streamline processes across different platforms.  Potential challenges in adapting to the training data platform's interface and functionalities, requiring additional training datasets and support for users to fully utilize its capabilities.  CVAT CVAT, or Computer Vision Annotation Tool, is an open-source platform tailored for data annotation, particularly in the field of computer vision. It stands out as a community-driven solution for data labeling. CVAT's Key Features and Benefits It's a fantastic choice for startups, research projects, and academic initiatives, thanks to its open-source nature.  CVAT is a cost-effective and highly adaptable alternative to Labelbox Being open-source, CVAT encourages community contributions and customization. It's a collaborative tool, making it accessible for a wide range of users, from newbies to pro. The process of dataset curation, annotation, training, and dataset improvement is the heart of data-centric AI. CVAT has capabilities for bounding boxes, polygons, and keypoint labeling. Users can adapt CVAT to their specific needs, through custom plugins, tailored workflows, or support for new data types. While CVAT offers a wide range of annotation tools, it does not have all the advanced features that some users may require for their specific annotation tasks. 

October 26

5 min

1 / 14

Software To Help You Turn Your Data Into AI

Forget fragmented workflows, annotation tools, and Notebooks for building AI applications. Encord Data Engine accelerates every step of taking your model into production.