Encord Blog

Label data 10x faster & gain control of your training data, today.

blog banner
Featured Blog

How To Fine-Tune Segment Anything

Computer vision is having its ChatGPT moment with the release of the Segment Anything Model (SAM) by Meta last week. Trained over 11 billion segmentation masks, SAM is a foundation model for predictive AI use cases rather than generative AI. While it has shown an incredible amount of flexibility in its ability to segment over wide-ranging image modalities and problem spaces, it was released without “fine-tuning” functionality. This tutorial will outline some of the key steps to fine-tune SAM using the mask decoder, particularly describing which functions from SAM to use to pre/post-process the data so that it's in good shape for fine-tuning. {{Training_data_CTA::Supercharge your annotations by fine-tuning SAM for your use case}} What is the Segment Anything Model (SAM)? The Segment Anything Model (SAM) is a segmentation model developed by Meta AI. It is considered the first foundational model for Computer Vision. SAM was trained on a huge corpus of data containing millions of images and billions of masks, making it extremely powerful. As its name suggests, SAM is able to produce accurate segmentation masks for a wide variety of images. SAM’s design allows it to take human prompts into account, making it particularly powerful for Human In The Loop annotation. These prompts can be multi-modal: they can be points on the area to be segmented, a bounding box around the object to be segmented, or a text prompt about what should be segmented. The model is structured into 3 components: an image encoder, a prompt encoder, and a mask decoder. Source The image encoder generates an embedding for the image being segmented, whilst the prompt encoder generates an embedding for the prompts. The image encoder is a particularly large component of the model. This is in contrast to the lightweight mask decoder, which predicts segmentation masks based on the embeddings. Meta AI has made the weights and biases of the model trained on the Segment Anything 1 Billion Mask (SA-1B) dataset available as a model checkpoint. {{light_callout_start}} Learn more about how Segment Anything works in our explainer blog post Segment Anything Model (SAM) Explained. {{light_callout_end}} What is Model Fine-Tuning? Publicly available state-of-the-art models have a custom architecture and are typically supplied with pre-trained model weights. If these architectures were supplied without weights then the models would need to be trained from scratch by the users, who would need to use massive datasets to obtain state-of-the-art performance. Model fine-tuning is the process of taking a pre-trained model (architecture+weights) and showing it data for a particular use case. This will typically be data that the model hasn’t seen before, or that is underrepresented in its original training dataset. The difference between fine-tuning the model and starting from scratch is the starting value of the weights and biases. If we were training from scratch, these would be randomly initialized according to some strategy. In such a starting configuration, the model would ‘know nothing’ of the task at hand and perform poorly. By using pre-existing weights and biases as a starting point we can ‘fine tune’ the weights and biases so that our model works better on our custom dataset. For example, the information learned to recognize cats (edge detection, counting paws) will be useful for recognizing dogs. Why Would I Fine-Tune a Model? The purpose of fine-tuning a model is to obtain higher performance on data that the pre-trained model has not seen before. For example, an image segmentation model trained on a broad corpus of data gathered from phone cameras will have mostly seen images from a horizontal perspective. If we tried to use this model for satellite imagery taken from a vertical perspective, it may not perform as well. If we were trying to segment rooftops, the model may not yield the best results. The pre-training is useful because the model will have learned how to segment objects in general, so we want to take advantage of this starting point to build a model that can accurately segment rooftops. Furthermore, it is likely that our custom dataset would not have millions of examples, so we want to fine-tune instead of training the model from scratch. Fine tuning is desirable so that we can obtain better performance on our specific use case, without having to incur the computational cost of training a model from scratch. How to Fine-Tune Segment Anything Model [With Code] Background & Architecture We gave an overview of the SAM architecture in the introduction section. The image encoder has a complex architecture with many parameters. In order to fine-tune the model, it makes sense for us to focus on the mask decoder which is lightweight and therefore easier, faster, and more memory efficient to fine-tune. In order to fine-tune SAM, we need to extract the underlying pieces of its architecture (image and prompt encoders, mask decoder). We cannot use SamPredictor.predict (link) for two reasons: We want to fine-tune only the mask decoder This function calls SamPredictor.predict_torch which has the  @torch.no_grad() decorator (link), which prevents us from computing gradients Thus, we need to examine the SamPredictor.predict function and call the appropriate functions with gradient calculation enabled on the part we want to fine-tune (the mask decoder). Doing this is also a good way to learn more about how SAM works. Creating a Custom Dataset We need three things to fine-tune our model: Images on which to draw segmentations Segmentation ground truth masks Prompts to feed into the model We chose the stamp verification dataset (link) since it has data that SAM may not have seen in its training (i.e., stamps on documents). We can verify that it performs well, but not perfectly, on this dataset by running inference with the pre-trained weights. The ground truth masks are also extremely precise, which will allow us to calculate accurate losses. Finally, this dataset contains bounding boxes around the segmentation masks, which we can use as prompts to SAM. An example image is shown below. These bounding boxes align well with the workflow that a human annotator would go through when looking to generate segmentations. Input Data Preprocessing We need to preprocess the scans from numpy arrays to pytorch tensors. To do this, we can follow what happens inside SamPredictor.set_image (link) and SamPredictor.set_torch_image (link) which preprocesses the image. First, we can use utils.transform.ResizeLongestSide to resize the image, as this is the transformer used inside the predictor (link). We can then convert the image to a pytorch tensor and use the SAM preprocess method (link) to finish preprocessing. Training Setup We download the model checkpoint for the vit_b model and load them in: sam_model = sam_model_registry['vit_b'](checkpoint='sam_vit_b_01ec64.pth') We can set up an Adam optimizer with defaults and specify that the parameters to tune are those of the mask decoder: optimizer = torch.optim.Adam(sam_model.mask_decoder.parameters())  At the same time, we can set up our loss function, for example Mean Squared Error loss_fn = torch.nn.MSELoss() Training Loop In the main training loop, we will be iterating through our data items, generating masks, and comparing them to our ground truth masks so that we can optimize the model parameters based on the loss function. In this example, we used a GPU for training since it is much faster than using a CPU. It is important to use .to(device) on the appropriate tensors to make sure that we don’t have certain tensors on the CPU and others on the GPU. We want to embed images by wrapping the encoder in the torch.no_grad() context manager, since otherwise we will have memory issues, along with the fact that we are not looking to fine-tune the image encoder. with torch.no_grad(): image_embedding = sam_model.image_encoder(input_image) We can also generate the prompt embeddings within the no_grad context manager. We use our bounding box coordinates, converted to pytorch tensors. with torch.no_grad(): sparse_embeddings, dense_embeddings = sam_model.prompt_encoder( points=None, boxes=box_torch, masks=None, ) Finally, we can generate the masks. Note that here we are in single mask generation mode (in contrast to the 3 masks that are normally output). low_res_masks, iou_predictions = sam_model.mask_decoder( image_embeddings=image_embedding, image_pe=sam_model.prompt_encoder.get_dense_pe(), sparse_prompt_embeddings=sparse_embeddings, dense_prompt_embeddings=dense_embeddings, multimask_output=False, ) The final step here is to upscale the masks back to the original image size since they are low resolution. We can use Sam.postprocess_masks to achieve this. We will also want to generate binary masks from the predicted masks so that we can compare these to our ground truths. It is important to use torch functionals in order to not break backpropagation. upscaled_masks = sam_model.postprocess_masks(low_res_masks, input_size, original_image_size).to(device) from torch.nn.functional import threshold, normalize binary_mask = normalize(threshold(upscaled_masks, 0.0, 0)).to(device) Finally, we can calculate the loss and run an optimization step: loss = loss_fn(binary_mask, gt_binary_mask) optimizer.zero_grad() loss.backward() optimizer.step() By repeating this over a number of epochs and batches we can fine-tune the SAM decoder. Saving Checkpoints and Starting a Model from it Once we are done with training and satisfied with the performance uplift, we can save the state dict of the tuned model using: torch.save(model.state_dict(), PATH) We can then load this state dict when we want to perform inference on data that is similar to the data we used to fine-tune the model. {{light_callout_start}} You can find the Colab Notebook with all the code you need to fine-tune SAM here. Keep reading if you want a fully working solution out of the box! {{light_callout_end}} Fine-Tuning for Downstream Applications While SAM does not currently offer fine-tuning out of the box, we are building a custom fine-tuner integrated with the Encord platform. As shown in this post, we fine-tune the decoder in order to achieve this. This is available as an out-of-the-box one-click procedure in the web app, where the hyperparameters are automatically set. Original vanilla SAM mask: Mask generated by fine-tuned version of the model: We can see that this mask is tighter than the original mask. This was the result of fine-tuning on a small subset of images from the stamp verification dataset, and then running the tuned model on a previously unseen example. With further training and more examples, we could obtain even better results. Conclusion That's all, folks! You have now learned how to fine-tune the Segment Anything Model (SAM). If you're looking to fine-tune SAM out of the box, you might also be interested to learn that we have recently released the Segment Anything Model in Encord, allowing you to fine-tune the model without writing any code. {{SAM_CTA}}

Read more
Page
1 / 17
sampleImage_what-is-robotic-process-automation-rpa
What is Robotic Process Automation (RPA)?

Robotic process automation (RPA) promotes data-driven automation and digital transformation in modern industries, or “Industrial Revolution 4.0.” Data-driven automation primarily uses insights from data to program software to improve productivity on various tasks. On the other hand, digital transformation approaches create or modify existing products and services, modify businesses, and improve efficiency, customer experience, and overall competitiveness. Modern industries, such as finance, healthcare, manufacturing, and retail, depend on RPA for many automation processes. It is assumed that RPA will overtake approximately 40% of accounting tasks by 2025, indicating a significant shift within the industry. This prediction indicates industries need to adapt RPA to streamline their workflows. Introduction to Robotic Process Automation  RPA is an automation technology that uses software robots or robotic actors to automate repetitive manual tasks. It implements a rigid set of predefined rules and actions to streamline tasks that don’t require human effort. It even leverages technologies like artificial intelligence (AI), the Internet of Things (IoT), and even robotics to achieve automation with intelligence and efficiency.  RPA, coupled with data-driven AI approaches in the current industries, aims to reduce human workload. A straightforward example of RPA in a banking institution is automating repetitive tasks such as data entry for customer transactions, updating customer records, and transaction validations. These processes are well structured and require clear steps and guidelines. Using RPA for such tasks is appropriate as it streamlines the process, reduces processing time, and minimizes errors.    RPA Workflow Likewise, it can be seamlessly integrated with other technologies like blockchain, cloud computing, AR, VR, etc. This improves their capabilities and enables greater productivity, cost savings, and scalability. The traditional way of automating, which involved heavy coding, macro recording, playback, integrating APIs, etc., was slow, complex, and required intensive programming. RPA, on the other hand, offers a sharp contrast. It addresses those issues for automation to be accessible to the masses with its less-code functionality, shallow learning curve, and adaptability. How Does Robotic Process Automation (RPA) Work? Implementing RPA typically follows a structured, four-step process: Understanding the process requires reading the documentation, observing the process, conducting interviews with stakeholders, and conducting user testing. These will provide a list of requirements that adhere to the task and factors affecting the process. Defining workflow automation requires designing the process according to the specific requirements and the complexity of tasks. Depending on the available tools, this may require using low-code platforms with intuitive drag-and-drop interfaces or more advanced systems incorporating machine learning to process unstructured data like text from emails or documents. Integrating with existing systems or processes ensures that RPA bots have the necessary access to perform tasks by interacting with databases, applications, and other digital platforms. Effective integration enables data flow and task execution within the automated workflow. Workflow monitoring and optimization are essential, as they involve overseeing the execution of RPA bots, tracking performance metrics, and identifying any anomalies or issues that may arise during operation. Proactive monitoring enables timely intervention and optimization, ensuring smooth and reliable automation processes. With these steps, you can effectively implement RPA in your workflow. So far, we have seen how RPA benefits repetitive and mundane tasks with a given set of rules. But there are instances where automation can be more than just defining workflows. Sometimes, RPA must reason and make decisions based on the circumstances or data provided. In the next section, we will explore the different types of RPA that satisfy the previous statement.  Types of Robotic Process Automation (RPA) Let us briefly explore how RPA has evolved from a more traditional rule-based automation system to a more intelligent and dynamic data-driven automation technology. Traditional RPA Traditional RPA is designed to automate structured, rule-based tasks that do not require human judgment or decision-making. This approach utilizes predefined steps and workflows to execute repetitive tasks such as data entry, extraction, form filling, and transaction processing.  Traditional RPA is highly effective in streamlining operations that follow a consistent pattern, reducing manual effort and error rates in tasks like invoice processing and routine data management. Applications and Implications of Traditional RPA Automate Logical and Straightforward Tasks: Traditional RPA is ideal for businesses that automate straightforward, voluminous tasks to increase efficiency and accuracy. For example, automating the invoice data entry process can significantly speed up accounts payable operations. Cognitive RPA Cognitive RPA extends the capabilities of traditional automation by integrating artificial intelligence (AI) and machine learning (ML) technologies. This advanced form of RPA can process structured and unstructured data, enabling it to perform tasks requiring contextual understanding, learning from patterns, and making decisions.  RPA Revolution in the Healthcare Industry During COVID-19 Cognitive RPA applications include natural language processing (NLP) and large language models (LLMs) for interpreting human language, sentiment analysis for gauging customer feedback, and image recognition for analyzing visual data.  Applications and Implications: Managing Complex Processes: Cognitive RPA is adept at handling complex processes such as customer service inquiries and analyzing large volumes of diverse data for insights because it adapts to changes and makes informed decisions.  Context-aware Automation: It is suited for more complex challenges like automated customer support, where it can analyze inquiries, understand context, and provide personalized responses. Attended Automation Attended automation involves human collaboration as it works on the cues given by an operator. It is essentially a virtual assistant aiming to boost an individual’s productivity on repetitive tasks. It is also considered a front-end automation tool. It is quite useful for tasks that require human input and judgment to execute a process. Applications and Implications Human + RPA: It is effective for scheduling appointments, customer service interactions, and data validation, where human expertise complements automated processes.  Front-office Tasks: It is primarily preferred for tasks such as receptions, flight booking, check-in automation, etc.  Unattended Automation Unattended automation provides an end-to-end automated solution with no human involvement. The bots are independent and automate the entire workflow. In this case, the RPA is provided with a sequential and clear step to execute.  This type of automation is suitable for executing long processes and works on dedicated machines. An orchestrator allows you to manage tasks by scheduling the entire workflow. You can trigger, monitor, and track your bots with an orchestrator.  Applications and Implications They are suitable for backend processes. They can handle complex tasks like data processing, orchestrating various virtual machines, high-volume transaction processing, data migration between systems, etc.  Hybrid Automation Hybrid automation combines attended and unattended automation. In this type of RPA, communication happens between both processes. Additionally, it combines human involvement and backend operations.  The “attended bots” receive instructions from the human worker and initiate the process. If the process requires triggering unattended bots, these attended bots can do so. Upon triggering, the unattended bots do what they are best at—providing an end-to-end automated service. Once the task is completed, the unattended bots send the data or output to the attended bot, which notifies the human worker for further human input.  Unattended robots handle tasks like data processing, report generation, etc. that don't require human involvement. On the other hand, attended robots handle tasks that require human attention, like gathering data. Applications and Implications Handling Complex Tasks: Hybrid automation excels in airport security check-in, order/delivery routing, inventory management, candidate screening, and interview scheduling.  Robotic Process Automation (RPA) and Artificial Intelligence (AI) In the previous section, we discussed how powerful Cognitive RPA is and how it can handle complex tasks using tools like neural networks and other ML approaches. RPA and AI are powerful individually, but combined, they can achieve and excel much more. This section will discuss how AI can improve RPA capabilities and functionality. Integrating RPA with Computer Vision Let’s discuss in detail how AI can enhance the automation capabilities of RPA via computer vision (CV).  To begin with, we must understand the complexities associated with an image dataset. Image data contains a lot of details and variability. Variability is one of the biggest concerns as it can portray diverse visual and content characteristics, including differences in size, shape, lighting, etc.  Useful: Struggling with detecting and fixing image quality issues for your applications? Use our open-source toolkit, Encord Active OS, to detect image quality issues in this technical tutorial.   The same object captured from different distances can portray different information. However, the same variability in the image contains rich information that, if leveraged properly, can help us get better information about the data.  Example: Suppose you want to analyze thousands of images containing only cars and trucks for autonomous vehicles. You apply a segmentation mask and label each object with a respective class. You can use AI approaches such as CV to apply segmentation masks and assign labels to achieve this. The segmentation process can also represent cars and trucks with different colors for visualization.   Once the segmentation masks are applied to each image, you can use RPA to automate various tasks. For example: It can automate the task of segregating cars and trucks into folders. It can extract and log individual images into a database or a spreadsheet RPA can trigger actions that initiate other required workflows or notifications based on the extracted data. You can see how versatile and beneficial RPA and AI can become when they are combined. You can use AI to perform complex tasks like image segmentation and annotation. However, RPA can build an automated pipeline based on the segmented and annotated images. Useful Read: What are the most prominent use cases of computer vision in robotics? Learn how machine vision powers eight use cases in robotics and automation from this resource.   Now, let’s find out the additional advantages that RPA offers.  Benefits of Robotic Process Automation (RPA) In this section, we will briefly discuss some of RPA's advantages. This will give you insight and help you make informed decisions about implementing RPA in your workflow and businesses. Below are some of the advantages. Low-code Development You can configure RPA software as it offers a UI drag-and-drop feature to define the automation process. This allows users to correctly, logically, and sequentially place the suitable automation component. It also facilitates rapid prototyping, a shallow learning curve, quicker deployment, and even improves collaboration. Increased Efficiency and Productivity RPA reduces human intervention and friction, allowing organizations to automate tasks consistently. This offers an efficient and streamlined workflow, which increases productivity. For example, automating invoice processing, payroll management, data migration, report generation, etc. Cost Savings through Automation RPA reduces human input and workload costs. This means routine work can be done cheaply, and human input can be used in other important areas. By automating repetitive tasks, RPA can save companies 30 to 50% in processing costs. Compared to manual work and traditional methods, this leads to a positive ROI within one year. Improved Accuracy and Compliance As we configure RPA bots with specific predefined rules, we constrain the bots to do that certain task. RPA can improve accuracy for repetitive tasks with well-defined rules by eliminating human error from fatigue and distractions.  RPA software is easy to learn and deploy, and it offers the additional advantages of scalability and efficiency, economic friendliness, and workload reduction. However, it also has challenges. The following section will delve into some of RPA's challenges.  Challenges of RPA We have seen how RPA benefits our repetitive, tedious, and mundane tasks. However, there can be instances where RPA can fail if the task is not correctly defined. Issues can also arise when working with data, among others. Let us now see four common challenges that RPA usually faces. Complexity of Process Identification When automating workflow, it is essential to understand the process because automating the wrong tasks can be detrimental. Carefully analyzing workflows and selecting well-defined, repetitive processes with clear inputs and outputs is essential for success. Integration with Legacy Systems Many organizations utilize older systems not designed for seamless integration with modern automation tools. This can require technical expertise and adaptation to overcome compatibility issues. Security and Compliance Concerns Integrating RPA introduces new access points and data flows. Robust security measures, including data encryption and access controls, are vital to ensure compliance and safeguard sensitive information. Resistance to Change and Organizational Culture Embracing automation often requires organizational shifts and employee training. Addressing concerns about job displacement, upskilling human workers, and fostering a culture of innovation are key to smooth adoption. These challenges often act as a roadblock that may hinder many workflow processes. But if these challenges are carefully addressed, they can help us break barriers and offer new solutions.  Despite the challenges represented in this section, many industries have never refrained from implementing RPA in their workflow. You will learn some of these in the next section. Use Cases This section will discuss three primary industries that use RPA to streamline operations. The industries mentioned here have one thing in common: supply and demand. Because of this factor, freeing up the human workload and automating repetitive and exhausting processes is essential. Healthcare Healthcare organizations are one of the most demanding places where many things can be automated. Because of the ongoing patient visits, especially in hospitals, attending to patients remains a vital obligation compared to other mundane and repetitive tasks. Some of the areas that can be automated using RPA are: Claims Processing: Automating tasks like eligibility verification, data entry, and claims submission can save time, increase accuracy, and improve reimbursement cycles. Patient Scheduling and Registration: Automating appointment scheduling via the RPA app can reduce administrative burden.  Medical Report Generation: Extracting high-volume data from various sources, such as imaging technologies, and generating standardized reports will reduce doctors' and clinicians' workload for patient care. Fraud Detection and Red-Teaming: Analyzing claim data to identify and flag potential fraudulent activity improves healthcare system security and integrity. As patient data requires high security, RPA can also automate various infiltration tests on the healthcare system to check its reliability and security. Retail With the rise of e-commerce and consumer demands, modern retail has enlarged its territory. Here are three ways in which the retail sector is using RPA: Order Processing and Fulfillment: Receiving orders from customers and their delivery is one of the critical jobs of retail. These can be automated using RPA, and customers can be notified regarding each process phase, such as order processing, shipping, etc. This enhances order accuracy and expedites delivery. Customer Service: Chatbots powered by RPA can handle routine inquiries, freeing up human agents for complex issues and improving customer experience. Price Management and Promotions: Automating tasks like price comparisons, discounts based on customer involvement, and campaign execution can promote dynamic pricing strategies and targeted promotions. Supply Chain Management RPA technology has a more significant impact on the supply chain, essentially orchestrating the exchange between various networks. It includes managing and storing raw materials, manufacturing, moving, delivering, and storing finished products in a warehouse.  This is how RPA implementation enhances the supply chain. Purchase Order Processing: RPA automates vendor communication, purchase order generation, and approval cycles, streamlining procurement processes. Improving Supply Chain Planning: RPA can automate data analysis for forecasting and recent trends in markets and products. This eventually promotes better demand planning and inventory management. Logistics and Transportation: Using RPA to automate shipment tracking and route optimization improves logistics efficiency and reduces delays. Case Study: Role of Computer Vision in Enhancing RPA Capabilities in Healthcare In healthcare, a large part is devoted to imaging technology and visual data. For instance, radiology depends on X-rays, CT scans, and other imaging technologies to diagnose and treat patients. Some challenges revolve around this type of data: Image Analysis: Analyzing such images is hard and time-consuming. On average, a radiologist takes about 8 to 10 minutes, sometimes more if the image needs clarification.  Workload Management: Understanding these images takes a lot of time, so it can be exhausting for radiologists to continuously read them and manage other obligations such as attending to the patient and counseling. Additionally, mental exhaustion can cause them to lose focus and make errors in diagnosis and treatment.  Report Generation: This is another phase where radiologists struggle to focus on generating the right and precise patient report through the scan. Overcoming RPA Challenges by Using Computer Vision  Traditional RPA can automate the above challenges with a predefined script but can be inefficient. However, certain tasks like fetching and organizing images can save radiologists time, but they might not be beneficial for complex tasks. This is because the automation script will mostly contain general steps. The software can make errors in anomalies and unclear images and provide the wrong solutions.  For instance, the software may need to analyze the image and correctly interpret the data. Similarly, the software may fail to find anomalies, increase the rate of false positives and negatives, or misclassify the image. If those two cases occur, considerable errors in report generation could lead to the wrong diagnosis and treatment. Computer vision (CV) can be coupled with RPA to address these issues. CV is one of those approaches where you extract rich data representations from visual data. Using CV, RPA can utilize these representations that allow the software to interpret the images and make the right decision. With this combination of AI and RPA, radiologists can quickly receive and review accurate image analysis. This reduces their workload, allowing them to attend to patients or take on complex cases. Additionally, this system can generate reports that the radiologist can review and approve. In a nutshell, systems like this can improve radiologists' accuracy, efficiency, and workload management. Relevant Read: Viz.ai is a San Francisco-based health tech company. Learn how they accelerated the time from diagnosis to treatment using a data-centric CV platform to develop high-quality datasets in this case study. But on the downside, these AI systems need to be trained on a large dataset, which generally takes a lot of time.  What’s Next: Cognitive Automation with Machine Vision?  Cognitive automation has shown great potential, as it can efficiently handle complex tasks. As such, it holds great significance in Machine Vision. A subfield also uses cameras and sensors to get the input data. Modern industrial practices rely on the vision system to manufacture products and services.  Cognitive automation with machine vision can enhance industries to make data-driven decisions, optimize operations, predict challenges, and improve efficiency across various sectors, such as scaling up and down based on requirements, strategic planning, etc. For instance:  Companies developing autonomous vehicles use cameras and sensors to capture environmental data. Cognitive automation processes this data for decision-making, such as updating ML models with anomalies or new insights and integrating them into training simulations. Additionally, it can analyze familiar data, aiding predictive analytics. In the future, cognitive automation may facilitate vehicle-to-vehicle communication, enhancing safety.  In manufacturing, vision systems are pivotal for product analysis and robot navigation. When combined with cognitive automation, new opportunities arise. For instance, it can identify bottlenecks like raw material shortages and automate orders. Furthermore, it can monitor product quality, gather user feedback, and suggest design improvements for future development.  These technologies can promote human-machine collaboration, creating new spaces for innovation and engineering. This can ultimately lead to offering new and better product designs and services and reducing waste.  Robotic Process Automation: Key Takeaways  Robotic Process Automation as automation software and solutions rapidly transforms our work across different fields and processes. With advancements in AI, RPA implementation can be significantly enhanced to boost industrial productivity in a much more innovative way.  As automation technology continues to evolve with RPA, the impact of automation solutions will only grow. They will reshape workflows and open doors for even greater automation possibilities. This will eventually drive research and development in many areas, promoting the betterment of human lives. While challenges exist, its potential for increased efficiency, reduced human error, accuracy, and cost savings is undeniable. Organizations can resolve these challenges by proactively adopting responsible development practices. They can use RPA to navigate the future of work effectively and unlock its full potential for success.

March 15

8 min

sampleImage_open-source-computer-vision-repositories
Top 10 Open Source Computer Vision Repositories

In this article, you will learn about the top 10 open-source Computer Vision repositories on GitHub. We discuss repository formats, their content, key learnings, and proficiency levels the repo caters to. The goal is to guide researchers, practitioners, and enthusiasts interested in exploring the latest advancements in Computer Vision. You will gain insights into the most influential open-source CV repositories to stay up-to-date with cutting-edge technology and potentially incorporate these resources into your projects. Readers can expect a comprehensive overview of the top Computer Vision repositories, including detailed descriptions of their features and functionalities.  The article will also highlight key trends and developments in the field, offering valuable insights for those looking to enhance their knowledge and skills in Computer Vision.  Here’s a list of the repositories we’re going to discuss: Awesome Computer Vision Segment Anything Model (SAM) Visual Instruction Tuning (LLaVA) LearnOpenCV Papers With Code Microsoft ComputerVision recipes Awesome-Deep-Vision Awesome transformer with ComputerVision CVPR 2023 Papers with Code Face Recognition   What is GitHub? GitHub provides developers with a shared environment in which they can contribute code, collaborate on projects, and monitor changes. It also serves as a repository for open-source projects, allowing easy access to code libraries and resources created by the global developer community.   Factors to Evaluate a Github Repository’s Health Before we list the top repositories for Computer Vision (CV), it is essential to understand how to determine a GitHub repository's health. The list below highlights a few factors you should consider to assess a repository’s reliability and sustainability: Level of Activity: Assess the frequency of updates by checking the number of commits, issues resolved, and pull requests. Contribution: Check the number of developers contributing to the repository. A large number of contributors signifies diverse community support. Documentation: Determine documentation quality by checking the availability of detailed readme files, support documents, tutorials, and links to relevant external research papers. New Releases: Examine the frequency of new releases. A higher frequency indicates continuous development. Responsiveness: Review how often the repository authors respond to issues raised by users. High responsiveness implies that the authors actively monitor the repository to identify and fix problems.  Stars Received: Stars on GitHub indicate a repository's popularity and credibility within the developer community. Active contributors often attract more stars, showcasing their value and impact. Top 10 GitHub Repositories for Computer Vision (CV) Open source repositories play a crucial role in CV by providing a platform for researchers and developers to collaborate, share, and improve upon existing algorithms and models.  These repositories host codebases, datasets, and documentation, making them valuable resources for enthusiasts, developers, engineers, and researchers. Let us delve into the top 10 repositories available on GitHub for use in Computer Vision. Disclaimer: Some of the numbers below may have changed after we published this blog post. Check the repository links to get a sense of the most recent numbers.   #1 Awesome Computer Vision The awesome-php project inspired the Awesome Computer Vision repository, which aims to provide a carefully curated list of significant content related to open-source Computer Vision tools.  Awesome Computer Vision Repository Repository Format You can expect to find resources on image recognition, object detection, semantic segmentation, and feature extraction. It also includes materials related to specific Computer Vision applications like facial recognition, autonomous vehicles, and medical image analysis. Repository Contents The repository is organized into various sections, each focusing on a specific aspect of Computer Vision.  Books and Courses: Classic Computer Vision textbooks and courses covering foundational principles on object recognition, computational photography, convex optimization, statistical learning, and visual recognition. Research Papers and Conferences: This section covers research from conferences published by CVPapers, SIGGRAPH Papers, NIPS papers, and survey papers from Visionbib. Tools: It includes annotation tools such as LabelME and specialized libraries for feature detection, semantic segmentation, contour detection, nearest-neighbor search, image captioning, and visual tracking. Datasets: PASCAL VOC dataset, Ground Truth Stixel dataset, MPI-Sintel Optical Flow dataset, HOLLYWOOD2 Dataset, UCF Sports Action Data Set, Image Deblurring, etc. Pre-trained Models: CV models used to build applications involving license plate detection, fire, face, and mask detectors, among others. Blogs: OpenCV, Learn OpenCV, Tombone's Computer Vision Blog, Computer Vision for Dummies, Andrej Karpathy’s blog, Computer Vision Basics with Python Keras, and OpenCV. Key Learnings Visual Computing: Use the repo to understand the core techniques and applications of visual computing across various industries. Convex Optimization: Grasp this critical mathematical framework to enhance your algorithmic efficiency and accuracy in CV tasks. Simultaneous Localization and Mapping (SLAM): Explore the integration of SLAM in robotics and AR/VR to map and interact with dynamic environments. Single-view Spatial Understanding: Learn about deriving 3D insights from 2D imagery to advance AR and spatial analysis applications. Efficient Data Searching: Leverage nearest neighbor search for enhanced image categorization and pattern recognition performance. Aerial Image Analysis: Apply segmentation techniques to aerial imagery for detailed environmental and urban assessment. Proficiency Level Aimed at individuals with an intermediate to advanced understanding of Computer Vision. Commits: 206 | Stars: 19.8k | Forks: 4.1k | Author: Jia-Bin Huang | Repository Link. #2 SegmentAnything Model (SAM) segment-anything is maintained by Meta AI. The Segment Anything Model (SAM) is designed to produce high-quality object masks from input prompts such as points or boxes. Trained on an extensive dataset of 11 million images and 1.1 billion masks, SAM exhibits strong zero-shot performance on various segmentation tasks.  segment-anything repository Repository Format The ReadMe.md file clearly mentions guides for installing these and running the model from prompts. Running SAM from this repo requires Python 3.8 or higher, PyTorch 1.7 or higher, and TorchVision 0.8 or higher. Repository Content The segment-anything repository provides code, links, datasets, etc. for running inference with the SegmentAnything Model (SAM). Here’s a concise summary of the content in the segment-anything repository: This repository provides: Code for running inference with SAM. Links to download trained model checkpoints. Downloadable dataset of images and masks used to train the model. Example notebooks demonstrating SAM usage. Lightweight mask decoder is exportable to the ONNX format for specialized environments. Key Learnings Some of the key learnings one can gain from the segment-anything repository are: Understanding Object Segmentation: Learn about object segmentation techniques and how to generate high-quality masks for objects in images. Explore using input prompts (such as points or boxes) to guide mask generation. Practical Usage of SAM: Install and use Segment Anything Model (SAM) for zero-shot segmentation tasks. Explore provided example notebooks to apply SAM to real-world images. Advanced Techniques: For more experienced users, explore exporting SAM’s lightweight mask decoder to ONNX format for specialized environments. Learn how to fine-tune the Segment Anything Model (SAM) through our comprehensive guide.   Proficiency Level The Segment Anything Model (SAM) is accessible to users with intermediate to advanced Python, PyTorch, and TorchVision proficiency. Here’s a concise breakdown for users of different proficiency levels: Beginner | Install and Run: If you’re new to SAM, follow installation instructions, download a model checkpoint, and use the provided code snippets to generate masks from input prompts or entire images. Intermediate | Explore Notebooks: Dive into example notebooks to understand advanced usage, experiment with prompts, and explore SAM’s capabilities. Advanced | ONNX Export: For advanced users, consider exporting SAM’s lightweight mask decoder to ONNX format for specialized environments supporting ONNX runtime. Commits: 46 | Stars: 42.4k | Forks: 5k | Author: Meta AI Research | Repository Link. #3 Visual Instruction Tuning (LLaVA) Repository The LLaVA (Large Language and Vision Assistant) repository, developed by Haotian Liu, focuses on Visual Instruction Tuning. It aims to enhance large language and vision models, reaching capabilities comparable to GPT-4V and beyond.  LLaVA demonstrates impressive multimodal chat abilities, sometimes even exhibiting behaviors similar to multimodal GPT-4 on unseen images and instructions. The project has seen several releases with unique features and applications, including LLaVA-NeXT, LLaVA-Plus, and LLaVA-Interactive. Visual Instruction Tuning (LLaVA)  Repository Format The content in the LLaVA repository is primarily Python-based. The repository contains code, models, and other resources related to Visual Instruction Tuning. The Python files (*.py) are used to implement, train, and evaluate the models. Additionally, there may be other formats, such as Markdown for documentation, JSON for configuration files, and text files for logs or instructions. Repository Content LLaVA is a project focusing on visual instruction tuning for large language and vision models with GPT-4 level capabilities. The repository contains the following: LLaVA-NeXT: The latest release, LLaVA-NeXT (LLaVA-1.6), has additional scaling to LLaVA-1.5 and outperforms Gemini Pro on some benchmarks. It can now process 4x more pixels and perform more tasks/applications. LLaVA-Plus: This version of LLaVA can plug and learn to use skills. LLaVA-Interactive: This release allows for an all-in-one demo for Image Chat, Segmentation, and Generation. LLaVA-1.5: This version of LLaVA achieved state-of-the-art results on 11 benchmarks, with simple modifications to the original LLaVA. Reinforcement Learning from Human Feedback (RLHF): LLaVA has been improved with RLHF to improve fact grounding and reduce hallucination. Key Learnings The LLaVA repository offers valuable insights in the domain of Visual Instruction Tuning. Some key takeaways include: Enhancing Multimodal Models: LLaVA focuses on improving large language and vision models to achieve capabilities comparable to GPT-4V and beyond. Impressive Multimodal Chat Abilities: LLaVA demonstrates remarkable performance, even on unseen images and instructions, showcasing its potential for multimodal tasks. Release Variants: The project has seen several releases, including LLaVA-NeXT, LLaVA-Plus, and LLaVA-Interactive, each introducing unique features and applications. Proficiency Level Catered towards intermediate and advanced levels Computer Vision engineers building vision-language applications. Commits: 446 | Stars: 14k | Forks: 1.5k | Author : Haotian Liu | Repository Link. #4 LearnOpenCV Satya Mallick maintains a repository on GitHub called LearnOpenCV. It contains a collection of C++ and Python codes related to Computer Vision, Deep Learning, and Artificial Intelligence. These codes are examples for articles shared on the LearnOpenCV.com blog. LearnOpenCV Repository Resource Format The resource format of the repository includes code for the articles and blogs. Whether you prefer hands-on coding or reading in-depth explanations, this repository has diverse resources to cater to your learning style. Repository Contents This repo contains code for Computer Vision, deep learning, and AI articles shared in OpenCV’s blogs, LearnOpenCV.com. You can choose the format that best suits your learning style and interests. Here are some popular topics from the LearnOpenCV repository: Face Detection and Recognition: Learn how to detect and recognize faces in images and videos using OpenCV and deep learning techniques. Object Tracking: Explore methods for tracking objects across video frames, such as using the Mean-Shift algorithm or correlation-based tracking. Image Stitching: Discover how to combine multiple images to create panoramic views or mosaics. Camera Calibration: Understand camera calibration techniques to correct lens distortion and obtain accurate measurements from images with OpenCV. Deep Learning Models: Use pre-trained deep learning models for tasks like image classification, object detection, and semantic segmentation. Augmented Reality (AR): Learn to overlay virtual objects onto real-world scenes using techniques such as marker-based AR. These examples provide practical insights into Computer Vision and AI, making them valuable resources for anyone interested in these fields! Key Learnings Apply OpenCV techniques confidently across varied industry contexts. Undertake hands-on projects using OpenCV that solidify your skills and theoretical understanding, preparing you for real-world Computer Vision challenges. Proficiency Level This repo caters to a wide audience: Beginner: Gain your footing in Computer Vision and AI with introductory blogs and simple projects. Intermediate: Elevate your understanding with more complex algorithms and applications. Advanced: Challenge yourself with cutting-edge research implementations and in-depth blog posts. Commits: 2,333 | Stars: 20.1k | Forks: 11.5k | Author: Satya Mallick | Repository Link. #5 Papers with Code Researchers from Meta AI are responsible for maintaining Papers with Code as a community project. No data is shared with any Meta Platforms product. Papers with Code Repository Repository Format The repository provides a wide range of Computer Vision research papers in various formats, such as:  ResNet: A powerful convolutional neural network architecture with 2052 papers with code. Vision Transformer: Leveraging self-attention mechanisms, this model has 1229 papers with code. VGG: The classic VGG architecture boasts 478 papers with code. DenseNet: Known for its dense connectivity, it has 385 papers with code. VGG-16: A variant of VGG, it appears in 352 papers with code. Repository Contents This repository contains Datasets, Research Papers with Codes, Tasks, and all the Computer Vision-related research material on almost every segment and aspect of CV like The contents are segregated in the form of classified lists as follows:  State-of-the-Art Benchmarks: The repository provides access to a whopping 4,443 benchmarks related to Computer Vision. These benchmarks serve as performance standards for various tasks and models. Diverse Tasks: With 1,364 tasks, Papers With Code covers a wide spectrum of Computer Vision challenges. Whether you’re looking for image classification, object tracking, or depth estimation, you'll find it here. Rich Dataset Collection: Explore 2,842 datasets curated for Computer Vision research. These datasets fuel advancements in ML and allow researchers to evaluate their models effectively. Massive Paper Repository: The platform hosts an impressive collection of 42,212 papers with codes. These papers contribute to cutting-edge research in Computer Vision. Key Learnings Here are some key learnings from the Computer Vision on Papers With Code: Semantic Segmentation: This task involves segmenting an image into regions corresponding to different object classes. There are 287 benchmarks and 4,977 papers with codes related to semantic segmentation. Object Detection: Object detection aims to locate and classify objects within an image. The section covers 333 benchmarks and 3,561 papers with code related to this task. Image Classification: Image classification involves assigning a label to an entire image. It features 464 benchmarks and 3,642 papers with code. Representation Learning: This area focuses on learning useful representations from data. There are 15 benchmarks and 3,542 papers with code related to representation learning. Reinforcement Learning (RL): While not specific to Computer Vision, there is 1 benchmark and 3,826 papers with code related to RL. Image Generation: This task involves creating new images. It includes 221 benchmarks and 1,824 papers with code. These insights provide a glimpse into the diverse research landscape within Computer Vision. Researchers can explore the repository to stay updated on the latest advancements and contribute to the field. Proficiency Levels A solid understanding of Computer Vision concepts and familiarity with machine learning and deep learning techniques are essential to make the best use of the Computer Vision section on Papers With Code. Here are the recommended proficiency levels: Intermediate: Proficient in Python, understanding of neural networks, can read research papers, and explore datasets. Advanced: Strong programming skills, deep knowledge, ability to contribute to research, and ability to stay updated. Benchmarks: 4,443 | Tasks: 1,364 | Datasets: 2,842 | Papers with Code: 42,212 #6  Microsoft / ComputerVision-Recipes The Microsoft GitHub organization hosts various open-source projects and samples across various domains. Among the many repositories hosted by Microsoft, the Computer Vision Recipes repository is a valuable resource for developers and enthusiasts interested in using Computer Vision technologies. Microsoft's Repositories Repository Format One key strength of Microsoft’s Computer Vision Recipes repository is its focus on simplicity and usability. The recipes are well-documented and include detailed explanations, code snippets, and sample outputs. Languages: The recipes are a range of programming languages, primarily Python (with some Jupyter Notebook examples), C#, C++, TypeScript, and JavaScript so that developers can use the language of their choice. Operating Systems: Additionally, the recipes are compatible with various operating systems, including Windows, Linux, and macOS. Repository Content Guidelines: The repository includes guidelines and recommendations for implementing Computer Vision solutions effectively.  Code Samples: You’ll find practical code snippets and examples covering a wide range of Computer Vision tasks. Documentation: Detailed explanations, tutorials, and documentation accompany the code samples. Supported Scenarios: - Image Tagging: Assigning relevant tags to images. - Face Recognition: Identifying and verifying faces in images. - OCR (Optical Character Recognition): Extracting text from images. - Video Analytics: Analyzing videos for objects, motion, and events. Highlights| Multi-Object Tracking: Added state-of-the-art support for multi-object tracking based on the FairMOT approach described in the 2020 paper “A Simple Baseline for Multi-Object Tracking." . Key Learnings The Computer Vision Recipes repository from Microsoft offers valuable insights and practical knowledge in computer vision. Here are some key learnings you can expect: Best Practices: The repository provides examples and guidelines for building computer vision systems using best practices. You’ll learn about efficient data preprocessing, model selection, and evaluation techniques. Task-Specific Implementations: This section covers a variety of computer vision tasks, such as image classification, object detection, and image similarity. By studying these implementations, you’ll better understand how to approach real-world vision problems. Deep Learning with PyTorch: The recipes leverage PyTorch, a popular deep learning library. You’ll learn how to create and train neural networks for vision tasks and explore architectures and techniques specific to computer vision. Proficiency Level The Computer Vision Recipes repository caters to a wide range of proficiency levels, from beginners to experienced practitioners. Whether you’re just starting in computer vision or looking to enhance your existing knowledge, this repository provides practical examples and insights that can benefit anyone interested in building robust computer vision systems. Commits: 906 | Stars: 9.3k | Forks: 1.2k | Author: Microsoft | Repository Link. #7 Awesome-Deep-Vision The Awesome Deep Vision repository, curated by Jiwon Kim, Heesoo Myeong, Myungsub Choi, Jung Kwon Lee, and Taeksoo Kim, is a comprehensive collection of deep learning resources designed specifically for Computer Vision.  This repository offers a well-organized collection of research papers, frameworks, tutorials, and other useful materials relating to Computer Vision and deep learning. Awesome-Deep-Vision Repository Repository Format The Awesome Deep Vision repository organizes its resources in a curated list format. The list includes various categories related to Computer Vision and deep learning, such as research papers, courses, books, videos, software, frameworks, applications, tutorials, and blogs. The repository is a valuable resource for anyone interested in advancing their knowledge in this field. Repository Content Here’s a closer look at the content and their sub-sections of the Awesome Deep Vision repository: Papers: This section includes seminal research papers related to Computer Vision. Notable topics covered include: ImageNet Classification: Papers like Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton’s work on image classification using deep convolutional neural networks. Object Detection: Research on real-time object detection, including Faster R-CNN and PVANET. Low-Level Vision: Papers on edge detection, semantic segmentation, and visual attention. Other resources are Computer Vision course lists, books, video lectures, frameworks, applications, tutorials, and insightful blog posts. Key Learnings The Awesome Deep Vision repository offers several valuable learnings for those interested in Computer Vision and deep learning: Stay Updated: The repository provides a curated list of research papers, frameworks, and tutorials. By exploring these resources, you can stay informed about the latest advancements in Computer Vision. Explore Frameworks: Discover various deep learning frameworks and libraries. Understanding their features and capabilities can enhance your ability to work with Computer Vision models. Learn from Research Papers: Dive into research papers related to Computer Vision. These papers often introduce novel techniques, architectures, and approaches. Studying them can broaden your knowledge and inspire your work. Community Collaboration: The repository is a collaborative effort by multiple contributors. Engaging with the community and sharing insights can lead to valuable discussions and learning opportunities. While the repository doesn’t directly provide model implementations, it is a valuable reference point for anyone passionate about advancing their Computer Vision and deep learning skills.  Proficiency Level The proficiency levels that this repository caters to are: Intermediate: Proficiency in Python programming and awareness of deep learning frameworks. Advanced: In-depth knowledge of CV principles, mastery of frameworks, and ability to contribute to the community. Commits : 207 | Stars : 10.8k | Forks : 2.8k | Author : Jiwon Kim | Repository Link. #8 Awesome Transformer with Computer Vision (CV) The Awesome Visual Transformer repository is a curated collection of articles and resources on transformer models in Computer Vision (CV), maintained by dk-liang.  The repository is a valuable resource for anyone interested in the intersection of visual transformers and Computer Vision (CV). Awesome-visual-transformer Repository Repository Format This repository (Awesome Transformer with Computer Vision (CV)) is a collection of research papers about transformers with vision. It contains surveys, arXiv papers, papers with codes on CVPR, and papers on many other subjects related to Computer Vision. It does not contain any coding.  Repository Content This is a valuable resource for anyone interested in transformer models within the context of Computer Vision (CV). Here’s a brief overview of its content: Papers: The repository collects research papers related to visual transformers. Notable papers include: “Transformers in Vision”: A technical blog discussing vision transformers. “Multimodal learning with transformers: A survey”: An IEEE TPAMI paper. ArXiv Papers: The repository includes various arXiv papers, such as: “Understanding Gaussian Attention Bias of Vision Transformers” “TAG: Boosting Text-VQA via Text-aware Visual Question-answer Generation” Transformer for Classification: - Visual Transformer Stand-Alone Self-Attention in Vision Models: Designed for image recognition, by Ramachandran et al. in 2019. - Transformers for Image Recognition at Scale: Dosovitskiy et al. explore transformers for large-scale image recognition in 2021. Other Topics: The repository covers task-aware active learning, robustness against adversarial attacks, and person re-identification using locally aware transformers. Key Learnings Here are some key learnings from the Awesome Visual Transformer repository: Understanding Visual Transformers: The repository provides a comprehensive overview of visual transformers, including their architecture, attention mechanisms, and applications in Computer Vision. You’ll learn how transformers differ from traditional convolutional neural networks (CNNs) and their advantages. Research Papers and Surveys: Explore curated research papers and surveys on visual transformers. These cover topics like self-attention, positional encodings, and transformer-based models for image classification, object detection, and segmentation. Practical Implementations: The repository includes practical implementations of visual transformers. Studying these code examples will give you insights into how to build and fine-tune transformer-based models for specific vision tasks. Proficiency Level Aimed at Computer Vision researchers and engineers with a practical understanding of the foundational concepts of transformers. Commits: 259 | Stars: 3.2k | Forks: 390 | Author: Dingkang Liang | Repository Link. #9 Papers-with-Code: CVPR 2023 Repository The CVPR2024-Papers-with-Code repository, maintained by Amusi, is a comprehensive collection of research papers and associated open-source projects related to Computer Vision. It covers many topics, including machine learning, deep learning, image processing, and specific areas like object detection, image segmentation, and visual tracking. CVPR2024 Papers with Code Repository Repository Format The repository is an extensive collection of research papers and relevant codes organized according to different topics, including machine learning, deep learning, image processing, and specific areas like object detection, image segmentation, and visual tracking.  Repository Content CVPR 2023 Papers: The repository contains a collection of papers presented at the CVPR 2023 conference. This year (2023), the conference received a record 9,155 submissions, a 12% increase over CVPR 2022, and accepted 2,360 papers for a 25.78% acceptance rate. Open-Source Projects: Along with the papers, the repository also includes links to the corresponding open-source projects. Organized by Topics: The papers and projects in the repository are organized by various topics such as Backbone, CLIP, MAE, GAN, OCR, Diffusion Models, Vision Transformer, Vision-Language, Self-supervised Learning, Data Augmentation, Object Detection, Visual Tracking, and numerous other related topics. Past Conferences: The repository also contains links to papers and projects from past CVPR conferences. Key Learnings Here are some key takeaways from the repository: Cutting-Edge Research: The repository provides access to the latest research papers presented at CVPR 2024. Researchers can explore novel techniques, algorithms, and approaches in Computer Vision. Practical Implementations: The associated open-source code allows practitioners to experiment with and implement state-of-the-art methods alongside research papers. This practical aspect bridges the gap between theory and application. Diverse Topics: The repository covers many topics, including machine learning, deep learning, image processing, and specific areas like object detection, image segmentation, and visual tracking. This diversity enables users to delve into various aspects of Computer Vision. In short, the repository is a valuable resource for staying informed about advancements in Computer Vision and gaining theoretical knowledge and practical skills. Proficiency Level While beginners may find the content challenging, readers with a solid foundation in Computer Vision can benefit significantly from this repository's theoretical insights and practical implementations. Commits: 642 | Stars: 15.2k | Forks: 2.4k | Author: Amusi | Repository Link. #10 Face Recognition This  repository on GitHub provides a simple and powerful facial recognition API for Python. It lets you recognize and manipulate faces from Python code or the command line.  Built using dlib’s state-of-the-art face recognition, this library achieves an impressive 99.38% accuracy on the Labeled Faces in the Wild benchmark. Face Recognition Repository Repository Format  The content of the face_recognition repository on GitHub is primarily in Python. It provides a simple and powerful facial recognition API that allows you to recognize and manipulate faces from Python code or the command line. You can use this library to find faces in pictures, identify facial features, and even perform real-time face recognition with other Python libraries.  Repository Content Here’s a concise list of the content within the face_recognition repository: Python Code Files: The repository contains Python code files that implement various facial recognition functionalities. These files include functions for finding faces in pictures, manipulating facial features, and performing face identification. Example Snippets: The repository provides example code snippets demonstrating how to use the library. These snippets cover tasks such as locating faces in images and comparing face encodings. Dependencies: The library relies on the dlib library for its deep learning-based face recognition. To use this library, you need to have Python 3.3+ (or Python 2.7), macOS or Linux, and dlib with Python bindings installed. Key Learnings Some of the key learnings from the face_recognition repository are: Facial Recognition in Python: It provides functions for locating faces in images, manipulating facial features, and identifying individuals. Deep Learning with dlib: You can benefit from the state-of-the-art face recognition model within dlib. Real-World Applications: By exploring the code and examples, you can understand how facial recognition can be applied in real-world scenarios. Applications include security, user authentication, and personalized experiences. Practical Usage: The repository offers practical code snippets that you can integrate into your projects. It’s a valuable resource for anyone interested in using facial data in Python. Proficiency Level Caters to users with a moderate-to-advanced proficiency level in Python. It provides practical tools and examples for facial recognition, making it suitable for those who are comfortable with Python programming and want to explore face-related tasks. Commits: 238 | Stars: 51.3k | Forks: 13.2k | Author: Adam Geitgey | Repository Link. Key Takeaways Open-source Computer Vision tools and resources greatly benefit researchers and developers in the CV field. The contributions from these repositories advance Computer Vision knowledge and capabilities.  Here are the highlights of this article: Benefits of Code, Research Papers, and Applications: Code, research papers, and applications are important sources of knowledge and understanding. Code provides instructions for computers and devices, research papers offer insights and analysis, and applications are practical tools that users interact with. Wide Range of Topics: Computer Vision encompasses various tasks related to understanding and interpreting visual information, including image classification, object detection, facial recognition, and semantic segmentation. It finds applications in image search, self-driving cars, medical diagnosis, and other fields.

March 15

8 min

sampleImage_github-repositories-image-segmentation
15 Interesting Github Repositories for Image Segmentation

A survey of Image segmentation GitHub Repositories shows how the field is rapidly advancing as computing power increases and diverse benchmark datasets emerge to evaluate model performance across various industrial domains.  Additionally, with the advent of Transformer-based architecture and few-shot learning methods, the artificial intelligence (AI) community uses Vision Transformers (ViT) to enhance segmentation accuracy. The techniques involve state-of-the-art (SOTA) algorithms that only need a few labeled data samples for model training. With around 100 million developers contributing to GitHub globally, the platform is popular for exploring some of the most modern segmentation models currently available.  This article explores the exciting world of segmentation by delving into the top 15 GitHub repositories, which showcase different approaches to segmenting complex images.  But first, let’s understand a few things about image segmentation. What is Image Segmentation? Image segmentation is a computer vision (CV) task that involves classifying each pixel in an image. The technique works by clustering similar pixels and assigning them a relevant label. The method can be categorized into:  Semantic segmentation—categorizes unique objects based on pixel similarity. Instance segmentation— distinguishes different instances of the same object category. For example, instance segmentation will recognize multiple individuals in an image as separate entities, labeling each person as “person 1”, “person 2”, “person 3”, etc. Semantic Segmentation (Left) and Instance Segmentation (Right) The primary applications of image segmentation include autonomous driving and medical imaging. In autonomous driving, segmentation allows the model to classify objects on the road. In medical imaging, segmentation enables healthcare professionals to detect anomalies in X-rays, MRIs, and CT scans. Want to know about best practices for image segmentation? Read our Guide to Image Segmentation in Computer Vision: Best Practices.   Factors to Validate Github Repository’s Health Before we list the top repositories for image segmentation, it is essential to understand how to determine a GitHub repository's health. The list below highlights a few factors you should consider to assess a repository’s reliability and sustainability: Level of Activity: Assess the frequency of updates by checking the number of commits, issues resolved, and pull requests. Contribution: Check the number of developers contributing to the repository. A large number of contributors signifies diverse community support. Documentation: Determine documentation quality by checking the availability of detailed readme files, support documents, tutorials, and links to relevant external research papers. New Releases: Examine the frequency of new releases. A higher frequency indicates continuous development. Responsiveness: Review how often the repository authors respond to issues raised by users. High responsiveness implies that the authors actively monitor the repository to identify and fix problems. Stars Received: Stars on GitHub indicate a repository's popularity and credibility within the developer community. Active contributors often attract more stars, showcasing their value and impact.  Top GitHub Repositories for Image Segmentation Due to image segmentation’s ability to perform advanced detection tasks, the AI community offers multiple open-source GitHub repositories comprising the latest algorithms, research papers, and implementation details. The following sections will overview the fifteen most interesting public repositories, describing their resource format and content, topics covered, key learnings, and difficulty level. #1. Awesome Referring Image Segmentation Referring image segmentation involves segmenting objects based on a natural language query. For example, the user can provide a phrase such as “a brown bag” to segment the relevant object within an image containing multiple objects. Referring image segmentation Resource Format The repository is a collection of benchmark datasets, research papers, and their respective code implementations. Repository Contents The repo comprises ten datasets, including ReferIt, Google-Ref, UNC, and UNC+, and 72 SOTA models for different referring image segmentation tasks. Topics Covered Traditional Referring Image Segmentation: In the repo, you will find frameworks or traditional referring image segmentation, such as LISA, for segmentation through large language models (LLMs). Interactive Referring Image Segmentation: Includes the interactive PhraseClick referring image segmentation model. Referring Video Object Segmentation: Consists of 18 models to segment objects within videos. Referring 3D Instance Segmentation: There are two models for referring 3D instance segmentation tasks for segmenting point-cloud data. Key Learnings Different Types of Referring Image Segmentation: Exploring this repo will allow you to understand how referring interactive, 3D instance, and video segmentation differ from traditional referring image segmentation tasks. Code Implementations: The code demonstrations will help you apply different frameworks to real-world scenarios. Proficiency Level The repo is for expert-level users with a robust understanding of image segmentation concepts. Commits: 71 | Stars: 501 | Forks: 54 | Author: Haoran MO | Repository Link.   #2. Transformer-based Visual Segmentation Transformer-based visual segmentation uses the transformer architecture with the self-attention mechanism to segment objects. Transformer-based Visual Segmentation Resource Format The repo contains research papers and code implementations. Resource Contents It has several segmentation frameworks based on convolutional neural networks (CNNs), multi-head and cross-attention architectures, and query-based models. Topics Covered Detection Transformer (DETR): The repository includes models built on the DETR architecture that Meta introduced. Attention Mechanism: Multiple models use the attention mechanism for segmenting objects. Pre-trained Foundation Model Tuning: Covers techniques for tuning pre-trained models. Key Learnings Applications of Transformers in Segmentation: The repo will allow you to explore the latest research on using transformers to segment images in multiple ways. Self-supervised Learning: You will learn how to apply self-supervised learning methods to transformer-based visual segmentation. Proficiency Level This is an expert-level repository requiring an understanding of the transformer architecture. Commits: 13 | Stars: 549 | Forks: 40 | Author: Xiangtai Li | Repository Link. #3. Segment Anything The Segment Anything Model (SAM) is a robust segmentation framework by Meta AI that generates object masks through user prompts. Segment Anything Model Resource Format The repo contains the research paper and an implementation guide. Resource Contents It consists of Jupyter notebooks and scripts with sample code for implementing SAM and has three model checkpoints, each with a different backbone size. It also provides Meta’s own SA-1B dataset for training object segmentation models. Topics Covered How SAM Works: The paper explains how Meta developed the SAM framework. Getting Started Tutorial: The Getting Started guide helps you generate object masks using SAM. Key Learnings How to Use SAM: The repo teaches you how to create segmentation masks with different model checkpoints. Proficiency Level This is a beginner-level repo that teaches you about SAM from scratch. Commits: 46 | Stars: 42.8k | Forks: 5k | Author: Hanzi Mao | Repository Link.   #4. Awesome Segment Anything The Awesome Segment Anything repository is a comprehensive survey of models using SAM as the foundation to segment anything. SAM mapping image features and prompt embeddings set for a segmentation mask Resource Format The repo is a list of papers and code. Resource Content It consists of SAM’s applications, historical development, and research trends. Topics Covered SAM-based Models: The repo explores the research on SAM-based frameworks. Open-source Projects: It also covers open-source models on platforms like HuggingFace and Colab. Key Learnings SAM Applications: Studying the repo will help you learn about use cases where SAM is relevant. Contemporary Segmentation Methods: It introduces the latest segmentation methods based on SAM. Proficiency Level This is an expert-level repo containing advanced research papers on SAM. Commits: 273 | Stars: 513 | Forks: 39 | Author: Chunhui Zhang | Repository Link.   #5. Image Segmentation Keras The repository is a Keras implementation of multiple deep learning image segmentation models. SAM mapping image features and prompt embeddings set for a segmentation mask Resource Format Code implementations of segmentation models. Resource Content The repo consists of implementations for Segnet, FCN, U-Net, Resnet, PSPNet, and VGG-based segmentation models. Topics Covered Colab Examples: The repo demonstrates implementations through a Python interface. Installation: There is an installation guide to run the relevant modules. Key Learnings How to Use Keras: The repo will help you learn how to implement segmentation models in Keras. Fine-tuning and Knowledge Distillation: The repo contains sections that explain how to fine-tune pre-trained models and use knowledge distillation to develop simpler models. Proficiency Level The repo is an intermediate-level resource for those familiar with Python. Commits: 256 | Stars: 2.8k | Forks: 1.2k | Author: Divam Gupta | Repository Link. #6. Image Segmentation The repository is a PyTorch implementation of multiple segmentation models. R2U-Net Resource Format It consists of code and research papers. Resource Content The models covered include U-Net, R2U-Net, Attention U-Net, and Attention R2U-Net. Topics Covered Architectures: The repo explains the models’ architectures and how they work. Evaluation Strategies: It tests the performance of all models using various evaluation metrics. Key Learnings PyTorch: The repo will help you learn about the PyTorch library. U-Net: It will familiarize you with the U-Net model, a popular framework for medical image segmentation. Proficiency Level This is an intermediate-level repo for those familiar with deep neural networks and evaluation methods in machine learning. Commits: 13 | Stars: 2.4k | Forks: 584 | Author: LeeJunHyun | Repository Link. #7. Portrait Segmentation The repository contains implementations of portrait segmentation models for mobile devices. Portrait Segmentation Resource Format The repo contains code and a detailed tutorial. Resource Content It consists of checkpoints, datasets, dependencies, and demo files. Topics Covered Model Architecture: The repo explains the architecture for Mobile-Unet, Deeplab V3+, Prisma-net, Portrait-net, Slim-net, and SINet. Evaluation: It reports the performance results of all the models. Key Learnings Portrait Segmentation Techniques: The repo will teach you about portrait segmentation frameworks. Model Development Workflow: It gives tips and tricks for training and validating models. Proficiency Level This is an expert-level repo. It requires knowledge of Tensorflow, Keras, and OpenCV. Commits: 405 | Stars: 624 | Forks: 135 | Author: Anilsathyan | Repository Link. #8. BCDU-Net The repository implements the Bi-Directional Convolutional LSTM with U-net (BCDU-Net) for medical segmentation tasks, including lung, skin lesions, and retinal blood vessel segmentation. BCDU-Net Architecture Resource Format The repo contains code and an overview of the model. Resource Content It contains links to the research paper, updates, and a list of medical datasets for training. It also provides pre-trained weights for lung, skin lesion, and blood vessel segmentation models. Topics Covered BCDU-Net Architecture: The repo explains the model architecture in detail. Performance Results: It reports the model's performance statistics against other SOTA frameworks. Key Learnings Medical Image Analysis: Exploring the repo will familiarize you with medical image formats and how to detect anomalies using deep learning models. BCDU-Net Development Principles: It explains how the BCDU-net model works based on the U-net architecture. You will also learn about the Bi-directional LSTM component fused with convolutional layers. Proficiency Level This is an intermediate-level repo. It requires knowledge of LSTMs and CNNs. Commits: 166 | Stars: 656 | Forks: 259 | Author: Reza Azad | Repository Link. #9.MedSegDiff The repository demonstrates the use of diffusion techniques for medical image segmentation. Diffusion Technique Resource Format It contains code implementations and a research paper. Resource Contents It overviews the model architecture and contains the brain tumor segmentation dataset. Topics Covered Model Structure: The repo explains the application of the diffusion method to segmentation problems. Examples: It contains examples for training the model on tumor and melanoma datasets. Key Learnings The Diffusion Mechanism: You will learn how the diffusion technique works. Hyperparameter Tuning: The repo demonstrates a few hyper-parameters to fine-tune the model. Proficiency Level This is an intermediate-level repo requiring knowledge of diffusion methods. Commits: 116 | Stars: 868 | Forks: 130 | Author: Junde Wu | Repository Link. #10. U-Net The repository is a Keras-based implementation of the U-Net architecture. U-Net Architecture Resource Format It contains the original training dataset, code, and a brief tutorial. Resource Contents The repo provides the link to the U-Net paper and contains a section that lists the dependencies and results. Topics Covered U-Net Architecture: The research paper in the repo explains how the U-Net model works. Keras: The topic page has a section that gives an overview of the Keras library. Key Learnings Data Augmentation: The primary feature of the U-net model is its use of data augmentation techniques. The repo will help you learn how the framework augments medical data for enhanced training. Proficiency Level This is a beginner-level repo requiring basic knowledge of Python. Commits: 17 | Stars: 4.4k | Forks: 2k | Author: Zhixuhao | Repository Link. #11. SOTA-MedSeg The repository is a detailed record of medical image segmentation challenges and winning models. Medical Imaging Segmentation Methods Resource Format The repo comprises research papers, code, and segmentation challenges based on different anatomical structures. Resource Contents It mentions the winning models for each year from 2018 to 2023 and provides their performance results on multiple segmentation tasks. Topics Covered Medical Image Segmentation: The repo explores models for segmenting brain, head, kidney, and neck tumors. Past Challenges: It lists older medical segmentation challenges. Key Learnings Latest Trends in Medical Image Processing: The repo will help you learn about the latest AI models for segmenting anomalies in multiple anatomical regions. Proficiency Level This is an expert-level repo requiring in-depth medical knowledge. Commits: 70 | Stars: 1.3k | Forks: 185 | Author: JunMa | Repository Link.   #12. UniverSeg The repository introduces the Universal Medical Image Segmentation (UniverSeg) model that requires no fine-tuning for novel segmentation tasks (e.g. new biomedical domain, new image type, new region of interest, etc). UnverSeg Method Resource Format It contains the research paper and code for implementing the model. Resource Contents The research paper provides details of the model architecture and Python code with an example dataset. Topics Covered UniverSeg Development: The repo illustrates the inner workings of the UniverSeg model. Implementation Guidelines: A ‘Getting Started’ section will guide you through the implementation process. Key Learnings Few-shot Learning: The model employs few-shot learning methods for quick adaptation to new tasks. Proficiency Level This is a beginner-level repo requiring basic knowledge of few-shot learning. Commits: 31 | Stars: 441 | Forks: 41 | Author: Jose Javier | Repository Link.   #13. Medical SAM Adapter The repository introduces the Medical SAM Adapter (Med-SA), which fine-tunes the SAM architecture for medical-specific domains. Med-SA Architecture Resource Format The repo contains a research paper, example datasets, and code for implementing Med-SA. Resource Contents The paper explains the architecture in detail, and the datasets relate to melanoma, abdominal, and optic-disc segmentation. Topics Covered Model Architecture: The research paper in the repo covers a detailed explanation of how the model works. News: It shares a list of updates related to the model. Key Learnings Vision Transformers (ViT): The model uses the ViT framework for image adaptation. Interactive Segmentation: You will learn how the model incorporates click prompts for model training. Proficiency Level The repo is an expert-level resource requiring an understanding of transformers. Commits: 95 | Stars: 759 | Forks: 58 | Author: Junde Wu (via Kids with Tokens) | Repository Link. #14. TotalSegmentator The repository introduces TotalSegmentator, a domain-specific medical segmentation model for segmenting CT images. Subtasks with Classes Resource Format The repo provides a short installation guide, code files, and links to the research paper. Resource Contents The topic page lists suitable use cases, advanced settings, training validation details, a Python API, and a table with all the class names. Topics Covered Total Segmentation Development: The paper discusses how the model works. Usage: It explains the sub-tasks the model can perform. Key Learnings Implementation Using Custom Datasets: The repo teaches you how to apply the model to unique medical datasets. nnU-Net: The model uses nnU-Net, a semantic segmentation model that automatically adjusts parameters based on input data. Proficiency Level The repo is an intermediate-level resource requiring an understanding of the U-Net architecture. Commits: 560 | Stars: 1.1k | Forks: 171 | Author: Jakob Wasserthal | Repository Link.   #15. Medical Zoo Pytorch The repository implements a Pytorch-based library for 3D multi-modal medical image segmentation. Implementing Image Segmentation in PyTorch Resource Format It contains the implementation code and research papers for the models featured in the library. Resource Contents The repo lists the implemented architectures and has a Quick Start guide with a demo in Colab. Topics Covered 3D Segmentation Models: The library contains multiple models, including U-Net3D, V-net, U-Net, and MED3D. Image Data-loaders: It consists of data-loaders for fetching standard medical datasets. Key Learnings Brain Segmentation Performance: The research paper compares the performance of implemented architectures on brain sub-region segmentation. This will help you identify the best model for brain segmentation. COVID-19 Segmentation: The library has a custom model for detecting COVID-19 cases. The implementation will help you classify COVID-19 patients through radiography chest images. Proficiency Level This is an expert-level repo requiring knowledge of several 3D segmentation models. Commits: 122 | Stars: 1.6k | Forks: 288 | Author: Adaloglou Nikolas | Repository Link. GitHub Repositories for Image Segmentation: Key Takeaways While object detection and image classification models dominate the CV space, the recent rise in segmentation frameworks signals a new era for AI in various applications.  Below are a few points to remember regarding image segmentation: Medical Segmentation is the most significant use case. Most segmentation models discussed above aim to segment complex medical images to detect anomalies. Few-shot Learning: Few-shot learning methods make it easier for experts to develop models for segmenting novel images. Transformer-based Architectures: The transformer architecture is becoming a popular framework for segmentation tasks due to its simplicity and higher processing speeds than traditional methods.

March 15

10 min

sampleImage_google-deepmind-sima-ai-agent
Google’s Video Gaming Companion: Scalable Instructable Multiworld Agent [SIMA]

What is DeepMind SIMA? SIMA can follow natural language instructions to perform tasks in various video game environments. It can also generalize across games, picking up skills learned in one game and transferring them to different games. How do you train an AI agent to be a generalist? Google DeepMind’s latest AI agent, SIMA, short for Scalable Instructable Multiworld Agent, helps us understand precisely how. Both NVIDIA and DeepMind have been focused on controlling one multi-world agent. The idea is that if you can develop one agent that can generalize across different domains (for example, different video games), it would probably be quite useful in the real world—for piloting a robot, learning from a physical environment, etc. In this article, you will learn about:  What SIMA is and how it interacts with the environment in real-time using a generic human-like interface. Different methods for training an AI agent. SIMA’s training process, including the environments, data, models, and evaluation methods. How SIMA generalizes knowledge across tasks and environments with really impressive zero-shot capabilities. How useful they are as embodied AI agents. DeepMind’s Gaming Legacy: Alpha Go to Scalable Instructable Multiworld Agent (SIMA) DeepMind has consistently been at the forefront of advancing artificial intelligence (AI) through gaming. This tradition dates back to its groundbreaking success with AlphaGo, famous for beating the world’s best Go players. To understand how the team arrived at SIMA, let’s explore the evolution from DeepMind's early work on reinforcement learning in Atari video games to Scalable Instructable Multiworld Agent (SIMA), focusing on… wait for it… Goat Simulator 3, with some of the funniest game actions . ​​The evolution shows how models go from mastering structured board games to navigating complex, rich, interactive 3D simulations and virtual environments. First off… Atari games. Reinforcement Learning on Atari Video Games DeepMind's first attempt at using AI in games was a huge success when applied to Atari games using deep reinforcement learning (RL). The goal was to get the highest scores in several classic games using only pixel data and game scores. These games provided a diverse platform for testing and improving RL algorithms, which learn optimal behaviors through trial and error, guided by rewards. In this situation, DeepMind's algorithms (the popular AlphaGo, MuZero, and AlphaGo Zero) could master several Atari games, often doing better than humans. This work showed how RL can solve difficult, dynamic, and visually varied problems. It also set a new standard in AI by showing how AI agents can learn and adapt to new environments without having much pre-programmed information. DeepMind's deep Q-network (DQN) was key to this success. It combined deep neural networks with a Q-learning framework to process high-dimensional sensory input and learn successful strategies directly from raw pixels. This approach enabled AI to understand and interact meaningfully with the gaming environment, paving the way for more sophisticated AI applications in gaming and beyond. Scalable Instructable Multiworld Agent (SIMA) on Goat Simulator 3 SIMA builds on its predecessors. The AI agent can move around and interact in a wide range of 3D virtual worlds, not just the 2D worlds of Atari games.  SIMA is built to understand and follow natural language instructions within these environments. This is a first step toward creating general AI that can understand the world and its complexities. SIMA learned from different gaming environments, and one interesting one is Goat Simulator 3. If you have played this game before, you will surely know how unpredictable and chaotic the actions are. It is uniquely challenging due to its open-ended gameplay and humorous, physics-defying mechanics. This, of course, is different from the structured world of Go and other Atari games! To teach SIMA how to operate in Goat Simulator 3, the researchers had to collect a lot of human gameplay from which it could learn. The gameplay included simple navigation to follow specific actions in open-ended language instructions (e.g., “jump the fence”).  This process checks the agent's ability to understand and follow directions and adapt to an environment where nothing is ever the same. Agent Training Methods DeepMind's technical report discusses new ways to train AI agents that use the complexity of simulated environments to help them learn and adapt. These methods are crucial for creating agents like those in the SIMA project that can interact intelligently with various 3D environments. AI Agent Simulator-based Training The method uses reinforcement learning—agents learn the best way to execute a task by trying things out and seeing what works best, with help from reward signals in their environment. In this context, the game environment serves as both the playground and the teacher. Here are the components of this training approach: Reinforcement Learning: The core of this method is an algorithm that adjusts the agent's policy based on the rewards it receives for its actions. The agent learns to connect actions with results, which helps it improve its plan to maximize cumulative rewards. Reward Signals: These signals guide the agent's learning process within game environments. They can be explicit, like points scored in a game, or more nuanced, reflecting progress toward a game's objective or successful interaction within the environment. Environment Flexibility: This training method is flexible because you can use in any setting that provides useful feedback. The agent learns by engaging directly with the environment, navigating a maze, solving puzzles, or interacting with dynamic elements. Examples: Using RL in places like Atari games, where the agent learns different strategies for each game, shows how well this method works. This can also be seen when training agents in more complicated situations, like those in Goat Simulator 3, where the AI must adapt to and understand complex situations with nuance. Traditional Simulator-based Agent Training This method involves unsupervised learning, where the agent explores the environment and learns its dynamics without explicit instruction or reinforcement. The goal is for the agent to develop an intuitive understanding of the rules and mechanics governing the environment. The techniques in this approach are: Unsupervised Model: By interacting with the environment without predefined objectives or rewards, the agent builds a model of the world that reflects its inherent rules and structures. This model helps agents predict outcomes and plan actions, even in unfamiliar scenarios. Learn the Rules Intuitively: The agent notices patterns and regularities in its surroundings by observing and interacting with them. This is the same as "learning the rules of the game." This process helps the agent gain a deep, unconscious understanding that shapes how they act and what they choose to do in the future. Less Need for Annotation: One big benefit of this method is that it does not require as much detailed annotation or guidance. The agent learns from experiences, so it does not need large datasets with labels or manual instructions. Example: Scenarios where agents must infer objectives or navigate environments with sparse or delayed feedback. For example, an agent might learn to identify edible vs. poisonous items in a survival game or deduce the mechanics of object interaction within a physics-driven simulation. Scalable Instructable Multiworld Agent (SIMA) Training Process SIMA's training approach includes several key components, detailed as follows: Scaling Instructable Agents Across Many Simulated Worlds Environment SIMA's training leverages diverse 3D environments, ranging from commercial video games to bespoke research simulations. It was important to the researchers that these environments offer a range of challenges and chances to learn so that agents could become more flexible and generalize to various settings and situations.  Key requirements of these environments include: Diversity: Using open-world games and controlled research environments ensures that agents encounter various scenarios, from dynamic, unpredictable game worlds to more structured, task-focused settings. Rich Interactions: The researchers chose the environments because they allowed agents to interact with different objects, characters, and terrain features in many ways, helping them learn a wide range of skills. Realism and Complexity: Some environments have physics and graphics close to reality. This lets agents learn in situations similar to how complicated things are in the real world. 💡Learn more about these environments in the technical report. Two environments that meet these requirements are: Commercial Video Games: The researchers trained the agents on games, including Goat Simulator 3, Hydroneer, No Man’s Sky, Satisfactory, Teardown, Valheim, and Wobbly Life. Research Environments: These are more controlled environments, such as Controled Labs and procedurally-generated rooms with realistic contents (ProcTHOR). SIMA is capable of performing many actions from language-instructed tasks. Data An extensive and varied set of gameplay data from various environments forms the basis of SIMA's training. This dataset includes: Multimodal Inputs: The multimodal data includes visual observations, spoken instructions, and the actions taken by human players that match. This gives agents a lot of information to learn from. Human Gameplay: The dataset ensures that agents learn from nuanced, contextually appropriate behavior by capturing gameplay and interaction sequences from human players. Annotated Instructions: Language instructions are paired with game sequences to give agents clear examples of using natural language to guide them in doing tasks. Agents SIMA agents are designed to interpret language instructions and execute relevant actions within 3D virtual environments. Key aspects of their design include: Language-Driven Generality: Agents are taught to follow instructions that use open-ended language. This lets them change their actions based on verbal cues to complete many tasks.  Human-Like Interaction: The agents work with a standard interface that looks and feels like a person's. It takes in text and images and responds to keyboard and mouse commands like a person would. Pre-trained Models: SIMA uses pre-trained models, like video models, to process textual and visual data. These models were mostly trained using instruction-conditioned behavioral cloning (see this note) and classifier-free guidance. This makes it easier for the agents to understand complicated instructions and their surroundings. 💡Learn how to go from big to intelligent visual data in our expert-led webinar. Instructions Across SIMA Data Evaluation Methods Assessing the performance of SIMA agents involves a variety of evaluation methods tailored to the different environments and tasks: Ground-truth Evaluation: In research environments, clear success criteria are set for each task, so it is easy to judge an agent's performance by whether certain goals are met.  Human Judgments: When the tasks are more open-ended or subjective, human evaluators watch how the agents act and give feedback on how well they can follow directions and reach their goals while acting like humans. Automated Metrics: In some cases, particularly within commercial games, automated metrics such as in-game scores or task completion indicators provide quantitative measures of agent success. Optical Character Recognition (OCR): Applied in commercial video games where task completion might not be as straightforward to assess. OCR is used to detect on-screen text indicating task completion. Action Log-probabilities and Static Visual Input Tests: These are more simplistic methods assessing the agent's ability to predict actions based on held-out data or to respond to static visual inputs with correct actions. 💡Interested in understanding metrics for computer vision models? Check out our comprehensive article on quality metrics in AI. SIMA Agent Features Scalable Instructable Multiworld Agent (SIMA) incorporates sophisticated features that enable it to interact effectively within various simulated 3D environments. These features are integral to its design, allowing it to understand and execute various natural language instructions and perform many actions across different virtual settings. SIMA agent receives instructions from a user and image observations from the environment Here's a breakdown of these crucial features: Multi-environment Transfer A key feature of SIMA is that it can use the knowledge and skills it has gained in one environment to perform well in another without starting from scratch each time. This ability to transfer between environments is very important for the agent's flexibility and efficiency; it lets it use what it has learned in a wide range of situations instead of just one. For instance, if the agent learns the concept of 'opening a door' in one game, it can apply this knowledge when encountering a door in another unrelated game. The agent's sophisticated perception and action systems facilitate mapping shared concepts by abstracting underlying similarities in interactions across environments and accelerating its adaptation. Understands Natural Language instructions SIMA is engineered to understand a wide range of language instructions, interpreting them within the context of its current environment and objectives. This comprehension extends to complex commands and instruction sequences, enabling SIMA to engage in sophisticated interactions and complete intricate tasks in accordance with human-like language inputs. Performs 600+ Actions Due to the variety of its training environments and the difficulty of the tasks it can handle, SIMA can perform more than 600 different actions. Thanks to its large action repertoire, it can respond correctly to various situations and instructions, which shows how well it has learned to adapt. Average success rate of the SIMA Agent by skill category From basic movements and interactions to more intricate and context-specific actions, SIMA's broad range of capabilities enables it to tackle diverse challenges and objectives. Generalization Rather than mastering a single task or environment, SIMA is developed to generalize its learning and problem-solving capabilities across contexts. This generalization ensures that the agent can apply its learned skills and knowledge to new, unseen challenges, adapting its strategies based on prior experiences and the specific demands of each new setting. Results Highlighting SIMA's Generalization Ability DeepMind's SIMA demonstrates impressive generalization capabilities across various environments, as showcased through several key findings: Zero-Shot Learning Abilities: SIMA effectively applies learned skills to new, unseen environments without additional training, which indicates robust internalized knowledge and skill transferability. No Pre-Training Ablation: Removing pre-trained components affects SIMA's performance, emphasizing the importance of pre-training for generalization. Despite this, some generalization capacity persists, highlighting the robustness of SIMA's core architecture. Language Ablation: Taking out natural language inputs worsens task performance. This shows how important language comprehension is to SIMA's ability to work in diverse environments. Environment-Specialized Performance: SIMA matches or outperforms environment-specialized agents, showcasing its broader applicability and efficient learning across different virtual worlds. Ethical AI Guidelines DeepMind's commitment to ethical AI practices is evident in developing and training SIMA. As part of these ethical guidelines, the AI should only be trained in carefully chosen environments that encourage good values and behavior. Here are the key guidelines they used to avoid violent content: Content Curation: In aligning with ethical AI practices, SIMA's training explicitly avoids video games or environments that feature violent actions or themes. This careful curation ensures that the agent is not exposed to, nor does it learn from, any content that could be considered harmful or contrary to societal norms and values. Promotes Positive Interaction: The training focused on problem-solving, navigation, and constructive interaction, choosing environments without violence. This created an AI agent that can be used in many positive situations. Risk Mitigation: This approach also serves as a risk mitigation strategy, reducing the potential for the AI to develop or replicate aggressive behaviors, which is crucial for maintaining trust and safety in AI deployments. Modeling Safe and Respectful Behaviors: The training program reinforces safe and respectful behaviors and decisions in the agent, ensuring that their actions align with the principles of avoiding harm and promoting well-being. SIMA's training on nonviolent content shows how important it is to ensure that AI research and development align with societal values and that we only create AI that is helpful, safe, and respectful of human rights. Challenges of Developing SIMA The DeepMind SIMA research team faced many difficult problems when developing the agent. These problems arise when training AI agents in different and changing 3D environments, and they show how difficult it is to use AI in situations similar to the complicated and unpredictable real world. Real-time Environments Not Designed for Agents Unpredictable Dynamics: Many real-time environments SIMA is trained in, especially commercial video games, are inherently unpredictable and not specifically designed for AI agents. These environments are crafted for human players and feature nuances and dynamics that can be challenging for AI to navigate and understand. Complex Interactions: The multifaceted interaction possibilities within these environments add another layer of complexity. Agents must learn how to handle various possible events and outcomes, which can change from one moment to the next, just like in real life. Evaluation Without API Access to Environment States Limited Information: Evaluating SIMA's performance without API access means the agent cannot rely on explicit environment states or underlying game mechanics that would typically be available to developers. This limitation necessitates reliance on visual and textual cues alone, which mirrors the human gameplay experience but introduces significant challenges in interpreting and responding to the environment accurately. Assessment Accuracy: The lack of direct environment state access complicates the evaluation process, making it harder to ascertain whether the AI has successfully understood and executed a given task, particularly in complex or ambiguous situations. SIMA’s Current Limitations Although the Scalable Instructable Multiworld Agent (SIMA) has made significant progress, it still has some problems worth mentioning. These constraints highlight areas for future research and development to improve AI agents' capabilities and applications in complex environments. Limited Environmental Availability Diversity of Games: SIMA was trained and tested on four research-based 3D simulations and seven commercial video games. This shows that the model can work in various settings but is still not very broad, considering all the different game types and settings. Adding more types of environments could help test and improve the agent's ability to adapt to new ones. Breadth of 3D Simulations: The four 3D simulations provide controlled settings to test specific agent capabilities. However, increasing the number and diversity of these simulations could offer more nuanced insights into the agent's adaptability and learning efficiency across varied contexts. Restricted Data Pipeline Scalability The current data pipeline, crucial for training SIMA through behavioral cloning, might not be scalable or diverse enough to cover the full spectrum of potential interactions and scenarios an agent could encounter. Improving the scalability and diversity of the data pipeline would be essential for training more robust and versatile AI agents. Short Action Horizon Action Duration: SIMA's training has primarily focused on short-horizon tasks, generally capped at around 10 seconds. This limitation restricts the agent's ability to learn and execute longer and potentially more complex sequences of actions, which are common in real-world scenarios or more intricate game levels. Reliability and Performance Agent Reliability: Although SIMA has shown promise in following instructions and performing actions across various environments, it is often unreliable compared to human performance. The agent's inconsistency in accurately interpreting and executing instructions poses challenges for its deployment in scenarios requiring high precision or critical decision-making. Comparison with Human Performance: Some tasks made for SIMA are naturally hard and require advanced problem-solving and strategic planning, but the agent still does not follow instructions as well as a human would. This shows how hard the environments are and how high the bar was set for the agent since even skilled human players do not get perfect scores on these tasks. Addressing these limitations will be crucial for the next stages of SIMA's development. To make the field of AI agents that can navigate and interact in complex, changing virtual worlds even better, we must improve environmental diversity, data pipeline scalability, action horizon, and overall reliability. Key Takeaways: Google’s Video Gaming Companion—Scalable Instructable Multiworld Agent (SIMA). Here are the key ideas from this article: SIMA interacts with the environment in real-time using a generic human-like interface. It receives image observations and language instructions as inputs and generates keyboard and mouse actions as outputs. SIMA is trained on a dataset of video games, including Satisfactory, No Man's Sky, Goat Simulator 3, and Valheim. The researchers evaluated SIMA’s ability to perform basic skills in these games, such as driving, placing objects, and using tools. On average, SIMA's performance is around 50%, but it is far from perfect. The researchers believe that training AI agents on a broad variety of video games is an effective way to make progress in general AI. These results support SIMA's strong generalization skills and show that it can work well in various situations and tasks. It is a big step forward in developing AI agents with strong, flexible, and transferable skill sets because it shows strong zero-shot learning abilities and resilience against ablation impacts.

March 15

8 min

sampleImage_9-best-image-annotation-tools-for-computer-vision
Best Image Annotation Tools for Computer Vision [Updated 2024]

Guide to the most popular image annotation tools that you need to know about in 2024. Compare the features and pricing, and choose the best image annotation tool for your use case. It’s 2024—annotating images is still one of the most time-consuming steps in bringing a computer vision project to market. To help you out, we put together a list of the most popular image labeling tools out there. Whether you are: A computer vision team building unmanned drones with your own in-house annotation tool. A team of data scientists working on an autonomous driving project looking for large-scale labeling services. Or a data operations team working in healthcare looking for the right platform for your radiologists to accurately label CT scans. This guide will help you compare the top AI annotation tools and find the right one for you. We will compare each based on key factors - including image annotation service, support for different data types and use cases, QA/QC capabilities, security and data privacy, integration with the machine learning pipeline, and customer support. But first, let's explore the process of selecting an image annotation tool from the available providers. Choosing the right image annotation tool is a critical decision that can significantly impact the quality and efficiency of the annotation process. To make an informed choice, it's essential to consider several factors and evaluate the suitability of an image annotation tool for specific needs. Evaluating Image Annotation Tools for Computer Vision Projects Selecting the perfect image annotation tool is like choosing the perfect brush for your painting. Different projects require specific annotation needs that dictate how downstream components. When evaluating an annotation tool that fits your project specifications, there are a few key factors you have to consider. In this section, we will explore those key factors and practical considerations to help you navigate the selection process and find the most fitting AI annotation tool for your computer vision applications. Annotation Types: An effective labeling tool should support various annotation types, such as bounding boxes (ideal for object localization), polygons (useful for detailed object outlines), keypoints (for pose estimation), and semantic segmentation (for scene understanding). The tool must be adaptable to different annotation requirements, allowing users to annotate images with precision and specificity based on the task at hand. User Interface (UI) and User Experience (UX): The user interface plays a crucial role in the efficiency and accuracy of the annotation process. A good annotation tool should have an intuitive interface that is easy to navigate, reducing the learning curve for users. Clear instructions, user-friendly controls, and efficient workflows contribute to a smoother annotation experience. Scalability: Consider the tool's ability to scale with the growing volume of data. A tool that efficiently handles large datasets and multiple annotators is crucial for projects with evolving requirements. Automation and AI Integration: Look for image labeling tools that offer automation features, such as automatic annotation tools or features, to accelerate the annotation process. Integration with artificial intelligence (AI) algorithms can further enhance efficiency by automating repetitive tasks, reducing manual effort, and improving annotation accuracy. Collaboration and Workflow Management: Assess the data annotation tool's collaboration features, including version control, user roles, and workflow management. Collaboration tools are essential for teams working on complex annotation projects. Data Security and Privacy: Ensure that the tool adheres to data security and privacy standards like GDPR. Evaluate encryption methods, access controls, and policies regarding the handling of sensitive data. Pricing: Consider various pricing models, such as per-user, per-project, or subscription models. Also factor in scalability costs, and potential additional fees, ensuring transparency in the pricing structure. Once you've identified which factors are most important for you to evaluate image annotating tools, the next step is understanding how to assess their suitability for your specific use case.  Let's compare the features offered by the best image annotation companies such as Encord, Scale AI, Label Studio, SuperAnnotate, CVAT, and Amazon SageMaker Ground Truth, and understand how they assist in annotating images. Most Popular Image Annotation Tools This article discusses the top 17 image annotation tools in 2024 to help you choose the right image annotation software for your use case. Encord Scale CVAT Label Studio Labelbox Playment Appen Dataloop SuperAnnotate V7 Labs Hive COCO Annotator Make Sense VGG Image Annotator LabelMe Amazon SageMaker Ground Truth VOTT Encord Encord is an automated annotation platform for AI-assisted image annotation, video annotation, and dataset management.  Key Features Data Management: Compile your raw data into curated datasets, organize datasets into folders, and send datasets for labeling.  AI-assisted Labeling: Automate 97% of your annotations with 99% accuracy using auto-annotation features powered by Meta's Segment Anything Model or GPT-4’s LLaVA. Collaboration: Integrate human-in-the-loop seamlessly with customized Workflows - create workflows with the no-code drag and drop builder to fit your data ops & ML pipelines. Quality Assurance: Robust annotator management & QA workflows to track annotator performance and increase label quality.  Integrated Data Labeling Services for all Industries: outsource your labeling tasks to an expert workforce of vetted, trained and specialized annotators to help you scale. Video Labeling Tool: provides the same support for video annotation. One of the leading video annotation tools with positive customer reviews, providing automated video annotations without frame rate errors. Robust Security Functionality: label audit trails, encryption, FDA, CE Compliance, and HIPAA compliance. Integrations: Advanced Python SDK and API access (+ easy export into JSON and COCO formats). Best for Commercial teams graduating from an in-house solution or open-source tool that need a robust, secure, and collaborative enterprise-grade platform to scale your annotation workflows. Teams working on complex or unique use cases that require an advanced annotation tool and/or functionality, including complex nested ontologies or rendering DICOM formats natively. Pricing Simple per-user pricing – no need to track annotation hours, label consumption or data usage.    Curious? Try it out Scale Scale AI, now Scale, is a data and labeling services platform that supports computer vision use cases but specializes in RLHF, user experience optimization, large language models, and synthetic data. Scale AI's Image Annotation Tool Key Features Customizable Workflows: Offers customizable labeling workflows tailored to specific project requirements and use cases. Data labeling services: Provides high-quality data labeling services for various data types, including images, text, audio, and video. Scalability: Capable of handling large-scale annotation projects and accommodating growing datasets and annotation needs. Best for Teams looking for labeling tool should know… Scale is a very popular option for data labeling services. Teams looking for annotation tools for Autonomous Vehicle vision should know… Scale is one of the earliest platforms on the market to support 3D Sensor Fusion annotation for RADAR and LiDAR use cases. Teams looking for medical imaging annotation tools should know… Platforms like Scale will usually not support DICOM or NIfTI data types nor allow companies to work with their data annotators on the platform. Pricing On a per-image basis CVAT (Computer Vision Annotation Tool) CVAT is an open source image annotation tool that is a web-based annotation toolkit, built by Intel. For image labeling, CVAT supports four types of annotations: points, polygons, bounding boxes, and polylines, as well as a subset of computer vision tasks: image segmentation, object detection, and image classification. In 2022, CVAT’s data, content, and GitHub repository were migrated over to OpenCV, where CVAT continues to be open-source. Furthermore, CVAT can also be utilized to annotate QR codes within images, facilitating the integration of QR code recognition into computer vision pipelines and applications. CVAT Label Editor Key Features Open-source: Easy and free to get started labeling images. Manual Annotation Tools: Supports a wide range of annotation types including bounding boxes, polygons, polylines, points, and cuboids, catering to diverse annotation needs. Multi-platform Compatibility: Works on various operating systems such as Windows, Linux, and macOS, providing flexibility for users. Export Formats: CVAT offers support for various data formats including JSON, COCO, and XML-based like Pascal VOC, ensuring annotation compatibility with diverse tools and platforms. Best for Students, researchers, and academics testing the waters with image annotation (perhaps with a few images or a small dataset). Not preferable for commercial teams as it lacks scalability, collaborative features, and robust security. Pricing Free 💡 More insights on image labeling with CVAT: If your team is looking for a free annotation tool, you should know… CVAT is one of the most popular open-source tools in the space, with over 1 million downloads since 2021 — other popular free image annotation alternatives to CVAT are 3D Slicer, Labelimg, VoTT (Visual Object Tagging Tool - developed by Microsoft), VIA (VGG Image Annotator), LabelMe, and Label Studio. If data security is a requirement for your annotation project… Commercial labeling tools will most likely be a better fit — key security features like audit trails, encryption, SSO, and generally-required vendor certifications (like SOC2, HIPAA, FDA, and GDPR) are usually not available in open-source tools. Further reading: Overview of open source annotation tools for computer vision Complete guide to image annotation for computer vision    Label Studio Label Studio is another popular open source data labeling platform. It provides a versatile platform for annotating various data types, including images, text, audio, and video. Label Studio supports collaborative labeling, custom labeling interfaces, and integration with machine learning pipelines for data annotation tasks. Label Studio Image Annotation Tool Key Features Customizable Labeling Interfaces: Flexible configuration for tailored annotation interfaces to specific tasks. Collaboration Tools: Real-time annotation and project sharing capabilities for seamless collaboration among annotators. Extensible: Easily connect to cloud object storage and label data there directly Export Formats: Label Studio supports multiple data formats including JSON, CSV, TSV, and VOC XML like Pascal VOC, facilitating integration and annotation from diverse sources for machine learning tasks. Best for Data scientists, machine learning engineers, and researchers or teams requiring versatile data labeling for images.  Not suitable for teams with limited technical expertise or resources for managing an open source tool Price Free with enterprise plan available Labelbox Labelbox is a US-based data annotation platform founded in 2017. Like most of the other platforms mentioned in this guide, Labelbox offers both an image labeling platform, as well as labeling services. Labelbox Image Editor Key Features Data management: QA workflows and data annotator performance tracking. Customizable labeling interface: 3rd party labeling services through Labelbox Boost. Automation: Integration with AI models for automatic data labeling to accelerate the annotation process. Annotation type: Support for multiple data types beyond images, especially text. Best for Teams looking for a platform to quickly annotate documents and text. Teams carrying out annotation projects that are use-case specific should know that… As generalist tools, platforms like Labelbox are great at handling a broad variety of data types. If you’re working on a unique use-case-specific annotation project (like scans in DICOM formats or high-resolution images that require pixel-perfect annotations), other commercial AI labeling tools will be a better fit: check out our blog exploring Best DICOM Labeling Tools. Pricing Varies based on the volume of data, percent of the total volume needing to be labeled, number of seats, number of projects, and percent of data used in model training. For larger commercial teams, this pricing may get expensive as your project scales. Playment Playment is a fully-managed data annotation platform. The workforce labeling company was acquired by Telus in 2021 and provides computer vision teams with training data for various use cases, supported by manual labelers and a machine learning platform. Playment Image Annotation Tool Key Features Data Labeling Services: Provides high-quality data labeling services for various data types including images, videos, text, and sensor data. Support: Global workforces of contractors and data labelers. Scalability: Capable of handling large-scale annotation projects and accommodating growing datasets and annotation needs. Audio labeling tool: Speech recognition training platform (handles all data types across 500+ languages and dialects). Best for Teams looking for a fully managed solution who do not need visibility into the process. Pricing Enterprise plan Appen Appen is a data labeling services platform founded in 1996, making it one of the first and oldest solutions in the market. The company offers data labeling services for a wide range of industries and in 2019, acquired Figure Eight to build out its software capabilities and help businesses also train and improve their computer vision models. Appen Image Annotation Tool Key Features Data labeling services: Support for multiple annotation types (bounding boxes, polygons, and image segmentation). Data collection: Data sourcing (pre-labeled datasets), data preparation, and real-world model evaluation. Natural language processing:  Supports natural language processing tasks such as sentiment analysis, entity recognition, and text classification. Image and Video Analysis: Analyzes images and videos for tasks such as object detection, image classification, and video segmentation. Best for Teams looking for image data sourcing and collection alongside annotation services. Pricing Enterprise plan Dataloop Dataloop is an Israel-based data labeling platform that provides a comprehensive solution for data Dataloop is an Israel-based data labeling platform that provides a comprehensive solution for data management and annotation projects. The tool offers data labeling capabilities across images, text, audio, and video annotation, helping businesses train and improve their machine learning models. Dataloop Image Annotation Tool Key Features Data Annotation: Features for image annotation tasks, including classification, detection, and semantic segmentation. Video Annotation Tool: Support for video annotations. Collaboration Tool: Features for real-time collaboration among annotators, project sharing, and version control for efficient teamwork. Data Management: Offers data management capabilities including data versioning, tracking, and organization for streamlined workflows. Best for Teams looking for a generalist annotation tool for various data annotation needs. Teams carrying out specific image and video annotation projects that are use-case specific should know that… As generalist tools, platforms like Dataloop are built to support a wide variety of simple use cases, so other commercial platforms are a better fit if you’re trying to labeluse-case-specific annotation projects (like high-resolution images that require pixel-perfect annotations in satellite imaging or DICOM files for medical teams). Pricing Free trial and an enterprise plan. SuperAnnotate SuperAnnotate provides enterprise solutions for image and video annotation, catering primarily to the needs of the computer vision community. It provides powerful annotation tools and features tailored for machine learning and AI applications, offering efficient labeling solutions to enhance model training and accuracy. SuperAnnotate - Image Annotation Tool Key Features Multi-data type support: Versatile annotation tool for image, video, text, and audio. AI Assistance: Integrates AI-assisted annotation to accelerate the annotation process and improve efficiency. Customization: Provides customizable annotation interfaces and workflows to tailor annotation tasks according to specific project requirements. Integration: Seamlessly integrates with machine learning pipelines and workflows for efficient model training and deployment. Scalability: Capable of handling large-scale annotation projects and accommodating growing datasets and annotation needs. Export Formats: SuperAnnotate supports multiple data formats, including popular ones like JSON, COCO, and Pascal VOC. Best for Larger teams working on various machine learning solutions looking for a versatile annotation tool. Pricing Free for early stage startups and academics for team size up to 3. Enterprise plan V7 Labs V7 is a UK-based data annotation platform founded in 2018. The company enables teams to annotate training data, support the human-in-the-loop processes, and also connect with annotation services. V7 offers annotation of a wide range of data types alongside image annotation tooling, including documents and videos. V7 Labs Image Annotation Tool Key Features Collaboration capabilities: Project management and automation workflow functionality, with real-time collaboration and tagging. Data labeling services: Provides labeling services for images and videos. AI assistance: Model-assisted annotation of multiple annotation types (segmentation, detection, and more). Best for Students or teams looking for a generalist platform to easily annotate different data types in one place (like documents, images, and short videos). Limited functionalities for use-case specific annotations. Pricing Various options, including academic, business, and pro. Hive Hive was founded in 2013 and provides cloud-based AI solutions for companies wanting to label content across a wide range of data types, including images, video, audio, text, and more. Hive Image Annotation Tool Key Features Image annotation tool: Offers annotation tools and workflows for labeling images along with support for unique image annotation use cases (ad targeting, semi-automated logo detection). Ease of access: Flexible access to model predictions with a single API call. Integration: Seamlessly integrates with machine learning pipelines and workflows for AI model training and deployment. Best for Teams labeling images and other data types for the purpose of content moderation. Pricing Enterprise plan COCO Annotator COCO Annotator is a web-based image annotation tool, crafted by Justin Brooks under the MIT license. Specifically designed to streamline the process of labeling images for object detection, localization, and keypoints detection models, this tool offers a range of features that cater to the diverse needs of machine learning practitioners and researchers.  COCO Annotator - Image Annotation Tool Key Features Image annotation: Supports annotation of images for object detection, instance segmentation, keypoint detection, and captioning tasks. Export formats: To facilitate large-scale object detection, the tool exports and stores annotations in the COCO format.  Automations: The tool makes annotating an image easier by incorporating semi-trained models. Additionally, it provides access to advanced selection tools, including the MaskRCNN, Magic Wand and DEXTR. Best For COCO Annotator is a good choice for ML researchers, preferable for image annotation for tasks like object detection and keypoints detection. Price Free Make Sense Make Sense AI is a user-friendly and open-source annotation tool, available under the GPLv3 license. Accessible through a web browser without the need for advanced installations, this tool simplifies the annotation process for various image types. Make Sense - Image Annotation Tool Key Features Open Sourced: Make Sense AI stands out as an open-source tool, freely available under the GPLv3 license, fostering collaboration and community engagement for its ongoing development. Accessibility: It ensures web-based accessibility, operating seamlessly in a web browser without complex installations, promoting ease of use across various devices. Export Formats: It facilitates exporting annotations in multiple formats (YOLO, VOC XML like Pascal VOC, VGG JSON, and CSV), ensuring compatibility with diverse machine learning algorithms and seamless integration into various workflows. Best For Small teams seeking an efficient solution to annotate an image. Price Free VGG Image Annotator VGG Image Annotator (VIA) is a versatile open-source tool crafted by the Visual Geometry Group (VGG) for the manual annotation of both image and video data. Released under the permissive BSD-2 clause license, VIA serves the needs of both academic and commercial users, offering a lightweight and accessible solution for annotation tasks. VGG Image Annotator - Image Annotation Tool Key Features Lightweight and user-friendly: VIA is a lightweight, self-contained annotation tool, utilizing HTML, Javascript, and CSS without external libraries, enabling offline usage in modern web browsers without setup or installation. Offline capability: The tool is designed to be used offline, providing a full application experience within a single HTML file of size less than 200 KB.  Multi-user collaboration: Facilitates collaboration among multiple annotators with features such as project sharing, real-time annotation, and version control. Best For VGG Image Annotator (VIA) is ideal for individuals and small teams involved in projects for academic researchers. Price Free LabelMe LabelMe is an open-source web-based tool developed by the MIT Computer Science and Artificial Intelligence Laboratory (CSAIL) that allows users to label and annotate images for computer vision research. It provides a user-friendly interface for drawing bounding boxes, polygons, and semantic segmentation masks to label objects within images. LabelMe Image Annotation Tool Key Features Web-Based: Accessible through a web-based interface, allowing for annotation tasks to be performed in any modern web browser without requiring software installation. Customizable Interface: Provides a customizable annotation interface with options to adjust settings, colors, and layout preferences to suit specific project requirements. Best for Academic and research purposes Pricing Free Amazon SageMaker Ground Truth Amazon SageMaker Ground Truth is a fully managed data labeling service provided by Amazon Web Services (AWS). It offers a platform for efficiently labeling large datasets to train machine learning models. Ground Truth supports various annotation tasks, including image classification, object detection, semantic segmentation, and more. Amazon SageMaker Ground Truth - Image Annotation Tool Key Features Managed service: Fully managed by AWS, eliminating the need for infrastructure setup and management. Human-in-the-Loop Labeling: Harnesses the power of human feedback across the ML lifecycle to improve the accuracy and relevancy of models. Scalability: Capable of handling large-scale annotation projects and accommodating growing datasets and annotation needs. Integration with Amazon SageMaker: Seamlessly integrates with Amazon SageMaker for model training and deployment, providing a streamlined end-to-end machine learning workflow. Best for Teams requiring large-scale data labeling. Pricing Varies based on labeling task and type of data. VOTT VOTT or Visual Object Tagging Tool is an open-source tool developed by Microsoft for annotating images and videos to create training datasets for computer vision models. VOTT provides an intuitive interface for drawing bounding boxes around objects of interest and labeling them with corresponding class names. VOTT Image Annotation Tool Key Features Versatile annotation tool: Supports a wide range of annotation types including bounding boxes, polygons, polylines, points, and segmentation masks for precise labeling. Video annotation: Enables annotation of videos frame by frame, with support for object tracking and interpolation to streamline the annotation process. Multi-platform compatibility: Works across various operating systems such as Windows, Linux, and macOS, ensuring flexibility for users. Best for Teams requiring lightweight and customizable annotation tool for object detection. Pricing Free Image Annotation Tool: Key Takeaways There you have it! The 11 Best Image Annotation Tools for computer vision in 2024.  For further reading, you might also want to check out a few 2024 honorable mentions, both paid and free annotation tools: Supervisely - commercial data labeling platform praised for its quality control functionality and basic interpolation feature. Labelimg - Labelimg is an open source multi-modal data annotation tool now part of Label Studio. MarkUp - MarkUp image is a free web annotation tool to annotate an image or a PDF.

March 15

10 min

sampleImage_yolo-world-object-detection
YOLO World Zero-shot Object Detection Model Explained

YOLO-World Zero-shot Real-Time Open-Vocabulary Object Detection is a machine learning model built on the YOLOv8 backbone that excels in identifying a wide array of objects without prior training on specific categories. It achieves high efficiency and real-time performance by integrating vision-language modeling, pre-training on large-scale datasets, and a novel Re-parameterizable Vision-Language Path Aggregation Network (RepVL-PAN).  Object Detection with YOLO Series The YOLO series of detectors initially introduced by Joseph Redmon have revolutionized zero-shot object detection with their real-time performance and straightforward architecture. These detectors operate by dividing the input image into a grid and predicting bounding boxes and class probabilities for each grid cell.  Despite their efficiency, traditional YOLO detectors are trained on datasets with fixed categories, limiting their ability to detect objects beyond these predefined classes without retraining on custom datasets. Read the blog on the latest of the YOLO series: YOLOv9: SOTA Object Detection Model Explained.   Object Detection with Other Vision Language Models Recently, with the introduction of vision foundation models, there has been a surge in research exploring the integration of vision and LLM to enhance object detection capabilities. Models like CLIP (Contrastive Language-Image Pre-training) and F-VLM (Fine-grained Vision-Language Model) have demonstrated the potential of vision-language modeling in various computer vision tasks, including object detection. Grounding DINO Grounding DINO is a method aimed at improving open-set object detection in computer vision. Open-set object detection is a task where models are required to identify and localize objects within images, including those from classes not seen during training, also known as "unknown" or "unseen" object classes. To tackle this challenge, Grounding DINO combines DINO, a self-supervised learning algorithm, with grounded pre-training, which incorporates both visual and textual information. This hybrid approach enhances the model's capability to detect and recognize previously unseen objects in real-world scenarios by leveraging textual descriptions in addition to visual features. For more information, read the paper: Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection.   CLIP CLIP is a neural network trained on a diverse range of images and natural language supervision sourced abundantly from the internet. Unlike traditional models, CLIP can perform various classification tasks instructed in natural language without direct optimization for specific benchmarks. This approach, similar to zero-shot capabilities seen in GPT-2 and GPT-3, enhances the model's robustness and performance, closing the robustness gap by up to 75%. CLIP achieves comparable performance to ResNet-50 on ImageNet zero-shot, without using any of the original labeled examples. For more information, read the paper: Learning Transferable Visual Models From Natural Language Supervision.   F-VLM F-VLM is a simplified open-vocabulary object detection method that leverages Frozen Vision and Language Models (VLM). It eliminates the need for complex multi-stage training pipelines involving knowledge distillation or specialized pretraining for detection. F-VLM demonstrates that a frozen VLM can retain locality-sensitive features crucial for detection and serves as a strong region classifier. The method fine-tunes only the detector head and combines detector and VLM outputs during inference. F-VLM exhibits scaling behavior and achieves a significant improvement of +6.5 mask AP over the previous state-of-the-art on novel categories of the LVIS open-vocabulary detection benchmark.  For more information, read the paper: F-VLM: Open-Vocabulary Object Detection upon Frozen Vision and Language Models.   Open Vocabulary Object Detection in Real-time YOLO-World addresses limitations of traditional object detection methods by enabling open-vocabulary detection beyond fixed categories, offering adaptability to new tasks, reducing computational burden, and simplifying deployment on edge devices. Real-Time Performance YOLO-World retains the real-time performance characteristic of the YOLO architecture. This is crucial for applications where timely detection of objects is required, such as in autonomous vehicles or surveillance systems. Open-Vocabulary Capability YOLO-World has the capability to detect objects beyond the fixed categories which the YOLO series is trained. This open-vocabulary approach allows YOLO-World to identify a broader range of objects making it highly adaptable to diverse real-world scenarios. YOLO-World also presents the "prompt-then-detect" approach, which eliminates the necessity for real-time text encoding. Instead, users can generate prompts, which are subsequently encoded into an offline vocabulary. Integration of Vision-Language Modeling YOLO-World integrates vision-language modeling techniques to enhance its object detection capabilities. By leveraging pre-trained models like CLIP, YOLO-World gains access to semantic information embedded in textual descriptions, which significantly improves its ability to understand and detect objects in images. Efficiency and Practicality Despite its advanced capabilities, YOLO-World remains highly efficient and practical for real-world applications. Its streamlined architecture and efficient implementation ensure that object detection can be performed in real-time without sacrificing accuracy or computational resources. This makes YOLO-World suitable for deployment in a wide range of applications, from robotics to image understanding systems. Open-vocabulary Instance Segmentation Feature In addition to its remarkable object detection capabilities, the pre-trained YOLO-World model also excels in open-vocabulary instance segmentation, demonstrating strong zero-shot performance on large-scale datasets. The open-vocabulary instance segmentation feature of YOLO-World enables it to delineate and segment individual objects within images, regardless of whether they belong to predefined categories or not. By using its comprehensive understanding of visual and textual information, YOLO-World can accurately identify and segment objects based on their contextual descriptions, providing valuable insights into the composition and layout of scenes captured in images. YOLO-World achieves 35.4 Average Precision (AP) on the LVIS dataset while maintaining a high inference speed of 52.0 frames per second (FPS). This underscores the model's ability to accurately segment instances across a wide range of object categories, even without specific prior training on those categories. YOLO-World Framework YOLO-World: Real-Time Open-Vocabulary Object Detection Frozen CLIP-based Text Encoder Frozen CLIP-based Text Encoder, plays a fundamental role in processing textual descriptions associated with objects in images. This text encoder is based on the CLIP (Contrastive Language-Image Pre-training) model, which has been pre-trained on large-scale datasets to understand the relationship between images and corresponding textual descriptions. By leveraging the semantic embeddings generated by the CLIP text encoder, YOLO-World gets access to contextual information about objects, enhancing its ability to interpret visual content accurately. Re-parameterizable Vision-Language Path Aggregation Network The vision-language path aggregation network (RepVL-PAN) serves as the bridge between visual and linguistic information, facilitating the fusion of features extracted from images and textual embeddings derived from the CLIP text encoder. By incorporating cross-modality fusion techniques, RepVL-PAN enhances both the visual and semantic representations of objects. Region-Text Contrastive Loss Region-text contrastive loss involves constructing pairs of regions and their associated textual descriptions, and then calculating the loss using cross-entropy between the predicted object-text similarity and the assigned text indices. YOLO-World incorporates region-text contrastive loss alongside other loss functions such as IoU loss and distributed focal loss for bounding box regression, ensuring comprehensive training and improved performance. This loss function helps YOLO-World learn to accurately associate objects with their corresponding textual descriptions, enhancing the model's object detection capabilities. For more information, read the YOLO-world paper: YOLO-World: Real-Time Open-Vocabulary Object Detection.   YOLO-World Performance Zero-Shot Evaluation on LVIS The YOLO-World model was tested in a zero-shot setting on the Large Vocabulary Instance Segmentation (LVIS) dataset. Despite not being trained in LVIS categories, it performed well, particularly in rare categories. This suggests that the model is effective at generalizing its learned knowledge to new categories. However, it’s important to note that these results are based on internal evaluations and actual performance may vary. YOLO-World: Real-Time Open-Vocabulary Object Detection Speed and Accuracy YOLO-World addresses the limitation of speed in zero-shot object detection models that rely on transformer architectures by applying a faster CNN based YOLO framework. On the challenging LVIS dataset, YOLO-World achieves an impressive 35.4 Average Precision (AP) while maintaining a high inference speed of 52.0 frames per second (FPS) on the V100 platform. This performance surpasses many state-of-the-art methods, highlighting the efficacy of the approach in efficiently detecting a wide range of objects in a zero-shot manner. After fine-tuning, YOLO-World demonstrates remarkable performance across various downstream tasks, including object detection and open-vocabulary instance segmentation, underscoring its versatility and robustness for real-world applications. YOLO-World: Real-Time Open-Vocabulary Object Detection Visualization In visualizations, YOLO-World’s performance is evaluated across three settings: Zero-shot Inference on LVIS: YOLO-World-L detects numerous objects effectively, showcasing its robust transfer capabilities. Inference with User's Vocabulary: YOLO-World-L displays fine-grained detection and classification abilities, distinguishing between sub-categories and even detecting parts of objects. Referring Object Detection: YOLO-World accurately locates regions or objects based on descriptive noun phrases, showcasing its referring or grounding capability. YOLO-World: Real-Time Open-Vocabulary Object Detection Performance Evaluation of YOLO World, GLIP, Grounding DINO In comparing performance on LVIS object detection, YOLO-World demonstrates superiority over recent state-of-the-art methods such as GLIP, GLIPv2, and Grounding DINO in a zero-shot manner. Performance Comparison: GLIP, GLIPv2, and Grounding DINO in a Zero-shot Manner YOLO-World outperforms these methods in terms of both zero-shot performance and inference speed, particularly when considering lighter backbones like Swin-T. Even when compared to models like GLIP, GLIPv2, and Grounding DINO, which utilize additional data sources such as Cap4M, YOLO-World pre-trained on O365 & GolG achieves better performance despite having fewer model parameters.  The Python code for implementing YOLO-World is available on GitHub and you can try out the demo of the object detector on their official site or HuggingFace.   GPU Optimization By efficiently utilizing GPU resources and memory, YOLO-World achieves remarkable speed and accuracy on a single NVIDIA V100 GPU. Leveraging parallel processing capabilities, optimized memory usage, and GPU-accelerated libraries, YOLO-World ensures high-performance execution for both training and inference. YOLO-World Highlights Open-vocabulary detection capability, surpassing fixed category limitations. Efficient adaptation to new tasks without heavy computation burdens. Simplified deployment, making it practical for real-world applications and edge devices. Incorporation of the innovative RepVL-PAN for enhanced performance in object detection. Strong zero-shot performance, achieving significant improvements in accuracy and speed on challenging datasets like LVIS. Easy adaptation to downstream tasks such as instance segmentation and referring object detection. Pre-trained weights and codes made open-source for broader practical use cases. YOLO-World: What’s Next With open-vocabulary object detection, YOLO-World has shown improvement in performance against traditional methods. Moving forward, there are different areas for further research: Efficiency Enhancements: Efforts can be directed towards improving the efficiency of YOLO-World, particularly in terms of inference speed and resource utilization. This involves optimizing model architectures, leveraging hardware acceleration, and exploring novel algorithms for faster computation. Fine-grained Object Detection: YOLO-World could undergo refinement to enhance its capability in detecting fine-grained objects and distinguishing between subtle object categories. This involves exploring advanced feature representation techniques and incorporating higher-resolution image inputs. Semantic Understanding: Future developments could focus on enhancing YOLO-World's semantic understanding capabilities, enabling it to grasp contextual information and relationships between objects within a scene. This involves integrating advanced natural language processing (NLP) techniques and multi-modal fusion strategies. A tutorial on evaluating YOLO World model predictions on Encord is coming up soon!   

March 11

10 min

sampleImage_video-annotation-tool-evaluation
5 Questions to Ask When Evaluating a Video Annotation Tool

With image and video data fueling advancements across various industries, the video and image annotation tool market is witnessing rapid expansion, projected to grow at a compound annual growth rate (CAGR) of 30% between 2023 and 2032. This growth is particularly pronounced in sectors such as autonomous vehicles, healthcare, and retail, where precise and accurate data annotation is crucial. The increased demand for these tools results from the need to develop robust quality assurance processes, integrate automation for efficiency, collaborate features for team-based annotation, and streamline labeling workflows to produce high-quality training data. However, the extensive choice of annotation tools makes choosing a suitable platform that suits your requirements challenging due to the plethora of available options, each with varying features, scalability, and pricing models. This article will guide you through this tooling landscape. It highlights five critical questions you must ask before investing in a video annotation tool to ensure it aligns with your project requirements and goals. Key Factors that Hinder Efficient Annotation Project Management A robust video annotation tool helps improve annotation workflows, but selecting an appropriate solution requires you to consider the tool’s ability to render videos natively, track objects using advanced algorithms, perform frame-by-frame analysis, while determining its scalability, quality, integrability, and cost to guide your choice. Below are a few factors that can be potential bottlenecks to your CV project. Native Video Rendering Annotating long-form videos can be challenging if the annotation tool lacks features for rendering videos natively. The operative costs can be prohibitive if you use external tools to render multiple videos, limiting your budget for the annotation project. Object Tracking and Frame-by-Frame Analysis Another obstacle to video annotation is sub-optimal object tracking algorithms that cannot address issues such as occlusion, camera shift, and image blur. Additionally, traditional tracking algorithms use a detection framework to identify the object within separate video frames. However, detecting and tracking objects frame-by-frame can cause annotation inconsistency and increase data transfer volume. The result will be inaccurate labels, delays in processing, and high storage costs if you are using a cloud platform that charges based on data usage. Scalability Handling large and complex video data is essential for providing a high-quality user experience. However, maintaining quality requires error-free training data with accurate labels to build robust computer vision models that can process video feeds efficiently. Finding a tool that you can quickly scale to rising demands is difficult due to the constantly evolving data landscape. Tools with limited scalability can soon become a bottleneck as you start labeling extensive datasets for training large-scale CV applications. For instance, the pipelines can break as you feed more data. This can result in missed deadlines, delays in deployment, and budgetary runs as you hire more annotators to compensate for the tool’s shortcomings. Quality of Annotation Annotation quality directly affects the performance of supervised learning models, which rely heavily on accurately labeled data for training. Consider developing a machine learning model for a surveillance system to detect abnormal behavior and alert relevant authorities to prevent accidents. If the model’s training set included video feeds with erroneous labels, it could not efficiently recognize security threats. The result would be false alarms and missed targets, leading to adverse security incidents. Deploying such models in crowded areas can be more detrimental, as the system will not flag suspicious actions in time. Mitigating these problems requires the annotation tool to have quality assurance and collaboration features to help human annotators verify labeling accuracy and fix errors proactively. Integrability with Existing Infrastructure Developing robust artificial intelligence (AI) models requires more than the best algorithms and evaluation strategies. Instead, the emphasis should be on an integrated infrastructure that seamlessly handles data collection, storage, preprocessing, and curation. As annotation is a vital element of a data curation pipeline, a tool that quickly integrates with your existing machinery can significantly boost productivity and quality. Businesses that fail to build an integrated system operate multiple disparate systems with no synchronization. The result is an increased manual effort to organize data assets. This can lead to sub-optimal workflows and poor deployment procedures. Cost A data annotation tool that provides flexible pricing options to upgrade or downgrade your plans according to project needs makes financing decisions easier, paving the way for a faster return on investment (ROI). A cost-effective tool helps with executive buy-in as it becomes easier for the management to convince the executive team to undertake innovative projects and continue the development process without budgetary hurdles. Learn how to automate video annotation by reading our guide on video annotation automation. How to Select a Video Annotation Tool Due to the challenges discussed above, choosing a tool that meets your required standards becomes time-consuming and delays the launch of your CV application. So, the following sections explain the primary factors you should consider when investing in a labeling platform. They will help you quickly filter out the desired features to speed up your annotation processes. What are Your Annotation Needs? Understanding the exact annotation requirements should be the first step in selecting a tool and must include the following factors. The Type of Computer Vision (CV) Application CV models for applications like autonomous driving and real-time surveillance call for a scalable annotation platform to label large amounts of real-time video feeds. The type of application will also determine what category of annotation is necessary and whether a particular tool offers the required functionality. Critical applications like medical imaging require pixel-level segmentation masks while bounding boxes will suffice for security surveillance. Automation for Video-specific Complexities Videos with higher frames-per-second (FPS) can take longer to label since annotators must classify objects within each frame. Additionally, videos with higher motion speeds can cause blurred-out frames or motion blur. This is especially true for action recognition CV models, where labeling frequently changing human actions becomes challenging. The solution to these issues is to have tools with automated labeling techniques that use pre-trained models (AI-assisted annotations) for labeling samples in real-time using data pipelines with interpolation algorithms to fix blurry frames. Platform Compatibility and User Interface (UI) A tool compatible with several operating systems and environments can improve integrability and prevent disruptions to annotation projects. Similarly, the tool’s UI must be intuitive so annotators can quickly learn to use the platform, reducing the time required for staff training. Video Format Compatibility Annotation tools must support multiple video formats, such as MP4, AVI, FLV, etc., for optimal data processing. They should also provide features to convert annotations into suitable formats to train CV models quickly. Video Annotation Tool: Must-have Functionalities Based on the above considerations, a video annotation tool must have: Features to natively label video datasets frame-by-frame for advanced object tracking so that minimal downsampling is required. Basic types of annotations, such as keypoint annotation for pose estimation, 2D bounding boxes, cuboids, polylines, and polygons for labeling objects within a single video frame. Advanced annotation techniques include semantic segmentation, object tracking algorithms, and temporal annotation. Suitable APIs and SDKs to programmatically integrate with existing data pipelines. While these factors are essential for a video annotation tool, it is also advisable to have a manual review process to assess annotation accuracy for high-precision tasks, such as medical imaging, surgical videos, and autonomous navigations. Encord Annotate addresses all the above concerns by offering scalable features and algorithms to handle project complexities, while offering extensive labeling techniques and automation to speed up the annotation process. How Do You Evaluate Annotation Efficiency? The annotation tool should allow you to compute annotation speed and accuracy through intuitive metrics that reflect actual annotation performance. The list below mentions a few popular metrics for measuring the two factors. Metrics for Measuring Annotation Speed Annotations per hour: Determine the 'annotations per hour' to gauge productivity, contextualizing it with industry norms or project expectations. Frames per minute: Evaluate 'frames per minute' to understand annotator performance in video contexts, considering the video complexity. Time per annotation: Use 'time per annotation' to assess individual annotation task efficiency, adjusting expectations based on the annotation detail required. Metrics for Measuring Annotation Accuracy F1-score: Use the F1-score to balance precision and recall scores, explaining its calculation through Intersection over Union (IoU) in video contexts—IoU determines precision and recall in video frames. Cohen’s Kappa and Fleiss’ Kappa: Use Cohen's Kappa and Fleiss’ Kappa for annotator agreement analysis, providing context for when each is most applicable. Krippendorff’s Alpha: Consider Krippendorff’s alpha for diverse or incomplete datasets, detailing its significance in ensuring consistent annotation quality. Ability to Process Complex Annotation Scenarios Ensure the tool can effectively manage challenges like object occlusion, multiple object tracking, and variable backgrounds, illustrating how these are addressed with examples. Discuss the tool's adaptability to different annotation complexities and how its features facilitate accurate labeling in varied scenarios. Customization and Integrations Customization and integrability with ML models are valuable capabilities that can help you tailor a tool’s annotation features to address use-case-specific needs.  Know if they allow you to use open-source annotation libraries to improve existing functionality. Encord Annotate offers multiple quality metrics to analyze annotation quality and ensures high efficiency that meets current industry standards. How Flexible do you Want the Features to be? While the features mentioned above directly relate to annotation functionality, a video annotation must have other advanced tools to streamline the video annotation process for computer vision projects. These include tools for managing ontology, handling long-form video footage, quality control, and AI-based labeling. Ontology Management Ontologies are high-level concepts that specify what and how to label and whether additional information is necessary for model training. Users can define hierarchical structures to relate multiple concepts and create a richer annotated dataset for training CV models. For instance, an ontology for autonomous driving applications will specify that the labeler must annotate a car with 2D bounding boxes and provide information regarding its model, color, type, etc. These ontologies allow annotators to identify objects of interest in complex videos correctly and include additional information relevant to scene understanding.  Clarifying how users can adapt these ontologies across various project types demonstrates the tool's adaptability to diverse research and industry needs. Features to Manage Long-form Videos Long-form videos pose unique challenges, as annotators must keep track of longer video sequences and manage labels in more video frames. Suitable tools that allow you to move back and forth between frames and timelines simplify video analysis, as you can easily navigate through the footage to examine objects and scenes. Segmentation: Segmentation is also a valuable feature to look out for, as it allows you to break long videos into smaller segments to create manageable annotation tasks. For instance, automated checks that monitor labels across segments help you identify discrepancies and ensure identical objects have consistent labeling within each segment. Version Control: Finally, version control features let you save and reload previous annotation work to help you keep track of your progress and synchronize tasks across multiple annotators. For example, tools that allow annotators to store annotation revision history and tag particular versions help maintain a clear audit trail.  These functionalities improve user experience by reducing fatigue and mitigating errors, as annotators can label long-form videos in separate stages. It also helps with quick recovery in case a particular version becomes corrupt. Customizable Workflows and Performance Monitoring Annotation tools that let you customize workflows and guidelines based on project requirements can improve annotation speed by removing redundancies and building processes that match existing annotators’ expertise. Further, intuitive dashboards that display relevant performance metrics regarding annotation progress and quality can allow management to track issues and make data-driven decisions to boost operational efficiency. Inter-annotator agreement (IAA), annotation speed, and feedback metrics that signify revision cycles are most useful in monitoring annotation efficiency.  For instance, an increasing trend in the number of revisions denotes inconsistencies and calls for a root-cause analysis to identify fundamental issues.  AI-assisted Labeling AI-assisted labeling that involves developing models for domain-specific annotation tasks can be costly, as the process requires manual effort to label sufficient samples for pre-training the labeling algorithms. An alternative approach is using techniques, like interpolation, semantic and instance segmentation, object tracking, and detection, to label video frames without developing a custom model. For example, video annotation tools with object-tracking algorithms can automatically identify objects of interest and fill in the gaps using only a small set of manually labeled data. The method enhances annotation efficiency as annotators do not have to train a separate model from scratch and only label a few items while leaving the rest for AI. Quality Assurance and Access Control Regardless of the level of automation, labeling is error-prone, as it is challenging to annotate each object in all video frames correctly. The limitation requires a tool with quality assurance features, including feedback cycles, progress trackers, and commenting protocols. This helps human annotators collaborate with experts to identify and fix errors. Efficient access control features also become crucial for managing access across different teams and assigning relevant roles to multiple members within a project. The Encord platform features robust AI-based annotation algorithms while allowing you to integrate custom models, build tailored workflows, and create detailed ontologies to manage long-form videos. What Type of Vendor Are You Looking for? The next vital step in evaluating a tool is assessing different vendors and comparing their annotation services and platforms against standard benchmarks while factoring in upfront and ongoing costs. A straightforward strategy is to list the required features for your annotation project and draw a comparison table to determine which platforms offer these features and at what cost. Here are a few points you should address:. Managed Service vs. Standalone Platform: You must see whether you require a managed service or a standalone application. While a managed service frees you from annotating the data in-house, a standalone tool offers more security and transparency in the annotation process. A side-by-side comparison detailing each model's implications on your workflow and data governance practices can guide your decision. Onboarding Costs: Analyze all costs associated with adopting and using the tool, distinguishing between one-time onboarding fees, recurring licensing costs, and any potential hidden fees. Consider creating a multi-year cost projection to understand the total cost of ownership and how it compares to the projected ROI. Ecosystem Strength: A vendor with a robust community and ecosystem offers additional resources for maximizing the value of your tool investment with access to a broader range of insights, support, and potential integrations.  Long-term Suitability: Other relevant factors in evaluating vendors include customer reviews, vendor’s track record in providing regular updates, supporting innovative projects, long-term clients, and customer support quality. Analyzing these will help you assess whether the vendor is a suitable long-run strategic partner who will proactively support your company’s mission and vision.  What is the Standard of Post-purchase Services Investing in a video annotation tool is a long-term strategic action involving repeated interactions with the vendor to ensure a smooth transition process and continuous improvements. Below are a few essential services that vendors must offer post-purchase to provide greater value and meet changing demands as per project requirements. Training Resources: The vendor must provide easy access to relevant training materials such as detailed documentation, video tutorials, and on-site support to help users take full advantage of the tool’s feature set right from the start. Data Security Protocols: While compliance with established security standards, including GDPR, HIPAA, ISO, and SOC, is crucial, the vendor must continuously update its encryption protocols to address the dynamic nature of data and rising privacy concerns. Post-purchase, the vendor must ensure robust security measures by following ethical practices and analyzing sensitive information in your project to implement suitable safeguards to prevent breaches and data misuse. Customer Support: The vendor must offer 24/7 customer support helplines for bug resolution and workflow assistance. Want to know the most crucial features of a video annotation tool? Read our article on the five features of video annotation. Encord complies with HIPPA, FDA, and CE standards, making it an ideal tool for sensitive annotation tasks, especially for medical use-cases. Evaluating a Video Annotation Tool: Key Takeaways As CV models permeate multiple domains, such as healthcare, retail, and manufacturing, video annotation tools will be critical determinants of the success of modern CV projects. Below are a few key factors you should consider when evaluating a video annotation platform. Annotation Requirements: The answer will allow you to filter out the desired feature set and scalability demands. Evaluation of Annotation Efficiency: Understanding evaluation methodologies will help you select a tool that offers suitable metrics to assess annotation speed and accuracy. Feature Flexibility: Ontology management, AI-assisted labeling, and options to customize workflows are crucial features that let you tailor the tool’s feature set per your requirements. Strategic Vendor Evaluation: Analyzing upfront and ongoing costs helps you determine the total cost of ownership and whether the vendor is a suitable long-term strategic partner. Quality of Post-purchase Services: With the ever-changing data landscape, you need a vendor that constantly updates its security and training protocols to keep pace with ongoing developments.

March 8

8 min

sampleImage_newsletter-february-2024
Encord Monthly Wrap: February Industry Newsletter

Hi there, Welcome the The Computer Vision Monthly Wrap Here’s what you should expect: 📦 YOLOv9 release with an explainer and code walkthrough on creating custom datasets. 📸 Meta’s V-JEPA for prediction video features. 📽️ Understanding Sora, OpenAI’s text-to-video model. ⚒️ Developer resources to learn how to analyze object detection model errors. ☁️ Computer vision case study from NVIDIA and Oracle. 🚀 Lessons from working with computer vision operations (CVOps) at scale. Let’s dive in! Top Picks for Computer Vision Papers This Month YOLOv9: Better than SoTA with Cutting-edge Real-time Object Detection If you haven’t heard yet, YOLOv9 is out, and, wow, it’s a high-performant model! YOLOv9 builds upon previous versions, using advancements in deep learning techniques and architectural design to beat state-of-the-art (SoTA) object detection tasks. What’s impressive? 🤯 It achieves top performance in object detection tasks on benchmark datasets like MS COCO. It surpasses existing real-time object detectors (YOLOv6, YOLOv8) in terms of accuracy, speed, and overall performance. It is much more adaptable to different scenarios and use cases. We have started seeing various applications, including surveillance, autonomous vehicles, robotics, and more. It is better than SoTA methods that use depth-wise convolution because it uses both the Programmable Gradient Information (PGI) and GLEAN (Generative Latent Embeddings for Object Detection) architectures. Read the paper on Arxiv. If that’s a lot, we also put out an explainer to help get to the important bits quickly with a walkthrough on using the open-source YOLOv9 release to create custom datasets. There’s also an accompanying repository for the implementation of the paper. Meta’s V-JEPA: Video Joint Embedding Predictive Architecture Explained In February, Meta released V-JEPA, a vision model exclusively trained using a feature prediction objective. In contrast to conventional machine learning methods, which rely on pre-trained image encoders, text, or human annotations, V-JEPA learns directly from video data without external supervision. What’s impressive? 👀 Instead of reconstructing images or relying on pixel-level predictions, V-JEPA prioritizes video feature prediction. This approach leads to more efficient training and superior performance in downstream tasks.  V-JEPA requires shorter training schedules than traditional pixel prediction methods (VideoMAE, Hiera, and OmniMAE) while maintaining high-performance levels. We wrote a comprehensive explainer of V-JEPA, including the architecture, key features, and performance details, in this blog post. Here is the accompanying repository on the implementation of V-JEPA. OpenAI Releases New Text-to-Video Model, Sora OpenAI responded to the recent debut of Google's Lumiere, a space-time diffusion model for video generation, by unveiling its own creation: Sora. The diffusion model can transform text descriptions into high-definition video clips for up to one minute. In this comprehensive explainer, you will learn: How Sora works Capabilities and limitations Safety considerations Other text-to-video generative models. Gemini 1.5: Google's Generative AI Model with 1 Million-Token Context Length and MoE Architecture Gemini 1.5 is a sparse mixture-of-experts (MoE) multimodal model with a context window of up to 1 million tokens in production and 10 million tokens in research. It excels at long-term recall and retrieval and generalizes zero-shot to long instructions, like analyzing 3 hours of video with near-perfect recall. Here is an explainer blog that distils the technical report with the necessary information. Developer Resources You’d Find Useful Multi-LoRA Composition for Image Generation → The space is moving so fast that it’s hard to miss out on gems like Multi-LoRA! The Multi-LoRA composition implementation integrates diverse elements like characters & clothing into a unified image to avoid the detail loss and distortion seen in traditional LoRA Merge. Check out the repo and try it yourself. Scaling MLOps for Computer Vision by MLOps.Community → In this panel conversation, experienced engineers talk about their experience, challenges, and best practices for working with computer vision operations (CVOps) at scale. How to Analyze Failure Modes of Object Detection Models for Debugging → This guide showcases how to use Encord Active to automatically identify and analyze the failure modes of a computer vision model to understand how well or poorly it performs in challenging real-world scenarios. NVIDIA Triton Server Serving at Oracle [Case Study] → I really liked this short case study by the Oracle Cloud team that discussed how their computer vision and data science services accelerate AI predictions using the NVIDIA Triton Inference Server. Some learnings in terms of cost savings and performance optimization are valuable. Here are other quick finds if you 💓 Encord and computer vision data stuff ⚡: Join the Encord Community to discuss this newsletter. Data-centric computer vision blog.

March 8

10 min

sampleImage_video-object-tracking-algorithms
Top 10 Video Object Tracking Algorithms in 2024

Object tracking has become a fundamental part of the computer vision ecosystem. It powers various modern artificial intelligence applications and is behind several revolutionary technologies, such as self-driving cars, surveillance, and action recognition systems. Tracking algorithms use a combination of object detection and object tracking to detect and localize entities within a video frame. These algorithms range from basic machine learning to complex deep learning networks. Each of these has different implementations and use cases. This article will discuss the top 10 most popular video object-tracking algorithms. It will go over video object-tracking algorithms' back-end implementations, advantages, and disadvantages. We will also explore popular computer vision applications for object tracking. What is Video Object Tracking? Video object tracking refers to detecting an object within a video frame and tracking its position throughout the video. The concept of object tracking stems from object detection, a popular computer vision (CV) technique used for identifying and localizing different objects in images. While object detection works on still images (single frames), video object tracking applies this concept to every frame in the video. It analyzes each frame to identify the object in question and draw a bounding box around it. The object is effectively tracked throughout the video by performing this operation on all frames. However, complex machine learning and deep learning algorithms apply additional techniques such as region proposal and trajectory prediction for real-time object inference. Object tracking algorithms have revolutionized several industries. It has enabled businesses to implement analytics and automation in various domains and led to applications like: Autonomous Vehicles: Tracking surrounding elements like pedestrians, roads, curbs. Automated Surveillance: Tracking people or illegal objects like guns and knives. Sports Analytics: Tracking the ball or players to create match strategies. Augmented Reality Applications: Tracking all objects in the visual field to superimpose the virtual elements. Customer Analysis in Retail: Tracking retail store customers to understand movement patterns and optimize shelf placement. Over the years, object tracking algorithms have undergone various improvements in-terms of accuracy and performance. Let’s discuss these in detail. Single-stage Object Detectors Vs. Two-stage Object Detectors Object detection is a crucial part of tracking algorithms. Hence, it is vital to understand them in detail. There are two main categories of object detectors: Single-stage and two-stage. Both these methodologies have proven to provide exceptional results. However, each offers different benefits, with the former having a lower inference time and the latter having better accuracy. Single-stage detectors perform faster since they rely on a single network to produce annotations. These models skip intermediate feature extraction steps, such as region proposal. They use the raw input image to identify objects and generate bounding box coordinates. One example of a single-stage detector is You only look once (YOLO). YOLO can generate annotations with a single pass of the image. Single Stage Vs. Two-Stage Detection Two-stage detectors, such as Fast R-CNN, comprise two networks. The first is a region proposal network (RPN) that analyzes the image and extracts potential regions containing the desired objects. The second network is a CNN-based feature extractor that analyzes the proposed regions. The latter identifies the objects present and outputs their bounding box coordinates.  Two-stage object detectors are computationally expensive compared to their single-stage counterparts. However, they produce more accurate results. Object Tracking Approaches Object tracking algorithms work on two granularity levels. These include: Single Object Tracking (SOT) SOT is used to track the location of a single object throughout the video feed. These detection-free algorithms depend on the user to provide a bounding box around the target object on the first frame. The algorithm learns to track the position and movement of the object present within the box. It localizes the object's shape, posture, and trajectory in every subsequent frame. Single object tracking is useful when the focus must be kept on a particular entity. Some examples include tracking suspicious activity in surveillance footage or ball-tracking in sports analytics. Popular SOT algorithms include Particle Filters and Siamese Networks. However, one downside of traditional SOT algorithms is that they are unsuitable for context-aware applications where tracking multiple objects is necessary. Multiple Object Tracking (MOT) MOT works on the same concept as SOT. However, multi-object tracking identifies and tracks multiple objects throughout a video instead of a single object. MOT algorithms use extensive training datasets to understand moving objects. Once trained, they can identify and track multiple objects within each frame. Modern deep-learning MOT algorithms, like DeepSORT, can even detect new objects mid-video and create new tracks for them while keeping existing tracks intact. Multiple-object tracking is useful when various objects must be analyzed simultaneously. For example, in virtual reality (VR) applications, the algorithm must keep track of all objects in the frame to superimpose the virtual elements. However, these algorithms are computationally expensive and require lengthy training time. Phases of Object Tracking Process Visual object tracking is a challenging process comprising several phases. Target Initialization: The first step is to define all the objects of interest using labels and bounding boxes. The annotations, which include the names and locations of all the objects to be tracked, are specified in the first video frame. The algorithm then learns to identify these objects in all the subsequent images or video sequences. Appearance Modelling: An object may undergo visual transformation throughout the video due to varying lighting conditions, motion blur, image noise, or physical augmentations. This phase of the object-tracking process aims to capture these various transformations to improve the model’s robustness. It includes constructing object descriptions and mathematical models to identify objects with different appearances. Motion Estimation: Once the object features are defined, motion estimation predicts the object’s position based on the previous frame data. This is achieved by leveraging linear regression techniques or Particle Filters. Target Positioning: Motion estimation provides an estimate of the object's position. The next step is to pinpoint the exact coordinates within the predicted region. This is accomplished using a greedy search, i.e., checking every possibility or a maximum posterior estimation that looks at the most likely place using visual clues. Criteria for Selecting a Video Object Tracking Algorithm The two primary criteria to evaluate object tracking methods are accuracy and inference time. These help determine the best algorithm for particular use cases. Let’s discuss these criteria in detail. Accuracy Tracking algorithms output two main predictions: object identity (label) and location (bounding box coordinates). The accuracy of these models is determined by evaluating both these predictions and analyzing how well it was able to identify and localize the object. Metrics like Accuracy, Precision, Recall, and F1-score help evaluate the model's ability to classify the found object. While accuracy provides a generic picture, precision, and recall judge the model based on occurrences of false positives and negatives. Metrics like intersection-over-union (IoU) are used for localization accuracy. IoU calculates how much the predicted bounding box coincides with its ground truth value. A higher value means higher intersection and, hence, higher accuracy. Intersection Over Union (IoU) Inference Time The second judgment criterion is the speed of inference. Inference time determines how quickly the algorithm processes a video frame and predicts the object label and location. It is often measured in frames-per-second (FPS). This refers to the amount of frames the algorithm can process and output every second. A higher FPS value indicates faster inference. Challenges in Object Tracking Object tracking techniques carry various benefits for different industries. However, implementing a robust tracking algorithm is quite challenging. Some key challenges include: Object Variety: The real world comes with countless objects. Training a generic tracking algorithm would require an extensive dataset containing millions of objects. For this reason, object tracking models are generally domain-specific, with even the vastest models trained on only a few thousand objects. Varying Conditions: Besides the object variety, the training data must also cover objects in different conditions. A single object must be captured in different lighting conditions, seasons, times of day, and from different camera angles. Varying Image Quality: Images from different lenses produce varying information in terms of color production, saturation, etc. A robust model must incorporate these variations to cover all real-world scenarios. Computation Costs: Handling large image or video datasets requires considerable expertise and computational power. Developers need access to top-notch GPUs and data-handling tools, which can be expensive. Training deep-learning-based tracking algorithms can also increase operational costs if you use paid platforms that charge based on data units processed.  Scalability: Training general-purpose object tracking models requires extensive datasets. The growing data volumes introduce scalability challenges as developers require platforms that can handle increasingly large volumes of data and can increase computation power to train larger complex models. Top Algorithms for Video Object Tracking Here is a list of popular object tracking algorithms, ranging from simple mathematical models to complex deep learning architectures. Kalman Filter Kalman filters estimate an object’s position and predict its motion in subsequent frames. They maintain an internal representation of the object's state, including its position, velocity, and sometimes acceleration. The filters use information from the object’s previous state and a mathematical model analyzing the object’s motion to predict a future state. The model accounts for any uncertainty in the object's motion (noise). It incorporates all the discussed factors and estimates the object’s current state to create a future representation. Advantages It is a mathematical model that does not require any training. It is computationally efficient. Disadvantage Subpar performance and capabilities compared to modern deep learning algorithms. The model works on various assumptions, such as constant object acceleration. The algorithm does not perform well in random motion scenarios. KCF (Kernelized Correlation Filters) KCF is a mathematical model that understands object features and learns to distinguish them from their background. It starts with the user providing a bounding box around the object in the first frame. Once feature understanding is complete, it uses correlation filters based on the kernel trick to construct a high-dimensional relationship between the features and the true object. It uses the correlation features in subsequent frames to scan around the object's last known location. The area with the highest correlation is predicted to contain the object. Advantages Fast Computation. Low Memory Requirement. Competitive Results in general cases. Disadvantages Traditional KCF faces challenges in conditions such as varying object scales or objects touching frame boundaries. DeepSORT The Deep Simple Online Realtime Tracking (DeepSORT) algorithm extends the original SORT algorithm. The original SORT algorithm used Kalman filters to predict object motion and the Hungarian algorithm for frame-by-frame data association. However, this algorithm struggles with occlusions and varying camera angles and can lose object tracking in such complex scenarios. DeepSORT Architecture DeepSORT uses an additional convolutional neural network (CNN) as a feature extractor. These are called appearance features as they learn to determine the object identity (appearance) in different scenarios and allow the algorithm to distinguish between moving objects. DeepSORT combines the information from Filtering and CNN to create a deep association metric for accurate detection. Advantages DeepSort’s simple yet efficient implementation provides real-time performance. The model is modular. It can support any detection network of the user's choice, such as YOLO or SSD. It maintains its detection during occluded environments and can distinguish between different objects in complex scenarios. Disadvantages Offline training of a separate detection network can be challenging and requires an extensive dataset for high accuracy. FairMOT The fair multi-object tracking (FairMOT) algorithm uses a pre-trained model like faster R-CNN for detecting objects in the video sequence. It then uses a neural network to extract features from the detected object. FairMOT Architecture These features are used to track the object across other frames. The branches share the same underlying architecture and receive equal weightage during training. The FairMOT algorithm treats all classes fairly and provides a balanced performance between the two tasks: detection and tracking. Advantages Provides balanced performance between tracking and detection. Improved tracking accuracy due to the re-identification branch (feature extraction branch) Disadvantage Computationally expensive due to the two neural network branches being trained. MDNet The multi-domain network (MDNet) is popular for learning across different domains. It consists of two modules. The first is a CNN architecture shared amongst all the video sequences, i.e., it is domain-independent and learns from the entire dataset. This consists of CNN layers and a few flattened, fully connected layers. MDNet Architecture The second part comprises parallel fully connected (FC) layers, each processing domain-specific information. If the data captures information from 5 domains, the second portion will have 5 FC layers. Each of these layers is independently updated during back-propagation depending on the domain of the target image. Advantages Excellent performance across different domains. The domain-specific branches can be fine-tuned on the fly if significant domain shifts are detected. Disadvantages If data is imbalanced, the model will display uneven performance across the different domains. YOLOv8 (You Only Look Once) YOLOv8 is a single stage-detector that ranks among the most popular object tracking algorithms. The YOLO family of models is based on a CNN architecture that learns to predict object labels and positions with a single pass of the image. Yolov8 Tasks Catalog The model v8 follows a similar architecture to its predecessor and consists of various CNN and fully connected layers. It is an anchor-free algorithm, which directly predicts the object’s center rather than an offset from a predefined anchor. Moreover, the algorithm can be used for classification, segmentation, pose estimation, object detection, and tracking. YOLOv8 extends its detection capabilities by providing a range of trackers. Two popular options amongst these are Bot-SORT and ByteTrack. All the trackers are customizable, and users can fine-tune parameters like confidence threshold and tracking area. Advantages The model covers various use cases, including tracking and segmentation. High accuracy and performance. Easy Python Interface. Disadvantages Trouble detecting small objects. YOLOv8 provides various model sizes, each trading performance for accuracy. Here’s all you need to know about the YOLO family of model. Read more about YOLO models for Object Detection Explained [YOLOv8 Updated] Siamese Neural Networks (SNNs) Siamese-based tracking algorithms consist of two parallel branches of neural networks. One is a template branch, which contains the template image (including the object bounding box information) and the next frame where the object is to be found. This branch consists of CNNs and pooling layers and extracts features from both images, such as edges, texture, and shape. A fully convolutional siamese network for object tracking The other is the similarity branch that takes the features from the template and search image. It calculates the similarity between the two images using algorithms like contrastive loss. The output of this network is the likelihood of the object being present at different positions in the image. The Siamese network has had various advancements over the years. The modern architectures include attention mechanisms and RPNs for improved performance. Advantages Multiple advancements, including SiamFC, SiamRPN, etc. Disadvantages Training two parallel networks leads to long training times. GOTURN (Generic Object Tracking Using Regression Networks) Generic Object Tracking Using Regression Networks (GOTURN) is a deep learning based offline learning algorithm. The framework accepts two images, a previous frame and a current frame. The previous frame contains the object at its center, and the image is cropped to 2 times the bounding box size. The current frame is cropped in the same location, but the object is off-center as it has supposedly moved from its position. GOTURN High-level Architecture The internal structure of the model consists of convolutional layers taken from the CaffeNet architecture. Each of the two frames is passed through these layers, and the output is concatenated and processed through a series of fully connected layers. The objective of the network is to learn features from the previous frame and predict the bounding box in the current. Advantages Excellent performance, even on CPU. Disadvantages Troubled in scenarios where only some part of the object is visible. Object tracking is highly affected by imbalanced training data. Want to know more about creating a fair dataset? Read more about Balanced and Imbalanced Datasets in Machine Learning [Full Introduction] TLD (Tracking, Learning, and Detection) TLD is a framework designed for long-term tracking of an unknown object in a video sequence. The three components serve the following purpose: Tracker: Predicts the object location in the next frame using information in the current frame. This module uses techniques like mean-shift or correlation filtering. Detector: Scans the input frame-by-frame for potential objects using previously learned object appearances. Learning: Observes the tracker's and the detector's performance and identifies their errors. It further generates training samples to teach the detector to avoid mistakes in the future. Tracking-Learning-Detection Advantages Real-time performance. Disadvantages Sensitive to illumination changes. Can lose track of the object if it is completely occluded in any frame. Can fail if the object appearance changes mid-video. Median Flow Tracker The Median Flow Tracker predicts object movement in videos by analyzing feature points across frames. It estimates optical flow, filters out unreliable measurements, and uses the remaining data to update the object's bounding box. Tracking using Median Flow Internally, it tracks motion in both forward and backward directions and compares the two trajectories. Advantages Works well for predictable motion. Disadvantage Fails in scenarios of abrupt and random motion. Applications of Video Object Tracking Video Object Tracking has important use-cases in various industries. These use-cases automate laborious tasks and provide critical analytics. Let's discuss some key applications. Autonomous Vehicles Market leaders like Tesla, Waymo, and Baidu are constantly enhancing their AI infrastructure with state-of-the-art algorithms and hardware for improved tracking. Modern autonomous vehicles use different cameras and robust neural processing engines to track the objects surrounding them. Video object tracking plays a vital role in mapping the car's surroundings. This feature map helps the vehicle distinguish between elements such as trees, roads, pedestrians, etc. Autonomous Harvesting Robots Object tracking algorithms also benefit the agriculture industry by allowing autonomous detection and harvesting of ready crops. Agri-based companies like Four Growers use detection and tracking algorithms to identify harvestable tomatoes and provide yield forecasting. They use the Encord annotation tool and a team of professional annotators to label millions of objects simultaneously. Using AI-assisted tools has allowed them to cut the data processing time by half. Sports Analytics Sports analysts use computer vision algorithms to track player and ball movement to build strategies. Video tracking algorithms allow the analysts to understand player weaknesses and generate AI-based analytics. The tracking algorithms can also be used to fix player postures to improve performance and mitigate injury risks. Traffic Congestion & Emission Monitoring System Computer vision is used to track traffic activity on roads and airports. The data is also used to manage traffic density and ensure smooth flow. Companies like Automotus use object tracking models to monitor curb activity and reduce carbon emissions. Their solution automatically captures the time a car spends on the curb, detects any traffic violations, and analyzes driver behavior. Vascular Ultrasound Analysis Object detection has various use cases in the healthcare domain. One of the more prominent applications is Ultrasound analysis for diagnosing and managing vascular diseases like Popliteal Artery Aneurysms (PAAs). CV algorithms help medical practitioners in detecting anomalous entities in medical imaging. The automated detection allows for further AI analysis, such as classification, and allows the detection of minute irregularities that might otherwise be ignored. Professional Video Editing Professional tools like Adobe Premiere Pro use object tracking to aid professional content creators. It allows creators to apply advanced special effects on various elements and save time creating professional edits. Customer Analysis in Retail Stores Tracking algorithms are applied in retail stores via surveillance cameras. They are used to detect and track customer movement throughout the store premises. The tracking data helps the store owner understand hot spots where the customers spend the most time. It also gives insights into customer movement patterns that help optimize product placement on shelves. Want to know the latest computer vision use cases? Learn more about the ten most exciting applications of computer vision in 2024 Video Object Tracking: Key Takeaways The computer vision domain has come a long way, and tasks like classification, segmentation, and object tracking have seen significant improvements. ML researchers have developed various algorithms for video object tracking, each of which holds certain benefits over the other. In this article, we discussed some of the most popular architectures. Here are a few takeaways: Object Tracking vs. Object Detection: Video object tracking is an extension of object detection and applies the same principles to video sequences. Multiple Categories of Object Tracking: Object tracking comprises various sub-categories, such as single object tracking, multiple object tracking, single-stage detection, and two-stage detection. Object Tracking Metrics: Object tracking algorithms are primarily judged on their inference time (frames-per-second) and tracking accuracy. Popular Frameworks: Popular tracking frameworks include YOLOv8, DeepSORT, GOTURN, and MDNet. Applications: Object tracking is used across various domains, including healthcare, autonomous vehicles, customer analysis, and sports analytics.

March 8

10 min


Get Your Models Into Production Faster
Encord is transforming how businesses are getting their computer vision models into production. We can do the same for you. Talk to us to find out how.

sampleImage_model-validation-tools
Top 9 Tools for Generative AI Model Validation in Computer Vision

The integrity, diversity, and reliability of the content that AI systems generate depend on generative AI model validation. It involves using tools to test, evaluate, and improve these models. Validation is important for detecting biases, errors, and potential risks in AI-generated outputs and for facilitating their rectification to adhere to ethical and legal guidelines. The demand for robust validation tools is increasing with the adoption of generative AI models. This article presents the top 9 tools for generative AI model validation. These tools help identify and correct discrepancies in generated content to improve model reliability and transparency in AI applications. The significance of model validation tools cannot be overstated, especially as generative AI continues to become mainstream. These tools are critical to the responsible and sustainable advancement of generative AI because they ensure the quality and integrity of AI-generated content. Here’s the list of tools we will cover in this article: Encord Active DeepChecks HoneyHive Arthur Bench Galileo LLM Studio TruLens Arize Weights and Biases HumanLoop Now that we understand the importance of optimizing performance in generative AI models, let's delve into the guidelines or criteria that can help us evaluate different tools and help us achieve these goals.  Criteria for Evaluating Generative AI Tools In recent years, generative AI has witnessed significant advancements, with pre-trained models as a cornerstone for many breakthroughs. Evaluating generative AI tools involves comprehensively assessing their quality, robustness, and ethical considerations.  Let’s delve into the key criteria for evaluating the generative AI tools: Scalability and Performance: Assess how well the tool handles increased workloads. Can it scale efficiently without compromising performance? Scalability is crucial for widespread adoption. Model Evaluation Metrics: Consider relevant metrics such as perplexity, BLEU score, or domain-specific measures. These metrics help quantify the quality of the generated content. Support for Different Data Types: Generative AI tools should handle various data types (text, images, videos, etc.). Ensure compatibility with your specific use case. Built-in Metrics to Assess Sample Quality: Tools with built-in quality assessment metrics are valuable. These metrics help measure the relevance, coherence, and fluency of the generated content. Interpretability and Explainability: Understand how the model makes decisions. Transparent models are easier to trust and debug. Experiment Tracking: Effective experiment tracking allows you to manage and compare different model versions. It's essential for iterative improvements. Usage Metrics: Understand how real users interact with the model over time. Usage metrics provide insights into adoption, engagement, and user satisfaction. Remember that generative AI is unique, and traditional evaluation methods may need adaptation. By focusing on these criteria, organizations can fine-tune their generative AI projects and drive successful results both now and in the future. Encord Active Encord Active is a data-centric model validation platform that allows you to test your models and deploy into production with confidence. Inspect model predictions and compare to your Ground Truth, surface common issue types and failure environments, and easily communicate errors back to your labeling team in order to validate your labels for better model performance. By emphasizing real data for accuracy and efficiency, Encord Active ensures foundation models are optimized and free from biases, errors, and risks. The Model Evaluation & Data Curation Toolkit to Build Better Models Key Features Let’s evaluate Encord Active based on the specified criteria: Scalability and Performance: Encord Active ensures robust model performance and adaptability as data landscapes evolve. Model Evaluation Metrics: The tool provides robust model evaluation capabilities, uncovering failure modes and issues. Built-in Metrics to Assess Sample Quality: It automatically surfaces label errors and validates labels for better model performance. Interpretability and Explainability: Encord Active offers explainability reports for model decisions. Experiment Tracking: While not explicitly mentioned, it likely supports experiment tracking. Usage Metrics: Encord Active helps track usage metrics related to data curation and model evaluation. Semantic Search: Encord Active is a data-centric AI platform that uses a built-in CLIP to index images from Annotate. The indexing process involves analyzing images and textual data to create a searchable representation that aligns images with potential textual queries. This provides an in-depth analysis of your data quality.Semantic search with Encord Active can be performed in two ways. Either through text-based queries by searching your images with natural language, or through Reference or anchor image by searching your images using a reference or anchor image. The guide recommends using Encord Annotate to create a project and import the dataset, and Encord Active to search data with natural language.  Best for Encord Active is best suited for ML practitioners deploying production-ready AI applications, offering data curation, labeling, model evaluation, and semantic search capabilities all in one. Learn about how Automotus increased mAP 20% while labeling 35% less of their dataset with Encord Active.   Pricing Encord Active OS is an open-source toolkit for local installation. Encord Active Cloud (an advanced and hosted version) has a pay-per-user model. Get started here.  Deepchecks Deepchecks is an open-source tool designed to support a wide array of language models, including ChatGPT, Falcon, LLaMA, and Cohere.  DeepChecks Dashboard Key Features and Functionalities Scalability and Performance: Deepchecks ensures validation for data and models across various phases, from research to production. Model Evaluation Metrics: Deepchecks provides response time and throughput metrics to assess model accuracy and effectiveness. Interpretability and Explainability: Deepchecks focuses on making model predictions understandable by associating inputs with consistent outputs. Usage Metrics: Deepchecks continuously monitors models and data throughout their lifecycle, customizable based on specific needs. Open-Source Synergy: Deepchecks supports both proprietary and open-source models, making it accessible for various use cases. Best for Deepchecks is best suited for NLP practitioners, researchers, and organizations seeking comprehensive validation, monitoring, and continuous improvement of their NLP models and data.  Pricing The pricing model for Deepchecks is based on the application count, seats, daily estimates and support options. The plans are categorized into Startup, Scale and Dedicated. HoneyHive HoneyHive is a platform with a suite of features designed to ensure model accuracy and reliability across text, images, audio, and video outputs. Adhering to NIST's AI Risk Management Framework provides a structured approach to managing risks inherent in non-deterministic AI systems, from development to deployment. HoneyHive - Evaluation and Observability for AI Applications Key Features and Functionalities Scalability and Performance: HoneyHive enables teams to deploy and continuously improve LLM-powered products, working with any model, framework, or environment. Model Evaluation Metrics: It provides evaluation tools for assessing prompts and models, ensuring robust performance across the application lifecycle. Built-in Metrics for Sample Quality: HoneyHive includes built-in sample quality assessment, allowing teams to monitor and debug failures in production. Interpretability and Explainability: While not explicitly mentioned, HoneyHive’s focus on evaluation and debugging likely involves interpretability and explainability features. Experiment Tracking: HoneyHive offers workspaces for prompt templates and model configurations, facilitating versioning and management. Usage Metrics: No explicit insights into usage patterns and performance metrics. Additional Features Model Fairness Assessment: Incorporate tools to evaluate model fairness and bias, ensuring ethical and equitable AI outcomes. Automated Hyperparameter Tuning: Integrate hyperparameter optimization techniques to fine-tune models automatically. Best for HoneyHive.ai is best suited for small teams building Generative AI applications, providing critical evaluation and observability tools for model performance, debugging, and collaboration. Pricing HoneyHive.ai offers a free plan for individual developers. Arthur Bench An open-source evaluation tool for comparing LLMs, prompts, and hyperparameters for generative text models, the Arthur Bench open-source tool will enable businesses to evaluate how different LLMs will perform in real-world scenarios so they can make decisions when integrating the latest AI technologies into their operations. Arthur Bench’s comparison of the hedging tendencies in various LLM responses  Key Features and Functionalities Scalability and Performance: Arthur Bench evaluates large language models (LLMs) and allows comparison of different LLM options. Model Evaluation Metrics: Bench provides a full suite of scoring metrics, including summarization quality and hallucinations. Built-in Metrics to Assess Sample Quality: Arthur Bench offers metrics for assessing accuracy, readability, and other criteria. Interpretability and Explainability: Not explicitly mentioned Experiment Tracking: Bench allows teams to compare test runs. Usage Metrics: Bench is available as both a local version (via GitHub) and a cloud-based SaaS offering, completely open source. Additional Features Customizable Scoring Metrics: Users can create and add their custom scoring metrics. Standardized Prompts for Comparison: Bench provides standardized prompts designed for business applications, ensuring fair evaluations. Best for The Arthur Bench tool is best suited for data scientists, machine learning researchers, and teams comparing large language models (LLMs) using standardized prompts and customizable scoring metrics.  Pricing Arthur Bench is an open-source AI model evaluator, freely available for use and contribution, with opportunities for monetization through team dashboards. Galileo LLM Studio Galileo LLM Studio is a platform designed for building production-grade Large Language Model (LLM) applications, providing tools for ensuring that LLM-powered applications meet standards. The tool supports local and cloud testing. Galileo LLM Studio  Key Features and Functionalities  Scalability and Performance: Galileo LLM Studio is a platform for building Large Language Model (LLM) applications. Model Evaluation Metrics: Evaluate, part of LLM Studio, offers out-of-the-box evaluation metrics to measure LLM performance and curb unwanted behavior or hallucinations. Built-in Metrics to Assess Sample Quality: LLM Studio’s Evaluate module includes metrics to assess sample quality. Interpretability and Explainability: Not explicitly mentioned. Experiment Tracking: LLM Studio allows prompt building, version tracking, and result collaboration. Usage Metrics: LLM Studio’s Observe module monitors productionized LLMs. Additional Features Here are some additional features of Galileo LLM Studio: Generative AI Studio: Users build, experiment and test prompts to fine-tune model behavior, to improve the relevance and model efficiency by exploring the capabilities of generative AI NLP Studio: Galileo supports natural language processing (NLP) tasks, allowing users to analyze language data, develop models, and work on NLP tasks. This integration provides a unified environment for both generative AI and NLP workloads. Best for Galileo LLM Studio, is a specialized platform tailored for individuals working with Large Language Models (LLMs) because it provides necessary tools specifically designed for LLM development, optimization and validation.  Pricing The pricing model for Galileo GenAI Studio is based on two predominant models: Consumption: This pricing model is usually measured per thousand tokens used. It allows users to pay based on their actual usage of the platform. Subscription: In this model, pricing is typically measured per user per month. Users pay a fixed subscription fee to access the platform’s features and services. TruLens TruLens enables the comparison of generated outputs to desired outcomes to identify discrepancies. Advanced visualization capabilities provide insights into model behavior, strengths, and weaknesses. TruLens for LLMs Key Features and Functionalities Scalability and Performance: TruLens evaluates large language models (LLMs) and scales up experiment assessment. Model Evaluation Metrics: TruLens provides feedback functions to assess LLM app quality, including context relevance, groundedness, and answer relevance. Built-in Metrics to Assess Sample Quality: TruLens offers an extensible library of built-in feedback functions for identifying LLM weaknesses. Interpretability and Explainability: Not explicitly emphasized Experiment Tracking: TruLens allows tracking and comparison of different LLM apps using a metrics leaderboard. Usage Metrics: TruLens is versatile for various LLM-based applications, including retrieval augmented generation (RAG), summarization, and co-pilots. Additional Features Customizable Feedback Functions: TruLens allows you to define your custom feedback functions to tailor the evaluation process to your specific LLM application. Automated Experiment Iteration: TruLens streamlines the feedback loop by automatically assessing LLM performance, enabling faster iteration and model improvement.  Best for  TruLens for LLMs is suited for natural language processing (NLP) researchers, and developers who work with large language models (LLMs) and want to rigorously evaluate their LLM-based applications.  Pricing TruLens is an open-source model and is thus free and available for download.  Arize Arize AI is designed for model observability and LLM (Language, Learning, and Modeling) evaluation. It helps monitor and assess machine learning models, track experiments, offer automatic insights, heatmap tracing, cohort analysis, A/B comparisons  and ensure model performance and reliability. Arize Dashboard Key Features and Functionalities Scalability and Performance: Arize AI handles large-scale deployments and provides real-time monitoring for performance optimization. Model Evaluation Metrics: Arize AI offers a comprehensive set of evaluation metrics, including custom-defined ones. Sample Quality Assessment: It monitors data drift and concept drift to assess sample quality. Interpretability and Explainability: Arize AI supports model interpretability through visualizations. Experiment Tracking: Users can track model experiments and compare performance. Usage Metrics: Arize AI provides insights into model usage patterns. Additional Features ML Observability: Arize AI surfaces worst-performing slices, monitors embedding drift, and offers dynamic dashboards for model health. Task-Based LLM Evaluations: Arize AI evaluates task performance dimensions and troubleshoots LLM traces and spans. Best for Arize AI helps business leaders pinpoint and resolve model issues quickly. Arize AI is for anyone who needs model observability, evaluation, and performance tracking.  Pricing Arize AI offers three pricing plans: Free Plan: Basic features for individuals and small teams. Pro Plan: Suitable for small teams, includes more models and enhanced monitoring features. Enterprise Plan: Customizable for larger organizations with advanced features, and tailored support.  Weights and Biases Weights and Biases enables ML professionals to track experiments, visualize performance, and collaborate effectively. Logging metrics, hyperparameters, and training data facilitate comparison and analysis. Using this tool, ML practitioners gain insights, identify improvements, and iterate for better performance. Weights & Biases: The AI Developer Platform Key Features and Functionalities Scalability and Performance: W&B helps AI developers build better models faster by streamlining the entire ML workflow, from tracking experiments to managing datasets and model versions. Model Evaluation Metrics: W&B provides a flexible and tokenization-agnostic interface for evaluating auto-regressive language models on various Natural Language Understanding (NLU) tasks, supporting models like GPT-2, T5, Gpt-J, Gpt-Neo, and Flan-T5. Built-in Metrics to Assess Sample Quality: While not explicitly mentioned, W&B’s evaluation capabilities likely include metrics to assess sample quality, given its focus on NLU tasks. Interpretability and Explainability: W&B does not directly provide interpretability or explainability features, but it integrates with other libraries and tools (such as Fastai) that may offer such capabilities. Experiment Tracking: W&B allows experiment tracking, versioning, and visualization with just a few lines of code. It supports various ML frameworks, including PyTorch, TensorFlow, Keras, and Scikit-learn. Usage Metrics: W&B monitors CPU and GPU usage in real-time during model training, providing insights into resource utilization. Additional Features  Panels: W&B provides visualizations called “panels” to explore logged data and understand relationships between hyperparameters and metrics. Custom Charts: W&B enables the creation of custom visualizations for analyzing and interpreting experiment results. Best for Weights & Biases (W&B) is best suited for machine learning practitioners and researchers who need comprehensive experiment tracking, visualization, and resource monitoring for their ML workflows. Pricing The Weights & Biases (W&B) AI platform offers the following pricing plans: Personal Free: Unlimited experiments, 100 GB storage, and no corporate use allowed. Teams: Suitable for teams, includes free tracked hours, additional hours billed separately. Enterprise: Custom plans with flexible deployment options, unlimited tracked hours, and dedicated support. HumanLoop HumanLoop uses HITL (Human In The Loop), allowing collaboration between human experts and AI systems for accurate and quality outputs. By facilitating iterative validation, models improve with real-time feedback. With expertise from leading AI companies, HumanLoop offers a comprehensive solution for validating generative AI models. Humanloop: Collaboration and evaluation for LLM applications Key Features and Functionalities Scalability and Performance: Humanloop provides a collaborative playground for managing and iterating on prompts across your organization, ensuring scalability while maintaining performance. Model Evaluation Metrics: It offers an evaluation and monitoring suite, allowing you to debug prompts, chains, or agents before deploying them to production. Built-in Metrics to Assess Sample Quality: Humanloop enables you to define custom metrics, manage test data, and integrate them into your CI/CD workflows for assessing sample quality. Interpretability and Explainability: While Humanloop emphasizes interpretability by allowing you to understand cause and effect, it also ensures explainability by revealing hidden parameters in deep neural networks. Experiment Tracking: Humanloop facilitates backtesting changes and confidently updating models, capturing feedback, and running quantitative experiments. Usage Metrics: It provides insights into testers’ productivity and application quality, helping you make informed decisions about model selection and parameter tuning. Additional Features Best-in-class Playground: Humanloop helps developers manage and improve prompts across an organization, fostering collaboration and ensuring consistency. Data Privacy and Security: Humanloop emphasizes data privacy and security, allowing confident work with private data while complying with regulations. Best for The Humanloop tool is particularly well-suited for organizations and teams that require collaborative AI validation, model evaluation, and experiment tracking, making it an ideal choice for managing and iterating on prompts across different projects. Its features cater to both technical and non-technical users, ensuring effective collaboration and informed decision-making in the AI development and evaluation process.  Pricing  Free Plan allows for Humanloop AI product prototyping for 2 members with 1,000 logs monthly and community support.  Enterprise Plan includes enterprise-scale deployment features and priority assistance. Generative AI Model Validation Tools: Key Takeaways Model validation tools ensure reliable and accurate AI-generated outputs, enhancing user experience, and fostering trust in AI technology.  Adaptation of these tools to evolving technologies is needed to provide real-time feedback, prioritizing - transparency, accountability, and fairness to address bias and ethical implications in AI-generated content.  The choice of a tool should consider scalability, performance, model evaluation metrics, sample quality assessment, interpretability, experiment tracking, and usage metrics.  Generative AI Validation Importance:  The pivotal role of generative AI model validation ensures content integrity, diversity, and reliability, emphasizing its significance in adhering to ethical and legal guidelines. Top Tools for Model Validation: Different tools are available catering to diverse needs, helping identify and rectify biases, errors, and discrepancies in AI-generated content, essential for model transparency and reliability. Criteria for Tool Evaluation: The key criteria for evaluating generative AI tools are focusing on scalability, model evaluation metrics, sample quality assessment, interpretability, and experiment tracking to guide organizations in choosing effective validation solutions. Adaptation for Generative AI: Recognizing the uniqueness of generative AI, the article emphasizes the need for adapting traditional evaluation methods. By adhering to outlined criteria, organizations can fine-tune generative AI projects for sustained success, coherence, and reliability.

March 6

10 min

sampleImage_panoptic-segmentation-updates
Panoptic Segmentation Updates in Encord

Panoptic Segmentation Updates in Encord Over the past 6 months, we have updated and built new features within Encord with a strong focus on improving your panoptic segmentation workflows across data, labeling, and model evaluation. Here are some updates we’ll cover in this article: Bitmask lock. SAM + Bitmask lock + Brush for AI-assisted precision labeling. Fast and performant rendering of fully bitmask-segmented images and videos. Panoptic Quality model evaluation metrics. Bitmask Lock within Encord Annotate to Manage Segmentation Overlap Our Bitmask Lock feature introduces a way to prevent segmentation and masks from overlapping, providing pixel-perfect accuracy for your object segmentation tasks. By simply toggling the “Bitmask cannot be drawn over” button, you can prevent any part of a bitmask label from being included in another label. This feature is crucial for applications requiring precise object boundaries and pixel-perfect annotations, eliminating the risk of overlapping segmentations. Let’s see how to do this within Encord Annotate: Step 1: Create your first Bitmask Initiating your labeling process with the Bitmask is essential for creating precise object boundaries. If you are new to the Bitmask option, check out our quickstart video walkthrough on creating your first Bitmask using brush tools for labeling. Step 2: Set Bitmask Overlapping Behavior  Managing how bitmasks overlap is vital for ensuring accurate segmentation, especially when dealing with multiple objects that are close to each other or overlapping. After creating your first bitmask, adjust the overlapping behavior settings to dictate how subsequent bitmasks interact with existing ones. This feature is crucial for delineating separate objects without merging their labels—perfect for panoptic segmentation. This prevents any part of this bitmask label from being included in another label. This is invaluable for creating high-quality datasets for training panoptic segmentation models. Step 3: Lock Bitmasks When Labeling Multiple Instances Different images require different approaches. Beyond HSV, you can use intensity values for grayscale images (like DICOM) or RGB for color-specific labeling. This flexibility allows for tailored labeling strategies that match the unique attributes of your dataset. Experiment with the different settings (HSV, intensity, and RGB) to select the best approach for your specific labeling task. Adjust the criteria to capture the elements you need precisely. Step 3: Using the Eraser Tool Even with careful labeling, adjustments may be necessary. The eraser tool can remove unwanted parts of a bitmask label before finalizing it, providing an extra layer of precision. If you've applied a label inaccurately, use the eraser tool to correct any errors by removing unwanted areas of the bitmask. See our documentation to learn more. Bitmask-Segmented Images and Videos Got a Serious Performance Lift (At Least 5x) Encord's commitment to enhancing user experience and efficiency is evident in the significant performance improvements made to the Bitmask-segmented annotation within the Label Editor. Our Engineering team has achieved a performance lift of at least 5x by directly addressing user feedback and pinpointing critical bottlenecks. This improves how fast the editor loads for your panoptic segmentation labeling instances.  Here's a closer look at the differences between the "before" and "after" scenarios, highlighting the advancements: Before the Performance Improvements: Performance Lag on Zoom: Users experienced small delays when attempting to zoom in on images, with many instances (over 100) that impacted the precision and speed of their labeling process. Slow Response to Commands: Basic functionalities like deselecting tools or simply navigating through the label editor were met with sluggish responses. Operational Delays: Every action, from image loading to applying labels, was hindered by "a few milliseconds" of delay, which accumulated significant time overheads across projects. After the Performance Enhancements: Quicker Image Load Time: The initial step of image loading has seen a noticeable speed increase! This sets a good pace for the entire labeling task. Responsiveness: The entire label editor interface, from navigating between tasks to adjusting image views, is now remarkably more responsive. This change eradicates previous lag-related frustrations and allows for a smoother user experience. Improved Zoom Functionality: Zooming in and out has become significantly more fluid and precise. This improvement is precious for detailed labeling work, where accuracy is paramount. The positive changes directly result from the Engineering team's responsiveness to user feedback. Our users have renewed confidence in handling future projects with the Label Editor. We are dedicated to improving Encord based on actual user experiences. Use Segment Anything Model (SAM) and Bitmask Lock for High Annotation Precision Starting your annotation process can be time-consuming, especially for complex images. Our Segment Anything Model (SAM) integration offers a one-click solution to create initial annotations. SAM identifies and segments objects in your image, significantly speeding up the annotation process while ensuring high accuracy. Step 1: Select the SAM tool from the toolbar with the Bitmask Lock enabled.  Step 2: Click on the object you wish to segment in your image. SAM will automatically generate a precise bitmask for the object. Step 3: Use the bitmask brush to refine the edges for pixel-perfect segmentation if needed. See how to use the Segment Anything Model (SAM) within Encord in our documentation.   Validate Segmentation with Panoptic Quality Metrics You can easily evaluate your segmentation model’s panoptic mask quality with new metrics:  mSQ (mean Segmentation Quality) mRQ (mean Recognition Quality) mPQ (mean Panoptic Quality) The platform will calculate mSQ, mRQ, and mPQ for your predictions, labels, and dataset to clearly understand the segmentation performance and areas for improvement. Navigate to Active → Under the Model Evaluation tab, choose the panoptic model you want to evaluate. Under Display, toggle the Panoptic Quality Metrics (still in beta) option to see the model's mSQ, mRQ, and mPQ scores. Fast Rendering of Fully Bitmask-Segmented Images within Encord Active The performance improvement within the Label Editor also translates to how you view and load panoptic segmentation within Active.  Try it yourself: Key Takeaways: Panoptic Segmentation Updates in Encord Here’s a recap of the key features and improvements within Encord that can improve your Panoptic Segmentation workflows across data and models: Bitmask Lock: This feature prevents overlaps in segmentation. it guarantees the integrity of each label, enhancing the quality of the training data and, consequently, the accuracy of machine learning models. This feature is crucial for projects requiring meticulous detail and precision. SAM + Bitmask Lock + Brush: The Lock feature allows you to apply Bitmasks to various objects within an image, which reduces manual effort and significantly speeds up your annotation process. The integration of SAM within Encord's platform, using Lock to manage Bitmask overlaps, and the generic brush tool empower you to achieve precise, pixel-perfect labels with minimal effort. Fast and Performant Rendering of Fully Bitmask-segmented Images and Videos: We have made at least 5x improvements to how Encord quickly renders fully Bitmask-segmented images and videos across Annotate Label Editor and Active. Panoptic Quality Model Evaluation Metrics: The Panoptic Quality Metrics—comprising mean Segmentation Quality (mSQ), mean Recognition Quality (mRQ), and mean Panoptic Quality (mPQ)—provide a comprehensive framework for evaluating the effectiveness of segmentation models.

March 6

7 min

sampleImage_claude-3-explained
Claude 3 | AI Model Suite: Introducing Opus, Sonnet, and Haiku

 What is Claude 3? Claude 3 is a family of large multimodal models by Anthropic: Claude 3 Opus, Claude 3 Sonnet, Claude 3 Haiku. Opus excels in various domains. Sonnet balances skills and speed. Haiku prioritizes speed and affordability. All models process text and images, offer improved multilingual fluency, and undergo comprehensive evaluations for safety. Joining the tech race of building AI chatbots with OpenAI’s ChatGPT, Google’s Gemini 1.5, or Le Chat of Mistral AI, Anthropic has introduced Claude. Claude is an AI assistant that helps manage organizations' tasks no matter the scale. The Claude 3 model family shows better performance than other SOTA models.  Claude 3 sets new benchmarks across reasoning, math, coding, multi-lingual understanding, and vision quality. Leveraging unsupervised learning and Constitutional AI, trained on AWS and GCP hardware with PyTorch, JAX, and Triton frameworks. Claude 3’s AI Model Suite Each large language model within the Claude 3 family is tailored to offer different combinations of capabilities, speed, and cost-effectiveness. Claude 3 Opus It is the most capable offering, achieving state-of-the-art results on benchmark evaluations across various domains such as reasoning, math, and coding. It sets new standards in performance and is suitable for applications requiring high levels of intelligence and processing power. Claude 3 Sonnet It provides a balance between skills and speed, offering strong performance in cognitive tasks while being more efficient in terms of processing time compared to Opus. Claude 3 Haiku It is the fastest and least expensive model in the family, suitable for applications where speed and cost-effectiveness are prioritized over absolute performance. Intelligence Benchmark Scores Vs. Cost Comparison of Claude 3 Model Family All models in the Claude 3 family come with vision capabilities for processing image data and exhibit improved fluency in non-English languages, making them versatile for a global audience. Model Training Data and Process The Claude 3 models are trained using a blend of publicly available internet data as of August 2023, along with public data from data labeling services and synthetic data generated internally. The training process involves several data cleaning and filtering methods, including deduplication and classification. The models are not trained on any user-submitted prompt or output data. Anthropic follows industry practices when obtaining data from public web pages, respecting robots.txt instructions and other signals indicating whether crawling is permitted. The crawling system operates transparently, allowing website operators to identify Anthropic visits and signal their preferences. The training of Claude 3 models emphasizes being helpful, harmless, and honest. Techniques include pretraining on diverse data sets for language capabilities and incorporating human feedback to elicit desirable responses. Constitutional AI, including principles from sources like the UN Declaration of Human Rights, ensures alignment with human values. A principle promoting respect for disability rights is integrated into Claude's constitution. Human feedback data, including publicly available sources, is used for finetuning. For more information on RLHF, read the blog Guide to Reinforcement Learning from Human Feedback (RLHF) for Computer Vision.   Performance Benchmark: Claude 3, GPT-4, GPT-3.5, Gemini Ultra, and Gemini Pro Claude 3, particularly the Opus model, surpasses other state-of-the-art models in various evaluation benchmarks for AI tools. It excels in domains such as undergraduate and graduate-level expert knowledge (MMLU, GPQA), basic mathematics (GSM8K), and more. Opus demonstrates near-human levels of comprehension and fluency, positioning itself at the forefront of general intelligence. Compared to other models like OpenAI’s GPT-4, GPT-3.5, Gemini Ultra, and Gemini Pro, Claude 3 models showcase enhanced capabilities in diverse areas. These include analysis and forecasting, nuanced content creation, code generation, and multilingual conversation proficiency in languages such as Spanish, Japanese, and French. Performance Benchmark Scores of Claude 3 Model Family: Opus, Sonnet, Haiku Claude 3 Capabilities Vision Capabilities: Photos, Charts, Graphs and Technical Diagrams The Claude 3 models are equipped to process and interpret visual information along with text inputs. The vision capabilities are particularly showcased in tasks like the AI2D science diagram benchmark and visual question answering. They excel in parsing scientific diagrams and achieving high accuracy rates in both zero-shot and few-shot settings. Evaluation Results on Multimodal Tasks Trained on diverse visual data, Claude 3 models effectively interpret and analyze various visual content, enhancing their overall problem-solving capabilities for applications in fields like image understanding and multimodal reasoning. Near Instant Model Results Claude 3 models deliver near-instant results, ideal for live customer chats, auto-completions, and data extraction tasks. Haiku is the fastest and most cost-effective, processing dense research papers in under three seconds. Sonnet is twice as fast as previous versions, suitable for rapid tasks like knowledge retrieval. Opus matches previous speeds but with higher intelligence levels. Multimodal Claude 3 shows impressive multimodal capabilities, adept at processing diverse types of data. Claude 3 excels in visual question answering, demonstrating its capacity to understand and respond to queries based on images. It showcases strong quantitative reasoning skills by analyzing and deriving insights from visual data, enhancing its overall versatility across various tasks. Multilingual Understanding Claude 3 showcases robust multilingual capabilities, important for global accessibility. Evaluations highlight Claude 3 Opus's state-of-the-art performance in the Multilingual Math MGSM benchmark, achieving over 90% accuracy in a zero-shot setting. Human feedback shows significant improvement in Claude 3 Sonnet, indicating enhanced multilingual reasoning capabilities compared to previous versions. The Claude 3 Model Family: Multilingual Capabilities Factual Accuracy Claude 3 prioritizes factual accuracy through rigorous evaluations, including 100Q Hard and Multi-factual datasets. Tracking correctness, incorrect, and unsure responses, Claude 3 Opus significantly improves accuracy over previous versions. Factual Accuracy of Claude 3 Models Vs Claude 2.1 Reasoning and Mathematical Problem Solving Claude 3 exhibits remarkable reasoning and mathematical problem-solving abilities, surpassing previous models in various benchmarks. In evaluations such as GPQA and MATH, Claude 3 Opus achieves significant improvements, although falling slightly short of expert-level accuracy. Leveraging techniques like chain-of-thought reasoning and majority voting further enhances performance, with Opus demonstrating impressive scores in both reasoning and mathematical problem-solving tasks, showcasing its advanced capabilities in these domains. Near-human Comprehension Claude 3 Sonnet outperforms its predecessors, Claude 2 and Claude Instant, in various core tasks, as assessed through direct comparisons by human raters. It excels in writing, coding, long document Q&A, non-English conversation, and instruction following. Domain experts across finance, law, medicine, STEM, and philosophy prefer Sonnet in 60-80% of cases. Human feedback, although noisy, provides insights into user preferences that industry benchmarks may overlook. Using Elo scores, Sonnet shows a significant improvement of roughly 50-200 points over Claude 2 models in various subject areas.  Claude models exhibit high proficiency in open-ended conversation, coding tasks, and text-related operations like searching, writing, and summarizing. They also interpret visual input for enhanced productivity and maintain a helpful, conversational tone, described as steerable, adaptive, and engaging by users. Claude's prediction mechanism constructs responses sequentially based on the input and past conversation, unable to edit previous responses or access external information beyond its context window, achieving near-human comprehension in various tasks. Contextual Understanding and Fewer Refusals Unlike previous versions, Claude 3 models are less likely to refuse to answer prompts that are within their capabilities and ethical boundaries. This improvement indicates a more refined understanding of context and a reduction in unnecessary refusals, enhancing their overall performance and usability. Comparison of Incorrect Refusals: Claude 3 Model Family Vs. Claude 2.1 Information Recall from Long Context Claude 3's capability for information recall from long contexts is impressive, expanding from 100K to 200K tokens and supporting contexts up to 1M tokens. Despite challenges in reliable recall within long contexts, Claude 3 models, particularly Claude Opus, exhibit significant improvements in accurately retrieving specific information. In evaluations like Needle In A Haystack (NIAH), Claude Opus consistently achieves over 99% recall in documents of up to 200K tokens, highlighting its enhanced performance in information retrieval tasks. Information Recall: Claude 3 Model Family (Opus, Sonnet, Haiku) Vs. Claude 2 Improved Accuracy Improved accuracy in Claude 3 models is important for businesses relying on them to serve customers at scale. Evaluation involves a large set of complex, factual questions targeting known weaknesses in previous models.  Accuracy Comparison: Claude 3 Model Family (Opus, Sonnet, Haiku) Vs. Claude 2 Claude 3 Opus demonstrates a twofold improvement in accuracy, reducing incorrect answers and admitting uncertainty when necessary. The upcoming features like citations will enhance trustworthiness by enabling precise verification of answers from reference material. For more information, read the model card:The Claude 3 Model Family: Opus, Sonnet, Haiku   Model Details Claude 3: Model Availability Opus and Sonnet are currently available for use in the Anthropic API, enabling developers to sign up and start using these models immediately. Haiku will be available soon. Sonnet powers the free experience on claude.ai, while Opus is available for Claude Pro subscribers. Sonnet is available through Amazon Bedrock, with Opus and Haiku coming soon to both Amazon Bedrock and Google Cloud's Vertex AI Model Garden in a private preview. Model Costs Claude 3 Opus Claude 3 Opus stands out as the most intelligent model, offering unparalleled performance on complex tasks. It excels in handling open-ended prompts and navigating sight-unseen scenarios with remarkable fluency and human-like understanding, showcasing the outer limits of generative AI. However, this high intelligence comes at a higher cost of $15 per million input tokens and $75 per million output tokens. The context window for Opus is 200K tokens, and it is suitable for tasks such as task automation, research and development, and advanced strategic analysis. Claude 3 Sonnet Claude 3 Sonnet, on the other hand, strikes a balance between intelligence and speed, making it ideal for enterprise workloads. It offers strong performance at a lower cost compared to its peers, with rates of $3 per million input tokens and $15 per million output tokens. Sonnet's context window is also 200K tokens, and it is suitable for data processing, sales tasks, and time-saving operations like code generation. Claude 3 Haiku Claude 3 Haiku is the fastest and most compact model, designed for near-instant responsiveness. It excels in handling simple queries and requests with unmatched speed and affordability, costing $0.25 per million input tokens and $1.25 per million output tokens. Haiku's context window is also 200K tokens, and it is suitable for tasks like customer interactions, content moderation, and cost-saving operations.  The Claude 3 Haiku model is now accessible via Amazon Bedrock on Amazon Web Services.                                                               Responsible Design Risk Mitigation Dedicated teams continuously track and mitigate various risks, including misinformation, harmful content, and potential misuse in areas such as biological information, election integrity, and autonomous replication. Bias Reduction Ongoing efforts focus on reducing biases in model outputs, with Claude 3 demonstrating decreased biases compared to previous models, as measured by the Bias Benchmark for Question Answering (BBQ). Model Neutrality Advanced methods such as Constitutional AI enhance model transparency and neutrality, guaranteeing that results are not biased toward any one political position. Responsible Scaling Policy Claude 3 models are classified at AI Safety Level 2 (ASL-2) under the Responsible Scaling Policy, with rigorous evaluations affirming minimal potential for catastrophic risks at present. Future models will be closely monitored to assess their proximity to ASL-3.  Claude 3: What’s Next Here is what to expect from the new models of Anthropic’s Claude:  Feature Updates for Enterprise Use Case Tool Use or Function Calling: Development is underway to enable Claude 3 to utilize functions, allowing for more advanced task automation and data processing. REPL or Interactive Coding: Claude 3 will soon support an interactive coding environment, providing users with the ability to engage in real-time code execution and debugging. Advanced Agentic Capabilities: Explorations are ongoing to equip Claude 3 with more advanced agentic capabilities, facilitating seamless interaction with users and autonomous execution of complex tasks. Large-scale Deployments: Optimization efforts are being made to ensure Claude 3 is suitable for large-scale deployments, enabling it to handle high volumes of requests while maintaining performance and reliability in enterprise settings. Safety Guardrails with Feature Advancements: In line with feature updates, Claude 3 is also working on its safety protocols to mitigate risks and promote responsible usage. At the same time, the focus remains on leveraging these advancements to foster positive societal outcomes, allowing users to achieve their goals ethically and efficiently while upholding principles of fairness, transparency, and accountability in artificial intelligence.

March 5

10 min

sampleImage_stable-diffusion-3-text-to-image-model
Stable Diffusion 3: Multimodal Diffusion Transformer Model Explained

  What is Stable Diffusion 3? Stable Diffusion 3 (SD3) is an advanced text-to-image generation model developed by Stability AI. Leveraging a latent diffusion approach and a Multimodal Diffusion Transformer architecture, SD3 generates high-quality images from textual descriptions. SD3 demonstrates superior performance compared to state-of-the-art text-to-image generation systems such as DALL·E 3, Midjourney v6, and Ideogram v1. On human preference evaluations, SD3 has shown advancements in typography and prompt adherence, setting a new standard in text-to-image generation   Stable Diffusion 3 is the latest version of the Stable Diffusion models. Stable Diffusion is built for text-to-image generation, leveraging a latent diffusion model trained on 512x512 images from a subset of the LAION-5B database. Supported by a generous compute donation from Stability AI and backing from LAION, this model combines a latent diffusion approach with a frozen CLIP ViT-L/14 text encoder for conditioning on text prompts. Exploring Stable Diffusion 3: Text-to-Image Model One of the notable features of SD3 is its architecture, which includes a Multimodal Diffusion Transformer (MMDiT). This architecture utilizes separate sets of weights for image and language representations, leading to improved text understanding and spelling capabilities compared to previous versions of SD3. The core architecture of Stable Diffusion 3 is based on a diffusion transformer architecture combined with flow matching techniques. This combination allows for the efficient and effective generation of high-quality images conditioned on textual input. Stable Diffusion 3 models vary in size, ranging from 800 million to 8 billion parameters, to cater to different needs for scalability and quality in generating images from text prompts. The goal of Stable Diffusion 3 is to align with the core values of the development team, including democratizing access to AI technologies. By offering open-source models of varying sizes and capabilities, Stable Diffusion 3 aims to provide users with a range of options to meet their creative needs, whether they require faster processing times or higher image quality. Let’s dive into the two core concepts of Stable Diffusion 3:  Diffusion Transformer (DiT) Diffusion Transformers or DiTs are a class of diffusion models that utilize transformer architecture for the generation of images. Unlike traditional approaches that rely on the U-Net backbone, DiTs operate on latent patches, offering improved scalability and performance. Images were generated using Diffusion Transformer Through an analysis of scalability using metrics such as Gflops (floating point operations per second), it has been observed that diffusion transformers (DiTs) with higher Gflops, achieved through increased transformer depth/width or a higher number of input tokens, consistently exhibit lower Frechet Inception Distance (FID). This implies improved performance in terms of image quality. For more information on Diffusion Transformers, read the paper: Scalable Diffusion Models with Transformers.   While transformers have gained popularity in fields like natural language processing (NLP) and computer vision tasks, their use in image-level generative models has been limited. This tendency is reflected in the general preference for convolutional U-Net architecture in diffusion models. But U-Net's inductive bias doesn’t necessarily make it the best choice for diffusion models, prompting researchers to explore alternative architectures such as transformers. Inspired by Vision Transformers, DiTs ensure scalability, efficiency, and high-quality sample generation, making them a good option for generative modeling. OpenAI’s recent text-to-video model uses Diffusion Transformers in its architecture. For more information, read the blog: OpenAI Releases New Text-to-Video Model, Sora. Flow Matching: A Model Training Technique The core concept of Flow Matching (FM) redefines Continuous Normalizing Flows (CNFs) by focusing on regressing vector fields of fixed conditional probability paths, eliminating the need for simulations. FM is versatile and can accommodate various types of Gaussian probability paths, including traditional diffusion paths used in diffusion models. It provides a robust and stable alternative for training diffusion models, which are commonly used in generative modeling tasks. Empirical evaluations on ImageNet, a widely used dataset for image classification tasks, demonstrate that FM consistently outperforms traditional diffusion-based methods in terms of both likelihood (how probable the generated samples are) and sample quality. Moreover, FM enables fast and reliable sample generation using existing numerical Ordinary Differential Equation (ODE) solvers. For more information on FM, read the paper: Flow Matching for Generative Modeling. Stable Diffusion 3 Architecture Overview of Stable Diffusion 3’s architecture The architecture of Stable Diffusion 3 incorporates both text and image modalities, leveraging pretrained models to derive suitable representations for each. Here's a breakdown of the key components and mechanisms involved: General Setup SD3 follows the framework of Latent Diffusion Models (LDM) for training text-to-image models in the latent space of a pretrained autoencoder. Text conditioning is encoded using pretrained, frozen text models, similar to previous approaches. Multi-Modal Diffusion Transformer (MMDiT) SD3's architecture builds upon the DiT (Diffusion Transformer) architecture, which focuses on class conditional image generation. In SD3, embeddings of the timestep and text conditioning are used as inputs to the modulation mechanism, enabling conditional generation. To address the coarse-grained nature of pooled text representations, SD3 incorporates information from the sequence representation of text inputs. Scaling Rectified Flow Transformers for High-Resolution Image Synthesis Sequence Construction SD3 constructs a sequence comprising embeddings of both text and image inputs. This sequence includes positional encodings and flattened patches of the latent pixel representation. After embedding and concatenating the patch encoding and text encoding to a common dimensionality, SD3 applies a sequence of modulated attention and Multi-Layer Perceptrons (MLPs). Weights of Each Modality Given the conceptual differences between text and image embeddings, SD3 employs separate sets of weights for each modality. While using two independent transformers for each modality, SD3 combines the sequences of both modalities for the attention operation, enabling both representations to work in their respective spaces while considering each other. Experiments on SD3 to Improve Performance Improving Rectified Flows by Reweighting Stable Diffusion 3 adopts a Rectified Flow (RF) formulation, connecting data and noise on a linear trajectory during training. This approach results in straighter inference paths, enabling sampling with fewer steps.  SD3 introduces a trajectory sampling schedule, assigning more weight to the middle parts of the trajectory to tackle more challenging prediction tasks. Comparative tests against 60 other diffusion trajectories, including LDM, EDM, and ADM, across multiple datasets, metrics, and sampler settings, demonstrate the consistent performance improvement of the re-weighted RF variant. Scaling Rectified Flow Transformer Models A scaling study is conducted for text-to-image synthesis using the reweighted Rectified Flow formulation and MMDiT backbone. Models ranging from 15 blocks with 450M parameters to 38 blocks with 8B parameters exhibit a smooth decrease in validation loss with increasing model size and training steps. Evaluation using automatic image-alignment metrics (GenEval) and human preference scores (ELO) demonstrates a strong correlation between these metrics and validation loss, suggesting the latter as a robust predictor of overall model performance. The scaling trend shows no signs of saturation, indicating potential for further performance improvement in the future. Scaling Rectified Flow Transformers for High-Resolution Image Synthesis Flexible Text Encoders Stable Diffusion 3 optimizes memory usage by removing the memory-intensive 4.7B parameter T5 text encoder for inference, resulting in significantly reduced memory requirements with minimal performance loss. The removal of the text encoder does not impact visual aesthetics, with a win rate of 50%, but slightly reduces text adherence with a win rate of 46%. However, it is recommended to include T5 for full power in generating written text, as typography generation experiences larger performance drops without it, with a win rate of 38%. Scaling Rectified Flow Transformers for High-Resolution Image Synthesis. Capabilities of Stable Diffusion 3 (SD3) Though we know very little about the capabilities of stable diffusion 3, here is what we can interpret based on the sample results shared: Multi-Subject Prompt Handling In text-to-image generation, multi-subject prompts include detailed descriptions of scenes, compositions, or scenarios involving more than one object, person, or concept. These prompts provide rich and complex information for the model to generate corresponding images that accurately represent the described scene or scenario. Handling multi-subject prompts effectively requires the text-to-image model to understand and interpret the relationships between different subjects mentioned in the prompt to generate coherent and realistic images. Prompt A painting of an astronaut riding a pig wearing a tutu holding a pink umbrella, on the ground next to the pig is a robin bird wearing a top hat, and in the corner are the words "stable diffusion" SD3 Output Text Rendering SD3 works well in accurately rendering text within generated images, ensuring that textual elements such as fonts, styles, and sizes are represented properly. This capability enhances the integration of text-based descriptions into the generated imagery, contributing to a seamless and cohesive visual narrative. Prompt Graffiti on the wall with the text "When SD3?" SD3 Output Fine Detail Representation SD3 delivers superior image quality compared to previous models. This improvement ensures that the generated images are more detailed, realistic, and visually appealing. Prompt Studio photograph closeup of a chameleon over a black background SD3 Output Prompt Adherence SD3 demonstrates strong adherence to provided prompts, ensuring that the generated images accurately reflect the details and specifications outlined in the input text. This enhances the creation of desired visual content with minimal deviation from the intended concept or scene. Prompt Night photo of a sports car with the text "SD3" on the side, the car is on a race track at high speed, a huge road sign with the text "faster" SD3 Output Photorealism SD3 excels in producing images with high fidelity and photorealism, surpassing previous iterations in capturing fine details and textures. Its generated images closely resemble real-world photographs or hand-drawn artwork, imbuing them with a sense of authenticity. Prompt Fisheye lens photo where waves hit a lighthouse in Scotland, black waves. SD3 Output Performance of Stable Diffusion 3 Based on comprehensive evaluations comparing Stable Diffusion 3 with various open and closed-source text-to-image generation models, including SDXL, SDXL Turbo, Stable Cascade, Playground v2.5, Pixart-α, DALL·E 3, Midjourney v6, and Ideogram v1, SD3 emerges as a standout performer across multiple criteria. Human evaluators assessed output images from each model based on prompt following, typography quality, and visual aesthetics. In all these areas, Stable Diffusion 3 either matches or surpasses current state-of-the-art text-to-image generation systems. Comparison of baseline SD3 against other SOTA text-to-image generation models Even in early, unoptimized inference tests on consumer hardware, the largest SD3 model with 8B parameters demonstrates impressive performance, states Stability AI. It fits within the 24GB VRAM of an RTX 4090 and generates a 1024x1024 resolution image in just 34 seconds using 50 sampling steps. Stability AI also states that the initial release of Stable Diffusion 3 will offer multiple variations, ranging from 800 million to 8 billion parameter models, to ensure accessibility and eliminate hardware barriers for users. Click here to join the waitlist!   Comparative Performance Analysis: Stable Diffusion 3, Dalle-3, and Midjourney Here are the few experiments we carried out to compare the three popular text-to-image generation models based on the results shared by Stability AI. Text Generation Prompt Epic anime artwork of a wizard atop a mountain at night casting a cosmic spell into the dark sky that says "Stable Diffusion 3" made out of colorful energy Text Generation Output - Stable Diffusion 3 (SD 3) Text Generation Output - Dalle-3 Text Generation Output - Midjourney Multi-Subject Prompt Resting on the kitchen table is an embroidered cloth with the text 'good night' and an embroidered baby tiger. Next to the cloth, there is a lit candle. The lighting is dim and dramatic. Multi-Subject Text Prompt Output - Stable Diffusion 3 (SD 3) Multi-Subject Prompt Output - Dalle-3 Multi-Subject Prompt Output - Midjourney Text Stylization Prompt Photo of a 90's desktop computer on a work desk, on the computer screen it says "welcome". On the wall in the background we see beautiful graffiti with the text "SD3" very large on the wall. Text Stylization Prompt Output - Stable Diffusion 3 Dalle-3 Midjourney SD3: Responsible AI Practices As Stable Diffusion plans on releasing the model weights and training procedure as open source shortly, it commits to safe and responsible AI practices at every stage. From the model's initial training to its testing, evaluation, and eventual release, SD3 aims to prevent its misuse by bad actors. To uphold these standards, SD3 has implemented various safeguards in preparation for the early preview of Stable Diffusion 3. These measures include continuous collaboration with researchers, experts, and the community to innovate further with integrity. Through this ongoing collaboration, SD3 aims to ensure that its generative AI remains open, safe, and universally accessible. Potential Drawbacks The Stable Diffusion 3 models have made significant advancements, but they still could have some limitations. The paper doesn’t mention any limitations of the models. But here are some possible limitations that are common in text-to-image generation models: Fidelity and Realism Generated images may lack fidelity and realism compared to real-world photographs or hand-drawn artwork. Fine details and textures may not be accurately represented, resulting in images that appear artificial or "uncanny." For example, the image below lacks fine details like the shadow underneath the bus suggesting light coming from behind it, and the shadow of a building on the street indicating light coming from the left of the image. Ambiguity Text descriptions can sometimes be ambiguous or subjective, leading to varied interpretations by the model. This ambiguity can result in generated images that may not fully capture the intended scene or elements described in the text. Contextual Understanding Text-to-image models may struggle with understanding contextual nuances and cultural references, leading to inaccuracies or misinterpretations in the generated images. For example, understanding metaphors or abstract concepts described in the text may pose challenges for the model. Resource Intensiveness Training and running text-to-image generation models can be computationally intensive and require significant computational resources, including high-performance GPUs or TPUs. This limitation can impact the scalability and accessibility of these models for widespread use. TripoSR: 3D Object Generation from Single Along with their SOTA text-to-image generation model, Stability AI also released TripoSR, a fast 3D object reconstruction model. TripoSR: Fast 3D Object Reconstruction from a Single Image TripoSR generates high-quality 3D models from a single image in under a second, making it incredibly fast and practical for various applications. Unlike other models, TripoSR operates efficiently even without a GPU, ensuring accessibility for a wide range of users. The model weights and source code are available for download under the MIT license, allowing for commercial, personal, and research use. For more information, read the official research paper available on arXiv: TripoSR: Fast 3D Object Reconstruction from a Single Image. Inspired by the Large Reconstruction Model For Single Image to 3D (LRM), TripoSR caters to the needs of professionals in entertainment, gaming, industrial design, and architecture. It offers responsive outputs for visualizing detailed 3D objects, creating detailed models in a fraction of the time of other models. Tested on an Nvidia A100, TripoSR generates draft-quality 3D outputs (textured meshes) in around 0.5 seconds, outperforming other open image-to-3D models like OpenLRM.  For more information on Stable Diffusion 3, read the official research paper available on arXiv: Scaling Rectified Flow Transformers for High-Resolution Image Synthesis.   Stable Diffusion 3: Key Highlights Multimodal Diffusion Transformer Architecture: SD3's innovative architecture incorporates separate sets of weights for image and language representations, resulting in improved text understanding and spelling capabilities compared to previous versions. Superior Performance: In comparative evaluations, SD3 has demonstrated superior performance when compared to state-of-the-art text-to-image generation systems such as DALL·E 3, Midjourney v6, and Ideogram v1. Human preference evaluations have highlighted advancements in typography and prompt adherence, setting a new standard in this field. Scalability and Flexibility: SD3 offers models of varying sizes, ranging from 800 million to 8 billion parameters, to cater to different needs for scalability and image quality. This flexibility ensures that users can select models that best suit their creative requirements. Open-Source Models: SD3 offers different choices and improvements in creating images from text.  This openness fosters collaboration and innovation within the AI community while promoting transparency and accessibility in AI technologies.

March 5

10 min

sampleImage_performanceyolov9-vs-yolov8-custom-dataset
Comparative Analysis of YOLOv9 and YOLOv8 Using Custom Dataset on Encord Active

Even as foundation models gain popularity, advancements in object detection models remain significant. YOLO has consistently been the preferred choice in machine learning for object detection. Let’s train the latest iterations of the YOLO series, YOLOv9, and YOLOV8 on a custom dataset and compare their model performance.  In this blog, we will train YOLOv9 and YOLOv8 on the xView3 dataset. The xView3 dataset contains aerial imagery with annotations for maritime object detection, making it an ideal choice for evaluating the robustness and generalization capabilities of object detection models. If you wish to curate and annotate your own dataset for a direct comparison between the two models, you have the option to create the dataset using Encord Annotate. Once annotated, you can seamlessly follow the provided code to train and evaluate both YOLOv9 and YOLOv8 on your custom dataset. Read the Encord Annotate Documentation to get started with your annotation project.   Prerequisites We are going to run our experiment on Google Colab. So if you are doing it on your local system, please bear in mind that the instructions and the code was made to run on Colab Notebook. Make sure you have access to GPU. You can either run the command below or navigate to Edit → Notebook settings → Hardware accelerator, set it to GPU, and the click Save. !nvidia-smi To make it easier to manage datasets, images, and models we create a HOME constant. import os HOME = os.getcwd() print(HOME) Train YOLOv9 on Encord Dataset Install YOLOv9 !git clone https://github.com/SkalskiP/yolov9.git  %cd yolov9 !pip install -r requirements.txt -q !pip install -q roboflow encord av # This is a convenience class that holds the info about Encord projects and makes everything easier. # The class supports bounding boxes and polygons across both images, image groups, and videos. !wget 'https://gist.githubusercontent.com/frederik-encord/e3e469d4062a24589fcab4b816b0d6ec/raw/fa0bfb0f1c47db3497d281bd90dd2b8b471230d9/encord_to_roboflow_v1.py' -O encord_to_roboflow_v1.py Imports from typing import Literal from pathlib import Path from IPython.display import Image import roboflow from encord import EncordUserClient from encord_to_roboflow_v1 import ProjectConverter Data Preparation Set up access to the Encord platform by creating and using an SSH key. # Create ssh-key-path key_path = Path("../colab_key.pub") if not key_path.is_file(): !ssh-keygen -t ed25519 -f ../colab_key -N "" -q key_content = key_path.read_text() We will now retrieve the data from Encord, converting it to the format required by Yolo and storing it on disk. It's important to note that for larger projects, this process may encounter difficulties related to disk space. The converter will automatically split your dataset into training, validation, and testing sets based on the specified sizes.  # Directory for images data_path = Path("../data") data_path.mkdir(exist_ok=True) client = EncordUserClient.create_with_ssh_private_key( Path("../colab_key").resolve().read_text() ) project_hash = "9ca5fc34-d26f-450f-b657-89ccb4fe2027" # xView3 tiny encord_project = client.get_project(project_hash) converter = ProjectConverter( encord_project, data_path, ) dataset_yaml_file = converter.do_it(batch_size=500, splits={"train": 0.5, "val": 0.1, "test": 0.4}) encord_project_title = converter.title Download Model Weight We will download the YOLOv9-e and the gelan-c weights. Although the YOLOv9 paper mentions versions yolov9-s and yolov9-m, it's worth noting that weights for these models are currently unavailable in the YOLOv9 repository. !mkdir -p {HOME}/weights !wget -q https://github.com/WongKinYiu/yolov9/releases/download/v0.1/yolov9-e-converted.pt -O {HOME}/weights/yolov9-e.pt !wget -P {HOME}/weights -q https://github.com/WongKinYiu/yolov9/releases/download/v0.1/gelan-c.pt You can predict and evaluate the results of object detection with the  YOLOv9 weights pre-trained on COCO model.  Check out the blog YOLOv9 Explained and How to Run it if you want to run object detection on pre-trained YOLOv9 weights.  Train Custom YOLOv9 Model for Object Detection We train a custom YOLOv9 model from a pre-trained gelan-c model. !python train.py \ --batch 8 --epochs 20 --img 640 --device 0 --min-items 0 --close-mosaic 15 \ --data $dataset_yaml_file \ --weights {HOME}/weights/gelan-c.pt \ --cfg models/detect/gelan-c.yaml \ --hyp hyp.scratch-high.yaml You can examine and validate your training results. The code for validation and inference with the custom model is available on Colab Notebook. Here we will focus on comparing the model performances. Converting Custom YOLOv9 Model Predictions to Encord Active Format pth = converter.create_encord_json_predictions(get_latest_exp("detect") / "labels", Path.cwd().parent) print(f"Predictions exported to {pth}") Download the predictions on your local computer and upload them via the UI to Encord Active for analysis of your results. Moving on to training YOLOv8! Train YOLOv8 on Encord Dataset Install YOLOv8 !pip install ultralytics==8.0.196 from IPython import display display.clear_output() import ultralytics ultralytics.checks() Dataset Preparation As we are doing a comparative analysis of two models, we will use the same dataset to train YOLOv8. Train Custom YOLOv8 Model for Object Detection from ultralytics import YOLO model = YOLO('yolov8n.pt')  # load a pretrained YOLOv8n detection model model.train(data=dataset_yaml_file.as_posix(), epochs=20)  # train the model model.predict() The code for running inference on the test dataset is available on the Colab Notebook shared below. Converting Custom YOLOv8 Model Predictions to Encord Active Format pth = converter.create_encord_json_predictions(get_latest_exp("detect", ext="predict") / "labels", Path.cwd().parent) print(f"Predictions exported to {pth}") Download this JSON file and upload it to Encord Active via UI. Comparative Analysis on Encord Active On Encord Active under the tab Model Evaluation, you can compare both the model’s predictions.  You can conveniently navigate to the Model Summary tab to view the Mean Average Precision (mAP), Mean Average Recall (mAR), and F1 score for both models. Additionally, you can compare the differences in predictions between YOLOv8 and YOLOv9. Precision  YOLOv8 may excel in correctly identifying objects (high true positive count) but at the risk of also detecting objects that aren't present (high false positive count). On the other hand, YOLOv9 may be more conservative in its detections (lower false positive count) but could potentially miss some instances of objects (higher false negative count).  Recall In terms of recall, YOLOv8 exhibits superior performance with a higher true positive count (101) compared to YOLOv9 (43), indicating its ability to correctly identify more instances of objects present in the dataset. Both models, however, show an equal count of false positives (643), suggesting similar levels of incorrect identifications of non-existent objects. YOLOv8 demonstrates a lower false negative count (1261) compared to YOLOv9 (1315), implying that YOLOv8 misses fewer instances of actual objects, highlighting its advantage in recall performance. Precision-Recall Curve Based on the observed precision-recall curves, it appears that YOLOv8 achieves a higher Area Under the Curve (AUC-PR) value compared to YOLOv9. This indicates that YOLOv8 generally performs better in terms of both precision and recall across different threshold values, capturing a higher proportion of true positives while minimizing false positives more effectively than YOLOv9. Precision-Recall Curve is not the only metric to evaluate the performance of models. There are other metrics like F1 score, IOU distribution, etc.  For more information on different quality metrics, read the blog Data, Label, & Model Quality Metrics in Encord.   Metric Correlation The metric impact on performance in Encord refers to how specific metrics influence the performance of your model. Encord allows you to figure out which metrics have the most influence on your model's performance. This metric tells us whether a positive change in a metric will lead to a positive change (positive correlation) or a negative change (negative correlation) in model performance. The dimensions of the labeled objects significantly influence the performance of both models. This underscores the importance of the size of objects in the dataset. It's possible that the YOLOv9 model's performance is adversely affected by the presence of smaller objects in the dataset, leading to its comparatively poorer performance. Metric Performance The Metric Performance in model evaluation in Encord provides a detailed view of how a specific metric affects the performance of your model. It allows you to understand the relationship between a particular metric and the model's performance. In conclusion, the comparison between YOLOv8 and YOLOv9 on Encord Active highlights distinct performance characteristics in terms of precision and recall. While YOLOv8 excels in correctly identifying objects with a higher true positive count, it also exhibits a higher false positive count, indicating a potential for over-detection. On the other hand, YOLOv9 demonstrates a lower false positive count but may miss some instances of objects due to its higher false negative count. If you want to improve your object detection model, read the blog How to Analyze Failure Modes of Object Detection Models for Debugging for more information.   The precision-recall curve analysis suggests that YOLOv8 generally outperforms YOLOv9, capturing a higher proportion of true positives while minimizing false positives more effectively. However, it's important to consider other metrics like F1 score and IOU distribution for a comprehensive evaluation of model performance. Moreover, understanding the impact of labeled object dimensions and specific metric correlations can provide valuable insights into improving model performance on Encord Active.

March 1

8 min

sampleImage_vlm-webinar-recording
Vision Language Models: Powering the next chapter in AI

Webinar Recording In this webinar, we delve into the rapidly evolving field of data annotation and explore the groundbreaking role of Vision Language Models (VLMs). As organizations seek more efficient, accurate, and scalable methods to process vast amounts of data, VLMs are emerging as a pivotal technology in automating and enhancing data annotation tasks. Here are the key resources from the webinar: [Guide] Guide to Vision-Language Models [Case Study] See how one customer increased mAP by 20% through reducing their dataset size by 35% with visual data curation

March 1

60 min

sampleImage_evaluating-model-performance
Validating Model Performance Using Encord Active

Model validation is a key machine learning (ML) lifecycle stage, ensuring models generalize well to new, unseen data. This process is critical for evaluating a model's predictions independently from its training dataset, thus testing its ability to perform reliably in the real world.  Model validation helps identify overfitting—where a model learns noise rather than the signal in its training data—and underfitting, where a model is too simplistic to capture complex data patterns. Both are detrimental to model performance. Techniques like the holdout method, cross-validation, and bootstrapping are pivotal in validating model performance, offering insights into how models might perform on unseen data. These methods are integral to deploying AI and machine learning models that are both reliable and accurate. This article delves into two parts: Key model validation techniques, the advantages of a data-centric approach, and how to select the most appropriate validation method for your project. How to validate a Mask R-CNN pre-trained model that segments instances in COVID-19 scans using Encord Active, a data-centric platform for evaluating and validating computer vision (CV) models. Ready to dive deeper into model validation and discover how Encord Active can enhance your ML projects? Let’s dive in! The Vital Role of a Data-Centric Approach in Model Validation A data-centric approach to model validation places importance on the quality of data in training and deploying computer vision (CV) and artificial intelligence (AI) models. The approach recognizes that the foundation of any robust AI system lies not in the complexity of its algorithms but in the quality of the data it learns from. High-quality, accurately labeled data (with ground truth) ensures that models can truly understand and interpret the nuances of the tasks they are designed to perform, from predictive analytics to real-time decision-making processes. Why Data Quality is Paramount The quality of training data is directly proportional to a model's ability to generalize from training to real-world applications. Poor data quality—including inaccuracies, biases, label errors, and incompleteness—leads to models that are unreliable, biased, or incapable of making accurate predictions. A data-centric approach prioritizes meticulous data preparation, including thorough data annotation, cleaning, and validation. This ensures the data distribution truly reflects the real world it aims to model and reduces label errors.  Improving Your Model’s Reliability Through Data Quality The reliability of CV models—and even more recently, foundation models—in critical applications—such as healthcare imaging and autonomous driving—cannot be overstated.  A data-centric approach mitigates the risks associated with model failure by ensuring the data has high fidelity. It involves rigorous validation checks and balances, using your expertise and automated data quality tools to continually improve your label quality and datasets. Adopt a data-centric approach to your AI project and unlock its potential by downloading our whitepaper.   Key Computer Vision Model Validation Techniques A data-centric approach is needed to validate computer vision models after model training that looks at more than just performance and generalizability. They also need to consider the unique problems of visual data, like how image quality, lighting, and perspectives can vary. Tailoring the common validation techniques specifically for computer vision is about robustly evaluating the model's ability to analyze visual information and embeddings across diverse scenarios: Out-of-Sample Validation: Essential for verifying that a CV model can generalize from its training data to new, unseen images or video streams. This approach tests the model's ability to handle variations in image quality, lighting, and subject positioning that it hasn't encountered during training. Cross-Validation and Stratified K-Fold: Particularly valuable in computer vision is ensuring that every aspect of the visual data is represented in both training and validation sets. Stratified K-Fold is beneficial when dealing with imbalanced datasets, common in computer vision tasks, to maintain an equal representation of classes across folds. Leave-One-Out Cross-Validation (LOOCV): While computationally intensive, LOOCV can be particularly insightful for small image datasets where every data point's inclusion is crucial for assessing the model's performance on highly nuanced visual tasks. Bootstrapping: Offers insights into the stability of model predictions across different visual contexts. This method helps understand how training data subset changes can affect the model's performance, which is particularly relevant for models expected to operate in highly variable visual environments. Adversarial Testing: Tests the model's resilience against slight, often invisible, image changes. This technique is critical to ensuring models are not easily perturbed by minor alterations that would not affect human perception. Domain-Specific Benchmarks: Participating in domain-specific challenges offered by ImageNet, COCO, or PASCAL VOC can be a reliable validation technique. These benchmarks provide standardized datasets and metrics, allowing for evaluating a model's performance against a wide range of visual tasks and conditions, ensuring it meets industry standards. Human-in-the-Loop: Involving domain experts in the validation process is invaluable, especially for tasks requiring fine-grained visual distinctions (e.g., medical imaging or facial recognition). This approach helps ensure that the model's interpretations align with human expertise and can handle the subtleties of real-world visual data. Ensuring a model can reliably interpret and analyze visual information across various conditions requires a careful balance between automated validation methods and human expertise.  Choosing the right validation techniques for CV models involves considering the dataset's diversity, the computational resources available, and the application's specific requirements. Luckily, there are model validation tools that can help you focus on validating the model. At the same time, they do the heavy lifting of providing the insights necessary to validate your CV model’s performance, including providing AI-assisted evaluation features.  But before walking through Encord Active, let’s understand the factors you need to consider for choosing the right tool. How to Choose the Right Computer Vision Model Validation Tool When choosing the right model validation tool for computer vision projects, several key factors come into play, each addressing the unique challenges and requirements of working with image data.  These considerations ensure that the selected tool accurately evaluates the model's performance and aligns with the project's specific demands. Here's a streamlined guide to making an informed choice: Data Specificity and Complexity: Opt for tools that cater to the variability and complexity inherent in image data. This means capabilities for handling image-specific metrics such as Intersection over Union (IoU) for object detection and Mean Absolute Error (MAE) for tasks like classification and segmentation are crucial. Robust Data Validation: The tool should adeptly manage image data peculiarities, including potential discrepancies between image annotations and the actual images. Look for features that support comprehensive data validation across various stages of the model development cycle, including pre-training checks and ongoing training validations. Comprehensive Evaluation Metrics: Essential for thoroughly assessing a computer vision model's performance. The tool should offer a wide array of metrics, including precision-recall curves, ROC curves, and confusion matrices for classification, alongside task-specific metrics like IoU for object detection. It should also support quality metrics for a more holistic, real-world evaluation. Versatile Performance Evaluation: It should support a broad spectrum of evaluation techniques for deep insights into accuracy, the balance between precision and recall, and the model’s ability to distinguish between different classes. Dataset Management: The validation tool should help with efficient dataset handling for proper training-validation splits. For the sake of performance and scale, it should be able to manage large datasets. Flexibility and Customization: The fast-paced nature of computer vision demands tools that allow for customization and flexibility. This includes introducing custom metrics, supporting various data types and model architectures, and adapting to specific preprocessing and integration needs. Considering those factors, you can select a validation tool (open-source toolkits, platforms, etc.) that meets your project's requirements and contributes to developing reliable models. Using Encord Active to Validate the Performance of Your Computer Vision Model Encord Active (EA) is a data-centric model validation solution that enables you to curate valuable data that can truly validate your model’s real-world generalizability through quality metrics. In this section, you will see how you can analyze the performance of a pre-trained Mask R-CNN object detection model with Encord Active on COVID-19 predictions. From the analysis results, you will be able to validate and, if necessary, debug your model's performance. This walkthrough uses  Encord Annotate to create a project and import the dataset. We use Encord Active Cloud to analyze the model’s failure modes. We recommend you sign up for an Encord account to follow this guide.   Import Predictions Import your predictions onto the platform. Learn how to import Predictions in the documentation. Select the Prediction Set you just uploaded, and Encord Active will use quality data, label, and model quality metrics to evaluate the performance of your model: Visualize Model Performance Summary on the Validation Set Evaluate the model’s performance by inspecting the Model Summary dashboard to get an overview of your model’s performance on the validation set with details error categorization (true positive vs. false positive vs. false negative), the F1 score, and mean average precision/recall based on a confidence (IoU) threshold: Manually Inspect the Model Results Beyond visualizing a summary of the model’s performance, using a tool that allows you to manually dig in and inspect how your model works on real-world samples is more than helpful. Encord Active provides an Explorer tab that enables you to filter models by metrics to observe the impact of metrics on real-world samples. EA’s data-centric build also lets you see how your model correctly or incorrectly makes predictions (detects, classifies, or segments) on the training, validation, and production samples.  Let’s see how you can achieve this: On the Model Summary dashboard, → Click True Positive Count metric to inspect the predictions your model got right: Click on one of the images using the expansion icon to see how well the model detects the class, the confidence score with which it predicts the object, other scores on performance metrics, and metadata. Still under the Explorer tab → Click on Overview (the tab on the right) → Click on False Positive Count to inspect instances that the model failed to detect correctly It seems most classes flagged as False Positives are due to poor object classification quality (the annotations are not 100% accurate). Let’s look closely at an instance: In that instance, the model correctly predicts that the object is ‘Cardiomediastinum’. Still, the second overlapping annotation has a broken track for some reason, so Encord Active classifies its prediction as false positive using a combination of Broken Object Track and other relevant quality metrics. Under Filter → Add filter, you will see parameters and attributes to filter your model’s performance. For example, if you added your validation set to Active through Annotate, you can validate your model’s performance on that set and, likewise, on the production set. Visualize the Impact of Metrics on Model Performance Evaluate the model outcome count to understand the distribution of the correct and incorrect results for each class. Under the Model Evaluation tab → Click on Outcome to see the distribution chart: Now, you should see the count for the number of predictions the model gets wrong. Using this chart, you can get a high-level perspective on the issues with your model. In this case, the model fails to segment the ‘Airways’ object in the instances correctly. The Intersection-of-Union (IoU) Threshold is 0.5, the threshold for the model’s confidence in its predictions. Use the IOU Threshold slider under the Overview tab to see the outcome count based on a higher or lower threshold. You can also select specific classes you want to inspect under the Classes option. Dig Deeper into the Metrics Once you understand the model outcome count, you can dig deeper into specific metrics like precision, recall, and F1 scores if they are relevant to your targets. Notice the low precision, recall, and F1 scores per class! Also, group the scores by the model outcome count to understand how the model performs in each class. You could also use the precision-recall curve to analyze and highlight the classes harder for the model to detect with high confidence. Also break down the model’s precision and recall values for the predictions of each object over the relevant metrics you want to investigate. For example, if you want to see the precision and recall by the Object Classification Quality metric, under Metric Performance → Select the Metric dropdown menu, and then the metric you want to investigate the model’s precision by: Validate the Model’s Performance on Business Criteria Now it’s time to see the metrics impacting the model’s performance the most and determine, based on your information, if it’s good or bad (needs debugging) for business.  For instance, if the Confidence scores are the least performing metrics, you might be worried that your vision model is naive in predictions given the previous consensus on the outcome count (false positives and negatives). Here is the case for this model under the Metric Performance dashboard (remember, you can use the IoU Threshold slider to check the metric impact at different confidence intervals): The Relevative Area (the object's size) significantly influences our model’s performance. Considering the business environment you want to deploy the model, would this be a good or bad event? This is up to you to decide based on your technical and business requirements. If the model does not work, you can run more experiments and train more models until you find the optimal one. Awesome! You have seen how Encord Active plays a key role in providing features for validating your model’s performance with built-in metrics. In addition, it natively integrates with Encord Annotate, an annotation tool, to facilitate data quality improvement that can enhance the performance of your models.  Conclusion Selecting the right model validation tools ensures that models perform accurately and efficiently. It involves the assessment of a model's performance through quantitative metrics such as the IoU, mAP (mean Average Precision), and MaE, or qualitatively, by subject matter experts.  The choice of evaluation metric should align with the business objectives the model aims to achieve. Furthermore, model selection hinges on comparing various models using these metrics within a carefully chosen evaluation schema, emphasizing the importance of a proper validation strategy to ensure robust model performance before deployment.​ Validating model performance is particularly vital in sectors where such inaccuracies could compromise safety. Check out our customer stories to learn from large and small teams that have improved their data quality and model performance with the help of Encord. Platforms like Encord, which specialize in improving data and model quality, are instrumental in this context. Encord Active, among others, provides features designed to refine data quality and bolster model accuracy, mitigating the risks associated with erroneous predictions or data analysis.

March 1

8 min

sampleImage_qwen-vl-large-scale-vision-language-models
Qwen-VL and Qwen-VL-Chat: Introduction to Alibaba’s AI Models

Qwen-VL is a series of open-source large vision-language models (LVLMs), offering a potent combination of advanced capabilities and accessibility. As an open-source project, Qwen-VL not only democratizes access to cutting-edge AI technology but also positions itself as a formidable competitor to established models from tech giants like OpenAI’s GPT-4V and Google’s Gemini. In the competitive landscape of LVLMs, Qwen-VL has quickly risen to the forefront, securing its place as a leader on the OpenVLM leaderboard. This leaderboard, which encompasses 38 different VLMs including GPT-4V, Gemini, QwenVLPlus, LLaVA, and others, serves as a comprehensive benchmark for evaluating model performance across 13 distinct multimodal tasks. OpenVLM Leaderboard Qwen-VL's performance across these benchmarks underscores its versatility and robustness in handling various vision-language tasks with unparalleled accuracy and efficiency. By leading the charge on the OpenVLM leaderboard, Qwen-VL sets a new standard for excellence in the field, pushing the boundaries of what is possible with LVLMs and paving the way for future advancements in multimodal AI research. Introduction to Large-scale Vision Language Models (LVLMs) Large Language Models (LLMs) have attracted attention in recent years for their remarkable text generation and comprehension capabilities in the field of generative AI. However, their limitation to processing text alone has constrained their utility in various applications. In response to this limitation, a new class of models known as Large Vision Language Models (LVLMs) has come up, aiming to integrate visual data with textual information to address vision-centric tasks. LVLMs improve conventional LLMs by integrating vision language learning, thus extending their applicability to include image datasets. However, despite their promising potential, open-source LVLM implementations encounter hurdles such as inadequate training and optimization when compared to proprietary models. Also, understanding visual content still remains a significant challenge for existing LVLM frameworks. Overview of Qwen-VL The Qwen-VL series represents a significant advancement in Large Vision Language Models (LVLMs), designed to overcome the limitations of existing models and equip LLMs with visual processing capabilities. Built upon the Alibaba Cloud’s 7 billion parameter model, Qwen-7B language model, the Qwen-VL series introduces a visual receptor architecture comprising a language-aligned visual encoder and a position-aware adapter. This architecture enables Qwen-VL models to effectively process visual inputs, generate responses based on prompts, and perform various vision-language tasks such as image recognition, image captioning, visual question answering, and visual grounding. Qwen-VL models demonstrate leading performance on vision-centric benchmarks and support multiple languages, including English and Chinese. For more information on VLMs, read the blog Guide to Vision-Language Models (VLMs)   Key Features of Qwen-VL Qwen-VL models demonstrate good accuracy on a wide range of vision-centric understanding benchmarks, surpassing other SOTA models of similar scales. They excel not only in conventional benchmarks such as captioning and question-answering but also in recently introduced dialogue benchmarks. Here are the key features of Qwen-VL: Multi-lingual Support: Similar to Qwen-LM, Qwen-VLs are trained on multilingual image-text data, with a substantial corpus in English and Chinese. This enables Qwen-VLs to naturally support English, Chinese, and other multilingual instructions. Multi-image Capability: During training, Qwen-VLs can handle arbitrary interleaved image-text data as inputs, allowing them to compare, understand, and analyze context when multiple images are provided. Fine-grained Visual Understanding: Qwen-VLs exhibit highly competitive fine-grained visual understanding abilities, thanks to their higher-resolution input size and fine-grained corpus used during training. Compared to existing vision-language generalists, Qwen-VLs demonstrate superior performance in tasks such as grounding, text-reading, text-oriented question answering, and fine-grained dialogue comprehension. Vision-centric Understanding: This allows the model to comprehensively interpret and process visual information. With advanced architecture integrating a language-aligned visual encoder and position-aware adapter, Qwen-VL excels in tasks like image captioning, question answering, and visual grounding. Its fine-grained analysis ensures precise interpretation of visual content, making Qwen-VL highly effective in vision-language tasks and real-world applications. Design Structure of Qwen-VL Beginning with the foundation of Qwen-LM, the model is enhanced with visual capacity through several key components: Visual Receptor: Qwen-VL incorporates a carefully designed visual receptor, which includes a visual encoder and adapter. This component is responsible for processing image inputs and extracting fixed-length sequences of image features.  Input-Output Interface: The model's input-output interface is optimized to differentiate between image and text feature inputs. Special tokens are utilized to delineate image feature input, ensuring seamless integration of both modalities. 3-stage Training Pipeline: Qwen-VL employs a sophisticated 3-stage training pipeline to optimize model performance. This pipeline encompasses comprehensive training stages aimed at fine-tuning the model's parameters and enhancing its ability to comprehend and generate responses for both text and image inputs. Multilingual Multimodal Cleaned Corpus: Qwen-VL is trained on a diverse multilingual multimodal corpus, which includes cleaned data encompassing both textual and visual information. This corpus facilitates the model's ability to understand and generate responses in multiple languages while effectively processing various types of visual content. Model Architecture of Qwen-VL The architecture of Qwen-VL comprises three key components, each contributing to the model's robustness in processing both text and visual inputs.  Large Language Model Qwen-VL leverages a large language model as its foundational component. This machine learning model is initialized with pre-trained weights obtained from Qwen-7B, ensuring a strong linguistic foundation for the model's language processing capabilities. Visual Encoder Qwen-VL employs the Vision Transformer (ViT) architecture, utilizing pre-trained weights from Openclip's ViT-bigG. During both training and inference, input images are resized to a specific resolution. The visual encoder processes these images by dividing them into patches with a stride of 14, thereby generating a set of image features that encapsulate visual information. Position-aware Vision-Language Adapter To address efficiency concerns arising from long sequences of image features, Qwen-VL introduces a vision-language adapter. This adapter is designed to compress the image features, enhancing computational efficiency. It consists of a single-layer cross-attention module initialized randomly. This module utilizes a group of trainable embeddings as query vectors and the image features from the visual encoder as keys for cross-attention operations. By employing this mechanism, the visual feature sequence is compressed to a fixed length of 256. To preserve positional information crucial for fine-grained image comprehension, 2D absolute positional encodings are incorporated into the query-key pairs of the cross-attention mechanism. This ensures that positional details are retained during the compression process. The compressed image feature sequence of length 256 is then fed into the large language model, enabling Qwen-VL to effectively process both textual and visual inputs and perform a wide range of vision-language tasks with high accuracy and efficiency. Training Pipeline of Qwen-VL series For more information, read the official paper released on Arxiv: Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond.   Performance of Qwen-VL against State-of-The-Art LVLMs The performance of Qwen-VL models, particularly Qwen-VL-Max, surpasses SOTA models such as Gemini Ultra and GPT-4V in various text-image multimodal tasks. Compared to the open-source version of Qwen-VL, these models achieve comparable results to Gemini Ultra and GPT-4V, while significantly outperforming previous best results from open-source models. Performance of Qwen-VL-Plus and Qwen-VL-Max against other LVLM In particular, Qwen-VL-Max demonstrates superior performance over GPT-4V from OpenAI and Gemini from Google in tasks related to Chinese question answering and Chinese text comprehension. This achievement highlights the advanced capabilities of Qwen-VL-Max and its potential to establish new benchmarks in multimodal AI research and application. It should also be noted that most SOTA models are not trained on chinese language. Capabilities of Qwen-VL Qwen-VL exhibits a diverse range of capabilities that enable it to effectively comprehend and interact with visual and textual information, as well as reason and learn from its environment. These capabilities include: Basic Recognition Capabilities Qwen-VL demonstrates strong basic recognition capabilities, accurately identifying and describing various elements within images, including common objects, celebrities, landmarks, and intricate details. Recognition capabilities of Qwen-VL Visual Agent Capability As a visual agent, Qwen-VL is capable of providing detailed background information, answering questions, and analyzing complex visual content. It can also compose poetry in multiple languages inspired by visual stimuli and analyze everyday screenshots. Visual Agent Capabilities of Qwen-VL Visual Reasoning Capability Qwen-VL possesses advanced visual reasoning capabilities, extending beyond content description to comprehend and interpret intricate representations such as flowcharts, diagrams, and other symbolic systems. It excels in problem-solving and reasoning tasks, including mathematical problem-solving and profound interpretations of charts and graphs. Qwen-VL has advanced visual reasoning capabilities Text Information Recognition and Processing Qwen-VL exhibits enhanced text information recognition and processing abilities, efficiently extracting information from tables and documents, reformatting it to meet customized output requirements, and effectively identifying and converting dense text. It also supports images with extreme aspect ratios, ensuring flexibility in processing diverse visual content. Advanced text information recognition and processing abilities of Qwen-VL Few-shot Learning on Vision-Language Tasks Qwen-VL demonstrates satisfactory in-context learning (few-shot learning) ability, achieving superior performance on vision-language tasks such as question answering and image captioning compared to models with similar numbers of parameters. Its performance rivals even larger models, showcasing its adaptability and efficiency in learning from limited data. For more information on few-shot learning, read the blog Few Shot Learning in Computer Vision: Approaches & Uses   Qwen-VL Availability Qwen-VL, including Qwen-VL-Plus and Qwen-VL-Max, is now readily accessible through various platforms, offering researchers and developers convenient access to its powerful capabilities: HuggingFace: Users can access Qwen-VL-Plus and Qwen-VL-Max through the Huggingface Spaces and Qwen website, enabling seamless integration into their projects and workflows. Dashscope APIs: The APIs of Qwen-VL-Plus and Qwen-VL-Max are available through the Dashscope platform, providing developers with the flexibility to leverage its capabilities for their AI applications. Detailed documentation and quick-start guides are available on the Dashscope platform for easy integration. QianWen Web Portal: By logging into the Tongyi QianWen web portal and switching to "Image Understanding" mode, users can harness the latest Qwen-VL-Max capabilities for image understanding tasks. This mode offers additional functionalities tailored specifically for image processing and understanding. ModelScope: The Qwen-VL-Chat demo is available on modelscope. GitHub Repository: The code and model weights of both Qwen-VL and Qwen-VL-Chat are openly available to download on GitHub, allowing researchers and developers to explore, modify, and utilize them freely. The commercial use of these resources is permitted, enabling their integration into commercial projects and applications. Qwen-VL-Chat Qwen-VL-Chat, as a generalist multimodal LLM-based AI assistant, supports complex interactions, including multiple image inputs, multi-round question answering, and creative capabilities. Unlike traditional vision-language chatbots, Qwen-VL-Chat's alignment techniques enable it to comprehend and respond to complex visual and textual inputs with superior accuracy and flexibility. Here's how Qwen-VL-Chat stands out in real-world dialog benchmarks and compares with existing models: Qwen-VL-Chat Vs. Vision-Language Chat Performance of Qwen-VL against other generalist models across various tasks Qwen-VL-Chat's advanced capabilities are evaluated using the TouchStone benchmark, which assesses overall text-image dialogue capability and alignment with humans. Unlike conventional models like chatGPT or Bard, Qwen-VL-Chat excels in handling direct image input, thanks to fine-grained image annotations provided by human labeling. With a comprehensive coverage of 300+ images, 800+ questions, and 27 categories, including attribute-based Q&A, celebrity recognition, writing poetry, summarizing multiple images, product comparison, and math problem solving, Qwen-VL-Chat achieves superior performance in understanding and responding to complex visual and textual inputs. You can find the official tutorial to implement Qwen-VL-Chat on your own on Github.   Real-world Dialog Benchmark Qwen-VL-Chat's outstanding results in other multimodal benchmarks, such the MME Benchmark and Seed-Bench, demonstrate that its performance evaluation extends beyond the TouchStone benchmark. In both the perceptual and cognition tracks, Qwen-VL-Chat obtains state-of-the-art scores in the MME Benchmark, an extensive evaluation of multimodal large language models. The Qwen series, which includes Qwen-VL-Chat, achieves state-of-the-art performance in Seed-Bench, a benchmark consisting of 19K multiple-choice questions with precise human annotations. Qwen-VL: What’s Next? The release of the Qwen-VL series represents a significant stride forward in large-scale multilingual vision-language models, with the goal of advancing multimodal research.  Qwen-VL has demonstrated its superiority over comparable artificial intelligence models across various benchmarks, facilitating multilingual complex conversations, multi-image interleaved conversations, grounding in Chinese, and fine-grained recognition. Looking ahead, the focus is on further enhancing Qwen-VL's capabilities in several key dimensions: Multi-modal Generation The team plans to integrate Qwen-VL with more modalities, including speech and video. By expanding its scope to encompass these modalities, Qwen-VL will enhance its ability to understand and generate content across a wider range of inputs. Multi-Modal Generation This generative AI model will be further developed to excel in multi-modal generation, particularly in generating high-fidelity images and fluent speech. By enhancing its ability to generate content across multiple modalities with high fidelity and fluency, Qwen-VL will advance the state-of-the-art in multimodal AI systems. Augmentation of Model Size and Training Data Efforts are underway to scale up the model size, training data, and resolution of Qwen-VL. This enhancement aims to enable Qwen-VL to handle more complex and intricate relationships within multimodal data, leading to more nuanced and comprehensive understanding and generation of content.

February 29

8 min

sampleImage_mistral-large-explained
Mistral Large Explained

Mistral AI made headlines with the release of Mistral 7B, an open-source model competing with tech giants like OpenAI and Meta and surpassing several state-of-the-art large language models such as LLaMA 2. Now, in collaboration with Microsoft, the French AI startup introduces Mistral Large, marking a significant advancement in language model development and distribution. What Is Mistral Large? Mistral Large, developed by Mistral AI, is an advanced language model renowned for its robust reasoning capabilities tailored for intricate multilingual tasks. Fluent in English, French, Spanish, German, and Italian, it exhibits a nuanced grasp of various languages. Boasting a 32K tokens context window, Mistral Large ensures precise information retrieval from extensive documents, facilitating accurate and contextually relevant text generation. With the incorporation of retrieval augmented generation (RAG), it can access facts from external knowledge bases, thereby enhancing comprehension and precision. Mistral Large also excels in instruction-following and function-calling functionalities, enabling tailored moderation policies and application development. Its performance in coding, mathematical, and reasoning tasks makes it a notable solution in natural language processing. Key Attributes of Mistral Large Reasoning Capabilities: Mistral Large showcases powerful reasoning capabilities, enabling it to excel in complex multilingual reasoning tasks. It stands out for its ability to understand, transform, and generate text with exceptional precision. Native Multilingual Proficiency: With native fluency in English, French, Spanish, German, and Italian, Mistral Large demonstrates a nuanced understanding of grammar and cultural context across multiple languages. Enhanced Contextual Understanding: Featuring a 32K tokens context window, Mistral Large offers precise information recall from large documents, facilitating accurate and contextually relevant text generation. Mistral Large, unlike Mistral 7B, the open-sourced LLM that provided stiff competition to state-of-the-art (SOTA) large language models, is equipped with retrieval augmented generation (RAG). This feature enables the LLM to retrieve facts from an external knowledge base, grounding its understanding and enhancing the accuracy and contextuality of its text-generation capabilities. Instruction-Following Mistral Large's instruction-following capabilities allow developers to design customized moderation policies and system-level moderation, exemplified by its usage in moderating platforms like le Chat. Function Calling Capability Mistral Large can directly call functions, making it easier to build and update apps and tech stack modernization on a large scale. With this feature and limited output mode, developers can add advanced features and make interactions smoother without any hassle. For more information, read the blog What is Retrieval Augmented Generation (RAG)?   Performance Benchmark The performance of Mistral Large is compared on various tasks against other state-of-the-art LLM models which are commonly used as benchmarks. Reasoning and Knowledge These benchmarks assess various aspects of language understanding and reasoning, including tasks like understanding massive multitask language (MMLU), completing tasks with limited information (e.g., 5-shot and 10-shot scenarios), and answering questions based on different datasets (e.g., TriviaQA and TruthfulQA). Multi-lingual Capacities The multilingual capability of Mistral Large undergoes benchmarking on HellaSwag, Arc Challenge, and MMLU benchmarks across French, German, Spanish, and Italian languages. Its performance is compared to Mistral 7B and LLaMA 2. Notably, Mistral Large hasn't been tested against the GPT series or Gemini, as these language models have not disclosed their performance metrics on these 4 languages. To know more about the Mistral 7B, read the blog Mistral 7B: Mistral AI's Open Source Model.   Maths and Coding Mistral Large excels across coding and math benchmarks, showcasing strong problem-solving abilities. With high pass rates in HumanEval and MBPP, it demonstrates proficiency in human-like evaluation tasks. Achieving a majority vote accuracy of 4 in the Math benchmark and maintaining accuracy in scenarios with limited information in GSM8K benchmarks, Mistral Large proves its effectiveness in diverse mathematical and coding challenges. Comparison of Mistral Large with other SOTA Models Mistral Large demonstrates impressive performance on widely recognized benchmarks, securing its position as the second-ranked model available via API globally, just behind GPT-4.  Detailed comparisons against other state-of-the-art (SOTA) models like Claude 2, Gemini Pro 1.0, GPT 3.5, and LLaMA 2 70B are provided on benchmarks such as MMLU (Measuring massive multitask language understanding), showcasing Mistral Large's competitive edge and advanced capabilities in natural language processing tasks. Mistral Large: Platform Availability La Plataforme Hosted securely on Mistral's infrastructure in Europe, La Plateforme offers developers access to a comprehensive array of models for developing applications and services. This platform provides a wide range of tools and resources to support different use cases. Le Chat Le Chat serves as a conversational interface for interacting with Mistral AI's models, providing users with a pedagogical and enjoyable experience to explore the company's technology. It can utilize Mistral Large or Mistral Small, as well as a prototype model called Mistral Next, offering brief and concise interactions. Microsoft Azure Mistral AI has announced its partnership with Microsoft and made Mistral LArge available in Azure AI Studio providing users with a user-friendly experience similar to Mistral's APIs. Beta customers have already experienced notable success utilizing Mistral Large on the Azure platform, benefiting from its advanced features and robust performance. Self-deployment For sensitive use cases, Mistral Large can be deployed directly into the user's environment, granting access to model weights for enhanced control and customization. Mistral Large on Microsoft Azure Mistral Large is set to benefit significantly from the multi-year partnership of Microsoft with Mistral AI on three key aspects: Supercomputing Infrastructure: Microsoft Azure will provide Mistral AI with supercomputing infrastructure tailored for AI training and inference workloads, ensuring best-in-class performance and scalability for Mistral AI's flagship models like Mistral Large. This infrastructure will enable Mistral AI to handle complex AI tasks efficiently and effectively. Scale to Market: Through Models as a Service (MaaS) in Azure AI Studio and Azure Machine Learning model catalog, Mistral AI's premium models, including Mistral Large, will be made available to customers. This platform offers a diverse selection of both open-source and commercial models, providing users with access to cutting-edge AI capabilities. Additionally, customers can utilize Microsoft Azure Consumption Commitment (MACC) for purchasing Mistral AI's models, enhancing accessibility and affordability for users worldwide. AI Research and Development: Microsoft and Mistral AI will collaborate on AI research and development initiatives, including the exploration of training purpose-specific models for select customers. This collaboration extends to European public sector workloads, highlighting the potential for Mistral Large and other models to address specific customer needs and industry requirements effectively. Mistral Small Mistral Small, introduced alongside Mistral Large, represents a new optimized model specifically designed to prioritize low latency and cost-effectiveness. This model surpasses Mixtral 8x7B, the sparse mixture-of-experts network, in performance while boasting lower latency, positioning it as a refined intermediary solution between Mistral's open-weight offering and its flagship model. Mistral Small inherits the same innovative features as Mistral Large, including RAG-enablement and function calling capabilities, ensuring consistent performance across both models. To streamline their endpoint offering, Mistral is introducing two main categories: Open-weight Endpoints: These endpoints, named open-mistral-7B and open-mixtral-8x7b, offer competitive pricing and provide access to Mistral's models with open weights, catering to users seeking cost-effective solutions. New Optimized Model Endpoints: Mistral is introducing new optimized model endpoints, namely mistral-small-2402 and mistral-large-2402. These endpoints are designed to accommodate specific use cases requiring optimized performance and cost efficiency. Also, mistral-medium will be maintained without updates at this time. To know more about the Mistral AI LLM models and how to access them, read the documentation.   Mistral Large: What’s Next? Multi-currency Pricing Moving forward, Mistral AI is introducing multi-currency pricing for organizational management, providing users with the flexibility to transact in their preferred currency. This enhancement aims to streamline payment processes and improve accessibility for users worldwide. Reduced End-point Latency Mistral AI states that it is working to reduce the latency of all our endpoints. This improvement ensures faster response times, enabling smoother interactions and improved efficiency for users across various applications. La Plataforme Service Tier Updates To make their services even better, Mistral AI has updated the service tiers on La Plataforme. These updates aim to improve performance, reliability, and user satisfaction for those using Mistral AI's platform for their projects and applications.

February 28

5 min

Page
1 / 17

Software To Help You Turn Your Data Into AI

Forget fragmented workflows, annotation tools, and Notebooks for building AI applications. Encord Data Engine accelerates every step of taking your model into production.