Encord Blog

Label data 10x faster & gain control of your training data, today.

blog banner
Featured Blog

How To Fine-Tune Segment Anything

Computer vision is having its ChatGPT moment with the release of the Segment Anything Model (SAM) by Meta last week. Trained over 11 billion segmentation masks, SAM is a foundation model for predictive AI use cases rather than generative AI. While it has shown an incredible amount of flexibility in its ability to segment over wide-ranging image modalities and problem spaces, it was released without “fine-tuning” functionality. This tutorial will outline some of the key steps to fine-tune SAM using the mask decoder, particularly describing which functions from SAM to use to pre/post-process the data so that it's in good shape for fine-tuning. {{Training_data_CTA::Supercharge your annotations by fine-tuning SAM for your use case}} What is the Segment Anything Model (SAM)? The Segment Anything Model (SAM) is a segmentation model developed by Meta AI. It is considered the first foundational model for Computer Vision. SAM was trained on a huge corpus of data containing millions of images and billions of masks, making it extremely powerful. As its name suggests, SAM is able to produce accurate segmentation masks for a wide variety of images. SAM’s design allows it to take human prompts into account, making it particularly powerful for Human In The Loop annotation. These prompts can be multi-modal: they can be points on the area to be segmented, a bounding box around the object to be segmented, or a text prompt about what should be segmented. The model is structured into 3 components: an image encoder, a prompt encoder, and a mask decoder. Source The image encoder generates an embedding for the image being segmented, whilst the prompt encoder generates an embedding for the prompts. The image encoder is a particularly large component of the model. This is in contrast to the lightweight mask decoder, which predicts segmentation masks based on the embeddings. Meta AI has made the weights and biases of the model trained on the Segment Anything 1 Billion Mask (SA-1B) dataset available as a model checkpoint. {{light_callout_start}} Learn more about how Segment Anything works in our explainer blog post Segment Anything Model (SAM) Explained. {{light_callout_end}} What is Model Fine-Tuning? Publicly available state-of-the-art models have a custom architecture and are typically supplied with pre-trained model weights. If these architectures were supplied without weights then the models would need to be trained from scratch by the users, who would need to use massive datasets to obtain state-of-the-art performance. Model fine-tuning is the process of taking a pre-trained model (architecture+weights) and showing it data for a particular use case. This will typically be data that the model hasn’t seen before, or that is underrepresented in its original training dataset. The difference between fine-tuning the model and starting from scratch is the starting value of the weights and biases. If we were training from scratch, these would be randomly initialized according to some strategy. In such a starting configuration, the model would ‘know nothing’ of the task at hand and perform poorly. By using pre-existing weights and biases as a starting point we can ‘fine tune’ the weights and biases so that our model works better on our custom dataset. For example, the information learned to recognize cats (edge detection, counting paws) will be useful for recognizing dogs. Why Would I Fine-Tune a Model? The purpose of fine-tuning a model is to obtain higher performance on data that the pre-trained model has not seen before. For example, an image segmentation model trained on a broad corpus of data gathered from phone cameras will have mostly seen images from a horizontal perspective. If we tried to use this model for satellite imagery taken from a vertical perspective, it may not perform as well. If we were trying to segment rooftops, the model may not yield the best results. The pre-training is useful because the model will have learned how to segment objects in general, so we want to take advantage of this starting point to build a model that can accurately segment rooftops. Furthermore, it is likely that our custom dataset would not have millions of examples, so we want to fine-tune instead of training the model from scratch. Fine tuning is desirable so that we can obtain better performance on our specific use case, without having to incur the computational cost of training a model from scratch. How to Fine-Tune Segment Anything Model [With Code] Background & Architecture We gave an overview of the SAM architecture in the introduction section. The image encoder has a complex architecture with many parameters. In order to fine-tune the model, it makes sense for us to focus on the mask decoder which is lightweight and therefore easier, faster, and more memory efficient to fine-tune. In order to fine-tune SAM, we need to extract the underlying pieces of its architecture (image and prompt encoders, mask decoder). We cannot use SamPredictor.predict (link) for two reasons: We want to fine-tune only the mask decoder This function calls SamPredictor.predict_torch which has the  @torch.no_grad() decorator (link), which prevents us from computing gradients Thus, we need to examine the SamPredictor.predict function and call the appropriate functions with gradient calculation enabled on the part we want to fine-tune (the mask decoder). Doing this is also a good way to learn more about how SAM works. Creating a Custom Dataset We need three things to fine-tune our model: Images on which to draw segmentations Segmentation ground truth masks Prompts to feed into the model We chose the stamp verification dataset (link) since it has data that SAM may not have seen in its training (i.e., stamps on documents). We can verify that it performs well, but not perfectly, on this dataset by running inference with the pre-trained weights. The ground truth masks are also extremely precise, which will allow us to calculate accurate losses. Finally, this dataset contains bounding boxes around the segmentation masks, which we can use as prompts to SAM. An example image is shown below. These bounding boxes align well with the workflow that a human annotator would go through when looking to generate segmentations. Input Data Preprocessing We need to preprocess the scans from numpy arrays to pytorch tensors. To do this, we can follow what happens inside SamPredictor.set_image (link) and SamPredictor.set_torch_image (link) which preprocesses the image. First, we can use utils.transform.ResizeLongestSide to resize the image, as this is the transformer used inside the predictor (link). We can then convert the image to a pytorch tensor and use the SAM preprocess method (link) to finish preprocessing. Training Setup We download the model checkpoint for the vit_b model and load them in: sam_model = sam_model_registry['vit_b'](checkpoint='sam_vit_b_01ec64.pth') We can set up an Adam optimizer with defaults and specify that the parameters to tune are those of the mask decoder: optimizer = torch.optim.Adam(sam_model.mask_decoder.parameters())  At the same time, we can set up our loss function, for example Mean Squared Error loss_fn = torch.nn.MSELoss() Training Loop In the main training loop, we will be iterating through our data items, generating masks, and comparing them to our ground truth masks so that we can optimize the model parameters based on the loss function. In this example, we used a GPU for training since it is much faster than using a CPU. It is important to use .to(device) on the appropriate tensors to make sure that we don’t have certain tensors on the CPU and others on the GPU. We want to embed images by wrapping the encoder in the torch.no_grad() context manager, since otherwise we will have memory issues, along with the fact that we are not looking to fine-tune the image encoder. with torch.no_grad(): image_embedding = sam_model.image_encoder(input_image) We can also generate the prompt embeddings within the no_grad context manager. We use our bounding box coordinates, converted to pytorch tensors. with torch.no_grad(): sparse_embeddings, dense_embeddings = sam_model.prompt_encoder( points=None, boxes=box_torch, masks=None, ) Finally, we can generate the masks. Note that here we are in single mask generation mode (in contrast to the 3 masks that are normally output). low_res_masks, iou_predictions = sam_model.mask_decoder( image_embeddings=image_embedding, image_pe=sam_model.prompt_encoder.get_dense_pe(), sparse_prompt_embeddings=sparse_embeddings, dense_prompt_embeddings=dense_embeddings, multimask_output=False, ) The final step here is to upscale the masks back to the original image size since they are low resolution. We can use Sam.postprocess_masks to achieve this. We will also want to generate binary masks from the predicted masks so that we can compare these to our ground truths. It is important to use torch functionals in order to not break backpropagation. upscaled_masks = sam_model.postprocess_masks(low_res_masks, input_size, original_image_size).to(device) from torch.nn.functional import threshold, normalize binary_mask = normalize(threshold(upscaled_masks, 0.0, 0)).to(device) Finally, we can calculate the loss and run an optimization step: loss = loss_fn(binary_mask, gt_binary_mask) optimizer.zero_grad() loss.backward() optimizer.step() By repeating this over a number of epochs and batches we can fine-tune the SAM decoder. Saving Checkpoints and Starting a Model from it Once we are done with training and satisfied with the performance uplift, we can save the state dict of the tuned model using: torch.save(model.state_dict(), PATH) We can then load this state dict when we want to perform inference on data that is similar to the data we used to fine-tune the model. {{light_callout_start}} You can find the Colab Notebook with all the code you need to fine-tune SAM here. Keep reading if you want a fully working solution out of the box! {{light_callout_end}} Fine-Tuning for Downstream Applications While SAM does not currently offer fine-tuning out of the box, we are building a custom fine-tuner integrated with the Encord platform. As shown in this post, we fine-tune the decoder in order to achieve this. This is available as an out-of-the-box one-click procedure in the web app, where the hyperparameters are automatically set. Original vanilla SAM mask: Mask generated by fine-tuned version of the model: We can see that this mask is tighter than the original mask. This was the result of fine-tuning on a small subset of images from the stamp verification dataset, and then running the tuned model on a previously unseen example. With further training and more examples, we could obtain even better results. Conclusion That's all, folks! You have now learned how to fine-tune the Segment Anything Model (SAM). If you're looking to fine-tune SAM out of the box, you might also be interested to learn that we have recently released the Segment Anything Model in Encord, allowing you to fine-tune the model without writing any code. {{SAM_CTA}}

Read more
Page
1 / 16
sampleImage_performanceyolov9-vs-yolov8-custom-dataset
Comparative Analysis of YOLOv9 and YOLOv8 Using Custom Dataset on Encord Active

Even as foundation models gain popularity, advancements in object detection models remain significant. YOLO has consistently been the preferred choice in machine learning for object detection. Let’s train the latest iterations of the YOLO series, YOLOv9, and YOLOV8 on a custom dataset and compare their model performance.  In this blog, we will train YOLOv9 and YOLOv8 on the xView3 dataset. The xView3 dataset contains aerial imagery with annotations for maritime object detection, making it an ideal choice for evaluating the robustness and generalization capabilities of object detection models. If you wish to curate and annotate your own dataset for a direct comparison between the two models, you have the option to create the dataset using Encord Annotate. Once annotated, you can seamlessly follow the provided code to train and evaluate both YOLOv9 and YOLOv8 on your custom dataset. Read the Encord Annotate Documentation to get started with your annotation project.   Prerequisites We are going to run our experiment on Google Colab. So if you are doing it on your local system, please bear in mind that the instructions and the code was made to run on Colab Notebook. Make sure you have access to GPU. You can either run the command below or navigate to Edit → Notebook settings → Hardware accelerator, set it to GPU, and the click Save. !nvidia-smi To make it easier to manage datasets, images, and models we create a HOME constant. import os HOME = os.getcwd() print(HOME) Train YOLOv9 on Encord Dataset Install YOLOv9 !git clone https://github.com/SkalskiP/yolov9.git  %cd yolov9 !pip install -r requirements.txt -q !pip install -q roboflow encord av # This is a convenience class that holds the info about Encord projects and makes everything easier. # The class supports bounding boxes and polygons across both images, image groups, and videos. !wget 'https://gist.githubusercontent.com/frederik-encord/e3e469d4062a24589fcab4b816b0d6ec/raw/fa0bfb0f1c47db3497d281bd90dd2b8b471230d9/encord_to_roboflow_v1.py' -O encord_to_roboflow_v1.py Imports from typing import Literal from pathlib import Path from IPython.display import Image import roboflow from encord import EncordUserClient from encord_to_roboflow_v1 import ProjectConverter Data Preparation Set up access to the Encord platform by creating and using an SSH key. # Create ssh-key-path key_path = Path("../colab_key.pub") if not key_path.is_file(): !ssh-keygen -t ed25519 -f ../colab_key -N "" -q key_content = key_path.read_text() We will now retrieve the data from Encord, converting it to the format required by Yolo and storing it on disk. It's important to note that for larger projects, this process may encounter difficulties related to disk space. The converter will automatically split your dataset into training, validation, and testing sets based on the specified sizes.  # Directory for images data_path = Path("../data") data_path.mkdir(exist_ok=True) client = EncordUserClient.create_with_ssh_private_key( Path("../colab_key").resolve().read_text() ) project_hash = "9ca5fc34-d26f-450f-b657-89ccb4fe2027" # xView3 tiny encord_project = client.get_project(project_hash) converter = ProjectConverter( encord_project, data_path, ) dataset_yaml_file = converter.do_it(batch_size=500, splits={"train": 0.5, "val": 0.1, "test": 0.4}) encord_project_title = converter.title Download Model Weight We will download the YOLOv9-e and the gelan-c weights. Although the YOLOv9 paper mentions versions yolov9-s and yolov9-m, it's worth noting that weights for these models are currently unavailable in the YOLOv9 repository. !mkdir -p {HOME}/weights !wget -q https://github.com/WongKinYiu/yolov9/releases/download/v0.1/yolov9-e-converted.pt -O {HOME}/weights/yolov9-e.pt !wget -P {HOME}/weights -q https://github.com/WongKinYiu/yolov9/releases/download/v0.1/gelan-c.pt You can predict and evaluate the results of object detection with the  YOLOv9 weights pre-trained on COCO model.  Check out the blog YOLOv9 Explained and How to Run it if you want to run object detection on pre-trained YOLOv9 weights.  Train Custom YOLOv9 Model for Object Detection We train a custom YOLOv9 model from a pre-trained gelan-c model. !python train.py \ --batch 8 --epochs 20 --img 640 --device 0 --min-items 0 --close-mosaic 15 \ --data $dataset_yaml_file \ --weights {HOME}/weights/gelan-c.pt \ --cfg models/detect/gelan-c.yaml \ --hyp hyp.scratch-high.yaml You can examine and validate your training results. The code for validation and inference with the custom model is available on Colab Notebook. Here we will focus on comparing the model performances. Converting Custom YOLOv9 Model Predictions to Encord Active Format pth = converter.create_encord_json_predictions(get_latest_exp("detect") / "labels", Path.cwd().parent) print(f"Predictions exported to {pth}") Download the predictions on your local computer and upload them via the UI to Encord Active for analysis of your results. Moving on to training YOLOv8! Train YOLOv8 on Encord Dataset Install YOLOv8 !pip install ultralytics==8.0.196 from IPython import display display.clear_output() import ultralytics ultralytics.checks() Dataset Preparation As we are doing a comparative analysis of two models, we will use the same dataset to train YOLOv8. Train Custom YOLOv8 Model for Object Detection from ultralytics import YOLO model = YOLO('yolov8n.pt')  # load a pretrained YOLOv8n detection model model.train(data=dataset_yaml_file.as_posix(), epochs=20)  # train the model model.predict() The code for running inference on the test dataset is available on the Colab Notebook shared below. Converting Custom YOLOv8 Model Predictions to Encord Active Format pth = converter.create_encord_json_predictions(get_latest_exp("detect", ext="predict") / "labels", Path.cwd().parent) print(f"Predictions exported to {pth}") Download this JSON file and upload it to Encord Active via UI. Comparative Analysis on Encord Active On Encord Active under the tab Model Evaluation, you can compare both the model’s predictions.  You can conveniently navigate to the Model Summary tab to view the Mean Average Precision (mAP), Mean Average Recall (mAR), and F1 score for both models. Additionally, you can compare the differences in predictions between YOLOv8 and YOLOv9. Precision  YOLOv8 may excel in correctly identifying objects (high true positive count) but at the risk of also detecting objects that aren't present (high false positive count). On the other hand, YOLOv9 may be more conservative in its detections (lower false positive count) but could potentially miss some instances of objects (higher false negative count).  Recall In terms of recall, YOLOv8 exhibits superior performance with a higher true positive count (101) compared to YOLOv9 (43), indicating its ability to correctly identify more instances of objects present in the dataset. Both models, however, show an equal count of false positives (643), suggesting similar levels of incorrect identifications of non-existent objects. YOLOv8 demonstrates a lower false negative count (1261) compared to YOLOv9 (1315), implying that YOLOv8 misses fewer instances of actual objects, highlighting its advantage in recall performance. Precision-Recall Curve Based on the observed precision-recall curves, it appears that YOLOv8 achieves a higher Area Under the Curve (AUC-PR) value compared to YOLOv9. This indicates that YOLOv8 generally performs better in terms of both precision and recall across different threshold values, capturing a higher proportion of true positives while minimizing false positives more effectively than YOLOv9. Precision-Recall Curve is not the only metric to evaluate the performance of models. There are other metrics like F1 score, IOU distribution, etc.  For more information on different quality metrics, read the blog Data, Label, & Model Quality Metrics in Encord.   Metric Correlation The metric impact on performance in Encord refers to how specific metrics influence the performance of your model. Encord allows you to figure out which metrics have the most influence on your model's performance. This metric tells us whether a positive change in a metric will lead to a positive change (positive correlation) or a negative change (negative correlation) in model performance. The dimensions of the labeled objects significantly influence the performance of both models. This underscores the importance of the size of objects in the dataset. It's possible that the YOLOv9 model's performance is adversely affected by the presence of smaller objects in the dataset, leading to its comparatively poorer performance. Metric Performance The Metric Performance in model evaluation in Encord provides a detailed view of how a specific metric affects the performance of your model. It allows you to understand the relationship between a particular metric and the model's performance. In conclusion, the comparison between YOLOv8 and YOLOv9 on Encord Active highlights distinct performance characteristics in terms of precision and recall. While YOLOv8 excels in correctly identifying objects with a higher true positive count, it also exhibits a higher false positive count, indicating a potential for over-detection. On the other hand, YOLOv9 demonstrates a lower false positive count but may miss some instances of objects due to its higher false negative count. If you want to improve your object detection model, read the blog How to Analyze Failure Modes of Object Detection Models for Debugging for more information.   The precision-recall curve analysis suggests that YOLOv8 generally outperforms YOLOv9, capturing a higher proportion of true positives while minimizing false positives more effectively. However, it's important to consider other metrics like F1 score and IOU distribution for a comprehensive evaluation of model performance. Moreover, understanding the impact of labeled object dimensions and specific metric correlations can provide valuable insights into improving model performance on Encord Active.

March 1

8 min

sampleImage_vlm-webinar-recording
Vision Language Models: Powering the next chapter in AI

Webinar Recording In this webinar, we delve into the rapidly evolving field of data annotation and explore the groundbreaking role of Vision Language Models (VLMs). As organizations seek more efficient, accurate, and scalable methods to process vast amounts of data, VLMs are emerging as a pivotal technology in automating and enhancing data annotation tasks. Here are the key resources from the webinar: [Guide] Guide to Vision-Language Models [Case Study] See how one customer increased mAP by 20% through reducing their dataset size by 35% with visual data curation

March 1

60 min

sampleImage_evaluating-model-performance
Validating Model Performance Using Encord Active

Model validation is a key machine learning (ML) lifecycle stage, ensuring models generalize well to new, unseen data. This process is critical for evaluating a model's predictions independently from its training dataset, thus testing its ability to perform reliably in the real world.  Model validation helps identify overfitting—where a model learns noise rather than the signal in its training data—and underfitting, where a model is too simplistic to capture complex data patterns. Both are detrimental to model performance. Techniques like the holdout method, cross-validation, and bootstrapping are pivotal in validating model performance, offering insights into how models might perform on unseen data. These methods are integral to deploying AI and machine learning models that are both reliable and accurate. This article delves into two parts: Key model validation techniques, the advantages of a data-centric approach, and how to select the most appropriate validation method for your project. How to validate a Mask R-CNN pre-trained model that segments instances in COVID-19 scans using Encord Active, a data-centric platform for evaluating and validating computer vision (CV) models. Ready to dive deeper into model validation and discover how Encord Active can enhance your ML projects? Let’s dive in! The Vital Role of a Data-Centric Approach in Model Validation A data-centric approach to model validation places importance on the quality of data in training and deploying computer vision (CV) and artificial intelligence (AI) models. The approach recognizes that the foundation of any robust AI system lies not in the complexity of its algorithms but in the quality of the data it learns from. High-quality, accurately labeled data (with ground truth) ensures that models can truly understand and interpret the nuances of the tasks they are designed to perform, from predictive analytics to real-time decision-making processes. Why Data Quality is Paramount The quality of training data is directly proportional to a model's ability to generalize from training to real-world applications. Poor data quality—including inaccuracies, biases, label errors, and incompleteness—leads to models that are unreliable, biased, or incapable of making accurate predictions. A data-centric approach prioritizes meticulous data preparation, including thorough data annotation, cleaning, and validation. This ensures the data distribution truly reflects the real world it aims to model and reduces label errors.  Improving Your Model’s Reliability Through Data Quality The reliability of CV models—and even more recently, foundation models—in critical applications—such as healthcare imaging and autonomous driving—cannot be overstated.  A data-centric approach mitigates the risks associated with model failure by ensuring the data has high fidelity. It involves rigorous validation checks and balances, using your expertise and automated data quality tools to continually improve your label quality and datasets. Adopt a data-centric approach to your AI project and unlock its potential by downloading our whitepaper.   Key Computer Vision Model Validation Techniques A data-centric approach is needed to validate computer vision models after model training that looks at more than just performance and generalizability. They also need to consider the unique problems of visual data, like how image quality, lighting, and perspectives can vary. Tailoring the common validation techniques specifically for computer vision is about robustly evaluating the model's ability to analyze visual information and embeddings across diverse scenarios: Out-of-Sample Validation: Essential for verifying that a CV model can generalize from its training data to new, unseen images or video streams. This approach tests the model's ability to handle variations in image quality, lighting, and subject positioning that it hasn't encountered during training. Cross-Validation and Stratified K-Fold: Particularly valuable in computer vision is ensuring that every aspect of the visual data is represented in both training and validation sets. Stratified K-Fold is beneficial when dealing with imbalanced datasets, common in computer vision tasks, to maintain an equal representation of classes across folds. Leave-One-Out Cross-Validation (LOOCV): While computationally intensive, LOOCV can be particularly insightful for small image datasets where every data point's inclusion is crucial for assessing the model's performance on highly nuanced visual tasks. Bootstrapping: Offers insights into the stability of model predictions across different visual contexts. This method helps understand how training data subset changes can affect the model's performance, which is particularly relevant for models expected to operate in highly variable visual environments. Adversarial Testing: Tests the model's resilience against slight, often invisible, image changes. This technique is critical to ensuring models are not easily perturbed by minor alterations that would not affect human perception. Domain-Specific Benchmarks: Participating in domain-specific challenges offered by ImageNet, COCO, or PASCAL VOC can be a reliable validation technique. These benchmarks provide standardized datasets and metrics, allowing for evaluating a model's performance against a wide range of visual tasks and conditions, ensuring it meets industry standards. Human-in-the-Loop: Involving domain experts in the validation process is invaluable, especially for tasks requiring fine-grained visual distinctions (e.g., medical imaging or facial recognition). This approach helps ensure that the model's interpretations align with human expertise and can handle the subtleties of real-world visual data. Ensuring a model can reliably interpret and analyze visual information across various conditions requires a careful balance between automated validation methods and human expertise.  Choosing the right validation techniques for CV models involves considering the dataset's diversity, the computational resources available, and the application's specific requirements. Luckily, there are model validation tools that can help you focus on validating the model. At the same time, they do the heavy lifting of providing the insights necessary to validate your CV model’s performance, including providing AI-assisted evaluation features.  But before walking through Encord Active, let’s understand the factors you need to consider for choosing the right tool. How to Choose the Right Computer Vision Model Validation Tool When choosing the right model validation tool for computer vision projects, several key factors come into play, each addressing the unique challenges and requirements of working with image data.  These considerations ensure that the selected tool accurately evaluates the model's performance and aligns with the project's specific demands. Here's a streamlined guide to making an informed choice: Data Specificity and Complexity: Opt for tools that cater to the variability and complexity inherent in image data. This means capabilities for handling image-specific metrics such as Intersection over Union (IoU) for object detection and Mean Absolute Error (MAE) for tasks like classification and segmentation are crucial. Robust Data Validation: The tool should adeptly manage image data peculiarities, including potential discrepancies between image annotations and the actual images. Look for features that support comprehensive data validation across various stages of the model development cycle, including pre-training checks and ongoing training validations. Comprehensive Evaluation Metrics: Essential for thoroughly assessing a computer vision model's performance. The tool should offer a wide array of metrics, including precision-recall curves, ROC curves, and confusion matrices for classification, alongside task-specific metrics like IoU for object detection. It should also support quality metrics for a more holistic, real-world evaluation. Versatile Performance Evaluation: It should support a broad spectrum of evaluation techniques for deep insights into accuracy, the balance between precision and recall, and the model’s ability to distinguish between different classes. Dataset Management: The validation tool should help with efficient dataset handling for proper training-validation splits. For the sake of performance and scale, it should be able to manage large datasets. Flexibility and Customization: The fast-paced nature of computer vision demands tools that allow for customization and flexibility. This includes introducing custom metrics, supporting various data types and model architectures, and adapting to specific preprocessing and integration needs. Considering those factors, you can select a validation tool (open-source toolkits, platforms, etc.) that meets your project's requirements and contributes to developing reliable models. Using Encord Active to Validate the Performance of Your Computer Vision Model Encord Active (EA) is a data-centric model validation solution that enables you to curate valuable data that can truly validate your model’s real-world generalizability through quality metrics. In this section, you will see how you can analyze the performance of a pre-trained Mask R-CNN object detection model with Encord Active on COVID-19 predictions. From the analysis results, you will be able to validate and, if necessary, debug your model's performance. This walkthrough uses  Encord Annotate to create a project and import the dataset. We use Encord Active Cloud to analyze the model’s failure modes. We recommend you sign up for an Encord account to follow this guide.   Import Predictions Import your predictions onto the platform. Learn how to import Predictions in the documentation. Select the Prediction Set you just uploaded, and Encord Active will use quality data, label, and model quality metrics to evaluate the performance of your model: Visualize Model Performance Summary on the Validation Set Evaluate the model’s performance by inspecting the Model Summary dashboard to get an overview of your model’s performance on the validation set with details error categorization (true positive vs. false positive vs. false negative), the F1 score, and mean average precision/recall based on a confidence (IoU) threshold: Manually Inspect the Model Results Beyond visualizing a summary of the model’s performance, using a tool that allows you to manually dig in and inspect how your model works on real-world samples is more than helpful. Encord Active provides an Explorer tab that enables you to filter models by metrics to observe the impact of metrics on real-world samples. EA’s data-centric build also lets you see how your model correctly or incorrectly makes predictions (detects, classifies, or segments) on the training, validation, and production samples.  Let’s see how you can achieve this: On the Model Summary dashboard, → Click True Positive Count metric to inspect the predictions your model got right: Click on one of the images using the expansion icon to see how well the model detects the class, the confidence score with which it predicts the object, other scores on performance metrics, and metadata. Still under the Explorer tab → Click on Overview (the tab on the right) → Click on False Positive Count to inspect instances that the model failed to detect correctly It seems most classes flagged as False Positives are due to poor object classification quality (the annotations are not 100% accurate). Let’s look closely at an instance: In that instance, the model correctly predicts that the object is ‘Cardiomediastinum’. Still, the second overlapping annotation has a broken track for some reason, so Encord Active classifies its prediction as false positive using a combination of Broken Object Track and other relevant quality metrics. Under Filter → Add filter, you will see parameters and attributes to filter your model’s performance. For example, if you added your validation set to Active through Annotate, you can validate your model’s performance on that set and, likewise, on the production set. Visualize the Impact of Metrics on Model Performance Evaluate the model outcome count to understand the distribution of the correct and incorrect results for each class. Under the Model Evaluation tab → Click on Outcome to see the distribution chart: Now, you should see the count for the number of predictions the model gets wrong. Using this chart, you can get a high-level perspective on the issues with your model. In this case, the model fails to segment the ‘Airways’ object in the instances correctly. The Intersection-of-Union (IoU) Threshold is 0.5, the threshold for the model’s confidence in its predictions. Use the IOU Threshold slider under the Overview tab to see the outcome count based on a higher or lower threshold. You can also select specific classes you want to inspect under the Classes option. Dig Deeper into the Metrics Once you understand the model outcome count, you can dig deeper into specific metrics like precision, recall, and F1 scores if they are relevant to your targets. Notice the low precision, recall, and F1 scores per class! Also, group the scores by the model outcome count to understand how the model performs in each class. You could also use the precision-recall curve to analyze and highlight the classes harder for the model to detect with high confidence. Also break down the model’s precision and recall values for the predictions of each object over the relevant metrics you want to investigate. For example, if you want to see the precision and recall by the Object Classification Quality metric, under Metric Performance → Select the Metric dropdown menu, and then the metric you want to investigate the model’s precision by: Validate the Model’s Performance on Business Criteria Now it’s time to see the metrics impacting the model’s performance the most and determine, based on your information, if it’s good or bad (needs debugging) for business.  For instance, if the Confidence scores are the least performing metrics, you might be worried that your vision model is naive in predictions given the previous consensus on the outcome count (false positives and negatives). Here is the case for this model under the Metric Performance dashboard (remember, you can use the IoU Threshold slider to check the metric impact at different confidence intervals): The Relevative Area (the object's size) significantly influences our model’s performance. Considering the business environment you want to deploy the model, would this be a good or bad event? This is up to you to decide based on your technical and business requirements. If the model does not work, you can run more experiments and train more models until you find the optimal one. Awesome! You have seen how Encord Active plays a key role in providing features for validating your model’s performance with built-in metrics. In addition, it natively integrates with Encord Annotate, an annotation tool, to facilitate data quality improvement that can enhance the performance of your models.  Conclusion Selecting the right model validation tools ensures that models perform accurately and efficiently. It involves the assessment of a model's performance through quantitative metrics such as the IoU, mAP (mean Average Precision), and MaE, or qualitatively, by subject matter experts.  The choice of evaluation metric should align with the business objectives the model aims to achieve. Furthermore, model selection hinges on comparing various models using these metrics within a carefully chosen evaluation schema, emphasizing the importance of a proper validation strategy to ensure robust model performance before deployment.​ Validating model performance is particularly vital in sectors where such inaccuracies could compromise safety. Check out our customer stories to learn from large and small teams that have improved their data quality and model performance with the help of Encord. Platforms like Encord, which specialize in improving data and model quality, are instrumental in this context. Encord Active, among others, provides features designed to refine data quality and bolster model accuracy, mitigating the risks associated with erroneous predictions or data analysis.

March 1

8 min

sampleImage_qwen-vl-large-scale-vision-language-models
Qwen-VL and Qwen-VL-Chat: Introduction to Alibaba’s AI Models

Qwen-VL is a series of open-source large vision-language models (LVLMs), offering a potent combination of advanced capabilities and accessibility. As an open-source project, Qwen-VL not only democratizes access to cutting-edge AI technology but also positions itself as a formidable competitor to established models from tech giants like OpenAI’s GPT-4V and Google’s Gemini. In the competitive landscape of LVLMs, Qwen-VL has quickly risen to the forefront, securing its place as a leader on the OpenVLM leaderboard. This leaderboard, which encompasses 38 different VLMs including GPT-4V, Gemini, QwenVLPlus, LLaVA, and others, serves as a comprehensive benchmark for evaluating model performance across 13 distinct multimodal tasks. OpenVLM Leaderboard Qwen-VL's performance across these benchmarks underscores its versatility and robustness in handling various vision-language tasks with unparalleled accuracy and efficiency. By leading the charge on the OpenVLM leaderboard, Qwen-VL sets a new standard for excellence in the field, pushing the boundaries of what is possible with LVLMs and paving the way for future advancements in multimodal AI research. Introduction to Large-scale Vision Language Models (LVLMs) Large Language Models (LLMs) have attracted attention in recent years for their remarkable text generation and comprehension capabilities in the field of generative AI. However, their limitation to processing text alone has constrained their utility in various applications. In response to this limitation, a new class of models known as Large Vision Language Models (LVLMs) has come up, aiming to integrate visual data with textual information to address vision-centric tasks. LVLMs improve conventional LLMs by integrating vision language learning, thus extending their applicability to include image datasets. However, despite their promising potential, open-source LVLM implementations encounter hurdles such as inadequate training and optimization when compared to proprietary models. Also, understanding visual content still remains a significant challenge for existing LVLM frameworks. Overview of Qwen-VL The Qwen-VL series represents a significant advancement in Large Vision Language Models (LVLMs), designed to overcome the limitations of existing models and equip LLMs with visual processing capabilities. Built upon the Alibaba Cloud’s 7 billion parameter model, Qwen-7B language model, the Qwen-VL series introduces a visual receptor architecture comprising a language-aligned visual encoder and a position-aware adapter. This architecture enables Qwen-VL models to effectively process visual inputs, generate responses based on prompts, and perform various vision-language tasks such as image recognition, image captioning, visual question answering, and visual grounding. Qwen-VL models demonstrate leading performance on vision-centric benchmarks and support multiple languages, including English and Chinese. For more information on VLMs, read the blog Guide to Vision-Language Models (VLMs)   Key Features of Qwen-VL Qwen-VL models demonstrate good accuracy on a wide range of vision-centric understanding benchmarks, surpassing other SOTA models of similar scales. They excel not only in conventional benchmarks such as captioning and question-answering but also in recently introduced dialogue benchmarks. Here are the key features of Qwen-VL: Multi-lingual Support: Similar to Qwen-LM, Qwen-VLs are trained on multilingual image-text data, with a substantial corpus in English and Chinese. This enables Qwen-VLs to naturally support English, Chinese, and other multilingual instructions. Multi-image Capability: During training, Qwen-VLs can handle arbitrary interleaved image-text data as inputs, allowing them to compare, understand, and analyze context when multiple images are provided. Fine-grained Visual Understanding: Qwen-VLs exhibit highly competitive fine-grained visual understanding abilities, thanks to their higher-resolution input size and fine-grained corpus used during training. Compared to existing vision-language generalists, Qwen-VLs demonstrate superior performance in tasks such as grounding, text-reading, text-oriented question answering, and fine-grained dialogue comprehension. Vision-centric Understanding: This allows the model to comprehensively interpret and process visual information. With advanced architecture integrating a language-aligned visual encoder and position-aware adapter, Qwen-VL excels in tasks like image captioning, question answering, and visual grounding. Its fine-grained analysis ensures precise interpretation of visual content, making Qwen-VL highly effective in vision-language tasks and real-world applications. Design Structure of Qwen-VL Beginning with the foundation of Qwen-LM, the model is enhanced with visual capacity through several key components: Visual Receptor: Qwen-VL incorporates a carefully designed visual receptor, which includes a visual encoder and adapter. This component is responsible for processing image inputs and extracting fixed-length sequences of image features.  Input-Output Interface: The model's input-output interface is optimized to differentiate between image and text feature inputs. Special tokens are utilized to delineate image feature input, ensuring seamless integration of both modalities. 3-stage Training Pipeline: Qwen-VL employs a sophisticated 3-stage training pipeline to optimize model performance. This pipeline encompasses comprehensive training stages aimed at fine-tuning the model's parameters and enhancing its ability to comprehend and generate responses for both text and image inputs. Multilingual Multimodal Cleaned Corpus: Qwen-VL is trained on a diverse multilingual multimodal corpus, which includes cleaned data encompassing both textual and visual information. This corpus facilitates the model's ability to understand and generate responses in multiple languages while effectively processing various types of visual content. Model Architecture of Qwen-VL The architecture of Qwen-VL comprises three key components, each contributing to the model's robustness in processing both text and visual inputs.  Large Language Model Qwen-VL leverages a large language model as its foundational component. This machine learning model is initialized with pre-trained weights obtained from Qwen-7B, ensuring a strong linguistic foundation for the model's language processing capabilities. Visual Encoder Qwen-VL employs the Vision Transformer (ViT) architecture, utilizing pre-trained weights from Openclip's ViT-bigG. During both training and inference, input images are resized to a specific resolution. The visual encoder processes these images by dividing them into patches with a stride of 14, thereby generating a set of image features that encapsulate visual information. Position-aware Vision-Language Adapter To address efficiency concerns arising from long sequences of image features, Qwen-VL introduces a vision-language adapter. This adapter is designed to compress the image features, enhancing computational efficiency. It consists of a single-layer cross-attention module initialized randomly. This module utilizes a group of trainable embeddings as query vectors and the image features from the visual encoder as keys for cross-attention operations. By employing this mechanism, the visual feature sequence is compressed to a fixed length of 256. To preserve positional information crucial for fine-grained image comprehension, 2D absolute positional encodings are incorporated into the query-key pairs of the cross-attention mechanism. This ensures that positional details are retained during the compression process. The compressed image feature sequence of length 256 is then fed into the large language model, enabling Qwen-VL to effectively process both textual and visual inputs and perform a wide range of vision-language tasks with high accuracy and efficiency. Training Pipeline of Qwen-VL series For more information, read the official paper released on Arxiv: Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond.   Performance of Qwen-VL against State-of-The-Art LVLMs The performance of Qwen-VL models, particularly Qwen-VL-Max, surpasses SOTA models such as Gemini Ultra and GPT-4V in various text-image multimodal tasks. Compared to the open-source version of Qwen-VL, these models achieve comparable results to Gemini Ultra and GPT-4V, while significantly outperforming previous best results from open-source models. Performance of Qwen-VL-Plus and Qwen-VL-Max against other LVLM In particular, Qwen-VL-Max demonstrates superior performance over GPT-4V from OpenAI and Gemini from Google in tasks related to Chinese question answering and Chinese text comprehension. This achievement highlights the advanced capabilities of Qwen-VL-Max and its potential to establish new benchmarks in multimodal AI research and application. It should also be noted that most SOTA models are not trained on chinese language. Capabilities of Qwen-VL Qwen-VL exhibits a diverse range of capabilities that enable it to effectively comprehend and interact with visual and textual information, as well as reason and learn from its environment. These capabilities include: Basic Recognition Capabilities Qwen-VL demonstrates strong basic recognition capabilities, accurately identifying and describing various elements within images, including common objects, celebrities, landmarks, and intricate details. Recognition capabilities of Qwen-VL Visual Agent Capability As a visual agent, Qwen-VL is capable of providing detailed background information, answering questions, and analyzing complex visual content. It can also compose poetry in multiple languages inspired by visual stimuli and analyze everyday screenshots. Visual Agent Capabilities of Qwen-VL Visual Reasoning Capability Qwen-VL possesses advanced visual reasoning capabilities, extending beyond content description to comprehend and interpret intricate representations such as flowcharts, diagrams, and other symbolic systems. It excels in problem-solving and reasoning tasks, including mathematical problem-solving and profound interpretations of charts and graphs. Qwen-VL has advanced visual reasoning capabilities Text Information Recognition and Processing Qwen-VL exhibits enhanced text information recognition and processing abilities, efficiently extracting information from tables and documents, reformatting it to meet customized output requirements, and effectively identifying and converting dense text. It also supports images with extreme aspect ratios, ensuring flexibility in processing diverse visual content. Advanced text information recognition and processing abilities of Qwen-VL Few-shot Learning on Vision-Language Tasks Qwen-VL demonstrates satisfactory in-context learning (few-shot learning) ability, achieving superior performance on vision-language tasks such as question answering and image captioning compared to models with similar numbers of parameters. Its performance rivals even larger models, showcasing its adaptability and efficiency in learning from limited data. For more information on few-shot learning, read the blog Few Shot Learning in Computer Vision: Approaches & Uses   Qwen-VL Availability Qwen-VL, including Qwen-VL-Plus and Qwen-VL-Max, is now readily accessible through various platforms, offering researchers and developers convenient access to its powerful capabilities: HuggingFace: Users can access Qwen-VL-Plus and Qwen-VL-Max through the Huggingface Spaces and Qwen website, enabling seamless integration into their projects and workflows. Dashscope APIs: The APIs of Qwen-VL-Plus and Qwen-VL-Max are available through the Dashscope platform, providing developers with the flexibility to leverage its capabilities for their AI applications. Detailed documentation and quick-start guides are available on the Dashscope platform for easy integration. QianWen Web Portal: By logging into the Tongyi QianWen web portal and switching to "Image Understanding" mode, users can harness the latest Qwen-VL-Max capabilities for image understanding tasks. This mode offers additional functionalities tailored specifically for image processing and understanding. ModelScope: The Qwen-VL-Chat demo is available on modelscope. GitHub Repository: The code and model weights of both Qwen-VL and Qwen-VL-Chat are openly available to download on GitHub, allowing researchers and developers to explore, modify, and utilize them freely. The commercial use of these resources is permitted, enabling their integration into commercial projects and applications. Qwen-VL-Chat Qwen-VL-Chat, as a generalist multimodal LLM-based AI assistant, supports complex interactions, including multiple image inputs, multi-round question answering, and creative capabilities. Unlike traditional vision-language chatbots, Qwen-VL-Chat's alignment techniques enable it to comprehend and respond to complex visual and textual inputs with superior accuracy and flexibility. Here's how Qwen-VL-Chat stands out in real-world dialog benchmarks and compares with existing models: Qwen-VL-Chat Vs. Vision-Language Chat Performance of Qwen-VL against other generalist models across various tasks Qwen-VL-Chat's advanced capabilities are evaluated using the TouchStone benchmark, which assesses overall text-image dialogue capability and alignment with humans. Unlike conventional models like chatGPT or Bard, Qwen-VL-Chat excels in handling direct image input, thanks to fine-grained image annotations provided by human labeling. With a comprehensive coverage of 300+ images, 800+ questions, and 27 categories, including attribute-based Q&A, celebrity recognition, writing poetry, summarizing multiple images, product comparison, and math problem solving, Qwen-VL-Chat achieves superior performance in understanding and responding to complex visual and textual inputs. You can find the official tutorial to implement Qwen-VL-Chat on your own on Github.   Real-world Dialog Benchmark Qwen-VL-Chat's outstanding results in other multimodal benchmarks, such the MME Benchmark and Seed-Bench, demonstrate that its performance evaluation extends beyond the TouchStone benchmark. In both the perceptual and cognition tracks, Qwen-VL-Chat obtains state-of-the-art scores in the MME Benchmark, an extensive evaluation of multimodal large language models. The Qwen series, which includes Qwen-VL-Chat, achieves state-of-the-art performance in Seed-Bench, a benchmark consisting of 19K multiple-choice questions with precise human annotations. Qwen-VL: What’s Next? The release of the Qwen-VL series represents a significant stride forward in large-scale multilingual vision-language models, with the goal of advancing multimodal research.  Qwen-VL has demonstrated its superiority over comparable artificial intelligence models across various benchmarks, facilitating multilingual complex conversations, multi-image interleaved conversations, grounding in Chinese, and fine-grained recognition. Looking ahead, the focus is on further enhancing Qwen-VL's capabilities in several key dimensions: Multi-modal Generation The team plans to integrate Qwen-VL with more modalities, including speech and video. By expanding its scope to encompass these modalities, Qwen-VL will enhance its ability to understand and generate content across a wider range of inputs. Multi-Modal Generation This generative AI model will be further developed to excel in multi-modal generation, particularly in generating high-fidelity images and fluent speech. By enhancing its ability to generate content across multiple modalities with high fidelity and fluency, Qwen-VL will advance the state-of-the-art in multimodal AI systems. Augmentation of Model Size and Training Data Efforts are underway to scale up the model size, training data, and resolution of Qwen-VL. This enhancement aims to enable Qwen-VL to handle more complex and intricate relationships within multimodal data, leading to more nuanced and comprehensive understanding and generation of content.

February 29

8 min

sampleImage_mistral-large-explained
Mistral Large Explained

Mistral AI made headlines with the release of Mistral 7B, an open-source model competing with tech giants like OpenAI and Meta and surpassing several state-of-the-art large language models such as LLaMA 2. Now, in collaboration with Microsoft, the French AI startup introduces Mistral Large, marking a significant advancement in language model development and distribution. What Is Mistral Large? Mistral Large, developed by Mistral AI, is an advanced language model renowned for its robust reasoning capabilities tailored for intricate multilingual tasks. Fluent in English, French, Spanish, German, and Italian, it exhibits a nuanced grasp of various languages. Boasting a 32K tokens context window, Mistral Large ensures precise information retrieval from extensive documents, facilitating accurate and contextually relevant text generation. With the incorporation of retrieval augmented generation (RAG), it can access facts from external knowledge bases, thereby enhancing comprehension and precision. Mistral Large also excels in instruction-following and function-calling functionalities, enabling tailored moderation policies and application development. Its performance in coding, mathematical, and reasoning tasks makes it a notable solution in natural language processing. Key Attributes of Mistral Large Reasoning Capabilities: Mistral Large showcases powerful reasoning capabilities, enabling it to excel in complex multilingual reasoning tasks. It stands out for its ability to understand, transform, and generate text with exceptional precision. Native Multilingual Proficiency: With native fluency in English, French, Spanish, German, and Italian, Mistral Large demonstrates a nuanced understanding of grammar and cultural context across multiple languages. Enhanced Contextual Understanding: Featuring a 32K tokens context window, Mistral Large offers precise information recall from large documents, facilitating accurate and contextually relevant text generation. Mistral Large, unlike Mistral 7B, the open-sourced LLM that provided stiff competition to state-of-the-art (SOTA) large language models, is equipped with retrieval augmented generation (RAG). This feature enables the LLM to retrieve facts from an external knowledge base, grounding its understanding and enhancing the accuracy and contextuality of its text-generation capabilities. For more information, read the blog What is Retrieval Augmented Generation (RAG)?   Instruction-Following Mistral Large's instruction-following capabilities allow developers to design customized moderation policies and system-level moderation, exemplified by its usage in moderating platforms like le Chat. Function Calling Capability Mistral Large can directly call functions, making it easier to build and update apps and tech stack modernization on a large scale. With this feature and limited output mode, developers can add advanced features and make interactions smoother without any hassle. Performance Benchmark The performance of Mistral Large is compared on various tasks against other state-of-the-art LLM models which are commonly used as benchmarks. Reasoning and Knowledge These benchmarks assess various aspects of language understanding and reasoning, including tasks like understanding massive multitask language (MMLU), completing tasks with limited information (e.g., 5-shot and 10-shot scenarios), and answering questions based on different datasets (e.g., TriviaQA and TruthfulQA). Multi-lingual Capacities The multilingual capability of Mistral Large undergoes benchmarking on HellaSwag, Arc Challenge, and MMLU benchmarks across French, German, Spanish, and Italian languages. Its performance is compared to Mistral 7B and LLaMA 2. Notably, Mistral Large hasn't been tested against the GPT series or Gemini, as these language models have not disclosed their performance metrics on these 4 languages. To know more about the Mistral 7B, read the blog Mistral 7B: Mistral AI's Open Source Model.   Maths and Coding Mistral Large excels across coding and math benchmarks, showcasing strong problem-solving abilities. With high pass rates in HumanEval and MBPP, it demonstrates proficiency in human-like evaluation tasks. Achieving a majority vote accuracy of 4 in the Math benchmark and maintaining accuracy in scenarios with limited information in GSM8K benchmarks, Mistral Large proves its effectiveness in diverse mathematical and coding challenges. Comparison of Mistral Large with other SOTA Models Mistral Large demonstrates impressive performance on widely recognized benchmarks, securing its position as the second-ranked model available via API globally, just behind GPT-4.  Detailed comparisons against other state-of-the-art (SOTA) models like Claude 2, Gemini Pro 1.0, GPT 3.5, and LLaMA 2 70B are provided on benchmarks such as MMLU (Measuring massive multitask language understanding), showcasing Mistral Large's competitive edge and advanced capabilities in natural language processing tasks. Mistral Large: Platform Availability La Plataforme Hosted securely on Mistral's infrastructure in Europe, La Plateforme offers developers access to a comprehensive array of models for developing applications and services. This platform provides a wide range of tools and resources to support different use cases. Le Chat Le Chat serves as a conversational interface for interacting with Mistral AI's models, providing users with a pedagogical and enjoyable experience to explore the company's technology. It can utilize Mistral Large or Mistral Small, as well as a prototype model called Mistral Next, offering brief and concise interactions. Microsoft Azure Mistral AI has announced its partnership with Microsoft and made Mistral LArge available in Azure AI Studio providing users with a user-friendly experience similar to Mistral's APIs. Beta customers have already experienced notable success utilizing Mistral Large on the Azure platform, benefiting from its advanced features and robust performance. Self-deployment For sensitive use cases, Mistral Large can be deployed directly into the user's environment, granting access to model weights for enhanced control and customization. Mistral Large on Microsoft Azure Mistral Large is set to benefit significantly from the multi-year partnership of Microsoft with Mistral AI on three key aspects: Supercomputing Infrastructure: Microsoft Azure will provide Mistral AI with supercomputing infrastructure tailored for AI training and inference workloads, ensuring best-in-class performance and scalability for Mistral AI's flagship models like Mistral Large. This infrastructure will enable Mistral AI to handle complex AI tasks efficiently and effectively. Scale to Market: Through Models as a Service (MaaS) in Azure AI Studio and Azure Machine Learning model catalog, Mistral AI's premium models, including Mistral Large, will be made available to customers. This platform offers a diverse selection of both open-source and commercial models, providing users with access to cutting-edge AI capabilities. Additionally, customers can utilize Microsoft Azure Consumption Commitment (MACC) for purchasing Mistral AI's models, enhancing accessibility and affordability for users worldwide. AI Research and Development: Microsoft and Mistral AI will collaborate on AI research and development initiatives, including the exploration of training purpose-specific models for select customers. This collaboration extends to European public sector workloads, highlighting the potential for Mistral Large and other models to address specific customer needs and industry requirements effectively. Mistral Small Mistral Small, introduced alongside Mistral Large, represents a new optimized model specifically designed to prioritize low latency and cost-effectiveness. This model surpasses Mixtral 8x7B, the sparse mixture-of-experts network, in performance while boasting lower latency, positioning it as a refined intermediary solution between Mistral's open-weight offering and its flagship model. Mistral Small inherits the same innovative features as Mistral Large, including RAG-enablement and function calling capabilities, ensuring consistent performance across both models. To streamline their endpoint offering, Mistral is introducing two main categories: Open-weight Endpoints: These endpoints, named open-mistral-7B and open-mixtral-8x7b, offer competitive pricing and provide access to Mistral's models with open weights, catering to users seeking cost-effective solutions. New Optimized Model Endpoints: Mistral is introducing new optimized model endpoints, namely mistral-small-2402 and mistral-large-2402. These endpoints are designed to accommodate specific use cases requiring optimized performance and cost efficiency. Also, mistral-medium will be maintained without updates at this time. To know more about the Mistral AI LLM models and how to access them, read the documentation.   What’s Next? Multi-currency Pricing Moving forward, Mistral AI is introducing multi-currency pricing for organizational management, providing users with the flexibility to transact in their preferred currency. This enhancement aims to streamline payment processes and improve accessibility for users worldwide. Reduced End-point Latency Mistral AI states that it is working to reduce the latency of all our endpoints. This improvement ensures faster response times, enabling smoother interactions and improved efficiency for users across various applications. La Plataforme Service Tier Updates To make their services even better, Mistral AI has updated the service tiers on La Plataforme. These updates aim to improve performance, reliability, and user satisfaction for those using Mistral AI's platform for their projects and applications.

February 28

5 min

sampleImage_machine-learning-lifecycle
An Overview of the Machine Learning Lifecycle

Machine learning (ML) is a  transformative technology that has recently witnessed widespread adoption and business impact across industries. However, realizing the true business potential of machine learning is challenging due to the intricate processes involved in building an ML product across various stages, from raw data management and preparations to model development and deployment. Therefore, organizations, especially emerging AI startups, must understand the entire ML lifecycle and implement best practices to build machine learning products in an efficient, reliable, secure, and scalable manner. In this article, I will provide an overview of the various stages of the ML lifecycle and share hard-won practical insights and advice from working in the AI industry across big technology companies and startups. Whether you're a data scientist, an ML engineer, or a business leader, this overview will equip you with the knowledge to navigate the complexities of building and deploying ML products effectively.   Here are the stages we will look at: Stage 1: Define the business problem and objectives. Stage 2: Translate the business problem into an ML problem. Stage 3: Prepare the dataset. Stage 4: Train the ML model. Stage 5: Deploy the ML model. Stage 6: Monitor and maintain the ML model in production. Stage 1. Define the Business Problem and Objectives  In industry, it is important to understand that machine learning is a tool, and a powerful one at that, to improve business products, processes, and ultimately impact commercial goals. In some companies, ML is core to the business; in most organizations, it serves to amplify business outcomes. The first stage in the machine learning lifecycle involves the conception and iterative refinement of a business problem aligned with the company’s short-term or long-term strategic goals. You must continuously iterate on the problem until its scope and objectives are finalized through the underlying market or customer research and discussions amongst the relevant stakeholders (including business leaders, product managers, ML engineers, and domain experts). Using machine learning is a given to solve some business problems, such as reducing the cost of manually annotating images. However, for other problems, the potential of machine learning needs to be explored further by conducting research and analyzing relevant datasets. Only once you form a clear definition and understanding of the business problem, goals, and the necessity of machine learning should you move forward to the next stage—translating the business problem into a machine learning problem statement.  Although this first stage involves little machine learning per se, it is actually a critical prerequisite for the success of any machine learning project, ensuring that the solution is not only technically viable but also strategically aligned with business goals. Stage 2. Translate the Business Problem into an ML Problem Translating a business problem, such as reducing the cost of manual image annotation, into a machine learning problem is a critical step that requires careful consideration of the specific objectives and the available data. However, the particular choice of modeling techniques or datasets to focus on, or whether to use classical machine learning or more advanced deep learning models, must be analyzed before proceeding.  Several approaches exist to solve the problem, and these need to be considered and prioritized based on the experience and intuition of the ML leadership  For instance, one might start with clustering algorithms to group similar images for image annotation. Computer vision models that determine whether two images are similar can achieve this. The next step might involve using pre-trained computer vision models like ResNet or Vision Transformers to annotate the images for the use case of interest, for instance, image segmentation.  Finally, the model-based annotations need human review to validate the utility of this ML-based method. In this way, you may propose a high-level problem statement and finalize it by considering the inputs and experience of the relevant machine learning team. Once the machine learning-based approaches are finalized, the business project can be better managed regarding requirements, milestones, timelines, stakeholders, the number of machine learning resources to allocate, budget, success criteria, etc. Stage 3. Data Preparation With an apparent business problem and its corresponding machine learning formulation, the team has a clear roadmap to build, train, and deploy models.  Data preparation and engineering are the next steps in building the ML solution. This involves multiple processes, such as setting up the overall data ecosystem, including a data lake and feature store, data acquisition and procurement as required, data annotation, data cleaning, data management, governance, and data feature processing. How do you clean and preprocess datasets? Our guide on ‘Mastering Data Cleaning & Data Preprocessing’ will delve into the techniques and best data cleaning and data preprocessing practices. You will learn their importance in machine learning, common techniques, and practical tips to improve your data science pipeline.   In large ML organizations, there is typically a dedicated team for all the above aspects of data preparation. For the particular business problem, the team needs to ensure that carefully cleaned, labeled, and curated datasets of high quality are made available to the machine learning scientists who will train models on the same datasets. Curating datasets by area metric using Encord Active Sometimes, you may need to acquire data externally by purchasing it from data brokers, web scraping, or using machine learning based approaches for generating synthetic data. After that, you may need to label data for supervised machine learning problems and subsequently clean and process it to create features relevant to the use case. These features must be stored in a feature store so that data scientists can efficiently access and retrieve them. In addition to the actual data preparation and processing work, most organizations must also invest in establishing a data governance and management strategy. Adopting a data-centric approach to your ML projects has become increasingly crucial. As organizations strive to develop more robust and effective deep learning models, the spotlight has shifted toward understanding and optimizing the data that fuels these systems. As you prepare your datasets, understanding the significance, key principles, and tools to implement this approach will set your project up for success.  Encord is a data-centric AI platform that streamlines your labeling and workflow management, intelligently cleans and curates data, easily validates label quality, and evaluates model performance. Check out our whitepaper on 'How to Adopt a Data-Centric AI.'    Stage 4. Model Training With a training dataset ready, you can now build models to solve the particular use case. Conduct preliminary research on choosing relevant models based on a literature review, including papers, conference proceedings, and technical blogs, before you start training the models.  It is also crucial to carefully define the relevant set of model metrics. The metrics should be relevant to the use case and not just include accuracy as a metric by default. For instance, IoU (Intersection Over Union) is a more appropriate metric for object detection and segmentation models. At the same time, the BLEU score is a relevant metric for measuring the performance of neural machine translation models. It's critical to consider multiple metrics to capture different performance aspects, ensuring they align with the technical and business objectives. Interested in learning more about segmentation? Check out our guide to image segmentation for computer vision projects.   Model training is typically done in Python notebooks such as Jupyter or Google Colaboratory with GPUs for handling large neural network models. However, conducting model training experiments using platforms that enable experiment tracking and visualization of results is helpful in promoting reproducible research and effective stakeholder collaboration. Apart from versioning the underlying code, it is really important to version the dataset, the model, and associated hyperparameters.  In some cases, a single model may not achieve the required performance levels, and it makes sense to ensemble different models together to attain the required performance. Analyze the model's results carefully on the validation dataset, ideally reflecting the distribution of the real-world data.  One approach you could take is to use tools that help you explore and understand the distribution of your validation datasets and how the model performs on them. Also, consider quality metrics that give you a nuanced, detailed evaluation of your model’s performance on the training and validation sets. Understanding how to evaluate and choose the right metric for your project is a development hack that successful ML teams do not neglect. See our comprehensive guide to learn how.   Encord Active uses a data-centric approach to evaluate how well your model will generalize to real-world scenarios using built-in metrics, custom quality metrics, and ML-assisted model evaluation features. Prediction Issues and Types View in Encord Active The final step here is to seek feedback from domain knowledge experts to confirm whether the model performance is robust, potentially guiding adjustments in features, model architecture, or problem framing. Stage 5. Model Deployment In the next stage, the trained model is prepared for model deployment. However, ensuring that the model size and latency meet the required criteria is crucial. Models can be compressed through techniques such as knowledge distillation or pruning without significantly impacting accuracy.  The choice of deployment environment depends on the use case and can vary from deployment in the cloud to deployment on the edge for devices such as a smartphone or a smart speaker. The final choice of the deployment platform is based on several factors, including computational resources, data privacy, latency requirements, and cost. You should also consider if the use case requires real-time or batch predictions so that the appropriate infrastructure for monitoring and logging is set up accordingly.  Before the model is put into production at scale, A/B testing is recommended to validate the model's performance and impact on the user experience. Implementing a phased rollout strategy can further mitigate risk, allowing for incremental adjustments based on real-world feedback and performance data. It is also important to remain cognizant of ethical and legal considerations, ensuring the model's deployment aligns with data protection regulations and ethical standards. Stage 6. Model Monitoring and Maintenance Once the model is deployed, you proceed to the final, but no less important, stage of model monitoring and maintenance. Continuous monitoring of the model performance metrics, the underlying hardware infrastructure, and the user experience is essential. Monitoring is an automated continuous process, all events are logged with alerting systems set up to flag and visualize any errors and anomalies. Once a model is deployed, it may lose its performance over time, especially in case of data drift or model drift. In such scenarios, the model may need to be retrained to address the change in the data distribution.  Model monitoring and observability help you understand how a model reaches an outcome using different techniques. We wrote a comprehensive guide on model observability that you should read 👀.   Most machine learning models in production are continuously retrained regularly, either hourly, daily, weekly, or monthly. By capturing the latest real-world data in the updated training set, you can continuously improve the machine learning model and adapt its performance to the dynamic real-world data distributions. Conclusion Building machine learning models from scratch and taking them into production to provide a delightful customer experience is an arduous yet rewarding and commercially viable journey. However, building machine learning systems is not straightforward as they comprise multiple building blocks, including code, data, and models.  To create a significant business impact with machine learning, you have to fight the high odds of failures of commercial machine learning projects. You can maximize the likelihood of success by going through this process systematically with continuous iterations and feedback loops as you navigate the various stages of the ML lifecycle.

February 26

8 min

sampleImage_data-annotation-companies-for-computer-vision
Top 10 Data Annotation and Data Labeling Companies [2024]

With increasing reliance on computer vision (CV) systems in multiple industrial domains, the demand for robust data annotation solutions is rising exponentially. The most recent reports project the data annotation tools market to have a compound annual growth rate (CAGR) of 21.8% from 2024 to 2032. However, as several companies emerge offering annotation platforms and services, finding a cost-effective provider is challenging. While many platforms offer advanced annotation features, only a few meet the scalability and security requirements essential for enterprise-level CV applications. This article discusses the ten best video and image annotation companies in 2024 to help you with your search. The following lists the companies we think are driving the data annotation space: Encord iMerit Appen Label Your Data KeyMakr TrainingData SuperbAI Kili Technology Telus International SuperAnnotate CogitoTech LabelBox Top 12 Data Annotation and Data Labeling Companies Data annotation companies offering labeling solutions must meet stringent security and scalability requirements to match the high standards of the modern artificial intelligence (AI) space. Below are the twelve top companies, ranked based on the following factors:  Data security protocols: Compliance with data security regulations and use of encryption algorithms. Scalability: The solution’s ability to handle large data volumes and variety. Collaboration: Tools allowing different team members to collaborate on projects. Ease of use: A user-friendly interface that is intuitive and easy to navigate. Supported data types: support for different modalities such as video, image, audio, and text. Automation: AI-based labeling for speeding up annotation processes. Other functionalities for streamlining the annotation workflow include integration with cloud services and advanced annotation methods for complex scenarios. Let’s explore each company's annotation platforms or services and see the key features based on the above factors to help you determine the most suitable option. Encord Encord is an end-to-end data platform that enables you to annotate, curate, and manage computer vision datasets through AI-assisted annotation features. It also provides intuitive dashboards to view insights on key metrics, such as label quality and annotator performance, to optimize workforce efficiency and ensure you build production-ready models faster. State-of-the-art model-assisted labeling and customizable workflows to accelerate labeling projects with Encord Annotate. Key Features Data security: Encord complies with the General Data Protection Regulation (GDPR), System and Organization Controls 2 (SOC 2), and Health Insurance Portability and Accountability Act (HIPAA) standards. It uses advanced encryption protocols to ensure data security and privacy. Scalability: The platform allows you to upload up to 500,000 images (recommended), 100 GB in size, and 5 million labels per project. You can also upload up to 200,000 frames per video (2 hours at 30 frames per second) for each project. See more guidelines for scalability in the documentation. Collaboration: You can create workflows and assign roles to relevant team members to manage tasks at different stages. User roles include admin, team member, reviewer, and annotator. Ease-of-use: Encord Annotate offers an intuitive user interface (UI) and an SDK to label and manage annotation projects. Supported data types: The platform lets you annotate images, videos (and image sequences), DICOM, and Mammography data. Supported annotation methods: Encord supports multiple annotation methods, including classification, bounding box, keypoint, polylines, and polygons. Automated labeling: The platform speeds up the annotation with automation features, including: - Segment Anything Model (SAM) to automatically create labels around distinct features in all supported file formats. - Interpolation to auto-create instance labels by estimating where labels should be created in videos and image sequences. - Object tracking to follow entities within images based on pixel information enclosed within the label boundary. Integration: Integrate popular cloud storage platforms, such as AWS, Google Cloud, Azure, and Open Telekom Cloud OSS, to import datasets. Best for Teams looking for an enterprise-grade image and video annotation solution to produce high-quality data for computer vision models. Pricing Encord has a pay-per-user pricing model with Starter, Team, and Enterprise options. Learn more about automated data annotation by reading our guide to automated data annotation. iMerit iMerit offers Ango Hub, a data annotation solution built on a generative AI framework that lets you build use-case-specific applications for autonomous vehicles, agriculture, and healthcare industries. iMerit Key Features Collaboration: The Ango Hub solution lets you add labelers and reviewers to customized workflows for managing annotation projects. Ease-of-use: The platform offers an intuitive UI to label items, requiring no coding expertise. Supported data types: Ango Hub supports audio, image, video, DICOM, text, and markdown data types. Supported labeling methods: The solution supports bounding boxes, polygons, polylines, segmentation, and tools for natural language processing (NLP). Integration: The platform features integrated plugins for automated labeling and machine learning models for AI-assisted annotations. Best for Teams searching for an integrated labeling platform for annotating text, video, and image data. Pricing Pricing information is not publicly available. Contact the team to get a quote. Appen Appen offers data annotation solutions for building large language models (LLMs) by providing a standalone labeling platform and data labeling services through expert linguists. Appen Key Features Workforce capacity: Appen’s managed services include more than a million specialists speaking over 200 languages across 170 countries. With the option to combine its platform with its services, the solution becomes highly scalable. Supported data types: Appen’s platform lets you label documents, images, videos, audio, text, and point-cloud data. Supported annotation methods: Labeling methods include bounding boxes, cuboids, lines, points, polygons, ellipses, segmentation, and classification. Instruction datasets: The company also offers domain-specific instruction datasets for training LLMs. Best for Teams looking for a hybrid solution for building multi-modal models for text and vision applications. Pricing Pricing is not publicly available. Label Your Data Label Your Data is a data annotation service provider offering video and image annotation services for CV and NLP applications. Label Your Data Key Features Data security: The company complies with ISO 27001, GDPR, and CCPA standards. Workforce capacity: Label Your Data builds a remote team of over 500 data annotators to speed up the annotation process. Supported data types: The solution supports image, video, point-cloud, text, and audio data. Supported labeling methods: CV methods include semantic segmentation, bounding boxes, polygons, cuboids, and key points. NLP methods include named entity recognition (NER), sentiment analysis, audio transcription, and text annotation. Best for Teams looking for a secure annotation service provider for completely outsourcing their labeling efforts. Pricing Label Your Data provides on-demand, short- and long-term plans. Keymakr Keymakr is an image and video annotation service provider that manages labeling processes through its in-house professional experts. Keymakr Key Features Labeling capacity: You can label up to 100,000 data items. Supported data types: The platform supports image, video, and point-cloud data. Supported labeling methods: Keymakr offers annotations that include bounding boxes, cuboids, polygons, semantic segmentation, key points, bitmasks, and instance segmentation. Smart assignment: The solution features a smart distribution to match relevant annotators with suitable tasks based on skillset. Performance tracking: Keymakr provides performance analytics to track progress and alert managers in case of issues. Data collection and creation: The company also offers services to create relevant data for your projects or collect it from reliable sources. Best for Beginner-level teams working CV projects, requiring data creation and annotation services. Pricing Pricing is not publicly available. TrainingData TrainingData is a Software-as-a-Service (SaaS) data labeling application for CV projects, featuring pixel-level annotation tools for accurate labeling. TrainingData Key Features Data security: The company provides a Docker image to run on your local network through a secure virtual private network (VPN) connection. Scalability: You can label up to 100,000 images. Collaboration: TrainingData’s platform lets you create projects and add relevant collaborators with suitable roles, including reviewer, annotator, and admin. Supported labeling methods: The platform offers multiple labeling tools, including a brush and eraser for pixel-accurate segmentation, bounding boxes, polygons, key points, and a freehand drawer for freeform contours. Integration: TrainingData integrates with any cloud storage service that complies with cross-origin resource sharing (CORS) policy. Best for Teams looking for an on-premises image annotation platform for segmentation tasks. Pricing TrainingData offers free, pro, and enterprise packages. SuperbAI SuperbAI offers multiple products for building AI models, including a data management platform, a labeling solution, and a tool for training, evaluating, and deploying models. SuperbAI Key Features Data security: SuperbAI complies with SOC standards and encrypts all data using Advanced Encryption Standard - 256 (AES-256). Collaboration: The platform offers access management tools and lets you invite team members as admins, labelers, and managers. Supported data types: SuperbAI supports images and videos in PNG, BMP, JPG, and MP4 formats. It also supports point-cloud data. Supported labeling methods: The solution supports all standard labeling methods, including bounding boxes, polylines, polygons, and cuboids. Integration: The platform integrates with Google Cloud, Azure, AWS, and Slack. Best for Teams looking for an integrated data management solution for training machine learning algorithms. Pricing SuperbAI offers starter and enterprise packages. Kili Technology Kili Technology offers an intuitive labeling platform to annotate data for LLMs, generative AI, and CV models with quality assurance features to produce error-free datasets. Kili Technology Key features Collaboration: The platform lets you assign multiple roles to team members, including reviewer, admin, manager, and labeler, to collaborate on projects through instructions and feedback. Ease-of-use: Kili offers a user-friendly UI for managing workflows, requiring minimal code. Supported labeling methods: The tool supports bounding boxes, optical character recognition (OCR), NERs, pose estimation, and semantic segmentation. Automation: Kili supports automated labeling through active learning and pre-annotations using ChatGPT and SAM. Best for Data scientists looking for a lightweight annotation solution for building generative AI applications. Pricing Pricing depends on the number of items you need to label. Telus International Telus International’s Ground Truth (GT) studio offers three platforms as part of a managed service to build training datasets for ML models. GT Manage helps with people and project management; GT Annotate lets you annotate image and video data. GT Data is a data creation and collection tool supporting multiple data types. Telus International Key Features Data security: GT Annotate complies with SOC 2 standards and implements two-factor authentication with firewall applications and intrusion detection for data security. Collaboration: GT Manage features workforce management tools for optimal task distribution and quality control. Supported data types: You can collect image, video, audio, text, and geo-location data using GT data. Supported labeling methods: GT Annotate supports bounding boxes, cuboids, polylines, and landmarks. Best for Teams looking for a complete AI solution for collecting, labeling, and managing raw data. Pricing Pricing information is not publicly available. SuperAnnotate SuperAnnotate offers a data labeling tool that lets you manage AI data through collaboration tools and annotation workflows while providing quality assurance features to produce labeling accuracy. SuperAnnotate Key Features Collaboration: SuperAnnotate lets you create teams and assign relevant roles such as admin, annotator, and reviewer. Ease-of-use: The platform has an easy-to-use UI. Supported data types: SuperAnnotate supports image, video, text, and audio data. Supported labeling methods: The platform has tools for categorization, segmentation, pose estimation, object tracking, sentiment analysis, and speech recognition. Best for Teams looking for an annotation solution to build generative AI applications. Pricing The platform offers free, pro, and enterprise versions. Cogito Cogito is a data labeling service provider that employs a large pool of human annotators to deliver annotations for generative AI, CV, content moderation, NLP, and data processing. Cognito Key Features Data security: Cogito complies with GDPR, SOC 2, HIPAA, CCPA, and ISO 27001 standards. Supported data types: The platform supports image, video, audio, text, and point-cloud data. Automation: Cogito uses AI-based algorithms to label large data volumes. Best for Startups looking for a company to outsource their AI operations. Pricing Pricing is not publicly available. Labelbox Labelbox offers multiple products for managing AI projects. Its data labeling platform allows you to annotate various data types for building vision and LLM applications. LabelBox Key Features Data security: Labelbox complies with several regulatory standards, including GDPR, CCPA, SOC 2, and ISO 27001. Collaboration: Users can create projects and invite in-house labeling team members with relevant roles to manage the annotation workflow. Ease-of-use: Labelbox has a user-friendly interface with a customizable labeling editor. Automation: The platform supports model-assisted labeling (MAL) to import AI-based classifications for your data. Integrability: Labelbox integrates with AWS, Azure, and Google Cloud to access data repositories quickly. Best for Teams looking for labeling solutions to build applications for e-commerce, healthcare, and financial services industries. Pricing Labelbox offers free, starter, and enterprise versions. Still confused about whether to buy a tool or go for open-source solutions? Read some lessons from practitioners regarding build vs. buy decisions Data Annotation Companies: Key Takeaways CV applications are driving the current industrial landscape by innovating fields like medical imaging, robotics, retail, etc. However, CV’s rapid expansion into these domains calls for robust data annotation tools and services to build high-quality training data. Below are a few key points regarding data annotation companies in 2024. Security is key: With data privacy regulations becoming stricter globally, companies offering annotation solutions must have compliance certifications to ensure data protection. Scalability: Annotation companies should offer scalable tools to handle the ever-increasing data volume and variety. Top annotation companies in 2024: SuperAnnotate, Encord, and Kili are the top 3 companies that provide robust labeling platforms and services.

February 23

8 min

sampleImage_yolov9-sota-machine-learning-object-dection-model
YOLOv9 Explained and How to Run it

What is YOLOv9? YOLOv9, the latest in the YOLO series, is a real-time object detection model. It shows better performance through advanced deep learning techniques and architectural design, including the Generalized ELAN (GELAN) and Programmable Gradient Information (PGI). The YOLO series has revolutionized the world of object detection for long now by introducing groundbreaking concepts in computer vision like processing entire images in a single pass through a convolutional neural network (CNN). With each iteration, from YOLOv1 to the latest YOLOv9, it has continuously refined and integrated advanced techniques to enhance accuracy, speed, and efficiency, making it the go-to solution for real-time object detection across various domains and scenarios. Let’s read an overview of YOLOv9 and learn about the new features. YOLOv9 Overview YOLOv9 is the latest iteration in the YOLO (You Only Look Once) series of real-time object detection systems. It builds upon previous versions, incorporating advancements in deep learning techniques and architectural design to achieve superior performance in object detection tasks. Developed by combining the Programmable Gradient Information (PGI) concept with the Generalized ELAN (GELAN) architecture, YOLOv9 represents a significant leap forward in terms of accuracy, speed, and efficiency. Evolution of YOLO The evolution of the YOLO series of real-time object detectors has been characterized by continuous refinement and integration of advanced algorithms to enhance performance and efficiency. Initially, YOLO introduced the concept of processing entire images in a single pass through a convolutional neural network (CNN). Subsequent iterations, including YOLOv2 and YOLOv3, introduced improvements in accuracy and speed by incorporating techniques like batch normalization, anchor boxes, and feature pyramid networks (FPN).  These enhancements were further refined in models like YOLOv4 and YOLOv5, which introduced novel techniques such as CSPDarknet and PANet to improve both speed and accuracy. Alongside these advancements, YOLO has also integrated various computing units like CSPNet and ELAN, along with their variants, to enhance computational efficiency. In addition, improved prediction heads like YOLOv3 head or FCOS head have been utilized for precise object detection. Despite the emergence of alternative real-time object detectors like RT DETR, based on DETR architecture, the YOLO series remains widely adopted due to its versatility and applicability across different domains and scenarios. The latest iteration, YOLOv9, builds upon the foundation of YOLOv7, leveraging the Generalized ELAN (GELAN) architecture and Programmable Gradient Information (PGI) to further enhance its capabilities, solidifying its position as the top real-time object detector of the new generation. The evolution of YOLO demonstrates a continuous commitment to innovation and improvement, resulting in state-of-the-art performance in real-time object detection tasks. YOLOv9 Key Features Object Detection in Real-Time: YOLOv9 maintains the hallmark feature of the YOLO series by providing real-time object detection capabilities. This means it can swiftly process input images or video streams and accurately detect objects within them without compromising on speed. PGI Integration: YOLOv9 incorporates the Programmable Gradient Information (PGI) concept, which facilitates the generation of reliable gradients through an auxiliary reversible branch. This ensures that deep features retain crucial characteristics necessary for executing target tasks, addressing the issue of information loss during the feedforward process in deep neural networks. GELAN Architecture: YOLOv9 utilizes the Generalized ELAN (GELAN) architecture, which is designed to optimize parameters, computational complexity, accuracy, and inference speed. By allowing users to select appropriate computational blocks for different inference devices, GELAN enhances the flexibility and efficiency of YOLOv9. Improved Performance: Experimental results demonstrate that YOLOv9 achieves top performance in object detection tasks on benchmark datasets like MS COCO. It surpasses existing real-time object detectors in terms of accuracy, speed, and overall performance, making it a state-of-the-art solution for various applications requiring object detection capabilities. Flexibility and Adaptability: YOLOv9 is designed to be adaptable to different scenarios and use cases. Its architecture allows for easy integration into various systems and environments, making it suitable for a wide range of applications, including surveillance, autonomous vehicles, robotics, and more. The paper is authored by Chien-Yao Wang, I-Hau Yeh and Hong-Yuan MArk Liao and is available on Arxiv:Yolov9: Learning What You Want to Learn Using Programmable Gradient Information.   Updates on YOLOv9 Architecture Integrating Programmable Gradient Information (PGI) and GLEAN (Generative Latent Embeddings for Object Detection) architecture into YOLOv9 can enhance its performance in object detection tasks. Here's how these components can be integrated into the YOLOv9 architecture to enhance performance: PGI Integration Yolov9: Learning What You Want to Learn Using Programmable Gradient Information  Main Branch Integration: The main branch of PGI, which represents the primary pathway of the network during inference, can be seamlessly integrated into the YOLOv9 architecture. This integration ensures that the inference process remains efficient without incurring additional computational costs. Auxiliary Reversible Branch: YOLOv9, like many deep neural networks, may encounter issues with information bottleneck as the network deepens. The auxiliary reversible branch of PGI can be incorporated to address this problem by providing additional pathways for gradient flow, thereby ensuring more reliable gradients for the loss function. Multi-level Auxiliary Information: YOLOv9 typically employs feature pyramids to detect objects of different sizes. By integrating multi-level auxiliary information from PGI, YOLOv9 can effectively handle error accumulation issues associated with deep supervision, especially in architectures with multiple prediction branches. This integration ensures that the model can learn from auxiliary information at multiple levels, leading to improved object detection performance across different scales. GLEAN Architecture Yolov9: GLEAN Architecture Generalized Efficient Layer Aggregation Network or GELAN is a novel architecture that combines CSPNet and ELAN principles for gradient path planning. It prioritizes lightweight design, fast inference, and accuracy. GELAN extends ELAN's layer aggregation by allowing any computational blocks, ensuring flexibility. The architecture aims for efficient feature aggregation while maintaining competitive performance in terms of speed and accuracy. GELAN's overall design integrates CSPNet's cross-stage partial connections and ELAN's efficient layer aggregation for effective gradient propagation and feature aggregation. YOLOv9 Results The performance of YOLOv9, as verified on the MS COCO dataset for object detection tasks, showcases the effectiveness of the integrated GELAN and PGI components: Parameter Utilization YOLOv9 leverages the Generalized ELAN (GELAN) architecture, which exclusively employs conventional convolution operators. Despite this, YOLOv9 achieves superior parameter utilization compared to state-of-the-art methods that rely on depth-wise convolution. This highlights the efficiency and effectiveness of YOLOv9 in optimizing model parameters while maintaining high performance in object detection. Flexibility and Scalability The Programmable Gradient Information (PGI) component integrated into YOLOv9 enhances its versatility. PGI allows YOLOv9 to be adaptable across a wide spectrum of models, ranging from light to large-scale architectures. This flexibility enables YOLOv9 to accommodate various computational requirements and model complexities, making it suitable for diverse deployment scenarios. Information Retention By utilizing PGI, YOLOv9 ensures handling data loss at every layer ensuring the retention of complete information during the training process. This capability is particularly beneficial for train-from-scratch models, as it enables them to achieve superior results compared to models pre-trained using large datasets. YOLOv9's ability to preserve crucial information throughout training contributes to its high accuracy and robust performance in object detection tasks. Comparison of YOLOv9 with SOTA Model Comparison of YOLOv9 with SOTA Model The comparison between YOLOv9 and state-of-the-art (SOTA) models reveals significant improvements across various metrics. YOLOv9 outperforms existing methods in parameter utilization, requiring fewer parameters while maintaining or even improving accuracy.  YOLOv9 demonstrates superior computational efficiency compared to both train-from-scratch methods and models based on depth-wise convolution and ImageNet-based pretrained models. Creating Custom Dataset for YOLOv9 on Encord for Object Detection With Encord you can either curate and create your custom dataset or use the sandbox datasets already created on Encord Active platform. Select New Dataset to Upload Data You can name the dataset and add a description to provide information about the dataset.  Annotate Custom Dataset Create an annotation project and attach the dataset and the ontology to the project to start annotation with a workflow. You can choose manual annotation if the dataset is simple, small, and doesn’t require a review process. Automated annotation is also available and is very helpful in speeding up the annotation process. For more information on automated annotation, read the blog The Full Guide to Automated Data Annotation. Start Labeling The summary page shows the progress of the annotation project. The information regarding the annotators and the performance of the annotators can be found under the tabs labels and performance. Export the Annotation Once the annotation has been reviewed, export the annotation in the required format. For more informationon exploring the quality of your custom dataset, read the blog Exploring the Quality of Hugging Face Image Datasets with Encord Active. Object Detection using YOLOv9 on Custom Dataset You can use the custom dataset curated using Encord Annotate for training an object detection model. For testing YOLOv9, we are going to use an image from one of the sandbox projects on Encord Active. Copy and run the code below to run YOLOv9 for object detection. The code for using YOLOv9 for panoptic segmentation has also been made available now on the original GitHub repository. Installing YOLOv9 !git clone https://github.com/WongKinYiu/yolov9.git Installing YOLOv9 Requirements !python -m pip install -r yolov9/requirements.txt Download YOLOv9 Model Weights The YOLOv9 is available as 4 models which are ordered by parameter count: YOLOv9-S YOLOv9-M YOLOv9-C YOLOv9-E Here we will be using YOLOv9-e. But the same process follows for other models. from pathlib import Path weights_dir = Path("/content/weights") weights_dir.mkdir(exist_ok=True) !wget 'https://github.com/WongKinYiu/yolov9/releases/download/v0.1/yolov9-e.pt' -O /content/weights/yolov9-e.pt Define the Model p_to_add = "/content/yolov9" import sys if p_to_add not in sys.path: sys.path.insert(0, p_to_add) from models.common import DetectMultiBackend weights = "/content/weights/yolov9-e.pt" model = DetectMultiBackend(weights) Download Test Image from Custom Dataset images_dir = Path("/content/images") images_dir.mkdir(exist_ok=True) !wget 'https://storage.googleapis.com/encord-active-sandbox-projects/f2140a72-c644-4c31-be66-3ef80b3718e5/a0241c5f-457d-4979-b951-e75f36d0ff2d.jpeg' -O '/content/images/example_1.jpeg' This is the sample image we will be using for testing YOLOv9 for object detection. Dataloader from utils.torch_utils import select_device, smart_inference_mode from utils.dataloaders import IMG_FORMATS, VID_FORMATS, LoadImages, LoadScreenshots, LoadStreams from utils.general import (LOGGER, Profile, check_file, check_img_size, check_imshow, check_requirements, colorstr, cv2,   increment_path, non_max_suppression, print_args, scale_boxes, strip_optimizer, xyxy2xywh) image_size = (640, 640) img_path = Path("/content/images/example_1.jpeg") device = select_device("cpu") vid_stride = 1 stride, names, pt = model.stride, model.names, model.pt imgsz = check_img_size(image_size, s=stride)  # check image size # Dataloader bs = 1  # batch_size dataset = LoadImages(img_path, img_size=image_size, stride=stride, auto=pt, vid_stride=vid_stride) model.warmup(imgsz=(1 if pt or model.triton else bs, 3, *imgsz))  # warmup Run Prediction import torch augment = False visualize = False conf_threshold = 0.25 nms_iou_thres = 0.45 max_det = 1000 seen, windows, dt = 0, [], (Profile(), Profile(), Profile()) for path, im, im0s, vid_cap, s in dataset: with dt[0]: im = torch.from_numpy(im).to(model.device).float()         im /= 255  # 0 - 255 to 0.0 - 1.0         if len(im.shape) == 3: im = im[None]  # expand for batch dim # Inference     with dt[1]:         pred = model(im, augment=augment, visualize=visualize)[0]  # NMS     with dt[2]:         filtered_pred = non_max_suppression(pred, conf_threshold, nms_iou_thres, None, False, max_det=max_det)  print(pred, filtered_pred)      break   Generate YOLOv9 Prediction on Custom Data import matplotlib.pyplot as plt from matplotlib.patches import Rectangle from PIL import Image img = Image.open("/content/images/example_1.jpeg") fig, ax = plt.subplots() ax.imshow(img) ax.axis("off") for p, c in zip(filtered_pred[0], ["r", "b", "g", "cyan"]): x, y, w, h, score, cls = p.detach().cpu().numpy().tolist() ax.add_patch(Rectangle((x, y), w, h, color="r", alpha=0.2)) ax.text(x+w/2, y+h/2, model.names[int(cls)], ha="center", va="center", color=c) fig.savefig("/content/predictions.jpg") Here is the result! Training YOLOv9 on Custom Dataset The YOLOV9 GitHub repository contains the code to train on both single and multiple GPU. You can check it out for more information. The source code of YOLOv9 on python can be found GitHub as it is open-sourced.   Stay tuned for the training code for YOLOv9 on custom dataset and a comparison analysis of YOLOv8 vs YOLOv8! YOLOv9 Key Takeaways Cutting-edge real-time object detection model. Advanced Architectural Design: Incorporates the Generalized ELAN (GELAN) architecture and Programmable Gradient Information (PGI) for enhanced efficiency and accuracy. Unparalleled Speed and Efficiency Compared to SOTA: Achieves top performance in object detection tasks with remarkable speed and efficiency.

February 23

5 min

sampleImage_vision-radiology-apple-vision-pro-application
Apple Vision PRO - Extending Reality to Radiology

After the historical introduction to Iphone in 2007, Apple has come up with yet another technological advancement - Apple Vision Pro. It is set to revolutionize again how we consume digital content. Let’s get into the details of the mixed reality headset and see its potential application in radiology. Introduction to Vision Pro The Apple Vision Pro represents a significant milestone in mixed reality (MR) technologies. Building upon decades of research and development in virtual reality (VR) and augmented reality (AR), the Vision Pro is the culmination of advancements in display technology, processing capabilities, eye-tracking systems, and artificial intelligence (AI).  It's a really important moment because it brings us closer to a future where virtual and real worlds blend together in amazing ways, offering new and exciting experiences in different areas like education, entertainment, and healthcare. Historical Evolution of Apple Vision Pro Apple has a rich history of introducing groundbreaking technological advancements over the years, starting with the revolutionary iPhone in 2007, followed by the iPad and the development of iOS. The App Store further expanded their ecosystem, offering a vast array of digital content. Their focus on technological advancements continued with the integration of voice commands, advancements in digital content delivery, and innovations in electronic health records (EHR). Additionally, their acquisition of Visage Imaging strengthened their presence in the medical imaging field. Building on this legacy of innovation, Apple now introduces the next leap forward in technology with the Apple Vision Pro.  Let's explore its features and how this mixed reality headset is poised to redefine our interaction with digital content in the healthcare landscape and beyond. Features of Apple Vision Pro Display Technology: The Vision Pro incorporates dual micro-OLED displays, each boasting over 23 million pixels, surpassing the resolution of current VR standards. This high pixel density reduces the screen-door effect and enhances visual fidelity, offering crisp and clear images. Processing and Performance: Engineered with a dual-chip architecture comprising Apple’s M2 chip and a custom R1 chip, the Vision Pro delivers unparalleled processing power and efficiency. Its low-latency design ensures a fluid and responsive MR environment, setting new industry standards. Eye Tracking Technology:  Advanced eye-tracking technology integrated into the Vision Pro utilizes high-speed cameras and LEDs to capture and interpret eye movements accurately. This enables intuitive, gaze-based interaction within the MR environment, revolutionizing user experience. AI and Machine Learning Integration: Leveraging AI algorithms, the Vision Pro achieves real-time spatial awareness and environmental mapping, enhancing personalized and adaptive MR experiences. Machine learning models adapt interfaces and interactions based on individual user behaviors, optimizing engagement over time. Spatial Computing and VisionOS: Apple used spatial computing to allow the users to interact with the extended reality. It also created an operating system for it called VisionOS. Spatial computing in VisionOS allows users to interact with Apple Vision Pro using their eyes, hands, and voice, creating intuitive and magical experiences. Apps in VisionOS can fill the space around users, scale to the perfect size, react to room lighting, cast shadows, and be moved anywhere.  Apple Vision Pro: User Experience The user experience with the Apple Vision Pro is characterized by its seamless integration of advanced technologies to deliver an immersive, intuitive, and personalized MR journey. Visual Immersion: The Vision Pro's high-resolution micro-OLED displays and reduced latency provide users with unparalleled visual immersion, minimizing distractions and enhancing presence within the virtual environment. Intuitive Interaction: Advanced eye-tracking technology enables natural, gaze-based interaction, reducing reliance on hand controllers and offering more intuitive control mechanisms. This hands-free approach enhances user comfort and engagement. Personalization and Adaptation: Leveraging AI and machine learning, the Vision Pro tailors experiences to individual user preferences and behaviors, creating a highly personalized MR journey. Adaptive interfaces and content delivery optimize engagement and learning outcomes. Interface Design of Vision Pro The interface design of the Apple Vision Pro prioritizes simplicity, intuitiveness, and accessibility to ensure a seamless user experience. Minimalist Interface: The interface design emphasizes simplicity, presenting users with a clean and intuitive layout that minimizes distractions and maximizes focus on the MR content. Gaze-Based Controls: Leveraging advanced eye-tracking technology, the interface incorporates gaze-based controls, allowing users to navigate menus, select options, and interact with objects effortlessly using their gaze. Adaptive Interfaces: Machine learning algorithms adapt interfaces based on user behavior and preferences, customizing the MR experience to optimize engagement and usability for each individual user. Extended Reality (XR) in Radiology Extended Reality (XR) technologies, including virtual reality (VR) and augmented reality (AR), have revolutionized the field of radiology by offering innovative solutions for intervention guidance, medical training, and teaching. These technologies provide radiologists with advanced tools to analyze complex medical images in three-dimensional (3D) formats, offering a deeper understanding of human anatomy and facilitating diagnostic radiology. Spatial Computing Spatial computing, a key component of XR technologies, enables radiologists to interact with virtual images of tissues, organs, vessels, and abnormalities in 3D formats. This immersive experience allows for a comprehensive exploration of medical imaging datasets, providing precise information and enhancing diagnostic accuracy. By transforming imaging datasets into holographic-like virtual images, spatial computing facilitates a better understanding of medical conditions and supports evidence-based planning for medical procedures. Vision Pro Headsets The introduction of Vision Pro headsets could enhance the visualization capabilities of radiologists, offering holographic displays of real-time 3D images of vascular anatomy. These extended reality headsets provide an immersive experience that surpasses traditional 2D imaging tools, allowing radiologists to view the internal structures of the body in three dimensions. This advanced visualization technology would improve the accuracy of physical therapy methods, support simulation-based training for medical procedures, and foster collaboration among medical professionals. Virtual Reality Surgical Visualization Virtual reality surgical visualization is a groundbreaking application of XR in radiology, empowering surgeons to enhance the efficiency and precision of surgical procedures. By collaborating with colleagues and visualizing complex 3D images in VR environments, surgeons can develop successful surgical plans for ophthalmology, microsurgeries, and neurosurgery. VR technology enables researchers to analyze and present medical images more effectively than traditional 2D scans, facilitating highly accurate measurements and enhancing patient care. 3D DICOM Image Visualizations DICOM (Digital Imaging and Communications in Medicine) images, commonly used in radiology, provide essential visual data for diagnosing medical conditions. The integration of VR headsets into radiology practices enhances the visualization of DICOM images by combining traditional 2D annotations with immersive 3D formats. The combination of these technologies would help radiologists understand medical images better, enhancing diagnostic capabilities and improving patient care. Multi-Modality Support Modern DICOM image visualization tools offer support for multiple imaging modalities, including X-ray, MRI (Magnetic Resonance Imaging), CT (Computed Tomography), and ultrasound. This multi-modality support allows radiologists to seamlessly integrate data from various imaging techniques, providing a holistic view of the patient's medical condition and improving diagnostic accuracy. 2D and 3D Augmented DICOM Viewer Augmented DICOM viewers combine traditional two-dimensional (2D) image viewing with advanced three-dimensional (3D) visualization capabilities. These viewers enable radiologists to switch between 2D and 3D views seamlessly, allowing for a more detailed analysis of DICOM images. Augmented DICOM viewers also offer interactive features, such as zooming, panning, and rotation, enhancing the radiologist's ability to examine medical images from different perspectives. 4K Rendering 4K rendering technology enhances the quality of DICOM image visualizations by providing ultra-high-definition images with exceptional clarity and detail. This high-resolution rendering allows radiologists to identify subtle anatomical features and abnormalities that may not be visible with lower-resolution imaging techniques. By improving image quality, 4K rendering enhances diagnostic accuracy and facilitates more precise medical interventions. In addition to traditional 4K rendering technology, the utilization of HTJ2K (High Throughput JPEG 2000) further enhances the quality of DICOM image visualizations. HTJ2K is a cutting-edge compression standard specifically designed for high-resolution medical imaging. By efficiently compressing large DICOM image datasets without compromising image quality, HTJ2K enables radiologists to visualize ultra-high-definition images with exceptional clarity and detail. By combining 4K rendering with HTJ2K compression, radiologists can leverage ultra-high-definition DICOM image visualizations to improve diagnostic accuracy and facilitate more precise medical interventions. This integration of advanced rendering and compression technologies represents a significant advancement in medical imaging, ultimately enhancing patient care and outcomes in radiology. For more information, read Announcing HTJ2K Support for DICOM Files in Encord   Enhanced Diagnostic Accuracy The combination of 3D DICOM image visualizations, augmented DICOM viewers, and 4K rendering technology contributes to enhanced diagnostic accuracy in radiology. Radiologists can visualize medical images with unprecedented detail and clarity, leading to more accurate diagnoses and treatment plans. By leveraging advanced visualization tools, radiologists can improve patient outcomes and provide better quality of care. Use Cases of Mixed Reality in Healthcare Augmented Reality in Surgical Systems Augmented reality (AR) technology is revolutionizing surgical systems by providing real-time visual guidance to surgeons during operations. Traditional surgical visualization methods, such as ultrasound, magnetic resonance imaging (MRI), and computed tomography (CT) scans, have limitations in integrating pre-operative and intra-operative information, leading to a mismatch and potentially extending operation durations. However, AR-based navigation systems address these challenges by overlaying virtual images onto the surgical field, allowing surgeons to visualize 3D models derived from CT and MRI data and providing real-time surgical guidance. Integration of Eye Motion and Gesture Controls One of the key features of AR-based surgical systems is the integration of eye motion and gesture controls. Surgeons can navigate through the augmented reality interface using eye movements and gestures, minimizing the need to switch attention between the surgical scene and auxiliary displays. This intuitive interaction method enhances surgical precision and efficiency by enabling surgeons to access critical information and manipulate virtual images with minimal disruption. Surgeons can maintain focus on the surgical scene while simultaneously navigating through the AR interface, resulting in smoother workflows and improved patient care.  Building Collaborative Healthcare Space Augmented Reality (AR) technology helps surgical teams work together better. Through the overlay of virtual images onto the surgical field, AR-based systems facilitate a shared visualization of anatomical structures and surgical targets among all surgical team members. This shared visualization enhances communication and collaboration, allowing surgical team members to effectively coordinate their actions and make informed decisions in real-time. By providing a common understanding of the surgical procedure, AR technologies make the healthcare team work together smoothly. The collaborative environment facilitated by AR technology not only improves communication within the surgical team but also promotes interdisciplinary collaboration across different specialties. Surgeons, nurses, anesthesiologists, and other healthcare professionals can seamlessly share information and insights. FDA Clearance FDA Approved AI/ML Medical Technologies The adoption of AR technology in surgical systems has gained momentum, with many AR-based navigation systems receiving clearance from the U.S. Food and Drug Administration (FDA). FDA clearance ensures that these systems meet regulatory standards for safety and effectiveness, giving surgeons confidence in utilizing AR technology for surgical procedures. This regulatory clearance underscores the reliability and efficacy of AR-based surgical systems, further driving their adoption and integration into mainstream surgical practices. For more information, read the blog The Step-by-Step Guide to Getting Your AI Models Through FDA Approval   Vision Pro in Emergency Medicine The Apple Vision Pro would be instrumental in emergency medicine training by offering immersive and realistic simulations for medical trainees. These simulations provide a safe environment for trainees to practice various clinical scenarios, ranging from routine patient interactions to emergency response situations. Surgery Projections Medical trainees using Vision Pro or other MR headsets can benefit from lifelike simulations of surgical procedures, allowing them to practice and refine their surgical skills in a risk-free environment. The high-resolution displays and spatial audio of the Vision Pro enhance the realism of these simulations, providing trainees with valuable hands-on experience in emergency settings. Surgical Planning The Vision Pro or other MR headset would allow medical trainees to engage in surgical planning exercises, where they can visualize and simulate surgical procedures before performing them on actual patients. This preoperative planning would help trainees develop effective surgical strategies and enhance their understanding of complex surgical techniques. Increased Patient Safety By providing realistic simulations and training scenarios, the Vision Pro could contribute to increased patient safety in emergency settings. Medical trainees who have undergone training with the MR headsets are better prepared to handle real-world emergency environments, reducing the risk of errors and improving overall patient care outcomes. Advancements in Radiology in Oncology Cancer Diagnostic Imaging Advanced computer vision for radiology techniques has improved cancer diagnostic imaging by providing detailed insights into the structure, function, and behavior of tumors. Modalities such as computed tomography (CT), magnetic resonance imaging (MRI), positron emission tomography (PET), and ultrasound, combined with advanced imaging protocols and image analysis algorithms, allow clinicians to visualize tumors with unprecedented clarity and accuracy. These imaging modalities play an important role in cancer diagnosis by: Early Detection: Advanced imaging techniques enable the detection of small tumors and precancerous lesions that may not be visible with conventional imaging methods, facilitating early intervention and improving patient outcomes. Accurate Staging: By precisely delineating the extent of tumor involvement and assessing the presence of metastases, vision radiology aids in the accurate staging of cancer, guiding treatment decisions and prognosis estimation. Treatment Planning: Detailed imaging provides valuable anatomical and functional information essential for treatment planning, including surgical resection, radiation therapy, and systemic therapies. Imaging-based tumor characterization helps tailor treatment strategies to individual patients, optimizing therapeutic efficacy while minimizing adverse effects. Monitoring Treatment Response: Serial imaging assessments enable the objective evaluation of treatment response, allowing clinicians to adapt therapy based on tumor dynamics and patient-specific factors. Response criteria such as RECIST (Response Evaluation Criteria in Solid Tumors) and PERCIST (PET Response Criteria in Solid Tumors) provide standardized metrics for assessing treatment efficacy and guiding clinical management. Cancer Treatment Using Vision Radiology In addition to its diagnostic utility, advanced vision radiology has various applications in guiding cancer treatment across various modalities: Image-Guided Interventions: Minimally invasive procedures such as biopsy, ablation, and radiofrequency thermal ablation rely on real-time imaging guidance to precisely target and treat tumors while preserving surrounding healthy tissues. Techniques such as CT-guided biopsy and MRI-guided focused ultrasound offer unparalleled accuracy in tumor localization and treatment delivery. Radiation Therapy Planning: Vision radiology facilitates precise radiation therapy planning through techniques such as intensity-modulated radiation therapy (IMRT), stereotactic body radiation therapy (SBRT), and proton therapy. Advanced imaging modalities enable the delineation of target volumes and critical structures, allowing for highly conformal radiation dose delivery while minimizing toxicity to adjacent organs. Image-Based Monitoring: Serial imaging assessments during and after treatment enable the longitudinal monitoring of treatment response, disease progression, and the emergence of treatment-related complications. Functional imaging techniques such as diffusion-weighted MRI and dynamic contrast-enhanced MRI offer insights into tumor microenvironment changes, vascular perfusion, and treatment-induced effects, facilitating early detection of treatment response or recurrence. With the introduction of the Vision Pro headset, research and development in vision radiology is going to be fast-forwarded in the field of oncology. By providing detailed anatomical, functional, and molecular information, vision radiology enables precise cancer diagnosis, treatment planning, and monitoring. Pediatric Considerations in Vision Radiology When applying vision radiology techniques, special considerations must be taken into account for pediatric patients due to their unique physiological and developmental characteristics. The use of the Apple Vision Pro in pediatric radiology requires a tailored approach to ensure safe and effective imaging procedures for children. Minimizing Radiation Exposure: Pediatric patients are more sensitive to radiation than adults, making it crucial to minimize radiation exposure during imaging procedures. The Apple Vision Pro would enable the use of low-dose radiation protocols and advanced imaging algorithms to reduce the amount of radiation required for pediatric scans while maintaining diagnostic quality. Patient Comfort: Children may experience anxiety or discomfort during imaging procedures, leading to motion artifacts or suboptimal image quality. The immersive and engaging nature of the Apple Vision Pro can help reduce anxiety in pediatric patients by providing a distraction during imaging exams.  Size-Adapted Imaging Protocols: Pediatric patients have smaller body sizes and anatomical structures compared to adults, necessitating size-adapted imaging protocols. Vision radiology could offer customizable imaging parameters and protocols specifically designed for pediatric patients, ensuring optimal image quality and diagnostic accuracy while minimizing radiation exposure. Apple Vision Pro for Tele-Radiology The Apple Vision Pro could be instrumental for teleradiology and not just radiology. Radiology involves the interpretation of medical imaging studies in a clinical setting, whereas teleradiology enables remote interpretation and consultation using digital technology. With the Apple Vision Pro in medical use, radiologists and other healthcare providers could access medical imaging studies from anywhere, allowing for timely interpretation and diagnosis without the need for physical access to imaging equipment. Challenges in Computer Vision Radiology Implementing computer vision radiology presents various challenges that need to be addressed to ensure its effective and ethical use in healthcare settings. Ethical Considerations One of the primary challenges in computer vision radiology is navigating the ethical implications of using artificial intelligence (AI) algorithms to interpret medical images. Ethical considerations include ensuring patient autonomy, informed consent, and transparency in the use of AI algorithms in radiological diagnosis. Privacy Concerns Computer vision radiology systems rely on vast amounts of patient data for training AI algorithms and interpreting medical images. Privacy concerns arise regarding the collection, storage, and sharing of sensitive patient information, highlighting the importance of robust data protection measures and compliance with privacy regulations, such as HIPAA (Health Insurance Portability and Accountability Act). Data Security Data security is a critical challenge in computer vision radiology, as medical imaging data is highly sensitive and must be protected from unauthorized access, tampering, or breaches. Ensuring data security involves implementing encryption protocols, access controls, and secure data storage solutions to safeguard patient information and maintain confidentiality. Addressing these challenges requires a multidisciplinary approach involving radiologists, data scientists, ethicists, and cybersecurity experts to develop ethical guidelines, privacy policies, and data security measures that uphold patient rights and ensure the responsible use of computer vision technology in radiology. Future Trends in Computer Vision for Radiology Artificial Intelligence (AI) Integration: AI-powered computer vision algorithms are increasingly being integrated into radiology workflows to automate image interpretation, assist in diagnosis, and improve workflow efficiency. Future advancements in AI algorithms will enable more accurate and personalized diagnostic insights, leading to enhanced patient care outcomes. 3D Imaging and Reconstruction: The adoption of advanced computer vision techniques for 3D imaging and reconstruction is revolutionizing radiological visualization. Future trends in this area include the development of real-time 3D reconstruction algorithms, volumetric rendering techniques, and virtual reality (VR) visualization tools, enabling immersive and interactive exploration of medical imaging data. Multi-Modal Fusion: Future advancements in computer vision for radiology will involve the fusion of multiple imaging modalities, such as MRI, CT, PET, and ultrasound, to provide comprehensive and complementary information for diagnosis and treatment planning. Explainable AI (XAI): As AI algorithms become increasingly prevalent in radiology, there is a growing need for explainable AI (XAI) techniques that can provide transparent and interpretable insights into algorithmic decision-making. Future trends in XAI for radiology will focus on developing interpretable AI models that can elucidate the underlying rationale behind diagnostic predictions, enhancing trust and acceptance among radiologists and clinicians. Augmented Reality (AR) and Mixed Reality (MR): The integration of AR and MR technologies with computer vision in radiology holds immense potential for enhancing surgical planning, interventional procedures, and medical education. Future trends in this area will focus on developing AR/MR-based visualization tools at affordable prices, surgical navigation systems, and immersive educational platforms that leverage computer vision to provide real-time guidance and enhance the surgical and educational experience. Vision Radiology: Key Takeaways Introduction to Apple’s Vision Pro: The Apple Vision Pro merges virtual and real worlds for transformative experiences in healthcare. Advanced Features of Vision Pro: Equipped with dual micro-OLED displays, powerful processing, eye-tracking, and AI integration, the Vision Pro delivers unparalleled experiences. Extended Reality (XR) in Radiology: XR technologies like VR and AR revolutionize radiology, enhancing diagnostics and patient care. Use case of Mixed Reality in Healthcare: AR in the field of surgery, oncology, pediatrics, education, and training for emergency training.  Challenges in Computer Vision Radiology: Implementing computer vision presents challenges such as ethical considerations, privacy concerns, and data security. Future Trends: Anticipated developments include AI integration, 3D imaging, multi-modal fusion, explainable AI, and AR/MR integration for enhanced medical applications.

February 22

8 min


Get Your Models Into Production Faster
Encord is transforming how businesses are getting their computer vision models into production. We can do the same for you. Talk to us to find out how.

sampleImage_error-analysis-object-detection-models
How to Analyze Failure Modes of Object Detection Models for Debugging

Developing models that consistently perform well across various real-world scenarios is a formidable challenge in computer vision. These models often encounter errors and inaccuracies that significantly impact their effectiveness and reliability. Identifying and debugging these errors is a complex process requiring a nuanced understanding of the model's behavior, making it time-consuming and intricate.  Computer vision engineers and researchers grapple with errors that can degrade performance, facing a labor-intensive debugging process that demands a deep dive into model behaviors. The stakes are high, as inefficiencies in this process can impede applications critical to safety and decision-making. The inability to efficiently debug and analyze model errors can lead to suboptimal model performance, impacting critical applications such as autonomous driving, security surveillance, and medical diagnostics. This not only diminishes the value of the technology but can also have serious implications for safety and decision-making processes based on these models. Encord Active is a debugging toolkit designed to solve these challenges. It allows a more focused and effective approach to model evaluation and debugging in the computer vision domain. It gives insights into model behavior and makes finding and fixing errors easier through an intuitive and complete set of features. In this article, you will learn how to use Encord Active to automatically identify and analyze the failure modes of computer vision models. By incorporating Encord Active into your model development process, you have a more efficient debugging process that can help build robust computer vision models capable of performing well in diverse and challenging real-world scenarios. Why are Traditional Model Metrics no Longer Enough? One critical concern we see with data and ML teams is figuring out how to detect where their model is struggling and fix the failure patterns to shore up performance problems. They often train a model and perform regular metric calculations (like recall, precision, F1-score, and accuracy). However, these metrics alone cannot detect edge cases or test the model’s robustness for real-world applications. Taking a more data-centric approach to debugging the model's performance has enabled most computer vision teams to fix errors and deploy robust models.  Let’s see how to find and fix model errors in a data-centric way using Encord Active. Analyzing Model Errors with Encord Active Encord Active helps you understand and improve your data, labels, and models at all stages of your computer vision journey. Beyond finding and fixing label errors through data exploration, it lets you visualize the important performance metrics for your computer vision mode. Encord Active is available in two versions: Encord Active Cloud and Encord Active OS. Active Cloud and Encord Annotate are natively integrated, with Encord hosting both services. Encord Active OS is an open-source toolkit that you can install on a local computer or server. This walkthrough uses  Encord Annotate to create a project and import the dataset. We use Encord Active Cloud to analyze the model’s failure modes. We recommend you sign up for an Encord account to follow this guide.   Here are the steps we will practically walk through to analyze the model errors in this guide. Step 1: Import your Project and Predictions Step 2: Evaluate Model Performance Step 3: Error Categorization Step 4: Quantitative Analysis Step 5: Data Distribution Analysis Step 1: Import your Project and Predictions The dataset we use for this guide is COCO 2017. A good place to start using Active is Importing an Annotate Project. From Annotate, you can configure the dataset and ontology for your Project. Once that is done, move to Active. We have pre-loaded the dataset into Annotate for this walkthrough and imported the Project to Active.   Select your project to head to Explorer. Great! The next step is to import your model predictions to Encord Active to analyze model errors. It will automatically provide visualizations to evaluate your model, detect label errors, and provide valuable insights to improve the overall performance of your model. You must import Predictions into Active before using the Predictions feature on the Explorer and Model Evaluation pages. Learn how to do it in this documentation. We have trained a Mask-RCNN model for this guide and imported the Predictions file (JSON format). Let’s see how to use the Model Evaluation page to navigate the model's performance and analyze the errors.  Pro tip here? Use your object detection model to make predictions on a validation subset of the dataset. Ensure this subset is diverse and covers various categories, object sizes, and scenes to get a comprehensive view of the model's performance. Step 2:  Evaluate Model Performance The next step is to evaluate the model's performance using standard object detection metrics such as mean Average Precision (mAP) and Intersection over Union (IoU). These metrics will help us identify overall performance issues but not specific errors. On the Explorer, navigate to the Model Evaluation tab → Model Performance → Performance to see the standard metrics: [Optional] Interpreting the Standard Metrics See those metrics and model results? Let’s interpret them: A mean Average Precision (mAP) of 0.5927 (or 59.27%) at an IoU threshold of 0.5 indicates the model has moderate accuracy in detecting objects on this dataset. Although the model is relatively effective at identifying objects, it has room for improvement, especially in achieving higher precision and recall across all classes. We’ll look at the precision and recall scores in the next section. A mean Average Recall (mAR) of 0.7347 (or 73.47%) at the same IoU threshold of 0.5 indicates the model's ability to detect a proportion of all relevant instances across all classes. A higher mAR suggests the model is quite good at identifying most of the objects it should, but there might still be missed detections (false negatives) affecting its overall recall. This should cause you to think back to what your goal for the real-world application is. Would it be appropriate to miss detection if it identifies most objects in the grand scheme? Remember that mAP says the model is only moderately effective at identifying the objects. This is reflected in the F1 score. An F1 score of 53.62% indicates that, despite the model's relatively high ability to detect pertinent objects (as suggested by mAR), its precision, or the accuracy of these detections, is lower, which lowers the F1 score. The model might be generating too many false positives. We’ll confirm that in the error categorization section. Potential Fix: Improving the model could involve reducing false positives through better background discrimination, refining anchor box sizes, or incorporating more varied examples in the training data to improve precision without significantly sacrificing recall. Step 3: Error Categorization Based on your observations, classify errors into categories such as: False Positive and Negative Counts Still on the Model Performance tab, notice the False Positive Count for this model is 38,595 predictions. No wonder the precision was that low! This implies that our model may have learned incorrect features indicative of certain classes.  This could be due to noise in the training data or an imbalance leading the model to err on the side of making detections too readily. Under Model Outcome Count, let’s see the classes in which the model incurs the most false positive predictions: Encord Active suggests that the model incurs the most false positives with the “person” class, but this is expected due to the proportion of the prediction count compared to other classes. So, if you expect your model to perform better in a certain class, you might have to zoom in on that and inspect it. Under Model Performance → Click False Positive Count → Predictions tab → Add filter → Choose the Class and Prediction IOU (adjust to 0.5) filters: Great! Manually inspect the predictions alongside their ground truth annotations. Take note of the incorrect classifications (wrong annotations). Click Select Class under the Filter tab to choose a class to inspect specifically. Use the quality metrics to filter the dataset further: Under Prediction → Choose a quality metric to filter the predictions with: The goal for you here is to look for error patterns. Are certain object sizes (Absolute Area), classes, or scenarios (e.g., occlusions, varying lighting conditions) more prone to errors? What about duplicates? Is the model likely to detect the same object more than once in an instance? This analysis can help pinpoint specific weaknesses in the model. When you conclude, repeat the same process for the False Negative Count metric if you are optimizing to identify predictions that the model missed detections of objects that are present. Filter by the Missing Objects metric. Potential Fix: Addressing the false positives may require cleaning the dataset to remove inaccuracies, using regularization techniques to prevent overfitting, or adjusting the detection threshold to make the model more conservative in its predictions. Localization Errors (Correct class but incorrect bounding box position or size) Here, you are looking for instances (predictions) where the model correctly identifies the presence of an object but inaccurately predicts its bounding box. This can mean the box is too large, small, poorly positioned, or improperly shaped relative to the object.  There are metrics to check for localization errors within EA: Border Proximity: Rank annotations by how close they are to image borders. Broken Object Tracks: Identify broken object tracks based on object overlaps. Inconsistent Object Classification: Looks for overlapping objects with different classes (across frames). Label Duplicates: Rank labels by how likely they are to represent the same object. Object Classification Quality: Compare object annotations against similar image crops. Polygon Shape Anomaly: Calculate potential outliers by polygon shape. Let’s use two metrics: Under Prediction → Choose Inconsistent Object Classification and Track → Sort by Descending Order → Expand one of the instances to inspect the error: Such errors suggest issues with how the model perceives object boundaries. This might be due to inadequate feature extraction capabilities that fail to capture the precise contours of objects or a mismatch between the scale of objects in the training data and those in the validation or real-world scenarios. Potential Fix: Improving localization might involve adjusting the anchor box configurations to better match the typical object sizes and shapes in the COCO dataset or improving the spatial resolution at which the model processes images. Also, check for: Classification Errors: Errors that arise when the model detects an object but assigns it an incorrect label. Confusion with Similar Classes: Errors observed when the model confuses objects between similar classes, mislabeling them in the process. Step 4: Quantitative Analysis If you need more help identifying which classes are most problematic, use error analysis features such as the metric impact on the model’s performance, precision and recall by metric chart, and precision-recall curves for each class within EA. First off, the metric impact on model performance. Visualizing the Impact of Metrics on Model Performance Encord Active applies quality metrics to your data, labels, and model predictions to assess their quality and rank them accordingly. EA uses these metrics to analyze and decompose your data, labels, and predictions.  It ships with pre-built quality metrics, but you can also define custom quality metrics for indexing your data, labels, and predictions. Navigate to the Model Evaluation tab → Metric Performance: The longest orange bar indicates that the confidence metric impacts the model's performance the most. This indicates that the confidence scores the model assigns to its predictions are strongly correlated with the precision of those predictions. Zoom into a specific metric (say, height) you want to assess using the Precision By Metric chart. Under Metric → Label/Height and Label/Width: The visualization shows a drop in precision once the model starts to predict instances with heights and widths > 500 pixels. Toggle on the Show Model Outcome Count option for a nuanced view with error categorizations (True Positive Count, False Negative Count, and False Positive Count).  Potential Fix: To improve the model, consider augmenting the dataset with more examples of objects at the widths where precision is low. Alternatively, you might adjust the model architecture or training process to handle better objects of these sizes, such as using different anchor box sizes in the case of object detection models like YOLO or SSD.   Switch to Recall to analyze the average recall of the model by the quality metrics. Precision and Outcome Count Chart Under the Model Evaluation tab → Toggle back to the Model Performance tab → Precision. We have looked at the mean Average Precision (mAP) earlier, but here, you can inspect the average precision per class and the model outcome count:  There is a notable variance in precision across classes, which indicates that the model's performance is not uniform. From all indications, this, in part, is due to a class imbalance in the training data, varying difficulty distinguishing certain classes, and the different amounts of training data available for each class. Check the same analysis under the Recall tab: The blue line is close to 1 for many classes, indicating that the recall across most classes appears quite high. This suggests that the model is generally effective at detecting instances of these classes when they are present in the data. The blue line dipping near the bottom of the chart indicates that "scissors," "toaster," and "hairdryer" are some of the classes with a very low recall (close to 0). This means the model is missing most instances of these classes, a significant concern for model performance. Precision-Recall Curves Precision-recall curves for each class reveal the trade-off between precision and recall at various IOU threshold settings. Analyzing these curves class by class can highlight which classes are harder for the model to detect with high confidence. Under Model Performance tab → Precision-Recall Curve → Use the slide to set the IOU Threshold value: You can also analyze the precision-recall trade-off for all the classes. Toggle the Decompose by Class option on and adjust the slider to balance between leniency and strictness on spatial accuracy: If there are specific classes you want to analyze, you can select them under the Classes dropdown box → Toggle on Decompose by Class: Use the insights from these curves to guide how you optimize the class-specific detection thresholds. This way, you can improve overall model performance by balancing precision and recall according to the specific requirements of each class. Step 5: Data Distribution Analysis You can also analyze the distribution of classes, object sizes, and scenarios in your training data. Imbalances or insufficient representation of certain cases can lead to higher error rates for those categories. Navigate to the Explorer tab → Data view → Filter → Add filter → Class: Here are other analyses you could carry out here: Size Inclusion in Training: Ensuring the training data includes a variety of object sizes and implementing multi-scale training techniques can improve detection across different object sizes. Variety in Object Conditions: Objects can appear in various conditions, such as different lighting, occlusions, and backgrounds. A lack of diversity in these scenarios can lead to a model that performs well only under the specific conditions it was trained on. Understanding Context: The context in which objects appear can provide additional cues for accurate detection. For example, certain objects are more likely to be found indoors versus outdoors, and some objects are typically found in proximity to others. A holistic data distribution analysis can uncover biases and gaps in the training dataset that may lead to certain prediction errors your model makes. Next Steps: Troubleshooting the Errors Now that you have diagnosed the model issues with the potential fixes we discussed earlier, what do you do next? Here are some steps: Focus on Problem Areas: Once you’ve identified error patterns, focus on these areas for further analysis. For example, if small objects are often missed, inspect how such objects are represented in the dataset and whether the model’s architecture can detect small features. Formulate Hypotheses: Based on your analysis, formulate hypotheses on why certain errors occur. This could involve feature representation, anchor box sizes, or training data quality. Experiment: Make targeted changes based on your hypotheses, such as adjusting the model architecture, re-balancing the training data, re-training the model on the data segment it performs poorly on, or using specific data augmentation techniques. Validate Changes: Re-evaluate the model on the validation subset to see if the changes have reduced the specific errors you identified. Iterative Improvement: Finding and fixing errors in object detection models is an iterative process. Repeating the steps of error identification, hypothesis testing, and validation can improve model performance. Debugging with Encord Active Encord Active (EA) is a data-centric debugging platform that offers a wide range of features to assist computer vision teams with efficiently debugging their models. Here are some key capabilities of EA: Interactive Debugging: Encord Active provides visualization tools, an explorer, and model-assisted features for interactive debugging where you can explore model predictions step by step. You can inspect the model's errors, see how it performs on data segments, and use quality metrics to assess how well the models generalize at the class level. Explainability: EA computes metric performance and importance for model predictions, so you can get insights into why the model made a specific decision Anomaly Detection: Encord Active can automatically detect anomalies and outliers in your data, helping you identify data quality issues that may affect your model's performance. Collaboration: Encord Active supports collaboration among team members, enabling multiple stakeholders to collaborate on identifying model errors and getting insights to improve machine learning models. Integration with Encord Annotate: EA natively integrates with Encord Annotate, the computer vision data annotation tool many computer vision teams use. Easily sync your project to Annotate and directly edit images within Annotate without moving across platforms or tools. Key Takeaways: In this guide, we identified and analyzed the failure modes of an object detection model based on its prediction errors. We trained a Mask R-CNN model on the COCO 2017 dataset and validated it on a subset of that dataset. We went through five analysis steps and potential fixes to the model errors. Debugging in machine learning is an intricate process, much like detective work. It's critical for making your models work correctly and for various strategic and technical reasons.  Here's a recap of the error analysis steps we discussed in this guide: Step 1: Import your Project and Predictions Step 2: Evaluate Model Performance Step 3: Error Categorization Step 4: Quantitative Analysis Step 5: Data Distribution Analysis

February 19

10 min

sampleImage_semantic-image-search-application
How to Use Semantic Search to Curate Images of Products with Encord Active

Finding contextually relevant images from large datasets remains a substantial challenge in computer vision. Traditional search methodologies often struggle to grasp the semantic nuances of user queries because they rely on image metadata. This usually leads to inefficient searches and inaccurate results. If you are working with visual datasets, traditional search methods cause significant workflow inefficiencies and potential data oversight, particularly in critical fields such as healthcare and autonomous vehicle development. How do we move up from searching by metadata? Enter semantic search! It uses a deep understanding of your search intent and contextual meaning to deliver accurate and semantically relevant results. So, how do you implement semantic search? Encord Active is a platform for running semantic search on your dataset. It uses OpenAI’s CLIP (Contrastive Language–Image Pre-Training) under the hood to understand and match the user's intent with contextually relevant images. This approach allows for a more nuanced, intent-driven search process. This guide will teach you how to implement semantic search with Encord Active to curate datasets for upstream or downstream tasks. More specifically, you will curate datasets for annotators to label products for an online fashion retailer. At the end of this guide, you should be able to: Perform semantic search on your images within Encord. Curate datasets to send over for annotation or downstream cleaning. See an increase in search efficiency, reduced manual effort, and more intuitive interaction with your data. What are Semantic Image Search and Embeddings? When building AI applications, semantic search and embeddings are pivotal.  Semantic search uses a deep understanding of user intent and the contextual meanings of queries to deliver search results that are accurate and semantically relevant to the user's needs. Semantic search bridges the gap between the complex, nuanced language of the queries and the language of the images. It interprets queries not as mere strings of keywords but as expressions of concepts and intentions. This approach allows for a more natural and intuitive interaction with image databases. This way, you can find images that closely match your search intent, even when the explicit keywords are not in the image metadata. Image Embeddings to Improve Model Performance Embeddings transform data—whether text, images, or sounds—into a high-dimensional numerical vector, capturing the essence of the data's meaning. For instance, text embeddings convert sentences into vectors that reflect their semantic content, enabling machines to process and understand them. OpenAI's Contrastive Language–Image Pre-Training (CLIP) is good at understanding and interpreting images in the context of natural language descriptions under a few lines of code. CLIP is a powerful semantic search engine, but we speak to teams running large-scale image search systems about how resource-intensive it is to run CLIP on their platform or server. Check out our workshop on building a semantic visual search engine with ChatGPT & CLIP.   Encord Active runs CLIP under the hood to help you perform semantic search at scale. Semantic Search with Encord Active Encord Active uses its built-in CLIP to index your images when integrating them from Annotate. This indexing process involves analyzing the images and textual data to create a searchable representation that aligns images with potential textual queries. You get this and an in–depth analysis of your data quality on an end-to-end data-centric AI platform. You can perform semantic search with Encord Active in two ways: Searching your images with natural language (text-based queries). Searching your images using a reference or anchor image. The walkthrough below uses Encord Annotate to create a project and import the dataset. We use Encord Active to search data with natural language. We recommend you sign up for an Encord account to follow this guide.   Import your Project A good place to start using Active is Importing an Annotate Project. From Annotate you can configure the Dataset, Ontology, and workflow for your Project. Once you complete that, move to Active. Explore your Dataset Before using semantic search in Encord Active, you must thoroughly understand and explore your dataset. Start by taking an inventory of your dataset. Know the volume, variety, and visual content of your images. Understanding the scope and diversity of your dataset is fundamental to effectively using semantic search. One of the most exciting features of Encord Active is that it intuitively shows you all the stats you need to explore the data. Within the Explorer, you can already see the total number of images, probable issues within your dataset, and metrics (for example, “Area”) to explore the images: Play around with the Explorer features to inspect the dataset and understand the visual quality. You can also examine any available metadata associated with your images. Metadata can provide valuable insights into the content, source, and characteristics of your images, which can be instrumental in refining your search. You can also explore your image embeddings by switching from Grid View to Embeddings View in the Explorer tab: It’s time to search smarter 🔎🧠. Search your Images with Natural Language You’ve explored the dataset; you are now ready to curate your fashion products for the listings with semantic searches using natural language queries. For example, if you want images for a marketing campaign themed around "innovation and technology," enter this phrase into the search bar. Encord Active then processes this query, considering the conceptual and contextual meanings behind the terms, and returns relevant images. Within the Explorer, locate the “Natural Search” bar.  There are two ways to perform semantic search within Encord: through natural language or a reference image. We’ll use natural language for the rest of this guide.   Say our online platform does not have enough product listings for white sneakers. What can we do? In this case, enter “white sneakers” in the search bar: Awesome! Now you see most images of models in white sneakers and a few that aren’t. Although the total images in our set are 2,701, imagine if the product listings were an order of magnitude larger. This would be consequential. Before curating the relevant images, let’s see how you can search by similar images to fine-tune your search even further. Search your Images with Similar Images Select one of the images and click the expansion button. Click the “Find similar” button to use the anchor image for the semantic similarity search: Sweet! You just found another way to tailor your search. Next, click the “x” (or cancel) icon to return to the natural language search explorer. It’s time to curate the product images. Curate Dataset in Collections Within the natural language search explorer, select all the visible images of models with white sneakers. One way to quickly comb through the data and ensure you don’t include false positives in your collections is to use the Select Visible option. This option only selects images you can see, and then you can deselect images of models in non-white sneakers. In this case, we will quickly select a few accurate options we can see: On the top right-hand corner, click Add to a Collection, and if you are not there already, navigate to New Collection. Name the Collection and add an appropriate and descriptive comment in the Collection Description box. Click Submit when you: Once you complete curating the images, you can take several actions: Create a new dataset slice. Send the collection to Encord Annotate for labeling. Bulk classify the images for pre-labeling. Export the dataset for downstream tasks. Here, we will send the collection to Annotate to label the product listings. We named it “White Sneakers” and included comments to describe the collection. Send Dataset to Encord Annotate Navigate to the Collections tab and find the new collection (in our case, “White Sneakers”). Hover over the new collection and click the menu on the right to show the list of next action options. Click Send to Annotate and leave a descriptive comment for your annotator: Export Dataset for Downstream Tasks Say you are building computer vision applications for an online retailer or want to develop an image search engine. You can export the dataset metadata for those tasks if you need to use the data downstream for preprocessing or training.  Back to your new collection, click the menu, and find the Generate CSV option to export the dataset metadata in your collection: Nice! Now, you have CSV with your dataset metadata and direct links to the images on Encored so that you can access them securely for downstream tasks. Next Steps: Labeling Your Products on Annotate After successfully using Encord Active to curate your image datasets through semantic search, the next critical step involves accurately labeling the data. Encord Annotate is a robust and intuitive platform for annotating images with high precision. This section explores how transitioning from dataset curation to dataset labeling within the Encord ecosystem can significantly improve the value and utility of the “White Sneakers” product listing. Encord Annotate supports various annotation types, including bounding boxes, polygons, segmentation masks, and keypoints for detailed representation of objects and features within images. Features of Encord Annotate Integrated Workflow: Encord Annotate integrates natively with Encord Active. This means a simple transition between searching, curating, and labeling. Collaborative Annotation: The platform supports collaborative annotation efforts that enable teams to work together efficiently, assign tasks, and manage project progress. Customizable Labeling Tools: You can customize the annotation features and labels to fit the specific needs of the projects to ensure that the data is labeled in the most relevant and informative way. Quality Assurance Mechanisms: Built-in quality assurance features, such as annotation review and validation processes, ensure the accuracy and consistency of the labeled data. AI-Assisted Labeling: Encord Annotate uses AI models like the Segment Anything Model (SAM) and LLaVA to suggest annotations, significantly speeding up the labeling process while maintaining high levels of precision. Getting Started with Encord Annotate To begin labeling the curated dataset with Encord Annotate, ensure it is properly organized and appears in Annotate.  New to Annotate? Sign up on this page to get started. From there, accessing Encord Annotate is straightforward: Navigate to the Encord Annotate section within the platform. Select the dataset you wish to annotate. Choose or customize the annotation tools and labels to your project requirements. Start the annotation process individually or by assigning tasks to team members. Key Takeaways: Semantic Search with Encord Active Phew! Glad you were able to join me and stick through the entire guide. Let’s recap what we have learned in this article: Encord Active uses CLIP for Image Retrieval: Encord Active runs CLIP under the hood to enable semantic search. This approach allows for more intuitive, accurate, and contextually relevant image retrieval than traditional keyword- or metadata-based searches. Natural Language Improves Image Exploration Experience: Searching by natural language within Encord Active makes searching for images as simple as describing what you're looking for in your own words. This feature significantly improves the search experience, making it accessible and efficient across various domains. Native Transition from Curation to Annotation: Encord Active and Encord Annotate provide an integrated workflow for curating and labeling datasets. Use this integration to curate data for upstream annotation tasks or downstream tasks (data cleaning or training computer vision models).

February 16

10 min

sampleImage_meta-v-jepa-explained
Meta’s V-JEPA: Video Joint Embedding Predictive Architecture Explained

Following the launch of I-JEPA last year, Meta has now rolled out V-JEPA as they accelerate efforts to envision Yann LeCun’s vision for Advanced Machine Intelligence.  Yann LeCun, Vice President & Chief AI Scientist at Meta, asserts that "V-JEPA is a step toward a more grounded understanding of the world so machines can achieve more generalized reasoning and planning." This statement reiterates the broader goal of advancing machine intelligence to emulate human learning processes, where internal models of the world are constructed to facilitate learning, adaptation, and efficient planning in complex tasks. What is  V-JEPA? V-JEPA is a vision model that is exclusively trained using a feature prediction objective. In contrast to conventional machine learning methods that rely on pre-trained image encoders, text, or human annotations, V-JEPA learns directly from video data without the need for external supervision. Key Features of V-JEPA  Self-supervised Learning V-JEPA employs self-supervised learning techniques, enhancing its adaptability and versatility across various tasks without necessitating labeled data during the training phase. Feature Prediction Objective Instead of reconstructing images or relying on pixel-level predictions, V-JEPA prioritizes video feature prediction. This approach leads to more efficient training and superior performance in downstream tasks. Efficiency With V-JEPA, Meta has achieved significant efficiency gains, requiring shorter training schedules compared to traditional pixel prediction methods while maintaining high performance levels. Versatile Representations V-JEPA produces versatile visual representations that excel in both motion and appearance-based tasks, showcasing its effectiveness in capturing complex interactions within video data. V-JEPA Methodology Revisiting Feature Prediction for Learning Visual Representations from Video The AI model is trained using the VideoMix2M dataset, where it passively observes video pixels without explicit guidance. Through an unsupervised feature prediction objective, V-JEPA learns to predict features within the videos without relying on external labels or annotations, setting it apart from traditional approaches. The model does not utilize pretrained image encoders, text, negative examples, human annotations, or pixel-level reconstruction during its training process. Instead of directly decoding pixel-level information, V-JEPA makes predictions in latent space, distinguishing it from generative methods. A conditional diffusion model is then trained to decode these feature-space predictions into interpretable pixels, with the pre-trained V-JEPA encoder and predictor networks remaining frozen throughout this process. Importantly, the decoder is only provided with representations predicted for the missing regions of the video and does not access unmasked regions. This methodology ensures that the feature predictions made by V-JEPA exhibit spatio-temporal consistency with the unmasked regions of the video, contributing to its ability to produce versatile visual representations that perform well on downstream video and image tasks without the need for adapting the model's parameters. Advantages over Pixel Prediction V-JEPA makes predictions in an abstract representation space, allowing it to focus on higher-level conceptual information in videos without getting bogged down by irrelevant details. It's the first video model adept at "frozen evaluations," where pre-training on the encoder and predictor is done once and then left untouched. This means adapting the model for new tasks only requires training a lightweight specialized layer on top, making the process efficient and quick. Unlike previous methods that required full fine-tuning for each task, V-JEPA's approach enables reusing the same model parts for multiple tasks without the need for specialized training each time, demonstrating its versatility in tasks like action classification and object interactions. Revisiting Feature Prediction for Learning Visual Representations from Video V-JEPA Performance V-JEPA was trained on a vast dataset comprising 2 million videos sourced from public datasets. The model was then evaluated on a range of downstream image and video tasks, demonstrating impressive performance across the board. Comparison with Pixel Prediction V-JEPA was assessed against video approaches relying on pixel prediction, ensuring a consistent architecture across all baselines. Models such as VideoMAE, Hiera, and OmniMAE were evaluated using either a ViT-L/16 encoder or a Hiera-L encoder, which had similar parameters. The evaluation encompassed frozen evaluation with an attentive probe on downstream video and image tasks, as well as end-to-end fine-tuning. Revisiting Feature Prediction for Learning Visual Representations from Video V-JEPA exhibited superior performance across all downstream tasks in frozen evaluation, with the exception of ImageNet, where it achieved a comparable accuracy of 74.8% to the 75.1% attained by an OmniMAE model trained directly on ImageNet. Under the fine-tuning protocol, V-JEPA surpassed other models trained with a ViT-L/16, matching the performance of Hiera-L, while utilizing significantly fewer samples during pretraining, underscoring the efficiency of feature prediction as a learning principle. Comparison with State-of-the-Art models The performance of V-JEPA models, pre-trained on video, was compared against the largest state-of-the-art self-supervised image and video models. This comparison included various baselines, such as OpenCLIP, DINOv2, and I-JEPA for image-pretrained models, and VideoMAE, OmniMAE, Hiera, VideoMAEv2, and MVD for video-pretrained models.  Revisiting Feature Prediction for Learning Visual Representations from Video The evaluation involved frozen evaluation with an attentive probe on downstream image and video tasks, showing V-JEPA's consistent improvement across all tasks, particularly excelling in tasks requiring motion understanding. It effectively reduced the gap between video and image models on tasks requiring static appearance-based features. V-JEPA Use-cases Video Understanding V-JEPA excels in understanding the content of various video streams, making it invaluable for computer vision tasks such as video classification, action recognition, and spatio-temporal action detection. Its ability to capture detailed object interactions and distinguish fine-grained actions sets it apart in the field of video understanding. Contextual AI Assistance The contextual understanding provided by V-JEPA lays the groundwork for developing AI assistants with a deeper understanding of their surroundings. Whether it's providing context-aware recommendations or assisting users in navigating complex environments, V-JEPA can enhance the capabilities of AI assistants in diverse scenarios. Augmented Reality (AR) Experiences V-JEPA's contextual understanding of video content can enrich AR experiences by providing relevant contextual information overlaid on the user's surroundings. Whether it's enhancing gaming experiences or providing real-time information overlays, V-JEPA can contribute to the development of immersive AR applications. With the release of Apple's Vision Pro, this technology could play a crucial role in enhancing mixed reality experiences. JEPA for Advanced Machine Intelligence (AMI) The primary focus of V-JEPA's development has centered on perception—grasping the contents of various video streams to gain an immediate contextual understanding of the world around us. The predictor within the Joint Embedding Predictive Architecture serves as an early physical world model, capable of conceptualizing what's happening within a video frame without needing to analyze every detail. Looking ahead, Meta's aim is to leverage this predictive model for planning and sequential decision-making tasks, expanding its utility beyond mere perception. Read the paper by Yann LeCun A Path Towards Autonomous Machine Intelligence for more information.   As a research model, V-JEPA holds promise for various future applications. Its contextual understanding could prove invaluable for embodied AI endeavors and the development of contextual AI assistants for future augmented reality (AR) glasses. Emphasizing responsible open science, Meta has released the V-JEPA model under the CC BY-NC license, encouraging collaboration and further extension of this groundbreaking work in the AI research community. You can find V-JEPA’s open source code on Meta AI’s GitHub.  

February 16

8 min

sampleImage_few-shot-learning-in-computer-vision
Few Shot Learning in Computer Vision: Approaches & Uses

Supervised learning once dominated the artificial intelligence (AI) space, where the only way to train a deep neural network was to use an extensive amount of labeled data. However, this approach encounters significant hurdles in complex industrial and research domains, such as advanced computer vision (CV) and natural language processing (NLP) tasks.  The primary challenges include the scarcity of labeled data, the high cost of annotating complex available datasets, and the emergence of new data categories in specific domains like healthcare, where data on new diseases makes traditional CV models obsolete.  To overcome these challenges, the AI community has pivoted towards innovative frameworks allowing effective model training with limited data. Few-shot learning (FSL) emerges as a pivotal solution, creating scalable CV systems that learn from only a handful of samples. This revolutionary change leverages prior knowledge and meta-learning techniques to achieve robust performance, even in data-constrained environments. This article will discuss the approaches to few-shot learning (FSL) and its wide-ranging applications, highlighting its critical role in advancing AI capabilities with minimal data. You will learn about: Different few-shot learning variations. Few-shot learning classification algorithms. Few-shot detection algorithms. Before getting into the details, let’s first discuss a few fundamental concepts regarding what FSL is, how it works, its relationship with meta-learning, and essential terminology used in the AI community to describe FSL frameworks. What is Few-shot Learning? FSL is an approach for developing machine learning (ML) algorithms with only a few samples in the training datasets. This approach is distinct from traditional supervised learning, which relies on large volumes of data, by focusing on the ability to generalize from very few examples using advanced techniques and prior knowledge. Key Terminology in Few-shot Learning FSL involves a few technical terms requiring explanation before describing how it works. These standard terms include support and query sets, k-way, n-shot, meta-learner, and base-learner. Support set A support set in FSL consists of data samples from a particular class for training an FSL model. It acts as the backbone for the FSL framework by exposing the model to real-world scenarios and allowing it to capture intricate data patterns from a few samples. For instance, the support set for two classes—dogs and cats—can contain six training data points with three samples per class.  Query set The query set contains different samples from the same classes as the support set. It challenges the model with new examples to ensure it has learned the concept, not just memorized specifics. For instance, the query set can have images of dogs and cats with other breeds, colors, shapes, and backgrounds. The number of examples per class in the query set should be the same as in the support set. N-way N refers to the number of classes involved in the learning task (in the support and query sets). This means a setting where the support and query sets have two classes - cats and dogs - will be a 2-way classification problem (the model learns to distinguish between the two classes). K-shot K is the number of samples per class. An FSL model with three samples per class will be a 3-shot classification task (the model learns from three examples of each class). The usual term is N-way K-shot. So, a situation where you have three samples per class with two classes is a 2-way 3-shot problem. Meta-learner and Base-learner In FSL, the meta-learner optimizes across tasks to improve the base learner's ability to adapt to new tasks quickly. A base learner, starting from a random initialization, focuses on specific tasks, with its performance feedback used to update the meta-learner. Overall, FSL is not just about dealing with less data; it's about smartly leveraging what's available to make significant leaps in learning efficiency. Understanding these foundational concepts equips you to learn about FSL algorithms and their diverse applications. But first off, how does it work? How Does Few-shot Learning Work? Few-shot Learning (FSL) operates through a structured process known as an 'episode,' which simulates multiple training tasks. Each episode comprises a support set and a query set, representing a small sample from the overall dataset designed to teach and then test the model within a narrowly defined scope. Episode - An episode consists of multiple training tasks The FSL workflow begins with constructing a series of training tasks, each encapsulated in an episode. For a '3-way 1-shot' problem, each task is built around learning from one example of each of three different classes. The model uses the support set to learn the distinctive features of each class from these single examples. Then, it attempts to classify new examples in the query set, which are variations of the same classes not seen during training. Next, we evaluate the model through several test tasks. The essence of FSL is its ability to validate this learning in new, unseen classes during the evaluation phase. Each test task consists of a query and a support set. However, the sets contain samples of novel or unseen classes not present during training. Training and Test Tasks containing different classes Key to this process is the iterative exposure to varied episodes, each presenting unique classes and examples. This approach encourages the model to develop a flexible understanding of class characteristics and apply this knowledge to new classes it faces in test tasks. Often, the FSL problem is synonymous with meta-learning, as the FSL model understands patterns in datasets from diverse domains to label unseen classes based on prior knowledge. This makes FSL a meta-learning problem where the model learns how to learn. Few-shot Learning Approaches FSL adopts multiple approaches to address the challenge of learning from limited data, incorporating data-level, parameter-level, meta-learning, generative, and cross-modal techniques. Each strategy brings unique strengths to FSL, enabling models to generalize effectively across diverse scenarios. Data-Level FSL Approach The data-level approach is a straightforward concept that says to add more data in cases of insufficiently labeled examples. The premise is to use extensive, diverse datasets as a base for pre-training your model.  The samples in the base dataset will differ slightly from the support and query sets. The model learns general patterns from the base dataset during the training stage. You can then fine-tune the pre-trained model for novel classes with a few examples. For instance, we can train a model on a base dataset containing multiple labeled images of generic anatomical structures. We can then fine-tune the model on specific medical images with limited labeled samples. Parameter-Level FSL Approach This approach involves finding a set of model parameters that quickly converge to the most optimal parameter space for a specific problem. The objective is to reach a parameter space where the model will require only a few training steps to generalize to the new dataset without needing extensive labeled data. For instance, training an FSL model to classify a rare bird species will be slower and more prone to overfitting if we use random parameters for initialization. Instead, we can initialize the model with pre-trained parameters that already have prior knowledge regarding generic bird species. Techniques such as Bayesian optimization or specialized embedding spaces prepare the model with a knowledge base that facilitates quick adaptation (i.e., classifying rare bird species), minimizing the risk of overfitting despite the sparse data. DINOv2 models are good few-shot learnings with many applications, including image classification, object detection, and video understanding. Learn how they are pre-trained to handle many tasks out-of-the-box in this guide.   Meta-learning This approach is subdivided into metric-learning and gradient-based approaches. Metric-learning employs distance-based metrics to assess class similarity so that models can classify new examples by comparing them to known classes within embedding spaces. Gradient-based meta-learning, exemplified by algorithms like MAML, optimizes the model's ability to learn efficiently from a few examples by adjusting its parameters based on a meta-learner's feedback, bridging the gap between different tasks. Generative Methods Generative methods relate to data-level approaches that use synthetic data to augment the support and query sets in FSL. Data augmentation techniques, generative adversarial networks (GANs), and vision transformers (ViT) are standard methods that you can use to create fake data. This approach increases the quantity of data and introduces variability, challenging the model to learn more generalized representations. Cross-modal Few-shot Learning Cross-modal techniques use data from different modalities, such as text and audio, for FSL. For instance, you can combine text and image data to have a richer dataset instead of using images only. A straightforward method employed by recent research combines text and visual embeddings to compute richer prototypes for measuring similarity with the query image. This extends the traditional prototypical network, which only uses image embeddings for class prototypes. Few-shot learning approaches vary depending on the problem’s context. However, their distinction can be hazy, as you can combine one approach to develop a new FSL framework.  Categorizing FSL based on its types can be more helpful. So, let’s discuss FSL’s variations. Here is a table summary of the approaches, their primary objective, and instances where they are the best approach to implement Few-shot Learning Variations FSL encompasses a range of learning frameworks tailored to the scarcity of data, classified into n-shot, one-shot, and zero-shot learning. Each variation addresses unique challenges in machine learning with minimal examples. N-shot Learning: N-shot learning is a generalization of FSL models where ‘N’ refers to the number of training examples per class. For instance, training a model with only four samples per class is called 4-shot learning. This adaptable variation allows models to be tailored to the specific constraints and complexities of various tasks. N-shot learning shines in scenarios where acquiring a handful of examples is feasible, balancing learning efficiency and performance.  One-shot Learning: One-shot learning (OSL) occurs when only one sample exists per class in the training set. OSL algorithms are helpful in facial recognition applications where you only have a single training instance for each individual, and gathering multiple instances may be challenging. They use feature extraction and comparison to recognize patterns from a single instance and avoid overfitting. Zero-shot Learning: Zero-shot learning (ZSL) is an extreme variation of FSL, where the model classifies items with no direct training examples. The method involves training a model on seen classes and corresponding auxiliary information—detailed descriptions, labels, and definitions of each class. The model learns to use the auxiliary information to predict labels for the seen classes correctly. Once trained, we ask the model to classify unseen or new classes based on their auxiliary information during inference. This approach is particularly valuable in domains where the class spectrum is vast and continually expanding. Few-shot Learning Classification Algorithms Let’s now turn to the several classification algorithms based on the abovementioned approaches and variations. The following will briefly overview six mainstream FSL algorithms: model-agnostic meta-learning (MAML), matching, prototypical, relation, and memory-augmented neural networks. Model-agnostic Meta-learning (MAML) MAML is a parameter-level GBML approach that involves a two-step optimization process to prepare models for quick adaptation to new tasks. In the first step, we initialize a model and train it on multiple tasks. We use the errors generated from this step to compute adapted parameters through gradient descent. Next, we fine-tune the model, adjusting its parameters based on the errors, through stochastic gradient descent using a loss function. The result is a generic pre-trained parameter set that can quickly adapt to new tasks in a few training steps.  MAML - Model Agnostic Meta Learning Once we have the pre-trained parameter, we can adapt by re-training it under a few-shot setting. The pre-trained parameter theta will approach to the true parameter theta-star of a new task with only a few gradient steps, making the learning process efficient. Matching Networks Matching networks (MNs) are a metric-based meta-learning approach that uses convolutional neural networks (CNNs) to generate embeddings for both support and query images. Matching Network Architecture The model classifies the query image based on similarity with support set embeddings. The approach dynamically adjusts to new tasks using a contrastive loss function to backpropagate errors for optimizing a model for better task-specific performance. Prototypical Networks Prototypical networks (PNs) are a metric-based approach that computes an average for each class in the support set using the respective embeddings of the classes. The averages are called prototypes.  Prototypical Network The model compares the embeddings of a query (input) image x with the prototype c for class k and classifies the image based on a similarity score (their proximity to these prototypes). Cross-modal approaches also use prototypical networks to compute the prototypes for each class by combining its text and image embeddings. Relation Networks Relation networks (RNs) combine the methods of matching and prototypical networks. The framework computes prototypes for each class and concatenates the query image embeddings with the prototypes. Relation Network A relation module classifies the query image based on the similarity between the query embeddings and class prototypes. This method allows for a more nuanced assessment of class membership to interpret complex relations. Siamese Networks Siamese networks are also metric-based frameworks adept at one-shot learning. They are designed for comparison, using twin networks to process pairs of inputs and assess their similarity. Siamese Network It uses a contrastive loss function to fine-tune the model's sensitivity to subtle differences and similarities. Contrastive learning allows models to extract meaningful representations from unlabeled data. Learn how it works in our ‘Full Guide to Contrastive Learning’.  Memory-augmented Neural Networks Memory-augmented neural networks (MANNs) use memory modules to store data-related information such as vectors, entity relationships, and context. This enables the model to draw on this repository when encountering new tasks.  MANN Architecture The architecture consists of a controller, read-write heads, and a memory module. The read head fetches relevant information from memory when the controller receives a query. It provides it back to the controller for classification. Also, the write head stores new information in the memory module when the controller receives new data. Few-shot Object Detection Algorithm Like few-shot classification, we can also use few-shot approaches for object detection. The method involves a support set containing K class labels for each object within an image and N examples per class.  Annotating an N-class-label image using Encord Annotate More generally, a single image can contain more than one instance of the same object, and there can be multiple images. The situation can result in a class imbalance as the support set can contain more examples for a specific class and fewer for others. The two algorithms to solve these issues and classify objects with only a few examples are: YOLOMAML DeFRCN YOLOMAML YOLOMAML combines a variation of the YOLO algorithm with the MAML technique for few-shot object detection. The architecture consists of a customized version of YOLOv3 with Tine Darknet as the backbone and two additional output blocks.  The backbone is initialized with pre-trained parameters on the ImageNet dataset, and the layers are frozen, leaving only five convolutional layers to be trained. This speeds up the learning process on a standard GPU. YOLOMAML Algorithm Pseudocode Like the standard MAML, the algorithm samples several detection tasks from the support set. For each task, it updates the initial parameters based on the loss function defined over the query set. This results in updated parameters for each task. Finally, it updates the initial parameter set through stochastic gradient descent using the aggregate of loss functions defined over the updated parameters. Once we have the updated parameters, we can initialize the network with this new set of parameters and provide novel images for detection. The pre-trained parameters will quickly adapt to detect the relevant objects based on limited samples. DeFRCN Decoupled Fast Recurrent Network (DeFRCN) is a variant of the Fast-RCNN framework which consists of a region proposal network (RPN), recurrent neural network (RCNN), and two modules for box classification and regression. Together, the box classifier and regressor help detect relevant objects within an image. In traditional Fast-RCNN, the RPN proposes regions of interest (where to look), and the RCNN module predicts bounding boxes (what to look). However, the two modules share the same feature extractor (the backbone). This results in misalignment as the objectives of RPN and RCNN are fundamentally different. DeFRCN overcomes these limitations by introducing separate gradient decoupled layers (GDL) for RPN and RCNN to control the effect of each on the backbone’s update process. The network is trained on a large base dataset with multiple labeled samples. The architecture uses a Prototypical Calibration Network (PCN) for few-shot detection, which consists of a feature extractor to capture relevant features of novel classes in the support set.  DeFRCN PCN computes prototypes for each class and outputs a similarity score against the query image. The query image is also given to the box classifier, which generates its own classification score.  The network backpropagates the loss based on the two scores to optimize the backbone further. This way, the DeFRCN architecture jointly trains the model on base and novel datasets for optimal detection. Few Shot Learning: Use Cases Since FSL requires only a few labeled samples for training machine learning models, it has widespread uses in multiple industrial applications where data is limited. The list below mentions a few popular FSL use cases. Robotics: FSL models can help robots recognize novel objects in unknown environments without requiring extensive prior knowledge. Medical imaging: Due to insufficient labeled images for rare diseases, FSL models are valuable for medical diagnosis as they can detect new diseases and anomalies with minimal training data. Facial recognition: Facial recognition systems mostly use OSL models like the Siamese network to authenticate users. The models compare the input image with a reference photo and detect similarity. Autonomous vehicles: CV models for autonomous vehicles require FSL object detection models to recognize new objects on the road for efficient navigation. Quality assurance: FSL frameworks can help detect new product anomalies and defects on the assembly line. Gesture and emotion recognition: Classifying gestures and emotions in real-time is challenging since training a model using traditional methods will require data on all kinds of emotional and physical cues. Instead, training FSL models on a few relevant images is optimal, as they can recognize anomalous behavior using minimal labeled samples. Video Scene Classification: FSL approaches can analyze and classify novel video scenes using the knowledge gained from a few training samples. Want to know the latest computer vision use cases? Learn more about the ten most exciting applications of computer vision in 2024   Few-shot Learning: Key Takeaways With FSL overtaking the traditional learning paradigms in computer vision, the approaches, algorithms, and frameworks will likely grow exponentially in the coming years. Below are a few key points to remember regarding FSL: Significance of FSL: FSL is crucial in the modern AI ecosystem. It can help you build models with minimal training data, making it suitable for applications where data is limited. Few-shot classification approaches: The primary FSL approaches for image classification include data-level, parameter-level, metric-based, gradient-based meta-learning, generative, and cross-modal methods. Few-shot object detection: Few-shot object detection is an emerging field where we aim to detect multiple objects within a single image using FSL approaches. YOLOMAML is the only mainstream algorithm to address this problem.

February 16

8 min

sampleImage_htj2k-support-for-dicom-files-encord
Announcing HTJ2K Support for DICOM Files in Encord

Announcing HTJ2K Support for DICOM Files in Encord We are thrilled to announce that Encord now supports the High Throughput JPEG 2000 (HTJ2K) transfer syntaxes for DICOM files. 🎉 Now, I know what you're thinking. "What's all the buzz about HTJ2K?" Well, let me break it down for you in plain English. What is HTJ2K? HTJ2K, or High Throughput JPEG 2000, is a variant of the JPEG 2000 image compression standard. It is designed to overcome the limitations of traditional JPEG 2000 by offering higher encoding and decoding throughput, making it particularly suitable for applications where speed is critical. Think of it as the turbocharged version of JPEG 2000, designed to handle massive DICOM datasets with lightning-fast speed.⚡ So, why is this a big deal? Picture this: you're a busy medical professional trying to access critical imaging data. Every second counts, right? With HTJ2K support in Encord, you'll experience faster rendering and import times, meaning you can get to that crucial image quicker than ever before. Importance of HTJ2K for DICOM Files In medical imaging, DICOM transfer syntax dictates how pixels are encoded within datasets. Despite significant advances in imaging technology, the DICOM standard has seen limited transfer syntax updates over the past two decades. Enter HTJ2K, a variant of JPEG 2000 that combines the efficiency of JPEG 2000 compression with high throughput capabilities. Unlike its predecessor, which struggled due to its heavyweight nature, HTJ2K offers streamlined compression without sacrificing image quality. The recent approval of HTJ2K transfer syntaxes by the DICOM Standard Committee marks a significant milestone. This new syntax brings unprecedented efficiency and speed to image compression, reducing file sizes and transmission times while improving workflow efficiency for medical practitioners. At Encord, we're thrilled to be at the forefront of this advancement in DICOM labeling.🎉 Why is it Exciting for Medical Professionals? One of few to support HTJ2K: With only a handful of applications currently supporting HTJ2K, Encord is among the pioneers in utilizing this advanced compression format.  Time to First Image:  With HTJ2K, medical professionals can expect significant efficiency gains as the compression rate is reduced by 40%. The time-to-first-image is dramatically reduced, meaning that practitioners can access critical imaging data faster, enabling quicker diagnoses and treatment decisions. Performance Improvements:  As the image takes up less storage space, there reduction in both presentation time (loading time) and transmission time means that medical professionals can spend less time waiting for images to render and import, leading to improved productivity and workflow efficiency. Industry Adoption: The fact that AWS has already built a medical imaging platform based on HTJ2K, underscores its potential to revolutionize the field. By aligning with this industry trend, Encord is empowering medical professionals with tools that are not only cutting-edge but also endorsed by industry leaders. So, the news that Encord now supports HTJ2K is nothing short of groundbreaking for medical professionals. It represents a significant leap forward in imaging technology, promising enhanced efficiency, performance improvements, and alignment with industry trends. With HTJ2K, medical practitioners can expect to save both space and time, ultimately leading to improved patient care and outcomes. Don't miss out on this exciting opportunity to elevate your medical imaging workflow to new heights with Encord. Contact us today to learn more and experience the power of HTJ2K for yourself!

February 16

5 min

sampleImage_google-gemini-1-5-generative-ai-model-with-mixture-of-experts
Gemini 1.5: Google's Generative AI Model with Mixture of Experts Architecture

In December 2023, Google launched the Gemini 1.0 family of models that outperformed state-of-the-art (SoTA) models in multimodal AI capabilities. Fast-forward to February 2024, and the Google Deepmind research team has launched Gemini 1.5 Pro with up to 10 million context windows! Not only that, it maintains near-perfect across the entire context and uses a mixture-of-experts (MoE) architecture for more efficient training & higher-quality responses. In this article, you will learn about:  The superior performance benchmarks of Gemini 1.5 Why it performs better than SoTA at textual, visual, and audio capabilities How well it handles long-context tasks, especially with MoE as it’s architectural backbone How you can get started using it Before we jump into it, let’s set the tone with an overview of the MoE architecture that backs Gemini 1.5. TL;DR Gemini 1.5 is a Sparse mixture-of-experts (MoE) multimodal model with a context window of up to 10 million tokens. It excels at long-term recall and retrieval; generalizes zero-shot to long instructions like analyzing 3 hours of video, and 22 hours of audio with near-perfect recall. It performs better than Gemini 1.0 Pro and 1.0 Ultra but performs worse than 1.0 Ultra for audio and vision. Although there are no detailed insights on the model size, architectural experiments, or the number of experts, the model performs well at in-context memorization and generalization Mixture-of-Experts (MoE) Architecture Gemini 1.5 Pro uses a mixture-of-experts (MoE) architecture for efficient training & higher-quality responses, building on a long line of Google research efforts on sparse models. At its core, MoE diverges from traditional deep learning and Transformer architectures by introducing a dynamic routing mechanism that selectively activates different subsets of parameters (referred to as "experts") depending on the input data. It learns to selectively activate only the most relevant expert pathways in its neural network for nuanced and contextually aware outputs. This approach enables the model to scale more effectively in terms of computational efficiency and capacity without a linear increase in computational demands. In the context of Gemini 1.5, the MoE architecture contributes to efficient training and serving. Concentrating computational resources on the most relevant parts of the model for each input allows for faster convergence and improved performance without necessitating the proportional increase in computational power typically associated with scaling up the model size.   Gemini 1.5 - Model Functionalities Gemini 1.5 drops with some impressive functionalities that beat SoTA models: Huge context window that spans up to 10 million-token context length Reduced training compute with the mixture-of-experts architecture Superior performance compared to Gemini 1.0 models, GPT-4, and other SoTA Huge Context Window A model’s “context window” comprises tokens, the building blocks for processing a user’s query. Tokens can be entire parts or subsections of words, images, videos, audio, or code. The bigger a model’s context window, the more information it can take in and process at a given prompt. Gemini 1.5 is a highly capable multimodal model with token context lengths ranging from 128K to 1 million token context lengths for production applications and up to 10 million for research. This unlocks a lot of use cases: Across reasoning about long text documents Making sense of an hour of video (full movies) 11 hours of audio Entire podcast series 700,000 words 30,000 lines of code simultaneously These capabilities are several times greater than other AI models, including OpenAI’s GPT-4, which powers ChatGPT. Context lengths of foundation models with Gemini 1.5 scaling up to 10 million tokens in research Reduced Training Compute The training compute required to train Gemini 1.5 were TPUv4 accelerators of multiple 4096-chip pods. This underscored the model's reliance on high-performance computing resources to perform well, but it also needed training efficiency techniques with the MoE architecture to be optimal. Gemini 1.5 significantly reduced compute requirements for training despite the larger context windows. This achievement is pivotal in the progress of AI model training efficiency, addressing one of the most pressing challenges in the field: the environmental and economic costs associated with training large-scale AI models. The reduction in training compute is primarily down to the Mixture-of-Experts (MoE) architectural backbone, which Gemini 1.5 uses to optimize computational resources. Beyond that, Gemini 1.5 incorporates state-of-the-art techniques such as sparsity in the model's parameters, which means that only a subset of the model's weights is updated during each training step. This approach reduces the computational load, leading to faster training times and lower energy consumption.  According to the technical report, combining those processes to train the model led to remarkable performance without the proportional increase in resource consumption typically seen in less advanced models. Recalling and Reasoning Google Gemini 1.5 Pro sets a new standard in AI's ability to recall and reason across extensive multimodal contexts. The ten million-token context window—the largest of any foundational model, so far—enables Gemini 1.5 Pro to demonstrate unparalleled proficiency in synthesizing and interpreting vast amounts of information. Gemini 1.5 Pro achieves near-perfect recall in complex retrieval tasks across long text documents, videos, and audio, which shows its understanding of the input. In tests from the report, Gemini 1.5 Pro learned new languages from sparse instructional materials 🤯. This model's proficiency in recalling specific details from large datasets and its capability to apply this knowledge in reasoning tasks usher in a new era in AI applications—ranging from academic research and comprehensive code analysis to nuanced content creation. Superior Performance Benchmark Gemini 1.5 Pro demonstrates remarkable improvements over state-of-the-art (SotA) models, including GPT-4V, in tasks spanning text, code, vision, and audio. Some of the benchmarks for which Gemini 1.5 Pro achieves SotA accuracy include 1H-VideoQA and EgoSchema. This indicates Gemini 1.5 Pro's advanced long-context multimodal understanding. Learn more about how OpenAI’s GPT-Vision is expected to compare to the Gemini family of models in our explainer blog post.    In core text evaluations, Gemini 1.5 Pro consistently outperforms its predecessors (Gemini 1.0 Pro and Ultra) in various domains such as Math, Science & Reasoning, Coding, Multilinguality, and Instruction Following. The model shows substantial improvements, particularly in Math and Science Reasoning, where it outperforms Gemini 1.0 Ultra, and in Coding tasks, it sets a new SotA accuracy benchmark on EgoSchema. Gemini 1.5 Pro's performance in multilingual evaluations highlights its enhanced ability to process and understand multiple languages. It shows significant improvements over both Gemini 1.0 models and other specialist models like USM and Whisper in speech understanding tasks. Needle In A Haystack (NIAH) Evaluation The Needle In A Haystack (NIAH) evaluation showcases Gemini 1.5 Pro's capability to retrieve specific information ("needle") from a massive amount of data ("haystack") across different modalities. This evaluation underscores the model's efficiency in long-context understanding and recall accuracy. Gemini 1.5 Pro achieves near-perfect “needle” recall (>99.7%) up to 1M tokens of “haystack” in all modalities (i.e., text, video audio) and maintains this recall performance when extending to 10 M tokens across modalities Context Window - Text Modality: Recall to Token Count Gemini 1.5 Pro excels in the text modality, with the model achieving over 99% recall for up to 10 million tokens, or approximately 7 million words. This capacity for deep, nuanced understanding and recall from vast quantities of text sets a new benchmark for AI performance in natural language processing. It can sift through large volumes of text to find specific information. Text needle-in-a-haystack task comparison between Gemini 1.5 Pro and GPT-4 Turbo The model demonstrates high recall rates for identifying exact text segments within extensive documents. Context Window - Audio Modality: Recall to Token Count Gemini 1.5 Pro demonstrates an exceptional ability to recall information from audio data, achieving near-perfect recall (>99.7%) up to 2 million tokens, equivalent to approximately 22 hours of audio content. It was able to recall and identify specific audio segments ("needles") embedded within long audio streams ("haystacks").  Audio version of the needle-in-a-haystack experiment comparing Gemini 1.5 Pro and a combination of Whisper and GPT-4 Turbo This represents a significant advancement over combining two SoTA models like Whisper + GPT-4 Turbo in a recall-to-token count comparison, which struggles with long-context audio processing. Context Window - Video Modality: Recall to Token Count Gemini 1.5 Pro maintains high recall performance in the video modality, successfully retrieving information from video data up to 2.8 million tokens, correlating to around 3 hours of video content. The "Video Needle In A Haystack" task tested the model's performance in recalling specific video frames from lengthy videos. This is critical for tasks requiring detailed understanding and analysis of long-duration video sequences. It can accurately pinpoint and recall specific moments or information from extensive video sequences. Multineedle in Haystack Test The researchers created a generalized version of the needle in a haystack test, where the model must retrieve 100 different needles hidden in the context window.  The results? Gemini 1.5 Pro’s performance was above that of GPT-4 Turbo at small context lengths and remains relatively steady across the entire 1M context window. At the same time, the GPT-4 Turbo model drops off more quickly (and cannot go past 128k tokens). Multineedle in Haystack Test Textual Capabilities of Gemini 1.5 Mathematical and Scientific Textual Reasoning Gemini 1.5 Pro shows a +28.9% improvement over Gemini 1.0 Pro and a +5.2% improvement over Gemini 1.0 Ultra. This indicates a substantial increase in its ability to handle complex reasoning and problem-solving tasks. This proficiency is attributed to its extensive training dataset, which includes a wide array of scientific literature and mathematical problems, so the model can grasp and apply complex concepts accurately. Coding In Coding tasks, Gemini 1.5 Pro marked a +8.9% improvement over 1.0 Pro and +0.2% over 1.0 Ultra, showcasing its superior algorithmic understanding and code generation capabilities. The model can 𝐚𝐜𝐜𝐮𝐫𝐚𝐭𝐞𝐥𝐲 𝐚𝐧𝐚𝐥𝐲𝐳𝐞 an entire code library in a single prompt, without the need to fine-tune the model, including understanding and reasoning over small details that a developer might easily miss. Problem Solving Capability across 100,633 lines of code Instructional Understanding Gemini 1.5 Pro excels in Instruction Following, surpassing the 1.0 series in comprehending and executing complex (+9.2% over 1.0 Pro and +2.5% over 1.0 Ultra), multi-step instructions across various data formats and tasks. This indicates its advanced natural language understanding and ability to process and apply knowledge in a contextually relevant manner. Multilinguality The model also shows improvements in handling multiple languages, with a +22.3% improvement over 1.0 Pro and a slight +6.7% improvement over 1.0 Ultra. This highlights its capacity for language understanding and translation across diverse linguistic datasets. This makes it an invaluable tool for global communication and preserving and revitalizing endangered languages. Kalamang has almost no online presence. Machine Translation from One Book (MTOB: arxiv.org/abs/2309.16575) is a recently introduced benchmark evaluating the ability of a learning system to learn to translate Kalamang from just a single book. Gemini 1.5 Pro still translates the user prompt with astonishing accuracy. Visual Capabilities of Gemini 1.5 The model's multimodal understanding is outstanding in Image and Video Understanding tasks. Gemini 1.5 Pro's performance in these areas reflects its ability to interpret and analyze visual data, making it an indispensable tool for tasks requiring a nuanced understanding of text and media. Image and Video Understanding For image understanding, there's a +6.5% improvement over 1.0 Pro but a -4.1% difference compared to 1.0 Ultra. In video understanding, however, Gemini 1.5 Pro shows a significant +16.9% improvement over 1.0 Pro and +3.8% over 1.0 Ultra, indicating robust enhancements in processing and understanding visual content.  These are some areas Gemini 1.5 performs great at: Contextual Understanding: Gemini 1.5 integrates visual data with textual descriptions, enabling it to understand the context and significance of visual elements in a comprehensive manner. This allows for nuanced interpretations that go beyond mere object recognition. Video Analysis: For video content, Gemini 1.5 demonstrates an advanced ability to track changes over time, recognize patterns, and predict outcomes. This includes understanding actions, events, and even the emotional tone of scenes and providing detailed analyses of video data. Image Processing: In image understanding, Gemini 1.5 utilizes state-of-the-art techniques to analyze and interpret images. This includes recognizing and categorizing objects, understanding spatial relationships, and extracting meaningful information from still visuals. Audio Capabilities of Gemini 1.5 Speech Recognition and Translation In an internal YouTube video-based benchmark, Gemini 1.5 Pro was evaluated on 15-minute segments, showing a remarkable ability to understand and transcribe speech with a word error rate (WER) significantly lower than that of its predecessors and other contemporary models.  This capability is especially notable given the challenges posed by long audio segments, where the model maintains high accuracy without the need for segmentation or additional preprocessing. Gemini 1.5 Pro also performed well at translating spoken language from one language to another, maintaining the meaning and context of the original speech. This is particularly important for applications that require real-time or near-real-time translation. Overall, there are mixed results in the audio domain, with a +1.2% improvement in speech recognition over 1.0 Pro but a -5.0% change compared to 1.0 Ultra. In speech translation, Gemini 1.5 Pro shows a slight +0.3% improvement over 1.0 Pro but a -2.2% difference compared to 1.0 Ultra. Gemini 1.5 Core capabilities performance over its predecessor, Gemini 1.0 series of models, Gemini 1.0 Pro and Gemini 1.0 Ultra Long Context Understanding Gemini 1.5 Pro significantly expands the context length to multiple millions of tokens, enabling the model to process larger inputs effectively. This is a substantial improvement over models like Claude 2.1, which has a 200k token context window. Gemini 1.5 Pro maintains a 100% recall at 200k tokens and shows minimal reduction in recall up to 10 million tokens, highlighting its superior ability to manage and analyze extensive data sets. In one example, the model analyzed long, complex text documents, like Victor Hugo’s five-volume novel “Les Misérables” (1382 pages, 732k tokens). The researchers demonstrated multimodal capabilities by coarsely sketching a scene and saying, “Look at the event in this drawing. What page is this on?” With the entire text of Les Misérables in the prompt (1382 pages, 732k tokens), Gemini 1.5 Pro can identify and locate a famous scene from a hand-drawn sketch In another example, Gemini 1.5 Pro analyzed and summarized the 402-page transcripts from Apollo 11’s mission to the moon. “One small step for man, one giant leap for mankind.” Demo of Long Context Understanding Prompt In-Context Learning and the Machine Translation from One Book (MTOB) Benchmark Gemini 1.5 Pro can adapt and generate accurate responses based on minimal instruction. This capability is especially evident in complex tasks requiring understanding nuanced instructions or learning new concepts from a limited amount of information in the prompt. Gemini 1.5 Pro's in-context learning capabilities show its performance on the challenging Machine Translation from One Book (MTOB) benchmark. This benchmark tests the model's ability to learn to translate a new language from a single source of instructional material.  In the MTOB benchmark, Gemini 1.5 Pro was tasked with translating between English and Kalamang, a language with a limited online presence and fewer than 200 speakers. Despite these challenges, the report showed that Gemini 1.5 Pro achieved translation quality comparable to that of human learners with the same instructional materials.  This underscores the model's potential to support language learning and translation for underrepresented languages, opening new avenues for research and application in linguistics and beyond. Gemini 1.5 Pro Vs. Gemini Ultra While Gemini 1.5 Pro (2024) and Gemini Ultra (2023) are at the forefront of AI research and application, Gemini Pro 1.5 introduces several key advancements that differentiate it from Gemini Ultra. The table below provides an overview and comparison of both models. Use Cases  Analyzing Lengthy Videos Analyzing videos is another great capability brought by the fact that Gemini models are naturally multimodal, and this becomes even more compelling with long contexts. In the technical report, Gemini 1.5 Pro was able to analyze movies, like Buster Keaton’s silent 45-minute “Sherlock Jr.” movie. Using one frame per second, the researchers turned the movie into an input context of 684k tokens.  The model can then answer fairly complex questions about the video content, such as: “Tell me some key information from the piece of paper that is removed from the person’s pocket and the timecode of that moment.” Or, a very cursory line drawing of something that happened, combined with “What is the timecode when this happens?” Gemini 1.5 analyzing and reasoning over the 45-minute “Sherlock Jr.” movie You can see this interaction here: Multimodal prompting with a 44-minute movie Navigating Large and Unfamiliar Codebases As another code-related example, imagine you’re unfamiliar with a large codebase and want the model to help you understand the code or find where a particular functionality is implemented. In another example, the model can ingest an entire 116-file JAX code base (746k tokens) and help users identify the specific spot in the code that implements the backward pass for auto differentiation. It’s easy to see how the long context capabilities can be invaluable when diving into an unfamiliar code base or working with one you use daily. According to a technical lead, many Gemini team members have been finding it very useful to use Gemini 1.5 Pro’s long context capabilities on our Gemini code base. Gemini 1.5 navigating large and unfamiliar codebases What’s Next? According to a Google blog post, Gemini 1.5 Pro is currently in private preview, and its general availability with a standard 128,000-token context window will come later. Developers and enterprise customers can sign up to try Gemini 1.5 Pro with a context window of up to an experimental 1 million tokens via AI Studio and Google Vertex AI to upload hundreds of pages of text, entire code repos, and long videos and let Gemini reason across them. Try Gemini 1.5 Pro with a context window of up to an experimental 1 million tokens via AI Studio and Google Vertex AI That’s all for now. In the meantime, check out our resources on multimodal AI: Introduction to Multimodal Deep Learning GPT-4 Vision Alternatives Top Multimodal Annotation Tools

February 16

10 min

sampleImage_open-ai-sora
OpenAI Releases New Text-to-Video Model, Sora

OpenAI has responded to the recent debut of Google's Lumiere, a space-time diffusion model for video generation, by unveiling its own creation: Sora. The diffusion model can transform short text descriptions into high-definition video clips up to one minute long. How Does Sora Work? Sora is a diffusion model that starts with a video that resembles static noise. Over many steps, the output gradually transforms by removing the noise. By providing the model with the foresight of multiple frames concurrently, OpenAI has resolved the complex issue of maintaining subject consistency, even when it momentarily disappears from view. OpenAI Sora - AI Video Output Similar to GPT models, Sora uses a transformer architecture. Images and videos are represented as patches, collections of smaller units of data. By representing the data in the same manner, OpenAI was able to train diffusion transformers on a wide range of data of different durations, resolutions, and aspect ratios.  Sora leverages the recaptioning techniques from DALL-E3 and as such, the model follows the user’s text instructions closely.  Technical overview of OpenAI’s Sora OpenAI has released a few technical details on how the state-of-the-art diffusion model for video generation. Here are the key methodologies and features employed in Sora’s architecture. Video Generated by OpenAI's Sora Unified Representation for Large-Scale Training Sora focuses on transforming visual data into a unified representation conducive to large-scale training of generative models. Unlike previous approaches that often concentrate on specific types of visual data or fixed-size videos, Sora embraces the variability inherent in real-world visual content. By training on videos and images of diverse durations, resolutions, and aspect ratios, Sora becomes a generalist model capable of generating high-quality videos and images spanning a wide range of characteristics. Patch-Based Representations Inspired by the use of tokens in large language models (LLMs), Sora adopts a patch-based representation of visual data. This approach effectively unifies diverse modalities of visual data, facilitating scalable and efficient training of generative models. Patches have demonstrated their effectiveness in modeling visual data, enabling Sora to handle diverse types of videos and images with ease. Turning Visual Data into Patches Video Compression Network To convert videos into patches, Sora first compresses the input videos into a lower-dimensional latent space, preserving both temporal and spatial information. This compression is facilitated by a specialized video compression network, which reduces the dimensionality of visual data while maintaining its essential features. The compressed representation is subsequently decomposed into spacetime patches, which serve as transformer tokens for Sora's diffusion transformer architecture. Diffusion Transformer Sora leverages a diffusion transformer architecture, demonstrating remarkable scalability as video models. Diffusion transformers have proven effective across various domains, including language modeling, computer vision, and image generation. Sora's diffusion transformer architecture enables it to effectively handle video generation tasks, with sample quality improving significantly as training compute increases. Scaling Transformers for Video Generation Native Size Training for High-Quality Video Generation Sora benefits from training on data at its native size, rather than resizing, cropping, or trimming videos to standardized dimensions. This approach offers several advantages, including sampling flexibility, improved framing and composition, and enhanced language understanding. By training on videos at their native aspect ratios, Sora achieves superior composition and framing, resulting in high-quality video generation. Language Understanding and Text-to-Video Generation Training Sora for text-to-video generation involves leveraging advanced language understanding techniques, including re-captioning and prompt generation using models like DALL·E and GPT. Highly descriptive video captions improve text fidelity and overall video quality, enabling Sora to generate high-quality videos accurately aligned with user prompts. Capabilities of Sora OpenAI’s Sora can generate intricate scenes encompassing numerous characters, distinct forms of motion, and precise delineations of subject and background. As OpenAI states “The model understands not only what the user has asked for in the prompt, but also how those things exist in the physical world.” Capabilities of OpenAI Sora Here is an extensive list of capabilities of Sora that OpenAI demonstrated. This definitely says a lot about how powerful it is as a text-to-video tool for creating content generation and simulation tasks. Prompting with Images and Videos Sora's flexibility extends to accepting inputs beyond text prompts, including pre-existing images or videos. Glimpse of Prompt Generated Artwork of an Art Gallery by OpenAI's Sora Animating DALL-E Images Sora can generate videos from static images produced by DALL·E, showcasing its ability to seamlessly animate still images and bring them to life through dynamic video sequences.  Current techniques for animating images utilize neural-based rendering methods to produce lifelike animations. However, despite these advancements, achieving precise and controllable image animation guided by text remains a challenge, especially for open-domain images taken in diverse real-world environments. Models like AnimateDiff, AnimateAnything, etc have also demonstrated promising results for animating static images. Extending Generated Videos Sora is adept at extending videos, whether forward or backward in time, to create seamless transitions or produce infinite loops. This capability enables Sora to generate videos with varying starting points while converging to a consistent ending, enhancing its utility in video editing tasks. Video-to-Video Editing Leveraging diffusion models like SDEdit, Sora enables zero-shot style and environment transformation of input videos, showcasing its capability to manipulate video content based on text prompts and editing techniques. Connecting Videos Sora facilitates gradual interpolation between two input videos, facilitating seamless transitions between videos with different subjects and scene compositions. This feature enhances Sora's ability to create cohesive video sequences with diverse visual content. Image Generation Sora is proficient in generating images by arranging patches of Gaussian noise in spatial grids with a temporal extent of one frame, offering flexibility in generating images of variable sizes up to 2048 x 2048 resolution. Photorealistic Image Generation Capability of OpenAI Sora Simulation Capabilities At scale, Sora exhibits amazing simulation capabilities, enabling it to simulate aspects of people, animals, environments, and digital worlds without explicit inductive biases. These capabilities include: 3D Consistency: Generating videos with dynamic camera motion, ensuring consistent movement of people and scene elements through three-dimensional space. Long-Range Coherence and Object Permanence: Effectively modeling short- and long-range dependencies, maintaining temporal consistency even when objects are occluded or leave the frame. Interacting with the World: Simulating actions that affect the state of the world, such as leaving strokes on a canvas or eating a burger with persistent bite marks. Simulating Digital Worlds: Simulating artificial processes, including controlling players in video games like Minecraft while rendering high-fidelity worlds and dynamics. Limitations of Sora Limitation of OpenAI's Sora - Glass Shattering Effect OpenAI acknowledged that the current AI model has known weaknesses, including: Struggling to accurately simulate complex space Understand some instances of cause and effect Confuse spatial details of a prompt Precise descriptions of events over time Safety Considerations of Sora OpenAI is currently working with a team of red teamers to test the AI model prior to making Sora available to OpenAI users. These red teamers consist of domain experts familiar with misinformation, hateful content, and bias.  In their release, OpenAI has stated that they will not only leverage existing safety methods leveraged for the release of DALL-E3 but also going one step further to build tools to detect misleading content, including a detection classifier that can identify a video generated by Sora. Once the model is released in OpenAI’s products, they will include C2PA metadata and be monitored by their text and image classifiers: input prompts that violate their usage policy will be rejected and video outputs will be reviewed frame by frame.  In addition to all these safety precautions, OpenAI has also stated they will engage policymakers, educators, and artists to understand concerns and identify use cases for the model.  Text-to-video synthesis with Sora Noteworthy Text to Video Generation Models Google’s Lumiere Google’s recent introduction of its text-to-video diffusion model, Lumiere is truly remarkable as well. It is designed to generate realistic, diverse, and coherent motion in videos. Lumiere’s capabilities include: text-to-video generation image-to-video generation stylized generation text-based video editing animating content of an image within a user-provided region Video inpainting Unlike traditional approaches that rely on cascaded designs involving distant keyframe generation and subsequent temporal super-resolution, Lumiere introduces Space-Time I-Net architecture. This architecture allows Lumiere to generate the entire temporal duration of the video at once, streamlining the synthesis process and improving global temporal consistency. Google Lumiere's Prompt Generated AI Video By incorporating spatial and temporal down- and up-sampling techniques and leveraging pre-trained text-to-image diffusion models, Lumiere achieves remarkable results in generating full-frame-rate, low-resolution videos. This approach not only enhances the overall visual quality of the synthesized videos but also facilitates a wide range of content creation tasks and video editing applications, including image-to-video conversion, video inpainting, and stylized generation. For more information, read the paper Lumiere: A Space-Time Diffusion Model for Video Generation.   Stability AI’s Stable Video Diffusion Stability AI introduced Stable Video Diffusion, a latent video diffusion model designed for state-of-the-art text-to-video and image-to-video generation tasks. Leveraging recent advancements in latent diffusion models (LDMs) initially trained for 2D image synthesis, Stability AI extends its capabilities to generate high-resolution videos by incorporating temporal layers and fine-tuning them on specialized video datasets. Stable Video Diffusion Stability AI addresses the lack of standardized training methods by proposing and evaluating three key stages for successful training of video LDMs: text-to-image pretraining, video pretraining, and high-quality video finetuning. Emphasizing the importance of a meticulously curated pretraining dataset for achieving high-quality video synthesis, Stability AI presents a systematic curation process, including strategies for captioning and data filtering, to train a robust base model. The Stable Video Diffusion model demonstrates the effectiveness of finetuning the base model on high-quality data, resulting in a text-to-video model that competes favorably with closed-source video generation methods. The base model not only provides a powerful motion representation for downstream tasks such as image-to-video generation but also exhibits adaptability to camera motion-specific LoRA modules. It also showcases the versatility of its model by demonstrating its strong multi-view 3D-prior capabilities, serving as a foundation for fine-tuning a multi-view diffusion model that generates multiple views of objects in a feedforward manner. This approach outperforms image-based methods while requiring a fraction of their compute budget, highlighting the efficiency and effectiveness of Stable Video Diffusion in generating high-quality videos across various applications. For more information, read the paper Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets. Meta’s Make-A-Video Meta two years ago introduced Make-A-Video. Make-A-Video leverages paired text-image data to learn representations of the visual world and utilize unsupervised learning on unpaired video data to capture realistic motion. This innovative approach offers several advantages: It expedites the training of text-to-video models by leveraging pre-existing visual and multimodal representations It eliminates the need for paired text-video data It inherits the vast diversity of aesthetic and fantastical depictions from state-of-the-art image generation models. Meta's Make-A-Video Generated Graphic Make-A-Video is a simple yet effective architecture that builds on text-to-image models with novel spatial-temporal modules. First, full temporal U-Net and attention tensors are decomposed and approximated in space and time. Then, a spatial-temporal pipeline is designed to generate high-resolution and frame-rate videos, incorporating a video decoder, interpolation model, and two super-resolution models to enable various applications beyond text-to-video synthesis. Despite the limitations of text describing images, Make-A-Video demonstrates surprising effectiveness in generating short videos. By extending spatial layers to include temporal information and incorporating new attention modules, Make-A-Video accelerates the T2V training process and enhances visual quality. Sora: Key Highlights With a SOTA diffusion model, Sora empowers users to effortlessly transform text descriptions into captivating high-definition video clips, revolutionizing the way we bring ideas to life. Here are the key highlights of Sora: Sora's Architecture: Utilizes a diffusion model and transformer architecture for efficient training. Sora's Methodologies: Sora uses methodologies like unified representation, patch-based representations, video compression network, and diffusion transformer. Capabilities: Includes image and video prompting, DALL·E image animation, video extension, editing, image generation, etc. Limitations: Weaknesses in simulating complex space and understanding causality. Sora's Safety Considerations: Emphasizes safety measures like red team testing, content detection, and engagement with stakeholders. Other significant text-to-video models: Lumiere, Stable Video Diffusion, and Make-A-Video.

February 15

3 min

sampleImage_product-updates-january-2024
Product Updates [January 2024]

Happy New Year 2024! We put our heads down in January for a fashionably late new years welcome party. Check out below to see everything we’ve prepared, including: New Home Page and Onboarding A new way to manage your data Workflows Usability Enhancements DICOM MPR Annotations and other improvements Simple and intuitive Encord Active imports Welcome to Encord Opening the new year with a new home page with points directly Projects, Datasets and Ontologies — you’re never more on than one click away from creating any necessary entities. Of course, you can continue to do so from the relevant list pages as well. We’ve also introduced in-context documentation and explainers after performing various key actions to give you the information you need when you need it. Welcome to the new year, and welcome to Encord! Interact and manage your data directly Registering and indexing your data into your annotation platform is a critical step on your journey to a first-class AI application. Making sure your cloud integrations are front and center helps easily manage all your data sources. Read about it here. You’ll notice Datasets and Integrations appearing under a new section — Index! We’re working with select customers on the new Storage application to manage your data directly. Storage allows you to step past Datasets and interact with every file registered on Encord directly. Global file search, tracking and assigning dataset membership, and the ability to organize files into folders to suit your purpose are some of the initial features we’re rolling out. Contact us at support@encord.com to be part of the early cohort helping us develop and iterate how you manage your data! Get up to speed with the Storage explainer, here. Workflows Usability Enhancements We upgraded Workflows to the default task management system at the end of last year, and have continued to add functionality and scale. Creating workflow projects now happens in one simple interface — from 1,2,3,4 to 1 and done! We’ve also added several popular features. You can now change task routing on percentage routers after project creation, and annotators can skip difficult tasks and conveniently leave a comment for project managers to understand what is happening. We’ve also made it possible to update priorities and attach comments in bulk via the SDK making it possible to scale these operations to many thousands of tasks quickly. Smooth Integration Of Encord Active and Annotate The import process has been split into 4 separate stages to give you quicker access to useful features and enable you to start working with Active within minutes instead of hours. Choose between Data, labels, Embeddings, and Quality metrics to start your curation or label validation workflow. We’ve also enhanced the integration going from Active to Annotate as well. You can now adjust task priorities and comments from Active in bulk to optimize your annotation workflows ensuring the most high-value data gets annotated first. Use priorities and comments to surface the most important data and instructions to your annotation team — making the most of your resources and your team’s time. Finally, bulk classification allows you to classify multiple images or frames simultaneously, accelerating some classification workloads from hours to minutes. Combined with a foundation model such as CLIP this feature is powerful at classifying the simpler classes in your ontology before your annotation teams handle the details. Encord Active: Search Anything and Look Good Doing It Take your dataset querying to the next level with Search Anything! Building on the success of our natural language search feature, we're excited to introduce a new dimension to your search capabilities. Our latest feature, Search Anything, uses embeddings-based search technology to find images in your dataset that closely resemble an external image you provide. Simply drag and drop your chosen external image into Encord Active whereafter our algorithms will match it against your dataset, quickly and accurately identifying the most similar images for your review. For when you can’t describe an image, let the image describe itself! Coming Soon: We are improving the model predictions import process in the next month to make it even easier to get your models into Encord for evaluation, comparison, or pre-labeling. If you would like to provide input or feedback to the process please reach out to product@encord.com  DICOM Improvements We’ve redoubled our efforts in making the DICOM annotation tool fit your workload. To start off the year, we’re quite pleased to be rolling out increased access to not one but two foundational improvements to our DICOM tool. Regardless of your modality (x-ray, CT, MRI, etc) — we’ve updated our recently introduced all new rendering engine. We’ve seen performance increases surpassing 3x loading time improvements with our recent changes and are very excited to be making it available. We’ve also added functionality to the MPR brush capabilities — the default mode is to annotate one slice at a time using the 2D brush, but you can also select the 3D brush mode to annotate on multiple slices at once. Refer to the reconstructed view documentation for the details and best practices. Contact us if you’re interested to hear more and help be part of the roll-out period. You can now also utilize our custom overlay feature to make your annotation experience as convenient as possible. Custom overlays allow you to control what DICOM information is conveniently surfaced and always visible in the editor interface. This means you won’t have to browse the metadata to find the information you need to annotate with accuracy! Full instructions and illustrations available in the documentation. Thanks for reading — as always, the inbox at product@encord.com is always open — we’d love to hear your thoughts on the feedback on the above! Talk to you soon!

February 15

5 min

sampleImage_multiplanar-reconstruction-mpr
Multiplanar Reconstruction (MPR) in the DICOM Editor

MPR transforms cross-sectional scans into 2D orthogonal images (coronal, sagittal and axial views) and is crucial for a comprehensive understanding of human anatomy.  MPR in Encord Our latest updates to MPR allow you to create Bitmask annotations directly on reconstructed views, and in any annotation window within the Label Editor. Annotations made in the one view automatically appear in all other views, and cross-reference lines are projected across all three views, assisting with precise anatomy annotation. All reconstructed views can also be transformed into detailed 3D renderings that prominently display your annotations. Labeling in Encord's MPR: A How-To Guide Now, let's get into the practical aspects of using MPR in Encord's Label Editor: Step 1: Open a DICOM annotation task in the Label Editor. The primary view is your main workspace, positioned on the left in the Label Editor. Reconstructed views are conveniently placed on the right. Step 2: Select a Bitmask object, and apply a label on any window. Use the 'Annotate from this tile' button to change the tile you're working on. By incorporating these steps into your workflow, you can leverage the full capabilities of MPR in Encord. Explore the Label Editor, experiment with Bitmask labels, and observe your annotations synchronize seamlessly across different views.

February 12

3 min

Page
1 / 16

Software To Help You Turn Your Data Into AI

Forget fragmented workflows, annotation tools, and Notebooks for building AI applications. Encord Data Engine accelerates every step of taking your model into production.