Contents
Top Picks for Computer Vision Papers You Should See
Want to get hands-on? Check Out These Computer Vision Tutorials
Developer Resources You’d Find Useful
Practical Computer Vision Use Cases
Top 3 Resources by Encord in January
Our Power Tip of the Month
Encord Blog
Encord Monthly Wrap: January Industry Newsletter
![blog image](https://images.prismic.io/encord/642398b5-fa00-4be8-8a0c-5444392d5bbc_Newsletter+-+Banner+-+January.png?auto=compress%2Cformat&fit=max&w=906&h=638)
![sideBlogCtaBannerDesktopBG](/static/VectorDesktop-d6a994f2c668a0332ba39898992e598f.png)
![sideBlogCtaBannerTabletBG](/static/VectorTablet-5246b4eeb12ce3a011a59f9a65313af7.png)
![sideBlogCtaBannerMobileBG](/static/VectorDesktop-d6a994f2c668a0332ba39898992e598f.png)
Power your AI models with the right data
Automate your data curation, annotation and label validation workflows.
Get startedContents
Top Picks for Computer Vision Papers You Should See
Want to get hands-on? Check Out These Computer Vision Tutorials
Developer Resources You’d Find Useful
Practical Computer Vision Use Cases
Top 3 Resources by Encord in January
Our Power Tip of the Month
Written by
![author-avatar-url](https://images.prismic.io/encord/5a0a7232-18e4-4603-addd-1559845af259_Stephen+bio+pic.jpeg?auto=compress%2Cformat&fit=max&w=80&h=80)
Stephen Oladele
Welcome to the January 2024 edition of Encord's Monthly Wrap.
It’s also our chance to wish you a belated happy new year!
Here’s what you should expect:
- Two interesting computer vision papers we reckon you check out.
- Hands-on tutorials you can work on during weekends.
- Developer resources you should bookmark, including Colab Notebooks.
- Computer vision use cases in manufacturing and robotics.
- Power tip for computer vision data explorers.
Let’s dive in!
Top Picks for Computer Vision Papers You Should See
Segment Anything in Medical Images (MedSAM)
This paper presents MedSAM, a novel adaptation of the Segment Anything Model (SAM) specifically for medical images.
What’s impressive? 🤯
- It introduces a large-scale medical image dataset with over 200,000 masks across 11 modalities and utilizes a fine-tuning method to adapt SAM for general medical image segmentation.
- It demonstrates superior performance over the original SAM, significantly improving the Dice Similarity Coefficient on 3D and 2D segmentation tasks.
There’s also an accompanying repository with a shoutout to one of our pieces on fine-tuning SAM 😉.
CLIP in Medical Imaging: A Comprehensive Survey
This survey explores the Contrastive Language-Image Pre-Training (CLIP) application in the medical imaging domain. It delves into the adaptation of CLIP for image-text alignment and its implementation in various clinical tasks.
What’s impressive? 👀
- It provides an in-depth analysis of CLIP's utility in medical imaging, covering the challenges of adapting it to the specific requirements of medical images.
- It shows how well CLIP generalizes tasks like 2D and 3D medical image Fsegmentation, medical visual question answering (MedVQA), and generating medical reports.
Illustration of CLIP’s generalizability via domain identification
Medical professionals use Encord’s DICOM & NIfTI Editor to quickly label large training datasets across modalities such as CT, X-ray, ultrasound, mammography, and MRI.
- How Harvard Medical School and MGH Cut Down Annotation Time and Model Errors with Encord
- Stanford Medicine reduced experiment times by 80%.
- Floy reduced label times by 50% for CT and 20% for MRI scans.
Want to get hands-on? Check Out These Computer Vision Tutorials
- [COLAB NOTEBOOK] How to Use the Depth Anything Model → The Depth Anything model is trained on 1.5 million labeled images and 62 million+ unlabeled images jointly and provides the most capable Monocular Depth Estimation (MDE) foundation models. This notebook shows you how to use the pipeline API to perform inference with any of the models. Here is the original paper (the image was adapted).
- How to Detect Data Quality Issues in Torchvision Dataset using Encord Active → This article shows you how to use Encord Active to explore images you have preloaded with Torchvision, identify and visualize potential issues, and take the next steps to rectify low-quality images.
- How to Use OpenCV With Tesseract for Real-Time Text Detection → This is a code walkthrough guide on building an app to perform real-time text detection from a webcam feed.
Developer Resources You’d Find Useful
- How to Pre-Label Data at Speed with Bulk Classifications → If you're working with large unlabeled datasets and want to quickly classify and curate for labeling, you’ll find our tutorial on pre-labeling data at warping speed with bulk classification useful.
- Best Image Annotation Tools for Computer Vision [Updated 2024] → Choosing the right image annotation tool is a critical decision that can significantly impact the quality and efficiency of the annotation process. To make an informed choice, this article considers several factors and evaluates suitable image annotation tools for your business needs.
- Generate Synthetic Data for Deep Object Pose Estimation Training with NVIDIA Isaac ROS → NVIDIA developed Deep Object Pose Estimation (DOPE) to find the six degrees of freedom (DOF) poses of an object. In this article, they illustrated how to generate synthetic data to train a DOPE model for an object.
- Best Computer Vision Projects With Source Code And Dataset → An article with 16 ideas for computer vision projects for beginners and start building.
Practical Computer Vision Use Cases
- Top 8 Use Cases of Computer Vision in Manufacturing → This article discusses the diverse applications of computer vision across various manufacturing industries, detailing their benefits and challenges, from product design and prototyping to operational safety and security.
- Top 8 Applications of Computer Vision in Robotics → This article explores computer vision applications in the robotics domain and mentions key challenges the industry faces today, from autonomous navigation and mapping to agricultural robotics.
Top 3 Resources by Encord in January
- How to Adopt a Data-Centric AI → For data teams to succeed in the long term, they must use high-quality data to build successful AI applications. But what is the crucial sauce for building successful and sustainable AI based on high-quality data? A data-centric AI approach! We released this whitepaper to guide you on how to develop an effective data-centric AI strategy.
- Top 15 DICOM Viewers for Medical Imaging → In the market for a DICOM viewer? We published a comparison article that discusses what to look for in an ideal viewer and the options in the market so you can make the optimal choice.
- Instance Segmentation in Computer Vision: A Comprehensive Guide → We published an all-you-need-to-know guide on instance segmentation, including details on techniques like single-shot instance segmentation and transformer- and detection-based methods. We also cover the U-Net and Mask R-CNN architectures, practical applications of instance segmentation in medical imaging, and the challenges.
Our Power Tip of the Month
If you are trying to become a computer vision data power user, I’ve got a tip to help supercharge your exploration gauntlet (I see you, Thanos 😉).
Within Encord Active, you can see the metric distribution of your data to identify potential data gaps that could influence model performance on outliers or edge cases. Here’s how to do it in 3 steps on the platform: Analytics >> Scroll down to Metric Distribution >> Choose a pre-built or custom Metric, and observe!
Good stuff 🤩. I hope you find it useful. Here are other quick finds if you 💓 Encord and computer vision data stuff ⚡:
- Data-centric computer vision blog
- Join the Encord Community to discuss the resources
- GitHub repo
- The Docs
Till next month, have a super-sparkly time!
![sideBlogCtaBannerDesktopBG](/static/VectorDesktop-d6a994f2c668a0332ba39898992e598f.png)
![sideBlogCtaBannerTabletBG](/static/VectorTablet-5246b4eeb12ce3a011a59f9a65313af7.png)
![sideBlogCtaBannerMobileBG](/static/VectorDesktop-d6a994f2c668a0332ba39898992e598f.png)
Power your AI models with the right data
Automate your data curation, annotation and label validation workflows.
Get startedWritten by
![author-avatar-url](https://images.prismic.io/encord/5a0a7232-18e4-4603-addd-1559845af259_Stephen+bio+pic.jpeg?auto=compress%2Cformat&fit=max&w=80&h=80)
Stephen Oladele
Related blogs
Expert Review with Workflows
Introduction Expert review workflows are crucial for accurate and successful annotation projects ensuring high data quality, efficient task allocation, and time savings. In this walkthrough, you’ll learn how to customize workflows to facilitate expert review flows and improve collaboration. As the AI and computer vision landscapes evolve, expert review workflows help you maintain data integrity, ensure optimal model performance, and maintain flexibility for future unknown labeling demands. Understanding Workflows Workflows are systematic processes (or graphs) that define how tasks are organized, assigned, routed, and automated within an annotation project. They provide a structured way of handling various stages of a project, ensuring that each step is completed efficiently and in the correct order while tracking performance at each step. Expert Review With the importance of training data ever-increasing, expert review workflows ensure the highest quality of annotations, in turn leading to improved model performance. The expert review ensures data meets the required standard through subject matter experts thoroughly checking and validating a subset of the annotations created. Benefits of Expert Review Workflows Expert review workflows offer a range of benefits that contribute to the success of data-centric projects: Improved Data Quality: Expert review ensures that data is accurate and error-free, leading to more reliable models and results. Efficient Task Allocation: Workflows help allocate tasks to the right experts, ensuring that each annotation or review is handled by the most qualified individuals. Error Detection and Correction: Issues can be identified and addressed promptly during the review process, preventing them from propagating further in the project. Time and Resource Savings: Automation within workflows streamline the process, reducing the time and effort required for manual coordination and ensuring experts aren’t wasting their time on menial tasks. Setting up Expert Review Workflows with Encord Create a New Workflow Template First, navigate to the "Workflow Templates" and click on the "+ New workflow template" button. For this walkthrough, we will create a workflow for an object detection model. Configuring the Workflow Template In the center, you will find the edit button and by clicking on it you will find on the right-hand side of the screen, you'll find the workflow library. This library contains components to build your workflow. Let’s look at each of these components as we add them to our insect detection project. Start Stage It's where your project begins, offering a clear overview of the project's foundation and helping team members understand the data they'll be working with. Annotate Stage This stage is the heart of the workflow, where data is annotated. The stage initially includes all annotators by default. To choose specific annotators, click the Annotate component, go to the Selective tab, enter the user's email, and select from the list. Only collaborators added via Project-level Manage collaborators will be available. The optional Webhook feature adds a layer of real-time notifications, enhancing project monitoring. Review Stage Multiple review stages can be included within a project, each with its unique set of reviewers and routing conditions helping to establish a structured process where subject matter experts validate annotations and detect errors. Strict Review With strict review, tasks stay put after label approval or rejection, giving reviewers time for adjustments and the ability to add comments for missing annotations. This provides reviewers with an additional opportunity to evaluate and, potentially, revise their judgments. This added layer of scrutiny helps to maintain accuracy and quality. Router A Router divides the pathway that annotation and review tasks follow within the workflow. You have the choice between two router types to select for your project: Percentage Router Precisely allocates annotations based on defined percentages, which is useful for the precise distribution of tasks, ensuring an equal workload split between different stages or groups of reviewers. Collaborator Router Customize annotation workflows based on collaborators to assign tasks strategically, ensuring alignment with expertise and responsibilities, and providing flexibility for diverse collaborators. For instance, a new annotator, Chris, may have his tasks automatically routed to an expert review queue, assigning pathology annotations to Dr. Smith and radiology annotations to Dr. Johnson. This approach optimizes the workflow, maintains quality through expert review, and allows flexibility for exceptions, enhancing collaboration in diverse teams Now that we've covered each element of the workflow, let's explore an instance of a workflow designed for object detection. Using Workflows in Annotation Projects To understand the integration of workflows in annotation projects, let's create an annotation project for an insect detection model with the following steps: Select and name the annotation project. Add insect dataset. You can create a new dataset here as well. Add the ontology for the annotation project. For quality assurance, opt for a workflow, either by creating a new one or utilizing an existing setup. And you are ready to start annotating! Select your annotation project and open the summary. The Summary provides an overview of your workflow project, displaying the status of tasks in each workflow stage, and offering a high-level visual representation of project progress. Navigate to the Queue for task management and labeling initiation, with options tailored to user permissions. It encompasses the Annotator's Queue, Reviewer's Queue, and Annotator & Reviewer, Admin, and Team Manager Queue. Users can filter, initiate, and assign tasks as needed, and this functionality varies based on the user's role. Admins and Task Managers can assign and release tasks, ensuring efficient task management within the project. Select the Start Labeling button to annotate your dataset. Label your dataset! Once the data has been annotated, reviewers find the labeled data to be reviewed in Queue as well. The reviewer has the option to exert bulk action on multiple reviews at once. Once the review is complete, any rejected images can again be found in the Annotator’s queue. The reason for rejection can also be specified and the annotator must resolve the issue to submit the re-annotated data. The approved images are found in the expert review queues. Once all the reviews are accepted the annotation is complete! The multiple review stages process in the annotation project contributes to the refinement of the dataset, aligning it with the desired standards and objectives of the project. The flexibility to perform bulk actions on multiple reviews simultaneously streamlines the review workflow and the ability to specify reasons for rejection provides valuable feedback to annotators. Wrapping Up In conclusion, expert review workflows play a pivotal role in ensuring the accuracy and success of data-centric projects like annotating an insect detection model. These workflows offer benefits such as improved data quality, efficient task allocation, and time savings. As technology advances, the importance of expert review workflows in maintaining data integrity becomes increasingly evident. They are an essential component in the evolving landscape of data-driven projects, ensuring optimal model performance.
Dec 05 2023
5 M
Product Updates [October 2023]
TLDR Workflows leaves Beta! Label snapshot versioning AI support is now in the platform - instant-access to support Encord Active improves Data Curation, Model Observability Introducing Encord Labs Encord @ RSNA - Come and see us at Booth #3772! All this and more! Read on to make your day. Workflows Leaves Beta We are thrilled to announce that our highly-anticipated feature, Workflows, has officially transitioned from beta to general availability! This milestone could not have been achieved without the invaluable feedback from our dedicated users throughout the beta phase. Workflows are designed to give you full control of the annotation process, ensuring a blend of high performance, usability, and extensibility that scales with the rapid pace and change of the AI industry. Some of the major improvements are: Performance: Handle larger projects efficiently (a tenfold increase from the previous benchmark), with significant speed enhancements across the platform. Usability: A new drag-and-drop UI simplifies workflow creation and the updated queue gives you full insight into the progress of your project. Extensibility: Advanced routing, better review functionality, and integration with Encord Active tailored to evolving AI demands. Editor Power Ups Workflow scalability means more tasks and more labels in your annotation projects. We're also juicing up the editor to be more performant -- which means more labels per task, faster. Backend improvements mean your data will save faster and more seamlessly, and we're introducing raised limits on labels per task to benefit from those improvements as well -- contact us to work with more data per task! Arriving soon are further performance improvements to enhance the user experience when dealing with many objects and complex label timelines. This all adds up to create a more natural pre-processing and editing experience, even on long, label intense, video annotation workloads. Exciting! AI Support We understand that searching our documentation isn’t always your first thought when you need to learn about the platform. To address this, we've integrated AI support directly into our platform, ensuring you have quick access to the assistance you need, precisely when needed. Whether you're onboarding for the first time, looking for a quick refresher on using the Label Editor, or need help understanding terminology, our AI assistant is here to help. It is regularly trained on all our platforms & SDK documentation, enabling it to provide intelligent and up-to-date responses to any questions you may have about our application! Active Improves Data Curation and Model Evaluation We know that curating the best images, frames from a video, or slices from a scan is a daunting, difficult, and time-intensive task. First, ensuring that your dataset is free of outliers, duplicates, and irrelevant images, and second, selecting the best samples is crucial for building robust and performant models. Encord is your trusted partner along your journey and based on your feedback we have designed Active's new Explorer to simplify this process, incorporating best practices into intuitive user journeys: Automated data quality checks: Active automatically identifies potential issues in your datasets, such as duplicates, blurry images, or corrupted frames. By filtering out these problematic frames, you can reduce annotation costs and prevent detrimental effects on your model's performance. Intelligent curation: Use Active to curate a balanced and diverse dataset. Whether you're establishing a dataset for an initial model run or curating targeted data for critical edge cases or blind spots, Active has a tailored workflow ready for you. After your data is annotated and your model is trained, Encord Active simplifies the shift to evaluation. Simply import your model predictions and access a detailed analysis of your model’s performance, with options to break it down by class, data collections, and splits such as train, test, and validation. You can also use the Explorer to investigate your prediction types following a series of best-practice workflows: Prediction inspection: Use the Explorer to delve into the types of model predictions – True Positives (TP), False Positives (FP), and False Negatives (FN), to understand your model's accuracy and behavior. Spot and address blind spots: When an edge case or a blind spot is detected, Active's similarity search allows you to surface and curate additional samples from your unlabeled data pool that resemble these critical cases. Continuous improvement cycle: Integrate these new samples into your annotation workflow, retrain your model, and directly compare performance improvements against previously identified edge cases. Label Snapshot Versioning Labeling training data is, like the model training process it supports, an iterative process. You’ve asked for ways to snapshot your progress — whether it’s to save a checkpoint before re-labeling, check-in progress as you work through a large project, or name different subsets for purposes such as training, testing, and validation. We’ve listened, and are happy to introduce label versioning for workflow projects. Navigate to the labels tab, select your tasks, and press ‘Save new version’ — you can refer to these snapshots by name and time. Initially, we’re rolling out support for exporting labels from saved checkpoints, but look out for coming improvements such as restoring to different projects. As always, let us know how it helps and what more we can do to enhance your AI initiatives with labels and label set management tools! Opt-in to Beta Features Faster with Encord Labs Many of you have shown interest in working closely with our product development team and helping us create the best features — as such, we’re very happy to be introducing Encord Labs! Encord Labs will give you access to features at the bleeding edge, but give you control over how features appear in the platform. This means you will get all the power of rapidly evolving technology with none of the risks. Getting in on the ground floor means you can shape how features evolve faster, helping us ensure we build with tight customer feedback in mind. Encord Labs will be rolling out several select features in Q4 — contact us if you’re interested or would like to join our collaborative early tester program! Thanks for reading, feel free to email product@encord.com with any questions or suggestions, and let us know if you're attending RSNA 2023!
Nov 10 2023
5 M
Product Updates [November 2023]
Upgraded workflows, improved bitmask and DICOM editor, a new automated labeling system, and an enhanced interface between Annotate and Active round out November with Encord. Looking forward to the holidays ahead! 💪 🚀 Workflows - the new Task Management System Workflows graduated from Beta and are now the default task management system for all customers. Scaled to handle 500,000 tasks — workflows is your new go to for large-scale annotation projects. In addition to performance, Workflows enjoy a significantly enhanced UX with a new template editor and easy-to-use task queue, now with priorities! Of course — no matter the priority, some tasks can still pose a problem — corrupted data and unclear annotation instructions are among the main reasons annotators might need to skip a task until later. Fortunately, workflows will enhance collaboration by prompting the annotator to clearly explain why the task is being skipped for clear communication with the labeling manager. Redesigned Bitmask Tool We're thrilled to announce upgrades to the threshold brush tool for bitmask labels! In addition to B/W (black and white) say hello to RGB (red/green/blue) and HSV (hue/saturation/values) thresholds - allowing you to target specific color ranges with ease. And to make sure you’re labeling what you think you are, the threshold preview has evolved to keep pace. You can now preview your work in transparent and inverted modes in addition to B/W. Whether you're fine-tuning reds, greens, blues, or playing with hues, saturations, and values, these new options make sure bitmask labeling can fit your data. Combine the above threshold range improvements with our bitmask lock feature, which allows you to designate that certain bitmask labels should not overlap. Our bitmask brush is now the perfect tool to ensure perfect annotation coverage per frame. Encord Active - Supporting Nested Attributes, Custom Metadata, Priorities Encord Active is improving across the board. We have extended support for custom metadata which means you can now filter your images/videos based on all the extra information associated with your data such as location, time of day, or anything else you might have available. Simultaneously we added the option to filter your classes based on all the nested attributes in your label ontology on the Encord platform. Pair these enhanced filtering capabilities with improved synergy between Active <> Annotate and you have a fully featured computer vision data platform. Among the headline improvements in November were the ability to create datasets with only specific frames from videos and image groups and send bulk or individual comments to Annotate. Coming soon, adjust priorities, bulk classification, and much more — contact us to learn how Encord Active can help you label the right data faster. DICOM The world of DICOM imaging is sprawling with various stakeholders and implementations. While it’s great to have so many organisations working together, in practice it means there may be difficulties handling various interpretations or small inconsistencies. To alleviate these pains, we're excited to announce a significant upgrade to our DICOM parsing capabilities! We’re introducing a new parser which extends our compatibility capabilities as well as increasing DICOM performance in the browser! In addition, we are enhancing our Multiplanar Reconstruction (MPR) capabilities — we’re making it possible to create bitmask annotations in the reconstructed views. With MPR Annotations you can now view and annotate from the best angle — giving you the power to annotate accurately and efficiently. Introducing VLM Automated Labeling You may have seen OpenAI’s exciting announcement about GPT 4 Vision, and if you’re anything like us, you were probably thinking something like: ”how can we leverage this to accelerate annotation?” 🤔 After some thinking, we’re happy to introduce our new VLM prediction feature! Currently available for classifications, our VLM prediction tool is a new repertoire in automated labelling which understands the content of the image and automatically creates the appropriate classifications from your ontology. Contact us to discuss if it’s right for your use case, and watch this space as we expand our automated labelling capabilities going forward!
Dec 07 2023
5 M
Encord Monthly Wrap: June Industry Newsletter
Hi there, Welcome to the Computer Vision Monthly Wrap for June 2024! Here’s what you should expect: 🎁 Automatic Data Curation for Self-Supervised Learning: A Clustering-Based Approach 📽️ Top CVPR 2024 papers, including the poster sessions ⚒️ Developer resources to use for your next vision AI application 🤝 New model releases in the computer vision and multimodal AI world Let’s go! 🚀 Encord released TTI-Eval, an open-source library to evaluate the performance of fine-tuned CLIP, domain-specific ones like BioCLIP models, and other VLMs on your dataset! Check out the getting started blog post. 📐 📜 Top Picks for Computer Vision Papers This Month Automatic Data Curation for Self-Supervised Learning: A Clustering-Based Approach Researchers at Meta AI released a paper introducing an automatic data curation method for self-supervised learning that can create large, diverse, and balanced datasets without manual effort. The approach in the paper uses hierarchical k-means clustering and balanced sampling to curate high-quality datasets from raw data. The method addresses imbalanced data representation in web-based collections, ensuring a more uniform distribution of diverse concepts. What’s impressive? 🤯 The approach enables training self-supervised models on automatically curated datasets, which alleviates the need for costly manual labeling and curation Hierarchical k-means clustering obtains uniform data clusters representing different concepts Balanced sampling from the clusters ensures the curated dataset has an even distribution of concepts Experiments on images, satellite data, and text show features trained on the auto-curated datasets match or exceed the performance of features from manually curated data How can you apply it? ⚒️ Curate your own high-quality datasets from large raw data sources for self-supervised pre-training Scale up model training by avoiding the bottleneck of manual data curation Improve model robustness and generalization by training on diverse and balanced datasets Apply the technique to various domains like computer vision, earth observation (remote-sensing), and natural language processing Frederik Hvilshøj, Lead ML Engineer at Encord, spoke to the paper's first author and distilled (yes, I don’t excuse the pun 😁) insights from the paper and conversations. Watch the video on LinkedIn. 📜 Read the publication. Top Papers and Poster Sessions from CVPR 2024 CVPR 2024 was quite an experience for many researchers, stakeholders, and engineers working on computer vision and multimodal AI problems. At Encord, we even released a fun game to get you battling it out with AI to win amazing prizes! 😎. This article reviews the top papers presented at CVPR 2024, including the research highlights. Frederik also reviewed some of the papers that were presented during the poster session: YOLO-World: Real-Time Open-Vocabulary Object Detection Putting the Object Back into Video Object Segmentation Panda-70M: Captioning 70M Videos with Multiple Cross- modality Teachers InternVL: Scaling up Vision Foundation models and Aligning for Generic Visual-Linguistic Tasks 🧑💻 Developer Resources You’d Find Useful Building Multi-Modal RAG Systems → Frederik Hvilshøj shared insights on a new model that could be just what you need to integrate multimodal RAGs in your apps. [WATCH] Interactive Tutorial On Using Gemini in Data Pipelines → Learn how to use Gemini 1.5 Pro to extract structured data from visual content with hands-on examples in this Colab notebook. Notebook for Fine-tune Florence-2 → The Hugging Face team and community members shared a notebook, HF Space, and a walkthrough blog on fine-tuning Florence-2 on DocVQA dataset. Check them out. New to Florence-2 from Microsoft? See this explainer blog post. How to Pre-Label Your Data with GPT-4o → Multimodal AI models are increasingly useful for bulk classification and pre-labeling datasets. That blog walks you through the principles behind this and shows you how to set up your own AI Agents to automate labeling 📰 Computer Vision In the News DeepMind’s new AI generates soundtracks and dialogue for videos → A new video-to-audio model that DeepMind developed can use a video and a soundtrack description (such as "jellyfish pulsating underwater, marine life, ocean") to produce audio that matches the video's mood, characters, and plot. Sensor Perception with NVIDIA’s Omniverse Cloud Sensor RTX at CVPR 2024 → At CVPR, NVIDIA showcased Omniverse microservices, including techniques and algorithms to simulate perception-based activities in realistic virtual environments before real-world deployment. From TechCrunch → All the rage has been about Runway’s new video-generating AI, Gen-3, which offers improved controls and more high-fidelity video generation results. Till next month, have a super-sparkly time!
Jul 02 2024
7 M
Encord Monthly Wrap: May Industry Newsletter
Hi there, Welcome to the Computer Vision Monthly Wrap for May 2024! Here’s what you should expect: 🤖 An Introduction to Vision-Language Modeling (VLM) 📽️ PaliGemma – Google's Open Source Vision Language Model (VLM) ⚔️GPT-4o vs. Gemini 1.5 Pro vs. Claude 3 Opus: Multimodal AI Model Comparison ⚒️ Developer resources to use for your next vision AI application. 🔎 TTI-Eval – Open-source library to evaluate the performance of fine-tuned CLIP models and other VLMs. Let’s go! 🚀 Checkout Text-to-image-eval (TTI-Eval) → TTI Eval is an open-source library for evaluating zero-shot classification models like CLIP and domain-specific ones like BioCLIP against your (or HF) datasets to estimate how well the model will perform. The metrics include zero-shot accuracy, linear probe, image retrieval, and KNN accuracy. 📜 Top Picks for Computer Vision Papers This Month An Introduction to Vision-Language Modeling Researchers at Meta AI released a paper that covers how VLMs work, how to train them, and approaches to evaluation. This paper provides a comprehensive introduction to Vision-Language Models (VLMs), which extend Large Language Models (LLMs) to the visual domain. VLMs have the potential to revolutionize how we interact with technology, from visual assistants to generative models that create images from text descriptions. The paper aims to help anyone enter the field by explaining VLMs, how they work, how to train them, and how to evaluate them. What’s impressive? 🤯 VLMs can enable visual assistants that guide users through unfamiliar environments. Generative VLMs can produce images from high-level text descriptions alone. The paper provides a clear introduction to VLMs for anyone wanting to enter the field. While focusing primarily on image-to-language mapping, the paper also discusses extending VLMs to videos. An Introduction to Vision-Language Modeling by FAIR. How can you apply it? ⚒️ Researchers can use this paper as a starting point for their research on VLMs. Developers can leverage the information in this paper to build and deploy VLM applications. Business stakeholders can gain a better understanding of the potential of VLMs and how they can be used to create value. Enthusiasts can learn about the latest developments in this exciting field and explore the possibilities of VLMs. 📜 Read the publication. PaliGemma – Google's Open Source Vision Language Model (VLM) Alongside introducing Project Astra, Gemini 1.5 Flash, and updates to Gemini 1.5 Pro, Google open-sourced PaliGemma-3B is a state-of-the-art Vision-Language Model (VLM) inspired by the PaLI-3 recipe. It fuses the SigLIP visual encoder and the Gemma 2B language model (as the decoder) to process and generate language based on visual inputs. What’s impressive? 👀 PaliGemma uses the state-of-the-art SigLIP visual encoder (SigLIP-So400m/14) to convert images into "soft tokens" for the model to understand and process visual information. Integrating the Gemma 2B language model, PaliGemma can generate coherent and contextually relevant text based on the input images and text prompts 🤯. The model's architecture concatenates image and prefix tokens before passing them to the Gemma decoder. This allows for seamless interaction between visual and textual information for more accurate and meaningful outputs. Its ability to handle multiple input images and generate auto-regressive text with masked attention shows its versatility and potential for complex multimodal tasks. PaliGemma – Google's Open Source Vision Language Model (VLM) Hugging Face Space. How can you apply it? ⚒️ PaliGemma can automatically generate descriptive captions for images. This could improve accessibility and user experience in applications such as social media platforms or e-commerce websites. The model can answer questions about input images for interactive and engaging user experiences in educational, entertainment, or customer support settings. PaliGemma can extract and understand text present in images, which is valuable for applications like document processing, OCR, or scene understanding. PaliGemma can be applied in fields such as autonomous vehicles, surveillance systems, or medical image analysis by identifying and localizing objects within images. Code on GitHub. Hugging Face Spaces Demo. 📜 Read the Hugging Face blog post to learn more. GPT-4o vs. Gemini 1.5 Pro vs. Claude 3 Opus: Multimodal AI Model Comparison This month, the multimodal AI wars reached an all-time high. OpenAI led the way with announcements like GPT-4o, which offers real-time multimodality, and then Google’s major updates to Gemini models. Oh, and let’s not forget Anthropic’s Claude 3 Opus, too. This article reviews each model's capabilities, strengths, and weaknesses, comparing their performance across various benchmarks and real-world applications. GPT-4o vs. Gemini 1.5 Pro vs. Claude 3 Opus: Multimodal AI Model Comparison. 🧑💻Developer Resources You’d Find Useful PaliGemma Fine-tuning Notebook → Good resource for fine-tuning pretrained PaliGemma on a small split of the VQAv2 dataset. 📰 In the News Spoor Uses AI to Save Birds from Wind Turbines → Spoor is a software that uses computer vision to detect birds on video while recording their movement and predicting their flight patterns. The Oslo, Norway-based company raised a $4 million seed round from investors. Till next month, have a super-sparkly time!
May 29 2024
4 M
Encord Monthly Computer Vision Wrap: April Industry Newsletter
Hi there, Welcome to the Computer Vision Monthly Wrap for April 2024! Here’s what you should expect: ⚡Imagine Flash: Accelerating Emu Diffusion Models with Backward Distillation. 🖌️ HQ-Edit: A High-Quality Dataset for Instruction-Based Image Editing. 🧑🏫 Instruction-tuning Llama 3 for performance gains on vision-language models (VLMs). ⚒️ Developer resources to use for your next vision AI application. 🔎 TTI to evaluate the performance of fine-tuned CLIP models and other VLMs. 🤖 Grok 1.5V from Elon Musk’s xAI. Let’s go! 🚀 📜 Top Picks for Computer Vision Papers This Month Imagine Flash: Accelerating Emu Diffusion Models with Backward Distillation Researchers at Meta AI released Imagine Flash, an image generation model synthesizing images and animations in real time as you prompt it. Here are the three main parts of the researchers' approach: Backward Distillation, which reduces differences between training and inference by setting the student on its own backward trajectory; Shifted Reconstruction Loss, which changes how knowledge is transferred based on the current time step; Noise Correction, an inference time technique that improves sample quality by fixing singletons in noise prediction. What’s impressive? 🤯 It’s powering new image generation features in MetaAI and WhatsApp that generate images in real-time as you type in prompts. It performs comparably to the teacher model, using only three denoising steps for efficient, high-quality generation. Imagine Flash’s distillation acceleration method outperforming existing competitors in quantitative metrics and human evaluations (when applied to the Emu baseline). The bias noise also performs well on color synthesis. How can you apply it? ⚒️ Although the model is not open-source, you can start testing it on Meta AI’s website. 📜 Read the publication. HQ-Edit: A High-Quality Dataset for Instruction-based Image Editing The paper introduces a new large-scale dataset called HQ-Edit for training models to perform image editing based on natural language instructions. The dataset contains around 200,000 high-resolution image editing examples, each consisting of an input image, an output edited image, and detailed editing instructions. The researchers developed a pipeline leveraging advanced AI models like GPT-4 and DALL-E 3 to create HQ-Edit. It starts by collecting and expanding various image descriptions and editing instructions. These are then used to generate "diptychs" - side-by-side input and output images. Finally, the diptychs undergo post-processing to split them into input-output pairs, refine their alignment, and enhance the editing instructions. What’s impressive? 👀 HQ-Edit contains 200,000 high-resolution (around 900x900) image editing examples with detailed instructions. This is substantially larger and of higher quality than prior datasets. Introducing Alignment and Coherence metrics using GPT-4 provides a more comprehensive way to assess the quality of image editing examples compared to simpler metrics like CLIP similarity. Models trained on HQ-Edit achieve impressive results, showcasing the dataset's value and overall approach. The gains over human-annotated data are especially noteworthy. How can you apply it? ⚒️ The most direct application is using HQ-Edit to train your own instruction-based image editing models. The dataset is publicly available, providing a valuable resource for building on. Code on GitHub: https://github.com/UCSC-VLAA/HQ-Edit Dataset: https://huggingface.co/datasets/UCSC-VLAA/HQ-Edit HuggingFace Spaces Demo: https://huggingface.co/spaces/LAOS-Y/HQEdit 📜 Read the paper on Arxiv. Meta’s Llama 3 and the Multimodal Capabilities This month, Meta released Llama 3 pre-trained and instruction-fine-tuned language models with 8 billion (8B) and 70 billion (70B) parameters. These models have new features, like better reasoning, coding, and math-solving capabilities. They set a new state-of-the-art (SoTA) for models of their sizes that are open-source, and you can use. Now, we know this is primarily a language model, but as this video explained, the vision-language model also has benefits. Training a VLM typically involves training an LLM and an image encoder separately, then training a projection component (projection) to align the outputs of the other two. You can reuse Llama 3 as the LLM component in a VLM (e.g., LlaVA) instead of training a separate LLM from scratch. This is because VLMs only require a small portion of the LLM to be fine-tuned for specific tasks. You can use instruction-tuning to improve the LLM's performance on specific tasks. Here is an explainer post that distills the technical report with the most important bits you need to know. Grok-1.5 Vision: First Multimodal Model from Elon Musk’s xAI Researchers at Elon Musk’s xAI released Grok-1.5V, a multimodal model that expands the capabilities of traditional text-based LLMs to include visual understanding. It interprets language and can process various image types with impressive performance on complex reasoning tasks (spatial understanding). Here are the highlights: 1️⃣ It can draw insights from various domains, combining visual and textual information to arrive at complex conclusions 2️⃣ It builds upon the strong language foundation of Grok-1, extending its abilities with visual understanding. 3️⃣ xAI introduced the RealWorldQA benchmark to measure the model’s ability to understand and reason about spatial relationships within the physical world. 4️⃣ It is in a preview stage and accessible to a limited group of early testers. This includes existing Grok users and subscribers to X.ai's Premium+ service. Here is an explainer post that distills the technical paper with the most important bits you need to know. 🧑💻Developer Resources You’d Find Useful Imgsys.org (Like Chatbot Arena but for Images) → imgsys.org is a generative arena for text-guided open-source image generation models, similar to lmsys.org (Chatbot Arena). Text-to-image-eval (TTI-Eval) → TTI Eval is an open-source library for evaluating zero-shot classification models like CLIP and domain-specific ones like BioCLIP against your (or HF) datasets to estimate how well the model will perform. The metrics include zero-shot accuracy, linear probe, image retrieval, and KNN accuracy. 🔥 NEW RELEASE: We released TTI-Eval (text-to-image evaluation), an open-source library for evaluating zero-shot classification models like CLIP and domain-specific ones like BioCLIP against your (or HF) datasets to estimate how well the model will perform. Get started with it on GitHub, and do ⭐️ the repo if it's awesome. 🔥 📰 In the News AI and Computer Vision to Detect Brain Abnormalities → Researchers at the Al Buraimi University College looked at different parts of MRI images, like color and texture, to find problems accurately. By looking at the symmetry between the brain's lobes, they developed an algorithm with precision, recall, and accuracy rates of 95.3%, 94.7%, and 95%, respectively. Here are other quick finds if you 💓Encord and computer vision data stuff ⚡: Join the Encord Community to discuss this newsletter. Data-centric computer vision blog. Till next month, have a super-sparkly time!
Apr 30 2024
5 M
Encord Monthly Wrap: March Industry Newsletter
Hi there, Welcome to the Computer Vision Monthly Wrap for March 2024! Here’s what you should expect: 🍏 MM1 - Methods, analysis, and insights from multimodal LLM pre-training by researchers at Apple. 📸 HyperLLaVA for developing adaptable and efficient AI systems that can excel across various multimodal tasks. 📽️ Understanding Mora, an open-source alternative to OpenAI’s text-to-video model. ⚒️ Developer resources to use for your next vision AI application. ☁️ Top 15 image segmentation repos for your next segmentation applications. 🤖 Google’s Video Gaming Companion: Scalable Instructable Multiworld Agent [SIMA]. Let’s dive in! Top Picks for Computer Vision Papers This Month MM1: Methods, Analysis & Insights from Multimodal LLM Pre-training This paper from Apple researchers is an in-depth analysis of multimodal large language model (MLLM) pre-training. They focused on developing efficient models by exploring architectural components and data selection strategies. The study shows how integrating different kinds of data—such as text-only data, interleaved image-text, and image-caption pairs—can improve few-shot learning performance on a range of benchmarks. It is a big step forward for AI's ability to understand and process complex multimodal inputs. What’s impressive? 🤯 The researchers scaled the model using Mixture of Experts (MoE) and dense model variants, which shows its complex architecture and how it can improve performance by smartly distributing computing resources. This is crucial for ensuring the model can work well in many real-world applications. The model's superior few-shot learning performance across several benchmarks indicates impressive improvements in how AI learns from limited data and interleaved data, which could help us build agile and adaptable AI systems. The 30B (billion) parameter-dense model beats prior state-of-the-art (SOTA) on VQA (Visual Question Answering) dataset and captioning tasks. How can you apply it? ⚒️ If you are conducting multimodal AI research, consider applying insights from MM1's architectural decisions, training recipes, and data strategies to improve how you develop new AI models. You can use the model for creative tasks like generating and curating context-aware content across different media. This will make it easier for people to create interesting and useful content. If you are building recommendation engines, use them to analyze user preferences across different media types for more personalized content suggestions. 📜 Read the paper on Arxiv. If that’s a lot, we also put out an explainer that helps you quickly get to the important bits. It provides a walkthrough on how to use the open-source YOLOv9 release to create custom datasets. HyperLLaVA: Dynamic Visual and Language Expert Tuning for Multimodal Large Language Models Advancements in Multimodal Large Language Models (MLLMs) have shown that scaling them up improves their performance on downstream multimodal tasks. But the current static tuning strategy may constrain their performance across different tasks. This paper discusses HyperLLaVA, a framework that circumvents the problems with static tuning methods by letting visual and language experts dynamically tune both the projector (which turns visual data into a format that language models can understand) and the LLM parameters. What’s impressive? 👀 It uses a unique training methodology that first aligns visual-language features and then refines language model tuning with multimodal instructions, optimizing the model’s comprehension and responsiveness. It shows amazing progress in MLLM benchmarks (MME, MMBench, SEED-Bench, and LLaVA-Bench), which opens the door for AI systems that are more nuanced, adaptable, and capable of handling complex multimodal data. Unlike static models, HyperLLaVA uses HyperNetworks to adaptively generate parameters for projectors and LLMs based on input, which helps with task-specific optimizations. 📜 Read the paper on Arxiv. Google’s Video Gaming Companion: Scalable Instructable Multiworld Agent [SIMA] How do you train an AI agent to be a generalist? Google DeepMind’s latest AI agent, SIMA, short for Scalable Instructable Multiworld Agent, helps us understand precisely how. SIMA interacts with the environment in real-time using a generic human-like interface. It receives image observations and language instructions as inputs and generates keyboard and mouse actions as outputs. SIMA is trained on a dataset of video games, including Satisfactory, No Man's Sky, Goat Simulator 3, and Valheim. Here is an explainer post that distills the technical paper with the most important bits you need to know. MORA: The Advanced Multi-Agent Video Generation Framework Mora is a multi-agent framework designed for generalist video generation. Based on OpenAI's Sora, it aims to replicate and expand the range of generalist video generation tasks. It distinguishes itself from Sora by integrating several visual AI agents into a cohesive system. Here are the video generation tasks it can do: 1️⃣ Text ➡️ Video 2️⃣ Text + Image ➡️ Video 3️⃣ Extending Videos 📈 4️⃣ Text + Video ➡️ Video 5️⃣ Video merging 🤝 6️⃣ Simulating digital worlds 🤖 Here is an explainer post that distills the technical paper with the most important bits you need to know. Developer Resources You’d Find Useful Gemini 1.5 Pro API Support in AI Studio for Developers → Google started rolling out Gemini 1.5 Pro support for developers! This means you can start developing AI apps with Gemini 1.5 Pro, which comes with a standard 128,000 token context window, and you can build with the 1M token context window! 15 Interesting GitHub Repositories for Image Segmentation → If you are building an application involving image segmentation, this article includes 15 GitHub repositories that showcase different approaches to segmenting complex images. The Generative AI In-Vehicle Experience Powered by NVIDIA DRIVE → In a recent video, NVIDIA unveiled a new in-vehicle AI experience powered by NVIDIA DRIVE. This multimodal AI assistant can perceive, reason with, and assist drivers with features like surround visualization, access to a knowledge base, and the ability to read and understand text. This new experience will likely help with developing more context-aware autonomous vehicle systems. Here are other quick finds if you 💓Encord and computer vision data stuff ⚡: Join the Encord Community to discuss this newsletter. Data-centric computer vision blog. Till next month, have a super-sparkly time!
Apr 08 2024
8 M
Encord Monthly Wrap: February Industry Newsletter
Hi there, Welcome the The Computer Vision Monthly Wrap Here’s what you should expect: 📦 YOLOv9 release with an explainer and code walkthrough on creating custom datasets. 📸 Meta’s V-JEPA for prediction video features. 📽️ Understanding Sora, OpenAI’s text-to-video model. ⚒️ Developer resources to learn how to analyze object detection model errors. ☁️ Computer vision case study from NVIDIA and Oracle. 🚀 Lessons from working with computer vision operations (CVOps) at scale. Let’s dive in! Top Picks for Computer Vision Papers This Month YOLOv9: Better than SoTA with Cutting-edge Real-time Object Detection If you haven’t heard yet, YOLOv9 is out, and, wow, it’s a high-performant model! YOLOv9 builds upon previous versions, using advancements in deep learning techniques and architectural design to beat state-of-the-art (SoTA) object detection tasks. What’s impressive? 🤯 It achieves top performance in object detection tasks on benchmark datasets like MS COCO. It surpasses existing real-time object detectors (YOLOv6, YOLOv8) in terms of accuracy, speed, and overall performance. It is much more adaptable to different scenarios and use cases. We have started seeing various applications, including surveillance, autonomous vehicles, robotics, and more. It is better than SoTA methods that use depth-wise convolution because it uses both the Programmable Gradient Information (PGI) and GLEAN (Generative Latent Embeddings for Object Detection) architectures. Read the paper on Arxiv. If that’s a lot, we also put out an explainer to help get to the important bits quickly with a walkthrough on using the open-source YOLOv9 release to create custom datasets. There’s also an accompanying repository for the implementation of the paper. Meta’s V-JEPA: Video Joint Embedding Predictive Architecture Explained In February, Meta released V-JEPA, a vision model exclusively trained using a feature prediction objective. In contrast to conventional machine learning methods, which rely on pre-trained image encoders, text, or human annotations, V-JEPA learns directly from video data without external supervision. What’s impressive? 👀 Instead of reconstructing images or relying on pixel-level predictions, V-JEPA prioritizes video feature prediction. This approach leads to more efficient training and superior performance in downstream tasks. V-JEPA requires shorter training schedules than traditional pixel prediction methods (VideoMAE, Hiera, and OmniMAE) while maintaining high-performance levels. We wrote a comprehensive explainer of V-JEPA, including the architecture, key features, and performance details, in this blog post. Here is the accompanying repository on the implementation of V-JEPA. OpenAI Releases New Text-to-Video Model, Sora OpenAI responded to the recent debut of Google's Lumiere, a space-time diffusion model for video generation, by unveiling its own creation: Sora. The diffusion model can transform text descriptions into high-definition video clips for up to one minute. In this comprehensive explainer, you will learn: How Sora works Capabilities and limitations Safety considerations Other text-to-video generative models. Gemini 1.5: Google's Generative AI Model with 1 Million-Token Context Length and MoE Architecture Gemini 1.5 is a sparse mixture-of-experts (MoE) multimodal model with a context window of up to 1 million tokens in production and 10 million tokens in research. It excels at long-term recall and retrieval and generalizes zero-shot to long instructions, like analyzing 3 hours of video with near-perfect recall. Here is an explainer blog that distils the technical report with the necessary information. Developer Resources You’d Find Useful Multi-LoRA Composition for Image Generation → The space is moving so fast that it’s hard to miss out on gems like Multi-LoRA! The Multi-LoRA composition implementation integrates diverse elements like characters & clothing into a unified image to avoid the detail loss and distortion seen in traditional LoRA Merge. Check out the repo and try it yourself. Scaling MLOps for Computer Vision by MLOps.Community → In this panel conversation, experienced engineers talk about their experience, challenges, and best practices for working with computer vision operations (CVOps) at scale. How to Analyze Failure Modes of Object Detection Models for Debugging → This guide showcases how to use Encord Active to automatically identify and analyze the failure modes of a computer vision model to understand how well or poorly it performs in challenging real-world scenarios. NVIDIA Triton Server Serving at Oracle [Case Study] → I really liked this short case study by the Oracle Cloud team that discussed how their computer vision and data science services accelerate AI predictions using the NVIDIA Triton Inference Server. Some learnings in terms of cost savings and performance optimization are valuable. Here are other quick finds if you 💓 Encord and computer vision data stuff ⚡: Join the Encord Community to discuss this newsletter. Data-centric computer vision blog.
Mar 08 2024
10 M
Software To Help You Turn Your Data Into AI
Forget fragmented workflows, annotation tools, and Notebooks for building AI applications. Encord Data Engine accelerates every step of taking your model into production.