stats
Encord Blog

Immerse yourself in vision

Trends, Tech, and beyond

blog banner
Featured
Company

Announcing Encord’s $30 million Series B funding

Today, we are excited to announce that Encord has raised $30 million in Series B funding to invest fully in the future of multimodal AI data development. It’s been a little over three years since we launched our product during Y Combinator’s winter 2021 batch, where, as a two-person company, we were sitting in a California dining room, sending cold emails during the day and struggling with AI model library dependencies at night. While the company has grown significantly and evolved to meet the seismic movements in the AI space, we have not lost the two core convictions we’ve had since the early days of YC: data is the most important factor in AI and the path to building a great company is to delight customers. Currently, the youngest AI company from Y Combinator to raise a Series B, we have grown to become the leading data development platform for multimodal AI. Our goal is to be the final AI data platform a company ever needs. We have already assisted over 200 of the world’s top AI teams, including those at Philips, Synthesia, Zeitview, Northwell Health, and Standard AI, in strengthening their data infrastructure for AI development. Our focus on creating high-quality AI data for training, fine-tuning, and validation has led to the production of better models, faster for our customers. We’re thrilled to have Next47 lead the round, with participation from our existing investors, including CRV, Crane Venture Partners, and Y Combinator. The continued support from our existing investors is a testament to the importance of our mission and recognition that the AI department of the future is the IT department of the past. It’s all about the data The technological platform shift driven by the proliferation of AI will solve problems previously thought unsolvable by technology and, much like the rise of the internet in the previous generation, will touch every person on the planet. This shift has been driven by rapid innovation in the compute and model layers, two of the three main pillars for building AI applications. However, innovation, for data, arguably the most important, most proprietary, and most defensible ingredient, has been stagnant. The data layer has fallen victim to a concoction of hastily built in-house tools and ad-hoc orchestration of distributed workforces for annotation and validation, hurting data quality and ultimately model performance. Powering the models of many of the world’s top AI teams at world-leading research labs and enterprises, we’ve witnessed firsthand the importance of having clean, traceable, governed and secure data. Knowing what data to put into your model and what data to take out is a prerequisite to true production level applications of generative and predictive AI.  At Encord, we think and talk a lot about how we can continue to support our users in their AI adoption and acceleration journey as they cross the chasm from prototype to production. That’s why we have broken down the data problem into first principles and continue to build best-in-class solutions for each of the core components of an AI data development platform: data management & curation, annotation, and model evaluation. We seek to tie solutions to these core components together in a single, seamlessly integrated solution that works with the petabyte-scale datasets that our clients leverage in their journey to monetize their data and turn it into AI. Some call it a data engine. Some call it a data foundry. We call it a data development platform. The future is data is the future of us We’re especially excited and proud of our product momentum. In the last three months alone we have added an agentic data workflow system, a consensus quality control protocol, support for audio, world-leading segmentation tracking, and many other features. We’ve also continued to make high-quality data annotation smoother and faster with the latest automation and foundation models, integrating Meta’s new Segment Anything Model into our platform less than a day after it was released and the vision-language models LLaVA and GPT4o the same week they were respectively publicly available. We plan to leverage additional capital to accelerate our product roadmap so that we can support our users—existing and new—in even more ways than we have before. With this commitment to continued innovation of the data layer, we’re proud to publicly launch Encord Index to bring ease to multimodal data management and curation. Index is an end-to-end data management platform allowing our users to visualize, search, sort, and manage their internal data at scale. Index gives AI teams full control and governance over their data, helping them understand and operationalize large private datasets in a collaborative and secure way. It integrates seamlessly with data storage such as AWS S3, GCP Cloud Storage, Azure Blob, and others to automate the curating of the best data and remove uninformative or biased data. As a result, our customers have achieved a 35% reduction in dataset size by curating the best data, seeing upwards of 20% improvement in model performance, and saving hundreds of thousands of dollars in compute and human annotation costs.  “Successful state-of-the-art models, like our recently released Expressive Avatar foundation model EXPRESS-1, require highly sophisticated infrastructure. Encord Index is a high-performance system for our AI data, enabling us to sort and search at any level of complexity. This supports our continuous efforts to push the boundaries of AI avatar technology and meet customer needs,” said Victor Riparbelli, Co-Founder and CEO of Synthesia, the billion-dollar generative AI company. We’re in the early innings of building a generational company that will play a key role in the AI revolution. Thank you to our users, investors, Encordians, and partners, who make all of this possible every day. We are very excited for what’s to come.

Aug 13 2024

6 m

Trending Articles
1
Announcing the launch of Consensus in Encord Workflows
2
The Step-by-Step Guide to Getting Your AI Models Through FDA Approval
3
Best Image Annotation Tools for Computer Vision [Updated 2024]
4
Top 8 Use Cases of Computer Vision in Manufacturing
5
YOLO Object Detection Explained: Evolution, Algorithm, and Applications
6
Active Learning in Machine Learning: Guide & Strategies [2024]
7
Training, Validation, Test Split for Machine Learning Datasets

Explore our...

Case Studies

Webinars

Learning

Documentation

From Vision to Edge: Meta’s Llama 3.2 Explained

Meta just released Llama 3.2, the next era of its open sourced AI models building on Llama 3.1. It introduces high-performance lightweight models optimized for mobile devices and vision models capable of advanced image reasoning, making it ideal for tasks like summarization, instruction following, and image analysis across a range of environments. Key Features of Llama 3.2 Expanded Model Variants: Llama 3.2 offers both lightweight large language models (LLMs) (1B and 3B) and medium-sized vision models (11B and 90B). This variety allows developers to select models tailored to their specific use cases, ensuring optimal performance whether running on edge devices or more powerful systems. Context Length Support: The lightweight 1B and 3B models support an impressive context length of up to 128K tokens, making them state-of-the-art for on-device applications such as summarization, instruction following, and rewriting tasks. Hardware Compatibility: These models are optimized for deployment on Qualcomm and MediaTek hardware, as well as Arm processors, enabling efficient use in mobile and edge environments right from day one. Llama 3.2: Vision Models Llama 3.2 is the first of the Llama series to incorporate vision capabilities. Llama 3.2: Vision Models Llama 3.2 is the first of the Llama series to incorporate vision capabilities. Here are its key vision features: Image Understanding: The 11B and 90B vision models can perform complex tasks such as document-level understanding (including charts and graphs), image captioning, and visual grounding. For example, these models can analyze a sales graph and provide insights on business performance or interpret maps to answer geographical queries. Integration of Vision and Language: These multimodal models can extract details from images, understand contexts, and generate coherent text, making them ideal for AI applications that require comprehensive understanding and reasoning across different modalities. Drop-In Compatibility: The vision models serve as drop-in replacements for their corresponding text model equivalents, allowing developers to easily transition between text and vision tasks without extensive modifications to their existing applications. Model Card of the Vision model Llama 3.2: Lightweight Models The lightweight LLMs in the Llama 3.2 family (1B and 3B) are engineered for high efficiency and performance, particularly in constrained environments. Here’s what sets them apart: Multilingual Capabilities: These models support multilingual text generation and are equipped with tool-calling functionalities. On-Device Privacy: A standout feature of the lightweight models is their ability to run locally on devices. This offers significant advantages, such as instantaneous processing of prompts and responses, as well as enhanced privacy by keeping sensitive data on the device and minimizing cloud interactions. Training Techniques: The lightweight models benefit from advanced training methods, including pruning and knowledge distillation. Pruning reduces model size while retaining performance, and knowledge distillation allows smaller models to learn from larger, more powerful models, ensuring they deliver high-quality results even in limited environments. Model Card of the Lightweight model. Technical Overview of Llama 3.2 Model Architecture Vision Models (11B and 90B): Designed for image reasoning tasks, integrating an image encoder with the pre-trained language model. This allows them to excel in tasks like image captioning and visual grounding. Lightweight Models (1B and 3B): Optimized for edge and mobile devices, these models utilize pruning and knowledge distillation to retain performance while reducing size, making them suitable for on-device applications. Training Process Llama 3.2's training involves several stages: Pre-training: It begins with the Llama 3.1 text models, adding image adapters and pre-training on large-scale image-text pair data. Adapter Training: This stage aligns image and text representations while maintaining existing language capabilities. Post-training: The models undergo supervised fine-tuning and synthetic data generation to optimize performance and safety. The lightweight models support a context length of 128K tokens and outperform competitors in summarization, instruction following, and multilingual generation. Llama 3.2 vs Llama 3.1 Model Variants: Llama 3.2 offers multiple models (1B, 3B, 11B, and 90B) with multimodal capabilities (text + vision), while Llama 3.1 primarily features a large 405B parameter text-only model. Multimodal Capabilities: Llama 3.2 introduces vision models that can process images and text, enabling tasks like visual reasoning and image captioning, whereas previous Llama models are limited to text inputs. Efficiency and Deployment: Llama 3.2's smaller models are optimized for edge and mobile devices while Llama 3.1 requires significant computational resources for deployment. Context Length: Both models support an extensive context length of 128K tokens (approximately 96,240 words), allowing for detailed input processing. Language Support: Llama 3.2 officially supports eight languages and can be fine-tuned for more, while Llama 3.1 has multilingual capabilities but with less specificity on language support. Performance of Llama 3.2 The performance of Llama 3.2 has been evaluated against leading foundation models such as Anthropic’s Claude 3 Haiku and OpenAI’s GPT4o-mini in image recognition and a variety of visual understanding tasks. Llama 3.2: Revolutionizing edge AI and vision with open, customizable models Specifically, the 3B model has shown superior performance over models like Gemma 2 (2.6B) and Phi 3.5-mini in critical areas such as following instructions, summarization, prompt rewriting, and tool usage. Meanwhile, the 1B model remains competitive with Gemma in several benchmarks. Llama Stack Distributions The Llama Stack distributions offer developers a standardized interface through the Llama Stack API. This API provides essential toolchain components for fine-tuning and synthetic data generation, facilitating the creation of agentic applications. The Llama Stack offers several key components to enhance usability and accessibility. The Llama CLI (command line interface) enables users to build, configure, and run Llama Stack distributions seamlessly. The client code is available in multiple programming languages, including Python, Node.js, Kotlin, and Swift, ensuring broad compatibility for developers. Docker containers are provided for the Llama Stack Distribution Server and Agents API Provider, facilitating easy deployment across different environments. The distribution options include: Single-node Llama Stack Distribution: Available through Meta's internal implementation and Ollama. Cloud Llama Stack Distributions: Supported by platforms such as AWS, Databricks, Fireworks, and Together. On-device Llama Stack Distribution: Designed for iOS using PyTorch ExecuTorch. On-prem Llama Stack Distribution: Supported by Dell for localized deployments. Llama 3.2: Revolutionizing edge AI and vision with open, customizable models Llama 3.2 Safety Llama 3.2: Revolutionizing edge AI and vision with open, customizable models Llama 3.2 incorporates Llama Guard 3 to enhance safety and responsibility. This update includes Llama Guard 3 11B Vision, which filters text and image prompts to support new image understanding capabilities. The 1B and 3B models have also been optimized for lower deployment costs, with Llama Guard 3 1B reduced from 2,858 MB to 438 MB for greater efficiency in constrained environments. These safeguards are integrated into reference implementations and are readily available for the open-source community. Read the documentation for more information on Llama Guard 3. Real-World Applications of Llama 3.2 The availability of Llama 3.2 on edge devices is going to open up many generative AI applications. On-Device Personal Assistants The lightweight 1B and 3B models facilitate the development of personalized, privacy-focused applications. For example, users can create agents that summarize recent messages, extract action items from conversations, and even send calendar invites—all while ensuring that data remains local and secure. Llama 3.2: Revolutionizing edge AI and vision with open, customizable models Image Understanding and Analysis The 11B and 90B vision models excel in tasks requiring image reasoning, such as analyzing charts and graphs or providing image captions. Businesses can leverage these models for document-level understanding, enabling users to quickly retrieve insights from visual data—like identifying trends in sales reports or navigating maps based on user queries. Education and Training Llama 3.2 can be integrated into educational apps to support adaptive learning experiences. It can assist students in summarizing lecture notes or following instructional content, providing tailored support based on individual learning needs. Data Retrieval and Analysis The retrieval-augmented generation (RAG) capabilities of Llama 3.2 allow organizations to build applications that can fetch relevant information from large datasets. This is particularly useful in research and business intelligence, where quick access to data insights is crucial for decision-making. How to Access Llama 3.2? There are several options to access Llama 3.2. Direct Downloads: You can directly download the Llama 3.2 models from Meta’s official website or Hugging Face. Cloud Platforms: Llama 3.2 is also available on various cloud platforms like Amazon Bedrock, IBM watsonx, Google Cloud, Microsoft Azure, NVIDIA, Snowflake, etc. For edge and mobile device deployment: Meta is working with partners like Arm, MediaTek, and Qualcomm to offer a broad range of services at launch, ensuring optimized on-device distribution via PyTorch ExecuTorch. Single-node Distribution: This is facilitated through Ollama, allowing for versatile deployment options across various environments.

Sep 30 2024

5 M

Key Insights from the Inaugural AI After Hours

On September 24, Encord hosted its first AI After Hours in San Francisco, featuring five talks from AI practitioners working on applications across industries. Here are the key takeaways from each presentation:  Leveraging LLM-based Marketing Analytics Copilots and Agents Presenter: Sai Kumar Arava, Machine Learning Manager at Adobe. LLMs and their use in marketing analytics: LLMs, like GPT-4 and Llama, are transforming marketing analytics by allowing natural language queries and automating tasks such as SQL generation, tabular analysis, and document summarization. This shift reduces reliance on technical teams and speeds up workflows. LLM challenges: Key issues include hallucination (inaccurate data generation), context length limitations, poor mathematical accuracy (especially for numerical tasks), and interpretability issues, all of which need to be addressed for enterprise-grade applications like marketing analytics. Fine-tuning and specialized agents: Fine-tuning LLMs for specific domains (e.g., marketing, legal, healthcare) is critical for improving performance and accuracy. Techniques like LoRA (Low-Rank Adaptation) are popular for efficient fine-tuning. AI agents for end-to-end automation: AI agents are evolving to automate entire marketing processes, from dynamic dashboard generation to customer service. They leverage planning and tool orchestration techniques to complete tasks with minimal human intervention. Sophisticated agent architectures: AI agent architectures are increasingly sophisticated, incorporating long-term memory, personalization APIs, and orchestration layers for tool and agent management. These advanced architectures help agents handle complex workflows across various sectors. Performance and scalability advancements: LLMs have made significant strides in performance and scalability, particularly in multi-agent and tool orchestration environments. Ethical and safety considerations: As AI agents become more prevalent, ensuring transparency, safety, and alignment with ethical guidelines is crucial to prevent unintended consequences. Human-AI collaboration remains necessary for critical decision-making. Unleashing Potential: POCs to a Promising Future in Agentic AI Presenter:  Meghana Puvvadi, Director of Engineering for AI/ML Enterprise Assistants at NVIDIA. Four pillars of agentic systems: Memory (retaining user preferences), tools (API access for actions like code generation and search), planning (LLMs handling complex tasks), and reasoning (breaking down tasks into logical steps). Key use cases: Simple, deterministic scenarios such as new hire onboarding and code assistance serve as ideal starting points for agentic AI implementation. More complex tasks, like supply chain management and customer interaction, benefit from agents employing multi-path reasoning. Decoupled architecture: Building AI applications with decoupled components—memory, tool invocation, planning, and reasoning—allows flexibility in swapping LLMs and adapting to new models as they emerge. Challenges and considerations: Key challenges include managing costs and latency due to frequent LLM calls, securing data with proper access controls, and continuously updating data to keep AI models relevant.  Security and permissions: With easier access to information through agents, companies need to ensure strong permission management and avoid exposing sensitive information unintentionally. Multi-agent architectures: These architectures are evolving rapidly, with different models such as layered, decentralized, and centralized systems, each suited for varying levels of interaction and autonomy in tasks. Multimodal RAG for Video Presenter: Anup Gosavi, CEO and co-founder of VideoDB. Multimodal RAG overview: Current multimodal RAG (Retrieval-Augmented Generation) primarily supports unimodal output (text) despite inputs being multimodal (video, images, text). There's a growing need for more comprehensive video outputs beyond short clips. Video query challenges: Video retrieval demands complex processing, including transforming videos into images, identifying objects and actions, and managing multiple steps to compile meaningful results. Limitations of multimodal models: Existing multimodal models often require manual compilation and editing, leading to high costs and latency in video processing. Additionally, the large token requirements for processing video data can quickly become unmanageable and expensive. RAG benefits: RAG enables pre-processing of video content tailored to specific use cases, resulting in improved indexing, retrieval, and lower latency. By optimizing the retrieval process, developers can manage costs more effectively as video data scales. Video RAG architecture: The proposed architecture involves a systematic approach to handle video inputs, including audio processing, image extraction, and text transcription, leading to efficient data storage and retrieval. The emphasis is on the need for effective ranking of search results to ensure relevance and efficiency. Use cases: Potential applications include generating video answers in chatbots, real-time content modification, and personalized highlights from events. The video content should be treated as data to facilitate dynamic access to various modalities.  AI-Powered Fundraising for Conservation: Transforming Grant Discovery Presenter: Prajakta Pardeshi, Senior Machine Learning Scientist at Walmart Global Tech. AI chatbot for fundraising: The chatbot employs a fine-tuned BERT model to analyze user inputs, extracting key features such as funding amounts and project details to identify relevant donors or grants. Data pipeline and embedding: A sophisticated data pipeline leverages web-scraped donor information to create embeddings for both user queries and potential grants, enabling efficient donor matching. Grant diversification: The system incorporates a diversification strategy to ensure a varied selection of grants based on factors such as funding amount and geographical relevance, enhancing the breadth of available options. Future enhancements: Plans include transitioning to vector indexing for improved data storage and querying, as well as exploring advanced algorithms for similarity matching to boost the overall efficiency of donor discovery. Personalized Video Ads at Scale with GenAI Presenter: Shradha Agrawal. Engineering Manager for GenAI and Computer Vision at Adobe. The need for hyper-personalization: In today's marketing landscape, the demand for hyper-personalized video ads is crucial, as traditional one-size-fits-all approaches no longer effectively target specific audiences. GenAI tool development: Adobe's generative AI tool empowers marketers to create personalized video ads efficiently by adapting a single marketing video to match individual customer preferences through target images and text prompts. Tool overview: The system utilizes stable diffusion models enhanced with a temporal attention module to create personalized video content. The model allows for fine-tuning with just one source video, streamlining the creation process. Inverted mask and feature flow: The system incorporates an inverted mask to specify target object placement in the video and employs feature flow for temporal consistency across frames, improving the visual coherence of generated videos. Performance metrics: Generated videos are evaluated using CLIP and DINO scores, measuring alignment with target text and images, respectively. Results indicate that the tool outperforms existing state-of-the-art methods, particularly in scenarios requiring shape changes.

Sep 27 2024

5 M

OpenAI o1: A New Era of AI Reasoning

While September is typically associated with Apple’s releases, OpenAI made waves of its own! OpenAI o1 series is a new class of large language models (LLMs) designed to enhance reasoning capabilities through a "thinking before responding" approach. With its advanced reasoning capabilities, o1 is setting a new standard for how artificial intelligence can solve complex reasoning tasks in math, coding, science, and beyond.  This explainer will walk you through what OpenAI o1 is, how it functions, and why it’s such an exciting breakthrough for anyone tackling difficult, technical challenges. What is OpenAI o1? OpenAI o1 is a new generation of AI models focused on advanced reasoning. Unlike many AI models that provide immediate answers based on broad knowledge, o1 distinguishes itself by taking time to think through complex tasks. It employs a "chain of thought" approach, which allows it to carefully analyze and break down complex problems, solving them step by step for more accurate and insightful results. This approach enables o1 to surpass previous models—like GPT-4o—on tasks demanding deep understanding and logical problem-solving. By scoring in the 89th percentile in coding competitions and ranking among the top 500 U.S. high school students in a prestigious math exam, o1 demonstrates its prowess as a powerful AI tool for tackling complex STEM challenges. Learn how to use GPT-4o in you data annotation pipeline. Read the blog How to Pre-Label Your Data with GPT-4o The o1 Model Series: Preview and Mini Versions OpenAI o1 is available in two versions: o1-preview and o1-mini, each tailored for different use cases. OpenAI o1-preview is the advanced version, excelling in reasoning-heavy tasks such as coding, math, and science. It surpasses human experts on academic benchmarks in physics, chemistry, and biology. OpenAI o1-mini is a faster, more cost-effective option for developers needing reasoning power without extensive general knowledge. While specializing in STEM fields, it remains highly capable in competitive programming and advanced math. Both new models are accessible via ChatGPT and the API. O1-mini offers an 80% cost reduction compared to o1-preview, making it an attractive choice for users prioritizing speed and affordability. How o1 Model Work: Learning to Reason A standout feature of the o1 series is its ability to reason through problems using a chain of thought. This approach mimics human problem-solving: breaking down complex questions, recognizing mistakes, and refining approaches over time. Reinforcement Learning OpenAI o1 uses reinforcement learning (RL) to develop its reasoning skills. Through this process, the AI model is trained to improve its problem-solving strategies. For example, if o1 makes an error in a complex math equation, it learns to correct that mistake by trying new approaches and refining its solution process. The more time it spends thinking, the more accurate it becomes. This RL approach leads to higher performance on reasoning-heavy tasks, which require more than just factual recall. It enables the model to improve with test-time compute—the more thinking time it gets, the better it performs. Performance Highlights and Benchmarks Learning to reason with LLMs OpenAI o1 has been tested on a variety of benchmarks, and its performance is nothing short of impressive: Math Performance: In the prestigious AIME (American Invitational Mathematics Examination), a qualifying round for the USA Math Olympiad, o1-preview scored 83%, solving some of the toughest problems high school students face. By comparison, GPT-4o only managed 12%. Coding: On Codeforces, a competitive coding platform, o1-mini achieved an Elo rating of 1650, placing it among the top 14% of all competitors, while o1-preview reached the 89th percentile. This makes it an invaluable tool for developers working on complex algorithms. STEM Reasoning: In academic benchmarks like GPQA (science) and MATH-500, o1-mini outperformed GPT-4o, showcasing its superior reasoning abilities These benchmarks demonstrate that o1 is not just faster or more powerful—it’s also much more intelligent when it comes to tackling problems that require deep thought compared to the previous GPT family, the GPT-4o. Model Speed and Cost Efficiency While the o1-preview model is an advanced, highly capable reasoning tool, OpenAI also offers a more efficient alternative: o1-mini. This smaller version of the model delivers comparable performance in many STEM tasks but at a fraction of the cost. For example, o1-mini is up to 5x faster than o1-preview in some tasks, making it ideal for developers who need quick responses without sacrificing accuracy. Its reduced price—80% cheaper than o1-preview—makes it an attractive option for cost-conscious users. Safety and Alignment in o1 Models One of the critical advancements in the o1 series is its improved safety and alignment capabilities. Using the chain of thought, the reasoning model is trained to reason through safety guidelines, ensuring it adheres to ethical boundaries even in tricky or edge-case scenarios. For example, o1-preview was tested on some of the hardest jailbreaking evaluations, where users try to bypass safety restrictions. It significantly outperformed previous models, scoring 84 out of 100 on safety adherence, compared to GPT-4o’s score of 22. Learning to reason with LLMs By teaching the model to integrate safety principles into its reasoning process, OpenAI has made o1 not only smarter but also safer and more aligned with human values. For more information, read the OpenAI o1 System Card. Hiding the Chain of Thought Though the o1 models internally use a chain of thought to arrive at answers, this process remains hidden from the user. OpenAI made the decision to summarize the reasoning process rather than showing the raw chain of thought, balancing user experience and competitive advantage. This hidden chain of thought allows OpenAI to monitor the model’s internal reasoning, offering insights into how it arrives at decisions while protecting users from potentially confusing or overly complex intermediate steps. Limitations and What’s Next While the o1 models excel in STEM reasoning, they do have some limitations: Limited Non-STEM Knowledge: o1-mini, in particular, is not as strong in general knowledge or language-focused tasks, such as answering trivia or biographies. Room for Improvement: Future versions of the o1 series aim to address these gaps, expanding its capabilities to other modalities like browsing the web and uploading files or images. OpenAI continues to iterate on the o1 series, with plans to release improved models and add more functionality in future updates. OpenAI o1: How to Use? ChatGPT Plus and Team Users: You can access both o1-preview and o1-mini in ChatGPT starting today. Use the model picker to select between the two. Initially, you will have a weekly message limit of 30 for o1-preview and 50 for o1-mini.  ChatGPT Enterprise and Edu Users: You will be able to access both o1 models beginning next week. This access will come with higher rate limits and additional features. API Tier 5 Developers: If you qualify for API usage tier 5, you can start using both o1-preview and o1-mini today with a rate limit of 20 RPM. Note that the API currently does not support function calling, streaming, or system messages.  ChatGPT Free Users: OpenAI plans to extend access to o1-mini to all ChatGPT Free users in the future. Cost and Performance: o1-mini is available now to API Tier 5 users at a cost 80% lower than o1-preview, offering a more affordable option with impressive performance. Key Highlights: OpenAI o1 Advanced Reasoning: Uses a chain of thought to tackle complex STEM problems. STEM Performance: Excels in math (83% on AIME), coding (89th percentile on Codeforces), and science (outperforms PhDs on GPQA). Two Versions: Full-featured o1-preview and cost-effective o1-mini (80% cheaper). Reinforcement Learning: Trained to improve problem-solving and correct mistakes. Hidden Chain of Thought: Internally monitors reasoning for improved safety without exposing raw thought process.

Sep 13 2024

5 M

Machine Learning Trends & Stats for 2024

The world is witnessing exponential advancements in artificial intelligence and machine learning technologies. These developments are introducing advanced tools and frameworks that are revolutionizing human-machine interactions like never before. Businesses are quickly integrating AI to boost productivity and reduce costs. In fact, 83% of companies consider AI a top strategic priority. In this article, we will explore key trends and statistics to help you learn more about recent developments in AI. AI & ML Market Statistics The global AI market was worth $196.63 billion as of 2024 and is projected to grow at a CAGR of 28.46% between 2024 and 2030. Estimates suggest that by 2030, AI will contribute around $15.7 trillion to the global economy - more than India and China’s current GDP. The computer vision market was worth $20.31 billion in 2023, with a projected CAGR of 27.3% between 2023 and 2032. The estimated size of the natural language processing (NLP) market is projected to reach $31.76 billion by the end of 2024, with a CAGR of 23.97% between 2024 and 2029. The large language model (LLM) market is currently valued at $6.4 billion and is expected to reach $36.1 billion by 2030, growing at a CAGR of 33.2% between 2023 and 2030. AI & ML Adoption Over 50% of companies in the U.S. with more than 5,000 employees use AI. Chinese and Indian companies report the highest use of AI compared to other developed countries, with 60% of IT professionals saying they already use AI applications. Over 20% of American content creators used AI to generate videos and images in 2023. Due to labor shortages, around 25% of companies are adopting AI to enhance business operations globally. According to the latest online survey by Gartner that covered CIOs from multiple geographies and industries, 34% say they already adopted AI, while 22% say they will do so by the end of 2024. AI and Gen AI Adoption Around three in four CIOs across multiple industries globally  intend to boost investment in IT services and AI-powered and AI-augmented applications. Insurance companies have the highest AI adoption rate of 49%, followed by U.S. healthcare companies, which have an adoption rate of 48%. U.S. healthcare companies also have the most aggressive spending budget for AI. 48% of businesses use deep learning, NLP, and ML models to manage large datasets. Organizations surveyed worked in software, consulting, finance, healthcare, government, higher education, telecommunications verticals among others. AI & ML Benefits According to a global survey by McKinsey, generative AI increased revenue by 5% in 2023 in supply chain and inventory management. 38% of businesses achieved cost reduction through machine learning technologies. According to research by Gartner, quick product development, enhanced customer experience, and greater workforce productivity are the most significant benefits of Gen AI. Netflix’s AI-based recommendation algorithm saves the company $1 billion annually. Companies that lead in AI functionalities produce total shareholder returns (TSR) four to six times higher than organizations that lag in AI investments. This trend is consistent across various industries, such as insurance, banking, and retail. TSR Returns - Leaders versus Laggards Amazon’s revenue from AWS increased by 17% to $25 billion in 2024. The company chief executive believes the performance is a result of continued focus on AI.  AI & ML Sentiments 44% of business owners feel AI helps them with better decision-making. 64% of business owners believe AI will improve customer relationships. Globally, only 54% of consumers think that AI-based products have more benefits than drawbacks. Only 49% of consumers surveyed across 31 countries say AI changed their lives in the past 3 to 5 years. 57% of workers believe AI will change how they do their jobs and 36% feel AI will replace them. AI & ML Impact on the Workforce AI automation is projected toreplace around 400 million workers by 2030, resulting in a 15% job loss globally. On the other hand, AI will create around 97 million new jobs. These jobs include developers and engineers who work on LLMs, UX designers, and content creators. Demand for LLM developers will increase as more technologies become dependent on tools such as ChatGPT. Demand for UX designers will increase to create intuitive interfaces to help users interact with AI. Demand for content creators will increase due to the need for relevant prompt engineering to generate the desired content. AI models will increase labor productivity by 40% across sixteen industries by 2035.  A recent survey shows that 39% of chief data officers implemented AI literacy programs to fill these crucial roles. Must-have AI roles include prompt, data, and machine learning engineers, data scientists, AI ethicists, heads of AI, and data and analytics translators. Must-have AI Roles According to Gartner, Gen AI will augment the human workforce in 90% of companies globally by 2025. Significant workforce reskilling is necessary to increase AI adoption. A recent report by Statista suggests that around20% or more enterprise employees will need reskilling. AI & ML Tools End-user spending on Robotics Process Automation (RPA) tools reached $3.35 million in 2023, up 17.5% from the previous year. Gartner Magic Quadrant According to the 2024 Gartner Magic Quadrant, Microsoft, Google, and Oracle are the leaders in analytics and business intelligence (BI) platforms. 48% of businesses in software and professional services use ML, data analysis, and other AI tools to ensure accurate and error-free data. Listings of AI software providers on Gartner Digital Markets doubled in 2023, with AI product reviews increasing by 2.5 times. 92% of organizations plan to invest in AI tools such as chatbots in 2024, believing such solutions offer significant productivity and time-saving benefits. Read more about the best image annotation tools Top AI Trends Now, let's dive into the key trends that will shape the future of AI in the coming years. Cloud Systems Due to the scalability and flexibility of cloud-based AI ecosystems, many organizations are moving from stand-alone software applications to cloud-native solutions. According to Gartner, 50% of new system deployments will occur in the cloud instead of separate point solutions requiring manual integration. The shift toward hybrid and cloud-based solutions will make AI more accessible to startups and small-to-medium enterprises (SMEs) that lack sufficient funds to build in-house platforms. Organizations will benefit from the cloud’s low latency and high throughput, as cloud-based platforms have integrated GPUs to optimize AI models. Edge AI The need to process data at the point of generation is increasing as organizations prioritize real-time insights and compliance with data privacy regulations. Gartner predicts that over 55% of deep neural networks will analyze data at the source by 2025. Businesses can identify deeper patterns by processing data at the source and deploying AI algorithms on local devices. These include sensors, cameras, and other Internet-of-things (IoT) devices.  In addition, edge AI allows businesses to streamline AI deployment through seamless integration and orchestration of AI workflows, helping them create more advanced AI models. Generative AI Gen AI is reshaping how personal and business users interact with AI to perform multiple tasks. According to the latest McKinsey survey, 65% of organizations use Gen AI, and 75% of respondents expect the technology to result in disruptive change. Respondent’s Use of Gen AI by Function The survey reports that sales and marketing, service development, and IT functions are the most prominent areas where organizations deploy Gen AI technologies. Further, Gartner reports that 38% of executives invest in Gen AI to improve customer experience and retention, increase revenue, and reduce costs. As Gen AI models become open-source, their adoption will likely soar in the near future, encouraging more businesses to implement Gen AI frameworks to solve complex issues and boost operational efficiency. Responsible AI As AI spreads across multiple aspects of human life, the concern for ethical AI deployments is rising. The trend involves addressing user sentiments regarding transparency, accountability, risk, trust, and societal value associated with AI initiatives. With only 1% of AI vendors owing large pre-trained models, the need for responsible AI is rapidly increasing. As such, organizations should adopt best practices for managing risk and preventing bias when building frameworks using these pre-trained models. Data-centric AI Data-centric AI is focuses on building frameworks that ensure data is clean, accurate, and consistent rather than just improving algorithms. This approach helps create platforms that enable users to quickly pre-process, curate, and label datasets with AI-based models, automating the entire workflow. For instance, the Encord platform is an end-to-end solution that offers AI tools to curate and annotate image, video, and medical data. It provides micro-models that users can train on a few data samples and automatically label remaining data points for better results. Data-centric AI also involves generating synthetic data using Gen AI tools to make up for the lack of real-world data needed for training complex CV and NLP models. According to reports, about 60% of data will be synthetic by the end of 2024. Accelerated AI Investments AI investments are increasing with the rise of foundation models. Gartner forecasts that around $10 billion will be spent on AI startups relying on such models. Also, around 70% of organizations are exploring Gen AI solutions, while 19% are already in the pilot phase. The increased investment will likely result in more AI innovations in real-world applications and stronger collaboration between research institutes, corporations, and the government. Popular Use Cases With the increasing prevalence of mobile technology, the use of AI-based voice assistants and chatbots is increasing. Reports predict there will be around 8 billion voice assistants by the end of 2024. Moreover, the telecommunications industry is one of the most significant users of chatbots, with 52% of telecom businesses using the technology to improve productivity. Popular Consumer Use Cases For consumers, the most popular AI use cases include: Responding to messages and emails, Answering financial questions, Planning travel itineraries, Crafting emails, Preparing for job interviews, Writing social media posts, Summarizing long copies Learn more about computer vision use cases in our detailed blog.   Top Picks Machine learning’s projected market size will reach $79.29 billion by the end of 2024. Global AI software spending will reach $297.9 billion by 2027. The ML Operations (MLOps) market is predicted to have a compound annual growth rate (CAGR) of 43.2 between 2024 and 2033. According to Bloomberg, generative AI (Gen AI) will be worth $1.3 trillion by 2032. Financial Times reports that OpenAI’s revenue surpassed $2 billion as demand for ChatGPT explodes. Around 92% of Fortune 500 companies are using Open AI products. 80% of business leaders think that Gen AI will increase efficiency. AI & ML Trends: Key Takeaways The above statistics and trends clearly show that organizations wishing to stay ahead of the competition must invest in AI and ML technologies to boost revenue, improve customer experience, and reduce costs. Below are a few critical points to remember regarding AI. Generative AI has the Highest Adoption: Organizations are rushing to implement generative AI technologies in multiple business functions to improve operational efficiency. Customer Experience is Key: The motive behind significant investments in AI is to improve customer experience and relationships. Workforce Reskilling: Significant workforce reskilling is necessary to counter AI’s job displacement effects. Key Trends: The most prominent trends in 2024 are the shift to cloud systems, edge AI, and a focus on building data-centric frameworks. References BIS Research Bloomberg Deloitte Exploding Topics Financial Times Forbes Advisor Fortune Business Insights Gartner Ipsos Market.us Scoop Mckinsey Statista The Guardian

Aug 16 2024

5 M

Top 10 Multimodal Datasets

Multimodal datasets are like the digital equivalent of our senses. Just as we use sight, sound, and touch to interpret the world, these datasets combine various data formats—text, images, audio, and video—to offer a richer understanding of content. Think of it this way: if you tried to understand a movie just by reading the script, you'd miss out on the visual and auditory elements that make the story come alive. Multimodal datasets provide those missing pieces, allowing AI to catch subtleties and context that would be lost if it were limited to a single type of data.  Another example is analyzing medical images alongside patient records. This approach can reveal patterns that might be missed if each type of data were examined separately, leading to breakthroughs in diagnosing diseases. It's like assembling multiple puzzle pieces to create a clearer, more comprehensive picture. In this blog, we've gathered the best multimodal datasets with links to these data sources. These datasets are crucial for Multimodal Deep Learning, which requires integrating multiple data sources to enhance performance in tasks such as image captioning, sentiment analysis, medical diagnostics, video analysis, speech recognition, emotion recognition, autonomous vehicles, and cross-modal retrieval. What is Multimodal Deep Learning? Multimodal deep learning, a subfield of Machine Learning, involves using deep learning techniques to analyze and integrate data from multiple data sources and modalities such as text, images, audio, and video simultaneously. This approach uses the complementary information from different types of data to improve model performance, enabling tasks like enhanced image captioning, audio-visual speech recognition, and cross-modal retrieval. Next-GPT: A Multimodal LLM Benefits of Multimodal Datasets in Computer Vision Multimodal datasets significantly enhance computer vision applications by providing richer and more contextual information. Here's how:    By combining visual data with other modalities and data sources like text, audio, or depth information, models can achieve higher accuracy in tasks such as object detection, image classification, and image segmentation.    Multimodal models are less susceptible to noise or variations in a single modality. For instance, combining visual and textual data can help in overcoming challenges like occlusions or ambiguous image content. Multimodal datasets allow models to learn deeper semantic relationships between objects and their context. This enables more sophisticated tasks like visual question answering (VQA) and image generation.    Multimodal dataset opens up possibilities for novel applications in computer vision, large language models, augmented reality, robotics, text-to-image generation, VQA, NLP and medical image analysis. By integrating information from data sources of different modalities, models can better understand the context of visual data, leading to more intelligent and human-like large language models. Top 10 Multimodal Datasets Flickr30K Entities Dataset The Flickr30K Entities dataset is an extension of the popular Flickr30K dataset, specifically designed to improve research in automatic image description and understand how language refers to objects in images. It provides more detailed annotations for image-text understanding tasks.  Flickr30K Entities dataset built upon the Flickr30k dataset, which contains 31K+ images collected from Flickr. Each image in Flickr30k Entities is associated with five crowd-sourced captions describing the image content. The dataset adds bounding box annotations for all entities (people, objects, etc.) mentioned in the image captions.  Flickr30K allows to develop better large language models with vision capabilities for image captioning, where the model can not only describe the image content but also pinpoint the location of the entities being described. It also allows the development of an improved grounded language understanding, which refers to a machine's ability to understand language in relation to the physical world. Research Paper: Flickr30k Entities: Collecting Region-to-Phrase Correspondences for Richer Image-to-Sentence Models Authors: Bryan A. Plummer, Liwei Wang, Chris M. Cervantes, Juan C. Caicedo, Julia Hockenmaier, and Svetlana Lazebnik Dataset Size: 31,783 real-world images, 158,915 captions (5 per image), approximately 275,000 bounding boxes, 44,518 unique entity instances. Licence: The dataset typically follows the original Flickr30k dataset licence, which allows for research and academic use on non-commercial projects. However, you should verify the current licensing terms as they may have changed. Access Links: Bryan A. Plummer Website Visual Genome The Visual Genome dataset is a multimodal dataset, bridging the gap between image content and textual descriptions. It offers a rich resource for researchers working in areas like image understanding, VQA, and multimodal learning.  Visual Genome combines two modalities, first is Visual, containing over 108,000 images from the MSCOCO dataset are used as the visual component, and second is Textual, where images are extensively annotated with textual information (i.e. objects, relationships, region captions, question-answer pairs). The multimodal nature of this dataset offers advantages like deeper image understanding to allow identify meaning and relationships between objects in a scene beyond simple object detection, VQA to understand the context and answer questions that require reasoning about the visual content, and multimodal learning that can learn from both visual and textual data. Research Paper: Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations Authors: Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson, Kenji Hata, Joshua Kravitz, Stephanie Chen, Yannis Kalantidis, Li-Jia Li, David A. Shamma, Michael S. Bernstein, Fei-Fei Li Dataset Size: 108,077 real-world image, 5.4 Million Region Descriptions, 1.7 Million VQA, 3.8 Million Object Instances, 2.8 Million Attributes, 2.3 Million Relationships Licence: Visual Genome by Ranjay Krishna is licensed under a Creative Commons Attribution 4.0 International License. Access Links: Visual Gnome Dataset at Hugging Face MuSe-CaR  MuSe-CaR (Multimodal Sentiment Analysis in Car Reviews) is a multimodal dataset specifically designed for studying sentiment analysis in the "in-the-wild" context of user-generated video reviews.  MuSe-CaR combines three modalities (i.e. text, audio, video) to understand sentiment in car reviews. The text reviews are presented as spoken language, captured in the video recordings, audio consists of vocal qualities (like tone, pitch, and emphasis) to reveal emotional aspects of the review beyond just the spoken words, and video consists of facial expressions, gestures, and overall body language provide additional cues to the reviewer's sentiment. MuSe-CaR aims to advance research in multimodal sentiment analysis by providing a rich dataset for training and evaluating models capable of understanding complex human emotions and opinions expressed through various modalities. Research Paper: The Multimodal Sentiment Analysis in Car Reviews (MuSe-CaR) Dataset: Collection, Insights and Improvements Authors: Lukas Stappen, Alice Baird, Lea Schumann, Björn Schuller Dataset Size: 40 hours of user-generated video material with more than 350 reviews and 70 host speakers (as well as 20 overdubbed narrators) from YouTube. Licence: End User Licence Agreement (EULA) Access Links: Muse Challenge Website CLEVR CLEVR, which stands for Compositional Language and Elementary Visual Reasoning, is a multimodal dataset designed to evaluate a machine learning model's ability to reason about the physical world using both visual information and natural language. It is a synthetic multimodal dataset created to test AI systems' ability to perform complex reasoning about visual scenes.  CLEVR combines two modalities, visual and textual. Visual modality comprises rendered 3D scenes containing various objects. Each scene features a simple background and a set of objects with distinct properties like shape (cube, sphere, cylinder), size (large, small), color (gray, red, blue, etc.), and material (rubber, metal).  Textual modality consists of questions posed in natural language about the scene. These questions challenge models to not only "see" the objects but also understand their relationships and properties to answer accurately. CLEVR is used in applications like visual reasoning in robotics and other domains to understand the spatial relationships between objects in real-time (e.g., "Which object is in front of the blue rubber cube?"), counting and comparison to enumerate objects with specific properties (e.g., "How many small spheres are there?"), and  logical reasoning to understand the scene and the question to arrive at the correct answer, even if the answer isn't directly visible (e.g., "The rubber object is entirely behind a cube. What color is it?"). Research Paper: CLEVR: A Diagnostic Dataset for Compositional Language and Elementary Visual Reasoning Authors: Justin Johnson, Bharath Hariharan, Laurens van der Maaten, Fei-Fei Li, Larry Zitnick, Ross Girshick Dataset Size: 100,000  images, 864986 questions, 849,980 answers, 85,000 scene graph annotations and functional program representations. Licence: Creative Commons CC BY 4.0 licence. Access Links: Stanford University CLEVR Page InternVid  InternVid is a relatively new multimodal dataset specifically designed for tasks related to video understanding and generation using generative models. InternVid focuses on the video-text modality, combining a large collection of videos containing everyday scenes and activities accompanied by detailed captions describing the content, actions, and objects present in the video. InternVid aims to support various video-related tasks such as video captioning, video understanding, video retrieval and video generation. Research Paper: InternVid: A Large-scale Video-Text Dataset for Multimodal Understanding and Generation Authors: Yi Wang, Yinan He, Yizhuo Li, Kunchang Li, Jiashuo Yu, Xin Ma, Xinhao Li, Guo Chen, Xinyuan Chen, Yaohui Wang, Conghui He, Ping Luo, Ziwei Liu, Yali Wang, LiMin Wang, Yu Qiao Dataset Size: The InternVid dataset contains over 7 million videos lasting nearly 760K hours, yielding 234M video clips accompanied by detailed descriptions of total 4.1B words. Licence: The InternVid dataset is licensed under the Apache License 2.0 Access Links: InternVid Dataset at Huggingface MovieQA MovieQA is a multimodal dataset designed specifically for the task of video question answering (VideoQA) using text and video information. MovieQA combines three modalities i.e. video, text and question and answer pairs. The dataset consists of video clips from various movie clips that are accompanied by subtitles or transcripts, providing textual descriptions of the spoken dialogue and on-screen actions. Each video clip is paired with multiple questions that require understanding both the visual content of the video and the textual information from the subtitles/transcript to answer accurately. MovieQA aims to evaluate how well a model can understand the actions, interactions, and events happening within the video clip. It can utilize textual information such as  subtitles/transcript to complement the visual understanding and answer questions that might require information from both modalities and provide informative answers. Research Paper: MovieQA: Understanding Stories in Movies through Question-Answering Authors: Makarand Tapaswi, Yukun Zhu, Rainer Stiefelhagen, Antonio Torralba, Raquel Urtasun, Sanja Fidler Dataset Size: This dataset consists of 15,000 questions about 400 movies with high semantic diversity. Licence: Unknown Access Links: Dataset at Metatext MSR-VTT MSR-VTT, which stands for Microsoft Research Video to Text, is a large-scale multimodal dataset designed for training and evaluating models on the task of automatic video captioning. The primary focus of MSR-VTT is to train models that can automatically generate captions for unseen videos based on their visual content. MSR-VTT combines two modalities, videos and text descriptions. Video is a collection of web videos covering a diverse range of categories and activities and each video is paired with multiple natural language captions describing the content, actions, and objects present in the video. MSR-VTT helps in large-scale learning using vast amounts of data which allows models to learn robust video representations and generate more accurate and descriptive captions. Videos from various categories help models generalize well to unseen video content and multiple captions per video provides a richer understanding of the content. Research Paper: MSR-VTT: A Large Video Description Dataset for Bridging Video and Language Authors: Jun Xu , Tao Mei , Ting Yao, Yong Rui Dataset Size: Large video captioning dataset with 10,000 clips (38.7 hours) and 200,000 descriptions. It covers diverse categories and has the most sentences/vocabulary compared to other similar datasets. Each clip has around 20 captions written by human annotators. Licence: Unknown Access Links: Dataset at Kaggle VoxCeleb2  VoxCeleb2 is a large-scale multimodal dataset designed for tasks related to speaker recognition and other audio-visual analysis. VoxCeleb2 combines two modalities, audio and video. Audio consists of recordings of speech from various individuals and corresponding video clips of the speakers, allowing for the extraction of visual features. VoxCeleb2 primarily focuses on speaker recognition, which involves identifying or verifying a speaker based on their voice. However, the audio-visual nature of the dataset also allows for face recognition and speaker verification. Research Paper: VoxCeleb2: Deep Speaker Recognition Authors: Joon Son Chung, Arsha Nagrani, Andrew Zisserman Dataset Size: VoxCeleb2 is a large-scale dataset containing over 1 million utterances for 6,112 celebrities, extracted from videos uploaded to YouTube. Licence: VoxCeleb2 metadata is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License. Access Links: The VoxCeleb2 Dataset VaTeX  VaTeX (VAriational Text and video) is a multimodal dataset designed specifically for research on video-and-language tasks.  Modalities: VaTeX combines two modalities, A collection of videos depicting various activities and scenes, and text descriptions for each video describing the content in both English and Chinese. Some caption pairs are parallel translations, allowing for video-guided machine translation research.  VaTeX supports several research areas related to video and language such as multilingual video captioning to generate captions for videos in multiple languages, video-guided machine translation to improve the accuracy of machine translation, and  video understanding to analyze and understand the meaning of video content beyond simple object recognition. Research Paper: VaTeX: A Large-Scale, High-Quality Multilingual Dataset for Video-and-Language Research Authors: Xin Wang, Jiawei Wu, Junkun Chen, Lei Li, Yuan-Fang Wang, William Yang Wang Dataset Size: The dataset contains over 41,250 videos and 825,000 captions in both English and Chinese. Licence: The dataset is under a Creative Commons Attribution 4.0 International License. Access Links: VATEX Dataset WIT WIT, which stands for Wikipedia-based Image Text, is an state-of-the-art large-scale dataset designed for tasks related to image-text retrieval and other multimedia learning applications.  Modalities: WIT combines two modalities, Images which are a massive collection of unique images from Wikipedia and text descriptions for each image extracted from the corresponding Wikipedia article. These descriptions provide information about the content depicted in the image. WIT primarily focuses on tasks involving the relationship between images and their textual descriptions. Some key applications are Image-Text Retrieval to retrieve images using text query, Image Captioning to generate captions for unseen images, and Multilingual Learning that can understand and connect images to text descriptions in various languages. Research Paper: WIT: Wikipedia-based Image Text Dataset for Multimodal Multilingual Machine Learning Authors: Krishna Srinivasan, Karthik Raman, Jiecao Chen, Michael Bendersky, Marc Najork Dataset Size: WIT contains a curated set of 37.6 million entity rich image-text examples with 11.5 million unique images across 108 Wikipedia languages. I Licence: This data is available under the Creative Commons Attribution-ShareAlike 3.0 Unported licence. Access Links: Google research dataset github Key Takeaways: Multimodal Datasets  Multimodal datasets, which blend information from diverse data sources such as text, images, audio, and video, provide a more comprehensive representation of the world. This fusion allows AI models to decipher complex patterns and relationships, enhancing performance in tasks like image captioning, video understanding, and sentiment analysis. By encompassing diverse data aspects, multimodal datasets push the boundaries of artificial intelligence, fostering more human-like understanding and interaction with the world. These datasets, sourced from various data sources, drive significant advancements across various fields, from superior image and video analysis to more effective human-computer interaction. As technology continues to advance, multimodal datasets will undoubtedly play a crucial role in shaping the future of AI. Embracing this evolution, we can look forward to smarter, more intuitive AI systems that better understand and interact with our multifaceted world.

Aug 15 2024

5 M

ONNX Standardized Format: The Universal Translator for AI Models

Modern artificial intelligence (AI) is moving beyond the traditional machine learning (ML) models involving straightforward statistical calculations. With the emergence of advanced computational resources and big data, AI frameworks are now more sophisticated, performing complex inferences on extensive datasets. However, as model complexity increases, so does the need for interoperability as developers begin using multiple frameworks to build, test, and deploy AI systems. Also, using AI-based solutions with legacy infrastructure calls for solutions that allow businesses to seamlessly integrate AI tools with existing tech stack. This lack of interoperability often results in time-consuming and error-prone conversion processes, creating significant obstacles to the smooth deployment of AI solutions. Enter the Open Neural Network Exchange (ONNX) framework. ONNX addresses this interoperability challenge by offering a standardized, open-source format for representing AI models. With ONNX, developers can build, share, and run models across various platforms without worrying about compatibility issues, thereby streamlining the entire model development and deployment lifecycle. In this article, we will discuss in detail what ONNX is and its key features, benefits, challenges, and best practices. It will help you understand how to use ONNX optimally to streamline your model development lifecycle.  What is ONNX? ONNX is a unified open-source format designed to enable interoperability between different AI and ML algorithms. A standard format allows users to execute models between multiple frameworks without implementing complex conversion pipelines. Originally developed by Facebook and Microsoft in 2017, ONNX has gained support from numerous tech giants, including IBM, Intel, and Qualcomm.  Traditionally, developers used the HD5 format to save a model in Keras, the SavedModel format to store the model in TensorFlow, and Pickle for Scikit-Learn. These formats are framework-specific, and their support in other development environments is limited. ONNX lets you overcome these limitations through the following key features: Open-source: ONNX is an open-source project on GitHub, with a large and active community that contributed to the development and enhancement of the framework’s ecosystem. Standardized Format: Standardization allows developers to use an ONNX-based model with any framework to provide smooth cross-platform integrations. Conversion Tools: ONNX includes extensive tools and APIs that enhance the ML lifecycle. For instance, it supports multiple libraries that enable the conversions of models built in popular frameworks such as TensorFlow, Keras, and PyTorch to ONNX. Visualization and Optimization Libraries: ONNX offers tools to visualize models and provides optimization solutions to remove redundant nodes and improve performance. Users can also deploy ONNX models using runtime libraries that support multiple hardware such as CPUs, GPUs, and accelerators. Interoperability: ONNX enables seamless import and export models across multiple ML frameworks. The ability enables developers to leverage the strengths of a particular framework during model development, convert the model into ONNX, and export it to a suitable lightweight and low-latency runtime environment. Focus on Inference: ONNX Runtime is a tool for efficiently deploying machine learning models in production with faster inferencing and broad compatibility with different hardware platforms. Format Flexibility: The ONNX standard supports traditional and state-of-the-art (SOTA) deep learning models such as complex computer vision (CV) and natural language processing (NLP) architectures. Performance Optimizations: ONNX Runtime supports multiple performance-enhancing graph optimizations through node elimination and fusion techniques, which help improve model execution efficiency. Popular Frameworks Compatible with ONNX ONNX supports multiple frameworks that let developers build a wide range of deep learning and traditional machine learning models more flexibly and efficiently. The following list highlights a few of the popular frameworks that are compatible with ONNX. PyTorch: Meta's PyTorch is a Python-based library that offers robust GPU-accelerated tensor computation to build complex CV and NLP models. PyTorch is particularly favored for its dynamic computational graph (also known as reverse-mode auto-differentiation), which allows developers to modify the graph on the fly. TensorFlow: Google’s TensorFlow is an end-to-end ML framework that offers intuitive APIs to build AI applications. It offers tools for developing and deploying models across various environments, including edge devices, the web, and mobile platforms. TensorFlow also includes utilities for creating input pipelines for data preprocessing. Scikit-Learn: Scikit-Learn is a Python-based platform for building traditional ML models for classification, regression, and clustering. It also offers tools for dimensionality reduction, model selection, and data preprocessing, making it a comprehensive framework for standard ML tasks. Keras: Keras is a high-level API for developing ML-powered apps with straightforward code that is quick to debug, deploy, and maintain. Microsoft Cognitive Toolkit (CNTK): CNTK is an open-source deep-learning library representing neural networks through directed computational graphs. The feature makes CNTK suitable for building architectures involving feed-forward, convolutional, and recurrent neural nets. Converting Models to ONNX ONNX offers libraries to convert models in different frameworks to ONNX format. The format consists of an ONNX graph that describes the ML model through mathematical operations. The operations transform input features to generate relevant predictions. For instance, a developer may create a linear regression model in Python and convert it to an ONNX graph. The model is a function of three variables, an addition, and a multiplication operation. Linear Regression Model Converting it to ONNX means using ONNX operators to represent the model in a standard graph that the developer can run and execute on any platform. The conversion involves writing the linear regression model in the ONNX language to declare the variables, define nodes, create the graph, and add relevant metadata. ONNX Graph for Linear Regression Model Although developers can manually write models in the ONNX language, a more convenient alternative is using pre-built ONNX conversion libraries. These libraries automatically convert Python-based models in supported frameworks to ONNX format. The following is a brief list of conversion libraries: sklearn-onnx: Helps convert scikit-learn models tf2onnx: Enables developers to transform TensorFlow, Keras, TensorFlow.js, and Tflite models to ONNX. onnx-coreml: Facilitates the conversion of ONNX models to CoreML format. torch.onnx: Supports converting Pytorch-based models to ONNX. YOLOv8 to ONNX Example YOLOv8 is an open-source PyTorch-based CV model by Ultralytics. It helps you with object detection and tracking, image classification, pose estimation, and instance segmentation tasks. CV Tasks in YOLOv8 Converting the model to ONNX format is straightforward. The following code snippet from Ultralytics demonstrates you can quickly export and run YOLO-v8 in ONNX. from ultralytics import YOLO # Load the YOLOv8 model model = YOLO("yolov8n.pt") # Export the model to ONNX formatmodel.export(format="onnx") # creates 'yolov8n.onnx' # Load the exported ONNX model onnx_model = YOLO("yolov8n.onnx") # Run inference results = onnx_model("https://ultralytics.com/images/bus.jpg") Curious how YOLO works? Learn more about the algorithm in our detailed guide on YOLO Object Detection Pre-Trained Models in ONNX While you can convert models to ONNX, the ONNX Model Zoo is a GitHub repository that offers multiple pretrained CV, NLP, Generative AI, and Graph ML models in ONNX format. The source of the models includes open-source repositories such as transformers, torchvision, timm, and torch_hub. Vision models include frameworks for image classification, object detection, image segmentation, pose estimation, and image manipulation tasks. Language models include machine translation and comprehension algorithms. Lastly, it offers models for speech recognition and visual question answering (VQA) tasks. Optimizing ONNX Models Developers can optimize ONNX models for better performance using ONNX Optimizer - an open-source library based on C++. The framework helps developers perform arbitrary optimizations that require custom backend information. In addition, ONNX Runtime is the official ONNX production engine, which lets you tailor ONNX-based deployments to specific hardware across multiple platforms. The framework applies relevant optimizations to run models efficiently on CPUs, GPUs, and accelerators. Developers can use ONNX Runtime to deploy models on web, mobile, edge, and cloud-based environments. The library also allows them to boost training speed and accuracy for large language models (LLMs) and perform on-device model learning to protect user privacy. Applications of ONNX Models in Computer Vision ONNX models are versatile and flexible frameworks that help you build and deploy models for multiple use cases. Below is a list of ONNX-based CV applications in various domains. ONNX for Image Classification: ONNX models can perform complex image classification tasks, such as classifying medical images to diagnose diseases. Medical Image Classification ONNX for Object Detection: CV applications often require object detection models with high inference speeds. For instance, models for self-driving cars must recognize objects in real-time without delay. Developers can achieve such performance by deploying ONNX models tailored to specific hardware limitations. Object Detection in self-driving cars ONNX for Segmentation: Authorities can use ONNX models to perform segmentation tasks for urban planning. These models may require deployment in satellites and drones, where ONNX can optimize inferencing performance through on-device processing. a) Urban Planning Map, b) Semantic Segmentation of Map ONNX for Facial Recognition: Robust security systems require powerful facial recognition algorithms to verify identities and restrict access. ONNX models can help developers optimize model deployment on edge devices such as cameras and sensors. Facial Recognition Find out about the top 10 computer vision applications in 2024 Benefits and Challenges of ONNX Models Through a unified standard, ONNX offers an efficient cross-platform framework to deploy ML models. However, despite its advantages, ONNX has a few challenges. Users should understand these benefits, challenges, and mitigation strategies to use ONNX to its full potential. ONNX Benefits Interoperability Across Frameworks: The key advantage of using ONNX is that it allows developers to build, execute, and run models across various platforms. Flexibility in Deployment: The framework facilitates quick deployment and inference across different hardware types. Optimization For Performance: ONNX models can be deployed across a wide range of platforms, including cloud, edge, mobile, and on-premises environments. This flexibility makes it easier to integrate AI models into diverse production settings. Hardware Agnostic: Developers can run ONNX models on multiple hardware, including CPUs, GPUs, and accelerators. Relevant libraries tailor ONNX models to specific hardware requirements for streamlined development. No Vendor Lock-in: Dependency on a single vendor’s ecosystem limits the functionalities a model can perform. ONNX frees developers from these restrictions and allows them to use the most suitable platform for a specific use case. ONNX Challenges Model Conversion Complexity: While ONNX offers multiple libraries for model conversion, transforming complex architectures into graphs can still be challenging. Certain features or layers in a model might not be fully supported, leading to potential loss of fidelity or requiring additional manual adjustments. Performance Degradation: Converting models to ONNX may result in performance loss when compared to models natively built and run in their original frameworks. This can be particularly noticeable in highly optimized environments or when using specialized hardware. However, using the ONNX-Runtime library to optimize hardware-specific performance can address performance issues. Difficulty in Debugging and Troubleshooting: Due to limited support and expertise, debugging in native and legacy frameworks can be relatively easier than in the ONNX language. However, visualization tools with detailed logging can help developers find issues more effectively. Dependency on Third-Party Tools: ONNX relies on various third-party tools and libraries for conversion and optimization. Compatibility issues between these tools, or lack of support for specific model features, can create additional hurdles for developers. Evolving Community: As the ONNX platform evolves, instability and backward incompatibility may be frequent issues. Developers must keep track of the latest developments and follow community forums to stay updated regarding new releases. Model Size: After converting to ONNX, the model's size may increase. Model compression and regularization techniques can help reduce model sizes. Best Practices for Deploying ONNX Models Effective ONNX model deployment requires experts to use appropriate strategies to maximize the benefits of ONNX frameworks. Below are a few best practices that will help you streamline ONNX deployments for more efficient results. Select the Right Runtime: Use the ONNX Runtime library to optimize model’s performance before deployment through quantization, pruning, and fusion techniques. Consider hardware-specific runtimes like TensorRT or OpenVINO for additional optimizations. Performance Tuning: Use ONNX Optimizer to streamline your model by removing redundant nodes and applying performance-enhancing techniques. Regularly profile and benchmark the model to ensure it meets performance goals. Testing and Validation: Perform thorough unit testing and check relevant metrics such as latency, throughput, and resource utilization. Compare these against benchmarks and previous model versions to identify potential issues before deployment. Deploying on Different Platforms: Ensure the deployment environment matches model requirements. Use cloud-based resources to scale operations or use edge devices for fast inferencing while protecting user privacy. Monitoring and Maintenance: Build continuous monitoring pipelines with real-time alerts that send instant notifications when performance metrics fall behind targets. Also, conduct regular updates and check for the latest ONNX updates to use the best model resources. Ensure Security and Compliance: Secure your deployment environment and ensure compliance with relevant data protection regulations. For edge deployments, optimize model size and efficiency, and test on target devices. Documentation: Maintaining comprehensive documentation regarding model structure, version updates, and metadata can streamline ONNX implementation and help new members get up to speed without much hassle. ONNX Model: Key Takeaways Organizations with AI models running on multiple platforms can significantly benefit from ONNX interoperability. The framework enables users to streamline model development and deployment workflows through cross-platform support. Below are a few critical points regarding ONNX. Standard Format: ONNX is an open-source format for storing and executing models across different platforms. Compatibility: ONNX supports popular ML frameworks such as PyTorch, Keras, and TensorFlow with pre-built libraries to convert models in these environments to ONNX. ONN Optimization: ONNX lets you optimize models and leverage hardware-specific capabilities through the ONNX Runtime library.

Aug 15 2024

5 M

Announcing Encord’s $30 million Series B funding

Today, we are excited to announce that Encord has raised $30 million in Series B funding to invest fully in the future of multimodal AI data development. It’s been a little over three years since we launched our product during Y Combinator’s winter 2021 batch, where, as a two-person company, we were sitting in a California dining room, sending cold emails during the day and struggling with AI model library dependencies at night. While the company has grown significantly and evolved to meet the seismic movements in the AI space, we have not lost the two core convictions we’ve had since the early days of YC: data is the most important factor in AI and the path to building a great company is to delight customers. Currently, the youngest AI company from Y Combinator to raise a Series B, we have grown to become the leading data development platform for multimodal AI. Our goal is to be the final AI data platform a company ever needs. We have already assisted over 200 of the world’s top AI teams, including those at Philips, Synthesia, Zeitview, Northwell Health, and Standard AI, in strengthening their data infrastructure for AI development. Our focus on creating high-quality AI data for training, fine-tuning, and validation has led to the production of better models, faster for our customers. We’re thrilled to have Next47 lead the round, with participation from our existing investors, including CRV, Crane Venture Partners, and Y Combinator. The continued support from our existing investors is a testament to the importance of our mission and recognition that the AI department of the future is the IT department of the past. It’s all about the data The technological platform shift driven by the proliferation of AI will solve problems previously thought unsolvable by technology and, much like the rise of the internet in the previous generation, will touch every person on the planet. This shift has been driven by rapid innovation in the compute and model layers, two of the three main pillars for building AI applications. However, innovation, for data, arguably the most important, most proprietary, and most defensible ingredient, has been stagnant. The data layer has fallen victim to a concoction of hastily built in-house tools and ad-hoc orchestration of distributed workforces for annotation and validation, hurting data quality and ultimately model performance. Powering the models of many of the world’s top AI teams at world-leading research labs and enterprises, we’ve witnessed firsthand the importance of having clean, traceable, governed and secure data. Knowing what data to put into your model and what data to take out is a prerequisite to true production level applications of generative and predictive AI.  At Encord, we think and talk a lot about how we can continue to support our users in their AI adoption and acceleration journey as they cross the chasm from prototype to production. That’s why we have broken down the data problem into first principles and continue to build best-in-class solutions for each of the core components of an AI data development platform: data management & curation, annotation, and model evaluation. We seek to tie solutions to these core components together in a single, seamlessly integrated solution that works with the petabyte-scale datasets that our clients leverage in their journey to monetize their data and turn it into AI. Some call it a data engine. Some call it a data foundry. We call it a data development platform. The future is data is the future of us We’re especially excited and proud of our product momentum. In the last three months alone we have added an agentic data workflow system, a consensus quality control protocol, support for audio, world-leading segmentation tracking, and many other features. We’ve also continued to make high-quality data annotation smoother and faster with the latest automation and foundation models, integrating Meta’s new Segment Anything Model into our platform less than a day after it was released and the vision-language models LLaVA and GPT4o the same week they were respectively publicly available. We plan to leverage additional capital to accelerate our product roadmap so that we can support our users—existing and new—in even more ways than we have before. With this commitment to continued innovation of the data layer, we’re proud to publicly launch Encord Index to bring ease to multimodal data management and curation. Index is an end-to-end data management platform allowing our users to visualize, search, sort, and manage their internal data at scale. Index gives AI teams full control and governance over their data, helping them understand and operationalize large private datasets in a collaborative and secure way. It integrates seamlessly with data storage such as AWS S3, GCP Cloud Storage, Azure Blob, and others to automate the curating of the best data and remove uninformative or biased data. As a result, our customers have achieved a 35% reduction in dataset size by curating the best data, seeing upwards of 20% improvement in model performance, and saving hundreds of thousands of dollars in compute and human annotation costs.  “Successful state-of-the-art models, like our recently released Expressive Avatar foundation model EXPRESS-1, require highly sophisticated infrastructure. Encord Index is a high-performance system for our AI data, enabling us to sort and search at any level of complexity. This supports our continuous efforts to push the boundaries of AI avatar technology and meet customer needs,” said Victor Riparbelli, Co-Founder and CEO of Synthesia, the billion-dollar generative AI company. We’re in the early innings of building a generational company that will play a key role in the AI revolution. Thank you to our users, investors, Encordians, and partners, who make all of this possible every day. We are very excited for what’s to come.

Aug 13 2024

6 M

How SAM 2 and Encord Transforms Video Annotation

The future of video annotation in here! In this blog, we'll explore how Meta's Segment Anything Model 2 (SAM 2) revolutionizes dataset creation and annotation. Whether you're working on image or video data, SAM 2's enhanced capabilities promise to streamline and elevate your annotation tasks. We'll dive into the powerful features of SAM 2, the creation of the extensive SA-V dataset, and show you how to harness these tools through Encord to curate your own datasets efficiently. Ready to transform your annotation workflows? Let's get started! Overview of SAM 2 SAM 2 is Meta’s new foundation model that extends the capabilities of the original Segment Anything Model (SAM) into the video domain. Designed to handle both static images and dynamic video content, SAM 2 integrates advanced segmentation and tracking functionalities within a single, efficient framework. Key features include: Real-Time Object Tracking: SAM 2 can track objects consistently across video frames in real-time, ensuring accurate segmentation even when objects temporarily disappear from view. Memory Module: The model includes a per-session memory module that maintains and tracks selected objects across all frames of a video. This memory system enables dynamic corrections and adjustments based on supplementary prompts, enhancing the accuracy and reliability of segmentation. Improved Performance and Efficiency: Building on the original SAM, SAM 2 offers enhanced performance with faster inference speeds, making it highly accessible and practical for a wide range of applications, from video editing to mixed reality experiences. For more information, read the blog Segment Anything Model 2 (SAM 2) & SA-V Dataset from Meta AI. Overview of SA-V Dataset The SA-V dataset is a large and diverse collection of video annotations created using the SAM 2 data engine. This dataset is designed to support the development and evaluation of advanced video segmentation models. Key aspects of the SA-V dataset include: Extensive Coverage: The SA-V dataset includes a vast array of objects and object parts, ensuring comprehensive coverage that facilitates the training of models capable of "segmenting anything" in videos. High-Quality Annotations: The dataset was created through a meticulous annotation process involving human annotators and the SAM 2 model. This process ensures high-quality, precise segmentations across all frames of the videos. Interactive Annotation Workflow: The data engine employs an interactive model-in-the-loop setup, allowing annotators to refine and correct mask predictions dynamically. This approach significantly speeds up the annotation process while maintaining accuracy. Verification and Quality Control: To uphold annotation quality, a verification step is incorporated where separate annotators assess and confirm the accuracy of each masklet. Unsatisfactory annotations are refined or rejected to ensure consistency and reliability. SAM 2: Segment Anything in Images and Videos Understanding SAM 2 Data Engine When starting a project that requires comprehensive video segmentation, having a large and diverse dataset is important. However, starting from scratch can be challenging due to initial lack of data needed to train a robust model. That’s why SAM 2 data engine was introduced. It is designed to address this problem by progressively building up a high-quality dataset and improving annotation efficiency over time. This engine employs an interactive model-in-the-loop setup with human annotators without imposing semantic constraints on the annotated masklets. The focus is on both whole objects (e.g., a person) and parts (e.g., a person’s hat). The data engine operates in three distinct phases, each categorized by the level of model assistance provided to annotators. Here, are the phases and the resulting SA-V dataset: Phase 1: SAM Per Frame The initial phase used the image-based interactive SAM to assist human annotation. Annotators were tasked with annotating the mask of a target object in every frame of the video at 6 frames per second (FPS) using SAM, along with pixel-precise manual editing tools such as a “brush” and “eraser.” No tracking model was involved to assist with the temporal propagation of masks to other frames. As this is a per-frame method, all frames required mask annotation from scratch, making the process slow, with an average annotation time of 37.8 seconds per frame. However, this method yielded high-quality spatial annotations per frame. During this phase, 16K masklets were collected across 1.4K videos. This approach was also used to annotate the SA-V validation and test sets to mitigate potential biases of SAM 2 during evaluation. Phase 2: SAM + SAM 2 Mask The second phase introduced SAM 2 into the loop, where SAM 2 only accepted masks as prompts, referred to as SAM 2 Mask. Annotators used SAM and other tools as in Phase 1 to generate spatial masks in the first frame, then used SAM 2 Mask to temporally propagate the annotated mask to other frames, creating full spatio-temporal masklets. At any subsequent video frame, annotators could spatially modify the predictions made by SAM 2 Mask by annotating a mask from scratch with SAM, a “brush,” and/or “eraser,” and re-propagate with SAM 2 Mask, repeating this process until the masklet was correct. SAM 2 Mask was initially trained on Phase 1 data and publicly available datasets. During Phase 2, SAM 2 Mask was re-trained and updated twice using the collected data. This phase resulted in 63.5K masklets, reducing the annotation time to 7.4 seconds per frame, a ~5.1x speedup over Phase 1. However, this decoupled approach still required annotating masks in intermediate frames from scratch, without previous memory. Phase 3: SAM 2 In the final phase, the fully-featured SAM 2 was utilized, accepting various types of prompts, including points and masks. SAM 2 benefited from memories of objects across the temporal dimension to generate mask predictions, allowing annotators to provide occasional refinement clicks to edit the predicted masklets in intermediate frames, rather than annotating from scratch. SAM 2 was re-trained and updated using the collected annotations five times during this phase. With SAM 2 in the loop, the annotation time per frame decreased to 4.5 seconds, an ~8.4x speedup over Phase 1. This phase produced 197.0K masklets. Annotation guideline overview. SAM 2: Segment Anything in Images and Videos Ensuring Quality and Diversity: Verification, Auto Masklet Generation, and Analysis To maintain high annotation standards, a verification step was introduced. A separate set of annotators assessed each masklet's quality, categorizing them as “satisfactory” (correctly tracking the target object across all frames) or “unsatisfactory” (well-defined boundary but inconsistent tracking). Unsatisfactory masklets were sent back for refinement, while those tracking poorly defined objects were rejected entirely. To ensure diverse annotations, automatically generated masklets (referred to as “Auto”) were added. SAM 2 was prompted with a grid of points in the first frame to create candidate masklets, which were then verified. Satisfactory auto masklets were included in the SA-V dataset, while unsatisfactory ones were refined by annotators in Phase 3. These masklets covered both prominent central objects and varying background objects. Comparative analysis across data engine phases showed increased efficiency and maintained quality. Phase 3, using SAM 2, was 8.4x faster than Phase 1, had fewer edited frames per masklet, and required fewer clicks per frame, resulting in better alignment. Performance comparisons of SAM 2, trained on data from each phase, revealed consistent improvements. Evaluations on the SA-V validation set and nine zero-shot benchmarks demonstrated the benefits of iterative data collection and model refinement, with performance measured using the J & F accuracy metric showing significant gains as shown in the table below. SAM 2: Segment Anything in Images and Videos Creating Your Own Dataset with SAM 2 and Encord If you want to use SAM 2 right away to curate your own dataset, you are in luck. Meta’s latest and greatest, SAM 2, is now part of Encord’s automated labeling suite—only one day after its official release! ⚡ Whether you're working on images or videos, SAM 2 promises to enhance both accuracy and efficiency, making your annotation tasks smoother and more effective. Ready to dive in? Here’s how you can start creating your own dataset using SAM 2 and Encord: Enable SAM 2 Head over to Encord Labs in your settings and flip the switch to activate SAM 2, as shown in our documentation. Look for the new magic wand icon in the editor—it means you’re all set to use SAM 2’s latest features. Documentation Use Encord Agents for Image or Video Segmentation Begin your annotation projects with SAM 2 for image or video segmentation. You can use SAM 2 to further enhance video annotation. You’ll also notice a significant boost in speed and precision, streamlining your workflow and improving annotation quality. Want to master SAM 2? Join our webinar, "How to Fine-Tune SAM 2," for expert tips on customizing and optimizing your model. Register now to secure your spot and take your SAM 2 skills to the next level! 

Aug 01 2024

5 M

Announcing the launch of SAM 2 in Encord

In April 2023, we introduced the original SAM model to our platform a mere few days after its initial release. Today, we are excited to announce the integration of Meta’s new Segment Anything Model, SAM 2, into our automated labelling suite, just one day after its official release. This rapid integration underscores our commitment to providing our customers with access to cutting-edge machine learning techniques faster than ever before. Integrating SAM 2 brings enhanced accuracy and speed to your automated segmentation workflows, enhancing both throughput and user experience. We’re starting today by bringing SAM 2 into image segmentation tasks, where it’s been benchmarked to perform up to 6x faster than SAM. We are also looking forward to introducing the VOS capabilities of SAM 2, enhancing performance on automating video segmentation technologies already in Encord, such as SAM + Cutie. As an extremely new piece of technology, SAM 2 is being made available to all our customers via Encord Labs. To enable SAM 2, navigate to Encord Labs in your settings and enable the switch for SAM 2, as illustrated in our documentation. When you return to the editor, you’ll know SAM 2 is enabled by the enhanced magic wand icon in the editor, signalling that you are using the latest and most powerful tools for your annotation tasks. We are eager for our customers to try out SAM 2 and experience its benefits firsthand. We believe that this integration will significantly enhance the capabilities of our platform and provide unparalleled accuracy and speed in data annotation. We invite all users to send their feedback to product@encord.com. Your insights are invaluable as we continue to push the boundaries of what’s possible in machine learning annotation and evaluation. Thank you for being a part of this exciting journey with Encord. We look forward to continuing to deliver world-leading technology at a rapid pace to meet the needs of our innovative customers.

Jul 31 2024

2 M

  • 1
  • 2
  • 3
  • 4
  • 5
  • 41

Explore our products