Back to Blogs

How to automate annotation with GPT-4 Vision’s rival, LLaVA

December 11, 2023
|
2 mins
blog image

In April, we announced the integration of Segment Anything Model (SAM) into our annotation platform, Encord Annotate, marking the first of such integrations for our industry. Over the past few months, we have been proud to see the benefits that automated labeling with SAM has brought to our customers, advancing further on our goal of enabling more efficient workflows for our customers' annotation process. 

With the release of the new multimodal Vision Language Models like Gemini and GPT Vision, visual labelling is about transform fundamentally. Today, we are excited to introduce another first - with the first automated labeling tool powered by the open-source VLM LLaVA. As highlighted in a recent post, GPT-4 Vision vs. LLaVA, LLaVA has shown impressive performance in combining the power of large language models with visual tasks. Read on to learn more about how you can leverage this with Encord.

What is LLaVA

Large Language and Vision Assistant (LLaVA) is one of the pioneering multimodal models. Although LLaVA was trained on a small dataset, it demonstrates remarkable skills in image understanding and answering image-related questions. It shines when it comes to tasks that require a high level of visual interpretation and following instructions. It is worth mentioning that LLaVA exhibits behaviors similar to those of GPT-4 and other multimodal models, even when given instructions and images that it has never seen before.

Introducing LLaVA-Powered Classification Available on Encord to Automatically Label Images with Natural Language

With our latest release, our LLaVA-powered classification tool automatically labels your image based on nothing more than natural language comprehension of your labeling ontology. Give the tool an image and an arbitrarily complex ontology, and let it auto-label for you!

How to get started: Annotate with our LLaVA-powered multimodal automated labeling tool in four simple steps

Step 1:  Set up your classification project by attaching your dataset and ontology

Step 2: Open your annotation task and click ‘Automated labeling

Step 3: Go to the LLM prediction section and press the magic ‘Predict’ button.

Step 4: Within a few moments, LLaVA will populate your tasks with labels based on your ontology. You can adjust and correct the labels if you deem them inaccurate for your task.

With the integration of SAM, you could segment anything in your images that aided annotation. Now, with LLaVA, our platform can understand labeling instructions through natural language and context from your images. We have started seeing customers use LLaVA as labeling assistants to suggest corrections to what could be incorrect annotations, which is faster than going through a review cycle.

Why LLaVA?

We chose to integrate LLaVA for automated labeling due to the following reasons:

  • Comparative strengths with GPT-4: When compared to GPT-4, we found that LLaVA showed comparable performance in interpreting images and excelled in certain scenarios. Although it struggled with tasks requiring Optical Character Recognition (OCR), its overall performance, especially considering its training on a smaller dataset, is noteworthy.
  • Improved chat capabilities and Science QA: LLaVA demonstrates impressive chat capabilities and sets new standards in areas like Science Question Answering (QA). This indicates its strong capacity to grasp and interpret labeled diagrams and complex instructions involving text and images.
  • Open source accessibility: One of the key advantages of LLaVA is its availability as an open-source project. This accessibility is a significant factor, allowing for wider adoption and customization based on specific annotation needs.
  • Multimodal functionality: LLaVA works well for comprehensive data annotation because it was trained to be a general-purpose multimodal assistant capable of handling textual and visual data.
  • Data privacy: Incorporating LLaVA into our platform aligns with our unwavering commitment to data privacy and ethical AI practices. With LLaVA, we can guarantee that data won't leak because we don't send anything outside our infrastructure.

The essential advantage of integrating LLaVA is its ability to process and understand complex multimodal datasets. With LLaVA, our annotation platform now uses the power of advanced AI to analyze and label vast amounts of data with unprecedented speed and precision. This enhances the efficiency of your annotation process and significantly boosts the output's accuracy.

What’s Next?

With this advancement, we can accelerate annotation efforts across various domains. Over the coming months, we will continue improving the accuracy and output quality of this feature to ensure it can better understand the relationships in your ontology. LLaVA, and other VLMs, will be able to improve training data workflows in many more ways, and our team is very excited for what's to come. 👀✨

light-callout-cta This feature will be available for select customers, and we will continue sharing more case studies and use cases from customers where their annotation workflow efficiency is no longer bottlenecked by long review times. If you want to try our automated labeling with natural language, please contact our team to get started.

cta banner

Build better ML models with Encord

Get started today
Written by
author-avatar-url

Frederik Hvilshøj

View more posts