Back to Blogs

Encord Monthly Wrap: May Industry Newsletter

May 29, 2024
|
4 mins
blog image

Hi there,

Welcome to the Computer Vision Monthly Wrap for May 2024!

Here’s what you should expect:

  • 🤖 An Introduction to Vision-Language Modeling (VLM)
  • 📽️ PaliGemma – Google's Open Source Vision Language Model (VLM)
  • ⚔️GPT-4o vs. Gemini 1.5 Pro vs. Claude 3 Opus: Multimodal AI Model Comparison
  • ⚒️ Developer resources to use for your next vision AI application.
  • 🔎 TTI-Eval – Open-source library to evaluate the performance of fine-tuned CLIP models and other VLMs.

Let’s go! 🚀

Checkout Text-to-image-eval (TTI-Eval) → TTI Eval is an open-source library for evaluating zero-shot classification models like CLIP and domain-specific ones like BioCLIP against your (or HF) datasets to estimate how well the model will perform. The metrics include zero-shot accuracy, linear probe, image retrieval, and KNN accuracy.

📜 Top Picks for Computer Vision Papers This Month

An Introduction to Vision-Language Modeling

Researchers at Meta AI released a paper that covers how VLMs work, how to train them, and approaches to evaluation.

This paper provides a comprehensive introduction to Vision-Language Models (VLMs), which extend Large Language Models (LLMs) to the visual domain. VLMs have the potential to revolutionize how we interact with technology, from visual assistants to generative models that create images from text descriptions. 

The paper aims to help anyone enter the field by explaining VLMs, how they work, how to train them, and how to evaluate them.

What’s impressive? 🤯

  • VLMs can enable visual assistants that guide users through unfamiliar environments.
  • Generative VLMs can produce images from high-level text descriptions alone.
  • The paper provides a clear introduction to VLMs for anyone wanting to enter the field.
  • While focusing primarily on image-to-language mapping, the paper also discusses extending VLMs to videos.

An Introduction to Vision-Language Modeling by FAIR

An Introduction to Vision-Language Modeling by FAIR.

How can you apply it? ⚒️

  • Researchers can use this paper as a starting point for their research on VLMs.
  • Developers can leverage the information in this paper to build and deploy VLM applications.
  • Business stakeholders can gain a better understanding of the potential of VLMs and how they can be used to create value.
  • Enthusiasts can learn about the latest developments in this exciting field and explore the possibilities of VLMs.


📜 Read the publication.

PaliGemma – Google's Open Source Vision Language Model (VLM)

Alongside introducing Project Astra, Gemini 1.5 Flash, and updates to Gemini 1.5 Pro, Google open-sourced PaliGemma-3B is a state-of-the-art Vision-Language Model (VLM) inspired by the PaLI-3 recipe. It fuses the SigLIP visual encoder and the Gemma 2B language model (as the decoder) to process and generate language based on visual inputs.

What’s impressive? 👀

  • PaliGemma uses the state-of-the-art SigLIP visual encoder (SigLIP-So400m/14) to convert images into "soft tokens" for the model to understand and process visual information.
  • Integrating the Gemma 2B language model, PaliGemma can generate coherent and contextually relevant text based on the input images and text prompts 🤯.
  • The model's architecture concatenates image and prefix tokens before passing them to the Gemma decoder. This allows for seamless interaction between visual and textual information for more accurate and meaningful outputs.
  • Its ability to handle multiple input images and generate auto-regressive text with masked attention shows its versatility and potential for complex multimodal tasks.

PaliGemma – Google's Open Source Vision Language Model (VLM) Hugging Face Space

PaliGemma – Google's Open Source Vision Language Model (VLM) Hugging Face Space.

How can you apply it? ⚒️

  • PaliGemma can automatically generate descriptive captions for images. This could improve accessibility and user experience in applications such as social media platforms or e-commerce websites.
  • The model can answer questions about input images for interactive and engaging user experiences in educational, entertainment, or customer support settings.
  • PaliGemma can extract and understand text present in images, which is valuable for applications like document processing, OCR, or scene understanding.
  • PaliGemma can be applied in fields such as autonomous vehicles, surveillance systems, or medical image analysis by identifying and localizing objects within images.
  • Code on GitHub.
  • Hugging Face Spaces Demo.


📜 Read the Hugging Face blog post to learn more.

GPT-4o vs. Gemini 1.5 Pro vs. Claude 3 Opus: Multimodal AI Model Comparison

This month, the multimodal AI wars reached an all-time high. OpenAI led the way with announcements like GPT-4o, which offers real-time multimodality, and then Google’s major updates to Gemini models. Oh, and let’s not forget Anthropic’s Claude 3 Opus, too.


This article reviews each model's capabilities, strengths, and weaknesses, comparing their performance across various benchmarks and real-world applications.

GPT-4o vs. Gemini 1.5 Pro vs. Claude 3 Opus: Multimodal AI Model Comparison

GPT-4o vs. Gemini 1.5 Pro vs. Claude 3 Opus: Multimodal AI Model Comparison.

🧑‍💻Developer Resources You’d Find Useful

📰 In the News

  • Spoor Uses AI to Save Birds from Wind Turbines → Spoor is a software that uses computer vision to detect birds on video while recording their movement and predicting their flight patterns. The Oslo, Norway-based company raised a $4 million seed round from investors.

Till next month, have a super-sparkly time!

encord logo

Power your AI models with the right data

Automate your data curation, annotation and label validation workflows.

Get started
Written by
author-avatar-url

Stephen Oladele

View more posts

Explore our products