Back to Blogs

Encord Monthly Computer Vision Wrap: April Industry Newsletter

April 30, 2024
|
5 mins
blog image

Hi there,

Welcome to the Computer Vision Monthly Wrap for April 2024!

Here’s what you should expect:

  • Imagine Flash: Accelerating Emu Diffusion Models with Backward Distillation.
  • 🖌️ HQ-Edit: A High-Quality Dataset for Instruction-Based Image Editing.
  • 🧑‍🏫 Instruction-tuning Llama 3 for performance gains on vision-language models (VLMs).
  • ⚒️ Developer resources to use for your next vision AI application.
  • 🔎 TTI to evaluate the performance of fine-tuned CLIP models and other VLMs.
  • 🤖 Grok 1.5V from Elon Musk’s xAI.

Let’s go! 🚀

📜 Top Picks for Computer Vision Papers This Month

Imagine Flash: Accelerating Emu Diffusion Models with Backward Distillation

Researchers at Meta AI released Imagine Flash, an image generation model synthesizing images and animations in real time as you prompt it.

Here are the three main parts of the researchers' approach:

  • Backward Distillation, which reduces differences between training and inference by setting the student on its own backward trajectory;
  • Shifted Reconstruction Loss, which changes how knowledge is transferred based on the current time step;
  • Noise Correction, an inference time technique that improves sample quality by fixing singletons in noise prediction.

What’s impressive? 🤯

  • It’s powering new image generation features in MetaAI and WhatsApp that generate images in real-time as you type in prompts.
  • It performs comparably to the teacher model, using only three denoising steps for efficient, high-quality generation.
  • Imagine Flash’s distillation acceleration method outperforming existing competitors in quantitative metrics and human evaluations (when applied to the Emu baseline). The bias noise also performs well on color synthesis.

Imagine Flash vs State-of-the-Art (SoTA)

How can you apply it? ⚒️

light-callout-cta 📜 Read the publication.

HQ-Edit: A High-Quality Dataset for Instruction-based Image Editing

The paper introduces a new large-scale dataset called HQ-Edit for training models to perform image editing based on natural language instructions. The dataset contains around 200,000 high-resolution image editing examples, each consisting of an input image, an output edited image, and detailed editing instructions.
The researchers developed a pipeline leveraging advanced AI models like GPT-4 and DALL-E 3 to create HQ-Edit. It starts by collecting and expanding various image descriptions and editing instructions. These are then used to generate "diptychs" - side-by-side input and output images.

Finally, the diptychs undergo post-processing to split them into input-output pairs, refine their alignment, and enhance the editing instructions.

Training CTA Asset
Curate Data for Diffusion Models with Encord
Book a live demo

What’s impressive? 👀

  • HQ-Edit contains 200,000 high-resolution (around 900x900) image editing examples with detailed instructions. This is substantially larger and of higher quality than prior datasets.
  • Introducing Alignment and Coherence metrics using GPT-4 provides a more comprehensive way to assess the quality of image editing examples compared to simpler metrics like CLIP similarity.
  • Models trained on HQ-Edit achieve impressive results, showcasing the dataset's value and overall approach. The gains over human-annotated data are especially noteworthy.

HQ-Edit results

How can you apply it? ⚒️

light-callout-cta 📜 Read the paper on Arxiv.

Meta’s Llama 3 and the Multimodal Capabilities

This month, Meta released Llama 3 pre-trained and instruction-fine-tuned language models with 8 billion (8B) and 70 billion (70B) parameters. These models have new features, like better reasoning, coding, and math-solving capabilities. They set a new state-of-the-art (SoTA) for models of their sizes that are open-source, and you can use.
Now, we know this is primarily a language model, but as this video explained, the vision-language model also has benefits. Training a VLM typically involves training an LLM and an image encoder separately, then training a projection component (projection) to align the outputs of the other two. 

You can reuse Llama 3 as the LLM component in a VLM (e.g., LlaVA) instead of training a separate LLM from scratch. This is because VLMs only require a small portion of the LLM to be fine-tuned for specific tasks. You can use instruction-tuning to improve the LLM's performance on specific tasks.

light-callout-cta Here is an explainer post that distills the technical report with the most important bits you need to know.

Grok-1.5 Vision: First Multimodal Model from Elon Musk’s xAI

Researchers at Elon Musk’s xAI released Grok-1.5V, a multimodal model that expands the capabilities of traditional text-based LLMs to include visual understanding. It interprets language and can process various image types with impressive performance on complex reasoning tasks (spatial understanding). 

Here are the highlights:

1️⃣ It can draw insights from various domains, combining visual and textual information to arrive at complex conclusions

2️⃣ It builds upon the strong language foundation of Grok-1, extending its abilities with visual understanding.

3️⃣ xAI introduced the RealWorldQA benchmark to measure the model’s ability to understand and reason about spatial relationships within the physical world.

4️⃣ It is in a preview stage and accessible to a limited group of early testers. This includes existing Grok users and subscribers to X.ai's Premium+ service.

light-callout-cta Here is an explainer post that distills the technical paper with the most important bits you need to know.

🧑‍💻Developer Resources You’d Find Useful

  • Imgsys.org (Like Chatbot Arena but for Images) → imgsys.org is a generative arena for text-guided open-source image generation models, similar to lmsys.org (Chatbot Arena).
  • Text-to-image-eval (TTI-Eval) → TTI Eval is an open-source library for evaluating zero-shot classification models like CLIP and domain-specific ones like BioCLIP against your (or HF) datasets to estimate how well the model will perform. The metrics include zero-shot accuracy, linear probe, image retrieval, and KNN accuracy.

light-callout-cta 🔥 NEW RELEASE: We released TTI-Eval (text-to-image evaluation), an open-source library for evaluating zero-shot classification models like CLIP and domain-specific ones like BioCLIP against your (or HF) datasets to estimate how well the model will perform. Get started with it on GitHub, and do ⭐️ the repo if it's awesome. 🔥

📰 In the News

  • AI and Computer Vision to Detect Brain Abnormalities → Researchers at the Al Buraimi University College looked at different parts of MRI images, like color and texture, to find problems accurately. By looking at the symmetry between the brain's lobes, they developed an algorithm with precision, recall, and accuracy rates of 95.3%, 94.7%, and 95%, respectively.

Here are other quick finds if you 💓Encord and computer vision data stuff ⚡:

Till next month, have a super-sparkly time!

Scale your annotation workflows and power your model performance with data-driven insights
medical banner

cta banner

Build better ML models with Encord

Get started today
Written by
author-avatar-url

Stephen Oladele

View more posts