Contents
📜 Top Picks for Computer Vision Papers This Month
🧑💻Developer Resources You’d Find Useful
📰 In the News
Encord Blog
Encord Monthly Computer Vision Wrap: April Industry Newsletter
Hi there,
Welcome to the Computer Vision Monthly Wrap for April 2024!
Here’s what you should expect:
- ⚡Imagine Flash: Accelerating Emu Diffusion Models with Backward Distillation.
- 🖌️ HQ-Edit: A High-Quality Dataset for Instruction-Based Image Editing.
- 🧑🏫 Instruction-tuning Llama 3 for performance gains on vision-language models (VLMs).
- ⚒️ Developer resources to use for your next vision AI application.
- 🔎 TTI to evaluate the performance of fine-tuned CLIP models and other VLMs.
- 🤖 Grok 1.5V from Elon Musk’s xAI.
Let’s go! 🚀
📜 Top Picks for Computer Vision Papers This Month
Imagine Flash: Accelerating Emu Diffusion Models with Backward Distillation
Researchers at Meta AI released Imagine Flash, an image generation model synthesizing images and animations in real time as you prompt it.
Here are the three main parts of the researchers' approach:
- Backward Distillation, which reduces differences between training and inference by setting the student on its own backward trajectory;
- Shifted Reconstruction Loss, which changes how knowledge is transferred based on the current time step;
- Noise Correction, an inference time technique that improves sample quality by fixing singletons in noise prediction.
What’s impressive? 🤯
- It’s powering new image generation features in MetaAI and WhatsApp that generate images in real-time as you type in prompts.
- It performs comparably to the teacher model, using only three denoising steps for efficient, high-quality generation.
- Imagine Flash’s distillation acceleration method outperforming existing competitors in quantitative metrics and human evaluations (when applied to the Emu baseline). The bias noise also performs well on color synthesis.
How can you apply it? ⚒️
- Although the model is not open-source, you can start testing it on Meta AI’s website.
HQ-Edit: A High-Quality Dataset for Instruction-based Image Editing
The paper introduces a new large-scale dataset called HQ-Edit for training models to perform image editing based on natural language instructions. The dataset contains around 200,000 high-resolution image editing examples, each consisting of an input image, an output edited image, and detailed editing instructions.
The researchers developed a pipeline leveraging advanced AI models like GPT-4 and DALL-E 3 to create HQ-Edit. It starts by collecting and expanding various image descriptions and editing instructions. These are then used to generate "diptychs" - side-by-side input and output images.
Finally, the diptychs undergo post-processing to split them into input-output pairs, refine their alignment, and enhance the editing instructions.
What’s impressive? 👀
- HQ-Edit contains 200,000 high-resolution (around 900x900) image editing examples with detailed instructions. This is substantially larger and of higher quality than prior datasets.
- Introducing Alignment and Coherence metrics using GPT-4 provides a more comprehensive way to assess the quality of image editing examples compared to simpler metrics like CLIP similarity.
- Models trained on HQ-Edit achieve impressive results, showcasing the dataset's value and overall approach. The gains over human-annotated data are especially noteworthy.
How can you apply it? ⚒️
- The most direct application is using HQ-Edit to train your own instruction-based image editing models. The dataset is publicly available, providing a valuable resource for building on.
- Code on GitHub: https://github.com/UCSC-VLAA/HQ-Edit
- Dataset: https://huggingface.co/datasets/UCSC-VLAA/HQ-Edit
- HuggingFace Spaces Demo: https://huggingface.co/spaces/LAOS-Y/HQEdit
Meta’s Llama 3 and the Multimodal Capabilities
This month, Meta released Llama 3 pre-trained and instruction-fine-tuned language models with 8 billion (8B) and 70 billion (70B) parameters. These models have new features, like better reasoning, coding, and math-solving capabilities. They set a new state-of-the-art (SoTA) for models of their sizes that are open-source, and you can use.
Now, we know this is primarily a language model, but as this video explained, the vision-language model also has benefits. Training a VLM typically involves training an LLM and an image encoder separately, then training a projection component (projection) to align the outputs of the other two.
You can reuse Llama 3 as the LLM component in a VLM (e.g., LlaVA) instead of training a separate LLM from scratch. This is because VLMs only require a small portion of the LLM to be fine-tuned for specific tasks. You can use instruction-tuning to improve the LLM's performance on specific tasks.
Grok-1.5 Vision: First Multimodal Model from Elon Musk’s xAI
Researchers at Elon Musk’s xAI released Grok-1.5V, a multimodal model that expands the capabilities of traditional text-based LLMs to include visual understanding. It interprets language and can process various image types with impressive performance on complex reasoning tasks (spatial understanding).
Here are the highlights:
1️⃣ It can draw insights from various domains, combining visual and textual information to arrive at complex conclusions
2️⃣ It builds upon the strong language foundation of Grok-1, extending its abilities with visual understanding.
3️⃣ xAI introduced the RealWorldQA benchmark to measure the model’s ability to understand and reason about spatial relationships within the physical world.
4️⃣ It is in a preview stage and accessible to a limited group of early testers. This includes existing Grok users and subscribers to X.ai's Premium+ service.
🧑💻Developer Resources You’d Find Useful
- Imgsys.org (Like Chatbot Arena but for Images) → imgsys.org is a generative arena for text-guided open-source image generation models, similar to lmsys.org (Chatbot Arena).
- Text-to-image-eval (TTI-Eval) → TTI Eval is an open-source library for evaluating zero-shot classification models like CLIP and domain-specific ones like BioCLIP against your (or HF) datasets to estimate how well the model will perform. The metrics include zero-shot accuracy, linear probe, image retrieval, and KNN accuracy.
📰 In the News
- AI and Computer Vision to Detect Brain Abnormalities → Researchers at the Al Buraimi University College looked at different parts of MRI images, like color and texture, to find problems accurately. By looking at the symmetry between the brain's lobes, they developed an algorithm with precision, recall, and accuracy rates of 95.3%, 94.7%, and 95%, respectively.
Here are other quick finds if you 💓Encord and computer vision data stuff ⚡:
- Join the Encord Community to discuss this newsletter.
- Data-centric computer vision blog.
Till next month, have a super-sparkly time!
Power your AI models with the right data
Automate your data curation, annotation and label validation workflows.
Book a demoWritten by
Stephen Oladele
Explore our products