Back to Blogs

Meta Imagine AI Just got an Impressive GIF Update

May 10, 2024
|
8 mins
blog image

Diffusion models enable most generative AI applications today to create highly realistic and diverse images. However, their sequential denoising process leads to expensive inference times. Meta AI researchers have introduced Imagine Flash, a new distillation framework that accelerates diffusion models like Emu while maintaining high-quality, diverse image generation.

Imagine Flash stimulates faster image generation using just one to three denoising steps—an improvement over existing methods. The approach combines three key components: Backward Distillation, Shifted Reconstruction Loss, and Noise Correction.

In this article, you will learn about the innovative techniques behind Imagine Flash and how it achieves efficient, high-quality image generation. We will explore the challenges of accelerating diffusion models, the limitations of existing approaches, and how Imagine Flash addresses these issues to push the boundaries of generative AI.

light-callout-cta

TL;DR

Image Flash distillation framework introduces three key components: 

1. Backward Distillation: Calibrates student on its own backward trajectory, reducing train-test discrepancy.

2. Shifted Reconstruction Loss: The student adapts knowledge transfer from the teacher based on a timestep. The student distills the global structure early and details it later.

3. Noise Correction: Fixes bias in the first step for noise prediction models, enhancing color and contrast.

- Experiments show Imagine Flash matches teacher performance in just 3 steps, outperforming prior arts like ADD and LDMXL-Lightning.

- Qualitative results demonstrate improved realism, sharpness, and details compared to competitors.

- Human evaluation shows a strong preference for Imagine Flash over state-of-the-art models.

Image Generation in Emu Diffusion Models

Emu diffusion models learn to reverse a gradual noising process, allowing them to generate diverse and realistic samples. These models learn to map random noise to realistic images through an iterative denoising process. However, the sequential nature of this process leads to expensive inference times, which hinders real-time applications.

 

Recent research has focused on accelerating Emu diffusion models to enable faster image generation without compromising quality. Imagine Flash is a new distillation framework from researchers at Meta AI that generates high-fidelity and diverse images using just one to three denoising steps: 

  • Backward Distillation to calibrate student model on its diffusion trajectory to reduce training-inference discrepancies,
  • Shifted Reconstruction Loss to dynamically adapt knowledge transfer from teacher model based on timestep,
  • Noise Correction fixes bias in the first step for noise prediction models and enhances sample quality (image color and contrast).

The significant reduction in inference time opens up new possibilities for efficient, real-time image generation applications.

Visual demonstration of the effect of Noise Correction

Visual demonstration of the effect of Noise Correction.

Training CTA Asset
Curate Data for Diffusion Models with Encord
Book a live demo
 

Meta AI Imagine: Backward Distillation Technique

The core innovation within Meta AI's Imagine Flash framework is the Backward Distillation technique, which accelerates diffusion models while maintaining high image quality. The key idea is to train a smaller, faster student model to learn from a larger, more complex teacher model.

In traditional forward distillation, the student model attempts to mimic the teacher's denoising process. The training process starts with a forward-noised latent code xt, which can lead to information leakage and inconsistencies between training and inference. 

However, this can be challenging when the student has significantly fewer denoising steps. This could also cause degraded sample quality, especially for photorealistic images and complex text conditioning (generating images given a text prompt).

Imagine Flash significantly reduces the baseline's inference time while generating high-quality complex images

Images generated with the proposed model.

Backward Distillation addresses this issue by using the student model's own diffusion trajectory (backward iterations) to obtain the starting latent code xΘT→t for training. The student performs denoising steps during training to obtain a latent code xt. This latent code is then used as input for student and teacher models.

From this point, the teacher takes additional denoising steps, and the student learns to match the teacher's output. This approach ensures consistency between training and inference for the student and eliminates the reliance on ground-truth signals during training.

light-callout-cta 🔥 NEW RELEASE: We released TTI-Eval (text-to-image evaluation), an open-source library for evaluating zero-shot classification models like CLIP and domain-specific ones like BioCLIP against your (or HF) datasets to estimate how well the model will perform. Get started with it on GitHub, and do ⭐️ the repo if it's awesome. 🔥
 

Imagine Flash: Technical Improvements

Imagine Flash introduces several important technical improvements that significantly improve image generation speed and quality. Let’s take a look at some of these technical improvements.

Faster Image Generation with Reduced Iterations

One of the most notable improvements is the drastic reduction in the number of iterations required for high-quality image synthesis. While the baseline Emu model necessitates around 25 iterations, Imagine Flash achieves comparable results with just 3 iterations. Essentially, it matches the student's performance in just 3 steps, outperforming prior arts like Adversarial Diffusion Distillation (ADD) and SDXL-Lightning

Image Flash vs. Adversarial Diffusion Distillation (ADD) and Lightning.

Image Flash vs. Adversarial Diffusion Distillation (ADD) and Lightning.

This has led to real-time and substantially faster image generation without compromising the quality of the images.

Imagine Flash significantly reduces the baseline's inference time while generating high-quality, complex images

Imagine Flash significantly reduces the baseline's inference time.

Extended Context Capability: From 8K to 128K

Imagine Flash can handle an extended context from 8K to 128K, allowing for more detailed and complex image generations. This expanded context capacity enables the model to capture more details and nuances, making it particularly effective for generating large-scale and high-resolution images.

Reduced Computational Cost with Fast-sampling Approach

Runs at 800+ Tokens per Second

Imagine Flash reduces the computational cost and processing power required for image generation without compromising the output quality. It can generate images at over 800 tokens per second using a fast-sampling approach that optimizes hardware usage.

The reduced computational requirements also contribute to a notable decrease in processing time. This improvement is particularly valuable for resource-constrained devices and scenarios for real-time applications, on-the-fly editing, and interactive systems.

Advanced Training Techniques

The Imagine Flash framework's three components (Backward Distillation, Shifted Reconstruction Loss, and Noise Correction) are trained on preformatted fine-tunes, which refine its ability to produce more accurate and visually appealing images. 

These training techniques improve the model's understanding and execution, enabling it to specialize in specific domains or styles. They also improve versatility and applicability across different use cases.

Benefits of Image Flash

Real-time Image Generation Previews and Creative Exploration 

One key benefit is generating real-time image previews, enabling rapid iteration and feedback. Meta is deploying this to its products, including WhatsApp and Meta AI.

Imagine Flash on WhatsApp - Meta AI

Imagine Flash on WhatsApp: Meta AI.

This real-time capability empowers artists and designers to creatively explore and experiment with various graphics and visual concepts for innovation in generative AI.

Performance Comparison: Meta’s Flash Vs. Stability AI’s SDXL-Lightning

Meta's Imagine Flash and Stability AI's SDXL-Lightning are state-of-the-art few-shot diffusion models. While both achieve impressive results, their performance has some key differences. Imagine Flash retains a similar text alignment capacity to SDXL-Lightning but shows more favorable FID scores, especially for two and three steps. 

Imagine Flash vs. public SOTA - Quantitative

Imagine Flash vs. public SOTA - Quantitative.

Human evaluations also demonstrate a clear preference for Imagine Flash over SDXL-Lightning, with Imagine Flash winning in 60.6% of comparisons. However, it's important to note that these models start from different base models, so comparisons should be interpreted cautiously.

Nonetheless, Imagine Flash's superior performance in key metrics and human evaluations highlights its effectiveness in efficient, high-quality image generation.

Imagine Flash’s Image Creation Process: How does it work?

Imagine Flash’s image creation process begins with a user-provided text prompt describing the desired image. As an exciting feature upgrade, Imagine AI offers a live preview, allowing users to see the generated image in real-time. This interactive experience enables users to refine their prompts on the fly.

Once the prompt is finalized, Imagine AI's advanced diffusion models, accelerated by the novel Imagine Flash framework, generate the image in just a few steps. The result is a high-quality, diverse image that closely aligns with the given prompt.

As another feature upgrade, Imagine AI introduces animation capabilities, bringing static images to life with fluid movements and transitions. This addition opens up new possibilities for creative expression and storytelling.

Meta’s Imagine AI Image Generator: Feature Updates

Meta's Imagine AI has introduced several exciting feature updates to enhance the user experience and creative possibilities. One notable addition is the live generation feature, which allows users to witness the image creation process in real-time as the model iteratively refines the output based on the provided text prompt.

Another significant update is the ability to create animated GIFs. Users can now convert their generated static images into short, looping animations with just a few clicks. This feature opens up new avenues for creative expression and adds an extra dimension of dynamism to the generated visuals.

These updates demonstrate Meta's commitment to continuously improving Imagine AI and providing users with a more engaging and versatile image-generation tool.

Meta’s Imagine AI: Limitations

Image Resolution

The generated images are currently restricted to a square format, which may not suit all use cases. Additionally, the model relies solely on text-based prompts, limiting users' input options.

Key Takeaways: Image Flash Real-Time AI Image Generator

In this work, the researchers introduced Imagine Flash, a distillation framework that enables high-fidelity, few-step image generation with diffusion models.

Their approach combines three key components:

  • Backward Distillation, which reduces discrepancies between training and inference by calibrating the student model on its own backward trajectory.
  • Shifted Reconstruction Loss (SRL), which dynamically adapts knowledge transfer from the teacher model based on the current time step, focusing on global structure early on and fine details later.
  • Noise Correction, an inference-time technique that enhances sample quality by addressing singularities in noise prediction during the initial sampling step.

Build Better Models, Faster with Encord's Leading Annotation Tool

Through extensive experiments and human evaluations, they demonstrated that Imagine Flash outperforms existing methods in both quantitative metrics and perceptual quality. Remarkably, it achieves performance comparable to the teacher model using only three denoising steps, enabling efficient, high-quality image generation.

It's important to note that Imagine AI is still in beta testing, and some features may be subject to change or improvement. For example, animation capabilities are currently limited to generating short GIFs with a reduced frame count compared to full-length videos.

cta banner

Build better ML models with Encord

Get started today
Written by
author-avatar-url

Stephen Oladele

View more posts
Frequently asked questions
  • The GIF upgrade in Imagine AI allows converting generated images into short animated GIFs with improved resolution and quality compared to typical GIF exports. It produces a series of frames that enable smoother animations.

  • Imagine AI is Meta's text-to-image generator that creates images based on textual descriptions. It is built on Meta's Emu image model and trained on over 1.1 billion public images from Facebook and Instagram

  • Users input text descriptions of the desired image. Imagine AI processes this and generates four thumbnail image options based on the prompt. Users can then customize and edit the generated images further.

  • Key new features include converting a generated image into an animated GIF, live image generation that updates the image in real-time as the user types, and an efficient built-in image editing interface.

  • Benefits include quickly generating customized images based on text prompts, the ability to produce various styles from photorealistic to artistic, fast performance, and user-friendly editing capabilities.

  • The generated GIFs have a limited number of frames rather than full videos, so there are some length and file size constraints. Specific processing time and file size limits are not provided.

  • Some limitations include generated images having a watermark, fixed 1024x1024 resolution currently, only available in certain regions, and still being in a beta phase compared to other AI image generation tools.