Back to Blogs

Understanding Meta’s Movie Gen Bench: New Generative AI model for Video and Audio

October 21, 2024
5 mins
blog image

Generative AI has made incredible strides in producing high-quality images and text, with models like GPT-4, DALL-E, and Stable Diffusion dominating their respective domains. However, generating high-quality videos with synchronized audio remains one of the most challenging frontiers in artificial intelligence. Meta AI’s Movie Gen offers a comprehensive solution with its foundation models. Alongside this groundbreaking model comes the Movie Gen Bench, a detailed benchmark designed to assess the capabilities of models in AI video and audio generation.

In this blog, we will explore the ins and outs of Meta AI’s Movie Gen Bench, including the set of foundation models, Movie Gen, the benchmarks involved, and how these AI tools enable the fair comparison of generative models in video and audio tasks. 

What is Movie Gen?

Meta’s Movie Gen is an advanced set of generative AI models that can produce high-definition videos and synchronized audio, representing a significant leap in generative media. These models excel in various tasks, including text-to-video synthesis, video personalization based on individual facial features, and precise video editing. By leveraging a transformer based architecture and training on extensive datasets, Movie Gen achieves high-quality outputs that outperform prior state-of-the-art systems, including OpenAI’s Sora, LumaLabs, and Runway Gen3, while offering capabilities not found in current commercial large language models (LLMs).

These capabilities can be found in two foundation models: 

Movie Gen Video

The Movie Gen Video model is a 30 billion parameter model trained on a vast collection of images and videos. It generates HD videos of up to 16 seconds at 16 frames per second, handling complex prompts like human activities, dynamic environments, and even unusual or fantastical subjects. Movie Gen Video uses a simplified architecture, incorporating Flow Matching to generate videos with realistic motion, camera angles, and physical interactions. This allows it to handle a range of video content, from fluid dynamics to human emotions and facial expressions.

One of the standout features of Movie Gen Video is its capability to personalize videos using a reference image of a specific individual, generating new content that maintains visual consistency with the provided image. This enables applications in personalized media creation, such as marketing or interactive experiences. It also includes instruction-guided video editing, which allows precise changes to both real and generated videos, offering a user-friendly approach to video editing through text prompts.

Movie Gen Audio

The Movie Gen Audio model focuses on generating cinematic sound effects and background music synchronized with video content. With 13 billion parameters, the model can create high-quality sound at 48 kHz and handle long audio sequences by extending audio in synchronization with the corresponding visual content.

Movie Gen Audio not only generates diegetic sounds, like footsteps or environmental noises, but also non-diegetic music that supports the mood of the video. It can handle both sound effect and music generation, blending them seamlessly to create a professional-level audiovisual experience. This model can also generate video-to-audio content, meaning it can produce soundtracks based on the actions and scenes within a video. This capability expands its use in film production, game design, and interactive media.

Together, Movie Gen Video and Audio provide a comprehensive set of tools for media generation, pushing the boundaries of what generative AI can achieve in both video and audio production.

Movie Gen Bench

While Movie Gen's technical capabilities are impressive, equally important is the Movie Gen Bench, which Meta introduced as an evaluation suite for future generative models. Movie Gen Bench consists of two main components: Movie Gen Video Bench and Movie Gen Audio Bench, both designed to test different aspects of video and audio generation.

These new benchmarks allow researchers and developers to evaluate the performance of generative models across a broad spectrum of scenarios. By providing a structured, standardized framework, Movie Gen Bench enables consistent comparisons across models, ensuring fair and objective assessment of performance.

Movie Gen Video Bench

Movie Gen Video Bench consists of 1003 prompts covering a wide variety of testing categories:

  • Human Activity: Captures complex motions such as limb movements and facial expressions.
  • Animals: Includes realistic depictions of animals in various environments.
  • Nature and Scenery: Tests the model's ability to generate natural landscapes and dynamic environmental scenes.
  • Physics: Focuses on fluid dynamics, gravity, and other elements of physical interaction within the environment, such as explosions or collisions.
  • Unusual Subjects and Activities: Pushes the boundaries of generative AI by prompting the model with fantastical or rare scenarios.

What makes this evaluation benchmark especially valuable is its thorough coverage of motion levels, including high, medium, and low-motion categories. This ensures the video generation model is tested for both fast-paced sequences and slower, more deliberate actions, which are often challenging to handle in text-to-video generation.

summary visualisation of evaluation prompts in Movie Gen Video Bench

Movie Gen: A Cast of Media Foundation Models

Movie Gen Audio Bench

Movie Gen Audio Bench is designed to test audio generation capabilities across 527 prompts. These prompts are paired with generated videos to assess how well the model can synchronize sound effects and music with visual content. The benchmark covers various soundscapes, including indoor, urban, nature, and transportation environments, as well as different types of sound effects—ranging from human and animal sounds to object-related noises.

Audio generation is particularly challenging, especially when it involves synchronizing sound effects with complex motion or creating background music that matches the mood of the scene. The Audio Bench enables evaluation in two critical areas:

  • Sound Effect Generation: Testing the model’s ability to generate realistic sound effects that match the visual action.
  • Joint Sound Effect and Background Music Generation: Assessing the generation of both sound effects and background music in harmony with one another.

How Movie Gen Bench Enables Fair Comparisons

A standout feature of Meta's Movie Gen Bench is its ability to support fair and transparent comparisons of generative models across different tasks. Many models showcase their best results, but Movie Gen Bench commits to a more rigorous approach by releasing non-cherry-picked data—ensuring all generated outputs are available for evaluation, not just select high-quality samples. This transparency prevents misleading or biased evaluations.

Moreover, Meta provides access to the generated videos and corresponding prompts for both the Video Bench and Audio Bench. This allows developers and researchers to test their models against the same dataset, ensuring that any improvements or shortcomings are accurately reflected and not obscured by variations in data quality or prompt difficulty.

Movie Gen Bench Evaluation and Metrics

Evaluating generative video and audio models is complex, particularly because it involves the added temporal dimension and synchronization between multiple modes. Movie Gen Bench uses a combination of human evaluations and automated metrics to rigorously assess the quality of its models.

Human Evaluation

For generative video and audio models, human evaluations are considered essential due to the subjective nature of concepts like realism, visual quality, and aesthetic appeal. Automated metrics alone often fail to capture the nuances of visual or auditory content, especially when dealing with motion consistency, fluidity, and sound synchronization. Human evaluators are tasked with assessing generated content on three core axes:

  • Text Alignment: It evaluates how well the generated video or audio matches the provided text prompt. This includes two key aspects: Subject Match, which examines whether the scene's appearance, background, and lighting align with the prompt, and Motion Match, which verifies that the actions of objects and characters correspond correctly to the described actions.
  • Visual Quality: It focuses on the motion quality and frame consistency of the video. It includes Frame Consistency, ensuring stability of objects and environments across frames; Motion Completeness, which assesses whether there’s enough motion for dynamic prompts; Motion Naturalness, evaluating the realism of movements, particularly in human actions; and an Overall Quality assessment that determines the best visual quality based on motion and consistency.
  • Realness and Aesthetics: It addresses the realism and aesthetic appeal of the generated content. This includes Realness, which measures how convincingly the video mimics real-life footage or cinematic scenes, and Aesthetics, which evaluates visual appeal through factors like composition, lighting, color schemes, and overall style.

Automated Metrics

Although human evaluation is crucial, automated metrics are used to complement these assessments. In generative video models, metrics such as Frechet Video Distance (FVD) and Inception Score (IS) have been commonly applied. However, Meta acknowledges the limitations of these metrics, particularly their weak correlation with human judgment in video generation tasks.

  • Frechet Video Distance (FVD): FVD measures how close a generated video is to real videos in terms of distribution, but it often falls short when evaluating motion and visual consistency across frames. While useful for large-scale comparisons, FVD lacks sensitivity to subtleties in realness and text alignment.
  • Inception Score (IS): IS evaluates image generation models by measuring the diversity and quality of images, but it struggles with temporal coherence in video generation, making it a less reliable metric for this task.

Necessity of Human Evaluations

One of the main findings in evaluating generative video models is that human judgment remains the most reliable and nuanced way to assess video quality. Temporal coherence, motion quality, and alignment to complex prompts require a level of understanding that automated metrics have yet to fully capture. For this reason, human evaluations are a significant component of Movie Gen Bench’s evaluation system, ensuring that feedback is as reliable and comprehensive as possible.

 

Movie Gen Bench Use cases

The Movie Gen Bench is a important tool for advancing research in generative AI, particularly in the media generation domain. With its release, future models will be held to a higher standard, requiring them to perform well across a broad range of categories—not just in isolated cases.

 

Potential applications of the Movie Gen models and benchmarks include:

  • Film and Animation: Generating realistic video sequences and synchronized audio for movies, short films, or animations.
  • Video Editing Tools: Enabling precise video edits through simple textual instructions.
  • Personalized Video: Creating customized videos based on user input, whether for personal use or marketing.
  • Interactive Storytelling: Generating immersive video experiences for virtual or augmented reality platforms.

By providing a benchmark for both video and audio generation, Meta is paving the way for a new era of AI-generated media that is more realistic, coherent, and customizable than ever before.

Accessing Movie Gen Bench

To access Movie Gen Bench, interested users and researchers can find the resources on its official GitHub repository. This repository includes essential components for evaluating the Movie Gen models, specifically the Movie Gen Video Bench and Movie Gen Audio Bench.

The Movie Gen Video Bench is also available on Hugging Face.

Movie Gen Bench: Key Takeaways

  • Comprehensive Evaluation Framework: Meta’s Movie Gen Bench sets a new standard for evaluating generative AI models in video and audio, emphasizing fair and rigorous assessments through comprehensive benchmarks.
  • Advanced Capabilities: The Movie Gen models feature a 30-billion parameter model for video and a 13-billion parameter model for audio. By using Flow Matching for efficiency and Temporal Autoencoder (TAE) for capturing temporal dynamics, these models generate realistic videos that closely align with text prompts, marking significant advancements in visual fidelity and coherence.
  • Robust Evaluation Metrics: Human evaluations, combined with automated metrics, provide a nuanced understanding of generated content, ensuring thorough assessments of text alignment, visual quality, and aesthetic appeal.
sideBlogCtaBannerMobileBGencord logo

Power your AI models with the right data

Automate your data curation, annotation and label validation workflows.

Get started
Written by
author-avatar-url

Justin Sharps

View more posts

Explore our products