Back to Blogs

Apple’s MM1.5 Explained

October 7, 2024
5 mins
blog image

The MM1 model was an impressive milestone in the development of multimodal large language models(MLLMs), demonstrating robust multimodal capabilities with a general-purpose architecture. However, as the demand for more specialized tasks grew in the field of artificial intelligence, so did the need for multimodal models that could scale efficiently while excelling at fine-grained image and text tasks. Enter MM1.5, a comprehensive upgrade.

MM1.5 scales from smaller model sizes (1B and 3B parameters) up to 30B parameters, introducing both dense and mixture-of-experts (MoE) variants. While MM1 was successful, MM1.5’s architecture and training improvements allow it to perform better even at smaller scales, making it highly efficient and versatile for deployment across a variety of devices.

But the real story of MM1.5 is in its data-centric approach. By carefully curating data throughout the training lifecycle, MM1.5 improves in areas like OCR (optical character recognition), image comprehension, image captioning and video processing—making it an ideal AI system for both general and niche multimodal tasks.

Key Features of MM1.5

OCR and Test-Rich Image Understanding

MM1.5 builds on the latest trends in high-resolution image comprehension, supporting arbitrary image aspect ratios and image resolutions up to 4 Megapixels. This flexibility allows it to excel at extracting and understanding embedded text in images with high fidelity.

The model leverages a carefully selected dataset of OCR data throughout the training process to ensure it can handle a wide range of text-rich images. By including both public OCR datasets and high-quality synthetic captions, MM1.5 optimizes its ability to read and interpret complex visual text inputs, something that remains a challenge for many open-source models.

MM1.5: Methods, Analysis & Insights from Multimodal LLM Fine-Tuning

MM1.5: Methods, Analysis & Insights from Multimodal LLM Fine-Tuning

Visual Referring and Grounding

Another major upgrade in MM1.5 is its capability for visual referring and grounding. This involves not just recognizing objects or regions in an image but linking them to specific references in the text. For instance, when presented with an image and a set of instructions like "click the red button," MM1.5 can accurately locate and highlight the specific button within the image.
This fine-grained image understanding is achieved by training the model to handle both text prompts and visual inputs, such as bounding boxes or points in an image. It can generate grounded responses by referencing specific regions in the visual data.

MM1.5: Methods, Analysis & Insights from Multimodal LLM Fine-Tuning

MM1.5: Methods, Analysis & Insights from Multimodal LLM Fine-Tuning

While multimodal AI models like OpenAI’s GPT-4, LlaVA-OneVision and Microsoft’s Phi-3-Vision rely on predefined prompts to refer to image regions, MM1.5 takes a more flexible approach by dynamically linking text output to image regions during inference. This is particularly useful for applications like interactive image editing, mobile UI analysis, and augmented reality (AR) experiences.

Multi-image Reasoning and In-Context Learning

MM1.5 is designed to handle multiple images simultaneously and reason across them. This capability, known as multi-image reasoning, allows the model to perform tasks that involve comparing, contrasting, or synthesizing information from several visual inputs.

For example, given two images of a city skyline taken at different times of the day, MM1.5 can describe the differences, reason about lighting changes, and even predict what might happen in the future. This ability is a result of large-scale interleaved pre-training, where the model learns to process sequences of multiple images and corresponding text data, ultimately improving its in-context learning performance.

MM1.5: Methods, Analysis & Insights from Multimodal LLM Fine-Tuning

MM1.5: Methods, Analysis & Insights from Multimodal LLM Fine-Tuning

MM1.5 also demonstrates strong few-shot learning capabilities, meaning it can perform well on new tasks with minimal training data. This makes it adaptable to a wide range of applications, from social media content moderation to medical imaging analysis.

MM1.5 Model Architecture and Variants

MM1.5 retains the same core architecture as MM1 but with improvements in scalability and specialization. The models range from 1B to 30B parameters, with both dense and mixture-of-experts (MoE) versions. Dense models, available in 1B and 3B sizes, are particularly optimized for deployment on mobile devices while maintaining strong performance on multimodal tasks.


The key components of MM1.5 include a CLIP-based image encoder that processes high-resolution images (up to 4 Megapixels) and uses dynamic image splitting for efficient handling of various aspect ratios. The vision-language connector (C-Abstractor) integrates visual and textual data, ensuring smooth alignment between the two modalities. Finally, the decoder-only Transformer architecture serves as the backbone, processing multimodal inputs and generating language outputs.

MM1.5: Methods, Analysis & Insights from Multimodal LLM Fine-Tuning

MM1.5: Methods, Analysis & Insights from Multimodal LLM Fine-Tuning

For more specialized applications, MM1.5 offers two variants:

  • MM1.5-Video: Designed for video understanding, this variant can process video data using either training-free methods (using pre-trained image data) or supervised fine-tuning on video-specific datasets. This makes MM1.5-Video ideal for applications like video surveillance, action recognition, and media analysis.
  • MM1.5-UI: Tailored for understanding mobile user interfaces (UIs), MM1.5-UI excels at tasks like app screen comprehension, button detection, and visual grounding in mobile environments. With the increasing complexity of mobile apps, MM1.5-UI offers a robust solution for developers seeking to enhance app interaction and accessibility features.

light-callout-cta To know more about the architecture of MM1, read the paper MM1: Methods, Analysis & Insights from Multimodal LLM Pre-training or for a brief overview read the blog MM1: Apple’s Multimodal Large Language Models (MLLMs)
 

MM1.5 Training Method: Data-Centric Approach

One of the key differentiators of MM1.5 from its previous generations is its data-centric training approach. This strategy focuses on carefully curating data mixtures at every stage of the training process to optimize the model’s performance across a variety of tasks. The training pipeline consists of three major stages: large-scale pre-training, continual pre-training, and supervised fine-tuning (SFT).

MM1.5: Methods, Analysis & Insights from Multimodal LLM Fine-Tuning

MM1.5: Methods, Analysis & Insights from Multimodal LLM Fine-Tuning

Large-Scale Pre-Training

At this stage, MM1.5 is trained on massive datasets consisting of 2 billion image-text pairs, interleaved image-text documents, and text-only data. By optimizing the ratio of these data types (with a heavier focus on text-only data), MM1.5 improves its language understanding and multimodal reasoning capabilities.

Continual Pre-Training

This stage introduces high-resolution OCR data and synthetic captions to further enhance MM1.5’s ability to interpret text-rich images. Through ablation studies, the MM1.5 team discovered that combining OCR data with carefully generated captions boosts the model’s performance on tasks like document understanding and infographic analysis.

Supervised Fine-Tuning(SFT)

Finally, MM1.5 undergoes supervised fine-tuning using a diverse mixture of public datasets tailored to different multimodal capabilities. These datasets are categorized into groups like general, text-rich, visual referring, and multi-image data, ensuring balanced performance across all tasks. Through dynamic high-resolution image encoding, MM1.5 efficiently processes images of varying sizes and aspect ratios, further enhancing its ability to handle real-world inputs.

light-callout-cta For more information, read the research paper published by Apple researchers on Arxiv: MM1.5: Methods, Analysis & Insights from Multimodal LLM Fine-Tuning.
 

Performance of MM1.5

MM1.5 demonstrates outstanding performance across several benchmarks, particularly excelling in text-rich image understanding and visual referring and grounding tasks. With high-quality OCR data and a dynamic image-splitting mechanism, it outperforms many models in tasks like DocVQA and InfoVQA, which involve complex document and infographic comprehension.


Its multi-image reasoning capability allows MM1.5 to compare and analyze multiple images, performing well on tasks like NLVR2. In visual referring tasks, such as RefCOCO and Flickr30k, MM1.5 accurately grounds text to specific image regions, making it highly effective for real-world applications like AR and interactive media.

MM1.5: Methods, Analysis & Insights from Multimodal LLM Fine-Tuning

MM1.5: Methods, Analysis & Insights from Multimodal LLM Fine-Tuning

MM1.5 achieves competitive performance even with smaller model scales (1B and 3B parameters), which enhances its suitability for resource-constrained environments without sacrificing capability.

Real-World Applications of MM1.5

The model’s ability to understand and reason with text-rich images is particularly strong, making it an excellent choice for document analysis, web page comprehension, and educational tools.

MM1.5’s multi-image reasoning and video understanding capabilities open up exciting possibilities in fields like media production, entertainment, and remote sensing. The model can also be deployed in mobile applications, offering developers powerful tools to improve user interfaces, enhance accessibility, and streamline interaction with digital content.

MM1.5: Key Takeaways

  • Enhanced Multimodal Capabilities: MM1.5 excels in text-rich image understanding, visual grounding, and multi-image reasoning.
  • Scalable Model Variants: Offers dense and mixture-of-experts models from 1B to 30B parameters, with strong performance even at smaller scales.
  • Data-Centric Training: Uses optimized OCR and synthetic caption data for continual pre-training, boosting accuracy in text-heavy tasks.
  • Specialized Variants: Includes MM1.5-Video for video understanding and MM1.5-UI for mobile UI analysis, making it adaptable for various use cases.
  • Efficient and Versatile: Performs competitively across a wide range of benchmarks, suitable for diverse applications from document processing to augmented reality.
sideBlogCtaBannerMobileBGencord logo

Power your AI models with the right data

Automate your data curation, annotation and label validation workflows.

Book a demo
Written by
author-avatar-url

Akruti Acharya

View more posts

Explore our products