Back to Blogs

Pixtral Large Explained

November 20, 2024
5 mins
blog image

Mistral AI first gained attention with Mistral 7B, an open-source large language model(LLM) that outperformed industry giants like OpenAI and Meta. Now, with Pixtral Large, the French AI startup takes a major step forward.

This blog explores the architectural innovations, benchmark results, and applications of Pixtral Large.

What is Pixtral Large?

Pixtral Large is a 124B-parameter multimodal language model that integrates a 123B text decoder and a 1B vision encoder. It is built on the foundation of Mistral Large 2, a leading text-only model, while introducing robust capabilities for understanding and reasoning over both text and visual data.
With a context window of 128,000 tokens, Pixtral Large can process up to 30 high-resolution images per input or the equivalent of a 300-page book, equivalent to leading OpenAI GPT series models.
It builds upon its predecessor, Mistral Large 2, unveiled in the summer of 2024, as well as Mistral's first multimodal model, Pixtral 12-B, released in September.

Key Features of Pixtral Large

Large Context Window

Pixtral Large supports a context window of 128K tokens, allowing it to process a substantial amount of data in a single inference. This capability helps researchers analyze documents containing both text and high-resolution images without segmenting inputs.

Multi-Resolution Vision Processing

The vision encoder processes images of varying resolutions, providing flexibility for latency-sensitive tasks and high-precision analysis.

Unified Evaluation Protocols

Mistral AI has developed MM-MT-Bench, a multimodal benchmark designed to standardize evaluation for vision-language models. This open-source resource provides consistent comparisons across different AI models and tasks.

Instruction-Tuned Multimodal Reasoning

Pixtral Large is fine-tuned on interleaved text and image datasets, enabling it to follow complex instructions involving both modalities. This includes multi-turn interactions that require sequential reasoning and context retention.

Seamless Integration

Pixtral Large integrates into existing machine learning workflows for tasks like retrieval-augmented generation (RAG), document analysis, and automated reasoning.

Scalability

The model’s architecture supports both small-scale applications (e.g., OCR for a single document) and large-scale deployments (e.g., multimodal search engines).

data visualisation of Pixtral outperforming all open-models on multimodal tasks

Pixtral outperforms all open-models within its weight class on multimodal tasks. Pixtral 12B

Architecture

Multimodal Decoder

The decoder in Pixtral Large is based on the architecture of Mistral Large 2. It uses a transformer-based design capable of performing high-level reasoning across text and visual modalities. The decoder seamlessly processes long contexts of up to 128K tokens, which is particularly useful for combining large amounts of textual and visual data in a single inference.

Vision Encoder

The vision encoder, Pixtral-ViT, is a 1B-parameter module designed to handle diverse visual data.

Pixtral Vision Encoder

Pixtral Vision Encoder. Pixtral 12B


Some of the key features are:

  • Aspect Ratio Preservation: Unlike standard encoders that require fixed resolutions, Pixtral-ViT processes images in their native dimensions. This minimizes preprocessing and preserves critical details.
  • Block-Diagonal Attention Masks: This technique supports efficient processing of multiple images by isolating each image's attention computations.
  • ROPE-2D Encoding: Relative Position Encoding in two dimensions improves spatial representation for image patches, making the encoder adaptive to different resolutions and aspect ratios.

The vision encoder transforms images into token representations compatible with the multimodal decoder, enabling unified processing of text and images.

Read the technical available on Arxiv: Pixtral 12B.

Performance Benchmarks

Pixtral Large has been evaluated on leading multimodal and text-only benchmarks, demonstrating competitive or superior results across tasks.

  • MathVista: This benchmark evaluates mathematical reasoning over visual data, where Pixtral leads among multimodal models like Claude-3.5 Sonnet, Llama-3.2, etc.
  • ChartQA and DocVQA: Pixtral surpasses GPT-4o and Gemini-1.5 Pro in reasoning about charts, tables, and scanned documents.
  • MM-MT-Bench: Designed for real-world multimodal applications, this benchmark evaluates multi-turn conversations combining text and images. Pixtral achieves the highest scores in its class.
  • Text-Only Benchmarks: Pixtral retains top-tier performance on MATH, HumanEval, and other text benchmarks, ensuring it is not limited to multimodal tasks.

Pixtral Large sets a new standard for open-weight models by excelling in visual tasks without sacrificing text processing.

Pixtral 12B

Pixtral 12B

Real-World Applications of Pixtral Large 

Document Understanding

Pixtral Large is well-suited for analyzing documents, including PDFs, invoices, and scanned materials. It supports tasks like:

  • Optical Character Recognition (OCR): Extracts text from multilingual documents.
  • Content Summarization: Produces concise summaries of large documents with embedded images.
  • Semantic Search: Searches for relevant information across mixed media datasets.

Pixtral Large example prompt

Pixtral Large

Data Visualization Analysis

Pixtral Large can interpret and reason about visual data, including:

  • Charts and Graphs: Identifies trends and anomalies in visualized datasets.
  • Training Loss Curves: Analyzes machine learning models' performance over time to pinpoint instability or overfitting.

Reasoning over complex figures. Pixtral 12B example

Reasoning over complex figures. Pixtral 12B

Multimodal Assistants

The model supports multi-turn conversational systems that combine text and images. Use cases include:

  • Customer Support: Resolves user queries by analyzing screenshots and associated text.
  • Knowledge Exploration: Provides detailed explanations of documents, diagrams, or charts during interactive sessions.

Image-to-Code Conversion

Pixtral Large converts images of hand-drawn or designed interfaces into executable HTML or code snippets. This functionality bridges design and development workflows, improving efficiency in prototyping.

Image-to-Code Conversion Pixtral 12B example

Pixtral 12B

Challenges Addressed

Pixtral Large resolves several limitations in existing multimodal AI systems:

  1. Input Size Restrictions: Its 128K token context window eliminates the need to truncate large inputs.
  2. Fixed-Resolution Encoders: The vision encoder processes variable-resolution images natively, improving flexibility and performance.
  3. Non-Standardized Evaluation: MM-MT-Bench ensures fair comparisons across different multimodal models.

These improvements make Pixtral Large a versatile tool for research and applied AI.

How to Access Pixtral Large?

Pixtral Large is available under two licenses:

  • Mistral Research License: For academic research and educational use.
  • Mistral Commercial License: For experimentation, testing, and production in commercial settings.
  • Access Options:
  • le Chat: Mistral’s chatbot for experimentation with the model. Click here to access it.
  • API: Available as pixtral-large-latest for integration into applications. Click here to access the API.
  • HuggingFace Repository: Open weights are provided for self-hosting and further fine-tuning. Click here to download it.

{Read more about Pixtral Large in their official webpage. }

Mistral’s Le Chat

Following the success of Pixtral Large, Mistral AI introduces Le Chat, a free, advanced platform designed to leverage the capabilities of its multimodal model. With features powered by Pixtral Large, Le Chat brings together text, vision, and interactivity into a seamless productivity tool suitable for a variety of use cases, from research and ideation to task automation.

Key Features of Le Chat

  • Web Search with Citations: Le Chat enhances its knowledge base by performing real-time web searches, providing source citations to ensure transparency and credibility in the information it delivers.
  • Canvas for Ideation: A new interface, Le Chat Canvas, enables users to create, modify, and collaborate on documents, presentations, and designs. This tool is ideal for brainstorming sessions and creative work, offering a dynamic space to visualize ideas and workflows.
  • Advanced Document and Image Analysis: Leveraging Pixtral Large, Le Chat can now process and summarize complex documents, including PDFs. It extracts insights from graphs, tables, equations, and other visual content, making it a powerful tool for researchers and professionals dealing with dense material.
  • Image Generation: In partnership with Black Forest Labs, Le Chat incorporates image generation capabilities powered by the Flux Pro model. This feature allows users to create high-quality visuals directly in the chat interface, similar to OpenAI’s DALL-E 3 integration, enhancing the creative potential of users.
  • Task Agents for Automation: Le Chat offers customizable task agents that automate repetitive activities like summarizing meeting minutes, processing invoices, or scanning receipts. This automation helps users save time and increase productivity.

Pixtral Large: Key Takeaways

  • Multimodal Strength: Combines text and image reasoning without performance trade-offs.
  • High Capacity: Processes up to 128K tokens, including multiple high-resolution images.
  • Vision Encoder: Handles variable image sizes with native resolution support.
  • Benchmark Leader: Outperforms leading models on MathVista, DocVQA, and MM-MT-Bench.
  • Accessible and Scalable: Open weights available for research and commercial use via multiple deployment options.

encord logo

Power your AI models with the right data

Automate your data curation, annotation and label validation workflows.

Get started
Written by
author-avatar-url

Justin Sharps

View more posts

Explore our products