LLaVA and LLaVA-1.5 Explained

Akruti Acharya
October 17, 2023
5 min read
blog image

Microsoft has recently entered the realm of multimodal models with the introduction of LLaVA, a groundbreaking solution that combines a vision encoder and Vicuna to enable visual and language comprehension. LLaVA showcases impressive chat capabilities, rivaling Open AI’s multimodal GPT-4, and sets a new benchmark for state-of-the-art accuracy in Science QA.

The convergence of natural language and computer vision has led to significant advancements in the field of artificial intelligence. While fine-tuning techniques have greatly improved the performance of large language models (LLMs) in handling new tasks, the application of these methods to multimodal models remains relatively unexplored. The research paper titled "Visual Instruction Tuning" introduces an innovative approach called LLAVA (Large Language and Vision Assistant) that leverages the power of GPT-4, initially designed for text-based tasks, to create a new paradigm of multimodal instruction-following data that seamlessly integrates textual and visual components.

In this blog, we will delve into the evolution of visual instruction tuning and explore the specifics of LLaVA, along with its more recent iteration, LLaVA-1.5. By examining these advancements, we can gain valuable insights into the continuous progress of LLMs in the field of AI. 

Power the next generation of LLMs & VLMs with Reinforcement Learning from Human Feedback
medical banner

What is Visual Instruction Tuning?

Visual instruction tuning is a technique that involves fine-tuning a large language model (LLM) to understand and execute instructions based on visual cues. 

This approach aims to establish a connection between language and vision, enabling AI systems to comprehend and act upon human instructions that involve both modalities. 

For instance, imagine asking a machine learning model to describe an image, perform an action in a virtual environment, or answer questions about a scene in a photograph. Visual instruction tuning equips the model with the ability to perform these tasks effectively.


LLaVA vs. LLaVA-1.5


LLaVA, short for Large Language and Vision Assistant, is one of the pioneering multimodal models. Despite being trained on a relatively small dataset, LLaVA showcases exceptional abilities in understanding images and responding to questions about them. Its performance on tasks that demand deep visual comprehension and instruction-following is particularly impressive. Notably, LLaVA demonstrates behaviors akin to multimodal models like GPT-4, even when presented with unseen images and instructions.

LLaVA Architecture


LLaVA Architecture

LLaVA utilizes the LLaMA model,  renowned for its efficacy in open-source language-only instruction-tuning projects. For visual content processing, LLaVA relies on the pre-trained CLIP visual encoder ViT-L/14, which excels in the realm of visual comprehension. The encoder extracts visual features from input images and connects them to language embeddings through a trainable projection matrix. This projection effectively translates visual features into language embedding tokens, thereby bridging the gap between text and images.

light-callout-cta Read the original paper by Microsoft, authored by Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee available on Arxiv: Visual Instruction Tuning.

LLaVA Training

LLaVA's training encompasses two essential stages that enhance its capacity to comprehend user instructions, understand both language and visual content, and generate accurate responses: 

  • Pre-training for Feature Alignment: In this initial stage, LLaVA aligns visual and language features to ensure compatibility. 
  • Fine-tuning End-to-End: The second training stage focuses on fine-tuning the entire model, end-to-end. While the visual encoder's weights remain unchanged, both the projection layer's pre-trained weights and the LLM's parameters become subjects of adaptation. This fine-tuning can be tailored to different application scenarios, yielding versatile capabilities.


In LLaVA-1.5, there are two significant improvements. Firstly, the addition of an MLP vision-language connector enhances the system's capabilities. Secondly, the integration of academic task-oriented data further enhances its performance and effectiveness.

MLP Vision-Language Connector

LLaVA-1.5 builds upon the success of MLPs in self-supervised learning and incorporates a design change to enhance its representation power. The transition from a linear projection to a two-layer MLP significantly enhances LLaVA-1.5's multimodal capabilities. This modification has profound implications, enabling the model to effectively understand and interact with both language and visual elements.

Academic Task Oriented Data

LLaVA-1.5 goes beyond its predecessor by integrating VQA datasets that are designed for academic tasks. These datasets focus on specific tasks related to VQA, Optical Character Recognition (OCR), and region-level perception. This enhancement equips LLaVA-1.5 with the capability to excel in a wide range of applications, including text recognition and precise localization of fine-grained visual details.


Improved Baselines with Visual Instruction Tuning 

The development from LLaVA to LLaVA-1.5 signifies Microsoft’s continuous pursuit to refine and expand the capabilities of large multimodal models. LLaVA-1.5 signifies a significant progression towards the development of more sophisticated and adaptable AI assistants, aligning with their commitment to advancing the field of artificial intelligence. 

light-callout-cta The codebase on LLaVA’s Github contains the model and the dataset (available on HuggingFace) used for training.

Comparison with SOTA

Multimodal AI has witnessed significant advancements, and the competition among different models is fierce. Evaluating the performance of LLaVA and LLaVA-1.5 in comparison to state-of-the-art (SOTA) models offers valuable insights into their capabilities.

LLaVA's ability to fine-tune LLaMA using machine-generated instruction-following data has shown promising results on various benchmarks. In tasks such as ScienceQA, LLaVA achieved an accuracy that closely aligns with the SOTA model's performance. ability to handle out-of-domain questions highlights its strong proficiency in comprehending visual content and delivering effective question answering.

However, LLaVA demonstrates exceptional proficiency in comprehending and adhering to instructions within a conversational context. It's capable of reasoning and responding to queries in a manner that aligns with human intent, outperforming other models like BLIP-2 and OpenFlamingo.


Visual Instruction Tuning

The introduction of LLaVA-1.5 and its potential improvements indicate promising advancements in the field.  The collaboration between LLaVA and GPT-4 through model ensembling holds the potential for enhanced accuracy and underscores the collaborative nature of AI model development.

Recent Developments


LLaVA-Med, the Large Language and Vision Assistant for BioMedicine, is a groundbreaking multimodal assistant designed specifically for the healthcare field. This innovative model aims to support biomedical practitioners in their pursuit of knowledge and insights by effectively addressing open-ended research inquiries related to biomedical images. What sets LLaVA-Med apart is its cost-effective approach, leveraging a comprehensive dataset of biomedical figure-caption pairs sourced from PubMed Central. 

Through self-guided learning facilitated by GPT-4, it excels in capturing the nuances of open-ended conversational semantics and aligning them with the specialized vocabulary of the biomedical domain. Remarkably, LLaVA-Med can be trained in less than 15 hours and exhibits exceptional capabilities in multimodal conversation. This represents a significant advancement in enhancing the comprehension and communication of biomedical images.


LLaVA-Interactive is an all-in-one demo that showcases the visual interaction and generation capabilities of multimodal models beyond language interaction. Powered by LLaVA, SEEM, and GLIGEN, this interactive experience offers a profound demonstration of the boundless versatility inherent in multimodal models.

Multimodal Foundation Models

Multimodal Foundation Models: From Specialists to General-Purpose Assistants is a comprehensive 118-page survey that explores the evolution and trends in multimodal foundation models. This survey provides insights into the current state of multimodal AI and its potential applications. It is based on the tutorial in CVPR 2023 by Microsoft and the members of the LLaVA project. 

Instruction Tuning with GPT-4 Vision

The paper Instruction Tuning with GPT-4 discusses the attempt to use GPT-4 data for LLM self-instruct tuning. This project is an exploration of the capabilities of GPT-4 and its potential in enhancing large language models.

While LLaVA represents a significant step forward in the world of large multimodal models, the journey is far from over, and there are promising directions to explore for its future development:

  • Data Scale: LLaVA's pre-training data is currently based on a subset of CC3M, and its fine-tuning data draws from a subset of COCO. To enhance its concept coverage, especially with regard to entities and OCR, one direction for improvement is to consider pre-training on even larger image-text datasets. 
  • Integrating with more computer vision models: LLaVA has shown promising results, even approaching the capabilities of the new ChatGPT in some scenarios. To advance further, one interesting avenue is the integration of powerful vision models, such as SAM. 

LLaVA: Key Takeaways

  • LLaVA Challenges GPT-4: Microsoft's LLaVA is a powerful multimodal model rivaling GPT-4, excelling in chat capabilities and setting new standards for Science QA.
  • Visual Instruction Tuning Advances AI: LLaVA's visual instruction tuning enables AI to understand and execute complex instructions involving both text and images.
  • LLaVA-1.5 Enhancements: LLaVA-1.5 introduces an MLP vision-language connector and academic task-oriented data, boosting its ability to interact with language and visual content.
  • Bridging Language and Vision: LLaVA's architecture combines LLaMA for language tasks and CLIP visual encoder ViT-L/14 for visual understanding, enhancing multimodal interactions.

Power the next generation of LLMs & VLMs with Reinforcement Learning from Human Feedback
medical banner

Written by Akruti Acharya
Akruti is a data scientist and technical content writer with a M.Sc. in Machine Learning & Artificial Intelligence from the University of Birmingham. She enjoys exploring new things and applying her technical and analytical skills to solve challenging problems and sharing her knowledge and... see more
View more posts
cta banner

Discuss this blog on Slack

Join the Encord Developers community to discuss the latest in computer vision, machine learning, and data-centric AI

Join the community

Software To Help You Turn Your Data Into AI

Forget fragmented workflows, annotation tools, and Notebooks for building AI applications. Encord Data Engine accelerates every step of taking your model into production.