Frederik Hvilshøj
Published April 24, 2023Edited April 25, 2023

Visual Foundation Models (VFMs) Explained

blog image

Visual Foundation Models (VFMs) take all of the advantages of foundation models, such as LLMs that make ChatGPT possible, and apply them to the creation or analysis of images and videos. VFMs also make auto-segmentation possible using models such as SAM and SegGPT.

As the machine learning (ML), artificial intelligence (AI), and computer vision (CV) sectors are now witnessing a rising reliance on foundation models for more use cases and industries than we can count. 

Large language models (LLMs), such as OpenAI’s ChatGPT, are transforming sectors and the ways people work. 

New applications for these foundation models are now going live, including Microsoft’s Visual ChatGPT, Meta’s Segment Anything Model (SAM), and SegGPT. 

In this article, we will explore Visual Foundation Models (VFMs) in context, giving you an understanding of how they work and how you can apply them to computer vision projects.

What are Visual Foundation Models (VFMs)? 

There are numerous types of foundation models, such as generative adversarial networks (GANs), variational auto-encoders (VAEs), variational auto-encoders (VAEs), transformer-based large language models (LLMs), multimodal models, and several others.

There are also computer vision foundation models - like Florence: “[Applying] universal visual-language representations that be adapted to various computer vision tasks.” 

While Florence is useful for image description and labeling, using an image-text contrastive learning approach, VFMs, such as Microsoft’s Visual ChatGPT, use foundation models to create images or videos using prompts. 

An example of this is Boris Eldagsen’s recent award-winning AI-generated image, “PSEUDOMNESIA: The Electricia.” It won an award under the creative category of the Sony World Photography Awards 2023 until he admitted the image was AI-generated. 

An award-winning AI-generated image


Introducing Microsoft’s Visual ChatGPT

The excitement around use cases for OpenAI’s ChatGPT (including the newest iteration, GPT-3) has been high.

Microsoft was so impressed by ChatGPT-3 that it made a significant investment in OpenAI, and alongside integrating it with Bing, another exciting use case has been developed: creating images.

In March 2023, two Microsoft AI engineers, Chenfei Wu and Nan Duan, and researchers at Cornell University, released a paper on Visual ChatGPT and, soon after, the source code via GitHub.

According to the published paper, the researchers and engineers incorporated “different Visual Foundation Models, to enable the user to interact with ChatGPT by:

  1. sending and receiving not only languages but also images 
  2. providing complex visual questions or visual editing instructions that require the collaboration of multiple AI models with multi-steps. 
  3. providing feedback and asking for corrected results.” 

Anyone or any organization can integrate the Visual ChatGPT (Generative Pre-trained Transformer) Large Language Model (LLM) into AI-based applications and models, including computer vision projects. 

Foundation model in Action: Segment Anything Model in Encord

Find out more about LLMs and other foundation models in our full guide

What is the goal of a VFM?

The goal of VFMs is to merge the capabilities of text-based foundation models (such as LLMs) with visual foundation models (such as Visual Transformers and Stable Diffusion). 

As the researchers explained in the paper launching Visual ChatGPT: “Since ChatGPT is trained with languages, it is currently not capable of processing or generating images from the visual world. At the same time, Visual Foundation Models, such as Visual Transformers or Stable Diffusion, although showing great visual understanding and generation capabilities, they are only experts on specific tasks with one-round fixed inputs and outputs.” 

In other words, use foundation models to create visual images from user prompts. 

How do Visual Foundation Models work? 

Visual foundation models work following the same 5 core AI principles of other foundation models. 

Pre-trained on vast datasets. Foundation models — whether they’re open, closed, or have been fine-tuned — are pre-trained on quantities of data that far exceed most ML and AI-based models. 

GPT-3, for example, was trained on 500,000 million words. It would take 10 human lifespans, reading continuously, to absorb that many words. At the same time, the number of parameters included in foundation models is enormous too. ChatGPT-4 includes 175 billion parameters, 100x more than GPT-3 and around 10x more than other comparable LLMs. 

Self-supervised learning. Unlike computer vision models, foundation models are designed to learn from the inputs, guided by the parameters, making them self-sufficient and constantly learning from human user inputs. 

Overfitting. Similar to Encord's computer vision micro-models, overfitting in the pre-training and parameter development stages is an important part of the process.

Adaptable. Thanks to fine-tuning and prompt engineering. Visual foundation models are continuously improving and learning based on the prompts and inputs from users and ML engineers.

Generalized. In most cases, foundation models are generalized. However, in the case of visual foundation models, it’s an adaptation on more generalized use cases because the aim is to create or auto-segment visual images and videos from user prompts.

Auto-segmentation using Encord

How can you use VFMs in computer vision? 

Yes, you can, not only with Microsoft’s Visual ChatGPT but for several exciting and time-saving computer vision applications, such as: 

Do you want to eliminate manual segmentation? Learn how to use foundation models, like SegGPT and Meta’s Segment Anything Model (SAM), to reduce labeling costs with Encord! Read the product announcement, how to fine-tune SAM or go straight to getting started!

Computer vision model in action

Conclusion & Key Takeaways 

Real-world use cases and applications of foundation models across dozens of sectors is accelerating exponentially. 

As we’ve seen with LLM tools such as ChatGPT, this is encouraging enterprise organizations to try out more innovative uses of other algorithmically-generated models, such as computer vision. 

Now, alongside GANs, Visual Foundation Models (VFMs), such as Microsoft’s Visual ChatGPT, can be used to create images using simple prompts. This is another AI-based leap forward, and one such use case is the creation of synthetic images where there aren’t enough real-world examples for particular projects. 

Other examples of visual foundation models being applied to computer vision include Meta’s SAM and SegGPT, for implementing numerous segmentation tasks more effectively. 

Over time, we expect there will be many more ways to use VFMs for computer vision and other projects, and we look forward to seeing further developments in this field. 

Ready to take your computer vision model, training, and development to the next level? 

Get started with Encord - the Active Learning Platform for Computer Vision, used by the world’s leading computer vision teams. 

AI-assisted labeling, model training & diagnostics, find & fix dataset errors and biases, all in one collaborative active learning platform, to get to production AI faster.

Want to stay updated?

  • Follow us on Twitter and LinkedIn for more content on computer vision, training data, and active learning.
  • Join the Slack community to chat and connect.
cta banner

Get the latest machine learning news and insights