Meta AI’s I-JEPA, Image-based Joint-Embedding Predictive Architecture, Explained

Akruti Acharya
June 14, 2023
5 min read
blog image

If you thought the AI landscape could not move any faster with Large Language Models and Generative AI, then think again! The next innovation in artificial intelligence is already here.

Meta AI unveiled Image-based Joint-Embedding Predictive Architecture (I-JEPA) this week, a computer vision model that learns like humans do. 

This release is a meaningful first step towards the vision of Chief AI Scientist, Yann LeCun, to create machines that can learn internal models of how the world works so that they can accomplish difficult tasks and adapt to unfamiliar situations.

Before diving into the details of I-JEPA, let’s discuss self-supervised learning. 

Evaluate computer vision models and build active learning pipelines
medical banner

What is Self-Supervised Learning?

Self-supervised learning is an approach to machine learning that enables models to learn from unlabeled dataset. By leveraging self-generated labels through pretext tasks, such as contrastive learning or image inpainting, self-supervised learning unlocks the potential to discover meaningful patterns and representations. This technique enhances model performance and scalability across various domains.

There are two common approaches to self-supervised learning in computer vision: invariance-based methods and generative methods.

Invariance-based Methods

Invariance-based pre-training methods focus on training models to capture different views of an image by using techniques such as data augmentation and contrastive learning.

While these methods can produce semantically rich representations, they may introduce biases. Generalizing these biases across tasks requiring different levels of abstraction, such as image classification and instance segmentation, can be challenging. 

Generative Methods

Using generative methods involves training models to generate realistic samples from a given distribution and indirectly learning meaningful representations in the process. Generative Adversarial Networks (GANs) and Variational Autoencoders (VAEs) are two examples of generative methods, where the models learn to generate images that resemble the training data distribution.

Generative methods may prioritize capturing low-level details and pixel-level accuracy, at the expense of higher-level semantic information. This drawback can make the learned representations less effective for tasks that require understanding complex semantics or reasoning of objects and scenes.

💡 To learn more about self-supervised learning, read Self-supervised Learning Explained.

Joint-Embedding Architecture

Joint Embedding Architecture (JEA) is an architecture that learns to produce similar embeddings for compatible inputs and dissimilar embeddings for incompatible inputs. By mapping related inputs close together, it creates a shared representation space and facilitates tasks such as similarity comparison or retrieval.

Joint-embedding architectures

Source

The main challenge with JEAs is representation collapse, where the encoder produces a constant output regardless of the input.

Generative Architecture

Generative architecture is a type of machine learning model or neural network that can generate new samples resembling the training data distribution, such as images or text. This architecture utilizes a decoder network and learns by removing or distorting elements in the input data, like erasing parts of an image, in order to capture the underlying patterns and predict the corrupted or missing pixels to produce realistic outputs.

Generative architecture

Source

Generative methods aim to fill in all missing information and disregard the inherent unpredictability of the world. Consequently, these methods may make mistakes that a human would not, as they prioritize irrelevant details over capturing high-level predictable concepts.

Joint-Embedding Predictive Architecture

Joint Embedding Predictive Architecture (JEPA) is an architecture that learns to predict representations of different target blocks in an image from a single context block, using a masking strategy to guide the model toward producing semantic representations.

Joint-embedding predictive architecture

Source

JEPAs focus on learning representations that predict each other when additional information is provided, rather than seeking invariance to data augmentations like JEAs. Unlike generative methods that predict in pixel space, JEPAs utilize abstract prediction targets, allowing the model to prioritize semantic features over unnecessary pixel-level details. This approach encourages the learning of more meaningful and high-level representations.

Watch the lecture by Yann LeCun on A Path Towards Autonomous Machine Learning where he explains the Joint Embedding Predictive Architecture.

I-JEPA: Image-based Joint-Embedding Predictive Architecture

Image-based Joint Embedding Predictive Architecture (I-JEPA) is a non-generative approach for self-supervised learning from images. This aims to predict missing information in an abstract representation space. For example, given a single context block, the goal is to predict the representations of different target blocks within the same image. The target representations are computed using a learned target-encoder network, enabling the model to capture and predict meaningful features and structures.

I-JEPA

Source

💡Use your embeddings live in Encord Active. Try it here or get in touch for a demo

I-JEPA Core Design

The framework of Image-based Joint Embedding Predictive Architecture (I-JEPA) contains 3 blocks: context block, target block, and predictor.

I-JEPA Core Design

 Source

I-JEPA incorporates a multi-block masking strategy to promote the generation of semantic segmentations. This strategy emphasizes the significance of predicting adequately large target blocks within the image and utilizing an informative context block that is spatially distributed.

I-JEPA

Source

Context Block

I-JEPA uses a single context block to predict the representations of multiple target blocks within the same image. The context encoder is based on a Vision Transformer (ViT) and focuses on processing the visible context patches to generate meaningful representations.

Target Block

The target block represents the image blocks' representation and is predicted using a single context block. These representations are generated by the target encoder, and their weights are updated during each iteration of the context block using an exponential moving average algorithm based on the context weights. To obtain the target blocks, masking is applied to the output of the target encoder, rather than the input.

I-JEPA

Source

Prediction

The predictor in I-JEPA is a narrower version of the Vision Transformer (ViT). It takes the output from the context encoder and predicts the representations of a target block located at a specific position with the guidance of positional tokens.

The loss is the average L2 distance between the predicted patch-level representations and the target patch-level representations. The parameters of the predictor and the context encoder are learned through gradient-based optimization while the parameters of the target encoder are learned using the exponential moving average of the context-encoder parameters.

💡If you want to test it yourself, access the code and model checkpoints on GitHub.

How does I-JEPA perform?

Predictor Visualizations

The predictor in I-JEPA captures spatial uncertainty within a static image, using only a partially observable context. It focuses on predicting high-level information rather than pixel-level details about unseen regions in the image.

A stochastic decoder is trained to convert the predicted representations from I-JEPA into pixel space. This enables visualizing how the model’s predictions translate into tangible pixel-level outputs. This evaluation demonstrates the model's ability to accurately capture positional uncertainty and generate high-level object parts with correct poses (e.g., the head of a dog or the front legs of a wolf) when making predictions within the specified blue box.

I-JEPA

Source

As illustrated in the above image, I-JEPA successfully learns high-level representations of object parts while preserving their localized positional information in the image.

Performance Evaluation

I-JEPA sets itself apart by efficiently learning semantic representations without using view augmentation. It outperforms pixel-reconstruction methods such as Masked Autoencoders (MAE) on ImageNet-1K linear probing and semi-supervised evaluation.

ImageNet 1K Linear Evaluation I-JEPA

Source

Notably, I-JEPA surpasses pre-training approaches that heavily relied on data augmentations for semantic tasks, exhibiting stronger performance in low-level vision tasks such as object counting and depth prediction. By employing a simpler model with a more flexible inductive bias, I-JEPA showcases versatility across a wider range of tasks.

I-JEPA also impresses with its scalability and efficiency. Pre-training a ViT-H/14 model on ImageNet requires under 1200 GPU hours, making it over 2.5x faster than ViTS/16 pre-trained with iBOT. In comparison to ViT-H/14 pre-trained with MAE, I-JEPA proves to be more than 10x more efficient. Leveraging representation space predictions significantly reduces the computation required for self-supervised pre-training.

I-JEPA

Source

I-JEPA: Key Takeaways

  • Image-based Joint Embedding Predictive Architecture (I-JEPA) is an approach for self-supervised learning from images without relying on data augmentations.
  • The concept behind I-JEPA: predict the representations of various target blocks in the same image from a single context block.
  • This method improves the semantic level of self-supervised representations without using extra prior knowledge encoded through image transformations
  • It is scalable and efficient.

💡Read the original paper here.

Recommended Articles

Read more on other recent releases from Meta AI:

Supercharge Your Annotations with the
Segment Anything Model
medical banner

Written by Akruti Acharya
Akruti is a data scientist and technical content writer with a M.Sc. in Machine Learning & Artificial Intelligence from the University of Birmingham. She enjoys exploring new things and applying her technical and analytical skills to solve challenging problems and sharing her knowledge and... see more
View more posts
cta banner

Discuss this blog on Slack

Join the Encord Developers community to discuss the latest in computer vision, machine learning, and data-centric AI

Join the community

Software To Help You Turn Your Data Into AI

Forget fragmented workflows, annotation tools, and Notebooks for building AI applications. Encord Data Engine accelerates every step of taking your model into production.