Back to Blogs

How We Built the World's Largest Multimodal Dataset

Summarize with AI
October 15, 2025
|
5min read
blog image

Over the past few months, our machine learning team set out to build a single, trustworthy foundation for multimodal models that operate across text, images, video, audio, and 3D point clouds. Not a demo-ready sample or a narrowly scoped corpus, but a resource large and clean enough to accelerate the open-source development of multimodal systems that span more than just a few modalities.

This has led us to build what we believe is the world’s largest open-source multimodal dataset. Below is an in-depth description of how we built this dataset and some of the decisions and learnings made along the way. We believe that sharing our methodology will help people in the construction of similar future datasets and push forward multimodal AI.

Why we built it

You’re probably familiar with OpenAI’s CLIP model. In short, it is an embedding model; CLIP allows you to embed images and text and successively measure the similarity of the two.

If the image and the text relate, e.g., if the image contains construction workers and the text is “Construction workers at a construction site”, they will have a high computed similarity, while other images of just a construction site (but without workers) will have a lower similarity.

As most ML Engineers already know this, I’ll remain brief here and name a few example use-cases:

  • Embed the names of the 10 classes from your computer vision classification problem and suddenly you have an image classifier (that is called zero-shot classification).
  • Embed all your images and store them in a vector DB. Embed a new text (or image) query, and find most similar vectors (and thus images) in your DB (that is called natural language search or retrieval)
  • Want to generate images from text? Well, perhaps you remember Stable Diffusion, that’s a model which uses CLIP embeddings to condition a diffusion model.
  • The image embeddings of CLIP have been used to caption new, unseen images.
  • You can easily fine-tune CLIP to, e.g, evaluate the quality/aesthetics of images to filter down datasets to only retain high-quality data.
  • You can train a large language model (LLM) to see via images by fine-tuning it on image embeddings from CLIP (the LLaVa paper was, to my knowledge, the pioneer on that front).

One model powers all these use-cases (I’m aware that many of these use cases are a few years outdated and the field has come a long way since then). Generative models, like Sora 2 or the open-source Wan2.5, produce astonishing videos from text prompts, and you can dream your way through imaginary worlds with Google’s Genie 3, etc.

However, except from a few cool embedding models, like TwelveLabs Marengo 2.7 and the open-source VAST model, most of them remain bi-modal. Imagine what could be possible if there was a model, like CLIP, that digested more than just text and vision?

What if it could also hear audio and sense surroundings? What if you could embed point clouds and audio with the same model; everything into one coherent embedding space?

That would open up a lot of new applications, like LLMs or robots that hear things happening near them, observe objects move, or generative models that can be conditioned on other modalities besides text.

What we built

We asked ourselves a simple question:

What data would be required to build a “CLIP model” that does image, video, audio, 3D point clouds, and text?

To answer this question, we need to look at what the data for CLIP (and all its cousins) looks like - pairs of images and related captions. Image and text together is easy to come by; In a nutshell, you scrape the internet for every image and its surrounding text. A good example is LAION5B (5B images and captions sourced from the internet).

But how would that look for a five-way multimodal model? The cleanest way would be to have a lot of 5-tuples, all matching each other on semantic content. However, there is no natural source for a dataset like that. You cannot simply scrape the internet and hope to find point clouds next to the four other modalities.

Instead, we decided on another strategy: Source open-source data for each modality individually and use retrieval models to line up the data.

Let’s dive a bit deeper into what that actually means, in terms of (1) Sourcing the data, and (2) lining it up.

1. Sourcing data

We had to source data from all five modalities. To do so, we first identified well-captioned bi-modal datasets. That is, datasets of image-text, audio-text, video-text, and point cloud-text pairs. Perhaps not surprisingly, downloading all that data took a few weeks.

After initial screenings for missing URLs, corrupted files, variable frame rates, etc., here is a breakdown of the data that we used to assemble the 100M dataset.

blog_image_5826

Table 1: Text captions sourced from respective datasets

blog_image_6049

Table 2: Non-text items sourced from public datasets

It was a conscious choice to source captions native to the different modalities in order to cover the “caption space” as well as possible.

2. Lining up the data (100M)

With a lot of data at hand, we decided to use state-of-the-art, bimodal retrieval models to find audio, image, video, and point clouds from the 6.7M captions. Below, you will find the models employed for embedding all the data.

blog_image_6685

Table 3: Embedding models used to embed the data.

We used embeddings for indices, where we could do k = 16 nearest neighbor search for each caption. Based on the retrieval results, we paired caption i with top-jj = 1, …, k, result from each modality into 5-tuples; (Ti, Ai, j, Ii, j, Vi, j, PCi, j).

Here is an example:

blog_image_7312

Figure 1: A 5-tuple sourced via the caption “soft wave of the sea on the sandy beach” (Caption sourced from Google Conceptual Captions). Notice how the sea waves are represented in all five modalities.

Sourcing 16 groups per caption yields a total of 6.7M ⋅ 16 ≈ 107M groups in total.

💡 If you tried the demo, you would have noticed that the dataset there is much smaller. The reason is licenses. There is a lot of data that Encord is not allowed to “redistribute” in its raw form. As such, we can only use license-permissive examples to demonstrate the value of the dataset.

How we elevated the quality (1M)

Arguably, such a dataset is not necessarily of the highest quality. Why? Because retrieval models are sometimes wrong. As a consequence, we also committed to level up the quality of the data.

We selected a diverse set of captions, firstly distributed across the five caption-modality sources. Next, we applied (a very cool) sampling approach from Meta which allowed us to sample a diverse set of captions.

These captions were successively filtered down to not contain NSFW content before being presented to a human annotator to rate the caption against three examples found in a similar manner as above, but with a greedy algorithm on top to mitigate heavy hitters (it’s all detailed in our paper).

blog_image_9190

Figure 2: An example task where an annotator is given a caption from a video dataset that was automatically matched against three point cloud objects (renders displayed for faster annotations) and asked to rate whether the object is a good, partial, or bad match.

We collected a total of 976, 863 ratings from a little over 6,000 human work hours. The figure below displays how the projects were done in terms of source and target modality. Note that we refrain from annotating audio and video together as those two modalities come naturally paired.

blog_image_9889

Figure 3: The distribution of ratings between source and target modalities.

We like to think of these ratings as useful for what could resemble a “post-training” phase where only high-quality pairs (negative or not) are employed. In our experiments, we found this particularly useful for improving the model performance on audio-point cloud and image-point cloud evaluations.

The evaluation has to be even better

We also built a retrieval model that can embed all modalities into a common embedding space. To build the model, we – of course – did a bunch of evaluations on public benchmarks, 13 to be exact. No need to say that we made sure that there is no leakage between the training set that we are publishing and the standard evaluations of retrieval models for the five modalities.

💡 If you haven’t been to the demo environment already, you can go there to get a feel for how the embeddings work. The similarity search and natural language search there is powered by that model.

While setting up these evaluations, we identified that there are no publicly available evaluation datasets for audio-point cloud embedding models. Therefore, we had to build one.

We sampled data of both modalities from evaluation splits of public datasets; audio from AudioSet, and point clouds from Objaverse-LVIS. As with the creation of our training sets, we automatically pair the two modalities through text captions, using SOTA retrieval models. We then ran every pair, through a human consensus check forcing three individual annotators to agree on each pair.

blog_image_11634

Figure 4: A consensus workflow requiring three annotators to agree.

This results in 1775 and 1763 unique audio and point cloud items, respectively.

Based on the annotations, we set up a zero-shot annotation task for the model evaluation. We do this by deriving 112 classes from the point clouds’s Objaverse-LVIS categories (an evaluation dataset where every point cloud object has a human verified class).

We then define a zero-shot classification task as follows: we group the audio items by class, and take the mean embedding of each class as the class representative. Each point cloud is then classified based on its embedding’s similarity to that of the class representative.

Swapping the roles of the modalities results in a classification task for audio. As such, we will know if our model can categorize audio based on point cloud references and vice-versa. Both evaluation code and the evaluation data can be downloaded with the rest of the dataset.

A simple baseline (and the headroom we see)

We trained a baseline retrieval model in two phases that mirrors the dataset’s structure:

  • Pre-training: We optimized a standard InfoNCE (contrastive) loss over the 100M pairs. Even with a conservative recipe, we saw performance surpass much of the related work on several cross-modal retrieval tasks.
  • Post-training: We fine-tune on the human-rated subset, which delivers a meaningful jump on tasks like audio-point cloud and image-point cloud retrieval, the combinations where data does not naturally co-occur.

Here is a summary of what the model achieves. We are quite proud of the fact that good data does actually beat parameter scale in many cases:

blog_image_13640

Figure 5: Average model performance over 13 retrieval and zero-shot benchmarks. Notice how our model (the purple) is comparable in performance to a 4× larger model.

This baseline deliberately leaves well-known tricks on the table: we didn’t use attention over full token sequences (e.g., patch tokens for images, frame tokens for audio/video), quality-weighted losses, or extensive unfreezing of backbones. That’s good news for practitioners. The headroom is real.

Final thoughts

I hope this breakdown of how we actually built the dataset is interesting and helpful in your own work. The Encord ML team have spent months building this and we’re excited to finally release it to ML practitioners around the world.

If you’d like to play around with the dataset in a demo environment, you can access it here (this is also where you can download the dataset).

And if you'd like more information on the dataset and the opportunity to ask questions, I’m running a live walkthrough on October 22 which you can sign up to here.

blog_image_15216

Explore our products