THE ENCORD ML TEAM PRESENTS

The World’s Largest Multimodal Dataset - for Generative AI

Encord has built a new, open-source dataset of images, video, text, audio, and point cloud embeddings for Generative AI teams to use – more than 10x the size of previous multimodal datasets.

Hero image
See the dataset
items icon

6000 hours of human annotation

items icon

1 million annotations added by humans

platform visualisation

About the E-MM1 dataset

We have sourced, curated and paired 107 million groups of data across five modalities, available open-source with Natural Language Search to help you find relevant data.

use cases

Explore five modalities of AI data in Encord

Platform visual

Watch: How we built the world’s largest multimodal AI dataset

Frederik Hvilshøj, Machine Learning Lead at Encord, explains the stages of dataset training and model development in this video series.

Frequently asked questions
  • You can download the E-MM1 dataset using the email form at the top of this page. This will take you to an Encord environment, in which you can use Natural Language Search to explore a subset of the data. You can then download the entire dataset via the Download button in the Encord environment.

  • The E-MM1 dataset features 107m groups of data in five modalities: images, video, audio, text, and 3D point clouds. This includes:
    • A pre-training pool (>100M 5-tuples) automatically assembled via SOTA language retrieval across 4 modalities; audio, image, video, and 3D point clouds.

    • A 1M post-training subset with diversity-aware sampling and human-rated match quality.

    • A gold-standard evaluation set with five-rater consensus for zero-shot classification of 3D point clouds against audio files.

  • You can download the partitions you need (pre-training / post-training / evaluation) from GitHub.

  • To make the dataset as easily usable as possible, we take multiple measures to clean the data.

    Integrity checks: Corruption detection, duration/shape sanity, and deduplication.

    Responsible content: NSFW filtering of data shown for human annotators.

    Licensing: We report all licenses of all underlying datasets to give AI developers full visibility.

    Leakage controls: Known public eval items are excluded from training partitions to protect benchmark validity.

  • The best way to spend less on labeling is using purpose-built annotation software, automation features, and active learning techniques.

    Encord's platform provides several automation techniques, including model-assisted labeling & auto-segmentation. High-complexity use cases have seen 60-80% reduction in labeling costs.

Encord is the unified data layer for AI development

200+ AI teams deploy production AI faster with Encord. Transform petabytes of unstructured multimodal data into high quality data for training, fine-tuning, and aligning AI models.