THE ENCORD ML TEAM PRESENTS

E-MM1: The World’s Largest

Multimodal AI Dataset

Encord has built a new, open-source dataset of images, video, text, audio, and point cloud embeddings for AI teams to use – more than 10x the size of previous multimodal datasets.

100M data groups
of 5 modalities
1B data pairs
modality to modality associations
1M annotations
by human annotators

About the E-MM1 dataset

We have sourced, curated and paired 107 million groups of data across five modalities, which is now available open-source for AI teams to use.

Frequently asked questions
  • You can download the E-MM1 dataset using the email form at the top of this page. This will take you to an Encord environment, in which you can use Natural Language Search to explore a subset of the data. You can then download the entire dataset via the Download button in the Encord environment.

  • The E-MM1 dataset features 107m groups of data in five modalities: images, video, audio, text, and 3D point clouds. This includes:
    • A pre-training pool (>100M 5-tuples) automatically assembled via SOTA language retrieval across 4 modalities; audio, image, video, and 3D point clouds.

    • A 1M post-training subset with diversity-aware sampling and human-rated match quality.

    • A gold-standard evaluation set with five-rater consensus for zero-shot classification of 3D point clouds against audio files.

  • You can download the partitions you need (pre-training / post-training / evaluation) from GitHub.

  • To make the dataset as easily usable as possible, we take multiple measures to clean the data.

    Integrity checks: Corruption detection, duration/shape sanity, and deduplication.

    Responsible content: NSFW filtering of data shown for human annotators.

    Licensing: We report all licenses of all underlying datasets to give AI developers full visibility.

    Leakage controls: Known public eval items are excluded from training partitions to protect benchmark validity.

  • The best way to spend less on labeling is using purpose-built annotation software, automation features, and active learning techniques.

    Encord's platform provides several automation techniques, including model-assisted labeling & auto-segmentation. High-complexity use cases have seen 60-80% reduction in labeling costs.

Take control of your AI data pipeline with Encord

200+ AI teams deploy production AI faster with Encord. Transform petabytes of unstructured multimodal data into high quality data for training, fine-tuning, and aligning AI models.