Frequently Asked Questions

Question 1

Where can I download the E-MM1 dataset?

Accepted Answer

You can download the E-MM1 dataset using the email form at the top of this page. This will take you to an Encord environment, in which you can use Natural Language Search to explore a subset of the data. You can then download the entire dataset via the Download button in the Encord environment.

Question 2

What data is included in the E-MM1 dataset?

Accepted Answer

The E-MM1 dataset features 107m groups of data in five modalities: images, video, audio, text, and 3D point clouds. This includes:
• A pre-training pool (>100M 5-tuples) automatically assembled via SOTA language retrieval across 4 modalities; audio, image, video, and 3D point clouds. • A 1M post-training subset with diversity-aware sampling and human-rated match quality. • A gold-standard evaluation set with five-rater consensus for zero-shot classification of 3D point clouds against audio files.

Question 3

How can I access the E-MM1 model?

Accepted Answer

You can download the partitions you need (pre-training / post-training / evaluation) from GitHub.

Question 4

What steps did you take to clean the data?

Accepted Answer

To make the dataset as easily usable as possible, we take multiple measures to clean the data.

Integrity checks: Corruption detection, duration/shape sanity, and deduplication.

Responsible content: NSFW filtering of data shown for human annotators.

Licensing: We report all licenses of all underlying datasets to give AI developers full visibility.

Leakage controls: Known public eval items are excluded from training partitions to protect benchmark validity.

Question 5

How does Encord help reduce the time & cost of labeling?

Accepted Answer

The best way to spend less on labeling is using purpose-built annotation software, automation features, and active learning techniques.

Encord's platform provides several automation techniques, including model-assisted labeling & auto-segmentation. High-complexity use cases have seen 60-80% reduction in labeling costs.

E-MM1: The World’s Largest 
Multimodal AI Dataset

About the E-MM1 dataset

Technical release

On-demand webinar

Building the dataset

Watch: How we built the world’s largest multimodal AI dataset

The formula for AI datasets

Pre-training 107 million AI data groups

Post-training 1 million AI data annotations

Consensus-based evaluation in AI datasets

Building the next CLIP model with E-MM1

Take control of your AI data pipeline with Encord

Subscribe to our newsletter

Platform

Learn

Company

E-MM1: The World’s Largest Multimodal AI Dataset