Episode thumbnail

AI Data Chats | Series: Practitioners

Tracking the Multimodal AI Data Supply Chain

In this episode of AI Data Chats, we are joined by Shayne Longpre from MIT and lead of the Data Provenance Initiative. Shayne and his team have audited thousands of pre-training datasets to trace the access and reuse of multimodal datasets across the web. In our chat, Shayne shares his thoughts on the importance of understanding the origins of the “organic” human data that seeds the generation of the synthetic, generated data proliferating across emerging datasets. He also shares results from the Data Provenance Initiative’s audits that shows the sources for video and audio datasets are much more concentrated in a handful of platforms as opposed to text datasets, and what the implications may then be for limits to data access by researchers and civil society to video and audio data.

Speakers

Jennifer Ding

Jennifer Ding

Solutions Engineer @ Encord

Shayne Longpre

Shayne Longpre

MIT, lead of the Data Provenance Initiative

About AI Data Chats

Watch Encord’s ML Solutions Engineer Jennifer Ding interview key thought leaders in the AI data space.

Slice Template Landing Page

AI Data Chat with Martine Wauben

Series: Practitioners

Announcing… the Data for London Library!

AI Data Chat with Hirsh Pithadia

Series: Practitioners

From “Context Stuffing” to Context Grounding for AI Systems

AI Data Chat with Shayne Longpre

Series: Practitioners

Tracking the Multimodal AI Data Supply Chain

AI Data Chat with Justin Zhao

Series: Researchers

When LLM as a Judge is not Good Enough—use a Language Model Council

AI Data Chat with Layla Hosseini-Gerami

Series: Researchers

Exploring the expanse of chemical data space to accelerate drug discovery

AI Data Chat with Andrea Cortoni

Series: Practitioners

Centering human-AI interaction in agentic workflows

AI Data Chat with Ben Burtenshaw

Series: Practitioners

Data-centric vibe testing for high performing models

AI Data Chat with Elena Simperl

Series: Researchers

Documenting ML Datasets for Responsible Reuse

AI Data Chat with Frederik Hvilshøj

Series: Researchers EP.1

Mixing datasets and data modalities from Robotics to Deepseek-R1