Announcing our Series C with $110M in total funding. Read more →.

Episode thumbnail

AI Data Chats | Series: Practitioners

Tracking the Multimodal AI Data Supply Chain

In this episode of AI Data Chats, we are joined by Shayne Longpre from MIT and lead of the Data Provenance Initiative. Shayne and his team have audited thousands of pre-training datasets to trace the access and reuse of multimodal datasets across the web. In our chat, Shayne shares his thoughts on the importance of understanding the origins of the “organic” human data that seeds the generation of the synthetic, generated data proliferating across emerging datasets. He also shares results from the Data Provenance Initiative’s audits that shows the sources for video and audio datasets are much more concentrated in a handful of platforms as opposed to text datasets, and what the implications may then be for limits to data access by researchers and civil society to video and audio data.

Speakers

Jennifer Ding

Jennifer Ding

Solutions Engineer @ Encord

Shayne Longpre

Shayne Longpre

MIT, lead of the Data Provenance Initiative

About AI Data Chats

Watch Encord’s ML Solutions Engineer Jennifer Ding interview key thought leaders in the AI data space.

AI Data Chats

AI Data Chat with Martine Wauben

Series: Practitioners

Announcing… the Data for London Library!

AI Data Chat with Hirsh Pithadia

Series: Practitioners

From “Context Stuffing” to Context Grounding for AI Systems

AI Data Chat with Shayne Longpre

Series: Practitioners

Tracking the Multimodal AI Data Supply Chain

AI Data Chat with Justin Zhao

Series: Researchers

When LLM as a Judge is not Good Enough—use a Language Model Council

AI Data Chat with Layla Hosseini-Gerami

Series: Researchers

Exploring the expanse of chemical data space to accelerate drug discovery

AI Data Chat with Andrea Cortoni

Series: Practitioners

Centering human-AI interaction in agentic workflows

AI Data Chat with Ben Burtenshaw

Series: Practitioners

Data-centric vibe testing for high performing models

AI Data Chat with Elena Simperl

Series: Researchers

Documenting ML Datasets for Responsible Reuse

AI Data Chat with Frederik Hvilshøj

Series: Researchers EP.1

Mixing datasets and data modalities from Robotics to Deepseek-R1

Featured series

Explore the future of AI through expert-led conversations on data, deep learning, and real-world impact.

Deep Learning Leaders

Deep Learning Leaders

Eric leads in-depth conversations with top AI leaders on the future of deep learning.

AI After Hours

New

AI After Hours

Live, in-person talks with experts diving into the real-world impact and future of AI.

Research Reports

Research Reports

Explore expert guides and frameworks that help top AI teams tame data complexity and accelerate model innovation at scale.

Subscribe now

Don’t miss out on the upcoming videos - sign up today and fuel your AI knowledge.