
AI Data Chats | Series: Practitioners
Tracking the Multimodal AI Data Supply Chain
In this episode of AI Data Chats, we are joined by Shayne Longpre from MIT and lead of the Data Provenance Initiative. Shayne and his team have audited thousands of pre-training datasets to trace the access and reuse of multimodal datasets across the web. In our chat, Shayne shares his thoughts on the importance of understanding the origins of the “organic” human data that seeds the generation of the synthetic, generated data proliferating across emerging datasets. He also shares results from the Data Provenance Initiative’s audits that shows the sources for video and audio datasets are much more concentrated in a handful of platforms as opposed to text datasets, and what the implications may then be for limits to data access by researchers and civil society to video and audio data.
Speakers

Jennifer Ding
Solutions Engineer @ Encord

Shayne Longpre
MIT, lead of the Data Provenance Initiative
About AI Data Chats
Watch Encord’s ML Solutions Engineer Jennifer Ding interview key thought leaders in the AI data space.







