Back to Blogs

Contents

What are Open-Source Datasets?
How do Machine Learning and Computer Vision Projects Benefit from Open-source Datasets?
10 Open-Source Datasets for Machine Learning
Use Open-Source Datasets for Computer Vision Projects with Encord

Encord Blog

Top 10 Open Source Datasets for Machine Learning

Summarize with AI

June 27, 2023

5 mins

Back to Blogs

Data infrastructure for multimodal AI

Click around the platform to see the product in action.

Contents

What are Open-Source Datasets?
How do Machine Learning and Computer Vision Projects Benefit from Open-source Datasets?
10 Open-Source Datasets for Machine Learning
Use Open-Source Datasets for Computer Vision Projects with Encord

Written by

Akruti Acharya

View more posts

Searching for a suitable dataset for your machine learning project can be time consuming.

As such, we have compiled a list of the top 10 open-source datasets spanning image recognition to natural language processing that will save you time and help you get started.

Whether you are a beginner or professional this diverse list of datasets will empower your machine learning advancements.

💡To learn more read the Complete Guide to Open Source Data Annotation.

What are Open-Source Datasets?

Open-source datasets are publicly available datasets that are shared with no restrictions on usage or distribution. These datasets are typically released under open licenses, which allows researchers, developers, and enthusiasts to freely access, utilize, and contribute to the data. This fosters collaboration, innovation, and the advancement of research and development in fields such as machine learning, computer vision, and natural language processing.

blog_image_1470

How do Machine Learning and Computer Vision Projects Benefit from Open-source Datasets?

Open-source datasets are valuable resources for machine learning and computer vision projects:

Easy access to a vast amount of data without financial constraints
Diverse samples for training robust and generalized models
Standardized benchmarks for fair evaluations,
Promotion of collaboration and reproducibility among researchers
Encouragement of ethical considerations and community contributions

These benefits support researchers, expedite development, and foster creativity in the fields of machine learning and computer vision.

Scale your annotation workflows and power your model performance with data-driven insights

Try Encord today

10 Open-Source Datasets for Machine Learning

SA-1B Dataset

SA-1B dataset consists of 11 million varied and high-resolution images along with 1.1 billion pixel-level annotations, making it suitable for training and evaluating advanced computer vision models.

blog_image_3959

Source

This dataset was collected by Meta for their Segment Anything project and the images in the dataset are automatically generated by the Segment Anything Model (SAM). With the SA-1B dataset, researchers and developers can explore a wide range of applications in computer vision, including scene understanding, object recognition, instance segmentation, and image parsing. The dataset's rich annotations allow for detailed analysis and modeling of object boundaries, semantic regions, and fine-grained object attributes.

Research Paper: Segment Anything
Authors: Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alexander C. Berg, Wan-Yen Lo, Piotr Dollar, Ross Girshick
Dataset Size: 11 million images, 1.1 billion masks, 1500*2250 image resolution
License: Limited; Research purpose only
Access Links: Official webpage

💡To learn more about the Segment Anything project, read our explainer post.

VisualQA

The Visual Question Answering (VQA) dataset consists of 260,000 images that depict abstract scenes from COCO, multiple questions and answers per image, and an automatic evaluation metric, challenging machine learning models to comprehend images and answer open-ended questions by combining vision, language, and common knowledge.

blog_image_6475

Source

This comprehensive dataset offers a wide range of applications, including image understanding, question generation, and multi-modal reasoning. With its large-scale and diverse nature, the VisualQA dataset provides a rich source of training and evaluation data for developing sophisticated AI algorithms. Researchers can leverage this dataset to enhance the capabilities of visual question-answering systems, allowing them to interpret images and accurately respond to human-generated questions.

Research Paper: VQA: Visual Question Answering
Authors: Aishwarya Agrawal, Jiasen Lu, Stanislaw Antol, Margaret Mitchell, C. Lawrence Zitnick, Dhruv Batra, Devi Parikh
Dataset Size: 265,016 images
License: CC By 4.0
Access Links: Official webpage, Pytorch Dataset Loader

💡Read our explainer on ImageBIND, a new breakthrough for multimodal learning introduced by Meta.

ADE20K

The ADE20K dataset provides over 20,000 diverse and densely annotated images which serves as a benchmark for developing computer vision models for semantic segmentation.

blog_image_8899

Source

The dataset offers high-quality annotations and covers both object and stuff classes, providing a comprehensive representation of scenes. It was created by the MIT Computer Science and Artificial Intelligence Laboratory (CSAIL) and is available for research and non-commercial use. The significance of the ADE20K dataset lies in its capacity to drive research in scene understanding, object recognition, and image parsing, leading to advancements in computer vision techniques and benefiting diverse applications like autonomous driving, object detection, and image analysis.

Research Paper: Scene Parsing Through ADE20K Dataset
Authors: Bolei Zhou, Hang Zhao, Xavier Puig, Sanja Fidler, Adela Barriuso, Antonio Torralba
Dataset Size: 25,574 training images, 2,000 validation images
License: CC BSD-3 License Agreement
Access Links: Official webpage, Pytorch Dataset,, Hugging Face

Youtube-8M

YouTube-8M is a large-scale video dataset containing 7 million YouTube videos annotated with a wide range of visual and audio labels for various machine learning tasks.

blog_image_11185

Source

YouTube-8M is a valuable resource for machine learning tasks, allowing researchers and developers to train and assess models for video understanding, action recognition, video summarization, and visual feature extraction.

Research Paper: YouTube-8M: A Large-Scale Video Classification Benchmark
Authors: Sami Abu-El-Haija, Nisarg Kothari, Joonseok Lee, Paul Natsev, George Toderici, Balakrishnan Varadarajan, Sudheendra Vijayanarasimhan
Dataset Size: 7 million videos with 4716 classes
License: CC By 4.0
Access Links: Official webpage

Scale your annotation workflows and power your model performance with data-driven insights

Try Encord today

Google’s Open Images

Google's Open Images is a publicly accessible dataset that provides 8 million labeled images, offering a valuable resource for various computer vision tasks and research.

blog_image_14079

Source

Google's Open Images is used for various purposes such as object detection, image classification, and visual recognition. The dataset's importance lies in its comprehensive coverage and large-scale nature, enabling researchers and developers to train and evaluate advanced AI models.

Research Paper: The Open Images Dataset V4: Unified image classification, object detection, and visual relationship detection at scale
Authors: Alina Kuznetsova Hassan Rom Neil Alldrin Jasper Uijlings Ivan Krasin Jordi Pont-Tuset Shahab Kamali Stefan Popov Matteo Malloci Alexander Kolesnikov Tom Duerig Vittorio Ferrari
Dataset Size: 8 million images
License: CC By 4.0
Access Links: Official webpage

MS COCO

MS COCO (Common Objects in Context) is a widely used large-scale dataset that contains 330,000 diverse images with rich annotations for tasks like object detection, segmentation, and captioning.

blog_image_15884

Source

MS COCO (Common Objects in Context) is specifically designed for object detection, segmentation, and captioning tasks, offering detailed annotations for a wide range of objects in various real-world contexts. The dataset has become a benchmark for evaluating and advancing state-of-the-art models in visual understanding and has played a significant role in driving progress in the field of computer vision.

Research Paper: Microsoft COCO: Common Objects in Context
Authors: Tsung-Yi Lin Michael Maire Serge Belongie Lubomir Bourdev Ross Girshick James Hays Pietro Perona Deva Ramanan C. Lawrence Zitnick Piotr Dollar
Dataset Size: 330,000 images, 1.5 million object instances, 80 object categories, and 91 stuff categories
License: CC By 4.0
Access Links: Official webpage, PyTorch, TensorFlow

CT Medical Images

The CT Medical Image dataset is a small sample extracted from the cancer imaging archive, comprised of the middle slice of CT images that meet specific criteria regarding age, modality, and contrast tags.

blog_image_18033

The dataset is designed to train models to recognize image textures, statistical patterns, and highly correlated features associated with these characteristics. This can enable the development of straightforward tools for automatically classifying misclassified images and identifying outliers that may indicate suspicious cases, inaccurate measurements, or inadequately calibrated machines.

Research Paper: The Cancer Genome Atlas Lung Adenocarcinoma Collection
Authors: Justin Kirby
Dataset Size: 475 series of images collected from 69 unique patients.
License: CC By 3.0
Access Links: Kaggle

💡To find more healthcare datasets, read Top 10 Free Healthcare Datasets for Computer Vision.

Aff-Wild

The Aff-Wild dataset consists of 564 videos of around 2.8 million frames with 554 subjects and is designed for the task of emotion recognition using facial images.

blog_image_20145

Source

Aff-Wild provides a diverse collection of facial images captured under various conditions, including different head poses, illumination conditions, and occlusions. The dataset serves as a valuable resource for developing and evaluating algorithms and models for emotion recognition, gesture recognition, and action unit detection in computer vision.

Research Paper: Deep Affect Prediction in-the-wild: Aff-Wild Database and Challenge, Deep Architectures, and Beyond
Author: Dimitrios Kollias, Panagiotis Tzirakis, Mihalis A. Nicolaou, Athanasios Papaioannouk, Guoying Zhao, Bjorn Schuller, Irene Kotsia, Stefanos Zafeiriou
Dataset Size: 564 videos of around 2.8 million frames with 554 subjects (326 of which are male and 228 female)
License: non-commercial research purposes
Access Links: Official Webpage

DensePose-COCO

DensePose-COCO consists of 50,000 images with dense human pose estimation annotations for each person in the COCO dataset, enabling a detailed understanding of the human body's pose and shape.

blog_image_21968

Source

DensePose-COCO is an extension of the COCO dataset that provides dense human pose annotations, allowing precise mapping of body landmarks and segmentation. It serves as a benchmark for pose estimation and shapes understanding in computer vision research.

Research Paper: DensePose: Dense Human Pose Estimation In The Wild
Authors: Rıza Alp Guler, Natalia Neverova, Iasonas Kokkinos
Dataset Size: 50,000 images from the COCO dataset, with annotations for more than 200,000 human instances
License: CC By 4.0
Access Links: Official Webpage

💡To find more, read Top 8 Free Datasets for Human Pose Estimation in Computer Vision,.

BDD100K

The BDD100K dataset is a large-scale diverse driving video dataset that contains over 100,000 videos.

blog_image_23979

The BDD100K dataset is a valuable asset for advancing autonomous driving research, computer vision algorithms, and robotics. It plays a crucial role in improving perception systems and facilitating the development of intelligent transportation systems. With its diverse and comprehensive annotations, BDD100K is utilized for various computer vision tasks, including object detection, instance segmentation, and scene understanding, empowering researchers and developers to push the boundaries of computer vision technology in autonomous driving and transportation.

Research Paper: BDD100K: A Diverse Driving Dataset for Heterogeneous Multitask Learning
Authors: Fisher Yu, Haofeng Chen, Xin Wang, Wenqi Xian, Yingying Chen, Fangchen Liu, Vashisht Madhavan, Trevor Darrell
Dataset Size: Over 100,000 driving videos (40 seconds each) collected from more than 50,000 rides, covering New York, San Francisco Bay Area
License: Mixed license
Access Links: Official Webpage

Scale your annotation workflows and power your model performance with data-driven insights

Try Encord today

Use Open-Source Datasets for Computer Vision Projects with Encord

Encord enables easy one-command downloads of these open-source datasets and provides the flexibility to explore, analyze, and curate datasets tailored to your project's specific requirements. By utilizing the platform, you will streamline your data collection process and enhance the efficiency and effectiveness of your machine learning workflows.

To use open-source datasets in the Encord platform, simply follow these steps:

Download the open-source dataset through the access links listed above
Download Encord Active using the following commands. For more information, refer to the documentation.

python3.9 -m venv ea-venv
source ea-venv/bin/activate
# within venv
pip install encord-active

Download your open-source dataset using the following commands.

# within venv
encord-active download

With those simple steps, you now have your dataset!

With Encord, you can accelerate the image and video labeling process of your machine learning project while also facilitating the analysis of your models.

Encord annotate empowers annotators with a diverse set of annotation types tailored for various computer vision applications. Meanwhile, Encord Active equips machine learning practitioners with a comprehensive toolset for data analysis, labeling, and assessing model quality.

Sign-up for a free trial of Encord: The Data Engine for AI Model Development, used by the world’s pioneering computer vision teams.

Data infrastructure for multimodal AI

Click around the platform to see the product in action.