Back to Blogs

Top 10 Multimodal Datasets

August 15, 2024
5 mins
blog image

Multimodal datasets are like the digital equivalent of our senses. Just as we use sight, sound, and touch to interpret the world, these datasets combine various data formats—text, images, audio, and video—to offer a richer understanding of content.

Think of it this way: if you tried to understand a movie just by reading the script, you'd miss out on the visual and auditory elements that make the story come alive. Multimodal datasets provide those missing pieces, allowing AI to catch subtleties and context that would be lost if it were limited to a single type of data. 

Another example is analyzing medical images alongside patient records. This approach can reveal patterns that might be missed if each type of data were examined separately, leading to breakthroughs in diagnosing diseases. It's like assembling multiple puzzle pieces to create a clearer, more comprehensive picture.

In this blog, we've gathered the best multimodal datasets with links to these data sources. These datasets are crucial for Multimodal Deep Learning, which requires integrating multiple data sources to enhance performance in tasks such as image captioning, sentiment analysis, medical diagnostics, video analysis, speech recognition, emotion recognition, autonomous vehicles, and cross-modal retrieval.

What is Multimodal Deep Learning?

Multimodal deep learning, a subfield of Machine Learning, involves using deep learning techniques to analyze and integrate data from multiple data sources and modalities such as text, images, audio, and video simultaneously. This approach uses the complementary information from different types of data to improve model performance, enabling tasks like enhanced image captioning, audio-visual speech recognition, and cross-modal retrieval.

blog_image_1945

Next-GPT: A Multimodal LLM

Benefits of Multimodal Datasets in Computer Vision

Multimodal datasets significantly enhance computer vision applications by providing richer and more contextual information. Here's how:   

  • By combining visual data with other modalities and data sources like text, audio, or depth information, models can achieve higher accuracy in tasks such as object detection, image classification, and image segmentation.   
  • Multimodal models are less susceptible to noise or variations in a single modality. For instance, combining visual and textual data can help in overcoming challenges like occlusions or ambiguous image content.
  • Multimodal datasets allow models to learn deeper semantic relationships between objects and their context. This enables more sophisticated tasks like visual question answering (VQA) and image generation.   
  • Multimodal dataset opens up possibilities for novel applications in computer vision, large language models, augmented reality, robotics, text-to-image generation, VQA, NLP and medical image analysis.
  • By integrating information from data sources of different modalities, models can better understand the context of visual data, leading to more intelligent and human-like large language models.

Top 10 Multimodal Datasets

Flickr30K Entities Dataset

The Flickr30K Entities dataset is an extension of the popular Flickr30K dataset, specifically designed to improve research in automatic image description and understand how language refers to objects in images. It provides more detailed annotations for image-text understanding tasks. 

Flickr30K Entities dataset built upon the Flickr30k dataset, which contains 31K+ images collected from Flickr. Each image in Flickr30k Entities is associated with five crowd-sourced captions describing the image content. The dataset adds bounding box annotations for all entities (people, objects, etc.) mentioned in the image captions. 

Flickr30K allows to develop better large language models with vision capabilities for image captioning, where the model can not only describe the image content but also pinpoint the location of the entities being described. It also allows the development of an improved grounded language understanding, which refers to a machine's ability to understand language in relation to the physical world.

Flickr30K Entities dataset

  • Research Paper: Flickr30k Entities: Collecting Region-to-Phrase Correspondences for Richer Image-to-Sentence Models
  • Authors: Bryan A. Plummer, Liwei Wang, Chris M. Cervantes, Juan C. Caicedo, Julia Hockenmaier, and Svetlana Lazebnik
  • Dataset Size: 31,783 real-world images, 158,915 captions (5 per image), approximately 275,000 bounding boxes, 44,518 unique entity instances.
  • Licence: The dataset typically follows the original Flickr30k dataset licence, which allows for research and academic use on non-commercial projects. However, you should verify the current licensing terms as they may have changed.
  • Access Links: Bryan A. Plummer Website

Visual Genome

The Visual Genome dataset is a multimodal dataset, bridging the gap between image content and textual descriptions. It offers a rich resource for researchers working in areas like image understanding, VQA, and multimodal learning. 

Visual Genome combines two modalities, first is Visual, containing over 108,000 images from the MSCOCO dataset are used as the visual component, and second is Textual, where images are extensively annotated with textual information (i.e. objects, relationships, region captions, question-answer pairs).

The multimodal nature of this dataset offers advantages like deeper image understanding to allow identify meaning and relationships between objects in a scene beyond simple object detection, VQA to understand the context and answer questions that require reasoning about the visual content, and multimodal learning that can learn from both visual and textual data.

Visual Genome Dataset

MuSe-CaR 

MuSe-CaR (Multimodal Sentiment Analysis in Car Reviews) is a multimodal dataset specifically designed for studying sentiment analysis in the "in-the-wild" context of user-generated video reviews. 

MuSe-CaR combines three modalities (i.e. text, audio, video) to understand sentiment in car reviews. The text reviews are presented as spoken language, captured in the video recordings, audio consists of vocal qualities (like tone, pitch, and emphasis) to reveal emotional aspects of the review beyond just the spoken words, and video consists of facial expressions, gestures, and overall body language provide additional cues to the reviewer's sentiment.

MuSe-CaR aims to advance research in multimodal sentiment analysis by providing a rich dataset for training and evaluating models capable of understanding complex human emotions and opinions expressed through various modalities.

MuSe-CaR Dataset

CLEVR

CLEVR, which stands for Compositional Language and Elementary Visual Reasoning, is a multimodal dataset designed to evaluate a machine learning model's ability to reason about the physical world using both visual information and natural language. It is a synthetic multimodal dataset created to test AI systems' ability to perform complex reasoning about visual scenes. 

CLEVR combines two modalities, visual and textual. Visual modality comprises rendered 3D scenes containing various objects. Each scene features a simple background and a set of objects with distinct properties like shape (cube, sphere, cylinder), size (large, small), color (gray, red, blue, etc.), and material (rubber, metal).  Textual modality consists of questions posed in natural language about the scene. These questions challenge models to not only "see" the objects but also understand their relationships and properties to answer accurately.

CLEVR is used in applications like visual reasoning in robotics and other domains to understand the spatial relationships between objects in real-time (e.g., "Which object is in front of the blue rubber cube?"), counting and comparison to enumerate objects with specific properties (e.g., "How many small spheres are there?"), and  logical reasoning to understand the scene and the question to arrive at the correct answer, even if the answer isn't directly visible (e.g., "The rubber object is entirely behind a cube. What color is it?").

CLEVR Dataset

InternVid 

InternVid is a relatively new multimodal dataset specifically designed for tasks related to video understanding and generation using generative models. InternVid focuses on the video-text modality, combining a large collection of videos containing everyday scenes and activities accompanied by detailed captions describing the content, actions, and objects present in the video.

InternVid aims to support various video-related tasks such as video captioning, video understanding, video retrieval and video generation.

InternVid Dataset

MovieQA

MovieQA is a multimodal dataset designed specifically for the task of video question answering (VideoQA) using text and video information.

MovieQA combines three modalities i.e. video, text and question and answer pairs. The dataset consists of video clips from various movie clips that are accompanied by subtitles or transcripts, providing textual descriptions of the spoken dialogue and on-screen actions.

Each video clip is paired with multiple questions that require understanding both the visual content of the video and the textual information from the subtitles/transcript to answer accurately.

MovieQA aims to evaluate how well a model can understand the actions, interactions, and events happening within the video clip. It can utilize textual information such as  subtitles/transcript to complement the visual understanding and answer questions that might require information from both modalities and provide informative answers.

MovieQA Dataset

MSR-VTT

MSR-VTT, which stands for Microsoft Research Video to Text, is a large-scale multimodal dataset designed for training and evaluating models on the task of automatic video captioning. The primary focus of MSR-VTT is to train models that can automatically generate captions for unseen videos based on their visual content.

MSR-VTT combines two modalities, videos and text descriptions. Video is a collection of web videos covering a diverse range of categories and activities and each video is paired with multiple natural language captions describing the content, actions, and objects present in the video.

MSR-VTT helps in large-scale learning using vast amounts of data which allows models to learn robust video representations and generate more accurate and descriptive captions. Videos from various categories help models generalize well to unseen video content and multiple captions per video provides a richer understanding of the content.

blog_image_17030


VoxCeleb2 

VoxCeleb2 is a large-scale multimodal dataset designed for tasks related to speaker recognition and other audio-visual analysis. VoxCeleb2 combines two modalities, audio and video. Audio consists of recordings of speech from various individuals and corresponding video clips of the speakers, allowing for the extraction of visual features.

VoxCeleb2 primarily focuses on speaker recognition, which involves identifying or verifying a speaker based on their voice. However, the audio-visual nature of the dataset also allows for face recognition and speaker verification.


VoxCeleb2 Dataset

VaTeX 

VaTeX (VAriational Text and video) is a multimodal dataset designed specifically for research on video-and-language tasks. 

Modalities: VaTeX combines two modalities, A collection of videos depicting various activities and scenes, and text descriptions for each video describing the content in both English and Chinese. Some caption pairs are parallel translations, allowing for video-guided machine translation research. 

VaTeX supports several research areas related to video and language such as multilingual video captioning to generate captions for videos in multiple languages, video-guided machine translation to improve the accuracy of machine translation, and  video understanding to analyze and understand the meaning of video content beyond simple object recognition.

VaTeX Dataset

WIT

WIT, which stands for Wikipedia-based Image Text, is an state-of-the-art large-scale dataset designed for tasks related to image-text retrieval and other multimedia learning applications. 


Modalities: WIT combines two modalities, Images which are a massive collection of unique images from Wikipedia and text descriptions for each image extracted from the corresponding Wikipedia article. These descriptions provide information about the content depicted in the image.


WIT primarily focuses on tasks involving the relationship between images and their textual descriptions. Some key applications are Image-Text Retrieval to retrieve images using text query, Image Captioning to generate captions for unseen images, and Multilingual Learning that can understand and connect images to text descriptions in various languages.

WIT Dataset example


Key Takeaways: Multimodal Datasets 

Multimodal datasets, which blend information from diverse data sources such as text, images, audio, and video, provide a more comprehensive representation of the world. This fusion allows AI models to decipher complex patterns and relationships, enhancing performance in tasks like image captioning, video understanding, and sentiment analysis. By encompassing diverse data aspects, multimodal datasets push the boundaries of artificial intelligence, fostering more human-like understanding and interaction with the world.

These datasets, sourced from various data sources, drive significant advancements across various fields, from superior image and video analysis to more effective human-computer interaction. As technology continues to advance, multimodal datasets will undoubtedly play a crucial role in shaping the future of AI. Embracing this evolution, we can look forward to smarter, more intuitive AI systems that better understand and interact with our multifaceted world.

encord logo

Power your AI models with the right data

Automate your data curation, annotation and label validation workflows.

Get started
Written by
author-avatar-url

Nikolaj Buhl

View more posts

Explore our products