Back to Blogs

Google’s MediaPipe Framework: Deploy Computer Vision Pipelines with Ease [2024]

June 21, 2024
5 mins
blog image

In today's era of fierce competition, 89% of enterprise executives believe that machine learning (ML) and artificial intelligence (AI) are crucial for success. Consequently, searching for the most effective ML tools is more intense than ever.

One such tool is MediaPipe - an open-source ML solution by Google with multiple libraries and tools to help you quickly develop advanced ML models. You can use MediaPipe to build applications in various domains, such as:

  • Computer Vision (CV): Enabling machines to interpret and understand visual information from the world
  • Natural Language Processing (NLP): Allowing machines to understand, interpret, and generate human language
  • Generative Artificial Intelligence (Gen AI): Enabling machines to create new content, such as images, videos, or text

In this article, we will describe what MediaPipe is to help you get started with the MediaPipe platform, explain its benefits, describe its functionalities, and explore customization and integration capabilities.

We will also compare MediaPipe with the OpenCV and Tensorflow.js frameworks to help you understand each platform's use case.

Curate Data for Computer Vision Pipelines with Encord
medical banner

What is MediaPipe?

MediaPipe is an open-source platform developed by Google for rapidly building complex deep-learning models across various domains, including computer vision (CV), text, and audio processing. It offers two primary components:

1. MediaPipe Solutions

A higher-level set of tools designed to simplify the integration of on-device machine learning solutions into your applications. MediaPipe Solutions consists of:

  • MediaPipe Tasks: Pre-built libraries and APIs that enable easy deployment of specific machine learning models (e.g., face detection, object tracking).
  • MediaPipe Models: A collection of pre-trained and ready-to-run models for various tasks, providing a starting point for your projects.
  • MediaPipe Model Maker: A framework for customizing existing models or training new ones based on your specific data and requirements.
  • MediaPipe Studio: A web-based tool for evaluating and fine-tuning model performance, making it easier to optimize your ML solutions.

2. MediaPipe Framework 

A lower-level toolkit for building custom machine learning pipelines. It provides building blocks for constructing your own models and processing pipelines, offering more flexibility and control than MediaPipe Solutions. The source code for MediaPipe Framework is available in the MediaPipe GitHub repository.

light-callout-cta If you're looking for a tool to evaluate your computer vision (CV) models, consider Encord Active. It offers a comprehensive suite of features to assess model performance, identify areas for improvement, and streamline your model development workflow.

Benefits of Using MediaPipe for Deployment

Only 22% of data scientists say their ideas reach production, making the model deployment situation bleak. However, MediaPipe streamlines deployment workflows through its easy-to-use libraries and frameworks.

Below is a list of the benefits MediaPipe Solutions offers developers building advanced machine learning (ML) models.

Accelerated Development

Instead of building models from scratch, MediaPipe's built-in solutions help experts develop complex models faster. It also allows you to speed up processing through GPUs (Graphics Processing Units) and combine GPU and CPU (Central Processing Unit) based nodes, which are computational units that process data in parallel.

Versatility Across Domains

Google MediaPipe's diverse solution range allows users to build models for multiple tasks and domains, including:

  • Healthcare: Pose estimation for patient monitoring and rehabilitation.
  • Augmented Reality: Face detection and tracking for interactive experiences.
  • Content Creation: Image and video segmentation for special effects.

MediaPipe can also handle text and audio classification tasks, build image generation frameworks, and perform large language model (LLM) inference using state-of-the-art Gemma 2B and 7B models.

light-callout-cta Learn how Teton AI uses computer vision (pose estimation, detection, etc.) to prevent falls in care homes and hospitals in this case study.

Efficient On-Device Performance

MediaPipe allows you to develop low-latency and real-time ML models using local hardware. It also offers cross-platform compatibility, supporting Linux, MacOS, iOS, and Android platforms.

Open-Source Community

Being an open-source framework, MediaPipe Solutions is a cost-effective option for small to medium enterprises (SMEs) that cannot afford large-scale platforms to develop and deploy ML applications.

Guide to Getting Started with MediaPipe

Users can build models with MediaPipe Solutions for Android and iOS mobile devices and ML-based web applications in Javascript. They can also develop models in Python for their specific use cases.

The guide below will demonstrate how to install MediaPipe and develop a few models for computer vision applications using Python.

How to Install MediaPipe?

Installing MediaPipe through Python is straightforward. Run the following command in a command prompt to install the framework.

python -m pip install mediapipe

MediaPipe Tutorials

The detailed tutorial for implementing multiple CV tasks is included in MediaPipe’s comprehensive documentation

The following sections briefly show two demos for implementing Hand Tracking and Pose Estimation models using MediaPipe.

Hand Tracking

Hand tracking or landmarking detects key points on a human hand image. The task helps render visual effects and detect the desired hand gestures for multiple use cases. Users can implement the model to detect a hand in a static image or a video stream.

blog image

Hand Landmarks

MediaPipe uses the Hand LandMarker model to perform the detection task and outputs image and world coordinates of the detected hand landmarks.

The model contains a palm detection component and a landmark detector based on the Generative 3D Human Shape and Articulated Pose Model (GHUM). 

The palm detection component detects whether a hand is present in an image or video, and the landmark detector identifies relevant landmarks.

The following steps show how to use MediaPipe to implement a Hand Tracking model to detect hand landmarks in an image.

Step 1: First, import all the necessary libraries and packages:

import mediapipe as mp
from mediapipe.tasks import python
from mediapipe.tasks.python import vision

Step 2: Next, download the Hand LandMarker model running the following code:

!wget -q https://storage.googleapis.com/mediapipe-models/hand_landmarker/hand_landmarker/float16/1/hand_landmarker.task

Step 3: Now, create the Hand Landmarker object:

base_options = python.BaseOptions(model_asset_path='hand_landmarker.task')
options = vision.HandLandmarkerOptions(base_options=base_options, num_hands=2)
detector = vision.HandLandmarker.create_from_options(options)

Step 4: Load the image you want to use for detection. For this example, the code uses a sample image by MediaPipe:

image = mp.Image.create_from_file("image.jpg")

Step 5: Apply the detector to detect landmarks in the image:

detection_result = detector.detect(image)

Step 6: Run the following code to create a function that visualizes the results:

from mediapipe import solutions
from mediapipe.framework.formats import landmark_pb2
import numpy as np


MARGIN = 10  # pixels
FONT_SIZE = 1
FONT_THICKNESS = 1
HANDEDNESS_TEXT_COLOR = (88, 205, 54) # vibrant green
def draw_landmarks_on_image(rgb_image, detection_result):
    hand_landmarks_list =     detection_result.hand_landmarks
    handedness_list = detection_result.handedness
    annotated_image = np.copy(rgb_image)


  # Loop through the detected hands to visualize.
  for idx in range(len(hand_landmarks_list)):
    hand_landmarks = hand_landmarks_list[idx]
    handedness = handedness_list[idx]


    # Draw the hand landmarks.
    hand_landmarks_proto = landmark_pb2.NormalizedLandmarkList()
    hand_landmarks_proto.landmark.extend([
      landmark_pb2.NormalizedLandmark(x=landmark.x, y=landmark.y, z=landmark.z) for landmark in hand_landmarks])
    solutions.drawing_utils.draw_landmarks(
      annotated_image,
      hand_landmarks_proto,
      solutions.hands.HAND_CONNECTIONS,
      solutions.drawing_styles.get_default_hand_landmarks_style(),
      solutions.drawing_styles.get_default_hand_connections_style())
    # Get the top left corner of the detected hand's bounding box.
    height, width,  = annotatedimage.shape
    x_coordinates = [landmark.x for landmark in hand_landmarks]
    y_coordinates = [landmark.y for landmark in hand_landmarks]
    text_x = int(min(x_coordinates) * width)
    text_y = int(min(y_coordinates) * height) - MARGIN


    # Draw handedness (left or right hand) on the image.
    cv2.putText(annotated_image, f"{handedness[0].category_name}",
                (text_x, text_y), cv2.FONT_HERSHEY_DUPLEX,
                FONT_SIZE, HANDEDNESS_TEXT_COLOR, FONT_THICKNESS, cv2.LINE_AA)
  return annotated_image

Step 7: Visualize the results:

annotated_image = draw_landmarks_on_image(image.numpy_view(), detection_result)
cv2_imshow(cv2.cvtColor(annotated_image, cv2.COLOR_RGB2BGR))

Step 8: You should see the following image:

blog image

Hand Landmark Detection Result

Pose Estimation

Pose estimation involves landmarking multiple human poses in an image or a video. The task uses models that can track body locations and label particular movements.

blog image

Pose Estimation

Like Hand Landmarker, pose estimation uses a Pose LandMarker model bundle consisting of a detection and a landmarker module. The base models include a convolutional neural network (CNN) similar to MobileNetV2 and the GHUM algorithm to estimate the body pose in 3D.

The pipeline to implement the pose estimation model includes steps similar to those for the Hand Landmark model.

Step 1: Load the relevant model:

!wget -O pose_landmarker.task -q https://storage.googleapis.com/mediapipe-models/pose_landmarker/pose_landmarker_heavy/float16/1/pose_landmarker_heavy.task

Step 2: Load and view the test image:

!wget -q -O image.jpg https://cdn.pixabay.com/photo/2019/03/12/20/39/girl-4051811_960_720.jpg

import cv2
from google.colab.patches import cv2_imshow
img = cv2.imread("image.jpg")
cv2_imshow(img)

Step 3: Import the relevant libraries and create the PoseLandmarker object:

import mediapipe as mp
from mediapipe.tasks import python
from mediapipe.tasks.python import vision

base_options = python.BaseOptions(model_asset_path='pose_landmarker.task')
options = vision.PoseLandmarkerOptions(
    base_options=base_options,
    output_segmentation_masks=True)
detector = vision.PoseLandmarker.create_from_options(options)

Step 4: Load the image and run the detection model:

image = mp.Image.create_from_file("image.jpg")
detection_result = detector.detect(image)

Step 5: Create the visualization function and run it to view results:

from mediapipe import solutions
from mediapipe.framework.formats import landmark_pb2
import numpy as np

def draw_landmarks_on_image(rgb_image, detection_result):
  pose_landmarks_list = detection_result.pose_landmarks
  annotated_image = np.copy(rgb_image)

  # Loop through the detected poses to visualize.
  for idx in range(len(pose_landmarks_list)):
    pose_landmarks = pose_landmarks_list[idx]


    # Draw the pose landmarks.
    pose_landmarks_proto = landmark_pb2.NormalizedLandmarkList()
    pose_landmarks_proto.landmark.extend([
      landmark_pb2.NormalizedLandmark(x=landmark.x, y=landmark.y, z=landmark.z) for landmark in pose_landmarks])
      solutions.drawing_utils.draw_landmarks(
annotated_image, pose_landmarks_proto, solutions.pose.POSE_CONNECTIONS, solutions.drawing_styles.get_default_pose_landmarks_style())
    return annotated_image


annotated_image = draw_landmarks_on_image(image.numpy_view(), detection_result)
cv2_imshow(cv2.cvtColor(annotated_image, cv2.COLOR_RGB2BGR))

blog image

Output image

light-callout-cta Recommended: Want to know more about pose estimation? Learn more in our complete guide to human pose estimation for computer vision.
 

Pre-built Building Blocks for Common Computer Vision Tasks

The above examples show how to run hand and pose estimation pipelines. However, MediaPipe includes additional templates and models for other useful CV tasks.

The following sections briefly review the tasks you can perform using MediaPipe.

Image Classification

Image classification generates labels for what the image contains. In MediaPipe, the classification models include EfficientNet-Lite0 and EfficientNet-Lite2, which are trained on 1000 classes from ImageNet

EfficientNet-Lite2 is heavier than EfficientNet-Lite0 and suitable for tasks requiring higher accuracy.

blog image

Image classification example

Users can control the regions of interest for classification and configure the language for labels. They can also specify the classes for categorization and the number of classification results.

Object Detection

Object detection identifies multiple classes within a single image.

blog image

Object Detection example

MediaPipe consists of three pre-trained detection models: EfficientDet-Lite0, EfficientDet-Lite2, and the Single Shot Detector (SSD) MobileNetV2 Model trained on the COCO image dataset.

Image Segmentation

Segmentation allows you to segment an image into regions based on specific criteria.

blog image

Segmentation example

MediaPipe’s segmentation models allow you to segment a person’s face, background, hair, clothing, skin, and accessories.

The framework offers four models:

  • Selfie Segmentation Model for segmenting a person from the background.
  • Hair Segmentation Model for segmenting a person’s hair from the background.
  • Multi-class Selfie Segmentation Model for segmenting a person’s hair, clothes, skin, accessories, and background.
  • DeepLab-v3 Model for segmenting other items, including cats, dogs, and potted plants.

Face Mesh

The face landmark detector in MediaPipe lets you identify facial expressions and landmarks.

blog image

Facial Landmark Detection example

The models produce a face mesh with corresponding blendshape scores. The scores identify facial expressions and landmark coordinates.

The models include a detection algorithm, a mesh, and a blend shape prediction model. The blendshape model predicts 52 scores for different facial expressions.

Gesture Recognition

Gesture Recognition involves identifying hand gestures, such as thumbs up or down in an image or video.

blog image

Gesture Recognition example

MediaPipe uses a pre-built gesture recognizer and hand landmarker model to identify gestures and detect landmarks.

Image Embedding

Image embeddings are vector representations of images. The representations help compute similarity metrics to assess the similarity of the images.

blog image

Image embedding example

MediaPipe generates embeddings using the MobileNet V3 model trained on ImageNet data. The model offers an acceptable accuracy-latency trade-off and computes the cosine-similarity metric.

MediaPipe Framework: Deployment and Integration

MediaPipe integrates with various programming languages and frameworks, which makes it adaptable for different deployment scenarios. Key integrations include:

  • Java: Integrate MediaPipe into Android applications for on-device machine learning.
  • C++: Utilize MediaPipe's core functionalities for high-performance and customizable pipelines.
  • JavaScript: Deploy MediaPipe models in web browsers for interactive experiences.
  • TensorFlow Lite: Leverage the power of TensorFlow Lite for optimized on-device inference.
  • OpenCV: Access a vast library of computer vision algorithms for tasks like image preprocessing and feature extraction.

For instance, you can use TensorFlow Lite to run a MediaPipe hand-tracking model on a smartphone (Android or iOS) or integrate MediaPipe with OpenCV in a C++ application for real-time object detection.

MediaPipe with Custom Model Pipeline

MediaPipe offers Model Maker, a tool that simplifies customizing pre-trained models to your specific needs. It employs transfer learning, a technique where a model trained on a large dataset is fine-tuned on a smaller, more specific dataset. This allows you to adapt existing models to your unique tasks without training a model from scratch.

Limitations of Customization

  • Task-Specific Fine-Tuning: While you can customize a model with new data, it's important to note that this customization is limited to the original task the model was designed for. For instance, you can fine-tune a face detection model to recognize specific individuals, but you cannot transform it into a model that detects cars.
  • Performance on New vs. Old Data: After fine-tuning, the model will perform best on the new data it was trained on. It may not generalize as well to the original data used for pre-training.

Custom Pipelines with MediaPipe Framework

For greater flexibility and control, MediaPipe Framework allows you to build custom machine learning pipelines from scratch. This is particularly useful for complex tasks or when you need to integrate custom models or algorithms. 

You can develop custom Android, iOS, C++, and Python pipelines, giving you various deployment options.

Evaluate your models and build active learning pipelines with Encord
medical banner

MediaPipe Frameworks Comparison: MediaPipe, OpenCV, TensorFlow.js

While MediaPipe is a robust platform for quickly building ML models, other frameworks offer comprehensive features for creating CV solutions.

The following sections compare MediaPipe with other popular platforms: OpenCV and TensorFlow.js.

MediaPipe

  • Benefits: MediaPipe is an easy-to-use ML framework that lets you quickly build models for basic CV tasks.
  • Limitations: Customizability is limited, as it only allows you to fine-tune existing models on new datasets for domain-specific use cases.
  • Best for: Beginners who want to develop and integrate ML models in mobile and web applications.

OpenCV

  • Benefits: OpenCV is a large-scale CV library consisting of 2500 algorithms that let you build models for complex CV systems. These include object tracking, extraction of 3D models of objects, and generation of point-cloud data from stereo cameras.
  • Limitations: OpenCV has a steep learning curve because it lacks high-level semantics and usability. Unlike MediaPipe, it does not support LLM inference, text, and audio classification.
  • Best for: Large businesses that want to build scalable CV frameworks for industrial use cases.

Tensorflow.js

  • Benefits: Tensorflow.js is a web ML library that allows developers to build models in the JavaScript environment. It lets you use versatile APIs to develop models in JavaScript and re-train or run existing models.
  • Limitations: Since Tensorflow.js works within the JS ecosystem, it has limited computational power compared to environments where GPUs are accessible. Also, it is unsuitable for training large-scale models requiring extensive datasets.
  • Best for: Developers who want to add lightweight ML functionality into web-based applications.

Google MediaPipe: Key Takeaways

Google MediaPipe is a versatile tool for quick ML model development and deployment. Below are a few key points to remember regarding MediaPipe.

  1. MediaPipe Solutions: MediaPipe Solutions is a platform that offers ready-to-use models for performing multiple computer vision (CV) tasks. It also helps with text and audio classification.
  2. MediaPipe Benefits: MediaPipe’s most significant benefit is its ability to allow users to deploy and integrate ML models into applications quickly. Additionally, it is open-source, making it cost-effective for small businesses.
  3. MediaPipe CV Tasks: MediaPipe supports image classification, object detection, hand and gesture recognition, face detection, image embeddings, and pose estimation.
sideBlogCtaBannerMobileBGencord logo

Power your AI models with the right data

Automate your data curation, annotation and label validation workflows.

G2Logo

4.8/5

Explore Our Product Suite
Written by
author-avatar-url

Nikolaj Buhl

View more posts
Frequently asked questions
  • MediaPipe is Google’s cross-platform machine learning (ML) development and deployment framework.

  • MediaPipe allows users to build models for multiple CV tasks and customize, test, and evaluate ML solutions.

  • MediaPipe offers ready-to-use models that users can quickly deploy in their applications.

  • MediaPipe’s pre-built solutions include hand and gesture recognition models, face detection, pose estimation, image classification, and object detection. It also offers frameworks for text and audio classification.

  • OpenCV is a low-level framework designed specifically for building advanced CV systems. MediaPipe is a lightweight solution that can implement basic CV models quickly.

  • MediaPipe supports Android, iOS, Linux, and MacOS.

  • MediaPipe's model customizability is limited. Users can only fine-tune a pre-trained model on new data to perform the same task. It is not suitable for building complex CV models for entirely different domains.