Back to Blogs

How to Use GPT-4o for Model Development with Encord

May 17, 2024
5 mins
blog image

How to Use GPT-4o for Model Development with Encord

This spring there has been a wave of releases of multimodal models from the tech giants. OpenAI didn’t shy away from the challenge and released GPT-4o in the same week Google and Anthropic released their recent models. 

In this blog, we will look at how GPT-4o can be used in the model development process and discuss how Encord uses GPT-4o in the data curation pipeline.

Whether you're a developer, business owner, or researcher, GPT-4o offers unparalleled opportunities to accelerate your AI projects.

From scaling to enhancing your model development with data-driven insights
medical banner

GPT-4o Highlights

With the release of GPT-4o or GPT-omni, OpenAI calls this a step towards natural human-computer interaction. Here are the highlights of the recent model of the GPT family:

Multimodal Capabilites

GPT-4o accepts any combinations of text, audio, and image inputs. It can generate corresponding outputs in any of the three modalities. It also shows improvement in its vision and audio comprehension compared to the previous models of the GPT family.  Its advanced processing capabilities enable it to better interpret and generate content from images and audio, making it a powerful tool for applications requiring high-quality visual and auditory analysis.

Increased Speed and Efficiency

GPT-4o can respond to audio inputs in as little as 232 milliseconds, with an average response time of 320 milliseconds, closely mirroring human conversational speed. This significant enhancement over previous models ensures more fluid and interactive experiences. GPT-4o not only surpasses its predecessors in speed but also reduces API costs by 50%, making it a highly economical option for developers and businesses.

Enhanced Language and Code Proficiency

GPT-4o performs on par with GPT-4 Turbo in English text and code comprehension. However, it truly shines in non-English languages. Developers can leverage GPT-4o across more than 50 languages using the API. Whether you’re building applications in English, Spanish, Mandarin, or any other language, GPT-4o has you covered.

End-to-End Model Integration

Unlike previous iterations that relied on separate models for transcription (Whisper), processing (GPT-3.5 or GPT-4), and audio output, GPT-4o is trained end-to-end across all modalities. This integration allows the model to maintain context, recognizing tone, multiple speakers, and background noises while also being capable of outputting laughter, singing, and emotional expressions.

GPT-4o vs other SOTA Multimodals

blog image

Text Evaluation. Source

The native multimodality of GPT-4o allows for a more comprehensive and natural interaction between the user and the AI. GPT-4o is not just an incremental upgrade over its predecessors; it introduces several new features that enhance its performance, particularly in multilingual, audio, and vision capabilities. It also boasts faster response times, with an architecture optimized for generating tokens up to twice as fast as the previous model, GPT-4 Turbo.

When compared to other state-of-the-art (SOTA) multimodal models like Gemini 1.5 Pro and Claude 3 Opus, GPT-4o stands out for its high intelligence and ability to handle complex tasks. For instance, it matches the performance of GPT-4 Turbo in text, reasoning, and coding intelligence but sets new benchmarks in its ability to process and integrate information from different modalities. 

blog image

Vision Understanding Evals. Source

However, its reach may be limited due to its availability only on platforms like Google Cloud’s Vertex AI in a limited preview. The performance comparison across various evaluation sets shows that GPT-4o considerably outperforms existing large language models and most SOTA models, which may include benchmark-specific crafting or additional training protocols.


Usecases of GPT-4o for Model Development

The demo by OpenAI showed many real-world applications of GPT-4o like tutoring, assisting developers in collaborative coding, voice-driven data visualization, and many more. By providing free access to GPT-4o, we can certainly see its integration in many industries. 

Here are a few ways you can use GPT-4o in your model development pipeline:

Automated Code Generation

GPT-4o’s high performance in code-related tasks can assist in automated code generation, bug detection, and code optimization. It can serve as an intelligent assistant for developers, enhancing productivity. For example, creating an integrated development environment (IDE) plugin that provides real-time code suggestions, debugging tips, and documentation generation.

Integrating with Existing Models

GPT-4o can be integrated with platforms like Encord to streamline workflows and enhance collaboration. Its multimodal capabilities complement Encord’s features, enabling more efficient model development and deployment. For example, while using Encord to curate datasets, you can use GPT-4o to preprocess data, and tweak or validate labels, resulting in a more efficient development pipeline.

Image Analysis

GPT-4o’s proficiency in handling image data can be utilized for image recognition, classification, and analysis tasks. This is particularly useful in fields such as healthcare for medical imaging or security for facial recognition. For example, developing a diagnostic tool that can analyze medical images to detect conditions like tumors or fractures.

Using GPT-4o for Data Curation with Encord

With release of GPT-4o, Encord has integrated it in, and it's ready to use in its annotation platform. With the feature of custom agents of Encord, you can add any other multimodal model like Gemini or Claude into your data curation pipeline. Whether it's generating contextual labels, enriching metadata, or automating annotation tasks, the integration of multimodals through Encord's custom agents opens up possibilities for streamlining your data curation efforts.

Encord AI Annotation Agents test with GPT4o results.

Here are few ways you can use the GPT-4o or other multimodal models for your data curation process:


You can automatically classify the data into predefined categories or labels based on the ontology. By analyzing text, images, and other modalities, these models can intelligently categorize data, reducing manual effort and ensuring consistency in classification.

Image Transcription

If there is text in the image, you can either transcribe the image or use the multimodal model to generate additional information, context, or metadata.  This enrichment enhances the value of curated datasets, providing deeper insights and facilitating more comprehensive analyses.

Data Cleaning

GPT-4o's natural language processing capabilities enable it to sift through vast datasets, identifying and rectifying inconsistencies, errors, and outliers. This automated data cleaning process not only saves time but also ensures the integrity and quality of the curated data.

light-callout-cta Watch the webinar Using GPT4o to accelerate your model development. It shows how to integrate GPT-4o into your data pipelines.

sideBlogCtaBannerMobileBGencord logo

Power your AI models with the right data

Automate your data curation, annotation and label validation workflows.

Try Encord for Free
Written by

Akruti Acharya

View more posts
Frequently asked questions
  • GPT-4o, or GPT-omni, is OpenAI's latest multimodal model, excelling in natural language understanding and processing text, audio, and images.

  • GPT-4o is invaluable for automated customer support, content generation, personalized recommendations, and enhancing data analysis in industries like e-commerce, healthcare, and finance.