Back to Blogs

Encord Monthly Wrap: February Industry Newsletter

March 8, 2024
10 mins
blog image

Hi there,

Welcome the The Computer Vision Monthly Wrap

Here’s what you should expect:

📦 YOLOv9 release with an explainer and code walkthrough on creating custom datasets.

📸 Meta’s V-JEPA for prediction video features.

📽️ Understanding Sora, OpenAI’s text-to-video model.

⚒️ Developer resources to learn how to analyze object detection model errors.

☁️ Computer vision case study from NVIDIA and Oracle.

🚀 Lessons from working with computer vision operations (CVOps) at scale.

Let’s dive in!

Top Picks for Computer Vision Papers This Month

YOLOv9: Better than SoTA with Cutting-edge Real-time Object Detection

If you haven’t heard yet, YOLOv9 is out, and, wow, it’s a high-performant model! YOLOv9 builds upon previous versions, using advancements in deep learning techniques and architectural design to beat state-of-the-art (SoTA) object detection tasks.

What’s impressive? 🤯

  • It achieves top performance in object detection tasks on benchmark datasets like MS COCO. It surpasses existing real-time object detectors (YOLOv6, YOLOv8) in terms of accuracy, speed, and overall performance.
  • It is much more adaptable to different scenarios and use cases. We have started seeing various applications, including surveillance, autonomous vehicles, robotics, and more.
  • It is better than SoTA methods that use depth-wise convolution because it uses both the Programmable Gradient Information (PGI) and GLEAN (Generative Latent Embeddings for Object Detection) architectures.

Read the paper on Arxiv. If that’s a lot, we also put out an explainer to help get to the important bits quickly with a walkthrough on using the open-source YOLOv9 release to create custom datasets.

Comparison of YOLOv9 with SOTA Model

There’s also an accompanying repository for the implementation of the paper.

Meta’s V-JEPA: Video Joint Embedding Predictive Architecture Explained

In February, Meta released V-JEPA, a vision model exclusively trained using a feature prediction objective. In contrast to conventional machine learning methods, which rely on pre-trained image encoders, text, or human annotations, V-JEPA learns directly from video data without external supervision.

What’s impressive? 👀

  • Instead of reconstructing images or relying on pixel-level predictions, V-JEPA prioritizes video feature prediction. This approach leads to more efficient training and superior performance in downstream tasks.
  •  V-JEPA requires shorter training schedules than traditional pixel prediction methods (VideoMAE, Hiera, and OmniMAE) while maintaining high-performance levels.


We wrote a comprehensive explainer of V-JEPA, including the architecture, key features, and performance details, in this blog post. Here is the accompanying repository on the implementation of V-JEPA.

OpenAI Releases New Text-to-Video Model, Sora

OpenAI responded to the recent debut of Google's Lumiere, a space-time diffusion model for video generation, by unveiling its own creation: Sora. The diffusion model can transform text descriptions into high-definition video clips for up to one minute. In this comprehensive explainer, you will learn:

  • How Sora works
  • Capabilities and limitations
  • Safety considerations
  • Other text-to-video generative models.

Gemini 1.5: Google's Generative AI Model with 1 Million-Token Context Length and MoE Architecture

Gemini 1.5 is a sparse mixture-of-experts (MoE) multimodal model with a context window of up to 1 million tokens in production and 10 million tokens in research. It excels at long-term recall and retrieval and generalizes zero-shot to long instructions, like analyzing 3 hours of video with near-perfect recall.

Here is an explainer blog that distils the technical report with the necessary information.

Developer Resources You’d Find Useful

  • Multi-LoRA Composition for Image Generation → The space is moving so fast that it’s hard to miss out on gems like Multi-LoRA! The Multi-LoRA composition implementation integrates diverse elements like characters & clothing into a unified image to avoid the detail loss and distortion seen in traditional LoRA Merge. Check out the repo and try it yourself.
  • Scaling MLOps for Computer Vision by MLOps.Community → In this panel conversation, experienced engineers talk about their experience, challenges, and best practices for working with computer vision operations (CVOps) at scale.
  • How to Analyze Failure Modes of Object Detection Models for Debugging → This guide showcases how to use Encord Active to automatically identify and analyze the failure modes of a computer vision model to understand how well or poorly it performs in challenging real-world scenarios.
  • NVIDIA Triton Server Serving at Oracle [Case Study] → I really liked this short case study by the Oracle Cloud team that discussed how their computer vision and data science services accelerate AI predictions using the NVIDIA Triton Inference Server. Some learnings in terms of cost savings and performance optimization are valuable.

Here are other quick finds if you 💓 Encord and computer vision data stuff ⚡:

sideBlogCtaBannerMobileBGencord logo

Power your AI models with the right data

Automate your data curation, annotation and label validation workflows.

Get started
Written by

Stephen Oladele

View more posts