Back to Blogs

Contents

DeepSeek V3
DeepSeek Multimodal Understanding and Generation
DeepSeek R1
Key Takeaways

Encord Blog

DeepSeek AI: Open-Source Models Revolutionizing Language, Reasoning, and Multimodal AI

Summarize with AI

January 29, 2025

5 mins

Back to Blogs

Data infrastructure for multimodal AI

Click around the platform to see the product in action.

Contents

DeepSeek V3
DeepSeek Multimodal Understanding and Generation
DeepSeek R1
Key Takeaways

Written by

Eric Landau

View more posts

Open-source AI models are rapidly closing the gap with proprietary systems, and DeepSeek AI is at the forefront of this shift. DeepSeek is a Chinese AI company founded by Liang Wenfeng that focuses on building open source large language models (LLMs). With models like DeepSeek V3, Janus for image generation, and DeepSeek R1 for reasoning, DeepSeek has built a suite of AI tools that rival—or even outperform—closed models like OpenAI’s GPT-4 and Google’s Gemini or open source models like Meta’s Llama or Qwen.

This blog explains DeepSeek’s key models, their features, what makes them stand out and how they compare to other top AI systems.

DeepSeek V3

DeepSeek V3 is a Mixture of Experts (MoE) language model. Unlike dense models like GPT-4, where all the parameters are used for each and every token, MoE models selectively activate a subset of the model for each token. This version is also significant as it is a 671 billion parameter model but uses 37 billion parameters per token during inference. This means DeepSeek v3 doesn’t need the full model to be active at once, it only needs 37 billion parameters active per token. This makes the model more computationally efficient than a fully dense model of the same size.

Model Architecture

DeepSeek V3 is based on a Mixture of Experts (MoE) transformer architecture, which selectively activates different subsets of parameters for different inputs.

Basic architecture of DeepSeek V3

Basic architecture of DeepSeek V3. Source.

Key components of its architecture include:

Mixture of Experts (MoE) Framework

DeepSeek V3 follows an MoE-based architecture, where different "expert" subnetworks handle different parts of the computation.
Instead of using all parameters for every token (as in dense models), DeepSeek V3 selects a subset of experts dynamically, reducing computational costs at a fraction of the cost of a fully dense model.
This design allows the model to scale efficiently while keeping inference more resource-efficient.

Multi-Head Latent Attention (MLA)
The model incorporates Multi-Head Latent Attention (MLA), an approach used in DeepSeek V2. MLA optimizes attention mechanisms to make inference faster and more memory-efficient.

DeepSeekMoE for Training Optimization
DeepSeekMoE, introduced in earlier versions, is used to train the MoE layers efficiently. It helps distribute workload across experts, reducing imbalances that could affect model performance.

Load Balancing Strategy

MoE models often struggle with uneven expert utilization, which can slow down training. DeepSeek V3 introduces an auxiliary-loss-free load balancing strategy, which reduces the trade-offs between performance and even expert activation.

Multi-Token Prediction (MTP) Training
Instead of predicting one token at a time, DeepSeek V3 uses Multi-Token Prediction (MTP). This allows the model to predict multiple tokens in parallel, improving efficiency and potentially speeding up inference.

Memory Optimization for Large-Scale Training
DeepSeek V3 is designed to be trained without tensor parallelism, which typically requires extra memory and computing resources. This allows for higher training efficiency on GPUs at a low-cost, making it more accessible for large-scale deployments.

These optimizations enable DeepSeek V3 to achieve strong performance with lower training and inference costs, making it a competitive open-source alternative to closed-source models like GPT-4o and Claude-3.5.

Key Capabilities

Computational Efficiency – The MoE structure reduces the number of active parameters per token, improving efficiency while maintaining strong performance.
Extended Context Handling – Supports 128,000 tokens, allowing better processing of long documents and multi-turn conversations.
Training Data and Fine-Tuning – Pretrained on 14.8 trillion tokens across multiple languages, with a focus on math and programming tasks. The model is then fine-tuned using Supervised Fine-Tuning (SFT) and Reinforcement Learning from Human Feedback (RLHF) for better reasoning and instruction following.

Performance

DeepSeek V3 achieves state of the art performance against open-source model on knowledge, reasoning, coding and math benchmarks. It scores 88.5 on MMLU, 75.9 on MMLU-Pro, and 59.1 on GPQA, surpassing other open models and closer to GPT-4o and Claude-3.5 performance. It excels in math, outperforming OpenAI’s o1-preview on MATH-500 and coding , ranking highest on LiveCodeBench.

DeepSeek V3 performance in bar graph on multiple subjects

Source

Its 128K token context length enables better long-form understanding. While closed models still lead in some areas, DeepSeek V3 offers a strong open-source alternative with competitive performance across multiple domains.

For more information, read the DeepSeek-V3 Technical Report. You can try it now on DeepSeek Chat and find the model weights on Hugging Face or GitHub

DeepSeek Multimodal Understanding and Generation

Janus

Janus is an autoregressive framework designed for multimodal tasks, combining both understanding and generation in a single generative AI model. It introduces a decoupled visual encoding approach, where separate pathways handle different aspects of visual processing while maintaining a unified transformer-based architecture. This design resolves conflicts between understanding and generation, making Janus more flexible than previous unified models.As a result, it matches or surpasses task-specific models in various multimodal benchmarks, demonstrating its effectiveness in vision-language tasks.

Janus benchmark performance and visual generation results

Source

For more information read the paper on arXiv, Janus: Decoupling Visual Encoding for Unified Multimodal Understanding and Generation

Janus-Pro

Janus-Pro builds on Janus with larger model scaling, improved training strategies, and expanded training data, leading to better multimodal understanding and more reliable text-to-image generation. These enhancements improve instruction-following capabilities for text-to-image tasks while increasing overall model stability. With these refinements, Janus-Pro pushes the performance of unified multimodal models further, offering a scalable and efficient solution for complex vision-language interactions.

Key Capabilities

Janus

Unified Multimodal Model: Janus integrates both multimodal understanding and generation into a single model, addressing limitations of previous approaches.
Decoupled Visual Encoding: By separating visual encoding into distinct pathways, Janus improves flexibility and performance for both understanding and generation tasks.
Autoregressive Framework: Janus uses an autoregressive framework that leverages a unified transformer architecture for multimodal processing.

Janus-Pro

Enhanced Text-to-Image Instruction-Following: Janus-Pro significantly improves performance in generating images based on text instructions, achieving high scores on the GenEval leaderboard.
Expanded Training Data and Larger Model Size: By scaling up the model size and increasing the dataset, Janus-Pro enhances stability and quality in text-to-image generation.
Optimized Training Strategy: Janus-Pro incorporates a more refined training strategy for better performance on diverse multimodal tasks.
Scalability: Janus-Pro supports multiple model sizes (1B and 7B parameters), showcasing its scalability in handling more complex tasks.

Janus vs Janus pro image generation examples, side by side comparison

Source

Performance

Janus-Pro significantly improves multimodal understanding and text-to-image generation over its predecessor, Janus. The Janus-Pro-7B model achieves a 79.2 score on MMBench, outperforming Janus (69.4), TokenFlow (68.9), and MetaMorph (75.2), demonstrating its superior multimodal reasoning capabilities. In text-to-image instruction-following, Janus-Pro-7B scores 0.80 on GenEval, surpassing Janus (0.61), DALL-E 3 (0.67), and Stable Diffusion 3 Medium (0.74).

Janus performance on LLM parameters and accuracy

Source

These improvements result from enhanced training strategies, expanded datasets, and increased model scale, making Janus-Pro a state-of-the-art unified multimodal model with strong generalization across tasks.

For more information, visit the Janus project page on GitHub. You can also find the Janus-Pro-7B, Janus-Pro-1B, Janus-1.3B model weights on Hugging Face.

DeepSeek R1

DeepSeek-R1 is an open-source reasoning model that matches OpenAI-o1 in math, reasoning, and code tasks. It presents a novel approach to reasoning tasks by using reinforcement learning(RL) for self evolution, while offering high performance solutions.

Model Architecture

It operates on the framework of the base model of DeepSeek V3. It uses RL for training without relying on supervised fine-tuning(SFT). IT starts with DeepSeek-R1-Zero, a model trained purely through RL, which naturally develops powerful reasoning behavior like self-verification, reflection, and chain-of-thought(CoT) solutions. Then the model is fine-tuned through a multi-stage training pipeline that incorporates cold-start data and SFt data from domains like writing and factual QA. This iterative process improves the model’s performance and helps resolve challenges such as readability and language mixing found in the initial RL phase.

Key Capabilities

Pure RL Training: Unlike most artificial intelligence models that rely on supervised fine-tuning, DeepSeek-R1 is primarily trained through RL. This means that the model self-evolves its reasoning capabilities.
Self-Verification and Chain-of-Thought: The R1 model naturally develops advanced reasoning behaviors such as self-verification, reflection, and chain-of-thought solutions, improving its ability to solve complex tasks.
Distilled Models: DeepSeek-R1 also includes distilled versions, such as DeepSeek-R1-Distill-Qwen-32B, offering competitive performance with reduced resource requirements.

Performance

DeepSeek-R1 matches or exceeds the performance of many SOTA models across a range of math, reasoning, and code tasks. The model achieves impressive results on reasoning benchmarks, setting new records for dense models, particularly with the distilled Qwen and Llama-based versions. For example, the DeepSeek-R1-Distill-Qwen-32B model surpasses OpenAI-o1-mini in various benchmarks.

DeepSeek R1 performance bar chart

Source

For more information, read the paper DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning. You can find the model weights on Hugging Face and visit the project page on Github.

Key Takeaways

DeepSeek models are fully open-source, fostering innovation and offering scalable, cost-effective solutions for diverse AI applications.
DeepSeek V3: Utilizes a Mixture of Experts (MoE) architecture for computational efficiency, offering strong performance with reduced resource usage.
Janus and Janus-Pro: Multimodal models that excel in both understanding and text-to-image generation with improved instruction-following capabilities and superior multimodal reasoning, making them powerful for AI assistants and chatbot applications.
DeepSeek R1: A reasoning-focused model trained through reinforcement learning, achieving top performance in math, reasoning, and coding tasks.
DeepSeek's models offer competitive performance to closed-source systems like GPT-4 and Claude-3.5, providing a high-performance alternative with efficient computing power usage.

Watch our webinar,"DeepSeek R1: How it works, What it means, and What comes next?" below.