The Complete Guide to DeepSeek Models: From V3 to R1 and Beyond

March 6, 2025 • Written By Sherlock Xu

DeepSeek has emerged as a major player in AI, drawing attention not just for its massive 671B models, V3 and R1, but also for its suite of distilled versions. As interest in these models grows, so does the confusion about their differences, capabilities, and ideal use cases.

  • “Which DeepSeek model should I use?”
  • “What’s the difference between R1 and V3?”
  • “Is R1-Zero better than R1?”
  • “Do I really need a distilled model?”

These questions echo across developer forums, Discord channels, and GitHub discussions. And honestly, the confusion makes sense. DeepSeek’s lineup has expanded rapidly, and without a clear roadmap, it’s easy to get lost in technical jargon and benchmark scores.

In this post, we’ll break down the key differences and help you choose the right model for your needs.

DeepSeek-V3

Let’s rewind to December 2024 when DeepSeek dropped V3. It's a Mixture-of-Experts (MoE) model with 671 billion parameters and 37 billion activated for each token.

If you’re wondering what Mixture-of-Experts means, it’s actually a cool concept. Essentially, it means the model can activate different parts of itself depending on the task at hand. Instead of using the entire model all the time, it “picks” the right experts for the job. This makes it not just powerful but efficient.

What's perhaps most remarkable about DeepSeek-V3 is the training efficiency. Despite its size, the model required only 2.788 million H800 GPU hours, which translates to around $5.6 million in training costs. To put that in perspective, training GPT-4 is estimated to cost between $50–100 million.

DeepSeek-V3 Base vs. Chat model

DeepSeek-V3 comes in two versions: a Base and a Chat model.

  • The Base model is exactly what it sounds like - the foundation. During its pre-training phase, it essentially learns to predict what comes next in massive amounts of text. After creating this Base model, DeepSeek researchers took it through two different post-training regimes to create models with different capabilities (which leads to two other models: DeepSeek-V3 Chat model and R1).
  • The Chat model (aka DeepSeek-V3, and yes, the naming can be confusing) underwent additional instruction tuning and reinforcement learning from human feedback (RLHF) to make it more helpful, harmless, and honest in conversation. It is highly performant in tasks like coding and math, and even compares favorably to the likes of GPT-4o and Llama 3.1 405B.

You can check their detailed benchmark performance in the evaluation results.

Deploying DeepSeek-V3

Both DeepSeek-V3 Base and Chat models are open-source and commercially usable. You can self-host them to build your own ChatGPT-level application. If you’re looking to deploy DeepSeek-V3, check out our example project using BentoML and vLLM.

Deploy DeepSeek-V3Deploy DeepSeek-V3

DeepSeek-R1

DeepSeek didn’t stop with V3. Just weeks later, they introduced two new models built on DeepSeek-V3-Base: DeepSeek-R1-Zero and DeepSeek-R1.

DeepSeek-R1-Zero: Learning without supervision

DeepSeek-R1-Zero was trained using large-scale reinforcement learning (RL) without the usual step of supervised fine-tuning (SFT). In simple terms, it learned reasoning patterns entirely on its own, refining its abilities through trial and error rather than structured instruction.

While the results were remarkable, there were also trade-offs. R1-Zero occasionally struggled with endless repetition, poor readability, and even language mixing.

DeepSeek-R1: A more refined reasoning model

To smooth out these rough edges, DeepSeek developed DeepSeek-R1 using a more sophisticated multi-stage training pipeline. This included incorporating thousands of "cold-start" data points to fine-tune the V3-Base model before applying reinforcement learning. The result was R1, a model that not only keeps the reasoning power of R1-Zero but significantly improves accuracy, readability, and coherence.

Unlike V3, which is optimized for general tasks, R1 is a true reasoning model. That means it doesn’t just give you an answer; it explains how it got there. Before responding, R1 generates a step-by-step chain of thought, making it especially useful for:

  • Complex mathematical problem-solving
  • Coding challenges
  • Scientific reasoning
  • Multi-step planning for agent workflows

Performance-wise, R1 rivals or even surpasses OpenAI o1 (also a reasoning model, but does not fully disclose the thinking tokens as R1) in math, coding, and reasoning benchmarks. This makes it the most powerful open-source reasoning model available today.

Deploying DeepSeek-R1

R1 is the engine behind the DeepSeek chat application, and many developers have begun using it for private deployments. If you’re looking to run it yourself, check out our example project using BentoML and vLLM.

Deploy DeepSeek-R1Deploy DeepSeek-R1

Keep these tips in mind when using R1:

  1. Avoid system prompts and make sure all instructions are included directly in the user prompt.
  2. For math problems, add a directive like Please reason step by step, and put your final answer within \boxed{}.
  3. Be aware that R1 may sometimes skip its reasoning process (i.e., outputting <think>\n\n</think>). To encourage thorough reasoning, tell the model to start the response with <think>\n in your prompt.

See more recommendations in the DeekSeep-R1 repository.

DeepSeek-V3 vs. DeepSeek-R1: Which one should you choose?

DeepSeek-V3 and DeepSeek-R1 are the go-to models for many engineers today, but they serve different purposes. If you’re unsure which model fits your needs, here’s a quick comparison to help you decide:

ItemDeepSeek-V3DeepSeek-R1
Base modelDeepSeek-V3-BaseDeepSeek-V3-Base
TypeGeneral-purpose language modelReasoning model
Response styleDirect answers (e.g., "The answer is 42")Step-by-step reasoning (e.g., "First, calculate X… then Y… so the answer is 42")
Parameters671B (37B activated)671B (37B activated)
ArchitectureMoEMoE
Context length128K128K
LicenseMIT & Model LicenseMIT
Best forContent creation, writing, translation, general Q&AComplex math, coding, research, logical reasoning, agentic workflows
Standard API price* (UTC 00:30-16:30)
  • $0.07 / million input tokens (cache hit)

  • $0.27 / million input tokens (cache miss)

  • $0.14 / million input tokens (cache hit)

  • $0.55 / million input tokens (cache miss)

Standard API price* (UTC 00:30-16:30)$1.10 / million output tokens$2.19 / million output tokens

* Pricing is based on data on March 6, 2025, and may change over time. DeepSeek offers off-peak pricing discounts from 16:30-00:30 UTC daily. Check their API documentation for details.

In short, I suggest you:

  • Choose DeepSeek-V3 if you need a fast, general-purpose model for tasks like writing, summarization, translation, or casual Q&A.
  • Choose DeepSeek-R1 if you need a reasoning-focused model for math, coding, research, or any task that requires step-by-step logical thinking.

Distilled DeepSeek models: Bringing reasoning to smaller models

While V3 and R1 are impressive, running them isn’t practical for everyone. They require 8 NVIDIA H200 GPUs with 141GB of memory each.

That’s where distilled DeepSeek models come in. These smaller, more efficient models bring the reasoning power of R1 to a more accessible scale. Instead of training new models from scratch, DeepSeek took a smart shortcut:

  1. Started with 6 open-source models from Llama 3.1/3.3 and Qwen 2.5
  2. Generated 800,000 high-quality reasoning samples using R1
  3. Fine-tuned the smaller models on these synthetic reasoning data

Unlike R1, these distilled models rely solely on SFT and they do not include an RL stage.

Despite their smaller size, these models perform remarkably well on reasoning tasks, proving that large-scale AI reasoning can be efficiently distilled. DeepSeek has open-sourced all six distilled models, ranging from 1.5B to 70B parameters.

deepseek-distilled-models-performance.png
Distilled Model Evaluation. Image Source.

But before I dive into the individual models, let’s take a quick look at the technique that makes this possible: distillation.

What is distillation

Distillation is a technique that transfers knowledge from a large, powerful model to a smaller, more efficient one. Instead of training on raw data, the smaller model learns to mimic the larger model’s behavior.

A great analogy comes from the research paper Distilling the Knowledge in a Neural Network by Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. They compare distillation to how insects evolve from larvae to adults:

  • The larval stage is optimized for growth, consuming as many nutrients as possible.
  • The adult stage is built for efficiency, faster, leaner, and adapted for survival.

Similarly, large AI models are trained with huge datasets and high computational power to extract deep knowledge. But deploying them at scale requires something faster and lighter. That’s where distillation comes in. It compresses the intelligence of a large model into a smaller model, making it more practical for real-world applications.

Distilled models from Llama and Qwen

Now, let’s take a closer look at each distilled model.

DeepSeek-R1-Distill-Qwen-1.5B

This is the smallest model in the lineup with decent math and reasoning ability. It outperforms GPT-4o and Claude-3.5-Sonnet on AIME and MATH-500, making it a good choice for lightweight problem-solving. However, it struggles with coding tasks, scoring only 16.9 on LiveCodeBench, meaning it's not ideal for programming applications.

DeepSeek-R1-Distill-Qwen-7B

A step up from the 1.5B model, this version offers stronger performance in mathematical reasoning and general problem-solving. It scores well on AIME (55.5) and MATH-500 (92.8), but still lags behind in coding benchmarks (37.6 on LiveCodeBench).

DeepSeek-R1-Distill-Llama-8B

Based on the Llama 3.1 architecture, this model shows strong mathematical reasoning. It not only surpasses GPT-4o and Claude-3.5-Sonnet in AIME and MATH-500, but also performs very close to o1-mini and QwQ-32B-Preview in MATH-500. Its coding performance suggests better competitive coding skills, but it's still not on par with larger models.

DeepSeek-R1-Distill-Qwen-14B

This is a balanced model that offers strong reasoning, math, and general logic capabilities. It’s a great middle ground for those needing better accuracy without high computational costs. While the coding performance is not the best, it’s very close to o1-mini.

DeepSeek-R1-Distill-Qwen-32B

This is one of the best-performing distilled models. With top-tier reasoning (72.6 on AIME, 94.3 on MATH-500) and a strong CodeForces rating (1691), it's a great option for math-heavy applications, competitive problem-solving, and advanced AI research.

An interesting research insight: DeepSeek used this model to compare distillation and RL for reasoning tasks. They tested whether a smaller model trained through large-scale RL could match the performance of a distilled model.

To explore this, they trained Qwen-32B-Base with math, coding, and STEM data for over 10,000 RL steps, resulting in DeepSeek-R1-Zero-Qwen-32B.

deepseek-distill-32b-comparison.png
Distilled and RL Models on Reasoning-Related Benchmarks. Image Source.

The conclusion is that distilling powerful models into smaller ones works better. In contrast, smaller models using large-scale RL need massive computing power and may still underperform compared to distillation.

DeepSeek-R1-Distill-Llama-70B

This is the most powerful distilled model, based on Llama-3.3-70B-Instruct (chosen for its better reasoning capability than Llama 3.1). With a 94.5 score on MATH-500, it closely rivals DeepSeek-R1 itself. It also achieves the highest coding score (57.5 on LiveCodeBench) among all distilled models.


Here is a high-level comparison of the six distilled models:

ModelBase modelBest forReasoning strengthCompute cost
DeepSeek-R1-Distill-Qwen-1.5BQwen2.5-Math-1.5BEntry-level reasoning, basic math💪💪Low
DeepSeek-R1-Distill-Qwen-7BQwen2.5-Math-7BMid-tier math & logic tasks💪💪💪Medium
DeepSeek-R1-Distill-Llama-8BLlama-3.1-8BMid-tier math & logic tasks, coding assistance💪💪💪Medium
DeepSeek-R1-Distill-Qwen-14BQwen2.5-14BAdvanced math & logic tasks, problem-solving, coding assistance💪💪💪💪Medium-High
DeepSeek-R1-Distill-Qwen-32BQwen2.5-32BComplex math, logic, & coding tasks, problem-solving, research💪💪💪💪💪High
DeepSeek-R1-Distill-Llama-70BLlama-3.3-70B-InstructComplex math, logic, & coding tasks, problem-solving, research💪💪💪💪💪High

Explore our example projects to deploy the 6 distilled models using BentoML and vLLM.

Beyond DeepSeek: Community-driven innovations

DeepSeek has ignited a wave of open-source innovation, with researchers and developers extending their models in creative ways. Here are two examples:

  • DeepScaleR-1.5B-Preview is a model fine-tuned from DeepSeek-R1-Distill-Qwen-1.5B using distributed RL to scale up long-context capabilities. It improves Pass@1 accuracy on AIME 2024 by 15% (43.1% vs. 28.8%), even surpassing OpenAI O1-Preview — all with just 1.5B parameters).
  • Jiayi Pan from Berkeley AI Research successfully reproduced the reasoning techniques of DeepSeek R1-Zero for under $30. This is a huge step in making advanced AI research more accessible.

These community efforts demonstrate how open source enables researchers and developers to create more accessible and powerful AI solutions.

What’s next

Now that you’re familiar with the DeepSeek models, you may be thinking about building your own AI applications with them. At first glance, calling the official DeepSeek API might seem like the easiest solution; it offers fast time to market with no infrastructure burden.

However, this convenience comes with trade-offs, including:

  • Data privacy and security risks
  • Limited customization (no inference optimization or fine-tuning with proprietary data, etc.)
  • Unpredictable behavior (rate limiting, outages, API restrictions, etc.)

As organizations weigh their options, many are turning to private deployment to maintain control, security, and flexibility.

At BentoML, we help companies build and scale AI applications securely using any model on any cloud. Our AI inference platform BentoCloud lets you deploy any DeepSeek variant on any cloud provider or on-premises infrastructure, offering:

  • Access to a variety of GPUs in the most cost-effective regions
  • Flexibility to deploy in your private VPC
  • Advanced autoscaling with fast cold start time
  • Built-in observability with LLM-specific metrics

Check out the following resources to learn more: