DeepSeek has emerged as a major player in AI, drawing attention not just for its massive 671B models like V3.1 and R1, but also for its suite of distilled versions. As interest in these models grows, so does the confusion about their differences, capabilities, and ideal use cases.
These questions echo across developer forums, Discord channels, and GitHub discussions. And honestly, the confusion makes sense. DeepSeek’s lineup has expanded rapidly, and without a clear roadmap, it’s easy to get lost in technical jargon and benchmark scores.
In this post, we’ll break down the key differences and help you choose the right model for your needs.
Let’s rewind to December 2024 when DeepSeek dropped V3. It's a Mixture-of-Experts (MoE) model with 671 billion parameters and 37 billion activated for each token.
If you’re wondering what Mixture-of-Experts means, it’s actually a cool concept. Essentially, it means the model can activate different parts of itself depending on the task at hand. Instead of using the entire model all the time, it “picks” the right experts for the job. This makes it not just powerful but efficient.
What's perhaps most remarkable about DeepSeek-V3 is the training efficiency. Despite its size, the model required only 2.788 million H800 GPU hours, which translates to around $5.6 million in training costs. To put that in perspective, training GPT-4 is estimated to cost between $50–100 million.
DeepSeek-V3 comes in two versions: a Base and a Chat model.
You can check their detailed benchmark performance in the evaluation results.
Both DeepSeek-V3 Base and Chat models are open-source and commercially usable. You can self-host them to build your own ChatGPT-level application. If you’re looking to deploy DeepSeek-V3, check out our example project using BentoML and vLLM.
Deploy DeepSeek-V3Deploy DeepSeek-V3
DeepSeek didn’t stop with V3. Just weeks later, they introduced two new models built on DeepSeek-V3-Base: DeepSeek-R1-Zero and DeepSeek-R1.
DeepSeek-R1-Zero was trained using large-scale reinforcement learning (RL) without the usual step of supervised fine-tuning (SFT). In simple terms, it learned reasoning patterns entirely on its own, refining its abilities through trial and error rather than structured instruction.
While the results were remarkable, there were also trade-offs. R1-Zero occasionally struggled with endless repetition, poor readability, and even language mixing.
To smooth out these rough edges, DeepSeek developed DeepSeek-R1 using a more sophisticated multi-stage training pipeline. This included incorporating thousands of "cold-start" data points to fine-tune the V3-Base model before applying reinforcement learning. The result was R1, a model that not only keeps the reasoning power of R1-Zero but significantly improves accuracy, readability, and coherence.
Unlike V3, which is optimized for general tasks, R1 is a true reasoning model. That means it doesn’t just give you an answer; it explains how it got there. Before responding, R1 generates a step-by-step chain of thought, making it especially useful for:
Performance-wise, R1 rivals or even surpasses OpenAI o1 (also a reasoning model, but does not fully disclose the thinking tokens as R1) in math, coding, and reasoning benchmarks. This makes it the most powerful open-source reasoning model available today.
R1 is the engine behind the DeepSeek chat application, and many developers have begun using it for private deployments. If you’re looking to run it yourself, check out our example project using BentoML and vLLM.
Deploy DeepSeek-R1Deploy DeepSeek-R1
Keep these tips in mind when using R1:
Please reason step by step, and put your final answer within \boxed{}
.<think>\n\n</think>
). To encourage thorough reasoning, tell the model to start the response with <think>\n
in your prompt.See more recommendations in the DeekSeep-R1 repository.
Note: Compared with R1, DeepSeek-R1-0528 supports system prompts and you don’t need to use <think> to force reasoning output. See details about DeepSeek-R1-0528 below.
DeepSeek-V3 and DeepSeek-R1 are the go-to models for many engineers today, but they serve different purposes. If you’re unsure which model fits your needs, here’s a quick comparison to help you decide:
Item | DeepSeek-V3 | DeepSeek-R1 |
---|---|---|
Base model | DeepSeek-V3-Base | DeepSeek-V3-Base |
Type | General-purpose language model | Reasoning model |
Response style | Direct answers (e.g., "The answer is 42") | Step-by-step reasoning (e.g., "First, calculate X… then Y… so the answer is 42") |
Parameters | 671B (37B activated) | 671B (37B activated) |
Architecture | MoE | MoE |
Context length | 128K | 128K |
License | MIT & Model License | MIT |
Best for | Content creation, writing, translation, general Q&A | Complex math, coding, research, logical reasoning, agentic workflows |
Â
Note that DeepSeek continues to actively update its models. Below are the latest versions of V3 and R1:
In March 2025, DeepSeek released a powerful new update: DeepSeek-V3-0324. While it uses the same Base model as DeepSeek-V3, the post-training pipeline has been improved, drawing lessons from the RL technique in DeepSeek-R1. This allows the new model to have better reasoning performance, coding skills and tool-use capabilities. In math and coding evaluations, DeepSeek-V3-0324 even outperforms GPT-4.5.
Deploy DeepSeek-V3-0324Deploy DeepSeek-V3-0324
If you don’t need fully detailed reasoning chains (or just for non-complicated reasoning tasks), I recommend this newer version as it's faster and more powerful than V3.
In May 2025, DeepSeek released DeepSeek-R1-0528, a significant upgrade to the original R1 model. While built on the same V3 Base model, this version pushes reasoning and inference capabilities further by leveraging more compute and advanced post-training optimizations.
Deploy DeepSeek-R1-0528Deploy DeepSeek-R1-0528
What’s new in DeepSeek-R1-0528:
The DeepSeek API has been upgraded accordingly, which now supports function calling and structured JSON outputs. The max_tokens
parameter has been updated to default to 32K, with a max of 64K (including the chain-of-thought process).
Additional updates over the original R1 include:
<think>
to force reasoning behavior.In August 2025, DeepSeek released DeepSeek-V3.1, a major update that combines the strengths of V3 and R1 into a single hybrid model. It features a total of 671B parameters (37B activated) and supports context lengths up to 128K.
Key takeaways:
Hybrid thinking mode: V3.1 can switch between “thinking” (chain-of-thought reasoning like R1) and “non-thinking” (direct answers like V3) just by changing the chat template. This means one model can cover both general-purpose and reasoning-heavy use cases.
Extended training: Built on DeepSeek-V3.1-Base, V3.1 went through a expanded long-context training process (630B tokens for the 32K extension phase and 209B tokens for the 128K phase).
Smarter tool calling: Thanks to post-training optimization, V3.1 is much stronger in tool usage and agentic workflows. It outperforms both DeepSeek-V3-0324 and DeepSeek-R1-0528 in code agent and search agent benchmarks.
Faster reasoning: DeepSeek-V3.1-Think achieves quality comparable to DeepSeek-R1-0528, but responds more quickly. Their internal tests show that after chain-of-thought compression training, V3.1-Think reduces output tokens by 20–50% while maintaining almost the same average performance.
The table below compares the latest models from the DeepSeek lineup:
Item | DeepSeek-V3.1 | DeepSeek-V3-0324 | DeepSeek-R1-0528 |
---|---|---|---|
Base model | V3.1-Base | V3-Base | V3-Base |
Parameters | 671B | 660B | 685B |
Context length | 128K | 128K | 128K |
Mode | Hybrid: Thinking (CoT) & Non-Thinking (direct) | Non-thinking (general-purpose) | Thinking (CoT reasoning) |
Tool and agent use | Strongest among the three; best code & search agent results | Good, stronger than original V3 | Improved function calling; search/tool calling not supported under thinking mode |
Response style | Flexible — fast direct answers or step-by-step reasoning | Direct answers | Detailed reasoning chains with higher token usage |
Performance highlights | Comparable reasoning to R1-0528 but faster; reduced CoT tokens by 20–50% | Better coding & math than original V3 and GPT-4.5 | Strongest step-by-step reasoning; reduced hallucinations |
License | MIT | MIT | MIT |
Best for | Teams needing both speed & reasoning in one model | General-purpose workloads like content creation and Q&A, with stronger reasoning ability for tasks like coding/math | Complex math, coding, and reasoning tasks which require deep step-by-step logic |
Â
In short, DeepSeek-V3.1 is the most versatile DeepSeek model yet. It is capable of acting like V3 when you want fast, direct outputs, or like R1 when you need step-by-step reasoning. If you need both speed and reasoning power in one model, or your workloads have heavy tool usage and agent tasks, DeepSeek-V3.1 is an ideal choice.
While V3 and R1 are impressive, running them isn’t practical for everyone. They require 8 NVIDIA H200 GPUs with 141GB of memory each.
That’s where distilled DeepSeek models come in. These smaller, more efficient models bring the reasoning power of R1 to a more accessible scale. Instead of training new models from scratch, DeepSeek took a smart shortcut:
Unlike R1, these distilled models rely solely on SFT and they do not include an RL stage.
Despite their smaller size, these models perform remarkably well on reasoning tasks, proving that large-scale AI reasoning can be efficiently distilled. DeepSeek has open-sourced all six distilled models, ranging from 1.5B to 70B parameters.
But before I dive into the individual models, let’s take a quick look at the technique that makes this possible: distillation.
Distillation is a technique that transfers knowledge from a large, powerful model to a smaller, more efficient one. Instead of training on raw data, the smaller model learns to mimic the larger model’s behavior.
A great analogy comes from the research paper Distilling the Knowledge in a Neural Network by Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. They compare distillation to how insects evolve from larvae to adults:
Similarly, large AI models are trained with huge datasets and high computational power to extract deep knowledge. But deploying them at scale requires something faster and lighter. That’s where distillation comes in. It compresses the intelligence of a large model into a smaller model, making it more practical for real-world applications.
Now, let’s take a closer look at each distilled model.
This is the smallest model in the lineup with decent math and reasoning ability. It outperforms GPT-4o and Claude-3.5-Sonnet on AIME and MATH-500, making it a good choice for lightweight problem-solving. However, it struggles with coding tasks, scoring only 16.9 on LiveCodeBench, meaning it's not ideal for programming applications.
A step up from the 1.5B model, this version offers stronger performance in mathematical reasoning and general problem-solving. It scores well on AIME (55.5) and MATH-500 (92.8), but still lags behind in coding benchmarks (37.6 on LiveCodeBench).
Based on the Llama 3.1 architecture, this model shows strong mathematical reasoning. It not only surpasses GPT-4o and Claude-3.5-Sonnet in AIME and MATH-500, but also performs very close to o1-mini and QwQ-32B-Preview in MATH-500. Its coding performance suggests better competitive coding skills, but it's still not on par with larger models.
This is a balanced model that offers strong reasoning, math, and general logic capabilities. It’s a great middle ground for those needing better accuracy without high computational costs. While the coding performance is not the best, it’s very close to o1-mini.
This is one of the best-performing distilled models. With top-tier reasoning (72.6 on AIME, 94.3 on MATH-500) and a strong CodeForces rating (1691), it's a great option for math-heavy applications, competitive problem-solving, and advanced AI research.
An interesting research insight: DeepSeek used this model to compare distillation and RL for reasoning tasks. They tested whether a smaller model trained through large-scale RL could match the performance of a distilled model.
To explore this, they trained Qwen-32B-Base with math, coding, and STEM data for over 10,000 RL steps, resulting in DeepSeek-R1-Zero-Qwen-32B.
The conclusion is that distilling powerful models into smaller ones works better. In contrast, smaller models using large-scale RL need massive computing power and may still underperform compared to distillation.
This is the most powerful distilled model, based on Llama-3.3-70B-Instruct (chosen for its better reasoning capability than Llama 3.1). With a 94.5 score on MATH-500, it closely rivals DeepSeek-R1 itself. It also achieves the highest coding score (57.5 on LiveCodeBench) among all distilled models.
Here is a high-level comparison of the six distilled models:
Model | Base model | Best for | Reasoning strength | Compute cost |
---|---|---|---|---|
DeepSeek-R1-Distill-Qwen-1.5B | Qwen2.5-Math-1.5B | Entry-level reasoning, basic math | đź’Şđź’Ş | Low |
DeepSeek-R1-Distill-Qwen-7B | Qwen2.5-Math-7B | Mid-tier math & logic tasks | đź’Şđź’Şđź’Ş | Medium |
DeepSeek-R1-Distill-Llama-8B | Llama-3.1-8B | Mid-tier math & logic tasks, coding assistance | đź’Şđź’Şđź’Ş | Medium |
DeepSeek-R1-Distill-Qwen-14B | Qwen2.5-14B | Advanced math & logic tasks, problem-solving, coding assistance | đź’Şđź’Şđź’Şđź’Ş | Medium-High |
DeepSeek-R1-Distill-Qwen-32B | Qwen2.5-32B | Complex math, logic, & coding tasks, problem-solving, research | đź’Şđź’Şđź’Şđź’Şđź’Ş | High |
DeepSeek-R1-Distill-Llama-70B | Llama-3.3-70B-Instruct | Complex math, logic, & coding tasks, problem-solving, research | đź’Şđź’Şđź’Şđź’Şđź’Ş | High |
Explore our example projects to deploy the 6 distilled models using BentoML and vLLM.
When DeepSeek released R1-0528, it also announced DeepSeek-R1-0528-Qwen3-8B. By fine-tuning Qwen3-8B on R1-0528's chain-of-thought outputs, this model delivers strong results. It surpasses both Qwen3-8B and Qwen3-32B on AIME benchmarks, and even Qwen3-235B-A22B on AIME 24.
Beyond V3 and R1, DeepSeek has been building domain-focused models to tackle narrower areas.
DeepSeek-Prover-V2-671B: An open-source LLM designed for formal theorem proving in Lean 4. It is able to decompose complex theorems into subgoals and produce verified Lean 4 proofs with chain-of-thought clarity. If you’re working in formal mathematics, automated theorem proving, or symbolic reasoning, DeepSeek-Prover-V2 is the model to explore. Note that DeepSeek-Prover-V2 provides two model sizes: 7B and 671B.
Check out the example to deploy DeepSeek-Prover-V2-671B.
Janus-Series: A family of unified multimodal models for both understanding and generation:
You can adopt the Janus line for applications that need vision + language integration, like multimodal assistants.
DeepSeek has ignited a wave of open-source innovation, with researchers and developers extending their models in creative ways. Here are two examples:
These community efforts demonstrate how open source enables researchers and developers to create more accessible and powerful AI solutions.
Now that you’re familiar with the DeepSeek models, you may be thinking about building your own AI applications with them. At first glance, calling the official DeepSeek API might seem like the easiest solution; it offers fast time to market with no infrastructure burden.
However, this convenience comes with trade-offs, including:
As organizations weigh their options, many are turning to private deployment to maintain control, security, and flexibility.
At Bento, we help companies build and scale AI applications securely using any model on any cloud. Our inference platform lets you deploy any DeepSeek variant on any cloud provider or on-premises infrastructure, offering:
Check out the following resources to learn more: