March 6, 2025 • Written By Sherlock Xu
DeepSeek has emerged as a major player in AI, drawing attention not just for its massive 671B models, V3 and R1, but also for its suite of distilled versions. As interest in these models grows, so does the confusion about their differences, capabilities, and ideal use cases.
These questions echo across developer forums, Discord channels, and GitHub discussions. And honestly, the confusion makes sense. DeepSeek’s lineup has expanded rapidly, and without a clear roadmap, it’s easy to get lost in technical jargon and benchmark scores.
In this post, we’ll break down the key differences and help you choose the right model for your needs.
Let’s rewind to December 2024 when DeepSeek dropped V3. It's a Mixture-of-Experts (MoE) model with 671 billion parameters and 37 billion activated for each token.
If you’re wondering what Mixture-of-Experts means, it’s actually a cool concept. Essentially, it means the model can activate different parts of itself depending on the task at hand. Instead of using the entire model all the time, it “picks” the right experts for the job. This makes it not just powerful but efficient.
What's perhaps most remarkable about DeepSeek-V3 is the training efficiency. Despite its size, the model required only 2.788 million H800 GPU hours, which translates to around $5.6 million in training costs. To put that in perspective, training GPT-4 is estimated to cost between $50–100 million.
DeepSeek-V3 comes in two versions: a Base and a Chat model.
You can check their detailed benchmark performance in the evaluation results.
Both DeepSeek-V3 Base and Chat models are open-source and commercially usable. You can self-host them to build your own ChatGPT-level application. If you’re looking to deploy DeepSeek-V3, check out our example project using BentoML and vLLM.
Deploy DeepSeek-V3Deploy DeepSeek-V3
DeepSeek didn’t stop with V3. Just weeks later, they introduced two new models built on DeepSeek-V3-Base: DeepSeek-R1-Zero and DeepSeek-R1.
DeepSeek-R1-Zero was trained using large-scale reinforcement learning (RL) without the usual step of supervised fine-tuning (SFT). In simple terms, it learned reasoning patterns entirely on its own, refining its abilities through trial and error rather than structured instruction.
While the results were remarkable, there were also trade-offs. R1-Zero occasionally struggled with endless repetition, poor readability, and even language mixing.
To smooth out these rough edges, DeepSeek developed DeepSeek-R1 using a more sophisticated multi-stage training pipeline. This included incorporating thousands of "cold-start" data points to fine-tune the V3-Base model before applying reinforcement learning. The result was R1, a model that not only keeps the reasoning power of R1-Zero but significantly improves accuracy, readability, and coherence.
Unlike V3, which is optimized for general tasks, R1 is a true reasoning model. That means it doesn’t just give you an answer; it explains how it got there. Before responding, R1 generates a step-by-step chain of thought, making it especially useful for:
Performance-wise, R1 rivals or even surpasses OpenAI o1 (also a reasoning model, but does not fully disclose the thinking tokens as R1) in math, coding, and reasoning benchmarks. This makes it the most powerful open-source reasoning model available today.
R1 is the engine behind the DeepSeek chat application, and many developers have begun using it for private deployments. If you’re looking to run it yourself, check out our example project using BentoML and vLLM.
Deploy DeepSeek-R1Deploy DeepSeek-R1
Keep these tips in mind when using R1:
Please reason step by step, and put your final answer within \boxed{}
.<think>\n\n</think>
). To encourage thorough reasoning, tell the model to start the response with <think>\n
in your prompt.See more recommendations in the DeekSeep-R1 repository.
DeepSeek-V3 and DeepSeek-R1 are the go-to models for many engineers today, but they serve different purposes. If you’re unsure which model fits your needs, here’s a quick comparison to help you decide:
Item | DeepSeek-V3 | DeepSeek-R1 |
---|---|---|
Base model | DeepSeek-V3-Base | DeepSeek-V3-Base |
Type | General-purpose language model | Reasoning model |
Response style | Direct answers (e.g., "The answer is 42") | Step-by-step reasoning (e.g., "First, calculate X… then Y… so the answer is 42") |
Parameters | 671B (37B activated) | 671B (37B activated) |
Architecture | MoE | MoE |
Context length | 128K | 128K |
License | MIT & Model License | MIT |
Best for | Content creation, writing, translation, general Q&A | Complex math, coding, research, logical reasoning, agentic workflows |
Standard API price* (UTC 00:30-16:30) |
|
|
Standard API price* (UTC 00:30-16:30) | $1.10 / million output tokens | $2.19 / million output tokens |
* Pricing is based on data on March 6, 2025, and may change over time. DeepSeek offers off-peak pricing discounts from 16:30-00:30 UTC daily. Check their API documentation for details.
In short, I suggest you:
While V3 and R1 are impressive, running them isn’t practical for everyone. They require 8 NVIDIA H200 GPUs with 141GB of memory each.
That’s where distilled DeepSeek models come in. These smaller, more efficient models bring the reasoning power of R1 to a more accessible scale. Instead of training new models from scratch, DeepSeek took a smart shortcut:
Unlike R1, these distilled models rely solely on SFT and they do not include an RL stage.
Despite their smaller size, these models perform remarkably well on reasoning tasks, proving that large-scale AI reasoning can be efficiently distilled. DeepSeek has open-sourced all six distilled models, ranging from 1.5B to 70B parameters.
But before I dive into the individual models, let’s take a quick look at the technique that makes this possible: distillation.
Distillation is a technique that transfers knowledge from a large, powerful model to a smaller, more efficient one. Instead of training on raw data, the smaller model learns to mimic the larger model’s behavior.
A great analogy comes from the research paper Distilling the Knowledge in a Neural Network by Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. They compare distillation to how insects evolve from larvae to adults:
Similarly, large AI models are trained with huge datasets and high computational power to extract deep knowledge. But deploying them at scale requires something faster and lighter. That’s where distillation comes in. It compresses the intelligence of a large model into a smaller model, making it more practical for real-world applications.
Now, let’s take a closer look at each distilled model.
This is the smallest model in the lineup with decent math and reasoning ability. It outperforms GPT-4o and Claude-3.5-Sonnet on AIME and MATH-500, making it a good choice for lightweight problem-solving. However, it struggles with coding tasks, scoring only 16.9 on LiveCodeBench, meaning it's not ideal for programming applications.
A step up from the 1.5B model, this version offers stronger performance in mathematical reasoning and general problem-solving. It scores well on AIME (55.5) and MATH-500 (92.8), but still lags behind in coding benchmarks (37.6 on LiveCodeBench).
Based on the Llama 3.1 architecture, this model shows strong mathematical reasoning. It not only surpasses GPT-4o and Claude-3.5-Sonnet in AIME and MATH-500, but also performs very close to o1-mini and QwQ-32B-Preview in MATH-500. Its coding performance suggests better competitive coding skills, but it's still not on par with larger models.
This is a balanced model that offers strong reasoning, math, and general logic capabilities. It’s a great middle ground for those needing better accuracy without high computational costs. While the coding performance is not the best, it’s very close to o1-mini.
This is one of the best-performing distilled models. With top-tier reasoning (72.6 on AIME, 94.3 on MATH-500) and a strong CodeForces rating (1691), it's a great option for math-heavy applications, competitive problem-solving, and advanced AI research.
An interesting research insight: DeepSeek used this model to compare distillation and RL for reasoning tasks. They tested whether a smaller model trained through large-scale RL could match the performance of a distilled model.
To explore this, they trained Qwen-32B-Base with math, coding, and STEM data for over 10,000 RL steps, resulting in DeepSeek-R1-Zero-Qwen-32B.
The conclusion is that distilling powerful models into smaller ones works better. In contrast, smaller models using large-scale RL need massive computing power and may still underperform compared to distillation.
This is the most powerful distilled model, based on Llama-3.3-70B-Instruct (chosen for its better reasoning capability than Llama 3.1). With a 94.5 score on MATH-500, it closely rivals DeepSeek-R1 itself. It also achieves the highest coding score (57.5 on LiveCodeBench) among all distilled models.
Here is a high-level comparison of the six distilled models:
Model | Base model | Best for | Reasoning strength | Compute cost |
---|---|---|---|---|
DeepSeek-R1-Distill-Qwen-1.5B | Qwen2.5-Math-1.5B | Entry-level reasoning, basic math | 💪💪 | Low |
DeepSeek-R1-Distill-Qwen-7B | Qwen2.5-Math-7B | Mid-tier math & logic tasks | 💪💪💪 | Medium |
DeepSeek-R1-Distill-Llama-8B | Llama-3.1-8B | Mid-tier math & logic tasks, coding assistance | 💪💪💪 | Medium |
DeepSeek-R1-Distill-Qwen-14B | Qwen2.5-14B | Advanced math & logic tasks, problem-solving, coding assistance | 💪💪💪💪 | Medium-High |
DeepSeek-R1-Distill-Qwen-32B | Qwen2.5-32B | Complex math, logic, & coding tasks, problem-solving, research | 💪💪💪💪💪 | High |
DeepSeek-R1-Distill-Llama-70B | Llama-3.3-70B-Instruct | Complex math, logic, & coding tasks, problem-solving, research | 💪💪💪💪💪 | High |
Explore our example projects to deploy the 6 distilled models using BentoML and vLLM.
DeepSeek has ignited a wave of open-source innovation, with researchers and developers extending their models in creative ways. Here are two examples:
These community efforts demonstrate how open source enables researchers and developers to create more accessible and powerful AI solutions.
Now that you’re familiar with the DeepSeek models, you may be thinking about building your own AI applications with them. At first glance, calling the official DeepSeek API might seem like the easiest solution; it offers fast time to market with no infrastructure burden.
However, this convenience comes with trade-offs, including:
As organizations weigh their options, many are turning to private deployment to maintain control, security, and flexibility.
At BentoML, we help companies build and scale AI applications securely using any model on any cloud. Our AI inference platform BentoCloud lets you deploy any DeepSeek variant on any cloud provider or on-premises infrastructure, offering:
Check out the following resources to learn more: