DeepSeek-V3

Basic model of the R1 series - optimized for tool use

Focus on quality and breadth

The DeepSeek-V3 model at a glance

DeepSeek-V3 is a powerful open-source language model from DeepSeek-AI, released in March 2025. It combines carefully curated training data with a modular architectural approach that guarantees both high-quality answers and a strong knowledge base – all with high efficiency. The V3 series is aimed at developers and companies who want to rely on a robust, versatile LLM with transparent license terms.

Name:

DeepSeek-V3

Developer:

DeepSeek-AI

Publication:

February 2025 (Technical report)

License:

The model checkpoints are available via the GitHub repository. The exact license conditions for commercial use are specified in the repository.

Availability:

Hugging Face or GitHub repository

Model type:

Mixture-of-Experts (MoE) language model, optimized for high performance with efficient training and inference.

Parameters:

Total: 671 billion, activated per token: 37 billion

Special features of the DeepSeek-V3 model

Multi-head Latent Attention (MLA)

Drastically reduces the key-value (KV) cache during inference through low-rank compression, which increases efficiency for long contexts.

DeepSeekMoE

A MoE architecture that relies on “fine-grained” experts (256 routed + 1 shared expert per MoE layer) and enables cost-efficient scaling.

Auxiliary-Loss-Free Load Balancing

An innovative, loss-free method of load balancing for experts that avoids performance losses due to conventional balancing losses.

Special features of the DeepSeek-V3 model

Multi-Token Prediction (MTP)

The model is trained to predict not only the next token, but several future tokens. This improves overall performance and can be used for speculative decoding to speed up inference.

Extreme training efficiency

Through a co-optimization of algorithms (FP8 training), framework (DualPipe) and hardware, the model was trained at very low cost (only 2.788 million H800 GPU hours).

Knowledge distillation

The chat version of the model was refined by distilling reasoning capabilities from the specialized DeepSeek R1 model suite to strike a balance between high accuracy and concise responses.

Long context

After an expansion phase, the model will support contexts of up to 128,000 tokens.

Individual AI consulting

Is DeepSeek-V3 the right model for you?

We would be happy to advise you individually on which AI model suits your requirements. Arrange a no-obligation initial consultation with our AI experts and exploit the full potential of AI for your project!

The post-training pipeline for DeepSeek-V3

Training data & training process

Pre-training data

Trained on 14.8 trillion high-value and diverse tokens. The dataset was enriched with a higher proportion of math and programming data as well as extended multilingual coverage.

Pre-training process

FP8 Mixed Precision Training: DeepSeek-V3 is one of the first models of this size to be successfully trained with 8-bit floating point numbers (FP8), which doubles the training speed and reduces memory requirements.
Fill-in-Middle (FIM): 10% of the training data was structured in FIM format to optimize the model for code completion tasks.

Post-training (SFT & RL)

Supervised Fine-Tuning (SFT): Fine-tuning on a dataset of 1.5 million instances, which includes reasoning data from the DeepSeek R1 model and non-reasoning data.
Reinforcement Learning (RL): Using Group Relative Policy Optimization (GRPO) with a combination of rule-based rewards (for math/code) and model-based rewards (for general tasks) to adapt the model to human preferences.

Hardware-Anforderungen (Inferenz)

Inference requires a substantial GPU infrastructure. The recommended minimum unit for prefilling consists of 32 GPUs on 4 nodes.
The inference for decoding is designed for 320 GPUs on 40 nodes to ensure low latency and high throughput.
These requirements make the model primarily suitable for companies and research institutions with large clusters.

Powerful & efficient

Recommended use cases for DeepSeek-V3

Is DeepSeek-V3 the right AI model for your individual use case? We will be happy to advise you comprehensively and personally.

Highly complex math and programming tasks

Sets new standards for non-reasoning models.

Knowledge-intensive tasks

Outperforms other open source models in benchmarks such as MMLU-Pro and GPQA-Diamond.

Processing and analyzing very long documents

Up to 128K tokens can be processed.

Factual question-answer systems

Particular strength in Chinese.

High performance with high efficiency

Development of AI systems that require a balance between high performance and efficiency.

DeepSeek-V3

Strengths & weaknesses of the DeepSeek-V3 model

Strengths

Strongest open source model: Outperforms other open source models at the time of release and is competitive with leading closed models such as GPT-4o and Claude-3.5-Sonnet.

Outstanding efficiency: The combination of MLA, DeepSeekMoE and FP8 training results in extremely low training costs for a model of this size.

Innovative architecture: The lossless load balancing strategy and multi-token prediction are novel contributions to LLM development.

Excellent coding and math skills: Leading among all comparable models in these domains.

Very stable training dynamics: The entire pre-training was completed without a single crash or rollback.

Weaknesses & limitations

High inference requirements: Running the model requires a large and complex GPU infrastructure, which limits accessibility for smaller teams or individuals.

Inference speed: Although improved, there is still potential for further optimization of latency at the decoding stage.

Tokenizer bias: The tokenizer used can lead to a “token boundary bias” with certain prompt structures (e.g. multi-line prompts without a line break at the end), even if countermeasures have been taken.

Maximize results with the right model

Ready for powerful open source AI?

Use DeepSeek-V3 for productive language processing, prototyping or your own model development – powerful, open and ready for immediate use.Our experts will advise you on the best way to use it and help with hosting, customization or integration into your systems.