DeepSeek-V3 Technical Report
by Community
We present DeepSeek-V3, a strong Mixture-of-Experts (MoE) language model with 671B total parameters with 37B activated for each token. To achieve efficient inference and cost-eff
OSS
DeepSeek-V3 Technical Report
Added 1 June 2026
Overview
DeepSeek-V3 is a Mixture-of-Experts framework with 671B total parameters and 37B activated per token. It uses Multi-head Latent Attention and DeepSeekMoE architectures, and introduces auxiliary-loss-free load balancing and multi-token prediction training. The model is pre-trained on 14.8 trillion tokens followed by supervised fine-tuning.
Best for
Best for
Large-scale language model researchers and engineers working on MoE frameworks.
Use cases
- Researching efficient MoE architectures for large language models
- Implementing load balancing strategies without auxiliary losses
- Applying multi-token prediction training to improve model performance
Notes
DeepSeek-V3 is a Mixture-of-Experts framework with 671B total parameters and 37B activated per token. It uses Multi-head Latent Attention and DeepSeekMoE architectures, and introduces auxiliary-loss-free load balancing and multi-token prediction training. The model is pre-trained on 14.8 trillion tokens followed by supervised fine-tuning.
Use cases
- Researching efficient MoE architectures for large language models
- Implementing load balancing strategies without auxiliary losses
- Applying multi-token prediction training to improve model performance
Pros
- Activates only 37B parameters per token for efficient inference
- Novel auxiliary-loss-free load balancing simplifies training
- Strong performance from training on 14.8 trillion high-quality tokens
Cons
- Very large total parameter count (671B) demands significant hardware resources
- Technical report may lack accessible implementation details and code
- Pre-training on 14.8T tokens is extremely resource and time intensive
Indexed from awesome-llm and enriched against its public facts.
Pros
- Activates only 37B parameters per token for efficient inference
- Novel auxiliary-loss-free load balancing simplifies training
- Strong performance from training on 14.8 trillion high-quality tokens
Cons
- Very large total parameter count (671B) demands significant hardware resources
- Technical report may lack accessible implementation details and code
- Pre-training on 14.8T tokens is extremely resource and time intensive
Pairs with
Other entries in the index that connect to this one. Click through to see the chain.