DeepSeek-v2-236B-MoE
by Community
We present DeepSeek-V2, a strong Mixture-of-Experts (MoE) language model characterized by economical training and efficient inference. It comprises 236B total parameters, of whic
OSS
DeepSeek-v2-236B-MoE
Added 1 June 2026
Overview
DeepSeek-V2 is a Mixture-of-Experts language model with 236B total parameters, activating 21B per token. It uses Multi-head Latent Attention to compress the KV cache into a latent vector for efficient inference, and DeepSeekMoE for economical training via sparse computation. It supports a context length of 128K tokens.
Best for
Best for
Developers needing a cost-efficient large language model with long context and sparse activation
Use cases
- Running large-scale language model inference with reduced memory footprint
- Training large models with lower computational cost via sparse activation
- Handling long-context tasks up to 128K tokens
Notes
DeepSeek-V2 is a Mixture-of-Experts language model with 236B total parameters, activating 21B per token. It uses Multi-head Latent Attention to compress the KV cache into a latent vector for efficient inference, and DeepSeekMoE for economical training via sparse computation. It supports a context length of 128K tokens.
Use cases
- Running large-scale language model inference with reduced memory footprint
- Training large models with lower computational cost via sparse activation
- Handling long-context tasks up to 128K tokens
Pros
- Efficient inference due to KV cache compression with Multi-head Latent Attention
- Economical training through sparse Mixture-of-Experts (only 21B activated per token)
- Supports very long context length of 128K tokens
Cons
- Large total parameter count (236B) requires substantial hardware for full model storage
- Community model may lack commercial support or polished documentation
- MoE architectures can introduce load balancing challenges and inference complexity
Indexed from awesome-llm and enriched against its public facts.
Pros
- Efficient inference due to KV cache compression with Multi-head Latent Attention
- Economical training through sparse Mixture-of-Experts (only 21B activated per token)
- Supports very long context length of 128K tokens
Cons
- Large total parameter count (236B) requires substantial hardware for full model storage
- Community model may lack commercial support or polished documentation
- MoE architectures can introduce load balancing challenges and inference complexity
Pairs with
Other entries in the index that connect to this one. Click through to see the chain.
PyTorch
Community
Tensors and Dynamic neural networks in Python with strong GPU acceleration
DeepSpeed
Community
DeepSpeed is a deep learning optimization library that makes distributed training and inference easy, efficient, and effective.
vLLM
Community
A high-throughput and memory-efficient inference and serving engine for LLMs
ollama
Community
Get up and running with Kimi-K2.5, GLM-5, MiniMax, DeepSeek, gpt-oss, Qwen, Gemma and other models.
DeepSeek-R1
Community
First-generation reasoning models from DeepSeek.