Enterprise DNA
O Open Source Frameworks medium

DeepSeek-v2-236B-MoE

by Community

We present DeepSeek-V2, a strong Mixture-of-Experts (MoE) language model characterized by economical training and efficient inference. It comprises 236B total parameters, of whic

D

OSS

DeepSeek-v2-236B-MoE

Added 1 June 2026

Overview

DeepSeek-V2 is a Mixture-of-Experts language model with 236B total parameters, activating 21B per token. It uses Multi-head Latent Attention to compress the KV cache into a latent vector for efficient inference, and DeepSeekMoE for economical training via sparse computation. It supports a context length of 128K tokens.

Best for

Best for
Developers needing a cost-efficient large language model with long context and sparse activation

Use cases

  • Running large-scale language model inference with reduced memory footprint
  • Training large models with lower computational cost via sparse activation
  • Handling long-context tasks up to 128K tokens

Notes

DeepSeek-V2 is a Mixture-of-Experts language model with 236B total parameters, activating 21B per token. It uses Multi-head Latent Attention to compress the KV cache into a latent vector for efficient inference, and DeepSeekMoE for economical training via sparse computation. It supports a context length of 128K tokens.

Use cases

  • Running large-scale language model inference with reduced memory footprint
  • Training large models with lower computational cost via sparse activation
  • Handling long-context tasks up to 128K tokens

Pros

  • Efficient inference due to KV cache compression with Multi-head Latent Attention
  • Economical training through sparse Mixture-of-Experts (only 21B activated per token)
  • Supports very long context length of 128K tokens

Cons

  • Large total parameter count (236B) requires substantial hardware for full model storage
  • Community model may lack commercial support or polished documentation
  • MoE architectures can introduce load balancing challenges and inference complexity

Indexed from awesome-llm and enriched against its public facts.

Pros

  • Efficient inference due to KV cache compression with Multi-head Latent Attention
  • Economical training through sparse Mixture-of-Experts (only 21B activated per token)
  • Supports very long context length of 128K tokens

Cons

  • Large total parameter count (236B) requires substantial hardware for full model storage
  • Community model may lack commercial support or polished documentation
  • MoE architectures can introduce load balancing challenges and inference complexity