ZeRO: Memory Optimizations Toward Training Trillion Parameter Models
by Community
Microsoft
OSS
ZeRO: Memory Optimizations Toward Training Trillion Parameter Models
Added 1 June 2026
Overview
ZeRO is a memory optimization technique for distributed training of large deep learning models. It reduces the memory footprint of model states (optimizer, gradients, parameters) by partitioning them across data-parallel processes, enabling training of models with trillions of parameters on existing hardware.
Best for
Best for
Researchers and engineers training very large models on distributed GPU clusters
Use cases
- Training large language models with billions of parameters
- Fine-tuning massive pretrained models on limited GPU memory
- Scaling distributed training across many GPUs efficiently
Notes
ZeRO is a memory optimization technique for distributed training of large deep learning models. It reduces the memory footprint of model states (optimizer, gradients, parameters) by partitioning them across data-parallel processes, enabling training of models with trillions of parameters on existing hardware.
Use cases
- Training large language models with billions of parameters
- Fine-tuning massive pretrained models on limited GPU memory
- Scaling distributed training across many GPUs efficiently
Pros
- Dramatically reduces per-device memory usage for model states
- Enables training of models that would otherwise exceed GPU memory
- Compatible with existing data-parallel training frameworks
Cons
- Requires careful tuning of partitioning stages (ZeRO-1, 2, 3)
- Increased communication overhead can impact training throughput
- Not a standalone tool; must be integrated into a training framework like DeepSpeed or PyTorch
Indexed from awesome-llm and enriched against its public facts.
Pros
- Dramatically reduces per-device memory usage for model states
- Enables training of models that would otherwise exceed GPU memory
- Compatible with existing data-parallel training frameworks
Cons
- Requires careful tuning of partitioning stages (ZeRO-1, 2, 3)
- Increased communication overhead can impact training throughput
- Not a standalone tool; must be integrated into a training framework like DeepSpeed or PyTorch
Pairs with
Other entries in the index that connect to this one. Click through to see the chain.
Megatron-LM
Community
Ongoing research training transformer models at scale
NeMo Framework
Community
A scalable generative AI framework built for researchers and developers working on Large Language Models, Multimodal, and Speech AI (Automatic Speech Recognition and Text-to-Speech