GLaM: Efficient Scaling of Language Models with Mixture-of-Experts
by Community
2021-12
OSS
GLaM: Efficient Scaling of Language Models with Mixture-of-Experts
Added 1 June 2026
Overview
GLaM is a language model architecture that uses a mixture-of-experts (MoE) approach to scale efficiently. It activates only a subset of parameters per input token, reducing computational cost while maintaining high performance. The framework was introduced in a December 2021 paper.
Best for
Best for
Researchers and engineers exploring efficient scaling of language models
Use cases
- Building large language models with lower training and inference cost
- Experimenting with sparse MoE architectures for natural language tasks
- Scaling model capacity beyond dense transformer limits
Notes
GLaM is a language model architecture that uses a mixture-of-experts (MoE) approach to scale efficiently. It activates only a subset of parameters per input token, reducing computational cost while maintaining high performance. The framework was introduced in a December 2021 paper.
Use cases
- Building large language models with lower training and inference cost
- Experimenting with sparse MoE architectures for natural language tasks
- Scaling model capacity beyond dense transformer limits
Pros
- Significantly lower FLOPs per token compared to dense models of equivalent size
- Supports scaling to trillions of parameters without proportional compute increase
- Competitive benchmark results relative to dense alternatives
Cons
- Requires careful load balancing to avoid expert collapse
- Memory footprint increases due to storing multiple expert modules
- Routing overhead can add latency during inference
Indexed from awesome-llm and enriched against its public facts.
Pros
- Significantly lower FLOPs per token compared to dense models of equivalent size
- Supports scaling to trillions of parameters without proportional compute increase
- Competitive benchmark results relative to dense alternatives
Cons
- Requires careful load balancing to avoid expert collapse
- Memory footprint increases due to storing multiple expert modules
- Routing overhead can add latency during inference
Pairs with
Other entries in the index that connect to this one. Click through to see the chain.
DeepSpeed
Community
DeepSpeed is a deep learning optimization library that makes distributed training and inference easy, efficient, and effective.
Megatron-LM
Community
Ongoing research training transformer models at scale
Colossal-AI
Community
Making large AI models cheaper, faster and more accessible