Megatron-LM
by Community
Ongoing research training transformer models at scale
OSS
Megatron-LM
Added 1 June 2026
Overview
Megatron-LM is a Python framework for training large transformer models at scale, developed and maintained by NVIDIA. It provides distributed training optimizations and memory-efficient techniques to handle models that exceed single-GPU capacity.
Best for
Best for
ML engineers training large transformer models who need production-grade distributed training infrastructure
Use cases
- Training billion-parameter language models across multiple GPUs
- Reducing memory footprint and training time for large transformers
- Implementing pipeline parallelism and tensor parallelism strategies
Notes
Megatron-LM is a Python framework for training large transformer models at scale, developed and maintained by NVIDIA. It provides distributed training optimizations and memory-efficient techniques to handle models that exceed single-GPU capacity.
16,545 stars on GitHub. Last updated 2026-06-01.
Use cases
- Training billion-parameter language models across multiple GPUs
- Reducing memory footprint and training time for large transformers
- Implementing pipeline parallelism and tensor parallelism strategies
Pros
- Production-grade distributed training infrastructure from NVIDIA
- Significant memory and compute optimizations for large models
- Active research codebase with ongoing improvements
Cons
- Steep learning curve for distributed training concepts
- Requires multi-GPU or multi-node setup to be practical
- Community-driven with less formal support than commercial alternatives
Indexed from awesome-llm and enriched against its public facts.
Pros
- Production-grade distributed training infrastructure from NVIDIA
- Significant memory and compute optimizations for large models
- Active research codebase with ongoing improvements
Cons
- Steep learning curve for distributed training concepts
- Requires multi-GPU or multi-node setup to be practical
- Community-driven with less formal support than commercial alternatives
Pairs with
Other entries in the index that connect to this one. Click through to see the chain.
DeepSpeed
Community
DeepSpeed is a deep learning optimization library that makes distributed training and inference easy, efficient, and effective.
Colossal-AI
Community
Making large AI models cheaper, faster and more accessible
GPT-NeoX
Community
An implementation of model parallel autoregressive transformers on GPUs, based on the Megatron and DeepSpeed libraries
Using Deep and Megatron to Train Megatron-Turing NLG 530B, A Large-Scale Generative Language Model
Community
Megatron-Turing NLG
BLOOM: A 176B-Parameter Open-Access Multilingual Language Model
Community
BigScience
BLOOMZ&mT0
Community
We’re on a journey to advance and democratize artificial intelligence through open source and open science.
Megatron-DeepSpeed
Community
Ongoing research training transformer language models at scale, including: BERT & GPT-2
MPT-7B
Community
Introducing MPT-7B, the first entry in our MosaicML Foundation Series. MPT-7B is a transformer trained from scratch on 1T tokens of text and code. It is open source, available fo
Nemotron-4-340B
Community
We’re on a journey to advance and democratize artificial intelligence through open source and open science.
GLaM: Efficient Scaling of Language Models with Mixture-of-Experts
Community
2021-12
Large Language Model Training in 2023
Community
Learn about large language model training with insights on large language model examples, model architecture, and model training guide.
ModelEditingPapers
Community
Must-read Papers on Knowledge Editing for Large Language Models.
Scaling Instruction-Finetuned Language Models
Community
Flan-T5/PaLM
Scaling Laws for Neural Language Models
Community
Scaling Law
Training Compute-Optimal Large Language Models
Community
Chinchilla
Transformer Engine
Community
A library for accelerating Transformer models on NVIDIA GPUs, including using 8-bit and 4-bit floating point (FP8 and FP4) precision on Hopper, Ada and Blackwell GPUs, to provide b
Unifying Language Learning Paradigms
Community
Existing pre-trained models are generally geared towards a particular class of problems. To date, there seems to be still no consensus on what the right architecture and pre-trai
ZeRO: Memory Optimizations Toward Training Trillion Parameter Models
Community
Microsoft
The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale
Community
The performance of a large language model (LLM) depends heavily on the quality and size of its pretraining dataset. However, the pretraining datasets for state-of-the-art open LL
BMTrain
Community
Efficient Training (including pre-training and fine-tuning) for Big Models
Colossal-AI
Community
Making large AI models cheaper, faster and more accessible
DeepSpeed
Community
DeepSpeed is a deep learning optimization library that makes distributed training and inference easy, efficient, and effective.
maxtext
Community
A simple, performant and scalable Jax LLM!
nanotron
Community
Minimalistic large language model 3D-parallelism training
OLMo: Accelerating the Science of Language Models
Community
Language models (LMs) have become ubiquitous in both NLP research and in commercial product offerings. As their commercial importance has surged, the most powerful models have be
torchtitan
Community
A PyTorch native platform for training generative AI models