Transformer Engine
by Community
A library for accelerating Transformer models on NVIDIA GPUs, including using 8-bit and 4-bit floating point (FP8 and FP4) precision on Hopper, Ada and Blackwell GPUs, to provide b
OSS
Transformer Engine
Added 1 June 2026
Overview
Transformer Engine is a Python library that accelerates Transformer models on NVIDIA GPUs by leveraging low-precision floating point formats (FP8 and FP4). It targets Hopper, Ada, and Blackwell architectures to improve performance and reduce memory usage during both training and inference.
Best for
Best for
Developers training or deploying large transformer models on modern NVIDIA GPUs who need to maximize performance and minimize memory usage
Use cases
- Training large language models with reduced memory footprint
- Running inference on transformer models with higher throughput
- Fine-tuning transformers on GPU clusters with limited VRAM
Notes
Transformer Engine is a Python library that accelerates Transformer models on NVIDIA GPUs by leveraging low-precision floating point formats (FP8 and FP4). It targets Hopper, Ada, and Blackwell architectures to improve performance and reduce memory usage during both training and inference.
3,374 stars on GitHub. Last updated 2026-06-01. Licensed Apache-2.0.
Use cases
- Training large language models with reduced memory footprint
- Running inference on transformer models with higher throughput
- Fine-tuning transformers on GPU clusters with limited VRAM
Pros
- Significantly reduces memory consumption compared to FP32 or FP16
- Optimized for the latest NVIDIA GPU families (Hopper, Ada, Blackwell)
- Supports both training and inference for transformer architectures
Cons
- Requires compatible NVIDIA GPUs (Hopper, Ada, or Blackwell) to use FP8/FP4
- Limited to specific precision formats; not a general-purpose optimization library
- May need code modifications to integrate into existing PyTorch workflows
Indexed from awesome-llm and enriched against its public facts.
Pros
- Significantly reduces memory consumption compared to FP32 or FP16
- Optimized for the latest NVIDIA GPU families (Hopper, Ada, Blackwell)
- Supports both training and inference for transformer architectures
Cons
- Requires compatible NVIDIA GPUs (Hopper, Ada, or Blackwell) to use FP8/FP4
- Limited to specific precision formats; not a general-purpose optimization library
- May need code modifications to integrate into existing PyTorch workflows
Pairs with
Other entries in the index that connect to this one. Click through to see the chain.
DeepSpeed
Community
DeepSpeed is a deep learning optimization library that makes distributed training and inference easy, efficient, and effective.
Megatron-LM
Community
Ongoing research training transformer models at scale
vLLM
Community
A high-throughput and memory-efficient inference and serving engine for LLMs