FasterTransformer
by Community
Transformer related optimization, including BERT, GPT
OSS
FasterTransformer
Added 1 June 2026
Overview
FasterTransformer is an open-source framework that accelerates transformer model inference. It implements optimized kernels and memory management for models like BERT and GPT. Written in C++, it provides high-performance execution on NVIDIA GPUs.
Best for
Best for
Developers seeking maximum inference performance for transformer models on NVIDIA hardware
Use cases
- Deploying large BERT models for low-latency inference
- Running GPT-based text generation with higher throughput
- Optimizing transformer inference on NVIDIA GPUs
Notes
FasterTransformer is an open-source framework that accelerates transformer model inference. It implements optimized kernels and memory management for models like BERT and GPT. Written in C++, it provides high-performance execution on NVIDIA GPUs.
6,418 stars on GitHub. Last updated 2024-03-27. Licensed Apache-2.0.
Use cases
- Deploying large BERT models for low-latency inference
- Running GPT-based text generation with higher throughput
- Optimizing transformer inference on NVIDIA GPUs
Pros
- Delivers state-of-the-art inference speed for supported transformers
- Actively maintained with strong community adoption (6,418 stars)
- Fine-tuned for NVIDIA GPU architectures
Cons
- Limited to NVIDIA GPUs, no CPU or other hardware support
- C++ codebase requires compilation and integration effort
- Does not offer a high-level API; manual configuration needed
Indexed from awesome-llm and enriched against its public facts.
Pros
- Delivers state-of-the-art inference speed for supported transformers
- Actively maintained with strong community adoption (6,418 stars)
- Fine-tuned for NVIDIA GPU architectures
Cons
- Limited to NVIDIA GPUs, no CPU or other hardware support
- C++ codebase requires compilation and integration effort
- Does not offer a high-level API; manual configuration needed
Pairs with
Other entries in the index that connect to this one. Click through to see the chain.
TensorRT-LLM
Community
TensorRT LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and supports state-of-the-art optimizations to perform inference efficiently on NV
vLLM
Community
A high-throughput and memory-efficient inference and serving engine for LLMs
DeepSpeed
Community
DeepSpeed is a deep learning optimization library that makes distributed training and inference easy, efficient, and effective.