TensorRT-LLM
by Community
TensorRT LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and supports state-of-the-art optimizations to perform inference efficiently on NV
OSS
TensorRT-LLM
Added 1 June 2026
Overview
TensorRT-LLM is a Python framework for defining and optimizing large language model inference on NVIDIA GPUs. It provides a high-level API to build LLM architectures and applies state-of-the-art optimizations like quantization and kernel fusion, then generates Python and C++ runtimes to execute inference efficiently.
Best for
Best for
Teams deploying LLMs at scale on NVIDIA infrastructure who need maximum inference performance.
Use cases
- Deploying LLMs with low latency on NVIDIA hardware
- Optimizing inference throughput for production serving
- Building custom inference pipelines with fine-grained control
Notes
TensorRT-LLM is a Python framework for defining and optimizing large language model inference on NVIDIA GPUs. It provides a high-level API to build LLM architectures and applies state-of-the-art optimizations like quantization and kernel fusion, then generates Python and C++ runtimes to execute inference efficiently.
13,781 stars on GitHub. Last updated 2026-06-01.
Use cases
- Deploying LLMs with low latency on NVIDIA hardware
- Optimizing inference throughput for production serving
- Building custom inference pipelines with fine-grained control
Pros
- Deep NVIDIA GPU optimization built in, not bolted on
- Supports both Python and C++ runtime generation for flexibility
- Active community project with 13k+ stars and regular updates
Cons
- Locked to NVIDIA GPUs, no portability to other accelerators
- Steeper learning curve than higher-level inference frameworks
- Requires understanding of LLM architecture and optimization techniques
Indexed from awesome-llm and enriched against its public facts.
Pros
- Deep NVIDIA GPU optimization built in, not bolted on
- Supports both Python and C++ runtime generation for flexibility
- Active community project with 13k+ stars and regular updates
Cons
- Locked to NVIDIA GPUs, no portability to other accelerators
- Steeper learning curve than higher-level inference frameworks
- Requires understanding of LLM architecture and optimization techniques
Pairs with
Other entries in the index that connect to this one. Click through to see the chain.
Awesome-LLM-Compression
Community
Awesome LLM compression research papers and tools.
Awesome-LLM-Inference
Community
📖A curated list of Awesome LLM/VLM Inference Papers with codes: WINT8/4, FlashAttention, PagedAttention, MLA, Parallelism, etc. 🎉🎉
Nemotron-4-340B
Community
We’re on a journey to advance and democratize artificial intelligence through open source and open science.
Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism
Community
Megatron-LM
FasterTransformer
Community
Transformer related optimization, including BERT, GPT
LMDeploy
Community
LMDeploy is a toolkit for compressing, deploying, and serving LLMs.
SGLang
Community
SGLang is a high-performance serving framework for large language models and multimodal models.
vLLM
Community
A high-throughput and memory-efficient inference and serving engine for LLMs