O Open Source Frameworks medium

TensorRT-LLM

by Community

TensorRT LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and supports state-of-the-art optimizations to perform inference efficiently on NV

Visit Community View repo Submit your build →

OSS

TensorRT-LLM

Added 1 June 2026

#blackwell #cuda #llm-serving #moe #pytorch

Overview

TensorRT-LLM is a Python framework for defining and optimizing large language model inference on NVIDIA GPUs. It provides a high-level API to build LLM architectures and applies state-of-the-art optimizations like quantization and kernel fusion, then generates Python and C++ runtimes to execute inference efficiently.

Best for

Best for
Teams deploying LLMs at scale on NVIDIA infrastructure who need maximum inference performance.

Use cases

Deploying LLMs with low latency on NVIDIA hardware
Optimizing inference throughput for production serving
Building custom inference pipelines with fine-grained control

Notes

13,781 stars on GitHub. Last updated 2026-06-01.

Use cases

Deploying LLMs with low latency on NVIDIA hardware
Optimizing inference throughput for production serving
Building custom inference pipelines with fine-grained control

Pros

Deep NVIDIA GPU optimization built in, not bolted on
Supports both Python and C++ runtime generation for flexibility
Active community project with 13k+ stars and regular updates

Cons

Locked to NVIDIA GPUs, no portability to other accelerators
Steeper learning curve than higher-level inference frameworks
Requires understanding of LLM architecture and optimization techniques

Indexed from awesome-llm and enriched against its public facts.

Pros

Deep NVIDIA GPU optimization built in, not bolted on
Supports both Python and C++ runtime generation for flexibility
Active community project with 13k+ stars and regular updates

Cons

Locked to NVIDIA GPUs, no portability to other accelerators
Steeper learning curve than higher-level inference frameworks
Requires understanding of LLM architecture and optimization techniques

Pairs with

Other entries in the index that connect to this one. Click through to see the chain.

Alternative to1entry

O OSS Framework medium

vLLM

Community

A high-throughput and memory-efficient inference and serving engine for LLMs

★ 81,619 updated 1mo ago

Pairs with4entries

O OSS Framework medium

Awesome-LLM-Inference

Community

📖A curated list of Awesome LLM/VLM Inference Papers with codes: WINT8/4, FlashAttention, PagedAttention, MLA, Parallelism, etc. 🎉🎉

★ 16 updated 1y ago

O OSS Framework medium

NeMo Framework

Community

A scalable generative AI framework built for researchers and developers working on Large Language Models, Multimodal, and Speech AI (Automatic Speech Recognition and Text-to-Speech

★ 17,285 updated 1mo ago

O OSS Framework medium

SkyPilot

Community

Run, manage, and scale AI workloads on any AI infrastructure. Use one system to access & manage all AI compute (Kubernetes, Slurm, 20+ clouds, on-prem).

★ 10,051 updated 1mo ago

O OSS Framework medium

Transformer Engine

Community

A library for accelerating Transformer models on NVIDIA GPUs, including using 8-bit and 4-bit floating point (FP8 and FP4) precision on Hopper, Ada and Blackwell GPUs, to provide b

★ 3,374 updated 1mo ago

Alternatives4entries

O OSS Framework medium

FasterTransformer

Community

Transformer related optimization, including BERT, GPT

★ 6,418 updated 2y ago

O OSS Framework medium

LMDeploy

Community

LMDeploy is a toolkit for compressing, deploying, and serving LLMs.

★ 7,876 updated 1mo ago

O OSS Framework medium

SGLang

Community

SGLang is a high-performance serving framework for large language models and multimodal models.

★ 28,885 updated 1mo ago

O OSS Framework medium

vLLM

Community

A high-throughput and memory-efficient inference and serving engine for LLMs

★ 81,619 updated 1mo ago

Free 27-page guide

Get the free Developer’s Field Guide

A 27-page field guide to the AI coding workflow with Claude. Claude Code, MCP servers, the prompt patterns that work, and what to delegate. Free.

Enter your work email. We send it straight over, plus a few short notes worth knowing. Unsubscribe any time.

← Back to Open Source Submit your own entry →