Awesome-LLM-Inference
by Community
๐A curated list of Awesome LLM/VLM Inference Papers with codes: WINT8/4, FlashAttention, PagedAttention, MLA, Parallelism, etc. ๐๐
OSS
Awesome-LLM-Inference
Added 1 June 2026
Overview
A community-curated GitHub repository that lists papers and code for large language model (LLM) and vision-language model (VLM) inference optimizations. It covers techniques such as WINT8/4 quantization, FlashAttention, PagedAttention, MLA, and parallelism. The repo provides links to the original papers and implementations for each technique.
Best for
Best for
Researchers and engineers seeking a concise overview of recent LLM inference optimization techniques and their implementations.
Use cases
- Finding reference implementations of inference optimization techniques like FlashAttention or PagedAttention.
- Exploring quantization methods (e.g., WINT8/4) to reduce model size and speed up inference.
- Learning about parallelism strategies for deploying LLMs at scale.
Notes
A community-curated GitHub repository that lists papers and code for large language model (LLM) and vision-language model (VLM) inference optimizations. It covers techniques such as WINT8/4 quantization, FlashAttention, PagedAttention, MLA, and parallelism. The repo provides links to the original papers and implementations for each technique.
16 stars on GitHub. Last updated 2025-03-30. Licensed GPL-3.0.
Use cases
- Finding reference implementations of inference optimization techniques like FlashAttention or PagedAttention.
- Exploring quantization methods (e.g., WINT8/4) to reduce model size and speed up inference.
- Learning about parallelism strategies for deploying LLMs at scale.
Pros
- Curated collection saves time by aggregating relevant papers and code.
- Covers a broad range of modern inference optimization methods.
- Provides direct links to resources for quick exploration.
Cons
- Merely a list, not an executable tool or library.
- Limited community validation with only 16 stars.
- May lack detailed tutorials or integration guides.
Indexed from awesome-llm and enriched against its public facts.
Pros
- Curated collection saves time by aggregating relevant papers and code.
- Covers a broad range of modern inference optimization methods.
- Provides direct links to resources for quick exploration.
Cons
- Merely a list, not an executable tool or library.
- Limited community validation with only 16 stars.
- May lack detailed tutorials or integration guides.
Pairs with
Other entries in the index that connect to this one. Click through to see the chain.
vLLM
Community
A high-throughput and memory-efficient inference and serving engine for LLMs
llama.cpp
Community
LLM inference in C/C++
TensorRT-LLM
Community
TensorRT LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and supports state-of-the-art optimizations to perform inference efficiently on NV
DeepSpeed
Community
DeepSpeed is a deep learning optimization library that makes distributed training and inference easy, efficient, and effective.
SGLang
Community
SGLang is a high-performance serving framework for large language models and multimodal models.