vLLM
by Community
A high-throughput and memory-efficient inference and serving engine for LLMs
OSS
vLLM
Added 1 June 2026
Overview
vLLM is a Python framework for serving large language models with optimized throughput and memory efficiency. It uses techniques like paged attention and continuous batching to reduce latency and increase request throughput compared to standard inference servers. Designed for production deployments that need to handle multiple concurrent requests.
Best for
Best for
Teams building production LLM APIs and services that need to maximize throughput and minimize latency under concurrent load.
Use cases
- Running inference servers that handle high request volume with low latency
- Reducing GPU memory footprint when serving large models
- Batching and scheduling inference requests efficiently
Notes
vLLM is a Python framework for serving large language models with optimized throughput and memory efficiency. It uses techniques like paged attention and continuous batching to reduce latency and increase request throughput compared to standard inference servers. Designed for production deployments that need to handle multiple concurrent requests.
81,619 stars on GitHub. Last updated 2026-06-01. Licensed Apache-2.0.
Use cases
- Running inference servers that handle high request volume with low latency
- Reducing GPU memory footprint when serving large models
- Batching and scheduling inference requests efficiently
Pros
- Significantly higher throughput than standard LLM serving approaches
- Lower memory consumption enables serving larger models on same hardware
- Active community with 81k+ GitHub stars and ongoing development
Cons
- Requires Python and GPU infrastructure, not suitable for CPU-only deployments
- Steeper learning curve than simple inference libraries for basic use cases
- Performance gains depend on workload characteristics and batch patterns
Indexed from awesome-llm and enriched against its public facts.
Pros
- Significantly higher throughput than standard LLM serving approaches
- Lower memory consumption enables serving larger models on same hardware
- Active community with 81k+ GitHub stars and ongoing development
Cons
- Requires Python and GPU infrastructure, not suitable for CPU-only deployments
- Steeper learning curve than simple inference libraries for basic use cases
- Performance gains depend on workload characteristics and batch patterns
Pairs with
Other entries in the index that connect to this one. Click through to see the chain.
LangChain
Community
The agent engineering platform.
LiteLLM 🚅
Community
Python SDK, Proxy Server (AI Gateway) to call 100+ LLM APIs in OpenAI (or native) format, with cost tracking, guardrails, loadbalancing and logging. [Bedrock, Azure, OpenAI, Vertex
TensorRT-LLM
Community
TensorRT LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and supports state-of-the-art optimizations to perform inference efficiently on NV
SGLang
Community
SGLang is a high-performance serving framework for large language models and multimodal models.
LMDeploy
Community
LMDeploy is a toolkit for compressing, deploying, and serving LLMs.
OpenLLM
Community
Run any open-source LLMs, such as DeepSeek and Llama, as OpenAI compatible API endpoint in the cloud.
BLOOM: A 176B-Parameter Open-Access Multilingual Language Model
Community
BigScience
distilabel
Community
Distilabel is a framework for synthetic data and AI feedback for engineers who need fast, reliable and scalable pipelines based on verified research papers.
GPUStack
Community
A GPU cluster manager that configures and orchestrates inference engines like vLLM and SGLang for high-performance AI model deployment.
LangChain
Community
The agent engineering platform.
lighteval
Community
Lighteval is your all-in-one toolkit for evaluating LLMs across multiple backends
LLMKube
Community
Kubernetes operator for local LLM inference with llama.cpp, vLLM, TGI, and mlx-server — multi-GPU NVIDIA + Apple Silicon Metal, autoscaling, air-gapped, production-ready
lm-evaluation-harness
Community
A framework for few-shot evaluation of language models.
OpenModelZ
Community
Autoscale LLM (vLLM, SGLang, LMDeploy) inferences on Kubernetes (and others)
OpenLLM
Community
Run any open-source LLMs, such as DeepSeek and Llama, as OpenAI compatible API endpoint in the cloud.
OpenRLHF
Community
An Easy-to-use, Scalable and High-performance Agentic RL Framework based on Ray (PPO & DAPO & REINFORCE++ & VLM & TIS & vLLM & Ray & Async RL)
Outlines
Community
Structured Outputs
Qwen2-Math-1.5B|7B|72B
Community
GITHUB HUGGING FACE MODELSCOPE DISCORD 🚨 This model mainly supports English. We will release bilingual (English and Chinese) math models soon. Introduction Over the past year, w
Tune Studio
Community
Playground for devs to finetune & deploy LLMs
DeepSeek
Various
Org profile for DeepSeek on Hugging Face, the AI community building the future.
Forefront
Various
Forefront is a platform to fine-tune and inference open-source-language-models.
Mistral
Various
The most powerful AI platform for enterprises. Customize, fine-tune, and deploy AI assistants, autonomous agents, and multimodal AI with open models.
Vicuna-13B
Various
We introduce Vicuna-13B, an open-source chatbot trained by fine-tuning LLaMA on user-shared conversations collected from ShareGPT. Preliminary evaluation using GPT-4 as a judge s
Jwrede/llmprobe
Various
Synthetic monitoring and CI smoke tests for LLM inference endpoints.
Awesome-LLM-Compression
Community
Awesome LLM compression research papers and tools.
Awesome-LLM-Inference
Community
📖A curated list of Awesome LLM/VLM Inference Papers with codes: WINT8/4, FlashAttention, PagedAttention, MLA, Parallelism, etc. 🎉🎉
awesome-llm-webapps
Community
A collection of open source, actively maintained web apps for LLM applications
Axolotl
Community
Go ahead and axolotl questions
Baichuan-7|13B
Community
AGI Large Language Models
Bifrost
Community
Fastest enterprise AI gateway (50x faster than LiteLLM) with adaptive load balancer, cluster mode, guardrails, 1000+ models support & <100 µs overhead at 5k RPS.
BLOOM: A 176B-Parameter Open-Access Multilingual Language Model
Community
BigScience
CodeQwen1.5-7B
Community
GITHUB HUGGING FACE MODELSCOPE DEMO DISCORD Introduction The advent of advanced programming tools, which harnesses the power of large language models (LLMs), has significantly en
Codestral-7|22B
Community
The most powerful AI platform for enterprises. Customize, fine-tune, and deploy AI assistants, autonomous agents, and multimodal AI with open models.
DeepSeek-Math-7B
Community
DeepSeek Math series
DeepSeek-R1
Community
First-generation reasoning models from DeepSeek.
DeepSeek-v2-236B-MoE
Community
We present DeepSeek-V2, a strong Mixture-of-Experts (MoE) language model characterized by economical training and efficient inference. It comprises 236B total parameters, of whic
DeepSeek-V2.5
Community
We’re on a journey to advance and democratize artificial intelligence through open source and open science.
DeepSeek-VL-1.3|7B
Community
DeepSeek-VL model series
Falcon 40B
Community
We’re on a journey to advance and democratize artificial intelligence through open source and open science.
Fiddler AI
Community
Fiddler Auditor is a tool to evaluate language models.
Flyflow
Community
Open source, high performance fine tuning as a service for GPT4 quality models with 5x lower latency and 3x lower cost
Gemma
Community
Checking your browser - reCAPTCHA
Gemma2-9|27B
Community
Gemma 2, our next generation of open models, is now available globally for researchers and developers.
GLM-130B: An Open Bilingual Pre-trained Model
Community
GLM-130B
GLM-2|6|10|13|70B
Community
Org profile for THUDM on Hugging Face, the AI community building the future.
Grok-1-314B-MoE
Community
Grok-1-314B-MoE — indexed from awesome-llm
Haystack
Community
Create agentic, context engineered AI systems using Haystack’s modular and customizable building blocks, built for real-world, production-ready applications.
Improving language models by retrieving from trillions of tokens
Community
Publications — Google DeepMind
Infinity
Community
Infinity is a high-throughput, low-latency serving engine for text-embeddings, reranking models, clip, clap and colpali
InternLM2-1.8|7|20B
Community
We’re on a journey to advance and democratize artificial intelligence through open source and open science.
KubeAI
Community
AI Inference Operator for Kubernetes. The easiest way to serve ML models in production. Supports VLMs, LLMs, embeddings, and speech-to-text.
Llama 1-7|13|33|65B
Community
[OPT-1.3 6.7 13 30 66B](https://arxiv.org/abs/2205.01068)
Llama 2: Open Foundation and Fine-Tuned Chat Models
Community
2023-07
Llama 3.2-1|3|11|90B
Community
[Llama 3.1-8 70 405B](https://llama.meta.com/)
Llama 3-8|70B
Community
[Llama 2-7 13 70B](https://llama.meta.com/llama2/)
LLaMA: Open and Efficient Foundation Language Models
Community
2023-02
maxtext
Community
A simple, performant and scalable Jax LLM!
Meta Lingua
Community
Meta Lingua: a lean, efficient, and easy-to-hack codebase to research LLMs.
MiniCPM-2B
Community
The MiniCPM family of LLMs and VLLMs.
Mistral 7B
Community
Mistral 7B
Mixtral-8x7B
Community
The most powerful AI platform for enterprises. Customize, fine-tune, and deploy AI assistants, autonomous agents, and multimodal AI with open models.
Moonlight-A3B
Community
Moonshot's Compute-efficient MoE LLM, first Scaling Up of Muon Optimizer
MPT-7B
Community
Introducing MPT-7B, the first entry in our MosaicML Foundation Series. MPT-7B is a transformer trained from scratch on 1T tokens of text and code. It is open source, available fo
Nemotron-4-340B
Community
We’re on a journey to advance and democratize artificial intelligence through open source and open science.
OLMo-7B
Community
Artifacts for the first set of OLMo models.
OLMoE: Open Mixture-of-Experts Language Models
Community
We introduce OLMoE, a fully open, state-of-the-art language model leveraging sparse Mixture-of-Experts (MoE). OLMoE-1B-7B has 7 billion (B) parameters but uses only 1B per input
OLMO-eval
Community
Evaluation suite for LLMs
OpenELM-1.1|3B
Community
We’re on a journey to advance and democratize artificial intelligence through open source and open science.
Phi1-1.3B
Community
We’re on a journey to advance and democratize artificial intelligence through open source and open science.
Qwen-1.8B|7B|14B|72B
Community
Qwen - a Qwen Collection
Qwen2-0.5B|1.5B|7B|57B-A14B-MoE|72B
Community
GITHUB HUGGING FACE MODELSCOPE DEMO DISCORD Introduction After months of efforts, we are pleased to announce the evolution from Qwen1.5 to Qwen2. This time, we bring to you: Pret
Qwen2.5-1M-7|14B
Community
Tech Report HuggingFace ModelScope Qwen Chat HuggingFace Demo ModelScope Demo DISCORD Introduction Two months after upgrading Qwen2.5-Turbo to support context length up to one mi
Qwen2.5 Technical Report
Community
In this report, we introduce Qwen2.5, a comprehensive series of large language models (LLMs) designed to meet diverse needs. Compared to previous iterations, Qwen 2.5 has been si
Qwen2.5-Max
Community
QWEN CHAT API DEMO DISCORD It is widely recognized that continuously scaling both data size and model size can lead to significant improvements in model intelligence. However, th
ray-llm
Community
RayLLM - LLMs on Ray (Archived). Read README for more info.
Semantic Kernel
Microsoft
Microsoft's enterprise-flavoured framework for AI agents. .NET-first, with Python and Java siblings.
SkyPilot
Community
Run, manage, and scale AI workloads on any AI infrastructure. Use one system to access & manage all AI compute (Kubernetes, Slurm, 20+ clouds, on-prem).
StableLM-3B
Community
We’re on a journey to advance and democratize artificial intelligence through open source and open science.
StableLM-v2-12B
Community
We’re on a journey to advance and democratize artificial intelligence through open source and open science.
StarCoder-1|3|7B
Community
All models, datasets, and demos related to StarCoder!
The Llama 3 Herd of Models
Community
Modern artificial intelligence (AI) systems are powered by foundation models. This paper presents a new set of foundation models, called Llama 3. It is a herd of language models
torchtitan
Community
A PyTorch native platform for training generative AI models
Transformer Engine
Community
A library for accelerating Transformer models on NVIDIA GPUs, including using 8-bit and 4-bit floating point (FP8 and FP4) precision on Hopper, Ada and Blackwell GPUs, to provide b
veRL
Community
verl/HybridFlow: A Flexible and Efficient RL Post-Training Framework
Yi-34B
Community
We’re on a journey to advance and democratize artificial intelligence through open source and open science.
Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism
Community
Megatron-LM
Qwen
Various
Qwickly forging AGI, enhancing intelligence.
FasterTransformer
Community
Transformer related optimization, including BERT, GPT
FlexGen
Community
Running large language models on a single GPU for throughput-oriented scenarios.
IntelliServer
Community
AI models as scalable microservices, enabling evaluation of LLMs and offering end-to-end functions such as chatbot, semantic search, image generation and beyond.
llama.cpp
Community
LLM inference in C/C++
LMDeploy
Community
LMDeploy is a toolkit for compressing, deploying, and serving LLMs.
mistral.rs
Community
Fast, flexible LLM inference
ollama
Community
Get up and running with Kimi-K2.5, GLM-5, MiniMax, DeepSeek, gpt-oss, Qwen, Gemma and other models.
ray-llm
Community
RayLLM - LLMs on Ray (Archived). Read README for more info.
SGLang
Community
SGLang is a high-performance serving framework for large language models and multimodal models.
TensorRT-LLM
Community
TensorRT LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and supports state-of-the-art optimizations to perform inference efficiently on NV
text-generation-inference
Community
Large Language Model Text Generation Inference
TGI
Community
We’re on a journey to advance and democratize artificial intelligence through open source and open science.
Triton Server (TRTIS)
Community
The Triton Inference Server provides an optimized cloud and edge inferencing solution.