FlexGen
by Community
Running large language models on a single GPU for throughput-oriented scenarios.
OSS
FlexGen
Added 1 June 2026
Overview
FlexGen is an open-source Python library for running large language models on a single GPU, optimized for throughput-oriented inference scenarios. It leverages memory and computation management techniques to maximize the number of tokens generated per second on constrained hardware.
Best for
Best for
Developers who need to run large language models at high throughput on a single GPU, especially in budget-constrained or research environments
Use cases
- Serving high-throughput LLM applications on a single GPU
- Benchmarking throughput limits of LLM inference on consumer hardware
- Prototyping resource-efficient LLM deployments
Notes
FlexGen is an open-source Python library for running large language models on a single GPU, optimized for throughput-oriented inference scenarios. It leverages memory and computation management techniques to maximize the number of tokens generated per second on constrained hardware.
9,365 stars on GitHub. Last updated 2024-10-28. Licensed Apache-2.0.
Use cases
- Serving high-throughput LLM applications on a single GPU
- Benchmarking throughput limits of LLM inference on consumer hardware
- Prototyping resource-efficient LLM deployments
Pros
- Open source with a strong community following (over 9,300 stars)
- Designed specifically for maximizing throughput on a single GPU
- Written in Python, easy to integrate into existing workflows
Cons
- Limited to single-GPU setups, not suitable for multi-GPU scaling
- May not prioritize latency, making it less ideal for real-time applications
- Community-maintained, with potential for slower updates or documentation gaps
Indexed from awesome-llmops and enriched against its public facts.
Pros
- Open source with a strong community following (over 9,300 stars)
- Designed specifically for maximizing throughput on a single GPU
- Written in Python, easy to integrate into existing workflows
Cons
- Limited to single-GPU setups, not suitable for multi-GPU scaling
- May not prioritize latency, making it less ideal for real-time applications
- Community-maintained, with potential for slower updates or documentation gaps
Pairs with
Other entries in the index that connect to this one. Click through to see the chain.