Enterprise DNA
O Open Source Observability medium

FlexGen

by Community

Running large language models on a single GPU for throughput-oriented scenarios.

F

OSS

FlexGen

Added 1 June 2026

#deep-learning #gpt-3 #high-throughput #large-language-models #machine-learning #offloading #opt

Overview

FlexGen is an open-source Python library for running large language models on a single GPU, optimized for throughput-oriented inference scenarios. It leverages memory and computation management techniques to maximize the number of tokens generated per second on constrained hardware.

Best for

Best for
Developers who need to run large language models at high throughput on a single GPU, especially in budget-constrained or research environments

Use cases

  • Serving high-throughput LLM applications on a single GPU
  • Benchmarking throughput limits of LLM inference on consumer hardware
  • Prototyping resource-efficient LLM deployments

Notes

FlexGen is an open-source Python library for running large language models on a single GPU, optimized for throughput-oriented inference scenarios. It leverages memory and computation management techniques to maximize the number of tokens generated per second on constrained hardware.

9,365 stars on GitHub. Last updated 2024-10-28. Licensed Apache-2.0.

Use cases

  • Serving high-throughput LLM applications on a single GPU
  • Benchmarking throughput limits of LLM inference on consumer hardware
  • Prototyping resource-efficient LLM deployments

Pros

  • Open source with a strong community following (over 9,300 stars)
  • Designed specifically for maximizing throughput on a single GPU
  • Written in Python, easy to integrate into existing workflows

Cons

  • Limited to single-GPU setups, not suitable for multi-GPU scaling
  • May not prioritize latency, making it less ideal for real-time applications
  • Community-maintained, with potential for slower updates or documentation gaps

Indexed from awesome-llmops and enriched against its public facts.

Pros

  • Open source with a strong community following (over 9,300 stars)
  • Designed specifically for maximizing throughput on a single GPU
  • Written in Python, easy to integrate into existing workflows

Cons

  • Limited to single-GPU setups, not suitable for multi-GPU scaling
  • May not prioritize latency, making it less ideal for real-time applications
  • Community-maintained, with potential for slower updates or documentation gaps