O Open Source Observability medium

FlexGen

by Community

Running large language models on a single GPU for throughput-oriented scenarios.

Visit Community View repo Submit your build →

OSS

FlexGen

Added 1 June 2026

#deep-learning #gpt-3 #high-throughput #large-language-models #machine-learning #offloading #opt

Overview

FlexGen is an open-source Python library for running large language models on a single GPU, optimized for throughput-oriented inference scenarios. It leverages memory and computation management techniques to maximize the number of tokens generated per second on constrained hardware.

Best for

Best for
Developers who need to run large language models at high throughput on a single GPU, especially in budget-constrained or research environments

Use cases

Serving high-throughput LLM applications on a single GPU
Benchmarking throughput limits of LLM inference on consumer hardware
Prototyping resource-efficient LLM deployments

Notes

9,365 stars on GitHub. Last updated 2024-10-28. Licensed Apache-2.0.

Use cases

Serving high-throughput LLM applications on a single GPU
Benchmarking throughput limits of LLM inference on consumer hardware
Prototyping resource-efficient LLM deployments

Pros

Open source with a strong community following (over 9,300 stars)
Designed specifically for maximizing throughput on a single GPU
Written in Python, easy to integrate into existing workflows

Cons

Limited to single-GPU setups, not suitable for multi-GPU scaling
May not prioritize latency, making it less ideal for real-time applications
Community-maintained, with potential for slower updates or documentation gaps

Indexed from awesome-llmops and enriched against its public facts.

Pros

Open source with a strong community following (over 9,300 stars)
Designed specifically for maximizing throughput on a single GPU
Written in Python, easy to integrate into existing workflows

Cons

Limited to single-GPU setups, not suitable for multi-GPU scaling
May not prioritize latency, making it less ideal for real-time applications
Community-maintained, with potential for slower updates or documentation gaps

Pairs with

Other entries in the index that connect to this one. Click through to see the chain.

Uses1entry

O OSS Obs medium

PyTorch

Community

Tensors and Dynamic neural networks in Python with strong GPU acceleration

★ 100,318 updated 1mo ago

Built with1entry

O OSS Obs medium

PyTorch

Community

Tensors and Dynamic neural networks in Python with strong GPU acceleration

★ 100,318 updated 1mo ago

Free 27-page guide

Get the free Developer’s Field Guide

A 27-page field guide to the AI coding workflow with Claude. Claude Code, MCP servers, the prompt patterns that work, and what to delegate. Free.

Enter your work email. We send it straight over, plus a few short notes worth knowing. Unsubscribe any time.

← Back to Open Source Submit your own entry →