O Open Source Frameworks medium

MInference

by Community

[NeurIPS'24 Spotlight, ICLR'25, ICML'25] To speed up Long-context LLMs' inference, approximate and dynamic sparse calculate the attention, which reduces inference latency by up to

Visit Community View repo Submit your build →

OSS

MInference

Added 1 June 2026

Overview

MInference is a framework that speeds up long-context large language model inference by approximating the attention computation with dynamic sparse patterns. It targets the pre-filling phase and can reduce latency by up to 10x on an A100 GPU while preserving accuracy. The tool is implemented in Python and open source under the Microsoft organization.

Best for

Best for
Developers optimizing long-context LLM inference on NVIDIA GPUs

Use cases

Accelerating pre-filling for long-context LLM inference pipelines
Reducing latency in applications that process large input sequences
Evaluating sparse attention as a drop-in replacement for full attention

Notes

1,217 stars on GitHub. Last updated 2026-04-08. Licensed MIT.

Use cases

Accelerating pre-filling for long-context LLM inference pipelines
Reducing latency in applications that process large input sequences
Evaluating sparse attention as a drop-in replacement for full attention

Pros

Up to 10x latency reduction for pre-filling on A100 hardware
Maintains accuracy despite sparse approximation
Open source with a published NeurIPS spotlight paper

Cons

Optimizations are limited to the pre-filling phase, not generation
Requires integration into existing inference codebases
Performance gains depend on specific model architectures and hardware

Indexed from awesome-llm and enriched against its public facts.

Pros

Up to 10x latency reduction for pre-filling on A100 hardware
Maintains accuracy despite sparse approximation
Open source with a published NeurIPS spotlight paper

Cons

Optimizations are limited to the pre-filling phase, not generation
Requires integration into existing inference codebases
Performance gains depend on specific model architectures and hardware

Pairs with

Other entries in the index that connect to this one. Click through to see the chain.

Pairs with3entries

O OSS Framework medium

vLLM

Community

A high-throughput and memory-efficient inference and serving engine for LLMs

★ 81,619 updated 1mo ago

O OSS Framework medium

SGLang

Community

SGLang is a high-performance serving framework for large language models and multimodal models.

★ 28,885 updated 1mo ago

O OSS Framework medium

DeepSpeed

Community

DeepSpeed is a deep learning optimization library that makes distributed training and inference easy, efficient, and effective.

★ 42,436 updated 1mo ago

Free 27-page guide

Get the free Developer’s Field Guide

A 27-page field guide to the AI coding workflow with Claude. Claude Code, MCP servers, the prompt patterns that work, and what to delegate. Free.

Enter your work email. We send it straight over, plus a few short notes worth knowing. Unsubscribe any time.

← Back to Open Source Submit your own entry →