MInference
by Community
[NeurIPS'24 Spotlight, ICLR'25, ICML'25] To speed up Long-context LLMs' inference, approximate and dynamic sparse calculate the attention, which reduces inference latency by up to
OSS
MInference
Added 1 June 2026
Overview
MInference is a framework that speeds up long-context large language model inference by approximating the attention computation with dynamic sparse patterns. It targets the pre-filling phase and can reduce latency by up to 10x on an A100 GPU while preserving accuracy. The tool is implemented in Python and open source under the Microsoft organization.
Best for
Best for
Developers optimizing long-context LLM inference on NVIDIA GPUs
Use cases
- Accelerating pre-filling for long-context LLM inference pipelines
- Reducing latency in applications that process large input sequences
- Evaluating sparse attention as a drop-in replacement for full attention
Notes
MInference is a framework that speeds up long-context large language model inference by approximating the attention computation with dynamic sparse patterns. It targets the pre-filling phase and can reduce latency by up to 10x on an A100 GPU while preserving accuracy. The tool is implemented in Python and open source under the Microsoft organization.
1,217 stars on GitHub. Last updated 2026-04-08. Licensed MIT.
Use cases
- Accelerating pre-filling for long-context LLM inference pipelines
- Reducing latency in applications that process large input sequences
- Evaluating sparse attention as a drop-in replacement for full attention
Pros
- Up to 10x latency reduction for pre-filling on A100 hardware
- Maintains accuracy despite sparse approximation
- Open source with a published NeurIPS spotlight paper
Cons
- Optimizations are limited to the pre-filling phase, not generation
- Requires integration into existing inference codebases
- Performance gains depend on specific model architectures and hardware
Indexed from awesome-llm and enriched against its public facts.
Pros
- Up to 10x latency reduction for pre-filling on A100 hardware
- Maintains accuracy despite sparse approximation
- Open source with a published NeurIPS spotlight paper
Cons
- Optimizations are limited to the pre-filling phase, not generation
- Requires integration into existing inference codebases
- Performance gains depend on specific model architectures and hardware
Pairs with
Other entries in the index that connect to this one. Click through to see the chain.