Enterprise DNA
O Open Source Frameworks medium

MInference

by Community

[NeurIPS'24 Spotlight, ICLR'25, ICML'25] To speed up Long-context LLMs' inference, approximate and dynamic sparse calculate the attention, which reduces inference latency by up to

M

OSS

MInference

Added 1 June 2026

Overview

MInference is a framework that speeds up long-context large language model inference by approximating the attention computation with dynamic sparse patterns. It targets the pre-filling phase and can reduce latency by up to 10x on an A100 GPU while preserving accuracy. The tool is implemented in Python and open source under the Microsoft organization.

Best for

Best for
Developers optimizing long-context LLM inference on NVIDIA GPUs

Use cases

  • Accelerating pre-filling for long-context LLM inference pipelines
  • Reducing latency in applications that process large input sequences
  • Evaluating sparse attention as a drop-in replacement for full attention

Notes

MInference is a framework that speeds up long-context large language model inference by approximating the attention computation with dynamic sparse patterns. It targets the pre-filling phase and can reduce latency by up to 10x on an A100 GPU while preserving accuracy. The tool is implemented in Python and open source under the Microsoft organization.

1,217 stars on GitHub. Last updated 2026-04-08. Licensed MIT.

Use cases

  • Accelerating pre-filling for long-context LLM inference pipelines
  • Reducing latency in applications that process large input sequences
  • Evaluating sparse attention as a drop-in replacement for full attention

Pros

  • Up to 10x latency reduction for pre-filling on A100 hardware
  • Maintains accuracy despite sparse approximation
  • Open source with a published NeurIPS spotlight paper

Cons

  • Optimizations are limited to the pre-filling phase, not generation
  • Requires integration into existing inference codebases
  • Performance gains depend on specific model architectures and hardware

Indexed from awesome-llm and enriched against its public facts.

Pros

  • Up to 10x latency reduction for pre-filling on A100 hardware
  • Maintains accuracy despite sparse approximation
  • Open source with a published NeurIPS spotlight paper

Cons

  • Optimizations are limited to the pre-filling phase, not generation
  • Requires integration into existing inference codebases
  • Performance gains depend on specific model architectures and hardware

Pairs with

Other entries in the index that connect to this one. Click through to see the chain.