exllama
by Community
A more memory-efficient rewrite of the HF transformers implementation of Llama for use with quantized weights.
OSS
exllama
Added 1 June 2026
Overview
ExLlama is a memory-efficient reimplementation of Hugging Face Transformers' Llama model, optimized for quantized weights. It reduces memory usage during inference, enabling larger models to run on consumer GPUs. The tool is written in Python and maintained by the open-source community.
Best for
Best for
Developers running quantized Llama models on resource-constrained hardware
Use cases
- Running quantized Llama models on limited VRAM
- Local inference of Llama-based chatbots or text generators
- Benchmarking memory-optimized transformer inference
Notes
ExLlama is a memory-efficient reimplementation of Hugging Face Transformers’ Llama model, optimized for quantized weights. It reduces memory usage during inference, enabling larger models to run on consumer GPUs. The tool is written in Python and maintained by the open-source community.
2,922 stars on GitHub. Last updated 2023-09-30. Licensed MIT.
Use cases
- Running quantized Llama models on limited VRAM
- Local inference of Llama-based chatbots or text generators
- Benchmarking memory-optimized transformer inference
Pros
- Significantly lower memory footprint than Hugging Face Transformers
- Fast inference with quantized weights
- Active community development and frequent updates
Cons
- Only supports Llama architecture, not other transformer models
- Requires specific quantization formats (e.g., GPTQ)
- Less feature-rich than full Hugging Face Transformers ecosystem
Indexed from awesome-llm and enriched against its public facts.
Pros
- Significantly lower memory footprint than Hugging Face Transformers
- Fast inference with quantized weights
- Active community development and frequent updates
Cons
- Only supports Llama architecture, not other transformer models
- Requires specific quantization formats (e.g., GPTQ)
- Less feature-rich than full Hugging Face Transformers ecosystem
Pairs with
Other entries in the index that connect to this one. Click through to see the chain.