O Open Source Frameworks medium

exllama

by Community

A more memory-efficient rewrite of the HF transformers implementation of Llama for use with quantized weights.

Visit Community View repo Submit your build →

OSS

exllama

Added 1 June 2026

Overview

ExLlama is a memory-efficient reimplementation of Hugging Face Transformers' Llama model, optimized for quantized weights. It reduces memory usage during inference, enabling larger models to run on consumer GPUs. The tool is written in Python and maintained by the open-source community.

Best for

Best for
Developers running quantized Llama models on resource-constrained hardware

Use cases

Running quantized Llama models on limited VRAM
Local inference of Llama-based chatbots or text generators
Benchmarking memory-optimized transformer inference

Notes

ExLlama is a memory-efficient reimplementation of Hugging Face Transformers’ Llama model, optimized for quantized weights. It reduces memory usage during inference, enabling larger models to run on consumer GPUs. The tool is written in Python and maintained by the open-source community.

2,922 stars on GitHub. Last updated 2023-09-30. Licensed MIT.

Use cases

Running quantized Llama models on limited VRAM
Local inference of Llama-based chatbots or text generators
Benchmarking memory-optimized transformer inference

Pros

Significantly lower memory footprint than Hugging Face Transformers
Fast inference with quantized weights
Active community development and frequent updates

Cons

Only supports Llama architecture, not other transformer models
Requires specific quantization formats (e.g., GPTQ)
Less feature-rich than full Hugging Face Transformers ecosystem

Indexed from awesome-llm and enriched against its public facts.

Pros

Significantly lower memory footprint than Hugging Face Transformers
Fast inference with quantized weights
Active community development and frequent updates

Cons

Only supports Llama architecture, not other transformer models
Requires specific quantization formats (e.g., GPTQ)
Less feature-rich than full Hugging Face Transformers ecosystem

Pairs with

Other entries in the index that connect to this one. Click through to see the chain.

Uses1entry

O OSS Obs medium

PyTorch

Community

Tensors and Dynamic neural networks in Python with strong GPU acceleration

★ 100,318 updated 1mo ago

Alternative to1entry

O OSS Framework medium

llama.cpp

Community

LLM inference in C/C++

★ 114,160 updated 1mo ago

Free 27-page guide

Get the free Developer’s Field Guide

A 27-page field guide to the AI coding workflow with Claude. Claude Code, MCP servers, the prompt patterns that work, and what to delegate. Free.

Enter your work email. We send it straight over, plus a few short notes worth knowing. Unsubscribe any time.

← Back to Open Source Submit your own entry →