llama.cpp
by Community
LLM inference in C/C++
OSS
llama.cpp
Added 1 June 2026
Overview
llama.cpp is a C++ inference framework that runs large language models locally on consumer hardware. It provides optimized tensor operations and quantization support to reduce model size and memory footprint, enabling fast inference without cloud dependencies.
Best for
Best for
Developers building privacy-first or offline-capable applications with constrained hardware
Use cases
- Running open-source LLMs on laptops or edge devices
- Building offline AI applications with minimal latency
- Quantizing and deploying models with reduced VRAM requirements
Notes
llama.cpp is a C++ inference framework that runs large language models locally on consumer hardware. It provides optimized tensor operations and quantization support to reduce model size and memory footprint, enabling fast inference without cloud dependencies.
114,160 stars on GitHub. Last updated 2026-06-01. Licensed MIT.
Use cases
- Running open-source LLMs on laptops or edge devices
- Building offline AI applications with minimal latency
- Quantizing and deploying models with reduced VRAM requirements
Pros
- Minimal dependencies and fast startup, runs on CPU and GPU
- Extensive quantization options (4-bit, 8-bit) dramatically reduce model size
- Active community with broad hardware support including Apple Silicon
Cons
- Requires manual model conversion and quantization workflows
- Performance varies significantly by hardware, CPU inference is slower than GPU alternatives
- Limited built-in abstractions for complex multi-model pipelines
Indexed from awesome-llm and enriched against its public facts.
Pros
- Minimal dependencies and fast startup, runs on CPU and GPU
- Extensive quantization options (4-bit, 8-bit) dramatically reduce model size
- Active community with broad hardware support including Apple Silicon
Cons
- Requires manual model conversion and quantization workflows
- Performance varies significantly by hardware, CPU inference is slower than GPU alternatives
- Limited built-in abstractions for complex multi-model pipelines
Pairs with
Other entries in the index that connect to this one. Click through to see the chain.
ollama
Community
Get up and running with Kimi-K2.5, GLM-5, MiniMax, DeepSeek, gpt-oss, Qwen, Gemma and other models.
gpt4all
Various
GPT4All: Run Local LLMs on Any Device. Open-source and available for commercial use.
Continue
Continue.dev
Open-source AI code assistant for VS Code and JetBrains. Customisable, BYO model, built for enterprise.
ShipItAndPray/mcp-turboquant
Various
MCP server for LLM quantization. Compress any model to GGUF/GPTQ/AWQ in one tool call. First MCP server for model compression.
Anything LLM
Community
The all-in-one AI productivity accelerator. On device and privacy first with no annoying setup or configuration.
Continue
Community
⏩ Source-controlled AI checks, enforceable in CI. Powered by the open-source Continue CLI
deploy-llms-with-ansible
Community
Easily deploy LLMs with Ansible. Uses Docker with llama.cpp or ollama. Secured with whitelisted IPs.
LangChain
Community
The agent engineering platform.
lighteval
Community
Lighteval is your all-in-one toolkit for evaluating LLMs across multiple backends
LLama Cpp Agent
Community
The llama-cpp-agent framework is a tool designed for easy interaction with Large Language Models (LLMs). Allowing users to chat with LLM models, execute structured function calls a
LLMKube
Community
Kubernetes operator for local LLM inference with llama.cpp, vLLM, TGI, and mlx-server — multi-GPU NVIDIA + Apple Silicon Metal, autoscaling, air-gapped, production-ready
lm-evaluation-harness
Community
A framework for few-shot evaluation of language models.
Off Grid
Community
The Swiss Army Knife of Offline AI. Chat, Speak, and Generate Images - Privacy First, Zero Internet. Download an LLM and use it on your mobile device. No data ever leaves your phon
ollama
Community
Get up and running with Kimi-K2.5, GLM-5, MiniMax, DeepSeek, gpt-oss, Qwen, Gemma and other models.
OpenDAN
Community
OpenDAN is an open source Personal AI OS , which consolidates various AI modules in one place for your personal use.
Outlines
Community
Structured Outputs
Phidata
Community
Build, run, and manage agent platforms.
prima.cpp
Community
A distributed implementation of llama.cpp that lets you run 70B-level LLMs on your everyday devices.
Private GPT
Community
Interact with your documents using the power of GPT, 100% privately, no data leaks
Qwen2-Math-1.5B|7B|72B
Community
GITHUB HUGGING FACE MODELSCOPE DISCORD 🚨 This model mainly supports English. We will release bilingual (English and Chinese) math models soon. Introduction Over the past year, w
Serge
Community
A web interface for chatting with Alpaca through llama.cpp. Fully dockerized, with an easy to use API.
Wllama
Community
WebAssembly binding for llama.cpp - Enabling on-browser LLM inference
gpt4all
Various
GPT4All: Run Local LLMs on Any Device. Open-source and available for commercial use.
Jan
Various
Jan is an open-source alternative to ChatGPT. Run open-source AI models locally or connect to cloud models like GPT, Claude and others.
LibreChat
Various
LibreChat brings together all your AI conversations in one unified, customizable interface.
LLM
Various
LLM: A CLI utility and Python library for interacting with Large Language Models
Local Deep Research
Various
~95% on SimpleQA (e.g. Qwen3.6-27B on a 3090). Supports all local and cloud LLMs (llama.cpp, Ollama, Google, ...). 10+ search engines - arXiv, PubMed, your private documents. Every
LM Studio
Various
Run local AI models like gpt-oss, Llama, Gemma, Qwen, and DeepSeek privately on your computer.
privateGPT
Various
Interact with your documents using the power of GPT, 100% privately, no data leaks
PyGPT
Various
PyGPT is an open‑source desktop AI assistant for Windows, macOS and Linux. Chat, agents, web search, run Python, TTS/STT, plugins, long‑term memory.
Vicuna-13B
Various
We introduce Vicuna-13B, an open-source chatbot trained by fine-tuning LLaMA on user-shared conversations collected from ShareGPT. Preliminary evaluation using GPT-4 as a judge s
Jwrede/llmprobe
Various
Synthetic monitoring and CI smoke tests for LLM inference endpoints.
awesome-japanese-llm
Community
日本語LLMまとめ - Overview of Japanese LLMs
Awesome-LLM-Compression
Community
Awesome LLM compression research papers and tools.
Awesome-LLM-Inference
Community
📖A curated list of Awesome LLM/VLM Inference Papers with codes: WINT8/4, FlashAttention, PagedAttention, MLA, Parallelism, etc. 🎉🎉
awesome-llm-webapps
Community
A collection of open source, actively maintained web apps for LLM applications
Baichuan-7|13B
Community
AGI Large Language Models
Chroma
Community
Search infrastructure for AI
CodeQwen1.5-7B
Community
GITHUB HUGGING FACE MODELSCOPE DEMO DISCORD Introduction The advent of advanced programming tools, which harnesses the power of large language models (LLMs), has significantly en
Codestral-7|22B
Community
The most powerful AI platform for enterprises. Customize, fine-tune, and deploy AI assistants, autonomous agents, and multimodal AI with open models.
DeepSeek-R1
Community
First-generation reasoning models from DeepSeek.
femtoGPT
Community
Pure Rust implementation of a minimal Generative Pretrained Transformer
Flock
Community
A multi agent desktop application built with Rust and Tauri.
Gemma
Community
Checking your browser - reCAPTCHA
Gemma2-9|27B
Community
Gemma 2, our next generation of open models, is now available globally for researchers and developers.
Grok-1-314B-MoE
Community
Grok-1-314B-MoE — indexed from awesome-llm
Haystack
Community
Create agentic, context engineered AI systems using Haystack’s modular and customizable building blocks, built for real-world, production-ready applications.
InternLM2-1.8|7|20B
Community
We’re on a journey to advance and democratize artificial intelligence through open source and open science.
Lagent
Community
A lightweight framework for building LLM-based agents
Langchain-Chatchat
Community
Langchain-Chatchat(原Langchain-ChatGLM)基于 Langchain 与 ChatGLM, Qwen 与 Llama 等语言模型的 RAG 与 Agent 应用 | Langchain-Chatchat (formerly langchain-ChatGLM), local knowledge based LLM (like
LiteLLM 🚅
Community
Python SDK, Proxy Server (AI Gateway) to call 100+ LLM APIs in OpenAI (or native) format, with cost tracking, guardrails, loadbalancing and logging. [Bedrock, Azure, OpenAI, Vertex
Llama 1-7|13|33|65B
Community
[OPT-1.3 6.7 13 30 66B](https://arxiv.org/abs/2205.01068)
Llama 2: Open Foundation and Fine-Tuned Chat Models
Community
2023-07
Llama 3.2-1|3|11|90B
Community
[Llama 3.1-8 70 405B](https://llama.meta.com/)
Llama 3-8|70B
Community
[Llama 2-7 13 70B](https://llama.meta.com/llama2/)
LLaMA Cult and More
Community
Large Language Models for All, 🦙 Cult and More, Stay in touch !
LLaMA: Open and Efficient Foundation Language Models
Community
2023-02
MiniCPM-2B
Community
The MiniCPM family of LLMs and VLLMs.
Mistral 7B
Community
Mistral 7B
Moonlight-A3B
Community
Moonshot's Compute-efficient MoE LLM, first Scaling Up of Muon Optimizer
OLMO-eval
Community
Evaluation suite for LLMs
OpenELM-1.1|3B
Community
We’re on a journey to advance and democratize artificial intelligence through open source and open science.
Phi1-1.3B
Community
We’re on a journey to advance and democratize artificial intelligence through open source and open science.
Qwen-1.8B|7B|14B|72B
Community
Qwen - a Qwen Collection
Qwen2-0.5B|1.5B|7B|57B-A14B-MoE|72B
Community
GITHUB HUGGING FACE MODELSCOPE DEMO DISCORD Introduction After months of efforts, we are pleased to announce the evolution from Qwen1.5 to Qwen2. This time, we bring to you: Pret
RWKV: Reinventing RNNs for the Transformer Era
Community
Transformers have revolutionized almost all natural language processing (NLP) tasks but suffer from memory and computational complexity that scales quadratically with sequence le
Semantic Kernel
Microsoft
Microsoft's enterprise-flavoured framework for AI agents. .NET-first, with Python and Java siblings.
Shell-Pilot
Community
A simple, lightweight shell script to interact with OpenAI or Ollama or Mistral AI or LocalAI or ZhipuAI from the terminal, and enhancing intelligent system management without any
StableLM-3B
Community
We’re on a journey to advance and democratize artificial intelligence through open source and open science.
StableLM-v2-12B
Community
We’re on a journey to advance and democratize artificial intelligence through open source and open science.
The Llama 3 Herd of Models
Community
Modern artificial intelligence (AI) systems are powered by foundation models. This paper presents a new set of foundation models, called Llama 3. It is a herd of language models
unslothai
Community
Unsloth Studio is a web UI for training and running open models like Gemma 4, Qwen3.6, DeepSeek, gpt-oss locally.
Vercel AI SDK
Vercel
The de facto TypeScript SDK for AI apps. Streaming, tools, multi-model, and now an agent loop.
Yi-34B
Community
We’re on a journey to advance and democratize artificial intelligence through open source and open science.
Auto-GPT
Various
AutoGPT is the vision of accessible AI for everyone, to use and to build on. Our mission is to provide the tools, so that you can focus on what matters.
Jan
Various
Jan is an open-source alternative to ChatGPT. Run open-source AI models locally or connect to cloud models like GPT, Claude and others.
LibreChat
Various
LibreChat brings together all your AI conversations in one unified, customizable interface.
LLM
Various
LLM: A CLI utility and Python library for interacting with Large Language Models
LLaMA
Various
Llama LLM, a foundational, 65-billion-parameter large language model by Meta. Meta, February 23rd, 2023. #opensource
Qwen
Various
Qwickly forging AGI, enhancing intelligence.
RunThisLLM
Various
Find out exactly what hardware you need to run any local LLM, image, video, or audio AI model. 275+ models with full build specs and performance estimates.
TurboPilot
Various
Turbopilot is an open source large-language-model based code completion engine that runs locally on CPU
exllama
Community
A more memory-efficient rewrite of the HF transformers implementation of Llama for use with quantized weights.
femtoGPT
Community
Pure Rust implementation of a minimal Generative Pretrained Transformer
mistral.rs
Community
Fast, flexible LLM inference
MNN-LLM
Community
MNN: A blazing-fast, lightweight inference engine battle-tested by Alibaba, powering high-performance on-device LLMs and Edge AI.
Rapid-MLX
Community
The fastest local AI engine for Apple Silicon. 4.2x faster than Ollama, 0.08s cached TTFT, 100% tool calling. 17 tool parsers, prompt cache, reasoning separation, cloud routing. Dr
Shimmy
Community
⚡ Python-free Rust inference server — OpenAI-API compatible. GGUF + SafeTensors, hot model swap, auto-discovery, single binary. FREE now, FREE forever.
TensorRT-LLM
Community
TensorRT LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and supports state-of-the-art optimizations to perform inference efficiently on NV
bitnet.cpp
Various
Official inference framework for 1-bit LLMs
ChatGPT
OpenAI
General-purpose AI assistant for writing, coding, analysis, and conversation. The most widely deployed consumer AI product.
OpenAI API
Various
Announcement of the OpenAI API for text-to-text general-purpose AI models based on GPT-3. OpenAI blog, June 11, 2020.