LLMKube
by Community
Kubernetes operator for local LLM inference with llama.cpp, vLLM, TGI, and mlx-server — multi-GPU NVIDIA + Apple Silicon Metal, autoscaling, air-gapped, production-ready
OSS
LLMKube
Added 1 June 2026
Overview
LLMKube is a Kubernetes operator for running LLM inference workloads locally using llama.cpp, vLLM, TGI, and mlx-server. It supports multi-GPU configurations on NVIDIA and Apple Silicon Metal, provides autoscaling, and can operate in air-gapped environments.
Best for
Best for
Teams needing a Kubernetes-native way to self-host LLM inference with flexible GPU support
Use cases
- Deploy and scale local LLM inference on a private Kubernetes cluster
- Run production LLM workloads with multiple GPU types (NVIDIA and Apple Silicon)
- Manage LLM serving in air-gapped or restricted-network environments
Notes
LLMKube is a Kubernetes operator for running LLM inference workloads locally using llama.cpp, vLLM, TGI, and mlx-server. It supports multi-GPU configurations on NVIDIA and Apple Silicon Metal, provides autoscaling, and can operate in air-gapped environments.
118 stars on GitHub. Last updated 2026-06-01. Licensed Apache-2.0.
Use cases
- Deploy and scale local LLM inference on a private Kubernetes cluster
- Run production LLM workloads with multiple GPU types (NVIDIA and Apple Silicon)
- Manage LLM serving in air-gapped or restricted-network environments
Pros
- Supports multiple inference engines (llama.cpp, vLLM, TGI, mlx-server)
- Works with both NVIDIA and Apple Silicon Metal GPUs
- Designed for air-gapped, production-ready deployment
Cons
- Community project with only 118 stars
- Written in Go, limiting contributor base
- Requires Kubernetes expertise to operate
Indexed from awesome-llmops and enriched against its public facts.
Pros
- Supports multiple inference engines (llama.cpp, vLLM, TGI, mlx-server)
- Works with both NVIDIA and Apple Silicon Metal GPUs
- Designed for air-gapped, production-ready deployment
Cons
- Community project with only 118 stars
- Written in Go, limiting contributor base
- Requires Kubernetes expertise to operate
Pairs with
Other entries in the index that connect to this one. Click through to see the chain.
llama.cpp
Community
LLM inference in C/C++
vLLM
Community
A high-throughput and memory-efficient inference and serving engine for LLMs
Docker
Community
The Moby Project - a collaborative project for the container ecosystem to assemble container-based systems