Enterprise DNA
O Open Source Observability medium

LLMKube

by Community

Kubernetes operator for local LLM inference with llama.cpp, vLLM, TGI, and mlx-server — multi-GPU NVIDIA + Apple Silicon Metal, autoscaling, air-gapped, production-ready

L

OSS

LLMKube

Added 1 June 2026

#ai #apple-silicon #autoscaling #edge-computing #gguf #gpu #homelab #inference

Overview

LLMKube is a Kubernetes operator for running LLM inference workloads locally using llama.cpp, vLLM, TGI, and mlx-server. It supports multi-GPU configurations on NVIDIA and Apple Silicon Metal, provides autoscaling, and can operate in air-gapped environments.

Best for

Best for
Teams needing a Kubernetes-native way to self-host LLM inference with flexible GPU support

Use cases

  • Deploy and scale local LLM inference on a private Kubernetes cluster
  • Run production LLM workloads with multiple GPU types (NVIDIA and Apple Silicon)
  • Manage LLM serving in air-gapped or restricted-network environments

Notes

LLMKube is a Kubernetes operator for running LLM inference workloads locally using llama.cpp, vLLM, TGI, and mlx-server. It supports multi-GPU configurations on NVIDIA and Apple Silicon Metal, provides autoscaling, and can operate in air-gapped environments.

118 stars on GitHub. Last updated 2026-06-01. Licensed Apache-2.0.

Use cases

  • Deploy and scale local LLM inference on a private Kubernetes cluster
  • Run production LLM workloads with multiple GPU types (NVIDIA and Apple Silicon)
  • Manage LLM serving in air-gapped or restricted-network environments

Pros

  • Supports multiple inference engines (llama.cpp, vLLM, TGI, mlx-server)
  • Works with both NVIDIA and Apple Silicon Metal GPUs
  • Designed for air-gapped, production-ready deployment

Cons

  • Community project with only 118 stars
  • Written in Go, limiting contributor base
  • Requires Kubernetes expertise to operate

Indexed from awesome-llmops and enriched against its public facts.

Pros

  • Supports multiple inference engines (llama.cpp, vLLM, TGI, mlx-server)
  • Works with both NVIDIA and Apple Silicon Metal GPUs
  • Designed for air-gapped, production-ready deployment

Cons

  • Community project with only 118 stars
  • Written in Go, limiting contributor base
  • Requires Kubernetes expertise to operate