promptfoo
by Community
Test your prompts, agents, and RAGs. Red teaming/pentesting/vulnerability scanning for AI. Compare performance of GPT, Claude, Gemini, DeepSeek, and more. Simple declarative config
OSS
promptfoo
Added 1 June 2026
Overview
promptfoo is a testing framework for evaluating prompts, agents, and RAG systems across multiple LLM providers including GPT, Claude, Gemini, and DeepSeek. It runs comparative benchmarks, red team tests, and vulnerability scans using declarative YAML configs with CLI and CI/CD support.
Best for
Best for
Teams building LLM applications who need systematic prompt validation and security testing before deployment
Use cases
- Compare prompt performance across different LLM models before production
- Automate security testing and adversarial input scanning for AI applications
- Integrate prompt evaluation into CI/CD pipelines for continuous quality checks
Notes
promptfoo is a testing framework for evaluating prompts, agents, and RAG systems across multiple LLM providers including GPT, Claude, Gemini, and DeepSeek. It runs comparative benchmarks, red team tests, and vulnerability scans using declarative YAML configs with CLI and CI/CD support.
21,784 stars on GitHub. Last updated 2026-06-01. Licensed MIT.
Use cases
- Compare prompt performance across different LLM models before production
- Automate security testing and adversarial input scanning for AI applications
- Integrate prompt evaluation into CI/CD pipelines for continuous quality checks
Pros
- Multi-model comparison built in, reducing vendor lock-in risk
- Red teaming and vulnerability scanning included, not bolted on
- Declarative config approach makes tests reproducible and version-controllable
Cons
- Requires familiarity with YAML config syntax and CLI tooling
- Testing scope limited to prompt and agent behavior, not full application integration
- Costs scale with API calls to external LLM providers during test runs
Indexed from awesome-llm and enriched against its public facts.
Pros
- Multi-model comparison built in, reducing vendor lock-in risk
- Red teaming and vulnerability scanning included, not bolted on
- Declarative config approach makes tests reproducible and version-controllable
Cons
- Requires familiarity with YAML config syntax and CLI tooling
- Testing scope limited to prompt and agent behavior, not full application integration
- Costs scale with API calls to external LLM providers during test runs
Pairs with
Other entries in the index that connect to this one. Click through to see the chain.
LangChain
Community
The agent engineering platform.
LiteLLM 🚅
Community
Python SDK, Proxy Server (AI Gateway) to call 100+ LLM APIs in OpenAI (or native) format, with cost tracking, guardrails, loadbalancing and logging. [Bedrock, Azure, OpenAI, Vertex
Arthur Shield
Community
Open-source toolkit for building, testing, and monitoring AI agents. Version prompts, run experiments, trace workflows, and catch issues before users do.
Awesome ChatGPT Prompts
Community
f.k.a. Awesome ChatGPT Prompts. Share, discover, and collect prompts from the community. Free and open source — self-host for your organization with complete privacy.
awesome-hallucination-detection
Community
List of papers on hallucination detection in LLMs.
Awesome LLM Security
Community
A curation of awesome tools, documents and projects about LLM Security.
Chinese Large Model Leaderboard
Community
非线智能 NoneLinear - ReLE评测:中文AI大模型能力评测(持续更新):目前已囊括374个大模型,覆盖chatgpt、gpt-5.4、谷歌gemini-3.1-pro、Claude-4.6、文心ERNIE-X1.1、ERNIE-5.0、qwen3.6-max、qwen3.6-plus、百川、讯飞星火、商汤senseChat等商用模型, 以及st
DSPy
Stanford NLP
Programming, not prompting. Declare what you want, compile prompts and weights against an objective.
Giskard
Community
🐢 Open-Source Evaluation & Testing library for LLM Agents
Prompt Engineering
Community
Prompt Engineering, also known as In-Context Prompting, refers to methods for how to communicate with LLM to steer its behavior for desired outcomes without updating the model we
Ragas
Community
Supercharge Your LLM Application Evaluations 🚀
Agenta
Community
The open-source LLMOps platform: prompt playground, prompt management, LLM evaluation, and LLM observability all in one place.
FELM
Community
FELM: Benchmarking Factuality Evaluation of Large Language Models
Giskard
Community
🐢 Open-Source Evaluation & Testing library for LLM Agents
LangSmith
Community
Complete AI agent and LLM observability platform with tracing and real-time monitoring. Debug agents, find failures fast, and track costs and latency.
LangWatch
Community
The platform for LLM evaluations and AI agent testing
OpenAI Evals
Community
Evals is a framework for evaluating LLMs and LLM systems, and an open-source registry of benchmarks.
Opik
Community
Debug, evaluate, and monitor your LLM applications, RAG systems, and agentic workflows with comprehensive tracing, automated evaluations, and production-ready dashboards.
Promptify
Community
Prompt Engineering | Prompt Versioning | Use GPT or other prompt based models to get structured output. Join our discord for Prompt-Engineering, LLMs and other latest research
PromptPerfect
Community
PromptPerfect - AI Prompt Generator and Optimizer