Ragas
by Community
Supercharge Your LLM Application Evaluations π
OSS
Ragas
Added 1 June 2026
Overview
Ragas is a Python framework for evaluating LLM applications through automated metrics and test generation. It measures retrieval quality, generation accuracy, and end-to-end performance without requiring manual ground truth labels. Designed for RAG systems and LLM pipelines, it provides quantitative feedback on application behavior.
Best for
Best for
Teams building RAG systems who need continuous evaluation without manual labeling
Use cases
- Measuring retrieval quality in RAG systems
- Benchmarking LLM output accuracy and relevance
- Automated test generation for prompt chains
Notes
Ragas is a Python framework for evaluating LLM applications through automated metrics and test generation. It measures retrieval quality, generation accuracy, and end-to-end performance without requiring manual ground truth labels. Designed for RAG systems and LLM pipelines, it provides quantitative feedback on application behavior.
14,186 stars on GitHub. Last updated 2026-02-24. Licensed Apache-2.0.
Use cases
- Measuring retrieval quality in RAG systems
- Benchmarking LLM output accuracy and relevance
- Automated test generation for prompt chains
Pros
- Reduces evaluation overhead by automating metric computation
- Works without pre-built ground truth datasets
- Active open source community with 14k+ stars
Cons
- Metrics depend on LLM quality, introducing circular dependencies
- Python-only, requires integration into existing workflows
- Automated metrics may not capture domain-specific correctness
Indexed from awesome-llm and enriched against its public facts.
Pros
- Reduces evaluation overhead by automating metric computation
- Works without pre-built ground truth datasets
- Active open source community with 14k+ stars
Cons
- Metrics depend on LLM quality, introducing circular dependencies
- Python-only, requires integration into existing workflows
- Automated metrics may not capture domain-specific correctness
Pairs with
Other entries in the index that connect to this one. Click through to see the chain.
promptfoo
Community
Test your prompts, agents, and RAGs. Red teaming/pentesting/vulnerability scanning for AI. Compare performance of GPT, Claude, Gemini, DeepSeek, and more. Simple declarative config
OpenAI Evals
Community
Evals is a framework for evaluating LLMs and LLM systems, and an open-source registry of benchmarks.
lm-evaluation-harness
Community
A framework for few-shot evaluation of language models.
AutoRAG
Community
AutoRAG: An Open-Source Framework for Retrieval-Augmented Generation (RAG) Evaluation & Optimization with AutoML-Style Automation
awesome-hallucination-detection
Community
List of papers on hallucination detection in LLMs.
Awesome-LLM-hallucination
Community
LLM hallucination paper list
CompMix
Community
CompMix: A Benchmark for Heterogeneous Question Answering.
Evidently
Community
Evidently is ββan open-source ML and LLM observability framework. Evaluate, test, and monitor any AI-powered system or data pipeline. From tabular data to Gen AI. 100+ metrics.
Giskard
Community
π’ Open-Source Evaluation & Testing library for LLM Agents
InfiBench
Community
IInfiBench: Evaluating the Question-Answering Capabilities of Code LLMs
LawBench
Community
LawBench
LLMEval
Community
LLMEval is a research series dedicated to building comprehensive, fair, and robust evaluation frameworks for large language models.
MMToM-QA
Community
Leaderboard for the MMToM-QA benchmark (Jin et al., ACL 2024).
PubMedQA
Community
PubMedQA Homepage
TAT-DQA
Community
TAT-DQA: A Document Visual Question Answering (VQA) Dataset, aiming to answer questions over visually-rich documents with a hybrid of Tabular and Textual Content in Finance
FELM
Community
FELM: Benchmarking Factuality Evaluation of Large Language Models
Giskard
Community
π’ Open-Source Evaluation & Testing library for LLM Agents
LangWatch
Community
The platform for LLM evaluations and AI agent testing
Opik
Community
Debug, evaluate, and monitor your LLM applications, RAG systems, and agentic workflows with comprehensive tracing, automated evaluations, and production-ready dashboards.