OpenAI Evals
by Community
Evals is a framework for evaluating LLMs and LLM systems, and an open-source registry of benchmarks.
OSS
OpenAI Evals
Added 1 June 2026
Overview
OpenAI Evals is a Python framework for systematically evaluating language models and LLM-based systems against benchmarks. It provides a registry of pre-built evaluation tasks and a structure for defining custom evaluation logic, enabling developers to measure model performance on specific capabilities.
Best for
Best for
Teams building LLM applications who need systematic, reproducible evaluation workflows
Use cases
- Comparing model outputs across different LLM versions or providers
- Measuring performance on domain-specific tasks before deployment
- Building custom evaluation suites for proprietary use cases
Notes
OpenAI Evals is a Python framework for systematically evaluating language models and LLM-based systems against benchmarks. It provides a registry of pre-built evaluation tasks and a structure for defining custom evaluation logic, enabling developers to measure model performance on specific capabilities.
18,584 stars on GitHub. Last updated 2026-04-14.
Use cases
- Comparing model outputs across different LLM versions or providers
- Measuring performance on domain-specific tasks before deployment
- Building custom evaluation suites for proprietary use cases
Pros
- Open-source with active community contributions and 18k+ GitHub stars
- Extensible framework for defining custom evaluation logic beyond built-in benchmarks
- Direct integration path with OpenAI models
Cons
- Requires manual setup and Python expertise to implement evaluations
- Registry of benchmarks may not cover all specialized domains
- Evaluation design quality depends on how well you define success criteria
Indexed from awesome-llm and enriched against its public facts.
Pros
- Open-source with active community contributions and 18k+ GitHub stars
- Extensible framework for defining custom evaluation logic beyond built-in benchmarks
- Direct integration path with OpenAI models
Cons
- Requires manual setup and Python expertise to implement evaluations
- Registry of benchmarks may not cover all specialized domains
- Evaluation design quality depends on how well you define success criteria
Pairs with
Other entries in the index that connect to this one. Click through to see the chain.
lm-evaluation-harness
Community
A framework for few-shot evaluation of language models.
promptfoo
Community
Test your prompts, agents, and RAGs. Red teaming/pentesting/vulnerability scanning for AI. Compare performance of GPT, Claude, Gemini, DeepSeek, and more. Simple declarative config
Awesome-Align-LLM-Human
Community
Aligning Large Language Models with Human: A Survey
Awesome ChatGPT Prompts
Community
f.k.a. Awesome ChatGPT Prompts. Share, discover, and collect prompts from the community. Free and open source — self-host for your organization with complete privacy.
awesome-hallucination-detection
Community
List of papers on hallucination detection in LLMs.
Awesome LLM Security
Community
A curation of awesome tools, documents and projects about LLM Security.
Emergent Abilities of Large Language Models
Community
Emergent Abilities
Evaluating Large Language Models Trained on Code
Community
2021-08
GPT-4 Technical Report
Community
2023-03
InfiBench
Community
IInfiBench: Evaluating the Question-Answering Capabilities of Code LLMs
LawBench
Community
LawBench
LLMEval
Community
LLMEval is a research series dedicated to building comprehensive, fair, and robust evaluation frameworks for large language models.
MMToM-QA
Community
Leaderboard for the MMToM-QA benchmark (Jin et al., ACL 2024).
Neurips2022-Foundational Robustness of Foundation Models
Community
NeurIPS Tutorial Foundational Robustness of Foundation Models
On the Opportunities and Risks of Foundation Models
Community
Foundation Models
OpenAI o3-mini
Community
Pushing the frontier of cost-effective reasoning.
Ragas
Community
Supercharge Your LLM Application Evaluations 🚀
WHOOPS!
Community
Breaking Common Sense: WHOOPS! A Vision-and-Language Benchmark of Synthetic and Compositional Images
Berkeley Function-Calling Leaderboard
Community
Explore The Berkeley Function Calling Leaderboard (also called The Berkeley Tool Calling Leaderboard) to see the LLM
Chain-of-Thought Hub
Community
Benchmarking large language models' complex reasoning ability with chain-of-thought prompting
CompassRank
Community
评测榜单旨在为大语言模型和多模态模型提供全面、客观且中立的得分与排名,同时提供多能力维度的评分参考,以便用户能够更全面地了解大模型的能力水平。
FELM
Community
FELM: Benchmarking Factuality Evaluation of Large Language Models
Giskard
Community
🐢 Open-Source Evaluation & Testing library for LLM Agents
HELM
Community
Holistic Evaluation of Language Models (HELM) is an open source Python framework created by the Center for Research on Foundation Models (CRFM) at Stanford for holistic, reproducib
instruct-eval
Community
This repository contains code to quantitatively evaluate instruction-tuned models such as Alpaca and Flan-T5 on held-out tasks.
LawBench
Community
LawBench
LangWatch
Community
The platform for LLM evaluations and AI agent testing
LLMEval
Community
LLMEval is a research series dedicated to building comprehensive, fair, and robust evaluation frameworks for large language models.
OLMO-eval
Community
Evaluation suite for LLMs
promptfoo
Community
Test your prompts, agents, and RAGs. Red teaming/pentesting/vulnerability scanning for AI. Compare performance of GPT, Claude, Gemini, DeepSeek, and more. Simple declarative config
simple-evals
Community
Eval tools by OpenAI.
OlympicArena
Community
OlympicArena: Benchmarking Multi-discipline Cognitive Reasoning for Superintelligent AI
SciBench
Community
Evaluating scientific problems
SuperBench
Community
a benchmark platform designed for evaluating large language models (LLMs) on a range of tasks, particularly focusing on their performance in different aspects such as natural langu