Enterprise DNA
O Open Source Frameworks medium

OpenAI Evals

by Community

Evals is a framework for evaluating LLMs and LLM systems, and an open-source registry of benchmarks.

OE

OSS

OpenAI Evals

Added 1 June 2026

Overview

OpenAI Evals is a Python framework for systematically evaluating language models and LLM-based systems against benchmarks. It provides a registry of pre-built evaluation tasks and a structure for defining custom evaluation logic, enabling developers to measure model performance on specific capabilities.

Best for

Best for
Teams building LLM applications who need systematic, reproducible evaluation workflows

Use cases

  • Comparing model outputs across different LLM versions or providers
  • Measuring performance on domain-specific tasks before deployment
  • Building custom evaluation suites for proprietary use cases

Notes

OpenAI Evals is a Python framework for systematically evaluating language models and LLM-based systems against benchmarks. It provides a registry of pre-built evaluation tasks and a structure for defining custom evaluation logic, enabling developers to measure model performance on specific capabilities.

18,584 stars on GitHub. Last updated 2026-04-14.

Use cases

  • Comparing model outputs across different LLM versions or providers
  • Measuring performance on domain-specific tasks before deployment
  • Building custom evaluation suites for proprietary use cases

Pros

  • Open-source with active community contributions and 18k+ GitHub stars
  • Extensible framework for defining custom evaluation logic beyond built-in benchmarks
  • Direct integration path with OpenAI models

Cons

  • Requires manual setup and Python expertise to implement evaluations
  • Registry of benchmarks may not cover all specialized domains
  • Evaluation design quality depends on how well you define success criteria

Indexed from awesome-llm and enriched against its public facts.

Pros

  • Open-source with active community contributions and 18k+ GitHub stars
  • Extensible framework for defining custom evaluation logic beyond built-in benchmarks
  • Direct integration path with OpenAI models

Cons

  • Requires manual setup and Python expertise to implement evaluations
  • Registry of benchmarks may not cover all specialized domains
  • Evaluation design quality depends on how well you define success criteria

Pairs with

Other entries in the index that connect to this one. Click through to see the chain.

Pairs with16entries
O OSS Framework medium

Awesome-Align-LLM-Human

Community

Aligning Large Language Models with Human: A Survey

★ 742 updated 2y ago
O OSS Framework medium

Awesome ChatGPT Prompts

Community

f.k.a. Awesome ChatGPT Prompts. Share, discover, and collect prompts from the community. Free and open source — self-host for your organization with complete privacy.

★ 163,161 updated 2d ago
O OSS Framework medium

awesome-hallucination-detection

Community

List of papers on hallucination detection in LLMs.

★ 1,096 updated 9d ago
O OSS Framework medium

Awesome LLM Security

Community

A curation of awesome tools, documents and projects about LLM Security.

★ 1,599 updated 9mo ago
O OSS Framework medium

Emergent Abilities of Large Language Models

Community

Emergent Abilities

O OSS Framework medium

Evaluating Large Language Models Trained on Code

Community

2021-08

O OSS Framework medium

GPT-4 Technical Report

Community

2023-03

O OSS Framework medium

InfiBench

Community

IInfiBench: Evaluating the Question-Answering Capabilities of Code LLMs

O OSS Framework medium

LawBench

Community

LawBench

O OSS Framework medium

LLMEval

Community

LLMEval is a research series dedicated to building comprehensive, fair, and robust evaluation frameworks for large language models.

O OSS Framework medium

MMToM-QA

Community

Leaderboard for the MMToM-QA benchmark (Jin et al., ACL 2024).

O OSS Framework medium

Neurips2022-Foundational Robustness of Foundation Models

Community

NeurIPS Tutorial Foundational Robustness of Foundation Models

O OSS Framework medium

On the Opportunities and Risks of Foundation Models

Community

Foundation Models

O OSS Framework medium

OpenAI o3-mini

Community

Pushing the frontier of cost-effective reasoning.

O OSS Framework medium

Ragas

Community

Supercharge Your LLM Application Evaluations 🚀

★ 14,186 updated 3mo ago
O OSS Framework medium

WHOOPS!

Community

Breaking Common Sense: WHOOPS! A Vision-and-Language Benchmark of Synthetic and Compositional Images

Alternatives16entries
O OSS Framework medium

Berkeley Function-Calling Leaderboard

Community

Explore The Berkeley Function Calling Leaderboard (also called The Berkeley Tool Calling Leaderboard) to see the LLM

O OSS Framework medium

Chain-of-Thought Hub

Community

Benchmarking large language models' complex reasoning ability with chain-of-thought prompting

★ 2,773 updated 1y ago
O OSS Framework medium

CompassRank

Community

评测榜单旨在为大语言模型和多模态模型提供全面、客观且中立的得分与排名,同时提供多能力维度的评分参考,以便用户能够更全面地了解大模型的能力水平。

O OSS Framework medium

FELM

Community

FELM: Benchmarking Factuality Evaluation of Large Language Models

O OSS Framework medium

Giskard

Community

🐢 Open-Source Evaluation & Testing library for LLM Agents

★ 5,414 updated 5d ago
O OSS Framework medium

HELM

Community

Holistic Evaluation of Language Models (HELM) is an open source Python framework created by the Center for Research on Foundation Models (CRFM) at Stanford for holistic, reproducib

★ 2,811 updated 2d ago
O OSS Framework medium

instruct-eval

Community

This repository contains code to quantitatively evaluate instruction-tuned models such as Alpaca and Flan-T5 on held-out tasks.

★ 553 updated 2y ago
O OSS Framework medium

LawBench

Community

LawBench

O OSS Framework medium

LangWatch

Community

The platform for LLM evaluations and AI agent testing

★ 3,275 updated 2d ago
O OSS Framework medium

LLMEval

Community

LLMEval is a research series dedicated to building comprehensive, fair, and robust evaluation frameworks for large language models.

O OSS Framework medium

OLMO-eval

Community

Evaluation suite for LLMs

★ 379 updated 10mo ago
O OSS Framework medium

promptfoo

Community

Test your prompts, agents, and RAGs. Red teaming/pentesting/vulnerability scanning for AI. Compare performance of GPT, Claude, Gemini, DeepSeek, and more. Simple declarative config

★ 21,784 updated 2d ago
O OSS Framework medium

simple-evals

Community

Eval tools by OpenAI.

★ 4,508 updated 1mo ago
O OSS Framework medium

OlympicArena

Community

OlympicArena: Benchmarking Multi-discipline Cognitive Reasoning for Superintelligent AI

O OSS Framework medium

SciBench

Community

Evaluating scientific problems

O OSS Framework medium

SuperBench

Community

a benchmark platform designed for evaluating large language models (LLMs) on a range of tasks, particularly focusing on their performance in different aspects such as natural langu