Directories / Alternatives / OpenAI Evals

Open Source Alternatives

Open source alternatives to OpenAI Evals

Open source alternatives to OpenAI Evals, ranked by GitHub stars and freshness.

17 open-source alternatives in the index, ranked by GitHub stars and freshness.

promptfoo

Community

Test your prompts, agents, and RAGs. Red teaming/pentesting/vulnerability scanning for AI. Compare performance of GPT, Claude, Gemini, DeepSeek, and more. Simple declarative config

★ 21,784 updated 1mo ago

open-source

Best for: Teams building LLM applications who need systematic prompt validation and security testing before deployment

O OSS Framework medium

Ragas

Community

Supercharge Your LLM Application Evaluations 🚀

★ 14,186 updated 4mo ago

open-source

Best for: Teams building RAG systems who need continuous evaluation without manual labeling

O OSS Framework medium

lm-evaluation-harness

Community

A framework for few-shot evaluation of language models.

★ 12,772 updated 2mo ago

open-source

Best for: Researchers and engineers benchmarking LLM performance against established academic standards

O OSS Framework medium

Chinese Large Model Leaderboard

Community

非线智能 NoneLinear - ReLE评测：中文AI大模型能力评测（持续更新）：目前已囊括374个大模型，覆盖chatgpt、gpt-5.4、谷歌gemini-3.1-pro、Claude-4.6、文心ERNIE-X1.1、ERNIE-5.0、qwen3.6-max、qwen3.6-plus、百川、讯飞星火、商汤senseChat等商用模型，以及st

★ 6,103 updated 1mo ago

open-source

Best for: Developers and researchers evaluating Chinese large language models.

O OSS Framework medium

Giskard

Community

🐢 Open-Source Evaluation & Testing library for LLM Agents

★ 5,414 updated 1mo ago

open-source

Best for: Python developers building LLM agents who need automated safety and quality testing.

O OSS Framework medium

simple-evals

Community

Eval tools by OpenAI.

★ 4,508 updated 2mo ago

open-source

Best for: Developers who need a straightforward, OpenAI-aligned evaluation toolkit for LLM outputs

O OSS Framework medium

HELM

Community

Holistic Evaluation of Language Models (HELM) is an open source Python framework created by the Center for Research on Foundation Models (CRFM) at Stanford for holistic, reproducib

★ 2,811 updated 1mo ago

open-source

Best for: Researchers and developers who need rigorous, multi-dimensional evaluation of foundation models

O OSS Framework medium

Chain-of-Thought Hub

Community

Benchmarking large language models' complex reasoning ability with chain-of-thought prompting

★ 2,773 updated 1y ago

open-source

Best for: Researchers and developers evaluating LLM reasoning capabilities with chain-of-thought prompting

O OSS Framework medium

instruct-eval

Community

This repository contains code to quantitatively evaluate instruction-tuned models such as Alpaca and Flan-T5 on held-out tasks.

★ 553 updated 2y ago

open-source

Best for: Researchers and developers who need a simple, standardized way to evaluate instruction-tuned language models

O OSS Framework medium

OLMO-eval

Community

Evaluation suite for LLMs

★ 379 updated 1y ago

open-source

Best for: Researchers and developers evaluating OLMo or compatible LLMs with reproducible benchmarks

O OSS Framework medium

ACLUE

Community

Official github repo for ACLUE, an evaluation benchmark focused on ancient Chinese language comprehension

★ 34 updated 2y ago

open-source

Best for: Researchers and developers working on classical Chinese NLP models

O OSS Framework medium

AlpacaEval

Community

AlpacaEval Leaderboard

open-source

Best for: Researchers and developers benchmarking instruction-tuned language models

O OSS Framework medium

CompassRank

Community

评测榜单旨在为大语言模型和多模态模型提供全面、客观且中立的得分与排名，同时提供多能力维度的评分参考，以便用户能够更全面地了解大模型的能力水平。

open-source

Best for: Developers evaluating and comparing open-source LLMs and multimodal models

O OSS Framework medium

M3CoT

Community

Leaderboard | M 3 CoT

open-source

Best for: Researchers and developers evaluating multi-modal chain-of-thought reasoning in AI models

O OSS Framework medium

MathEval

Community

a comprehensive benchmarking platform designed to evaluate large models' mathematical abilities across 20 fields and nearly 30,000 math problems.

open-source

Best for: Researchers and developers benchmarking mathematical reasoning in large models.

O OSS Framework medium

Open LLM Leaderboard

Community

Track, rank and evaluate open LLMs and chatbots

open-source

Best for: Developers and researchers evaluating open LLMs for general-purpose language tasks

O OSS Framework medium

TensorZero

Community

TensorZero builds open-source tools for production-grade LLM applications: LLM gateway, observability, optimization, evaluations, and experimentation.

open-source

Best for: Teams needing an open-source, end-to-end LLM toolchain for production.