Open Source Alternatives
Open source alternatives to OpenAI Evals
Open source alternatives to OpenAI Evals, ranked by GitHub stars and freshness.
16 open-source alternatives in the index, ranked by GitHub stars and freshness.
promptfoo
Community
Test your prompts, agents, and RAGs. Red teaming/pentesting/vulnerability scanning for AI. Compare performance of GPT, Claude, Gemini, DeepSeek, and more. Simple declarative config
Best for: Teams building LLM applications who need systematic prompt validation and security testing before deployment
Giskard
Community
🐢 Open-Source Evaluation & Testing library for LLM Agents
Best for: Python developers building LLM agents who need automated safety and quality testing.
simple-evals
Community
Eval tools by OpenAI.
Best for: Developers who need a straightforward, OpenAI-aligned evaluation toolkit for LLM outputs
LangWatch
Community
The platform for LLM evaluations and AI agent testing
Best for: Developers building and testing LLM-based agents in TypeScript who need a lightweight evaluation framework
HELM
Community
Holistic Evaluation of Language Models (HELM) is an open source Python framework created by the Center for Research on Foundation Models (CRFM) at Stanford for holistic, reproducib
Best for: Researchers and developers who need rigorous, multi-dimensional evaluation of foundation models
Chain-of-Thought Hub
Community
Benchmarking large language models' complex reasoning ability with chain-of-thought prompting
Best for: Researchers and developers evaluating LLM reasoning capabilities with chain-of-thought prompting
instruct-eval
Community
This repository contains code to quantitatively evaluate instruction-tuned models such as Alpaca and Flan-T5 on held-out tasks.
Best for: Researchers and developers who need a simple, standardized way to evaluate instruction-tuned language models
OLMO-eval
Community
Evaluation suite for LLMs
Best for: Researchers and developers evaluating OLMo or compatible LLMs with reproducible benchmarks
Berkeley Function-Calling Leaderboard
Community
Explore The Berkeley Function Calling Leaderboard (also called The Berkeley Tool Calling Leaderboard) to see the LLM
Best for: Developers and researchers evaluating LLMs for tool-use and function-calling applications
CompassRank
Community
评测榜单旨在为大语言模型和多模态模型提供全面、客观且中立的得分与排名,同时提供多能力维度的评分参考,以便用户能够更全面地了解大模型的能力水平。
Best for: Developers evaluating and comparing open-source LLMs and multimodal models
FELM
Community
FELM: Benchmarking Factuality Evaluation of Large Language Models
Best for: Researchers and developers needing a standardized way to measure LLM factuality
LawBench
Community
LawBench
Best for: Researchers and engineers evaluating or selecting LLMs for legal applications
LLMEval
Community
LLMEval is a research series dedicated to building comprehensive, fair, and robust evaluation frameworks for large language models.
Best for: Researchers and developers building or using LLM evaluation benchmarks
OlympicArena
Community
OlympicArena: Benchmarking Multi-discipline Cognitive Reasoning for Superintelligent AI
Best for: Researchers and developers evaluating reasoning capabilities of AI models across multiple disciplines.
SciBench
Community
Evaluating scientific problems
Best for: Researchers and developers evaluating AI systems on scientific reasoning tasks
SuperBench
Community
a benchmark platform designed for evaluating large language models (LLMs) on a range of tasks, particularly focusing on their performance in different aspects such as natural langu
Best for: Researchers and developers who need a standardized platform to compare LLM performance across common tasks.