Open Source Alternatives
Open source alternatives to LM Eval Harness
Open source alternatives to LM Eval Harness, ranked by GitHub stars and freshness.
20 open-source alternatives in the index, ranked by GitHub stars and freshness.
OpenAI Evals
Community
Evals is a framework for evaluating LLMs and LLM systems, and an open-source registry of benchmarks.
Best for: Teams building LLM applications who need systematic, reproducible evaluation workflows
Giskard
Community
🐢 Open-Source Evaluation & Testing library for LLM Agents
Best for: Python developers building LLM agents who need automated safety and quality testing.
Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models
Community
Beyond the Imitation Game collaborative benchmark for measuring and extrapolating the capabilities of language models
Best for: Researchers and engineers studying language model capabilities and scaling behavior
HELM
Community
Holistic Evaluation of Language Models (HELM) is an open source Python framework created by the Center for Research on Foundation Models (CRFM) at Stanford for holistic, reproducib
Best for: Researchers and developers who need rigorous, multi-dimensional evaluation of foundation models
Chain-of-Thought Hub
Community
Benchmarking large language models' complex reasoning ability with chain-of-thought prompting
Best for: Researchers and developers evaluating LLM reasoning capabilities with chain-of-thought prompting
lighteval
Community
Lighteval is your all-in-one toolkit for evaluating LLMs across multiple backends
Best for: Researchers and developers who need a unified way to evaluate and compare LLMs from different sources
instruct-eval
Community
This repository contains code to quantitatively evaluate instruction-tuned models such as Alpaca and Flan-T5 on held-out tasks.
Best for: Researchers and developers who need a simple, standardized way to evaluate instruction-tuned language models
OLMO-eval
Community
Evaluation suite for LLMs
Best for: Researchers and developers evaluating OLMo or compatible LLMs with reproducible benchmarks
AlpacaEval
Community
AlpacaEval Leaderboard
Best for: Researchers and developers benchmarking instruction-tuned language models
CompassRank
Community
评测榜单旨在为大语言模型和多模态模型提供全面、客观且中立的得分与排名,同时提供多能力维度的评分参考,以便用户能够更全面地了解大模型的能力水平。
Best for: Developers evaluating and comparing open-source LLMs and multimodal models
FELM
Community
FELM: Benchmarking Factuality Evaluation of Large Language Models
Best for: Researchers and developers needing a standardized way to measure LLM factuality
Holistic Evaluation of Language Models
Community
Stanford
Best for: Researchers and engineers who need a rigorous, standardized way to assess and compare language model capabilities and limitations.
LawBench
Community
LawBench
Best for: Researchers and engineers evaluating or selecting LLMs for legal applications
LLMEval
Community
LLMEval is a research series dedicated to building comprehensive, fair, and robust evaluation frameworks for large language models.
Best for: Researchers and developers building or using LLM evaluation benchmarks
M3CoT
Community
Leaderboard | M 3 CoT
Best for: Researchers and developers evaluating multi-modal chain-of-thought reasoning in AI models
MathEval
Community
a comprehensive benchmarking platform designed to evaluate large models' mathematical abilities across 20 fields and nearly 30,000 math problems.
Best for: Researchers and developers benchmarking mathematical reasoning in large models.
DreamBench++
Community
DreamBench++: A Human-Aligned Benchmark for Personalized Image Generation
Best for: Researchers and developers working on personalized image generation models
OlympicArena
Community
OlympicArena: Benchmarking Multi-discipline Cognitive Reasoning for Superintelligent AI
Best for: Researchers and developers evaluating reasoning capabilities of AI models across multiple disciplines.
SciBench
Community
Evaluating scientific problems
Best for: Researchers and developers evaluating AI systems on scientific reasoning tasks
SuperBench
Community
a benchmark platform designed for evaluating large language models (LLMs) on a range of tasks, particularly focusing on their performance in different aspects such as natural langu
Best for: Researchers and developers who need a standardized platform to compare LLM performance across common tasks.