Enterprise DNA
Directories / Alternatives / LM Eval Harness

Open Source Alternatives

Open source alternatives to LM Eval Harness

Open source alternatives to LM Eval Harness, ranked by GitHub stars and freshness.

20 open-source alternatives in the index, ranked by GitHub stars and freshness.

O OSS Framework medium

OpenAI Evals

Community

Evals is a framework for evaluating LLMs and LLM systems, and an open-source registry of benchmarks.

★ 18,584 updated 1mo ago
open-source

Best for: Teams building LLM applications who need systematic, reproducible evaluation workflows

O OSS Framework medium

Giskard

Community

🐢 Open-Source Evaluation & Testing library for LLM Agents

★ 5,414 updated 5d ago
open-source

Best for: Python developers building LLM agents who need automated safety and quality testing.

O OSS Framework medium

Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models

Community

Beyond the Imitation Game collaborative benchmark for measuring and extrapolating the capabilities of language models

★ 3,244 updated 1y ago
open-source

Best for: Researchers and engineers studying language model capabilities and scaling behavior

O OSS Framework medium

HELM

Community

Holistic Evaluation of Language Models (HELM) is an open source Python framework created by the Center for Research on Foundation Models (CRFM) at Stanford for holistic, reproducib

★ 2,811 updated 2d ago
open-source

Best for: Researchers and developers who need rigorous, multi-dimensional evaluation of foundation models

O OSS Framework medium

Chain-of-Thought Hub

Community

Benchmarking large language models' complex reasoning ability with chain-of-thought prompting

★ 2,773 updated 1y ago
open-source

Best for: Researchers and developers evaluating LLM reasoning capabilities with chain-of-thought prompting

O OSS Framework medium

lighteval

Community

Lighteval is your all-in-one toolkit for evaluating LLMs across multiple backends

★ 2,430 updated 5d ago
open-source

Best for: Researchers and developers who need a unified way to evaluate and compare LLMs from different sources

O OSS Framework medium

instruct-eval

Community

This repository contains code to quantitatively evaluate instruction-tuned models such as Alpaca and Flan-T5 on held-out tasks.

★ 553 updated 2y ago
open-source

Best for: Researchers and developers who need a simple, standardized way to evaluate instruction-tuned language models

O OSS Framework medium

OLMO-eval

Community

Evaluation suite for LLMs

★ 379 updated 10mo ago
open-source

Best for: Researchers and developers evaluating OLMo or compatible LLMs with reproducible benchmarks

O OSS Framework medium

AlpacaEval

Community

AlpacaEval Leaderboard

open-source

Best for: Researchers and developers benchmarking instruction-tuned language models

O OSS Framework medium

CompassRank

Community

评测榜单旨在为大语言模型和多模态模型提供全面、客观且中立的得分与排名,同时提供多能力维度的评分参考,以便用户能够更全面地了解大模型的能力水平。

open-source

Best for: Developers evaluating and comparing open-source LLMs and multimodal models

O OSS Framework medium

FELM

Community

FELM: Benchmarking Factuality Evaluation of Large Language Models

open-source

Best for: Researchers and developers needing a standardized way to measure LLM factuality

O OSS Framework medium

Holistic Evaluation of Language Models

Community

Stanford

open-source

Best for: Researchers and engineers who need a rigorous, standardized way to assess and compare language model capabilities and limitations.

O OSS Framework medium

LawBench

Community

LawBench

open-source

Best for: Researchers and engineers evaluating or selecting LLMs for legal applications

O OSS Framework medium

LLMEval

Community

LLMEval is a research series dedicated to building comprehensive, fair, and robust evaluation frameworks for large language models.

open-source

Best for: Researchers and developers building or using LLM evaluation benchmarks

O OSS Framework medium

M3CoT

Community

Leaderboard | M 3 CoT

open-source

Best for: Researchers and developers evaluating multi-modal chain-of-thought reasoning in AI models

O OSS Framework medium

MathEval

Community

a comprehensive benchmarking platform designed to evaluate large models' mathematical abilities across 20 fields and nearly 30,000 math problems.

open-source

Best for: Researchers and developers benchmarking mathematical reasoning in large models.

O OSS Framework medium

DreamBench++

Community

DreamBench++: A Human-Aligned Benchmark for Personalized Image Generation

open-source

Best for: Researchers and developers working on personalized image generation models

O OSS Framework medium

OlympicArena

Community

OlympicArena: Benchmarking Multi-discipline Cognitive Reasoning for Superintelligent AI

open-source

Best for: Researchers and developers evaluating reasoning capabilities of AI models across multiple disciplines.

O OSS Framework medium

SciBench

Community

Evaluating scientific problems

open-source

Best for: Researchers and developers evaluating AI systems on scientific reasoning tasks

O OSS Framework medium

SuperBench

Community

a benchmark platform designed for evaluating large language models (LLMs) on a range of tasks, particularly focusing on their performance in different aspects such as natural langu

open-source

Best for: Researchers and developers who need a standardized platform to compare LLM performance across common tasks.