Enterprise DNA
O Open Source Frameworks medium

lm-evaluation-harness

by Community

A framework for few-shot evaluation of language models.

L

OSS

lm-evaluation-harness

Added 1 June 2026

#evaluation-framework #language-model #transformer

Overview

Python framework for evaluating language models across standardized benchmarks using few-shot prompting. Supports multiple model backends and task definitions, enabling reproducible performance measurement against established datasets like MMLU, HellaSwag, and others.

Best for

Best for
Researchers and engineers benchmarking LLM performance against established academic standards

Use cases

  • Comparing performance across different LLM architectures on standard benchmarks
  • Measuring model degradation or improvement after fine-tuning or quantization
  • Validating model behavior on specific task categories before deployment

Notes

Python framework for evaluating language models across standardized benchmarks using few-shot prompting. Supports multiple model backends and task definitions, enabling reproducible performance measurement against established datasets like MMLU, HellaSwag, and others.

12,772 stars on GitHub. Last updated 2026-05-11. Licensed MIT.

Use cases

  • Comparing performance across different LLM architectures on standard benchmarks
  • Measuring model degradation or improvement after fine-tuning or quantization
  • Validating model behavior on specific task categories before deployment

Pros

  • Extensive built-in benchmark library reduces setup time for common evaluations
  • Supports multiple model backends (local, API-based, custom implementations)
  • Active community maintenance with 12k+ stars and regular benchmark additions

Cons

  • Steep learning curve for custom task definition and evaluation logic
  • Evaluation runs can be computationally expensive and time-consuming at scale
  • Limited guidance on interpreting results or statistical significance testing

Indexed from awesome-llm and enriched against its public facts.

Pros

  • Extensive built-in benchmark library reduces setup time for common evaluations
  • Supports multiple model backends (local, API-based, custom implementations)
  • Active community maintenance with 12k+ stars and regular benchmark additions

Cons

  • Steep learning curve for custom task definition and evaluation logic
  • Evaluation runs can be computationally expensive and time-consuming at scale
  • Limited guidance on interpreting results or statistical significance testing

Pairs with

Other entries in the index that connect to this one. Click through to see the chain.

Pairs with33entries
O OSS Framework medium

ACLUE

Community

Official github repo for ACLUE, an evaluation benchmark focused on ancient Chinese language comprehension

★ 34 updated 2y ago
O OSS Framework medium

Awesome-Align-LLM-Human

Community

Aligning Large Language Models with Human: A Survey

★ 742 updated 2y ago
O OSS Framework medium

Awesome-Code-LLM

Community

👨‍💻 An awesome and curated list of best code-LLM for research.

★ 1,287 updated 1y ago
O OSS Framework medium

awesome-hallucination-detection

Community

List of papers on hallucination detection in LLMs.

★ 1,096 updated 9d ago
O OSS Framework medium

awesome-language-model-analysis

Community

This paper list focuses on the theoretical and empirical analysis of language models, especially large language models (LLMs). The papers in this list investigate the learning beha

★ 100 updated 1y ago
O OSS Framework medium

Awesome-LLM-hallucination

Community

LLM hallucination paper list

★ 335 updated 2y ago
O OSS Framework medium

Awesome LLM Human Preference Datasets

Community

A curated list of Human Preference Datasets for LLM fine-tuning, RLHF, and eval.

★ 391 updated 2y ago
O OSS Framework medium

Chinese Large Model Leaderboard

Community

非线智能 NoneLinear - ReLE评测:中文AI大模型能力评测(持续更新):目前已囊括374个大模型,覆盖chatgpt、gpt-5.4、谷歌gemini-3.1-pro、Claude-4.6、文心ERNIE-X1.1、ERNIE-5.0、qwen3.6-max、qwen3.6-plus、百川、讯飞星火、商汤senseChat等商用模型, 以及st

★ 6,103 updated 4d ago
O OSS Framework medium

CompMix

Community

CompMix: A Benchmark for Heterogeneous Question Answering.

O OSS Framework medium

Emergent Abilities of Large Language Models

Community

Emergent Abilities

O OSS Framework medium

Evaluating Large Language Models Trained on Code

Community

2021-08

O OSS Framework medium

Finetuned Language Models are Zero-Shot Learners

Community

This paper explores a simple method for improving the zero-shot learning abilities of language models. We show that instruction tuning—finetuning language models on a collection

O OSS Framework medium

InfiBench

Community

IInfiBench: Evaluating the Question-Answering Capabilities of Code LLMs

O OSS Framework medium

LawBench

Community

LawBench

O OSS Framework medium

LLMEval

Community

LLMEval is a research series dedicated to building comprehensive, fair, and robust evaluation frameworks for large language models.

O OSS Framework medium

Meta Lingua

Community

Meta Lingua: a lean, efficient, and easy-to-hack codebase to research LLMs.

★ 4,760 updated 10mo ago
O OSS Framework medium

MMedBench

Community

Medical Multilingual Benchmark

O OSS Framework medium

MMToM-QA

Community

Leaderboard for the MMToM-QA benchmark (Jin et al., ACL 2024).

O OSS Framework medium

Multitask Prompted Training Enables Zero-Shot Task Generalization

Community

Large language models have recently been shown to attain reasonable zero-shot generalization on a diverse set of tasks (Brown et al., 2020). It has been hypothesized that this is

O OSS Framework medium

Neurips2022-Foundational Robustness of Foundation Models

Community

NeurIPS Tutorial Foundational Robustness of Foundation Models

O OSS Framework medium

PubMedQA

Community

PubMedQA Homepage

O OSS Framework medium

Qwen2-Math-1.5B|7B|72B

Community

GITHUB HUGGING FACE MODELSCOPE DISCORD 🚨 This model mainly supports English. We will release bilingual (English and Chinese) math models soon. Introduction Over the past year, w

O OSS Framework medium

Ragas

Community

Supercharge Your LLM Application Evaluations 🚀

★ 14,186 updated 3mo ago
O OSS Framework medium

Solving Quantitative Reasoning Problems with Language Models

Community

Language models have achieved remarkable performance on a wide range of tasks that require natural language understanding. Nevertheless, state-of-the-art models have generally st

O OSS Framework medium

SuperLim

Community

a Swedish language understanding benchmark that evaluates natural language processing (NLP) models on various tasks such as argumentation analysis, semantic similarity, and textual

O OSS Framework medium

TAT-DQA

Community

TAT-DQA: A Document Visual Question Answering (VQA) Dataset, aiming to answer questions over visually-rich documents with a hybrid of Tabular and Textual Content in Finance

O OSS Framework medium

WHOOPS!

Community

Breaking Common Sense: WHOOPS! A Vision-and-Language Benchmark of Synthetic and Compositional Images

O OSS Framework medium

MixEval

Community

Deriving Wisdom of the Crowd from LLM Benchmark Mixtures

O OSS Framework medium

OPT-IML: Scaling Language Model Instruction Meta Learning through the Lens of Generalization

Community

2022-12

O OSS Framework medium

Principle-Driven Self-Alignment of Language Models from Scratch with Minimal Human Supervision

Community

Recent AI-assistant agents, such as ChatGPT, predominantly rely on supervised fine-tuning (SFT) with human annotations and reinforcement learning from human feedback (RLHF) to al

O OSS Framework medium

Pythia: A Suite for Analyzing Large Language Models Across Training and Scaling

Community

How do large language models (LLMs) develop and evolve over the course of training? How do these patterns change as models scale? To answer these questions, we introduce \textit{

O OSS Framework medium

The Flan Collection: Designing Data and Methods for Effective Instruction Tuning

Community

Flan 2022 Collection

O OSS Framework medium

We-Math

Community

Does Your Large Multimodal Model Achieve Human-like Mathematical Reasoning?

Alternatives20entries
O OSS Framework medium

AlpacaEval

Community

AlpacaEval Leaderboard

O OSS Framework medium

Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models

Community

Beyond the Imitation Game collaborative benchmark for measuring and extrapolating the capabilities of language models

★ 3,244 updated 1y ago
O OSS Framework medium

Chain-of-Thought Hub

Community

Benchmarking large language models' complex reasoning ability with chain-of-thought prompting

★ 2,773 updated 1y ago
O OSS Framework medium

CompassRank

Community

评测榜单旨在为大语言模型和多模态模型提供全面、客观且中立的得分与排名,同时提供多能力维度的评分参考,以便用户能够更全面地了解大模型的能力水平。

O OSS Framework medium

FELM

Community

FELM: Benchmarking Factuality Evaluation of Large Language Models

O OSS Framework medium

Giskard

Community

🐢 Open-Source Evaluation & Testing library for LLM Agents

★ 5,414 updated 5d ago
O OSS Framework medium

HELM

Community

Holistic Evaluation of Language Models (HELM) is an open source Python framework created by the Center for Research on Foundation Models (CRFM) at Stanford for holistic, reproducib

★ 2,811 updated 2d ago
O OSS Framework medium

Holistic Evaluation of Language Models

Community

Stanford

O OSS Framework medium

instruct-eval

Community

This repository contains code to quantitatively evaluate instruction-tuned models such as Alpaca and Flan-T5 on held-out tasks.

★ 553 updated 2y ago
O OSS Framework medium

LawBench

Community

LawBench

O OSS Framework medium

lighteval

Community

Lighteval is your all-in-one toolkit for evaluating LLMs across multiple backends

★ 2,430 updated 5d ago
O OSS Framework medium

LLMEval

Community

LLMEval is a research series dedicated to building comprehensive, fair, and robust evaluation frameworks for large language models.

O OSS Framework medium

M3CoT

Community

Leaderboard | M 3 CoT

O OSS Framework medium

MathEval

Community

a comprehensive benchmarking platform designed to evaluate large models' mathematical abilities across 20 fields and nearly 30,000 math problems.

O OSS Framework medium

OLMO-eval

Community

Evaluation suite for LLMs

★ 379 updated 10mo ago
O OSS Framework medium

OpenAI Evals

Community

Evals is a framework for evaluating LLMs and LLM systems, and an open-source registry of benchmarks.

★ 18,584 updated 1mo ago
O OSS Framework medium

DreamBench++

Community

DreamBench++: A Human-Aligned Benchmark for Personalized Image Generation

O OSS Framework medium

OlympicArena

Community

OlympicArena: Benchmarking Multi-discipline Cognitive Reasoning for Superintelligent AI

O OSS Framework medium

SciBench

Community

Evaluating scientific problems

O OSS Framework medium

SuperBench

Community

a benchmark platform designed for evaluating large language models (LLMs) on a range of tasks, particularly focusing on their performance in different aspects such as natural langu