lm-evaluation-harness
by Community
A framework for few-shot evaluation of language models.
OSS
lm-evaluation-harness
Added 1 June 2026
Overview
Python framework for evaluating language models across standardized benchmarks using few-shot prompting. Supports multiple model backends and task definitions, enabling reproducible performance measurement against established datasets like MMLU, HellaSwag, and others.
Best for
Best for
Researchers and engineers benchmarking LLM performance against established academic standards
Use cases
- Comparing performance across different LLM architectures on standard benchmarks
- Measuring model degradation or improvement after fine-tuning or quantization
- Validating model behavior on specific task categories before deployment
Notes
Python framework for evaluating language models across standardized benchmarks using few-shot prompting. Supports multiple model backends and task definitions, enabling reproducible performance measurement against established datasets like MMLU, HellaSwag, and others.
12,772 stars on GitHub. Last updated 2026-05-11. Licensed MIT.
Use cases
- Comparing performance across different LLM architectures on standard benchmarks
- Measuring model degradation or improvement after fine-tuning or quantization
- Validating model behavior on specific task categories before deployment
Pros
- Extensive built-in benchmark library reduces setup time for common evaluations
- Supports multiple model backends (local, API-based, custom implementations)
- Active community maintenance with 12k+ stars and regular benchmark additions
Cons
- Steep learning curve for custom task definition and evaluation logic
- Evaluation runs can be computationally expensive and time-consuming at scale
- Limited guidance on interpreting results or statistical significance testing
Indexed from awesome-llm and enriched against its public facts.
Pros
- Extensive built-in benchmark library reduces setup time for common evaluations
- Supports multiple model backends (local, API-based, custom implementations)
- Active community maintenance with 12k+ stars and regular benchmark additions
Cons
- Steep learning curve for custom task definition and evaluation logic
- Evaluation runs can be computationally expensive and time-consuming at scale
- Limited guidance on interpreting results or statistical significance testing
Pairs with
Other entries in the index that connect to this one. Click through to see the chain.
PyTorch
Community
Tensors and Dynamic neural networks in Python with strong GPU acceleration
vLLM
Community
A high-throughput and memory-efficient inference and serving engine for LLMs
llama.cpp
Community
LLM inference in C/C++
ACLUE
Community
Official github repo for ACLUE, an evaluation benchmark focused on ancient Chinese language comprehension
Awesome-Align-LLM-Human
Community
Aligning Large Language Models with Human: A Survey
Awesome-Code-LLM
Community
👨💻 An awesome and curated list of best code-LLM for research.
awesome-hallucination-detection
Community
List of papers on hallucination detection in LLMs.
awesome-language-model-analysis
Community
This paper list focuses on the theoretical and empirical analysis of language models, especially large language models (LLMs). The papers in this list investigate the learning beha
Awesome-LLM-hallucination
Community
LLM hallucination paper list
Awesome LLM Human Preference Datasets
Community
A curated list of Human Preference Datasets for LLM fine-tuning, RLHF, and eval.
Chinese Large Model Leaderboard
Community
非线智能 NoneLinear - ReLE评测:中文AI大模型能力评测(持续更新):目前已囊括374个大模型,覆盖chatgpt、gpt-5.4、谷歌gemini-3.1-pro、Claude-4.6、文心ERNIE-X1.1、ERNIE-5.0、qwen3.6-max、qwen3.6-plus、百川、讯飞星火、商汤senseChat等商用模型, 以及st
CompMix
Community
CompMix: A Benchmark for Heterogeneous Question Answering.
Emergent Abilities of Large Language Models
Community
Emergent Abilities
Evaluating Large Language Models Trained on Code
Community
2021-08
Finetuned Language Models are Zero-Shot Learners
Community
This paper explores a simple method for improving the zero-shot learning abilities of language models. We show that instruction tuning—finetuning language models on a collection
InfiBench
Community
IInfiBench: Evaluating the Question-Answering Capabilities of Code LLMs
LawBench
Community
LawBench
LLMEval
Community
LLMEval is a research series dedicated to building comprehensive, fair, and robust evaluation frameworks for large language models.
Meta Lingua
Community
Meta Lingua: a lean, efficient, and easy-to-hack codebase to research LLMs.
MMedBench
Community
Medical Multilingual Benchmark
MMToM-QA
Community
Leaderboard for the MMToM-QA benchmark (Jin et al., ACL 2024).
Multitask Prompted Training Enables Zero-Shot Task Generalization
Community
Large language models have recently been shown to attain reasonable zero-shot generalization on a diverse set of tasks (Brown et al., 2020). It has been hypothesized that this is
Neurips2022-Foundational Robustness of Foundation Models
Community
NeurIPS Tutorial Foundational Robustness of Foundation Models
PubMedQA
Community
PubMedQA Homepage
Qwen2-Math-1.5B|7B|72B
Community
GITHUB HUGGING FACE MODELSCOPE DISCORD 🚨 This model mainly supports English. We will release bilingual (English and Chinese) math models soon. Introduction Over the past year, w
Ragas
Community
Supercharge Your LLM Application Evaluations 🚀
Solving Quantitative Reasoning Problems with Language Models
Community
Language models have achieved remarkable performance on a wide range of tasks that require natural language understanding. Nevertheless, state-of-the-art models have generally st
SuperLim
Community
a Swedish language understanding benchmark that evaluates natural language processing (NLP) models on various tasks such as argumentation analysis, semantic similarity, and textual
TAT-DQA
Community
TAT-DQA: A Document Visual Question Answering (VQA) Dataset, aiming to answer questions over visually-rich documents with a hybrid of Tabular and Textual Content in Finance
WHOOPS!
Community
Breaking Common Sense: WHOOPS! A Vision-and-Language Benchmark of Synthetic and Compositional Images
MixEval
Community
Deriving Wisdom of the Crowd from LLM Benchmark Mixtures
OPT-IML: Scaling Language Model Instruction Meta Learning through the Lens of Generalization
Community
2022-12
Principle-Driven Self-Alignment of Language Models from Scratch with Minimal Human Supervision
Community
Recent AI-assistant agents, such as ChatGPT, predominantly rely on supervised fine-tuning (SFT) with human annotations and reinforcement learning from human feedback (RLHF) to al
Pythia: A Suite for Analyzing Large Language Models Across Training and Scaling
Community
How do large language models (LLMs) develop and evolve over the course of training? How do these patterns change as models scale? To answer these questions, we introduce \textit{
The Flan Collection: Designing Data and Methods for Effective Instruction Tuning
Community
Flan 2022 Collection
We-Math
Community
Does Your Large Multimodal Model Achieve Human-like Mathematical Reasoning?
AlpacaEval
Community
AlpacaEval Leaderboard
Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models
Community
Beyond the Imitation Game collaborative benchmark for measuring and extrapolating the capabilities of language models
Chain-of-Thought Hub
Community
Benchmarking large language models' complex reasoning ability with chain-of-thought prompting
CompassRank
Community
评测榜单旨在为大语言模型和多模态模型提供全面、客观且中立的得分与排名,同时提供多能力维度的评分参考,以便用户能够更全面地了解大模型的能力水平。
FELM
Community
FELM: Benchmarking Factuality Evaluation of Large Language Models
Giskard
Community
🐢 Open-Source Evaluation & Testing library for LLM Agents
HELM
Community
Holistic Evaluation of Language Models (HELM) is an open source Python framework created by the Center for Research on Foundation Models (CRFM) at Stanford for holistic, reproducib
Holistic Evaluation of Language Models
Community
Stanford
instruct-eval
Community
This repository contains code to quantitatively evaluate instruction-tuned models such as Alpaca and Flan-T5 on held-out tasks.
LawBench
Community
LawBench
lighteval
Community
Lighteval is your all-in-one toolkit for evaluating LLMs across multiple backends
LLMEval
Community
LLMEval is a research series dedicated to building comprehensive, fair, and robust evaluation frameworks for large language models.
M3CoT
Community
Leaderboard | M 3 CoT
MathEval
Community
a comprehensive benchmarking platform designed to evaluate large models' mathematical abilities across 20 fields and nearly 30,000 math problems.
OLMO-eval
Community
Evaluation suite for LLMs
OpenAI Evals
Community
Evals is a framework for evaluating LLMs and LLM systems, and an open-source registry of benchmarks.
DreamBench++
Community
DreamBench++: A Human-Aligned Benchmark for Personalized Image Generation
OlympicArena
Community
OlympicArena: Benchmarking Multi-discipline Cognitive Reasoning for Superintelligent AI
SciBench
Community
Evaluating scientific problems
SuperBench
Community
a benchmark platform designed for evaluating large language models (LLMs) on a range of tasks, particularly focusing on their performance in different aspects such as natural langu