O Open Source Frameworks medium

lm-evaluation-harness

by Community

A framework for few-shot evaluation of language models.

Visit Community View repo Submit your build →

OSS

Added 1 June 2026

#evaluation-framework #language-model #transformer

Overview

Python framework for evaluating language models across standardized benchmarks using few-shot prompting. Supports multiple model backends and task definitions, enabling reproducible performance measurement against established datasets like MMLU, HellaSwag, and others.

Best for

Best for
Researchers and engineers benchmarking LLM performance against established academic standards

Use cases

Comparing performance across different LLM architectures on standard benchmarks
Measuring model degradation or improvement after fine-tuning or quantization
Validating model behavior on specific task categories before deployment

Notes

12,772 stars on GitHub. Last updated 2026-05-11. Licensed MIT.

Use cases

Comparing performance across different LLM architectures on standard benchmarks
Measuring model degradation or improvement after fine-tuning or quantization
Validating model behavior on specific task categories before deployment

Pros

Extensive built-in benchmark library reduces setup time for common evaluations
Supports multiple model backends (local, API-based, custom implementations)
Active community maintenance with 12k+ stars and regular benchmark additions

Cons

Steep learning curve for custom task definition and evaluation logic
Evaluation runs can be computationally expensive and time-consuming at scale
Limited guidance on interpreting results or statistical significance testing

Indexed from awesome-llm and enriched against its public facts.

Pros

Extensive built-in benchmark library reduces setup time for common evaluations
Supports multiple model backends (local, API-based, custom implementations)
Active community maintenance with 12k+ stars and regular benchmark additions

Cons

Steep learning curve for custom task definition and evaluation logic
Evaluation runs can be computationally expensive and time-consuming at scale
Limited guidance on interpreting results or statistical significance testing

Pairs with

Other entries in the index that connect to this one. Click through to see the chain.

Uses3entries

O OSS Obs medium

PyTorch

Community

Tensors and Dynamic neural networks in Python with strong GPU acceleration

lm-evaluation-harness

Overview

Best for

Use cases

Notes

Use cases

Pros

Cons

Pairs with

PyTorch

vLLM

llama.cpp

Open LLM Leaderboard

ACLUE

Awesome-Align-LLM-Human

Awesome-Code-LLM

awesome-hallucination-detection

awesome-language-model-analysis

Awesome-LLM-hallucination

Awesome LLM Human Preference Datasets

Chinese Large Model Leaderboard

CompMix

Emergent Abilities of Large Language Models

Evaluating Large Language Models Trained on Code

Finetuned Language Models are Zero-Shot Learners

InfiBench

LawBench

LLMEval

Meta Lingua

MMedBench

MMToM-QA

Multitask Prompted Training Enables Zero-Shot Task Generalization

Neurips2022-Foundational Robustness of Foundation Models

PubMedQA

Qwen2-Math-1.5B|7B|72B

Ragas

Solving Quantitative Reasoning Problems with Language Models

SuperLim

TAT-DQA

WHOOPS!

MixEval

OPT-IML: Scaling Language Model Instruction Meta Learning through the Lens of Generalization

Principle-Driven Self-Alignment of Language Models from Scratch with Minimal Human Supervision

Pythia: A Suite for Analyzing Large Language Models Across Training and Scaling

The Flan Collection: Designing Data and Methods for Effective Instruction Tuning

We-Math

AlpacaEval

Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models

Chain-of-Thought Hub

CompassRank

FELM

Giskard

HELM

Holistic Evaluation of Language Models

instruct-eval

LawBench

lighteval

LLMEval

M3CoT

MathEval

OLMO-eval

OpenAI Evals

DreamBench++

OlympicArena

SciBench

SuperBench