OLMO-eval
by Community
Evaluation suite for LLMs
OSS
OLMO-eval
Added 1 June 2026
Overview
OLMO-eval is a Python-based evaluation suite for large language models (LLMs). It provides standardized benchmarks and metrics to assess model performance across multiple tasks.
Best for
Best for
Researchers and developers evaluating OLMo or compatible LLMs with reproducible benchmarks
Use cases
- Running reproducible evaluations on LLMs using established benchmarks
- Comparing performance of different model versions or configurations
- Integrating evaluation pipelines into model training workflows
Notes
OLMO-eval is a Python-based evaluation suite for large language models (LLMs). It provides standardized benchmarks and metrics to assess model performance across multiple tasks.
379 stars on GitHub. Last updated 2025-07-11. Licensed Apache-2.0.
Use cases
- Running reproducible evaluations on LLMs using established benchmarks
- Comparing performance of different model versions or configurations
- Integrating evaluation pipelines into model training workflows
Pros
- Open-source and community-maintained under the Allen AI umbrella
- Simplifies running standard LLM evaluations with a single Python framework
Cons
- Small star count (379) indicates limited community adoption and support
- Primarily designed for OLMo models, may require adaptation for other architectures
Indexed from awesome-llm and enriched against its public facts.
Pros
- Open-source and community-maintained under the Allen AI umbrella
- Simplifies running standard LLM evaluations with a single Python framework
Cons
- Small star count (379) indicates limited community adoption and support
- Primarily designed for OLMo models, may require adaptation for other architectures
Pairs with
Other entries in the index that connect to this one. Click through to see the chain.
OpenAI Evals
Community
Evals is a framework for evaluating LLMs and LLM systems, and an open-source registry of benchmarks.
lm-evaluation-harness
Community
A framework for few-shot evaluation of language models.