instruct-eval
by Community
This repository contains code to quantitatively evaluate instruction-tuned models such as Alpaca and Flan-T5 on held-out tasks.
OSS
instruct-eval
Added 1 June 2026
Overview
Community framework for quantitative evaluation of instruction-tuned models (e.g., Alpaca, Flan-T5) on held-out tasks. It provides a standardized benchmarking setup to measure model performance on unseen instructions.
Best for
Best for
Researchers and developers who need a simple, standardized way to evaluate instruction-tuned language models
Use cases
- Evaluate instruction-tuned models on a held-out task set
- Benchmark custom instruction-tuned models against baselines
- Compare output quality across different instruction-tuned architectures
Notes
Community framework for quantitative evaluation of instruction-tuned models (e.g., Alpaca, Flan-T5) on held-out tasks. It provides a standardized benchmarking setup to measure model performance on unseen instructions.
553 stars on GitHub. Last updated 2024-03-10. Licensed Apache-2.0.
Use cases
- Evaluate instruction-tuned models on a held-out task set
- Benchmark custom instruction-tuned models against baselines
- Compare output quality across different instruction-tuned architectures
Pros
- Lightweight and focused solely on evaluation
- Open source with community support
- Provides a consistent, reproducible evaluation pipeline
Cons
- Limited to instruction-tuned models only
- May not cover all evaluation metrics needed for production
- Requires manual integration with specific model formats
Indexed from awesome-llm and enriched against its public facts.
Pros
- Lightweight and focused solely on evaluation
- Open source with community support
- Provides a consistent, reproducible evaluation pipeline
Cons
- Limited to instruction-tuned models only
- May not cover all evaluation metrics needed for production
- Requires manual integration with specific model formats
Pairs with
Other entries in the index that connect to this one. Click through to see the chain.
lm-evaluation-harness
Community
A framework for few-shot evaluation of language models.
OpenAI Evals
Community
Evals is a framework for evaluating LLMs and LLM systems, and an open-source registry of benchmarks.