Chain-of-Thought Hub
by Community
Benchmarking large language models' complex reasoning ability with chain-of-thought prompting
OSS
Chain-of-Thought Hub
Added 1 June 2026
Overview
Chain-of-Thought Hub is a community-maintained benchmarking framework for evaluating large language models on complex reasoning tasks using chain-of-thought prompting. It provides datasets, prompts, and evaluation scripts in Jupyter Notebook format to measure and compare model performance.
Best for
Best for
Researchers and developers evaluating LLM reasoning capabilities with chain-of-thought prompting
Use cases
- Benchmark LLM reasoning abilities with chain-of-thought prompts
- Compare multiple models on standardized reasoning tasks
- Reproduce and extend research on chain-of-thought prompting
Notes
Chain-of-Thought Hub is a community-maintained benchmarking framework for evaluating large language models on complex reasoning tasks using chain-of-thought prompting. It provides datasets, prompts, and evaluation scripts in Jupyter Notebook format to measure and compare model performance.
2,773 stars on GitHub. Last updated 2024-08-04. Licensed MIT.
Use cases
- Benchmark LLM reasoning abilities with chain-of-thought prompts
- Compare multiple models on standardized reasoning tasks
- Reproduce and extend research on chain-of-thought prompting
Pros
- Open source with a focused, well-defined scope
- Community-driven with active development and 2,773 stars
- Provides ready-to-use datasets and evaluation code
Cons
- Jupyter Notebook format limits production deployment
- Primarily a benchmarking tool, not a runtime or inference framework
- Requires manual setup and model API keys or local models
Indexed from awesome-llm and enriched against its public facts.
Pros
- Open source with a focused, well-defined scope
- Community-driven with active development and 2,773 stars
- Provides ready-to-use datasets and evaluation code
Cons
- Jupyter Notebook format limits production deployment
- Primarily a benchmarking tool, not a runtime or inference framework
- Requires manual setup and model API keys or local models
Pairs with
Other entries in the index that connect to this one. Click through to see the chain.
lm-evaluation-harness
Community
A framework for few-shot evaluation of language models.
OpenAI Evals
Community
Evals is a framework for evaluating LLMs and LLM systems, and an open-source registry of benchmarks.