SciBench
by Community
Evaluating scientific problems
OSS
SciBench
Added 2 June 2026
Overview
SciBench is a community-maintained benchmark for evaluating AI systems on scientific problem solving. It provides a standardized set of tasks across scientific domains and maintains a public leaderboard for comparing model performance.
Best for
Best for
Researchers and developers evaluating AI systems on scientific reasoning tasks
Use cases
- Benchmark scientific reasoning capabilities of language models
- Compare model performance on standardized scientific tasks
- Track progress in scientific problem solving across AI systems
Notes
SciBench is a community-maintained benchmark for evaluating AI systems on scientific problem solving. It provides a standardized set of tasks across scientific domains and maintains a public leaderboard for comparing model performance.
Use cases
- Benchmark scientific reasoning capabilities of language models
- Compare model performance on standardized scientific tasks
- Track progress in scientific problem solving across AI systems
Pros
- Open-source and community driven, encouraging broad participation
- Focuses on rigorous scientific reasoning rather than general language tasks
- Public leaderboard enables transparent comparison
Cons
- Limited to the scientific domains covered by the benchmark tasks
- May not reflect real-world scientific problem complexity
- Leaderboard updates depend on community contributions
Indexed from awesome-llm and enriched against its public facts.
Pros
- Open-source and community driven, encouraging broad participation
- Focuses on rigorous scientific reasoning rather than general language tasks
- Public leaderboard enables transparent comparison
Cons
- Limited to the scientific domains covered by the benchmark tasks
- May not reflect real-world scientific problem complexity
- Leaderboard updates depend on community contributions
Pairs with
Other entries in the index that connect to this one. Click through to see the chain.
OpenAI Evals
Community
Evals is a framework for evaluating LLMs and LLM systems, and an open-source registry of benchmarks.
lm-evaluation-harness
Community
A framework for few-shot evaluation of language models.