InfiBench
by Community
IInfiBench: Evaluating the Question-Answering Capabilities of Code LLMs
OSS
InfiBench
Added 1 June 2026
Overview
InfiBench is a community-driven benchmark for evaluating the question-answering capabilities of code-focused large language models. It provides a standardized set of tasks and metrics to measure how well these models understand and respond to code-related queries.
Best for
Best for
Researchers and developers evaluating or comparing code LLMs on question-answering tasks
Use cases
- Comparing the QA performance of different code LLMs on a common benchmark
- Identifying strengths and weaknesses of a code LLM in answering programming questions
- Validating improvements in a code LLM's question-answering abilities during development
Notes
InfiBench is a community-driven benchmark for evaluating the question-answering capabilities of code-focused large language models. It provides a standardized set of tasks and metrics to measure how well these models understand and respond to code-related queries.
Use cases
- Comparing the QA performance of different code LLMs on a common benchmark
- Identifying strengths and weaknesses of a code LLM in answering programming questions
- Validating improvements in a code LLM’s question-answering abilities during development
Pros
- Provides a focused, standardized evaluation for code LLM QA tasks
- Community-driven, allowing for broad input and relevance
- Helps developers and researchers make informed model comparisons
Cons
- Limited to question-answering, not covering other code generation or understanding tasks
- As a community project, may have less frequent updates or support than commercial benchmarks
- Requires familiarity with the benchmark setup to interpret results correctly
Indexed from awesome-llm and enriched against its public facts.
Pros
- Provides a focused, standardized evaluation for code LLM QA tasks
- Community-driven, allowing for broad input and relevance
- Helps developers and researchers make informed model comparisons
Cons
- Limited to question-answering, not covering other code generation or understanding tasks
- As a community project, may have less frequent updates or support than commercial benchmarks
- Requires familiarity with the benchmark setup to interpret results correctly
Pairs with
Other entries in the index that connect to this one. Click through to see the chain.
lm-evaluation-harness
Community
A framework for few-shot evaluation of language models.
OpenAI Evals
Community
Evals is a framework for evaluating LLMs and LLM systems, and an open-source registry of benchmarks.
Ragas
Community
Supercharge Your LLM Application Evaluations 🚀