Evaluating Large Language Models Trained on Code
by Community
2021-08
OSS
Evaluating Large Language Models Trained on Code
Added 1 June 2026
Overview
A research paper introducing Codex, a GPT model fine-tuned on public GitHub code, and HumanEval, a benchmark of 164 hand-written programming problems. It evaluates the model's functional correctness by generating code from docstrings and running unit tests.
Best for
Best for
Researchers and engineers evaluating code generation models
Use cases
- Benchmarking code generation models against functional correctness
- Designing prompts for docstring-to-code synthesis
- Evaluating model safety and bias in code generation
Notes
A research paper introducing Codex, a GPT model fine-tuned on public GitHub code, and HumanEval, a benchmark of 164 hand-written programming problems. It evaluates the model’s functional correctness by generating code from docstrings and running unit tests.
Use cases
- Benchmarking code generation models against functional correctness
- Designing prompts for docstring-to-code synthesis
- Evaluating model safety and bias in code generation
Pros
- Established a widely adopted benchmark (HumanEval) for code generation
- Introduced a rigorous pass@k metric for functional correctness
- Provided transparent methodology and open dataset
Cons
- Benchmark limited to Python and simple algorithmic tasks
- Model and data not publicly released, limiting reproducibility
- Does not address real-world software engineering workflows
Indexed from awesome-llm and enriched against its public facts.
Pros
- Established a widely adopted benchmark (HumanEval) for code generation
- Introduced a rigorous pass@k metric for functional correctness
- Provided transparent methodology and open dataset
Cons
- Benchmark limited to Python and simple algorithmic tasks
- Model and data not publicly released, limiting reproducibility
- Does not address real-world software engineering workflows
Pairs with
Other entries in the index that connect to this one. Click through to see the chain.
lm-evaluation-harness
Community
A framework for few-shot evaluation of language models.
OpenAI Evals
Community
Evals is a framework for evaluating LLMs and LLM systems, and an open-source registry of benchmarks.