Enterprise DNA
O Open Source Frameworks medium

Evaluating Large Language Models Trained on Code

by Community

2021-08

EL

OSS

Evaluating Large Language Models Trained on Code

Added 1 June 2026

Overview

A research paper introducing Codex, a GPT model fine-tuned on public GitHub code, and HumanEval, a benchmark of 164 hand-written programming problems. It evaluates the model's functional correctness by generating code from docstrings and running unit tests.

Best for

Best for
Researchers and engineers evaluating code generation models

Use cases

  • Benchmarking code generation models against functional correctness
  • Designing prompts for docstring-to-code synthesis
  • Evaluating model safety and bias in code generation

Notes

A research paper introducing Codex, a GPT model fine-tuned on public GitHub code, and HumanEval, a benchmark of 164 hand-written programming problems. It evaluates the model’s functional correctness by generating code from docstrings and running unit tests.

Use cases

  • Benchmarking code generation models against functional correctness
  • Designing prompts for docstring-to-code synthesis
  • Evaluating model safety and bias in code generation

Pros

  • Established a widely adopted benchmark (HumanEval) for code generation
  • Introduced a rigorous pass@k metric for functional correctness
  • Provided transparent methodology and open dataset

Cons

  • Benchmark limited to Python and simple algorithmic tasks
  • Model and data not publicly released, limiting reproducibility
  • Does not address real-world software engineering workflows

Indexed from awesome-llm and enriched against its public facts.

Pros

  • Established a widely adopted benchmark (HumanEval) for code generation
  • Introduced a rigorous pass@k metric for functional correctness
  • Provided transparent methodology and open dataset

Cons

  • Benchmark limited to Python and simple algorithmic tasks
  • Model and data not publicly released, limiting reproducibility
  • Does not address real-world software engineering workflows