O Open Source Frameworks medium

Evaluating Large Language Models Trained on Code

by Community

2021-08

Visit Community View repo Submit your build →

OSS

Added 1 June 2026

Overview

A research paper introducing Codex, a GPT model fine-tuned on public GitHub code, and HumanEval, a benchmark of 164 hand-written programming problems. It evaluates the model's functional correctness by generating code from docstrings and running unit tests.

Best for

Best for
Researchers and engineers evaluating code generation models

Use cases

Benchmarking code generation models against functional correctness
Designing prompts for docstring-to-code synthesis
Evaluating model safety and bias in code generation

Notes

A research paper introducing Codex, a GPT model fine-tuned on public GitHub code, and HumanEval, a benchmark of 164 hand-written programming problems. It evaluates the model’s functional correctness by generating code from docstrings and running unit tests.

Use cases

Benchmarking code generation models against functional correctness
Designing prompts for docstring-to-code synthesis
Evaluating model safety and bias in code generation

Pros

Established a widely adopted benchmark (HumanEval) for code generation
Introduced a rigorous pass@k metric for functional correctness
Provided transparent methodology and open dataset

Cons

Benchmark limited to Python and simple algorithmic tasks
Model and data not publicly released, limiting reproducibility
Does not address real-world software engineering workflows

Indexed from awesome-llm and enriched against its public facts.

Pros

Established a widely adopted benchmark (HumanEval) for code generation
Introduced a rigorous pass@k metric for functional correctness
Provided transparent methodology and open dataset

Cons

Benchmark limited to Python and simple algorithmic tasks
Model and data not publicly released, limiting reproducibility
Does not address real-world software engineering workflows

Pairs with

Other entries in the index that connect to this one. Click through to see the chain.

Pairs with2entries

O OSS Framework medium

lm-evaluation-harness

Community

A framework for few-shot evaluation of language models.

★ 12,772 updated 1mo ago

O OSS Framework medium

OpenAI Evals

Community

Evals is a framework for evaluating LLMs and LLM systems, and an open-source registry of benchmarks.

★ 18,584 updated 2mo ago

← Back to Open Source Submit your own entry →