O Open Source Frameworks medium

Berkeley Function-Calling Leaderboard

by Community

Explore The Berkeley Function Calling Leaderboard (also called The Berkeley Tool Calling Leaderboard) to see the LLM

Visit Community View repo Submit your build →

OSS

Added 1 June 2026

Overview

The Berkeley Function-Calling Leaderboard (also called the Berkeley Tool Calling Leaderboard) is a community-driven benchmark that evaluates and ranks large language models based on their ability to correctly invoke functions and use tools. It provides a standardized set of API-calling tasks to compare model performance across diverse real-world scenarios.

Best for

Best for
Developers and researchers evaluating LLMs for tool-use and function-calling applications

Use cases

Comparing LLMs for function-calling accuracy in tool-use applications
Selecting a model that best handles structured API calls and multi-step tool usage
Benchmarking custom models against state-of-the-art results on function-calling tasks

Notes

Use cases

Comparing LLMs for function-calling accuracy in tool-use applications
Selecting a model that best handles structured API calls and multi-step tool usage
Benchmarking custom models against state-of-the-art results on function-calling tasks

Pros

Open and transparent benchmark with community-contributed data
Covers a wide range of function categories and realistic API patterns
Regularly updated with new models and tasks

Cons

Leaderboard performance may not fully translate to every production environment
Models can be fine-tuned to overfit specific benchmark tasks
Limited to function-calling evaluation; does not assess other model capabilities

Indexed from awesome-llm and enriched against its public facts.

Pros

Open and transparent benchmark with community-contributed data
Covers a wide range of function categories and realistic API patterns
Regularly updated with new models and tasks

Cons

Leaderboard performance may not fully translate to every production environment
Models can be fine-tuned to overfit specific benchmark tasks
Limited to function-calling evaluation; does not assess other model capabilities

Pairs with

Other entries in the index that connect to this one. Click through to see the chain.

Pairs with2entries

O OSS Framework medium

lm-evaluation-harness

Community

A framework for few-shot evaluation of language models.

★ 12,772 updated 2mo ago

O OSS Framework medium

OpenAI Evals

Community

Evals is a framework for evaluating LLMs and LLM systems, and an open-source registry of benchmarks.

★ 18,584 updated 3mo ago

Free 27-page guide

Get the free Developer’s Field Guide

A 27-page field guide to the AI coding workflow with Claude. Claude Code, MCP servers, the prompt patterns that work, and what to delegate. Free.

Enter your work email. We send it straight over, plus a few short notes worth knowing. Unsubscribe any time.

← Back to Open Source Submit your own entry →