Enterprise DNA
O Open Source Frameworks medium

Berkeley Function-Calling Leaderboard

by Community

Explore The Berkeley Function Calling Leaderboard (also called The Berkeley Tool Calling Leaderboard) to see the LLM

BF

OSS

Berkeley Function-Calling Leaderboard

Added 1 June 2026

Overview

The Berkeley Function-Calling Leaderboard (also called the Berkeley Tool Calling Leaderboard) is a community-driven benchmark that evaluates and ranks large language models based on their ability to correctly invoke functions and use tools. It provides a standardized set of API-calling tasks to compare model performance across diverse real-world scenarios.

Best for

Best for
Developers and researchers evaluating LLMs for tool-use and function-calling applications

Use cases

  • Comparing LLMs for function-calling accuracy in tool-use applications
  • Selecting a model that best handles structured API calls and multi-step tool usage
  • Benchmarking custom models against state-of-the-art results on function-calling tasks

Notes

The Berkeley Function-Calling Leaderboard (also called the Berkeley Tool Calling Leaderboard) is a community-driven benchmark that evaluates and ranks large language models based on their ability to correctly invoke functions and use tools. It provides a standardized set of API-calling tasks to compare model performance across diverse real-world scenarios.

Use cases

  • Comparing LLMs for function-calling accuracy in tool-use applications
  • Selecting a model that best handles structured API calls and multi-step tool usage
  • Benchmarking custom models against state-of-the-art results on function-calling tasks

Pros

  • Open and transparent benchmark with community-contributed data
  • Covers a wide range of function categories and realistic API patterns
  • Regularly updated with new models and tasks

Cons

  • Leaderboard performance may not fully translate to every production environment
  • Models can be fine-tuned to overfit specific benchmark tasks
  • Limited to function-calling evaluation; does not assess other model capabilities

Indexed from awesome-llm and enriched against its public facts.

Pros

  • Open and transparent benchmark with community-contributed data
  • Covers a wide range of function categories and realistic API patterns
  • Regularly updated with new models and tasks

Cons

  • Leaderboard performance may not fully translate to every production environment
  • Models can be fine-tuned to overfit specific benchmark tasks
  • Limited to function-calling evaluation; does not assess other model capabilities

Pairs with

Other entries in the index that connect to this one. Click through to see the chain.