Berkeley Function-Calling Leaderboard
by Community
Explore The Berkeley Function Calling Leaderboard (also called The Berkeley Tool Calling Leaderboard) to see the LLM
OSS
Berkeley Function-Calling Leaderboard
Added 1 June 2026
Overview
The Berkeley Function-Calling Leaderboard (also called the Berkeley Tool Calling Leaderboard) is a community-driven benchmark that evaluates and ranks large language models based on their ability to correctly invoke functions and use tools. It provides a standardized set of API-calling tasks to compare model performance across diverse real-world scenarios.
Best for
Best for
Developers and researchers evaluating LLMs for tool-use and function-calling applications
Use cases
- Comparing LLMs for function-calling accuracy in tool-use applications
- Selecting a model that best handles structured API calls and multi-step tool usage
- Benchmarking custom models against state-of-the-art results on function-calling tasks
Notes
The Berkeley Function-Calling Leaderboard (also called the Berkeley Tool Calling Leaderboard) is a community-driven benchmark that evaluates and ranks large language models based on their ability to correctly invoke functions and use tools. It provides a standardized set of API-calling tasks to compare model performance across diverse real-world scenarios.
Use cases
- Comparing LLMs for function-calling accuracy in tool-use applications
- Selecting a model that best handles structured API calls and multi-step tool usage
- Benchmarking custom models against state-of-the-art results on function-calling tasks
Pros
- Open and transparent benchmark with community-contributed data
- Covers a wide range of function categories and realistic API patterns
- Regularly updated with new models and tasks
Cons
- Leaderboard performance may not fully translate to every production environment
- Models can be fine-tuned to overfit specific benchmark tasks
- Limited to function-calling evaluation; does not assess other model capabilities
Indexed from awesome-llm and enriched against its public facts.
Pros
- Open and transparent benchmark with community-contributed data
- Covers a wide range of function categories and realistic API patterns
- Regularly updated with new models and tasks
Cons
- Leaderboard performance may not fully translate to every production environment
- Models can be fine-tuned to overfit specific benchmark tasks
- Limited to function-calling evaluation; does not assess other model capabilities
Pairs with
Other entries in the index that connect to this one. Click through to see the chain.