What Function Calling Actually Is
Function calling is a mechanism that lets a language model return structured JSON describing which external tool to invoke and with what arguments, rather than returning plain text. The model does not execute the function. It produces a description of what should be called, your code does the actual work, then you feed the result back to the model for a final natural language response.
Strip away the marketing and you have three moving parts. A function schema you define in your prompt or API request. A model that has been trained to recognize when a function fits the user’s intent and to emit arguments matching the schema. A loop in your application that handles the execution and feeds results back.
Most major providers now ship this as a first-class feature. OpenAI calls it function calling or tool use. Anthropic calls it tool use. Google bundles it into Gemini’s structured output mode. Open-source runtimes like vLLM and Ollama support it through compatible schemas. The underlying idea is consistent across vendors, which is the main reason it has stuck around as the standard way to wire LLMs into real systems.
The thing most newcomers miss is that the model is not calling a function. It is producing a structured prediction about which function to call. The actual call happens in your code. This distinction matters because it changes where failures originate and how you debug them.
A typical interaction looks like this. The user asks “what is the weather in Sydney”. The model sees the available function schemas, decides weather_lookup is a fit, and returns JSON describing the call. Your code runs the weather function, gets a result, and sends it back. The model then writes a final answer grounded in that real data. That is the entire mechanism. Everything else is configuration around it.
Setup and Authentication
For OpenAI, you need an API key and a recent SDK. The current version of the Python SDK is in the 1.x line. Install it and set your key as an environment variable.
pip install openai export OPENAI_API_KEY=sk-…
For Anthropic, the setup is the same pattern.
pip install anthropic export ANTHROPIC_API_KEY=sk-ant-…
For local models, install Ollama, pull a model that supports tool use, and start the server. The current Ollama version exposes an OpenAI-compatible endpoint at http://localhost:11434/v1, so the same client code works with a base URL change.
Authentication is the easy part. The harder part is making sure your dev environment, staging, and production all have the key available without leaking it. Standard practice is environment variables for local work and a secret manager for production. Never commit a key to git and never paste one into a frontend bundle.
If you are calling the API directly with curl, include the key in the Authorization header as Bearer plus the key. If you are using the SDK, the client picks it up from the environment automatically.
First Working Example
Here is a runnable example using the OpenAI Python SDK. It defines a single function, sends a user message, executes the function the model requests, and returns the final answer.
import json from openai import OpenAI
client = OpenAI()
tools = [ { “type”: “function”, “function”: { “name”: “get_weather”, “description”: “Get current weather for a city”, “parameters”: { “type”: “object”, “properties”: { “city”: {“type”: “string”, “description”: “City name”} }, “required”: [“city”] } } } ]
def get_weather(city): return {“city”: city, “temp_c”: 22, “condition”: “clear”}
messages = [{“role”: “user”, “content”: “What is the weather in Tokyo”}]
response = client.chat.completions.create( model=“gpt-4o-mini”, messages=messages, tools=tools )
msg = response.choices[0].message
if msg.tool_calls: call = msg.tool_calls[0] args = json.loads(call.function.arguments) result = get_weather(args[“city”])
messages.append(msg)
messages.append({
"role": "tool",
"tool_call_id": call.id,
"content": json.dumps(result)
})
final = client.chat.completions.create(
model="gpt-4o-mini",
messages=messages
)
print(final.choices[0].message.content)
The two-call pattern is the part most tutorials skip and the part that actually matters. The first call returns a tool_calls structure on the message. Your code parses the arguments, runs the function, and appends the result as a tool role message along with the original assistant message. The second call gives the model everything it needs to write a grounded final answer. Skip the second call and the model never sees the result. Append the result without the original assistant message and the API rejects the conversation for being malformed.
Key Settings That Matter
The dials that actually affect production behavior are easy to miss.
Tool choice. You can set tool_choice to “auto” (let the model decide), “none” (force no tool use), or a specific function reference to force that one tool. Forcing is useful when you have one tool that must always be used, but be careful, because forcing the wrong tool on ambiguous inputs creates brittle behavior.
Parallel tool calls. Most providers allow the model to request multiple function calls in a single turn. Set parallel_tool_calls to true on OpenAI to enable. This is genuinely useful when the user asks something that benefits from multiple lookups, like comparing the weather in Tokyo and Sydney. Your dispatcher needs to handle the array, not just a single call.
Temperature. Lower temperatures in the 0 to 0.2 range make function argument selection more deterministic, which is what you want for production. Higher temperatures encourage the model to try unusual argument combinations, which is fine for chat but bad for reliable function dispatch.
Model choice. Smaller models like gpt-4o-mini handle simple, well-defined tools fine. Complex workflows with 10 or more tools and nested arguments benefit from larger models. The current frontier models are noticeably better at picking the right tool from a large set without hallucinating arguments.
Schema strictness. OpenAI supports strict on function definitions, which forces the model to return arguments that exactly match the schema. This dramatically reduces malformed JSON, which is one of the biggest sources of bugs. If your provider supports it, turn it on. Always set additionalProperties to false and define every property.
Token limits. Every function schema you pass counts against your context window. With 5 to 10 tools, you might use 500 to 1000 tokens on schemas alone. If you have dozens of tools, this becomes a real cost. The standard fix is dynamic tool selection, where you only include the relevant tools for the current request based on the user’s intent.
Where It Shines
Function calling is genuinely excellent for a handful of patterns.
Structured data extraction. Point the model at a schema describing your target structure and let it pull fields out of messy text. Invoice processing, resume parsing, form filling all fit this pattern. The schema doubles as validation. If the model cannot produce matching JSON, you know the input was ambiguous.
Database queries. Let users ask questions in natural language, have the model emit SQL, run it against a read-only database, return the results. With proper safety controls like read-only credentials, query timeouts, and row limits, this is one of the highest-ROI applications of LLMs in business operations.
API orchestration. The model becomes a router that picks between internal services. “Add this to my calendar” routes to Google Calendar. “Send a message to the team” routes to Slack. This works because the tool schemas are themselves documentation the model reads at inference time.
Multi-step workflows. With careful prompt design, the model can chain function calls across multiple turns to complete a task. Book a flight, then add it to the calendar, then send a confirmation email. This is fragile at the edges but works well for narrow, well-scoped domains.
Code execution and calculation. Tools that evaluate Python, run shell commands, or hit a calculator are reliable because the math is done by a real interpreter, not the model guessing at arithmetic.
Where It Fails
The honest list of limitations you will hit in production.
Argument hallucination. The model sometimes invents parameter values that look plausible but are wrong. “What did John say in the last meeting” might return a fake timestamp or a made-up quote. The fix is strict schemas, post-call validation against allowed values, and a human-in-the-loop for any high-stakes call.
Tool selection errors. With too many tools or overlapping tools, the model picks the wrong one. Stick to 5 to 10 active tools per request. If you need more, route to a tool-selection step first that picks the right subset before the main call.
Latency. A single function call round trip typically adds 500ms to 2s. Multi-step workflows can stretch into 5 to 10s. For real-time applications, batch operations or pre-compute anything you can.
Cost. Every tool round trip is another full API call. A three-step workflow costs roughly 3x a single chat. For high-volume use, prompt caching and shorter system prompts matter, as does reusing context where possible.
Context drift. The model sometimes forgets constraints set in the system prompt by step 3 or 4. For long chains, re-inject constraints in each turn or use a smaller number of larger tools that bundle related operations.
Security. Any tool that takes user input as an argument is a prompt injection surface. A user could craft a query that causes the model to call send_email with attacker-chosen content. Mitigations include argument allowlists, confirmation steps for destructive actions, and treating the model’s tool calls as untrusted input that your code must re-validate before executing.
Practical Workflow Pattern
How to slot function calling into a real work setup.
Step 1: Define the smallest viable tool set. Start--- title: “Function Calling Tutorial: Real Setup and Working Examples” description: “A working LLM function calling tutorial covering setup, schemas, the execution loop, settings that matter, and where the pattern actually breaks in production.” publishDate: “2026-06-25” author: “Sam McKay” difficulty: “intermediate” service: “general” tags:
- ai-tools
- tutorial draft: false
What Function Calling Actually Is
Function calling is a mechanism that lets a language model return structured JSON describing which external tool to invoke and with what arguments, rather than returning plain text. The model does not execute the function. It produces a description of what should be called, your code does the actual work, then you feed the result back to the model for a final natural language response.
Strip away the marketing and you have three moving parts. A function schema you define in your prompt or API request. A model that has been trained to recognize when a function fits the user’s intent and to emit arguments matching the schema. A loop in your application that handles the execution and feeds results back.
Most major providers now ship this as a first-class feature. OpenAI calls it function calling or tool use. Anthropic calls it tool use. Google bundles it into Gemini’s structured output mode. Open-source runtimes like vLLM and Ollama support it through compatible schemas. The underlying idea is consistent across vendors, which is the main reason it has stuck around as the standard way to wire LLMs into real systems.
The thing most newcomers miss is that the model is not calling a function. It is producing a structured prediction about which function to call. The actual call happens in your code. This distinction matters because it changes where failures originate and how you debug them.
A typical interaction looks like this. The user asks “what is the weather in Sydney”. The model sees the available function schemas, decides weather_lookup is a fit, and returns JSON describing the call. Your code runs the weather function, gets a result, and sends it back. The model then writes a final answer grounded in that real data. That is the entire mechanism. Everything else is configuration around it.
Setup and Authentication
For OpenAI, you need an API key and a recent SDK. The current version of the Python SDK is in the 1.x line. Install it and set your key as an environment variable.
pip install openai export OPENAI_API_KEY=sk-…
For Anthropic, the setup is the same pattern.
pip install anthropic export ANTHROPIC_API_KEY=sk-ant-…
For local models, install Ollama, pull a model that supports tool use, and start the server. The current Ollama version exposes an OpenAI-compatible endpoint at http://localhost:11434/v1, so the same client code works with a base URL change.
Authentication is the easy part. The harder part is making sure your dev environment, staging, and production all have the key available without leaking it. Standard practice is environment variables for local work and a secret manager for production. Never commit a key to git and never paste one into a frontend bundle.
If you are calling the API directly with curl, include the key in the Authorization header as Bearer plus the key. If you are using the SDK, the client picks it up from the environment automatically.
First Working Example
Here is a runnable example using the OpenAI Python SDK. It defines a single function, sends a user message, executes the function the model requests, and returns the final answer.
import json from openai import OpenAI
client = OpenAI()
tools = [ { “type”: “function”, “function”: { “name”: “get_weather”, “description”: “Get current weather for a city”, “parameters”: { “type”: “object”, “properties”: { “city”: {“type”: “string”, “description”: “City name”} }, “required”: [“city”] } } } ]
def get_weather(city): return {“city”: city, “temp_c”: 22, “condition”: “clear”}
messages = [{“role”: “user”, “content”: “What is the weather in Tokyo”}]
response = client.chat.completions.create( model=“gpt-4o-mini”, messages=messages, tools=tools )
msg = response.choices[0].message
if msg.tool_calls: call = msg.tool_calls[0] args = json.loads(call.function.arguments) result = get_weather(args[“city”])
messages.append(msg)
messages.append({
"role": "tool",
"tool_call_id": call.id,
"content": json.dumps(result)
})
final = client.chat.completions.create(
model="gpt-4o-mini",
messages=messages
)
print(final.choices[0].message.content)
The two-call pattern is the part most tutorials skip and the part that actually matters. The first call returns a tool_calls structure on the message. Your code parses the arguments, runs the function, and appends the result as a tool role message along with the original assistant message. The second call gives the model everything it needs to write a grounded final answer. Skip the second call and the model never sees the result. Append the result without the original assistant message and the API rejects the conversation for being malformed.
Key Settings That Matter
The dials that actually affect production behavior are easy to miss.
Tool choice. You can set tool_choice to “auto” (let the model decide), “none” (force no tool use), or a specific function reference to force that one tool. Forcing is useful when you have one tool that must always be used, but be careful, because forcing the wrong tool on ambiguous inputs creates brittle behavior.
Parallel tool calls. Most providers allow the model to request multiple function calls in a single turn. Set parallel_tool_calls to true on OpenAI to enable. This is genuinely useful when the user asks something that benefits from multiple lookups, like comparing the weather in Tokyo and Sydney. Your dispatcher needs to handle the array, not just a single call.
Temperature. Lower temperatures in the 0 to 0.2 range make function argument selection more deterministic, which is what you want for production. Higher temperatures encourage the model to try unusual argument combinations, which is fine for chat but bad for reliable function dispatch.
Model choice. Smaller models like gpt-4o-mini handle simple, well-defined tools fine. Complex workflows with 10 or more tools and nested arguments benefit from larger models. The current frontier models are noticeably better at picking the right tool from a large set without hallucinating arguments.
Schema strictness. OpenAI supports strict on function definitions, which forces the model to return arguments that exactly match the schema. This dramatically reduces malformed JSON, which is one of the biggest sources of bugs. If your provider supports it, turn it on. Always set additionalProperties to false and define every property.
Token limits. Every function schema you pass counts against your context window. With 5 to 10 tools, you might use 500 to 1000 tokens on schemas alone. If you have dozens of tools, this becomes a real cost. The standard fix is dynamic tool selection, where you only include the relevant tools for the current request based on the user’s intent.
Where It Shines
Function calling is genuinely excellent for a handful of patterns.
Structured data extraction. Point the model at a schema describing your target structure and let it pull fields out of messy text. Invoice processing, resume parsing, form filling all fit this pattern. The schema doubles as validation. If the model cannot produce matching JSON, you know the input was ambiguous.
Database queries. Let users ask questions in natural language, have the model emit SQL, run it against a read-only database, return the results. With proper safety controls like read-only credentials, query timeouts, and row limits, this is one of the highest-ROI applications of LLMs in business operations.
API orchestration. The model becomes a router that picks between internal services. “Add this to my calendar” routes to Google Calendar. “Send a message to the team” routes to Slack. This works because the tool schemas are themselves documentation the model reads at inference time.
Multi-step workflows. With careful prompt design, the model can chain function calls across multiple turns to complete a task. Book a flight, then add it to the calendar, then send a confirmation email. This is fragile at the edges but works well for narrow, well-scoped domains.
Code execution and calculation. Tools that evaluate Python, run shell commands, or hit a calculator are reliable because the math is done by a real interpreter, not the model guessing at arithmetic.
Where It Fails
The honest list of limitations you will hit in production.
Argument hallucination. The model sometimes invents parameter values that look plausible but are wrong. “What did John say in the last meeting” might return a fake timestamp or a made-up quote. The fix is strict schemas, post-call validation against allowed values, and a human-in-the-loop for any high-stakes call.
Tool selection errors. With too many tools or overlapping tools, the model picks the wrong one. Stick to 5 to 10 active tools per request. If you need more, route to a tool-selection step first that picks the right subset before the main call.
Latency. A single function call round trip typically adds 500ms to 2s. Multi-step workflows can stretch into 5 to 10s. For real-time applications, batch operations or pre-compute anything you can.
Cost. Every tool round trip is another full API call. A three-step workflow costs roughly 3x a single chat. For high-volume use, prompt caching and shorter system prompts matter, as does reusing context where possible.
Context drift. The model sometimes forgets constraints set in the system prompt by step 3 or 4. For long chains, re-inject constraints in each turn or use a smaller number of larger tools that bundle related operations.
Security. Any tool that takes user input as an argument is a prompt injection surface. A user could craft a query that causes the model to call send_email with attacker-chosen content. Mitigations include argument allowlists, confirmation steps for destructive actions, and treating the model’s tool calls as untrusted input that your code must re-validate before executing.
Practical Workflow Pattern
How to slot function calling into a real work setup.
Step 1: Define the smallest viable tool set. Start