Home AI/ML Tool Calling Explained: How AI Models Interact With the Real World Through Function Calling

Tool Calling Explained: How AI Models Interact With the Real World Through Function Calling

In March 2023, a developer built a ChatGPT-powered assistant that could check the weather, look up flight prices, and book restaurant reservations — all within a single conversation. The trick? The AI never actually called a single API itself. Instead, it told the developer’s code exactly which function to call and with which arguments, received the results, and wove them into a seamless natural language response. The user had no idea they were talking to a text generator that couldn’t actually do anything on its own. That trick has a name: tool calling. And it’s the single most important capability that transformed large language models from impressive text generators into agents that can interact with the real world.

Here’s the uncomfortable truth about LLMs: they are fundamentally trapped. An LLM doesn’t know today’s date. It can’t check a stock price. It can’t query your database, send an email, or read a file on your computer. It only knows what was in its training data (which is months or years old) and whatever you include in the current conversation. Without tool calling, asking an LLM “What’s NVIDIA’s stock price right now?” gets you a polite apology and a reminder of its knowledge cutoff date.

Tool calling changed everything. It’s the mechanism that lets an AI model say, “I don’t know the answer to this, but I know which function to call to get the answer — and here are the exact arguments.” Your code then executes that function, feeds the result back to the model, and the model responds to the user as if it knew all along. This is how ChatGPT plugins work. This is how Claude Code reads and writes files. This is how every AI agent operates under the hood.

In this guide, I’m going to break down tool calling from the ground up. You’ll learn exactly how it works, see complete code examples for Claude and OpenAI, understand the differences between providers, and walk away with everything you need to build your own tool-calling applications. Whether you’re a developer building AI-powered products or an investor evaluating AI companies, understanding tool calling is essential — it’s the bridge between “AI that talks” and “AI that acts.”

What Is Tool Calling?

Tool calling (also called function calling) is a mechanism where a large language model can request the execution of external functions or APIs during a conversation. Instead of trying to answer everything from memory, the model can reach out to the real world — checking databases, calling APIs, performing calculations, or executing code — by asking your application to run specific functions on its behalf.

The key insight is deceptively simple: the model doesn’t execute the tools itself. It generates a structured request — a function name plus arguments in JSON format — and your code is responsible for actually executing it. The result gets sent back to the model, which then incorporates it into its response.

Think of it like a brain and hands. The LLM is the brain: it plans, reasons, and decides what needs to happen. The tools are the hands: they actually do things in the physical world. The brain can’t pick up a cup of coffee on its own, but it can tell the hands exactly how to do it. Similarly, an LLM can’t check the weather, but it can tell your code to call a weather API with specific coordinates and interpret the result.

The Three-Step Loop

Every tool calling interaction follows the same fundamental pattern:

The Tool Calling Loop:

  1. User asks something → “What’s the weather in Tokyo right now?”
  2. Model decides to call a tool → Outputs structured JSON: {"name": "get_weather", "arguments": {"city": "Tokyo"}}
  3. Your code executes the tool → Calls the weather API, gets the result → Sends it back to the model
  4. Model responds naturally → “It’s currently 22°C and sunny in Tokyo with a light breeze from the east.”

Here’s the full flow described step by step:

┌─────────┐    "What's the weather     ┌─────────┐
│         │    in Tokyo?"              │         │
│  User   │ ──────────────────────────→│  Your   │
│         │                            │  App    │
└─────────┘                            └────┬────┘
                                            │
                           Sends message +  │
                           tool definitions │
                                            ▼
                                       ┌─────────┐
                                       │         │
                                       │  LLM    │
                                       │  (API)  │
                                       └────┬────┘
                                            │
                           Returns:         │
                           tool_use:        │
                           get_weather      │
                           {"city":"Tokyo"} │
                                            ▼
                                       ┌─────────┐
                                       │  Your   │
                                       │  App    │──→ Calls weather API
                                       │(execute)│←── Gets result: 22°C
                                       └────┬────┘
                                            │
                           Sends tool_result│
                           back to LLM     │
                                            ▼
                                       ┌─────────┐
                                       │  LLM    │
                                       │  (API)  │
                                       └────┬────┘
                                            │
                           Final response:  │
                           "It's 22°C and   │
                            sunny in Tokyo" │
                                            ▼
                                       ┌─────────┐
                                       │  User   │
                                       │  sees   │
                                       │ response│
                                       └─────────┘

Why This Is Revolutionary

Before tool calling: LLMs could only generate text. They were extraordinarily good at it, but they were fundamentally disconnected from the world. Ask for today’s weather and you’d get a hallucinated guess or an apology. Ask them to send an email and they’d write you a draft you’d have to copy-paste yourself.

After tool calling: LLMs can take actions. They can check real-time data, interact with databases, control software, browse the web, manage files, send messages, and orchestrate complex multi-step workflows. The same text-generation capability that was previously limited to chat responses now powers decision-making about which actions to take and how to interpret the results.

This single capability — the ability for a model to say “call this function with these arguments” — is what turned LLMs from chatbots into agents. Every AI agent framework, every chatbot plugin system, and every autonomous AI workflow is built on tool calling.

How Tool Calling Works Under the Hood

Let’s walk through each step of the tool calling process in detail, with the actual data structures you’ll encounter when building with the APIs.

Step 1: Tool Definition

Before the model can use any tools, you have to tell it what tools are available. You do this by including a tool definition in your API request. Each tool definition is a JSON Schema that describes the function’s name, what it does, and what parameters it accepts.

{
  "name": "get_current_weather",
  "description": "Get the current weather conditions for a specific city. Returns temperature in Celsius, weather condition, humidity, and wind speed. Use this when the user asks about current weather, temperature, or atmospheric conditions for any location.",
  "input_schema": {
    "type": "object",
    "properties": {
      "city": {
        "type": "string",
        "description": "The city name, e.g. 'Tokyo', 'New York', 'London'"
      },
      "units": {
        "type": "string",
        "enum": ["celsius", "fahrenheit"],
        "description": "Temperature units. Defaults to celsius.",
        "default": "celsius"
      }
    },
    "required": ["city"]
  }
}

The description is critically important — it’s what the model reads to decide when to use this tool. A vague description like “weather stuff” will lead to the model using the tool at the wrong times or not using it when it should. A detailed description like the one above helps the model make precise decisions.

Step 2: Tool Selection

When the model receives a user message along with tool definitions, it makes a decision: should it respond directly, or should it call one or more tools first? This decision is made by the model itself — it’s part of the model’s inference process, not a separate system.

The model considers:

  • Does the user’s request require information I don’t have?
  • Is there a tool that can provide this information?
  • What arguments should I pass to the tool?
  • Do I need to call multiple tools?
  • Should I call tools in parallel or sequentially?

If the user asks “What’s 2 + 2?”, the model will answer directly — no tool needed. If the user asks “What’s the weather in Tokyo?”, and a get_current_weather tool is available, the model will decide to call it.

Step 3: Structured Output

When the model decides to call a tool, it doesn’t output free-form text. Instead, it outputs a structured tool_use block with the function name and arguments as valid JSON:

{
  "role": "assistant",
  "content": [
    {
      "type": "tool_use",
      "id": "toolu_01A09q90qw90lq917835lq9",
      "name": "get_current_weather",
      "input": {
        "city": "Tokyo",
        "units": "celsius"
      }
    }
  ]
}

This is not a suggestion or a natural language request — it’s a precisely structured instruction. The function name matches exactly what you defined, and the arguments conform to the JSON Schema you provided. This is what makes tool calling reliable: the model doesn’t say “maybe try checking the weather”; it says “call get_current_weather with {"city": "Tokyo", "units": "celsius"}“.

Step 4: Execution

Your application code receives this tool_use block, parses it, and executes the actual function. This is where the real work happens — you make the API call, run the database query, perform the calculation, or whatever the tool does:

# Your code — NOT the model's code
def get_current_weather(city: str, units: str = "celsius") -> dict:
    response = requests.get(
        f"https://api.openweathermap.org/data/2.5/weather",
        params={"q": city, "units": "metric", "appid": API_KEY}
    )
    data = response.json()
    return {
        "city": city,
        "temperature": data["main"]["temp"],
        "condition": data["weather"][0]["description"],
        "humidity": data["main"]["humidity"],
        "wind_speed": data["wind"]["speed"]
    }

Step 5: Result Injection

You send the tool result back to the model as a tool_result message:

{
  "role": "user",
  "content": [
    {
      "type": "tool_result",
      "tool_use_id": "toolu_01A09q90qw90lq917835lq9",
      "content": "{\"city\": \"Tokyo\", \"temperature\": 22, \"condition\": \"clear sky\", \"humidity\": 45, \"wind_speed\": 3.6}"
    }
  ]
}

Step 6: Final Response

The model reads the tool result and generates a natural language response for the user. It doesn’t just parrot the raw data — it interprets it, adds context, and presents it conversationally:

“Right now in Tokyo, it’s a beautiful 22°C with clear skies. Humidity is at a comfortable 45%, and there’s a gentle breeze at 3.6 m/s. Perfect weather for a walk!”

Multi-Tool and Iterative Tool Use

Modern models can call multiple tools in a single turn. If a user asks “What’s the weather in Tokyo and New York?”, the model can output two tool_use blocks simultaneously — a parallel tool call. Your code executes both and sends both results back.

Models can also use tools iteratively. In a complex task, the model might call tool A, examine the result, decide it needs more information, call tool B, examine that result, and then finally respond. This iterative capability is the foundation of AI agents — the model keeps calling tools in a loop until it has enough information to complete the task.

Tool Calling Across Major AI Providers

The core concept is the same across providers, but the API formats differ. Let’s look at complete, runnable examples for each major provider.

Anthropic Claude (Messages API)

Claude’s tool calling uses a clean, content-block-based format. Tools are defined with input_schema (standard JSON Schema), and the model responds with tool_use content blocks.

Here’s a complete, runnable Python example:

import anthropic
import json

client = anthropic.Anthropic()  # Uses ANTHROPIC_API_KEY env var

# Define tools
tools = [
    {
        "name": "get_weather",
        "description": "Get the current weather for a city. Returns temperature (Celsius), condition, humidity, and wind speed.",
        "input_schema": {
            "type": "object",
            "properties": {
                "city": {
                    "type": "string",
                    "description": "City name, e.g. 'Tokyo', 'London'"
                }
            },
            "required": ["city"]
        }
    },
    {
        "name": "get_stock_price",
        "description": "Get the current stock price for a given ticker symbol. Returns price in USD, daily change, and percentage change.",
        "input_schema": {
            "type": "object",
            "properties": {
                "ticker": {
                    "type": "string",
                    "description": "Stock ticker symbol, e.g. 'AAPL', 'NVDA', 'GOOGL'"
                }
            },
            "required": ["ticker"]
        }
    }
]

# Simulated tool implementations
def get_weather(city: str) -> dict:
    # In production, call a real weather API
    return {"city": city, "temperature": 22, "condition": "sunny", "humidity": 45}

def get_stock_price(ticker: str) -> dict:
    # In production, call a real stock API
    return {"ticker": ticker, "price": 875.30, "change": +12.50, "percent_change": "+1.45%"}

# Map function names to implementations
tool_functions = {
    "get_weather": get_weather,
    "get_stock_price": get_stock_price,
}

# Send initial message with tools
messages = [{"role": "user", "content": "What's the weather in Tokyo and NVIDIA's stock price?"}]

response = client.messages.create(
    model="claude-sonnet-4-20250514",
    max_tokens=1024,
    tools=tools,
    messages=messages
)

print(f"Stop reason: {response.stop_reason}")

# Process tool calls
while response.stop_reason == "tool_use":
    # Collect all tool use blocks
    tool_results = []
    for block in response.content:
        if block.type == "tool_use":
            # Execute the tool
            func = tool_functions[block.name]
            result = func(**block.input)
            print(f"Called {block.name}({block.input}) → {result}")

            tool_results.append({
                "type": "tool_result",
                "tool_use_id": block.id,
                "content": json.dumps(result)
            })

    # Send results back to Claude
    messages.append({"role": "assistant", "content": response.content})
    messages.append({"role": "user", "content": tool_results})

    response = client.messages.create(
        model="claude-sonnet-4-20250514",
        max_tokens=1024,
        tools=tools,
        messages=messages
    )

# Print final response
for block in response.content:
    if hasattr(block, "text"):
        print(f"\nClaude's response:\n{block.text}")
Tip: Claude supports tool_choice parameter to control tool usage: "auto" (model decides), "any" (must use at least one tool), or {"type": "tool", "name": "get_weather"} (must use a specific tool). Use "auto" for most cases.

Claude-specific features:

  • Parallel tool calls: Claude can output multiple tool_use blocks in a single response, allowing you to execute them in parallel
  • Streaming with tools: Tool calls work with streaming — you receive content_block_start events for tool_use blocks as they’re generated
  • Tool choice control: Fine-grained control over when the model uses tools via tool_choice
  • Large tool sets: Claude handles large numbers of tools well, though keeping it under 20 is recommended for optimal performance

OpenAI GPT (Chat Completions API)

OpenAI’s format uses a tools array with type: "function" wrappers. The response includes a tool_calls array, and results are sent back as messages with role: "tool".

from openai import OpenAI
import json

client = OpenAI()  # Uses OPENAI_API_KEY env var

# Define tools — note the different format from Claude
tools = [
    {
        "type": "function",
        "function": {
            "name": "get_weather",
            "description": "Get the current weather for a city.",
            "parameters": {
                "type": "object",
                "properties": {
                    "city": {
                        "type": "string",
                        "description": "City name, e.g. 'Tokyo'"
                    }
                },
                "required": ["city"]
            }
        }
    },
    {
        "type": "function",
        "function": {
            "name": "get_stock_price",
            "description": "Get the current stock price for a ticker symbol.",
            "parameters": {
                "type": "object",
                "properties": {
                    "ticker": {
                        "type": "string",
                        "description": "Stock ticker, e.g. 'NVDA'"
                    }
                },
                "required": ["ticker"]
            }
        }
    }
]

# Same tool implementations as above
def get_weather(city):
    return {"city": city, "temperature": 22, "condition": "sunny"}

def get_stock_price(ticker):
    return {"ticker": ticker, "price": 875.30, "change": "+1.45%"}

tool_functions = {"get_weather": get_weather, "get_stock_price": get_stock_price}

messages = [{"role": "user", "content": "What's the weather in Tokyo and NVIDIA's stock price?"}]

response = client.chat.completions.create(
    model="gpt-4o",
    messages=messages,
    tools=tools,
    tool_choice="auto"
)

message = response.choices[0].message

# Process tool calls
while message.tool_calls:
    messages.append(message)  # Add assistant message with tool calls

    for tool_call in message.tool_calls:
        func = tool_functions[tool_call.function.name]
        args = json.loads(tool_call.function.arguments)
        result = func(**args)

        # Note: OpenAI uses role="tool" instead of tool_result content blocks
        messages.append({
            "role": "tool",
            "tool_call_id": tool_call.id,
            "content": json.dumps(result)
        })

    response = client.chat.completions.create(
        model="gpt-4o",
        messages=messages,
        tools=tools
    )
    message = response.choices[0].message

print(message.content)

Google Gemini

Gemini’s function calling follows a similar pattern but with its own API format. Tool definitions use FunctionDeclaration objects, and responses include function_call parts. Gemini supports both automatic and manual function calling modes, and can handle parallel function calls similar to Claude and GPT.

The key difference with Gemini is its tight integration with Google’s ecosystem — function calling works seamlessly with Google Search, Google Maps, and other Google APIs as built-in tools.

Provider Comparison

Feature Claude (Anthropic) GPT (OpenAI) Gemini (Google)
Tool definition key input_schema parameters parameters
Tool call format tool_use content block tool_calls array function_call part
Result format tool_result content block role: "tool" message function_response part
Parallel tool calls Yes Yes Yes
Streaming with tools Yes Yes Yes
Tool choice control auto / any / specific auto / none / required / specific auto / none / specific
JSON reliability Excellent Excellent Good
Stop reason indicator stop_reason: "tool_use" finish_reason: "tool_calls" Part type check

 

Key Takeaway: Despite format differences, all three providers follow the same conceptual pattern: define tools → model requests tool execution → your code runs the tool → send result back → model responds. If you understand one, you can work with any of them.

Practical Tool Calling Examples (with Complete Code)

Theory is great, but let’s build real things. Here are four complete examples that demonstrate increasingly complex tool calling patterns.

Example 1: Chained Tools — Weather by City Name

This example shows tool chaining: the model calls one tool to get coordinates, then uses those coordinates to call a second tool for weather data. The model autonomously decides it needs both calls.

import anthropic
import json
import requests

client = anthropic.Anthropic()

tools = [
    {
        "name": "get_coordinates",
        "description": "Convert a city name to latitude/longitude coordinates using geocoding.",
        "input_schema": {
            "type": "object",
            "properties": {
                "city": {"type": "string", "description": "City name, e.g. 'Paris'"},
                "country_code": {"type": "string", "description": "ISO country code, e.g. 'FR'"}
            },
            "required": ["city"]
        }
    },
    {
        "name": "get_weather_by_coords",
        "description": "Get weather data for specific latitude/longitude coordinates.",
        "input_schema": {
            "type": "object",
            "properties": {
                "latitude": {"type": "number", "description": "Latitude coordinate"},
                "longitude": {"type": "number", "description": "Longitude coordinate"}
            },
            "required": ["latitude", "longitude"]
        }
    }
]

API_KEY = "your_openweathermap_api_key"

def get_coordinates(city: str, country_code: str = None) -> dict:
    params = {"q": city if not country_code else f"{city},{country_code}",
              "limit": 1, "appid": API_KEY}
    resp = requests.get("http://api.openweathermap.org/geo/1.0/direct", params=params)
    data = resp.json()[0]
    return {"city": data["name"], "lat": data["lat"], "lon": data["lon"],
            "country": data["country"]}

def get_weather_by_coords(latitude: float, longitude: float) -> dict:
    params = {"lat": latitude, "lon": longitude, "units": "metric", "appid": API_KEY}
    resp = requests.get("https://api.openweathermap.org/data/2.5/weather", params=params)
    data = resp.json()
    return {
        "temperature": data["main"]["temp"],
        "feels_like": data["main"]["feels_like"],
        "condition": data["weather"][0]["description"],
        "humidity": data["main"]["humidity"],
        "wind_speed": data["wind"]["speed"]
    }

tool_map = {"get_coordinates": get_coordinates, "get_weather_by_coords": get_weather_by_coords}

def chat_with_tools(user_message: str) -> str:
    messages = [{"role": "user", "content": user_message}]

    while True:
        response = client.messages.create(
            model="claude-sonnet-4-20250514", max_tokens=1024,
            tools=tools, messages=messages
        )

        if response.stop_reason == "end_turn":
            return "".join(b.text for b in response.content if hasattr(b, "text"))

        # Process tool calls
        tool_results = []
        for block in response.content:
            if block.type == "tool_use":
                result = tool_map[block.name](**block.input)
                print(f"  Tool: {block.name}({block.input}) → {result}")
                tool_results.append({
                    "type": "tool_result",
                    "tool_use_id": block.id,
                    "content": json.dumps(result)
                })

        messages.append({"role": "assistant", "content": response.content})
        messages.append({"role": "user", "content": tool_results})

# The model will first call get_coordinates("Paris"),
# then use the result to call get_weather_by_coords(48.85, 2.35)
print(chat_with_tools("What's the weather like in Paris right now?"))

The model doesn’t need to be told to chain these calls — it reads the tool descriptions, understands that get_weather_by_coords needs coordinates, and autonomously calls get_coordinates first. This is emergent reasoning, not hard-coded logic.

Example 2: Database Query Tool

This example gives the model the ability to query a SQLite database. The model generates SQL, the tool executes it safely, and the model interprets the results.

import anthropic
import json
import sqlite3

client = anthropic.Anthropic()

# Create a sample database
conn = sqlite3.connect(":memory:")
cursor = conn.cursor()
cursor.executescript("""
    CREATE TABLE users (id INTEGER PRIMARY KEY, name TEXT, email TEXT,
                        signup_date DATE, plan TEXT);
    INSERT INTO users VALUES (1, 'Alice', 'alice@example.com', '2026-03-15', 'pro');
    INSERT INTO users VALUES (2, 'Bob', 'bob@example.com', '2026-03-20', 'free');
    INSERT INTO users VALUES (3, 'Charlie', 'charlie@example.com', '2026-02-10', 'pro');
    INSERT INTO users VALUES (4, 'Diana', 'diana@example.com', '2026-03-25', 'enterprise');
    INSERT INTO users VALUES (5, 'Eve', 'eve@example.com', '2026-01-05', 'free');

    CREATE TABLE orders (id INTEGER PRIMARY KEY, user_id INTEGER,
                         amount DECIMAL, order_date DATE);
    INSERT INTO orders VALUES (1, 1, 99.99, '2026-03-16');
    INSERT INTO orders VALUES (2, 3, 199.99, '2026-03-01');
    INSERT INTO orders VALUES (3, 4, 499.99, '2026-03-26');
    INSERT INTO orders VALUES (4, 1, 49.99, '2026-03-28');
""")

tools = [
    {
        "name": "query_database",
        "description": """Execute a READ-ONLY SQL query against the database.
Available tables:
- users (id, name, email, signup_date, plan) — plan is 'free', 'pro', or 'enterprise'
- orders (id, user_id, amount, order_date) — user_id references users.id
Only SELECT statements are allowed. Returns rows as a list of dictionaries.""",
        "input_schema": {
            "type": "object",
            "properties": {
                "query": {
                    "type": "string",
                    "description": "SQL SELECT query to execute"
                }
            },
            "required": ["query"]
        }
    }
]

def query_database(query: str) -> dict:
    # Security: only allow SELECT statements
    if not query.strip().upper().startswith("SELECT"):
        return {"error": "Only SELECT queries are allowed"}

    try:
        cursor.execute(query)
        columns = [desc[0] for desc in cursor.description]
        rows = [dict(zip(columns, row)) for row in cursor.fetchall()]
        return {"columns": columns, "rows": rows, "row_count": len(rows)}
    except Exception as e:
        return {"error": str(e)}

# Ask a natural language question about the data
messages = [{"role": "user", "content": "How many users signed up in March 2026, and what's the total revenue from orders that month?"}]

response = client.messages.create(
    model="claude-sonnet-4-20250514", max_tokens=1024,
    tools=tools, messages=messages
)

# Process (the model will likely make two queries)
while response.stop_reason == "tool_use":
    tool_results = []
    for block in response.content:
        if block.type == "tool_use":
            result = query_database(**block.input)
            print(f"SQL: {block.input['query']}")
            print(f"Result: {result}\n")
            tool_results.append({
                "type": "tool_result",
                "tool_use_id": block.id,
                "content": json.dumps(result)
            })

    messages.append({"role": "assistant", "content": response.content})
    messages.append({"role": "user", "content": tool_results})
    response = client.messages.create(
        model="claude-sonnet-4-20250514", max_tokens=1024,
        tools=tools, messages=messages
    )

for block in response.content:
    if hasattr(block, "text"):
        print(block.text)
Caution: Never let an LLM execute arbitrary SQL against a production database. Always enforce read-only access, use parameterized queries where possible, validate the query before execution, and run against a restricted database user with minimal permissions.

Example 3: Multi-Tool Agent

This example builds a mini agent that can search the web, read URLs, and send emails. It demonstrates the agentic loop — the model calls tools iteratively until the task is complete.

import anthropic
import json

client = anthropic.Anthropic()

tools = [
    {
        "name": "search_web",
        "description": "Search the web for current information. Returns a list of results with titles, URLs, and snippets.",
        "input_schema": {
            "type": "object",
            "properties": {
                "query": {"type": "string", "description": "Search query"}
            },
            "required": ["query"]
        }
    },
    {
        "name": "read_url",
        "description": "Read the text content of a web page given its URL.",
        "input_schema": {
            "type": "object",
            "properties": {
                "url": {"type": "string", "description": "Full URL to read"}
            },
            "required": ["url"]
        }
    },
    {
        "name": "send_email",
        "description": "Send an email to a recipient with a subject and body.",
        "input_schema": {
            "type": "object",
            "properties": {
                "to": {"type": "string", "description": "Recipient email address"},
                "subject": {"type": "string", "description": "Email subject line"},
                "body": {"type": "string", "description": "Email body (plain text)"}
            },
            "required": ["to", "subject", "body"]
        }
    }
]

# Simulated tool implementations
def search_web(query):
    return {"results": [
        {"title": "NVIDIA Q4 2026 Earnings", "url": "https://example.com/nvidia-earnings",
         "snippet": "NVIDIA reported revenue of $45B, up 78% YoY..."},
        {"title": "NVIDIA Earnings Analysis", "url": "https://example.com/nvidia-analysis",
         "snippet": "Data center revenue drove growth at $38B..."}
    ]}

def read_url(url):
    return {"content": "NVIDIA reported Q4 2026 revenue of $45 billion, beating estimates of $42B. "
            "Data center revenue reached $38B (+95% YoY). Gaming revenue was $4.2B (+15%). "
            "Gross margin was 73.5%. The company announced a $50B buyback program."}

def send_email(to, subject, body):
    return {"status": "sent", "message_id": "msg_abc123"}

tool_map = {"search_web": search_web, "read_url": read_url, "send_email": send_email}

def run_agent(task: str, max_iterations: int = 10) -> str:
    """Run the agent loop until task completion or max iterations."""
    messages = [{"role": "user", "content": task}]

    for i in range(max_iterations):
        response = client.messages.create(
            model="claude-sonnet-4-20250514", max_tokens=4096,
            tools=tools, messages=messages
        )

        if response.stop_reason == "end_turn":
            return "".join(b.text for b in response.content if hasattr(b, "text"))

        # Execute all tool calls
        tool_results = []
        for block in response.content:
            if block.type == "tool_use":
                result = tool_map[block.name](**block.input)
                print(f"  [{i+1}] {block.name}({json.dumps(block.input)[:80]}...)")
                tool_results.append({
                    "type": "tool_result",
                    "tool_use_id": block.id,
                    "content": json.dumps(result)
                })

        messages.append({"role": "assistant", "content": response.content})
        messages.append({"role": "user", "content": tool_results})

    return "Max iterations reached"

# The agent will: search → read article → compose email → send
result = run_agent(
    "Research the latest NVIDIA earnings and email a summary to investor@example.com"
)
print(result)

Notice the run_agent function — it’s a simple while loop that keeps calling the model until the task is done. The model autonomously decides the sequence: search first, read the most relevant article, compose an email, and send it. This is the core pattern behind every AI agent framework.

Example 4: Calculator and Code Execution

LLMs are notoriously bad at arithmetic. Tool calling solves this by offloading computation to actual code:

import anthropic
import json
import math

client = anthropic.Anthropic()

tools = [
    {
        "name": "calculate",
        "description": "Evaluate a mathematical expression. Supports standard math operations (+, -, *, /, **, %), functions (sqrt, sin, cos, log, abs), and constants (pi, e). Examples: '2**10', 'sqrt(144)', 'log(1000, 10)'",
        "input_schema": {
            "type": "object",
            "properties": {
                "expression": {"type": "string", "description": "Math expression to evaluate"}
            },
            "required": ["expression"]
        }
    },
    {
        "name": "run_python",
        "description": "Execute a Python code snippet and return stdout output. Use for complex calculations, data processing, or generating formatted results. The code runs in a sandboxed environment.",
        "input_schema": {
            "type": "object",
            "properties": {
                "code": {"type": "string", "description": "Python code to execute"}
            },
            "required": ["code"]
        }
    }
]

def calculate(expression: str) -> dict:
    # Safe math evaluation with limited namespace
    allowed = {k: v for k, v in math.__dict__.items() if not k.startswith('_')}
    allowed.update({"abs": abs, "round": round, "min": min, "max": max})
    try:
        result = eval(expression, {"__builtins__": {}}, allowed)
        return {"expression": expression, "result": result}
    except Exception as e:
        return {"error": str(e)}

def run_python(code: str) -> dict:
    # WARNING: In production, use a proper sandbox (Docker, gVisor, etc.)
    import io, contextlib
    output = io.StringIO()
    try:
        with contextlib.redirect_stdout(output):
            exec(code, {"__builtins__": __builtins__})
        return {"stdout": output.getvalue(), "status": "success"}
    except Exception as e:
        return {"error": str(e), "status": "error"}

tool_map = {"calculate": calculate, "run_python": run_python}

# Ask something that requires precise computation
messages = [{"role": "user", "content":
    "If I invest $10,000 at 7.5% annual return compounded monthly, "
    "how much will I have after 20 years? Show the year-by-year breakdown."}]

response = client.messages.create(
    model="claude-sonnet-4-20250514", max_tokens=4096,
    tools=tools, messages=messages
)

while response.stop_reason == "tool_use":
    tool_results = []
    for block in response.content:
        if block.type == "tool_use":
            result = tool_map[block.name](**block.input)
            tool_results.append({
                "type": "tool_result",
                "tool_use_id": block.id,
                "content": json.dumps(result)
            })
    messages.append({"role": "assistant", "content": response.content})
    messages.append({"role": "user", "content": tool_results})
    response = client.messages.create(
        model="claude-sonnet-4-20250514", max_tokens=4096,
        tools=tools, messages=messages
    )

for block in response.content:
    if hasattr(block, "text"):
        print(block.text)
Caution: The run_python tool above uses exec(), which is dangerous in production. Always sandbox code execution using containers, WebAssembly, or dedicated code execution services. Never run LLM-generated code with full system access.

The Agentic Loop: From Tool Calling to AI Agents

Tool calling is a single request-response interaction. An AI agent is what happens when you put tool calling in a loop. The agent keeps thinking, calling tools, observing results, and thinking again — until the task is complete.

The Basic Agent Loop

while task is not complete:
    1. THINK    → Model analyzes the current state and decides what to do next
    2. SELECT   → Model chooses a tool and generates arguments
    3. EXECUTE  → Application runs the tool and captures the result
    4. OBSERVE  → Result is fed back to the model
    5. REPEAT   → Model decides: need more info? Call another tool. Done? Respond.

┌──────────────────────────────────────────────┐
│                AGENT LOOP                     │
│                                               │
│  ┌─────────┐     ┌──────────┐    ┌─────────┐ │
│  │  THINK  │────→│  SELECT  │───→│ EXECUTE │ │
│  │         │     │   TOOL   │    │  TOOL   │ │
│  └────▲────┘     └──────────┘    └────┬────┘ │
│       │                               │      │
│       │         ┌──────────┐          │      │
│       └─────────│ OBSERVE  │◀─────────┘      │
│                 │  RESULT  │                  │
│                 └─────┬────┘                  │
│                       │                       │
│              Done? ───┤                       │
│              No  ─────┘ (loop back)           │
│              Yes ─────→ RESPOND to user       │
└──────────────────────────────────────────────┘

This pattern is everywhere:

  • Claude Code — the tool you might be reading this post through — uses exactly this pattern. When you ask Claude Code to “fix the bug in auth.py”, it calls tools like Read (to read files), Grep (to search code), Edit (to modify files), and Bash (to run tests), iterating until the bug is fixed.
  • ChatGPT with plugins follows the same loop — the model decides which plugins to invoke, executes them, reads the results, and continues.
  • GitHub Copilot’s agent mode reads your codebase, makes edits, runs tests, and iterates — all through tool calling.

How Claude Code Uses Tool Calling

Claude Code is a perfect real-world example. When you give it a task, it has access to tools like:

Tool What It Does Example Use
Read Reads a file from disk Read src/auth.py to understand the code
Write Creates or overwrites a file Write a new test file
Edit Makes targeted edits to a file Fix a specific line in a function
Bash Runs a shell command Run pytest to check if the fix works
Grep Searches file contents Find all usages of a function
Glob Finds files by pattern Find all *.test.py files

 

A typical Claude Code session might involve 20-50 tool calls for a single task. The model reads a file, identifies the problem, searches for related code, makes an edit, runs the tests, sees a test fail, reads the error, makes another edit, runs the tests again, and finally reports success. Every step is a tool call. The “intelligence” is in deciding which tool to call and what arguments to use — the actual execution is done by your computer.

The Progression: Tool Call to Agent

Understanding tool calling lets you see the full progression of AI capability:

  1. Simple tool call: User asks a question → model calls one tool → responds. (Weather lookup)
  2. Multi-tool call: Model calls several tools in parallel or sequence within one turn. (Weather + stock price)
  3. Multi-step chain: Model calls tools iteratively across multiple turns, using each result to inform the next call. (Research → read → summarize → email)
  4. Autonomous agent: Model operates in a loop with minimal human intervention, using tools to accomplish complex goals. (Claude Code fixing a bug across multiple files)

Each step builds on the one before it. If you understand step 1, you understand the foundation for step 4. Tool calling is the atomic unit of AI agency.

Model Context Protocol (MCP): The Standard for Tool Calling

If every AI application defines its tools in a different format, the ecosystem becomes fragmented. That’s the problem the Model Context Protocol (MCP) solves.

MCP is an open standard created by Anthropic that provides a universal way to connect AI models to external tools, data sources, and services. Think of it as USB-C for AI tools — a single standard that works everywhere, instead of every device having its own proprietary connector.

How MCP Works

MCP defines a client-server architecture:

  • MCP Clients (like Claude Code, Claude Desktop, or your custom app) connect to MCP servers and expose the available tools to the AI model
  • MCP Servers expose three types of capabilities:
    • Tools: Functions the model can call (same concept as function calling)
    • Resources: Data the model can read (files, database records, API responses)
    • Prompts: Pre-defined prompt templates for common tasks
┌─────────────┐     ┌─────────────┐     ┌─────────────┐
│  Claude     │     │  MCP        │     │  External   │
│  Desktop /  │────→│  Server     │────→│  Service    │
│  Claude Code│     │  (your app) │     │  (DB, API)  │
│  (MCP Client)     │             │     │             │
└─────────────┘     └─────────────┘     └─────────────┘

The MCP Server exposes:
- Tools:     query_database, create_ticket, send_slack_message
- Resources: customer_data, product_catalog
- Prompts:   summarize_ticket, generate_report

Building a Simple MCP Server

Here’s a minimal MCP server that exposes a database query tool:

from mcp.server import Server
from mcp.types import Tool, TextContent
import sqlite3
import json

server = Server("database-server")

@server.list_tools()
async def list_tools():
    return [
        Tool(
            name="query_database",
            description="Run a read-only SQL query against the customer database.",
            inputSchema={
                "type": "object",
                "properties": {
                    "query": {"type": "string", "description": "SQL SELECT query"}
                },
                "required": ["query"]
            }
        )
    ]

@server.call_tool()
async def call_tool(name: str, arguments: dict):
    if name == "query_database":
        conn = sqlite3.connect("customers.db")
        cursor = conn.cursor()

        if not arguments["query"].strip().upper().startswith("SELECT"):
            return [TextContent(type="text", text="Error: Only SELECT queries allowed")]

        cursor.execute(arguments["query"])
        columns = [d[0] for d in cursor.description]
        rows = [dict(zip(columns, row)) for row in cursor.fetchall()]
        conn.close()

        return [TextContent(type="text", text=json.dumps(rows, indent=2))]

# Run with: python -m mcp.server.stdio database_server

Once this MCP server is running, any MCP-compatible client (Claude Code, Claude Desktop, custom applications) can connect to it and the AI model will be able to query your database through tool calling — with the MCP protocol handling all the communication plumbing.

MCP vs. Other Approaches

Approach Standardized? Multi-Client Discovery Status
MCP Open standard Yes Built-in Growing adoption
OpenAI Plugins OpenAI-specific No Plugin manifest Deprecated in favor of GPTs
Custom function calling No No Manual Most flexible

 

MCP is gaining significant momentum in 2026. Major IDE extensions, AI coding tools, and enterprise platforms are adopting it as the standard way to connect AI to external systems. If you’re building tools for AI models, building them as MCP servers future-proofs your work.

Best Practices for Designing Tools

The quality of your tools directly determines how well your AI application performs. A well-designed tool is like a well-written function: clear name, documented parameters, predictable behavior. A poorly designed tool leads to hallucinated arguments, incorrect tool selection, and frustrated users.

Naming and Descriptions

The model reads your tool’s name and description to decide when and how to use it. Invest time in these — they’re essentially prompts for the model.

Aspect Bad Good
Function name weather get_current_weather
Function name do_stuff create_calendar_event
Description “Gets weather” “Get current weather conditions (temperature, humidity, wind) for a specific city. Use when the user asks about weather or atmospheric conditions.”
Parameter description “The city” “City name, e.g. ‘Tokyo’, ‘New York’, ‘London’. Use the English name.”

 

Key Design Principles

One tool per action. Don’t create a manage_database tool that can query, insert, update, and delete. Create separate tools: query_database, insert_record, update_record, delete_record. This gives the model clearer choices and reduces errors.

Detailed JSON Schema. Use types, required fields, enums, defaults, and descriptions for every parameter. The more constrained the schema, the more reliable the model’s output:

{
  "properties": {
    "priority": {
      "type": "string",
      "enum": ["low", "medium", "high", "critical"],
      "description": "Task priority level. Use 'critical' only for production outages.",
      "default": "medium"
    },
    "due_date": {
      "type": "string",
      "description": "Due date in ISO 8601 format (YYYY-MM-DD), e.g. '2026-04-15'"
    }
  }
}

Structured error messages. When a tool fails, return a structured error message that the model can understand and act on — not a stack trace:

# Bad: raises exception that crashes the loop
raise Exception("Connection timeout")

# Good: returns error the model can understand
return {"error": "Database connection timed out after 30s. The database may be under heavy load. Try again in a few minutes."}

Separate read and write tools. This is crucial for safety. A query_database tool (read-only) is safe to call freely. A delete_record tool (destructive) should require confirmation. By separating them, you can apply different safety policies.

Confirmation for dangerous actions. Before deleting data, sending emails, or making payments, have the model ask for user confirmation. You can implement this by having the tool return a “confirmation required” response that the model must present to the user before proceeding.

Tip: When designing tools, ask yourself: “If the model called this tool with the wrong arguments, what’s the worst that could happen?” If the answer is “data loss” or “real money spent,” add confirmation steps, input validation, and rate limiting.

Common Pitfalls and How to Avoid Them

Even with well-designed tools, things can go wrong. Here are the most common issues and their solutions:

Pitfall Cause Solution
Model hallucinating tool calls Tool name similar to a known concept Use strict tool definitions; validate tool name before execution
Wrong argument types Vague or missing JSON Schema Add detailed types, enums, and descriptions; include examples
Infinite tool loops Model keeps calling tools without converging Set max_iterations limit; add “no more info needed” guidance
Unnecessary tool calls Overly broad tool description Write precise descriptions about when to use the tool
Ignoring tool errors Error returned as exception, not tool result Always return errors as tool results so the model can handle them
SQL injection via tool args LLM-generated SQL executed without validation Parameterized queries; read-only database user; query allowlists
Command injection LLM-generated shell commands executed directly Sandboxing; allowlisted commands only; never pass to shell=True
Token cost explosion Tool results too large (e.g., full database dumps) Paginate results; limit response size; summarize large outputs

 

Security Considerations

Security deserves special attention because tool calling gives an LLM the ability to take real actions. A prompt injection attack that convinces the model to call delete_all_users() is no longer a theoretical concern — it’s a real risk.

Key security practices:

  1. Input validation: Validate all tool arguments before execution. Don’t trust the model to always provide safe inputs.
  2. Least privilege: Give tools the minimum permissions necessary. Database tools should use read-only credentials unless writes are required.
  3. Rate limiting: Limit how often tools can be called to prevent abuse or runaway loops.
  4. Audit logging: Log every tool call with its arguments and results. This is essential for debugging and security auditing.
  5. Sandboxing: Code execution tools must run in isolated environments (containers, VMs, or WebAssembly sandboxes).
  6. Confirmation gates: Destructive operations (delete, send, pay) should require human confirmation before execution.

Tool Calling in Production

Moving from a prototype to production requires additional engineering around reliability, observability, and cost management.

Reliability Patterns

Caching: Cache tool results to avoid redundant API calls. If the model asks for the weather in Tokyo twice in the same conversation, return the cached result. Use time-based expiration (e.g., 5-minute TTL for weather data).

from functools import lru_cache
from datetime import datetime, timedelta

_cache = {}

def cached_tool_call(name: str, args: dict, ttl_seconds: int = 300):
    key = f"{name}:{json.dumps(args, sort_keys=True)}"
    if key in _cache:
        result, timestamp = _cache[key]
        if datetime.now() - timestamp < timedelta(seconds=ttl_seconds):
            return result

    result = execute_tool(name, args)
    _cache[key] = (result, datetime.now())
    return result

Retry with backoff: External APIs fail. Implement retries with exponential backoff for transient errors (timeouts, rate limits, 5xx errors).

Fallback strategies: When a tool fails after retries, return a structured error message that lets the model inform the user gracefully, rather than crashing the entire interaction.

Observability

Logging: Log every tool call with a structured format:

{
  "timestamp": "2026-04-03T10:30:00Z",
  "conversation_id": "conv_abc123",
  "tool_name": "get_weather",
  "arguments": {"city": "Tokyo"},
  "result_summary": "success, temperature=22",
  "latency_ms": 245,
  "tokens_used": {"input": 150, "output": 45}
}

Monitoring: Track key metrics:

  • Tool call success rate (should be above 95%)
  • Average tool latency (directly impacts user experience)
  • Tool calls per conversation (indicates complexity)
  • Token cost per tool call cycle (each call adds tokens to the context)
  • Error rates by tool (identifies problematic tools)

Cost Optimization

Every tool call adds tokens to your context window. The tool definitions themselves are included in every API request, so 20 detailed tools might add 2,000-3,000 tokens before the conversation even starts.

Strategies to manage costs:

  • Dynamic tool loading: Only include relevant tools based on the conversation context. A weather conversation doesn't need database tools.
  • Result compression: Truncate or summarize large tool results before sending them back to the model. A full database dump is rarely necessary — send summary statistics instead.
  • Conversation pruning: In long multi-tool conversations, summarize earlier tool results and remove the raw data from the context.
  • Model selection: Use cheaper, faster models (like Claude Haiku or GPT-4o-mini) for simple tool-calling tasks, and reserve expensive models for complex reasoning.

Testing Tool-Calling Applications

Test tools independently before integrating them with the LLM:

  1. Unit tests: Test each tool function with various inputs, including edge cases and invalid arguments.
  2. Integration tests: Test the tool with the actual API or database it connects to.
  3. LLM integration tests: Test the full loop with the model. Provide a set of test prompts and verify the model calls the right tools with correct arguments.
  4. Adversarial tests: Test with prompts designed to trick the model into misusing tools (prompt injection).
# Example: testing that the model calls the right tool
def test_weather_tool_selection():
    response = client.messages.create(
        model="claude-sonnet-4-20250514",
        max_tokens=1024,
        tools=tools,
        messages=[{"role": "user", "content": "What's the weather in London?"}]
    )

    tool_calls = [b for b in response.content if b.type == "tool_use"]
    assert len(tool_calls) == 1
    assert tool_calls[0].name == "get_weather"
    assert tool_calls[0].input["city"] == "London"

def test_no_tool_for_general_question():
    response = client.messages.create(
        model="claude-sonnet-4-20250514",
        max_tokens=1024,
        tools=tools,
        messages=[{"role": "user", "content": "What is the capital of France?"}]
    )

    # Model should answer directly, no tool call
    assert response.stop_reason == "end_turn"

The Future of Tool Calling

Tool calling is evolving rapidly. Here's where it's heading:

Computer Use

Anthropic's computer use capability takes tool calling to its logical extreme: instead of calling specific APIs, the model can control an entire computer desktop. It sees the screen (via screenshots), moves the mouse, clicks buttons, and types text. The "tools" become the entire computer interface — every application, every website, every file. This is the most general form of tool use: rather than building a specific tool for every task, you give the model the same tools a human uses.

More Reliable Structured Output

Constrained decoding is making tool calling more reliable. Instead of hoping the model produces valid JSON, the decoding process itself enforces the JSON Schema — the model literally cannot produce invalid output. OpenAI's "strict mode" and Anthropic's improvements in JSON reliability are steps in this direction.

Tool Learning and Discovery

Current models use tools that are explicitly defined in the request. Future models may be able to discover tools dynamically — browsing an API directory, reading documentation, and figuring out how to use a new tool without it being pre-defined. MCP is laying the groundwork for this with its discovery protocol.

Multi-Agent Tool Sharing

As multi-agent systems become more common (multiple AI agents collaborating on a task), tool sharing becomes important. One agent might specialize in database queries while another handles email. MCP's architecture supports this by allowing multiple agents to connect to the same tool servers.

Standardization

MCP adoption is accelerating. In the same way that REST APIs standardized web service communication, MCP is standardizing how AI models interact with external tools. For developers and companies building AI tools, this means writing your tool once and making it available to every AI model and client that supports MCP.

Key Takeaway: Tool calling is not just a feature — it's the foundational capability that enables AI agents, computer use, and autonomous AI systems. Every advance in AI agency is ultimately an advance in how models select, call, and orchestrate tools.

Conclusion

Tool calling is the invisible infrastructure behind every AI agent, every chatbot plugin, and every autonomous AI system. It's deceptively simple — a model outputs a function name and arguments, your code executes it, and the result goes back to the model — but this simple loop is what transformed LLMs from text generators into systems that can do things in the real world.

Let's recap what we covered:

  • The core concept: Tool calling lets LLMs request the execution of external functions. The model plans, your code acts.
  • The three-step loop: User asks → model calls tool → your code executes → model responds with the result.
  • Provider implementations: Claude, GPT, and Gemini all support tool calling with slightly different formats but the same underlying pattern.
  • Practical patterns: From simple weather lookups to chained tool calls, database queries, and multi-tool agents.
  • The agentic loop: Tool calling in a loop is the foundation of AI agents. Claude Code, ChatGPT plugins, and GitHub Copilot all work this way.
  • MCP: The open standard that's making tool definitions universal and interoperable.
  • Best practices: Clear naming, detailed schemas, error handling, security, and the read/write separation principle.
  • Production concerns: Caching, logging, cost optimization, and testing strategies.

If you're a developer, start building with tool calling today. Pick an API you already use, define it as a tool, and hook it up to Claude or GPT. You'll be surprised at how quickly you go from "AI that chats" to "AI that acts." If you're an investor, understand that tool calling is not a feature — it's the foundation of the entire AI agent ecosystem. Companies that master tool integration will win the next phase of AI.

The era of AI that only talks is over. The era of AI that does is just beginning — and tool calling is the mechanism that makes it possible.

References

  1. Anthropic. "Tool use (function calling) — Claude Documentation." docs.anthropic.com/en/docs/build-with-claude/tool-use
  2. OpenAI. "Function calling — OpenAI API Documentation." platform.openai.com/docs/guides/function-calling
  3. Google. "Function calling — Gemini API Documentation." ai.google.dev/gemini-api/docs/function-calling
  4. Anthropic. "Model Context Protocol — Documentation." modelcontextprotocol.io
  5. Anthropic. "Computer use — Claude Documentation." docs.anthropic.com/en/docs/build-with-claude/computer-use
  6. Anthropic. "Claude Code — Documentation." docs.anthropic.com/en/docs/claude-code
  7. Schick, T., et al. "Toolformer: Language Models Can Teach Themselves to Use Tools." arXiv:2302.04761, 2023.
  8. Qin, Y., et al. "ToolLLM: Facilitating Large Language Models to Master 16000+ Real-world APIs." arXiv:2307.16789, 2023.

Comments

Leave a Reply

Your email address will not be published. Required fields are marked *