tutorials

The Evolution of AI Agents: How Top Developers Build Efficient Tools for Claude

October 9, 2025
Updated Oct 9
9 min read

Does your AI Agent feel a bit clumsy and unable to reach its full potential? The problem might not be the AI itself, but the “tools” you give it. This article will reveal Anthropic’s internal methods, sharing how to build, evaluate, and optimize AI tools, and even let Claude assist you in completing it all, doubling the performance of your AI applications.


Have you ever had this feeling? You have a powerful Large Language Model (LLM) like Claude, which in theory should be able to handle complex tasks automatically, but in practice, it always feels a bit clunky and not smart enough. It’s like you’ve hired a Michelin-starred chef but only given them a dull knife and a few less-than-fresh ingredients.

The root of the problem is often not the chef’s ability, but the tools we provide them.

The performance of an AI Agent is most directly related to the tools we give it. This article is to share the experience we have summarized through countless experiments within Anthropic: how to build high-quality tools, how to conduct comprehensive evaluations, and the most interesting part—how to collaborate with AI like Claude and let it optimize its own tools.

So, What Exactly Are AI “Tools”?

Before we dive in, we need to clarify a concept. Traditional software development is like writing a precise recipe. As long as you input the same ingredients, follow every step exactly, the final dish (output) will always be the same. This is what’s called a “deterministic system.”

But AI Agents are different. They are more like creative chefs who, even with the same ingredients, might make slightly different variations based on their inspiration at the moment. It is a “non-deterministic system,” full of variables and possibilities.

Therefore, the “tools” designed for AI are a new kind of software. They are no longer rigid instruction sets, but more like a “contract” established between a deterministic system and a non-deterministic agent. When a user asks, “Should I bring an umbrella when I go out today?”, the agent might call a weather tool, answer from its own knowledge, or even ask for the location. It might make mistakes, or it might not find a suitable tool.

This means we must completely change our thinking. We are no longer designing APIs for other developers, but tools for a “digital brain” full of uncertainty that needs guidance.

How to Build Efficient Tools? A Continuous Development Cycle

Building tools that AI can use smoothly is not something that can be achieved overnight. It is a continuous cycle of “building, evaluating, and learning.”

Step 1: Don’t Overthink It, Just Build a Prototype

It’s useless to just imagine which tools an AI will find “handy” and which will “confuse” it. The best way is to just get started.

You can use tools like Claude Code to quickly generate your tool prototype. A little trick is to provide it with relevant software libraries, APIs, or SDK documentation, especially those LLM-friendly plain text files (many open-source projects provide files like llms.txt), which will make it much more efficient.

Once the prototype is written, you can package it as a local Model Context Protocol (MCP) server or a Desktop Extension (DXT) to test it in the Claude Code or Claude desktop application. You can also test it programmatically through the Anthropic API.

Test your tools yourself, feel whether the process is smooth, and collect user feedback. This will help you build an intuition for the use cases.

Step 2: It’s Time for a Rigorous “Final Exam”

With the prototype in hand, you now need to measure how well Claude performs using these tools. This requires a comprehensive evaluation mechanism.

Forget about those overly simplistic “sandbox” environments! What you need are evaluation tasks that originate from the real world and have sufficient complexity. A good evaluation task may require the AI to call multiple, or even dozens of, tools in succession to complete.

Look at the difference between these two sets of tasks:

  • Good evaluation task examples:

    • “Help me schedule a meeting with Jane for next week to discuss the latest Acme Inc. project. Attach notes from the last project planning meeting and book a conference room.”
    • “Customer ID 9182 reported that they were charged three times for a single purchase. Find all relevant log records and determine if other customers were also affected.”
  • Weaker evaluation task examples:

    • “Schedule a meeting with [email protected] for next week.”
    • “Search for payment logs with customer_id=9182.”

See the difference? Good tasks are closer to real workflows.

Each evaluation task should have a verifiable result. The simplest way is to compare strings; a more complex way is to have another Claude instance determine if the result is correct. At the same time, you can also require the AI to return its “reasoning process” and “feedback” in the System Prompt before calling a tool. This can trigger its “Chain-of-Thought” behavior and improve its problem-solving intelligence.

Step 3: Let the AI Be Your Best Analyst

After the evaluation is run and a pile of data is in front of you, what’s next?

At this point, the AI agent itself is your best partner. They can help you discover all kinds of problems, from contradictory tool descriptions to inefficient tool implementations. But remember one key point: Large Language Models are not always straightforward. What they don’t say is often more important than what they do say.

Carefully observe where your AI gets stuck or confused. Read its reasoning process (CoT) to find the rough spots. You can even paste the entire evaluation process script (including tool calls and returns) directly into Claude Code. It is an expert at analyzing scripts and refactoring tools, ensuring that after your modifications, the tool’s implementation and description remain consistent.

In fact, most of the suggestions in this article come from our internal practice of continuously optimizing tools with Claude Code. Through this method, we have found that the performance improvement even surpasses that of tools manually written by expert researchers.

The Five Golden Rules for Building Efficient Tools

After countless iteration cycles, we have distilled several key design principles.

Rule 1: Less is More, Don’t Give Your AI Choice Paralysis

A common misconception is to think that the more tools you give an AI, the better. But the opposite is true. Simply wrapping existing API functions one-to-one into tools often has a counterproductive effect.

The “context” of an AI agent is limited, just like human short-term memory. In contrast, the memory of a traditional computer is almost infinite. Imagine finding a person in a contact list. Traditional software can quickly traverse the entire list. But if a tool returns “all” contacts and lets the AI read them one by one, it is undoubtedly wasting its precious context space.

A smarter, more natural way is to, like a human, jump directly to the relevant page (e.g., by looking up alphabetically).

Therefore, you should design tools for specific, high-impact workflows. For example, instead of providing list_users, list_events, and create_event three tools, it is better to integrate a schedule_event tool that finds free time and schedules the event in one step.

Rule 2: Organize Your Toolbox, Naming is an Art

When your AI can access dozens or even hundreds of tools, chaos ensues. If tools have overlapping functions or ambiguous purposes, the AI can easily use the wrong one.

Namespacing is a simple yet effective solution. By grouping related tools with a common prefix, you can help the AI choose the right tool at the right time. For example:

  • By service: asana_search, jira_search
  • By resource: asana_projects_search, asana_users_search

This not only reduces the number of tools that need to be loaded into the AI’s context, but also shifts some of the computational burden from the AI’s “brain” to the tools themselves, thereby reducing the risk of errors.

Rule 3: Only Say What’s Important, the AI’s “Attention” is Precious

The return content of a tool is equally important. Be sure to only return high-value, highly context-relevant information.

AI is better at processing natural language names or terms than mysterious technical identifiers like uuid. We found that simply parsing a long string of meaningless alphanumeric IDs into more semantically rich language can significantly improve Claude’s accuracy in retrieval tasks and reduce hallucinations.

In some cases, you can also provide flexibility. For example, add a response_format parameter that allows the AI to choose between returning a “concise” or “detailed” result. The concise version may only contain the core content, while the detailed version includes various IDs for subsequent tool calls.

Rule 4: Be Economical, Teach Your AI to Save “Brain Capacity”

Context quality is important, but “quantity” also needs to be optimized. The context length of a tool is limited, so you need to implement features like pagination, range selection, and filtering.

If the result returned by your tool is truncated, be sure to give a clear prompt to guide the AI to adopt a more token-saving strategy, such as performing multiple small-range, precise searches instead of one large-range, vague search.

Similarly, error messages are crucial. Instead of returning a cold error code, provide a helpful response that clearly explains the problem and gives suggestions for correction.

Look at this comparison:

  • Useless error: {"error": {"code": "RESOURCE_NOT_FOUND"}}
  • Useful error: “# Resource Not Found: Invalid userId. Your request failed because the userId[email protected]’ does not exist or is incorrectly formatted. An example of a valid userId is: ‘192829814…’. You can try calling user_search() to resolve this issue.”

The latter is clearly better at guiding the AI onto the right path.

Rule 5: The Most Powerful Lever—A Good Description is Worth a Thousand Lines of Code

Finally, we come to the most effective and most often overlooked part: prompt-engineering your tool descriptions.

The description and specifications of a tool are loaded into the AI’s context and directly affect its behavior. When writing, imagine you are explaining the tool to a new team member. Write out all the background knowledge you might take for granted—specific query formats, definitions of technical terms, relationships between resources—explicitly.

Avoid ambiguity, especially in parameter naming. Don’t use a vague user; use a clear user_id.

Small changes can bring huge performance improvements. For example, Claude Sonnet 3.5 achieved top performance in the SWE-bench validation evaluation precisely because we made precise fine-tuning to the tool descriptions, thereby significantly reducing the error rate.

Looking to the Future: A New Development Model of Co-evolution with AI

Building tools for AI agents requires us to shift our software development mindset from a predictable, deterministic world to a non-deterministic world full of change.

Through the iterative, evaluation-driven development process we have described, you will find that efficient tools share some common characteristics: they are goal-oriented, make good use of the AI’s context, can be flexibly combined, and allow the AI to intuitively solve real-world problems.

In the future, as LLMs themselves and interactive protocols like MCP continue to upgrade, the way AI interacts with the world will also continue to evolve. But as long as we adhere to this systematic optimization method, we can ensure that the tools in our hands can grow and evolve alongside the increasingly powerful AI.

Article Source

https://www.anthropic.com/engineering/writing-tools-for-agents

Share on:
Featured Partners

© 2026 Communeify. All rights reserved.