译:我们如何构建多智能体研究系统
原文: https://www.anthropic.com/engineering/built-multi-agent-research-system
作者: Jeremy Hadfield, Barry Zhang, Kenneth Lien, Florian Scholz, Jeremy Fox, and Daniel Ford
译者: Gemini 2.5 Pro
Our Research feature uses multiple Claude agents to explore complex topics more effectively. We share the engineering challenges and the lessons we learned from building this system.
我们的 Research 功能使用多个 Claude 智能体(agent)来更有效地探索复杂主题。在此,我们分享构建这个系统时遇到的工程挑战和学到的经验。
Claude now has Research capabilities that allow it to search across the web, Google Workspace, and any integrations to accomplish complex tasks.
Claude 现在具备了 Research 功能,可以跨网络、Google Workspace 及任何集成进行搜索,以完成复杂的任务。
The journey of this multi-agent system from prototype to production taught us critical lessons about system architecture, tool design, and prompt engineering. A multi-agent system consists of multiple agents (LLMs autonomously using tools in a loop) working together. Our Research feature involves an agent that plans a research process based on user queries, and then uses tools to create parallel agents that search for information simultaneously. Systems with multiple agents introduce new challenges in agent coordination, evaluation, and reliability.
将这个多智能体系统从原型推向产品的过程,让我们在系统架构、工具设计和 prompt 工程方面学到了宝贵的经验。一个多智能体系统由多个智能体(自主循环使用工具的 LLM)协同工作。我们的 Research 功能包含一个主智能体,它根据用户查询规划研究流程,然后使用工具创建多个并行的子智能体同时搜索信息。多智能体系统在智能体协调、评估和可靠性方面带来了新的挑战。
This post breaks down the principles that worked for us—we hope you’ll find them useful to apply when building your own multi-agent systems.
这篇文章将分解那些对我们行之有效的原则——希望你在构建自己的多智能体系统时也能发现它们的用处。
多智能体系统的好处
Research work involves open-ended problems where it’s very difficult to predict the required steps in advance. You can’t hardcode a fixed path for exploring complex topics, as the process is inherently dynamic and path-dependent. When people conduct research, they tend to continuously update their approach based on discoveries, following leads that emerge during investigation.
研究工作涉及开放式问题,很难预先判断需要哪些步骤。你无法为探索复杂主题硬编码一个固定路径,因为这个过程本质上是动态和路径依赖的。人们做研究时,会根据新发现不断调整方法,跟进调查过程中出现的线索。
This unpredictability makes AI agents particularly well-suited for research tasks. Research demands the flexibility to pivot or explore tangential connections as the investigation unfolds. The model must operate autonomously for many turns, making decisions about which directions to pursue based on intermediate findings. A linear, one-shot pipeline cannot handle these tasks.
这种不可预测性使得 AI 智能体特别适合研究任务。研究需要灵活性,要能随着调查的深入而调整方向或探索旁支线索。模型必须能自主运行多个回合,根据中间发现来决定追求哪个方向。线性的、一次性完成的流程无法处理这类任务。
The essence of search is compression: distilling insights from a vast corpus. Subagents facilitate compression by operating in parallel with their own context windows, exploring different aspects of the question simultaneously before condensing the most important tokens for the lead research agent. Each subagent also provides separation of concerns—distinct tools, prompts, and exploration trajectories—which reduces path dependency and enables thorough, independent investigations.
搜索的本质是压缩:从海量信息中提炼洞见。子智能体通过并行操作来促进压缩,它们各自拥有独立的上下文窗口,同时探索问题的不同方面,然后将最重要的 token 提炼给主研究智能体。每个子智能体还实现了关注点分离——使用不同的工具、prompt 和探索路径——这减少了路径依赖,使彻底、独立的调查成为可能。
Once intelligence reaches a threshold, multi-agent systems become a vital way to scale performance. For instance, although individual humans have become more intelligent in the last 100,000 years, human societies have become exponentially more capable in the information age because of our collective intelligence and ability to coordinate. Even generally-intelligent agents face limits when operating as individuals; groups of agents can accomplish far more.
一旦智能达到某个阈值,多智能体系统就成为扩展能力的关键方式。例如,虽然过去十万年里,单个人的智力并未发生巨变,但人类社会在信息时代的能力却呈指数级增长,这得益于我们的集体智慧和协作能力。即使是通用智能体,作为个体行动时也会面临局限;而智能体群体能完成的任务要多得多。
Our internal evaluations show that multi-agent research systems excel especially for breadth-first queries that involve pursuing multiple independent directions simultaneously. We found that a multi-agent system with Claude Opus 4 as the lead agent and Claude Sonnet 4 subagents outperformed single-agent Claude Opus 4 by 90.2% on our internal research eval. For example, when asked to identify all the board members of the companies in the Information Technology S&P 500, the multi-agent system found the correct answers by decomposing this into tasks for subagents, while the single agent system failed to find the answer with slow, sequential searches.
我们的内部评估显示,多智能体研究系统尤其擅长处理广度优先的查询,这类查询需要同时探索多个独立方向。我们发现,在内部研究评估中,一个以 Claude Opus 4 为主智能体、Claude Sonnet 4 为子智能体的多智能体系统,其性能比单智能体的 Claude Opus 4 高出 90.2%。例如,当被要求找出标普 500 信息技术板块所有公司的董事会成员时,多智能体系统通过将任务分解给子智能体找到了正确答案,而单智能体系统则因缓慢的顺序搜索而未能找到答案。
Multi-agent systems work mainly because they help spend enough tokens to solve the problem. In our analysis, three factors explained 95% of the performance variance in the BrowseComp evaluation (which tests the ability of browsing agents to locate hard-to-find information). We found that token usage by itself explains 80% of the variance, with the number of tool calls and the model choice as the two other explanatory factors. This finding validates our architecture that distributes work across agents with separate context windows to add more capacity for parallel reasoning. The latest Claude models act as large efficiency multipliers on token use, as upgrading to Claude Sonnet 4 is a larger performance gain than doubling the token budget on Claude Sonnet 3.7. Multi-agent architectures effectively scale token usage for tasks that exceed the limits of single agents.
多智能体系统之所以有效,主要是因为它们有助于投入足够的 token 来解决问题。在我们的分析中,有三个因素解释了 BrowseComp 评估中 95% 的性能差异(该评估测试浏览智能体定位难找信息的能力)。我们发现,仅 token 使用量就解释了 80% 的差异,另外两个因素是工具调用次数和模型选择。这一发现验证了我们的架构:将工作分配给具有独立上下文窗口的智能体,从而增加并行推理的能力。最新的 Claude 模型是 token 使用效率的巨大倍增器,因为升级到 Claude Sonnet 4 带来的性能提升,比在 Claude Sonnet 3.7 上将 token 预算翻倍还要大。对于超出单个智能体能力极限的任务,多智能体架构能有效扩展 token 的使用。
There is a downside: in practice, these architectures burn through tokens fast. In our data, agents typically use about 4× more tokens than chat interactions, and multi-agent systems use about 15× more tokens than chats. For economic viability, multi-agent systems require tasks where the value of the task is high enough to pay for the increased performance. Further, some domains that require all agents to share the same context or involve many dependencies between agents are not a good fit for multi-agent systems today. For instance, most coding tasks involve fewer truly parallelizable tasks than research, and LLM agents are not yet great at coordinating and delegating to other agents in real time. We’ve found that multi-agent systems excel at valuable tasks that involve heavy parallelization, information that exceeds single context windows, and interfacing with numerous complex tools.
但也有一个缺点:在实践中,这些架构消耗 token 的速度非常快。我们的数据显示,智能体通常比聊天交互多用约 4 倍的 token,而多智能体系统则比聊天多用约 15 倍的 token。为了在经济上可行,多智能体系统需要用于那些价值足够高、值得为性能提升付费的任务。此外,一些需要所有智能体共享相同上下文或智能体之间存在许多依赖关系的领域,目前并不适合多智能体系统。例如,大多数编码任务比研究任务更少有真正可并行的部分,而且 LLM 智能体在实时协调和委派任务给其他智能体方面还不够出色。我们发现,多智能体系统在那些有价值、涉及大量并行处理、信息量超出单个上下文窗口以及需要与众多复杂工具交互的任务上表现卓越。
Research 功能的架构概览
Our Research system uses a multi-agent architecture with an orchestrator-worker pattern, where a lead agent coordinates the process while delegating to specialized subagents that operate in parallel.
我们的 Research 系统采用多智能体架构,遵循“协调者-工作者”模式(orchestrator-worker pattern),由一个主智能体协调流程,并将任务委派给并行的专业子智能体。
The multi-agent architecture in action: user queries flow through a lead agent that creates specialized subagents to search for different aspects in parallel.
多智能体架构的实际运作:用户查询流经一个主智能体,该智能体创建专业的子智能体,以并行方式搜索不同方面的信息。
When a user submits a query, the lead agent analyzes it, develops a strategy, and spawns subagents to explore different aspects simultaneously. As shown in the diagram above, the subagents act as intelligent filters by iteratively using search tools to gather information, in this case on AI agent companies in 2025, and then returning a list of companies to the lead agent so it can compile a final answer.
当用户提交查询时,主智能体分析查询,制定策略,并生成子智能体同时探索不同方面。如上图所示,子智能体作为智能过滤器,迭代使用搜索工具收集信息(此例中是关于 2025 年的 AI 智能体公司),然后将公司列表返回给主智能体,由其汇编最终答案。
Traditional approaches using Retrieval Augmented Generation (RAG) use static retrieval. That is, they fetch some set of chunks that are most similar to an input query and use these chunks to generate a response. In contrast, our architecture uses a multi-step search that dynamically finds relevant information, adapts to new findings, and analyzes results to formulate high-quality answers.
使用检索增强生成(RAG)的传统方法采用静态检索。也就是说,它们获取一组与输入查询最相似的信息块,并用这些信息块来生成回应。相比之下,我们的架构使用多步搜索,能动态地发现相关信息,适应新发现,并分析结果以形成高质量的答案。
Process diagram showing the complete workflow of our multi-agent Research system. When a user submits a query, the system creates a LeadResearcher agent that enters an iterative research process. The LeadResearcher begins by thinking through the approach and saving its plan to Memory to persist the context, since if the context window exceeds 200,000 tokens it will be truncated and it is important to retain the plan. It then creates specialized Subagents (two are shown here, but it can be any number) with specific research tasks. Each Subagent independently performs web searches, evaluates tool results using interleaved thinking, and returns findings to the LeadResearcher. The LeadResearcher synthesizes these results and decides whether more research is needed—if so, it can create additional subagents or refine its strategy. Once sufficient information is gathered, the system exits the research loop and passes all findings to a CitationAgent, which processes the documents and research report to identify specific locations for citations. This ensures all claims are properly attributed to their sources. The final research results, complete with citations, are then returned to the user.
流程图展示了我们多智能体 Research 系统的完整工作流。当用户提交查询时,系统会创建一个 LeadResearcher 智能体,进入一个迭代的研究过程。LeadResearcher 首先会思考方法,并将其计划保存到内存中以持久化上下文,因为如果上下文窗口超过 200,000 个 token 就会被截断,保留计划至关重要。然后,它会创建带有特定研究任务的专业 Subagent(这里显示了两个,但可以是任意数量)。每个 Subagent 独立执行网络搜索,使用交错思考评估工具结果,并将发现返回给 LeadResearcher。LeadResearcher 综合这些结果,并决定是否需要更多研究——如果需要,它可以创建额外的子智能体或完善其策略。一旦收集到足够的信息,系统就会退出研究循环,并将所有发现传递给一个 CitationAgent,该智能体处理文档和研究报告,以确定引用的具体位置。这确保所有论断都有恰当的来源依据。最终,带有引用的研究结果将返回给用户。
研究智能体的 Prompt 工程与评估
Multi-agent systems have key differences from single-agent systems, including a rapid growth in coordination complexity. Early agents made errors like spawning 50 subagents for simple queries, scouring the web endlessly for nonexistent sources, and distracting each other with excessive updates. Since each agent is steered by a prompt, prompt engineering was our primary lever for improving these behaviors. Below are some principles we learned for prompting agents:
多智能体系统与单智能体系统有关键区别,其中之一是协调复杂性的急剧增长。早期的智能体犯过各种错误,比如为简单查询生成 50 个子智能体,为不存在的来源无休止地搜索网络,以及用过多的更新相互干扰。由于每个智能体都由 prompt 引导,prompt 工程是我们改善这些行为的主要手段。以下是我们学到的一些 prompt 设计原则:
- Think like your agents. To iterate on prompts, you must understand their effects. To help us do this, we built simulations using our Console with the exact prompts and tools from our system, then watched agents work step-by-step. This immediately revealed failure modes: agents continuing when they already had sufficient results, using overly verbose search queries, or selecting incorrect tools. Effective prompting relies on developing an accurate mental model of the agent, which can make the most impactful changes obvious.
- 像你的智能体一样思考。 要迭代 prompt,你必须理解其效果。为了做到这一点,我们使用我们的 Console,配置与系统中完全相同的 prompt 和工具来构建模拟,然后一步步观察智能体的工作。这立刻暴露了失败模式:智能体在已有足够结果时仍继续工作,使用过于冗长的搜索查询,或选择错误的工具。有效的 prompt 设计依赖于建立一个准确的智能体心智模型,这能让最有影响力的改动变得显而易见。
- Teach the orchestrator how to delegate. In our system, the lead agent decomposes queries into subtasks and describes them to subagents. Each subagent needs an objective, an output format, guidance on the tools and sources to use, and clear task boundaries. Without detailed task descriptions, agents duplicate work, leave gaps, or fail to find necessary information. We started by allowing the lead agent to give simple, short instructions like ‘research the semiconductor shortage,’ but found these instructions often were vague enough that subagents misinterpreted the task or performed the exact same searches as other agents. For instance, one subagent explored the 2021 automotive chip crisis while 2 others duplicated work investigating current 2025 supply chains, without an effective division of labor.
- 教协调者如何授权。 在我们的系统中,主智能体将查询分解为子任务,并向子智能体描述这些任务。每个子智能体都需要一个目标、一个输出格式、关于使用哪些工具和来源的指导,以及清晰的任务边界。没有详细的任务描述,智能体就会重复工作、留下空白,或者找不到必要的信息。我们最初允许主智能体给出简单、简短的指令,比如“研究半导体短缺”,但发现这些指令常常含糊不清,导致子智能体误解任务或执行与其他智能体完全相同的搜索。例如,一个子智能体在研究 2021 年的汽车芯片危机,而另外两个则重复劳动,都在调查 2025 年的当前供应链,没有有效的分工。
- Scale effort to query complexity. Agents struggle to judge appropriate effort for different tasks, so we embedded scaling rules in the prompts. Simple fact-finding requires just 1 agent with 3-10 tool calls, direct comparisons might need 2-4 subagents with 10-15 calls each, and complex research might use more than 10 subagents with clearly divided responsibilities. These explicit guidelines help the lead agent allocate resources efficiently and prevent overinvestment in simple queries, which was a common failure mode in our early versions.
- 根据查询复杂性调整投入。 智能体很难判断不同任务所需的适当投入,所以我们在 prompt 中嵌入了投入规模规则。简单的信息查找只需要 1 个智能体进行 3-10 次工具调用,直接比较可能需要 2-4 个子智能体,每个进行 10-15 次调用,而复杂的研究可能需要超过 10 个职责明确划分的子智能体。这些明确的指导方针帮助主智能体有效分配资源,避免在简单查询上过度投入,这是我们早期版本中常见的失败模式。
- Tool design and selection are critical. Agent-tool interfaces are as critical as human-computer interfaces. Using the right tool is efficient—often, it’s strictly necessary. For instance, an agent searching the web for context that only exists in Slack is doomed from the start. With MCP servers that give the model access to external tools, this problem compounds, as agents encounter unseen tools with descriptions of wildly varying quality. We gave our agents explicit heuristics: for example, examine all available tools first, match tool usage to user intent, search the web for broad external exploration, or prefer specialized tools over generic ones. Bad tool descriptions can send agents down completely wrong paths, so each tool needs a distinct purpose and a clear description.
- 工具设计和选择至关重要。 智能体与工具的接口和人机界面一样关键。使用正确的工具不仅高效,而且往往是必需的。例如,一个在网络上搜索只存在于 Slack 中的上下文的智能体,从一开始就注定了失败。有了能让模型访问外部工具的 MCP 服务器后,这个问题更加复杂,因为智能体会遇到描述质量参差不齐的未知工具。我们为智能体提供了明确的启发式规则:例如,首先检查所有可用工具,将工具使用与用户意图匹配,为广泛的外部探索搜索网络,或者优先选择专用工具而非通用工具。糟糕的工具描述会把智能体引向完全错误的道路,所以每个工具都需要有明确的用途和清晰的描述。
- Let agents improve themselves. We found that the Claude 4 models can be excellent prompt engineers. When given a prompt and a failure mode, they are able to diagnose why the agent is failing and suggest improvements. We even created a tool-testing agent—when given a flawed MCP tool, it attempts to use the tool and then rewrites the tool description to avoid failures. By testing the tool dozens of times, this agent found key nuances and bugs. This process for improving tool ergonomics resulted in a 40% decrease in task completion time for future agents using the new description, because they were able to avoid most mistakes.
- 让智能体自我改进。我们发现 Claude 4 模型本身就是出色的 prompt 工程师。当给它一个 prompt 和一个失败案例时,它能诊断出智能体失败的原因并提出改进建议。我们甚至创建了一个工具测试智能体——当给它一个有缺陷的 MCP 工具时,它会尝试使用该工具,然后重写工具描述以避免失败。通过几十次测试,这个智能体发现了关键的细微差别和 bug。这个改进工具易用性的过程,使得未来使用新描述的智能体任务完成时间减少了 40%,因为它们能够避免大多数错误。
- Start wide, then narrow down. Search strategy should mirror expert human research: explore the landscape before drilling into specifics. Agents often default to overly long, specific queries that return few results. We counteracted this tendency by prompting agents to start with short, broad queries, evaluate what’s available, then progressively narrow focus.
- 先广后窄。 搜索策略应该模仿人类专家的研究方式:先探索全局,再深入具体细节。智能体常常默认使用过长、过于具体的查询,结果返回很少。我们通过 prompt 引导智能体来纠正这一倾向:先用简短、宽泛的查询开始,评估可用的信息,然后逐步收窄焦点。
- Guide the thinking process. Extended thinking mode, which leads Claude to output additional tokens in a visible thinking process, can serve as a controllable scratchpad. The lead agent uses thinking to plan its approach, assessing which tools fit the task, determining query complexity and subagent count, and defining each subagent’s role. Our testing showed that extended thinking improved instruction-following, reasoning, and efficiency. Subagents also plan, then use interleaved thinking after tool results to evaluate quality, identify gaps, and refine their next query. This makes subagents more effective in adapting to any task.
- 引导思考过程。 扩展思考模式能引导 Claude 在可见的思考过程中输出额外的 token,可以作为可控的草稿纸。主智能体利用思考来规划其方法,评估哪些工具适合任务,确定查询复杂度和子智能体数量,并定义每个子智能体的角色。我们的测试表明,扩展思考模式提高了指令遵循、推理和效率。子智能体也会先规划,然后在得到工具结果后使用交错思考来评估质量、发现不足并优化下一次查询。这使得子智能体能更有效地适应任何任务。
- Parallel tool calling transforms speed and performance. Complex research tasks naturally involve exploring many sources. Our early agents executed sequential searches, which was painfully slow. For speed, we introduced two kinds of parallelization: (1) the lead agent spins up 3-5 subagents in parallel rather than serially; (2) the subagents use 3+ tools in parallel. These changes cut research time by up to 90% for complex queries, allowing Research to do more work in minutes instead of hours while covering more information than other systems.
- 并行工具调用改变了速度和性能。 复杂的研究任务天然涉及探索多个来源。我们早期的智能体执行顺序搜索,速度慢得令人痛苦。为了提速,我们引入了两种并行化:(1) 主智能体并行启动 3-5 个子智能体,而不是串行启动;(2) 子智能体并行使用 3 个以上的工具。这些改动为复杂查询节省了高达 90% 的研究时间,让 Research 功能能在几分钟内完成以前需要数小时的工作,同时覆盖比其他系统更多的信息。
Our prompting strategy focuses on instilling good heuristics rather than rigid rules. We studied how skilled humans approach research tasks and encoded these strategies in our prompts—strategies like decomposing difficult questions into smaller tasks, carefully evaluating the quality of sources, adjusting search approaches based on new information, and recognizing when to focus on depth (investigating one topic in detail) vs. breadth (exploring many topics in parallel). We also proactively mitigated unintended side effects by setting explicit guardrails to prevent the agents from spiraling out of control. Finally, we focused on a fast iteration loop with observability and test cases.
我们的 prompt 策略专注于灌输好的启发式方法,而非僵硬的规则。我们研究了熟练的人类如何进行研究,并将这些策略编码到 prompt 中——比如将难题分解为小任务,仔细评估来源质量,根据新信息调整搜索方法,以及识别何时应注重深度(详细调查一个主题)与广度(并行探索多个主题)。我们还通过设置明确的护栏来主动减轻意外的副作用,防止智能体失控。最后,我们专注于建立一个具有可观察性和测试用例的快速迭代循环。
如何有效评估智能体
Good evaluations are essential for building reliable AI applications, and agents are no different. However, evaluating multi-agent systems presents unique challenges. Traditional evaluations often assume that the AI follows the same steps each time: given input X, the system should follow path Y to produce output Z. But multi-agent systems don’t work this way. Even with identical starting points, agents might take completely different valid paths to reach their goal. One agent might search three sources while another searches ten, or they might use different tools to find the same answer. Because we don’t always know what the right steps are, we usually can’t just check if agents followed the “correct” steps we prescribed in advance. Instead, we need flexible evaluation methods that judge whether agents achieved the right outcomes while also following a reasonable process.
好的评估是构建可靠 AI 应用的基础,智能体也不例外。然而,评估多智能体系统带来了独特的挑战。传统评估通常假设 AI 每次都遵循相同的步骤:给定输入 X,系统应遵循路径 Y 产生输出 Z。但多智能体系统不是这样工作的。即使起点相同,智能体也可能采取完全不同但都有效的路径来达到目标。一个智能体可能搜索三个来源,另一个可能搜索十个;或者它们可能使用不同的工具找到相同的答案。因为我们并不总能知道正确的步骤是什么,我们通常不能只检查智能体是否遵循了我们预先设定的“正确”步骤。相反,我们需要灵活的评估方法,既能判断智能体是否达到了正确的结果,又能判断其过程是否合理。
Start evaluating immediately with small samples. In early agent development, changes tend to have dramatic impacts because there is abundant low-hanging fruit. A prompt tweak might boost success rates from 30% to 80%. With effect sizes this large, you can spot changes with just a few test cases. We started with a set of about 20 queries representing real usage patterns. Testing these queries often allowed us to clearly see the impact of changes. We often hear that AI developer teams delay creating evals because they believe that only large evals with hundreds of test cases are useful. However, it’s best to start with small-scale testing right away with a few examples, rather than delaying until you can build more thorough evals.
立即用小样本开始评估。在智能体开发的早期阶段,改动往往会产生巨大影响,因为有大量唾手可得的改进空间。一个 prompt 的微调可能会将成功率从 30% 提升到 80%。在效果如此显著的情况下,只需几个测试用例就能发现变化。我们从大约 20 个代表真实使用模式的查询开始。测试这些查询常常能让我们清楚地看到改动的影响。我们常听说 AI 开发团队推迟创建评估,因为他们认为只有包含数百个测试用例的大型评估才有用。然而,最好是立即用少量例子进行小规模测试,而不是等到能构建更全面的评估时再开始。
LLM-as-judge evaluation scales when done well. Research outputs are difficult to evaluate programmatically, since they are free-form text and rarely have a single correct answer. LLMs are a natural fit for grading outputs. We used an LLM judge that evaluated each output against criteria in a rubric: factual accuracy (do claims match sources?), citation accuracy (do the cited sources match the claims?), completeness (are all requested aspects covered?), source quality (did it use primary sources over lower-quality secondary sources?), and tool efficiency (did it use the right tools a reasonable number of times?). We experimented with multiple judges to evaluate each component, but found that a single LLM call with a single prompt outputting scores from 0.0-1.0 and a pass-fail grade was the most consistent and aligned with human judgements. This method was especially effective when the eval test cases did have a clear answer, and we could use the LLM judge to simply check if the answer was correct (i.e. did it accurately list the pharma companies with the top 3 largest R&D budgets?). Using an LLM as a judge allowed us to scalably evaluate hundreds of outputs.
做得好的“LLM 即评委”评估可以规模化。 研究产出很难用程序化方式评估,因为它们是自由格式的文本,很少有唯一的正确答案。LLM 天然适合为这类产出打分。我们使用一个 LLM 评委,根据一套标准来评估每个产出:事实准确性(论断是否与来源匹配?)、引用准确性(引用的来源是否支持论断?)、完整性(是否覆盖了所有要求的内容?)、来源质量(是否使用了主要来源而非质量较低的次要来源?),以及工具效率(是否以合理的次数使用了正确的工具?)。我们尝试过用多个评委来评估每个部分,但发现单个 LLM 调用,使用单个 prompt 输出 0.0-1.0 的分数和一个“通过/不通过”的等级,结果最稳定,也最符合人类的判断。当评估用例确实有明确答案时,这种方法尤其有效,我们可以让 LLM 评委简单地检查答案是否正确(例如,是否准确列出了研发预算前三的制药公司?)。使用 LLM 作为评委,使我们能够规模化地评估数百个产出。
Human evaluation catches what automation misses. People testing agents find edge cases that evals miss. These include hallucinated answers on unusual queries, system failures, or subtle source selection biases. In our case, human testers noticed that our early agents consistently chose SEO-optimized content farms over authoritative but less highly-ranked sources like academic PDFs or personal blogs. Adding source quality heuristics to our prompts helped resolve this issue. Even in a world of automated evaluations, manual testing remains essential.
人工评估能捕捉到自动化遗漏的问题。 测试智能体的人员会发现评估遗漏的边缘案例。这些包括对不寻常查询的幻觉性回答、系统故障或微妙的来源选择偏见。在我们的案例中,人类测试者注意到,我们早期的智能体总是选择经过 SEO 优化的内容农场,而不是像学术 PDF 或个人博客这样权威但排名不高的来源。在我们的 prompt 中加入来源质量的启发式规则帮助解决了这个问题。即使在自动化评估的世界里,手动测试仍然至关重要。
Multi-agent systems have emergent behaviors, which arise without specific programming. For instance, small changes to the lead agent can unpredictably change how subagents behave. Success requires understanding interaction patterns, not just individual agent behavior. Therefore, the best prompts for these agents are not just strict instructions, but frameworks for collaboration that define the division of labor, problem-solving approaches, and effort budgets. Getting this right relies on careful prompting and tool design, solid heuristics, observability, and tight feedback loops.See the open-source prompts in our Cookbook for example prompts from our system.
多智能体系统具有涌现行为,这些行为并非通过特定编程产生。例如,对主智能体的微小改动可能会不可预测地改变子智能体的行为。成功需要理解交互模式,而不仅仅是单个智能体的行为。因此,最好的 prompt 不仅仅是严格的指令,而是协作框架,它定义了分工、解决问题的方法和投入预算。要做好这一点,依赖于精心的 prompt 和工具设计、可靠的启发式方法、可观察性以及紧密的反馈循环。请参阅我们Cookbook 中的开源 prompt以获取我们系统中的 prompt 示例。
生产环境的可靠性与工程挑战
In traditional software, a bug might break a feature, degrade performance, or cause outages. In agentic systems, minor changes cascade into large behavioral changes, which makes it remarkably difficult to write code for complex agents that must maintain state in a long-running process.
在传统软件中,一个 bug 可能会破坏一个功能、降低性能或导致服务中断。而在智能体系统中,微小的改动会级联成巨大的行为变化,这使得为需要在长期运行过程中维护状态的复杂智能体编写代码变得异常困难。
Agents are stateful and errors compound. Agents can run for long periods of time, maintaining state across many tool calls. This means we need to durably execute code and handle errors along the way. Without effective mitigations, minor system failures can be catastrophic for agents. When errors occur, we can’t just restart from the beginning: restarts are expensive and frustrating for users. Instead, we built systems that can resume from where the agent was when the errors occurred. We also use the model’s intelligence to handle issues gracefully: for instance, letting the agent know when a tool is failing and letting it adapt works surprisingly well. We combine the adaptability of AI agents built on Claude with deterministic safeguards like retry logic and regular checkpoints.
智能体是有状态的,错误会累积。 智能体可以长时间运行,在多次工具调用中维持状态。这意味着我们需要持久地执行代码并在此过程中处理错误。没有有效的缓解措施,微小的系统故障对智能体来说可能是灾难性的。当错误发生时,我们不能简单地从头开始:重新启动成本高昂,且让用户感到沮丧。因此,我们构建了能从错误发生点恢复的系统。我们还利用模型的智能来优雅地处理问题:例如,告知智能体某个工具失败了,让它自行适应,效果出奇地好。我们将基于 Claude 构建的 AI 智能体的适应性与重试逻辑、定期检查点等确定性保障措施结合起来。
Debugging benefits from new approaches. Agents make dynamic decisions and are non-deterministic between runs, even with identical prompts. This makes debugging harder. For instance, users would report agents “not finding obvious information,” but we couldn’t see why. Were the agents using bad search queries? Choosing poor sources? Hitting tool failures? Adding full production tracing let us diagnose why agents failed and fix issues systematically. Beyond standard observability, we monitor agent decision patterns and interaction structures—all without monitoring the contents of individual conversations, to maintain user privacy. This high-level observability helped us diagnose root causes, discover unexpected behaviors, and fix common failures.
调试需要新方法。 智能体做出的决策是动态的,即使 prompt 相同,每次运行的结果也可能不确定。这使得调试更加困难。例如,用户会报告智能体“找不到显而易见的信息”,但我们看不出原因。是智能体用了糟糕的搜索查询吗?选了差劲的来源?还是遇到了工具故障?增加完整的生产环境追踪让我们能够诊断智能体失败的原因并系统地解决问题。除了标准的可观察性,我们还监控智能体的决策模式和交互结构——所有这些都无需监控单个对话内容,以保护用户隐私。这种高层次的可观察性帮助我们诊断根本原因、发现意外行为并修复常见故障。
Deployment needs careful coordination. Agent systems are highly stateful webs of prompts, tools, and execution logic that run almost continuously. This means that whenever we deploy updates, agents might be anywhere in their process. We therefore need to prevent our well-meaning code changes from breaking existing agents. We can’t update every agent to the new version at the same time. Instead, we use rainbow deployments to avoid disrupting running agents, by gradually shifting traffic from old to new versions while keeping both running simultaneously.
部署需要精心协调。 智能体系统是高度有状态的、由 prompt、工具和执行逻辑构成的网络,几乎持续运行。这意味着每当我们部署更新时,智能体可能正处于其流程的任何一个环节。因此,我们需要防止善意的代码改动破坏现有的智能体。我们不能同时将所有智能体更新到新版本。相反,我们使用彩虹部署来避免干扰正在运行的智能体,通过逐步将流量从旧版本转移到新版本,同时保持两者并行运行。
Synchronous execution creates bottlenecks. Currently, our lead agents execute subagents synchronously, waiting for each set of subagents to complete before proceeding. This simplifies coordination, but creates bottlenecks in the information flow between agents. For instance, the lead agent can’t steer subagents, subagents can’t coordinate, and the entire system can be blocked while waiting for a single subagent to finish searching. Asynchronous execution would enable additional parallelism: agents working concurrently and creating new subagents when needed. But this asynchronicity adds challenges in result coordination, state consistency, and error propagation across the subagents. As models can handle longer and more complex research tasks, we expect the performance gains will justify the complexity.
同步执行造成瓶颈。 目前,我们的主智能体同步执行子智能体,等待每组子智能体完成后再继续。这简化了协调,但在智能体之间的信息流中造成了瓶颈。例如,主智能体无法引导子智能体,子智能体之间无法协调,整个系统可能因为等待单个子智能体完成搜索而被阻塞。异步执行将能实现额外的并行性:智能体并发工作,并在需要时创建新的子智能体。但这种异步性在结果协调、状态一致性和跨子智能体的错误传播方面增加了挑战。随着模型能处理更长、更复杂的研究任务,我们预计性能的提升将证明这种复杂性是值得的。
结论
When building AI agents, the last mile often becomes most of the journey. Codebases that work on developer machines require significant engineering to become reliable production systems. The compound nature of errors in agentic systems means that minor issues for traditional software can derail agents entirely. One step failing can cause agents to explore entirely different trajectories, leading to unpredictable outcomes. For all the reasons described in this post, the gap between prototype and production is often wider than anticipated.
在构建 AI 智能体时,最后一英里往往占据了大部分旅程。在开发者机器上能运行的代码库,需要大量的工程工作才能成为可靠的生产系统。智能体系统中错误的复合效应意味着,对传统软件来说的小问题,可能会让智能体完全脱轨。一个步骤的失败可能导致智能体探索完全不同的轨迹,产生不可预测的结果。基于本文描述的所有原因,原型和生产之间的差距通常比预想的要大。
Despite these challenges, multi-agent systems have proven valuable for open-ended research tasks. Users have said that Claude helped them find business opportunities they hadn’t considered, navigate complex healthcare options, resolve thorny technical bugs, and save up to days of work by uncovering research connections they wouldn’t have found alone. Multi-agent research systems can operate reliably at scale with careful engineering, comprehensive testing, detail-oriented prompt and tool design, robust operational practices, and tight collaboration between research, product, and engineering teams who have a strong understanding of current agent capabilities. We’re already seeing these systems transform how people solve complex problems.
尽管存在这些挑战,多智能体系统在开放式研究任务中已证明其价值。用户反馈说,Claude 帮助他们发现了未曾考虑过的商业机会,驾驭了复杂的医疗保健选项,解决了棘手的技术 bug,并通过揭示他们自己无法发现的研究联系,节省了长达数天的工作量。通过精心的工程设计、全面的测试、注重细节的 prompt 和工具设计、稳健的运营实践,以及对当前智能体能力有深刻理解的研究、产品和工程团队之间的紧密合作,多智能体研究系统可以大规模可靠地运行。我们已经看到这些系统正在改变人们解决复杂问题的方式。
A Clio embedding plot showing the most common ways people are using the Research feature today. The top use case categories are developing software systems across specialized domains (10%), develop and optimize professional and technical content (8%), develop business growth and revenue generation strategies (8%), assist with academic research and educational material development (7%), and research and verify information about people, places, or organizations (5%).
一张 Clio 嵌入图,展示了当今人们使用 Research 功能最常见的方式。排名前列的用例类别是:在专业领域开发软件系统 (10%),开发和优化专业及技术内容 (8%),制定业务增长和创收策略 (8%),协助学术研究和教育材料开发 (7%),以及研究和核实关于人物、地点或组织的信息 (5%)。
致谢
Written by Jeremy Hadfield, Barry Zhang, Kenneth Lien, Florian Scholz, Jeremy Fox, and Daniel Ford. This work reflects the collective efforts of several teams across Anthropic who made the Research feature possible. Special thanks go to the Anthropic apps engineering team, whose dedication brought this complex multi-agent system to production. We’re also grateful to our early users for their excellent feedback.
作者:Jeremy Hadfield, Barry Zhang, Kenneth Lien, Florian Scholz, Jeremy Fox, and Daniel Ford。这项工作是 Anthropic 多个团队集体努力的成果,他们使 Research 功能成为可能。特别感谢 Anthropic 应用工程团队,他们的奉献将这个复杂的多智能体系统推向了生产环境。我们也感谢早期用户提供的宝贵反馈。
附录
Below are some additional miscellaneous tips for multi-agent systems.
以下是一些关于多智能体系统的额外杂项技巧。
End-state evaluation of agents that mutate state over many turns. Evaluating agents that modify persistent state across multi-turn conversations presents unique challenges. Unlike read-only research tasks, each action can change the environment for subsequent steps, creating dependencies that traditional evaluation methods struggle to handle. We found success focusing on end-state evaluation rather than turn-by-turn analysis. Instead of judging whether the agent followed a specific process, evaluate whether it achieved the correct final state. This approach acknowledges that agents may find alternative paths to the same goal while still ensuring they deliver the intended outcome. For complex workflows, break evaluation into discrete checkpoints where specific state changes should have occurred, rather than attempting to validate every intermediate step.
对多轮改变状态的智能体进行终态评估。 评估在多轮对话中修改持久状态的智能体带来了独特的挑战。与只读的研究任务不同,每个动作都会改变后续步骤的环境,产生了传统评估方法难以处理的依赖关系。我们发现,专注于终态评估而非逐轮分析是成功的关键。不要判断智能体是否遵循了特定过程,而是评估它是否达到了正确的最终状态。这种方法承认智能体可能会找到达成同一目标的不同路径,同时仍能确保它们交付预期的结果。对于复杂的工作流,将评估分解为离散的检查点,在这些点上应该发生了特定的状态变化,而不是试图验证每个中间步骤。
Long-horizon conversation management. Production agents often engage in conversations spanning hundreds of turns, requiring careful context management strategies. As conversations extend, standard context windows become insufficient, necessitating intelligent compression and memory mechanisms. We implemented patterns where agents summarize completed work phases and store essential information in external memory before proceeding to new tasks. When context limits approach, agents can spawn fresh subagents with clean contexts while maintaining continuity through careful handoffs. Further, they can retrieve stored context like the research plan from their memory rather than losing previous work when reaching the context limit. This distributed approach prevents context overflow while preserving conversation coherence across extended interactions.
长程对话管理。 生产环境中的智能体常常进行长达数百轮的对话,这需要精心的上下文管理策略。随着对话的延长,标准的上下文窗口变得不够用,需要智能的压缩和记忆机制。我们实现了一种模式,让智能体在进入新任务前,先总结已完成的工作阶段并将关键信息存储在外部记忆中。当接近上下文限制时,智能体可以生成具有干净上下文的新子智能体,并通过精心的交接来保持连续性。此外,它们可以从记忆中检索已存储的上下文(如研究计划),而不是在达到上下文限制时丢失之前的工作。这种分布式方法在扩展交互中防止了上下文溢出,同时保持了对话的连贯性。
Subagent output to a filesystem to minimize the ‘game of telephone.’ Direct subagent outputs can bypass the main coordinator for certain types of results, improving both fidelity and performance. Rather than requiring subagents to communicate everything through the lead agent, implement artifact systems where specialized agents can create outputs that persist independently. Subagents call tools to store their work in external systems, then pass lightweight references back to the coordinator. This prevents information loss during multi-stage processing and reduces token overhead from copying large outputs through conversation history. The pattern works particularly well for structured outputs like code, reports, or data visualizations where the subagent’s specialized prompt produces better results than filtering through a general coordinator.
将子智能体的输出写入文件系统,以减少“传话游戏”带来的信息失真。 对于某些类型的结果,让子智能体直接输出可以绕过主协调者,从而提高保真度和性能。与其要求子智能体通过主智能体传达所有信息,不如实现一个工件(artifact)系统,让专业智能体可以创建独立持久化的输出。子智能体调用工具将其工作成果存储在外部系统中,然后将轻量级的引用传回协调者。这可以防止在多阶段处理过程中的信息丢失,并减少因在对话历史中复制大量输出而产生的 token 开销。这种模式对于结构化输出(如代码、报告或数据可视化)尤其有效,因为子智能体的专业 prompt 产生的结果比经过通用协调者过滤后的更好。