译：关于 AI Evals 的常见问题（及解答）

发布于 2025年7月10日

原文： https://hamel.dev/blog/posts/evals-faq/
作者： Hamel Husain, Shreya Shankar
译者： Gemini 2.5 Pro

问：什么是 LLM Evals？

If you are completely new to product-specific LLM evals (not foundation model benchmarks), see these posts: part 1, part 2, part 3. Otherwise, keep reading.

如果你对面向特定产品的 LLM 评估（而非基础模型基准测试）完全陌生，请看这几篇文章：第一部分、第二部分、第三部分。否则，请继续阅读。

问：RAG 已死吗？

Question: Should I avoid using RAG for my AI application after reading that “RAG is dead” for coding agents?

Many developers are confused about when and how to use RAG after reading articles claiming “RAG is dead.” Understanding what RAG actually means versus the narrow marketing definitions will help you make better architectural decisions for your AI applications.

问：在读到一篇关于编码 agent 的文章说 “RAG 已死” 之后，我应该在我的 AI 应用中避免使用 RAG 吗？

许多开发者在读了那些声称“RAG 已死”的文章后，对何时以及如何使用 RAG 感到困惑。理解 RAG 的真正含义，而不是那些狭隘的营销定义，将帮助你为你的 AI 应用做出更好的架构决策。

The viral article claiming RAG is dead specifically argues against using naive vector database retrieval for autonomous coding agents, not RAG as a whole. This is a crucial distinction that many developers miss due to misleading marketing.

那篇宣称 RAG 已死的爆款文章，特指反对为自主编码 agent 使用幼稚的向量数据库检索，而不是反对整个 RAG。这是一个关键的区别，许多开发者因为误导性的营销而忽略了这一点。

RAG simply means Retrieval-Augmented Generation - using retrieval to provide relevant context that improves your model’s output. The core principle remains essential: your LLM needs the right context to generate accurate answers. The question isn’t whether to use retrieval, but how to retrieve effectively.

RAG 的意思很简单，就是检索增强生成（Retrieval-Augmented Generation）——利用检索来提供相关上下文，以改善模型的输出。其核心原则始终至关重要：你的 LLM 需要正确的上下文才能生成准确的答案。问题不在于是否使用检索，而在于如何有效地检索。

For coding applications, naive vector similarity search often fails because code relationships are complex and contextual. Instead of abandoning retrieval entirely, modern coding assistants like Claude Code still uses retrieval —they just employ agentic search instead of relying solely on vector databases, similar to how human developers work.

对于编码应用，幼稚的向量相似性搜索常常失败，因为代码关系复杂且依赖上下文。现代的编码助手，如 Claude Code，并没有完全放弃检索，它们仍然在使用检索——只是它们采用了 agentic search，而不是仅仅依赖向量数据库，这与人类开发者的工作方式相似。

You have multiple retrieval strategies available, ranging from simple keyword matching to embedding similarity to LLM-powered relevance filtering. The optimal approach depends on your specific use case, data characteristics, and performance requirements. Many production systems combine multiple strategies or use multi-hop retrieval guided by LLM agents.

你有多种检索策略可选，从简单的关键词匹配，到 embedding 相似度，再到由 LLM 驱动的相关性过滤。最佳方法取决于你的具体用例、数据特性和性能要求。许多生产系统会结合多种策略，或使用由 LLM agent 指导的多跳检索。

Unfortunately, “RAG” has become a buzzword with no shared definition. Some people use it to mean any retrieval system, others restrict it to vector databases. Focus on the ultimate goal: getting your LLM the context it needs to succeed. Whether that’s through vector search, agentic exploration, or hybrid approaches is a product and engineering decision.

不幸的是，“RAG”已经成了一个没有共同定义的流行词。有些人用它指代任何检索系统，另一些人则将其限定于向量数据库。你应该专注于最终目标：为你的 LLM 提供成功所需的上下文。至于通过向量搜索、agentic 探索还是混合方法来实现，则是一个产品和工程决策。

Rather than following categorical advice to avoid or embrace RAG, experiment with different retrieval approaches and measure what works best for your application.

与其盲从那些要么避免要么拥抱 RAG 的绝对建议，不如去试验不同的检索方法，并衡量哪种最适合你的应用。

问：我能用同一个模型来处理主任务和评估吗？

For LLM-as-Judge selection, using the same model is usually fine because the judge is doing a different task than your main LLM pipeline. The judges we recommend building do scoped binary classification tasks. Focus on achieving high True Positive Rate (TPR) and True Negative Rate (TNR) with your judge on a held out labeled test set rather than avoiding the same model family. You can use these metrics on the test set to understand how well your judge is doing.

对于 LLM-as-Judge 的选择，使用相同的模型通常没问题，因为评判模型执行的任务与你的主 LLM 流水线不同。我们推荐构建的评判模型执行的是限定范围的二元分类任务。你应该专注于在留存的已标注测试集上，让你的评判模型获得较高的真正例率（TPR）和真负例率（TNR），而不是刻意避免使用同系列的模型。你可以用这些在测试集上的指标来了解你的评判模型表现如何。

When selecting judge models, start with the most capable models available to establish strong alignment with human judgments. You can optimize for cost later once you’ve established reliable evaluation criteria. We do not recommend using the same model for open ended preferences or response quality (but we don’t recommend building judges this way in the first place!).

在选择评判模型时，先从最强大的模型开始，以建立与人类判断的高度一致性。一旦你建立了可靠的评估标准，之后再优化成本。我们不推荐将同一模型用于开放式偏好或响应质量的评估（但我们首先就不推荐以这种方式构建评判模型！）。

问：我应该花多少时间在模型选择上？

Many developers fixate on model selection as the primary way to improve their LLM applications. Start with error analysis to understand your failure modes before considering model switching. As Hamel noted in office hours, “I suggest not thinking of switching model as the main axes of how to improve your system off the bat without evidence. Does error analysis suggest that your model is the problem?”

许多开发者执着于将模型选择作为提升其 LLM 应用的主要方式。在考虑更换模型之前，先从错误分析入手，了解你的失败模式。正如 Hamel 在答疑时间指出的：“我建议，在没有证据的情况下，不要一开始就把更换模型当作提升系统的主要途径。错误分析是否表明你的模型是问题所在？”

问：我应该自建标注工具还是用现成的？

Build a custom annotation tool. This is the single most impactful investment you can make for your AI evaluation workflow. With AI-assisted development tools like Cursor or Lovable, you can build a tailored interface in hours. I often find that teams with custom annotation tools iterate ~10x faster.

自建一个定制的标注工具。 这是你能为你的 AI 评估工作流做的最具影响力的单项投资。借助像 Cursor 或 Lovable 这样的 AI 辅助开发工具，你可以在几小时内构建出一个量身定制的界面。我常常发现，拥有定制标注工具的团队迭代速度快了约 10 倍。

Custom tools excel because:

They show all your context from multiple systems in one place
They can render your data in a product specific way (images, widgets, markdown, buttons, etc.)
They’re designed for your specific workflow (custom filters, sorting, progress bars, etc.)

定制工具之所以出色，是因为：

它们能将来自多个系统的所有上下文集中展示在一处。
它们能以产品特定的方式渲染你的数据（图片、小部件、Markdown、按钮等）。
它们是为你特定的工作流程设计的（自定义筛选、排序、进度条等）。

Off-the-shelf tools may be justified when you need to coordinate dozens of distributed annotators with enterprise access controls. Even then, many teams find the configuration overhead and limitations aren’t worth it.

只有当你需要协调数十个分布式标注员并需要企业级访问控制时，使用现成工具或许才算合理。即便如此，许多团队也发现其配置开销和功能限制得不偿失。

Isaac’s Anki flashcard annotation app shows the power of custom tools—handling 400+ results per query with keyboard navigation and domain-specific evaluation criteria that would be nearly impossible to configure in a generic tool.

Isaac 的 Anki 抽认卡标注应用展示了定制工具的力量——它能处理每个查询超过 400 条结果，支持键盘导航，并包含领域特定的评估标准，这些在通用工具中几乎不可能配置出来。

问：为什么你推荐二元（通过/失败）评估，而不是 1-5 分制（李克特量表）？

Engineers often believe that Likert scales (1-5 ratings) provide more information than binary evaluations, allowing them to track gradual improvements. However, this added complexity often creates more problems than it solves in practice.

工程师们常常认为，李克特量表（1-5 分制）比二元评估提供更多信息，使他们能够追踪渐进式的改进。然而，在实践中，这种增加的复杂性往往弊大于利。

Binary evaluations force clearer thinking and more consistent labeling. Likert scales introduce significant challenges: the difference between adjacent points (like 3 vs 4) is subjective and inconsistent across annotators, detecting statistical differences requires larger sample sizes, and annotators often default to middle values to avoid making hard decisions.

二元评估迫使你进行更清晰的思考和更一致的标注。李克特量表则带来了显著的挑战：相邻分数（如 3 分与 4 分）之间的差异是主观的，在不同标注员之间不一致；检测统计上的差异需要更大的样本量；而且标注员常常为了回避艰难的抉择而默认选择中间值。

Having binary options forces people to make a decision rather than hiding uncertainty in middle values. Binary decisions are also faster to make during error analysis - you don’t waste time debating whether something is a 3 or 4.

二元选项迫使人们做出决定，而不是将不确定性隐藏在中间值里。在错误分析时，做二元决策也更快——你不用浪费时间去争论某个东西到底是 3 分还是 4 分。

For tracking gradual improvements, consider measuring specific sub-components with their own binary checks rather than using a scale. For example, instead of rating factual accuracy 1-5, you could track “4 out of 5 expected facts included” as separate binary checks. This preserves the ability to measure progress while maintaining clear, objective criteria.

为了追踪渐进式的改进，可以考虑用各自的二元检查来衡量特定的子组件，而不是使用量表。例如，与其给事实准确性打 1-5 分，你可以将“包含 5 个预期事实中的 4 个”作为独立的二元检查来追踪。这既保留了衡量进展的能力，又维持了清晰、客观的标准。

Start with binary labels to understand what ‘bad’ looks like. Numeric labels are advanced and usually not necessary.

从二元标签开始，以理解“差”是什么样的。数字标签是高级用法，通常没有必要。

问：我该如何调试多轮对话的 trace？

Start simple. Check if the whole conversation met the user’s goal with a pass/fail judgment. Look at the entire trace and focus on the first upstream failure. Read the user-visible parts first to understand if something went wrong. Only then dig into the technical details like tool calls and intermediate steps.

从简单的开始。用一个“通过/失败”的判断来检查整个对话是否达成了用户的目标。审视整个 trace，并专注于第一个上游的失败点。首先阅读用户可见的部分，看是否出了问题。之后再深入研究技术细节，如工具调用和中间步骤。

When you find a failure, reproduce it with the simplest possible test case. Here’s an example: suppose a shopping bot gives the wrong return policy on turn 4 of a conversation. Before diving into the full multi-turn complexity, simplify it to a single turn: “What is the return window for product X1000?” If it still fails, you’ve proven the error isn’t about conversation context - it’s likely a basic retrieval or knowledge issue you can debug more easily.

当你发现一个失败时，用最简单的测试用例来复现它。举个例子：假设一个购物机器人在对话的第四轮给出了错误的退货政策。在深入研究复杂的多轮对话之前，先把它简化为单轮对话：“产品 X1000 的退货窗口是多久？”如果它仍然失败，你就证明了这个错误与对话上下文无关——它很可能是一个基本的检索或知识问题，这样调试起来就容易多了。

For generating test cases, you have two main approaches. First, you can simulate users with another LLM to create realistic multi-turn conversations. Second, use “N-1 testing” where you provide the first N-1 turns of a real conversation and test what happens next. The N-1 approach often works better since it uses actual conversation prefixes rather than fully synthetic interactions (but is less flexible and doesn’t test the full conversation). User simulation is getting better as models improve. Keep an eye on this space.

在生成测试用例方面，你有两种主要方法。第一，你可以用另一个 LLM 来模拟用户，以创建真实的多轮对话。第二，使用“N-1 测试”，即你提供真实对话的前 N-1 轮，然后测试接下来会发生什么。N-1 方法通常效果更好，因为它使用的是真实的对话前缀，而不是完全合成的交互（但它灵活性较差，且不能测试完整的对话）。随着模型的进步，用户模拟的效果正在变好。请持续关注这个领域。

The key is balancing thoroughness with efficiency. Not every multi-turn failure requires multi-turn analysis.

关键是在彻底性和效率之间取得平衡。并非每一个多轮对话的失败都需要进行多轮分析。

问：我应该为发现的每一种失败模式都构建自动化评估器吗？

Focus automated evaluators on failures that persist after fixing your prompts. Many teams discover their LLM doesn’t meet preferences they never actually specified - like wanting short responses, specific formatting, or step-by-step reasoning. Fix these obvious gaps first before building complex evaluation infrastructure.

把自动化评估器集中在那些修复了 prompt 之后依然存在的失败上。许多团队发现他们的 LLM 不符合他们从未明确指定过的偏好——比如想要简短的回答、特定的格式或分步推理。在构建复杂的评估基础设施之前，先修复这些明显的缺口。

Consider the cost hierarchy of different evaluator types. Simple assertions and reference-based checks (comparing against known correct answers) are cheap to build and maintain. LLM-as-Judge evaluators require 100+ labeled examples, ongoing weekly maintenance, and coordination between developers, PMs, and domain experts. This cost difference should shape your evaluation strategy.

考虑不同类型评估器的成本层级。简单的断言和基于引用的检查（与已知的正确答案进行比较）构建和维护成本低廉。而 LLM-as-Judge 评估器则需要 100 多个标注样本、持续的每周维护，以及开发人员、产品经理和领域专家之间的协调。这种成本差异应该影响你的评估策略。

Only build expensive evaluators for problems you’ll iterate on repeatedly. Since LLM-as-Judge comes with significant overhead, save it for persistent generalization failures - not issues you can fix trivially. Start with cheap code-based checks where possible: regex patterns, structural validation, or execution tests. Reserve complex evaluation for subjective qualities that can’t be captured by simple rules.

只为那些你需要反复迭代的问题构建昂贵的评估器。由于 LLM-as-Judge 带有巨大的开销，所以把它留给那些持续存在的泛化失败——而不是那些你可以轻易修复的问题。尽可能从成本低廉的基于代码的检查开始：正则表达式、结构验证或执行测试。将复杂的评估留给那些无法用简单规则捕捉的主观质量。

问：应该有多少人来标注我的 LLM 输出？

For most small to medium-sized companies, appointing a single domain expert as a “benevolent dictator” is the most effective approach. This person—whether it’s a psychologist for a mental health chatbot, a lawyer for legal document analysis, or a customer service director for support automation—becomes the definitive voice on quality standards.

对于大多数中小型公司而言，任命一位领域专家作为“仁慈的独裁者”是最有效的方法。这个人——无论是心理健康聊天机器人的心理学家、法律文件分析的律师，还是支持自动化的客服总监——将成为质量标准的最终权威。

A single expert eliminates annotation conflicts and prevents the paralysis that comes from “too many cooks in the kitchen”. The benevolent dictator can incorporate input and feedback from others, but they drive the process. If you feel like you need five subject matter experts to judge a single interaction, it’s a sign your product scope might be too broad.

单一专家可以消除标注冲突，避免“厨子多了烧坏汤”所带来的瘫痪。这位“仁慈的独裁者”可以采纳他人的意见和反馈，但由他们来主导整个过程。如果你觉得你需要五位主题专家来评判一次交互，这可能表明你的产品范围太广了。

However, larger organizations or those operating across multiple domains (like a multinational company with different cultural contexts) may need multiple annotators. When you do use multiple people, you’ll need to measure their agreement using metrics like Cohen’s Kappa, which accounts for agreement beyond chance. However, use your judgment. Even in larger companies, a single expert is often enough.

然而，大型组织或跨多个领域运营的组织（如具有不同文化背景的跨国公司）可能需要多位标注员。当你确实使用多人时，你需要使用像 Cohen’s Kappa 这样的指标来衡量他们的一致性，该指标考虑了超出偶然的一致性。不过，还是要运用你的判断。即使在较大的公司里，一位专家也常常足够了。

Start with a benevolent dictator whenever feasible. Only add complexity when your domain demands it.

只要可行，就从一位“仁慈的独裁者”开始。只有当你的领域确实需要时，才增加复杂性。

问：我应该准备好自己填补评估工具中的哪些空白？

Most eval tools handle the basics well: logging complete traces, tracking metrics, prompt playgrounds, and annotation queues. These are table stakes. Here are four areas where you’ll likely need to supplement existing tools.

大多数评估工具都能很好地处理基础功能：记录完整的 trace、追踪指标、prompt 游乐场和标注队列。这些都是基本要求。以下是四个你很可能需要自己补充现有工具的领域。

Watch for vendors addressing these gaps—it’s a strong signal they understand practitioner needs.

留意那些正在解决这些空白的服务商——这是一个强烈的信号，表明他们理解从业者的需求。

1. Error Analysis and Pattern Discovery

1. 错误分析与模式发现

After reviewing traces where your AI fails, can your tooling automatically cluster similar issues? For instance, if multiple traces show the assistant using casual language for luxury clients, you need something that recognizes this broader “persona-tone mismatch” pattern. We recommend building capabilities that use AI to suggest groupings, rewrite your observations into clearer failure taxonomies, help find similar cases through semantic search, etc.

在审查 AI 失败的 trace 后，你的工具能否自动聚类相似的问题？例如，如果多个 trace 显示助手对奢侈品客户使用了随意的语言，你需要一个能识别出这种更广泛的“人设-语调不匹配”模式的工具。我们建议构建一些能力，利用 AI 建议分组、将你的观察重写为更清晰的失败分类法、通过语义搜索帮助找到相似案例等。

2. AI-Powered Assistance Throughout the Workflow

2. 贯穿工作流的 AI 辅助

The most effective workflows use AI to accelerate every stage of evaluation. During error analysis, you want an LLM helping categorize your open-ended observations into coherent failure modes. For example, you might annotate several traces with notes like “wrong tone for investor,” “too casual for luxury buyer,” etc. Your tooling should recognize these as the same underlying pattern and suggest a unified “persona-tone mismatch” category.

最高效的工作流会利用 AI 来加速评估的每一个阶段。在错误分析期间，你需要一个 LLM 帮助你将开放式的观察归类为连贯的失败模式。例如，你可能用“对投资者的语调不对”、“对奢侈品买家太随意”等笔记标注了几个 trace。你的工具应该能识别出这些是同一个潜在模式，并建议一个统一的“人设-语调不匹配”类别。

You’ll also want AI assistance in proposing fixes. After identifying 20 cases where your assistant omits pet policies from property summaries, can your workflow analyze these failures and suggest specific prompt modifications? Can it draft refinements to your SQL generation instructions when it notices patterns of missing WHERE clauses?

你还需要 AI 辅助来提出修复建议。在识别出 20 个助手在房产摘要中遗漏宠物政策的案例后，你的工作流能否分析这些失败并建议具体的 prompt 修改？当它注意到缺少 WHERE 子句的模式时，能否起草对 SQL 生成指令的改进方案？

Additionally, good workflows help you conduct data analysis of your annotations and traces. I like using notebooks with AI in-the-loop like Julius,Hex or SolveIt. These help me discover insights like “location ambiguity errors spike 3x when users mention neighborhood names” or “tone mismatches occur 80% more often in email generation than other modalities.”

此外，好的工作流还能帮助你对标注和 trace 进行数据分析。我喜欢使用带有人在回路中的 AI 的 notebook，比如 Julius、Hex 或 SolveIt。这些工具帮助我发现像“当用户提到社区名称时，位置模糊错误激增 3 倍”或“语调不匹配在邮件生成中比其他模式多出现 80%”这样的洞见。

3. Custom Evaluators Over Generic Metrics

3. 定制评估器优于通用指标

Be prepared to build most of your evaluators from scratch. Generic metrics like “hallucination score” or “helpfulness rating” rarely capture what actually matters for your application—like proposing unavailable showing times or omitting budget constraints from emails. In our experience, successful teams spend most of their effort on application-specific metrics.

准备好从零开始构建你的大部分评估器。像“幻觉分数”或“有用性评级”这样的通用指标，很少能捕捉到对你的应用真正重要的事情——比如提议了无法安排的看房时间，或在邮件中遗漏了预算限制。根据我们的经验，成功的团队大部分精力都花在了应用专属的指标上。

4. APIs That Support Custom Annotation Apps

4. 支持定制标注应用的 API

Custom annotation interfaces work best for most teams. This requires observability platforms with thoughtful APIs. I often have to build my own libraries and abstractions just to make bulk data export manageable. You shouldn’t have to paginate through thousands of requests or handle timeout-prone endpoints just to get your data. Look for platforms that provide true bulk export capabilities and, crucially, APIs that let you write annotations back efficiently.

定制的标注界面对大多数团队来说效果最好。这需要可观测性平台提供设计周到的 API。我常常不得不构建自己的库和抽象层，才能让批量数据导出变得可管理。你不应该为了获取数据而需要对成千上万的请求进行分页，或者处理容易超时的端点。寻找那些提供真正批量导出功能，以及——至关重要的——能让你高效写回标注的 API 的平台。

问：生成合成数据的最佳方法是什么？

A common mistake is prompting an LLM to "give me test queries" without structure, resulting in generic, repetitive outputs. A structured approach using dimensions produces far better synthetic data for testing LLM applications.

一个常见的错误是无结构地提示 LLM“给我一些测试查询”，这会导致泛泛而重复的输出。使用维度的结构化方法，可以为测试 LLM 应用生成质量好得多的合成数据。

Start by defining dimensions: categories that describe different aspects of user queries. Each dimension captures one type of variation in user behavior. For example:

For a recipe app, dimensions might include Dietary Restriction (vegan, gluten-free, none), Cuisine Type (Italian, Asian, comfort food), and Query Complexity (simple request, multi-step, edge case).
For a customer support bot, dimensions could be Issue Type (billing, technical, general), Customer Mood (frustrated, neutral, happy), and Prior Context (new issue, follow-up, resolved).

从定义维度开始：这些类别描述了用户查询的不同方面。每个维度捕捉一种用户行为的变化。例如：

对于一个食谱应用，维度可能包括饮食限制（素食、无麸质、无）、菜系类型（意大利菜、亚洲菜、家常菜）和查询复杂度（简单请求、多步骤、边缘案例）。
对于一个客服机器人，维度可以是问题类型（账单、技术、一般）、客户情绪（沮丧、中性、开心）和先前背景（新问题、跟进、已解决）。

Choose dimensions that target likely failure modes. If you suspect your recipe app struggles with scaling ingredients for large groups or your support bot mishandles angry customers, make those dimensions. Use your application first—you need hypotheses about where failures occur. Without this, you’ll generate useless test data.

选择那些针对可能失败模式的维度。 如果你怀疑你的食谱应用在为大团体调整食材用量时遇到困难，或者你的客服机器人处理不好愤怒的客户，那就把这些设为维度。先亲自使用你的应用——你需要对失败可能发生在哪里有假设。没有这个，你生成的测试数据将毫无用处。

Once you have dimensions, create tuples: specific combinations selecting one value from each dimension. A tuple like (Vegan, Italian, Multi-step) represents a particular use case. Write 20 tuples manually to understand your problem space, then use an LLM to scale up.

有了维度后，创建元组： 从每个维度中选一个值组成的特定组合。一个像（素食、意大利菜、多步骤）这样的元组代表一个特定的用例。先手动写 20 个元组来理解你的问题空间，然后用 LLM 来扩大规模。

The two-step generation process is important. First, have the LLM generate structured tuples. Then, in a separate prompt, convert each tuple to a natural language query. This separation prevents repetitive phrasing. For the vegan Italian tuple above, you might get "I need a dairy-free lasagna recipe that I can prep the day before."

两步生成过程很重要。首先，让 LLM 生成结构化的元组。然后，在另一个 prompt 中，将每个元组转换为自然语言查询。这种分离可以防止措辞重复。对于上面那个素食意大利菜的元组，你可能会得到“我需要一个不含乳制品的千层面食谱，可以提前一天准备。”

Don’t generate synthetic data for problems you can fix immediately. If your prompt never mentions handling dietary restrictions, fix the prompt rather than generating hundreds of specialized queries. Save synthetic data for complex issues requiring iteration—like an LLM consistently failing at ingredient scaling math or misinterpreting ambiguous requests.

不要为那些你可以立即修复的问题生成合成数据。 如果你的 prompt 从未提及处理饮食限制，那就去修复 prompt，而不是生成数百个专门的查询。把合成数据留给那些需要迭代的复杂问题——比如 LLM 总是算错食材缩放的数学题，或者误解了模棱两可的请求。

After iterating on your tuples and prompts, run these synthetic queries through your actual system to capture full traces. Sample 100 traces for error analysis. This number provides enough traces to manually review and identify failure patterns without being overwhelming. Rather than generating thousands of similar queries, ensure your 100 traces cover diverse combinations across your dimensions—this variety will reveal more failure modes than sheer volume.

在迭代了你的元组和 prompt 之后，将这些合成查询在你的实际系统中运行，以捕获完整的 trace。抽取 100 个 trace 用于错误分析。这个数量足以让你手动审查并识别失败模式，而又不会不堪重负。与其生成数千个相似的查询，不如确保你的 100 个 trace 覆盖了你各个维度下的多样化组合——这种多样性比纯粹的数量更能揭示失败模式。

问：当我的系统处理多样化的用户查询时，我该如何进行评估？

Complex applications often support vastly different query patterns—from “What’s the return policy?” to “Compare pricing trends across regions for products matching these criteria.” Each query type exercises different system capabilities, leading to confusion on how to design eval criteria.

复杂的应用常常支持截然不同的查询模式——从“退货政策是什么？”到“比较符合这些标准的产品在不同区域的价格趋势。”每种查询类型都会调用不同的系统能力，这导致在设计评估标准时感到困惑。

Error Analysis is all you need. Your evaluation strategy should emerge from observed failure patterns (e.g. error analysis), not predetermined query classifications. Rather than creating a massive evaluation matrix covering every query type you can imagine, let your system’s actual behavior guide where you invest evaluation effort.

你所需要的只是错误分析。 你的评估策略应该源于观察到的失败模式（即错误分析），而不是预先设定的查询分类。与其创建一个覆盖你能想到的每一种查询类型的庞大评估矩阵，不如让你系统的实际行为来指导你将评估精力投向何处。

During error analysis, you’ll likely discover that certain query categories share failure patterns. For instance, all queries requiring temporal reasoning might struggle regardless of whether they’re simple lookups or complex aggregations. Similarly, queries that need to combine information from multiple sources might fail in consistent ways. These patterns discovered through error analysis should drive your evaluation priorities. It could be that query category is a fine way to group failures, but you don’t know that until you’ve analyzed your data.

在错误分析期间，你可能会发现某些查询类别共享失败模式。例如，所有需要时间推理的查询都可能遇到困难，无论它们是简单的查找还是复杂的聚合。同样，需要整合来自多个信息源的查询可能会以一致的方式失败。这些通过错误分析发现的模式应该驱动你的评估优先级。查询类别可能是一种很好的对失败进行分组的方式，但在你分析数据之前，你并不知道这一点。

To see an example of basic error analysis in action, see this video.

要看一个基础错误分析的实例，请观看这个视频。

问：如何为我的文档处理任务选择合适的 chunk size？

Unlike RAG, where chunks are optimized for retrieval, document processing assumes the model will see every chunk. The goal is to split text so the model can reason effectively without being overwhelmed. Even if a document fits within the context window, it might be better to break it up. Long inputs can degrade performance due to attention bottlenecks, especially in the middle of the context. Two task types require different strategies:

不同于 RAG 中 chunk 是为检索而优化的，文档处理假设模型会看到每一个 chunk。其目标是切分文本，以便模型能有效推理而不会被信息淹没。即使一份文档能装进上下文窗口，把它拆开可能效果更好。由于注意力瓶颈，尤其是在上下文的中间部分，长输入会降低性能。两种任务类型需要不同的策略：

1. 固定输出任务 → 大块 (Large Chunks)

These are tasks where the output length doesn’t grow with input: extracting a number, answering a specific question, classifying a section. For example:

“What’s the penalty clause in this contract?”
“What was the CEO’s salary in 2023?”

这些是输出长度不随输入增长的任务：提取一个数字、回答一个具体问题、对一个章节进行分类。例如：

“这份合同里的罚则条款是什么？”
“2023 年 CEO 的薪水是多少？”

Use the largest chunk (with caveats) that likely contains the answer. This reduces the number of queries and avoids context fragmentation. However, avoid adding irrelevant text. Models are sensitive to distraction, especially with large inputs. The middle parts of a long input might be under-attended. Furthermore, if cost and latency are a bottleneck, you should consider preprocessing or filtering the document (via keyword search or a lightweight retriever) to isolate relevant sections before feeding a huge chunk.

使用可能包含答案的最大的 chunk（但有附加条件）。这可以减少查询次数并避免上下文碎片化。然而，要避免加入不相关的文本。模型对干扰很敏感，尤其是在输入很长的情况下。长输入的中间部分可能会被忽略。此外，如果成本和延迟是瓶颈，你应该考虑在喂入一个巨大的 chunk 之前，对文档进行预处理或过滤（通过关键词搜索或一个轻量级的检索器）来分离出相关部分。

2. 扩展性输出任务 → 小块 (Smaller Chunks)

These include summarization, exhaustive extraction, or any task where output grows with input. For example:

“Summarize each section”
“List all customer complaints”

这些包括摘要、详尽提取，或任何输出随输入增长的任务。例如：

“总结每个章节”
“列出所有客户投诉”

In these cases, smaller chunks help preserve reasoning quality and output completeness. The standard approach is to process each chunk independently, then aggregate results (e.g., map-reduce). When sizing your chunks, try to respect content boundaries like paragraphs, sections, or chapters. Chunking also helps mitigate output limits. By breaking the task into pieces, each piece’s output can stay within limits.

在这些情况下，更小的 chunk 有助于保持推理质量和输出的完整性。标准方法是独立处理每个 chunk，然后聚合结果（例如，map-reduce）。在确定 chunk 大小时，尽量尊重内容边界，如段落、章节或章回。分块也有助于缓解输出长度限制。通过将任务分解成小块，每块的输出都可以保持在限制之内。

一般性指导

It’s important to recognize why chunk size affects results. A larger chunk means the model has to reason over more information in one go – essentially, a heavier cognitive load. LLMs have limited capacity to retain and correlate details across a long text. If too much is packed in, the model might prioritize certain parts (commonly the beginning or end) and overlook or “forget” details in the middle. This can lead to overly coarse summaries or missed facts. In contrast, a smaller chunk bounds the problem: the model can pay full attention to that section. You are trading off global context for local focus.

重要的是要认识到为什么 chunk size 会影响结果。更大的 chunk 意味着模型需要一次性推理更多的信息——实质上是更重的认知负荷。LLM 在长文本中保留和关联细节的能力是有限的。如果塞入太多信息，模型可能会优先处理某些部分（通常是开头或结尾），而忽略或“忘记”中间的细节。这可能导致过于粗糙的摘要或遗漏事实。相比之下，更小的 chunk 限定了问题范围：模型可以全神贯注于那一部分。你是在用全局上下文换取局部焦点。

No rule of thumb can perfectly determine the best chunk size for your use case – you should validate with experiments. The optimal chunk size can vary by domain and model. I treat chunk size as a hyperparameter to tune.

没有任何经验法则能完美地确定你用例的最佳 chunk size——你应该通过实验来验证。最优的 chunk size 可能因领域和模型而异。我把 chunk size 当作一个需要调整的超参数。

问：我应该如何评估我的 RAG 系统？

RAG systems have two distinct components that require different evaluation approaches: retrieval and generation.

RAG 系统有两个截然不同的组件，需要不同的评估方法：检索和生成。

The retrieval component is a search problem. Evaluate it using traditional information retrieval (IR) metrics. Common examples include Recall@k (of all relevant documents, how many did you retrieve in the top k?), Precision@k (of the k documents retrieved, how many were relevant?), or MRR (how high up was the first relevant document?). The specific metrics you choose depend on your use case. These metrics are pure search metrics that measure whether you’re finding the right documents (more on this below).

检索组件是一个搜索问题。使用传统的信息检索（IR）指标来评估它。常见的例子包括 Recall@k（在所有相关文档中，你在前 k 个结果中检索到了多少？）、Precision@k（在你检索的 k 个文档中，有多少是相关的？），或 MRR（第一个相关文档排在第几位？）。你选择的具体指标取决于你的用例。这些是纯粹的搜索指标，衡量你是否找到了正确的文档（下文将详述）。

To evaluate retrieval, create a dataset of queries paired with their relevant documents. Generate this synthetically by taking documents from your corpus, extracting key facts, then generating questions those facts would answer. This reverse process gives you query-document pairs for measuring retrieval performance without manual annotation.

要评估检索，需要创建一个包含查询及其相关文档的数据集。你可以通过从你的语料库中提取文档，抽取出关键事实，然后生成这些事实可以回答的问题，来合成这个数据集。这个逆向过程为你提供了用于衡量检索性能的查询-文档对，而无需手动标注。

For the generation component—how well the LLM uses retrieved context, whether it hallucinates, whether it answers the question—use the same evaluation procedures covered throughout this course: error analysis to identify failure modes, collecting human labels, building LLM-as-judge evaluators, and validating those judges against human annotations.

对于生成组件——LLM 如何利用检索到的上下文、是否产生幻觉、是否回答了问题——使用本课程中介绍的相同评估流程：通过错误分析识别失败模式，收集人工标签，构建 LLM-as-judge 评估器，并根据人工标注来验证这些评判模型。

Jason Liu’s “There Are Only 6 RAG Evals” provides a framework that maps well to this separation. His Tier 1 covers traditional IR metrics for retrieval. Tiers 2 and 3 evaluate relationships between Question, Context, and Answer—like whether the context is relevant (C|Q), whether the answer is faithful to context (A|C), and whether the answer addresses the question (A|Q).

Jason Liu 的《只有 6 种 RAG 评估》提供了一个与这种分离很好地对应的框架。他的第一层级（Tier 1）涵盖了用于检索的传统 IR 指标。第二和第三层级（Tiers 2 and 3）评估问题、上下文和答案之间的关系——比如上下文是否相关（C|Q），答案是否忠实于上下文（A|C），以及答案是否解决了问题（A|Q）。

In addition to Jason’s six evals, error analysis on your specific data may reveal domain-specific failure modes that warrant their own metrics. For example, a medical RAG system might consistently fail to distinguish between drug dosages for adults versus children, or a legal RAG might confuse jurisdictional boundaries. These patterns emerge only through systematic review of actual failures. Once identified, you can create targeted evaluators for these specific issues beyond the general framework.

除了 Jason 的六种评估之外，对你特定数据的错误分析可能会揭示出领域特定的失败模式，这些模式需要有自己的指标。例如，一个医疗 RAG 系统可能总是无法区分成人和儿童的药物剂量，或者一个法律 RAG 可能会混淆司法管辖权的界限。这些模式只有通过对实际失败的系统性审查才能浮现。一旦识别出来，你就可以在通用框架之外，为这些具体问题创建有针对性的评估器。

Finally, when implementing Jason’s Tier 2 and 3 metrics, don’t just use prompts off the shelf. The standard LLM-as-judge process requires several steps: error analysis, prompt iteration, creating labeled examples, and measuring your judge’s accuracy against human labels. Once you know your judge’s True Positive and True Negative rates, you can correct its estimates to determine the actual failure rate in your system. Skip this validation and your judges may not reflect your actual quality criteria.

最后，在实施 Jason 的第二和第三层级指标时，不要只是直接使用现成的 prompt。标准的 LLM-as-judge 流程需要几个步骤：错误分析、prompt 迭代、创建标注样本，以及对照人工标签来衡量你的评判模型的准确性。一旦你知道了你的评判模型的真正例率和真负例率，你就可以校正它的估计值，以确定你系统中的实际失败率。如果跳过这个验证步骤，你的评判模型可能无法反映你真正的质量标准。

In summary, debug retrieval first using IR metrics, then tackle generation quality using properly validated LLM judges.

总结一下，首先使用 IR 指标来调试检索部分，然后使用经过适当验证的 LLM 评判模型来解决生成质量问题。

Q: What makes a good custom interface for reviewing LLM outputs?

问：一个好的用于审查 LLM 输出的自定义界面是怎样的？

Great interfaces make human review fast, clear, and motivating. We recommend building your own annotation tool customized to your domain. The following features are possible enhancements we’ve seen work well, but you don’t need all of them. The screenshots shown are illustrative examples to clarify concepts. In practice, I rarely implement all these features in a single app. It’s ultimately a judgment call based on your specific needs and constraints.

好的界面能使人工审查变得快速、清晰且富有激励性。我们建议构建你自己的、针对你所在领域定制的标注工具。以下是一些我们见过效果不错的可能增强功能，但你并不需要全部实现它们。所展示的截图是为阐明概念的示例。在实践中，我很少在一个应用中实现所有这些功能。这最终是一个基于你具体需求和限制的判断。

1. Render Traces Intelligently, Not Generically: Present the trace in a way that’s intuitive for the domain. If you’re evaluating generated emails, render them to look like emails. If the output is code, use syntax highlighting. Allow the reviewer to see the full trace (user input, tool calls, and LLM reasoning), but keep less important details in collapsed sections that can be expanded. Here is an example of a custom annotation tool for reviewing real estate assistant emails:

1. 智能地渲染 Trace，而非通用地渲染：以对领域而言直观的方式呈现 trace。如果你在评估生成的邮件，就让它们看起来像邮件。如果输出是代码，就使用语法高亮。允许审查者看到完整的 trace（用户输入、工具调用和 LLM 推理），但将不太重要的细节放在可展开的折叠部分中。这是一个用于审查房地产助理邮件的自定义标注工具示例：

A custom interface for reviewing emails for a real estate assistant.

一个用于审查房地产助理邮件的自定义界面。

2. Show Progress and Support Keyboard Navigation: Keep reviewers in a state of flow by minimizing friction and motivating completion. Include progress indicators (e.g., “Trace 45 of 100”) to keep the review session bounded and encourage completion. Enable hotkeys for navigating between traces (e.g., N for next), applying labels, and saving notes quickly. Below is an illustration of these features:

2. 显示进度并支持键盘导航：通过减少阻力并激励完成，让审查者保持心流状态。包含进度指示器（例如，“Trace 45 of 100”）来限定审查会话的范围并鼓励完成。启用快捷键来在 trace 之间导航（例如，N 代表下一个）、应用标签和快速保存笔记。下面是这些功能的图示：

An annotation interface with a progress bar and hotkey guide

一个带有进度条和快捷键指南的标注界面

4. Trace navigation through clustering, filtering, and search: Allow reviewers to filter traces by metadata or search by keywords. Semantic search helps find conceptually similar problems. Clustering similar traces (like grouping by user persona) lets reviewers spot recurring issues and explore hypotheses. Below is an illustration of these features:

4. 通过聚类、过滤和搜索进行 Trace 导航：允许审查者按元数据过滤 trace 或按关键词搜索。语义搜索有助于找到概念上相似的问题。对相似的 trace 进行聚类（比如按用户画像分组）能让审查者发现重复出现的问题并探索假设。下面是这些功能的图示：

Cluster view showing groups of emails, such as property-focused or client-focused examples. Reviewers can drill into a group to see individual traces.

聚类视图显示了邮件分组，例如以房产为中心的或以客户为中心的例子。审查者可以深入一个分组查看单个 trace。

5. Prioritize labeling traces you think might be problematic: Surface traces flagged by guardrails, CI failures, or automated evaluators for review. Provide buttons to take actions like adding to datasets, filing bugs, or re-running pipeline tests. Display relevant context (pipeline version, eval scores, reviewer info) directly in the interface to minimize context switching. Below is an illustration of these ideas:

5. 优先标注你认为可能有问题的 trace：将被 guardrails、CI 失败或自动化评估器标记的 trace 浮现出来以供审查。提供按钮来执行操作，如添加到数据集、提交 bug 或重新运行流水线测试。在界面中直接显示相关上下文（流水线版本、评估分数、审查者信息），以最大限度地减少上下文切换。下面是这些想法的图示：

A trace view that allows you to quickly see auto-evaluator verdict, add traces to dataset or open issues. Also shows metadata like pipeline version, reviewer info, and more.

一个 trace 视图，可以让你快速查看自动评估器的结论、将 trace 添加到数据集或开启 issue。同时还显示了如流水线版本、审查者信息等元数据。

General Principle: Keep it minimal

一般原则：保持简约

Keep your annotation interface minimal. Only incorporate these ideas if they provide a benefit that outweighs the additional complexity and maintenance overhead.

保持你的标注界面简约。只有当这些想法带来的好处超过了额外的复杂性和维护开销时，才将它们融入进来。

问：我应该将多少开发预算分配给评估？

It’s important to recognize that evaluation is part of the development process rather than a distinct line item, similar to how debugging is part of software development.

重要的是要认识到，评估是开发过程的一部分，而不是一个独立的预算项目，就像调试是软件开发的一部分一样。

You should always be doing error analysis. When you discover issues through error analysis, many will be straightforward bugs you’ll fix immediately. These fixes don’t require separate evaluation infrastructure as they’re just part of development.

你应该一直在做错误分析。当你通过错误分析发现问题时，许多将是你会立即修复的直接的 bug。这些修复不需要独立的评估基础设施，因为它们只是开发的一部分。

The decision to build automated evaluators comes down to cost-benefit analysis. If you can catch an error with a simple assertion or regex check, the cost is minimal and probably worth it. But if you need to align an LLM-as-judge evaluator, consider whether the failure mode warrants that investment.

构建自动化评估器的决定归结于成本效益分析。如果你能用一个简单的断言或正则表达式检查来捕获一个错误，成本是最小的，并且可能值得。但如果你需要校准一个 LLM-as-judge 评估器，就要考虑这种失败模式是否值得那份投资。

In the projects we’ve worked on, we’ve spent 60-80% of our development time on error analysis and evaluation. Expect most of your effort to go toward understanding failures (i.e. looking at data) rather than building automated checks.

在我们参与的项目中，我们花费了 60-80% 的开发时间在错误分析和评估上。预计你的大部分精力将用于理解失败（即查看数据），而不是构建自动化检查。

Be wary of optimizing for high eval pass rates. If you’re passing 100% of your evals, you’re likely not challenging your system enough. A 70% pass rate might indicate a more meaningful evaluation that’s actually stress-testing your application. Focus on evals that help you catch real issues, not ones that make your metrics look good.

要警惕为高评估通过率而优化。如果你的评估通过率是 100%，你很可能没有给你的系统足够的挑战。70% 的通过率可能意味着一个更有意义的评估，它真正在对你的应用进行压力测试。专注于那些能帮助你捕捉真实问题的评估，而不是那些让你的指标看起来好看的评估。

问：为什么“错误分析”在 LLM 评估中如此重要，以及如何进行？

Error analysis is the most important activity in evals. Error analysis helps you decide what evals to write in the first place. It allows you to identify failure modes unique to your application and data. The process involves:

错误分析是评估中最重要的一环。错误分析能帮助你决定首先应该编写哪些评估。它让你能够识别出特定于你的应用和数据的失败模式。这个过程包括：

Creating a Dataset: Gathering representative traces of user interactions with the LLM. If you do not have any data, you can generate synthetic data to get started.
Open Coding: Human annotator(s) (ideally a benevolent dictator) review and write open-ended notes about traces, noting any issues. This process is akin to “journaling” and is adapted from qualitative research methodologies. When beginning, it is recommended to focus on noting the first failure observed in a trace, as upstream errors can cause downstream issues, though you can also tag all independent failures if feasible. A domain expert should be performing this step.
Axial Coding: Categorize the open-ended notes into a “failure taxonomy.”. In other words, group similar failures into distinct categories. This is the most important step. At the end, count the number of failures in each category. You can use a LLM to help with this step.
Iterative Refinement: Keep iterating on more traces until you reach theoretical saturation, meaning new traces do not seem to reveal new failure modes or information to you. As a rule of thumb, you should aim to review at least 100 traces.
创建数据集：收集与 LLM 用户交互的有代表性的 trace。如果你没有任何数据，可以生成合成数据来起步。
开放式编码：由人类标注员（理想情况下是一位“仁慈的独裁者”）审查 trace 并写下开放式笔记，记录任何问题。这个过程类似于“写日记”，改编自定性研究方法。开始时，建议专注于记录在 trace 中观察到的第一个失败，因为上游的错误可能导致下游的问题，不过如果可行，你也可以标记所有独立的失败。这一步应该由领域专家来执行。
主轴编码：将开放式笔记归类到一个“失败分类法”中。换句话说，将相似的失败分组成不同的类别。这是最重要的一步。最后，计算每个类别中失败的数量。你可以用 LLM 来帮助完成这一步。
迭代优化：持续迭代更多的 trace，直到达到理论饱和，即新的 trace 似乎不再为你揭示新的失败模式或信息。根据经验，你应该至少审查 100 个 trace。

You should frequently revisit this process. There are advanced ways to sample data more efficiently, like clustering, sorting by user feedback, and sorting by high probability failure patterns. Over time, you’ll develop a “nose” for where to look for failures in your data.

你应该经常回顾这个过程。有一些更高级的方法可以更有效地抽样数据，比如聚类、按用户反馈排序，以及按高概率失败模式排序。随着时间的推移，你会在数据中培养出一种寻找失败的“直觉”。

Do not skip error analysis. It ensures that the evaluation metrics you develop are supported by real application behaviors instead of counter-productive generic metrics (which most platforms nudge you to use). For examples of how error analysis can be helpful, see this video, or this blog post.

不要跳过错误分析。它能确保你制定的评估指标是由真实的应用行为支持的，而不是那些适得其反的通用指标（大多数平台都鼓励你使用这些指标）。想看错误分析如何有用的例子，可以看这个视频，或者这篇博客文章。

问：Guardrails 和 Evaluators 有什么区别？

Guardrails are inline safety checks that sit directly in the request/response path. They validate inputs or outputs before anything reaches a user, so they typically are:

Fast and deterministic – typically a few milliseconds of latency budget.
Simple and explainable – regexes, keyword block-lists, schema or type validators, lightweight classifiers.
Targeted at clear-cut, high-impact failures – PII leaks, profanity, disallowed instructions, SQL injection, malformed JSON, invalid code syntax, etc.

Guardrails 是直接位于请求/响应路径中的内联安全检查。它们在任何内容到达用户之前验证输入或输出，因此它们通常是：

快速且确定性的 – 通常只有几毫秒的延迟预算。
简单且可解释的 – 正则表达式、关键词黑名单、模式或类型验证器、轻量级分类器。
针对明确、高影响的失败 – 个人身份信息（PII）泄露、脏话、不允许的指令、SQL 注入、格式错误的 JSON、无效的代码语法等。

If a guardrail triggers, the system can redact, refuse, or regenerate the response. Because these checks are user-visible when they fire, false positives are treated as production bugs; teams version guardrail rules, log every trigger, and monitor rates to keep them conservative.

如果一个 guardrail 被触发，系统可以编辑、拒绝或重新生成响应。因为这些检查一旦触发对用户是可见的，所以误报会被当作生产环境的 bug 来处理；团队会对 guardrail 规则进行版本控制，记录每次触发，并监控触发率以保持其保守性。

On the other hand, evaluators typically run after a response is produced. Evaluators measure qualities that simple rules cannot, such as factual correctness, completeness, etc. Their verdicts feed dashboards, regression tests, and model-improvement loops, but they do not block the original answer.

另一方面，evaluators 通常在响应生成之后运行。Evaluators 衡量简单规则无法衡量的质量，如事实正确性、完整性等。它们的结论会输入到仪表盘、回归测试和模型改进循环中，但它们不会阻止原始答案的发出。

Evaluators are usually run asynchronously or in batch to afford heavier computation such as a LLM-as-a-Judge. Inline use of an LLM-as-Judge is possible only when the latency budget and reliability targets allow it. Slow LLM judges might be feasible in a cascade that runs on the minority of borderline cases.

Evaluators 通常是异步或批量运行的，以便进行更重的计算，比如 LLM-as-a-Judge。只有在延迟预算和可靠性目标允许的情况下，才可能内联使用 LLM-as-a-Judge。对于少数处于边界情况的案例，可以在一个级联系统中运行较慢的 LLM 评判模型，这或许是可行的。

Apply guardrails for immediate protection against objective failures requiring intervention. Use evaluators for monitoring and improving subjective or nuanced criteria. Together, they create layered protection.

使用 guardrails 来即时防范需要干预的客观失败。使用 evaluators 来监控和改进主观或细微的标准。两者结合，构成了分层保护。

Word of caution: Do not use llm guardrails off the shelf blindly. Always look at the prompt.

提醒一句：不要盲目地使用现成的 LLM guardrails。一定要看看它的 prompt。

问：最小可行的评估配置是怎样的？

Start with error analysis, not infrastructure. Spend 30 minutes manually reviewing 20-50 LLM outputs whenever you make significant changes. Use one domain expert who understands your users as your quality decision maker (a “benevolent dictator”).

从错误分析开始，而不是基础设施。每当你做出重大改动时，花 30 分钟手动审查 20-50 个 LLM 输出。让一位了解你用户的领域专家作为你的质量决策者（一位“仁慈的独裁者”）。

If possible, use notebooks to help you review traces and analyze data. In our opinion, this is the single most effective tool for evals because you can write arbitrary code, visualize data, and iterate quickly. You can even build your own custom annotation interface right inside notebooks, as shown in this video.

如果可能的话，使用 notebook 来帮助你审查 trace 和分析数据。在我们看来，这是进行评估最有效的单一工具，因为你可以编写任意代码、可视化数据并快速迭代。你甚至可以在 notebook 内部构建自己的自定义标注界面，如这个视频所示。

问：我该如何评估 Agentic 工作流？

We recommend evaluating agentic workflows in two phases:

我们建议分两个阶段来评估 agentic 工作流：

1. End-to-end task success. Treat the agent as a black box and ask “did we meet the user’s goal?”. Define a precise success rule per task (exact answer, correct side-effect, etc.) and measure with human or aligned LLM judges. Take note of the first upstream failure when conducting error analysis.

1. 端到端的任务成功。 将 agent 视为一个黑盒，然后问“我们是否达成了用户的目标？”。为每个任务定义一个精确的成功规则（准确的答案、正确的副作用等），并用人工或校准过的 LLM 评判模型来衡量。在进行错误分析时，记下第一个上游的失败。

Once error analysis reveals which workflows fail most often, move to step-level diagnostics to understand why they’re failing.

一旦错误分析揭示了哪些工作流最常失败，就转向步骤级别的诊断，以理解它们失败的原因。

2. Step-level diagnostics. Assuming that you have sufficiently instrumented your system with details of tool calls and responses, you can score individual components such as: - Tool choice: was the selected tool appropriate? - Parameter extraction: were inputs complete and well-formed? - Error handling: did the agent recover from empty results or API failures? - Context retention: did it preserve earlier constraints? - Efficiency: how many steps, seconds, and tokens were spent? - Goal checkpoints: for long workflows verify key milestones.

2. 步骤级别的诊断。 假设你已经用工具调用和响应的详细信息充分地检测了你的系统，你就可以对单个组件进行评分，例如：- 工具选择：选择的工具是否合适？- 参数提取：输入是否完整且格式正确？- 错误处理：agent 是否从空结果或 API 失败中恢复？- 上下文保留：它是否保留了早前的约束？- 效率：花费了多少步骤、秒数和 token？- 目标检查点：对于长工作流，验证关键的里程碑。

Example: “Find Berkeley homes under $1M and schedule viewings” breaks into: parameters extracted correctly, relevant listings retrieved, availability checked, and calendar invites sent. Each checkpoint can pass or fail independently, making debugging tractable.

例如：“寻找伯克利 100 万美元以下的房屋并安排看房”可以分解为：参数提取正确、检索到相关房源、检查了可用性，以及发送了日历邀请。每个检查点都可以独立通过或失败，这使得调试变得易于处理。

Use transition failure matrices to understand error patterns. Create a matrix where rows represent the last successful state and columns represent where the first failure occurred. This is a great way to understand where the most failures occur.

使用转换失败矩阵来理解错误模式。 创建一个矩阵，其中行代表上一个成功的状态，列代表第一次失败发生的地方。这是理解失败最常发生在哪里的好方法。

Transition failure matrix showing hotspots in text-to-SQL agent workflow

转换失败矩阵显示了 text-to-SQL agent 工作流中的热点区域

Transition matrices transform overwhelming agent complexity into actionable insights. Instead of drowning in individual trace reviews, you can immediately see that GenSQL → ExecSQL transitions cause 12 failures while DecideTool → PlanCal causes only 2. This data-driven approach guides where to invest debugging effort. Here is another example from Bryan Bischof, that is also a text-to-SQL agent:

转换矩阵将令人不知所措的 agent 复杂性转化为可行的洞见。你不再需要淹没在单个 trace 的审查中，而是可以立即看到 GenSQL → ExecSQL 的转换导致了 12 次失败，而 DecideTool → PlanCal 只导致了 2 次。这种数据驱动的方法指导你将调试精力投向何处。这是 Bryan Bischof 的另一个例子，同样是一个 text-to-SQL agent：

Bischof, Bryan “Failure is A Funnel - Data Council, 2025”

Bischof, Bryan “失败是一个漏斗 - Data Council, 2025”

In this example, Bryan shows variation in transition matrices across experiments. How you organize your transition matrix depends on the specifics of your application. For example, Bryan’s text-to-SQL agent has an inherent sequential workflow which he exploits for further analytical insight. You can watch his full talk for more details.

在这个例子中，Bryan 展示了不同实验中转换矩阵的变化。你如何组织你的转换矩阵取决于你应用的具体情况。例如，Bryan 的 text-to-SQL agent 有一个固有的顺序工作流，他利用这一点来获得更深入的分析洞见。你可以观看他的完整演讲以获取更多细节。

Creating Test Cases for Agent Failures

为 Agent 失败创建测试用例

Creating test cases for agent failures follows the same principles as our previous FAQ on debugging multi-turn conversation traces (i.e. try to reproduce the error in the simplest way possible, only use multi-turn tests when the failure actually requires conversation context, etc.).

为 agent 失败创建测试用例遵循与我们之前关于调试多轮对话 trace 的常见问题相同的原则（即，尝试用最简单的方式复现错误，仅在失败确实需要对话上下文时才使用多轮测试等）。

问：说真的，Hamel，别扯淡了。你最喜欢的评估服务商是哪家？

Eval tools are in an intensely competitive space. It would be futile to compare their features. If I tried to do such an analysis, it would be invalidated in a week! Vendors I encounter the most organically in my work are: Langsmith, Arize and Braintrust.

评估工具处于一个竞争异常激烈的领域。比较它们的功能是徒劳的。如果我尝试做这样的分析，一周之内就会过时！在我的工作中，我最常自然接触到的服务商是：Langsmith、Arize 和 Braintrust。

When I help clients with vendor selection, the decision weighs heavily towards who can offer the best support, as opposed to purely features. This changes depending on size of client, use case, etc. Yes - it’s mainly the human factor that matters, and dare I say, vibes.

当我帮助客户选择服务商时，决策的重心严重偏向于谁能提供最好的支持，而不是纯粹的功能。这取决于客户的规模、用例等。是的——主要起作用的是人的因素，恕我直言，还有“感觉”（vibes）。

I have no favorite vendor. At the core, their features are very similar - and I often build custom tools on top of them to fit my needs.

我没有最喜欢的服务商。它们的核心功能非常相似——而且我常常在它们之上构建自定义工具来满足我的需求。

My suggestion is to explore the vendors and see which one you like the most.

我的建议是，去探索一下这些服务商，看看你最喜欢哪一家。

问：评估在 CI/CD 和生产监控中的使用有何不同？

The most important difference between CI vs. production evaluation is the data used for testing.

CI 评估和生产评估之间最重要的区别是用于测试的数据。

Test datasets for CI are small (in many cases 100+ examples) and purpose-built. Examples cover core features, regression tests for past bugs, and known edge cases. Since CI tests are run frequently, the cost of each test has to be carefully considered (that’s why you carefully curate the dataset). Favor assertions or other deterministic checks over LLM-as-judge evaluators.

用于 CI 的测试数据集很小（很多情况下是 100 多个例子）并且是专门构建的。例子涵盖了核心功能、针对过去 bug 的回归测试以及已知的边缘案例。由于 CI 测试运行频繁，必须仔细考虑每次测试的成本（这就是为什么你要精心策划数据集）。优先使用断言或其他确定性检查，而不是 LLM-as-judge 评估器。

For evaluating production traffic, you can sample live traces and run evaluators against them asynchronously. Since you usually lack reference outputs on production data, you might rely more on on more expensive reference-free evaluators like LLM-as-judge. Additionally, track confidence intervals for production metrics. If the lower bound crosses your threshold, investigate further.

对于评估生产流量，你可以抽样实时 trace 并异步地对其运行评估器。由于你通常在生产数据上缺少参考输出，你可能更依赖于像 LLM-as-judge 这样更昂贵的无参考评估器。此外，要追踪生产指标的置信区间。如果下限越过了你的阈值，就要进一步调查。

These two systems are complementary: when production monitoring reveals new failure patterns through error analysis and evals, add representative examples to your CI dataset. This mitigates regressions on new issues.

这两个系统是互补的：当生产监控通过错误分析和评估揭示了新的失败模式时，将有代表性的例子添加到你的 CI 数据集中。这可以减轻在新问题上发生回归的风险。

问：相似度指标（BERTScore、ROUGE 等）对评估 LLM 输出有用吗？

Generic metrics like BERTScore, ROUGE, cosine similarity, etc. are not useful for evaluating LLM outputs in most AI applications. Instead, we recommend using error analysis to identify metrics specific to your application’s behavior. We recommend designing binary pass/fail evals (using LLM-as-judge) or code-based assertions.

像 BERTScore、ROUGE、余弦相似度等通用指标，在大多数 AI 应用中对评估 LLM 输出并无用处。相反，我们建议使用错误分析来识别特定于你应用行为的指标。我们建议设计二元通过/失败的评估（使用 LLM-as-judge）或基于代码的断言。

As an example, consider a real estate CRM assistant. Suggesting showings that aren’t available (can be tested with an assertion) or confusing client personas (can be tested with a LLM-as-judge) is problematic . Generic metrics like similarity or verbosity won’t catch this. A relevant quote from the course:

举个例子，考虑一个房地产 CRM 助手。建议无法安排的看房（可以用断言测试）或混淆客户画像（可以用 LLM-as-judge 测试）都是问题。像相似度或冗长度这样的通用指标是无法捕捉到这些的。课程中的一段相关引述：

“The abuse of generic metrics is endemic. Many eval vendors promote off the shelf metrics, which ensnare engineers into superfluous tasks.”

“滥用通用指标的现象非常普遍。许多评估服务商推广现成的指标，这让工程师陷入了多余的任务中。”

Similarity metrics aren’t always useless. They have utility in domains like search and recommendation (and therefore can be useful for optimizing and debugging retrieval for RAG). For example, cosine similarity between embeddings can measure semantic closeness in retrieval systems, and average pairwise similarity can assess output diversity (where lower similarity indicates higher diversity).

相似度指标并非总是无用。它们在搜索和推荐等领域有其用处（因此对于 RAG 的检索优化和调试也可能有用）。例如，embedding 之间的余弦相似度可以衡量检索系统中的语义接近度，而平均成对相似度可以评估输出的多样性（其中较低的相似度表示较高的多样性）。

问：我应该使用“开箱即用”的评估指标吗？

No. Generic evaluations waste time and create false confidence. (Unless you’re using them for exploration).

不。通用的评估浪费时间并制造虚假的信心。（除非你用它们来做探索）。

One instructor noted:

“All you get from using these prefab evals is you don’t know what they actually do and in the best case they waste your time and in the worst case they create an illusion of confidence that is unjustified.”1

一位讲师指出：

“使用这些预制评估，你得到的只是你不知道它们到底在做什么，最好的情况是它们浪费你的时间，最坏的情况是它们制造了一种毫无根据的信心幻觉。”1

Generic evaluation metrics are everywhere. Eval libraries contain scores like helpfulness, coherence, quality, etc. promising easy evaluation. These metrics measure abstract qualities that may not matter for your use case. Good scores on them don’t mean your system works.

通用的评估指标无处不在。评估库里包含了像“有用性”、“连贯性”、“质量”等分数，承诺能轻松进行评估。这些指标衡量的是抽象的品质，可能对你的用例并不重要。在这些指标上得分高并不意味着你的系统能用。

Instead, conduct error analysis to understand failures. Define binary failure modes based on real problems. Create custom evaluators for those failures and validate them against human judgment. Essentially, the entire evals process.

相反，应该进行错误分析来理解失败。基于真实问题定义二元失败模式。为那些失败创建自定义评估器，并对照人类判断进行验证。本质上，就是整个评估流程。

Experienced practitioners may still use these metrics, just not how you’d expect. As Picasso said: “Learn the rules like a pro, so you can break them like an artist.” Once you understand why generic metrics fail as evaluations, you can repurpose them as exploration tools to find interesting traces (explained in the next FAQ).

有经验的从业者可能仍会使用这些指标，但方式和你预想的不同。正如毕加索所说：“像专家一样学习规则，才能像艺术家一样打破规则。”一旦你理解了为什么通用指标作为评估会失败，你就可以将它们重新用作探索工具，来找到有趣的 trace（在下一个 FAQ 中解释）。

问：我如何能有效地从生产 trace 中抽样进行审查？

It can be cubersome to review traces randomly, especially when most traces don’t have an error. These sampling strategies help you find traces more likely to reveal problems:

随机审查 trace 可能很麻烦，尤其是当大多数 trace 都没有错误时。这些抽样策略可以帮助你找到更有可能揭示问题的 trace：

Outlier detection: Sort by any metric (response length, latency, tool calls) and review extremes.
User feedback signals: Prioritize traces with negative feedback, support tickets, or escalations.
Metric-based sorting: Generic metrics can serve as exploration signals to find interesting traces. Review both high and low scores and treat them as exploration clues. Based on what you learn, you can build custom evaluators for the failure modes you find.
Stratified sampling: Group traces by key dimensions (user type, feature, query category) and sample from each group.
异常值检测： 按任何指标（响应长度、延迟、工具调用）排序，并审查极端值。
用户反馈信号： 优先处理带有负面反馈、支持工单或升级的 trace。
基于指标的排序： 通用指标可以作为探索信号来找到有趣的 trace。审查高分和低分，并将其视为探索线索。根据你学到的东西，你可以为你发现的失败模式构建自定义评估器。
分层抽样： 按关键维度（用户类型、功能、查询类别）对 trace 进行分组，并从每个组中抽样。

As you get more sophisticated with how you sample, you can incorporate these tactics into the design of your annotation tools.

随着你抽样方法的日趋成熟，你可以将这些策略融入到你的标注工具设计中。

脚注

Eleanor Berger，我们出色的助教。↩︎

目录