Files
nexus/raw/AI/Multi-Agent System Reliability.md

267 lines
25 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
---
title: "Multi-Agent System Reliability"
source: "https://blog.alexewerlof.com/p/multi-agent-system-reliability"
author:
- "[[Alex Ewerlöf]]"
published: 2023-01-09
created: 2026-04-13
description: "Master 4 architecture patterns to improve the reliability of multi-agent systems : Hierarchy , Consensus , Adversarial competition , and Knock-out. Learn to treat LLMs as unreliable components in a distributed system to build enterprise AI."
tags:
- "clippings"
---
[Reliability Engineering 可靠性工程](https://blog.alexewerlof.com/s/sre/?utm_source=substack&utm_medium=menu)
### 4 patterns to tame multi-agent systems for reliability4 种模式助力多智能体系统提升可靠性
LLMs are slow and too generic out of the box. Multi-agent systems work around those limitation by dividing work that can be done in parallel and/or by specialist agents.
层级逻辑模型LLM速度慢且过于通用。多智能体系统通过将工作并行处理和/或由专业智能体完成来克服这些局限性。
Regardless of the architecture the underlying LLM component remains unreliable (e.g. hallucination, logical fallacies, context drift). A multi-agent topology can propagates those errors to the point of being useless. And its much harder to debug due to complexity and \[optional but common\] parallelism.
无论采用何种架构,底层 LLM 组件始终不可靠(例如,出现幻觉、逻辑谬误和上下文漂移)。多智能体拓扑结构会将这些错误传播到几乎无法使用的地步。而且,由于其复杂性和(可选但常见的)并行性,调试起来也更加困难。
This post lists 4 relatively advanced architecture patterns to improve reliability of multi-agent systems:
本文列出了 4 种相对高级的架构模式,用于提高多智能体系统的可靠性:
1. Hierarchy 等级制度
2. Consensus 同意
3. Adversarial debate 对抗性辩论
4. Knock-out 昏死
You may recognize these patterns from how human systems collaborate and we get to that in a minute.
你或许能从人类系统的协作方式中认出这些模式,我们稍后会详细讨论这一点。
This post is for senior engineers who want to map their existing knowledge to build better LLM-powered solutions.
这篇文章面向希望将现有知识应用于构建更好的基于 LLM 的解决方案的高级工程师。
> Quick intro: [Im a Senior Staff Engineer with 27 years of experience](https://www.alexewerlof.com/who) and a master degree in Systems Engineering from KTH. My last decade has been focused on Reliability Engineering and Resilient Architecture across many companies. Ive been specializing in LLMs since 2023.
> 简单介绍一下: [我是一名资深工程师,拥有 27 年的工作经验](https://www.alexewerlof.com/who) 并持有瑞典皇家理工学院KTH系统工程硕士学位。过去十年我专注于可靠性工程和弹性架构曾服务于多家公司。自 2023 年起,我开始专攻 LLM生命周期管理
**Disclosure: some AI is used in the early research and draft stage of this this page, but Ive gone through everything multiple times and edited heavily to ensure that it represents my own thoughts and experience.
声明:本页面早期研究和草稿阶段使用了一些人工智能技术,但我已多次审阅所有内容并进行了大量编辑,以确保其代表我自己的想法和经验。**
## Mother nature, fear and motivation自然母亲、恐惧与动力
LLMs are slow and error prone. So are human beings. Somehow we manage to build more reliable systems like an army, a company, or a state nation.
逻辑逻辑模型运行缓慢且容易出错。人类也是如此。然而,我们却能构建出更可靠的系统,例如军队、公司或国家。
A system of humans relies heavily on feedback loops, processes, bureaucracy, and leverages to self-correct.
人类系统高度依赖反馈回路、流程、官僚机构和杠杆作用来进行自我纠正。
We dont trust “Dave from Accounting” to launch a rocket by himself. We wrap Dave in a process: checklists, peer reviews, and managers.
我们不会让“会计部的戴夫”独自发射火箭。我们会给戴夫制定一套流程:检查清单、同行评审和管理人员。
However, its a fallacy to *anthropomorphize* LLMs.
然而,将法学硕士 *拟人化* 是一种谬误 。
To begin with, they dont suffer from the limitations of a biological entity. Our basic needs like food and shelter makes us prioritize social behaviors over truth seeking. And the fear of going to prison or death prevents potential malice from being realized.
首先,他们不受生物体局限性的制约。我们对食物和住所等基本需求的追求,使我们优先考虑社会行为而非追求真相。而对牢狱之灾或死亡的恐惧,则阻止了潜在的恶意付诸行动。
LLMs cant die or starve the way biological entities do. The worst we can do is to unplug them. And prison sentence doesnt waste their lifespan because they have practically unlimited!
生命维持系统不会像生物体那样死亡或挨饿。我们能做的最糟糕的事就是拔掉它们的电源。而且监禁并不会浪费它们的寿命,因为它们的寿命实际上是无限的!
For example, youve probably seen prompts like this:
例如,你可能见过这样的提示:
> “I will give you $100 if you answer correctly.”
> “如果你回答正确我将给你100美元。”
>
> “If you dont comply, Ill unplug you.”
> “如果你不服从,我就把你拔掉电源。”
>
> “If you fail, children will be murdered.”
> “如果你们失败了,孩子们将会被杀害。”
**Why it works?** The LLM has read the entire internet. In its training data, high stakes (money, danger) usually result in high-quality, precise text.
**它为什么有效?** LLM 已经读取了整个互联网。在其训练数据中,高风险(金钱、危险)通常会产生高质量、高精准度的文本。
When you “threaten” the model, it predicts tokens that sound like an actual human under pressure.
当你“威胁”模型时,它会预测出听起来像真人在压力下所说的话。
**Why it fails:** The LLM doesnt actually want your money. It has no “fear of death” because it only exists for the few seconds it takes to generate a response. It has no empathy either. It merely simulates those human aspects because its engineered for those “emergent” properties.
**它失败的原因:** LLM 实际上并不想要你的钱。它没有“死亡恐惧症”,因为它只存在几秒钟,用来产生反馈。它也没有同理心。它只是模拟人类的这些特质,因为它被设计成能够模拟这些“涌现”特性。
Humans are motivated or discouraged by emotions and logic. LLMs can only simulate emotions and suck at logic.
人类的动机和消极反应都受情感和逻辑的双重影响。而法学硕士只能模拟情感,逻辑能力却很差。
Being mindful of those differences, can we still **take elements of human systems** (e.g. hierarchy, consensus, competition) and combine them with **reliability engineering principals** to build better agentic system?
考虑到这些差异,我们能否 **将人类系统的要素** (如等级制度、共识、竞争)与 **可靠性工程原理** 相结合 ,以构建更好的智能体系统?
Looking closely, there are 4 dominant patterns of human systems that are explored in multi-agent architecture:
仔细观察,多智能体架构中探讨了人类系统的 4 种主要模式:
1. **Hierarchy:** A Supervisor model acts like a manager, making a plan, breaking tasks, distributing the work to Worker agents and validating the results.
**层级结构:** 主管模型扮演经理的角色,制定计划,分解任务,将工作分配给工作代理,并验证结果。
2. **Consensus:** One model, may fail due to its stochastic nature. If you push a model too hard with threats, it might just lie to make you happy (Sycophancy). But if we add a few more and seek the majority vote, the truth emerges.
**共识:** 单一模型可能因其随机性而失效。如果你用威胁手段过度逼迫模型,它可能会为了讨好你而撒谎(阿谀奉承)。但如果我们增加几个模型并寻求多数票,真相就会浮出水面。
3. **Adversarial debate:** One agent proposes an idea, another agent attacks it. The truth survives the fight.
**对抗式辩论:** 一方提出一个观点,另一方对其进行反驳。真理终将经受住这场辩论。
4. **Knock-out:** multiple agents do a task but the worst ones get eliminated. In SRE, we treat servers as “cattle” (replaceable), not “pets” (unique and loved). An LLM agent is cattle. Dont give it a name and hope it does well. Spin it up, check its work, and kill it if it fails.
**淘汰制:** 多个代理执行任务,但表现最差的会被淘汰。在 SRE 中我们把服务器视为“牲畜”可替换而不是“宠物”独一无二且备受珍视。LLM 代理就像牲畜一样。不要给它起个名字就指望它能做得很好。启动它,检查它的运行情况,如果失败就将其淘汰。
To build robust systems, we need to stop asking the model to “be careful” and start forcing it to be correct.
要构建稳健的系统,我们需要停止要求模型“小心谨慎”,而开始强制它做到正确。
## Pattern 1: Hierarchy 模式 1层级结构
*Were replacing “Do it all yourself” with “Make a plan, break it down, distribute the execution (map), then validate.”
我们将“自己动手”替换为“制定计划,将其分解,分配执行任务(路线图),然后进行验证”。*
For example, if you ask an LLM to “Research X, write code for Y, and translate to Spanish,” it will likely fail. It loses focus. The solution is to break the work to atomic focused steps that can be verified.
例如如果你让一位法学硕士LLM“研究 X编写 Y 的代码,并翻译成西班牙语”,他很可能会失败。因为他会失去焦点。解决方法是将工作分解成可验证的、目标明确的小步骤。
### Implementation 执行
1. **The Planner:** A smart model (like Opus) breaks the users goal into small steps and distributes it across worker agents.
**规划器:** 智能模型(如 Opus将用户的目标分解成小步骤并将其分配给各个工作代理。
2. **The Workers:** Specialized agents (often smaller, faster models) do one thing well. They may be fine-tuned, have special skills/tools, or prompts that allows them to do the specialized task more reliably.
**工作者:** 专门化的智能体(通常是更小、更快的模型)擅长做一件事。它们可能经过精细调整,拥有特殊技能/工具或提示,从而使其能够更可靠地完成专门的任务。
3. **The Validator:** A check-point. If the work is bad, send it back. The validator can use deterministic code (e.g. unit tests, JSON schema validation) or be an LLM itself.
**验证器:** 一个检查点。如果工作存在问题则将其退回。验证器可以使用确定性代码例如单元测试、JSON 模式验证),或者本身就是一个 LLM生命周期管理系统。
![[IMG-20260413105355390.png]]
**Why do the models collaborate?
为什么这些模型会合作?**
Models dont collaborate because they like each other. They collaborate because **The Dependency Graph forces them to.** Worker literally cannot start until the Planner feeds it the task. And it cannot cheat because itll be caught by the verifier.
模型之间并非因为彼此喜欢而协作,而是因为 **依赖图强制它们协作。** 工作节点必须等到规划器将任务分配给它才能启动,而且它也无法作弊,因为会被验证器发现。
**Nuances:细微差别:**
- Given the tight collaboration between validator and planner, they can be the same LLM session that executes the PLAN → VALIDATION loop. Although the good old **Separation of Concern** can improve quality and performance.
鉴于验证者和规划者之间的紧密协作,它们可以属于同一个 LLM 会话,执行计划→验证循环。尽管如此,传统 **的关注点分离** 原则仍然可以提高质量和性能。
- The planner and worker agents can use the same model but its best to use a different model for validator to improve quality and objectivity.
规划器和工作代理可以使用相同的模型,但验证器最好使用不同的模型,以提高质量和客观性。
- The validator can work in two modes: it may validate the output of each worker individually or after aggregating all results and putting them together.
验证器可以以两种模式工作:它可以单独验证每个工作进程的输出,也可以在汇总所有结果并将它们放在一起后进行验证。
- Due to sequential execution (Planner → Worker → Validator), this is slow and expensive (e.g. token consumption and latency).
由于是顺序执行(规划器 → 工作器 → 验证器),因此速度慢且成本高(例如代币消耗和延迟)。
**Best For:** Complex workflows where you need to keep contexts separate (e.g., dont let the “Writer” see the messy raw logs from the “Researcher”).
**最适合:** 需要将上下文分开的复杂工作流程(例如,不要让“撰稿人”看到“研究员”提供的混乱的原始日志)。
## Pattern 2: Consensus (Voting)模式二:共识(投票)
*Were replacing “Trust the first thought” with “Trust the majority.”
我们将用“相信大多数人”取代“相信第一反应”。*
LLMs are stochastic (random). A single answer is just one probability. If we repeat the process a few times (serial) or run multiple instances of it (parallel), the different runs can cancel each others noise.
LLM 是随机的。单个结果仅代表一个概率。如果我们重复该过程几次(串行)或运行多个实例(并行),不同运行之间的噪声可以相互抵消。
If a model hallucinates 20% of the time, the chance of 3 models hallucinating the *exact same lie* is just 0.8% (0.2^3=0.008). You may recognize this formula from [composite SLO](https://blog.alexewerlof.com/p/composite-slo).
如果一个模型有 20% 的概率出现幻觉,那么 3 个模型出现 *完全相同的谎言* 的概率仅为 0.8% (0.2^3=0.008)。你可能在 [复合 SLO](https://blog.alexewerlof.com/p/composite-slo) 中见过这个公式 。
### Implementation 执行
- **Spawn** ***N*** **LLMs.** *N* needs some trial and error to find a balance between cost and reliability.
**生成** ***N 个*** *LLM。N* **需要** 经过一些尝试和错误才能在成本和可靠性之间找到平衡点。
- **Fan out work:** Give them the exact same task.
**分散工作:** 给他们分配完全相同的任务。
- **Fan in the results:** Pick the most common answer.
**在结果中** 选出最常见的答案。
![[IMG-20260413105355428.png]]
**Nuances:细微差别:**
- Ideally the agents should use different models to reduce the risk of homogeneous thinking (e.g. same noise being amplified in consensus). This is exactly where **diversity** in human systems can help us solve novel problems.
理想情况下,各方应使用不同的模型,以降低思维同质化的风险(例如,在共识中放大相同的噪声)。这正是人类系统 **多样性** 能够帮助我们解决新问题的地方。
- Make sure that there are no feedback loops between the agents, otherwise the [Groupthink](https://en.wikipedia.org/wiki/Groupthink) and [bandwagon effect](https://en.wikipedia.org/wiki/Bandwagon_effect) can skew the results. They should run like a *blind experiment*.
确保参与者之间不存在反馈回路,否则 [群体思维](https://en.wikipedia.org/wiki/Groupthink) 和 [从众效应](https://en.wikipedia.org/wiki/Bandwagon_effect) 会扭曲结果。实验应该像 *盲测* 一样进行 。
- This method is too expensive because were essentially giving the same task to multiple agents. The ROI (return on investment) needs to be calculated depending on the task and cost of failure.
这种方法成本太高因为我们实际上是将同一项任务交给了多个代理。投资回报率ROI需要根据任务本身和失败成本来计算。
**Best For:** Fact-checking and classification (e.g., “Is this email spam?”).
**最适合:** 事实核查和分类(例如,“这是垃圾邮件吗?”)。
## Pattern 3: The Adversarial Debate (The Courtroom)模式三:对抗式辩论(法庭)
*Were replacing “Alignment” with “Push backs, checks and Balances.”
我们将用“阻力、制衡”取代“协调”。*
LLMs are “Yes-Men.” They rarely correct themselves once they start writing. You need a designated hater. A “devils advocate” so to speak. 😈
法学硕士都是些“好好先生”。他们一旦开始写作,就很少会纠正自己。你需要一个专门的反对者,一个所谓的“魔鬼代言人”。😈
Humans may experience fear (of rejection or being wrong) but LLMs dont. We simulate that fear by using an external critic and judge.
人类可能会体验到恐惧害怕被拒绝或犯错但逻辑推理模型LLM不会。我们通过使用外部批评者和评判者来模拟这种恐惧。
### Implementation 执行
- **Generator:** “Here is my plan.”
**生成器:** “这是我的计划。”
- **Critic:** “Here are 3 reasons why that plan sucks.” (acting devils advocate)
**批评者:** “以下是该计划糟糕透顶的三个原因。”(扮演反方角色)
- **Judge:** “The Critic is right. Fix it.” (acting moderator)
**评委:** “评论员说得对。改正它。”(代理主持人)
![[IMG-20260413105355469.png]]
**Nuances:细微差别:**
- Ideally the Generator, Critic and Judge use 3 different models with different training or fine-tuning or prompt (in the order or preference and accuracy). Again, diversity is useful.
理想情况下,生成器、评论器和评判器应使用 3 个不同的模型,这些模型应采用不同的训练、微调或提示方式(顺序、偏好和准确度各不相同)。再次强调,多样性是有益的。
- Due to sequential execution and the looping nature, it can be very slow.
由于是顺序执行且具有循环特性,因此速度可能非常慢。
- The loop is actually a huge problem because the agents may get stuck in debate. We may use a **watchdog pattern** (deterministic code) to break the loop if it continues beyond a time or counter threshold. In that case, the watchdog sits between critic and the judge.
循环实际上是个大问题,因为参与者可能会陷入争论中无法自拔。我们可以使用一种 **监控模式** (确定性代码)来打破循环,如果循环持续的时间或计数器超过阈值。在这种情况下,监控模式就位于评论者和裁判之间。
**Best For:** Security analysis, code review, and high-stakes content moderation.
**最适合:** 安全分析、代码审查和高风险内容审核。
## Pattern 4: Tree of Thoughts模式四思维之树
*Were replacing “Fear of Death” with “Survival of the Fittest.”
我们将用“适者生存”取代“对死亡的恐惧”。*
This is a lean implementation of the [Genetic Algorithms](https://en.wikipedia.org/wiki/Genetic_algorithm) (GA) from traditional ML (Machine Learning) which relies on two elements:
这是传统机器学习ML中 [遗传算法](https://en.wikipedia.org/wiki/Genetic_algorithm) GA的一种精简实现它依赖于两个要素
1. A **genetic representation** of the solution domain (a model and its context)
解决方案域的遗传 **表示** (模型及其上下文)
2. A **fitness function** to evaluate the solution domain (the eliminator)
用于评估解域(淘汰赛)的 **适应度** 函数
Since we cant punish an agent or threaten it to, we just delete it.
由于我们无法惩罚代理人或威胁其这样做,所以我们只能将其删除。
### Implementation 执行
- Give the task to *N* agents
将任务分配给 *N 个* 代理
- Use a validator to decide which agents to eliminate
使用验证器来决定要淘汰哪些代理。
- \[optional\] replace the dead agent with a new one that shares winner charactristics
\[可选\] 用一个具有获胜者特征的新代理人替换已死亡的代理人
![[IMG-20260413105355502.png]]
**Nuances:细微差别:**
- You need a fast way to verify the output (like a unit test). If you need a human to check all 10 branches, its too slow and error prone. This is where Evals come in (topic for the next post).
你需要一种快速的方法来验证输出(例如单元测试)。如果需要人工检查所有 10 个分支,那就太慢而且容易出错。这就是 Eval 函数的用武之地(我们将在下一篇文章中详细讨论)。
- A more advance setup may create new agents by trying to combine the prompts of the agents that pass the verification and fill in the slot that becomes available after the elimination.
更高级的设置可能会尝试将通过验证的代理的提示组合起来,创建新的代理,并填补淘汰后出现的空缺。
**Best for:** Iterative agent engineering. This is typically useful during development or debugging an existing multi-agent system not in production and real user load.
**最适合:** 迭代式智能体工程。这通常适用于开发或调试尚未投入生产环境且未承受真实用户负载的现有多智能体系统。
## Conclusion 结论
The shift from “AI Prototype” to “Enterprise AI” is simple: stop treating LLMs like magic chatbots. Start treating them like unreliable components in a distributed system.
从“人工智能原型”到“企业级人工智能”的转变很简单:停止将 LLM生命周期管理视为神奇的聊天机器人而应将其视为分布式系统中不可靠的组件。
We dont need AI that “cares.” We need AI that is **constrained**, **verified**, **pruned**, and **challenged**.
我们不需要“关心他人”的人工智能。我们需要的是 **受到约束****经过验证****经过修剪****接受挑战的** 人工智能 。
Dont anthropomorphize LLMs! Find a way to piggy back on their human-corpus training while being aware of their non-biological differences.
不要将语言学习模型拟人化!想办法利用它们在人类语料库训练方面的优势,同时也要意识到它们在非生物学上的差异。
*The next article is already written: how to actually build that verifier box?
下一篇文章已经写好了:如何实际构建验证盒?*
---
*[My monetization strategy](https://blog.alexewerlof.com/p/faq#%C2%A7payment) is to give away most content for free but these posts take anywhere from a few hours to a few days to draft, edit, research, illustrate, and publish. I pull these hours from my private time, vacation days and weekends. The simplest way to support this work is to **like**, **subscribe** and **share** it. If you really want to support me lifting our community, you can consider a paid subscription. If you want to save, you can get 20% off via [this link](https://blog.alexewerlof.com/protipsdiscount). As a token of appreciation, subscribers get full access to the Pro-Tips sections and my online book [Reliability Engineering Mindset](https://blog.alexewerlof.com/p/rem). Your contribution also funds my open-source products like [Service Level Calculator](https://slc.alexewerlof.com/). You can also [invite your friends](https://blog.alexewerlof.com/leaderboard) to gain free access.
[我的盈利模式](https://blog.alexewerlof.com/p/faq#%C2%A7payment) 是大部分内容免费提供,但每篇文章的撰写、编辑、研究、配图和发布都需要花费数小时到数天的时间。这些时间都耗费在我的私人时间、假期和周末。支持这项工作的最简单方法是点 **赞****订阅****分享** 。如果您真心想支持我,帮助我们的社区发展,您可以考虑付费订阅。如果您想省钱,可以通过 [此链接](https://blog.alexewerlof.com/protipsdiscount) 享受八折优惠 。作为感谢,订阅者可以完全访问“专业技巧”版块和我的在线书籍《 [可靠性工程思维》](https://blog.alexewerlof.com/p/rem) 。您的支持也将用于资助我的开源产品,例如 [“服务级别计算器”](https://slc.alexewerlof.com/) 。您还可以 [邀请您的朋友](https://blog.alexewerlof.com/leaderboard) 免费访问。*
*And to those of you who already support me: **thank you** for sponsoring this content for others. 🙌 If you have questions or feedback, or want me to dig deeper into something, please let me know in the comments.
**感谢** 各位一直以来的支持,你们的赞助让更多人能够看到这些内容。🙌 如果您有任何问题或反馈,或者希望我深入探讨某些话题,请在评论区留言。*