nexus/wiki/sources/Your-AI-Isn-t-Stupid---It-Just-Needs-a-Better-Harness--Lychee-Technology-Engineering-Blog.md at b40abbcd473a7093d8261e212e3d6de97c1e516a

ishenwei/nexus

Fork 0

Files

weishen 111bc65b7b Update nexus wiki content

2026-05-03 05:42:12 +08:00

5.7 KiB

Raw Blame History

title, type, tags, date

title

type

Source File

Agent/Your AI Isn't Stupid — It Just Needs a Better Harness Lychee Technology Engineering Blog

Summary（用中文描述）

核心主题：Harness Engineering——为 AI Agent 设计"马具"的系统学科，将 LLM 嵌入严格代码脚手架中实现可靠的多步自主执行
问题域：AI Agent 在长周期自主任务中的崩溃问题（10步崩塌、上下文溢出、Schema 漂移、状态丢失）
方法/机制：7层 Harness Stack（Cognition → Tools → Contracts → Orchestration → Memory → Evaluation → Constraints & Recovery），每个边界的输入输出验证，每个工具调用的幂等重试，状态外部化持久化
结论/价值：Agent 失败的原因不是模型弱，而是系统设计缺失；最成功的构建者不是写最好代码的人，而是设计最好"马具"的人

Key Claims（用中文描述）

LLM 本身不是 Agent，Agent = LLM + 代码脚手架（状态管理 + 恢复工作流）
Prompt Engineering → Context Engineering → Harness Engineering 是演进而非替代，后者吸收前两者并增加执行控制层
约束（constrain）而非指令（instruct）——程序化限制比自然语言提示更可靠
每个 LLM 输出必须经过 Schema 验证器，拒绝格式不符的输出，而非依赖模型自我修正
状态必须外部化——context window 是易失的，磁盘文件才是持久的
每个工具调用必须幂等——单步失败只重试该步，不重启整个管道（局部失败而非全局崩溃）
Context Reset：当 token 使用率超过 70% 阈值时，保存状态、终止当前实例、启动全新 Agent
Self-Grading Illusion：LLM 无法有效评估自身输出——同一套权重既生成输出又评判输出，结构上存在缺陷
Sprint Contract：Generator 和 Evaluator 必须完全独立，Evaluator 在干净上下文中仅接收输出和成功标准，不读 Generator 的思维链
Memory Consolidation：周期性压缩 Agent 累积的工作日志（实测 32K 噪声日志压缩至 7K），防止记忆膨胀和矛盾
最小可行 Harness（Day 1 可构建）：state.json + retry wrapper + schema validator + tool output truncation

Key Quotes

"The problem usually isn't the horse. It's the reins." — 核心隐喻：模型是马，Harness 是缰绳 "A prompt that says 'always respond in valid JSON' is a hope. A schema validator that rejects malformed output is a guarantee." — 约束优于指令 "The model speaks in probabilities. The harness must speak in types." — Schema drift 的根源 "Fail locally, not globally." — 单步失败只重试该步，不重启整个管道

Key Concepts

Harness-Engineering：为 LLM 设计的系统脚手架学科，使 Agent 能在长周期自主任务中可靠执行——包含约束、外部化、验证、恢复四个设计原则
Agent-Collapse（10-Step Collapse）：Agent 在多步任务中途开始幻觉或输出崩溃的现象——根因是 context window 被静默截断或无 Schema 验证
Context-Anxiety：当 context window 使用率超过 70% 或延迟升高时，模型表现出"仓促"行为——跳过步骤或过早完成任务
Context-Reset：当 Context Anxiety 触发时，Harness 将状态保存至磁盘、终止当前实例、启动全新 Agent 的程序化操作
Schema-Drift：同一 LLM 在不同调用中对同一字段生成不同数据类型（如 price 一次为 string 一次为 float）的静默错误
Sprint-Contract：Generator Agent 和独立 Evaluator Agent 在工作开始前约定的可测试"完成"定义，Evaluator 在干净上下文中仅接收输出和标准
Self-Grading-Illusion：LLM 无法有效评估自身输出的结构性缺陷——生成输出的权重位置决定了它不能可靠地评判该输出
Memory-Consolidation：Agent 空闲时周期性压缩累积工作日志（去重 + 解决矛盾 + 写入精简状态文件）的机制
State-Externalization：将任务状态（pending/in-progress/completed）写入磁盘文件而非仅保存在 context window 的实践
Idempotency：工具调用的幂等性保证——失败时精确重试该步，不污染全局状态也不重复已完成工作
7-Layer-Harness-Stack：Cognition / Tools / Contracts & Interfaces / Orchestration / Memory & State / Evaluation & Observation / Constraints & Recovery

Key Entities

Lychee-Technology：发布本 engineering blog 的公司，专注于 production-grade AI 系统设计
Anthropic：在 steering vectors 和内部模型表征方面的研究被本文引用，用于论证 Self-Grading Illusion

Connections

Harness-Engineering ← extends ← Agent-Collapse（分析根因并提供系统性解决方案）
7-Layer-Harness-Stack ← is_a ← Harness-Engineering（具体实现框架）
Context-Reset ← is_a ← Context-Anxiety（问题的程序化解决方案）
Sprint-Contract ← resolves ← Self-Grading-Illusion（通过角色分离打破结构性缺陷）
Memory-Consolidation ← addresses ← State Management（解决长期运行的记忆膨胀问题）
Hermes-Agent ← implements ← 7-Layer-Harness-Stack（若 Hermes Agent 实现了完整 Harness 架构，可关联本文）
Designing-for-Agentic-AI ← related ← Harness-Engineering（可补充外部设计原则文档）

Contradictions

无与其他 Wiki 页面的直接冲突内容。本文可与 Designing-for-Agentic-AI 互补阅读——后者侧重设计原则，本文侧重工程实现层次。

5.7 KiB Raw Blame History Unescape Escape

Source File

Summary（用中文描述）

Key Claims（用中文描述）

Key Quotes

Key Concepts

Key Entities

Connections

Contradictions

5.7 KiB

Raw Blame History