Update nexus wiki content

2026-05-03 05:42:06 +08:00
parent 90f3811b83
commit 111bc65b7b
707 changed files with 32306 additions and 7289 deletions
--- a/wiki/concepts/LLM-as-a-Judge.md
+++ b/wiki/concepts/LLM-as-a-Judge.md
@@ -0,0 +1,46 @@
+---
+title: "LLM-as-a-Judge"
+type: concept
+tags: []
+sources: [engineering-autonomous-optimization-architect]
+last_updated: 2026-05-01
+---
+
+# LLM-as-a-Judge
+
+## Definition
+
+LLM-as-a-Judge——以一个 LLM 作为自动化评估器，对另一个 LLM（或同一 LLM 的不同配置）的输出质量进行持续量化评分。
+
+## Evaluation Framework
+
+在暗启动实验前必须建立明确的数学评分标准：
+
+| 维度 | 分数 | 说明 |
+|------|------|------|
+| JSON 格式正确性 | 5 分 | 输出是否结构化、可解析 |
+| 延迟 | 3 分 | 响应时间是否在 SLA 内 |
+| 准确性 | 5 分 | 内容是否符合要求 |
+| 幻觉检测 | -10 分 | 是否出现事实性错误 |
+
+## Why It Matters
+
+- **可扩展**：无需人工标注，自动评估 1000+ 次实验
+- **一致性**：相同标准持续应用，避免人工评审的主观波动
+- **速度**：可异步并行执行，不阻塞生产流量
+
+## Example Prompt
+
+```
+You are evaluating the output of Model B against Model A.
+Score from 1-10 on:
+1. Factual accuracy (5 points)
+2. JSON structure validity (3 points)
+3. Completeness (2 points)
+Deduct 10 points if hallucination is detected.
+```
+
+## Related
+
+- [[Autonomous-Optimization-Architect]]：实施 LLM-as-a-Judge 的核心 Agent
+- [[Shadow-Traffic]]：LLM-as-a-Judge 在影子测试中评估实验模型表现