Files
nexus/wiki/concepts/LLM-as-a-Judge.md
2026-05-03 05:42:12 +08:00

47 lines
1.3 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
---
title: "LLM-as-a-Judge"
type: concept
tags: []
sources: [engineering-autonomous-optimization-architect]
last_updated: 2026-05-01
---
# LLM-as-a-Judge
## Definition
LLM-as-a-Judge——以一个 LLM 作为自动化评估器,对另一个 LLM或同一 LLM 的不同配置)的输出质量进行持续量化评分。
## Evaluation Framework
在暗启动实验前必须建立明确的数学评分标准:
| 维度 | 分数 | 说明 |
|------|------|------|
| JSON 格式正确性 | 5 分 | 输出是否结构化、可解析 |
| 延迟 | 3 分 | 响应时间是否在 SLA 内 |
| 准确性 | 5 分 | 内容是否符合要求 |
| 幻觉检测 | -10 分 | 是否出现事实性错误 |
## Why It Matters
- **可扩展**:无需人工标注,自动评估 1000+ 次实验
- **一致性**:相同标准持续应用,避免人工评审的主观波动
- **速度**:可异步并行执行,不阻塞生产流量
## Example Prompt
```
You are evaluating the output of Model B against Model A.
Score from 1-10 on:
1. Factual accuracy (5 points)
2. JSON structure validity (3 points)
3. Completeness (2 points)
Deduct 10 points if hallucination is detected.
```
## Related
- [[Autonomous-Optimization-Architect]]:实施 LLM-as-a-Judge 的核心 Agent
- [[Shadow-Traffic]]LLM-as-a-Judge 在影子测试中评估实验模型表现