Update nexus wiki content
This commit is contained in:
46
wiki/concepts/LLM-as-a-Judge.md
Normal file
46
wiki/concepts/LLM-as-a-Judge.md
Normal file
@@ -0,0 +1,46 @@
|
||||
---
|
||||
title: "LLM-as-a-Judge"
|
||||
type: concept
|
||||
tags: []
|
||||
sources: [engineering-autonomous-optimization-architect]
|
||||
last_updated: 2026-05-01
|
||||
---
|
||||
|
||||
# LLM-as-a-Judge
|
||||
|
||||
## Definition
|
||||
|
||||
LLM-as-a-Judge——以一个 LLM 作为自动化评估器,对另一个 LLM(或同一 LLM 的不同配置)的输出质量进行持续量化评分。
|
||||
|
||||
## Evaluation Framework
|
||||
|
||||
在暗启动实验前必须建立明确的数学评分标准:
|
||||
|
||||
| 维度 | 分数 | 说明 |
|
||||
|------|------|------|
|
||||
| JSON 格式正确性 | 5 分 | 输出是否结构化、可解析 |
|
||||
| 延迟 | 3 分 | 响应时间是否在 SLA 内 |
|
||||
| 准确性 | 5 分 | 内容是否符合要求 |
|
||||
| 幻觉检测 | -10 分 | 是否出现事实性错误 |
|
||||
|
||||
## Why It Matters
|
||||
|
||||
- **可扩展**:无需人工标注,自动评估 1000+ 次实验
|
||||
- **一致性**:相同标准持续应用,避免人工评审的主观波动
|
||||
- **速度**:可异步并行执行,不阻塞生产流量
|
||||
|
||||
## Example Prompt
|
||||
|
||||
```
|
||||
You are evaluating the output of Model B against Model A.
|
||||
Score from 1-10 on:
|
||||
1. Factual accuracy (5 points)
|
||||
2. JSON structure validity (3 points)
|
||||
3. Completeness (2 points)
|
||||
Deduct 10 points if hallucination is detected.
|
||||
```
|
||||
|
||||
## Related
|
||||
|
||||
- [[Autonomous-Optimization-Architect]]:实施 LLM-as-a-Judge 的核心 Agent
|
||||
- [[Shadow-Traffic]]:LLM-as-a-Judge 在影子测试中评估实验模型表现
|
||||
Reference in New Issue
Block a user