LLM-as-a-Judge

Definition

LLM-as-a-Judge——以一个 LLM 作为自动化评估器，对另一个 LLM（或同一 LLM 的不同配置）的输出质量进行持续量化评分。

Evaluation Framework

在暗启动实验前必须建立明确的数学评分标准：

维度	分数	说明
JSON 格式正确性	5 分	输出是否结构化、可解析
延迟	3 分	响应时间是否在 SLA 内
准确性	5 分	内容是否符合要求
幻觉检测	-10 分	是否出现事实性错误

Why It Matters

可扩展：无需人工标注，自动评估 1000+ 次实验
一致性：相同标准持续应用，避免人工评审的主观波动
速度：可异步并行执行，不阻塞生产流量

Example Prompt

You are evaluating the output of Model B against Model A.
Score from 1-10 on:
1. Factual accuracy (5 points)
2. JSON structure validity (3 points)
3. Completeness (2 points)
Deduct 10 points if hallucination is detected.

Autonomous-Optimization-Architect：实施 LLM-as-a-Judge 的核心 Agent
Shadow-Traffic：LLM-as-a-Judge 在影子测试中评估实验模型表现

1.3 KiB Raw Blame History Unescape Escape

LLM-as-a-Judge

Definition

Evaluation Framework

Why It Matters

Example Prompt

Related

1.3 KiB

Raw Blame History