nexus/wiki/concepts/LLMasJudge.md at b40abbcd473a7093d8261e212e3d6de97c1e516a

ishenwei/nexus

Fork 0

Files

Shen Wei 5854781fa8 chore: sync local project changes

2026-04-27 16:26:34 +08:00

1.4 KiB

Raw Blame History

title, type, tags, sources, last_updated

title

type

Aliases

LLM as a Judge
LLM-as-Judge
LLM-as-a-Judge Grading

Definition

LLM-as-a-Judge 是 AutonomousOptimizationArchitect 的评分机制——使用一个独立的 LLM（如 Claude Opus）作为"裁判"，对实验模型和生产模型的输出进行客观评分，避免人工评审的主观偏差。评分维度包括：JSON 格式正确性（5分）、延迟（3分）、幻觉检测（-10分）等。

Mechanism

评分标准预先建立：在 ShadowTraffic 测试前，AutonomousOptimizationArchitect 明确建立数学评分标准
异步评估：实验模型和生产模型同时处理任务，裁判 LLM 盲评两者输出
统计分析：累积足够样本后进行统计显著性检验
自主决策：实验模型显著优于基准时，更新路由权重

Key Properties

客观性：消除人工评分的主观偏差
可扩展：可同时评估多个 Provider 的输出
数据驱动：评分结果直接驱动 SemanticRouting 决策

Connections

AutonomousOptimizationArchitect — LLM-as-Judge 是核心评估工具
ShadowTraffic — 提供实验与基准并行执行的流量环境
SemanticRouting — 评分结果更新路由权重

1.4 KiB Raw Blame History Unescape Escape