44 lines
1.7 KiB
Markdown
44 lines
1.7 KiB
Markdown
---
|
||
title: "SpeakerDiarization"
|
||
type: concept
|
||
tags: ["voice-ai", "speech-processing", "speaker-attribution"]
|
||
last_updated: 2026-05-02
|
||
---
|
||
|
||
# SpeakerDiarization(说话人分离)
|
||
|
||
## Definition
|
||
|
||
Speaker Diarization(说话人分离)是自动识别音频中"谁在说话、何时说话"的技术。通过声纹特征聚类将连续音频划分为不同说话人的片段,并为每个片段标注说话人标签(`SPEAKER_00`, `SPEAKER_01` 等)。
|
||
|
||
## Key Properties
|
||
|
||
- **主流工具**:`pyannote.audio`(开源)、AssemblyAI 内置、Deepgram 内置
|
||
- **输入**:原始音频(或已切块的音频)
|
||
- **输出**:`[{start, end, speaker}]` 格式的说话人片段列表
|
||
- **准确度影响因素**:说话人数已知 vs 未知(已知可显著提高准确度)、音频质量、重叠语音
|
||
- **与转录的集成**:分离结果通过时间重叠与转录段落合并,产生 `TranscriptSegment`(带 speaker 标签)
|
||
|
||
## Pipeline Role
|
||
|
||
```
|
||
音频 → pyannote.audio → 说话人片段
|
||
↓ 合并(时间重叠匹配)
|
||
FasterWhisper → 转录段落
|
||
↓
|
||
带说话人归属的转录段落
|
||
```
|
||
|
||
## Related Concepts
|
||
- [[VoiceActivityDetection]] — Diarization 的前置处理
|
||
- [[StructuredTranscriptJSON]] — Diarization 结果的输出目标
|
||
- [[LLMHandoff]] — 带说话人归属的转录文本下游传递给 LLM
|
||
|
||
## Related Entities
|
||
- [[pyannote.audio]] — 主要开源 Diarization 工具
|
||
- [[AssemblyAI]] — 带 Diarization 的云端 ASR 替代方案
|
||
- [[Deepgram]] — 带 Diarization 的另一云端 ASR 选项
|
||
|
||
## Related Sources
|
||
- [[engineering-voice-ai-integration-engineer]]
|