nexus/openclaw/xingshu/whisper-guide.md

# Whisper 本地语音转录完全指南

> 文档版本：2026-04-15
> 维护者：星枢（xingshu）
> 状态：✅ Macmini 已验证可运行

---

## 一、Whisper 是什么

Whisper 是 OpenAI 开源的自动语音识别（ASR）模型，可将音频文件转录为文字。支持 99 种语言，尤其对英文识别精度极高。

**两种使用方式：**

| 方式 | 说明 | 费用 |
|---|---|---|
| **本地运行** | 模型下载到本地 Mac/PC | **免费** |
| OpenAI API | 调用 OpenAI Whisper API | 按分钟计费 |

本指南使用**本地运行**方式。

---

## 二、支持的模型

| 模型 | 参数量 | 英文 WER* | 中文 CER* | 本地内存占用 | Macmini 兼容性 |
|---|---|---|---|---|---|
| `tiny` | 39M | 5.2% | ~10% | ~1GB | ✅ |
| `base` | 74M | 3.5% | ~8% | ~1GB | ✅ |
| **`small`** | 244M | 2.7% | ~5% | ~1.5GB | **✅ 推荐** |
| `medium` | 769M | 2.3% | ~4% | ~5GB | ⚠️ 可能 OOM |
| `large` | 1550M | 2.0% | ~3% | ~10GB | ❌ OOM |

> \* WER = Word Error Rate，CER = Character Error Rate，越低越准确。

**推荐：`small` 模型**（精度与资源占用的最佳平衡）

---

## 三、安装

### 3.1 前置条件

```bash
# 确认 Python 版本（需 3.8+）
python3 --version

# 确认 pip 可用
pip3 --version
```

### 3.2 安装 Whisper

```bash
pip3 install openai-whisper
```

**如果遇到权限错误（macOS）：**
```bash
pip3 install --user openai-whisper
```

**首次运行会自动下载模型文件**（~500MB/small 模型），无需手动下载。

---

## 四、快速测试

### 4.1 单文件测试（tiny 模型，最快）

```python
import whisper

model = whisper.load_model("tiny")          # 首次运行会下载模型
result = model.transcribe("audio.mp3", language="en")
print(result["text"])
```

### 4.2 完整示例（small 模型）

```python
import whisper

# 加载模型（只需加载一次）
model = whisper.load_model("small")

# 转录
result = model.transcribe(
    "audio.mp3",
    language="en",    # 指定语言，不指定则自动检测
    fp16=False,       # Macmini 用 CPU，必须 False
    verbose=True,     # 显示进度
)

print("语言检测:", result["language"])
print("转写稿:", result["text"])
print("分段数:", len(result["segments"]))
```

### 4.3 命令行测试

```bash
# 安装后可直接在命令行使用
whisper audio.mp3 --model small --language en
```

---

## 五、Python API 详解

### 5.1 核心方法

```python
import whisper

model = whisper.load_model("small")

# 完整参数
result = model.transcribe(
    audio="path/to/file.mp3",

    # 语言设置
    language="en",           # 指定语言，不填则自动检测
    # prompt="",            # 可选，引导模型偏好（如专有名词）

    # 输出控制
   fp16=False,              # CPU 必须 False，GPU 可 True
    temperature=0.0,         # 0=确定性，>0=随机性
    condition_on_previous_text=True,  # 利用前一段上下文

    # 任务模式
    task="transcribe",       # transcribe 或 translate（中译英）

    # 段落切分
    word_timestamps=False,    # True=输出每个词的起止时间

    # 日志
    verbose=True,
)
```

### 5.2 返回值结构

```python
{
    "text": "完整的转写文本...",
    "language": "en",
    "segments": [
        {
            "id": 0,
            "start": 0.0,      # 秒
            "end": 5.5,
            "text": " Can you see my screen already?",
            "words": [...]       # 如果 word_timestamps=True
        },
        ...
    ],
    "language_probability": 0.99
}
```

### 5.3 批量转录

```python
import whisper
import glob

model = whisper.load_model("small")
audio_files = glob.glob("*.mp3")

for audio_file in audio_files:
    print(f"Processing: {audio_file}")
    result = model.transcribe(audio_file, language="en", fp16=False)

    # 保存转写稿
    with open(audio_file + ".txt", "w") as f:
        f.write(result["text"])
```

---

## 六、Macmini M4 Pro 性能实测

| 音频时长 | 文件大小 | 模型 | 转录耗时 | 速度比 |
|---|---|---|---|---|
| ~54 分钟 | 3MB | `small` | ~43s | ~75x realtime |
| ~54 分钟 | 3MB | `tiny` | ~10s | ~320x realtime |
| ~1 小时 | 22MB | `small` | ~90s | ~40x realtime |

**速度经验：** `small` 模型处理 1 小时音频约 1-2 分钟，内存占用稳定在 ~1.5GB。

---

## 七、在流水线中的使用

本项目不使用 Whisper API，而是通过 Python 脚本调用本地模型：

```python
import whisper

def whisper_transcribe(mp3_path: str) -> str:
    """单文件转录，返回英文字幕/转写稿"""
    model = whisper.load_model("small")  # 模型只加载一次
    result = model.transcribe(
        mp3_path,
        language="en",
        fp16=False,
    )
    return result["text"].strip()

# 使用
transcript = whisper_transcribe("/path/to/audio.mp3")
```

---

## 八、常见问题

### Q1: `fp16 is not supported on CPU` 警告
**正常**，Macmini 用 CPU 运行，Whisper 自动降级到 FP32。不影响精度。

### Q2: `SIGKILL` / 进程被杀死
**内存不足**，模型太大。改用更小的模型：
```python
model = whisper.load_model("tiny")   # 最省内存
```

### Q3: 中文识别不准
指定语言参数提升精度：
```python
result = model.transcribe("audio.mp3", language="zh")  # 中文
result = model.transcribe("audio.mp3", language="en")  # 英文
```

### Q4: 如何加速转录
- 用 `tiny` 或 `base` 模型（牺牲精度换速度）
- Macmini M 系列芯片无需特殊优化（Neural Engine 自动加速）
- 避免同时跑多个转录任务

### Q5: 支持哪些音频格式
支持 FFmpeg 支持的所有格式：`mp3`, `wav`, `m4a`, `flac`, `ogg`, `webm` 等。

---

## 九、卸载

```bash
pip3 uninstall openai-whisper

# 删除已下载的模型（默认缓存位置）
rm -rf ~/.cache/whisper
```

---

## 十、相关资源

- **GitHub**: https://github.com/openai/whisper
- **模型下载**: 首次调用 `load_model()` 时自动下载
- **缓存位置**: `~/.cache/whisper/`
- **本项目脚本**: `~/.openclaw/temp/xingshu/scripts/nas_whisper_gemini_summarize.py`