Files
nexus/wiki/concepts/Blocking.md

47 lines
1.7 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
---
title: "Blocking"
type: concept
tags: ["identity-resolution", "performance", "algorithm", "entity-matching"]
sources: ["identity-graph-operator"]
last_updated: 2026-04-25
---
# Blocking阻塞/分块)
## Definition
身份解析中的候选对筛选技术——通过预计算的 **blocking key** 将全量 O(n²) 记录对比较减少为可控规模候选集的 O(n×k) 操作,是大规模实体解析的性能关键。
## Blocking Key Types
| 类型 | 示例 | 适用场景 |
|------|------|----------|
| Email Domain | `acme.com` | 同一公司账号 |
| Phone Prefix | `+1555` | 同一地区号码 |
| Name Soundex | `S530` | 语音相似姓名Williams→W452 |
| Postal Code | `94105` | 同一地理区域 |
| Composite | email_domain + name_soundex | 联合分块,减少假阳性 |
## Workflow
```
全量记录
为每条记录生成 blocking key(s)
按 blocking key 分组(分块)
仅对同组记录对进行 pairwise scoring
跨块记录对被阻塞(不比较)
```
## Design Considerations
- **召回率 vs 性能**blocking key 越宽松 → 更多候选对 → 更高召回率但更慢;越严格 → 更少候选对但可能遗漏真匹配
- **假阴性风险**:两个同实体但 blocking key 不同(如 "gmail.com" vs "googlemail.com")会跨块遗漏
- **假阳性成本**:同块内异实体(如同名不同人的 "John Smith")需靠 scoring 层排除
## Relationship to Related Concepts
- [[Blocking]] 是 [[Identity Resolution]] 的性能优化组件,通过牺牲少量召回率换取大规模场景可接受的计算成本
- [[Fuzzy-Matching]] 在 Blocking 筛选出的候选对上执行细粒度评分
- [[Confidence-Score]] 综合 Blocking + Scoring 的结果给出最终合并决策