Update nexus: fix conflicts and sync local changes
This commit is contained in:
@@ -1,46 +1,46 @@
|
||||
---
|
||||
title: "Blocking"
|
||||
type: concept
|
||||
tags: ["identity-resolution", "performance", "algorithm", "entity-matching"]
|
||||
sources: ["identity-graph-operator"]
|
||||
last_updated: 2026-04-25
|
||||
---
|
||||
|
||||
# Blocking(阻塞/分块)
|
||||
|
||||
## Definition
|
||||
身份解析中的候选对筛选技术——通过预计算的 **blocking key** 将全量 O(n²) 记录对比较减少为可控规模候选集的 O(n×k) 操作,是大规模实体解析的性能关键。
|
||||
|
||||
## Blocking Key Types
|
||||
|
||||
| 类型 | 示例 | 适用场景 |
|
||||
|------|------|----------|
|
||||
| Email Domain | `acme.com` | 同一公司账号 |
|
||||
| Phone Prefix | `+1555` | 同一地区号码 |
|
||||
| Name Soundex | `S530` | 语音相似姓名(Williams→W452) |
|
||||
| Postal Code | `94105` | 同一地理区域 |
|
||||
| Composite | email_domain + name_soundex | 联合分块,减少假阳性 |
|
||||
|
||||
## Workflow
|
||||
|
||||
```
|
||||
全量记录
|
||||
↓
|
||||
为每条记录生成 blocking key(s)
|
||||
↓
|
||||
按 blocking key 分组(分块)
|
||||
↓
|
||||
仅对同组记录对进行 pairwise scoring
|
||||
↓
|
||||
跨块记录对被阻塞(不比较)
|
||||
```
|
||||
|
||||
## Design Considerations
|
||||
- **召回率 vs 性能**:blocking key 越宽松 → 更多候选对 → 更高召回率但更慢;越严格 → 更少候选对但可能遗漏真匹配
|
||||
- **假阴性风险**:两个同实体但 blocking key 不同(如 "gmail.com" vs "googlemail.com")会跨块遗漏
|
||||
- **假阳性成本**:同块内异实体(如同名不同人的 "John Smith")需靠 scoring 层排除
|
||||
|
||||
## Relationship to Related Concepts
|
||||
- [[Blocking]] 是 [[Identity Resolution]] 的性能优化组件,通过牺牲少量召回率换取大规模场景可接受的计算成本
|
||||
- [[Fuzzy-Matching]] 在 Blocking 筛选出的候选对上执行细粒度评分
|
||||
- [[Confidence-Score]] 综合 Blocking + Scoring 的结果给出最终合并决策
|
||||
---
|
||||
title: "Blocking"
|
||||
type: concept
|
||||
tags: ["identity-resolution", "performance", "algorithm", "entity-matching"]
|
||||
sources: ["identity-graph-operator"]
|
||||
last_updated: 2026-04-25
|
||||
---
|
||||
|
||||
# Blocking(阻塞/分块)
|
||||
|
||||
## Definition
|
||||
身份解析中的候选对筛选技术——通过预计算的 **blocking key** 将全量 O(n²) 记录对比较减少为可控规模候选集的 O(n×k) 操作,是大规模实体解析的性能关键。
|
||||
|
||||
## Blocking Key Types
|
||||
|
||||
| 类型 | 示例 | 适用场景 |
|
||||
|------|------|----------|
|
||||
| Email Domain | `acme.com` | 同一公司账号 |
|
||||
| Phone Prefix | `+1555` | 同一地区号码 |
|
||||
| Name Soundex | `S530` | 语音相似姓名(Williams→W452) |
|
||||
| Postal Code | `94105` | 同一地理区域 |
|
||||
| Composite | email_domain + name_soundex | 联合分块,减少假阳性 |
|
||||
|
||||
## Workflow
|
||||
|
||||
```
|
||||
全量记录
|
||||
↓
|
||||
为每条记录生成 blocking key(s)
|
||||
↓
|
||||
按 blocking key 分组(分块)
|
||||
↓
|
||||
仅对同组记录对进行 pairwise scoring
|
||||
↓
|
||||
跨块记录对被阻塞(不比较)
|
||||
```
|
||||
|
||||
## Design Considerations
|
||||
- **召回率 vs 性能**:blocking key 越宽松 → 更多候选对 → 更高召回率但更慢;越严格 → 更少候选对但可能遗漏真匹配
|
||||
- **假阴性风险**:两个同实体但 blocking key 不同(如 "gmail.com" vs "googlemail.com")会跨块遗漏
|
||||
- **假阳性成本**:同块内异实体(如同名不同人的 "John Smith")需靠 scoring 层排除
|
||||
|
||||
## Relationship to Related Concepts
|
||||
- [[Blocking]] 是 [[Identity Resolution]] 的性能优化组件,通过牺牲少量召回率换取大规模场景可接受的计算成本
|
||||
- [[Fuzzy-Matching]] 在 Blocking 筛选出的候选对上执行细粒度评分
|
||||
- [[Confidence-Score]] 综合 Blocking + Scoring 的结果给出最终合并决策
|
||||
|
||||
Reference in New Issue
Block a user