Files
nexus/wiki/concepts/Fuzzy-Matching.md

58 lines
2.1 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
---
title: "Fuzzy Matching"
type: concept
tags: ["identity-resolution", "string-similarity", "normalization", "entity-matching"]
sources: ["identity-graph-operator"]
last_updated: 2026-04-25
---
# Fuzzy Matching模糊匹配
## Definition
处理"相同实体但文本表达不同"记录的能力——通过规范化Normalization和相似度算法将表面不同的记录识别为同一实体。是身份解析的核心挑战之一。
## Core Techniques
### 1. Nickname Normalization
```python
nicknames = {
"bill": "william", "bob": "robert", "jim": "james",
"mike": "michael", "dave": "david", "joe": "joseph",
"tom": "thomas", "dick": "richard", "jack": "john",
}
# "Bill Smith" → "william smith"
```
### 2. String Similarity
| 算法 | 适用场景 |
|------|----------|
| Levenshtein Distance | 字符级编辑距离 |
| Jaro-Winkler | 人名高权重前缀匹配 |
| Soundex / Metaphone | 语音相似性("Jon" = "John" |
| Token-basedTF-IDF | 多词短语 |
### 3. Field-specific Normalization
| 字段类型 | 规范化规则 |
|----------|------------|
| Email | `lower().strip()` |
| Phone | `re.sub(r"[^\d+]", "", value)` → E.164 格式 |
| Name | Nickname expansion + lowercase |
| Address | Street abbreviationSt→Street、directionalsNE→Northeast |
## Example
```
记录A: "Bill Smith", wsmith@acme.com, +1-555-0142
记录B: "William Smith", wsmith@acme.com, +15550142
↓ Normalize + Score
Email: 1.0exact match
Name: 0.82Bill→William nickname expansion
Phone: 1.0E.164 normalized
────────────────────────────────
Total: 0.94 confidence → 触发自动 merge> 0.95 阈值接近)
```
## Relationship to Related Concepts
- [[Fuzzy-Matching]] 是 [[Identity-Resolution]] scoring 层的核心技术
- [[Blocking]] 筛选候选对后,[[Fuzzy-Matching]] 执行细粒度字段比较
- [[Confidence-Score]] 综合所有字段的 fuzzy match scores 得出最终决策