Update nexus: fix conflicts and sync local changes

This commit is contained in:
Shen Wei
2026-04-26 12:06:50 +08:00
parent 191797c01b
commit f09834b5a5
2443 changed files with 254323 additions and 255154 deletions

View File

@@ -1,57 +1,57 @@
---
title: "Fuzzy Matching"
type: concept
tags: ["identity-resolution", "string-similarity", "normalization", "entity-matching"]
sources: ["identity-graph-operator"]
last_updated: 2026-04-25
---
# Fuzzy Matching模糊匹配
## Definition
处理"相同实体但文本表达不同"记录的能力——通过规范化Normalization和相似度算法将表面不同的记录识别为同一实体。是身份解析的核心挑战之一。
## Core Techniques
### 1. Nickname Normalization
```python
nicknames = {
"bill": "william", "bob": "robert", "jim": "james",
"mike": "michael", "dave": "david", "joe": "joseph",
"tom": "thomas", "dick": "richard", "jack": "john",
}
# "Bill Smith" → "william smith"
```
### 2. String Similarity
| 算法 | 适用场景 |
|------|----------|
| Levenshtein Distance | 字符级编辑距离 |
| Jaro-Winkler | 人名高权重前缀匹配 |
| Soundex / Metaphone | 语音相似性("Jon" = "John" |
| Token-basedTF-IDF | 多词短语 |
### 3. Field-specific Normalization
| 字段类型 | 规范化规则 |
|----------|------------|
| Email | `lower().strip()` |
| Phone | `re.sub(r"[^\d+]", "", value)` → E.164 格式 |
| Name | Nickname expansion + lowercase |
| Address | Street abbreviationSt→Street、directionalsNE→Northeast |
## Example
```
记录A: "Bill Smith", wsmith@acme.com, +1-555-0142
记录B: "William Smith", wsmith@acme.com, +15550142
↓ Normalize + Score
Email: 1.0exact match
Name: 0.82Bill→William nickname expansion
Phone: 1.0E.164 normalized
────────────────────────────────
Total: 0.94 confidence → 触发自动 merge> 0.95 阈值接近)
```
## Relationship to Related Concepts
- [[Fuzzy-Matching]] 是 [[Identity-Resolution]] scoring 层的核心技术
- [[Blocking]] 筛选候选对后,[[Fuzzy-Matching]] 执行细粒度字段比较
- [[Confidence-Score]] 综合所有字段的 fuzzy match scores 得出最终决策
---
title: "Fuzzy Matching"
type: concept
tags: ["identity-resolution", "string-similarity", "normalization", "entity-matching"]
sources: ["identity-graph-operator"]
last_updated: 2026-04-25
---
# Fuzzy Matching模糊匹配
## Definition
处理"相同实体但文本表达不同"记录的能力——通过规范化Normalization和相似度算法将表面不同的记录识别为同一实体。是身份解析的核心挑战之一。
## Core Techniques
### 1. Nickname Normalization
```python
nicknames = {
"bill": "william", "bob": "robert", "jim": "james",
"mike": "michael", "dave": "david", "joe": "joseph",
"tom": "thomas", "dick": "richard", "jack": "john",
}
# "Bill Smith" → "william smith"
```
### 2. String Similarity
| 算法 | 适用场景 |
|------|----------|
| Levenshtein Distance | 字符级编辑距离 |
| Jaro-Winkler | 人名高权重前缀匹配 |
| Soundex / Metaphone | 语音相似性("Jon" = "John" |
| Token-basedTF-IDF | 多词短语 |
### 3. Field-specific Normalization
| 字段类型 | 规范化规则 |
|----------|------------|
| Email | `lower().strip()` |
| Phone | `re.sub(r"[^\d+]", "", value)` → E.164 格式 |
| Name | Nickname expansion + lowercase |
| Address | Street abbreviationSt→Street、directionalsNE→Northeast |
## Example
```
记录A: "Bill Smith", wsmith@acme.com, +1-555-0142
记录B: "William Smith", wsmith@acme.com, +15550142
↓ Normalize + Score
Email: 1.0exact match
Name: 0.82Bill→William nickname expansion
Phone: 1.0E.164 normalized
────────────────────────────────
Total: 0.94 confidence → 触发自动 merge> 0.95 阈值接近)
```
## Relationship to Related Concepts
- [[Fuzzy-Matching]] 是 [[Identity-Resolution]] scoring 层的核心技术
- [[Blocking]] 筛选候选对后,[[Fuzzy-Matching]] 执行细粒度字段比较
- [[Confidence-Score]] 综合所有字段的 fuzzy match scores 得出最终决策