Files
nexus/wiki/concepts/Fuzzy-Matching.md

2.1 KiB
Raw Blame History

title, type, tags, sources, last_updated
title type tags sources last_updated
Fuzzy Matching concept
identity-resolution
string-similarity
normalization
entity-matching
identity-graph-operator
2026-04-25

Fuzzy Matching模糊匹配

Definition

处理"相同实体但文本表达不同"记录的能力——通过规范化Normalization和相似度算法将表面不同的记录识别为同一实体。是身份解析的核心挑战之一。

Core Techniques

1. Nickname Normalization

nicknames = {
    "bill": "william", "bob": "robert", "jim": "james",
    "mike": "michael", "dave": "david", "joe": "joseph",
    "tom": "thomas", "dick": "richard", "jack": "john",
}
# "Bill Smith" → "william smith"

2. String Similarity

算法 适用场景
Levenshtein Distance 字符级编辑距离
Jaro-Winkler 人名高权重前缀匹配
Soundex / Metaphone 语音相似性("Jon" = "John"
Token-basedTF-IDF 多词短语

3. Field-specific Normalization

字段类型 规范化规则
Email lower().strip()
Phone re.sub(r"[^\d+]", "", value) → E.164 格式
Name Nickname expansion + lowercase
Address Street abbreviationSt→Street、directionalsNE→Northeast

Example

记录A: "Bill Smith", wsmith@acme.com, +1-555-0142
记录B: "William Smith", wsmith@acme.com, +15550142
        ↓ Normalize + Score
Email:     1.0exact match
Name:      0.82Bill→William nickname expansion
Phone:     1.0E.164 normalized
────────────────────────────────
Total:     0.94 confidence → 触发自动 merge> 0.95 阈值接近)