--- title: "Fuzzy Matching" type: concept tags: ["identity-resolution", "string-similarity", "normalization", "entity-matching"] sources: ["identity-graph-operator"] last_updated: 2026-04-25 --- # Fuzzy Matching(模糊匹配) ## Definition 处理"相同实体但文本表达不同"记录的能力——通过规范化(Normalization)和相似度算法,将表面不同的记录识别为同一实体。是身份解析的核心挑战之一。 ## Core Techniques ### 1. Nickname Normalization ```python nicknames = { "bill": "william", "bob": "robert", "jim": "james", "mike": "michael", "dave": "david", "joe": "joseph", "tom": "thomas", "dick": "richard", "jack": "john", } # "Bill Smith" → "william smith" ``` ### 2. String Similarity | 算法 | 适用场景 | |------|----------| | Levenshtein Distance | 字符级编辑距离 | | Jaro-Winkler | 人名高权重前缀匹配 | | Soundex / Metaphone | 语音相似性("Jon" = "John") | | Token-based(TF-IDF) | 多词短语 | ### 3. Field-specific Normalization | 字段类型 | 规范化规则 | |----------|------------| | Email | `lower().strip()` | | Phone | `re.sub(r"[^\d+]", "", value)` → E.164 格式 | | Name | Nickname expansion + lowercase | | Address | Street abbreviation(St→Street)、directionals(NE→Northeast) | ## Example ``` 记录A: "Bill Smith", wsmith@acme.com, +1-555-0142 记录B: "William Smith", wsmith@acme.com, +15550142 ↓ Normalize + Score Email: 1.0(exact match) Name: 0.82(Bill→William nickname expansion) Phone: 1.0(E.164 normalized) ──────────────────────────────── Total: 0.94 confidence → 触发自动 merge(> 0.95 阈值接近) ``` ## Relationship to Related Concepts - [[Fuzzy-Matching]] 是 [[Identity-Resolution]] scoring 层的核心技术 - [[Blocking]] 筛选候选对后,[[Fuzzy-Matching]] 执行细粒度字段比较 - [[Confidence-Score]] 综合所有字段的 fuzzy match scores 得出最终决策