Files
nexus/wiki/concepts/HybridFingerprinting.md
2026-05-03 05:42:12 +08:00

44 lines
1.1 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
---
title: "Hybrid Fingerprinting"
type: concept
tags: []
last_updated: 2026-05-01
---
## Definition
结合精确匹配SHA-256 主键哈希)与模糊匹配(向量语义相似度)两种信号,防止因表面相似而误合并不同记录的混合指纹识别机制。
## The Problem
纯语义相似度是模糊的:
- `"John Doe ID:101"``"Jon Doe ID:102"` 语义高度相似
- 但主键不同ID:101 ≠ ID:102实际上是两条不同的记录
- 若仅依赖语义相似度,可能被错误聚类合并
## Solution
```
Hybrid Score = SHA-256(PK_hash) + Vector_Similarity(embedding)
```
- **PK Hash differs** → 强制分离聚类,不允许合并
- **PK Hash matches** → 才考虑向量相似度进行聚类
## Implementation
```python
# 伪代码
for each candidate_pair:
if sha256(pk1) != sha256(pk2):
force_separate_clusters() # PK不同强制分离
else:
if vector_similarity(embedding1, embedding2) > threshold:
merge_clusters() # PK相同且语义相似才合并
```
## Related
- [[Semantic Anomaly Compression]]
- [[Air-Gapped SLM Fix Generation]]