44 lines
1.1 KiB
Markdown
44 lines
1.1 KiB
Markdown
---
|
||
title: "Hybrid Fingerprinting"
|
||
type: concept
|
||
tags: []
|
||
last_updated: 2026-05-01
|
||
---
|
||
|
||
## Definition
|
||
|
||
结合精确匹配(SHA-256 主键哈希)与模糊匹配(向量语义相似度)两种信号,防止因表面相似而误合并不同记录的混合指纹识别机制。
|
||
|
||
## The Problem
|
||
|
||
纯语义相似度是模糊的:
|
||
- `"John Doe ID:101"` 与 `"Jon Doe ID:102"` 语义高度相似
|
||
- 但主键不同(ID:101 ≠ ID:102),实际上是两条不同的记录
|
||
- 若仅依赖语义相似度,可能被错误聚类合并
|
||
|
||
## Solution
|
||
|
||
```
|
||
Hybrid Score = SHA-256(PK_hash) + Vector_Similarity(embedding)
|
||
```
|
||
|
||
- **PK Hash differs** → 强制分离聚类,不允许合并
|
||
- **PK Hash matches** → 才考虑向量相似度进行聚类
|
||
|
||
## Implementation
|
||
|
||
```python
|
||
# 伪代码
|
||
for each candidate_pair:
|
||
if sha256(pk1) != sha256(pk2):
|
||
force_separate_clusters() # PK不同,强制分离
|
||
else:
|
||
if vector_similarity(embedding1, embedding2) > threshold:
|
||
merge_clusters() # PK相同且语义相似,才合并
|
||
```
|
||
|
||
## Related
|
||
|
||
- [[Semantic Anomaly Compression]]
|
||
- [[Air-Gapped SLM Fix Generation]]
|