Update nexus wiki content
This commit is contained in:
43
wiki/concepts/HybridFingerprinting.md
Normal file
43
wiki/concepts/HybridFingerprinting.md
Normal file
@@ -0,0 +1,43 @@
|
||||
---
|
||||
title: "Hybrid Fingerprinting"
|
||||
type: concept
|
||||
tags: []
|
||||
last_updated: 2026-05-01
|
||||
---
|
||||
|
||||
## Definition
|
||||
|
||||
结合精确匹配(SHA-256 主键哈希)与模糊匹配(向量语义相似度)两种信号,防止因表面相似而误合并不同记录的混合指纹识别机制。
|
||||
|
||||
## The Problem
|
||||
|
||||
纯语义相似度是模糊的:
|
||||
- `"John Doe ID:101"` 与 `"Jon Doe ID:102"` 语义高度相似
|
||||
- 但主键不同(ID:101 ≠ ID:102),实际上是两条不同的记录
|
||||
- 若仅依赖语义相似度,可能被错误聚类合并
|
||||
|
||||
## Solution
|
||||
|
||||
```
|
||||
Hybrid Score = SHA-256(PK_hash) + Vector_Similarity(embedding)
|
||||
```
|
||||
|
||||
- **PK Hash differs** → 强制分离聚类,不允许合并
|
||||
- **PK Hash matches** → 才考虑向量相似度进行聚类
|
||||
|
||||
## Implementation
|
||||
|
||||
```python
|
||||
# 伪代码
|
||||
for each candidate_pair:
|
||||
if sha256(pk1) != sha256(pk2):
|
||||
force_separate_clusters() # PK不同,强制分离
|
||||
else:
|
||||
if vector_similarity(embedding1, embedding2) > threshold:
|
||||
merge_clusters() # PK相同且语义相似,才合并
|
||||
```
|
||||
|
||||
## Related
|
||||
|
||||
- [[Semantic Anomaly Compression]]
|
||||
- [[Air-Gapped SLM Fix Generation]]
|
||||
Reference in New Issue
Block a user