Files
nexus/wiki/concepts/Content-Hashing.md

51 lines
1.8 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
---
title: "Content Hashing (Incremental Indexing)"
type: concept
tags: [indexing, optimization, hash, incremental]
sources: [semantic-memory-search]
last_updated: 2026-04-22
---
## Aliases
- Content Hashing
- 增量索引
- Incremental Indexing
- 内容哈希
## Definition
内容哈希是一种通过计算文档内容块的 SHA-256 哈希值来唯一标识内容的技术。当文档内容未变化时,哈希值保持不变,系统据此跳过已索引内容,仅处理新增或变更的内容块,从而实现增量索引,避免重复 Embedding API 调用。
## How It Works
```
文档内容块 → SHA-256 哈希 → 内容指纹
内容指纹 vs 已索引指纹 → 比对结果:
- 完全匹配 → 跳过(已存在,无需重新嵌入)
- 变化/新增 → 执行 Embedding 并存储向量
```
## Why SHA-256?
- **确定性**:相同内容总是产生相同哈希,无误判
- **抗碰撞**SHA-256 的 256 位空间使碰撞概率可忽略不计
- **快速**:哈希计算比 Embedding 快数个数量级,适合高频增量检查
## Key Insight
> "Smart dedup saves money. Each chunk is identified by a SHA-256 content hash. Re-running `index` only embeds new or changed content, so you can run it as often as you like without wasting embedding API calls." — memsearch
## Benefits
| 收益 | 说明 |
|------|------|
| **成本节省** | 避免重复 Embedding API 调用,节省 token 和费用 |
| **速度提升** | 仅处理增量变化,索引重建时间大幅缩短 |
| **幂等性** | 任意次数重新索引,结果一致 |
| **原子性** | 内容块级别独立,无整体重写的开销 |
## Connections
- [[semantic-memory-search]] — memsearch 使用 SHA-256 内容哈希实现增量索引
- [[memsearch]] — 内容哈希是 memsearch 增量索引的核心机制