Update nexus: fix conflicts and sync local changes
This commit is contained in:
@@ -1,50 +1,50 @@
|
||||
---
|
||||
title: "Content Hashing (Incremental Indexing)"
|
||||
type: concept
|
||||
tags: [indexing, optimization, hash, incremental]
|
||||
sources: [semantic-memory-search]
|
||||
last_updated: 2026-04-22
|
||||
---
|
||||
|
||||
## Aliases
|
||||
- Content Hashing
|
||||
- 增量索引
|
||||
- Incremental Indexing
|
||||
- 内容哈希
|
||||
|
||||
## Definition
|
||||
|
||||
内容哈希是一种通过计算文档内容块的 SHA-256 哈希值来唯一标识内容的技术。当文档内容未变化时,哈希值保持不变,系统据此跳过已索引内容,仅处理新增或变更的内容块,从而实现增量索引,避免重复 Embedding API 调用。
|
||||
|
||||
## How It Works
|
||||
|
||||
```
|
||||
文档内容块 → SHA-256 哈希 → 内容指纹
|
||||
↓
|
||||
内容指纹 vs 已索引指纹 → 比对结果:
|
||||
- 完全匹配 → 跳过(已存在,无需重新嵌入)
|
||||
- 变化/新增 → 执行 Embedding 并存储向量
|
||||
```
|
||||
|
||||
## Why SHA-256?
|
||||
|
||||
- **确定性**:相同内容总是产生相同哈希,无误判
|
||||
- **抗碰撞**:SHA-256 的 256 位空间使碰撞概率可忽略不计
|
||||
- **快速**:哈希计算比 Embedding 快数个数量级,适合高频增量检查
|
||||
|
||||
## Key Insight
|
||||
|
||||
> "Smart dedup saves money. Each chunk is identified by a SHA-256 content hash. Re-running `index` only embeds new or changed content, so you can run it as often as you like without wasting embedding API calls." — memsearch
|
||||
|
||||
## Benefits
|
||||
|
||||
| 收益 | 说明 |
|
||||
|------|------|
|
||||
| **成本节省** | 避免重复 Embedding API 调用,节省 token 和费用 |
|
||||
| **速度提升** | 仅处理增量变化,索引重建时间大幅缩短 |
|
||||
| **幂等性** | 任意次数重新索引,结果一致 |
|
||||
| **原子性** | 内容块级别独立,无整体重写的开销 |
|
||||
|
||||
## Connections
|
||||
- [[semantic-memory-search]] — memsearch 使用 SHA-256 内容哈希实现增量索引
|
||||
- [[memsearch]] — 内容哈希是 memsearch 增量索引的核心机制
|
||||
---
|
||||
title: "Content Hashing (Incremental Indexing)"
|
||||
type: concept
|
||||
tags: [indexing, optimization, hash, incremental]
|
||||
sources: [semantic-memory-search]
|
||||
last_updated: 2026-04-22
|
||||
---
|
||||
|
||||
## Aliases
|
||||
- Content Hashing
|
||||
- 增量索引
|
||||
- Incremental Indexing
|
||||
- 内容哈希
|
||||
|
||||
## Definition
|
||||
|
||||
内容哈希是一种通过计算文档内容块的 SHA-256 哈希值来唯一标识内容的技术。当文档内容未变化时,哈希值保持不变,系统据此跳过已索引内容,仅处理新增或变更的内容块,从而实现增量索引,避免重复 Embedding API 调用。
|
||||
|
||||
## How It Works
|
||||
|
||||
```
|
||||
文档内容块 → SHA-256 哈希 → 内容指纹
|
||||
↓
|
||||
内容指纹 vs 已索引指纹 → 比对结果:
|
||||
- 完全匹配 → 跳过(已存在,无需重新嵌入)
|
||||
- 变化/新增 → 执行 Embedding 并存储向量
|
||||
```
|
||||
|
||||
## Why SHA-256?
|
||||
|
||||
- **确定性**:相同内容总是产生相同哈希,无误判
|
||||
- **抗碰撞**:SHA-256 的 256 位空间使碰撞概率可忽略不计
|
||||
- **快速**:哈希计算比 Embedding 快数个数量级,适合高频增量检查
|
||||
|
||||
## Key Insight
|
||||
|
||||
> "Smart dedup saves money. Each chunk is identified by a SHA-256 content hash. Re-running `index` only embeds new or changed content, so you can run it as often as you like without wasting embedding API calls." — memsearch
|
||||
|
||||
## Benefits
|
||||
|
||||
| 收益 | 说明 |
|
||||
|------|------|
|
||||
| **成本节省** | 避免重复 Embedding API 调用,节省 token 和费用 |
|
||||
| **速度提升** | 仅处理增量变化,索引重建时间大幅缩短 |
|
||||
| **幂等性** | 任意次数重新索引,结果一致 |
|
||||
| **原子性** | 内容块级别独立,无整体重写的开销 |
|
||||
|
||||
## Connections
|
||||
- [[semantic-memory-search]] — memsearch 使用 SHA-256 内容哈希实现增量索引
|
||||
- [[memsearch]] — 内容哈希是 memsearch 增量索引的核心机制
|
||||
|
||||
Reference in New Issue
Block a user