Workspace sync: auto commit 2026-04-23 12:02:11

2026-04-23 12:02:11 +08:00
parent 6a8362bb5a
commit c59cc07327
57 changed files with 3427 additions and 30 deletions
--- a/wiki/concepts/Indexing.md
+++ b/wiki/concepts/Indexing.md
@@ -0,0 +1,29 @@
+---
+title: "Indexing"
+type: concept
+tags: [rag, indexing, document-processing, embedding]
+last_updated: 2025-01-16
+---
+
+## Definition
+Indexing（索引阶段）是 RAG Pipeline 的第一步，负责将外部文档转化为可检索的向量表示：文档加载 → 文本切分 → 向量化 → 存入向量数据库。
+
+## Process
+1. **Document Loading**：从多种来源（网页/PDF/数据库/API 等）加载原始文档
+2. **Text Splitting**：将长文档切分为满足 Embedding Model Context Window 的文本片段（Split）
+3. **Embedding**：使用 Embedding Model 将每个 Split 转化为固定长度的语义向量
+4. **Storage**：将向量 + 原始文本块存入 Vector Store（向量数据库）
+
+## Why Splitting is Necessary
+Embedding Model 的 Context Window 有限（通常 512~8192 token），无法直接处理整篇长文档，因此必须切分。切分策略直接影响检索质量——过小则语义不完整，过大则引入噪声。
+
+## In RAG Pipeline
+- **前置阶段**：Indexing 的输出（向量数据库）是 Retrieval 阶段的输入
+- **工具支撑**：LangChain 的 DocumentLoader、TextSplitter、Embedding、VectorStore 组件封装了全流程
+
+## Related Concepts
+- [[RAG]] — Indexing 是 RAG Pipeline 的第一阶段
+- [[Split]] — 切分后的文档片段
+- [[Embedding]] — 向量化的技术
+- [[Vector Store]] — 存储向量的数据库
+- [[Retrieval]] — Indexing 的下一阶段