Auto-sync: update nexus workspace

2026-04-28 07:26:52 +08:00
parent b83b4e3105
commit 3224ec4787
436 changed files with 17107 additions and 15920 deletions
--- a/wiki/concepts/Indexing.md
+++ b/wiki/concepts/Indexing.md
@@ -1,29 +1,47 @@
---
-title: "Indexing"
-type: concept
-tags: [rag, indexing, document-processing, embedding]
-last_updated: 2025-01-16
---
-
-## Definition
-Indexing（索引阶段）是 RAG Pipeline 的第一步，负责将外部文档转化为可检索的向量表示：文档加载 → 文本切分 → 向量化 → 存入向量数据库。
-
-## Process
-1. **Document Loading**：从多种来源（网页/PDF/数据库/API 等）加载原始文档
-2. **Text Splitting**：将长文档切分为满足 Embedding Model Context Window 的文本片段（Split）
-3. **Embedding**：使用 Embedding Model 将每个 Split 转化为固定长度的语义向量
-4. **Storage**：将向量 + 原始文本块存入 Vector Store（向量数据库）
-
-## Why Splitting is Necessary
-Embedding Model 的 Context Window 有限（通常 512~8192 token），无法直接处理整篇长文档，因此必须切分。切分策略直接影响检索质量——过小则语义不完整，过大则引入噪声。
-
-## In RAG Pipeline
- **前置阶段**：Indexing 的输出（向量数据库）是 Retrieval 阶段的输入
- **工具支撑**：LangChain 的 DocumentLoader、TextSplitter、Embedding、VectorStore 组件封装了全流程
-
-## Related Concepts
- [[RAG]] — Indexing 是 RAG Pipeline 的第一阶段
- [[Split]] — 切分后的文档片段
- [[Embedding]] — 向量化的技术
- [[Vector Store]] — 存储向量的数据库
- [[Retrieval]] — Indexing 的下一阶段
+---
+title: "Indexing"
+type: concept
+tags: [RAG, 向量数据库, 文档处理]
+sources: [rag从入门到精通系列1-基础rag]
+last_updated: 2025-01-16
+---
+
+## Definition
+
+Indexing（索引阶段）是 RAG（检索增强生成）管道的第一阶段，负责将外部文档转换为可检索的向量表示并存入向量数据库。
+
+## Core Process
+
+```
+原始文档 → 文档加载器 → 文本切分(Split) → Embedding向量化 → 存入Vector Store
+```
+
+1. **文档加载（Loading）**：通过 LangChain 等框架的 Document Loader 从多种来源（网页/本地文件/数据库等）加载原始文档
+2. **文本切分（Splitting）**：将长文档切分成适合 Embedding Model Context Window 的小块（Split），通常 512~4096 token
+3. **向量化（Embedding）**：使用 Embedding Model（如 BAAI/bge 系列）将文本块转换为固定长度的向量表示
+4. **存入向量数据库**：将 Embedding Vector 存入 Vector Store（如 Qdrant、Chroma、Milvus 等）
+
+## Key Parameters
+
+- **Chunk Size**：每个 Split 的 token 数量，需平衡上下文完整性和模型限制
+- **Chunk Overlap**：相邻 Split 之间的重叠 token 数，防止信息在切分边界丢失
+- **Embedding Model**：决定向量质量和检索效果的模型（如 BAAI、OpenAI text-embedding-3、BGE 等）
+
+## Tools
+
+- **LangChain**：提供 160+ 文档加载器和向量存储集成
+- **LlamaIndex**：专注数据连接和索引的 LLM 应用框架
+- **Qdrant**：Rust 编写的开源向量数据库，支持过滤和混合检索
+
+## Connections
+
+- [[Indexing]] ← part_of ← [[RAG]]
+- [[Indexing]] ← uses ← [[Embedding]]
+- [[Indexing]] ← produces ← [[Vector-Store]]
+- [[Indexing]] ← depends_on ← [[Context-Window]]
+
+## Aliases
+
+- Document Indexing
+- Chunking
+- 文档索引