Update nexus: fix conflicts and sync local changes
This commit is contained in:
@@ -1,51 +1,51 @@
|
||||
---
|
||||
title: "Hybrid Search"
|
||||
type: concept
|
||||
tags: [search, vector, bm25, retrieval]
|
||||
sources: [semantic-memory-search, knowledge-base-rag]
|
||||
last_updated: 2026-04-22
|
||||
---
|
||||
|
||||
## Definition
|
||||
|
||||
混合搜索结合两种或多种检索策略——通常是稠密向量检索(语义相似性)和稀疏关键词检索(BM25)——通过排名融合算法合并结果,兼顾语义理解和精确匹配。是当前 RAG 系统提升召回率的主流方法。
|
||||
|
||||
## How It Works
|
||||
|
||||
```
|
||||
查询 → [向量检索(ANN)] ─┐
|
||||
→ [BM25 关键词检索] ──┼─→ Reciprocal Rank Fusion (RRF) → 融合排名结果
|
||||
→ [其他检索器] ──────┘
|
||||
```
|
||||
|
||||
1. **向量检索**:Embedding 模型将查询编码为向量,通过 ANN 索引(如 HNSW)找到语义相近的文档块
|
||||
2. **BM25 检索**:传统关键词检索,统计词频和文档频率,返回字面匹配的文档块
|
||||
3. **RRF 融合**:对各检索器的排名结果按 `1/(k+rank)` 公式融合,k 为平滑参数(通常 k=60)
|
||||
|
||||
## Why Not Pure Vector Search?
|
||||
|
||||
纯向量搜索的局限性:
|
||||
- **同义词覆盖不足**:Embedding 空间无法覆盖所有同义词(如"缓存"vs"cache")
|
||||
- **专有名词精度低**:罕见词/新词/数字类实体的向量表示不够精确
|
||||
- **计算成本高**:向量检索的计算量随向量维度增长
|
||||
|
||||
混合搜索通过 BM25 补充关键词精确匹配,同时保留向量搜索的语义理解能力。
|
||||
|
||||
## Key Insight
|
||||
|
||||
> "Hybrid search beats pure vector search. Combining semantic similarity (dense vectors) with keyword matching (BM25) via Reciprocal Rank Fusion catches both meaning-based and exact-match queries." — memsearch 文档
|
||||
|
||||
## Implementation
|
||||
|
||||
| 组件 | 说明 |
|
||||
|------|------|
|
||||
| 向量检索器 | Milvus / Pinecone / FAISS / Qdrant |
|
||||
| BM25 | Elasticsearch / OpenSearch / rank_bm25 |
|
||||
| RRF 融合 | 自实现或向量数据库内置 |
|
||||
| Embedding | OpenAI text-embedding-3 / BGE / Sentence-BERT |
|
||||
|
||||
## Connections
|
||||
- [[semantic-memory-search]] — memsearch 使用混合搜索策略
|
||||
- [[Knowledge-Base-RAG]] — 混合搜索是知识库 RAG 提升召回率的关键
|
||||
- [[Semantic-Search]] — 混合搜索是纯语义搜索的增强版
|
||||
- [[Reciprocal Rank Fusion]] — 混合搜索的融合算法
|
||||
---
|
||||
title: "Hybrid Search"
|
||||
type: concept
|
||||
tags: [search, vector, bm25, retrieval]
|
||||
sources: [semantic-memory-search, knowledge-base-rag]
|
||||
last_updated: 2026-04-22
|
||||
---
|
||||
|
||||
## Definition
|
||||
|
||||
混合搜索结合两种或多种检索策略——通常是稠密向量检索(语义相似性)和稀疏关键词检索(BM25)——通过排名融合算法合并结果,兼顾语义理解和精确匹配。是当前 RAG 系统提升召回率的主流方法。
|
||||
|
||||
## How It Works
|
||||
|
||||
```
|
||||
查询 → [向量检索(ANN)] ─┐
|
||||
→ [BM25 关键词检索] ──┼─→ Reciprocal Rank Fusion (RRF) → 融合排名结果
|
||||
→ [其他检索器] ──────┘
|
||||
```
|
||||
|
||||
1. **向量检索**:Embedding 模型将查询编码为向量,通过 ANN 索引(如 HNSW)找到语义相近的文档块
|
||||
2. **BM25 检索**:传统关键词检索,统计词频和文档频率,返回字面匹配的文档块
|
||||
3. **RRF 融合**:对各检索器的排名结果按 `1/(k+rank)` 公式融合,k 为平滑参数(通常 k=60)
|
||||
|
||||
## Why Not Pure Vector Search?
|
||||
|
||||
纯向量搜索的局限性:
|
||||
- **同义词覆盖不足**:Embedding 空间无法覆盖所有同义词(如"缓存"vs"cache")
|
||||
- **专有名词精度低**:罕见词/新词/数字类实体的向量表示不够精确
|
||||
- **计算成本高**:向量检索的计算量随向量维度增长
|
||||
|
||||
混合搜索通过 BM25 补充关键词精确匹配,同时保留向量搜索的语义理解能力。
|
||||
|
||||
## Key Insight
|
||||
|
||||
> "Hybrid search beats pure vector search. Combining semantic similarity (dense vectors) with keyword matching (BM25) via Reciprocal Rank Fusion catches both meaning-based and exact-match queries." — memsearch 文档
|
||||
|
||||
## Implementation
|
||||
|
||||
| 组件 | 说明 |
|
||||
|------|------|
|
||||
| 向量检索器 | Milvus / Pinecone / FAISS / Qdrant |
|
||||
| BM25 | Elasticsearch / OpenSearch / rank_bm25 |
|
||||
| RRF 融合 | 自实现或向量数据库内置 |
|
||||
| Embedding | OpenAI text-embedding-3 / BGE / Sentence-BERT |
|
||||
|
||||
## Connections
|
||||
- [[semantic-memory-search]] — memsearch 使用混合搜索策略
|
||||
- [[Knowledge-Base-RAG]] — 混合搜索是知识库 RAG 提升召回率的关键
|
||||
- [[Semantic-Search]] — 混合搜索是纯语义搜索的增强版
|
||||
- [[Reciprocal Rank Fusion]] — 混合搜索的融合算法
|
||||
|
||||
Reference in New Issue
Block a user