Auto-sync: 2026-04-27 20:02
This commit is contained in:
27
wiki/concepts/Content-Deduplication.md
Normal file
27
wiki/concepts/Content-Deduplication.md
Normal file
@@ -0,0 +1,27 @@
|
||||
---
|
||||
title: "Content-Deduplication"
|
||||
type: concept
|
||||
tags: [Data-Processing, NLP, Similarity-Matching]
|
||||
sources: [multi-source-tech-news-digest.md]
|
||||
last_updated: 2026-04-27
|
||||
---
|
||||
|
||||
# Content-Deduplication
|
||||
|
||||
内容去重——识别并合并重复或近似内容的技术,解决同一内容从多个渠道涌入造成的冗余问题。
|
||||
|
||||
## Definition
|
||||
|
||||
通过计算标题/摘要的相似度(如 Jaccard 相似度、余弦相似度、编辑距离),判断两条内容是否指向同一信息,并将重复项合并。
|
||||
|
||||
## Approaches
|
||||
|
||||
- **精确匹配**:基于 URL、唯一 ID 去重(适用于同一平台内的内容)
|
||||
- **模糊匹配**:基于标题/摘要的语义或字符串相似度去重(适用于跨平台聚合)
|
||||
- **聚类去重**:将相似内容聚类,每类只保留一条代表
|
||||
|
||||
## Related Concepts
|
||||
|
||||
- [[Content-Aggregation]]:去重是内容聚合流程中的关键步骤
|
||||
- [[Quality-Scoring]]:去重后对每类的代表内容进行评分
|
||||
- [[Semantic-Search]]:语义相似度技术同样可用于去重
|
||||
Reference in New Issue
Block a user