Auto-sync: 2026-04-16 17:30

This commit is contained in:
2026-04-16 17:30:41 +08:00
parent b2250c60b2
commit c999498de4
662 changed files with 3797 additions and 21340 deletions

View File

@@ -1,41 +0,0 @@
---
title: Scrapy
type: entity
tags: [爬虫, Python, 开源, 数据采集]
sources: ["https://scrapy.org"]
last_updated: 2026-04-15
---
## 基本信息
- **类型**Python 开源爬虫框架
- **官网**https://scrapy.org
- **Star**5.5万+GitHub
## 核心机制
- **异步抓取**:基于 Twisted 异步网络框架,支持高并发
- **Spiders**:定义爬取逻辑,支持 CSS/XPath 选择器
- **Item Pipeline**:数据清洗、验证、存储管道
- **Middleware**:下载中间件,可自定义 User-Agent、代理、cookies
- **Feed Exports**:支持 JSON/CSV/XML/JSONL 多种输出格式
- **scrapy-playwright**:插件集成 Playwright处理 JS 动态渲染页面
## 关键配置
```python
# scrapy-playwright 集成
DOWNLOAD_HANDLERS = {
"http": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
}
TWISTED_REACTOR = "twisted.internet.asyncioreactor.AsyncioSelectorReactor"
PLAYWRIGHT_LAUNCH_OPTIONS = {"headless": True, "args": ["--no-sandbox"]}
```
## 在 Wiki 中的角色
- [[可自动化可扩展AI增强的电商数据采集与处理系统]] 爬虫层核心
- [[Playwright]] 提供 JS 渲染能力Scrapy 负责调度和结构化输出
## 防封策略
- ROBOTSTXT_OBEY = False根据目标网站决定
- DOWNLOAD_DELAY 设置访问延迟
- RANDOMIZE_DOWNLOAD_DELAY 随机化延迟
- scrapy-user-agents 中间件轮换 User-Agent
- 配合代理池BrightData/ScraperAPI