Files
nexus/wiki/entities/Scrapy.md
weishen 5789476c23 Batch ingest: Multi-Agent Team / DevOps Maturity / 一语点醒梦中人 / NodeWarden
Sources:
- Agent-usecases-multi-Agent-Team.md
- DevOps-Maturity-Model-From-Traditional-IT-to-Advanced-DevOps.md
- AI-一语点醒梦中人.md
- Home-Office-NodeWarden-把-Bitwarden-搬上-Cloudflare-Workers彻底告别服务器.md

Entities: Trebuh, Cloudflare
Concepts: DevOps成熟度模型, 共享内存模式, 空性智慧, 绝处逢生
2026-04-15 18:05:17 +08:00

42 lines
1.4 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
---
title: Scrapy
type: entity
tags: [爬虫, Python, 开源, 数据采集]
sources: ["https://scrapy.org"]
last_updated: 2026-04-15
---
## 基本信息
- **类型**Python 开源爬虫框架
- **官网**https://scrapy.org
- **Star**5.5万+GitHub
## 核心机制
- **异步抓取**:基于 Twisted 异步网络框架,支持高并发
- **Spiders**:定义爬取逻辑,支持 CSS/XPath 选择器
- **Item Pipeline**:数据清洗、验证、存储管道
- **Middleware**:下载中间件,可自定义 User-Agent、代理、cookies
- **Feed Exports**:支持 JSON/CSV/XML/JSONL 多种输出格式
- **scrapy-playwright**:插件集成 Playwright处理 JS 动态渲染页面
## 关键配置
```python
# scrapy-playwright 集成
DOWNLOAD_HANDLERS = {
"http": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
}
TWISTED_REACTOR = "twisted.internet.asyncioreactor.AsyncioSelectorReactor"
PLAYWRIGHT_LAUNCH_OPTIONS = {"headless": True, "args": ["--no-sandbox"]}
```
## 在 Wiki 中的角色
- [[可自动化可扩展AI增强的电商数据采集与处理系统]] 爬虫层核心
- [[Playwright]] 提供 JS 渲染能力Scrapy 负责调度和结构化输出
## 防封策略
- ROBOTSTXT_OBEY = False根据目标网站决定
- DOWNLOAD_DELAY 设置访问延迟
- RANDOMIZE_DOWNLOAD_DELAY 随机化延迟
- scrapy-user-agents 中间件轮换 User-Agent
- 配合代理池BrightData/ScraperAPI