Files
nexus/wiki/entities/Scrapy.md
weishen 5789476c23 Batch ingest: Multi-Agent Team / DevOps Maturity / 一语点醒梦中人 / NodeWarden
Sources:
- Agent-usecases-multi-Agent-Team.md
- DevOps-Maturity-Model-From-Traditional-IT-to-Advanced-DevOps.md
- AI-一语点醒梦中人.md
- Home-Office-NodeWarden-把-Bitwarden-搬上-Cloudflare-Workers彻底告别服务器.md

Entities: Trebuh, Cloudflare
Concepts: DevOps成熟度模型, 共享内存模式, 空性智慧, 绝处逢生
2026-04-15 18:05:17 +08:00

1.4 KiB
Raw Blame History

title, type, tags, sources, last_updated
title type tags sources last_updated
Scrapy entity
爬虫
Python
开源
数据采集
https://scrapy.org
2026-04-15

基本信息

  • 类型Python 开源爬虫框架
  • 官网https://scrapy.org
  • Star5.5万+GitHub

核心机制

  • 异步抓取:基于 Twisted 异步网络框架,支持高并发
  • Spiders:定义爬取逻辑,支持 CSS/XPath 选择器
  • Item Pipeline:数据清洗、验证、存储管道
  • Middleware:下载中间件,可自定义 User-Agent、代理、cookies
  • Feed Exports:支持 JSON/CSV/XML/JSONL 多种输出格式
  • scrapy-playwright:插件集成 Playwright处理 JS 动态渲染页面

关键配置

# scrapy-playwright 集成
DOWNLOAD_HANDLERS = {
    "http": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
}
TWISTED_REACTOR = "twisted.internet.asyncioreactor.AsyncioSelectorReactor"
PLAYWRIGHT_LAUNCH_OPTIONS = {"headless": True, "args": ["--no-sandbox"]}

在 Wiki 中的角色

防封策略

  • ROBOTSTXT_OBEY = False根据目标网站决定
  • DOWNLOAD_DELAY 设置访问延迟
  • RANDOMIZE_DOWNLOAD_DELAY 随机化延迟
  • scrapy-user-agents 中间件轮换 User-Agent
  • 配合代理池BrightData/ScraperAPI