Batch ingest: Multi-Agent Team / DevOps Maturity / 一语点醒梦中人 / NodeWarden
Sources: - Agent-usecases-multi-Agent-Team.md - DevOps-Maturity-Model-From-Traditional-IT-to-Advanced-DevOps.md - AI-一语点醒梦中人.md - Home-Office-NodeWarden-把-Bitwarden-搬上-Cloudflare-Workers彻底告别服务器.md Entities: Trebuh, Cloudflare Concepts: DevOps成熟度模型, 共享内存模式, 空性智慧, 绝处逢生
This commit is contained in:
41
wiki/entities/Scrapy.md
Normal file
41
wiki/entities/Scrapy.md
Normal file
@@ -0,0 +1,41 @@
|
||||
---
|
||||
title: Scrapy
|
||||
type: entity
|
||||
tags: [爬虫, Python, 开源, 数据采集]
|
||||
sources: ["https://scrapy.org"]
|
||||
last_updated: 2026-04-15
|
||||
---
|
||||
|
||||
## 基本信息
|
||||
- **类型**:Python 开源爬虫框架
|
||||
- **官网**:https://scrapy.org
|
||||
- **Star**:5.5万+(GitHub)
|
||||
|
||||
## 核心机制
|
||||
- **异步抓取**:基于 Twisted 异步网络框架,支持高并发
|
||||
- **Spiders**:定义爬取逻辑,支持 CSS/XPath 选择器
|
||||
- **Item Pipeline**:数据清洗、验证、存储管道
|
||||
- **Middleware**:下载中间件,可自定义 User-Agent、代理、cookies
|
||||
- **Feed Exports**:支持 JSON/CSV/XML/JSONL 多种输出格式
|
||||
- **scrapy-playwright**:插件集成 Playwright,处理 JS 动态渲染页面
|
||||
|
||||
## 关键配置
|
||||
```python
|
||||
# scrapy-playwright 集成
|
||||
DOWNLOAD_HANDLERS = {
|
||||
"http": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
|
||||
}
|
||||
TWISTED_REACTOR = "twisted.internet.asyncioreactor.AsyncioSelectorReactor"
|
||||
PLAYWRIGHT_LAUNCH_OPTIONS = {"headless": True, "args": ["--no-sandbox"]}
|
||||
```
|
||||
|
||||
## 在 Wiki 中的角色
|
||||
- [[可自动化可扩展AI增强的电商数据采集与处理系统]] 爬虫层核心
|
||||
- [[Playwright]] 提供 JS 渲染能力,Scrapy 负责调度和结构化输出
|
||||
|
||||
## 防封策略
|
||||
- ROBOTSTXT_OBEY = False(根据目标网站决定)
|
||||
- DOWNLOAD_DELAY 设置访问延迟
|
||||
- RANDOMIZE_DOWNLOAD_DELAY 随机化延迟
|
||||
- scrapy-user-agents 中间件轮换 User-Agent
|
||||
- 配合代理池(BrightData/ScraperAPI)
|
||||
Reference in New Issue
Block a user