Files
nexus/wiki/concepts/Scrapy.md
weishen e62797a33a Batch 9: Obsidian插件/AI开源平替/Coze培训/TK面单/Ubuntu科学上网
- Sources: 5个新文档
- Concepts: ProxyChains, SOCKS5代理, Docker Daemon代理
- Index: 更新至 Batch 9
- 累计 sources: 108/182
2026-04-16 06:36:36 +08:00

30 lines
1.2 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
---
title: "Scrapy"
type: concept
tags: [python, scraping, crawling, data-collection]
date: 2025-09-29
---
## Definition
Scrapy开源 Python 爬虫框架提供异步请求调度、Item Pipeline 结构化输出、下载器中间件扩展等能力,适用于大规模结构化网页数据采集。
## Key Properties
- **异步架构**:基于 Twisted 异步网络库,支持高并发请求
- **Item Pipeline**数据清洗、验证、持久化JSON/CSV/数据库)的可编程管道
- **选择器**CSS Selector + XPath 双选,支持 re 项目提取
- **Spider**:自定义爬虫类,定义 start_urls、解析规则、Item 输出
- **scrapy-playwright 集成**Playwright 无头浏览器作为下载器中间件,解决 JavaScript 动态渲染问题
## Use Cases
- 结构化电商数据采集(产品标题、价格、评分、评论)
- 新闻内容聚合(标题、摘要、来源、时间)
- 竞品价格监控
## Related Concepts
- [[Playwright]]浏览器自动化工具Scrapy 通过 scrapy-playwright 集成
- [[电商数据采集]]Scrapy 是电商数据采集的主流技术栈之一
- [[Scrapy]]Entity工具开发方
## Source
[[Scrapy-Playwright-抓取TikTok-Shop-Data]]