Auto-sync: 2026-04-16 17:30
This commit is contained in:
@@ -1,29 +0,0 @@
|
||||
---
|
||||
title: "Scrapy"
|
||||
type: concept
|
||||
tags: [python, scraping, crawling, data-collection]
|
||||
date: 2025-09-29
|
||||
---
|
||||
|
||||
## Definition
|
||||
Scrapy,开源 Python 爬虫框架,提供异步请求调度、Item Pipeline 结构化输出、下载器中间件扩展等能力,适用于大规模结构化网页数据采集。
|
||||
|
||||
## Key Properties
|
||||
- **异步架构**:基于 Twisted 异步网络库,支持高并发请求
|
||||
- **Item Pipeline**:数据清洗、验证、持久化(JSON/CSV/数据库)的可编程管道
|
||||
- **选择器**:CSS Selector + XPath 双选,支持 re 项目提取
|
||||
- **Spider**:自定义爬虫类,定义 start_urls、解析规则、Item 输出
|
||||
- **scrapy-playwright 集成**:Playwright 无头浏览器作为下载器中间件,解决 JavaScript 动态渲染问题
|
||||
|
||||
## Use Cases
|
||||
- 结构化电商数据采集(产品标题、价格、评分、评论)
|
||||
- 新闻内容聚合(标题、摘要、来源、时间)
|
||||
- 竞品价格监控
|
||||
|
||||
## Related Concepts
|
||||
- [[Playwright]]:浏览器自动化工具,Scrapy 通过 scrapy-playwright 集成
|
||||
- [[电商数据采集]]:Scrapy 是电商数据采集的主流技术栈之一
|
||||
- [[Scrapy]](Entity):工具开发方
|
||||
|
||||
## Source
|
||||
[[Scrapy-Playwright-抓取TikTok-Shop-Data]]
|
||||
Reference in New Issue
Block a user