Batch 9: Obsidian插件/AI开源平替/Coze培训/TK面单/Ubuntu科学上网
- Sources: 5个新文档 - Concepts: ProxyChains, SOCKS5代理, Docker Daemon代理 - Index: 更新至 Batch 9 - 累计 sources: 108/182
This commit is contained in:
40
wiki/sources/Scrapy-Playwright-抓取TikTok-Shop-Data.md
Normal file
40
wiki/sources/Scrapy-Playwright-抓取TikTok-Shop-Data.md
Normal file
@@ -0,0 +1,40 @@
|
||||
---
|
||||
title: "Scrapy + Playwright 抓取 TikTok Shop Data"
|
||||
type: source
|
||||
tags: [scrapy, playwright, tiktok, data-collection, python]
|
||||
date: 2025-09-29
|
||||
---
|
||||
|
||||
## Source File
|
||||
- [[raw/跨境电商/Scrapy + Playwright 抓取TikTok Shop Data.md]]
|
||||
|
||||
## Summary
|
||||
- 核心主题:使用 Scrapy + Scrapy-Playwright 抓取 TikTok Shop 店铺数据
|
||||
- 问题域:TikTok Shop 页面为动态渲染,传统 HTTP 请求无法获取数据
|
||||
- 方法/机制:Python venv 虚拟环境隔离依赖;scrapy-playwright 驱动 Chromium 渲染动态内容;`scrapy runspider` CLI 运行爬虫
|
||||
- 结论/价值:提供 Docker 容器化部署配置(venv + PATH 环境变量);Playwright Chromium 替代 requests + Selenium 组合
|
||||
|
||||
## Key Claims
|
||||
- Python venv 虚拟环境是管理 Scrapy/Playwright 依赖的最佳实践,避免全局环境污染
|
||||
- `scrapy-playwright` 集成包将 Playwright 无头浏览器注册为 Scrapy 下载器中间件
|
||||
- `playwright install chromium` 安装无头 Chromium,支持 JavaScript 渲染
|
||||
- Docker 容器部署需在 Dockerfile 中预先配置 venv 并设置 PATH
|
||||
|
||||
## Key Concepts
|
||||
- [[Scrapy]]:Python 开源爬虫框架,异步结构化抓取,支持 Item Pipeline
|
||||
- [[Playwright]]:Microsoft 浏览器自动化工具,支持 Chromium/Firefox/WebKit
|
||||
- [[电商数据采集]]:TikTok Shop 数据采集的技术栈
|
||||
|
||||
## Key Entities
|
||||
- [[TikTok Shop]]:字节跳动旗下电商平台,数据采集目标
|
||||
|
||||
## Connections
|
||||
- [[Scrapy]] ← 中间件整合 ← [[Playwright]]
|
||||
- [[Scrapy]] → 输出结构化数据 → [[电商数据采集]]
|
||||
|
||||
## Contradictions
|
||||
- 无
|
||||
|
||||
## Metadata
|
||||
- 来源:个人实践笔记
|
||||
- 标签:scrapy、playwright、tiktok
|
||||
Reference in New Issue
Block a user