Files
nexus/wiki/sources/可自动化可扩展AI增强的电商数据采集与处理系统.md
weishen 5789476c23 Batch ingest: Multi-Agent Team / DevOps Maturity / 一语点醒梦中人 / NodeWarden
Sources:
- Agent-usecases-multi-Agent-Team.md
- DevOps-Maturity-Model-From-Traditional-IT-to-Advanced-DevOps.md
- AI-一语点醒梦中人.md
- Home-Office-NodeWarden-把-Bitwarden-搬上-Cloudflare-Workers彻底告别服务器.md

Entities: Trebuh, Cloudflare
Concepts: DevOps成熟度模型, 共享内存模式, 空性智慧, 绝处逢生
2026-04-15 18:05:17 +08:00

106 lines
5.3 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
---
title: "可自动化、可扩展、AI增强的电商数据采集与处理系统"
type: source
tags: [e-commerce, scraper, automation, n8n, ai, docker]
date: 2025-11-11
---
## Source File
- [[raw/Home Office/可自动化、可扩展、AI增强的电商数据采集与处理系统.md]]
## Summary
- 核心主题:基于 Docker + Scrapy + Playwright + n8n 构建可自动化运行的电商数据采集与 AI 处理管线
- 问题域:如何高效采集多电商平台产品数据,并通过 AI 实现清洗、分类、摘要和结构化输出
- 方法/机制Scrapy 负责结构化抓取和分页调度Playwright 处理 JS 动态渲染页面n8n 定时触发爬虫、读取结果、调用 AIOpenAI/Ollama处理、写入数据库/文件、发送通知
- 结论/价值:提供完整 Docker Compose 架构、Scrapy 项目模板、n8n Workflow JSON 模板,实现从爬取到 AI 分析的全链路自动化
## Key Claims
- Scrapy + Playwright 组合Scrapy 负责结构化抓取、分页调度、下载媒体Playwright 负责 JS 动态渲染页面scrapy-playwright 插件直接集成两者
- docker-compose 多容器架构scraperScrapy+Playwright、n8n自动化调度数据通过共享 ./data 目录传递
- n8n Workflow 自动化管线Cron Trigger → Execute Command运行爬虫→ Read File → AI 处理OpenAI/Ollama→ Database/File → 通知
- 本地 AI 处理方案OllamaMistral/Llama3通过 HTTP Request 调用 http://localhost:11434/api/generate不依赖外部 API
- 防封策略User-Agent 轮换、代理池BrightData/ScraperAPI、下载延迟+随机化访问、分布式调度Scrapyd
- Scrapy 爬取结果输出为 JSON/CSV 格式,供 n8n 消费处理
- 采集数据建议字段title、price、rating、image_urls、product_url
- 长期扩展路径FastAPI 服务层 + LangChain + Qdrant 向量数据库 + Grafana/Metabase 可视化
- Playwright 需安装浏览器playwright install支持 headless 模式和 viewport 参数配置
## Key Quotes
> "Scrapy 负责结构化抓取、分页调度、下载媒体Playwright 负责加载动态页面;两者可通过 Docker Compose 容器化" — 推荐技术组合
> "可以本地使用 Ollama (Mistral, Llama3) 模型,通过 n8n 的 HTTP Request 调用本地 http://localhost:11434/api/generate" — 本地 AI 处理方案
## Key Concepts
- [[Scrapy]]Python 开源爬虫框架支持异步抓取、中间件扩展、Item Pipeline适合大规模结构化数据采集
- [[Playwright]]Microsoft 开源浏览器自动化工具,支持 Chromium/Firefox/WebKit可模拟真实用户操作
- [[scrapy-playwright]]Scrapy 与 Playwright 集成插件,使 Scrapy 爬虫可直接渲染 JS 动态页面
- [[n8n Workflow自动化]]:可视化工作流引擎,通过 Cron 定时触发爬虫执行、文件读取、AI 处理、数据存储全流程
- [[Ollama]]:本地大模型推理服务,支持 Llama3/Mistral 等模型,通过 REST API 调用
- [[电商数据采集]]:从电商平台采集产品标题、价格、评分、图片等结构化信息
- [[AI数据处理]]:通过 LLM 对采集数据进行摘要、分类、特征提取、异常检测
- [[防封策略]]User-Agent 轮换、代理池、访问延迟、分布式调度等反爬虫对抗技术
- [[Docker容器化爬虫]]:将 Scrapy + Playwright 封装为 Docker 镜像,实现环境一致性部署
## Key Entities
- [[Scrapy]]Python 爬虫框架
- [[Playwright]]Microsoft 浏览器自动化工具
- [[n8n]]:开源工作流自动化平台
- [[Ollama]]:本地 LLM 推理引擎
- [[BrightData]]:商业代理池服务
- [[ScraperAPI]]:爬虫 API 服务
## Connections
- [[Scrapy]] ← 动态渲染 ← [[Playwright]](通过 scrapy-playwright
- [[n8n Workflow自动化]] ← Cron Trigger ← [[Scrapy]](执行爬虫命令)
- [[n8n Workflow自动化]] ← AI处理 ← [[Ollama]](本地模型调用)
- [[n8n Workflow自动化]] ← 数据写入 ← PostgreSQL/SQLite
- [[Scrapy]] ← 输出格式 ← JSON/CSVdata/ 目录)
- [[电商数据采集]] ← 工具 ← [[Scrapy]] + [[Playwright]]
- [[AI数据处理]] ← 工具 ← [[n8n Workflow自动化]] + [[Ollama]]
## Contradictions
- 无明显冲突
## 核心架构代码
### docker-compose.yml
```yaml
services:
scraper:
build: ./scrapy
volumes:
- ./data:/app/data
depends_on:
- playwright
environment:
- PLAYWRIGHT_BROWSERS_PATH=/ms-playwright
playwright:
image: mcr.microsoft.com/playwright/python:v1.48.0-jammy
shm_size: 2gb
```
### Scrapy settings.py关键配置
```python
DOWNLOAD_HANDLERS = {
"http": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
"https": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
}
TWISTED_REACTOR = "twisted.internet.asyncioreactor.AsyncioSelectorReactor"
PLAYWRIGHT_LAUNCH_OPTIONS = {
"headless": True,
"args": ["--no-sandbox", "--disable-setuid-sandbox"],
}
FEEDS = {"/app/data/amazon.json": {"format": "json", "overwrite": True}}
```
### n8n Workflow 节点链路
1. Cron Trigger每天凌晨 2:00
2. Execute Commanddocker exec scraper scrapy crawl amazon
3. Read Binary File读取 /data/products.json
4. Function Node解析 JSON
5. OpenAI / HTTP RequestOllama 本地调用)
6. Write Binary File输出 products_summary.json
7. Email / Telegram发送日报