Files
nexus/raw/Technical/Home Office/可自动化、可扩展、AI增强的电商数据采集与处理系统.md
2026-04-14 12:19:28 +08:00

696 lines
18 KiB
Markdown
Raw Blame History

This file contains invisible Unicode characters
This file contains invisible Unicode characters that are indistinguishable to humans but may be processed differently by a computer. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
---
title: 安装 Playwright 浏览器
source:
author: shenwei
published:
created: 2025-11-11
description: 可自动化、可扩展、AI增强的电商数据采集与处理系统
tags: []
link:
---
非常好的思路——你想要的是一个**可自动化、可扩展、AI增强的数据采集与处理系统**,基于 Docker + Ubuntu + n8n 搭建。下面我给你一个详细的设计与工具选择建议,从爬取到分析的整体架构。
---
## 🧩 一、系统整体架构建议
你的目标系统可以分为三个层次:
|层次|组件|说明|
|---|---|---|
|**数据采集层(爬虫)**|Scrapy / Playwright / Selenium / Apify|从各大电商网站采集结构化信息(标题、描述、图片、视频等)|
|**数据处理层(自动化管道)**|n8n + LLM API (e.g., OpenAI, Ollama, LM Studio)|对采集数据进行清洗、分类、摘要、翻译、属性提取等AI处理|
|**存储与展示层**|PostgreSQL / SQLite + MinIO / NAS + Grafana / Metabase|存储文本、图片和视频元数据,并可视化结果|
---
## 🕷️ 二、爬虫工具推荐与对比
|工具|适用场景|优点|缺点|
|---|---|---|---|
|**Scrapy**|静态页面、电商产品信息|轻量高效、插件生态丰富、可Docker化部署|对JS渲染页面支持弱需要配合Splash或Playwright|
|**Playwright (Python/Node.js)**|动态渲染页面、滚动加载、视频图片加载|可模拟浏览器、支持无头模式、可靠性高|相对重,适合单站点深度采集|
|**Apify (Open Source SDK)**|通用网页爬取+API接口+调度|已内置防封禁策略、支持Docker|学习曲线略陡、对纯本地化部署需定制|
|**Colly (Go语言)**|高性能爬虫服务、轻量API爬取|性能强、可编译为二进制Docker镜像|JS支持弱不适合电商复杂页面|
|**Crawlee (Node.js)**|Apify的开源核心框架支持Playwright/Selenium|与n8n、LangChain容易集成|需要JS/TS开发基础|
**推荐组合:**
> ✅ **Scrapy + Playwright或Crawlee + Playwright**
- Scrapy 负责结构化抓取、分页调度、下载媒体;
- Playwright 负责加载动态页面;
- 两者可通过 Docker Compose 容器化;
- 输出 JSON 或 CSV 文件供 n8n 消费。
---
## ⚙️ 三、Docker 架构示例
Scrapy 爬完的数据JSON/CSV存放在 `/data`,由 n8n 定时读取和处理。
---
## 🤖 四、n8n 中的AI集成与自动化
在 n8n 中可以通过 workflow 实现整个管线自动化:
**工作流示例:**
1. **Cron Trigger** → 定时启动
2. **Execute Command Node** → 执行 `docker exec scraper scrapy crawl amazon`
3. **Read File Node** → 读取爬取的 JSON 文件
4. **OpenAI / Ollama Node** → 调用 LLM 对产品描述提取属性(品牌、型号、价格区间、关键词等)
5. **Database Node** → 写入 PostgreSQL / SQLite
6. **Webhook / Email Node** → 生成报表或通知
你也可以用 n8n 的 “HTTP Request Node” 抓取 API 页面适合公开电商API如阿里、Shopee
---
## 🧠 五、AI 处理建议
结合 LLM 做如下任务:
- **内容摘要与分类**(提炼产品特点)
- **多语言翻译**
- **特征提取**(品牌、价格、类别)
- **异常检测**(自动识别异常价格或缺图产品)
- **结构化JSON输出**,便于数据库存储
可以本地使用 **Ollama** (Mistral, Llama3) 模型,通过 n8n 的 HTTP Request 调用本地 `http://localhost:11434/api/generate`
---
## 🗄️ 六、数据存储与管理建议
|数据类型|存储建议|工具|
|---|---|---|
|文本/结构化数据|PostgreSQL / SQLite|适合AI分析、可视化|
|图片/视频|MinIO / 本地NAS|模拟S3存储方式|
|Dashboard|Grafana / Metabase|生成电商趋势与分析报表|
---
## 🛡️ 七、防封与安全建议
- 使用 **User-Agent轮换**Scrapy 内置 `scrapy-user-agents`
- **代理池**BrightData / ScraperAPI / 免费代理)
- 对同一网站设置 **下载延迟 + 随机化访问**
- **分布式调度** 可用 `Scrapyd``Archetype`Scrapy集群
- 不采集个人隐私信息或侵权内容(仅公开页面信息)
---
## 🚀 八、后续可扩展方向
- 增加 **FastAPI 服务层**:暴露 REST API 给前端或 BI 工具;
- 使用 **LangChain** 结合 Vector DBe.g. Qdrant / Milvus存储产品语义信息
- n8n 可触发自动生成市场趋势报告AI Summarizer
- 可视化电商类目趋势Grafana + Postgres
---
## ✅ 推荐起步路径
1. **在 Ubuntu 上安装 Docker + Docker Compose**
2. 启动基础环境:`scrapy + playwright + n8n`
3. 选择 12 个电商站点(如 Amazon / JD / Taobao
4. 构建 Scrapy 爬虫模板
5. 用 n8n 处理数据并测试 AI 工作流
6. 逐步扩展至全自动管线
---
太好了 ✅ 我将为你生成一个**可直接部署的爬虫自动化系统骨架**,基于以下目标设计:
> **目标:** 在 Ubuntu 上使用 Docker Compose 部署一个自动化电商爬虫系统,采集电商网站产品信息 → 存入本地文件 → 由 n8n 自动读取并利用 AI 处理。
下面是完整结构与代码示例。
---
## 🧩 一、项目目录结构
```bash
ecommerce-crawler/
├── docker-compose.yml
├── scrapy/
│ ├── Dockerfile
│ ├── requirements.txt
│ ├── scrapy.cfg
│ └── ecommerce/
│ ├── __init__.py
│ ├── items.py
│ ├── middlewares.py
│ ├── pipelines.py
│ ├── settings.py
│ └── spiders/
│ └── amazon_spider.py
├── n8n_data/ # n8n 数据持久化
├── data/ # 爬取结果输出目录
└── README.md
```
---
## 🧱 二、`docker-compose.yml`
```yaml
version: '3.8'
services:
# Scrapy 爬虫容器
scraper:
build: ./scrapy
container_name: ecommerce-scraper
working_dir: /app/scrapy
volumes:
- ./data:/app/data
depends_on:
- playwright
environment:
- PLAYWRIGHT_BROWSERS_PATH=/ms-playwright
networks:
- crawler-net
# Playwright 浏览器支持容器
playwright:
image: mcr.microsoft.com/playwright/python:v1.48.0-jammy
shm_size: 2gb
networks:
- crawler-net
# n8n 自动化平台
#n8n:
# image: n8nio/n8n:latest
# container_name: n8n
# ports:
# - 5678:5678
# environment:
# - N8N_BASIC_AUTH_ACTIVE=true
# - N8N_BASIC_AUTH_USER=admin
# - N8N_BASIC_AUTH_PASSWORD=changeme
# - N8N_PATH=/workflows
# volumes:
# - ./n8n_data:/home/node/.n8n
# - ./data:/data
# networks:
# - crawler-net
networks:
crawler-net:
```
---
## 🐍 三、Scrapy 部分
### `scrapy/Dockerfile`
```dockerfile
FROM mcr.microsoft.com/playwright/python:v1.48.0-jammy
WORKDIR /app
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
COPY . .
# 安装 Playwright 浏览器
RUN playwright install
WORKDIR /app
CMD ["scrapy", "crawl", "amazon"]
```
---
### `scrapy/requirements.txt`
```txt
scrapy==2.13.3
playwright==1.48.0
scrapy-playwright==0.0.44
```
> 说明:`scrapy-playwright` 插件可直接让 Scrapy 调用 Playwright 渲染动态页面,非常适合电商网站。
---
### `scrapy/scrapy.cfg`
```ini
[settings]
default = settings
[deploy]
# 如果你将来要用 scrapyd 部署,可以在这里定义目标(可忽略)
# url = http://localhost:6800/
# project = crawler
```
---
### `scrapy/ecommerce/settings.py`
```python
BOT_NAME = "scrapy"
SPIDER_MODULES = ["spiders"] # 指向当前目录下的 spiders
NEWSPIDER_MODULE = "spiders" # 新建 spider 时默认放在这里
ROBOTSTXT_OBEY = False
DOWNLOAD_DELAY = 2
DOWNLOAD_HANDLERS = {
"http": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
"https": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
}
TWISTED_REACTOR = "twisted.internet.asyncioreactor.AsyncioSelectorReactor"
PLAYWRIGHT_LAUNCH_OPTIONS = {
"headless": True,
"args": ["--no-sandbox", "--disable-setuid-sandbox"],
}
PLAYWRIGHT_BROWSER_TYPE = "chromium"
FEEDS = {
"/app/data/amazon.json": {"format": "json", "overwrite": True},
}
```
---
### `scrapy/ecommerce/items.py`
```python
import scrapy
class ProductItem(scrapy.Item):
title = scrapy.Field()
price = scrapy.Field()
rating = scrapy.Field()
image_urls = scrapy.Field()
images = scrapy.Field()
product_url = scrapy.Field()
```
---
### `scrapy/ecommerce/pipelines.py`
```python
import json
class JsonWriterPipeline:
def open_spider(self, spider):
self.file = open("data/products.json", "w", encoding="utf-8")
self.file.write("[\n")
def close_spider(self, spider):
self.file.write("]\n")
self.file.close()
def process_item(self, item, spider):
line = json.dumps(dict(item), ensure_ascii=False) + ",\n"
self.file.write(line)
return item
```
---
### `scrapy/ecommerce/spiders/amazon_spider.py`
> ⚠️ 仅作演示用途,使用通用搜索页采集,不涉及登录或侵权内容。
```python
import scrapy
from scrapy_playwright.page import PageMethod
class QuotesSpider(scrapy.Spider):
name = "amazon"
start_urls = ["https://quotes.toscrape.com/js/"]
custom_settings = {
"PLAYWRIGHT_LAUNCH_OPTIONS": {"headless": True},
"PLAYWRIGHT_CONTEXT_ARGS": {"viewport": {"width": 1280, "height": 720}},
}
def start_requests(self):
for url in self.start_urls:
yield scrapy.Request(
url,
meta={
"playwright": True,
"playwright_page_methods": [
PageMethod("wait_for_selector", "div.quote")
],
},
)
async def parse(self, response):
quotes = response.css("div.quote")
for quote in quotes:
text = quote.css("span.text::text").get()
author = quote.css("small.author::text").get()
yield {"text": text, "author": author}
# 翻页
next_page = response.css("li.next a::attr(href)").get()
if next_page:
next_url = response.urljoin(next_page)
yield scrapy.Request(
next_url,
meta={
"playwright": True,
"playwright_page_methods": [
PageMethod("wait_for_selector", "div.quote")
],
},
callback=self.parse
)
```
### `scrapy/ecommerce/spiders/__init__.py`
```python
#empty
```
---
## ⚙️ 四、运行方式
```bash
# 1. 构建并启动所有容器
docker compose up --build -d
# 2. 进入 Scrapy 容器执行爬虫
docker exec -it ecommerce-scraper scrapy crawl amazon
# 3. 查看数据
cat data/products.json
```
### Docker 命令
``` bash
docker compose down -v
docker compose up --build -d
docker exec -it ecommerce-scraper bash
#进入交互模式
docker run -it --entrypoint bash ecommerce-scraper
```
---
## 🤖 五、n8n 自动化流程(示例)
你可以在 n8n Web UI[http://localhost:5678导入如下逻辑](http://localhost:5678%EF%BC%89%E5%AF%BC%E5%85%A5%E5%A6%82%E4%B8%8B%E9%80%BB%E8%BE%91%EF%BC%9A/)
**Workflow 示例逻辑:**
1. **Cron Node**:每天凌晨触发
2. **Execute Command Node**:执行命令
`docker exec ecommerce-scraper scrapy crawl amazon`
3. **Read File Node**:读取 `/data/products.json`
4. **OpenAI Node**(或 HTTP Request 调本地 Ollama
- Prompt 示例:
```text
从以下JSON中提取每个产品的简短摘要不超过30字并分类。
```
5. **Write to Database Node**(或 Google Sheet / CSV
保存提炼结果
6. **Telegram / Email Node**
发送日报通知
---
## 💡 六、后续增强建议
|功能|工具|说明|
|---|---|---|
|图像下载/视频下载|Scrapy 自带 `ImagesPipeline` 或 `requests`|下载并存储本地或 MinIO|
|多站点并行爬取|Scrapy 多 Spider + n8n 并行调度|自动轮询不同站点|
|数据可视化|Metabase / Grafana|接 PostgreSQL 实时展示产品趋势|
|向量搜索|Qdrant / Weaviate|用于AI语义检索|
|本地AI处理|Ollama + Llama3|替代外部API|
---
是否希望我帮你再生成一个 **完整可导入的 n8n workflow JSON 文件模板**(包含 Cron、执行 Scrapy、读取 JSON、调用 AI、存入数据库
这样你只需导入到 n8n UI 就能立即自动化整个流程。
非常好 👍
下面是一个**完整可导入的 n8n Workflow JSON 模板**,它能自动:
1⃣ 定时运行 Scrapy 爬虫(通过 Docker 命令)
2⃣ 读取爬取的数据(`products.json`
3⃣ 调用 AI可选择 OpenAI 或本地 Ollama进行摘要与分类
4⃣ 存入数据库或文件
5⃣ 发送报告通知
---
## 🧩 一、Workflow 功能概览
|步骤|节点类型|功能|
|---|---|---|
|1|**Cron Trigger**|每天凌晨 2:00 自动触发|
|2|**Execute Command**|执行 `docker exec ecommerce-scraper scrapy crawl amazon`|
|3|**Read Binary File**|读取 `/data/products.json`|
|4|**OpenAI (或 HTTP Request)**|提炼摘要与分类(可切换 Ollama|
|5|**Write Binary File**|输出 `data/products_summary.json`|
|6|**Email (或 Telegram)**|发送日报通知|
---
## 📦 二、Workflow JSON 模板(可直接导入)
将以下 JSON 内容保存为
👉 `workflow_ecommerce_automation.json`
然后在 n8n Web UI → **Import from file** 导入。
```json
{
"name": "Ecommerce Crawler + AI Summary",
"nodes": [
{
"parameters": {
"triggerTimes": {
"item": [
{
"mode": "everyDay",
"hour": 2
}
]
}
},
"id": "1",
"name": "Cron Trigger",
"type": "n8n-nodes-base.cron",
"typeVersion": 1,
"position": [250, 250]
},
{
"parameters": {
"command": "docker exec ecommerce-scraper scrapy crawl amazon"
},
"id": "2",
"name": "Run Scrapy Crawler",
"type": "n8n-nodes-base.executeCommand",
"typeVersion": 1,
"position": [500, 250]
},
{
"parameters": {
"path": "/data/products.json",
"options": {}
},
"id": "3",
"name": "Read Products JSON",
"type": "n8n-nodes-base.readBinaryFile",
"typeVersion": 1,
"position": [750, 250]
},
{
"parameters": {
"functionCode": "const data = JSON.parse(Buffer.from(items[0].binary.data.data, 'base64').toString());\nreturn data.map(p => ({ json: p }));"
},
"id": "4",
"name": "Parse JSON",
"type": "n8n-nodes-base.function",
"typeVersion": 1,
"position": [1000, 250]
},
{
"parameters": {
"model": "gpt-4-turbo",
"prompt": "你是一个电商产品分析助手。请从以下产品信息中提取每个产品的简短摘要不超过30字并归类到相应产品类别。\n\n输入数据{{$json[\"title\"]}},价格:{{$json[\"price\"]}},评分:{{$json[\"rating\"]}}。\n\n输出格式{\"title\":\"...\",\"summary\":\"...\",\"category\":\"...\"}"
},
"id": "5",
"name": "AI Summarize & Categorize",
"type": "n8n-nodes-base.openAi",
"typeVersion": 2,
"position": [1250, 250],
"credentials": {
"openAIApi": {
"id": "YOUR-OPENAI-CREDENTIAL-ID",
"name": "OpenAI API"
}
}
},
{
"parameters": {
"fileName": "/data/products_summary.json",
"options": {},
"dataPropertyName": "json"
},
"id": "6",
"name": "Write Summary JSON",
"type": "n8n-nodes-base.writeBinaryFile",
"typeVersion": 1,
"position": [1500, 250]
},
{
"parameters": {
"fromEmail": "noreply@yourdomain.com",
"toEmail": "your@email.com",
"subject": "Daily Product Summary Report",
"text": "今日电商产品摘要已生成,请查看 /data/products_summary.json 文件。"
},
"id": "7",
"name": "Send Email Notification",
"type": "n8n-nodes-base.emailSend",
"typeVersion": 1,
"position": [1750, 250]
}
],
"connections": {
"Cron Trigger": { "main": [[{ "node": "Run Scrapy Crawler", "type": "main", "index": 0 }]] },
"Run Scrapy Crawler": { "main": [[{ "node": "Read Products JSON", "type": "main", "index": 0 }]] },
"Read Products JSON": { "main": [[{ "node": "Parse JSON", "type": "main", "index": 0 }]] },
"Parse JSON": { "main": [[{ "node": "AI Summarize & Categorize", "type": "main", "index": 0 }]] },
"AI Summarize & Categorize": { "main": [[{ "node": "Write Summary JSON", "type": "main", "index": 0 }]] },
"Write Summary JSON": { "main": [[{ "node": "Send Email Notification", "type": "main", "index": 0 }]] }
},
"active": true
}
```
---
## 🧠 三、使用说明
1. **导入流程**
- 访问 `http://<你的服务器IP>:5678`
- 点击右上角「Import from File」→ 选择上面的 JSON 文件
2. **配置 OpenAI 凭证**
- n8n → Credentials → 新建 OpenAI API Key
- 将 `YOUR-OPENAI-CREDENTIAL-ID` 替换为对应 ID
- 若使用本地 Ollama
- 替换节点 `AI Summarize & Categorize` 为 “HTTP Request”
- URL 填入 `http://ollama:11434/api/generate`
- POST body 示例:
```json
{
"model": "llama3",
"prompt": "分析以下产品信息:{{$json[\"title\"]}} ..."
}
```
3. **设置邮箱凭证**
- n8n → Credentials → Email Send (SMTP)
- 支持 Gmail, Outlook, AWS SES 等
4. **测试运行**
- 手动点击 ▶️ 运行
- 查看 `/data/products_summary.json`
- 邮箱将收到日报通知
---
## 🚀 四、可选扩展
|目标|实现方式|
|---|---|
|改为多站点爬取|添加更多 spider如 `jd_spider.py`, `ebay_spider.py`)并在 n8n 串联运行|
|增加数据库写入|在 n8n 中加入 “PostgreSQL Node” 写入产品与AI摘要|
|生成图表报表|把数据库接入 Grafana / Metabase|
|向量检索增强|在 n8n 中调用 Qdrant 插入语义向量|
---
是否希望我在下一步帮你生成一个适配 **Ollama 本地模型离线AI处理** 的 workflow 版本?
它将不依赖外部API完全在本地自动运行分析。