Auto-sync: 2026-04-17 08:37
This commit is contained in:
@@ -1,60 +0,0 @@
|
||||
---
|
||||
title: 为什么 Obsidian 让我戒掉了碎片化记录?
|
||||
source: https://mp.weixin.qq.com/s?__biz=MzI3NzcwOTY4MQ==&mid=2247486972&idx=1&sn=e61477c9f8628c7f534fc2183d87e2d3&scene=21#wechat_redirect
|
||||
author: shenwei
|
||||
published:
|
||||
created: 2025-03-13
|
||||
description:
|
||||
tags: []
|
||||
---
|
||||
|
||||
|
||||
Original *2025年03月05日 23:02*
|
||||
|
||||
### 以前的笔记,像是丢进了黑洞
|
||||
|
||||
老实讲,我以前的笔记习惯非常糟糕——想到什么记什么,但从不复盘。印象笔记里塞满了各种“灵光一闪”,但基本没打开过;微信收藏夹里躺着上百条“稍后阅读”,但从来没“稍后”过;甚至连 Obsidian 早期的使用,也是把它当成一个炫酷的 Markdown 记事本,结果就是……又堆了一堆“存而不读”的笔记。
|
||||
|
||||
说白了,**我只是把信息收集起来,却没让它发挥价值**。
|
||||
|
||||
我相信很多人都有类似的困扰——记了那么多,为什么到用的时候,脑子还是一片空白?
|
||||
|
||||
### Obsidian 的核心魅力:关系,而不是堆砌
|
||||
|
||||
如果说 Obsidian 改变了什么,那就是让我真正意识到:**笔记的价值,不在于“存”,而在于“联”**。
|
||||
|
||||
📌**双链(Backlinks)**是个神奇的功能。一开始我也觉得“双链”这种东西玄乎其玄,但当我尝试把**“零散的记录” 和 “已有的知识” 关联起来**,一切都变了。
|
||||
|
||||
• 某天我写了一篇关于“如何提高写作灵感”的笔记,意外发现它和我三个月前记下的“输入-输出模型”有关联。
|
||||
|
||||
• 我整理一篇关于“番茄工作法”的文章时,发现它跟“沉浸式深度工作”可以结合起来用。
|
||||
|
||||
这种“点对点”的连接,让零碎的笔记慢慢长成了一张网络,我开始真正用自己的方式消化知识,而不是简单存档。
|
||||
|
||||
### 如何用 Obsidian 让笔记“活”起来?
|
||||
|
||||
光说理论没用,我分享几个我自己在用的方法,大家可以试试看:
|
||||
|
||||
✅**每天用“每日笔记”串联想法**
|
||||
|
||||
别让你的笔记变成“死笔记”,每天写几行,总结当天学到的新东西,并顺带看看有没有旧笔记可以连接。
|
||||
|
||||
✅**尝试用“地图笔记”整理核心主题**
|
||||
|
||||
选几个你关心的主题(比如“写作技巧”),整理一个“索引页”,让你所有相关的笔记都能快速导航过去。
|
||||
|
||||
✅**定期复盘,把无用笔记删掉或合并**
|
||||
|
||||
Obsidian 不等于“记了就有用”,定期翻翻旧笔记,把无意义的删掉,或者整理成更有逻辑的知识模块。
|
||||
|
||||
### 你的笔记,也被“信息黑洞”吞噬了吗?
|
||||
|
||||
如果你也有“记了但不用”的困扰,不妨试试上面的方法。笔记的最终目的是让信息为你所用,而不是让你被信息淹没。
|
||||
|
||||
📢**你是如何管理你的笔记的?你有没有遇到相似的问题?**欢迎在评论区聊聊你的笔记方法,或者你对 Obsidian 还有哪些疑问!🎯
|
||||
|
||||
如果你对 Obsidian、知识管理、效率工具感兴趣,欢迎关注**赫点茶**🍵!我会持续分享**实用数字产品、成长思维、以及工作和生活中的高效技巧**。
|
||||
|
||||
👇 点击下方卡片,一起提升效率,优化生活!🚀
|
||||
|
||||
|
||||
@@ -1,28 +0,0 @@
|
||||
---
|
||||
title: 为什么你的笔记总是乱糟糟?试试这个方法,彻底告别信息混乱!
|
||||
source: https://mp.weixin.qq.com/s?__biz=MzI3NzcwOTY4MQ==&mid=2247486984&idx=1&sn=51232deb29cb0a2ed81fac0daa972217&scene=21&poc_token=HDC7RGmjnpWk-8uvZfBeR3Ky26-5B19_c0nN7BR7
|
||||
author: shenwei
|
||||
published:
|
||||
created: 2025-12-19
|
||||
description:
|
||||
tags: []
|
||||
---
|
||||
|
||||
|
||||

|
||||
|
||||
原创 赫点茶 [赫点茶](https://mp.weixin.qq.com/) *2025年3月6日 21:02*
|
||||
|
||||
|
||||
|
||||
效率工具 183
|
||||
|
||||
Obsidian 79
|
||||
|
||||
继续滑动看下一个
|
||||
|
||||
赫点茶
|
||||
|
||||
向上滑动看下一个
|
||||
|
||||
赫点茶
|
||||
695
raw/Others/可自动化、可扩展、AI增强的电商数据采集与处理系统.md
Normal file
695
raw/Others/可自动化、可扩展、AI增强的电商数据采集与处理系统.md
Normal file
@@ -0,0 +1,695 @@
|
||||
---
|
||||
title: 安装 Playwright 浏览器
|
||||
source:
|
||||
author: shenwei
|
||||
published:
|
||||
created: 2025-11-11
|
||||
description: 可自动化、可扩展、AI增强的电商数据采集与处理系统
|
||||
tags: []
|
||||
link:
|
||||
---
|
||||
|
||||
|
||||
|
||||
|
||||
非常好的思路——你想要的是一个**可自动化、可扩展、AI增强的数据采集与处理系统**,基于 Docker + Ubuntu + n8n 搭建。下面我给你一个详细的设计与工具选择建议,从爬取到分析的整体架构。
|
||||
|
||||
---
|
||||
|
||||
## 🧩 一、系统整体架构建议
|
||||
|
||||
你的目标系统可以分为三个层次:
|
||||
|
||||
|层次|组件|说明|
|
||||
|---|---|---|
|
||||
|**数据采集层(爬虫)**|Scrapy / Playwright / Selenium / Apify|从各大电商网站采集结构化信息(标题、描述、图片、视频等)|
|
||||
|**数据处理层(自动化管道)**|n8n + LLM API (e.g., OpenAI, Ollama, LM Studio)|对采集数据进行清洗、分类、摘要、翻译、属性提取等AI处理|
|
||||
|**存储与展示层**|PostgreSQL / SQLite + MinIO / NAS + Grafana / Metabase|存储文本、图片和视频元数据,并可视化结果|
|
||||
|
||||
---
|
||||
|
||||
## 🕷️ 二、爬虫工具推荐与对比
|
||||
|
||||
|工具|适用场景|优点|缺点|
|
||||
|---|---|---|---|
|
||||
|**Scrapy**|静态页面、电商产品信息|轻量高效、插件生态丰富、可Docker化部署|对JS渲染页面支持弱,需要配合Splash或Playwright|
|
||||
|**Playwright (Python/Node.js)**|动态渲染页面、滚动加载、视频图片加载|可模拟浏览器、支持无头模式、可靠性高|相对重,适合单站点深度采集|
|
||||
|**Apify (Open Source SDK)**|通用网页爬取+API接口+调度|已内置防封禁策略、支持Docker|学习曲线略陡、对纯本地化部署需定制|
|
||||
|**Colly (Go语言)**|高性能爬虫服务、轻量API爬取|性能强、可编译为二进制Docker镜像|JS支持弱,不适合电商复杂页面|
|
||||
|**Crawlee (Node.js)**|Apify的开源核心框架,支持Playwright/Selenium|与n8n、LangChain容易集成|需要JS/TS开发基础|
|
||||
|
||||
**推荐组合:**
|
||||
|
||||
> ✅ **Scrapy + Playwright(或Crawlee + Playwright)**
|
||||
|
||||
- Scrapy 负责结构化抓取、分页调度、下载媒体;
|
||||
|
||||
- Playwright 负责加载动态页面;
|
||||
|
||||
- 两者可通过 Docker Compose 容器化;
|
||||
|
||||
- 输出 JSON 或 CSV 文件供 n8n 消费。
|
||||
|
||||
|
||||
---
|
||||
|
||||
## ⚙️ 三、Docker 架构示例
|
||||
|
||||
|
||||
Scrapy 爬完的数据(JSON/CSV)存放在 `/data`,由 n8n 定时读取和处理。
|
||||
|
||||
---
|
||||
|
||||
## 🤖 四、n8n 中的AI集成与自动化
|
||||
|
||||
在 n8n 中可以通过 workflow 实现整个管线自动化:
|
||||
|
||||
**工作流示例:**
|
||||
|
||||
1. **Cron Trigger** → 定时启动
|
||||
|
||||
2. **Execute Command Node** → 执行 `docker exec scraper scrapy crawl amazon`
|
||||
|
||||
3. **Read File Node** → 读取爬取的 JSON 文件
|
||||
|
||||
4. **OpenAI / Ollama Node** → 调用 LLM 对产品描述提取属性(品牌、型号、价格区间、关键词等)
|
||||
|
||||
5. **Database Node** → 写入 PostgreSQL / SQLite
|
||||
|
||||
6. **Webhook / Email Node** → 生成报表或通知
|
||||
|
||||
|
||||
你也可以用 n8n 的 “HTTP Request Node” 抓取 API 页面(适合公开电商API如阿里、Shopee)。
|
||||
|
||||
---
|
||||
|
||||
## 🧠 五、AI 处理建议
|
||||
|
||||
结合 LLM 做如下任务:
|
||||
|
||||
- **内容摘要与分类**(提炼产品特点)
|
||||
|
||||
- **多语言翻译**
|
||||
|
||||
- **特征提取**(品牌、价格、类别)
|
||||
|
||||
- **异常检测**(自动识别异常价格或缺图产品)
|
||||
|
||||
- **结构化JSON输出**,便于数据库存储
|
||||
|
||||
|
||||
可以本地使用 **Ollama** (Mistral, Llama3) 模型,通过 n8n 的 HTTP Request 调用本地 `http://localhost:11434/api/generate`。
|
||||
|
||||
---
|
||||
|
||||
## 🗄️ 六、数据存储与管理建议
|
||||
|
||||
|数据类型|存储建议|工具|
|
||||
|---|---|---|
|
||||
|文本/结构化数据|PostgreSQL / SQLite|适合AI分析、可视化|
|
||||
|图片/视频|MinIO / 本地NAS|模拟S3存储方式|
|
||||
|Dashboard|Grafana / Metabase|生成电商趋势与分析报表|
|
||||
|
||||
---
|
||||
|
||||
## 🛡️ 七、防封与安全建议
|
||||
|
||||
- 使用 **User-Agent轮换**(Scrapy 内置 `scrapy-user-agents`)
|
||||
|
||||
- **代理池**(BrightData / ScraperAPI / 免费代理)
|
||||
|
||||
- 对同一网站设置 **下载延迟 + 随机化访问**
|
||||
|
||||
- **分布式调度** 可用 `Scrapyd` 或 `Archetype`(Scrapy集群)
|
||||
|
||||
- 不采集个人隐私信息或侵权内容(仅公开页面信息)
|
||||
|
||||
|
||||
---
|
||||
|
||||
## 🚀 八、后续可扩展方向
|
||||
|
||||
- 增加 **FastAPI 服务层**:暴露 REST API 给前端或 BI 工具;
|
||||
|
||||
- 使用 **LangChain** 结合 Vector DB(e.g. Qdrant / Milvus)存储产品语义信息;
|
||||
|
||||
- n8n 可触发自动生成市场趋势报告(AI Summarizer);
|
||||
|
||||
- 可视化电商类目趋势(Grafana + Postgres)。
|
||||
|
||||
|
||||
---
|
||||
|
||||
## ✅ 推荐起步路径
|
||||
|
||||
1. **在 Ubuntu 上安装 Docker + Docker Compose**
|
||||
|
||||
2. 启动基础环境:`scrapy + playwright + n8n`
|
||||
|
||||
3. 选择 1–2 个电商站点(如 Amazon / JD / Taobao)
|
||||
|
||||
4. 构建 Scrapy 爬虫模板
|
||||
|
||||
5. 用 n8n 处理数据并测试 AI 工作流
|
||||
|
||||
6. 逐步扩展至全自动管线
|
||||
|
||||
|
||||
---
|
||||
|
||||
太好了 ✅ 我将为你生成一个**可直接部署的爬虫自动化系统骨架**,基于以下目标设计:
|
||||
|
||||
> **目标:** 在 Ubuntu 上使用 Docker Compose 部署一个自动化电商爬虫系统,采集电商网站产品信息 → 存入本地文件 → 由 n8n 自动读取并利用 AI 处理。
|
||||
|
||||
下面是完整结构与代码示例。
|
||||
|
||||
---
|
||||
|
||||
## 🧩 一、项目目录结构
|
||||
|
||||
```bash
|
||||
ecommerce-crawler/
|
||||
├── docker-compose.yml
|
||||
├── scrapy/
|
||||
│ ├── Dockerfile
|
||||
│ ├── requirements.txt
|
||||
│ ├── scrapy.cfg
|
||||
│ └── ecommerce/
|
||||
│ ├── __init__.py
|
||||
│ ├── items.py
|
||||
│ ├── middlewares.py
|
||||
│ ├── pipelines.py
|
||||
│ ├── settings.py
|
||||
│ └── spiders/
|
||||
│ └── amazon_spider.py
|
||||
├── n8n_data/ # n8n 数据持久化
|
||||
├── data/ # 爬取结果输出目录
|
||||
└── README.md
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 🧱 二、`docker-compose.yml`
|
||||
|
||||
```yaml
|
||||
version: '3.8'
|
||||
services:
|
||||
# Scrapy 爬虫容器
|
||||
scraper:
|
||||
build: ./scrapy
|
||||
container_name: ecommerce-scraper
|
||||
working_dir: /app/scrapy
|
||||
volumes:
|
||||
- ./data:/app/data
|
||||
depends_on:
|
||||
- playwright
|
||||
environment:
|
||||
- PLAYWRIGHT_BROWSERS_PATH=/ms-playwright
|
||||
networks:
|
||||
- crawler-net
|
||||
|
||||
|
||||
# Playwright 浏览器支持容器
|
||||
playwright:
|
||||
image: mcr.microsoft.com/playwright/python:v1.48.0-jammy
|
||||
shm_size: 2gb
|
||||
networks:
|
||||
- crawler-net
|
||||
|
||||
# n8n 自动化平台
|
||||
#n8n:
|
||||
# image: n8nio/n8n:latest
|
||||
# container_name: n8n
|
||||
# ports:
|
||||
# - 5678:5678
|
||||
# environment:
|
||||
# - N8N_BASIC_AUTH_ACTIVE=true
|
||||
# - N8N_BASIC_AUTH_USER=admin
|
||||
# - N8N_BASIC_AUTH_PASSWORD=changeme
|
||||
# - N8N_PATH=/workflows
|
||||
# volumes:
|
||||
# - ./n8n_data:/home/node/.n8n
|
||||
# - ./data:/data
|
||||
# networks:
|
||||
# - crawler-net
|
||||
|
||||
networks:
|
||||
crawler-net:
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 🐍 三、Scrapy 部分
|
||||
|
||||
### `scrapy/Dockerfile`
|
||||
|
||||
```dockerfile
|
||||
FROM mcr.microsoft.com/playwright/python:v1.48.0-jammy
|
||||
|
||||
WORKDIR /app
|
||||
COPY requirements.txt .
|
||||
RUN pip install --no-cache-dir -r requirements.txt
|
||||
COPY . .
|
||||
|
||||
# 安装 Playwright 浏览器
|
||||
RUN playwright install
|
||||
|
||||
WORKDIR /app
|
||||
CMD ["scrapy", "crawl", "amazon"]
|
||||
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### `scrapy/requirements.txt`
|
||||
|
||||
```txt
|
||||
scrapy==2.13.3
|
||||
playwright==1.48.0
|
||||
scrapy-playwright==0.0.44
|
||||
```
|
||||
|
||||
> 说明:`scrapy-playwright` 插件可直接让 Scrapy 调用 Playwright 渲染动态页面,非常适合电商网站。
|
||||
|
||||
---
|
||||
|
||||
### `scrapy/scrapy.cfg`
|
||||
|
||||
```ini
|
||||
[settings]
|
||||
default = settings
|
||||
|
||||
[deploy]
|
||||
# 如果你将来要用 scrapyd 部署,可以在这里定义目标(可忽略)
|
||||
# url = http://localhost:6800/
|
||||
# project = crawler
|
||||
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### `scrapy/ecommerce/settings.py`
|
||||
|
||||
```python
|
||||
BOT_NAME = "scrapy"
|
||||
|
||||
SPIDER_MODULES = ["spiders"] # 指向当前目录下的 spiders
|
||||
NEWSPIDER_MODULE = "spiders" # 新建 spider 时默认放在这里
|
||||
|
||||
ROBOTSTXT_OBEY = False
|
||||
DOWNLOAD_DELAY = 2
|
||||
|
||||
DOWNLOAD_HANDLERS = {
|
||||
"http": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
|
||||
"https": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
|
||||
}
|
||||
|
||||
TWISTED_REACTOR = "twisted.internet.asyncioreactor.AsyncioSelectorReactor"
|
||||
|
||||
PLAYWRIGHT_LAUNCH_OPTIONS = {
|
||||
"headless": True,
|
||||
"args": ["--no-sandbox", "--disable-setuid-sandbox"],
|
||||
}
|
||||
PLAYWRIGHT_BROWSER_TYPE = "chromium"
|
||||
|
||||
FEEDS = {
|
||||
"/app/data/amazon.json": {"format": "json", "overwrite": True},
|
||||
}
|
||||
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### `scrapy/ecommerce/items.py`
|
||||
|
||||
```python
|
||||
import scrapy
|
||||
|
||||
class ProductItem(scrapy.Item):
|
||||
title = scrapy.Field()
|
||||
price = scrapy.Field()
|
||||
rating = scrapy.Field()
|
||||
image_urls = scrapy.Field()
|
||||
images = scrapy.Field()
|
||||
product_url = scrapy.Field()
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### `scrapy/ecommerce/pipelines.py`
|
||||
|
||||
```python
|
||||
import json
|
||||
|
||||
class JsonWriterPipeline:
|
||||
def open_spider(self, spider):
|
||||
self.file = open("data/products.json", "w", encoding="utf-8")
|
||||
self.file.write("[\n")
|
||||
|
||||
def close_spider(self, spider):
|
||||
self.file.write("]\n")
|
||||
self.file.close()
|
||||
|
||||
def process_item(self, item, spider):
|
||||
line = json.dumps(dict(item), ensure_ascii=False) + ",\n"
|
||||
self.file.write(line)
|
||||
return item
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### `scrapy/ecommerce/spiders/amazon_spider.py`
|
||||
|
||||
> ⚠️ 仅作演示用途,使用通用搜索页采集,不涉及登录或侵权内容。
|
||||
|
||||
```python
|
||||
import scrapy
|
||||
from scrapy_playwright.page import PageMethod
|
||||
|
||||
class QuotesSpider(scrapy.Spider):
|
||||
name = "amazon"
|
||||
start_urls = ["https://quotes.toscrape.com/js/"]
|
||||
|
||||
custom_settings = {
|
||||
"PLAYWRIGHT_LAUNCH_OPTIONS": {"headless": True},
|
||||
"PLAYWRIGHT_CONTEXT_ARGS": {"viewport": {"width": 1280, "height": 720}},
|
||||
}
|
||||
|
||||
def start_requests(self):
|
||||
for url in self.start_urls:
|
||||
yield scrapy.Request(
|
||||
url,
|
||||
meta={
|
||||
"playwright": True,
|
||||
"playwright_page_methods": [
|
||||
PageMethod("wait_for_selector", "div.quote")
|
||||
],
|
||||
},
|
||||
)
|
||||
|
||||
async def parse(self, response):
|
||||
quotes = response.css("div.quote")
|
||||
for quote in quotes:
|
||||
text = quote.css("span.text::text").get()
|
||||
author = quote.css("small.author::text").get()
|
||||
yield {"text": text, "author": author}
|
||||
|
||||
# 翻页
|
||||
next_page = response.css("li.next a::attr(href)").get()
|
||||
if next_page:
|
||||
next_url = response.urljoin(next_page)
|
||||
yield scrapy.Request(
|
||||
next_url,
|
||||
meta={
|
||||
"playwright": True,
|
||||
"playwright_page_methods": [
|
||||
PageMethod("wait_for_selector", "div.quote")
|
||||
],
|
||||
},
|
||||
callback=self.parse
|
||||
)
|
||||
```
|
||||
|
||||
### `scrapy/ecommerce/spiders/__init__.py`
|
||||
|
||||
```python
|
||||
#empty
|
||||
```
|
||||
|
||||
|
||||
|
||||
---
|
||||
|
||||
## ⚙️ 四、运行方式
|
||||
|
||||
```bash
|
||||
# 1. 构建并启动所有容器
|
||||
docker compose up --build -d
|
||||
|
||||
# 2. 进入 Scrapy 容器执行爬虫
|
||||
docker exec -it ecommerce-scraper scrapy crawl amazon
|
||||
|
||||
# 3. 查看数据
|
||||
cat data/products.json
|
||||
```
|
||||
|
||||
### Docker 命令
|
||||
|
||||
|
||||
``` bash
|
||||
docker compose down -v
|
||||
docker compose up --build -d
|
||||
|
||||
docker exec -it ecommerce-scraper bash
|
||||
|
||||
#进入交互模式
|
||||
docker run -it --entrypoint bash ecommerce-scraper
|
||||
```
|
||||
---
|
||||
|
||||
## 🤖 五、n8n 自动化流程(示例)
|
||||
|
||||
你可以在 n8n Web UI([http://localhost:5678)导入如下逻辑:](http://localhost:5678%EF%BC%89%E5%AF%BC%E5%85%A5%E5%A6%82%E4%B8%8B%E9%80%BB%E8%BE%91%EF%BC%9A/)
|
||||
|
||||
**Workflow 示例逻辑:**
|
||||
|
||||
1. **Cron Node**:每天凌晨触发
|
||||
|
||||
2. **Execute Command Node**:执行命令
|
||||
`docker exec ecommerce-scraper scrapy crawl amazon`
|
||||
|
||||
3. **Read File Node**:读取 `/data/products.json`
|
||||
|
||||
4. **OpenAI Node**(或 HTTP Request 调本地 Ollama):
|
||||
|
||||
- Prompt 示例:
|
||||
|
||||
```text
|
||||
从以下JSON中提取每个产品的简短摘要(不超过30字)并分类。
|
||||
```
|
||||
|
||||
5. **Write to Database Node**(或 Google Sheet / CSV)
|
||||
保存提炼结果
|
||||
|
||||
6. **Telegram / Email Node**
|
||||
发送日报通知
|
||||
|
||||
|
||||
---
|
||||
|
||||
## 💡 六、后续增强建议
|
||||
|
||||
|功能|工具|说明|
|
||||
|---|---|---|
|
||||
|图像下载/视频下载|Scrapy 自带 `ImagesPipeline` 或 `requests`|下载并存储本地或 MinIO|
|
||||
|多站点并行爬取|Scrapy 多 Spider + n8n 并行调度|自动轮询不同站点|
|
||||
|数据可视化|Metabase / Grafana|接 PostgreSQL 实时展示产品趋势|
|
||||
|向量搜索|Qdrant / Weaviate|用于AI语义检索|
|
||||
|本地AI处理|Ollama + Llama3|替代外部API|
|
||||
|
||||
---
|
||||
|
||||
是否希望我帮你再生成一个 **完整可导入的 n8n workflow JSON 文件模板**(包含 Cron、执行 Scrapy、读取 JSON、调用 AI、存入数据库)?
|
||||
这样你只需导入到 n8n UI 就能立即自动化整个流程。
|
||||
|
||||
非常好 👍
|
||||
下面是一个**完整可导入的 n8n Workflow JSON 模板**,它能自动:
|
||||
|
||||
1️⃣ 定时运行 Scrapy 爬虫(通过 Docker 命令)
|
||||
2️⃣ 读取爬取的数据(`products.json`)
|
||||
3️⃣ 调用 AI(可选择 OpenAI 或本地 Ollama)进行摘要与分类
|
||||
4️⃣ 存入数据库或文件
|
||||
5️⃣ 发送报告通知
|
||||
|
||||
---
|
||||
|
||||
## 🧩 一、Workflow 功能概览
|
||||
|
||||
|步骤|节点类型|功能|
|
||||
|---|---|---|
|
||||
|1|**Cron Trigger**|每天凌晨 2:00 自动触发|
|
||||
|2|**Execute Command**|执行 `docker exec ecommerce-scraper scrapy crawl amazon`|
|
||||
|3|**Read Binary File**|读取 `/data/products.json`|
|
||||
|4|**OpenAI (或 HTTP Request)**|提炼摘要与分类(可切换 Ollama)|
|
||||
|5|**Write Binary File**|输出 `data/products_summary.json`|
|
||||
|6|**Email (或 Telegram)**|发送日报通知|
|
||||
|
||||
---
|
||||
|
||||
## 📦 二、Workflow JSON 模板(可直接导入)
|
||||
|
||||
将以下 JSON 内容保存为
|
||||
👉 `workflow_ecommerce_automation.json`
|
||||
然后在 n8n Web UI → **Import from file** 导入。
|
||||
|
||||
```json
|
||||
{
|
||||
"name": "Ecommerce Crawler + AI Summary",
|
||||
"nodes": [
|
||||
{
|
||||
"parameters": {
|
||||
"triggerTimes": {
|
||||
"item": [
|
||||
{
|
||||
"mode": "everyDay",
|
||||
"hour": 2
|
||||
}
|
||||
]
|
||||
}
|
||||
},
|
||||
"id": "1",
|
||||
"name": "Cron Trigger",
|
||||
"type": "n8n-nodes-base.cron",
|
||||
"typeVersion": 1,
|
||||
"position": [250, 250]
|
||||
},
|
||||
{
|
||||
"parameters": {
|
||||
"command": "docker exec ecommerce-scraper scrapy crawl amazon"
|
||||
},
|
||||
"id": "2",
|
||||
"name": "Run Scrapy Crawler",
|
||||
"type": "n8n-nodes-base.executeCommand",
|
||||
"typeVersion": 1,
|
||||
"position": [500, 250]
|
||||
},
|
||||
{
|
||||
"parameters": {
|
||||
"path": "/data/products.json",
|
||||
"options": {}
|
||||
},
|
||||
"id": "3",
|
||||
"name": "Read Products JSON",
|
||||
"type": "n8n-nodes-base.readBinaryFile",
|
||||
"typeVersion": 1,
|
||||
"position": [750, 250]
|
||||
},
|
||||
{
|
||||
"parameters": {
|
||||
"functionCode": "const data = JSON.parse(Buffer.from(items[0].binary.data.data, 'base64').toString());\nreturn data.map(p => ({ json: p }));"
|
||||
},
|
||||
"id": "4",
|
||||
"name": "Parse JSON",
|
||||
"type": "n8n-nodes-base.function",
|
||||
"typeVersion": 1,
|
||||
"position": [1000, 250]
|
||||
},
|
||||
{
|
||||
"parameters": {
|
||||
"model": "gpt-4-turbo",
|
||||
"prompt": "你是一个电商产品分析助手。请从以下产品信息中提取每个产品的简短摘要(不超过30字)并归类到相应产品类别。\n\n输入数据:{{$json[\"title\"]}},价格:{{$json[\"price\"]}},评分:{{$json[\"rating\"]}}。\n\n输出格式:{\"title\":\"...\",\"summary\":\"...\",\"category\":\"...\"}"
|
||||
},
|
||||
"id": "5",
|
||||
"name": "AI Summarize & Categorize",
|
||||
"type": "n8n-nodes-base.openAi",
|
||||
"typeVersion": 2,
|
||||
"position": [1250, 250],
|
||||
"credentials": {
|
||||
"openAIApi": {
|
||||
"id": "YOUR-OPENAI-CREDENTIAL-ID",
|
||||
"name": "OpenAI API"
|
||||
}
|
||||
}
|
||||
},
|
||||
{
|
||||
"parameters": {
|
||||
"fileName": "/data/products_summary.json",
|
||||
"options": {},
|
||||
"dataPropertyName": "json"
|
||||
},
|
||||
"id": "6",
|
||||
"name": "Write Summary JSON",
|
||||
"type": "n8n-nodes-base.writeBinaryFile",
|
||||
"typeVersion": 1,
|
||||
"position": [1500, 250]
|
||||
},
|
||||
{
|
||||
"parameters": {
|
||||
"fromEmail": "noreply@yourdomain.com",
|
||||
"toEmail": "your@email.com",
|
||||
"subject": "Daily Product Summary Report",
|
||||
"text": "今日电商产品摘要已生成,请查看 /data/products_summary.json 文件。"
|
||||
},
|
||||
"id": "7",
|
||||
"name": "Send Email Notification",
|
||||
"type": "n8n-nodes-base.emailSend",
|
||||
"typeVersion": 1,
|
||||
"position": [1750, 250]
|
||||
}
|
||||
],
|
||||
"connections": {
|
||||
"Cron Trigger": { "main": [[{ "node": "Run Scrapy Crawler", "type": "main", "index": 0 }]] },
|
||||
"Run Scrapy Crawler": { "main": [[{ "node": "Read Products JSON", "type": "main", "index": 0 }]] },
|
||||
"Read Products JSON": { "main": [[{ "node": "Parse JSON", "type": "main", "index": 0 }]] },
|
||||
"Parse JSON": { "main": [[{ "node": "AI Summarize & Categorize", "type": "main", "index": 0 }]] },
|
||||
"AI Summarize & Categorize": { "main": [[{ "node": "Write Summary JSON", "type": "main", "index": 0 }]] },
|
||||
"Write Summary JSON": { "main": [[{ "node": "Send Email Notification", "type": "main", "index": 0 }]] }
|
||||
},
|
||||
"active": true
|
||||
}
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 🧠 三、使用说明
|
||||
|
||||
1. **导入流程**
|
||||
|
||||
- 访问 `http://<你的服务器IP>:5678`
|
||||
|
||||
- 点击右上角「Import from File」→ 选择上面的 JSON 文件
|
||||
|
||||
2. **配置 OpenAI 凭证**
|
||||
|
||||
- n8n → Credentials → 新建 OpenAI API Key
|
||||
|
||||
- 将 `YOUR-OPENAI-CREDENTIAL-ID` 替换为对应 ID
|
||||
|
||||
- 若使用本地 Ollama,则:
|
||||
|
||||
- 替换节点 `AI Summarize & Categorize` 为 “HTTP Request”
|
||||
|
||||
- URL 填入 `http://ollama:11434/api/generate`
|
||||
|
||||
- POST body 示例:
|
||||
|
||||
```json
|
||||
{
|
||||
"model": "llama3",
|
||||
"prompt": "分析以下产品信息:{{$json[\"title\"]}} ..."
|
||||
}
|
||||
```
|
||||
|
||||
3. **设置邮箱凭证**
|
||||
|
||||
- n8n → Credentials → Email Send (SMTP)
|
||||
|
||||
- 支持 Gmail, Outlook, AWS SES 等
|
||||
|
||||
4. **测试运行**
|
||||
|
||||
- 手动点击 ▶️ 运行
|
||||
|
||||
- 查看 `/data/products_summary.json`
|
||||
|
||||
- 邮箱将收到日报通知
|
||||
|
||||
|
||||
---
|
||||
|
||||
## 🚀 四、可选扩展
|
||||
|
||||
|目标|实现方式|
|
||||
|---|---|
|
||||
|改为多站点爬取|添加更多 spider(如 `jd_spider.py`, `ebay_spider.py`)并在 n8n 串联运行|
|
||||
|增加数据库写入|在 n8n 中加入 “PostgreSQL Node” 写入产品与AI摘要|
|
||||
|生成图表报表|把数据库接入 Grafana / Metabase|
|
||||
|向量检索增强|在 n8n 中调用 Qdrant 插入语义向量|
|
||||
|
||||
---
|
||||
|
||||
是否希望我在下一步帮你生成一个适配 **Ollama 本地模型(离线AI处理)** 的 workflow 版本?
|
||||
它将不依赖外部API,完全在本地自动运行分析。
|
||||
|
||||
|
||||
|
||||
Reference in New Issue
Block a user