1290 lines
49 KiB
Python
Executable File
1290 lines
49 KiB
Python
Executable File
#!/usr/bin/env python3
|
||
"""
|
||
Wiki ↔ Raw 三向同步工具
|
||
================================================================================
|
||
|
||
概述
|
||
----
|
||
本脚本负责维护 raw/(原始文档层)与 wiki/(知识库层)之间的同步状态。
|
||
它通过 tools/manifest.json 追踪每个 raw 文件的哈希、摄取状态和 slug 映射,
|
||
让编码代理(agent)能准确知道哪些文件需要被(重新)摄取到 wiki。
|
||
|
||
核心功能
|
||
--------
|
||
1. 扫描 raw/ 下的 .md 文件,与 manifest 对比,检测新增/删除(不再自动检测 updated)
|
||
2. 维护 tools/manifest.json 状态映射(hash、slug、ingested 等)
|
||
3. 标记单个文件为"已摄取",供摄取流程回调
|
||
4. 批量规范化 manifest 中的 slug(reslug)
|
||
5. 从 manifest 重建 wiki/index.md(兜底方案)
|
||
6. 检测 orphan entity/concept(仅报告,不删除)
|
||
7. 批量或单条修正 source 页面中的 Source File link(对齐 manifest 的 raw 路径)
|
||
|
||
--------------------------------------------------------------------------------
|
||
CLI 用法
|
||
--------------------------------------------------------------------------------
|
||
|
||
基础操作:
|
||
python tools/sync.py --check
|
||
预览 raw/ 与 manifest 的差异(新增/删除),不写入任何文件。
|
||
输出为 Markdown 格式,适合人工阅读。
|
||
|
||
python tools/sync.py --sync
|
||
执行完整同步:将 raw/ 的变化写入 manifest,并报告 orphan 页面。
|
||
当前默认仅处理新增/删除,不会因为已存在文件内容变化而自动重置 ingested。
|
||
|
||
python tools/sync.py --sync -v / --verbose
|
||
同上,但额外列出每个新增/删除文件的详情,以及 orphan 清单。
|
||
|
||
python tools/sync.py --pending
|
||
列出 manifest 中所有 ingested=false 的待摄取文件(人类可读格式)。
|
||
|
||
python tools/sync.py --pending --json
|
||
以单行 JSON 输出待摄取列表,供脚本/agent 消费。
|
||
|
||
python tools/sync.py --pending --json --limit 1
|
||
只返回第一条待摄取文件(返回 "file" 字段而非 "files" 数组)。
|
||
|
||
python tools/sync.py --pending --json --limit N
|
||
返回前 N 条待摄取文件(返回 "files" 数组)。
|
||
|
||
python tools/sync.py --json
|
||
与 --sync 配合:使用 JSON 行流模式输出所有事件,便于程序解析。
|
||
|
||
python tools/sync.py --rebuild
|
||
从 manifest 重建 wiki/index.md。适合 index 损坏或丢失时的兜底恢复。
|
||
|
||
Source File link 修正:
|
||
python tools/sync.py --fix-source-links
|
||
扫描 manifest 中所有条目,批量修正对应 source 页面里 `## Source File` 下的链接。
|
||
目标格式统一为:- [[raw/.../your-file.md]]
|
||
|
||
python tools/sync.py --fix-source-links --fix-source-target "raw/dir/file.md"
|
||
只修正指定 raw 条目对应的单个 source 页面(适合每次 ingest 后做单文件校验)。
|
||
|
||
python tools/sync.py --fix-source-links --dry-run
|
||
预览将要修改的数量,不写入文件。
|
||
|
||
标记摄取状态:
|
||
python tools/sync.py --mark-ingested "raw/dir/file.md" --slug my-slug
|
||
标记指定 raw 文件为已摄取,同时更新 slug、source_path、ingested_at。
|
||
该命令是摄取工作流的最后一步,应在 wiki/sources/<slug>.md 写入完毕后调用。
|
||
|
||
python tools/sync.py --mark-ingested "raw/dir/file.md" --slug my-slug --mark-json
|
||
同上,但以单行 JSON 输出结果(供脚本消费)。
|
||
|
||
python tools/sync.py --reset-failed
|
||
将所有带 error 标记的 manifest 条目重置为 ingested=false(重新加入待处理队列)。
|
||
|
||
slug 管理:
|
||
python tools/sync.py --reslug
|
||
批量规范化 manifest 中全部条目的 slug 和 source_path。
|
||
规则:中文直接保留,ASCII 大写转小写,特殊字符转 `-`,压缩连续 `-`。
|
||
|
||
python tools/sync.py --reslug --reslug-target "raw/dir/file.md"
|
||
只规范化指定文件的 slug。
|
||
|
||
python tools/sync.py --reslug --dry-run
|
||
预览 reslug 变更,不写入 manifest。
|
||
|
||
--------------------------------------------------------------------------------
|
||
manifest.json 格式
|
||
--------------------------------------------------------------------------------
|
||
|
||
路径:tools/manifest.json(与本脚本同目录)
|
||
|
||
顶层结构:
|
||
{
|
||
"version": 1, // 格式版本,当前固定为 1
|
||
"updated_at": "2024-01-15T08:00:00Z", // 最后更新时间(UTC ISO 8601),每次写入自动刷新
|
||
"files": { ... } // key = raw 文件相对仓库根的路径
|
||
}
|
||
|
||
files 中每条记录的结构:
|
||
{
|
||
"raw/dir/my-paper.md": {
|
||
"hash": "a3f1c2d4e5b6a7b8", // sha256 前 16 位,用于检测文件内容变化
|
||
"modified": "2024-01-15T07:00:00Z", // raw 文件的 mtime(UTC ISO 8601)
|
||
"slug": "my-paper", // wiki 页面 slug,用于生成 source_path
|
||
"source_path": "wiki/sources/my-paper.md", // 对应的 wiki source 页面路径
|
||
"ingested": true, // false = 待摄取;true = 已摄取
|
||
"ingested_at": "2024-01-15T08:00:00Z", // 摄取完成时间(null 表示未摄取)
|
||
"error": "..." // 可选,摄取失败时记录错误信息
|
||
}
|
||
}
|
||
|
||
状态流转:
|
||
新文件被 --sync 检测到
|
||
→ ingested=false, ingested_at=null
|
||
摄取工作流完成后调用 --mark-ingested
|
||
→ ingested=true, ingested_at=<当前 UTC 时间>
|
||
当前默认同步策略不自动处理“已存在文件内容变化”
|
||
→ 已摄取文件不会因 updated 检测而自动重置(避免重复 ingest)
|
||
摄取失败时由外部流程写入 error 字段
|
||
→ 使用 --reset-failed 清除,重回待处理队列
|
||
|
||
--------------------------------------------------------------------------------
|
||
JSON 输出格式(--json / --mark-json / --pending --json)
|
||
--------------------------------------------------------------------------------
|
||
|
||
每行输出一个独立 JSON 对象(JSON Lines 格式),可能的 event 类型:
|
||
|
||
{"event": "pending", "rel_path": "...", "slug": "...", "action": "new"}
|
||
{"event": "deleted_detected","rel_path": "..."}
|
||
{"event": "sync_complete", "summary": {"pending": N, "deleted": N, "manifest_entries": N},
|
||
"pending_files": [...], "deleted_files": [...]}
|
||
{"event": "pending_list", "count": N, "files": [...]} // --pending --json --limit N
|
||
{"event": "pending_list", "count": N, "file": {...}} // --pending --json --limit 1
|
||
{"event": "mark_ingested", "rel_path": "...", "slug": "...",
|
||
"source_path": "...", "modified": "...", "ingested_at": "..."}
|
||
{"event": "fix_source_links_complete", "summary": {...}, "details": [...]}
|
||
{"event": "error", "message": "..."}
|
||
|
||
--------------------------------------------------------------------------------
|
||
内部函数说明
|
||
--------------------------------------------------------------------------------
|
||
|
||
sha256_file(path)
|
||
计算文件 sha256,返回前 16 位十六进制字符串,用于快速变化检测。
|
||
|
||
load_manifest() / save_manifest(manifest)
|
||
读写 tools/manifest.json;文件不存在或损坏时返回空白 manifest。
|
||
|
||
scan_raw()
|
||
递归扫描 raw/ 下所有 .md 文件,返回 {rel_path: {hash, modified, size, abs_path}}。
|
||
|
||
build_slug_from_path(rel_path)
|
||
从 raw 文件路径生成基础 slug(保留中文,空格/特殊字符转 `-`)。
|
||
注意:--reslug 使用更严格的 _compute_normalized_slug() 规则。
|
||
|
||
check_changes(manifest, raw_files)
|
||
对比 manifest 与实际文件,当前默认返回新增/删除为主(updated 关闭)。
|
||
|
||
run_sync(dry_run, verbose, json_mode)
|
||
执行完整同步逻辑,更新 manifest,并触发 orphan 检测报告。
|
||
|
||
run_check()
|
||
只读比对,以 Markdown 格式打印差异报告,不修改任何文件。
|
||
|
||
run_rebuild()
|
||
遍历 manifest 中全部条目,重建 wiki/index.md,同时做容错路径匹配和 orphan 检测。
|
||
|
||
find_orphan_entity_concept(manifest)
|
||
扫描 wiki/sources/*.md 中的 [[wikilinks]],找出未被引用的 entity/concept 页面。
|
||
|
||
mark_ingested(rel_path, slug, json_mode)
|
||
将指定 raw 文件标记为已摄取,更新 slug、source_path、hash、ingested_at。
|
||
rel_path 必须已存在于 manifest(先 --sync 再 --mark-ingested)。
|
||
|
||
run_reslug(target_rel_path, dry_run)
|
||
批量(或单条)规范化 manifest 中的 slug/source_path,
|
||
使用 _compute_normalized_slug() 规则处理特殊字符。
|
||
|
||
run_fix_source_links(target_rel_path, dry_run, json_mode)
|
||
基于 manifest 修正 source 页面 `## Source File` 下的 raw 路径链接;
|
||
支持全量和单文件模式。
|
||
|
||
_compute_normalized_slug(rel_path)
|
||
规范化 slug 的核心规则:
|
||
a. 中文字符直接保留
|
||
b. ASCII 大写字母转小写
|
||
c. 空格、标点、特殊符号替换为 `-`
|
||
d. 连续多个 `-` 压缩为单个,首尾 `-` 去除
|
||
|
||
--------------------------------------------------------------------------------
|
||
典型工作流(供 agent 参考)
|
||
--------------------------------------------------------------------------------
|
||
|
||
1. 检查有无待摄取文件:
|
||
python tools/sync.py --pending --json --limit 1
|
||
|
||
2. 同步 raw 变化到 manifest:
|
||
python tools/sync.py --sync
|
||
|
||
3. 摄取完成后标记:
|
||
python tools/sync.py --mark-ingested "raw/papers/my-paper.md" --slug my-paper
|
||
|
||
4. 修复 slug 命名:
|
||
python tools/sync.py --reslug --dry-run # 预览
|
||
python tools/sync.py --reslug # 应用
|
||
|
||
5. 批量修正 Source File link:
|
||
python tools/sync.py --fix-source-links --dry-run
|
||
python tools/sync.py --fix-source-links
|
||
|
||
6. ingest 后单文件校验:
|
||
python tools/sync.py --fix-source-links --fix-source-target "raw/papers/my-paper.md"
|
||
|
||
7. index 损坏时重建:
|
||
python tools/sync.py --rebuild
|
||
"""
|
||
|
||
import json
|
||
import hashlib
|
||
import argparse
|
||
from pathlib import Path
|
||
from datetime import datetime, timezone
|
||
|
||
|
||
REPO_ROOT = Path(__file__).parent.parent.resolve()
|
||
WIKI_DIR = REPO_ROOT / "wiki"
|
||
MANIFEST_FILE = Path(__file__).parent / "manifest.json"
|
||
|
||
|
||
# ─── 工具函数 ───────────────────────────────────────────────
|
||
|
||
def green(text):
|
||
return f"\033[92m{text}\033[0m"
|
||
|
||
def yellow(text):
|
||
return f"\033[93m{text}\033[0m"
|
||
|
||
def red(text):
|
||
return f"\033[91m{text}\033[0m"
|
||
|
||
def dim(text):
|
||
return f"\033[2m{text}\033[0m"
|
||
|
||
def bold(text):
|
||
return f"\033[1m{text}\033[0m"
|
||
|
||
|
||
def log(msg, style="normal"):
|
||
prefixes = {
|
||
"normal": " ",
|
||
"info": " ℹ ",
|
||
"success": " ✓ ",
|
||
"warn": " ⚠ ",
|
||
"error": " ✗ ",
|
||
"section": "\n── ",
|
||
}
|
||
print(f"{prefixes.get(style, ' ')}{msg}")
|
||
|
||
|
||
def sha256_file(path: Path) -> str:
|
||
h = hashlib.sha256()
|
||
h.update(path.read_bytes())
|
||
return h.hexdigest()[:16]
|
||
|
||
|
||
def iso_now():
|
||
return datetime.now(timezone.utc).isoformat()
|
||
|
||
|
||
def load_manifest() -> dict:
|
||
if MANIFEST_FILE.exists():
|
||
try:
|
||
return json.loads(MANIFEST_FILE.read_text(encoding="utf-8"))
|
||
except (json.JSONDecodeError, IOError):
|
||
pass
|
||
return {"version": 1, "updated_at": iso_now(), "files": {}}
|
||
|
||
|
||
def save_manifest(manifest: dict):
|
||
manifest["updated_at"] = iso_now()
|
||
MANIFEST_FILE.write_text(json.dumps(manifest, ensure_ascii=False, indent=2), encoding="utf-8")
|
||
|
||
|
||
def scan_raw() -> dict[str, dict]:
|
||
"""返回 {relative_path: {hash, modified, size}}"""
|
||
raw_dir = REPO_ROOT / "raw"
|
||
result = {}
|
||
if not raw_dir.exists():
|
||
return result
|
||
for p in raw_dir.rglob("*.md"):
|
||
if p.is_file() and not p.name.startswith("."):
|
||
rel = str(p.relative_to(REPO_ROOT))
|
||
stat = p.stat()
|
||
result[rel] = {
|
||
"hash": sha256_file(p),
|
||
"modified": datetime.fromtimestamp(stat.st_mtime, tz=timezone.utc).isoformat(),
|
||
"size": stat.st_size,
|
||
"abs_path": str(p),
|
||
}
|
||
return result
|
||
|
||
|
||
def build_slug_from_path(rel_path: str) -> str:
|
||
"""从相对路径生成 slug(尽量保留中文,kebab-case)"""
|
||
name = Path(rel_path).stem
|
||
name = name.replace(" ", "-").replace("/", "-").replace("\\", "-")
|
||
name = "".join(c if c.isalnum() or c in ("-", "_", "·") else "-" for c in name)
|
||
name = name.strip("-")
|
||
return name or "untitled"
|
||
|
||
|
||
def find_orphan_entity_concept(manifest: dict) -> tuple[list, list]:
|
||
"""检测未被任何 source page 引用的 entity 和 concept"""
|
||
import re
|
||
wikilink_pattern = re.compile(r"\[\[([^\]]+)\]\]")
|
||
|
||
sources_dir = WIKI_DIR / "sources"
|
||
referenced_entities = set()
|
||
referenced_concepts = set()
|
||
|
||
if sources_dir.exists():
|
||
for src in sources_dir.glob("*.md"):
|
||
content = src.read_text(encoding="utf-8")
|
||
for link in wikilink_pattern.findall(content):
|
||
name = link.strip()
|
||
if name.startswith("entities/"):
|
||
referenced_entities.add(Path(name).stem)
|
||
elif name.startswith("concepts/"):
|
||
referenced_concepts.add(Path(name).stem)
|
||
elif "/" not in name:
|
||
referenced_entities.add(name)
|
||
referenced_concepts.add(name)
|
||
|
||
orphan_entities = []
|
||
entities_dir = WIKI_DIR / "entities"
|
||
if entities_dir.exists():
|
||
for f in entities_dir.glob("*.md"):
|
||
if f.stem not in referenced_entities:
|
||
orphan_entities.append(f.name)
|
||
|
||
orphan_concepts = []
|
||
concepts_dir = WIKI_DIR / "concepts"
|
||
if concepts_dir.exists():
|
||
for f in concepts_dir.glob("*.md"):
|
||
if f.stem not in referenced_concepts:
|
||
orphan_concepts.append(f.name)
|
||
|
||
return orphan_entities, orphan_concepts
|
||
|
||
|
||
# ─── 核心同步逻辑 ───────────────────────────────────────────────
|
||
|
||
def check_changes(manifest: dict, raw_files: dict) -> dict:
|
||
"""对比 manifest 和实际 raw 文件,返回变化。
|
||
|
||
当前策略(按需求收敛):
|
||
- 仅检测 new / deleted
|
||
- 不再基于 hash 检测 updated(避免仅 mtime 变化导致重复 ingest)
|
||
"""
|
||
changes = {"new": [], "updated": [], "deleted": [], "unchanged": []}
|
||
manifest_files = manifest.get("files", {})
|
||
|
||
for rel_path, info in raw_files.items():
|
||
if rel_path not in manifest_files:
|
||
changes["new"].append({"rel_path": rel_path, **info})
|
||
else:
|
||
# 按新策略:已有文件一律视作 unchanged,不再进入 updated
|
||
changes["unchanged"].append(rel_path)
|
||
|
||
for rel_path in manifest_files:
|
||
abs_path = REPO_ROOT / rel_path
|
||
if not abs_path.exists():
|
||
changes["deleted"].append({
|
||
"rel_path": rel_path,
|
||
"slug": manifest_files[rel_path].get("slug", build_slug_from_path(rel_path)),
|
||
"source_path": manifest_files[rel_path].get("source_path"),
|
||
})
|
||
|
||
return changes
|
||
|
||
|
||
def run_sync(dry_run: bool = False, verbose: bool = False, json_mode: bool = False):
|
||
"""执行同步并尽量保持输出精简。
|
||
|
||
- 默认(非 verbose、非 json)只会输出一行变化摘要 + manifest 更新成功提示。
|
||
- verbose=True 会打印每个新增/更新/删除的文件列表(保留旧行为)。
|
||
- json_mode=True 保持原有的机器友好 JSON 流输出。
|
||
"""
|
||
manifest = load_manifest()
|
||
raw_files = scan_raw()
|
||
changes = check_changes(manifest, raw_files)
|
||
new = changes["new"]
|
||
updated = changes["updated"]
|
||
deleted = changes["deleted"]
|
||
total_changes = len(new) + len(updated) + len(deleted)
|
||
|
||
if total_changes == 0:
|
||
if json_mode:
|
||
print(json.dumps({"event": "sync_complete", "summary": {"pending": 0, "deleted": 0, "manifest_entries": len(manifest.get("files", {}))}}))
|
||
else:
|
||
log("No changes detected — wiki is up to date.", "success")
|
||
return
|
||
|
||
# 非 JSON:简短摘要(默认)或详细列表(verbose)
|
||
if not json_mode:
|
||
log(f"Changes detected: +{len(new)} ~{len(updated)} -{len(deleted)}", "info")
|
||
if verbose:
|
||
if new:
|
||
print("\nNew Files:")
|
||
for f in new:
|
||
print(f" {f['rel_path']}")
|
||
if updated:
|
||
print("\nUpdated Files:")
|
||
for f in updated:
|
||
old = f.get("old_hash")
|
||
print(f" {f['rel_path']}" + (f" (was {old})" if old else ""))
|
||
if deleted:
|
||
print("\nDeleted Files:")
|
||
for f in deleted:
|
||
print(f" {f['rel_path']}")
|
||
|
||
if dry_run:
|
||
log("Dry-run complete. Run with --sync to apply.", "warn")
|
||
return
|
||
|
||
# Apply changes (保持原有 manifest 更新逻辑,但抑制逐文件日志,除非 json_mode 或 verbose)
|
||
updated_manifest = manifest.copy()
|
||
updated_manifest["files"] = manifest.get("files", {}).copy()
|
||
pending_files = []
|
||
recovered_files = []
|
||
|
||
for f in new:
|
||
rel_path = f["rel_path"]
|
||
slug = build_slug_from_path(rel_path)
|
||
source_path = f"wiki/sources/{slug}.md"
|
||
source_file = WIKI_DIR / "sources" / f"{slug}.md"
|
||
|
||
# 检测 wiki/sources/<slug>.md 是否已存在(manifest 被删除后的恢复场景)
|
||
already_ingested = source_file.exists()
|
||
ingested_at = None
|
||
if already_ingested:
|
||
# 用 source 文件的 mtime 作为 ingested_at 的近似值
|
||
try:
|
||
ingested_at = datetime.fromtimestamp(source_file.stat().st_mtime, tz=timezone.utc).isoformat()
|
||
except Exception:
|
||
ingested_at = iso_now()
|
||
|
||
if json_mode:
|
||
action = "recovered" if already_ingested else "new"
|
||
print(json.dumps({"event": "pending" if not already_ingested else "recovered", "rel_path": rel_path, "slug": slug, "action": action}))
|
||
if not already_ingested:
|
||
pending_files.append({"rel_path": rel_path, "abs_path": f["abs_path"], "slug": slug, "action": "new"})
|
||
else:
|
||
recovered_files.append({"rel_path": rel_path, "slug": slug, "source_path": source_path})
|
||
if verbose and not json_mode:
|
||
print(f" ↺ Recovered (source exists): {rel_path} → {source_path}")
|
||
|
||
updated_manifest["files"][rel_path] = {
|
||
"hash": f["hash"],
|
||
"modified": f.get("modified"),
|
||
"slug": slug,
|
||
"source_path": source_path,
|
||
"ingested": already_ingested,
|
||
"ingested_at": ingested_at,
|
||
}
|
||
|
||
for f in updated:
|
||
rel_path = f["rel_path"]
|
||
old_entry = manifest["files"].get(rel_path, {})
|
||
slug = old_entry.get("slug") or build_slug_from_path(rel_path)
|
||
if json_mode:
|
||
print(json.dumps({"event": "pending", "rel_path": rel_path, "slug": slug, "action": "updated"}))
|
||
pending_files.append({"rel_path": rel_path, "abs_path": f["abs_path"], "slug": slug, "action": "updated"})
|
||
updated_manifest["files"][rel_path] = {
|
||
**old_entry,
|
||
"hash": f["hash"],
|
||
"modified": f.get("modified"),
|
||
"ingested": False,
|
||
"ingested_at": None,
|
||
}
|
||
|
||
deleted_files = []
|
||
for f in deleted:
|
||
rel_path = f["rel_path"]
|
||
source_path = f.get("source_path")
|
||
if rel_path in updated_manifest["files"]:
|
||
del updated_manifest["files"][rel_path]
|
||
deleted_files.append(rel_path)
|
||
if json_mode and deleted:
|
||
print(json.dumps({"event": "deleted_detected", "rel_path": rel_path}))
|
||
|
||
save_manifest(updated_manifest)
|
||
|
||
if json_mode:
|
||
print(json.dumps({
|
||
"event": "sync_complete",
|
||
"summary": {
|
||
"pending": len(pending_files),
|
||
"recovered": len(recovered_files),
|
||
"deleted": len(deleted_files),
|
||
"manifest_entries": len(updated_manifest["files"]),
|
||
},
|
||
"pending_files": pending_files,
|
||
"deleted_files": deleted_files,
|
||
}))
|
||
else:
|
||
log(f"manifest.json updated ({len(updated_manifest['files'])} entries)", "success")
|
||
if recovered_files:
|
||
log(f"Recovered (source page exists): {len(recovered_files)}", "info")
|
||
if verbose:
|
||
log(f"Pending files for ingestion: {len(pending_files)}", "info")
|
||
|
||
# 简短的 orphan 报告(仅在 verbose 模式下列出详情)
|
||
orphan_entities, orphan_concepts = find_orphan_entity_concept(updated_manifest)
|
||
if not json_mode:
|
||
if orphan_entities or orphan_concepts:
|
||
if verbose:
|
||
print(f"\n{bold('--- Orphan Report (kept as requested) ---')}")
|
||
if orphan_entities:
|
||
print(f"Orphan Entities ({len(orphan_entities)}):")
|
||
for e in sorted(orphan_entities):
|
||
print(f" {e}")
|
||
if orphan_concepts:
|
||
print(f"Orphan Concepts ({len(orphan_concepts)}):")
|
||
for c in sorted(orphan_concepts):
|
||
print(f" {c}")
|
||
else:
|
||
log(f"Orphan entities: {len(orphan_entities)}; Orphan concepts: {len(orphan_concepts)}", "info")
|
||
else:
|
||
if verbose:
|
||
log("No orphan entity/concept detected.", "success")
|
||
|
||
if not json_mode:
|
||
print("\nDone.")
|
||
|
||
|
||
def run_check():
|
||
"""只预览变化,不执行(输出为标准 Markdown)"""
|
||
manifest = load_manifest()
|
||
raw_files = scan_raw()
|
||
changes = check_changes(manifest, raw_files)
|
||
total = len(changes["new"]) + len(changes["updated"]) + len(changes["deleted"])
|
||
|
||
# Markdown header and summary
|
||
print("# Wiki Sync Check\n")
|
||
print(f"- Raw files: {len(raw_files)}")
|
||
print(f"- Manifest entries: {len(manifest.get('files', {}))}")
|
||
print(f"- New: {len(changes['new'])}")
|
||
print(f"- Updated: {len(changes['updated'])}")
|
||
print(f"- Deleted: {len(changes['deleted'])}\n")
|
||
|
||
if total > 0:
|
||
if changes["new"]:
|
||
print("## New Files")
|
||
for f in changes["new"]:
|
||
print(f"- {f['rel_path']}")
|
||
print()
|
||
if changes["updated"]:
|
||
print("## Updated Files")
|
||
for f in changes["updated"]:
|
||
print(f"- {f['rel_path']} (was {f['old_hash']}, now {f['hash']})")
|
||
print()
|
||
if changes["deleted"]:
|
||
print("## Deleted Files")
|
||
for f in changes["deleted"]:
|
||
print(f"- {f['rel_path']}")
|
||
print()
|
||
else:
|
||
print("No changes — wiki is in sync.\n")
|
||
|
||
|
||
def run_rebuild():
|
||
"""从 manifest 重建 wiki/index.md(兜底方案)。
|
||
|
||
改进点:
|
||
- 优先使用 manifest 中记录的 source_path(如果存在且文件真实存在),
|
||
其次尝试 wiki/sources/<slug>.md;再尝试在 wiki/sources 下做不区分大小写或
|
||
归一化后的匹配(减少命名差异导致的断链)。
|
||
- 更健壮地解析 YAML frontmatter 中的 title 字段(支持缺失结束符的容错),
|
||
并在没有 title 时回退到第一个 Markdown 标题或 slug。
|
||
- 在无法找到 source 文件时,保留原 slug 并在 index 中标注 (source missing),
|
||
以便人工排查。
|
||
"""
|
||
manifest = load_manifest()
|
||
print(f"\n{bold('=== Wiki Rebuild from Manifest')}\n")
|
||
print(f" Manifest entries: {len(manifest.get('files', {}))}")
|
||
print(f" Rebuilding index.md ...\n")
|
||
|
||
index_lines = [
|
||
"# Wiki Index\n",
|
||
"\n## Overview\n",
|
||
"- [Overview](overview.md) — living synthesis\n",
|
||
"\n## Sources\n",
|
||
]
|
||
|
||
files = manifest.get("files", {})
|
||
sorted_files = sorted(files.items(), key=lambda x: (x[1].get("ingested_at") or "", x[1].get("modified", "")), reverse=True)
|
||
|
||
import re
|
||
|
||
sources_dir = WIKI_DIR / "sources"
|
||
|
||
def normalize(s: str) -> str:
|
||
# 用于不严格匹配文件名:移除非字母数字并小写
|
||
return ''.join(ch for ch in s.lower() if ch.isalnum())
|
||
|
||
def find_source_file(slug: str, info: dict, rel_path: str):
|
||
# 尝试按 manifest.source_path 优先匹配
|
||
sp = info.get('source_path')
|
||
if sp:
|
||
p = REPO_ROOT / sp
|
||
if p.exists():
|
||
return p
|
||
# 如果是相对于 wiki 的路径(如 "sources/foo.md"),尝试 WIKI_DIR 下
|
||
p2 = WIKI_DIR / sp
|
||
if p2.exists():
|
||
return p2
|
||
|
||
# 常规位置:wiki/sources/<slug>.md
|
||
candidate = sources_dir / f"{slug}.md"
|
||
if candidate.exists():
|
||
return candidate
|
||
|
||
# 尝试去除多余后缀(如 manifest 中误带了 ".md")
|
||
if slug.endswith('.md'):
|
||
short = slug[:-3]
|
||
c2 = sources_dir / f"{short}.md"
|
||
if c2.exists():
|
||
return c2
|
||
|
||
# 不区分大小写或归一化匹配
|
||
norm_slug = normalize(slug)
|
||
if sources_dir.exists():
|
||
for p in sources_dir.glob('*.md'):
|
||
if p.stem.lower() == slug.lower():
|
||
return p
|
||
if normalize(p.stem) == norm_slug:
|
||
return p
|
||
|
||
# 最后尝试根据 manifest 中的 rel_path(原始 raw 文件)去推测 source 文件名
|
||
# 有些仓库会把源文件直接放在 wiki/sources 下并采用不同的 slug 规则
|
||
try:
|
||
# rel_path 示例: 'raw/dir/name.md' -> use name as candidate
|
||
name = Path(rel_path).stem
|
||
p3 = sources_dir / f"{name}.md"
|
||
if p3.exists():
|
||
return p3
|
||
except Exception:
|
||
pass
|
||
|
||
return None
|
||
|
||
for rel_path, info in sorted_files:
|
||
slug = info.get("slug") or build_slug_from_path(rel_path)
|
||
# 清理误带后缀
|
||
if slug.endswith('.md'):
|
||
slug = slug[:-3]
|
||
|
||
src_file = find_source_file(slug, info, rel_path)
|
||
|
||
# 从 manifest 的 ingested_at 字段提取日期前缀(格式 YYYY-MM-DD),未摄取则留空
|
||
date_raw = info.get("ingested_at") or ""
|
||
date_prefix = ""
|
||
if date_raw:
|
||
try:
|
||
date_prefix = f"[{date_raw[:10]}] "
|
||
except Exception:
|
||
date_prefix = ""
|
||
|
||
title = None
|
||
if src_file and src_file.exists():
|
||
content = src_file.read_text(encoding="utf-8")
|
||
lines = content.splitlines()
|
||
|
||
# 处理 YAML frontmatter(容错:若缺少结束 '---' 则忽略 frontmatter)
|
||
if lines and lines[0].strip() == '---':
|
||
end_idx = None
|
||
for i in range(1, min(len(lines), 500)):
|
||
if lines[i].strip() == '---':
|
||
end_idx = i
|
||
break
|
||
if end_idx:
|
||
frontmatter = '\n'.join(lines[1:end_idx])
|
||
# 支持 title: "..." 或 title: > 的情况(简单提取首行)
|
||
m = re.search(r'^\s*title\s*:\s*(?:["\']?(.*?)["\']?|>\s*\n\s*(.*))\s*$', frontmatter, flags=re.MULTILINE)
|
||
if m:
|
||
title = (m.group(1) or m.group(2) or '').strip()
|
||
|
||
# 回退:第一个以 # 开头的行
|
||
if not title and lines:
|
||
for line in lines:
|
||
s = line.strip()
|
||
if s.startswith('#'):
|
||
title = s.lstrip('#').strip()
|
||
break
|
||
|
||
if not title:
|
||
title = slug
|
||
|
||
index_lines.append(f"- {date_prefix}[{title}](sources/{src_file.name})\n")
|
||
else:
|
||
# 如果没有找到 source 文件,但 manifest 里有 source_path 文本,则将其展示出来,便于排查
|
||
sp = info.get('source_path')
|
||
if sp:
|
||
index_lines.append(f"- {date_prefix}[{slug}](sources/{slug}.md) — (expected: {sp} — source missing)\n")
|
||
else:
|
||
index_lines.append(f"- {date_prefix}[{slug}](sources/{slug}.md) — (source missing)\n")
|
||
|
||
# Entities 索引
|
||
index_lines.append("\n## Entities\n")
|
||
entities_dir = WIKI_DIR / "entities"
|
||
if entities_dir.exists():
|
||
entity_files = sorted(entities_dir.glob("*.md"), key=lambda p: p.stem.lower())
|
||
for ef in entity_files:
|
||
index_lines.append(f"- [{ef.stem}](entities/{ef.name})\n")
|
||
|
||
# Concepts 索引
|
||
index_lines.append("\n## Concepts\n")
|
||
concepts_dir = WIKI_DIR / "concepts"
|
||
if concepts_dir.exists():
|
||
concept_files = sorted(concepts_dir.glob("*.md"), key=lambda p: p.stem.lower())
|
||
for cf in concept_files:
|
||
index_lines.append(f"- [{cf.stem}](concepts/{cf.name})\n")
|
||
|
||
index_lines.append("\n## Syntheses\n")
|
||
|
||
index_file = WIKI_DIR / "index.md"
|
||
index_file.write_text("".join(index_lines), encoding="utf-8")
|
||
print(f" {green('✓')} index.md rebuilt with {len(sorted_files)} sources")
|
||
|
||
# orphan 检测使用 manifest(重建后也可根据最新 manifest 检测)
|
||
orphan_entities, orphan_concepts = find_orphan_entity_concept(manifest)
|
||
if orphan_entities:
|
||
print(f" {dim('?')} Orphan entities: {len(orphan_entities)}")
|
||
if orphan_concepts:
|
||
print(f" {dim('?')} Orphan concepts: {len(orphan_concepts)}")
|
||
|
||
print(f"\nDone.")
|
||
|
||
|
||
# ─── 管理接口:修正 source 页面中的 Source File link ─────────────────────────────────────
|
||
|
||
def _fix_source_file_link_in_content(content: str, raw_rel_path: str) -> tuple[str, bool, str]:
|
||
"""修正单个 source 页面中的 `## Source File` 区块。
|
||
|
||
目标格式:
|
||
## Source File
|
||
- [[raw/.../file.md]]
|
||
|
||
返回: (new_content, changed, action)
|
||
action ∈ {"unchanged", "updated", "inserted_line", "inserted_section"}
|
||
"""
|
||
expected_line = f"- [[{raw_rel_path}]]"
|
||
lines = content.splitlines()
|
||
had_trailing_newline = content.endswith("\n")
|
||
|
||
# 1) 找 `## Source File` 标题
|
||
heading_idx = None
|
||
for i, line in enumerate(lines):
|
||
if line.strip().lower() == "## source file":
|
||
heading_idx = i
|
||
break
|
||
|
||
# 2) 没有区块:插入一个完整区块(优先插到 frontmatter 之后)
|
||
if heading_idx is None:
|
||
insert_at = 0
|
||
if lines and lines[0].strip() == "---":
|
||
for j in range(1, len(lines)):
|
||
if lines[j].strip() == "---":
|
||
insert_at = j + 1
|
||
while insert_at < len(lines) and lines[insert_at].strip() == "":
|
||
insert_at += 1
|
||
break
|
||
|
||
block = ["## Source File", expected_line, ""]
|
||
new_lines = lines[:insert_at] + block + lines[insert_at:]
|
||
new_content = "\n".join(new_lines)
|
||
if had_trailing_newline or new_content:
|
||
new_content += "\n"
|
||
return new_content, True, "inserted_section"
|
||
|
||
# 3) 在 `## Source File` 到下一个二级标题之间找第一条列表项
|
||
section_end = len(lines)
|
||
for j in range(heading_idx + 1, len(lines)):
|
||
if lines[j].startswith("## "):
|
||
section_end = j
|
||
break
|
||
|
||
bullet_idx = None
|
||
for j in range(heading_idx + 1, section_end):
|
||
if lines[j].strip().startswith("- "):
|
||
bullet_idx = j
|
||
break
|
||
|
||
if bullet_idx is None:
|
||
# 没有列表项,直接插入标准链接行
|
||
lines.insert(heading_idx + 1, expected_line)
|
||
new_content = "\n".join(lines)
|
||
if had_trailing_newline or new_content:
|
||
new_content += "\n"
|
||
return new_content, True, "inserted_line"
|
||
|
||
# 4) 有列表项:替换成 manifest 对应的 raw 路径
|
||
current = lines[bullet_idx].strip()
|
||
if current == expected_line:
|
||
return content, False, "unchanged"
|
||
|
||
lines[bullet_idx] = expected_line
|
||
new_content = "\n".join(lines)
|
||
if had_trailing_newline or new_content:
|
||
new_content += "\n"
|
||
return new_content, True, "updated"
|
||
|
||
|
||
def run_fix_source_links(target_rel_path: str = None, dry_run: bool = False, json_mode: bool = False):
|
||
"""基于 manifest,校正 source 页面中的 Source File link。
|
||
|
||
- 不传 target_rel_path:扫描并修正所有条目
|
||
- 传 target_rel_path:只处理单个 raw 条目(适合 ingest 后单文件校验)
|
||
"""
|
||
manifest = load_manifest()
|
||
files = manifest.get("files", {})
|
||
|
||
if target_rel_path:
|
||
if target_rel_path not in files:
|
||
msg = f"target not found in manifest: {target_rel_path}"
|
||
if json_mode:
|
||
print(json.dumps({"event": "error", "message": msg}))
|
||
else:
|
||
print(red(f" ✗ {msg}"))
|
||
raise SystemExit(1)
|
||
targets = [(target_rel_path, files[target_rel_path])]
|
||
else:
|
||
targets = list(files.items())
|
||
|
||
changed = 0
|
||
unchanged = 0
|
||
skipped_no_source_path = 0
|
||
skipped_source_missing = 0
|
||
details = []
|
||
|
||
for rel_path, info in targets:
|
||
source_path = info.get("source_path")
|
||
if not source_path:
|
||
skipped_no_source_path += 1
|
||
details.append({"rel_path": rel_path, "status": "skipped_no_source_path"})
|
||
continue
|
||
|
||
src_file = REPO_ROOT / source_path
|
||
if not src_file.exists():
|
||
skipped_source_missing += 1
|
||
details.append({"rel_path": rel_path, "source_path": source_path, "status": "skipped_source_missing"})
|
||
continue
|
||
|
||
original = src_file.read_text(encoding="utf-8")
|
||
new_content, did_change, action = _fix_source_file_link_in_content(original, rel_path)
|
||
|
||
if did_change:
|
||
changed += 1
|
||
if not dry_run:
|
||
src_file.write_text(new_content, encoding="utf-8")
|
||
details.append({"rel_path": rel_path, "source_path": source_path, "status": "changed", "action": action})
|
||
else:
|
||
unchanged += 1
|
||
details.append({"rel_path": rel_path, "source_path": source_path, "status": "unchanged"})
|
||
|
||
summary = {
|
||
"scanned": len(targets),
|
||
"changed": changed,
|
||
"unchanged": unchanged,
|
||
"skipped_no_source_path": skipped_no_source_path,
|
||
"skipped_source_missing": skipped_source_missing,
|
||
"dry_run": dry_run,
|
||
}
|
||
|
||
if json_mode:
|
||
print(json.dumps({"event": "fix_source_links_complete", "summary": summary, "details": details}, ensure_ascii=False))
|
||
return
|
||
|
||
print(f"\n{bold('=== Fix Source File Links')}\n")
|
||
print(f" Scanned : {summary['scanned']}")
|
||
print(f" Changed : {summary['changed']}")
|
||
print(f" Unchanged : {summary['unchanged']}")
|
||
print(f" Skipped (no source_path): {summary['skipped_no_source_path']}")
|
||
print(f" Skipped (source missing): {summary['skipped_source_missing']}")
|
||
if dry_run:
|
||
print(f" {yellow('⚠')} Dry-run only, no file written.")
|
||
else:
|
||
print(f" {green('✓')} Source File links corrected.")
|
||
print()
|
||
|
||
|
||
# ─── 管理接口:reslug(批量规范化 manifest slug) ──────────────────────────────────────
|
||
|
||
def _compute_normalized_slug(rel_path: str) -> str:
|
||
"""根据规则从 raw 文件路径计算规范化 slug。
|
||
|
||
规则:
|
||
a. 中文字符直接保留(不转拼音)
|
||
b. ASCII 大写字母转小写
|
||
c. 空格和特殊字符(引号、斜杠、问号、冒号、逗号、句号、感叹号、括号、
|
||
全角符号等)替换为 `-`
|
||
d. 连续多个 `-` 压缩为单个 `-`,并去除首尾 `-`
|
||
"""
|
||
import re
|
||
stem = Path(rel_path).stem
|
||
|
||
# 转小写(仅影响 ASCII 字母,中文不变)
|
||
result = stem.lower()
|
||
|
||
# 将特殊字符替换为 `-`
|
||
# 保留:中文字符、ASCII 字母数字、点(在版本号如 0.65.0 中保留)、下划线
|
||
result = re.sub(
|
||
r'[ \t\r\n'
|
||
r'\'"' # 单双引号
|
||
r'//\\\\' # 斜杠(全角/半角/反斜杠)
|
||
r'??' # 问号
|
||
r'::' # 冒号
|
||
r',,' # 逗号
|
||
r'。\.' # 句号(保留版本号小数点后面会被压缩)
|
||
r'!!' # 感叹号
|
||
r'()()' # 括号
|
||
r'【】\[\]' # 方括号
|
||
r'《》<>' # 书名号/尖括号
|
||
r'、' # 顿号
|
||
r'—–\-' # 破折号/连字符(统一重新处理)
|
||
r'|&@#%\^*+=~`'
|
||
r';;' # 分号
|
||
r']+',
|
||
'-',
|
||
result,
|
||
)
|
||
|
||
# 压缩连续 `-` 为单个
|
||
result = re.sub(r'-{2,}', '-', result)
|
||
|
||
# 去除首尾 `-`
|
||
result = result.strip('-')
|
||
|
||
return result or 'untitled'
|
||
|
||
|
||
def run_reslug(target_rel_path: str = None, dry_run: bool = False):
|
||
"""批量(或单条)规范化 manifest 中的 slug / source_path。
|
||
|
||
参数:
|
||
target_rel_path: 指定单个 raw 相对路径;为 None 则处理全部条目。
|
||
dry_run: 若为 True,只打印预览,不写入 manifest。
|
||
"""
|
||
manifest = load_manifest()
|
||
files = manifest.get("files", {})
|
||
|
||
if target_rel_path:
|
||
targets = [(target_rel_path, files[target_rel_path])] if target_rel_path in files else []
|
||
if not targets:
|
||
print(red(f" ✗ Not found in manifest: {target_rel_path}"))
|
||
return
|
||
else:
|
||
targets = list(files.items())
|
||
|
||
changed = []
|
||
skipped = 0
|
||
|
||
for rel_path, info in targets:
|
||
new_slug = _compute_normalized_slug(rel_path)
|
||
old_slug = info.get("slug", "")
|
||
new_source_path = f"wiki/sources/{new_slug}.md"
|
||
old_source_path = info.get("source_path", "")
|
||
|
||
if new_slug == old_slug and new_source_path == old_source_path:
|
||
skipped += 1
|
||
continue
|
||
|
||
changed.append({
|
||
"rel_path": rel_path,
|
||
"old_slug": old_slug,
|
||
"new_slug": new_slug,
|
||
"old_source_path": old_source_path,
|
||
"new_source_path": new_source_path,
|
||
})
|
||
|
||
print(f"\n{bold('=== Reslug Preview' if dry_run else '=== Reslug')}\n")
|
||
print(f" Total entries scanned : {len(targets)}")
|
||
print(f" Unchanged (skipped) : {skipped}")
|
||
print(f" To update : {len(changed)}\n")
|
||
|
||
if not changed:
|
||
print(f" {green('✓')} All slugs already normalized.\n")
|
||
return
|
||
|
||
for item in changed:
|
||
print(f" {dim(item['rel_path'])}")
|
||
if item['old_slug'] != item['new_slug']:
|
||
print(f" slug : {yellow(item['old_slug'])} → {green(item['new_slug'])}")
|
||
if item['old_source_path'] != item['new_source_path']:
|
||
print(f" src : {yellow(item['old_source_path'])} → {green(item['new_source_path'])}")
|
||
print()
|
||
|
||
if dry_run:
|
||
print(f" {yellow('⚠')} Dry-run — manifest NOT updated. Re-run without --dry-run to apply.\n")
|
||
return
|
||
|
||
# 应用变更
|
||
for item in changed:
|
||
entry = files[item["rel_path"]]
|
||
entry["slug"] = item["new_slug"]
|
||
entry["source_path"] = item["new_source_path"]
|
||
|
||
save_manifest(manifest)
|
||
print(f" {green('✓')} manifest.json updated ({len(changed)} entries changed).\n")
|
||
|
||
|
||
# ─── 管理接口:mark_ingested(供摄取流程调用) ─────────────────────────────────────────
|
||
|
||
def mark_ingested(rel_path: str, slug: str, json_mode: bool = False):
|
||
"""标记某个 raw 文件为已摄取(更新 manifest 条目)。
|
||
|
||
行为:
|
||
- rel_path 必须已存在于 manifest(即曾被 --sync 扫描过),否则报错退出。
|
||
- slug 必须显式传入,否则报错退出。
|
||
- source_path 由 slug 自动推断为 wiki/sources/<slug>.md。
|
||
- modified 强制更新为 raw 文件的实际 mtime(文件不存在时保留旧值并警告)。
|
||
- ingested 设为 True,ingested_at 设为当前 UTC 时间戳。
|
||
|
||
参数:
|
||
rel_path : 相对于仓库根目录的路径,例如 "raw/dir/name.md" (必填)
|
||
slug : wiki slug,例如 "my-article" (必填)
|
||
json_mode : 若为 True,输出单行 JSON,便于脚本消费
|
||
"""
|
||
if not slug or not slug.strip():
|
||
msg = f"--slug is required for --mark-ingested"
|
||
if json_mode:
|
||
print(json.dumps({"event": "error", "message": msg}))
|
||
else:
|
||
print(red(f" ✗ {msg}"))
|
||
raise SystemExit(1)
|
||
|
||
manifest = load_manifest()
|
||
files = manifest.get("files", {})
|
||
|
||
if rel_path not in files:
|
||
msg = f"rel_path not found in manifest (run --sync first): {rel_path}"
|
||
if json_mode:
|
||
print(json.dumps({"event": "error", "message": msg}))
|
||
else:
|
||
print(red(f" ✗ {msg}"))
|
||
raise SystemExit(1)
|
||
|
||
entry = files[rel_path]
|
||
|
||
# 更新 slug 和 source_path
|
||
entry["slug"] = slug.strip()
|
||
entry["source_path"] = f"wiki/sources/{slug.strip()}.md"
|
||
|
||
# 强制更新 modified(基于 raw 文件实际 mtime)
|
||
abs_path = REPO_ROOT / rel_path
|
||
if abs_path.exists():
|
||
entry["hash"] = sha256_file(abs_path)
|
||
entry["modified"] = datetime.fromtimestamp(abs_path.stat().st_mtime, tz=timezone.utc).isoformat()
|
||
else:
|
||
if not json_mode:
|
||
print(yellow(f" ⚠ Raw file not found, modified timestamp not updated: {rel_path}"))
|
||
|
||
# 标记已摄取
|
||
entry["ingested"] = True
|
||
entry["ingested_at"] = iso_now()
|
||
entry.pop("error", None)
|
||
|
||
files[rel_path] = entry
|
||
manifest["files"] = files
|
||
save_manifest(manifest)
|
||
|
||
if json_mode:
|
||
print(json.dumps({
|
||
"event": "mark_ingested",
|
||
"rel_path": rel_path,
|
||
"slug": entry["slug"],
|
||
"source_path": entry["source_path"],
|
||
"modified": entry.get("modified"),
|
||
"ingested_at": entry["ingested_at"],
|
||
}))
|
||
else:
|
||
print(f" {green('✓')} Marked ingested: {rel_path}")
|
||
print(f" slug : {entry['slug']}")
|
||
print(f" source_path : {entry['source_path']}")
|
||
print(f" modified : {entry.get('modified', '(unchanged)')}")
|
||
print(f" ingested_at : {entry['ingested_at']}")
|
||
|
||
|
||
# ─── CLI 入口 ───────────────────────────────────────────────
|
||
|
||
if __name__ == "__main__":
|
||
parser = argparse.ArgumentParser(
|
||
description="Wiki ↔ Raw 三向同步工具",
|
||
formatter_class=argparse.RawDescriptionHelpFormatter,
|
||
)
|
||
parser.add_argument(
|
||
"--check",
|
||
action="store_true",
|
||
help="预览变化,不执行同步",
|
||
)
|
||
parser.add_argument(
|
||
"--sync",
|
||
action="store_true",
|
||
help="执行完整同步(新增/修改/删除 + orphan 检测)",
|
||
)
|
||
parser.add_argument(
|
||
"--rebuild",
|
||
action="store_true",
|
||
help="从 manifest 重建 wiki/index.md(兜底方案)",
|
||
)
|
||
parser.add_argument(
|
||
"--reset-failed",
|
||
action="store_true",
|
||
help="重置所有 failed 的 ingest 状态(让它们重新待处理)",
|
||
)
|
||
parser.add_argument(
|
||
"--pending",
|
||
action="store_true",
|
||
help="列出所有待摄取的 pending 文件",
|
||
)
|
||
parser.add_argument(
|
||
"--verbose", "-v",
|
||
action="store_true",
|
||
help="详细输出",
|
||
)
|
||
parser.add_argument(
|
||
"--json",
|
||
action="store_true",
|
||
help="JSON 行输出模式(供调用方解析)",
|
||
)
|
||
parser.add_argument(
|
||
"--mark-ingested",
|
||
metavar="REL_PATH",
|
||
nargs=1,
|
||
help="标记单个 raw 文件为已摄取:传入相对路径(例如 'raw/dir/file.md')。必须配合 --slug 使用。",
|
||
)
|
||
parser.add_argument(
|
||
"--slug",
|
||
help="与 --mark-ingested 配合(必填):指定 wiki slug(例如 my-article)",
|
||
)
|
||
parser.add_argument(
|
||
"--mark-json",
|
||
action="store_true",
|
||
help="与 --mark-ingested 配合:以 JSON 单行输出 mark 结果",
|
||
)
|
||
parser.add_argument(
|
||
"--limit",
|
||
type=int,
|
||
default=None,
|
||
help="与 --pending --json 配合:限制返回条目数(默认返回全部)",
|
||
)
|
||
parser.add_argument(
|
||
"--fix-source-links",
|
||
action="store_true",
|
||
help="基于 manifest 修正 source 页面 `## Source File` 下的 raw 路径链接",
|
||
)
|
||
parser.add_argument(
|
||
"--fix-source-target",
|
||
metavar="REL_PATH",
|
||
help="与 --fix-source-links 配合:仅修正单个 raw 条目(例如 'raw/AI/file.md')",
|
||
)
|
||
parser.add_argument(
|
||
"--reslug",
|
||
action="store_true",
|
||
help="批量规范化 manifest 中的 slug/source_path(中文保留,ASCII 特殊字符转 -,大写转小写,压缩连续 -)",
|
||
)
|
||
parser.add_argument(
|
||
"--reslug-target",
|
||
metavar="REL_PATH",
|
||
help="与 --reslug 配合:只处理指定的 raw 文件(例如 'raw/dir/file.md')",
|
||
)
|
||
parser.add_argument(
|
||
"--dry-run",
|
||
action="store_true",
|
||
help="与 --reslug 配合:只预览变更,不写入 manifest",
|
||
)
|
||
|
||
args = parser.parse_args()
|
||
|
||
if args.mark_ingested:
|
||
rel = args.mark_ingested[0]
|
||
mark_ingested(rel, slug=args.slug, json_mode=args.mark_json)
|
||
elif args.fix_source_links:
|
||
run_fix_source_links(
|
||
target_rel_path=args.fix_source_target,
|
||
dry_run=args.dry_run,
|
||
json_mode=args.json,
|
||
)
|
||
elif args.reslug:
|
||
run_reslug(target_rel_path=args.reslug_target, dry_run=args.dry_run)
|
||
elif args.rebuild:
|
||
run_rebuild()
|
||
elif args.pending:
|
||
manifest = load_manifest()
|
||
pending = [(k, v) for k, v in manifest["files"].items() if not v.get("ingested")]
|
||
if args.json:
|
||
total = len(pending)
|
||
# 未指定 limit -> 返回全部(files 列表)
|
||
if args.limit is None:
|
||
payload = {
|
||
"event": "pending_list",
|
||
"count": total,
|
||
"files": [
|
||
{
|
||
"rel_path": k,
|
||
"slug": v.get("slug", build_slug_from_path(k)),
|
||
"source_path": v.get("source_path"),
|
||
"modified": v.get("modified"),
|
||
"hash": v.get("hash"),
|
||
}
|
||
for k, v in pending
|
||
],
|
||
}
|
||
elif args.limit <= 0:
|
||
payload = {"event": "pending_list", "count": total, "files": []}
|
||
elif args.limit == 1:
|
||
first = pending[0] if pending else (None, None)
|
||
if first[0] is None:
|
||
payload = {"event": "pending_list", "count": 0, "file": None}
|
||
else:
|
||
k, v = first
|
||
payload = {
|
||
"event": "pending_list",
|
||
"count": total,
|
||
"file": {
|
||
"rel_path": k,
|
||
"slug": v.get("slug", build_slug_from_path(k)),
|
||
"source_path": v.get("source_path"),
|
||
"modified": v.get("modified"),
|
||
"hash": v.get("hash"),
|
||
},
|
||
}
|
||
else:
|
||
# 返回前 N 条 as files array
|
||
n = min(args.limit, total)
|
||
payload = {
|
||
"event": "pending_list",
|
||
"count": total,
|
||
"files": [
|
||
{
|
||
"rel_path": k,
|
||
"slug": v.get("slug", build_slug_from_path(k)),
|
||
"source_path": v.get("source_path"),
|
||
"modified": v.get("modified"),
|
||
"hash": v.get("hash"),
|
||
}
|
||
for k, v in pending[:n]
|
||
],
|
||
}
|
||
print(json.dumps(payload))
|
||
else:
|
||
# 控制台输出也支持 --limit
|
||
total = len(pending)
|
||
n = total if args.limit is None else max(0, args.limit)
|
||
print(f"=== Pending Ingest Files ({total}) ===\n")
|
||
if n == 0:
|
||
print(" (no items to show)")
|
||
else:
|
||
for i, (path, info) in enumerate(pending[:n], 1):
|
||
print(f"{i:3}. {path}")
|
||
elif args.reset_failed:
|
||
manifest = load_manifest()
|
||
reset_count = 0
|
||
for k, v in manifest["files"].items():
|
||
if v.get("error"):
|
||
v["ingested"] = False
|
||
v.pop("error", None)
|
||
v.pop("ingested_at", None)
|
||
reset_count += 1
|
||
if reset_count > 0:
|
||
save_manifest(manifest)
|
||
print(f"Reset {reset_count} failed entries to pending.")
|
||
else:
|
||
print("No failed entries found.")
|
||
elif args.check:
|
||
run_check()
|
||
elif args.sync:
|
||
run_sync(dry_run=False, verbose=args.verbose, json_mode=args.json)
|
||
else:
|
||
parser.print_help()
|
||
print("\n示例:")
|
||
print(" python tools/sync.py --check # 预览变化")
|
||
print(" python tools/sync.py --sync # 执行同步")
|
||
print(" python tools/sync.py --sync -v # 详细模式")
|
||
print(" python tools/sync.py --rebuild # 重建 index")
|