#!/usr/bin/env python3 """ Wiki ↔ Raw 三向同步工具 ================================================================================ 概述 ---- 本脚本负责维护 raw/（原始文档层）与 wiki/（知识库层）之间的同步状态。它通过 tools/manifest.json 追踪每个 raw 文件的哈希、摄取状态和 slug 映射，让编码代理（agent）能准确知道哪些文件需要被（重新）摄取到 wiki。核心功能 -------- 1. 扫描 raw/ 下的 .md 文件，与 manifest 对比，检测新增/删除（不再自动检测 updated） 2. 维护 tools/manifest.json 状态映射（hash、slug、ingested 等） 3. 标记单个文件为"已摄取"，供摄取流程回调 4. 批量规范化 manifest 中的 slug（reslug） 5. 从 manifest 重建 wiki/index.md（兜底方案） 6. 检测 orphan entity/concept（仅报告，不删除） 7. 批量或单条修正 source 页面中的 Source File link（对齐 manifest 的 raw 路径） -------------------------------------------------------------------------------- CLI 用法 -------------------------------------------------------------------------------- 基础操作： python tools/sync.py --check 预览 raw/ 与 manifest 的差异（新增/删除），不写入任何文件。输出为 Markdown 格式，适合人工阅读。 python tools/sync.py --sync 执行完整同步：将 raw/ 的变化写入 manifest，并报告 orphan 页面。当前默认仅处理新增/删除，不会因为已存在文件内容变化而自动重置 ingested。 python tools/sync.py --sync -v / --verbose 同上，但额外列出每个新增/删除文件的详情，以及 orphan 清单。 python tools/sync.py --pending 列出 manifest 中所有 ingested=false 的待摄取文件（人类可读格式）。 python tools/sync.py --pending --json 以单行 JSON 输出待摄取列表，供脚本/agent 消费。 python tools/sync.py --pending --json --limit 1 只返回第一条待摄取文件（返回 "file" 字段而非 "files" 数组）。 python tools/sync.py --pending --json --limit N 返回前 N 条待摄取文件（返回 "files" 数组）。 python tools/sync.py --json 与 --sync 配合：使用 JSON 行流模式输出所有事件，便于程序解析。 python tools/sync.py --rebuild 从 manifest 重建 wiki/index.md。适合 index 损坏或丢失时的兜底恢复。 Source File link 修正： python tools/sync.py --fix-source-links 扫描 manifest 中所有条目，批量修正对应 source 页面里 `## Source File` 下的链接。目标格式统一为：- [[raw/.../your-file.md]] python tools/sync.py --fix-source-links --fix-source-target "raw/dir/file.md" 只修正指定 raw 条目对应的单个 source 页面（适合每次 ingest 后做单文件校验）。 python tools/sync.py --fix-source-links --dry-run 预览将要修改的数量，不写入文件。标记摄取状态： python tools/sync.py --mark-ingested "raw/dir/file.md" --slug my-slug 标记指定 raw 文件为已摄取，同时更新 slug、source_path、ingested_at。该命令是摄取工作流的最后一步，应在 wiki/sources/.md 写入完毕后调用。 python tools/sync.py --mark-ingested "raw/dir/file.md" --slug my-slug --mark-json 同上，但以单行 JSON 输出结果（供脚本消费）。 python tools/sync.py --reset-failed 将所有带 error 标记的 manifest 条目重置为 ingested=false（重新加入待处理队列）。 slug 管理： python tools/sync.py --reslug 批量规范化 manifest 中全部条目的 slug 和 source_path。规则：中文直接保留，ASCII 大写转小写，特殊字符转 `-`，压缩连续 `-`。 python tools/sync.py --reslug --reslug-target "raw/dir/file.md" 只规范化指定文件的 slug。 python tools/sync.py --reslug --dry-run 预览 reslug 变更，不写入 manifest。 -------------------------------------------------------------------------------- manifest.json 格式 -------------------------------------------------------------------------------- 路径：tools/manifest.json（与本脚本同目录）顶层结构： { "version": 1, // 格式版本，当前固定为 1 "updated_at": "2024-01-15T08:00:00Z", // 最后更新时间（UTC ISO 8601），每次写入自动刷新 "files": { ... } // key = raw 文件相对仓库根的路径 } files 中每条记录的结构： { "raw/dir/my-paper.md": { "hash": "a3f1c2d4e5b6a7b8", // sha256 前 16 位，用于检测文件内容变化 "modified": "2024-01-15T07:00:00Z", // raw 文件的 mtime（UTC ISO 8601） "slug": "my-paper", // wiki 页面 slug，用于生成 source_path "source_path": "wiki/sources/my-paper.md", // 对应的 wiki source 页面路径 "ingested": true, // false = 待摄取；true = 已摄取 "ingested_at": "2024-01-15T08:00:00Z", // 摄取完成时间（null 表示未摄取） "error": "..." // 可选，摄取失败时记录错误信息 } } 状态流转：新文件被 --sync 检测到 → ingested=false, ingested_at=null 摄取工作流完成后调用 --mark-ingested → ingested=true, ingested_at=<当前 UTC 时间> 当前默认同步策略不自动处理“已存在文件内容变化” → 已摄取文件不会因 updated 检测而自动重置（避免重复 ingest）摄取失败时由外部流程写入 error 字段 → 使用 --reset-failed 清除，重回待处理队列 -------------------------------------------------------------------------------- JSON 输出格式（--json / --mark-json / --pending --json） -------------------------------------------------------------------------------- 每行输出一个独立 JSON 对象（JSON Lines 格式），可能的 event 类型： {"event": "pending", "rel_path": "...", "slug": "...", "action": "new"} {"event": "deleted_detected","rel_path": "..."} {"event": "sync_complete", "summary": {"pending": N, "deleted": N, "manifest_entries": N}, "pending_files": [...], "deleted_files": [...]} {"event": "pending_list", "count": N, "files": [...]} // --pending --json --limit N {"event": "pending_list", "count": N, "file": {...}} // --pending --json --limit 1 {"event": "mark_ingested", "rel_path": "...", "slug": "...", "source_path": "...", "modified": "...", "ingested_at": "..."} {"event": "fix_source_links_complete", "summary": {...}, "details": [...]} {"event": "error", "message": "..."} -------------------------------------------------------------------------------- 内部函数说明 -------------------------------------------------------------------------------- sha256_file(path) 计算文件 sha256，返回前 16 位十六进制字符串，用于快速变化检测。 load_manifest() / save_manifest(manifest) 读写 tools/manifest.json；文件不存在或损坏时返回空白 manifest。 scan_raw() 递归扫描 raw/ 下所有 .md 文件，返回 {rel_path: {hash, modified, size, abs_path}}。 build_slug_from_path(rel_path) 从 raw 文件路径生成基础 slug（保留中文，空格/特殊字符转 `-`）。注意：--reslug 使用更严格的 _compute_normalized_slug() 规则。 check_changes(manifest, raw_files) 对比 manifest 与实际文件，当前默认返回新增/删除为主（updated 关闭）。 run_sync(dry_run, verbose, json_mode) 执行完整同步逻辑，更新 manifest，并触发 orphan 检测报告。 run_check() 只读比对，以 Markdown 格式打印差异报告，不修改任何文件。 run_rebuild() 遍历 manifest 中全部条目，重建 wiki/index.md，同时做容错路径匹配和 orphan 检测。 find_orphan_entity_concept(manifest) 扫描 wiki/sources/*.md 中的 [[wikilinks]]，找出未被引用的 entity/concept 页面。 mark_ingested(rel_path, slug, json_mode) 将指定 raw 文件标记为已摄取，更新 slug、source_path、hash、ingested_at。 rel_path 必须已存在于 manifest（先 --sync 再 --mark-ingested）。 run_reslug(target_rel_path, dry_run) 批量（或单条）规范化 manifest 中的 slug/source_path，使用 _compute_normalized_slug() 规则处理特殊字符。 run_fix_source_links(target_rel_path, dry_run, json_mode) 基于 manifest 修正 source 页面 `## Source File` 下的 raw 路径链接；支持全量和单文件模式。 _compute_normalized_slug(rel_path) 规范化 slug 的核心规则： a. 中文字符直接保留 b. ASCII 大写字母转小写 c. 空格、标点、特殊符号替换为 `-` d. 连续多个 `-` 压缩为单个，首尾 `-` 去除 -------------------------------------------------------------------------------- 典型工作流（供 agent 参考） -------------------------------------------------------------------------------- 1. 检查有无待摄取文件： python tools/sync.py --pending --json --limit 1 2. 同步 raw 变化到 manifest： python tools/sync.py --sync 3. 摄取完成后标记： python tools/sync.py --mark-ingested "raw/papers/my-paper.md" --slug my-paper 4. 修复 slug 命名： python tools/sync.py --reslug --dry-run # 预览 python tools/sync.py --reslug # 应用 5. 批量修正 Source File link： python tools/sync.py --fix-source-links --dry-run python tools/sync.py --fix-source-links 6. ingest 后单文件校验： python tools/sync.py --fix-source-links --fix-source-target "raw/papers/my-paper.md" 7. index 损坏时重建： python tools/sync.py --rebuild """ import json import hashlib import argparse from pathlib import Path from datetime import datetime, timezone REPO_ROOT = Path(__file__).parent.parent.resolve() WIKI_DIR = REPO_ROOT / "wiki" MANIFEST_FILE = Path(__file__).parent / "manifest.json" # ─── 工具函数 ─────────────────────────────────────────────── def green(text): return f"\033[92m{text}\033[0m" def yellow(text): return f"\033[93m{text}\033[0m" def red(text): return f"\033[91m{text}\033[0m" def dim(text): return f"\033[2m{text}\033[0m" def bold(text): return f"\033[1m{text}\033[0m" def log(msg, style="normal"): prefixes = { "normal": " ", "info": " ℹ ", "success": " ✓ ", "warn": " ⚠ ", "error": " ✗ ", "section": "\n── ", } print(f"{prefixes.get(style, ' ')}{msg}") def sha256_file(path: Path) -> str: h = hashlib.sha256() h.update(path.read_bytes()) return h.hexdigest()[:16] def iso_now(): return datetime.now(timezone.utc).isoformat() def load_manifest() -> dict: if MANIFEST_FILE.exists(): try: return json.loads(MANIFEST_FILE.read_text(encoding="utf-8")) except (json.JSONDecodeError, IOError): pass return {"version": 1, "updated_at": iso_now(), "files": {}} def save_manifest(manifest: dict): manifest["updated_at"] = iso_now() MANIFEST_FILE.write_text(json.dumps(manifest, ensure_ascii=False, indent=2), encoding="utf-8") def scan_raw() -> dict[str, dict]: """返回 {relative_path: {hash, modified, size}}""" raw_dir = REPO_ROOT / "raw" result = {} if not raw_dir.exists(): return result for p in raw_dir.rglob("*.md"): if p.is_file() and not p.name.startswith("."): rel = str(p.relative_to(REPO_ROOT)) stat = p.stat() result[rel] = { "hash": sha256_file(p), "modified": datetime.fromtimestamp(stat.st_mtime, tz=timezone.utc).isoformat(), "size": stat.st_size, "abs_path": str(p), } return result def build_slug_from_path(rel_path: str) -> str: """从相对路径生成 slug（尽量保留中文，kebab-case）""" name = Path(rel_path).stem name = name.replace(" ", "-").replace("/", "-").replace("\\", "-") name = "".join(c if c.isalnum() or c in ("-", "_", "·") else "-" for c in name) name = name.strip("-") return name or "untitled" def find_orphan_entity_concept(manifest: dict) -> tuple[list, list]: """检测未被任何 source page 引用的 entity 和 concept""" import re wikilink_pattern = re.compile(r"\[\[([^\]]+)\]\]") sources_dir = WIKI_DIR / "sources" referenced_entities = set() referenced_concepts = set() if sources_dir.exists(): for src in sources_dir.glob("*.md"): content = src.read_text(encoding="utf-8") for link in wikilink_pattern.findall(content): name = link.strip() if name.startswith("entities/"): referenced_entities.add(Path(name).stem) elif name.startswith("concepts/"): referenced_concepts.add(Path(name).stem) elif "/" not in name: referenced_entities.add(name) referenced_concepts.add(name) orphan_entities = [] entities_dir = WIKI_DIR / "entities" if entities_dir.exists(): for f in entities_dir.glob("*.md"): if f.stem not in referenced_entities: orphan_entities.append(f.name) orphan_concepts = [] concepts_dir = WIKI_DIR / "concepts" if concepts_dir.exists(): for f in concepts_dir.glob("*.md"): if f.stem not in referenced_concepts: orphan_concepts.append(f.name) return orphan_entities, orphan_concepts # ─── 核心同步逻辑 ─────────────────────────────────────────────── def check_changes(manifest: dict, raw_files: dict) -> dict: """对比 manifest 和实际 raw 文件，返回变化。当前策略（按需求收敛）： - 仅检测 new / deleted - 不再基于 hash 检测 updated（避免仅 mtime 变化导致重复 ingest） """ changes = {"new": [], "updated": [], "deleted": [], "unchanged": []} manifest_files = manifest.get("files", {}) for rel_path, info in raw_files.items(): if rel_path not in manifest_files: changes["new"].append({"rel_path": rel_path, **info}) else: # 按新策略：已有文件一律视作 unchanged，不再进入 updated changes["unchanged"].append(rel_path) for rel_path in manifest_files: abs_path = REPO_ROOT / rel_path if not abs_path.exists(): changes["deleted"].append({ "rel_path": rel_path, "slug": manifest_files[rel_path].get("slug", build_slug_from_path(rel_path)), "source_path": manifest_files[rel_path].get("source_path"), }) return changes def run_sync(dry_run: bool = False, verbose: bool = False, json_mode: bool = False): """执行同步并尽量保持输出精简。 - 默认（非 verbose、非 json）只会输出一行变化摘要 + manifest 更新成功提示。 - verbose=True 会打印每个新增/更新/删除的文件列表（保留旧行为）。 - json_mode=True 保持原有的机器友好 JSON 流输出。 """ manifest = load_manifest() raw_files = scan_raw() changes = check_changes(manifest, raw_files) new = changes["new"] updated = changes["updated"] deleted = changes["deleted"] total_changes = len(new) + len(updated) + len(deleted) if total_changes == 0: if json_mode: print(json.dumps({"event": "sync_complete", "summary": {"pending": 0, "deleted": 0, "manifest_entries": len(manifest.get("files", {}))}})) else: log("No changes detected — wiki is up to date.", "success") return # 非 JSON：简短摘要（默认）或详细列表（verbose） if not json_mode: log(f"Changes detected: +{len(new)} ~{len(updated)} -{len(deleted)}", "info") if verbose: if new: print("\nNew Files:") for f in new: print(f" {f['rel_path']}") if updated: print("\nUpdated Files:") for f in updated: old = f.get("old_hash") print(f" {f['rel_path']}" + (f" (was {old})" if old else "")) if deleted: print("\nDeleted Files:") for f in deleted: print(f" {f['rel_path']}") if dry_run: log("Dry-run complete. Run with --sync to apply.", "warn") return # Apply changes (保持原有 manifest 更新逻辑，但抑制逐文件日志，除非 json_mode 或 verbose) updated_manifest = manifest.copy() updated_manifest["files"] = manifest.get("files", {}).copy() pending_files = [] recovered_files = [] for f in new: rel_path = f["rel_path"] slug = build_slug_from_path(rel_path) source_path = f"wiki/sources/{slug}.md" source_file = WIKI_DIR / "sources" / f"{slug}.md" # 检测 wiki/sources/.md 是否已存在（manifest 被删除后的恢复场景） already_ingested = source_file.exists() ingested_at = None if already_ingested: # 用 source 文件的 mtime 作为 ingested_at 的近似值 try: ingested_at = datetime.fromtimestamp(source_file.stat().st_mtime, tz=timezone.utc).isoformat() except Exception: ingested_at = iso_now() if json_mode: action = "recovered" if already_ingested else "new" print(json.dumps({"event": "pending" if not already_ingested else "recovered", "rel_path": rel_path, "slug": slug, "action": action})) if not already_ingested: pending_files.append({"rel_path": rel_path, "abs_path": f["abs_path"], "slug": slug, "action": "new"}) else: recovered_files.append({"rel_path": rel_path, "slug": slug, "source_path": source_path}) if verbose and not json_mode: print(f" ↺ Recovered (source exists): {rel_path} → {source_path}") updated_manifest["files"][rel_path] = { "hash": f["hash"], "modified": f.get("modified"), "slug": slug, "source_path": source_path, "ingested": already_ingested, "ingested_at": ingested_at, } for f in updated: rel_path = f["rel_path"] old_entry = manifest["files"].get(rel_path, {}) slug = old_entry.get("slug") or build_slug_from_path(rel_path) if json_mode: print(json.dumps({"event": "pending", "rel_path": rel_path, "slug": slug, "action": "updated"})) pending_files.append({"rel_path": rel_path, "abs_path": f["abs_path"], "slug": slug, "action": "updated"}) updated_manifest["files"][rel_path] = { **old_entry, "hash": f["hash"], "modified": f.get("modified"), "ingested": False, "ingested_at": None, } deleted_files = [] for f in deleted: rel_path = f["rel_path"] source_path = f.get("source_path") if rel_path in updated_manifest["files"]: del updated_manifest["files"][rel_path] deleted_files.append(rel_path) if json_mode and deleted: print(json.dumps({"event": "deleted_detected", "rel_path": rel_path})) save_manifest(updated_manifest) if json_mode: print(json.dumps({ "event": "sync_complete", "summary": { "pending": len(pending_files), "recovered": len(recovered_files), "deleted": len(deleted_files), "manifest_entries": len(updated_manifest["files"]), }, "pending_files": pending_files, "deleted_files": deleted_files, })) else: log(f"manifest.json updated ({len(updated_manifest['files'])} entries)", "success") if recovered_files: log(f"Recovered (source page exists): {len(recovered_files)}", "info") if verbose: log(f"Pending files for ingestion: {len(pending_files)}", "info") # 简短的 orphan 报告（仅在 verbose 模式下列出详情） orphan_entities, orphan_concepts = find_orphan_entity_concept(updated_manifest) if not json_mode: if orphan_entities or orphan_concepts: if verbose: print(f"\n{bold('--- Orphan Report (kept as requested) ---')}") if orphan_entities: print(f"Orphan Entities ({len(orphan_entities)}):") for e in sorted(orphan_entities): print(f" {e}") if orphan_concepts: print(f"Orphan Concepts ({len(orphan_concepts)}):") for c in sorted(orphan_concepts): print(f" {c}") else: log(f"Orphan entities: {len(orphan_entities)}; Orphan concepts: {len(orphan_concepts)}", "info") else: if verbose: log("No orphan entity/concept detected.", "success") if not json_mode: print("\nDone.") def run_check(): """只预览变化，不执行（输出为标准 Markdown）""" manifest = load_manifest() raw_files = scan_raw() changes = check_changes(manifest, raw_files) total = len(changes["new"]) + len(changes["updated"]) + len(changes["deleted"]) # Markdown header and summary print("# Wiki Sync Check\n") print(f"- Raw files: {len(raw_files)}") print(f"- Manifest entries: {len(manifest.get('files', {}))}") print(f"- New: {len(changes['new'])}") print(f"- Updated: {len(changes['updated'])}") print(f"- Deleted: {len(changes['deleted'])}\n") if total > 0: if changes["new"]: print("## New Files") for f in changes["new"]: print(f"- {f['rel_path']}") print() if changes["updated"]: print("## Updated Files") for f in changes["updated"]: print(f"- {f['rel_path']} (was {f['old_hash']}, now {f['hash']})") print() if changes["deleted"]: print("## Deleted Files") for f in changes["deleted"]: print(f"- {f['rel_path']}") print() else: print("No changes — wiki is in sync.\n") def run_rebuild(): """从 manifest 重建 wiki/index.md（兜底方案）。改进点： - 优先使用 manifest 中记录的 source_path（如果存在且文件真实存在），其次尝试 wiki/sources/.md；再尝试在 wiki/sources 下做不区分大小写或归一化后的匹配（减少命名差异导致的断链）。 - 更健壮地解析 YAML frontmatter 中的 title 字段（支持缺失结束符的容错），并在没有 title 时回退到第一个 Markdown 标题或 slug。 - 在无法找到 source 文件时，保留原 slug 并在 index 中标注 (source missing)，以便人工排查。 """ manifest = load_manifest() print(f"\n{bold('=== Wiki Rebuild from Manifest')}\n") print(f" Manifest entries: {len(manifest.get('files', {}))}") print(f" Rebuilding index.md ...\n") index_lines = [ "# Wiki Index\n", "\n## Overview\n", "- [Overview](overview.md) — living synthesis\n", "\n## Sources\n", ] files = manifest.get("files", {}) sorted_files = sorted(files.items(), key=lambda x: (x[1].get("ingested_at") or "", x[1].get("modified", "")), reverse=True) import re sources_dir = WIKI_DIR / "sources" def normalize(s: str) -> str: # 用于不严格匹配文件名：移除非字母数字并小写 return ''.join(ch for ch in s.lower() if ch.isalnum()) def find_source_file(slug: str, info: dict, rel_path: str): # 尝试按 manifest.source_path 优先匹配 sp = info.get('source_path') if sp: p = REPO_ROOT / sp if p.exists(): return p # 如果是相对于 wiki 的路径（如 "sources/foo.md"），尝试 WIKI_DIR 下 p2 = WIKI_DIR / sp if p2.exists(): return p2 # 常规位置：wiki/sources/.md candidate = sources_dir / f"{slug}.md" if candidate.exists(): return candidate # 尝试去除多余后缀（如 manifest 中误带了 ".md"） if slug.endswith('.md'): short = slug[:-3] c2 = sources_dir / f"{short}.md" if c2.exists(): return c2 # 不区分大小写或归一化匹配 norm_slug = normalize(slug) if sources_dir.exists(): for p in sources_dir.glob('*.md'): if p.stem.lower() == slug.lower(): return p if normalize(p.stem) == norm_slug: return p # 最后尝试根据 manifest 中的 rel_path（原始 raw 文件）去推测 source 文件名 # 有些仓库会把源文件直接放在 wiki/sources 下并采用不同的 slug 规则 try: # rel_path 示例: 'raw/dir/name.md' -> use name as candidate name = Path(rel_path).stem p3 = sources_dir / f"{name}.md" if p3.exists(): return p3 except Exception: pass return None for rel_path, info in sorted_files: slug = info.get("slug") or build_slug_from_path(rel_path) # 清理误带后缀 if slug.endswith('.md'): slug = slug[:-3] src_file = find_source_file(slug, info, rel_path) # 从 manifest 的 ingested_at 字段提取日期前缀（格式 YYYY-MM-DD），未摄取则留空 date_raw = info.get("ingested_at") or "" date_prefix = "" if date_raw: try: date_prefix = f"[{date_raw[:10]}] " except Exception: date_prefix = "" title = None if src_file and src_file.exists(): content = src_file.read_text(encoding="utf-8") lines = content.splitlines() # 处理 YAML frontmatter（容错：若缺少结束 '---' 则忽略 frontmatter） if lines and lines[0].strip() == '---': end_idx = None for i in range(1, min(len(lines), 500)): if lines[i].strip() == '---': end_idx = i break if end_idx: frontmatter = '\n'.join(lines[1:end_idx]) # 支持 title: "..." 或 title: > 的情况（简单提取首行） m = re.search(r'^\s*title\s*:\s*(?:["\']?(.*?)["\']?|>\s*\n\s*(.*))\s*$', frontmatter, flags=re.MULTILINE) if m: title = (m.group(1) or m.group(2) or '').strip() # 回退：第一个以 # 开头的行 if not title and lines: for line in lines: s = line.strip() if s.startswith('#'): title = s.lstrip('#').strip() break if not title: title = slug index_lines.append(f"- {date_prefix}[{title}](sources/{src_file.name})\n") else: # 如果没有找到 source 文件，但 manifest 里有 source_path 文本，则将其展示出来，便于排查 sp = info.get('source_path') if sp: index_lines.append(f"- {date_prefix}[{slug}](sources/{slug}.md) — (expected: {sp} — source missing)\n") else: index_lines.append(f"- {date_prefix}[{slug}](sources/{slug}.md) — (source missing)\n") # Entities 索引 index_lines.append("\n## Entities\n") entities_dir = WIKI_DIR / "entities" if entities_dir.exists(): entity_files = sorted(entities_dir.glob("*.md"), key=lambda p: p.stem.lower()) for ef in entity_files: index_lines.append(f"- [{ef.stem}](entities/{ef.name})\n") # Concepts 索引 index_lines.append("\n## Concepts\n") concepts_dir = WIKI_DIR / "concepts" if concepts_dir.exists(): concept_files = sorted(concepts_dir.glob("*.md"), key=lambda p: p.stem.lower()) for cf in concept_files: index_lines.append(f"- [{cf.stem}](concepts/{cf.name})\n") index_lines.append("\n## Syntheses\n") index_file = WIKI_DIR / "index.md" index_file.write_text("".join(index_lines), encoding="utf-8") print(f" {green('✓')} index.md rebuilt with {len(sorted_files)} sources") # orphan 检测使用 manifest（重建后也可根据最新 manifest 检测） orphan_entities, orphan_concepts = find_orphan_entity_concept(manifest) if orphan_entities: print(f" {dim('?')} Orphan entities: {len(orphan_entities)}") if orphan_concepts: print(f" {dim('?')} Orphan concepts: {len(orphan_concepts)}") print(f"\nDone.") # ─── 管理接口：修正 source 页面中的 Source File link ───────────────────────────────────── def _fix_source_file_link_in_content(content: str, raw_rel_path: str) -> tuple[str, bool, str]: """修正单个 source 页面中的 `## Source File` 区块。目标格式： ## Source File - [[raw/.../file.md]] 返回： (new_content, changed, action) action ∈ {"unchanged", "updated", "inserted_line", "inserted_section"} """ expected_line = f"- [[{raw_rel_path}]]" lines = content.splitlines() had_trailing_newline = content.endswith("\n") # 1) 找 `## Source File` 标题 heading_idx = None for i, line in enumerate(lines): if line.strip().lower() == "## source file": heading_idx = i break # 2) 没有区块：插入一个完整区块（优先插到 frontmatter 之后） if heading_idx is None: insert_at = 0 if lines and lines[0].strip() == "---": for j in range(1, len(lines)): if lines[j].strip() == "---": insert_at = j + 1 while insert_at < len(lines) and lines[insert_at].strip() == "": insert_at += 1 break block = ["## Source File", expected_line, ""] new_lines = lines[:insert_at] + block + lines[insert_at:] new_content = "\n".join(new_lines) if had_trailing_newline or new_content: new_content += "\n" return new_content, True, "inserted_section" # 3) 在 `## Source File` 到下一个二级标题之间找第一条列表项 section_end = len(lines) for j in range(heading_idx + 1, len(lines)): if lines[j].startswith("## "): section_end = j break bullet_idx = None for j in range(heading_idx + 1, section_end): if lines[j].strip().startswith("- "): bullet_idx = j break if bullet_idx is None: # 没有列表项，直接插入标准链接行 lines.insert(heading_idx + 1, expected_line) new_content = "\n".join(lines) if had_trailing_newline or new_content: new_content += "\n" return new_content, True, "inserted_line" # 4) 有列表项：替换成 manifest 对应的 raw 路径 current = lines[bullet_idx].strip() if current == expected_line: return content, False, "unchanged" lines[bullet_idx] = expected_line new_content = "\n".join(lines) if had_trailing_newline or new_content: new_content += "\n" return new_content, True, "updated" def run_fix_source_links(target_rel_path: str = None, dry_run: bool = False, json_mode: bool = False): """基于 manifest，校正 source 页面中的 Source File link。 - 不传 target_rel_path：扫描并修正所有条目 - 传 target_rel_path：只处理单个 raw 条目（适合 ingest 后单文件校验） """ manifest = load_manifest() files = manifest.get("files", {}) if target_rel_path: if target_rel_path not in files: msg = f"target not found in manifest: {target_rel_path}" if json_mode: print(json.dumps({"event": "error", "message": msg})) else: print(red(f" ✗ {msg}")) raise SystemExit(1) targets = [(target_rel_path, files[target_rel_path])] else: targets = list(files.items()) changed = 0 unchanged = 0 skipped_no_source_path = 0 skipped_source_missing = 0 details = [] for rel_path, info in targets: source_path = info.get("source_path") if not source_path: skipped_no_source_path += 1 details.append({"rel_path": rel_path, "status": "skipped_no_source_path"}) continue src_file = REPO_ROOT / source_path if not src_file.exists(): skipped_source_missing += 1 details.append({"rel_path": rel_path, "source_path": source_path, "status": "skipped_source_missing"}) continue original = src_file.read_text(encoding="utf-8") new_content, did_change, action = _fix_source_file_link_in_content(original, rel_path) if did_change: changed += 1 if not dry_run: src_file.write_text(new_content, encoding="utf-8") details.append({"rel_path": rel_path, "source_path": source_path, "status": "changed", "action": action}) else: unchanged += 1 details.append({"rel_path": rel_path, "source_path": source_path, "status": "unchanged"}) summary = { "scanned": len(targets), "changed": changed, "unchanged": unchanged, "skipped_no_source_path": skipped_no_source_path, "skipped_source_missing": skipped_source_missing, "dry_run": dry_run, } if json_mode: print(json.dumps({"event": "fix_source_links_complete", "summary": summary, "details": details}, ensure_ascii=False)) return print(f"\n{bold('=== Fix Source File Links')}\n") print(f" Scanned : {summary['scanned']}") print(f" Changed : {summary['changed']}") print(f" Unchanged : {summary['unchanged']}") print(f" Skipped (no source_path): {summary['skipped_no_source_path']}") print(f" Skipped (source missing): {summary['skipped_source_missing']}") if dry_run: print(f" {yellow('⚠')} Dry-run only, no file written.") else: print(f" {green('✓')} Source File links corrected.") print() # ─── 管理接口：reslug（批量规范化 manifest slug） ────────────────────────────────────── def _compute_normalized_slug(rel_path: str) -> str: """根据规则从 raw 文件路径计算规范化 slug。规则： a. 中文字符直接保留（不转拼音） b. ASCII 大写字母转小写 c. 空格和特殊字符（引号、斜杠、问号、冒号、逗号、句号、感叹号、括号、全角符号等）替换为 `-` d. 连续多个 `-` 压缩为单个 `-`，并去除首尾 `-` """ import re stem = Path(rel_path).stem # 转小写（仅影响 ASCII 字母，中文不变） result = stem.lower() # 将特殊字符替换为 `-` # 保留：中文字符、ASCII 字母数字、点（在版本号如 0.65.0 中保留）、下划线 result = re.sub( r'[ \t\r\n' r'\'"' # 单双引号 r'／/\\\\' # 斜杠（全角/半角/反斜杠） r'？?' # 问号 r'：:' # 冒号 r'，,' # 逗号 r'。\.' # 句号（保留版本号小数点后面会被压缩） r'！!' # 感叹号 r'（）()' # 括号 r'【】\[\]' # 方括号 r'《》<>' # 书名号/尖括号 r'、' # 顿号 r'—–\-' # 破折号/连字符（统一重新处理） r'|&@#%\^*+=~`' r'；;' # 分号 r']+', '-', result, ) # 压缩连续 `-` 为单个 result = re.sub(r'-{2,}', '-', result) # 去除首尾 `-` result = result.strip('-') return result or 'untitled' def run_reslug(target_rel_path: str = None, dry_run: bool = False): """批量（或单条）规范化 manifest 中的 slug / source_path。参数： target_rel_path: 指定单个 raw 相对路径；为 None 则处理全部条目。 dry_run: 若为 True，只打印预览，不写入 manifest。 """ manifest = load_manifest() files = manifest.get("files", {}) if target_rel_path: targets = [(target_rel_path, files[target_rel_path])] if target_rel_path in files else [] if not targets: print(red(f" ✗ Not found in manifest: {target_rel_path}")) return else: targets = list(files.items()) changed = [] skipped = 0 for rel_path, info in targets: new_slug = _compute_normalized_slug(rel_path) old_slug = info.get("slug", "") new_source_path = f"wiki/sources/{new_slug}.md" old_source_path = info.get("source_path", "") if new_slug == old_slug and new_source_path == old_source_path: skipped += 1 continue changed.append({ "rel_path": rel_path, "old_slug": old_slug, "new_slug": new_slug, "old_source_path": old_source_path, "new_source_path": new_source_path, }) print(f"\n{bold('=== Reslug Preview' if dry_run else '=== Reslug')}\n") print(f" Total entries scanned : {len(targets)}") print(f" Unchanged (skipped) : {skipped}") print(f" To update : {len(changed)}\n") if not changed: print(f" {green('✓')} All slugs already normalized.\n") return for item in changed: print(f" {dim(item['rel_path'])}") if item['old_slug'] != item['new_slug']: print(f" slug : {yellow(item['old_slug'])} → {green(item['new_slug'])}") if item['old_source_path'] != item['new_source_path']: print(f" src : {yellow(item['old_source_path'])} → {green(item['new_source_path'])}") print() if dry_run: print(f" {yellow('⚠')} Dry-run — manifest NOT updated. Re-run without --dry-run to apply.\n") return # 应用变更 for item in changed: entry = files[item["rel_path"]] entry["slug"] = item["new_slug"] entry["source_path"] = item["new_source_path"] save_manifest(manifest) print(f" {green('✓')} manifest.json updated ({len(changed)} entries changed).\n") # ─── 管理接口：mark_ingested（供摄取流程调用） ───────────────────────────────────────── def mark_ingested(rel_path: str, slug: str, json_mode: bool = False): """标记某个 raw 文件为已摄取（更新 manifest 条目）。行为： - rel_path 必须已存在于 manifest（即曾被 --sync 扫描过），否则报错退出。 - slug 必须显式传入，否则报错退出。 - source_path 由 slug 自动推断为 wiki/sources/.md。 - modified 强制更新为 raw 文件的实际 mtime（文件不存在时保留旧值并警告）。 - ingested 设为 True，ingested_at 设为当前 UTC 时间戳。参数: rel_path : 相对于仓库根目录的路径，例如 "raw/dir/name.md" （必填） slug : wiki slug，例如 "my-article" （必填） json_mode : 若为 True，输出单行 JSON，便于脚本消费 """ if not slug or not slug.strip(): msg = f"--slug is required for --mark-ingested" if json_mode: print(json.dumps({"event": "error", "message": msg})) else: print(red(f" ✗ {msg}")) raise SystemExit(1) manifest = load_manifest() files = manifest.get("files", {}) if rel_path not in files: msg = f"rel_path not found in manifest (run --sync first): {rel_path}" if json_mode: print(json.dumps({"event": "error", "message": msg})) else: print(red(f" ✗ {msg}")) raise SystemExit(1) entry = files[rel_path] # 更新 slug 和 source_path entry["slug"] = slug.strip() entry["source_path"] = f"wiki/sources/{slug.strip()}.md" # 强制更新 modified（基于 raw 文件实际 mtime） abs_path = REPO_ROOT / rel_path if abs_path.exists(): entry["hash"] = sha256_file(abs_path) entry["modified"] = datetime.fromtimestamp(abs_path.stat().st_mtime, tz=timezone.utc).isoformat() else: if not json_mode: print(yellow(f" ⚠ Raw file not found, modified timestamp not updated: {rel_path}")) # 标记已摄取 entry["ingested"] = True entry["ingested_at"] = iso_now() entry.pop("error", None) files[rel_path] = entry manifest["files"] = files save_manifest(manifest) if json_mode: print(json.dumps({ "event": "mark_ingested", "rel_path": rel_path, "slug": entry["slug"], "source_path": entry["source_path"], "modified": entry.get("modified"), "ingested_at": entry["ingested_at"], })) else: print(f" {green('✓')} Marked ingested: {rel_path}") print(f" slug : {entry['slug']}") print(f" source_path : {entry['source_path']}") print(f" modified : {entry.get('modified', '(unchanged)')}") print(f" ingested_at : {entry['ingested_at']}") # ─── CLI 入口 ─────────────────────────────────────────────── if __name__ == "__main__": parser = argparse.ArgumentParser( description="Wiki ↔ Raw 三向同步工具", formatter_class=argparse.RawDescriptionHelpFormatter, ) parser.add_argument( "--check", action="store_true", help="预览变化，不执行同步", ) parser.add_argument( "--sync", action="store_true", help="执行完整同步（新增/修改/删除 + orphan 检测）", ) parser.add_argument( "--rebuild", action="store_true", help="从 manifest 重建 wiki/index.md（兜底方案）", ) parser.add_argument( "--reset-failed", action="store_true", help="重置所有 failed 的 ingest 状态（让它们重新待处理）", ) parser.add_argument( "--pending", action="store_true", help="列出所有待摄取的 pending 文件", ) parser.add_argument( "--verbose", "-v", action="store_true", help="详细输出", ) parser.add_argument( "--json", action="store_true", help="JSON 行输出模式（供调用方解析）", ) parser.add_argument( "--mark-ingested", metavar="REL_PATH", nargs=1, help="标记单个 raw 文件为已摄取：传入相对路径（例如 'raw/dir/file.md'）。必须配合 --slug 使用。", ) parser.add_argument( "--slug", help="与 --mark-ingested 配合（必填）：指定 wiki slug（例如 my-article）", ) parser.add_argument( "--mark-json", action="store_true", help="与 --mark-ingested 配合：以 JSON 单行输出 mark 结果", ) parser.add_argument( "--limit", type=int, default=None, help="与 --pending --json 配合：限制返回条目数（默认返回全部）", ) parser.add_argument( "--fix-source-links", action="store_true", help="基于 manifest 修正 source 页面 `## Source File` 下的 raw 路径链接", ) parser.add_argument( "--fix-source-target", metavar="REL_PATH", help="与 --fix-source-links 配合：仅修正单个 raw 条目（例如 'raw/AI/file.md'）", ) parser.add_argument( "--reslug", action="store_true", help="批量规范化 manifest 中的 slug/source_path（中文保留，ASCII 特殊字符转 -，大写转小写，压缩连续 -）", ) parser.add_argument( "--reslug-target", metavar="REL_PATH", help="与 --reslug 配合：只处理指定的 raw 文件（例如 'raw/dir/file.md'）", ) parser.add_argument( "--dry-run", action="store_true", help="与 --reslug 配合：只预览变更，不写入 manifest", ) args = parser.parse_args() if args.mark_ingested: rel = args.mark_ingested[0] mark_ingested(rel, slug=args.slug, json_mode=args.mark_json) elif args.fix_source_links: run_fix_source_links( target_rel_path=args.fix_source_target, dry_run=args.dry_run, json_mode=args.json, ) elif args.reslug: run_reslug(target_rel_path=args.reslug_target, dry_run=args.dry_run) elif args.rebuild: run_rebuild() elif args.pending: manifest = load_manifest() pending = [(k, v) for k, v in manifest["files"].items() if not v.get("ingested")] if args.json: total = len(pending) # 未指定 limit -> 返回全部（files 列表） if args.limit is None: payload = { "event": "pending_list", "count": total, "files": [ { "rel_path": k, "slug": v.get("slug", build_slug_from_path(k)), "source_path": v.get("source_path"), "modified": v.get("modified"), "hash": v.get("hash"), } for k, v in pending ], } elif args.limit <= 0: payload = {"event": "pending_list", "count": total, "files": []} elif args.limit == 1: first = pending[0] if pending else (None, None) if first[0] is None: payload = {"event": "pending_list", "count": 0, "file": None} else: k, v = first payload = { "event": "pending_list", "count": total, "file": { "rel_path": k, "slug": v.get("slug", build_slug_from_path(k)), "source_path": v.get("source_path"), "modified": v.get("modified"), "hash": v.get("hash"), }, } else: # 返回前 N 条 as files array n = min(args.limit, total) payload = { "event": "pending_list", "count": total, "files": [ { "rel_path": k, "slug": v.get("slug", build_slug_from_path(k)), "source_path": v.get("source_path"), "modified": v.get("modified"), "hash": v.get("hash"), } for k, v in pending[:n] ], } print(json.dumps(payload)) else: # 控制台输出也支持 --limit total = len(pending) n = total if args.limit is None else max(0, args.limit) print(f"=== Pending Ingest Files ({total}) ===\n") if n == 0: print(" (no items to show)") else: for i, (path, info) in enumerate(pending[:n], 1): print(f"{i:3}. {path}") elif args.reset_failed: manifest = load_manifest() reset_count = 0 for k, v in manifest["files"].items(): if v.get("error"): v["ingested"] = False v.pop("error", None) v.pop("ingested_at", None) reset_count += 1 if reset_count > 0: save_manifest(manifest) print(f"Reset {reset_count} failed entries to pending.") else: print("No failed entries found.") elif args.check: run_check() elif args.sync: run_sync(dry_run=False, verbose=args.verbose, json_mode=args.json) else: parser.print_help() print("\n示例:") print(" python tools/sync.py --check # 预览变化") print(" python tools/sync.py --sync # 执行同步") print(" python tools/sync.py --sync -v # 详细模式") print(" python tools/sync.py --rebuild # 重建 index")