llm-wiki-agent/tools/sync.py

#!/usr/bin/env python3
"""
Wiki ↔ Raw 三向同步工具
================================================================================

概述
----
本脚本负责维护 raw/（原始文档层）与 wiki/（知识库层）之间的同步状态。
它通过 tools/manifest.json 追踪每个 raw 文件的哈希、摄取状态和 slug 映射，
让编码代理（agent）能准确知道哪些文件需要被（重新）摄取到 wiki。

核心功能
--------
  1. 扫描 raw/ 下的 .md 文件，与 manifest 对比，检测新增/删除（不再自动检测 updated）
  2. 维护 tools/manifest.json 状态映射（hash、slug、ingested 等）
  3. 标记单个文件为"已摄取"，供摄取流程回调
  4. 批量规范化 manifest 中的 slug（reslug）
  5. 从 manifest 重建 wiki/index.md（兜底方案）
  6. 检测 orphan entity/concept（仅报告，不删除）
  7. 批量或单条修正 source 页面中的 Source File link（对齐 manifest 的 raw 路径）

--------------------------------------------------------------------------------
CLI 用法
--------------------------------------------------------------------------------

基础操作：
    python tools/sync.py --check
        预览 raw/ 与 manifest 的差异（新增/删除），不写入任何文件。
        输出为 Markdown 格式，适合人工阅读。

    python tools/sync.py --sync
        执行完整同步：将 raw/ 的变化写入 manifest，并报告 orphan 页面。
        当前默认仅处理新增/删除，不会因为已存在文件内容变化而自动重置 ingested。

    python tools/sync.py --sync -v / --verbose
        同上，但额外列出每个新增/删除文件的详情，以及 orphan 清单。

    python tools/sync.py --pending
        列出 manifest 中所有 ingested=false 的待摄取文件（人类可读格式）。

    python tools/sync.py --pending --json
        以单行 JSON 输出待摄取列表，供脚本/agent 消费。

    python tools/sync.py --pending --json --limit 1
        只返回第一条待摄取文件（返回 "file" 字段而非 "files" 数组）。

    python tools/sync.py --pending --json --limit N
        返回前 N 条待摄取文件（返回 "files" 数组）。

    python tools/sync.py --json
        与 --sync 配合：使用 JSON 行流模式输出所有事件，便于程序解析。

    python tools/sync.py --rebuild
        从 manifest 重建 wiki/index.md。适合 index 损坏或丢失时的兜底恢复。

Source File link 修正：
    python tools/sync.py --fix-source-links
        扫描 manifest 中所有条目，批量修正对应 source 页面里 `## Source File` 下的链接。
        目标格式统一为：- [[raw/.../your-file.md]]

    python tools/sync.py --fix-source-links --fix-source-target "raw/dir/file.md"
        只修正指定 raw 条目对应的单个 source 页面（适合每次 ingest 后做单文件校验）。

    python tools/sync.py --fix-source-links --dry-run
        预览将要修改的数量，不写入文件。

标记摄取状态：
    python tools/sync.py --mark-ingested "raw/dir/file.md" --slug my-slug
        标记指定 raw 文件为已摄取，同时更新 slug、source_path、ingested_at。
        该命令是摄取工作流的最后一步，应在 wiki/sources/<slug>.md 写入完毕后调用。

    python tools/sync.py --mark-ingested "raw/dir/file.md" --slug my-slug --mark-json
        同上，但以单行 JSON 输出结果（供脚本消费）。

    python tools/sync.py --reset-failed
        将所有带 error 标记的 manifest 条目重置为 ingested=false（重新加入待处理队列）。

slug 管理：
    python tools/sync.py --reslug
        批量规范化 manifest 中全部条目的 slug 和 source_path。
        规则：中文直接保留，ASCII 大写转小写，特殊字符转 `-`，压缩连续 `-`。

    python tools/sync.py --reslug --reslug-target "raw/dir/file.md"
        只规范化指定文件的 slug。

    python tools/sync.py --reslug --dry-run
        预览 reslug 变更，不写入 manifest。

--------------------------------------------------------------------------------
manifest.json 格式
--------------------------------------------------------------------------------

路径：tools/manifest.json（与本脚本同目录）

顶层结构：
{
  "version": 1,                         // 格式版本，当前固定为 1
  "updated_at": "2024-01-15T08:00:00Z", // 最后更新时间（UTC ISO 8601），每次写入自动刷新
  "files": { ... }                      // key = raw 文件相对仓库根的路径
}

files 中每条记录的结构：
{
  "raw/dir/my-paper.md": {
    "hash":         "a3f1c2d4e5b6a7b8",  // sha256 前 16 位，用于检测文件内容变化
    "modified":     "2024-01-15T07:00:00Z", // raw 文件的 mtime（UTC ISO 8601）
    "slug":         "my-paper",           // wiki 页面 slug，用于生成 source_path
    "source_path":  "wiki/sources/my-paper.md", // 对应的 wiki source 页面路径
    "ingested":     true,                 // false = 待摄取；true = 已摄取
    "ingested_at":  "2024-01-15T08:00:00Z", // 摄取完成时间（null 表示未摄取）
    "error":        "..."                 // 可选，摄取失败时记录错误信息
  }
}

状态流转：
  新文件被 --sync 检测到
      → ingested=false, ingested_at=null
  摄取工作流完成后调用 --mark-ingested
      → ingested=true, ingested_at=<当前 UTC 时间>
  当前默认同步策略不自动处理“已存在文件内容变化”
      → 已摄取文件不会因 updated 检测而自动重置（避免重复 ingest）
  摄取失败时由外部流程写入 error 字段
      → 使用 --reset-failed 清除，重回待处理队列

--------------------------------------------------------------------------------
JSON 输出格式（--json / --mark-json / --pending --json）
--------------------------------------------------------------------------------

每行输出一个独立 JSON 对象（JSON Lines 格式），可能的 event 类型：

  {"event": "pending",         "rel_path": "...", "slug": "...", "action": "new"}
  {"event": "deleted_detected","rel_path": "..."}
  {"event": "sync_complete",   "summary": {"pending": N, "deleted": N, "manifest_entries": N},
                                "pending_files": [...], "deleted_files": [...]}
  {"event": "pending_list",    "count": N, "files": [...]}           // --pending --json --limit N
  {"event": "pending_list",    "count": N, "file": {...}}            // --pending --json --limit 1
  {"event": "mark_ingested",   "rel_path": "...", "slug": "...",
                                "source_path": "...", "modified": "...", "ingested_at": "..."}
  {"event": "fix_source_links_complete", "summary": {...}, "details": [...]}
  {"event": "error",           "message": "..."}

--------------------------------------------------------------------------------
内部函数说明
--------------------------------------------------------------------------------

  sha256_file(path)
      计算文件 sha256，返回前 16 位十六进制字符串，用于快速变化检测。

  load_manifest() / save_manifest(manifest)
      读写 tools/manifest.json；文件不存在或损坏时返回空白 manifest。

  scan_raw()
      递归扫描 raw/ 下所有 .md 文件，返回 {rel_path: {hash, modified, size, abs_path}}。

  build_slug_from_path(rel_path)
      从 raw 文件路径生成基础 slug（保留中文，空格/特殊字符转 `-`）。
      注意：--reslug 使用更严格的 _compute_normalized_slug() 规则。

  check_changes(manifest, raw_files)
      对比 manifest 与实际文件，当前默认返回新增/删除为主（updated 关闭）。

  run_sync(dry_run, verbose, json_mode)
      执行完整同步逻辑，更新 manifest，并触发 orphan 检测报告。

  run_check()
      只读比对，以 Markdown 格式打印差异报告，不修改任何文件。

  run_rebuild()
      遍历 manifest 中全部条目，重建 wiki/index.md，同时做容错路径匹配和 orphan 检测。

  find_orphan_entity_concept(manifest)
      扫描 wiki/sources/*.md 中的 [[wikilinks]]，找出未被引用的 entity/concept 页面。

  mark_ingested(rel_path, slug, json_mode)
      将指定 raw 文件标记为已摄取，更新 slug、source_path、hash、ingested_at。
      rel_path 必须已存在于 manifest（先 --sync 再 --mark-ingested）。

  run_reslug(target_rel_path, dry_run)
      批量（或单条）规范化 manifest 中的 slug/source_path，
      使用 _compute_normalized_slug() 规则处理特殊字符。

  run_fix_source_links(target_rel_path, dry_run, json_mode)
      基于 manifest 修正 source 页面 `## Source File` 下的 raw 路径链接；
      支持全量和单文件模式。

  _compute_normalized_slug(rel_path)
      规范化 slug 的核心规则：
        a. 中文字符直接保留
        b. ASCII 大写字母转小写
        c. 空格、标点、特殊符号替换为 `-`
        d. 连续多个 `-` 压缩为单个，首尾 `-` 去除

--------------------------------------------------------------------------------
典型工作流（供 agent 参考）
--------------------------------------------------------------------------------

  1. 检查有无待摄取文件：
       python tools/sync.py --pending --json --limit 1

  2. 同步 raw 变化到 manifest：
       python tools/sync.py --sync

  3. 摄取完成后标记：
       python tools/sync.py --mark-ingested "raw/papers/my-paper.md" --slug my-paper

  4. 修复 slug 命名：
       python tools/sync.py --reslug --dry-run   # 预览
       python tools/sync.py --reslug             # 应用

  5. 批量修正 Source File link：
       python tools/sync.py --fix-source-links --dry-run
       python tools/sync.py --fix-source-links

  6. ingest 后单文件校验：
       python tools/sync.py --fix-source-links --fix-source-target "raw/papers/my-paper.md"

  7. index 损坏时重建：
       python tools/sync.py --rebuild
"""

import json
import hashlib
import argparse
from pathlib import Path
from datetime import datetime, timezone


REPO_ROOT = Path(__file__).parent.parent.resolve()
WIKI_DIR = REPO_ROOT / "wiki"
MANIFEST_FILE = Path(__file__).parent / "manifest.json"


# ─── 工具函数 ───────────────────────────────────────────────

def green(text):
    return f"\033[92m{text}\033[0m"

def yellow(text):
    return f"\033[93m{text}\033[0m"

def red(text):
    return f"\033[91m{text}\033[0m"

def dim(text):
    return f"\033[2m{text}\033[0m"

def bold(text):
    return f"\033[1m{text}\033[0m"


def log(msg, style="normal"):
    prefixes = {
        "normal":   "  ",
        "info":     "  ℹ ",
        "success":  "  ✓ ",
        "warn":     "  ⚠ ",
        "error":    "  ✗ ",
        "section":  "\n── ",
    }
    print(f"{prefixes.get(style, '  ')}{msg}")


def sha256_file(path: Path) -> str:
    h = hashlib.sha256()
    h.update(path.read_bytes())
    return h.hexdigest()[:16]


def iso_now():
    return datetime.now(timezone.utc).isoformat()


def load_manifest() -> dict:
    if MANIFEST_FILE.exists():
        try:
            return json.loads(MANIFEST_FILE.read_text(encoding="utf-8"))
        except (json.JSONDecodeError, IOError):
            pass
    return {"version": 1, "updated_at": iso_now(), "files": {}}


def save_manifest(manifest: dict):
    manifest["updated_at"] = iso_now()
    MANIFEST_FILE.write_text(json.dumps(manifest, ensure_ascii=False, indent=2), encoding="utf-8")


def scan_raw() -> dict[str, dict]:
    """返回 {relative_path: {hash, modified, size}}"""
    raw_dir = REPO_ROOT / "raw"
    result = {}
    if not raw_dir.exists():
        return result
    for p in raw_dir.rglob("*.md"):
        if p.is_file() and not p.name.startswith("."):
            rel = str(p.relative_to(REPO_ROOT))
            stat = p.stat()
            result[rel] = {
                "hash": sha256_file(p),
                "modified": datetime.fromtimestamp(stat.st_mtime, tz=timezone.utc).isoformat(),
                "size": stat.st_size,
                "abs_path": str(p),
            }
    return result


def build_slug_from_path(rel_path: str) -> str:
    """从相对路径生成 slug（尽量保留中文，kebab-case）"""
    name = Path(rel_path).stem
    name = name.replace(" ", "-").replace("/", "-").replace("\\", "-")
    name = "".join(c if c.isalnum() or c in ("-", "_", "·") else "-" for c in name)
    name = name.strip("-")
    return name or "untitled"


def find_orphan_entity_concept(manifest: dict) -> tuple[list, list]:
    """检测未被任何 source page 引用的 entity 和 concept"""
    import re
    wikilink_pattern = re.compile(r"\[\[([^\]]+)\]\]")

    sources_dir = WIKI_DIR / "sources"
    referenced_entities = set()
    referenced_concepts = set()

    if sources_dir.exists():
        for src in sources_dir.glob("*.md"):
            content = src.read_text(encoding="utf-8")
            for link in wikilink_pattern.findall(content):
                name = link.strip()
                if name.startswith("entities/"):
                    referenced_entities.add(Path(name).stem)
                elif name.startswith("concepts/"):
                    referenced_concepts.add(Path(name).stem)
                elif "/" not in name:
                    referenced_entities.add(name)
                    referenced_concepts.add(name)

    orphan_entities = []
    entities_dir = WIKI_DIR / "entities"
    if entities_dir.exists():
        for f in entities_dir.glob("*.md"):
            if f.stem not in referenced_entities:
                orphan_entities.append(f.name)

    orphan_concepts = []
    concepts_dir = WIKI_DIR / "concepts"
    if concepts_dir.exists():
        for f in concepts_dir.glob("*.md"):
            if f.stem not in referenced_concepts:
                orphan_concepts.append(f.name)

    return orphan_entities, orphan_concepts


# ─── 核心同步逻辑 ───────────────────────────────────────────────

def check_changes(manifest: dict, raw_files: dict) -> dict:
    """对比 manifest 和实际 raw 文件，返回变化。

    当前策略（按需求收敛）：
      - 仅检测 new / deleted
      - 不再基于 hash 检测 updated（避免仅 mtime 变化导致重复 ingest）
    """
    changes = {"new": [], "updated": [], "deleted": [], "unchanged": []}
    manifest_files = manifest.get("files", {})

    for rel_path, info in raw_files.items():
        if rel_path not in manifest_files:
            changes["new"].append({"rel_path": rel_path, **info})
        else:
            # 按新策略：已有文件一律视作 unchanged，不再进入 updated
            changes["unchanged"].append(rel_path)

    for rel_path in manifest_files:
        abs_path = REPO_ROOT / rel_path
        if not abs_path.exists():
            changes["deleted"].append({
                "rel_path": rel_path,
                "slug": manifest_files[rel_path].get("slug", build_slug_from_path(rel_path)),
                "source_path": manifest_files[rel_path].get("source_path"),
            })

    return changes


def run_sync(dry_run: bool = False, verbose: bool = False, json_mode: bool = False):
    """执行同步并尽量保持输出精简。

    - 默认（非 verbose、非 json）只会输出一行变化摘要 + manifest 更新成功提示。
    - verbose=True 会打印每个新增/更新/删除的文件列表（保留旧行为）。
    - json_mode=True 保持原有的机器友好 JSON 流输出。
    """
    manifest = load_manifest()
    raw_files = scan_raw()
    changes = check_changes(manifest, raw_files)
    new = changes["new"]
    updated = changes["updated"]
    deleted = changes["deleted"]
    total_changes = len(new) + len(updated) + len(deleted)

    if total_changes == 0:
        if json_mode:
            print(json.dumps({"event": "sync_complete", "summary": {"pending": 0, "deleted": 0, "manifest_entries": len(manifest.get("files", {}))}}))
        else:
            log("No changes detected — wiki is up to date.", "success")
        return

    # 非 JSON：简短摘要（默认）或详细列表（verbose）
    if not json_mode:
        log(f"Changes detected: +{len(new)} ~{len(updated)} -{len(deleted)}", "info")
        if verbose:
            if new:
                print("\nNew Files:")
                for f in new:
                    print(f"  {f['rel_path']}")
            if updated:
                print("\nUpdated Files:")
                for f in updated:
                    old = f.get("old_hash")
                    print(f"  {f['rel_path']}" + (f" (was {old})" if old else ""))
            if deleted:
                print("\nDeleted Files:")
                for f in deleted:
                    print(f"  {f['rel_path']}")

    if dry_run:
        log("Dry-run complete. Run with --sync to apply.", "warn")
        return

    # Apply changes (保持原有 manifest 更新逻辑，但抑制逐文件日志，除非 json_mode 或 verbose)
    updated_manifest = manifest.copy()
    updated_manifest["files"] = manifest.get("files", {}).copy()
    pending_files = []
    recovered_files = []

    for f in new:
        rel_path = f["rel_path"]
        slug = build_slug_from_path(rel_path)
        source_path = f"wiki/sources/{slug}.md"
        source_file = WIKI_DIR / "sources" / f"{slug}.md"

        # 检测 wiki/sources/<slug>.md 是否已存在（manifest 被删除后的恢复场景）
        already_ingested = source_file.exists()
        ingested_at = None
        if already_ingested:
            # 用 source 文件的 mtime 作为 ingested_at 的近似值
            try:
                ingested_at = datetime.fromtimestamp(source_file.stat().st_mtime, tz=timezone.utc).isoformat()
            except Exception:
                ingested_at = iso_now()

        if json_mode:
            action = "recovered" if already_ingested else "new"
            print(json.dumps({"event": "pending" if not already_ingested else "recovered", "rel_path": rel_path, "slug": slug, "action": action}))
        if not already_ingested:
            pending_files.append({"rel_path": rel_path, "abs_path": f["abs_path"], "slug": slug, "action": "new"})
        else:
            recovered_files.append({"rel_path": rel_path, "slug": slug, "source_path": source_path})
            if verbose and not json_mode:
                print(f"  ↺ Recovered (source exists): {rel_path} → {source_path}")

        updated_manifest["files"][rel_path] = {
            "hash": f["hash"],
            "modified": f.get("modified"),
            "slug": slug,
            "source_path": source_path,
            "ingested": already_ingested,
            "ingested_at": ingested_at,
        }

    for f in updated:
        rel_path = f["rel_path"]
        old_entry = manifest["files"].get(rel_path, {})
        slug = old_entry.get("slug") or build_slug_from_path(rel_path)
        if json_mode:
            print(json.dumps({"event": "pending", "rel_path": rel_path, "slug": slug, "action": "updated"}))
        pending_files.append({"rel_path": rel_path, "abs_path": f["abs_path"], "slug": slug, "action": "updated"})
        updated_manifest["files"][rel_path] = {
            **old_entry,
            "hash": f["hash"],
            "modified": f.get("modified"),
            "ingested": False,
            "ingested_at": None,
        }

    deleted_files = []
    for f in deleted:
        rel_path = f["rel_path"]
        source_path = f.get("source_path")
        if rel_path in updated_manifest["files"]:
            del updated_manifest["files"][rel_path]
        deleted_files.append(rel_path)
        if json_mode and deleted:
            print(json.dumps({"event": "deleted_detected", "rel_path": rel_path}))

    save_manifest(updated_manifest)

    if json_mode:
        print(json.dumps({
            "event": "sync_complete",
            "summary": {
                "pending": len(pending_files),
                "recovered": len(recovered_files),
                "deleted": len(deleted_files),
                "manifest_entries": len(updated_manifest["files"]),
            },
            "pending_files": pending_files,
            "deleted_files": deleted_files,
        }))
    else:
        log(f"manifest.json updated ({len(updated_manifest['files'])} entries)", "success")
        if recovered_files:
            log(f"Recovered (source page exists): {len(recovered_files)}", "info")
        if verbose:
            log(f"Pending files for ingestion: {len(pending_files)}", "info")

    # 简短的 orphan 报告（仅在 verbose 模式下列出详情）
    orphan_entities, orphan_concepts = find_orphan_entity_concept(updated_manifest)
    if not json_mode:
        if orphan_entities or orphan_concepts:
            if verbose:
                print(f"\n{bold('--- Orphan Report (kept as requested) ---')}")
                if orphan_entities:
                    print(f"Orphan Entities ({len(orphan_entities)}):")
                    for e in sorted(orphan_entities):
                        print(f"  {e}")
                if orphan_concepts:
                    print(f"Orphan Concepts ({len(orphan_concepts)}):")
                    for c in sorted(orphan_concepts):
                        print(f"  {c}")
            else:
                log(f"Orphan entities: {len(orphan_entities)}; Orphan concepts: {len(orphan_concepts)}", "info")
        else:
            if verbose:
                log("No orphan entity/concept detected.", "success")

    if not json_mode:
        print("\nDone.")


def run_check():
    """只预览变化，不执行（输出为标准 Markdown）"""
    manifest = load_manifest()
    raw_files = scan_raw()
    changes = check_changes(manifest, raw_files)
    total = len(changes["new"]) + len(changes["updated"]) + len(changes["deleted"])

    # Markdown header and summary
    print("# Wiki Sync Check\n")
    print(f"- Raw files: {len(raw_files)}")
    print(f"- Manifest entries: {len(manifest.get('files', {}))}")
    print(f"- New: {len(changes['new'])}")
    print(f"- Updated: {len(changes['updated'])}")
    print(f"- Deleted: {len(changes['deleted'])}\n")

    if total > 0:
        if changes["new"]:
            print("## New Files")
            for f in changes["new"]:
                print(f"- {f['rel_path']}")
            print()
        if changes["updated"]:
            print("## Updated Files")
            for f in changes["updated"]:
                print(f"- {f['rel_path']} (was {f['old_hash']}, now {f['hash']})")
            print()
        if changes["deleted"]:
            print("## Deleted Files")
            for f in changes["deleted"]:
                print(f"- {f['rel_path']}")
            print()
    else:
        print("No changes — wiki is in sync.\n")


def run_rebuild():
    """从 manifest 重建 wiki/index.md（兜底方案）。

    改进点：
    - 优先使用 manifest 中记录的 source_path（如果存在且文件真实存在），
      其次尝试 wiki/sources/<slug>.md；再尝试在 wiki/sources 下做不区分大小写或
      归一化后的匹配（减少命名差异导致的断链）。
    - 更健壮地解析 YAML frontmatter 中的 title 字段（支持缺失结束符的容错），
      并在没有 title 时回退到第一个 Markdown 标题或 slug。
    - 在无法找到 source 文件时，保留原 slug 并在 index 中标注 (source missing)，
      以便人工排查。
    """
    manifest = load_manifest()
    print(f"\n{bold('=== Wiki Rebuild from Manifest')}\n")
    print(f"  Manifest entries: {len(manifest.get('files', {}))}")
    print(f"  Rebuilding index.md ...\n")

    index_lines = [
        "# Wiki Index\n",
        "\n## Overview\n",
        "- [Overview](overview.md) — living synthesis\n",
        "\n## Sources\n",
    ]

    files = manifest.get("files", {})
    sorted_files = sorted(files.items(), key=lambda x: (x[1].get("ingested_at") or "", x[1].get("modified", "")), reverse=True)

    import re

    sources_dir = WIKI_DIR / "sources"

    def normalize(s: str) -> str:
        # 用于不严格匹配文件名：移除非字母数字并小写
        return ''.join(ch for ch in s.lower() if ch.isalnum())

    def find_source_file(slug: str, info: dict, rel_path: str):
        # 尝试按 manifest.source_path 优先匹配
        sp = info.get('source_path')
        if sp:
            p = REPO_ROOT / sp
            if p.exists():
                return p
            # 如果是相对于 wiki 的路径（如 "sources/foo.md"），尝试 WIKI_DIR 下
            p2 = WIKI_DIR / sp
            if p2.exists():
                return p2

        # 常规位置：wiki/sources/<slug>.md
        candidate = sources_dir / f"{slug}.md"
        if candidate.exists():
            return candidate

        # 尝试去除多余后缀（如 manifest 中误带了 ".md"）
        if slug.endswith('.md'):
            short = slug[:-3]
            c2 = sources_dir / f"{short}.md"
            if c2.exists():
                return c2

        # 不区分大小写或归一化匹配
        norm_slug = normalize(slug)
        if sources_dir.exists():
            for p in sources_dir.glob('*.md'):
                if p.stem.lower() == slug.lower():
                    return p
                if normalize(p.stem) == norm_slug:
                    return p

        # 最后尝试根据 manifest 中的 rel_path（原始 raw 文件）去推测 source 文件名
        # 有些仓库会把源文件直接放在 wiki/sources 下并采用不同的 slug 规则
        try:
            # rel_path 示例: 'raw/dir/name.md' -> use name as candidate
            name = Path(rel_path).stem
            p3 = sources_dir / f"{name}.md"
            if p3.exists():
                return p3
        except Exception:
            pass

        return None

    for rel_path, info in sorted_files:
        slug = info.get("slug") or build_slug_from_path(rel_path)
        # 清理误带后缀
        if slug.endswith('.md'):
            slug = slug[:-3]

        src_file = find_source_file(slug, info, rel_path)

        # 从 manifest 的 ingested_at 字段提取日期前缀（格式 YYYY-MM-DD），未摄取则留空
        date_raw = info.get("ingested_at") or ""
        date_prefix = ""
        if date_raw:
            try:
                date_prefix = f"[{date_raw[:10]}] "
            except Exception:
                date_prefix = ""

        title = None
        if src_file and src_file.exists():
            content = src_file.read_text(encoding="utf-8")
            lines = content.splitlines()

            # 处理 YAML frontmatter（容错：若缺少结束 '---' 则忽略 frontmatter）
            if lines and lines[0].strip() == '---':
                end_idx = None
                for i in range(1, min(len(lines), 500)):
                    if lines[i].strip() == '---':
                        end_idx = i
                        break
                if end_idx:
                    frontmatter = '\n'.join(lines[1:end_idx])
                    # 支持 title: "..." 或 title: > 的情况（简单提取首行）
                    m = re.search(r'^\s*title\s*:\s*(?:["\']?(.*?)["\']?|>\s*\n\s*(.*))\s*$', frontmatter, flags=re.MULTILINE)
                    if m:
                        title = (m.group(1) or m.group(2) or '').strip()

            # 回退：第一个以 # 开头的行
            if not title and lines:
                for line in lines:
                    s = line.strip()
                    if s.startswith('#'):
                        title = s.lstrip('#').strip()
                        break

            if not title:
                title = slug

            index_lines.append(f"- {date_prefix}[{title}](sources/{src_file.name})\n")
        else:
            # 如果没有找到 source 文件，但 manifest 里有 source_path 文本，则将其展示出来，便于排查
            sp = info.get('source_path')
            if sp:
                index_lines.append(f"- {date_prefix}[{slug}](sources/{slug}.md) — (expected: {sp} — source missing)\n")
            else:
                index_lines.append(f"- {date_prefix}[{slug}](sources/{slug}.md) — (source missing)\n")

    # Entities 索引
    index_lines.append("\n## Entities\n")
    entities_dir = WIKI_DIR / "entities"
    if entities_dir.exists():
        entity_files = sorted(entities_dir.glob("*.md"), key=lambda p: p.stem.lower())
        for ef in entity_files:
            index_lines.append(f"- [{ef.stem}](entities/{ef.name})\n")

    # Concepts 索引
    index_lines.append("\n## Concepts\n")
    concepts_dir = WIKI_DIR / "concepts"
    if concepts_dir.exists():
        concept_files = sorted(concepts_dir.glob("*.md"), key=lambda p: p.stem.lower())
        for cf in concept_files:
            index_lines.append(f"- [{cf.stem}](concepts/{cf.name})\n")

    index_lines.append("\n## Syntheses\n")

    index_file = WIKI_DIR / "index.md"
    index_file.write_text("".join(index_lines), encoding="utf-8")
    print(f"  {green('✓')} index.md rebuilt with {len(sorted_files)} sources")

    # orphan 检测使用 manifest（重建后也可根据最新 manifest 检测）
    orphan_entities, orphan_concepts = find_orphan_entity_concept(manifest)
    if orphan_entities:
        print(f"  {dim('?')} Orphan entities: {len(orphan_entities)}")
    if orphan_concepts:
        print(f"  {dim('?')} Orphan concepts: {len(orphan_concepts)}")

    print(f"\nDone.")


# ─── 管理接口：修正 source 页面中的 Source File link ─────────────────────────────────────

def _fix_source_file_link_in_content(content: str, raw_rel_path: str) -> tuple[str, bool, str]:
    """修正单个 source 页面中的 `## Source File` 区块。

    目标格式：
      ## Source File
      - [[raw/.../file.md]]

    返回： (new_content, changed, action)
      action ∈ {"unchanged", "updated", "inserted_line", "inserted_section"}
    """
    expected_line = f"- [[{raw_rel_path}]]"
    lines = content.splitlines()
    had_trailing_newline = content.endswith("\n")

    # 1) 找 `## Source File` 标题
    heading_idx = None
    for i, line in enumerate(lines):
        if line.strip().lower() == "## source file":
            heading_idx = i
            break

    # 2) 没有区块：插入一个完整区块（优先插到 frontmatter 之后）
    if heading_idx is None:
        insert_at = 0
        if lines and lines[0].strip() == "---":
            for j in range(1, len(lines)):
                if lines[j].strip() == "---":
                    insert_at = j + 1
                    while insert_at < len(lines) and lines[insert_at].strip() == "":
                        insert_at += 1
                    break

        block = ["## Source File", expected_line, ""]
        new_lines = lines[:insert_at] + block + lines[insert_at:]
        new_content = "\n".join(new_lines)
        if had_trailing_newline or new_content:
            new_content += "\n"
        return new_content, True, "inserted_section"

    # 3) 在 `## Source File` 到下一个二级标题之间找第一条列表项
    section_end = len(lines)
    for j in range(heading_idx + 1, len(lines)):
        if lines[j].startswith("## "):
            section_end = j
            break

    bullet_idx = None
    for j in range(heading_idx + 1, section_end):
        if lines[j].strip().startswith("- "):
            bullet_idx = j
            break

    if bullet_idx is None:
        # 没有列表项，直接插入标准链接行
        lines.insert(heading_idx + 1, expected_line)
        new_content = "\n".join(lines)
        if had_trailing_newline or new_content:
            new_content += "\n"
        return new_content, True, "inserted_line"

    # 4) 有列表项：替换成 manifest 对应的 raw 路径
    current = lines[bullet_idx].strip()
    if current == expected_line:
        return content, False, "unchanged"

    lines[bullet_idx] = expected_line
    new_content = "\n".join(lines)
    if had_trailing_newline or new_content:
        new_content += "\n"
    return new_content, True, "updated"


def run_fix_source_links(target_rel_path: str = None, dry_run: bool = False, json_mode: bool = False):
    """基于 manifest，校正 source 页面中的 Source File link。

    - 不传 target_rel_path：扫描并修正所有条目
    - 传 target_rel_path：只处理单个 raw 条目（适合 ingest 后单文件校验）
    """
    manifest = load_manifest()
    files = manifest.get("files", {})

    if target_rel_path:
        if target_rel_path not in files:
            msg = f"target not found in manifest: {target_rel_path}"
            if json_mode:
                print(json.dumps({"event": "error", "message": msg}))
            else:
                print(red(f"  ✗ {msg}"))
            raise SystemExit(1)
        targets = [(target_rel_path, files[target_rel_path])]
    else:
        targets = list(files.items())

    changed = 0
    unchanged = 0
    skipped_no_source_path = 0
    skipped_source_missing = 0
    details = []

    for rel_path, info in targets:
        source_path = info.get("source_path")
        if not source_path:
            skipped_no_source_path += 1
            details.append({"rel_path": rel_path, "status": "skipped_no_source_path"})
            continue

        src_file = REPO_ROOT / source_path
        if not src_file.exists():
            skipped_source_missing += 1
            details.append({"rel_path": rel_path, "source_path": source_path, "status": "skipped_source_missing"})
            continue

        original = src_file.read_text(encoding="utf-8")
        new_content, did_change, action = _fix_source_file_link_in_content(original, rel_path)

        if did_change:
            changed += 1
            if not dry_run:
                src_file.write_text(new_content, encoding="utf-8")
            details.append({"rel_path": rel_path, "source_path": source_path, "status": "changed", "action": action})
        else:
            unchanged += 1
            details.append({"rel_path": rel_path, "source_path": source_path, "status": "unchanged"})

    summary = {
        "scanned": len(targets),
        "changed": changed,
        "unchanged": unchanged,
        "skipped_no_source_path": skipped_no_source_path,
        "skipped_source_missing": skipped_source_missing,
        "dry_run": dry_run,
    }

    if json_mode:
        print(json.dumps({"event": "fix_source_links_complete", "summary": summary, "details": details}, ensure_ascii=False))
        return

    print(f"\n{bold('=== Fix Source File Links')}\n")
    print(f"  Scanned                 : {summary['scanned']}")
    print(f"  Changed                 : {summary['changed']}")
    print(f"  Unchanged               : {summary['unchanged']}")
    print(f"  Skipped (no source_path): {summary['skipped_no_source_path']}")
    print(f"  Skipped (source missing): {summary['skipped_source_missing']}")
    if dry_run:
        print(f"  {yellow('⚠')} Dry-run only, no file written.")
    else:
        print(f"  {green('✓')} Source File links corrected.")
    print()


# ─── 管理接口：reslug（批量规范化 manifest slug） ──────────────────────────────────────

def _compute_normalized_slug(rel_path: str) -> str:
    """根据规则从 raw 文件路径计算规范化 slug。

    规则：
      a. 中文字符直接保留（不转拼音）
      b. ASCII 大写字母转小写
      c. 空格和特殊字符（引号、斜杠、问号、冒号、逗号、句号、感叹号、括号、
         全角符号等）替换为 `-`
      d. 连续多个 `-` 压缩为单个 `-`，并去除首尾 `-`
    """
    import re
    stem = Path(rel_path).stem

    # 转小写（仅影响 ASCII 字母，中文不变）
    result = stem.lower()

    # 将特殊字符替换为 `-`
    # 保留：中文字符、ASCII 字母数字、点（在版本号如 0.65.0 中保留）、下划线
    result = re.sub(
        r'[ \t\r\n'
        r'\'"'  # 单双引号
        r'／/\\\\'  # 斜杠（全角/半角/反斜杠）
        r'？?'  # 问号
        r'：:'  # 冒号
        r'，,'  # 逗号
        r'。\.'  # 句号（保留版本号小数点后面会被压缩）
        r'！!'  # 感叹号
        r'（）()'  # 括号
        r'【】\[\]'  # 方括号
        r'《》<>'  # 书名号/尖括号
        r'、'  # 顿号
        r'—–\-'  # 破折号/连字符（统一重新处理）
        r'|&@#%\^*+=~`'
        r'；;'  # 分号
        r']+',
        '-',
        result,
    )

    # 压缩连续 `-` 为单个
    result = re.sub(r'-{2,}', '-', result)

    # 去除首尾 `-`
    result = result.strip('-')

    return result or 'untitled'


def run_reslug(target_rel_path: str = None, dry_run: bool = False):
    """批量（或单条）规范化 manifest 中的 slug / source_path。

    参数：
      target_rel_path: 指定单个 raw 相对路径；为 None 则处理全部条目。
      dry_run: 若为 True，只打印预览，不写入 manifest。
    """
    manifest = load_manifest()
    files = manifest.get("files", {})

    if target_rel_path:
        targets = [(target_rel_path, files[target_rel_path])] if target_rel_path in files else []
        if not targets:
            print(red(f"  ✗ Not found in manifest: {target_rel_path}"))
            return
    else:
        targets = list(files.items())

    changed = []
    skipped = 0

    for rel_path, info in targets:
        new_slug = _compute_normalized_slug(rel_path)
        old_slug = info.get("slug", "")
        new_source_path = f"wiki/sources/{new_slug}.md"
        old_source_path = info.get("source_path", "")

        if new_slug == old_slug and new_source_path == old_source_path:
            skipped += 1
            continue

        changed.append({
            "rel_path": rel_path,
            "old_slug": old_slug,
            "new_slug": new_slug,
            "old_source_path": old_source_path,
            "new_source_path": new_source_path,
        })

    print(f"\n{bold('=== Reslug Preview' if dry_run else '=== Reslug')}\n")
    print(f"  Total entries scanned : {len(targets)}")
    print(f"  Unchanged (skipped)   : {skipped}")
    print(f"  To update             : {len(changed)}\n")

    if not changed:
        print(f"  {green('✓')} All slugs already normalized.\n")
        return

    for item in changed:
        print(f"  {dim(item['rel_path'])}")
        if item['old_slug'] != item['new_slug']:
            print(f"    slug : {yellow(item['old_slug'])} → {green(item['new_slug'])}")
        if item['old_source_path'] != item['new_source_path']:
            print(f"    src  : {yellow(item['old_source_path'])} → {green(item['new_source_path'])}")
        print()

    if dry_run:
        print(f"  {yellow('⚠')}  Dry-run — manifest NOT updated. Re-run without --dry-run to apply.\n")
        return

    # 应用变更
    for item in changed:
        entry = files[item["rel_path"]]
        entry["slug"] = item["new_slug"]
        entry["source_path"] = item["new_source_path"]

    save_manifest(manifest)
    print(f"  {green('✓')} manifest.json updated ({len(changed)} entries changed).\n")


# ─── 管理接口：mark_ingested（供摄取流程调用） ─────────────────────────────────────────

def mark_ingested(rel_path: str, slug: str, json_mode: bool = False):
    """标记某个 raw 文件为已摄取（更新 manifest 条目）。

    行为：
      - rel_path 必须已存在于 manifest（即曾被 --sync 扫描过），否则报错退出。
      - slug 必须显式传入，否则报错退出。
      - source_path 由 slug 自动推断为 wiki/sources/<slug>.md。
      - modified 强制更新为 raw 文件的实际 mtime（文件不存在时保留旧值并警告）。
      - ingested 设为 True，ingested_at 设为当前 UTC 时间戳。

    参数:
      rel_path  : 相对于仓库根目录的路径，例如 "raw/dir/name.md" （必填）
      slug      : wiki slug，例如 "my-article" （必填）
      json_mode : 若为 True，输出单行 JSON，便于脚本消费
    """
    if not slug or not slug.strip():
        msg = f"--slug is required for --mark-ingested"
        if json_mode:
            print(json.dumps({"event": "error", "message": msg}))
        else:
            print(red(f"  ✗ {msg}"))
        raise SystemExit(1)

    manifest = load_manifest()
    files = manifest.get("files", {})

    if rel_path not in files:
        msg = f"rel_path not found in manifest (run --sync first): {rel_path}"
        if json_mode:
            print(json.dumps({"event": "error", "message": msg}))
        else:
            print(red(f"  ✗ {msg}"))
        raise SystemExit(1)

    entry = files[rel_path]

    # 更新 slug 和 source_path
    entry["slug"] = slug.strip()
    entry["source_path"] = f"wiki/sources/{slug.strip()}.md"

    # 强制更新 modified（基于 raw 文件实际 mtime）
    abs_path = REPO_ROOT / rel_path
    if abs_path.exists():
        entry["hash"] = sha256_file(abs_path)
        entry["modified"] = datetime.fromtimestamp(abs_path.stat().st_mtime, tz=timezone.utc).isoformat()
    else:
        if not json_mode:
            print(yellow(f"  ⚠  Raw file not found, modified timestamp not updated: {rel_path}"))

    # 标记已摄取
    entry["ingested"] = True
    entry["ingested_at"] = iso_now()
    entry.pop("error", None)

    files[rel_path] = entry
    manifest["files"] = files
    save_manifest(manifest)

    if json_mode:
        print(json.dumps({
            "event": "mark_ingested",
            "rel_path": rel_path,
            "slug": entry["slug"],
            "source_path": entry["source_path"],
            "modified": entry.get("modified"),
            "ingested_at": entry["ingested_at"],
        }))
    else:
        print(f"  {green('✓')} Marked ingested: {rel_path}")
        print(f"       slug        : {entry['slug']}")
        print(f"       source_path : {entry['source_path']}")
        print(f"       modified    : {entry.get('modified', '(unchanged)')}")
        print(f"       ingested_at : {entry['ingested_at']}")


# ─── CLI 入口 ───────────────────────────────────────────────

if __name__ == "__main__":
    parser = argparse.ArgumentParser(
        description="Wiki ↔ Raw 三向同步工具",
        formatter_class=argparse.RawDescriptionHelpFormatter,
    )
    parser.add_argument(
        "--check",
        action="store_true",
        help="预览变化，不执行同步",
    )
    parser.add_argument(
        "--sync",
        action="store_true",
        help="执行完整同步（新增/修改/删除 + orphan 检测）",
    )
    parser.add_argument(
        "--rebuild",
        action="store_true",
        help="从 manifest 重建 wiki/index.md（兜底方案）",
    )
    parser.add_argument(
        "--reset-failed",
        action="store_true",
        help="重置所有 failed 的 ingest 状态（让它们重新待处理）",
    )
    parser.add_argument(
        "--pending",
        action="store_true",
        help="列出所有待摄取的 pending 文件",
    )
    parser.add_argument(
        "--verbose", "-v",
        action="store_true",
        help="详细输出",
    )
    parser.add_argument(
        "--json",
        action="store_true",
        help="JSON 行输出模式（供调用方解析）",
    )
    parser.add_argument(
        "--mark-ingested",
        metavar="REL_PATH",
        nargs=1,
        help="标记单个 raw 文件为已摄取：传入相对路径（例如 'raw/dir/file.md'）。必须配合 --slug 使用。",
    )
    parser.add_argument(
        "--slug",
        help="与 --mark-ingested 配合（必填）：指定 wiki slug（例如 my-article）",
    )
    parser.add_argument(
        "--mark-json",
        action="store_true",
        help="与 --mark-ingested 配合：以 JSON 单行输出 mark 结果",
    )
    parser.add_argument(
        "--limit",
        type=int,
        default=None,
        help="与 --pending --json 配合：限制返回条目数（默认返回全部）",
    )
    parser.add_argument(
        "--fix-source-links",
        action="store_true",
        help="基于 manifest 修正 source 页面 `## Source File` 下的 raw 路径链接",
    )
    parser.add_argument(
        "--fix-source-target",
        metavar="REL_PATH",
        help="与 --fix-source-links 配合：仅修正单个 raw 条目（例如 'raw/AI/file.md'）",
    )
    parser.add_argument(
        "--reslug",
        action="store_true",
        help="批量规范化 manifest 中的 slug/source_path（中文保留，ASCII 特殊字符转 -，大写转小写，压缩连续 -）",
    )
    parser.add_argument(
        "--reslug-target",
        metavar="REL_PATH",
        help="与 --reslug 配合：只处理指定的 raw 文件（例如 'raw/dir/file.md'）",
    )
    parser.add_argument(
        "--dry-run",
        action="store_true",
        help="与 --reslug 配合：只预览变更，不写入 manifest",
    )

    args = parser.parse_args()

    if args.mark_ingested:
        rel = args.mark_ingested[0]
        mark_ingested(rel, slug=args.slug, json_mode=args.mark_json)
    elif args.fix_source_links:
        run_fix_source_links(
            target_rel_path=args.fix_source_target,
            dry_run=args.dry_run,
            json_mode=args.json,
        )
    elif args.reslug:
        run_reslug(target_rel_path=args.reslug_target, dry_run=args.dry_run)
    elif args.rebuild:
        run_rebuild()
    elif args.pending:
        manifest = load_manifest()
        pending = [(k, v) for k, v in manifest["files"].items() if not v.get("ingested")]
        if args.json:
            total = len(pending)
            # 未指定 limit -> 返回全部（files 列表）
            if args.limit is None:
                payload = {
                    "event": "pending_list",
                    "count": total,
                    "files": [
                        {
                            "rel_path": k,
                            "slug": v.get("slug", build_slug_from_path(k)),
                            "source_path": v.get("source_path"),
                            "modified": v.get("modified"),
                            "hash": v.get("hash"),
                        }
                        for k, v in pending
                    ],
                }
            elif args.limit <= 0:
                payload = {"event": "pending_list", "count": total, "files": []}
            elif args.limit == 1:
                first = pending[0] if pending else (None, None)
                if first[0] is None:
                    payload = {"event": "pending_list", "count": 0, "file": None}
                else:
                    k, v = first
                    payload = {
                        "event": "pending_list",
                        "count": total,
                        "file": {
                            "rel_path": k,
                            "slug": v.get("slug", build_slug_from_path(k)),
                            "source_path": v.get("source_path"),
                            "modified": v.get("modified"),
                            "hash": v.get("hash"),
                        },
                    }
            else:
                # 返回前 N 条 as files array
                n = min(args.limit, total)
                payload = {
                    "event": "pending_list",
                    "count": total,
                    "files": [
                        {
                            "rel_path": k,
                            "slug": v.get("slug", build_slug_from_path(k)),
                            "source_path": v.get("source_path"),
                            "modified": v.get("modified"),
                            "hash": v.get("hash"),
                        }
                        for k, v in pending[:n]
                    ],
                }
            print(json.dumps(payload))
        else:
            # 控制台输出也支持 --limit
            total = len(pending)
            n = total if args.limit is None else max(0, args.limit)
            print(f"=== Pending Ingest Files ({total}) ===\n")
            if n == 0:
                print("  (no items to show)")
            else:
                for i, (path, info) in enumerate(pending[:n], 1):
                    print(f"{i:3}. {path}")
    elif args.reset_failed:
        manifest = load_manifest()
        reset_count = 0
        for k, v in manifest["files"].items():
            if v.get("error"):
                v["ingested"] = False
                v.pop("error", None)
                v.pop("ingested_at", None)
                reset_count += 1
        if reset_count > 0:
            save_manifest(manifest)
            print(f"Reset {reset_count} failed entries to pending.")
        else:
            print("No failed entries found.")
    elif args.check:
        run_check()
    elif args.sync:
        run_sync(dry_run=False, verbose=args.verbose, json_mode=args.json)
    else:
        parser.print_help()
        print("\n示例:")
        print("  python tools/sync.py --check       # 预览变化")
        print("  python tools/sync.py --sync        # 执行同步")
        print("  python tools/sync.py --sync -v      # 详细模式")
        print("  python tools/sync.py --rebuild     # 重建 index")