Files
llm-wiki-agent/tools/sync.py

1290 lines
49 KiB
Python
Executable File
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
#!/usr/bin/env python3
"""
Wiki ↔ Raw 三向同步工具
================================================================================
概述
----
本脚本负责维护 raw/(原始文档层)与 wiki/(知识库层)之间的同步状态。
它通过 tools/manifest.json 追踪每个 raw 文件的哈希、摄取状态和 slug 映射,
让编码代理agent能准确知道哪些文件需要被重新摄取到 wiki。
核心功能
--------
1. 扫描 raw/ 下的 .md 文件,与 manifest 对比,检测新增/删除(不再自动检测 updated
2. 维护 tools/manifest.json 状态映射hash、slug、ingested 等)
3. 标记单个文件为"已摄取",供摄取流程回调
4. 批量规范化 manifest 中的 slugreslug
5. 从 manifest 重建 wiki/index.md兜底方案
6. 检测 orphan entity/concept仅报告不删除
7. 批量或单条修正 source 页面中的 Source File link对齐 manifest 的 raw 路径)
--------------------------------------------------------------------------------
CLI 用法
--------------------------------------------------------------------------------
基础操作:
python tools/sync.py --check
预览 raw/ 与 manifest 的差异(新增/删除),不写入任何文件。
输出为 Markdown 格式,适合人工阅读。
python tools/sync.py --sync
执行完整同步:将 raw/ 的变化写入 manifest并报告 orphan 页面。
当前默认仅处理新增/删除,不会因为已存在文件内容变化而自动重置 ingested。
python tools/sync.py --sync -v / --verbose
同上,但额外列出每个新增/删除文件的详情,以及 orphan 清单。
python tools/sync.py --pending
列出 manifest 中所有 ingested=false 的待摄取文件(人类可读格式)。
python tools/sync.py --pending --json
以单行 JSON 输出待摄取列表,供脚本/agent 消费。
python tools/sync.py --pending --json --limit 1
只返回第一条待摄取文件(返回 "file" 字段而非 "files" 数组)。
python tools/sync.py --pending --json --limit N
返回前 N 条待摄取文件(返回 "files" 数组)。
python tools/sync.py --json
与 --sync 配合:使用 JSON 行流模式输出所有事件,便于程序解析。
python tools/sync.py --rebuild
从 manifest 重建 wiki/index.md。适合 index 损坏或丢失时的兜底恢复。
Source File link 修正:
python tools/sync.py --fix-source-links
扫描 manifest 中所有条目,批量修正对应 source 页面里 `## Source File` 下的链接。
目标格式统一为:- [[raw/.../your-file.md]]
python tools/sync.py --fix-source-links --fix-source-target "raw/dir/file.md"
只修正指定 raw 条目对应的单个 source 页面(适合每次 ingest 后做单文件校验)。
python tools/sync.py --fix-source-links --dry-run
预览将要修改的数量,不写入文件。
标记摄取状态:
python tools/sync.py --mark-ingested "raw/dir/file.md" --slug my-slug
标记指定 raw 文件为已摄取,同时更新 slug、source_path、ingested_at。
该命令是摄取工作流的最后一步,应在 wiki/sources/<slug>.md 写入完毕后调用。
python tools/sync.py --mark-ingested "raw/dir/file.md" --slug my-slug --mark-json
同上,但以单行 JSON 输出结果(供脚本消费)。
python tools/sync.py --reset-failed
将所有带 error 标记的 manifest 条目重置为 ingested=false重新加入待处理队列
slug 管理:
python tools/sync.py --reslug
批量规范化 manifest 中全部条目的 slug 和 source_path。
规则中文直接保留ASCII 大写转小写,特殊字符转 `-`,压缩连续 `-`。
python tools/sync.py --reslug --reslug-target "raw/dir/file.md"
只规范化指定文件的 slug。
python tools/sync.py --reslug --dry-run
预览 reslug 变更,不写入 manifest。
--------------------------------------------------------------------------------
manifest.json 格式
--------------------------------------------------------------------------------
路径tools/manifest.json与本脚本同目录
顶层结构:
{
"version": 1, // 格式版本,当前固定为 1
"updated_at": "2024-01-15T08:00:00Z", // 最后更新时间UTC ISO 8601每次写入自动刷新
"files": { ... } // key = raw 文件相对仓库根的路径
}
files 中每条记录的结构:
{
"raw/dir/my-paper.md": {
"hash": "a3f1c2d4e5b6a7b8", // sha256 前 16 位,用于检测文件内容变化
"modified": "2024-01-15T07:00:00Z", // raw 文件的 mtimeUTC ISO 8601
"slug": "my-paper", // wiki 页面 slug用于生成 source_path
"source_path": "wiki/sources/my-paper.md", // 对应的 wiki source 页面路径
"ingested": true, // false = 待摄取true = 已摄取
"ingested_at": "2024-01-15T08:00:00Z", // 摄取完成时间null 表示未摄取)
"error": "..." // 可选,摄取失败时记录错误信息
}
}
状态流转:
新文件被 --sync 检测到
→ ingested=false, ingested_at=null
摄取工作流完成后调用 --mark-ingested
→ ingested=true, ingested_at=<当前 UTC 时间>
当前默认同步策略不自动处理“已存在文件内容变化”
→ 已摄取文件不会因 updated 检测而自动重置(避免重复 ingest
摄取失败时由外部流程写入 error 字段
→ 使用 --reset-failed 清除,重回待处理队列
--------------------------------------------------------------------------------
JSON 输出格式(--json / --mark-json / --pending --json
--------------------------------------------------------------------------------
每行输出一个独立 JSON 对象JSON Lines 格式),可能的 event 类型:
{"event": "pending", "rel_path": "...", "slug": "...", "action": "new"}
{"event": "deleted_detected","rel_path": "..."}
{"event": "sync_complete", "summary": {"pending": N, "deleted": N, "manifest_entries": N},
"pending_files": [...], "deleted_files": [...]}
{"event": "pending_list", "count": N, "files": [...]} // --pending --json --limit N
{"event": "pending_list", "count": N, "file": {...}} // --pending --json --limit 1
{"event": "mark_ingested", "rel_path": "...", "slug": "...",
"source_path": "...", "modified": "...", "ingested_at": "..."}
{"event": "fix_source_links_complete", "summary": {...}, "details": [...]}
{"event": "error", "message": "..."}
--------------------------------------------------------------------------------
内部函数说明
--------------------------------------------------------------------------------
sha256_file(path)
计算文件 sha256返回前 16 位十六进制字符串,用于快速变化检测。
load_manifest() / save_manifest(manifest)
读写 tools/manifest.json文件不存在或损坏时返回空白 manifest。
scan_raw()
递归扫描 raw/ 下所有 .md 文件,返回 {rel_path: {hash, modified, size, abs_path}}。
build_slug_from_path(rel_path)
从 raw 文件路径生成基础 slug保留中文空格/特殊字符转 `-`)。
注意:--reslug 使用更严格的 _compute_normalized_slug() 规则。
check_changes(manifest, raw_files)
对比 manifest 与实际文件,当前默认返回新增/删除为主updated 关闭)。
run_sync(dry_run, verbose, json_mode)
执行完整同步逻辑,更新 manifest并触发 orphan 检测报告。
run_check()
只读比对,以 Markdown 格式打印差异报告,不修改任何文件。
run_rebuild()
遍历 manifest 中全部条目,重建 wiki/index.md同时做容错路径匹配和 orphan 检测。
find_orphan_entity_concept(manifest)
扫描 wiki/sources/*.md 中的 [[wikilinks]],找出未被引用的 entity/concept 页面。
mark_ingested(rel_path, slug, json_mode)
将指定 raw 文件标记为已摄取,更新 slug、source_path、hash、ingested_at。
rel_path 必须已存在于 manifest先 --sync 再 --mark-ingested
run_reslug(target_rel_path, dry_run)
批量(或单条)规范化 manifest 中的 slug/source_path
使用 _compute_normalized_slug() 规则处理特殊字符。
run_fix_source_links(target_rel_path, dry_run, json_mode)
基于 manifest 修正 source 页面 `## Source File` 下的 raw 路径链接;
支持全量和单文件模式。
_compute_normalized_slug(rel_path)
规范化 slug 的核心规则:
a. 中文字符直接保留
b. ASCII 大写字母转小写
c. 空格、标点、特殊符号替换为 `-`
d. 连续多个 `-` 压缩为单个,首尾 `-` 去除
--------------------------------------------------------------------------------
典型工作流(供 agent 参考)
--------------------------------------------------------------------------------
1. 检查有无待摄取文件:
python tools/sync.py --pending --json --limit 1
2. 同步 raw 变化到 manifest
python tools/sync.py --sync
3. 摄取完成后标记:
python tools/sync.py --mark-ingested "raw/papers/my-paper.md" --slug my-paper
4. 修复 slug 命名:
python tools/sync.py --reslug --dry-run # 预览
python tools/sync.py --reslug # 应用
5. 批量修正 Source File link
python tools/sync.py --fix-source-links --dry-run
python tools/sync.py --fix-source-links
6. ingest 后单文件校验:
python tools/sync.py --fix-source-links --fix-source-target "raw/papers/my-paper.md"
7. index 损坏时重建:
python tools/sync.py --rebuild
"""
import json
import hashlib
import argparse
from pathlib import Path
from datetime import datetime, timezone
REPO_ROOT = Path(__file__).parent.parent.resolve()
WIKI_DIR = REPO_ROOT / "wiki"
MANIFEST_FILE = Path(__file__).parent / "manifest.json"
# ─── 工具函数 ───────────────────────────────────────────────
def green(text):
return f"\033[92m{text}\033[0m"
def yellow(text):
return f"\033[93m{text}\033[0m"
def red(text):
return f"\033[91m{text}\033[0m"
def dim(text):
return f"\033[2m{text}\033[0m"
def bold(text):
return f"\033[1m{text}\033[0m"
def log(msg, style="normal"):
prefixes = {
"normal": " ",
"info": " ",
"success": "",
"warn": "",
"error": "",
"section": "\n── ",
}
print(f"{prefixes.get(style, ' ')}{msg}")
def sha256_file(path: Path) -> str:
h = hashlib.sha256()
h.update(path.read_bytes())
return h.hexdigest()[:16]
def iso_now():
return datetime.now(timezone.utc).isoformat()
def load_manifest() -> dict:
if MANIFEST_FILE.exists():
try:
return json.loads(MANIFEST_FILE.read_text(encoding="utf-8"))
except (json.JSONDecodeError, IOError):
pass
return {"version": 1, "updated_at": iso_now(), "files": {}}
def save_manifest(manifest: dict):
manifest["updated_at"] = iso_now()
MANIFEST_FILE.write_text(json.dumps(manifest, ensure_ascii=False, indent=2), encoding="utf-8")
def scan_raw() -> dict[str, dict]:
"""返回 {relative_path: {hash, modified, size}}"""
raw_dir = REPO_ROOT / "raw"
result = {}
if not raw_dir.exists():
return result
for p in raw_dir.rglob("*.md"):
if p.is_file() and not p.name.startswith("."):
rel = str(p.relative_to(REPO_ROOT))
stat = p.stat()
result[rel] = {
"hash": sha256_file(p),
"modified": datetime.fromtimestamp(stat.st_mtime, tz=timezone.utc).isoformat(),
"size": stat.st_size,
"abs_path": str(p),
}
return result
def build_slug_from_path(rel_path: str) -> str:
"""从相对路径生成 slug尽量保留中文kebab-case"""
name = Path(rel_path).stem
name = name.replace(" ", "-").replace("/", "-").replace("\\", "-")
name = "".join(c if c.isalnum() or c in ("-", "_", "·") else "-" for c in name)
name = name.strip("-")
return name or "untitled"
def find_orphan_entity_concept(manifest: dict) -> tuple[list, list]:
"""检测未被任何 source page 引用的 entity 和 concept"""
import re
wikilink_pattern = re.compile(r"\[\[([^\]]+)\]\]")
sources_dir = WIKI_DIR / "sources"
referenced_entities = set()
referenced_concepts = set()
if sources_dir.exists():
for src in sources_dir.glob("*.md"):
content = src.read_text(encoding="utf-8")
for link in wikilink_pattern.findall(content):
name = link.strip()
if name.startswith("entities/"):
referenced_entities.add(Path(name).stem)
elif name.startswith("concepts/"):
referenced_concepts.add(Path(name).stem)
elif "/" not in name:
referenced_entities.add(name)
referenced_concepts.add(name)
orphan_entities = []
entities_dir = WIKI_DIR / "entities"
if entities_dir.exists():
for f in entities_dir.glob("*.md"):
if f.stem not in referenced_entities:
orphan_entities.append(f.name)
orphan_concepts = []
concepts_dir = WIKI_DIR / "concepts"
if concepts_dir.exists():
for f in concepts_dir.glob("*.md"):
if f.stem not in referenced_concepts:
orphan_concepts.append(f.name)
return orphan_entities, orphan_concepts
# ─── 核心同步逻辑 ───────────────────────────────────────────────
def check_changes(manifest: dict, raw_files: dict) -> dict:
"""对比 manifest 和实际 raw 文件,返回变化。
当前策略(按需求收敛):
- 仅检测 new / deleted
- 不再基于 hash 检测 updated避免仅 mtime 变化导致重复 ingest
"""
changes = {"new": [], "updated": [], "deleted": [], "unchanged": []}
manifest_files = manifest.get("files", {})
for rel_path, info in raw_files.items():
if rel_path not in manifest_files:
changes["new"].append({"rel_path": rel_path, **info})
else:
# 按新策略:已有文件一律视作 unchanged不再进入 updated
changes["unchanged"].append(rel_path)
for rel_path in manifest_files:
abs_path = REPO_ROOT / rel_path
if not abs_path.exists():
changes["deleted"].append({
"rel_path": rel_path,
"slug": manifest_files[rel_path].get("slug", build_slug_from_path(rel_path)),
"source_path": manifest_files[rel_path].get("source_path"),
})
return changes
def run_sync(dry_run: bool = False, verbose: bool = False, json_mode: bool = False):
"""执行同步并尽量保持输出精简。
- 默认(非 verbose、非 json只会输出一行变化摘要 + manifest 更新成功提示。
- verbose=True 会打印每个新增/更新/删除的文件列表(保留旧行为)。
- json_mode=True 保持原有的机器友好 JSON 流输出。
"""
manifest = load_manifest()
raw_files = scan_raw()
changes = check_changes(manifest, raw_files)
new = changes["new"]
updated = changes["updated"]
deleted = changes["deleted"]
total_changes = len(new) + len(updated) + len(deleted)
if total_changes == 0:
if json_mode:
print(json.dumps({"event": "sync_complete", "summary": {"pending": 0, "deleted": 0, "manifest_entries": len(manifest.get("files", {}))}}))
else:
log("No changes detected — wiki is up to date.", "success")
return
# 非 JSON简短摘要默认或详细列表verbose
if not json_mode:
log(f"Changes detected: +{len(new)} ~{len(updated)} -{len(deleted)}", "info")
if verbose:
if new:
print("\nNew Files:")
for f in new:
print(f" {f['rel_path']}")
if updated:
print("\nUpdated Files:")
for f in updated:
old = f.get("old_hash")
print(f" {f['rel_path']}" + (f" (was {old})" if old else ""))
if deleted:
print("\nDeleted Files:")
for f in deleted:
print(f" {f['rel_path']}")
if dry_run:
log("Dry-run complete. Run with --sync to apply.", "warn")
return
# Apply changes (保持原有 manifest 更新逻辑,但抑制逐文件日志,除非 json_mode 或 verbose)
updated_manifest = manifest.copy()
updated_manifest["files"] = manifest.get("files", {}).copy()
pending_files = []
recovered_files = []
for f in new:
rel_path = f["rel_path"]
slug = build_slug_from_path(rel_path)
source_path = f"wiki/sources/{slug}.md"
source_file = WIKI_DIR / "sources" / f"{slug}.md"
# 检测 wiki/sources/<slug>.md 是否已存在manifest 被删除后的恢复场景)
already_ingested = source_file.exists()
ingested_at = None
if already_ingested:
# 用 source 文件的 mtime 作为 ingested_at 的近似值
try:
ingested_at = datetime.fromtimestamp(source_file.stat().st_mtime, tz=timezone.utc).isoformat()
except Exception:
ingested_at = iso_now()
if json_mode:
action = "recovered" if already_ingested else "new"
print(json.dumps({"event": "pending" if not already_ingested else "recovered", "rel_path": rel_path, "slug": slug, "action": action}))
if not already_ingested:
pending_files.append({"rel_path": rel_path, "abs_path": f["abs_path"], "slug": slug, "action": "new"})
else:
recovered_files.append({"rel_path": rel_path, "slug": slug, "source_path": source_path})
if verbose and not json_mode:
print(f" ↺ Recovered (source exists): {rel_path}{source_path}")
updated_manifest["files"][rel_path] = {
"hash": f["hash"],
"modified": f.get("modified"),
"slug": slug,
"source_path": source_path,
"ingested": already_ingested,
"ingested_at": ingested_at,
}
for f in updated:
rel_path = f["rel_path"]
old_entry = manifest["files"].get(rel_path, {})
slug = old_entry.get("slug") or build_slug_from_path(rel_path)
if json_mode:
print(json.dumps({"event": "pending", "rel_path": rel_path, "slug": slug, "action": "updated"}))
pending_files.append({"rel_path": rel_path, "abs_path": f["abs_path"], "slug": slug, "action": "updated"})
updated_manifest["files"][rel_path] = {
**old_entry,
"hash": f["hash"],
"modified": f.get("modified"),
"ingested": False,
"ingested_at": None,
}
deleted_files = []
for f in deleted:
rel_path = f["rel_path"]
source_path = f.get("source_path")
if rel_path in updated_manifest["files"]:
del updated_manifest["files"][rel_path]
deleted_files.append(rel_path)
if json_mode and deleted:
print(json.dumps({"event": "deleted_detected", "rel_path": rel_path}))
save_manifest(updated_manifest)
if json_mode:
print(json.dumps({
"event": "sync_complete",
"summary": {
"pending": len(pending_files),
"recovered": len(recovered_files),
"deleted": len(deleted_files),
"manifest_entries": len(updated_manifest["files"]),
},
"pending_files": pending_files,
"deleted_files": deleted_files,
}))
else:
log(f"manifest.json updated ({len(updated_manifest['files'])} entries)", "success")
if recovered_files:
log(f"Recovered (source page exists): {len(recovered_files)}", "info")
if verbose:
log(f"Pending files for ingestion: {len(pending_files)}", "info")
# 简短的 orphan 报告(仅在 verbose 模式下列出详情)
orphan_entities, orphan_concepts = find_orphan_entity_concept(updated_manifest)
if not json_mode:
if orphan_entities or orphan_concepts:
if verbose:
print(f"\n{bold('--- Orphan Report (kept as requested) ---')}")
if orphan_entities:
print(f"Orphan Entities ({len(orphan_entities)}):")
for e in sorted(orphan_entities):
print(f" {e}")
if orphan_concepts:
print(f"Orphan Concepts ({len(orphan_concepts)}):")
for c in sorted(orphan_concepts):
print(f" {c}")
else:
log(f"Orphan entities: {len(orphan_entities)}; Orphan concepts: {len(orphan_concepts)}", "info")
else:
if verbose:
log("No orphan entity/concept detected.", "success")
if not json_mode:
print("\nDone.")
def run_check():
"""只预览变化,不执行(输出为标准 Markdown"""
manifest = load_manifest()
raw_files = scan_raw()
changes = check_changes(manifest, raw_files)
total = len(changes["new"]) + len(changes["updated"]) + len(changes["deleted"])
# Markdown header and summary
print("# Wiki Sync Check\n")
print(f"- Raw files: {len(raw_files)}")
print(f"- Manifest entries: {len(manifest.get('files', {}))}")
print(f"- New: {len(changes['new'])}")
print(f"- Updated: {len(changes['updated'])}")
print(f"- Deleted: {len(changes['deleted'])}\n")
if total > 0:
if changes["new"]:
print("## New Files")
for f in changes["new"]:
print(f"- {f['rel_path']}")
print()
if changes["updated"]:
print("## Updated Files")
for f in changes["updated"]:
print(f"- {f['rel_path']} (was {f['old_hash']}, now {f['hash']})")
print()
if changes["deleted"]:
print("## Deleted Files")
for f in changes["deleted"]:
print(f"- {f['rel_path']}")
print()
else:
print("No changes — wiki is in sync.\n")
def run_rebuild():
"""从 manifest 重建 wiki/index.md兜底方案
改进点:
- 优先使用 manifest 中记录的 source_path如果存在且文件真实存在
其次尝试 wiki/sources/<slug>.md再尝试在 wiki/sources 下做不区分大小写或
归一化后的匹配(减少命名差异导致的断链)。
- 更健壮地解析 YAML frontmatter 中的 title 字段(支持缺失结束符的容错),
并在没有 title 时回退到第一个 Markdown 标题或 slug。
- 在无法找到 source 文件时,保留原 slug 并在 index 中标注 (source missing)
以便人工排查。
"""
manifest = load_manifest()
print(f"\n{bold('=== Wiki Rebuild from Manifest')}\n")
print(f" Manifest entries: {len(manifest.get('files', {}))}")
print(f" Rebuilding index.md ...\n")
index_lines = [
"# Wiki Index\n",
"\n## Overview\n",
"- [Overview](overview.md) — living synthesis\n",
"\n## Sources\n",
]
files = manifest.get("files", {})
sorted_files = sorted(files.items(), key=lambda x: (x[1].get("ingested_at") or "", x[1].get("modified", "")), reverse=True)
import re
sources_dir = WIKI_DIR / "sources"
def normalize(s: str) -> str:
# 用于不严格匹配文件名:移除非字母数字并小写
return ''.join(ch for ch in s.lower() if ch.isalnum())
def find_source_file(slug: str, info: dict, rel_path: str):
# 尝试按 manifest.source_path 优先匹配
sp = info.get('source_path')
if sp:
p = REPO_ROOT / sp
if p.exists():
return p
# 如果是相对于 wiki 的路径(如 "sources/foo.md"),尝试 WIKI_DIR 下
p2 = WIKI_DIR / sp
if p2.exists():
return p2
# 常规位置wiki/sources/<slug>.md
candidate = sources_dir / f"{slug}.md"
if candidate.exists():
return candidate
# 尝试去除多余后缀(如 manifest 中误带了 ".md"
if slug.endswith('.md'):
short = slug[:-3]
c2 = sources_dir / f"{short}.md"
if c2.exists():
return c2
# 不区分大小写或归一化匹配
norm_slug = normalize(slug)
if sources_dir.exists():
for p in sources_dir.glob('*.md'):
if p.stem.lower() == slug.lower():
return p
if normalize(p.stem) == norm_slug:
return p
# 最后尝试根据 manifest 中的 rel_path原始 raw 文件)去推测 source 文件名
# 有些仓库会把源文件直接放在 wiki/sources 下并采用不同的 slug 规则
try:
# rel_path 示例: 'raw/dir/name.md' -> use name as candidate
name = Path(rel_path).stem
p3 = sources_dir / f"{name}.md"
if p3.exists():
return p3
except Exception:
pass
return None
for rel_path, info in sorted_files:
slug = info.get("slug") or build_slug_from_path(rel_path)
# 清理误带后缀
if slug.endswith('.md'):
slug = slug[:-3]
src_file = find_source_file(slug, info, rel_path)
# 从 manifest 的 ingested_at 字段提取日期前缀(格式 YYYY-MM-DD未摄取则留空
date_raw = info.get("ingested_at") or ""
date_prefix = ""
if date_raw:
try:
date_prefix = f"[{date_raw[:10]}] "
except Exception:
date_prefix = ""
title = None
if src_file and src_file.exists():
content = src_file.read_text(encoding="utf-8")
lines = content.splitlines()
# 处理 YAML frontmatter容错若缺少结束 '---' 则忽略 frontmatter
if lines and lines[0].strip() == '---':
end_idx = None
for i in range(1, min(len(lines), 500)):
if lines[i].strip() == '---':
end_idx = i
break
if end_idx:
frontmatter = '\n'.join(lines[1:end_idx])
# 支持 title: "..." 或 title: > 的情况(简单提取首行)
m = re.search(r'^\s*title\s*:\s*(?:["\']?(.*?)["\']?|>\s*\n\s*(.*))\s*$', frontmatter, flags=re.MULTILINE)
if m:
title = (m.group(1) or m.group(2) or '').strip()
# 回退:第一个以 # 开头的行
if not title and lines:
for line in lines:
s = line.strip()
if s.startswith('#'):
title = s.lstrip('#').strip()
break
if not title:
title = slug
index_lines.append(f"- {date_prefix}[{title}](sources/{src_file.name})\n")
else:
# 如果没有找到 source 文件,但 manifest 里有 source_path 文本,则将其展示出来,便于排查
sp = info.get('source_path')
if sp:
index_lines.append(f"- {date_prefix}[{slug}](sources/{slug}.md) — (expected: {sp} — source missing)\n")
else:
index_lines.append(f"- {date_prefix}[{slug}](sources/{slug}.md) — (source missing)\n")
# Entities 索引
index_lines.append("\n## Entities\n")
entities_dir = WIKI_DIR / "entities"
if entities_dir.exists():
entity_files = sorted(entities_dir.glob("*.md"), key=lambda p: p.stem.lower())
for ef in entity_files:
index_lines.append(f"- [{ef.stem}](entities/{ef.name})\n")
# Concepts 索引
index_lines.append("\n## Concepts\n")
concepts_dir = WIKI_DIR / "concepts"
if concepts_dir.exists():
concept_files = sorted(concepts_dir.glob("*.md"), key=lambda p: p.stem.lower())
for cf in concept_files:
index_lines.append(f"- [{cf.stem}](concepts/{cf.name})\n")
index_lines.append("\n## Syntheses\n")
index_file = WIKI_DIR / "index.md"
index_file.write_text("".join(index_lines), encoding="utf-8")
print(f" {green('')} index.md rebuilt with {len(sorted_files)} sources")
# orphan 检测使用 manifest重建后也可根据最新 manifest 检测)
orphan_entities, orphan_concepts = find_orphan_entity_concept(manifest)
if orphan_entities:
print(f" {dim('?')} Orphan entities: {len(orphan_entities)}")
if orphan_concepts:
print(f" {dim('?')} Orphan concepts: {len(orphan_concepts)}")
print(f"\nDone.")
# ─── 管理接口:修正 source 页面中的 Source File link ─────────────────────────────────────
def _fix_source_file_link_in_content(content: str, raw_rel_path: str) -> tuple[str, bool, str]:
"""修正单个 source 页面中的 `## Source File` 区块。
目标格式:
## Source File
- [[raw/.../file.md]]
返回: (new_content, changed, action)
action ∈ {"unchanged", "updated", "inserted_line", "inserted_section"}
"""
expected_line = f"- [[{raw_rel_path}]]"
lines = content.splitlines()
had_trailing_newline = content.endswith("\n")
# 1) 找 `## Source File` 标题
heading_idx = None
for i, line in enumerate(lines):
if line.strip().lower() == "## source file":
heading_idx = i
break
# 2) 没有区块:插入一个完整区块(优先插到 frontmatter 之后)
if heading_idx is None:
insert_at = 0
if lines and lines[0].strip() == "---":
for j in range(1, len(lines)):
if lines[j].strip() == "---":
insert_at = j + 1
while insert_at < len(lines) and lines[insert_at].strip() == "":
insert_at += 1
break
block = ["## Source File", expected_line, ""]
new_lines = lines[:insert_at] + block + lines[insert_at:]
new_content = "\n".join(new_lines)
if had_trailing_newline or new_content:
new_content += "\n"
return new_content, True, "inserted_section"
# 3) 在 `## Source File` 到下一个二级标题之间找第一条列表项
section_end = len(lines)
for j in range(heading_idx + 1, len(lines)):
if lines[j].startswith("## "):
section_end = j
break
bullet_idx = None
for j in range(heading_idx + 1, section_end):
if lines[j].strip().startswith("- "):
bullet_idx = j
break
if bullet_idx is None:
# 没有列表项,直接插入标准链接行
lines.insert(heading_idx + 1, expected_line)
new_content = "\n".join(lines)
if had_trailing_newline or new_content:
new_content += "\n"
return new_content, True, "inserted_line"
# 4) 有列表项:替换成 manifest 对应的 raw 路径
current = lines[bullet_idx].strip()
if current == expected_line:
return content, False, "unchanged"
lines[bullet_idx] = expected_line
new_content = "\n".join(lines)
if had_trailing_newline or new_content:
new_content += "\n"
return new_content, True, "updated"
def run_fix_source_links(target_rel_path: str = None, dry_run: bool = False, json_mode: bool = False):
"""基于 manifest校正 source 页面中的 Source File link。
- 不传 target_rel_path扫描并修正所有条目
- 传 target_rel_path只处理单个 raw 条目(适合 ingest 后单文件校验)
"""
manifest = load_manifest()
files = manifest.get("files", {})
if target_rel_path:
if target_rel_path not in files:
msg = f"target not found in manifest: {target_rel_path}"
if json_mode:
print(json.dumps({"event": "error", "message": msg}))
else:
print(red(f"{msg}"))
raise SystemExit(1)
targets = [(target_rel_path, files[target_rel_path])]
else:
targets = list(files.items())
changed = 0
unchanged = 0
skipped_no_source_path = 0
skipped_source_missing = 0
details = []
for rel_path, info in targets:
source_path = info.get("source_path")
if not source_path:
skipped_no_source_path += 1
details.append({"rel_path": rel_path, "status": "skipped_no_source_path"})
continue
src_file = REPO_ROOT / source_path
if not src_file.exists():
skipped_source_missing += 1
details.append({"rel_path": rel_path, "source_path": source_path, "status": "skipped_source_missing"})
continue
original = src_file.read_text(encoding="utf-8")
new_content, did_change, action = _fix_source_file_link_in_content(original, rel_path)
if did_change:
changed += 1
if not dry_run:
src_file.write_text(new_content, encoding="utf-8")
details.append({"rel_path": rel_path, "source_path": source_path, "status": "changed", "action": action})
else:
unchanged += 1
details.append({"rel_path": rel_path, "source_path": source_path, "status": "unchanged"})
summary = {
"scanned": len(targets),
"changed": changed,
"unchanged": unchanged,
"skipped_no_source_path": skipped_no_source_path,
"skipped_source_missing": skipped_source_missing,
"dry_run": dry_run,
}
if json_mode:
print(json.dumps({"event": "fix_source_links_complete", "summary": summary, "details": details}, ensure_ascii=False))
return
print(f"\n{bold('=== Fix Source File Links')}\n")
print(f" Scanned : {summary['scanned']}")
print(f" Changed : {summary['changed']}")
print(f" Unchanged : {summary['unchanged']}")
print(f" Skipped (no source_path): {summary['skipped_no_source_path']}")
print(f" Skipped (source missing): {summary['skipped_source_missing']}")
if dry_run:
print(f" {yellow('')} Dry-run only, no file written.")
else:
print(f" {green('')} Source File links corrected.")
print()
# ─── 管理接口reslug批量规范化 manifest slug ──────────────────────────────────────
def _compute_normalized_slug(rel_path: str) -> str:
"""根据规则从 raw 文件路径计算规范化 slug。
规则:
a. 中文字符直接保留(不转拼音)
b. ASCII 大写字母转小写
c. 空格和特殊字符(引号、斜杠、问号、冒号、逗号、句号、感叹号、括号、
全角符号等)替换为 `-`
d. 连续多个 `-` 压缩为单个 `-`,并去除首尾 `-`
"""
import re
stem = Path(rel_path).stem
# 转小写(仅影响 ASCII 字母,中文不变)
result = stem.lower()
# 将特殊字符替换为 `-`
# 保留中文字符、ASCII 字母数字、点(在版本号如 0.65.0 中保留)、下划线
result = re.sub(
r'[ \t\r\n'
r'\'"' # 单双引号
r'/\\\\' # 斜杠(全角/半角/反斜杠)
r'?' # 问号
r':' # 冒号
r',' # 逗号
r'\.' # 句号(保留版本号小数点后面会被压缩)
r'!' # 感叹号
r'()' # 括号
r'【】\[\]' # 方括号
r'《》<>' # 书名号/尖括号
r'' # 顿号
r'—–\-' # 破折号/连字符(统一重新处理)
r'|&@#%\^*+=~`'
r';' # 分号
r']+',
'-',
result,
)
# 压缩连续 `-` 为单个
result = re.sub(r'-{2,}', '-', result)
# 去除首尾 `-`
result = result.strip('-')
return result or 'untitled'
def run_reslug(target_rel_path: str = None, dry_run: bool = False):
"""批量(或单条)规范化 manifest 中的 slug / source_path。
参数:
target_rel_path: 指定单个 raw 相对路径;为 None 则处理全部条目。
dry_run: 若为 True只打印预览不写入 manifest。
"""
manifest = load_manifest()
files = manifest.get("files", {})
if target_rel_path:
targets = [(target_rel_path, files[target_rel_path])] if target_rel_path in files else []
if not targets:
print(red(f" ✗ Not found in manifest: {target_rel_path}"))
return
else:
targets = list(files.items())
changed = []
skipped = 0
for rel_path, info in targets:
new_slug = _compute_normalized_slug(rel_path)
old_slug = info.get("slug", "")
new_source_path = f"wiki/sources/{new_slug}.md"
old_source_path = info.get("source_path", "")
if new_slug == old_slug and new_source_path == old_source_path:
skipped += 1
continue
changed.append({
"rel_path": rel_path,
"old_slug": old_slug,
"new_slug": new_slug,
"old_source_path": old_source_path,
"new_source_path": new_source_path,
})
print(f"\n{bold('=== Reslug Preview' if dry_run else '=== Reslug')}\n")
print(f" Total entries scanned : {len(targets)}")
print(f" Unchanged (skipped) : {skipped}")
print(f" To update : {len(changed)}\n")
if not changed:
print(f" {green('')} All slugs already normalized.\n")
return
for item in changed:
print(f" {dim(item['rel_path'])}")
if item['old_slug'] != item['new_slug']:
print(f" slug : {yellow(item['old_slug'])}{green(item['new_slug'])}")
if item['old_source_path'] != item['new_source_path']:
print(f" src : {yellow(item['old_source_path'])}{green(item['new_source_path'])}")
print()
if dry_run:
print(f" {yellow('')} Dry-run — manifest NOT updated. Re-run without --dry-run to apply.\n")
return
# 应用变更
for item in changed:
entry = files[item["rel_path"]]
entry["slug"] = item["new_slug"]
entry["source_path"] = item["new_source_path"]
save_manifest(manifest)
print(f" {green('')} manifest.json updated ({len(changed)} entries changed).\n")
# ─── 管理接口mark_ingested供摄取流程调用 ─────────────────────────────────────────
def mark_ingested(rel_path: str, slug: str, json_mode: bool = False):
"""标记某个 raw 文件为已摄取(更新 manifest 条目)。
行为:
- rel_path 必须已存在于 manifest即曾被 --sync 扫描过),否则报错退出。
- slug 必须显式传入,否则报错退出。
- source_path 由 slug 自动推断为 wiki/sources/<slug>.md。
- modified 强制更新为 raw 文件的实际 mtime文件不存在时保留旧值并警告
- ingested 设为 Trueingested_at 设为当前 UTC 时间戳。
参数:
rel_path : 相对于仓库根目录的路径,例如 "raw/dir/name.md" (必填)
slug : wiki slug例如 "my-article" (必填)
json_mode : 若为 True输出单行 JSON便于脚本消费
"""
if not slug or not slug.strip():
msg = f"--slug is required for --mark-ingested"
if json_mode:
print(json.dumps({"event": "error", "message": msg}))
else:
print(red(f"{msg}"))
raise SystemExit(1)
manifest = load_manifest()
files = manifest.get("files", {})
if rel_path not in files:
msg = f"rel_path not found in manifest (run --sync first): {rel_path}"
if json_mode:
print(json.dumps({"event": "error", "message": msg}))
else:
print(red(f"{msg}"))
raise SystemExit(1)
entry = files[rel_path]
# 更新 slug 和 source_path
entry["slug"] = slug.strip()
entry["source_path"] = f"wiki/sources/{slug.strip()}.md"
# 强制更新 modified基于 raw 文件实际 mtime
abs_path = REPO_ROOT / rel_path
if abs_path.exists():
entry["hash"] = sha256_file(abs_path)
entry["modified"] = datetime.fromtimestamp(abs_path.stat().st_mtime, tz=timezone.utc).isoformat()
else:
if not json_mode:
print(yellow(f" ⚠ Raw file not found, modified timestamp not updated: {rel_path}"))
# 标记已摄取
entry["ingested"] = True
entry["ingested_at"] = iso_now()
entry.pop("error", None)
files[rel_path] = entry
manifest["files"] = files
save_manifest(manifest)
if json_mode:
print(json.dumps({
"event": "mark_ingested",
"rel_path": rel_path,
"slug": entry["slug"],
"source_path": entry["source_path"],
"modified": entry.get("modified"),
"ingested_at": entry["ingested_at"],
}))
else:
print(f" {green('')} Marked ingested: {rel_path}")
print(f" slug : {entry['slug']}")
print(f" source_path : {entry['source_path']}")
print(f" modified : {entry.get('modified', '(unchanged)')}")
print(f" ingested_at : {entry['ingested_at']}")
# ─── CLI 入口 ───────────────────────────────────────────────
if __name__ == "__main__":
parser = argparse.ArgumentParser(
description="Wiki ↔ Raw 三向同步工具",
formatter_class=argparse.RawDescriptionHelpFormatter,
)
parser.add_argument(
"--check",
action="store_true",
help="预览变化,不执行同步",
)
parser.add_argument(
"--sync",
action="store_true",
help="执行完整同步(新增/修改/删除 + orphan 检测)",
)
parser.add_argument(
"--rebuild",
action="store_true",
help="从 manifest 重建 wiki/index.md兜底方案",
)
parser.add_argument(
"--reset-failed",
action="store_true",
help="重置所有 failed 的 ingest 状态(让它们重新待处理)",
)
parser.add_argument(
"--pending",
action="store_true",
help="列出所有待摄取的 pending 文件",
)
parser.add_argument(
"--verbose", "-v",
action="store_true",
help="详细输出",
)
parser.add_argument(
"--json",
action="store_true",
help="JSON 行输出模式(供调用方解析)",
)
parser.add_argument(
"--mark-ingested",
metavar="REL_PATH",
nargs=1,
help="标记单个 raw 文件为已摄取:传入相对路径(例如 'raw/dir/file.md')。必须配合 --slug 使用。",
)
parser.add_argument(
"--slug",
help="与 --mark-ingested 配合(必填):指定 wiki slug例如 my-article",
)
parser.add_argument(
"--mark-json",
action="store_true",
help="与 --mark-ingested 配合:以 JSON 单行输出 mark 结果",
)
parser.add_argument(
"--limit",
type=int,
default=None,
help="与 --pending --json 配合:限制返回条目数(默认返回全部)",
)
parser.add_argument(
"--fix-source-links",
action="store_true",
help="基于 manifest 修正 source 页面 `## Source File` 下的 raw 路径链接",
)
parser.add_argument(
"--fix-source-target",
metavar="REL_PATH",
help="与 --fix-source-links 配合:仅修正单个 raw 条目(例如 'raw/AI/file.md'",
)
parser.add_argument(
"--reslug",
action="store_true",
help="批量规范化 manifest 中的 slug/source_path中文保留ASCII 特殊字符转 -,大写转小写,压缩连续 -",
)
parser.add_argument(
"--reslug-target",
metavar="REL_PATH",
help="与 --reslug 配合:只处理指定的 raw 文件(例如 'raw/dir/file.md'",
)
parser.add_argument(
"--dry-run",
action="store_true",
help="与 --reslug 配合:只预览变更,不写入 manifest",
)
args = parser.parse_args()
if args.mark_ingested:
rel = args.mark_ingested[0]
mark_ingested(rel, slug=args.slug, json_mode=args.mark_json)
elif args.fix_source_links:
run_fix_source_links(
target_rel_path=args.fix_source_target,
dry_run=args.dry_run,
json_mode=args.json,
)
elif args.reslug:
run_reslug(target_rel_path=args.reslug_target, dry_run=args.dry_run)
elif args.rebuild:
run_rebuild()
elif args.pending:
manifest = load_manifest()
pending = [(k, v) for k, v in manifest["files"].items() if not v.get("ingested")]
if args.json:
total = len(pending)
# 未指定 limit -> 返回全部files 列表)
if args.limit is None:
payload = {
"event": "pending_list",
"count": total,
"files": [
{
"rel_path": k,
"slug": v.get("slug", build_slug_from_path(k)),
"source_path": v.get("source_path"),
"modified": v.get("modified"),
"hash": v.get("hash"),
}
for k, v in pending
],
}
elif args.limit <= 0:
payload = {"event": "pending_list", "count": total, "files": []}
elif args.limit == 1:
first = pending[0] if pending else (None, None)
if first[0] is None:
payload = {"event": "pending_list", "count": 0, "file": None}
else:
k, v = first
payload = {
"event": "pending_list",
"count": total,
"file": {
"rel_path": k,
"slug": v.get("slug", build_slug_from_path(k)),
"source_path": v.get("source_path"),
"modified": v.get("modified"),
"hash": v.get("hash"),
},
}
else:
# 返回前 N 条 as files array
n = min(args.limit, total)
payload = {
"event": "pending_list",
"count": total,
"files": [
{
"rel_path": k,
"slug": v.get("slug", build_slug_from_path(k)),
"source_path": v.get("source_path"),
"modified": v.get("modified"),
"hash": v.get("hash"),
}
for k, v in pending[:n]
],
}
print(json.dumps(payload))
else:
# 控制台输出也支持 --limit
total = len(pending)
n = total if args.limit is None else max(0, args.limit)
print(f"=== Pending Ingest Files ({total}) ===\n")
if n == 0:
print(" (no items to show)")
else:
for i, (path, info) in enumerate(pending[:n], 1):
print(f"{i:3}. {path}")
elif args.reset_failed:
manifest = load_manifest()
reset_count = 0
for k, v in manifest["files"].items():
if v.get("error"):
v["ingested"] = False
v.pop("error", None)
v.pop("ingested_at", None)
reset_count += 1
if reset_count > 0:
save_manifest(manifest)
print(f"Reset {reset_count} failed entries to pending.")
else:
print("No failed entries found.")
elif args.check:
run_check()
elif args.sync:
run_sync(dry_run=False, verbose=args.verbose, json_mode=args.json)
else:
parser.print_help()
print("\n示例:")
print(" python tools/sync.py --check # 预览变化")
print(" python tools/sync.py --sync # 执行同步")
print(" python tools/sync.py --sync -v # 详细模式")
print(" python tools/sync.py --rebuild # 重建 index")