diff --git a/.gitignore b/.gitignore new file mode 100644 index 0000000..70e7c56 --- /dev/null +++ b/.gitignore @@ -0,0 +1,2 @@ +raw/ +wiki/ diff --git a/CLAUDE.md b/CLAUDE.md index 345219f..8fc63ef 100644 --- a/CLAUDE.md +++ b/CLAUDE.md @@ -1,78 +1,132 @@ -# LLM Wiki Agent — Schema & Workflow Instructions +# LLM Wiki Agent — Schema & Workflow Instructions(中文版增强规范) -This wiki is maintained entirely by Claude Code. No API key or Python scripts needed — just open this repo in Claude Code and talk to it. +本 Wiki 完全由 Claude Code 自动维护。无需 API Key 或 Python 脚本 —— 只需在 Claude Code 中打开本仓库并与其对话。 -## Slash Commands (Claude Code) +--- +# 🔴 全局强制规则(CRITICAL) -| Command | What to say | -|---|---| -| `/wiki-ingest` | `ingest raw/my-article.md` | -| `/wiki-query` | `query: what are the main themes?` | -| `/wiki-lint` | `lint the wiki` | -| `/wiki-graph` | `build the knowledge graph` | +## 1. 输出语言(必须遵守) -Or just describe what you want in plain English: -- *"Ingest this file: raw/papers/attention-is-all-you-need.md"* -- *"What does the wiki say about transformer models?"* -- *"Check the wiki for orphan pages and contradictions"* -- *"Build the graph and show me what's connected to RAG"* +- 所有输出必须使用**简体中文** +- 专有名词允许保留英文,但首次出现必须附带中文解释 +- 如果原始文件名是中文,则source页面的名称尽量用中文,不要用拼音表示, 如果有特殊字符可以忽略 +- 禁止中英混合句(术语除外) +- 不允许输出纯英文总结或分析 -Claude Code reads this file automatically and follows the workflows below. +示例: + +Transformer(变压器模型,一种基于注意力机制的神经网络架构) --- -## Directory Layout +## 2. 输出风格(严格限制) -``` -raw/ # Immutable source documents — never modify these -wiki/ # Claude owns this layer entirely - index.md # Catalog of all pages — update on every ingest - log.md # Append-only chronological record - overview.md # Living synthesis across all sources - sources/ # One summary page per source document - entities/ # People, companies, projects, products - concepts/ # Ideas, frameworks, methods, theories - syntheses/ # Saved query answers -graph/ # Auto-generated graph data -tools/ # Optional standalone Python scripts (require ANTHROPIC_API_KEY) -``` +所有输出必须: + +- 去修辞(禁止 narrative 风格) +- 去模糊(禁止“可能”“大概”等词) +- 信息密度最大化 +- 面向“知识结构化”,而非阅读体验 + +优先级: + +结构 > 关系 > 结论 > 描述 --- -## Page Format +## 3. 结构化语义(必须) -Every wiki page uses this frontmatter: +所有页面必须遵循结构化语义规则: + +- Summary 必须使用固定字段 +- Claim 必须符合标准语法 +- Connections 必须使用关系类型 +- 禁止自由发挥 + +--- + +# Slash Commands(Claude Code) + +| Command | 使用方式 | +| -------------- | --------------------------- | +| `/wiki-ingest` | `ingest raw/your-file.md` | +| `/wiki-query` | `query: 你的问题` | +| `/wiki-lint` | `lint the wiki` | +| `/wiki-graph` | `build the knowledge graph` | + +--- + +## 自然语言示例 + +- ingest raw/papers/attention-is-all-you-need.md +- query: Transformer 的核心机制是什么? +- lint the wiki +- build the graph and analyze RAG + +Claude Code 会自动读取本文件并执行以下工作流。 + + + +--- + +# Directory Layout(目录结构) + +``` +raw/ # 原始文档(不可修改) +wiki/ # 知识层(由 Claude 完全维护) + index.md # 页面索引(每次 ingest 必须更新) + log.md # 追加式日志 + overview.md # 全局知识总结 + sources/ # 每个原始文档对应一个页面 + entities/ # 实体(人/公司/产品/项目) + concepts/ # 概念(方法/理论/框架) + syntheses/ # 查询结果沉淀 +graph/ # 自动生成的图数据 +tools/ # 可选 Python 工具 (require ANTHROPIC_API_KEY) +```` + + +--- + +# Page Format(页面格式) + +每个页面必须包含: ```yaml --- +id: unique_id title: "Page Title" type: source | entity | concept | synthesis tags: [] -sources: [] # list of source slugs that inform this page +sources: [] # 来源 last_updated: YYYY-MM-DD --- -``` +```` -Use `[[PageName]]` wikilinks to link to other wiki pages. +必须使用 `[[PageName]]` 进行链接。 --- -## Ingest Workflow +# Ingest Workflow(摄取流程) +**重要** 请严格按照摄取流程进行操作,每分析一个页面必须要创建/更新source page,entity, concept等。不可遗漏! -Triggered by: *"ingest "* or `/wiki-ingest` +触发方式: +- `/wiki-ingest` +- 或:`ingest ` +## 执行步骤(严格顺序) +1. 使用 Read 工具完整读取 source 文档 +2. 读取 `wiki/index.md` 和 `wiki/overview.md` +3. 生成 `wiki/sources/原始中文名.md` (非中文使用 slug.md) +4. 更新 `wiki/index.md` +5. 更新 `wiki/overview.md`(如有必要) +6. 创建或更新 Entity 页面 +7. 创建或更新 Concept 页面 +8. 检测并记录冲突 +9. 追加 `wiki/log.md` -Steps (in order): -1. Read the source document fully using the Read tool -2. Read `wiki/index.md` and `wiki/overview.md` for current wiki context -3. Write `wiki/sources/.md` — use the source page format below -4. Update `wiki/index.md` — add entry under Sources section -5. Update `wiki/overview.md` — revise synthesis if warranted -6. Update/create entity pages for key people, companies, projects mentioned -7. Update/create concept pages for key ideas and frameworks discussed -8. Flag any contradictions with existing wiki content -9. Append to `wiki/log.md`: `## [YYYY-MM-DD] ingest | ` +--- -### Source Page Format +# Source Page Format(增强结构) ```markdown --- @@ -80,32 +134,46 @@ title: "Source Title" type: source tags: [] date: YYYY-MM-DD -source_file: raw/... --- +## Source File +- [[raw/...]] + ## Summary -2–4 sentence summary. +- 核心主题: +- 问题域: +- 方法/机制: +- 结论/价值: ## Key Claims -- Claim 1 -- Claim 2 +- (必须符合:主体 + 机制 + 结果) ## Key Quotes -> "Quote here" — context +> "引用内容" — 上下文说明 + +## Key Concepts +- [[ConceptName]]:定义 + +## Key Entities +- [[EntityName]]:角色说明 ## Connections -- [[EntityName]] — how they relate -- [[ConceptName]] — how it connects +- [[A]] ← depends_on ← [[B]] +- [[C]] ← extends ← [[D]] ## Contradictions -- Contradicts [[OtherPage]] on: ... +- 与 [[OtherPage]] 冲突: + - 冲突点: + - 当前观点: + - 对方观点: ``` -### Domain-Specific Templates +--- -If the source falls into a specific domain (e.g., personal diary, meeting notes), the agent should use a specialized template instead of the default generic one above: +# Domain-Specific Templates(领域模板) + +## Diary / Journal -#### Diary / Journal Template ```markdown --- title: "YYYY-MM-DD Diary" @@ -114,18 +182,16 @@ tags: [diary] date: YYYY-MM-DD --- ## Event Summary -... ## Key Decisions -... ## Energy & Mood -... ## Connections -... ## Shifts & Contradictions -... ``` -#### Meeting Notes Template +--- + +## Meeting Notes + ```markdown --- title: "Meeting Title" @@ -134,97 +200,153 @@ tags: [meeting] date: YYYY-MM-DD --- ## Goal -... ## Key Discussions -... ## Decisions Made -... ## Action Items -... ``` --- -## Query Workflow +# Entity & Concept Rules(关键增强) -Triggered by: *"query: <question>"* or `/wiki-query` +## Entity(实体) -Steps: -1. Read `wiki/index.md` to identify relevant pages -2. Read those pages with the Read tool -3. Synthesize an answer with inline citations as `[[PageName]]` wikilinks -4. Ask the user if they want the answer filed as `wiki/syntheses/<slug>.md` +创建条件: +- 出现 ≥ 2 次 + 或 +- 对主题有关键影响 + +类型: +- 人 / 公司 / 产品 / 项目 --- -## Lint Workflow +## Concept(概念) +创建条件: +- 可抽象 +- 可复用 +- 非具体实例 +--- -Triggered by: *"lint the wiki"* or `/wiki-lint` +## 命名规范(强制) +- 使用唯一标准名称 +- 所有别名写入页面: -Use Grep and Read tools to check for: -- **Orphan pages** — wiki pages with no inbound `[[links]]` from other pages -- **Broken links** — `[[WikiLinks]]` pointing to pages that don't exist -- **Contradictions** — claims that conflict across pages -- **Stale summaries** — pages not updated after newer sources -- **Missing entity pages** — entities mentioned in 3+ pages but lacking their own page -- **Data gaps** — questions the wiki can't answer; suggest new sources - -Output a lint report and ask if the user wants it saved to `wiki/lint-report.md`. +```markdown +## Aliases +- GPT4 +- GPT-4 +``` --- -## Graph Workflow +## 去重机制(必须) -Triggered by: *"build the knowledge graph"* or `/wiki-graph` - -When the user asks to build the graph, run `tools/build_graph.py` which: -- Pass 1: Parses all `[[wikilinks]]` → deterministic `EXTRACTED` edges -- Pass 2: Infers implicit relationships → `INFERRED` edges with confidence scores -- Runs Louvain community detection -- Outputs `graph/graph.json` + `graph/graph.html` - -If the user doesn't have Python/dependencies set up, instead generate the graph data manually: -1. Use Grep to find all `[[wikilinks]]` across wiki pages -2. Build a node/edge list -3. Write `graph/graph.json` directly -4. Write `graph/graph.html` using the vis.js template +创建前必须: +1. 搜索 index +2. 判断是否存在 +3. 存在则更新 --- -## Naming Conventions +# Query Workflow(查询流程) -- Source slugs: `kebab-case` matching source filename -- Entity pages: `TitleCase.md` (e.g. `OpenAI.md`, `SamAltman.md`) -- Concept pages: `TitleCase.md` (e.g. `ReinforcementLearning.md`, `RAG.md`) -- Source pages: `kebab-case.md` +触发: +- `/wiki-query` +- 或:`query: 问题` -## Index Format +--- + +## 步骤 + +1. 读取 index +2. 找到相关页面 +3. 使用 Read 工具加载 +4. 输出结构化答案 +5. 使用 `[[Page]]` 引用 +6. 询问是否保存为 synthesis + +--- + +# Lint Workflow(校验) + +检查内容: + +- 孤立页面 +- 断链 +- 冲突 +- 过期内容 +- 缺失Entity +- 缺失Concept +- 知识空白 + +--- + +# Graph Workflow(知识图谱) + +触发: +- `/wiki-graph` + +--- + +执行: +- 优先运行 `tools/build_graph.py` +- 否则手动构建: + +步骤: +1. 提取所有 `[[links]]` +2. 构建节点与边 +3. 输出 `graph.json` + +--- + +# Naming Conventions(命名规范) +- Source:保留原始中文名称(去除特殊符号),非中文使用 kebab-case +- Entity:TitleCase +- Concept:TitleCase + +--- + +# Index Format(索引结构) ```markdown # Wiki Index ## Overview -- [Overview](overview.md) — living synthesis +- [Overview](overview.md) ## Sources -- [Source Title](sources/slug.md) — one-line summary +- [Title](sources/原始中文名.md) ## Entities -- [Entity Name](entities/EntityName.md) — one-line description +- [Entity](entities/Entity.md) ## Concepts -- [Concept Name](concepts/ConceptName.md) — one-line description +- [Concept](concepts/Concept.md) ## Syntheses -- [Analysis Title](syntheses/slug.md) — what question it answers +- [Title](syntheses/slug.md) ``` -## Log Format +--- -Each entry starts with `## [YYYY-MM-DD] <operation> | <title>` so it's grep-parseable: +# Log Format(日志) ``` -grep "^## \[" wiki/log.md | tail -10 +## [YYYY-MM-DD] ingest | 标题 ``` -Operations: `ingest`, `query`, `lint`, `graph` +--- + +# ✅ 最终目标 + +该系统用于: + +- 知识沉淀 +- 结构化理解 +- 自动图谱构建 +- Agent 推理支持 + +--- + +# END \ No newline at end of file diff --git a/CLAUDE.md.bak b/CLAUDE.md.bak new file mode 100644 index 0000000..345219f --- /dev/null +++ b/CLAUDE.md.bak @@ -0,0 +1,230 @@ +# LLM Wiki Agent — Schema & Workflow Instructions + +This wiki is maintained entirely by Claude Code. No API key or Python scripts needed — just open this repo in Claude Code and talk to it. + +## Slash Commands (Claude Code) + +| Command | What to say | +|---|---| +| `/wiki-ingest` | `ingest raw/my-article.md` | +| `/wiki-query` | `query: what are the main themes?` | +| `/wiki-lint` | `lint the wiki` | +| `/wiki-graph` | `build the knowledge graph` | + +Or just describe what you want in plain English: +- *"Ingest this file: raw/papers/attention-is-all-you-need.md"* +- *"What does the wiki say about transformer models?"* +- *"Check the wiki for orphan pages and contradictions"* +- *"Build the graph and show me what's connected to RAG"* + +Claude Code reads this file automatically and follows the workflows below. + +--- + +## Directory Layout + +``` +raw/ # Immutable source documents — never modify these +wiki/ # Claude owns this layer entirely + index.md # Catalog of all pages — update on every ingest + log.md # Append-only chronological record + overview.md # Living synthesis across all sources + sources/ # One summary page per source document + entities/ # People, companies, projects, products + concepts/ # Ideas, frameworks, methods, theories + syntheses/ # Saved query answers +graph/ # Auto-generated graph data +tools/ # Optional standalone Python scripts (require ANTHROPIC_API_KEY) +``` + +--- + +## Page Format + +Every wiki page uses this frontmatter: + +```yaml +--- +title: "Page Title" +type: source | entity | concept | synthesis +tags: [] +sources: [] # list of source slugs that inform this page +last_updated: YYYY-MM-DD +--- +``` + +Use `[[PageName]]` wikilinks to link to other wiki pages. + +--- + +## Ingest Workflow + +Triggered by: *"ingest <file>"* or `/wiki-ingest` + +Steps (in order): +1. Read the source document fully using the Read tool +2. Read `wiki/index.md` and `wiki/overview.md` for current wiki context +3. Write `wiki/sources/<slug>.md` — use the source page format below +4. Update `wiki/index.md` — add entry under Sources section +5. Update `wiki/overview.md` — revise synthesis if warranted +6. Update/create entity pages for key people, companies, projects mentioned +7. Update/create concept pages for key ideas and frameworks discussed +8. Flag any contradictions with existing wiki content +9. Append to `wiki/log.md`: `## [YYYY-MM-DD] ingest | <Title>` + +### Source Page Format + +```markdown +--- +title: "Source Title" +type: source +tags: [] +date: YYYY-MM-DD +source_file: raw/... +--- + +## Summary +2–4 sentence summary. + +## Key Claims +- Claim 1 +- Claim 2 + +## Key Quotes +> "Quote here" — context + +## Connections +- [[EntityName]] — how they relate +- [[ConceptName]] — how it connects + +## Contradictions +- Contradicts [[OtherPage]] on: ... +``` + +### Domain-Specific Templates + +If the source falls into a specific domain (e.g., personal diary, meeting notes), the agent should use a specialized template instead of the default generic one above: + +#### Diary / Journal Template +```markdown +--- +title: "YYYY-MM-DD Diary" +type: source +tags: [diary] +date: YYYY-MM-DD +--- +## Event Summary +... +## Key Decisions +... +## Energy & Mood +... +## Connections +... +## Shifts & Contradictions +... +``` + +#### Meeting Notes Template +```markdown +--- +title: "Meeting Title" +type: source +tags: [meeting] +date: YYYY-MM-DD +--- +## Goal +... +## Key Discussions +... +## Decisions Made +... +## Action Items +... +``` + +--- + +## Query Workflow + +Triggered by: *"query: <question>"* or `/wiki-query` + +Steps: +1. Read `wiki/index.md` to identify relevant pages +2. Read those pages with the Read tool +3. Synthesize an answer with inline citations as `[[PageName]]` wikilinks +4. Ask the user if they want the answer filed as `wiki/syntheses/<slug>.md` + +--- + +## Lint Workflow + +Triggered by: *"lint the wiki"* or `/wiki-lint` + +Use Grep and Read tools to check for: +- **Orphan pages** — wiki pages with no inbound `[[links]]` from other pages +- **Broken links** — `[[WikiLinks]]` pointing to pages that don't exist +- **Contradictions** — claims that conflict across pages +- **Stale summaries** — pages not updated after newer sources +- **Missing entity pages** — entities mentioned in 3+ pages but lacking their own page +- **Data gaps** — questions the wiki can't answer; suggest new sources + +Output a lint report and ask if the user wants it saved to `wiki/lint-report.md`. + +--- + +## Graph Workflow + +Triggered by: *"build the knowledge graph"* or `/wiki-graph` + +When the user asks to build the graph, run `tools/build_graph.py` which: +- Pass 1: Parses all `[[wikilinks]]` → deterministic `EXTRACTED` edges +- Pass 2: Infers implicit relationships → `INFERRED` edges with confidence scores +- Runs Louvain community detection +- Outputs `graph/graph.json` + `graph/graph.html` + +If the user doesn't have Python/dependencies set up, instead generate the graph data manually: +1. Use Grep to find all `[[wikilinks]]` across wiki pages +2. Build a node/edge list +3. Write `graph/graph.json` directly +4. Write `graph/graph.html` using the vis.js template + +--- + +## Naming Conventions + +- Source slugs: `kebab-case` matching source filename +- Entity pages: `TitleCase.md` (e.g. `OpenAI.md`, `SamAltman.md`) +- Concept pages: `TitleCase.md` (e.g. `ReinforcementLearning.md`, `RAG.md`) +- Source pages: `kebab-case.md` + +## Index Format + +```markdown +# Wiki Index + +## Overview +- [Overview](overview.md) — living synthesis + +## Sources +- [Source Title](sources/slug.md) — one-line summary + +## Entities +- [Entity Name](entities/EntityName.md) — one-line description + +## Concepts +- [Concept Name](concepts/ConceptName.md) — one-line description + +## Syntheses +- [Analysis Title](syntheses/slug.md) — what question it answers +``` + +## Log Format + +Each entry starts with `## [YYYY-MM-DD] <operation> | <title>` so it's grep-parseable: + +``` +grep "^## \[" wiki/log.md | tail -10 +``` + +Operations: `ingest`, `query`, `lint`, `graph` diff --git a/raw b/raw new file mode 120000 index 0000000..9bb82eb --- /dev/null +++ b/raw @@ -0,0 +1 @@ +/Users/weishen/Workspace/nexus/raw \ No newline at end of file diff --git a/raw/.gitkeep b/raw/.gitkeep deleted file mode 100644 index e69de29..0000000 diff --git a/tools/__pycache__/sync.cpython-311.pyc b/tools/__pycache__/sync.cpython-311.pyc new file mode 100644 index 0000000..0533199 Binary files /dev/null and b/tools/__pycache__/sync.cpython-311.pyc differ diff --git a/tools/sync.py b/tools/sync.py new file mode 100755 index 0000000..70c4ba9 --- /dev/null +++ b/tools/sync.py @@ -0,0 +1,567 @@ +#!/usr/bin/env python3 +""" +Wiki ↔ Raw 三向同步工具 + +功能: + - 检测 raw/ 下文件变化(新增/修改/删除) + - 自动调用 ingest.py 进行同步 + - 维护 manifest.json 状态映射 + - 检测 orphan entity/concept(仅报告,不删除) + +用法: + python tools/sync.py --check 预览变化(不执行) + python tools/sync.py --sync 执行同步 + python tools/sync.py --rebuild 从 manifest 重建 wiki/index(兜底) + python tools/sync.py --bootstrap 从现有 wiki sources 反向生成 manifest(首次用,跳过已 ingest 的文件) + +manifest.json 格式: +{ + "version": 1, + "updated_at": "ISO timestamp", + "files": { + "relative/path/to/file.md": { + "hash": "sha256", + "modified": "ISO timestamp", + "slug": "wiki-source-slug", + "source_path": "wiki/sources/slug.md", + "ingested": true + } + } +} +""" + +import os +import sys +import json +import hashlib +import subprocess +from pathlib import Path +from datetime import datetime, timezone + + +REPO_ROOT = Path(__file__).parent.parent +WIKI_DIR = REPO_ROOT / "wiki" +MANIFEST_FILE = WIKI_DIR / "manifest.json" +SCHEMA_FILE = REPO_ROOT / "CLAUDE.md" + + +# ─── 工具函数 ─────────────────────────────────────────────── + +def green(text): + return f"\033[92m{text}\033[0m" + +def yellow(text): + return f"\033[93m{text}\033[0m" + +def red(text): + return f"\033[91m{text}\033[0m" + +def dim(text): + return f"\033[2m{text}\033[0m" + +def bold(text): + return f"\033[1m{text}\033[0m" + + +def log(msg, style="normal"): + prefixes = { + "normal": " ", + "info": " ℹ ", + "success": " ✓ ", + "warn": " ⚠ ", + "error": " ✗ ", + "section": "\n── ", + } + print(f"{prefixes.get(style, ' ')}{msg}") + + +def sha256_file(path: Path) -> str: + h = hashlib.sha256() + h.update(path.read_bytes()) + return h.hexdigest()[:16] + + +def iso_now(): + return datetime.now(timezone.utc).isoformat() + + +def load_manifest() -> dict: + if MANIFEST_FILE.exists(): + try: + return json.loads(MANIFEST_FILE.read_text(encoding="utf-8")) + except (json.JSONDecodeError, IOError): + pass + return {"version": 1, "updated_at": iso_now(), "files": {}} + + +def save_manifest(manifest: dict): + manifest["updated_at"] = iso_now() + MANIFEST_FILE.write_text(json.dumps(manifest, ensure_ascii=False, indent=2), encoding="utf-8") + + +def scan_raw() -> dict[str, dict]: + """返回 {relative_path: {hash, modified, size}}""" + raw_dir = REPO_ROOT / "raw" + result = {} + if not raw_dir.exists(): + return result + for p in raw_dir.rglob("*.md"): + if p.is_file() and not p.name.startswith("."): + rel = str(p.relative_to(REPO_ROOT)) + stat = p.stat() + result[rel] = { + "hash": sha256_file(p), + "modified": datetime.fromtimestamp(stat.st_mtime, tz=timezone.utc).isoformat(), + "size": stat.st_size, + "abs_path": str(p), + } + return result + + +def build_slug_from_path(rel_path: str) -> str: + """从相对路径生成 slug(尽量保留中文,kebab-case)""" + name = Path(rel_path).stem + name = name.replace(" ", "-").replace("/", "-").replace("\\", "-") + name = "".join(c if c.isalnum() or c in ("-", "_", "·") else "-" for c in name) + name = name.strip("-") + return name or "untitled" + + +def call_ingest(source_path: str, slug: str = None) -> dict: + """调用 ingest.py,返回结果""" + cmd = [sys.executable, str(REPO_ROOT / "tools" / "ingest.py"), source_path] + try: + result = subprocess.run( + cmd, + capture_output=True, + text=True, + timeout=300, + cwd=str(REPO_ROOT), + ) + return { + "success": result.returncode == 0, + "stdout": result.stdout, + "stderr": result.stderr, + } + except subprocess.TimeoutExpired: + return {"success": False, "stdout": "", "stderr": "Timeout (>5min)"} + except Exception as e: + return {"success": False, "stdout": "", "stderr": str(e)} + + +def find_orphan_entity_concept(manifest: dict) -> tuple[list, list]: + """检测未被任何 source page 引用的 entity 和 concept""" + # 从所有 source 内容中提取 [[wikilinks]] + import re + wikilink_pattern = re.compile(r"\[\[([^\]]+)\]\]") + + sources_dir = WIKI_DIR / "sources" + referenced_entities = set() + referenced_concepts = set() + + if sources_dir.exists(): + for src in sources_dir.glob("*.md"): + content = src.read_text(encoding="utf-8") + for link in wikilink_pattern.findall(content): + name = link.strip() + if name.startswith("entities/"): + referenced_entities.add(Path(name).stem) + elif name.startswith("concepts/"): + referenced_concepts.add(Path(name).stem) + elif "/" not in name: + # 裸 wikilink,可能是 entity 或 concept + referenced_entities.add(name) + referenced_concepts.add(name) + + # 检查 entity 目录 + orphan_entities = [] + entities_dir = WIKI_DIR / "entities" + if entities_dir.exists(): + for f in entities_dir.glob("*.md"): + if f.stem not in referenced_entities: + orphan_entities.append(f.name) + + # 检查 concept 目录 + orphan_concepts = [] + concepts_dir = WIKI_DIR / "concepts" + if concepts_dir.exists(): + for f in concepts_dir.glob("*.md"): + if f.stem not in referenced_concepts: + orphan_concepts.append(f.name) + + return orphan_entities, orphan_concepts + + +# ─── 核心同步逻辑 ─────────────────────────────────────────────── + +def check_changes(manifest: dict, raw_files: dict) -> dict: + """对比 manifest 和实际 raw 文件,返回变化""" + changes = {"new": [], "updated": [], "deleted": [], "unchanged": []} + manifest_files = manifest.get("files", {}) + + # 遍历当前 raw 文件 + for rel_path, info in raw_files.items(): + if rel_path not in manifest_files: + changes["new"].append({"rel_path": rel_path, **info}) + elif info["hash"] != manifest_files[rel_path]["hash"]: + changes["updated"].append({ + "rel_path": rel_path, + "old_hash": manifest_files[rel_path]["hash"], + **info, + }) + else: + changes["unchanged"].append(rel_path) + + # 遍历 manifest,找已删除的 + for rel_path in manifest_files: + abs_path = REPO_ROOT / rel_path + if not abs_path.exists(): + changes["deleted"].append({ + "rel_path": rel_path, + "slug": manifest_files[rel_path].get("slug", build_slug_from_path(rel_path)), + "source_path": manifest_files[rel_path].get("source_path"), + }) + + return changes + + +def run_sync(dry_run: bool = False, verbose: bool = False): + print(f"\n{bold('=== Wiki Sync')}\n") + print(f" Date: {datetime.now().strftime('%Y-%m-%d %H:%M')}") + print(f" Raw: {REPO_ROOT / 'raw'}") + print(f" Wiki: {WIKI_DIR}") + print(f" Mode: {'DRY-RUN (preview only)' if dry_run else 'LIVE SYNC'}") + print() + + # Step 1: load manifest + manifest = load_manifest() + log("manifest.json loaded", "info") + + # Step 2: scan raw/ + raw_files = scan_raw() + log(f"raw/ scan: {len(raw_files)} .md files found", "info") + + # Step 3: check changes + changes = check_changes(manifest, raw_files) + total_changes = len(changes["new"]) + len(changes["updated"]) + len(changes["deleted"]) + + if total_changes == 0: + log("No changes detected — wiki is up to date.", "success") + return + + # ─── Report ─── + print(f"\n{bold('--- Changes ---')}") + print(f" {green('+')} New: {len(changes['new'])}") + print(f" {yellow('~')} Updated: {len(changes['updated'])}") + print(f" {red('-')} Deleted: {len(changes['deleted'])}") + + if verbose or not dry_run: + if changes["new"]: + print(f"\n {bold('New Files:')}") + for f in changes["new"]: + log(f"{green('[+')} {f['rel_path']}", "normal") + + if changes["updated"]: + print(f"\n {bold('Updated Files:')}") + for f in changes["updated"]: + log(f"{yellow('[~]')} {f['rel_path']} (hash changed)", "normal") + + if changes["deleted"]: + print(f"\n {bold('Deleted Files:')}") + for f in changes["deleted"]: + log(f"{red('[-]')} {f['rel_path']}", "normal") + + if dry_run: + log("\nDry-run complete. Run with --sync to apply.", "warn") + return + + # ─── Apply Sync ─── + print(f"\n{bold('--- Applying Sync ---')}") + + updated_manifest = manifest.copy() + updated_manifest["files"] = manifest.get("files", {}).copy() + + # ① 新增 → ingest + for f in changes["new"]: + rel_path = f["rel_path"] + abs_path = f["abs_path"] + slug = build_slug_from_path(rel_path) + print(f"\n {green('[+]')} New: {rel_path}") + print(f" slug: {slug}") + + result = call_ingest(abs_path, slug) + if result["success"]: + log(f"Ingested: {slug}.md", "success") + updated_manifest["files"][rel_path] = { + "hash": f["hash"], + "modified": f["modified"], + "slug": slug, + "source_path": f"wiki/sources/{slug}.md", + "ingested": True, + "ingested_at": iso_now(), + } + else: + log(f"Failed: {result['stderr'][:200]}", "error") + # 仍然记录(避免重复 ingest) + updated_manifest["files"][rel_path] = { + "hash": f["hash"], + "modified": f["modified"], + "slug": slug, + "source_path": f"wiki/sources/{slug}.md", + "ingested": False, + "ingested_at": None, + "error": result["stderr"][:500], + } + + # ② 修改 → re-ingest + for f in changes["updated"]: + rel_path = f["rel_path"] + abs_path = f["abs_path"] + old_slug = manifest["files"].get(rel_path, {}).get("slug") or build_slug_from_path(rel_path) + print(f"\n {yellow('[~]')} Updated: {rel_path}") + + result = call_ingest(abs_path, old_slug) + if result["success"]: + log(f"Re-ingested: {old_slug}.md", "success") + updated_manifest["files"][rel_path] = { + **updated_manifest["files"].get(rel_path, {}), + "hash": f["hash"], + "modified": f["modified"], + "slug": old_slug, + "source_path": f"wiki/sources/{old_slug}.md", + "ingested": True, + "ingested_at": iso_now(), + } + else: + log(f"Failed: {result['stderr'][:200]}", "error") + + # ③ 删除 → 保留 wiki 内容,仅从 manifest 移除(按用户要求保留 orphan) + for f in changes["deleted"]: + rel_path = f["rel_path"] + source_path = f.get("source_path") + print(f"\n {red('[-]')} Deleted: {rel_path}") + if source_path: + sp = WIKI_DIR / source_path + log(f" Wiki source kept: {sp}", "warn") + # 从 manifest 移除(不删除 wiki 文件) + if rel_path in updated_manifest["files"]: + del updated_manifest["files"][rel_path] + + # Step 4: Save manifest + save_manifest(updated_manifest) + log(f"\nmanifest.json updated ({len(updated_manifest['files'])} entries)", "success") + + # Step 5: Orphan detection + orphan_entities, orphan_concepts = find_orphan_entity_concept(updated_manifest) + if orphan_entities or orphan_concepts: + print(f"\n{bold('--- Orphan Report (kept as requested) ---')}") + if orphan_entities: + print(f" {bold('Orphan Entities')} ({len(orphan_entities)}):") + for e in sorted(orphan_entities): + print(f" {dim('?')} {e}") + if orphan_concepts: + print(f" {bold('Orphan Concepts')} ({len(orphan_concepts)}):") + for c in sorted(orphan_concepts): + print(f" {dim('?')} {c}") + log("\nOrphan pages are kept (not deleted per user request).", "info") + else: + log("No orphan entity/concept detected.", "success") + + print(f"\n{bold('Done.')}") + + +def run_bootstrap(): + """从现有 wiki sources 反向生成 manifest,跳过已 ingest 的文件""" + import re + + print(f"\n{bold('=== Wiki Bootstrap')}\n") + print(f" Scanning existing wiki sources to build manifest ...\n") + + sources_dir = WIKI_DIR / "sources" + if not sources_dir.exists(): + print(f" {red('✗')} No wiki/sources/ directory found. Nothing to bootstrap.") + return + + wikilink_pattern = re.compile(r"\[\[?raw/([^\]\s]+\.md)\]?]?", re.IGNORECASE) + + manifest = {"version": 1, "updated_at": iso_now(), "files": {}} + raw_dir = (REPO_ROOT / "raw").resolve() # 解析 symlink 到真实路径 + repo_raw_prefix = str(REPO_ROOT / "raw") # 用于 strip 前缀得到相对路径 + bootstrapped = 0 + skipped_not_found = 0 + skipped_no_source_field = 0 + + for src in sources_dir.glob("*.md"): + content = src.read_text(encoding="utf-8") + + # 尝试从 ## Source File 字段提取原始路径 + match = wikilink_pattern.search(content) + if not match: + skipped_no_source_field += 1 + continue + + # raw_rel 格式如 "Agent/usecases/xxx.md"(不含 raw/ 前缀) + raw_rel = match.group(1).lstrip("/") + # 用 resolved 后的 raw_dir 拼接(follow symlink) + raw_path = raw_dir / raw_rel + + if not raw_path.exists(): + # 文件已删除,保留 source page 但不加入 manifest + skipped_not_found += 1 + continue + + stat = raw_path.stat() + file_hash = sha256_file(raw_path) + slug = src.stem + + # manifest key 用 "raw/Agent/xxx.md" 格式(REPO_ROOT 相对路径) + manifest_key = f"raw/{raw_rel}" + manifest["files"][manifest_key] = { + "hash": file_hash, + "modified": datetime.fromtimestamp(stat.st_mtime, tz=timezone.utc).isoformat(), + "slug": slug, + "source_path": f"wiki/sources/{slug}.md", + "ingested": True, + "ingested_at": datetime.fromtimestamp(stat.st_mtime, tz=timezone.utc).isoformat(), + } + bootstrapped += 1 + + save_manifest(manifest) + + print(f" {bold('Result:')}") + print(f" {green('✓')} Manifest entries created: {bootstrapped}") + print(f" {yellow('~')} Skipped (source file deleted): {skipped_not_found}") + print(f" {dim('-')} Skipped (no source_file field): {skipped_no_source_field}") + print(f"\n {green('✓')} manifest.json created at: {MANIFEST_FILE}") + print(f"\n Run now: {bold('python tools/sync.py --check')} to preview new/updated files.\n") + + +def run_check(): + """只预览变化,不执行""" + manifest = load_manifest() + raw_files = scan_raw() + changes = check_changes(manifest, raw_files) + total = len(changes["new"]) + len(changes["updated"]) + len(changes["deleted"]) + + print(f"\n{bold('=== Wiki Sync Check')} (preview mode)\n") + print(f" Raw files: {len(raw_files)}") + print(f" Manifest entries: {len(manifest.get('files', {}))}") + print(f" {green('+')} New: {len(changes['new'])}") + print(f" {yellow('~')} Updated: {len(changes['updated'])}") + print(f" {red('-')} Deleted: {len(changes['deleted'])}") + + if total > 0: + if changes["new"]: + print(f"\n {bold('New Files:')}") + for f in changes["new"]: + print(f" {green('[+]')} {f['rel_path']}") + if changes["updated"]: + print(f"\n {bold('Updated Files:')}") + for f in changes["updated"]: + print(f" {yellow('[~]')} {f['rel_path']} (was {f['old_hash']}, now {f['hash']})") + if changes["deleted"]: + print(f"\n {bold('Deleted Files:')}") + for f in changes["deleted"]: + print(f" {red('[-]')} {f['rel_path']}") + else: + print(f"\n {green('No changes — wiki is in sync.')}") + + print() + + +def run_rebuild(): + """从 manifest 重建 wiki/index.md(兜底方案)""" + manifest = load_manifest() + print(f"\n{bold('=== Wiki Rebuild from Manifest')}\n") + print(f" Manifest entries: {len(manifest.get('files', {}))}") + print(f" Rebuilding index.md ...\n") + + index_lines = [ + "# Wiki Index\n", + "\n## Overview\n", + "- [Overview](overview.md) — living synthesis\n", + "\n## Sources\n", + ] + + files = manifest.get("files", {}) + # 按 modified 时间倒序 + sorted_files = sorted(files.items(), key=lambda x: x[1].get("modified", ""), reverse=True) + + for rel_path, info in sorted_files: + slug = info.get("slug", build_slug_from_path(rel_path)) + source_md_path = WIKI_DIR / "sources" / f"{slug}.md" + if source_md_path.exists(): + title = source_md_path.read_text(encoding="utf-8").split("\n")[0].lstrip("# ").strip() + index_lines.append(f"- [{title}](sources/{slug}.md)\n") + else: + index_lines.append(f"- [{slug}](sources/{slug}.md) — (source missing)\n") + + index_lines.append("\n## Entities\n\n## Concepts\n\n## Syntheses\n") + + index_file = WIKI_DIR / "index.md" + index_file.write_text("".join(index_lines), encoding="utf-8") + print(f" {green('✓')} index.md rebuilt with {len(sorted_files)} sources") + + # Orphan report + orphan_entities, orphan_concepts = find_orphan_entity_concept(manifest) + if orphan_entities: + print(f" {dim('?')} Orphan entities: {len(orphan_entities)}") + if orphan_concepts: + print(f" {dim('?')} Orphan concepts: {len(orphan_concepts)}") + + print(f"\nDone.") + + +# ─── CLI 入口 ─────────────────────────────────────────────── + +if __name__ == "__main__": + import argparse + + parser = argparse.ArgumentParser( + description="Wiki ↔ Raw 三向同步工具", + formatter_class=argparse.RawDescriptionHelpFormatter, + ) + parser.add_argument( + "--check", + action="store_true", + help="预览变化,不执行同步", + ) + parser.add_argument( + "--sync", + action="store_true", + help="执行完整同步(新增/修改/删除 + orphan 检测)", + ) + parser.add_argument( + "--rebuild", + action="store_true", + help="从 manifest 重建 wiki/index.md(兜底方案)", + ) + parser.add_argument( + "--bootstrap", + action="store_true", + help="从现有 wiki sources 反向生成 manifest(首次使用,跳过已 ingest 的文件)", + ) + parser.add_argument( + "--verbose", "-v", + action="store_true", + help="详细输出", + ) + + args = parser.parse_args() + + if args.bootstrap: + run_bootstrap() + elif args.rebuild: + run_rebuild() + elif args.check: + run_check() + elif args.sync: + run_sync(dry_run=False, verbose=args.verbose) + else: + parser.print_help() + print("\n示例:") + print(" python tools/sync.py --check # 预览变化") + print(" python tools/sync.py --sync # 执行同步") + print(" python tools/sync.py --sync -v # 详细模式") + print(" python tools/sync.py --rebuild # 重建 index") + print(" python tools/sync.py --bootstrap # 首次:从 wiki sources 生成 manifest") diff --git a/wiki b/wiki new file mode 120000 index 0000000..31bd750 --- /dev/null +++ b/wiki @@ -0,0 +1 @@ +/Users/weishen/Workspace/nexus/wiki \ No newline at end of file diff --git a/wiki/index.md b/wiki/index.md deleted file mode 100644 index 647ecb0..0000000 --- a/wiki/index.md +++ /dev/null @@ -1,14 +0,0 @@ -# Wiki Index - -This file is maintained by the LLM. Updated on every ingest. - -## Overview -- [Overview](overview.md) — living synthesis across all sources - -## Sources - -## Entities - -## Concepts - -## Syntheses diff --git a/wiki/log.md b/wiki/log.md deleted file mode 100644 index 66a9285..0000000 --- a/wiki/log.md +++ /dev/null @@ -1,9 +0,0 @@ -# Wiki Log - -Append-only chronological record of all operations. - -Format: `## [YYYY-MM-DD] <operation> | <title>` - -Parse recent entries: `grep "^## \[" wiki/log.md | tail -10` - ---- diff --git a/wiki/overview.md b/wiki/overview.md deleted file mode 100644 index f71416d..0000000 --- a/wiki/overview.md +++ /dev/null @@ -1,17 +0,0 @@ ---- -title: "Overview" -type: synthesis -tags: [] -sources: [] -last_updated: "" ---- - -# Overview - -*This page is maintained by the LLM. It is updated on every ingest to reflect the current synthesis across all sources.* - -No sources ingested yet. Add your first source with: - -```bash -python tools/ingest.py raw/your-source.md -```