diff --git a/.DS_Store b/.DS_Store index 70f1cf35..1a63846e 100644 Binary files a/.DS_Store and b/.DS_Store differ diff --git a/AGENTS.md b/AGENTS.md new file mode 100644 index 00000000..1fe5a046 --- /dev/null +++ b/AGENTS.md @@ -0,0 +1,219 @@ +# LLM Wiki Agent — Schema & Workflow Instructions + +This wiki is maintained entirely by your coding agent. No API key or Python scripts needed — just open this repo in Codex, OpenCode, or any agent that reads this file, and talk to it. + +## How to Use + +Describe what you want in plain English: +- *"Ingest this file: raw/papers/my-paper.md"* +- *"What does the wiki say about transformer models?"* +- *"Check the wiki for orphan pages and contradictions"* +- *"Build the knowledge graph"* + +Or use shorthand triggers: +- `ingest ` → runs the Ingest Workflow +- `query: ` → runs the Query Workflow +- `lint` → runs the Lint Workflow +- `build graph` → runs the Graph Workflow + +--- + +## Directory Layout + +``` +raw/ # Immutable source documents — never modify these +wiki/ # Agent owns this layer entirely + index.md # Catalog of all pages — update on every ingest + log.md # Append-only chronological record + overview.md # Living synthesis across all sources + sources/ # One summary page per source document + entities/ # People, companies, projects, products + concepts/ # Ideas, frameworks, methods, theories + syntheses/ # Saved query answers +graph/ # Auto-generated graph data +tools/ # Optional standalone Python scripts (require ANTHROPIC_API_KEY) +``` + +--- + +## Page Format + +Every wiki page uses this frontmatter: + +```yaml +--- +title: "Page Title" +type: source | entity | concept | synthesis +tags: [] +sources: [] # list of source slugs that inform this page +last_updated: YYYY-MM-DD +--- +``` + +Use `[[PageName]]` wikilinks to link to other wiki pages. + +--- + +## Ingest Workflow + +Triggered by: *"ingest "* + +Steps (in order): +1. Read the source document fully +2. Read `wiki/index.md` and `wiki/overview.md` for current wiki context +3. Write `wiki/sources/.md` — use the source page format below +4. Update `wiki/index.md` — add entry under Sources section +5. Update `wiki/overview.md` — revise synthesis if warranted +6. Update/create entity pages for key people, companies, projects mentioned +7. Update/create concept pages for key ideas and frameworks discussed +8. Flag any contradictions with existing wiki content +9. Append to `wiki/log.md`: `## [YYYY-MM-DD] ingest | ` + +### Source Page Format + +```markdown +--- +title: "Source Title" +type: source +tags: [] +date: YYYY-MM-DD +source_file: raw/... +--- + +## Summary +2–4 sentence summary. + +## Key Claims +- Claim 1 +- Claim 2 + +## Key Quotes +> "Quote here" — context + +## Connections +- [[EntityName]] — how they relate +- [[ConceptName]] — how it connects + +## Contradictions +- Contradicts [[OtherPage]] on: ... +``` + +### Domain-Specific Templates + +If the source falls into a specific domain (e.g., personal diary, meeting notes), the agent should use a specialized template instead of the default generic one above: + +#### Diary / Journal Template +```markdown +--- +title: "YYYY-MM-DD Diary" +type: source +tags: [diary] +date: YYYY-MM-DD +--- +## Event Summary +... +## Key Decisions +... +## Energy & Mood +... +## Connections +... +## Shifts & Contradictions +... +``` + +#### Meeting Notes Template +```markdown +--- +title: "Meeting Title" +type: source +tags: [meeting] +date: YYYY-MM-DD +--- +## Goal +... +## Key Discussions +... +## Decisions Made +... +## Action Items +... +``` + +--- + +## Query Workflow + +Triggered by: *"query: <question>"* + +Steps: +1. Read `wiki/index.md` to identify relevant pages +2. Read those pages +3. Synthesize an answer with inline citations as `[[PageName]]` wikilinks +4. Ask the user if they want the answer filed as `wiki/syntheses/<slug>.md` + +--- + +## Lint Workflow + +Triggered by: *"lint"* + +Check for: +- **Orphan pages** — wiki pages with no inbound `[[links]]` from other pages +- **Broken links** — `[[WikiLinks]]` pointing to pages that don't exist +- **Contradictions** — claims that conflict across pages +- **Stale summaries** — pages not updated after newer sources +- **Missing entity pages** — entities mentioned in 3+ pages but lacking their own page +- **Data gaps** — questions the wiki can't answer; suggest new sources + +Output a lint report and ask if the user wants it saved to `wiki/lint-report.md`. + +--- + +## Graph Workflow + +Triggered by: *"build graph"* + +First try: `python tools/build_graph.py --open` + +If Python/deps unavailable, build manually: +1. Search for all `[[wikilinks]]` across wiki pages +2. Build nodes (one per page) and edges (one per link) +3. Infer implicit relationships not captured by wikilinks — tag `INFERRED` with confidence score; low confidence → `AMBIGUOUS` +4. Write `graph/graph.json` with `{nodes, edges, built: date}` +5. Write `graph/graph.html` as a self-contained vis.js visualization + +--- + +## Naming Conventions + +- Source slugs: `kebab-case` matching source filename +- Entity pages: `TitleCase.md` (e.g. `OpenAI.md`, `SamAltman.md`) +- Concept pages: `TitleCase.md` (e.g. `ReinforcementLearning.md`, `RAG.md`) + +## Index Format + +```markdown +# Wiki Index + +## Overview +- [Overview](overview.md) — living synthesis + +## Sources +- [Source Title](sources/slug.md) — one-line summary + +## Entities +- [Entity Name](entities/EntityName.md) — one-line description + +## Concepts +- [Concept Name](concepts/ConceptName.md) — one-line description + +## Syntheses +- [Analysis Title](syntheses/slug.md) — what question it answers +``` + +## Log Format + +`## [YYYY-MM-DD] <operation> | <title>` + +Operations: `ingest`, `query`, `lint`, `graph` diff --git a/CLAUDE.md b/CLAUDE.md new file mode 100644 index 00000000..345219f7 --- /dev/null +++ b/CLAUDE.md @@ -0,0 +1,230 @@ +# LLM Wiki Agent — Schema & Workflow Instructions + +This wiki is maintained entirely by Claude Code. No API key or Python scripts needed — just open this repo in Claude Code and talk to it. + +## Slash Commands (Claude Code) + +| Command | What to say | +|---|---| +| `/wiki-ingest` | `ingest raw/my-article.md` | +| `/wiki-query` | `query: what are the main themes?` | +| `/wiki-lint` | `lint the wiki` | +| `/wiki-graph` | `build the knowledge graph` | + +Or just describe what you want in plain English: +- *"Ingest this file: raw/papers/attention-is-all-you-need.md"* +- *"What does the wiki say about transformer models?"* +- *"Check the wiki for orphan pages and contradictions"* +- *"Build the graph and show me what's connected to RAG"* + +Claude Code reads this file automatically and follows the workflows below. + +--- + +## Directory Layout + +``` +raw/ # Immutable source documents — never modify these +wiki/ # Claude owns this layer entirely + index.md # Catalog of all pages — update on every ingest + log.md # Append-only chronological record + overview.md # Living synthesis across all sources + sources/ # One summary page per source document + entities/ # People, companies, projects, products + concepts/ # Ideas, frameworks, methods, theories + syntheses/ # Saved query answers +graph/ # Auto-generated graph data +tools/ # Optional standalone Python scripts (require ANTHROPIC_API_KEY) +``` + +--- + +## Page Format + +Every wiki page uses this frontmatter: + +```yaml +--- +title: "Page Title" +type: source | entity | concept | synthesis +tags: [] +sources: [] # list of source slugs that inform this page +last_updated: YYYY-MM-DD +--- +``` + +Use `[[PageName]]` wikilinks to link to other wiki pages. + +--- + +## Ingest Workflow + +Triggered by: *"ingest <file>"* or `/wiki-ingest` + +Steps (in order): +1. Read the source document fully using the Read tool +2. Read `wiki/index.md` and `wiki/overview.md` for current wiki context +3. Write `wiki/sources/<slug>.md` — use the source page format below +4. Update `wiki/index.md` — add entry under Sources section +5. Update `wiki/overview.md` — revise synthesis if warranted +6. Update/create entity pages for key people, companies, projects mentioned +7. Update/create concept pages for key ideas and frameworks discussed +8. Flag any contradictions with existing wiki content +9. Append to `wiki/log.md`: `## [YYYY-MM-DD] ingest | <Title>` + +### Source Page Format + +```markdown +--- +title: "Source Title" +type: source +tags: [] +date: YYYY-MM-DD +source_file: raw/... +--- + +## Summary +2–4 sentence summary. + +## Key Claims +- Claim 1 +- Claim 2 + +## Key Quotes +> "Quote here" — context + +## Connections +- [[EntityName]] — how they relate +- [[ConceptName]] — how it connects + +## Contradictions +- Contradicts [[OtherPage]] on: ... +``` + +### Domain-Specific Templates + +If the source falls into a specific domain (e.g., personal diary, meeting notes), the agent should use a specialized template instead of the default generic one above: + +#### Diary / Journal Template +```markdown +--- +title: "YYYY-MM-DD Diary" +type: source +tags: [diary] +date: YYYY-MM-DD +--- +## Event Summary +... +## Key Decisions +... +## Energy & Mood +... +## Connections +... +## Shifts & Contradictions +... +``` + +#### Meeting Notes Template +```markdown +--- +title: "Meeting Title" +type: source +tags: [meeting] +date: YYYY-MM-DD +--- +## Goal +... +## Key Discussions +... +## Decisions Made +... +## Action Items +... +``` + +--- + +## Query Workflow + +Triggered by: *"query: <question>"* or `/wiki-query` + +Steps: +1. Read `wiki/index.md` to identify relevant pages +2. Read those pages with the Read tool +3. Synthesize an answer with inline citations as `[[PageName]]` wikilinks +4. Ask the user if they want the answer filed as `wiki/syntheses/<slug>.md` + +--- + +## Lint Workflow + +Triggered by: *"lint the wiki"* or `/wiki-lint` + +Use Grep and Read tools to check for: +- **Orphan pages** — wiki pages with no inbound `[[links]]` from other pages +- **Broken links** — `[[WikiLinks]]` pointing to pages that don't exist +- **Contradictions** — claims that conflict across pages +- **Stale summaries** — pages not updated after newer sources +- **Missing entity pages** — entities mentioned in 3+ pages but lacking their own page +- **Data gaps** — questions the wiki can't answer; suggest new sources + +Output a lint report and ask if the user wants it saved to `wiki/lint-report.md`. + +--- + +## Graph Workflow + +Triggered by: *"build the knowledge graph"* or `/wiki-graph` + +When the user asks to build the graph, run `tools/build_graph.py` which: +- Pass 1: Parses all `[[wikilinks]]` → deterministic `EXTRACTED` edges +- Pass 2: Infers implicit relationships → `INFERRED` edges with confidence scores +- Runs Louvain community detection +- Outputs `graph/graph.json` + `graph/graph.html` + +If the user doesn't have Python/dependencies set up, instead generate the graph data manually: +1. Use Grep to find all `[[wikilinks]]` across wiki pages +2. Build a node/edge list +3. Write `graph/graph.json` directly +4. Write `graph/graph.html` using the vis.js template + +--- + +## Naming Conventions + +- Source slugs: `kebab-case` matching source filename +- Entity pages: `TitleCase.md` (e.g. `OpenAI.md`, `SamAltman.md`) +- Concept pages: `TitleCase.md` (e.g. `ReinforcementLearning.md`, `RAG.md`) +- Source pages: `kebab-case.md` + +## Index Format + +```markdown +# Wiki Index + +## Overview +- [Overview](overview.md) — living synthesis + +## Sources +- [Source Title](sources/slug.md) — one-line summary + +## Entities +- [Entity Name](entities/EntityName.md) — one-line description + +## Concepts +- [Concept Name](concepts/ConceptName.md) — one-line description + +## Syntheses +- [Analysis Title](syntheses/slug.md) — what question it answers +``` + +## Log Format + +Each entry starts with `## [YYYY-MM-DD] <operation> | <title>` so it's grep-parseable: + +``` +grep "^## \[" wiki/log.md | tail -10 +``` + +Operations: `ingest`, `query`, `lint`, `graph` diff --git a/GEMINI.md b/GEMINI.md new file mode 100644 index 00000000..3025c9d2 --- /dev/null +++ b/GEMINI.md @@ -0,0 +1,175 @@ +# LLM Wiki Agent — Schema & Workflow Instructions + +This wiki is maintained entirely by Gemini CLI. No API key or Python scripts needed — just open this repo with `gemini` and talk to it. + +## How to Use + +Describe what you want in plain English: +- *"Ingest this file: raw/papers/my-paper.md"* +- *"What does the wiki say about transformer models?"* +- *"Check the wiki for orphan pages and contradictions"* +- *"Build the knowledge graph"* + +Or use shorthand triggers: +- `ingest <file>` → runs the Ingest Workflow +- `query: <question>` → runs the Query Workflow +- `lint` → runs the Lint Workflow +- `build graph` → runs the Graph Workflow + +--- + +## Directory Layout + +``` +raw/ # Immutable source documents — never modify these +wiki/ # Agent owns this layer entirely + index.md # Catalog of all pages — update on every ingest + log.md # Append-only chronological record + overview.md # Living synthesis across all sources + sources/ # One summary page per source document + entities/ # People, companies, projects, products + concepts/ # Ideas, frameworks, methods, theories + syntheses/ # Saved query answers +graph/ # Auto-generated graph data +tools/ # Optional standalone Python scripts +``` + +--- + +## Page Format + +Every wiki page uses this frontmatter: + +```yaml +--- +title: "Page Title" +type: source | entity | concept | synthesis +tags: [] +sources: [] +last_updated: YYYY-MM-DD +--- +``` + +Use `[[PageName]]` wikilinks to link to other wiki pages. + +--- + +## Ingest Workflow + +Triggered by: *"ingest <file>"* + +1. Read the source document fully +2. Read `wiki/index.md` and `wiki/overview.md` for current wiki context +3. Write `wiki/sources/<slug>.md` (source page format below) +4. Update `wiki/index.md` — add entry under Sources +5. Update `wiki/overview.md` — revise synthesis if warranted +6. Update/create entity and concept pages +7. Flag contradictions with existing wiki content +8. Append to `wiki/log.md`: `## [YYYY-MM-DD] ingest | <Title>` + +### Source Page Format + +```markdown +--- +title: "Source Title" +type: source +tags: [] +date: YYYY-MM-DD +source_file: raw/... +--- + +## Summary +2–4 sentence summary. + +## Key Claims +- Claim 1 + +## Key Quotes +> "Quote here" + +## Connections +- [[EntityName]] — how they relate + +## Contradictions +- Contradicts [[OtherPage]] on: ... +``` + +### Domain-Specific Templates + +If the source falls into a specific domain (e.g., personal diary, meeting notes), the agent should use a specialized template instead of the default generic one above: + +#### Diary / Journal Template +```markdown +--- +title: "YYYY-MM-DD Diary" +type: source +tags: [diary] +date: YYYY-MM-DD +--- +## Event Summary +... +## Key Decisions +... +## Energy & Mood +... +## Connections +... +## Shifts & Contradictions +... +``` + +#### Meeting Notes Template +```markdown +--- +title: "Meeting Title" +type: source +tags: [meeting] +date: YYYY-MM-DD +--- +## Goal +... +## Key Discussions +... +## Decisions Made +... +## Action Items +... +``` + +--- + +## Query Workflow + +Triggered by: *"query: <question>"* + +1. Read `wiki/index.md` — identify relevant pages +2. Read those pages +3. Synthesize answer with `[[PageName]]` citations +4. Offer to save as `wiki/syntheses/<slug>.md` + +--- + +## Lint Workflow + +Triggered by: *"lint"* + +Check for: orphan pages, broken links, contradictions, stale content, missing entity pages, data gaps. + +--- + +## Graph Workflow + +Triggered by: *"build graph"* + +Try `python tools/build_graph.py --open` first. If unavailable, build graph.json and graph.html manually from wikilinks. + +--- + +## Naming Conventions + +- Source slugs: `kebab-case` +- Entity/Concept pages: `TitleCase.md` + +## Log Format + +`## [YYYY-MM-DD] <operation> | <title>` diff --git a/LICENSE b/LICENSE new file mode 100644 index 00000000..7caba6cd --- /dev/null +++ b/LICENSE @@ -0,0 +1,21 @@ +MIT License + +Copyright (c) 2023 SamurAIGPT + +Permission is hereby granted, free of charge, to any person obtaining a copy +of this software and associated documentation files (the "Software"), to deal +in the Software without restriction, including without limitation the rights +to use, copy, modify, merge, publish, distribute, sublicense, and/or sell +copies of the Software, and to permit persons to whom the Software is +furnished to do so, subject to the following conditions: + +The above copyright notice and this permission notice shall be included in all +copies or substantial portions of the Software. + +THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR +IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, +FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE +AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER +LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, +OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE +SOFTWARE. diff --git a/README.md b/README.md index 48ab6fde..4397152c 100644 --- a/README.md +++ b/README.md @@ -1,12 +1,245 @@ ---- -title: nexus -source: -author: shenwei -published: -created: -description: -tags: [] +# LLM Wiki Agent + +[![License](https://img.shields.io/badge/license-MIT-blue.svg)](LICENSE) + +**A coding agent skill.** Drop source documents into `raw/` and type `/wiki-ingest` — the agent reads them, extracts knowledge, and builds a persistent interlinked wiki. Every new source makes the wiki richer. You never write it. + +> Most knowledge tools make you search your own notes. This one reads everything you've collected and writes a structured wiki that compounds over time — cross-references already built, contradictions already flagged, synthesis already done. + +``` +/wiki-ingest raw/papers/attention-is-all-you-need.md +``` + +``` +wiki/ +├── index.md catalog of all pages — updated on every ingest +├── log.md append-only record of every operation +├── overview.md living synthesis across all sources +├── sources/ one summary page per source document +├── entities/ people, companies, projects — auto-created +├── concepts/ ideas, frameworks, methods — auto-created +└── syntheses/ query answers filed back as wiki pages +graph/ +├── graph.json persistent node/edge data (SHA256-cached) +└── graph.html interactive vis.js visualization — open in any browser +``` + +## Install + +**Requires:** [Claude Code](https://claude.ai/code), [Codex](https://openai.com/codex), [Gemini CLI](https://github.com/google-gemini/gemini-cli), or any agent that reads a config file. + +```bash +git clone https://github.com/SamurAIGPT/llm-wiki-agent.git +cd llm-wiki-agent +``` + +Open in your agent — no API key or Python setup needed: + +```bash +claude # reads CLAUDE.md + .claude/commands/ +codex # reads AGENTS.md +opencode # reads AGENTS.md +gemini # reads GEMINI.md +``` + +## Usage + +``` +/wiki-ingest raw/papers/my-paper.md # ingest a source into the wiki +/wiki-ingest raw/articles/my-article.md # works on any markdown file + +/wiki-query "what are the main themes?" # synthesize answer from wiki pages +/wiki-query "how does X relate to Y?" # with [[wikilink]] citations + +/wiki-lint # find orphans, contradictions, gaps +/wiki-graph # build graph.html from all wikilinks +``` + +Plain English also works with any agent: +``` +"Ingest this paper: raw/papers/llama2.md" +"What does the wiki say about attention mechanisms?" +"Check for contradictions across sources" +"Build the knowledge graph and tell me the most connected nodes" +``` + +Works with any markdown source — articles, papers, book chapters, meeting notes, journal entries, research summaries. + +## What You Get + +**Persistent wiki** — structured markdown pages that accumulate across sessions. Unlike chat, nothing is lost. + +**Entity pages** — auto-created for every person, company, or project mentioned across sources. Updated each time a new source references them. + +**Concept pages** — auto-created for every key idea or framework. Cross-referenced to every source that discusses them. + +**Living overview** — `wiki/overview.md` is revised on every ingest to reflect the current synthesis across everything you've read. + +**Contradiction flags** — when a new source contradicts an existing claim, it's flagged at ingest time, not buried until query time. + +**Knowledge graph** — `graph.html` shows every wiki page as a node, every `[[wikilink]]` as an edge, and Claude-inferred implicit relationships as dotted edges. Community detection clusters related topics. + +**Lint reports** — orphan pages, broken links, missing entity pages, data gaps with suggested sources to fill them. + +## Use Cases + +### Research + +Going deep on a topic over weeks — reading papers, articles, reports. + +``` +/wiki-ingest raw/papers/attention-is-all-you-need.md +/wiki-ingest raw/papers/llama2.md +/wiki-ingest raw/papers/rag-survey.md + +# Wiki builds entity pages (Meta AI, Google Brain) and +# concept pages (Attention, RLHF, Context Window) automatically. + +/wiki-query "What are the main approaches to reducing hallucination?" +/wiki-query "How has context window size evolved across models?" + +/wiki-lint +# → "No sources on mixture-of-experts — consider the Mixtral paper" +``` + +By the end you have a structured, interlinked reference — not a folder of PDFs you'll never reopen. + --- -# nexus +### Reading a Book +File each chapter as you go. Build out pages for characters, themes, arguments. + +``` +/wiki-ingest raw/book/chapter-01.md +/wiki-ingest raw/book/chapter-02.md + +# Wiki creates entity and theme pages automatically. + +/wiki-query "How has the protagonist's motivation evolved?" +/wiki-query "What contradictions exist in the author's argument so far?" + +/wiki-graph # → graph.html shows every character/theme and how they connect +``` + +Think fan wikis like Tolkien Gateway — built as you read, with the agent doing all the cross-referencing. + +--- + +### Personal Knowledge Base + +Track goals, health, habits, self-improvement — file journal entries, articles, podcast notes. + +``` +/wiki-ingest raw/journal/2026-01-week1.md +/wiki-ingest raw/articles/huberman-sleep-protocol.md +/wiki-ingest raw/articles/atomic-habits-summary.md + +/wiki-query "What patterns show up in my journal entries about energy?" +/wiki-query "What habits have I tried and what was the outcome?" +``` + +The wiki builds a structured picture over time. Concepts like "Sleep", "Exercise", "Deep Work" accumulate evidence from every source filed. + +--- + +### Business / Team Intelligence + +Feed in meeting transcripts, project docs, customer calls. + +``` +/wiki-ingest raw/meetings/q1-planning-transcript.md +/wiki-ingest raw/docs/product-roadmap-2026.md +/wiki-ingest raw/calls/customer-interview-acme.md + +/wiki-query "What feature requests have come up most across customer calls?" +/wiki-query "What decisions were made in Q1 and what was the rationale?" + +/wiki-lint +# → "Project X mentioned in 5 pages but no dedicated page" +# → "Roadmap contradicts customer interview on priority of feature Y" +``` + +The wiki stays current because the agent does the maintenance no one wants to do. + +--- + +### Competitive Analysis + +Track a company, market, or technology over time. + +``` +/wiki-ingest raw/competitors/openai-announcements.md +/wiki-ingest raw/market/ai-funding-report-q1.md + +/wiki-query "How do OpenAI and Anthropic differ on safety approach?" +/wiki-query "Which companies announced multimodal models in the last 6 months?" +/wiki-query "Competitive landscape summary as of today" --save +``` + +## The Graph + +Two-pass build: + +1. **Deterministic** — parses all `[[wikilinks]]` across wiki pages → edges tagged `EXTRACTED` +2. **Semantic** — agent infers implicit relationships not captured by wikilinks → edges tagged `INFERRED` (with confidence score) or `AMBIGUOUS` + +Louvain community detection clusters nodes by topic. SHA256 cache means only changed pages are reprocessed. Output is a self-contained `graph.html` — no server, opens in any browser. + +## CLAUDE.md / AGENTS.md + +The schema file tells the agent how to maintain the wiki — page formats, ingest/query/lint/graph workflows, naming conventions. This is the key config file. Edit it to customize behavior for your domain. + +| Agent | Schema file | +|---|---| +| Claude Code | `CLAUDE.md` | +| Codex / OpenCode | `AGENTS.md` | +| Gemini CLI | `GEMINI.md` | + +## What Makes This Different from RAG + +| RAG | LLM Wiki Agent | +|---|---| +| Re-derives knowledge every query | Compiles once, keeps current | +| Raw chunks as retrieval unit | Structured wiki pages | +| No cross-references | Cross-references pre-built | +| Contradictions surface at query time (maybe) | Flagged at ingest time | +| No accumulation | Every source makes the wiki richer | + +## Obsidian Integration + +The wiki is designed to be browsed seamlessly in [Obsidian](https://obsidian.md). Since the agent maintains consistent `[[wikilinks]]`, you get a naturally growing knowledge graph in your vault. + +### Vault Symlink Pattern +If you want to keep the LLM Wiki Agent repository separate from your main personal vault, use symlinks: +1. Keep your working agent repository at e.g., `~/llm-wiki-agent` +2. Create a symlink from your main Obsidian vault: + ```bash + ln -sfn ~/llm-wiki-agent/wiki ~/your-obsidian-vault/wiki + ``` +3. Use the [Obsidian Web Clipper](https://obsidian.md/clipper) or write directly to `raw/` in the agent repo to queue items for ingestion. + +> **Note:** If you ever move your local repo directory, remember to update the symlink, otherwise the `wiki/` directory will appear missing in Obsidian. + +### Recommended .obsidian Config +- **Graph View:** Filter out `index.md` and `log.md` (e.g. `-file:index.md -file:log.md`) to avoid them becoming gravity wells in your Obsidian graph. +- **Dataview:** Use the community plugin [Dataview](https://blacksmithgu.github.io/obsidian-dataview/) to query the YAML frontmatter the agent automatically injects (e.g., `type: source`, `tags: [diary]`). + +## Tips + +- File good query answers back with `--save` — your explorations compound just like ingested sources +- The wiki is a git repo — version history for free +- Standalone Python scripts in `tools/` work without a coding agent (require `ANTHROPIC_API_KEY`) + +## Tech Stack + +NetworkX + Louvain + Claude + vis.js. No server, no database, runs entirely locally. Everything is plain markdown files. + +## Related + +- [graphify](https://github.com/safishamsi/graphify) — graph-based knowledge extraction skill (inspiration for the graph layer) +- [Vannevar Bush's Memex (1945)](https://en.wikipedia.org/wiki/Memex) — the original vision this resembles + +## License + +MIT License — see [LICENSE](LICENSE) for details. diff --git a/docs/automated-sync.md b/docs/automated-sync.md new file mode 100644 index 00000000..fc7f06ed --- /dev/null +++ b/docs/automated-sync.md @@ -0,0 +1,101 @@ +# Automated Wiki Synchronization Guide + +Managing an LLM Wiki works best when it constantly reflects your background note-taking system. Instead of manually ingesting files every time you write something new, you can orchestrate an end-to-end automation pipeline. + +This guide outlines a production-grade cron/launchd strategy for local Mac/Linux environments. + +## The Two-Step Architecture + +LLM Wiki Agent ingestion is a two-step process: +1. **Syncing to `raw/`**: Getting files from your personal vault/tools into the agent's staging area. +2. **Batch Ingestion**: Triggering `tools/ingest.py` on the synchronized directories to synthesize and weave them into the graph. + +### Step 1: The Master Orchestrator Script + +Create a comprehensive shell script in your wiki root (`daily-automated-sync.sh`): + +```bash +#!/usr/bin/env bash +set -uo pipefail + +# Define variables +LAB_DIR="$HOME/projects/active/personal-wiki-lab" +LOG_FILE="$LAB_DIR/automation-cron.log" +DATE=$(date "+%Y-%m-%d %H:%M:%S") + +echo "=====================================================" >> "$LOG_FILE" +echo "[$DATE] Starting automated wiki synchronization..." >> "$LOG_FILE" + +cd "$LAB_DIR" || exit 1 + +# 1. Run your personal Vault-to-Raw symlink script here +# Example: ./sync-raw.sh >> "$LOG_FILE" 2>&1 + +# 2. Trigger Litellm Batch Ingestion using LLM of your choice +export LLM_MODEL="gemini/gemini-3-flash-preview" +export GEMINI_API_KEY="AIzaSy..." # or export OPENAI_API_KEY + +echo "[$DATE] Batch ingesting markdown files..." >> "$LOG_FILE" +find raw/ -type l -name "*.md" -o -type f -name "*.md" | \ +while read file; do + python3 tools/ingest.py "$file" >> "$LOG_FILE" 2>&1 +done + +# 3. Heal Graph Context (Auto-resolves broken semantic links) +echo "[$DATE] Healing broken nodes..." >> "$LOG_FILE" +python3 tools/heal.py >> "$LOG_FILE" 2>&1 + +echo "[$(date "+%Y-%m-%d %H:%M:%S")] Automated sync completed." >> "$LOG_FILE" +echo "=====================================================" >> "$LOG_FILE" +``` + +Don't forget to make it executable: `chmod +x daily-automated-sync.sh`. + +### Step 2: System Scheduler (macOS launchd) + +For macOS, `launchd` is significantly more robust than `cron`. + +Create a `.plist` file at `~/Library/LaunchAgents/com.personal-wiki-sync.plist`: + +```xml +<?xml version="1.0" encoding="UTF-8"?> +<!DOCTYPE plist PUBLIC "-//Apple//DTD PLIST 1.0//EN" "http://www.apple.com/DTDs/PropertyList-1.0.dtd"> +<plist version="1.0"> +<dict> + <key>Label</key> + <string>com.personal-wiki-sync</string> + <key>ProgramArguments</key> + <array> + <string>/bin/bash</string> + <string>/Users/your-username/projects/active/personal-wiki-lab/daily-automated-sync.sh</string> + </array> + + <!-- Execute automatically at 2:00 AM daily --> + <key>StartCalendarInterval</key> + <dict> + <key>Hour</key> + <integer>2</integer> + <key>Minute</key> + <integer>0</integer> + </dict> + + <!-- Run upon system boot if the interval was missed --> + <key>RunAtLoad</key> + <true/> + + <!-- Diagnostic Logs --> + <key>StandardOutPath</key> + <string>/Users/your-username/projects/active/personal-wiki-lab/daemon.stdout.log</string> + <key>StandardErrorPath</key> + <string>/Users/your-username/projects/active/personal-wiki-lab/daemon.stderr.log</string> +</dict> +</plist> +``` + +Load the daemon: +```bash +launchctl load ~/Library/LaunchAgents/com.personal-wiki-sync.plist +``` + +### Self-Healing & Health Monitoring +Since the automation runs silently at night, your `daemon.stderr.log` guarantees you will spot any API failures. The orchestrated script includes `tools/heal.py`, which is strongly recommended: it will seamlessly intercept and build concepts that accumulated throughout your day but were never individually formalized. diff --git a/examples/cjk-showcase/README.md b/examples/cjk-showcase/README.md new file mode 100644 index 00000000..ef104ab0 --- /dev/null +++ b/examples/cjk-showcase/README.md @@ -0,0 +1,14 @@ +# CJK Showcase (Chinese Language Example) + +This directory demonstrates how LLM Wiki Agent performs with Non-English (CJK) languages. + +The agent naturally supports processing Chinese content. With the CJK query bug fixed, you can ingest, query, and linguistically search across Chinese entries without any language-specific configuration. + +## Files included in this showcase: + +- `raw/2026-04-13-reflection.md`: A sample source document (a personal reflection on career transition). +- `wiki/sources/2026-04-13-reflection.md`: The parsed structured source page. +- `wiki/entities/杨帆.md`: Auto-extracted Chinese entity page. +- `wiki/concepts/AI转型.md`: Auto-extracted Chinese concept page. + +Try running `python tools/query.py "关于AI转型的建议"` from the root directory after moving these to your main knowledge base to see how semantic extraction and keyword matching behave in non-English contexts! diff --git a/examples/cjk-showcase/raw/2026-04-13-reflection.md b/examples/cjk-showcase/raw/2026-04-13-reflection.md new file mode 100644 index 00000000..3316642e --- /dev/null +++ b/examples/cjk-showcase/raw/2026-04-13-reflection.md @@ -0,0 +1,7 @@ +# 2026-04-13 关于AI转型的复盘总结 + +今天和杨帆深入讨论了土木工程转向AI产品经理的路径。他提到最大的陷阱是“工具旅游(Tool Tourism)”——很多非技术背景的人沉迷于尝试各种AI工具,却忽略了业务本质和产品交付。 + +真正的破局点在于将大模型视为一种新的计算范式,而不是魔术。我们需要关注模型稳定性、成本、并发以及长上下文的召回率。同时,我也在思考目前个人的技术栈,从玩提示词到掌握Agentic Workflow框架(如LangChain或自定义多Agent系统),这是一个质的飞跃。 + +决定下一步:减少看泛科普文章,直接深入开源社区,比如通过贡献代码或者提出架构Issue来积累实际影响力。 diff --git a/graph/.gitkeep b/graph/.gitkeep new file mode 100644 index 00000000..e69de29b diff --git a/raw/.gitkeep b/raw/.gitkeep new file mode 100644 index 00000000..e69de29b diff --git a/requirements.txt b/requirements.txt new file mode 100644 index 00000000..a9c7b7ba --- /dev/null +++ b/requirements.txt @@ -0,0 +1,2 @@ +litellm>=1.0.0 +networkx>=3.2 diff --git a/tools/build_graph.py b/tools/build_graph.py new file mode 100644 index 00000000..73be7b49 --- /dev/null +++ b/tools/build_graph.py @@ -0,0 +1,454 @@ +#!/usr/bin/env python3 +""" +Build the knowledge graph from the wiki. + +Usage: + python tools/build_graph.py # full rebuild + python tools/build_graph.py --no-infer # skip semantic inference (faster) + python tools/build_graph.py --open # open graph.html in browser after build + +Outputs: + graph/graph.json — node/edge data (cached by SHA256) + graph/graph.html — interactive vis.js visualization + +Edge types: + EXTRACTED — explicit [[wikilink]] in a page + INFERRED — Claude-detected implicit relationship + AMBIGUOUS — low-confidence inferred relationship +""" + +import re +import json +import hashlib +import argparse +import webbrowser +from pathlib import Path +from datetime import date + +import os + +try: + import networkx as nx + from networkx.algorithms import community as nx_community + HAS_NETWORKX = True +except ImportError: + HAS_NETWORKX = False + print("Warning: networkx not installed. Community detection disabled. Run: pip install networkx") + +REPO_ROOT = Path(__file__).parent.parent +WIKI_DIR = REPO_ROOT / "wiki" +GRAPH_DIR = REPO_ROOT / "graph" +GRAPH_JSON = GRAPH_DIR / "graph.json" +GRAPH_HTML = GRAPH_DIR / "graph.html" +CACHE_FILE = GRAPH_DIR / ".cache.json" +LOG_FILE = WIKI_DIR / "log.md" +SCHEMA_FILE = REPO_ROOT / "CLAUDE.md" + +# Node type → color mapping +TYPE_COLORS = { + "source": "#4CAF50", + "entity": "#2196F3", + "concept": "#FF9800", + "synthesis": "#9C27B0", + "unknown": "#9E9E9E", +} + +EDGE_COLORS = { + "EXTRACTED": "#555555", + "INFERRED": "#FF5722", + "AMBIGUOUS": "#BDBDBD", +} + + +def read_file(path: Path) -> str: + return path.read_text(encoding="utf-8") if path.exists() else "" + + +def call_llm(prompt: str, model_env: str, default_model: str, max_tokens: int = 4096) -> str: + try: + from litellm import completion + except ImportError: + print("Error: litellm not installed. Run: pip install litellm") + import sys + sys.exit(1) + + model = os.getenv(model_env, default_model) + response = completion( + model=model, + messages=[{"role": "user", "content": prompt}], + max_tokens=max_tokens + ) + return response.choices[0].message.content + + +def sha256(text: str) -> str: + return hashlib.sha256(text.encode()).hexdigest() + + +def all_wiki_pages() -> list[Path]: + return [p for p in WIKI_DIR.rglob("*.md") + if p.name not in ("index.md", "log.md", "lint-report.md")] + + +def extract_wikilinks(content: str) -> list[str]: + return list(set(re.findall(r'\[\[([^\]]+)\]\]', content))) + + +def extract_frontmatter_type(content: str) -> str: + match = re.search(r'^type:\s*(\S+)', content, re.MULTILINE) + return match.group(1).strip('"\'') if match else "unknown" + + +def page_id(path: Path) -> str: + return path.relative_to(WIKI_DIR).as_posix().replace(".md", "") + + +def load_cache() -> dict: + if CACHE_FILE.exists(): + try: + return json.loads(CACHE_FILE.read_text()) + except (json.JSONDecodeError, IOError): + return {} + return {} + + +def save_cache(cache: dict): + GRAPH_DIR.mkdir(parents=True, exist_ok=True) + CACHE_FILE.write_text(json.dumps(cache, indent=2)) + + +def build_nodes(pages: list[Path]) -> list[dict]: + nodes = [] + for p in pages: + content = read_file(p) + node_type = extract_frontmatter_type(content) + title_match = re.search(r'^title:\s*"?([^"\n]+)"?', content, re.MULTILINE) + label = title_match.group(1).strip() if title_match else p.stem + nodes.append({ + "id": page_id(p), + "label": label, + "type": node_type, + "color": TYPE_COLORS.get(node_type, TYPE_COLORS["unknown"]), + "path": str(p.relative_to(REPO_ROOT)), + }) + return nodes + + +def build_extracted_edges(pages: list[Path]) -> list[dict]: + """Pass 1: deterministic wikilink edges.""" + # Build a map from stem (lower) -> page_id for resolution + stem_map = {p.stem.lower(): page_id(p) for p in pages} + edges = [] + seen = set() + for p in pages: + content = read_file(p) + src = page_id(p) + for link in extract_wikilinks(content): + target = stem_map.get(link.lower()) + if target and target != src: + key = (src, target) + if key not in seen: + seen.add(key) + edges.append({ + "from": src, + "to": target, + "type": "EXTRACTED", + "color": EDGE_COLORS["EXTRACTED"], + "confidence": 1.0, + }) + return edges + + +def build_inferred_edges(pages: list[Path], existing_edges: list[dict], cache: dict) -> list[dict]: + """Pass 2: API-inferred semantic relationships.""" + new_edges = [] + + # Only process pages that changed since last run + changed_pages = [] + for p in pages: + content = read_file(p) + h = sha256(content) + entry = cache.get(str(p)) + + if not isinstance(entry, dict) or entry.get("hash") != h: + changed_pages.append(p) + else: + # Page unchanged: load its inferred edges from cache perfectly + src = page_id(p) + for rel in entry.get("edges", []): + new_edges.append({ + "from": src, + "to": rel["to"], + "type": rel.get("type", "INFERRED"), + "title": rel.get("relationship", ""), + "label": "", + "color": EDGE_COLORS.get(rel.get("type", "INFERRED"), EDGE_COLORS["INFERRED"]), + "confidence": float(rel.get("confidence", 0.7)), + }) + + if not changed_pages: + print(" no changed pages — skipping semantic inference") + return [] + + print(f" inferring relationships for {len(changed_pages)} changed pages...") + + # Build a summary of existing nodes for context + node_list = "\n".join(f"- {page_id(p)} ({extract_frontmatter_type(read_file(p))})" for p in pages) + existing_edge_summary = "\n".join( + f"- {e['from']} → {e['to']} (EXTRACTED)" for e in existing_edges[:30] + ) + + for p in changed_pages: + content = read_file(p)[:2000] # truncate for context efficiency + src = page_id(p) + + prompt = f"""Analyze this wiki page and identify implicit semantic relationships to other pages in the wiki. + +Source page: {src} +Content: +{content} + +All available pages: +{node_list} + +Already-extracted edges from this page: +{existing_edge_summary} + +Return ONLY a JSON array of NEW relationships not already captured by explicit wikilinks: +[ + {{"to": "page-id", "relationship": "one-line description", "confidence": 0.0-1.0, "type": "INFERRED or AMBIGUOUS"}} +] + +Rules: +- Only include pages from the available list above +- Confidence >= 0.7 → INFERRED, < 0.7 → AMBIGUOUS +- Do not repeat edges already in the extracted list +- Return empty array [] if no new relationships found +""" + raw = call_llm(prompt, "LLM_MODEL_FAST", "claude-3-5-haiku-latest", max_tokens=1024) + raw = raw.strip() + raw = re.sub(r"^```(?:json)?\s*", "", raw) + raw = re.sub(r"\s*```$", "", raw) + + try: + inferred = json.loads(raw) + valid_rels = [] + for rel in inferred: + if isinstance(rel, dict) and "to" in rel: + new_edges.append({ + "from": src, + "to": rel["to"], + "type": rel.get("type", "INFERRED"), + "title": rel.get("relationship", ""), + "label": "", + "color": EDGE_COLORS.get(rel.get("type", "INFERRED"), EDGE_COLORS["INFERRED"]), + "confidence": float(rel.get("confidence", 0.7)), + }) + valid_rels.append(rel) + + # Save properly to cache + cache[str(p)] = { + "hash": sha256(content), + "edges": valid_rels + } + except (json.JSONDecodeError, TypeError, ValueError): + pass + + return new_edges + + +def detect_communities(nodes: list[dict], edges: list[dict]) -> dict[str, int]: + """Assign community IDs to nodes using Louvain algorithm.""" + if not HAS_NETWORKX: + return {} + + G = nx.Graph() + for n in nodes: + G.add_node(n["id"]) + for e in edges: + G.add_edge(e["from"], e["to"]) + + if G.number_of_edges() == 0: + return {} + + try: + communities = nx_community.louvain_communities(G, seed=42) + node_to_community = {} + for i, comm in enumerate(communities): + for node in comm: + node_to_community[node] = i + return node_to_community + except Exception: + return {} + + +COMMUNITY_COLORS = [ + "#E91E63", "#00BCD4", "#8BC34A", "#FF5722", "#673AB7", + "#FFC107", "#009688", "#F44336", "#3F51B5", "#CDDC39", +] + + +def render_html(nodes: list[dict], edges: list[dict]) -> str: + """Generate self-contained vis.js HTML.""" + nodes_json = json.dumps(nodes, indent=2) + edges_json = json.dumps(edges, indent=2) + + legend_items = "".join( + f'<span style="background:{color};padding:3px 8px;margin:2px;border-radius:3px;font-size:12px">{t}</span>' + for t, color in TYPE_COLORS.items() if t != "unknown" + ) + + return f"""<!DOCTYPE html> +<html lang="en"> +<head> +<meta charset="UTF-8"> +<title>LLM Wiki — Knowledge Graph + + + + +
+

LLM Wiki Graph

+ +
{legend_items}
+
+ ── Explicit link
+ ── Inferred +
+
+
+
+
+
+ +
+
+ + +""" + + +def append_log(entry: str): + log_path = WIKI_DIR / "log.md" + existing = read_file(log_path) + log_path.write_text(entry.strip() + "\n\n" + existing, encoding="utf-8") + + +def build_graph(infer: bool = True, open_browser: bool = False): + pages = all_wiki_pages() + today = date.today().isoformat() + + if not pages: + print("Wiki is empty. Ingest some sources first.") + return + + print(f"Building graph from {len(pages)} wiki pages...") + GRAPH_DIR.mkdir(parents=True, exist_ok=True) + + cache = load_cache() + + # Pass 1: extracted edges + print(" Pass 1: extracting wikilinks...") + nodes = build_nodes(pages) + edges = build_extracted_edges(pages) + print(f" → {len(edges)} extracted edges") + + # Pass 2: inferred edges + if infer: + print(" Pass 2: inferring semantic relationships...") + inferred = build_inferred_edges(pages, edges, cache) + edges.extend(inferred) + print(f" → {len(inferred)} inferred edges") + save_cache(cache) + + # Community detection + print(" Running Louvain community detection...") + communities = detect_communities(nodes, edges) + for node in nodes: + comm_id = communities.get(node["id"], -1) + if comm_id >= 0: + node["color"] = COMMUNITY_COLORS[comm_id % len(COMMUNITY_COLORS)] + node["group"] = comm_id + + # Save graph.json + graph_data = {"nodes": nodes, "edges": edges, "built": today} + GRAPH_JSON.write_text(json.dumps(graph_data, indent=2)) + print(f" saved: graph/graph.json ({len(nodes)} nodes, {len(edges)} edges)") + + # Save graph.html + html = render_html(nodes, edges) + GRAPH_HTML.write_text(html) + print(f" saved: graph/graph.html") + + append_log(f"## [{today}] graph | Knowledge graph rebuilt\n\n{len(nodes)} nodes, {len(edges)} edges ({len([e for e in edges if e['type']=='EXTRACTED'])} extracted, {len([e for e in edges if e['type']=='INFERRED'])} inferred).") + + if open_browser: + webbrowser.open(f"file://{GRAPH_HTML.resolve()}") + + +if __name__ == "__main__": + parser = argparse.ArgumentParser(description="Build LLM Wiki knowledge graph") + parser.add_argument("--no-infer", action="store_true", help="Skip semantic inference (faster)") + parser.add_argument("--open", action="store_true", help="Open graph.html in browser") + args = parser.parse_args() + build_graph(infer=not args.no_infer, open_browser=args.open) diff --git a/tools/heal.py b/tools/heal.py new file mode 100755 index 00000000..cf85a684 --- /dev/null +++ b/tools/heal.py @@ -0,0 +1,100 @@ +#!/usr/bin/env python3 +""" +Graph Self-Healing Tool + +Automatically retrieves "Missing Entity Pages" from the wiki and generates +comprehensive definition pages for them using the LLM. +It resolves broken entity links by scanning existing contexts where the entity is referenced. + +Usage: + python tools/heal.py +""" + +import os +import sys +from pathlib import Path + +try: + from litellm import completion +except ImportError: + print("Error: litellm not installed. Run: pip install litellm") + sys.exit(1) + +# Ensure tools can be imported +sys.path.insert(0, str(Path(__file__).parent.parent)) + +from tools.lint import find_missing_entities, all_wiki_pages + +REPO_ROOT = Path(__file__).parent.parent +WIKI_DIR = REPO_ROOT / "wiki" +ENTITIES_DIR = WIKI_DIR / "entities" + +def call_llm(prompt: str, max_tokens: int = 1500) -> str: + # Use litellm standard environment variables + # e.g., GEMINI_API_KEY, ANTHROPIC_API_KEY, OPENAI_API_KEY + model = os.getenv("LLM_MODEL", "claude-3-5-haiku-latest") # default to fast model + + response = completion( + model=model, + messages=[{"role": "user", "content": prompt}], + max_tokens=max_tokens + ) + return response.choices[0].message.content + +def search_sources(entity: str, pages: list[Path]) -> list[Path]: + """Find up to 15 pages where this entity is mentioned natively.""" + sources = [] + for p in pages: + if "entities" not in str(p.parent) and "concepts" not in str(p.parent): + content = p.read_text(encoding="utf-8") + if entity.lower() in content.lower(): + sources.append(p) + return sources[:15] + +def heal_missing_entities(): + pages = all_wiki_pages() + missing_entities = find_missing_entities(pages) + + if not missing_entities: + print("Graph is fully connected. No missing entities found!") + return + + ENTITIES_DIR.mkdir(exist_ok=True, parents=True) + print(f"Found {len(missing_entities)} missing entity nodes. Commencing auto-heal...") + + for entity in missing_entities: + print(f"Healing entity page for: {entity}") + sources = search_sources(entity, pages) + + context = "" + for s in sources: + context += f"\n\n### {s.name}\n{s.read_text(encoding='utf-8')[:800]}" + + prompt = f"""You are filling a data gap in the Personal LLM Wiki. +Create an Entity definition page for "{entity}". + +Here is how the entity appears in the current sources: +{context} + +Format: +--- +title: "{entity}" +type: entity +tags: [] +sources: {[s.name for s in sources]} +--- + +# {entity} + +Write a comprehensive paragraph defining what `{entity}` means in the context of this wiki, its main significance, and any actions or associations related to it. +""" + try: + result = call_llm(prompt) + out_path = ENTITIES_DIR / f"{entity}.md" + out_path.write_text(result, encoding="utf-8") + print(f" -> Saved to {out_path.relative_to(REPO_ROOT)}") + except Exception as e: + print(f" [!] Failed to generate {entity}: {e}") + +if __name__ == "__main__": + heal_missing_entities() diff --git a/tools/ingest.py b/tools/ingest.py new file mode 100644 index 00000000..7c0bb988 --- /dev/null +++ b/tools/ingest.py @@ -0,0 +1,239 @@ +#!/usr/bin/env python3 +""" +Ingest a source document into the LLM Wiki. + +Usage: + python tools/ingest.py + python tools/ingest.py raw/articles/my-article.md + +The LLM reads the source, extracts knowledge, and updates the wiki: + - Creates wiki/sources/.md + - Updates wiki/index.md + - Updates wiki/overview.md (if warranted) + - Creates/updates entity and concept pages + - Appends to wiki/log.md + - Flags contradictions +""" + +import os +import sys +import json +import hashlib +import re +from pathlib import Path +from datetime import date + +import os + +REPO_ROOT = Path(__file__).parent.parent +WIKI_DIR = REPO_ROOT / "wiki" +LOG_FILE = WIKI_DIR / "log.md" +INDEX_FILE = WIKI_DIR / "index.md" +OVERVIEW_FILE = WIKI_DIR / "overview.md" +SCHEMA_FILE = REPO_ROOT / "CLAUDE.md" + + +def sha256(text: str) -> str: + return hashlib.sha256(text.encode()).hexdigest()[:16] + + +def read_file(path: Path) -> str: + return path.read_text(encoding="utf-8") if path.exists() else "" + + +def call_llm(prompt: str, max_tokens: int = 8192) -> str: + try: + from litellm import completion + except ImportError: + print("Error: litellm not installed. Run: pip install litellm") + sys.exit(1) + + model = os.getenv("LLM_MODEL", "claude-3-5-sonnet-latest") + response = completion( + model=model, + messages=[{"role": "user", "content": prompt}], + max_tokens=max_tokens + ) + return response.choices[0].message.content + + +def write_file(path: Path, content: str): + path.parent.mkdir(parents=True, exist_ok=True) + path.write_text(content, encoding="utf-8") + print(f" wrote: {path.relative_to(REPO_ROOT)}") + + +def build_wiki_context() -> str: + parts = [] + if INDEX_FILE.exists(): + parts.append(f"## wiki/index.md\n{read_file(INDEX_FILE)}") + if OVERVIEW_FILE.exists(): + parts.append(f"## wiki/overview.md\n{read_file(OVERVIEW_FILE)}") + # Include a few recent source pages for contradiction checking + sources_dir = WIKI_DIR / "sources" + if sources_dir.exists(): + recent = sorted(sources_dir.glob("*.md"), key=lambda p: p.stat().st_mtime, reverse=True)[:5] + for p in recent: + parts.append(f"## {p.relative_to(REPO_ROOT)}\n{p.read_text()}") + return "\n\n---\n\n".join(parts) + + +def parse_json_from_response(text: str) -> dict: + # Strip markdown code fences if present + text = re.sub(r"^```(?:json)?\s*", "", text.strip()) + text = re.sub(r"\s*```$", "", text.strip()) + # Find the outermost JSON object + match = re.search(r"\{[\s\S]*\}", text) + if not match: + raise ValueError("No JSON object found in response") + return json.loads(match.group()) + + +def update_index(new_entry: str, section: str = "Sources"): + content = read_file(INDEX_FILE) + if not content: + content = "# Wiki Index\n\n## Overview\n- [Overview](overview.md) — living synthesis\n\n## Sources\n\n## Entities\n\n## Concepts\n\n## Syntheses\n" + section_header = f"## {section}" + if section_header in content: + content = content.replace(section_header + "\n", section_header + "\n" + new_entry + "\n") + else: + content += f"\n{section_header}\n{new_entry}\n" + write_file(INDEX_FILE, content) + + +def append_log(entry: str): + existing = read_file(LOG_FILE) + write_file(LOG_FILE, entry.strip() + "\n\n" + existing) + + +def ingest(source_path: str): + source = Path(source_path) + if not source.exists(): + print(f"Error: file not found: {source_path}") + sys.exit(1) + + source_content = source.read_text(encoding="utf-8") + source_hash = sha256(source_content) + today = date.today().isoformat() + + print(f"\nIngesting: {source.name} (hash: {source_hash})") + + wiki_context = build_wiki_context() + schema = read_file(SCHEMA_FILE) + + schema = read_file(SCHEMA_FILE) + + prompt = f"""You are maintaining an LLM Wiki. Process this source document and integrate its knowledge into the wiki. + +Schema and conventions: +{schema} + +Current wiki state (index + recent pages): +{wiki_context if wiki_context else "(wiki is empty — this is the first source)"} + +New source to ingest (file: {source.relative_to(REPO_ROOT) if source.is_relative_to(REPO_ROOT) else source.name}): +=== SOURCE START === +{source_content} +=== SOURCE END === + +Today's date: {today} + +Return ONLY a valid JSON object with these fields (no markdown fences, no prose outside the JSON): +{{ + "title": "Human-readable title for this source", + "slug": "kebab-case-slug-for-filename", + "source_page": "full markdown content for wiki/sources/.md — use the source page format from the schema", + "index_entry": "- [Title](sources/slug.md) — one-line summary", + "overview_update": "full updated content for wiki/overview.md, or null if no update needed", + "entity_pages": [ + {{"path": "entities/EntityName.md", "content": "full markdown content"}} + ], + "concept_pages": [ + {{"path": "concepts/ConceptName.md", "content": "full markdown content"}} + ], + "contradictions": ["describe any contradiction with existing wiki content, or empty list"], + "log_entry": "## [{today}] ingest | \\n\\nAdded source. Key claims: ..." +}} +""" + + print(f" calling API (model: ...)") + raw = call_llm(prompt, max_tokens=8192) + try: + data = parse_json_from_response(raw) + except (ValueError, json.JSONDecodeError) as e: + print(f"Error parsing API response: {e}") + print("Raw response saved to /tmp/ingest_debug.txt") + Path("/tmp/ingest_debug.txt").write_text(raw) + sys.exit(1) + + # Write source page + slug = data["slug"] + write_file(WIKI_DIR / "sources" / f"{slug}.md", data["source_page"]) + + # Write entity pages + for page in data.get("entity_pages", []): + write_file(WIKI_DIR / page["path"], page["content"]) + + # Write concept pages + for page in data.get("concept_pages", []): + write_file(WIKI_DIR / page["path"], page["content"]) + + # Update overview + if data.get("overview_update"): + write_file(OVERVIEW_FILE, data["overview_update"]) + + # Update index + update_index(data["index_entry"], section="Sources") + + # Append log + append_log(data["log_entry"]) + + # Report contradictions + contradictions = data.get("contradictions", []) + if contradictions: + print("\n ⚠️ Contradictions detected:") + for c in contradictions: + print(f" - {c}") + + print(f"\nDone. Ingested: {data['title']}") + + +if __name__ == "__main__": + if len(sys.argv) < 2: + print("Usage: python tools/ingest.py <path-to-source> [path2 ...] [dir1 ...]") + sys.exit(1) + + paths_to_process = [] + for arg in sys.argv[1:]: + p = Path(arg) + if p.is_file() and p.suffix == ".md": + paths_to_process.append(p) + elif p.is_dir(): + for f in p.rglob("*.md"): + if f.is_file(): + paths_to_process.append(f) + else: + import glob + for f in glob.glob(arg, recursive=True): + g_p = Path(f) + if g_p.is_file() and g_p.suffix == ".md": + paths_to_process.append(g_p) + + # Deduplicate while preserving order + unique_paths = [] + seen = set() + for p in paths_to_process: + abs_p = p.resolve() + if abs_p not in seen: + seen.add(abs_p) + unique_paths.append(p) + + if not unique_paths: + print("Error: no markdown files found to ingest.") + sys.exit(1) + + if len(unique_paths) > 1: + print(f"Batch mode: found {len(unique_paths)} files to ingest.") + + for p in unique_paths: + ingest(str(p)) diff --git a/tools/lint.py b/tools/lint.py new file mode 100644 index 00000000..c7997ee5 --- /dev/null +++ b/tools/lint.py @@ -0,0 +1,210 @@ +#!/usr/bin/env python3 +""" +Lint the LLM Wiki for health issues. + +Usage: + python tools/lint.py + python tools/lint.py --save # save lint report to wiki/lint-report.md + +Checks: + - Orphan pages (no inbound wikilinks from other pages) + - Broken wikilinks (pointing to pages that don't exist) + - Missing entity pages (entities mentioned in 3+ pages but no page) + - Contradictions between pages + - Data gaps and suggested new sources +""" + +import re +import sys +import argparse +from pathlib import Path +from collections import defaultdict +from datetime import date + +import os + +REPO_ROOT = Path(__file__).parent.parent +WIKI_DIR = REPO_ROOT / "wiki" +LOG_FILE = WIKI_DIR / "log.md" +SCHEMA_FILE = REPO_ROOT / "CLAUDE.md" + + +def read_file(path: Path) -> str: + return path.read_text(encoding="utf-8") if path.exists() else "" + + +def call_llm(prompt: str, model_env: str, default_model: str, max_tokens: int = 4096) -> str: + try: + from litellm import completion + except ImportError: + print("Error: litellm not installed. Run: pip install litellm") + sys.exit(1) + + model = os.getenv(model_env, default_model) + response = completion( + model=model, + messages=[{"role": "user", "content": prompt}], + max_tokens=max_tokens + ) + return response.choices[0].message.content + + +def all_wiki_pages() -> list[Path]: + return [p for p in WIKI_DIR.rglob("*.md") + if p.name not in ("index.md", "log.md", "lint-report.md")] + + +def extract_wikilinks(content: str) -> list[str]: + return re.findall(r'\[\[([^\]]+)\]\]', content) + + +def page_name_to_path(name: str) -> list[Path]: + """Try to resolve a [[WikiLink]] to a file path.""" + candidates = [] + for p in all_wiki_pages(): + if p.stem.lower() == name.lower() or p.stem == name: + candidates.append(p) + return candidates + + +def find_orphans(pages: list[Path]) -> list[Path]: + inbound = defaultdict(int) + for p in pages: + content = read_file(p) + for link in extract_wikilinks(content): + resolved = page_name_to_path(link) + for r in resolved: + inbound[r] += 1 + return [p for p in pages if inbound[p] == 0 and p != WIKI_DIR / "overview.md"] + + +def find_broken_links(pages: list[Path]) -> list[tuple[Path, str]]: + broken = [] + for p in pages: + content = read_file(p) + for link in extract_wikilinks(content): + if not page_name_to_path(link): + broken.append((p, link)) + return broken + + +def find_missing_entities(pages: list[Path]) -> list[str]: + """Find entity-like names mentioned in 3+ pages but lacking their own page.""" + mention_counts: dict[str, int] = defaultdict(int) + existing_pages = {p.stem.lower() for p in pages} + for p in pages: + content = read_file(p) + links = extract_wikilinks(content) + for link in links: + if link.lower() not in existing_pages: + mention_counts[link] += 1 + return [name for name, count in mention_counts.items() if count >= 3] + + +def run_lint(): + pages = all_wiki_pages() + today = date.today().isoformat() + + if not pages: + print("Wiki is empty. Nothing to lint.") + return "" + + print(f"Linting {len(pages)} wiki pages...") + + # Deterministic checks + orphans = find_orphans(pages) + broken = find_broken_links(pages) + missing_entities = find_missing_entities(pages) + + print(f" orphans: {len(orphans)}") + print(f" broken links: {len(broken)}") + print(f" missing entity pages: {len(missing_entities)}") + + # Build context for semantic checks (contradictions, gaps) + # Use a sample of pages to stay within context limits + sample = pages[:20] + pages_context = "" + for p in sample: + rel = p.relative_to(REPO_ROOT) + pages_context += f"\n\n### {rel}\n{read_file(p)[:1500]}" # truncate long pages + + print(" running semantic lint via API...") + prompt = f"""You are linting an LLM Wiki. Review the pages below and identify: +1. Contradictions between pages (claims that conflict) +2. Stale content (summaries that newer sources have superseded) +3. Data gaps (important questions the wiki can't answer — suggest specific sources to find) +4. Concepts mentioned but lacking depth + +Wiki pages (sample of {len(sample)} pages): +{pages_context} + +Return a markdown lint report with these sections: +## Contradictions +## Stale Content +## Data Gaps & Suggested Sources +## Concepts Needing More Depth + +Be specific — name the exact pages and claims involved. +""" + semantic_report = call_llm(prompt, "LLM_MODEL", "claude-3-5-sonnet-latest", max_tokens=3000) + + # Compose full report + report_lines = [ + f"# Wiki Lint Report — {today}", + "", + f"Scanned {len(pages)} pages.", + "", + "## Structural Issues", + "", + ] + + if orphans: + report_lines.append("### Orphan Pages (no inbound links)") + for p in orphans: + report_lines.append(f"- `{p.relative_to(REPO_ROOT)}`") + report_lines.append("") + + if broken: + report_lines.append("### Broken Wikilinks") + for page, link in broken: + report_lines.append(f"- `{page.relative_to(REPO_ROOT)}` links to `[[{link}]]` — not found") + report_lines.append("") + + if missing_entities: + report_lines.append("### Missing Entity Pages (mentioned 3+ times but no page)") + for name in missing_entities: + report_lines.append(f"- `[[{name}]]`") + report_lines.append("") + + if not orphans and not broken and not missing_entities: + report_lines.append("No structural issues found.") + report_lines.append("") + + report_lines.append("---") + report_lines.append("") + report_lines.append(semantic_report) + + report = "\n".join(report_lines) + print("\n" + report) + return report + + +def append_log(entry: str): + existing = read_file(LOG_FILE) + LOG_FILE.write_text(entry.strip() + "\n\n" + existing, encoding="utf-8") + + +if __name__ == "__main__": + parser = argparse.ArgumentParser(description="Lint the LLM Wiki") + parser.add_argument("--save", action="store_true", help="Save lint report to wiki/lint-report.md") + args = parser.parse_args() + + report = run_lint() + + if args.save and report: + report_path = WIKI_DIR / "lint-report.md" + report_path.write_text(report, encoding="utf-8") + print(f"\nSaved: {report_path.relative_to(REPO_ROOT)}") + + today = date.today().isoformat() + append_log(f"## [{today}] lint | Wiki health check\n\nRan lint. See lint-report.md for details.") diff --git a/tools/query.py b/tools/query.py new file mode 100644 index 00000000..7b5c2bb0 --- /dev/null +++ b/tools/query.py @@ -0,0 +1,192 @@ +#!/usr/bin/env python3 +""" +Query the LLM Wiki. + +Usage: + python tools/query.py "What are the main themes across all sources?" + python tools/query.py "How does ConceptA relate to ConceptB?" --save + python tools/query.py "Summarize everything about EntityName" --save synthesis/my-analysis.md + +Flags: + --save Save the answer back into the wiki (prompts for filename) + --save <path> Save to a specific wiki path +""" + +import sys +import re +import json +import argparse +from pathlib import Path +from datetime import date + +import os + +REPO_ROOT = Path(__file__).parent.parent +WIKI_DIR = REPO_ROOT / "wiki" +INDEX_FILE = WIKI_DIR / "index.md" +LOG_FILE = WIKI_DIR / "log.md" +SCHEMA_FILE = REPO_ROOT / "CLAUDE.md" + + +def read_file(path: Path) -> str: + return path.read_text(encoding="utf-8") if path.exists() else "" + + +def write_file(path: Path, content: str): + path.parent.mkdir(parents=True, exist_ok=True) + path.write_text(content, encoding="utf-8") + print(f" saved: {path.relative_to(REPO_ROOT)}") + + +def call_llm(prompt: str, model_env: str, default_model: str, max_tokens: int = 4096) -> str: + try: + from litellm import completion + except ImportError: + print("Error: litellm not installed. Run: pip install litellm") + sys.exit(1) + + model = os.getenv(model_env, default_model) + response = completion( + model=model, + messages=[{"role": "user", "content": prompt}], + max_tokens=max_tokens + ) + return response.choices[0].message.content + + +def find_relevant_pages(question: str, index_content: str) -> list[Path]: + """Extract linked pages from index that seem relevant to the question.""" + # Pull all [[links]] and markdown links from index + md_links = re.findall(r'\[([^\]]+)\]\(([^)]+)\)', index_content) + question_lower = question.lower() + relevant = [] + + for title, href in md_links: + title_lower = title.lower() + match = False + + # 1. English/Space-separated: check words > 3 chars + if any(word in question_lower for word in title_lower.split() if len(word) > 3): + match = True + # 2. Exact substring match for the whole title (useful for short CJK titles, e.g. len=2) + elif len(title_lower) >= 2 and title_lower in question_lower: + match = True + # 3. CJK chunks: find contiguous non-ASCII characters (len >= 2) and check if in question + elif any(chunk in question_lower for chunk in re.findall(r'[^\x00-\x7F]{2,}', title_lower)): + match = True + + if match: + p = WIKI_DIR / href + if p.exists() and p not in relevant: + relevant.append(p) + + # Always include overview + overview = WIKI_DIR / "overview.md" + if overview.exists() and overview not in relevant: + relevant.insert(0, overview) + return relevant[:12] # cap to avoid context overflow + + +def append_log(entry: str): + existing = read_file(LOG_FILE) + LOG_FILE.write_text(entry.strip() + "\n\n" + existing, encoding="utf-8") + + +def query(question: str, save_path: str | None = None): + today = date.today().isoformat() + + # Step 1: Read index + index_content = read_file(INDEX_FILE) + if not index_content: + print("Wiki is empty. Ingest some sources first with: python tools/ingest.py <source>") + sys.exit(1) + + # Step 2: Find relevant pages + relevant_pages = find_relevant_pages(question, index_content) + + # If no keyword match, ask Claude to identify relevant pages from the index + if not relevant_pages or len(relevant_pages) <= 1: + print(" selecting relevant pages via API...") + prompt = f"Given this wiki index:\n\n{index_content}\n\nWhich pages are most relevant to answering: \"{question}\"\n\nReturn ONLY a JSON array of relative file paths (as listed in the index), e.g. [\"sources/foo.md\", \"concepts/Bar.md\"]. Maximum 10 pages." + raw = call_llm(prompt, "LLM_MODEL_FAST", "claude-3-5-haiku-latest", max_tokens=512) + raw = raw.strip() + raw = re.sub(r"^```(?:json)?\s*", "", raw) + raw = re.sub(r"\s*```$", "", raw) + try: + paths = json.loads(raw) + relevant_pages = [WIKI_DIR / p for p in paths if (WIKI_DIR / p).exists()] + except (json.JSONDecodeError, TypeError): + pass + + # Step 3: Read relevant pages + pages_context = "" + for p in relevant_pages: + rel = p.relative_to(REPO_ROOT) + pages_context += f"\n\n### {rel}\n{p.read_text(encoding='utf-8')}" + + if not pages_context: + pages_context = f"\n\n### wiki/index.md\n{index_content}" + + schema = read_file(SCHEMA_FILE) + + # Step 4: Synthesize answer + print(f" synthesizing answer from {len(relevant_pages)} pages...") + prompt = f"""You are querying an LLM Wiki to answer a question. Use the wiki pages below to synthesize a thorough answer. Cite sources using [[PageName]] wikilink syntax. + +Schema: +{schema} + +Wiki pages: +{pages_context} + +Question: {question} + +Write a well-structured markdown answer with headers, bullets, and [[wikilink]] citations. At the end, add a ## Sources section listing the pages you drew from. +""" + answer = call_llm(prompt, "LLM_MODEL", "claude-3-5-sonnet-latest", max_tokens=4096) + print("\n" + "=" * 60) + print(answer) + print("=" * 60) + + # Step 5: Optionally save answer + if save_path is not None: + if save_path == "": + # Prompt for filename + slug = input("\nSave as (slug, e.g. 'my-analysis'): ").strip() + if not slug: + print("Skipping save.") + return + save_path = f"syntheses/{slug}.md" + + full_save_path = WIKI_DIR / save_path + frontmatter = f"""--- +title: "{question[:80]}" +type: synthesis +tags: [] +sources: [] +last_updated: {today} +--- + +""" + write_file(full_save_path, frontmatter + answer) + + # Update index + index_content = read_file(INDEX_FILE) + entry = f"- [{question[:60]}]({save_path}) — synthesis" + if "## Syntheses" in index_content: + index_content = index_content.replace("## Syntheses\n", f"## Syntheses\n{entry}\n") + INDEX_FILE.write_text(index_content, encoding="utf-8") + print(f" indexed: {save_path}") + + # Append to log + append_log(f"## [{today}] query | {question[:80]}\n\nSynthesized answer from {len(relevant_pages)} pages." + + (f" Saved to {save_path}." if save_path else "")) + + +if __name__ == "__main__": + parser = argparse.ArgumentParser(description="Query the LLM Wiki") + parser.add_argument("question", help="Question to ask the wiki") + parser.add_argument("--save", nargs="?", const="", default=None, + help="Save answer to wiki (optionally specify path)") + args = parser.parse_args() + query(args.question, args.save) diff --git a/wiki/index.md b/wiki/index.md new file mode 100644 index 00000000..647ecb01 --- /dev/null +++ b/wiki/index.md @@ -0,0 +1,14 @@ +# Wiki Index + +This file is maintained by the LLM. Updated on every ingest. + +## Overview +- [Overview](overview.md) — living synthesis across all sources + +## Sources + +## Entities + +## Concepts + +## Syntheses diff --git a/wiki/log.md b/wiki/log.md new file mode 100644 index 00000000..66a92854 --- /dev/null +++ b/wiki/log.md @@ -0,0 +1,9 @@ +# Wiki Log + +Append-only chronological record of all operations. + +Format: `## [YYYY-MM-DD] <operation> | <title>` + +Parse recent entries: `grep "^## \[" wiki/log.md | tail -10` + +--- diff --git a/wiki/overview.md b/wiki/overview.md new file mode 100644 index 00000000..f71416d2 --- /dev/null +++ b/wiki/overview.md @@ -0,0 +1,17 @@ +--- +title: "Overview" +type: synthesis +tags: [] +sources: [] +last_updated: "" +--- + +# Overview + +*This page is maintained by the LLM. It is updated on every ingest to reflect the current synthesis across all sources.* + +No sources ingested yet. Add your first source with: + +```bash +python tools/ingest.py raw/your-source.md +```