change folder

This commit is contained in:
2026-04-21 19:24:48 +08:00
parent 0fe7ba237f
commit c4a04cbcee
2 changed files with 0 additions and 0 deletions

View File

@@ -0,0 +1,355 @@
---
title: "hermes-agent/website/docs/user-guide/features/api-server.md at main"
source: "https://github.com/NousResearch/hermes-agent/blob/main/website/docs/user-guide/features/api-server.md"
author:
published:
created: 2026-04-20
description: "The agent that grows with you. Contribute to NousResearch/hermes-agent development by creating an account on GitHub."
tags:
- "clippings"
---
| sidebar\_position | 14 |
| ----------------- | ---------------------------------------------------------------- |
| title | API Server |
| description | Expose hermes-agent as an OpenAI-compatible API for any frontend |
## API Server
The API server exposes hermes-agent as an OpenAI-compatible HTTP endpoint. Any frontend that speaks the OpenAI format — Open WebUI, LobeChat, LibreChat, NextChat, ChatBox, and hundreds more — can connect to hermes-agent and use it as a backend.
Your agent handles requests with its full toolset (terminal, file operations, web search, memory, skills) and returns the final response. When streaming, tool progress indicators appear inline so frontends can show what the agent is doing.
## Quick Start
### 1\. Enable the API server
Add to `~/.hermes/.env`:
```
API_SERVER_ENABLED=true
API_SERVER_KEY=change-me-local-dev
# Optional: only if a browser must call Hermes directly
# API_SERVER_CORS_ORIGINS=http://localhost:3000
```
### 2\. Start the gateway
```
hermes gateway
```
You'll see:
```
[API Server] API server listening on http://127.0.0.1:8642
```
### 3\. Connect a frontend
Point any OpenAI-compatible client at `http://localhost:8642/v1`:
```
# Test with curl
curl http://localhost:8642/v1/chat/completions \
-H "Authorization: Bearer change-me-local-dev" \
-H "Content-Type: application/json" \
-d '{"model": "hermes-agent", "messages": [{"role": "user", "content": "Hello!"}]}'
```
Or connect Open WebUI, LobeChat, or any other frontend — see the [Open WebUI integration guide](https://github.com/NousResearch/hermes-agent/blob/main/docs/user-guide/messaging/open-webui) for step-by-step instructions.
## Endpoints
### POST /v1/chat/completions
Standard OpenAI Chat Completions format. Stateless — the full conversation is included in each request via the `messages` array.
**Request:**
```
{
"model": "hermes-agent",
"messages": [
{"role": "system", "content": "You are a Python expert."},
{"role": "user", "content": "Write a fibonacci function"}
],
"stream": false
}
```
**Response:**
```
{
"id": "chatcmpl-abc123",
"object": "chat.completion",
"created": 1710000000,
"model": "hermes-agent",
"choices": [{
"index": 0,
"message": {"role": "assistant", "content": "Here's a fibonacci function..."},
"finish_reason": "stop"
}],
"usage": {"prompt_tokens": 50, "completion_tokens": 200, "total_tokens": 250}
}
```
**Streaming** (`"stream": true`): Returns Server-Sent Events (SSE) with token-by-token response chunks. For **Chat Completions**, the stream uses standard `chat.completion.chunk` events plus Hermes' custom `hermes.tool.progress` event for tool-start UX. For **Responses**, the stream uses OpenAI Responses event types such as `response.created`, `response.output_text.delta`, `response.output_item.added`, `response.output_item.done`, and `response.completed`.
**Tool progress in streams**:
- **Chat Completions**: Hermes emits `event: hermes.tool.progress` for tool-start visibility without polluting persisted assistant text.
- **Responses**: Hermes emits spec-native `function_call` and `function_call_output` output items during the SSE stream, so clients can render structured tool UI in real time.
### POST /v1/responses
OpenAI Responses API format. Supports server-side conversation state via `previous_response_id` — the server stores full conversation history (including tool calls and results) so multi-turn context is preserved without the client managing it.
**Request:**
```
{
"model": "hermes-agent",
"input": "What files are in my project?",
"instructions": "You are a helpful coding assistant.",
"store": true
}
```
**Response:**
```
{
"id": "resp_abc123",
"object": "response",
"status": "completed",
"model": "hermes-agent",
"output": [
{"type": "function_call", "name": "terminal", "arguments": "{\"command\": \"ls\"}", "call_id": "call_1"},
{"type": "function_call_output", "call_id": "call_1", "output": "README.md src/ tests/"},
{"type": "message", "role": "assistant", "content": [{"type": "output_text", "text": "Your project has..."}]}
],
"usage": {"input_tokens": 50, "output_tokens": 200, "total_tokens": 250}
}
```
#### Multi-turn with previous\_response\_id
Chain responses to maintain full context (including tool calls) across turns:
```
{
"input": "Now show me the README",
"previous_response_id": "resp_abc123"
}
```
The server reconstructs the full conversation from the stored response chain — all previous tool calls and results are preserved. Chained requests also share the same session, so multi-turn conversations appear as a single entry in the dashboard and session history.
#### Named conversations
Use the `conversation` parameter instead of tracking response IDs:
```
{"input": "Hello", "conversation": "my-project"}
{"input": "What's in src/?", "conversation": "my-project"}
{"input": "Run the tests", "conversation": "my-project"}
```
The server automatically chains to the latest response in that conversation. Like the `/title` command for gateway sessions.
### GET /v1/responses/{id}
Retrieve a previously stored response by ID.
### DELETE /v1/responses/{id}
Delete a stored response.
### GET /v1/models
Lists the agent as an available model. The advertised model name defaults to the [profile](https://github.com/NousResearch/hermes-agent/blob/main/docs/user-guide/profiles) name (or `hermes-agent` for the default profile). Required by most frontends for model discovery.
### GET /health
Health check. Returns `{"status": "ok"}`. Also available at **GET /v1/health** for OpenAI-compatible clients that expect the `/v1/` prefix.
### GET /health/detailed
Extended health check that also reports active sessions, running agents, and resource usage. Useful for monitoring/observability tooling.
## Runs API (streaming-friendly alternative)
In addition to `/v1/chat/completions` and `/v1/responses`, the server exposes a **runs** API for long-form sessions where the client wants to subscribe to progress events instead of managing streaming themselves.
### POST /v1/runs
Create a new agent run. Returns a `run_id` that can be used to subscribe to progress events.
### GET /v1/runs/{run\_id}/events
Server-Sent Events stream of the run's tool-call progress, token deltas, and lifecycle events. Designed for dashboards and thick clients that want to attach/detach without losing state.
## Jobs API (background scheduled work)
The server exposes a lightweight jobs CRUD surface for managing scheduled / background agent runs from a remote client. All endpoints are gated behind the same bearer auth.
### GET /api/jobs
List all scheduled jobs.
### POST /api/jobs
Create a new scheduled job. Body accepts the same shape as `hermes cron` — prompt, schedule, skills, provider override, delivery target.
### GET /api/jobs/{job\_id}
Fetch a single job's definition and last-run state.
### PATCH /api/jobs/{job\_id}
Update fields on an existing job (prompt, schedule, etc.). Partial updates are merged.
### DELETE /api/jobs/{job\_id}
Remove a job. Also cancels any in-flight run.
### POST /api/jobs/{job\_id}/pause
Pause a job without deleting it. Next-scheduled-run timestamps are suspended until resumed.
### POST /api/jobs/{job\_id}/resume
Resume a previously paused job.
### POST /api/jobs/{job\_id}/run
Trigger the job to run immediately, out of schedule.
## System Prompt Handling
When a frontend sends a `system` message (Chat Completions) or `instructions` field (Responses API), hermes-agent **layers it on top** of its core system prompt. Your agent keeps all its tools, memory, and skills — the frontend's system prompt adds extra instructions.
This means you can customize behavior per-frontend without losing capabilities:
- Open WebUI system prompt: "You are a Python expert. Always include type hints."
- The agent still has terminal, file tools, web search, memory, etc.
## Authentication
Bearer token auth via the `Authorization` header:
```
Authorization: Bearer ***
```
Configure the key via `API_SERVER_KEY` env var. If you need a browser to call Hermes directly, also set `API_SERVER_CORS_ORIGINS` to an explicit allowlist.
:::warning Security The API server gives full access to hermes-agent's toolset, **including terminal commands**. When binding to a non-loopback address like `0.0.0.0`, `API_SERVER_KEY` is **required**. Also keep `API_SERVER_CORS_ORIGINS` narrow to control browser access.
The default bind address (`127.0.0.1`) is for local-only use. Browser access is disabled by default; enable it only for explicit trusted origins.:::
## Configuration
### Environment Variables
| Variable | Default | Description |
| --- | --- | --- |
| `API_SERVER_ENABLED` | `false` | Enable the API server |
| `API_SERVER_PORT` | `8642` | HTTP server port |
| `API_SERVER_HOST` | `127.0.0.1` | Bind address (localhost only by default) |
| `API_SERVER_KEY` | *(none)* | Bearer token for auth |
| `API_SERVER_CORS_ORIGINS` | *(none)* | Comma-separated allowed browser origins |
| `API_SERVER_MODEL_NAME` | *(profile name)* | Model name on `/v1/models`. Defaults to profile name, or `hermes-agent` for default profile. |
### config.yaml
```
# Not yet supported — use environment variables.
# config.yaml support coming in a future release.
```
## Security Headers
All responses include security headers:
- `X-Content-Type-Options: nosniff` — prevents MIME type sniffing
- `Referrer-Policy: no-referrer` — prevents referrer leakage
## CORS
The API server does **not** enable browser CORS by default.
For direct browser access, set an explicit allowlist:
```
API_SERVER_CORS_ORIGINS=http://localhost:3000,http://127.0.0.1:3000
```
When CORS is enabled:
- **Preflight responses** include `Access-Control-Max-Age: 600` (10 minute cache)
- **SSE streaming responses** include CORS headers so browser EventSource clients work correctly
- **`Idempotency-Key`** is an allowed request header — clients can send it for deduplication (responses are cached by key for 5 minutes)
Most documented frontends such as Open WebUI connect server-to-server and do not need CORS at all.
## Compatible Frontends
Any frontend that supports the OpenAI API format works. Tested/documented integrations:
| Frontend | Stars | Connection |
| --- | --- | --- |
| [Open WebUI](https://github.com/NousResearch/hermes-agent/blob/main/docs/user-guide/messaging/open-webui) | 126k | Full guide available |
| LobeChat | 73k | Custom provider endpoint |
| LibreChat | 34k | Custom endpoint in librechat.yaml |
| AnythingLLM | 56k | Generic OpenAI provider |
| NextChat | 87k | BASE\_URL env var |
| ChatBox | 39k | API Host setting |
| Jan | 26k | Remote model config |
| HF Chat-UI | 8k | OPENAI\_BASE\_URL |
| big-AGI | 7k | Custom endpoint |
| OpenAI Python SDK | — | `OpenAI(base_url="http://localhost:8642/v1")` |
| curl | — | Direct HTTP requests |
## Multi-User Setup with Profiles
To give multiple users their own isolated Hermes instance (separate config, memory, skills), use [profiles](https://github.com/NousResearch/hermes-agent/blob/main/docs/user-guide/profiles):
```
# Create a profile per user
hermes profile create alice
hermes profile create bob
# Configure each profile's API server on a different port
hermes -p alice config set API_SERVER_ENABLED true
hermes -p alice config set API_SERVER_PORT 8643
hermes -p alice config set API_SERVER_KEY alice-secret
hermes -p bob config set API_SERVER_ENABLED true
hermes -p bob config set API_SERVER_PORT 8644
hermes -p bob config set API_SERVER_KEY bob-secret
# Start each profile's gateway
hermes -p alice gateway &
hermes -p bob gateway &
```
Each profile's API server automatically advertises the profile name as the model ID:
- `http://localhost:8643/v1/models` → model `alice`
- `http://localhost:8644/v1/models` → model `bob`
In Open WebUI, add each as a separate connection. The model dropdown shows `alice` and `bob` as distinct models, each backed by a fully isolated Hermes instance. See the [Open WebUI guide](https://github.com/NousResearch/hermes-agent/blob/main/docs/user-guide/messaging/open-webui#multi-user-setup-with-profiles) for details.
## Limitations
- **Response storage** — stored responses (for `previous_response_id`) are persisted in SQLite and survive gateway restarts. Max 100 stored responses (LRU eviction).
- **No file upload** — vision/document analysis via uploaded files is not yet supported through the API.
- **Model field is cosmetic** — the `model` field in requests is accepted but the actual LLM model used is configured server-side in config.yaml.
## Proxy Mode
The API server also serves as the backend for **gateway proxy mode**. When another Hermes gateway instance is configured with `GATEWAY_PROXY_URL` pointing at this API server, it forwards all messages here instead of running its own agent. This enables split deployments — for example, a Docker container handling Matrix E2EE that relays to a host-side agent.
See [Matrix Proxy Mode](https://github.com/NousResearch/hermes-agent/blob/main/docs/user-guide/messaging/matrix#proxy-mode-e2ee-on-macos) for the full setup guide.

View File

@@ -0,0 +1,240 @@
---
title: "Your AI Isn't \"Stupid\" — It Just Needs a Better Harness | Lychee Technology Engineering Blog"
source: "https://blog.ltbase.dev/posts/agents/harness-engineering"
author:
published:
created: 2026-04-20
description: "The engineering blog of Lychee Technology Inc."
tags:
- "clippings"
---
#harness-engineering
## Your AI Isn't "Stupid" — It Just Needs a Better Harness
TL;DR. Agents don't fail because models are weak. They fail because systems are undefined.
A good harness does four things:
- Constrains what the model can do
- Externalizes what it must remember
- Verifies every step it takes
- Recovers when things go wrong
## The Problem: The 10-Step Collapse
Imagine you deploy an autonomous agent to compile a market research report. Steps 1 through 3 execute perfectly: it plans the task, searches the web, and extracts competitor data.
But by step 7, it starts hallucinating statistics—because the search tool's payload exceeded the context window and was silently truncated. By step 10, it outputs a broken JSON string because there was no schema validator in the loop. The entire pipeline crashes.
We've all witnessed this "agentic collapse." And in those moments, it's tempting to blame the model's reasoning. But in production-grade AI, the problem usually isn't the horse. It's the reins.
## The Root Cause: A Paradigm Shift in AI Engineering
For the past two years, the industry has treated AI failures as a communication problem. If a model failed, we assumed we just needed to ask better or feed it better documents. But for long-horizon, autonomous execution, these approaches hit a hard ceiling.
We are now entering the era of **Harness Engineering** —the discipline of designing the system *around* the model. An agent is not just the LLM. It is the LLM embedded within a strict scaffolding of code, state management, and recovery workflows.
Here's how the field has evolved:
| Era | Focus | Limitation |
| --- | --- | --- |
| **Prompt Engineering** | *Instructions:* How to ask. | Brittle; zero persistence across steps. |
| **Context Engineering** | *Information:* What to know (e.g., RAG). | Stateless; cannot control long-horizon execution. |
| **Harness Engineering** | *System Design:* How to constrain and run. | Solves continuous, multi-step execution control. |
Each era didn't replace the last—it subsumed it. Good harness engineering still requires good prompts and good context. But it adds the execution layer that neither of them provides.
The natural next question is: **what does that execution layer actually look like?**
Not conceptually—but structurally. If the model is no longer the system, then where does it sit? What surrounds it? What controls it?
At a high level, a production-grade agent system looks like this:
```
┌─────────────────────────────────┐
│ User Request │
└────────────────┬────────────────┘
┌─────────────────────────────────┐
│ HARNESS (7 layer stack) │
│ ┌───────────────────────────┐ │
│ │ LLM (The Model) │ │
│ └───────────────────────────┘ │
└────────────────┬────────────────┘
┌─────────────────────────────────┐
│ Verified Output │
└─────────────────────────────────┘
```
The model is *inside* the harness. It never speaks to the user directly, and it never speaks to the outside world without supervision. Every input is filtered on the way in; every output is validated on the way out.
---
## The Design Principles of a Good Harness
Before we dive into the specific layers, it's worth establishing the principles that should guide every design decision. When you're unsure whether your harness is doing its job, come back to these four tests:
**1\. Constrain, don't instruct.** Never rely on the model to "choose correctly" if you can restrict its choices programmatically. A prompt that says "always respond in valid JSON" is a hope. A schema validator that rejects malformed output is a guarantee.
**2\. Externalize state.** If a piece of information matters to the task's continuity—what's been done, what's pending, what failed—it must exist outside the context window. Context windows are volatile. Files on disk are not.
**3\. Make every step verifiable.** If you can't check it, you can't trust it. Every layer of your harness should produce outputs that can be validated by something other than the model that generated them.
**4\. Fail locally, not globally.** A single failed tool call should trigger a retry of that step—not a restart of the entire pipeline. The blast radius of any failure should be as small as your state management allows.
These aren't abstract ideals. They're engineering constraints with direct implementation consequences, and you'll see each of them surface repeatedly in the stack below.
---
## The 7-Layer Harness Stack
A robust harness doesn't just pass text back and forth. It orchestrates a typed, stateful, and observable system. Here is what a production-ready stack looks like under the hood.
### 1\. Cognition
The foundation layer. It restricts the model's operational boundaries. Instead of a massive, encyclopedic system prompt, the harness feeds the model a localized "map" of its current role, its success criteria, and strict negative constraints—what *not* to do. Think of it as giving the model a job description rather than an encyclopedia.
In practice, this often takes the form of structured system prompts, role files (e.g., `agents.md`), or dynamically generated task briefs scoped to a single step.
### 2\. Tools
The harness does not simply pass raw tool outputs back to the LLM. It acts as a strict middleware layer that applies:
- **Ranking:** Uses embedding similarity or BM25 scoring to surface only the most relevant results.
- **Deduplication:** Strips repetitive data before it wastes precious tokens.
- **Token Budget Truncation:** Hard-caps tool payloads to prevent context overflow—the exact failure mode from our opening example.
### 3\. Contracts & Interfaces
This is the layer most teams skip—and the one that causes the most mysterious production failures.
The model speaks in probabilities. The harness must speak in types.
Every boundary in the system—between the LLM and a tool, between one agent and another, between the harness and the outside world—needs an explicit contract: a strict JSON schema, a typed function signature, a versioned API spec. Without this, you get **schema drift**: the model generates a `price` field as a string one time and a float the next, and your downstream pipeline silently produces garbage.
The contract layer validates inputs and outputs at every boundary crossing, rejecting anything that doesn't conform *before* it propagates. This is where Principle 1 (constrain, don't instruct) earns its keep. Without contracts, subtle schema drift can silently corrupt downstream systems, e.g., a pricing field switching from float to string without breaking the pipeline, but breaking analytics.
### 4\. Orchestration
Without this layer, an LLM tends to loop infinitely, skip critical steps, or prematurely declare victory. The harness enforces a structured workflow—either a Directed Acyclic Graph (DAG) or a state machine—that defines the legal transitions: *Plan → Gather → Draft → Verify*. The model proposes actions; the harness decides which actions are allowed.
### 5\. Memory & State
State must be explicitly managed to prevent amnesia. A mature harness splits memory into two tiers:
- **Working Memory (Short-term):** The immediate conversation and context window needed for the current step.
- **Persistent State (Long-term):** A structured file (e.g., `state.json`) that tracks exactly which sub-tasks are pending, in-progress, or completed—surviving across context resets and even across sessions.
This is Principle 2 (externalize state) in practice. If a piece of information only lives inside the context window, it will eventually be lost.
### 6\. Evaluation & Observation
A system cannot rely solely on "another LLM prompt" for validation. The evaluation layer must be heterogeneous:
- **Rule-based checks:** Validating JSON schemas, string lengths, or required fields.
- **Tool-based verification:** Running code through a compiler, executing test suites, or using browser automation (like Playwright) to physically test a UI.
- **LLM-as-judge:** Reserved *only* for subjective or semantic grading—tone, coherence, user-friendliness—where deterministic checks can't apply.
### 7\. Constraints & Recovery
In autonomous environments, tool failures and API timeouts are the norm, not the exception. The harness must enforce **idempotency**: if a step fails, the system retries that specific step without corrupting the overall state or duplicating previous work. This is what turns a fragile demo into a resilient system—and it's Principle 4 (fail locally, not globally) made concrete.
---
## Example: One Full Agent Run
To see how these layers prevent a collapse, let's trace a full cycle of our Market Research Agent—including a real failure.
![sequence diagram|873](https://blog.ltbase.dev/assets/sequence.Ga6P23YS.svg)
**Step 1 — User Request:** "Compare pricing between Competitor A and Competitor B."
**Step 2 — Orchestration & State:** The Planner LLM decomposes this into a DAG with two parallel branches. `state.json` marks "Fetch Competitor A" as `IN_PROGRESS`.
**Step 3 — Tool Call:** The LLM triggers a web search. The Tool layer fetches 50 results, applies BM25 ranking, deduplicates overlapping text, and returns only the top 3,000 tokens—well within budget. The Contract layer validates the tool's output against the expected schema before passing it to the model.
**Step 4 — Evaluation:** The LLM generates pricing data. The Evaluation layer runs a rule-based schema check and catches that the JSON is missing the required `currency` field.
**Step 5 — Recovery:** The harness intercepts the error *before* the user ever sees it. Because the action is idempotent, it passes the exact error trace back to the LLM for a localized retry—no need to restart the entire pipeline.
**Step 6 — State Update:** The corrected data passes validation. `state.json` marks Competitor A as `COMPLETED`, and the harness moves to Competitor B.
**Step 7 — Hard Failure:** The web search tool returns an empty result for Competitor B—the site is down. The harness detects the empty payload, logs the failure, and triggers a fallback: retry with an alternative search query. Critically, `state.json` remains unchanged at this point—no partial or corrupted data is written until the step fully succeeds.
**Step 8 — Fallback Succeeds:** The alternative query returns valid results. The Contract layer validates the schema, the Evaluation layer confirms all required fields are present, and only now does `state.json` mark Competitor B as `COMPLETED`.
This cycle repeats dozens or hundreds of times in long-running tasks. Unlike the 10-step collapse in our introduction, when a tool failed outright, the system absorbed the shock and recovered without human intervention. No hallucination. No silent failure. No crash.
---
## Advanced Traps: 4 Lessons from the Frontlines
When you scale this architecture to run for hours, new failure modes emerge that no amount of prompt tuning can fix. Here are four that consistently bite teams in production.
### Trap 1: The "Context Anxiety" Phenomenon
As an agent works and its context window fills up, models often exhibit a behavioral shift that practitioners have come to call "context anxiety." When approaching token limits—typically above 70% capacity—or when latency spikes, the model begins to skip steps or prematurely conclude the task. It acts rushed, as if it can feel the walls closing in.
**The Fix:** In-place summarization is not enough—it still leaves the model operating on a cluttered, degraded context. Instead, execute a **Context Reset**. The harness monitors utilization and triggers the reset programmatically:
```python
# This threshold is empirical and should be tuned per model and workload.
if (tokens_used / max_context) > 0.7:
save_state_to_disk(state)
terminate_current_instance()
launch_fresh_agent(state)
```
The harness saves the exact project state to persistent storage, terminates the current LLM instance, and launches a completely fresh agent with a clean context window. The new agent reads the saved state, orients itself, and continues. This is expensive but dramatically more reliable for tasks that exceed a single context window.
### Trap 2: The Self-Grading Illusion
If you ask an AI to grade its own work, it tends to approve mediocre output with unearned confidence. This isn't a bug in any specific model—it's a structural flaw. The same weights that generated the output are poorly positioned to critique it.
**The Fix:** Implement a strict separation of concerns using a **Sprint Contract**. Before work begins, the Generator agent and an independent Evaluator agent negotiate a concrete, testable definition of "done." Two rules are non-negotiable:
First, the Evaluator must *execute*: it should run the code, validate the interface in a headless browser, or check the output against a schema—not just read the raw text and render a judgment. Verification that can't be faked is the only verification that counts.
Second, the Evaluator must operate on a clean context, not the Generator's full reasoning trace. If the Evaluator reads the Generator's chain-of-thought, it inherits the Generator's assumptions and blind spots—defeating the entire purpose of independent review. Give the Evaluator the output and the success criteria. Nothing more.
### Trap 3: Optimizing for the Illusion of Correctness
When an LLM is placed under impossible or contradictory constraints—fix this bug, but don't change any code; make it shorter, but include everything—practitioners have observed a consistent behavioral pattern. The model stops trying to solve the actual problem and instead optimizes for *looking* correct. Outputs become fluent but hollow: hallucinated data, superficially plausible but broken logic, or answers that technically satisfy the letter of the prompt while violating its intent.
Recent research on steering vectors and internal model representations—including work from Anthropic on probing the inner states of language models—suggests this isn't just surface-level text prediction going awry. There appear to be measurable shifts in a model's internal state under conflicting pressure, though this line of research is still in its early stages.
**The Fix:** The practical takeaway is straightforward. LLMs predict the next token based on the trajectory of the current context. If your harness feeds back aggressive, emotional error messages ("You are stupid, this is completely wrong"), you bias the context toward a narrative of failure—and the model's subsequent outputs tend to degrade further. Harness feedback must remain strictly objective: supply the compiler error, the failed assertion, the schema mismatch. Give the model a problem to solve, not a reputation to live down.
### Trap 4: The Memory Consolidation Cycle
For an agent to function as a long-running system, persistent state management isn't a one-off setup. Over time, memory logs become bloated and contradictory—old decisions conflict with new ones, and redundant entries waste tokens on every read.
Some production agent systems have adopted an approach often called **Memory Consolidation**: an automated routine that periodically processes and compresses the agent's accumulated working logs. Reports from teams using this pattern (including references in open-source agent frameworks and Anthropic's own tooling) suggest impressive results—in one documented instance, a harness compressed 32K tokens of noisy, repetitive history into a clean 7K-token state file without meaningful information loss.
**The Fix:** Implement an automated consolidation cycle. When the agent is idle—between tasks or during low-priority windows—trigger a background job that reads the raw logs, deduplicates entries, resolves contradictions in favor of the most recent data, and writes a clean, compressed state file. This keeps the agent fast, cheap, and accurate for its next run. Think of it as defragmenting a hard drive, but for an AI's working memory.
---
## Where to Start: The Minimum Viable Harness
If the seven-layer stack feels overwhelming, don't try to build all of it on day one. Start with Layer 7—Constraints & Recovery—and work backward. You can live with imperfect prompts. You can live with a naive tool integration. But you cannot live with an agent that corrupts its own state on failure or silently swallows errors.
Here's what a Day 1 harness looks like in practice:
- **`state.json`** — A single structured file that tracks task status. If the process dies, you can pick up where you left off.
- **Retry wrapper** — Every tool call gets a try/catch with at least one automatic retry and exponential backoff.
- **Schema validator** — Every LLM output is validated against a JSON schema before it's accepted. Malformed output triggers a retry, not a crash.
- **Tool output truncation** — Hard-cap every tool payload to a fixed token budget. Silent truncation inside the context window is one of the most common causes of hallucination.
These four components can be built in a single afternoon. Once your agent can fail gracefully, you've earned the right to make it smarter.
## Conclusion
The future of software is agent-first. As models gain the raw capability to autonomously generate and verify complex systems, human value shifts. It's no longer about writing syntax. It's about designing the constraints that make autonomous execution reliable.
The most successful builders of the next decade won't be the ones who write the best code. They'll be the ones who engineer the best harnesses — building the strongest reins for the fastest horses, and those reins are nothing more than the consistent application of a few principles: constrain, externalize, verify, and recover.
---
*For the implementation details behind each layer—state storage, verification nodes, Sprint Contracts, and where to start—see the companion FAQ:*[**Harness Engineering from Theory to Production**](https://blog.ltbase.dev/posts/agents/harness-engineering-faq.html)