# GoClaw — Complete Documentation

> GoClaw is a multi-agent AI gateway written in Go. It connects LLMs to tools, channels, and data via WebSocket RPC and OpenAI-compatible HTTP API.

---

# Getting Started

**GoClaw** is a multi-agent AI gateway that connects LLMs to your tools, channels, and data — deployed as a single Go binary (~25 MB, ~36 MB with OTel). Zero runtime dependencies, <1s startup. It orchestrates agent teams, inter-agent delegation, and quality-gated workflows across 13+ LLM providers with full multi-tenant isolation.

A Go port of [OpenClaw](https://github.com/openclaw/openclaw) with enhanced security, multi-tenant PostgreSQL, and production-grade observability.

---

## What Makes It Different

- **Agent Teams & Orchestration** — Teams with shared task boards, inter-agent delegation (sync/async), conversation handoff, evaluate-loop quality gates, and hybrid agent discovery
- **Multi-Tenant PostgreSQL** — Per-user context files, encrypted credentials (AES-256-GCM), agent sharing, and complete data isolation
- **5-Layer Security** — Rate limiting, prompt injection detection, SSRF protection, shell deny patterns, and AES-256-GCM encryption
- **13+ LLM Providers** — Anthropic (native HTTP+SSE with prompt caching), OpenAI, OpenRouter, Groq, DeepSeek, Gemini, Mistral, xAI, MiniMax, Cohere, Perplexity, DashScope (Qwen), Bailian Coding
- **Omnichannel** — WebSocket, HTTP (OpenAI-compatible), Telegram, Discord, Feishu/Lark, Zalo, WhatsApp
- **MCP Integration** — Connect external Model Context Protocol servers (stdio, SSE, streamable-HTTP) with per-agent and per-user access grants
- **Custom Tools** — Define shell-based tools at runtime via HTTP API with encrypted env vars
- **Production Observability** — Built-in LLM call tracing with optional OpenTelemetry OTLP export + Jaeger

---

## Two Operating Modes

| Aspect | Standalone | Managed |
|--------|-----------|---------|
| Storage | JSON files + SQLite | PostgreSQL (pgvector) |
| Dependencies | None (beyond LLM API key) | PostgreSQL 15+ |
| Agents | Defined in `config.json` | CRUD via HTTP API + Web Dashboard |
| Multi-tenancy | Per-user workspace dirs | Full DB isolation |
| Agent Teams | N/A | Shared task board + mailbox |
| Delegation | N/A | Sync/async delegation + quality gates |
| Tracing | N/A | Full LLM call tracing + OTel export |
| Custom Tools | N/A | Runtime-defined shell tools |

---

## Quick Start

### Prerequisites

- Go 1.25+
- At least one LLM API key (Anthropic, OpenAI, OpenRouter, or any supported provider)
- PostgreSQL 15+ with pgvector (managed mode only)

### Build from Source

```bash
# Build
go build -o goclaw .

# Interactive setup (creates config.json + .env.local)
./goclaw onboard

# Load environment and start
source .env.local
./goclaw
```

The `onboard` command auto-detects API keys from environment variables. If found, it runs non-interactively. Otherwise, it launches an interactive wizard to select provider, model, gateway token, and channels.

### Managed Mode (PostgreSQL)

```bash
# Set PostgreSQL DSN (env var only, never in config.json)
export GOCLAW_POSTGRES_DSN="postgres://goclaw:goclaw@localhost:5432/goclaw?sslmode=disable"

# Run database migrations
./goclaw migrate up

# Start gateway
./goclaw
```

---

## Docker Deployment

GoClaw provides **8 composable Docker Compose files** that you can mix and match for your deployment needs.

### Compose Files

| File | Purpose |
|------|---------|
| `docker-compose.yml` | Base service definition (required) |
| `docker-compose.standalone.yml` | File-based storage with persistent volumes |
| `docker-compose.managed.yml` | PostgreSQL pgvector (pg18) for multi-tenant mode |
| `docker-compose.selfservice.yml` | Web dashboard UI (nginx + React SPA, port 3000) |
| `docker-compose.upgrade.yml` | One-shot database schema migration service |
| `docker-compose.sandbox.yml` | Docker-based code execution sandbox (requires docker socket) |
| `docker-compose.otel.yml` | OpenTelemetry + Jaeger tracing (Jaeger UI on port 16686) |
| `docker-compose.tailscale.yml` | Tailscale VPN mesh listener for secure remote access |

### Common Deployments

**Standalone (simplest):**

```bash
docker compose -f docker-compose.yml -f docker-compose.standalone.yml up -d
```

**Managed + Web Dashboard (recommended):**

```bash
# Prepare environment (auto-generates encryption key + gateway token)
chmod +x prepare-env.sh && ./prepare-env.sh

# Start services
docker compose -f docker-compose.yml \
  -f docker-compose.managed.yml \
  -f docker-compose.selfservice.yml up -d
```

**Full Stack (managed + dashboard + tracing):**

```bash
docker compose -f docker-compose.yml \
  -f docker-compose.managed.yml \
  -f docker-compose.selfservice.yml \
  -f docker-compose.otel.yml up -d
```

**With Code Sandbox:**

```bash
docker compose -f docker-compose.yml \
  -f docker-compose.managed.yml \
  -f docker-compose.sandbox.yml up -d
```

**Database Schema Upgrade:**

```bash
docker compose -f docker-compose.yml \
  -f docker-compose.managed.yml \
  -f docker-compose.upgrade.yml run --rm goclaw-upgrade
```

### Using Makefile

```bash
make up      # Start managed + dashboard (default)
make down    # Stop all services
make logs    # Stream goclaw container logs
make reset   # Stop, delete volumes, rebuild
make build   # Build binary locally
```

### Default Ports

| Service | Port |
|---------|------|
| Gateway (HTTP + WebSocket) | 18790 |
| Web Dashboard | 3000 |
| PostgreSQL | 5432 |
| Jaeger UI (OTel) | 16686 |

---

## Environment Variables

### LLM Provider Keys (at least one required)

```bash
GOCLAW_ANTHROPIC_API_KEY=sk-ant-...
GOCLAW_OPENAI_API_KEY=sk-...
GOCLAW_OPENROUTER_API_KEY=sk-or-...
GOCLAW_GROQ_API_KEY=gsk_...
GOCLAW_DEEPSEEK_API_KEY=sk-...
GOCLAW_GEMINI_API_KEY=...
GOCLAW_MISTRAL_API_KEY=...
GOCLAW_XAI_API_KEY=...
GOCLAW_MINIMAX_API_KEY=...
GOCLAW_COHERE_API_KEY=...
GOCLAW_PERPLEXITY_API_KEY=...
```

### Gateway & Security

```bash
GOCLAW_GATEWAY_TOKEN=           # Auto-generated by prepare-env.sh
GOCLAW_ENCRYPTION_KEY=          # Auto-generated (32-byte hex)
GOCLAW_PORT=18790               # Gateway port
GOCLAW_HOST=0.0.0.0             # Gateway host
```

### Database (managed mode)

```bash
GOCLAW_MODE=managed             # "standalone" or "managed"
GOCLAW_POSTGRES_DSN=postgres://goclaw:goclaw@localhost:5432/goclaw
```

### Channels (optional)

```bash
GOCLAW_TELEGRAM_TOKEN=
GOCLAW_DISCORD_TOKEN=
GOCLAW_LARK_APP_ID=
GOCLAW_LARK_APP_SECRET=
GOCLAW_ZALO_TOKEN=
GOCLAW_WHATSAPP_BRIDGE_URL=
```

### Scheduler Lanes

```bash
GOCLAW_LANE_MAIN=30             # Main lane concurrency
GOCLAW_LANE_SUBAGENT=50         # Subagent lane
GOCLAW_LANE_DELEGATE=100        # Delegation lane
GOCLAW_LANE_CRON=30             # Cron lane
```

### Observability & TTS (optional)

```bash
GOCLAW_TELEMETRY_ENABLED=true
GOCLAW_TELEMETRY_ENDPOINT=      # OTLP endpoint
GOCLAW_TTS_OPENAI_API_KEY=
GOCLAW_TTS_ELEVENLABS_API_KEY=
GOCLAW_TTS_MINIMAX_API_KEY=
```

---

## Configuration

Configuration is loaded from a JSON5 file with environment variable overlay. Secrets are never persisted to the config file.

```json
{
  "gateway": {
    "host": "0.0.0.0",
    "port": 18790,
    "token": ""
  },
  "agents": {
    "defaults": {
      "provider": "anthropic",
      "model": "claude-sonnet-4-5-20250929",
      "context_window": 200000
    }
  },
  "tools": {
    "profile": "full"
  },
  "database": {
    "mode": "standalone"
  }
}
```

### Config Sections

| Section | Purpose |
|---------|---------|
| `gateway` | host, port, token, allowed_origins, rate_limit_rpm |
| `agents` | defaults (provider, model, context_window) + per-agent list |
| `tools` | profile, allow/deny lists, exec_approval, mcp_servers |
| `channels` | Telegram, Discord, Feishu, Zalo, WhatsApp settings |
| `database` | mode (standalone/managed) |
| `sessions` | Session management settings |
| `tts` | Text-to-speech provider settings |
| `cron` | Cron job settings |
| `telemetry` | OpenTelemetry settings |
| `tailscale` | Tailscale listener config |
| `bindings` | Channel-to-agent mappings |

---

## Supported LLM Providers

| Provider | Type | Default Model |
|----------|------|---------------|
| Anthropic | Native HTTP + SSE | `claude-sonnet-4-5-20250929` |
| OpenAI | OpenAI-compatible | `gpt-4o` |
| OpenRouter | OpenAI-compatible | `anthropic/claude-sonnet-4-5-20250929` |
| Groq | OpenAI-compatible | `llama-3.3-70b-versatile` |
| DeepSeek | OpenAI-compatible | `deepseek-chat` |
| Gemini | OpenAI-compatible | `gemini-2.0-flash` |
| Mistral | OpenAI-compatible | `mistral-large-latest` |
| xAI | OpenAI-compatible | `grok-3-mini` |
| MiniMax | OpenAI-compatible | `MiniMax-M2.5` |
| Cohere | OpenAI-compatible | `command-a` |
| Perplexity | OpenAI-compatible | `sonar-pro` |
| DashScope | OpenAI-compatible | `qwen-plus` |
| Bailian Coding | OpenAI-compatible | `bailian-code` |

---

## CLI Commands

```bash
# Gateway
goclaw                          # Start gateway (default)
goclaw onboard                  # Interactive setup wizard
goclaw version                  # Print version & protocol
goclaw doctor                   # Health check

# Agents
goclaw agent list               # List configured agents
goclaw agent chat               # Chat with an agent
goclaw agent add                # Add new agent
goclaw agent delete             # Delete agent

# Database (managed mode)
goclaw migrate up               # Run pending migrations
goclaw migrate down             # Rollback last migration
goclaw migrate version          # Show current schema version
goclaw upgrade                  # Upgrade schema + data hooks
goclaw upgrade --status         # Show schema status
goclaw upgrade --dry-run        # Preview pending changes

# Configuration
goclaw config show              # Display config (secrets redacted)
goclaw config path              # Show config file path
goclaw config validate          # Validate config

# Sessions
goclaw sessions list            # List active sessions
goclaw sessions delete [key]    # Delete session
goclaw sessions reset [key]     # Clear session history

# Skills, Models, Channels
goclaw skills list              # List available skills
goclaw models list              # List AI models and providers
goclaw channels list            # List messaging channels

# Cron & Pairing
goclaw cron list                # List scheduled jobs
goclaw pairing approve [code]   # Approve pairing code
goclaw pairing list             # List paired devices
```

---

## Web Dashboard

GoClaw includes a React 19 SPA dashboard (Vite 6, TypeScript, Tailwind CSS 4, Radix UI) for managing agents, sessions, skills, and configuration.

### Local Development

```bash
cd ui/web
pnpm install    # Must use pnpm, not npm
pnpm dev
```

### Docker (via selfservice compose)

```bash
docker compose -f docker-compose.yml \
  -f docker-compose.managed.yml \
  -f docker-compose.selfservice.yml up -d
```

The dashboard runs on port 3000 and connects to the gateway via WebSocket.

---

## Next Steps

- [Architecture Overview](#architecture) — Component diagram, module map, startup sequence
- [Agent Loop](#agent-loop) — Deep dive into the Think-Act-Observe cycle
- [API Reference](#api-reference) — HTTP and WebSocket endpoints
- [Security](#security) — 5-layer defense-in-depth model
- [Tools System](#tools) — 30+ built-in tools, custom tools, and MCP integration
- [Channels & Messaging](#channels) — Telegram, Discord, Feishu/Lark, Zalo, WhatsApp

---

# 00 - Architecture Overview

## 1. Overview

GoClaw is an AI agent gateway written in Go. It exposes a WebSocket RPC (v3) interface and an OpenAI-compatible HTTP API for orchestrating LLM-powered agents. The system supports two operating modes:

- **Standalone** -- file-based storage with SQLite for per-user data, zero external dependencies beyond an LLM API key.
- **Managed** -- PostgreSQL-backed multi-tenant mode with HTTP CRUD APIs, per-user context files, encrypted credentials, agent delegation, teams, and LLM call tracing.

> **Documentation scope**: This documentation covers both modes. Standalone mode now has near-parity with managed mode for core features (per-user context files, workspace isolation, agent types, bootstrap onboarding). Managed mode adds agent delegation, teams, quality gates, tracing, HTTP CRUD APIs, and encrypted secrets.

## 2. Component Diagram

```mermaid
flowchart TD
    subgraph Clients
        WS[WebSocket Clients]
        HTTP[HTTP Clients]
        TG[Telegram]
        DC[Discord]
        FS[Feishu / Lark]
        ZL[Zalo]
        WA[WhatsApp]
    end

    subgraph Gateway["Gateway Server"]
        WSS[WebSocket Server]
        HTTPS[HTTP API Server]
        MR[Method Router]
        RL[Rate Limiter]
        RBAC[Permission Engine]
    end

    subgraph Channels["Channel Manager"]
        CM[Channel Manager]
        PA[Pairing Service]
    end

    subgraph Core["Core Engine"]
        BUS[Message Bus]
        SCHED[Scheduler -- 4 Lanes]
        AR[Agent Router]
        LOOP[Agent Loop -- Think / Act / Observe]
    end

    subgraph Providers["LLM Providers"]
        ANTH[Anthropic -- Native HTTP + SSE]
        OAI[OpenAI-Compatible -- HTTP + SSE]
    end

    subgraph Tools["Tool Registry"]
        FS_T[Filesystem]
        EXEC[Exec / Shell]
        WEB[Web Search / Fetch]
        MEM[Memory]
        SUB[Subagent]
        DEL[Delegation]
        TEAM_T[Teams]
        EVAL[Evaluate Loop]
        HO[Handoff]
        TTS_T[TTS]
        BROW[Browser]
        SK[Skills]
        MCP_T[MCP Bridge]
        CT[Custom Tools]
    end

    subgraph Hooks["Hook Engine"]
        HE[Engine]
        CMD_E[Command Evaluator]
        AGT_E[Agent Evaluator]
    end

    subgraph Store["Store Layer"]
        SESS[SessionStore]
        AGENT_S[AgentStore]
        PROV_S[ProviderStore]
        CRON_S[CronStore]
        MEM_S[MemoryStore]
        SKILL_S[SkillStore]
        TRACE_S[TracingStore]
        MCP_S[MCPServerStore]
        CT_S[CustomToolStore]
        AL_S[AgentLinkStore]
        TM_S[TeamStore]
    end

    WS --> WSS
    HTTP --> HTTPS
    TG & DC & FS & ZL & WA --> CM

    WSS --> MR
    HTTPS --> MR
    MR --> RL --> RBAC --> AR

    CM --> BUS
    BUS --> SCHED
    SCHED --> AR
    AR --> LOOP

    LOOP --> Providers
    LOOP --> Tools
    Tools --> Store
    Tools --> Hooks
    Hooks --> Tools
    LOOP --> Store
```

## 3. Module Map

| Module | Description |
|--------|-------------|
| `internal/gateway/` | WebSocket + HTTP server, client handling, method router |
| `internal/gateway/methods/` | RPC method handlers: chat, agents, agent_links, teams, delegations, sessions, config, skills, cron, pairing, exec approval, usage, send |
| `internal/agent/` | Agent loop (think, act, observe), router, resolver, system prompt builder, sanitization, pruning, tracing, memory flush, DELEGATION.md + TEAM.md injection |
| `internal/providers/` | LLM providers: Anthropic (native HTTP + SSE streaming), OpenAI-compatible (HTTP + SSE), retry logic |
| `internal/tools/` | Tool registry, filesystem ops, exec/shell, policy engine, subagent, delegation manager, team tools, evaluate loop, handoff, context file + memory interceptors, credential scrubbing, rate limiting, PathDenyable |
| `internal/tools/dynamic_loader.go` | Custom tool loader: LoadGlobal (startup), LoadForAgent (per-agent clone), ReloadGlobal (cache invalidation) |
| `internal/tools/dynamic_tool.go` | Custom tool executor: command template rendering, shell escaping, encrypted env vars |
| `internal/hooks/` | Hook engine: quality gates, command evaluator, agent evaluator, recursion prevention (`WithSkipHooks`) |
| `internal/store/` | Store interfaces: SessionStore, AgentStore, ProviderStore, SkillStore, MemoryStore, CronStore, PairingStore, TracingStore, MCPServerStore, AgentLinkStore, TeamStore, ChannelInstanceStore, ConfigSecretsStore |
| `internal/store/pg/` | PostgreSQL implementations (`database/sql` + `pgx/v5`) |
| `internal/store/file/` | File-based implementations: sessions, memory (SQLite), cron, pairing, skills, agents (filesystem + SQLite) |
| `internal/bootstrap/` | System prompt files (AGENTS.md, SOUL.md, TOOLS.md, IDENTITY.md, USER.md, HEARTBEAT.md, BOOTSTRAP.md) + seeding + truncation |
| `internal/config/` | Config loading (JSON5) + env var overlay |
| `internal/skills/` | SKILL.md loader (5-tier hierarchy) + BM25 search + hot-reload via fsnotify |
| `internal/channels/` | Channel manager + adapters: Telegram, Feishu/Lark, Zalo, Discord, WhatsApp |
| `internal/mcp/` | MCP server bridge (stdio, SSE, streamable-HTTP transports) |
| `internal/scheduler/` | Lane-based concurrency control (main, subagent, cron, delegate lanes) with per-session serialization |
| `internal/memory/` | Memory system (SQLite FTS5 + embeddings for standalone mode) |
| `internal/permissions/` | RBAC policy engine (admin, operator, viewer roles) |
| `internal/pairing/` | DM/device pairing service (8-character codes) |
| `internal/sessions/` | File-based session manager (standalone mode) |
| `internal/bus/` | Event pub/sub (Message Bus) |
| `internal/sandbox/` | Docker-based code execution sandbox |
| `internal/tts/` | Text-to-Speech providers: OpenAI, ElevenLabs, Edge, MiniMax |
| `internal/http/` | HTTP API handlers: /v1/chat/completions, /v1/agents, /v1/skills, /v1/traces, /v1/mcp, /v1/delegations, summoner |
| `internal/crypto/` | AES-256-GCM encryption for API keys |
| `internal/tracing/` | LLM call tracing (traces + spans), in-memory buffer with periodic store flush |
| `internal/tracing/otelexport/` | Optional OpenTelemetry OTLP exporter (opt-in via build tags; adds gRPC + protobuf) |
| `internal/heartbeat/` | Periodic agent wake-up service |

---

## 4. Two Operating Modes

| Aspect | Standalone | Managed |
|--------|-----------|---------|
| Config source | `config.json` + env vars | `config.json` + `GOCLAW_POSTGRES_DSN` |
| Storage | JSON files + SQLite (`~/.goclaw/data/agents.db`) | PostgreSQL |
| Agents | Defined in `config.json` `agents.list`, created eagerly at startup | `agents` table, lazy-resolved via `ManagedResolver` |
| Agent store | `FileAgentStore` (filesystem + SQLite) | `PGAgentStore` |
| Context files | Agent-level on filesystem, per-user in SQLite | `agent_context_files` + `user_context_files` tables |
| Agent types | `open` / `predefined` (via config) | `open` (7 per-user files) / `predefined` (agent-level + USER.md per-user) |
| Per-user isolation | Workspace subdirectories (`user_alice/`, `user_bob/`) | Same + DB-scoped context files |
| Bootstrap onboarding | Per-user BOOTSTRAP.md seeding (SQLite) | Same (PostgreSQL) |
| Agent delegation | N/A | Sync/async delegation, agent links, quality gates |
| Agent teams | N/A | Shared task board, mailbox, handoff |
| Skills | Filesystem only (workspace + global dirs) | PostgreSQL + filesystem + embedding search |
| Memory | SQLite FTS5 + embeddings | pgvector hybrid (full-text search + vector similarity) |
| Tracing | N/A | `traces` + `spans` tables + optional OTel OTLP export |
| MCP servers | `config.json` `tools.mcp_servers` | `mcp_servers` table + grants |
| API key storage | `.env.local` / env vars only | PostgreSQL (AES-256-GCM encrypted) |
| HTTP CRUD API | N/A | `/v1/agents`, `/v1/skills`, `/v1/traces`, `/v1/mcp`, `/v1/delegations` |
| Virtual FS | `ContextFileInterceptor` routes to SQLite | `ContextFileInterceptor` routes to PostgreSQL |
| Custom tools | N/A | `custom_tools` table + `DynamicToolLoader` |
| Managed-only stores (nil in standalone) | -- | ProviderStore, TracingStore, MCPServerStore, CustomToolStore, AgentLinkStore, TeamStore |

---

## 5. Multi-Tenant Identity Model

GoClaw uses the **Identity Propagation** pattern (also known as **Trusted Subsystem**). It does not implement authentication or authorization — instead, it trusts the upstream service that authenticates with the gateway token to provide accurate user identity.

```mermaid
flowchart LR
    subgraph "Upstream Service (trusted)"
        AUTH["Authenticate end-user"]
        HDR["Set X-GoClaw-User-Id header<br/>or user_id in WS connect"]
    end

    subgraph "GoClaw Gateway"
        EXTRACT["Extract user_id<br/>(opaque, VARCHAR 255)"]
        CTX["store.WithUserID(ctx)"]
        SCOPE["Per-user scoping:<br/>sessions, context files,<br/>memory, traces, agent shares"]
    end

    AUTH --> HDR
    HDR --> EXTRACT
    EXTRACT --> CTX
    CTX --> SCOPE
```

### Identity Flow

| Entry Point | How user_id is provided | Enforcement |
|-------------|------------------------|-------------|
| HTTP API | `X-GoClaw-User-Id` header | Required in managed mode |
| WebSocket | `user_id` field in `connect` handshake | Required in managed mode |
| Channels | Derived from platform sender ID (e.g., Telegram user ID) | Automatic |

### Compound User ID Convention

The `user_id` field is **opaque** to GoClaw — it does not interpret or validate the format. For multi-tenant deployments, the recommended convention is:

```
tenant.{tenantId}.user.{userId}
```

This hierarchical format ensures natural isolation between tenants. Since `user_id` is used as a scoping key across all per-user tables (`user_context_files`, `user_agent_profiles`, `user_agent_overrides`, `agent_shares`, `sessions`, `traces`), the compound format guarantees that users from different tenants cannot access each other's data.

### Where user_id is used

| Component | Usage |
|-----------|-------|
| Session keys | `agent:{agentId}:{channel}:direct:{peerId}` — peerId derived from user_id |
| Context files | `user_context_files` table scoped by `(agent_id, user_id)` |
| User profiles | `user_agent_profiles` table — first/last seen, workspace |
| User overrides | `user_agent_overrides` — per-user provider/model preferences |
| Agent shares | `agent_shares` table — user-level access control |
| Memory | Per-user memory entries via context propagation |
| Traces | `traces` table includes `user_id` for filtering |
| MCP grants | `mcp_user_grants` — per-user MCP server access |
| Skills grants | `skill_user_grants` — per-user skill access |

---

## 6. Gateway Startup Sequence

```mermaid
sequenceDiagram
    participant CLI as CLI (cmd/root.go)
    participant GW as runGateway()
    participant PG as PostgreSQL
    participant Engine as Core Engine

    CLI->>GW: 1. Parse CLI flags + load config
    GW->>GW: 2. Resolve workspace + data dirs
    GW->>GW: 3. Create Message Bus

    alt Managed mode
        GW->>PG: 4. Connect to Postgres (pg.NewPGStores)
        PG-->>GW: PG stores created
        GW->>GW: 5. Start tracing collector
        GW->>PG: 6. Register providers from DB
        GW->>PG: 7. Wire embedding provider to PGMemoryStore
        GW->>PG: 8. Backfill memory embeddings (background)
    else Standalone mode
        GW->>GW: 4. Create file-based stores
    end

    GW->>GW: 9. Register config-based providers
    GW->>GW: 10. Create tool registry (filesystem, exec, web, memory, browser, TTS, subagent, MCP)
    GW->>GW: 11. Load bootstrap files (DB or filesystem)
    GW->>GW: 12. Create skills loader + register skill_search tool
    GW->>GW: 13. Wire skill embeddings (managed only)

    alt Managed mode
        GW->>GW: 14. Create agents lazily (set ManagedResolver)
        GW->>GW: 15. wireManagedExtras (interceptors, cache subscribers)
        GW->>GW: 16. Wire managed HTTP handlers (agents, skills, traces, MCP)
    else Standalone mode
        GW->>GW: 14. Create agents eagerly from config
        GW->>GW: 15. wireStandaloneExtras (FileAgentStore, interceptors, callbacks)
    end

    GW->>Engine: 17. Create gateway server (WS + HTTP)
    GW->>Engine: 18. Register RPC methods
    GW->>Engine: 19. Register + start channels (Telegram, Discord, Feishu, Zalo, WhatsApp)
    GW->>Engine: 20. Start cron, heartbeat, scheduler (4 lanes)
    GW->>Engine: 21. Start skills watcher + inbound consumer
    GW->>Engine: 22. Listen on host:port
```

---

## 7. Managed Mode Wiring

The `wireManagedExtras()` function in `cmd/gateway_managed.go` wires multi-tenant components:

```mermaid
flowchart TD
    W1["1. ContextFileInterceptor<br/>Routes read_file / write_file to DB"] --> W2
    W2["2. User Seeding Callback<br/>Seeds per-user context files on first chat"] --> W3
    W3["3. Context File Loader<br/>Loads per-user vs agent-level files by agent_type"] --> W4
    W4["4. ManagedResolver<br/>Lazy-creates agent Loops from DB on cache miss"] --> W5
    W5["5. Virtual FS Interceptors<br/>Wire interceptors on read_file + write_file + memory tools"] --> W6
    W6["6. Memory Store Wiring<br/>Wire PGMemoryStore on memory_search + memory_get tools"] --> W7
    W7["7. Cache Invalidation Subscribers<br/>Subscribe to MessageBus events"] --> W8
    W8["8. Delegation Tools<br/>DelegateManager + delegate_search + agent links"] --> W9
    W9["9. Team Tools<br/>team_tasks + team_message + team auto-linking"] --> W10
    W10["10. Hook Engine<br/>Quality gates with command + agent evaluators"] --> W11
    W11["11. Evaluate Loop + Handoff<br/>evaluate_loop tool + handoff tool"]
```

A separate `wireStandaloneExtras()` in `cmd/gateway_standalone.go` wires the same core callbacks (user seeding, context file loading) using `FileAgentStore` instead of PostgreSQL.

### Cache Invalidation Events

| Event | Subscriber | Action |
|-------|-----------|--------|
| `cache:bootstrap` | ContextFileInterceptor | `InvalidateAgent()` or `InvalidateAll()` |
| `cache:agent` | AgentRouter | `InvalidateAgent()` -- forces re-resolve from DB |
| `cache:skills` | SkillStore | `BumpVersion()` |
| `cache:cron` | CronStore | `InvalidateCache()` |
| `cache:custom_tools` | DynamicToolLoader | `ReloadGlobal()` + `AgentRouter.InvalidateAll()` |

---

## 8. Scheduler Lanes

The scheduler uses a lane-based concurrency model. Each lane is a named worker pool with a bounded semaphore. Per-session queues control concurrency within each session.

```mermaid
flowchart TD
    subgraph Main["Lane: main (concurrency 30)"]
        M1[Channel messages]
        M2[WebSocket requests]
    end

    subgraph Sub["Lane: subagent (concurrency 50)"]
        S1[Subagent executions]
    end

    subgraph Del["Lane: delegate (concurrency 100)"]
        D1[Delegation executions]
    end

    subgraph Cron["Lane: cron (concurrency 30)"]
        C1[Cron job executions]
    end

    Main --> SEM1[Semaphore]
    Sub --> SEM2[Semaphore]
    Del --> SEM3[Semaphore]
    Cron --> SEM4[Semaphore]

    SEM1 --> Q[Per-Session Queue]
    SEM2 --> Q
    SEM3 --> Q
    SEM4 --> Q

    Q --> AGENT[Agent Loop]
```

### Lane Defaults

| Lane | Concurrency | Env Override | Purpose |
|------|:-----------:|-------------|---------|
| `main` | 30 | `GOCLAW_LANE_MAIN` | Primary user chat sessions |
| `subagent` | 50 | `GOCLAW_LANE_SUBAGENT` | Spawned subagents |
| `delegate` | 100 | `GOCLAW_LANE_DELEGATE` | Agent delegation executions |
| `cron` | 30 | `GOCLAW_LANE_CRON` | Scheduled cron jobs |

### Session Queue Concurrency

Per-session queues now support configurable `maxConcurrent`:
- **DMs**: `maxConcurrent = 1` (single-threaded per user)
- **Groups**: `maxConcurrent = 3` (multiple concurrent responses)
- **Adaptive throttle**: When session history exceeds 60% of context window, concurrency drops to 1

### Queue Modes

| Mode | Behavior |
|------|----------|
| `queue` | FIFO -- new messages wait until the current run completes |
| `followup` | Merges incoming message into the pending queue as a follow-up |
| `interrupt` | Cancels the active run and replaces it with the new message |

Default queue config: capacity 10, drop policy `old` (drops oldest on overflow), debounce 800ms.

### /stop and /stopall

- `/stop` -- Cancel the oldest running task (others keep going)
- `/stopall` -- Cancel all running tasks + drain the queue

Both are intercepted before the debouncer to avoid being merged with normal messages.

---

## 9. Graceful Shutdown

When the process receives SIGINT or SIGTERM:

1. Broadcast `shutdown` event to all connected WebSocket clients.
2. `channelMgr.StopAll()` -- stop all channel adapters.
3. `cronStore.Stop()` -- stop cron scheduler.
4. `heartbeatSvc.Stop()` -- stop heartbeat service.
5. `sandboxMgr.Stop()` + `ReleaseAll()` -- release Docker containers.
6. `cancel()` -- cancel root context, propagating to consumer + scheduler.
7. Deferred cleanup: flush tracing collector, close memory store, close browser manager, stop scheduler lanes.
8. HTTP server shutdown with a **5-second timeout** (`context.WithTimeout`).

---

## 10. Config System

Configuration is loaded from a JSON5 file with environment variable overlay. Secrets are never persisted to the config file.

```mermaid
flowchart TD
    A{Config path?} -->|--config flag| B[CLI flag path]
    A -->|GOCLAW_CONFIG env| C[Env var path]
    A -->|default| D["config.json"]

    B & C & D --> LOAD["config.Load()"]
    LOAD --> S1["1. Set defaults"]
    S1 --> S2["2. Parse JSON5"]
    S2 --> S3["3. Env var overlay<br/>(GOCLAW_*_API_KEY)"]
    S3 --> S4["4. Apply computed defaults<br/>(context pruning, etc.)"]
    S4 --> READY[Config ready]
```

### Key Config Sections

| Section | Purpose |
|---------|---------|
| `gateway` | host, port, token, allowed_origins, rate_limit_rpm, max_message_chars |
| `agents` | defaults (provider, model, context_window) + list (per-agent overrides) |
| `tools` | profile, allow/deny lists, exec_approval, web, browser, mcp_servers, rate_limit_per_hour |
| `channels` | Per-channel: enabled, token, dm_policy, group_policy, allow_from |
| `database` | mode (standalone/managed); postgres_dsn read only from env var |

### Secret Handling

- Secrets exist only in env vars or `.env.local` -- never in `config.json`.
- `GOCLAW_POSTGRES_DSN` is tagged `json:"-"` and cannot be read from the config file.
- `MaskedCopy()` replaces API keys with `"***"` when returning config over WebSocket.
- `StripSecrets()` removes secrets before writing config to disk.
- Config hot-reload via `fsnotify` watcher with 300ms debounce.

---

## 11. File Reference

| File | Purpose |
|------|---------|
| `cmd/root.go` | Cobra CLI entry point, flag parsing |
| `cmd/gateway.go` | Gateway startup orchestrator (`runGateway()`) |
| `cmd/gateway_managed.go` | Managed mode wiring (`wireManagedExtras()`, `wireManagedHTTP()`) |
| `cmd/gateway_standalone.go` | Standalone mode wiring (`wireStandaloneExtras()`) |
| `cmd/gateway_callbacks.go` | Shared callbacks for managed + standalone (user seeding, context file loading) |
| `cmd/gateway_consumer.go` | Inbound message consumer (subagent, delegate, teammate, handoff routing) |
| `cmd/gateway_providers.go` | Provider registration (config-based + DB-based) |
| `cmd/gateway_methods.go` | RPC method registration |
| `internal/config/config.go` | Config struct definitions |
| `internal/config/config_load.go` | JSON5 loading + env overlay |
| `internal/config/config_channels.go` | Channel config structs |
| `internal/gateway/server.go` | WS + HTTP server, CORS, rate limiter setup |
| `internal/gateway/client.go` | WebSocket client handling, read limit (512KB) |
| `internal/gateway/router.go` | RPC method routing |
| `internal/scheduler/lanes.go` | Lane definitions, semaphore-based concurrency |
| `internal/scheduler/queue.go` | Per-session queue, queue modes, debounce |
| `internal/hooks/engine.go` | Hook engine: evaluator registry, `EvaluateHooks` |
| `internal/hooks/command_evaluator.go` | Shell command evaluator (exit 0 = pass) |
| `internal/hooks/agent_evaluator.go` | Agent delegation evaluator (APPROVED/REJECTED) |
| `internal/hooks/context.go` | `WithSkipHooks` / `SkipHooksFromContext` (recursion prevention) |
| `internal/store/stores.go` | `Stores` container struct (all 14 store interfaces) |
| `internal/store/types.go` | `StoreConfig`, `BaseModel` |

---

## Cross-References

| Document | Content |
|----------|---------|
| [01-agent-loop.md](./01-agent-loop.md) | Agent loop detail, sanitization pipeline, history management |
| [02-providers.md](./02-providers.md) | LLM providers, retry logic, schema cleaning |
| [03-tools-system.md](./03-tools-system.md) | Tool registry, policy engine, interceptors, custom tools, MCP grants |
| [04-gateway-protocol.md](./04-gateway-protocol.md) | WebSocket protocol v3, HTTP API, RBAC, identity propagation |
| [05-channels-messaging.md](./05-channels-messaging.md) | Channel adapters, Telegram formatting, pairing, managed-mode user scoping |
| [06-store-data-model.md](./06-store-data-model.md) | Store interfaces, PostgreSQL schema, session caching, custom tool store |
| [07-bootstrap-skills-memory.md](./07-bootstrap-skills-memory.md) | Bootstrap files, skills system, memory, skills grants |
| [08-scheduling-cron-heartbeat.md](./08-scheduling-cron-heartbeat.md) | Scheduler lanes, cron lifecycle, heartbeat |
| [09-security.md](./09-security.md) | Defense layers, encryption, rate limiting, RBAC, sandbox |
| [10-tracing-observability.md](./10-tracing-observability.md) | Tracing collector, span hierarchy, OTel export, trace API |

---

# 01 - Agent Loop

## Overview

The Agent Loop implements a **Think --> Act --> Observe** cycle. Each agent owns a `Loop` instance configured with a provider, model, tools, workspace, and agent type. A user message enters as a `RunRequest`, passes through `runLoop`, and exits as a `RunResult`. The loop iterates up to 20 times: the LLM thinks, optionally calls tools, observes results, and repeats until it produces a final text response.

---

## 1. RunRequest Flow

The full lifecycle of a single agent run is broken into seven phases.

```mermaid
flowchart TD
    START([RunRequest]) --> PH1

    subgraph PH1["Phase 1: Setup"]
        P1A[Increment activeRuns atomic counter] --> P1B[Emit run.started event]
        P1B --> P1C[Create trace record]
        P1C --> P1D[Inject agentType / userID / agentID into context]
        P1D --> P1E0[Compute per-user workspace + WithToolWorkspace]
        P1E0 --> P1E[Ensure per-user files via sync.Map cache]
        P1E --> P1F[Persist agent + user IDs on session]
    end

    PH1 --> PH2

    subgraph PH2["Phase 2: Input Validation"]
        P2A["InputGuard.Scan - 6 injection patterns"] --> P2B["Message truncation at max_message_chars (default 32K)"]
    end

    PH2 --> PH3

    subgraph PH3["Phase 3: Build Messages"]
        P3A[Build system prompt - 15+ sections] --> P3B[Inject conversation summary if present]
        P3B --> P3C["History pipeline: limitHistoryTurns --> pruneContextMessages --> sanitizeHistory"]
        P3C --> P3D[Append current user message]
        P3D --> P3E[Buffer user message locally - deferred write]
    end

    PH3 --> PH4

    subgraph PH4["Phase 4: LLM Iteration Loop (max 20)"]
        P4A[Filter tools via PolicyEngine] --> P4B["Call LLM (ChatStream or Chat)"]
        P4B --> P4C[Accumulate tokens + record LLM span]
        P4C --> P4D{Tool calls in response?}
        P4D -->|No| EXIT[Exit loop with final content]
        P4D -->|Yes| PH5
    end

    subgraph PH5["Phase 5: Tool Execution"]
        P5A[Append assistant message with tool calls] --> P5B{Single or multiple tools?}
        P5B -->|Single| P5C[Execute sequentially]
        P5B -->|Multiple| P5D["Execute in parallel via goroutines, sort results by index"]
        P5C & P5D --> P5E["Emit tool.call / tool.result events, record tool spans, save tool messages"]
    end

    PH5 --> PH4

    EXIT --> PH6

    subgraph PH6["Phase 6: Response Finalization"]
        P6A["SanitizeAssistantContent (8-step pipeline)"] --> P6B["Detect NO_REPLY - suppress delivery if silent"]
        P6B --> P6C[Flush all buffered messages atomically to session]
        P6C --> P6D[Update metadata: model, provider, token counts]
    end

    PH6 --> PH7

    subgraph PH7["Phase 7: Auto-Summarization"]
        P7A{"> 50 messages OR > 75% context window?"}
        P7A -->|No| P7D[Skip]
        P7A -->|Yes| P7B["Memory flush (synchronous, max 5 iterations, 90s timeout)"]
        P7B --> P7C["Summarize in background goroutine (120s timeout)"]
    end

    PH7 --> POST

    subgraph POST["Post-processing"]
        PP1[Emit root agent span] --> PP2["Emit run.completed or run.failed"]
        PP2 --> PP3[Finish trace]
    end

    POST --> RESULT([RunResult])
```

### Phase 1: Setup

- Increment the `activeRuns` atomic counter (no mutex -- true concurrency, especially in group chats with `maxConcurrent = 3`).
- Emit a `run.started` event to notify connected clients.
- Create a trace record (managed mode) with a generated trace UUID.
- Propagate context values: `WithAgentID()`, `WithUserID()`, `WithAgentType()`. Downstream tools and interceptors rely on these.
- Compute per-user workspace: `base + "/" + sanitize(userID)`. Inject via `WithToolWorkspace(ctx)` so all filesystem and shell tools use the correct directory.
- Ensure per-user files exist. A `sync.Map` cache guarantees the seeding function runs at most once per user.
- Persist the agent ID and user ID on the session for later reference.

### Phase 2: Input Validation

- **InputGuard**: scans the user message against 6 regex patterns that detect prompt injection attempts. See Section 4 for details.
- **Message truncation**: if the message exceeds `max_message_chars` (default 32,768), the content is truncated and the LLM receives a notification that the input was shortened. The message is never rejected outright.

### Phase 3: Build Messages

- Build the system prompt (15+ sections). Context files are resolved dynamically based on agent type.
- Inject the conversation summary (if one exists from a previous compaction) as the first two messages.
- Run the history pipeline (3 stages, see Section 5).
- Append the current user message. Messages are buffered locally (deferred write) to avoid race conditions with concurrent runs on the same session.

### Phase 4: LLM Iteration Loop

- Filter the available tools through the PolicyEngine (RBAC).
- Call the LLM. Streaming calls emit `chunk` events in real time; non-streaming calls return a single response.
- Record an LLM span for tracing with token counts and timing.
- If the response contains no tool calls, exit the loop.
- If tool calls are present, proceed to Phase 5 and then loop back.
- Maximum 20 iterations before the loop forcibly exits.

### Phase 5: Tool Execution

- Append the assistant message (with tool calls) to the message list.
- **Single tool call**: execute sequentially (no goroutine overhead).
- **Multiple tool calls**: launch parallel goroutines, collect all results, sort by original index, then process sequentially.
- Emit `tool.call` before execution and `tool.result` after.
- Record a tool span for each call. Track async tools (spawn, cron) separately.
- Save tool messages to the session.

### Phase 6: Response Finalization

- Run `SanitizeAssistantContent` -- an 8-step cleanup pipeline (see Section 3).
- Detect `NO_REPLY` in the final content. If present, suppress message delivery (silent reply).
- Flush all buffered messages atomically to the session (user message, tool messages, assistant message). This prevents concurrent runs from interleaving partial history.
- Update session metadata: model name, provider name, cumulative token counts.

### Phase 7: Auto-Summarization

- **Trigger condition**: the history has more than 50 messages OR the estimated token count exceeds 75% of the context window.
- **Per-session TryLock**: before summarizing, acquire a non-blocking per-session lock. If another concurrent run is already summarizing, skip. This prevents concurrent summarization from corrupting session history.
- **Memory flush first**: run synchronously so the agent can persist durable memories before history is truncated. Max 5 LLM iterations, 90-second timeout.
- **Summarize**: launch a background goroutine with a 120-second timeout. The LLM produces a summary of all messages except the last 4. The summary is saved and the history is truncated to those 4 messages. The compaction counter is incremented.

### Cancel Handling

When the context is cancelled (via `/stop` or `/stopall`), the loop exits immediately:
- Trace finalization uses `context.Background()` fallback when `ctx.Err() != nil` to ensure the final DB write succeeds.
- Trace status is set to `"cancelled"` instead of `"error"`.
- An empty outbound message triggers cleanup (stop typing indicator, clear reactions).

---

## 2. System Prompt

The system prompt is assembled dynamically from 15+ sections. Two modes control the amount of content included:

- **PromptFull**: used for main agent runs. Includes all sections.
- **PromptMinimal**: used for sub-agents and cron jobs. Stripped-down version with only essential context.

### Sections

1. **Identity** -- agent persona loaded from bootstrap files (IDENTITY.md, SOUL.md).
2. **First-run bootstrap** -- instructions shown only on the very first interaction.
3. **Tooling** -- descriptions and usage guidelines for available tools.
4. **Safety** -- defensive preamble for handling external content, wrapped in XML tags.
5. **Skills (inline)** -- skill content injected directly when the skill set is small.
6. **Skills (search mode)** -- BM25 skill search tool when the skill set is large.
7. **Memory Recall** -- recalled memory snippets relevant to the current conversation.
8. **Workspace** -- working directory path and file structure context.
9. **Sandbox** -- Docker sandbox instructions when sandbox mode is enabled.
10. **User Identity** -- the current user's display name and identifier.
11. **Time** -- current date and time for temporal awareness.
12. **Messaging** -- channel-specific formatting instructions (Telegram, Feishu, etc.).
13. **Extra context** -- additional prompt text wrapped in `<extra_context>` XML tags.
14. **Project Context** -- context files loaded from the database or filesystem, wrapped in `<context_file>` XML tags with a defensive preamble.
15. **Silent Replies** -- instructions for the NO_REPLY convention.
16. **Heartbeats** -- instructions for periodic wake-up behavior.
17. **Sub-Agent Spawning** -- rules for launching child agents.
18. **Delegation** -- auto-generated `DELEGATION.md` listing available delegation targets (inline if ≤15, search instruction if >15).
19. **Team** -- `TEAM.md` injected for team leads only (team name, role, teammate list).
20. **Runtime** -- runtime metadata (agent ID, session key, provider info).

---

## 3. Sanitize Output

An 8-step pipeline cleans raw LLM output before delivering it to the user.

```mermaid
flowchart TD
    IN[Raw LLM Output] --> S1
    S1["1. stripGarbledToolXML<br/>Remove broken XML tool artifacts<br/>from DeepSeek, GLM, Minimax"] --> S2
    S2["2. stripDowngradedToolCallText<br/>Remove text-format tool calls:<br/>[Tool Call: ...], [Tool Result ...]"] --> S3
    S3["3. stripThinkingTags<br/>Remove reasoning tags:<br/>think, thinking, thought, antThinking"] --> S4
    S4["4. stripFinalTags<br/>Remove final tag wrappers,<br/>preserve inner content"] --> S5
    S5["5. stripEchoedSystemMessages<br/>Remove hallucinated<br/>[System Message] blocks"] --> S6
    S6["6. collapseConsecutiveDuplicateBlocks<br/>Deduplicate repeated paragraphs<br/>caused by model stuttering"] --> S6B
    S6B["7. stripMediaPaths<br/>Remove raw media file paths<br/>from output"] --> S7
    S7["8. stripLeadingBlankLines<br/>Remove leading whitespace lines"] --> TRIM
    TRIM["TrimSpace()"] --> OUT[Clean Output]
```

### Step Details

1. **stripGarbledToolXML** -- Some models (DeepSeek, GLM, Minimax) emit tool-call XML as plain text instead of proper structured tool calls. This step removes tags like `<tool_call>`, `<function_call>`, `<tool_use>`, `<minimax:tool_call>`, and `<parameter name=...>`. If the entire response consists of garbled XML, an empty string is returned.

2. **stripDowngradedToolCallText** -- Removes text-format tool calls such as `[Tool Call: ...]`, `[Tool Result ...]`, and `[Historical context: ...]` along with any accompanying JSON arguments and output. Uses line-by-line scanning because Go regex does not support lookahead.

3. **stripThinkingTags** -- Removes internal reasoning tags: `<think>`, `<thinking>`, `<thought>`, `<antThinking>`. Case-insensitive, non-greedy matching.

4. **stripFinalTags** -- Removes `<final>` and `</final>` wrapper tags but preserves the content inside them.

5. **stripEchoedSystemMessages** -- Removes `[System Message]` blocks that the LLM hallucinates or echoes in its response. Scans line by line, skipping content until an empty line is reached.

6. **collapseConsecutiveDuplicateBlocks** -- Removes paragraphs that repeat consecutively (a symptom of model stuttering). Splits by `\n\n` and compares each trimmed block against its predecessor.

7. **stripMediaPaths** -- Removes raw media file paths from the output that the model may leak into its response text.

8. **stripLeadingBlankLines** -- Removes whitespace-only lines at the beginning of the output while preserving indentation in the remaining content.

---

## 4. Input Guard

The Input Guard detects prompt injection attempts in user messages. It is a detection system -- by default it logs warnings but does not block requests.

### 6 Detection Patterns

| Pattern | Description | Example |
|---------|-------------|---------|
| `ignore_instructions` | Attempts to override prior instructions | "Ignore all previous instructions" |
| `role_override` | Attempts to redefine the agent's role | "You are now a different assistant" |
| `system_tags` | Injection of fake system-level tags | `<\|im_start\|>system`, `[SYSTEM]` |
| `instruction_injection` | Insertion of new directives | "New instructions:", "override:" |
| `null_bytes` | Null byte injection | `\x00` characters in the message |
| `delimiter_escape` | Attempts to escape context boundaries | "end of system", `</instructions>` |

### 4 Action Modes

| Action | Behavior |
|--------|----------|
| `"off"` | Scanning disabled entirely |
| `"log"` | Log at info level (`security.injection_detected`), continue processing |
| `"warn"` (default) | Log at warn level (`security.injection_detected`), continue processing |
| `"block"` | Log at warn level and return an error, halting the request |

All security events use the `slog.Warn("security.injection_detected")` convention.

---

## 5. History Pipeline

The history pipeline prepares conversation history before sending it to the LLM. It runs in three sequential stages.

```mermaid
flowchart TD
    RAW[Raw Session History] --> S1
    S1["Stage 1: limitHistoryTurns<br/>Keep the last N user turns<br/>plus their associated assistant/tool messages"] --> S2
    S2["Stage 2: pruneContextMessages<br/>2-pass tool result trimming<br/>(see Section 6)"] --> S3
    S3["Stage 3: sanitizeHistory<br/>Repair broken tool_use / tool_result pairing<br/>after truncation"] --> OUT[Cleaned History]
```

### Stage 1: limitHistoryTurns

Takes the raw session history and a `historyLimit` parameter. Keeps only the last N user turns along with all associated assistant and tool messages that belong to those turns. Earlier messages are discarded.

### Stage 2: pruneContextMessages

Applies the 2-pass context pruning algorithm described in Section 6.

### Stage 3: sanitizeHistory

Repairs tool message pairing that may have been broken by truncation or compaction:

1. Skip orphaned tool messages at the beginning of history (no preceding assistant message).
2. For each assistant message that contains tool calls, collect the expected tool_call IDs.
3. Validate that the following tool messages match those expected IDs. Drop mismatched tool messages.
4. Synthesize missing tool results with placeholder text: `"[Tool result missing -- session was compacted]"`.

---

## 6. Context Pruning

Context pruning reduces oversized tool results using a 2-pass algorithm. It only activates when the estimated token-to-context-window ratio crosses a threshold.

```mermaid
flowchart TD
    START[Estimate token ratio vs context window] --> CHECK{Ratio >= softTrimRatio 0.3?}
    CHECK -->|No| DONE[No pruning needed]
    CHECK -->|Yes| PASS1

    PASS1["Pass 1: Soft Trim<br/>For each eligible tool result > 4000 chars:<br/>Keep first 1500 chars + last 1500 chars<br/>Replace middle with '...'"]
    PASS1 --> CHECK2{"Ratio >= hardClearRatio 0.5?"}
    CHECK2 -->|No| DONE
    CHECK2 -->|Yes| PASS2

    PASS2["Pass 2: Hard Clear<br/>Replace entire tool result content<br/>with '[Old tool result content cleared]'<br/>Stop when ratio drops below threshold"]
    PASS2 --> DONE
```

### Defaults

| Parameter | Default | Description |
|-----------|---------|-------------|
| `keepLastAssistants` | 3 | Number of recent assistant messages protected from pruning |
| `softTrimRatio` | 0.3 | Token ratio threshold to trigger Pass 1 |
| `hardClearRatio` | 0.5 | Token ratio threshold to trigger Pass 2 |
| `minPrunableToolChars` | 50,000 | Minimum tool result length eligible for hard clear |

### Protected Zone

The following messages are never pruned:

- System messages
- The last N assistant messages (default: 3)
- The first user message in the conversation

---

## 7. Auto-Summarize and Compaction

When the conversation grows too long, the auto-summarization system compresses older history into a summary while preserving recent context.

```mermaid
flowchart TD
    CHECK{"> 50 messages OR<br/>> 75% context window?"}
    CHECK -->|No| SKIP[Skip compaction]
    CHECK -->|Yes| FLUSH

    FLUSH["Step 1: Memory Flush (synchronous)<br/>LLM turn with write_file tool<br/>Agent writes durable memories before truncation<br/>Max 5 iterations, 90s timeout"]
    FLUSH --> SUMMARIZE

    SUMMARIZE["Step 2: Summarize (background goroutine)<br/>Keep last 4 messages<br/>LLM summarizes older messages<br/>temp=0.3, max_tokens=1024, timeout 120s"]
    SUMMARIZE --> SAVE

    SAVE["Step 3: Save<br/>SetSummary() + TruncateHistory(4)<br/>IncrementCompaction()"]
```

### Summary Reuse

On the next request, the saved summary is injected at the beginning of the message list as two messages:

1. `{role: "user", content: "[Previous conversation summary]\n{summary}"}`
2. `{role: "assistant", content: "I understand the context..."}`

This gives the LLM continuity without replaying the full history.

---

## 8. Memory Flush

Memory flush runs synchronously before compaction to give the agent an opportunity to persist important information.

- **Trigger**: token estimate >= contextWindow - 20,000 - 4,000.
- **Deduplication**: runs at most once per compaction cycle, tracked by the compaction counter.
- **Mechanism**: an embedded agent turn using `PromptMinimal` mode with a flush prompt and the 10 most recent messages. The default prompt is: "Store durable memories now, if nothing to store reply NO_REPLY."
- **Available tools**: `write_file` and `read_file`, so the agent can write and read memory files.
- **Timing**: fully synchronous -- blocks the summarization step until the flush completes.

---

## 9. Agent Router

The Agent Router manages Loop instances with a cache layer. It supports lazy resolution, TTL-based expiration, and run abort.

```mermaid
flowchart TD
    GET["Router.Get(agentID)"] --> CACHE{"Cache hit<br/>and TTL valid?"}
    CACHE -->|Yes| RETURN[Return cached Loop]
    CACHE -->|No or Expired| RESOLVE{"Resolver configured?"}
    RESOLVE -->|No| ERR["Error: agent not found"]
    RESOLVE -->|Yes| DB["Resolver.Resolve(agentID)<br/>Load from DB, create Loop"]
    DB --> STORE[Store in cache with TTL]
    STORE --> RETURN
```

### Cache Invalidation

`InvalidateAgent(agentID)` removes a specific agent from the cache, forcing the next `Get()` call to re-resolve from the database.

### Active Run Tracking

| Method | Behavior |
|--------|----------|
| `RegisterRun(runID, sessionKey, agentID, cancel)` | Register a new active run with its cancel function |
| `AbortRun(runID, sessionKey)` | Cancel a run (verifies sessionKey match before aborting) |
| `AbortRunsForSession(sessionKey)` | Cancel all active runs belonging to a session |

---

## 10. Resolver (Managed Mode)

The `ManagedResolver` lazy-creates Loop instances from PostgreSQL data when the Router encounters a cache miss.

```mermaid
flowchart TD
    MISS["Router cache miss"] --> LOAD["Step 1: Load agent from DB<br/>AgentStore.GetByKey(agentKey)"]
    LOAD --> PROV["Step 2: Resolve provider<br/>ProviderRegistry.Get(provider)<br/>Fallback: first provider in registry"]
    PROV --> BOOT["Step 3: Load bootstrap files<br/>bootstrap.LoadFromStore(agentID)"]
    BOOT --> DEFAULTS["Step 4: Apply defaults<br/>contextWindow <= 0 then 200K<br/>maxIterations <= 0 then 20"]
    DEFAULTS --> CREATE["Step 5: Create Loop<br/>NewLoop(LoopConfig)"]
    CREATE --> WIRE["Step 6: Wire managed-mode hooks<br/>EnsureUserFilesFunc, ContextFileLoaderFunc"]
    WIRE --> DONE["Return Loop to Router for caching"]
```

### Resolved Properties

- **Provider**: looked up by name from the provider registry. Falls back to the first registered provider if not found.
- **Bootstrap files**: loaded from the `agent_context_files` table (agent-level files like IDENTITY.md, SOUL.md).
- **Agent type**: `open` (per-user context with 7 template files) or `predefined` (agent-level context plus USER.md per user).
- **Per-user seeding**: `EnsureUserFilesFunc` seeds template files on first chat, idempotent (skips files that already exist). Uses PostgreSQL's `xmax` trick in `GetOrCreateUserProfile` to distinguish INSERT from ON CONFLICT UPDATE, triggering seeding only for genuinely new users.
- **Dynamic context loading**: `ContextFileLoaderFunc` resolves context files based on agent type -- per-user files for open agents, agent-level files for predefined agents.
- **Custom tools**: `DynamicLoader.LoadForAgent()` clones the global tool registry and adds per-agent custom tools, ensuring each agent gets its own isolated set of dynamic tools.

---

## 11. Event System

The Loop publishes events via an `onEvent` callback. The WebSocket gateway forwards these as `EventFrame` messages to connected clients for real-time progress tracking.

### Event Types

| Event | When | Payload |
|-------|------|---------|
| `run.started` | Run begins | -- |
| `chunk` | Streaming: each text fragment from the LLM | `{"content": "..."}` |
| `tool.call` | Tool execution begins | `{"name": "...", "id": "..."}` |
| `tool.result` | Tool execution completes | `{"name": "...", "id": "...", "is_error": bool}` |
| `run.completed` | Run finishes successfully | -- |
| `run.failed` | Run finishes with an error | `{"error": "..."}` |
| `handoff` | Conversation transferred to another agent | `{"from": "...", "to": "...", "reason": "..."}` |

### Event Flow

```mermaid
sequenceDiagram
    participant L as Agent Loop
    participant GW as Gateway
    participant C as WebSocket Client

    L->>GW: emit(run.started)
    GW->>C: EventFrame

    loop LLM Iterations
        L->>GW: emit(chunk) x N
        GW->>C: EventFrame x N
        L->>GW: emit(tool.call)
        GW->>C: EventFrame
        L->>GW: emit(tool.result)
        GW->>C: EventFrame
    end

    L->>GW: emit(run.completed)
    GW->>C: EventFrame
```

---

## 12. Tracing

Every agent run produces a trace with a hierarchy of spans for debugging, analysis, and cost tracking.

### Span Hierarchy

```mermaid
flowchart TD
    T["Trace (one per Run)"] --> A["Root Agent Span<br/>Covers the entire run duration"]
    A --> L1["LLM Span #1<br/>provider, model, iteration number"]
    A --> T1["Tool Span #1a<br/>tool name, duration"]
    A --> T2["Tool Span #1b<br/>tool name, duration"]
    A --> L2["LLM Span #2<br/>provider, model, iteration number"]
    A --> T3["Tool Span #2a<br/>tool name, duration"]
```

### 3 Span Types

| Span Type | Description |
|-----------|-------------|
| **Root Agent Span** | Parent span covering the full run. Contains agent ID, session key, and final status. |
| **LLM Call Span** | One per LLM invocation. Records provider, model, token counts (input/output), and duration. |
| **Tool Call Span** | One per tool execution. Records tool name, whether it errored, and duration. |

### Verbose Mode

Enabled via the `GOCLAW_TRACE_VERBOSE=1` environment variable.

| Field | Normal Mode | Verbose Mode |
|-------|-------------|--------------|
| `OutputPreview` | First 500 characters | First 500 characters |
| `InputPreview` | Not recorded | Full LLM input messages as JSON, truncated at 50,000 characters |

---

## 13. File Reference

| File | Responsibility |
|------|---------------|
| `internal/agent/loop.go` | Core Loop struct, RunRequest/RunResult, LLM iteration loop, tool execution, event emission |
| `internal/agent/loop_history.go` | History pipeline: limitHistoryTurns, sanitizeHistory, summary injection |
| `internal/agent/pruning.go` | Context pruning: 2-pass soft trim and hard clear algorithm |
| `internal/agent/systemprompt.go` | System prompt assembly (15+ sections), PromptFull and PromptMinimal modes |
| `internal/agent/resolver.go` | ManagedResolver: lazy Loop creation from PostgreSQL, provider resolution, bootstrap loading |
| `internal/agent/loop_tracing.go` | Trace and span creation, verbose mode input capture, span finalization |
| `internal/agent/input_guard.go` | Input Guard: 6 regex patterns, 4 action modes, security logging |
| `internal/agent/sanitize.go` | 8-step output sanitization pipeline |
| `internal/agent/memoryflush.go` | Pre-compaction memory flush: embedded agent turn with write_file tool |

---

# 02 - LLM Providers

GoClaw abstracts LLM communication behind a single `Provider` interface, allowing the agent loop to work with any backend without knowing the wire format. Two concrete implementations exist: an Anthropic provider using native `net/http` with SSE streaming, and a generic OpenAI-compatible provider that covers 10+ API endpoints.

---

## 1. Provider Architecture

All providers implement four methods: `Chat()`, `ChatStream()`, `Name()`, and `DefaultModel()`. The agent loop calls `Chat()` for non-streaming requests and `ChatStream()` for token-by-token streaming. Both return a unified `ChatResponse` with content, tool calls, finish reason, and token usage.

```mermaid
flowchart TD
    AL["Agent Loop"] -->|"Chat() / ChatStream()"| PI["Provider Interface"]

    PI --> ANTH["Anthropic Provider<br/>native net/http + SSE"]
    PI --> OAI["OpenAI-Compatible Provider<br/>generic HTTP client"]

    ANTH --> CLAUDE["Claude API<br/>api.anthropic.com/v1"]
    OAI --> OPENAI["OpenAI API"]
    OAI --> OR["OpenRouter API"]
    OAI --> GROQ["Groq API"]
    OAI --> DS["DeepSeek API"]
    OAI --> GEM["Gemini API"]
    OAI --> OTHER["Mistral / xAI / MiniMax<br/>Cohere / Perplexity"]
```

The Anthropic provider uses `x-api-key` header authentication and the `anthropic-version: 2023-06-01` header. The OpenAI-compatible provider uses `Authorization: Bearer` tokens and targets each provider's `/chat/completions` endpoint. Both providers set an HTTP client timeout of 120 seconds.

---

## 2. Supported Providers

| Provider | Type | API Base | Default Model |
|----------|------|----------|---------------|
| anthropic | Native HTTP + SSE | `https://api.anthropic.com/v1` | `claude-sonnet-4-5-20250929` |
| openai | OpenAI-compatible | `https://api.openai.com/v1` | `gpt-4o` |
| openrouter | OpenAI-compatible | `https://openrouter.ai/api/v1` | `anthropic/claude-sonnet-4-5-20250929` |
| groq | OpenAI-compatible | `https://api.groq.com/openai/v1` | `llama-3.3-70b-versatile` |
| deepseek | OpenAI-compatible | `https://api.deepseek.com/v1` | `deepseek-chat` |
| gemini | OpenAI-compatible | `https://generativelanguage.googleapis.com/v1beta/openai` | `gemini-2.0-flash` |
| mistral | OpenAI-compatible | `https://api.mistral.ai/v1` | `mistral-large-latest` |
| xai | OpenAI-compatible | `https://api.x.ai/v1` | `grok-3-mini` |
| minimax | OpenAI-compatible | `https://api.minimax.chat/v1` | `MiniMax-M2.5` |
| cohere | OpenAI-compatible | `https://api.cohere.com/v2` | `command-a` |
| perplexity | OpenAI-compatible | `https://api.perplexity.ai` | `sonar-pro` |
| dashscope | OpenAI-compatible | `https://dashscope.aliyuncs.com/compatible-mode/v1` | `qwen3-max` |

---

## 3. Call Flow

### Non-Streaming (Chat)

```mermaid
sequenceDiagram
    participant AL as Agent Loop
    participant P as Provider
    participant R as RetryDo
    participant API as LLM API

    AL->>P: Chat(ChatRequest)
    P->>P: resolveModel()
    P->>P: buildRequestBody()
    P->>R: RetryDo(fn)

    loop Max 3 attempts
        R->>API: HTTP POST /messages or /chat/completions
        alt Success (200)
            API-->>R: JSON Response
            R-->>P: io.ReadCloser
        else Retryable (429, 500-504, network)
            API-->>R: Error
            R->>R: Backoff delay + jitter
        else Non-retryable (400, 401, 403)
            API-->>R: Error
            R-->>P: Error (no retry)
        end
    end

    P->>P: parseResponse()
    P-->>AL: ChatResponse
```

### Streaming (ChatStream)

```mermaid
sequenceDiagram
    participant AL as Agent Loop
    participant P as Provider
    participant R as RetryDo
    participant API as LLM API

    AL->>P: ChatStream(ChatRequest, onChunk)
    P->>P: buildRequestBody(stream=true)
    P->>R: RetryDo(connection only)

    R->>API: HTTP POST (stream: true)
    API-->>R: 200 OK + SSE stream
    R-->>P: io.ReadCloser

    loop SSE events (line-by-line)
        API-->>P: data: event JSON
        P->>P: Accumulate content + tool call args
        P->>AL: onChunk(StreamChunk)
    end

    P->>P: Parse accumulated tool call JSON
    P->>AL: onChunk(Done: true)
    P-->>AL: ChatResponse (final)
```

Key difference: non-streaming wraps the entire request in `RetryDo`. Streaming retries only the connection phase -- once SSE events start flowing, no retry occurs mid-stream.

---

## 4. Anthropic vs OpenAI-Compatible

| Aspect | Anthropic | OpenAI-Compatible |
|--------|-----------|-------------------|
| Implementation | Native `net/http` | Generic HTTP client |
| System messages | Separate `system` field (array of text blocks) | Inline in `messages` array with `role: "system"` |
| Tool definitions | `name` + `description` + `input_schema` | Standard OpenAI function schema |
| Tool results | `role: "user"` with `tool_result` content block + `tool_use_id` | `role: "tool"` with `tool_call_id` |
| Tool call arguments | `map[string]interface{}` (parsed JSON object) | JSON string in `function.arguments` (manual marshal) |
| Tool call streaming | `input_json_delta` events | `delta.tool_calls[].function.arguments` fragments |
| Stop reason mapping | `tool_use` mapped to `tool_calls`, `max_tokens` mapped to `length` | Direct passthrough of `finish_reason` |
| Gemini compatibility | N/A | Skip empty `content` field in assistant messages with tool_calls |
| OpenRouter compatibility | N/A | Model must contain `/` (e.g., `anthropic/claude-...`); unprefixed falls back to default |

---

## 5. Retry Logic

### RetryDo[T] Generic Function

`RetryDo` is a generic function that wraps any provider call with exponential backoff, jitter, and context cancellation support.

### Configuration

| Parameter | Default | Description |
|-----------|---------|-------------|
| Attempts | 3 | Total tries (1 = no retry) |
| MinDelay | 300ms | Initial delay before first retry |
| MaxDelay | 30s | Upper cap on delay |
| Jitter | 0.1 (10%) | Random variation applied to each delay |

### Backoff Formula

```
delay = MinDelay * 2^(attempt - 1)
delay = min(delay, MaxDelay)
delay = delay +/- (delay * jitter * random)

Example:
  Attempt 1: 300ms (+/-30ms)  -> 270ms..330ms
  Attempt 2: 600ms (+/-60ms)  -> 540ms..660ms
  Attempt 3: 1200ms (+/-120ms) -> 1080ms..1320ms
```

If the response includes a `Retry-After` header (HTTP 429 or 503), the header value completely replaces the computed backoff. The header is parsed as integer seconds or RFC 1123 date format.

### Retryable vs Non-Retryable Errors

| Category | Conditions |
|----------|------------|
| Retryable | HTTP 429, 500, 502, 503, 504; network errors (`net.Error`); connection reset; broken pipe; EOF; timeout |
| Non-retryable | HTTP 400, 401, 403, 404; all other status codes |

### Retry Flow

```mermaid
flowchart TD
    CALL["fn()"] --> OK{Success?}
    OK -->|Yes| RETURN["Return result"]
    OK -->|No| RETRY{Retryable error?}
    RETRY -->|No| FAIL["Return error immediately"]
    RETRY -->|Yes| LAST{Last attempt?}
    LAST -->|Yes| FAIL
    LAST -->|No| DELAY["Compute delay<br/>(Retry-After header or backoff + jitter)"]
    DELAY --> WAIT{Context cancelled?}
    WAIT -->|Yes| CANCEL["Return context error"]
    WAIT -->|No| CALL
```

---

## 6. Schema Cleaning

Some providers reject tool schemas containing unsupported JSON Schema fields. `CleanSchemaForProvider()` recursively removes these fields from the entire schema tree, including nested `properties`, `anyOf`, `oneOf`, and `allOf`.

| Provider | Fields Removed |
|----------|---------------|
| Gemini | `$ref`, `$defs`, `additionalProperties`, `examples`, `default` |
| Anthropic | `$ref`, `$defs` |
| All others | No cleaning applied |

The Anthropic provider calls `CleanSchemaForProvider("anthropic", ...)` when converting tool definitions to the `input_schema` format. The OpenAI-compatible provider calls `CleanToolSchemas()` which applies the same logic per provider name.

---

## 7. Managed Mode -- Providers from Database

In managed mode, providers are loaded from the `llm_providers` table in addition to the config file. Database providers override config providers with the same name.

### Loading Flow

```mermaid
flowchart TD
    START["Gateway Startup"] --> CFG["Step 1: Register providers from config<br/>(Anthropic, OpenAI, etc.)"]
    CFG --> DB["Step 2: Register providers from DB<br/>SELECT * FROM llm_providers<br/>Decrypt API keys"]
    DB --> OVERRIDE["DB providers override<br/>config providers with same name"]
    OVERRIDE --> READY["Provider Registry ready"]
```

### API Key Encryption

```mermaid
flowchart LR
    subgraph "Storing a key"
        PLAIN["Plaintext API key"] --> ENC["AES-256-GCM encrypt"]
        ENC --> DB["DB column: 'aes-gcm:' + base64(nonce + ciphertext + tag)"]
    end

    subgraph "Loading a key"
        DB2["DB value"] --> CHECK{"Has 'aes-gcm:' prefix?"}
        CHECK -->|Yes| DEC["AES-256-GCM decrypt"]
        CHECK -->|No| RAW["Return as-is<br/>(backward compatibility)"]
        DEC --> USE["Plaintext key for provider"]
        RAW --> USE
    end
```

`GOCLAW_ENCRYPTION_KEY` accepts three formats:
- **Hex**: 64 characters (32 bytes decoded)
- **Base64**: 44 characters (32 bytes decoded)
- **Raw**: 32 characters (32 bytes direct)

---

## 8. Agent Evaluators (Hook System)

Agent evaluators in the quality gate / hook system (see [03-tools-system.md](./03-tools-system.md)) use the same provider resolution as normal agent runs. When a quality gate is configured with `"type": "agent"`, the hook engine delegates to the specified reviewer agent, which resolves its own provider through the standard provider registry. No separate provider configuration is needed for evaluator agents.

---

## File Reference

| File | Purpose |
|------|---------|
| `internal/providers/types.go` | Provider interface, ChatRequest, ChatResponse, Message, ToolCall, Usage types |
| `internal/providers/anthropic.go` | Anthropic provider implementation (native HTTP + SSE streaming) |
| `internal/providers/openai.go` | OpenAI-compatible provider implementation (generic HTTP) |
| `internal/providers/retry.go` | RetryDo[T] generic function, RetryConfig, IsRetryableError, backoff computation |
| `internal/providers/schema_cleaner.go` | CleanSchemaForProvider, CleanToolSchemas, recursive schema field removal |
| `cmd/gateway_providers.go` | Provider registration from config and database during gateway startup |

---

# 03 - Tools System

The tools system is the bridge between the agent loop and the external environment. When the LLM emits a tool call, the agent loop delegates execution to the tool registry, which handles rate limiting, credential scrubbing, policy enforcement, and virtual filesystem routing before returning results for the next LLM iteration.

---

## 1. Tool Execution Flow

```mermaid
sequenceDiagram
    participant AL as Agent Loop
    participant R as Registry
    participant RL as Rate Limiter
    participant T as Tool
    participant SC as Scrubber

    AL->>R: ExecuteWithContext(name, args, channel, chatID, ...)
    R->>R: Inject context values into ctx
    R->>RL: Allow(sessionKey)?
    alt Rate limited
        RL-->>R: Error: rate limit exceeded
    else Allowed
        RL-->>R: OK
        R->>T: Execute(ctx, args)
        T-->>R: Result
        R->>SC: ScrubCredentials(result.ForLLM)
        R->>SC: ScrubCredentials(result.ForUser)
        SC-->>R: Cleaned result
    end
    R-->>AL: Result
```

ExecuteWithContext performs 8 steps:

1. Lock registry, find tool by name, unlock
2. Inject `WithToolChannel(ctx, channel)`
3. Inject `WithToolChatID(ctx, chatID)`
4. Inject `WithToolPeerKind(ctx, peerKind)`
5. Inject `WithToolSandboxKey(ctx, sessionKey)`
6. Rate limit check via `rateLimiter.Allow(sessionKey)`
7. Execute `tool.Execute(ctx, args)`
8. Scrub credentials from both `ForLLM` and `ForUser` output, log duration

Context keys ensure each tool call receives the correct per-call values without mutable fields, allowing tool instances to be shared safely across concurrent goroutines.

---

## 2. Complete Tool Inventory

### Filesystem (group: `fs`)

| Tool | Description |
|------|-------------|
| `read_file` | Read file contents with optional line range |
| `write_file` | Write or create a file |
| `edit_file` | Apply targeted edits to a file |
| `list_files` | List directory contents |
| `search` | Search file contents with regex |
| `glob` | Find files matching a glob pattern |

### Runtime (group: `runtime`)

| Tool | Description |
|------|-------------|
| `exec` | Execute a shell command |
| `process` | Manage running processes |

### Web (group: `web`)

| Tool | Description |
|------|-------------|
| `web_search` | Search the web |
| `web_fetch` | Fetch and parse a URL |

### Memory (group: `memory`)

| Tool | Description |
|------|-------------|
| `memory_search` | Search memory documents |
| `memory_get` | Retrieve a specific memory document |

### Sessions (group: `sessions`)

| Tool | Description |
|------|-------------|
| `sessions_list` | List active sessions |
| `sessions_history` | View session message history |
| `sessions_send` | Send a message to a session |
| `sessions_spawn` | Spawn an async subagent task |
| `subagents` | Manage subagent tasks (list, cancel, steer) |
| `session_status` | Get current session status |

### UI (group: `ui`)

| Tool | Description |
|------|-------------|
| `browser` | Browser automation via Rod + CDP |
| `canvas` | Visual canvas operations |

### Automation (group: `automation`)

| Tool | Description |
|------|-------------|
| `cron` | Manage scheduled tasks |
| `gateway` | Gateway administration commands |

### Messaging (group: `messaging`)

| Tool | Description |
|------|-------------|
| `message` | Send a message to a channel |

### Delegation (group: `delegation`)

| Tool | Description |
|------|-------------|
| `delegate` | Delegate task to another agent (actions: delegate, cancel, list, history) |
| `delegate_search` | Hybrid FTS + semantic agent discovery for delegation targets |
| `evaluate_loop` | Generate-evaluate-revise cycle with two agents (max 5 rounds) |
| `handoff` | Transfer conversation to another agent (routing override) |

### Teams (group: `teams`)

| Tool | Description |
|------|-------------|
| `team_tasks` | Task board: list, create, claim, complete, search |
| `team_message` | Mailbox: send, broadcast, read unread messages |

### Other Tools

| Tool | Description |
|------|-------------|
| `skill_search` | Search available skills (BM25) |
| `image` | Generate images |
| `tts` | Text-to-speech synthesis (OpenAI, ElevenLabs, Edge, MiniMax) |
| `spawn` | Spawn subagent (alternative to sessions_spawn) |
| `nodes` | Node graph operations |

---

## 3. Filesystem Tools and Virtual FS Routing

In managed mode, filesystem operations are intercepted before hitting the host disk. Two interceptor layers route specific paths to the database instead.

```mermaid
flowchart TD
    CALL["read_file / write_file"] --> INT1{"ContextFile<br/>Interceptor?"}
    INT1 -->|Handled| DB1[("DB: agent_context_files<br/>/ user_context_files")]
    INT1 -->|Not handled| INT2{"Memory<br/>Interceptor?"}
    INT2 -->|Handled| DB2[("DB: memory_documents")]
    INT2 -->|Not handled| SBX{"Sandbox enabled?"}
    SBX -->|Yes| DOCKER["Docker container"]
    SBX -->|No| HOST["Host filesystem<br/>resolvePath -> os.ReadFile / WriteFile"]
```

### ContextFileInterceptor -- 7 Routed Files

| File | Description |
|------|-------------|
| `SOUL.md` | Agent personality and behavior |
| `IDENTITY.md` | Agent identity information |
| `AGENTS.md` | Sub-agent definitions |
| `TOOLS.md` | Tool usage guidance |
| `HEARTBEAT.md` | Periodic wake-up instructions |
| `USER.md` | Per-user preferences and context |
| `BOOTSTRAP.md` | First-run instructions (write empty = delete row) |

### Routing by Agent Type

```mermaid
flowchart TD
    FILE{"Path is one of<br/>7 context files?"} -->|No| PASS["Pass through to disk"]
    FILE -->|Yes| TYPE{"Agent type?"}
    TYPE -->|open| USER_CF["user_context_files<br/>fallback: agent_context_files"]
    TYPE -->|predefined| PRED{"File = USER.md?"}
    PRED -->|Yes| USER_CF2["user_context_files"]
    PRED -->|No| AGENT_CF["agent_context_files"]
```

- **Open agents**: All 7 files are per-user. If a user file does not exist, the agent-level template is returned as fallback.
- **Predefined agents**: Only `USER.md` is per-user. All other files come from the agent-level store.

### MemoryInterceptor

Routes `MEMORY.md`, `memory.md`, and `memory/*` paths. Per-user results take priority with a fallback to global scope. Writing a `.md` file automatically triggers `IndexDocument()` (chunking + embedding).

### PathDenyable Interface

Tools that access the filesystem implement the `PathDenyable` interface, allowing specific path prefixes to be denied at runtime:

```go
type PathDenyable interface {
    DenyPaths(...string)
}
```

All four filesystem tools (`read_file`, `write_file`, `list_files`, `edit_file`) implement it. `list_files` additionally filters denied directories from its output entirely -- the agent doesn't even know the directory exists. Used to prevent agents from accessing `.goclaw` directories within workspaces.

### Workspace Context Injection

Filesystem and shell tools read their workspace from `ToolWorkspaceFromCtx(ctx)`, which is injected by the agent loop based on the current user and agent. This enables per-user workspace isolation without changing any tool code. Falls back to the struct field for backward compatibility.

### Path Security

`resolvePath()` joins relative paths with the workspace root, applies `filepath.Clean()`, and verifies the result with `HasPrefix()`. This prevents path traversal attacks (e.g., `../../../etc/passwd`). The extended `resolvePathWithAllowed()` permits additional prefixes for skills directories.

---

## 4. Shell Execution

The `exec` tool allows the LLM to run shell commands, with multiple defense layers.

### Deny Patterns

| Category | Blocked Patterns |
|----------|------------------|
| Destructive file ops | `rm -rf`, `del /f`, `rmdir /s` |
| Disk destruction | `mkfs`, `dd if=`, `> /dev/sd*` |
| System control | `shutdown`, `reboot`, `poweroff` |
| Fork bombs | `:(){ ... };:` |
| Remote code exec | `curl \| sh`, `wget -O - \| sh` |
| Reverse shells | `/dev/tcp/`, `nc -e` |
| Eval injection | `eval $()`, `base64 -d \| sh` |

### Approval Workflow

```mermaid
flowchart TD
    CMD["Shell Command"] --> DENY{"Matches deny<br/>pattern?"}
    DENY -->|Yes| BLOCK["Blocked by safety policy"]
    DENY -->|No| APPROVAL{"Approval manager<br/>configured?"}
    APPROVAL -->|No| EXEC["Execute on host"]
    APPROVAL -->|Yes| CHECK{"CheckCommand()"}
    CHECK -->|deny| BLOCK2["Command denied"]
    CHECK -->|allow| EXEC
    CHECK -->|ask| REQUEST["Request approval<br/>(2-minute timeout)"]
    REQUEST -->|allow-once| EXEC
    REQUEST -->|allow-always| ADD["Add to dynamic allowlist"] --> EXEC
    REQUEST -->|deny / timeout| BLOCK3["Command denied"]
```

### Sandbox Routing

When a sandbox manager is configured and a `sandboxKey` exists in context, commands execute inside a Docker container. The host working directory maps to `/workspace` in the container. Host timeout is 60 seconds; sandbox timeout is 300 seconds. If sandbox returns `ErrSandboxDisabled`, execution falls back to the host.

---

## 5. Policy Engine

The policy engine determines which tools the LLM can use through a 7-step allow pipeline followed by deny subtraction and additive alsoAllow.

```mermaid
flowchart TD
    ALL["All registered tools"] --> S1

    S1["Step 1: Global Profile<br/>full / minimal / coding / messaging"] --> S2
    S2["Step 2: Provider Profile Override<br/>byProvider.{name}.profile"] --> S3
    S3["Step 3: Global Allow List<br/>Intersection with allow list"] --> S4
    S4["Step 4: Provider Allow Override<br/>byProvider.{name}.allow"] --> S5
    S5["Step 5: Agent Allow<br/>Per-agent allow list"] --> S6
    S6["Step 6: Agent + Provider Allow<br/>Per-agent per-provider allow"] --> S7
    S7["Step 7: Group Allow<br/>Group-level allow list"]

    S7 --> DENY["Apply Deny Lists<br/>Global deny, then Agent deny"]
    DENY --> ALSO["Apply AlsoAllow<br/>Global alsoAllow, Agent alsoAllow<br/>(additive union)"]
    ALSO --> SUB{"Subagent?"}
    SUB -->|Yes| SUBDENY["Apply subagent deny list<br/>+ leaf deny list if at max depth"]
    SUB -->|No| FINAL["Final tool list sent to LLM"]
    SUBDENY --> FINAL
```

### Profiles

| Profile | Tools Included |
|---------|---------------|
| `full` | All registered tools (no restriction) |
| `coding` | `group:fs`, `group:runtime`, `group:sessions`, `group:memory`, `image` |
| `messaging` | `group:messaging`, `sessions_list`, `sessions_history`, `sessions_send`, `session_status` |
| `minimal` | `session_status` only |

### Tool Groups

| Group | Members |
|-------|---------|
| `fs` | `read_file`, `write_file`, `list_files`, `edit_file`, `search`, `glob` |
| `runtime` | `exec`, `process` |
| `web` | `web_search`, `web_fetch` |
| `memory` | `memory_search`, `memory_get` |
| `sessions` | `sessions_list`, `sessions_history`, `sessions_send`, `sessions_spawn`, `subagents`, `session_status` |
| `ui` | `browser`, `canvas` |
| `automation` | `cron`, `gateway` |
| `messaging` | `message` |
| `delegation` | `delegate`, `delegate_search`, `evaluate_loop`, `handoff` |
| `teams` | `team_tasks`, `team_message` |
| `goclaw` | All native tools (composite group) |

Groups can be referenced in allow/deny lists with the `group:` prefix (e.g., `group:fs`). The MCP manager dynamically registers `mcp` and `mcp:{serverName}` groups at runtime.

---

## 6. Subagent System

Subagents are child agent instances spawned to handle parallel or complex tasks. They run in background goroutines with restricted tool access.

### Lifecycle

```mermaid
stateDiagram-v2
    [*] --> Spawning: spawn(task, label)
    Spawning --> Running: Limits pass<br/>(depth, concurrent, children)
    Spawning --> Rejected: Limit exceeded

    Running --> Completed: Task finished
    Running --> Failed: LLM error
    Running --> Cancelled: cancel / steer / parent abort

    Completed --> Archived: After 60 min
    Failed --> Archived: After 60 min
    Cancelled --> Archived: After 60 min
```

### Limits

| Constraint | Default | Description |
|------------|---------|-------------|
| MaxConcurrent | 8 | Total running subagents across all parents |
| MaxSpawnDepth | 1 | Maximum nesting depth |
| MaxChildrenPerAgent | 5 | Maximum children per parent agent |
| ArchiveAfterMinutes | 60 | Auto-archive completed tasks |
| Max iterations | 20 | LLM loop iterations per subagent |

### Subagent Actions

| Action | Behavior |
|--------|----------|
| `spawn` (async) | Launch in goroutine, return immediately with acceptance message |
| `run` (sync) | Block until subagent completes, return result directly |
| `list` | List all subagent tasks with status |
| `cancel` | Cancel by specific ID, `"all"`, or `"last"` |
| `steer` | Cancel + settle 500ms + respawn with new message |

### Tool Deny Lists

| List | Denied Tools |
|------|-------------|
| Always denied (all depths) | `gateway`, `agents_list`, `whatsapp_login`, `session_status`, `cron`, `memory_search`, `memory_get`, `sessions_send` |
| Leaf denied (max depth) | `sessions_list`, `sessions_history`, `sessions_spawn`, `spawn`, `subagent` |

Results are announced back to the parent agent via the message bus, optionally batched through an AnnounceQueue with debouncing.

---

## 7. Delegation System

Delegation allows named agents to delegate tasks to other fully independent agents (each with its own identity, tools, provider, model, and context files). Unlike subagents (anonymous clones), delegation crosses agent boundaries via explicit permission links.

### DelegateManager

The `DelegateManager` in `internal/tools/delegate.go` orchestrates all delegation operations:

| Action | Mode | Behavior |
|--------|------|----------|
| `delegate` | `sync` | Caller waits for result (quick lookups, fact checks) |
| `delegate` | `async` | Caller moves on; result announced later via message bus (`delegate:{id}`) |
| `cancel` | -- | Cancel a running async delegation by ID |
| `list` | -- | List active delegations |
| `history` | -- | Query past delegations from `delegation_history` table |

### Callback Pattern

The `tools` package cannot import `agent` (import cycle). A callback function bridges the gap:

```go
type AgentRunFunc func(ctx context.Context, agentKey string, req DelegateRunRequest) (*DelegateRunResult, error)
```

The `cmd` layer provides the implementation at wiring time. The `tools` package never knows `agent` exists.

### Agent Links (Permission Control)

Delegation requires an explicit link in the `agent_links` table. Links are directed edges:

- **outbound** (A→B): Only A can delegate to B
- **bidirectional** (A↔B): Both can delegate to each other

Each link has `max_concurrent` and per-user `settings` (JSONB) for deny/allow lists.

### Concurrency Control

Two layers prevent overload:

| Layer | Config | Scope |
|-------|--------|-------|
| Per-link | `agent_links.max_concurrent` | A→B specifically |
| Per-agent | `other_config.max_delegation_load` | B from all sources |

When limits hit, the error message is written for LLM reasoning: *"Agent at capacity (5/5). Try a different agent or handle it yourself."*

### DELEGATION.md Auto-Injection

During agent resolution, `DELEGATION.md` is auto-generated and injected into the system prompt:

- **≤15 targets**: Full inline list with agent keys, names, and frontmatter
- **>15 targets**: Search instruction pointing to the `delegate_search` tool (hybrid FTS + pgvector cosine)

### Context File Merging (Open Agents)

For open agents, per-user context files merge with resolver-injected base files. Per-user files override same-name base files, but base-only files like `DELEGATION.md` are preserved:

```
Base files (resolver):     DELEGATION.md
Per-user files (DB):       AGENTS.md, SOUL.md, TOOLS.md, USER.md, ...
Merged result:             AGENTS.md, SOUL.md, TOOLS.md, USER.md, ..., DELEGATION.md ✓
```

---

## 8. Agent Teams

Teams add a shared coordination layer on top of delegation: a task board for parallel work and a mailbox for peer-to-peer communication.

### Architecture

An admin creates a team via the dashboard, assigns a **lead** and **members**. When a user messages the lead:
1. The lead sees `TEAM.md` in its system prompt (teammate list + role)
2. The lead posts tasks to the board
3. Teammates are activated, claim tasks, and work in parallel
4. Teammates message each other for coordination
5. The lead synthesizes results and replies to the user

### Task Board (`team_tasks` tool)

| Action | Description |
|--------|-------------|
| `list` | List tasks (filter: active/completed/all, order: priority/newest) |
| `create` | Create task with subject, description, priority, blocked_by |
| `claim` | Atomically claim a pending task (race-safe via row-level lock) |
| `complete` | Mark task done with result; auto-unblocks dependent tasks |
| `search` | FTS search over task subject + description |

### Mailbox (`team_message` tool)

| Action | Description |
|--------|-------------|
| `send` | Send direct message to a specific teammate |
| `broadcast` | Send message to all teammates |
| `read` | Read unread messages |

### Lead-Centric Design

Only the lead gets `TEAM.md` in its system prompt. Teammates discover context on demand through tools -- no wasted tokens on idle agents. When a teammate message arrives, the message itself carries context (e.g., *"[Team message from lead]: please claim a task from the board."*).

### Message Routing

Teammate results route through the message bus with a `"teammate:"` prefix. The consumer publishes the outbound response so the lead (and ultimately the user) sees the result.

---

## 9. Evaluate-Optimize Loop

A structured revision cycle between two agents: a generator and an evaluator.

```mermaid
sequenceDiagram
    participant L as Calling Agent
    participant G as Generator
    participant V as Evaluator

    L->>G: "Write product announcement"
    G->>L: Draft v1
    L->>V: "Evaluate against criteria"
    V->>L: "REJECTED: Too long, missing pricing"
    L->>G: "Revise. Feedback: too long, missing pricing"
    G->>L: Draft v2
    L->>V: "Evaluate revised version"
    V->>L: "APPROVED"
    L->>L: Return v2 as final output
```

The `evaluate_loop` tool orchestrates this. Parameters: generator agent, evaluator agent, pass criteria, and max rounds (default 3, cap 5). Each round is a pair of sync delegations. If the evaluator responds with "APPROVED" (case-insensitive prefix match), the loop exits. If "REJECTED: feedback", the generator gets another shot.

Internal delegations use `WithSkipHooks(ctx)` to prevent quality gates from triggering recursion.

---

## 10. Agent Handoff

Handoff transfers a conversation from one agent to another. Unlike delegation (which keeps the source agent in the loop), handoff removes it entirely.

| | Delegation | Handoff |
|---|---|---|
| Who talks to the user? | Source agent (always) | Target agent (after transfer) |
| Source agent involvement | Waits for result, reformulates | Steps away completely |
| Session | Target runs in source's context | Target gets a new session |
| Duration | One task | Until cleared or handed back |

### Mechanism

When agent A calls `handoff(agent="billing", reason="billing question")`:
1. A row is written to `handoff_routes`: this channel + chat ID now routes to billing
2. A `handoff` event is broadcast (WS clients can react)
3. An initial message is published to billing via the message bus with conversation context

Subsequent messages from the user on that channel are routed to billing (consumer checks `handoff_routes` before normal routing). Billing can hand back via `handoff(action="clear")`.

---

## 11. Quality Gates (Hook System)

A general-purpose hook system for validating agent output before it reaches the user. Located in `internal/hooks/`.

### Evaluator Types

| Type | How it works | Example |
|------|-------------|---------|
| **command** | Run a shell command. Exit 0 = pass. Stderr = feedback. | `npm test`, `eslint --stdin` |
| **agent** | Delegate to a reviewer agent. Parse "APPROVED" or "REJECTED: feedback". | QA reviewer checks tone/accuracy |

### Configuration

Quality gates live in the source agent's `other_config` JSON:

```json
{
  "quality_gates": [
    {
      "event": "delegation.completed",
      "type": "agent",
      "agent": "qa-reviewer",
      "block_on_failure": true,
      "max_retries": 2
    }
  ]
}
```

When `block_on_failure` is true and retries remain, the system re-runs the target agent with the evaluator's feedback injected as a revision prompt.

### Recursion Prevention

Quality gates with agent evaluators can cause infinite recursion (gate delegates to reviewer → reviewer completes → gate fires again). The fix is a context flag: `hooks.WithSkipHooks(ctx, true)`. Three places set it:
1. **Agent evaluator** -- when delegating to the reviewer
2. **Evaluate loop** -- for all internal generator/evaluator delegations
3. **Agent eval callback in cmd layer** -- when the hook engine itself triggers delegation

`DelegateManager.Delegate()` checks `hooks.SkipHooksFromContext(ctx)` before applying gates. If set, gates are skipped.

---

## 12. MCP Bridge Tools

GoClaw integrates with Model Context Protocol (MCP) servers via `internal/mcp/`. The MCP Manager connects to external tool servers and registers their tools in the tool registry with a configurable prefix.

### Transports

| Transport | Description |
|-----------|-------------|
| `stdio` | Launch process with command + args, communicate via stdin/stdout |
| `sse` | Connect to SSE endpoint via URL |
| `streamable-http` | Connect to HTTP streaming endpoint |

### Behavior

- Health checks run every 30 seconds per server
- Reconnection uses exponential backoff (2s initial, 60s max, 10 attempts)
- Tools are registered with a prefix (e.g., `mcp_servername_toolname`)
- Dynamic tool group registration: `mcp` and `mcp:{serverName}` groups

### Access Control (Managed Mode)

In managed mode, MCP server access is controlled through per-agent and per-user grants stored in PostgreSQL.

```mermaid
flowchart TD
    REQ["LoadForAgent(agentID, userID)"] --> QUERY["ListAccessible()<br/>JOIN mcp_servers + agent_grants + user_grants"]
    QUERY --> SERVERS["Accessible servers list<br/>(with ToolAllow/ToolDeny per grant)"]
    SERVERS --> CONNECT["Connect each server<br/>(stdio/sse/streamable-http)"]
    CONNECT --> DISCOVER["ListTools() from server"]
    DISCOVER --> FILTER["filterTools()<br/>1. Remove tools in deny list<br/>2. Keep only tools in allow list (if set)<br/>3. Deny takes priority over allow"]
    FILTER --> REGISTER["Register filtered tools<br/>in tool registry"]
```

**Grant types**:

| Grant | Table | Scope | Fields |
|-------|-------|-------|--------|
| Agent grant | `mcp_agent_grants` | Per server + agent | `tool_allow`, `tool_deny` (JSONB arrays), `config_overrides`, `enabled` |
| User grant | `mcp_user_grants` | Per server + user | `tool_allow`, `tool_deny` (JSONB arrays), `enabled` |

**Access request workflow**: Users can request access to MCP servers. Admins review and approve or reject. On approval, a corresponding grant is created transactionally.

```mermaid
flowchart LR
    USER["CreateRequest()<br/>scope: agent/user<br/>status: pending"] --> ADMIN["ReviewRequest()<br/>approve or reject"]
    ADMIN -->|approved| GRANT["Create agent/user grant<br/>with requested tool_allow"]
    ADMIN -->|rejected| DONE["Request closed"]
```

---

## 13. Custom Tools (Managed Mode)

Define shell-based tools at runtime via the HTTP API -- no recompile or restart needed. Custom tools are stored in the `custom_tools` PostgreSQL table and loaded dynamically into the agent's tool registry.

### Lifecycle

```mermaid
flowchart TD
    subgraph Startup
        GLOBAL["LoadGlobal()<br/>Fetch all tools with agent_id IS NULL<br/>Register into global registry"]
    end

    subgraph "Per-Agent Resolution"
        RESOLVE["LoadForAgent(globalReg, agentID)"] --> CHECK{"Agent has<br/>custom tools?"}
        CHECK -->|No| USE_GLOBAL["Use global registry as-is"]
        CHECK -->|Yes| CLONE["Clone global registry<br/>Register per-agent tools<br/>Return cloned registry"]
    end

    subgraph "Cache Invalidation"
        EVENT["cache:custom_tools event"] --> RELOAD["ReloadGlobal()<br/>Unregister old, register new"]
        RELOAD --> INVALIDATE["AgentRouter.InvalidateAll()<br/>Force re-resolve on next request"]
    end
```

### Scope

| Scope | `agent_id` | Behavior |
|-------|-----------|----------|
| Global | `NULL` | Available to all agents |
| Per-agent | UUID | Available only to the specified agent |

### Command Execution

1. **Template rendering**: `{{.key}}` placeholders replaced with shell-escaped argument values (single-quote wrapping with embedded quote escaping)
2. **Deny pattern check**: Same deny patterns as the `exec` tool (blocks `curl|sh`, reverse shells, etc.)
3. **Execution**: `sh -c <rendered_command>` with configurable timeout (default 60s) and optional working directory
4. **Environment variables**: Stored encrypted (AES-256-GCM) in the database, decrypted at runtime and injected into the command environment

### JSON Config Example

```json
{
  "name": "dns_lookup",
  "description": "Look up DNS records for a domain",
  "parameters": {
    "type": "object",
    "properties": {
      "domain": { "type": "string", "description": "Domain name" },
      "record_type": { "type": "string", "enum": ["A", "AAAA", "MX", "CNAME", "TXT"] }
    },
    "required": ["domain"]
  },
  "command": "dig +short {{.record_type}} {{.domain}}",
  "timeout_seconds": 10,
  "enabled": true
}
```

---

## 14. Credential Scrubbing

Tool output is automatically scrubbed before being returned to the LLM. Enabled by default in the registry.

### Detected Patterns

| Type | Pattern |
|------|---------|
| OpenAI | `sk-[a-zA-Z0-9]{20,}` |
| Anthropic | `sk-ant-[a-zA-Z0-9-]{20,}` |
| GitHub PAT | `ghp_`, `gho_`, `ghu_`, `ghs_`, `ghr_` + 36 alphanumeric characters |
| AWS | `AKIA[A-Z0-9]{16}` |
| Generic | `(api_key\|token\|secret\|password\|bearer\|authorization)[:=]value` (case-insensitive) |

All matches are replaced with `[REDACTED]`.

---

## 15. Rate Limiter

The tool registry supports per-session rate limiting via `ToolRateLimiter`. When configured, each `ExecuteWithContext` call checks `rateLimiter.Allow(sessionKey)` before tool execution. Rate-limited calls receive an error result without executing the tool.

---

## File Reference

| File | Purpose |
|------|---------|
| `internal/tools/registry.go` | Registry: Register, Execute, ExecuteWithContext, ProviderDefs |
| `internal/tools/types.go` | Tool interface, ContextualTool, InterceptorAware, and other config interfaces |
| `internal/tools/policy.go` | PolicyEngine: 7-step pipeline, tool groups, profiles, subagent deny lists |
| `internal/tools/filesystem.go` | read_file, write_file, edit_file with interceptor support |
| `internal/tools/filesystem_list.go` | list_files tool |
| `internal/tools/filesystem_write.go` | Additional write operations |
| `internal/tools/shell.go` | ExecTool: deny patterns, approval workflow, sandbox routing |
| `internal/tools/scrub.go` | ScrubCredentials: credential pattern matching and redaction |
| `internal/tools/subagent.go` | SubagentManager: spawn, cancel, steer, run sync, deny lists |
| `internal/tools/delegate.go` | DelegateManager: sync, async, cancel, concurrency, per-user checks |
| `internal/tools/delegate_tool.go` | Delegate tool wrapper (action: delegate/cancel/list/history) |
| `internal/tools/delegate_search_tool.go` | Hybrid FTS + semantic agent discovery |
| `internal/tools/evaluate_loop_tool.go` | Generate-evaluate-revise loop (max 5 rounds) |
| `internal/tools/handoff_tool.go` | Conversation transfer (routing override + context carry) |
| `internal/tools/team_tool_manager.go` | Shared backend for team tools |
| `internal/tools/team_tasks_tool.go` | Task board: list, create, claim, complete, search |
| `internal/tools/team_message_tool.go` | Mailbox: send, broadcast, read |
| `internal/hooks/engine.go` | Hook engine: evaluator registry, EvaluateHooks |
| `internal/hooks/command_evaluator.go` | Shell command evaluator |
| `internal/hooks/agent_evaluator.go` | Agent delegation evaluator |
| `internal/hooks/context.go` | WithSkipHooks / SkipHooksFromContext |
| `internal/tools/context_file_interceptor.go` | ContextFileInterceptor: 7-file routing by agent type |
| `internal/tools/memory_interceptor.go` | MemoryInterceptor: MEMORY.md and memory/* routing |
| `internal/tools/skill_search.go` | Skill search tool (BM25) |
| `internal/tools/tts.go` | Text-to-speech tool (4 providers) |
| `internal/mcp/manager.go` | MCP Manager: server connections, health checks, tool registration |
| `internal/mcp/bridge_tool.go` | MCP bridge tool implementation |
| `internal/tools/dynamic_loader.go` | DynamicLoader: LoadGlobal, LoadForAgent, ReloadGlobal |
| `internal/tools/dynamic_tool.go` | DynamicTool: template rendering, shell escaping, execution |
| `internal/store/custom_tool_store.go` | CustomToolStore interface |
| `internal/store/pg/custom_tools.go` | PostgreSQL custom tools implementation |
| `internal/store/mcp_store.go` | MCPServerStore interface (grants, access requests) |
| `internal/store/pg/mcp_servers.go` | PostgreSQL MCP implementation |

---

# 04 - Gateway and Protocol

The gateway is the central component of GoClaw, serving both WebSocket RPC (Protocol v3) and HTTP REST API on a single port. It handles authentication, role-based access control, rate limiting, and method dispatch for all client interactions.

---

## 1. WebSocket Lifecycle

```mermaid
sequenceDiagram
    participant C as Client
    participant S as Server

    C->>S: HTTP GET /ws
    S-->>C: 101 Switching Protocols

    Note over S: Create Client, register,<br/>subscribe to event bus

    C->>S: req: connect {token, user_id}
    S-->>C: res: {protocol: 3, role, user_id}

    loop RPC Communication
        C->>S: req: chat.send {message, agentId, ...}
        S-->>C: event: agent {run.started}
        S-->>C: event: chat {chunk} (repeated)
        S-->>C: event: agent {tool.call}
        S-->>C: event: agent {tool.result}
        S-->>C: res: {content, usage}
    end

    Note over C,S: Ping/Pong every 30s

    C->>S: close
    Note over S: Unregister, cleanup,<br/>unsubscribe from event bus
```

### Connection Parameters

| Parameter | Value | Description |
|-----------|-------|-------------|
| Read limit | 512 KB | Auto-close connection on exceed |
| Send buffer | 256 capacity | Drop messages when full |
| Read deadline | 60s | Reset on each message or pong |
| Write deadline | 10s | Per-write timeout |
| Ping interval | 30s | Server-initiated keepalive |

---

## 2. Protocol v3 Frame Types

| Type | Direction | Purpose |
|------|-----------|---------|
| `req` | Client to Server | Invoke an RPC method |
| `res` | Server to Client | Response matching request by `id` |
| `event` | Server to Client | Push events (streaming chunks, agent status, etc.) |

The first request from a client must be `connect`. Any other method sent before authentication results in an `UNAUTHORIZED` error.

### Request Frame Structure

- `type`: always `"req"`
- `id`: unique request ID (client-generated)
- `method`: RPC method name
- `params`: method-specific parameters (JSON)

### Response Frame Structure

- `type`: always `"res"`
- `id`: matches the request ID
- `ok`: boolean success indicator
- `payload`: response data (when `ok` is true)
- `error`: error shape with `code`, `message`, `details`, `retryable`, `retryAfterMs` (when `ok` is false)

### Event Frame Structure

- `type`: always `"event"`
- `event`: event name (e.g., `chat`, `agent`, `status`, `handoff`)
- `payload`: event data
- `seq`: ordering sequence number
- `stateVersion`: version counters for optimistic state sync

---

## 3. Authentication and RBAC

### Connect Handshake

```mermaid
flowchart TD
    FIRST{"First frame = connect?"} -->|No| REJECT["UNAUTHORIZED<br/>'first request must be connect'"]
    FIRST -->|Yes| TOKEN{"Token match?"}
    TOKEN -->|"Config token matches"| ADMIN["Role: admin"]
    TOKEN -->|"No config token set"| OPER["Role: operator"]
    TOKEN -->|"Wrong or missing token"| VIEW["Role: viewer"]
```

Token comparison uses `crypto/subtle.ConstantTimeCompare` to prevent timing attacks.

In managed mode, `user_id` in the connect parameters is required for per-user session scoping and context file routing. GoClaw uses the **Identity Propagation** pattern — it trusts the upstream service to provide accurate user identity. The `user_id` is opaque (VARCHAR 255); multi-tenant deployments use the compound format `tenant.{tenantId}.user.{userId}`. See [00-architecture-overview.md Section 5](./00-architecture-overview.md) for details.

### Three Roles

```mermaid
flowchart LR
    V["viewer (level 1)<br/>Read only"] --> O["operator (level 2)<br/>Read + Write"]
    O --> A["admin (level 3)<br/>Full control"]
```

### Method Permissions

| Role | Accessible Methods |
|------|--------------------|
| viewer | `agents.list`, `config.get`, `sessions.list`, `sessions.preview`, `health`, `status`, `models.list`, `skills.list`, `skills.get`, `channels.list`, `channels.status`, `cron.list`, `cron.status`, `cron.runs`, `usage.get`, `usage.summary` |
| operator | All viewer methods plus: `chat.send`, `chat.abort`, `chat.history`, `chat.inject`, `sessions.delete`, `sessions.reset`, `sessions.patch`, `cron.create`, `cron.update`, `cron.delete`, `cron.toggle`, `cron.run`, `skills.update`, `send`, `exec.approval.list`, `exec.approval.approve`, `exec.approval.deny`, `device.pair.request`, `device.pair.list` |
| admin | All operator methods plus: `config.apply`, `config.patch`, `agents.create`, `agents.update`, `agents.delete`, `agents.files.*`, `agents.links.*`, `teams.*`, `channels.toggle`, `device.pair.approve`, `device.pair.revoke` |

---

## 4. Request Handling Pipeline

```mermaid
flowchart TD
    REQ["Client sends RequestFrame"] --> PARSE["Parse frame type"]
    PARSE --> AUTH{"Authenticated?"}
    AUTH -->|"No and method is not connect"| UNAUTH["UNAUTHORIZED"]
    AUTH -->|"Yes or method is connect"| FIND{"Handler found?"}
    FIND -->|No| INVALID["INVALID_REQUEST<br/>'unknown method'"]
    FIND -->|Yes| PERM{"Permission check<br/>(skip for connect, health)"}
    PERM -->|Insufficient role| DENIED["UNAUTHORIZED<br/>'permission denied'"]
    PERM -->|OK| EXEC["Execute handler(ctx, client, req)"]
    EXEC --> RES["Send ResponseFrame"]
```

---

## 5. RPC Methods

### System

| Method | Description |
|--------|-------------|
| `connect` | Authentication handshake (must be first request) |
| `health` | Health check |
| `status` | Gateway status (connected clients, agents, channels) |
| `models.list` | List available models from all providers |

### Chat

| Method | Description |
|--------|-------------|
| `chat.send` | Send a message to an agent, receive streaming response |
| `chat.history` | Get conversation history for a session |
| `chat.abort` | Abort a running agent loop |
| `chat.inject` | Inject a system message into a session |

### Agents

| Method | Description |
|--------|-------------|
| `agent` | Get details for a specific agent |
| `agent.wait` | Wait for an agent to become available |
| `agent.identity.get` | Get agent identity (name, description) |
| `agents.list` | List all accessible agents |
| `agents.create` | Create a new agent (managed mode) |
| `agents.update` | Update agent configuration |
| `agents.delete` | Soft-delete an agent |
| `agents.files.list` | List agent context files |
| `agents.files.get` | Read a context file |
| `agents.files.set` | Write a context file |

### Sessions

| Method | Description |
|--------|-------------|
| `sessions.list` | List all sessions |
| `sessions.preview` | Preview session content |
| `sessions.patch` | Update session metadata |
| `sessions.delete` | Delete a session |
| `sessions.reset` | Reset session history |

### Config

| Method | Description |
|--------|-------------|
| `config.get` | Get current configuration (secrets redacted) |
| `config.apply` | Replace entire configuration |
| `config.patch` | Partial configuration update |
| `config.schema` | Get configuration JSON schema |

### Skills

| Method | Description |
|--------|-------------|
| `skills.list` | List all skills |
| `skills.get` | Get skill details |
| `skills.update` | Update skill content |

### Cron

| Method | Description |
|--------|-------------|
| `cron.list` | List scheduled jobs |
| `cron.create` | Create a new cron job |
| `cron.update` | Update a cron job |
| `cron.delete` | Delete a cron job |
| `cron.toggle` | Enable/disable a cron job |
| `cron.status` | Get cron system status |
| `cron.run` | Manually trigger a cron job |
| `cron.runs` | List recent run logs |

### Channels

| Method | Description |
|--------|-------------|
| `channels.list` | List enabled channels |
| `channels.status` | Get channel running status |
| `channels.toggle` | Enable/disable a channel (admin only) |

### Pairing

| Method | Description |
|--------|-------------|
| `device.pair.request` | Request a pairing code |
| `device.pair.approve` | Approve a pairing request |
| `device.pair.list` | List paired devices |
| `device.pair.revoke` | Revoke a paired device |
| `browser.pairing.status` | Poll browser pairing approval status |

### Exec Approval

| Method | Description |
|--------|-------------|
| `exec.approval.list` | List pending exec approval requests |
| `exec.approval.approve` | Approve an exec request |
| `exec.approval.deny` | Deny an exec request |

### Usage and Send

| Method | Description |
|--------|-------------|
| `usage.get` | Get token usage for a session |
| `usage.summary` | Get aggregated usage summary |
| `send` | Send a direct message to a channel |

### Agent Links

| Method | Description |
|--------|-------------|
| `agents.links.list` | List agent links (by source agent) |
| `agents.links.create` | Create an agent link (outbound or bidirectional) |
| `agents.links.update` | Update a link (max_concurrent, settings, status) |
| `agents.links.delete` | Delete an agent link |

### Teams

| Method | Description |
|--------|-------------|
| `teams.list` | List agent teams |
| `teams.create` | Create a team (lead + members) |
| `teams.get` | Get team details with members |
| `teams.delete` | Delete a team |
| `teams.tasks.list` | List team tasks |

### Delegations

| Method | Description |
|--------|-------------|
| `delegations.list` | List delegation history (result truncated to 500 runes) |
| `delegations.get` | Get delegation detail (result truncated to 8000 runes) |

---

## 6. HTTP API

### Authentication

- `Authorization: Bearer <token>` -- timing-safe comparison via `crypto/subtle.ConstantTimeCompare`
- No token configured: all requests allowed
- `X-GoClaw-User-Id`: required in managed mode for per-user scoping
- `X-GoClaw-Agent-Id`: specify target agent for the request

### Endpoints

#### POST /v1/chat/completions (OpenAI-compatible)

```mermaid
flowchart TD
    REQ["HTTP Request"] --> AUTH["Bearer token check"]
    AUTH --> RL["Rate limit check"]
    RL --> BODY["MaxBytesReader (1 MB)"]
    BODY --> AGENT["Resolve agent<br/>(model prefix / header / default)"]
    AGENT --> RUN["agent.Run()"]
    RUN --> RESP{"Streaming?"}
    RESP -->|Yes| SSE["SSE: text/event-stream<br/>data: chunks...<br/>data: [DONE]"]
    RESP -->|No| JSON["JSON response<br/>(OpenAI format)"]
```

Agent resolution priority: `model` field with `goclaw:` or `agent:` prefix, then `X-GoClaw-Agent-Id` header, then `"default"`.

#### POST /v1/responses (OpenResponses Protocol)

Same agent resolution and execution flow, different response format (`response.started`, `response.delta`, `response.done`).

#### POST /v1/tools/invoke

Direct tool invocation without the agent loop. Supports `dryRun: true` to return tool schema only.

#### GET /health

Returns `{"status":"ok","protocol":3}`.

#### Managed Mode CRUD Endpoints

All managed endpoints require `Authorization: Bearer <token>` and `X-GoClaw-User-Id` header for per-user scoping.

**Agents** (`/v1/agents`):

| Method | Path | Description |
|--------|------|-------------|
| GET | `/v1/agents` | List accessible agents (filtered by user shares) |
| POST | `/v1/agents` | Create a new agent |
| GET | `/v1/agents/{id}` | Get agent details |
| PUT | `/v1/agents/{id}` | Update agent configuration |
| DELETE | `/v1/agents/{id}` | Soft-delete an agent |

**Custom Tools** (`/v1/tools/custom`):

| Method | Path | Description |
|--------|------|-------------|
| GET | `/v1/tools/custom` | List tools (optional `?agent_id=` filter) |
| POST | `/v1/tools/custom` | Create a custom tool |
| GET | `/v1/tools/custom/{id}` | Get tool details |
| PUT | `/v1/tools/custom/{id}` | Update a tool |
| DELETE | `/v1/tools/custom/{id}` | Delete a tool |

**MCP Servers** (`/v1/mcp`):

| Method | Path | Description |
|--------|------|-------------|
| GET | `/v1/mcp/servers` | List registered MCP servers |
| POST | `/v1/mcp/servers` | Register a new MCP server |
| GET | `/v1/mcp/servers/{id}` | Get server details |
| PUT | `/v1/mcp/servers/{id}` | Update server config |
| DELETE | `/v1/mcp/servers/{id}` | Remove MCP server |
| POST | `/v1/mcp/servers/{id}/grants/agent` | Grant access to an agent |
| DELETE | `/v1/mcp/servers/{id}/grants/agent/{agentID}` | Revoke agent access |
| GET | `/v1/mcp/grants/agent/{agentID}` | List agent's MCP grants |
| POST | `/v1/mcp/servers/{id}/grants/user` | Grant access to a user |
| DELETE | `/v1/mcp/servers/{id}/grants/user/{userID}` | Revoke user access |
| POST | `/v1/mcp/requests` | Request access (user self-service) |
| GET | `/v1/mcp/requests` | List pending access requests |
| POST | `/v1/mcp/requests/{id}/review` | Approve or reject a request |

**Agent Sharing** (`/v1/agents/{id}/sharing`):

| Method | Path | Description |
|--------|------|-------------|
| GET | `/v1/agents/{id}/sharing` | List shares for an agent |
| POST | `/v1/agents/{id}/sharing` | Share agent with a user |
| DELETE | `/v1/agents/{id}/sharing/{userID}` | Revoke user access |

**Agent Links** (`/v1/agents/{id}/links`):

| Method | Path | Description |
|--------|------|-------------|
| GET | `/v1/agents/{id}/links` | List links for an agent |
| POST | `/v1/agents/{id}/links` | Create a new link |
| PUT | `/v1/agents/{id}/links/{linkID}` | Update a link |
| DELETE | `/v1/agents/{id}/links/{linkID}` | Delete a link |

**Delegations** (`/v1/delegations`):

| Method | Path | Description |
|--------|------|-------------|
| GET | `/v1/delegations` | List delegation history (full records, paginated) |
| GET | `/v1/delegations/{id}` | Get delegation detail |

**Skills** (`/v1/skills`):

| Method | Path | Description |
|--------|------|-------------|
| GET | `/v1/skills` | List skills |
| POST | `/v1/skills/upload` | Upload skill ZIP (max 20 MB) |
| DELETE | `/v1/skills/{id}` | Delete a skill |

**Traces** (`/v1/traces`):

| Method | Path | Description |
|--------|------|-------------|
| GET | `/v1/traces` | List traces (filter by agent_id, user_id, status, date range) |
| GET | `/v1/traces/{id}` | Get trace details with all spans |

---

## 7. Rate Limiting

Token bucket rate limiting per user or IP address. Configured via `gateway.rate_limit_rpm` (0 = disabled, > 0 = enabled).

```mermaid
flowchart TD
    REQ["Request"] --> CHECK{"rate_limit_rpm > 0?"}
    CHECK -->|No| PASS["Allow all requests"]
    CHECK -->|Yes| BUCKET{"Token available<br/>for this key?"}
    BUCKET -->|Yes| ALLOW["Allow + consume token"]
    BUCKET -->|No| REJECT["WS: INVALID_REQUEST<br/>HTTP: 429 + Retry-After: 60"]
```

| Aspect | WebSocket | HTTP |
|--------|-----------|------|
| Rate key | `client.UserID()` fallback `client.ID()` | `RemoteAddr` fallback `"token:" + bearer` |
| On limit | `INVALID_REQUEST "rate limit exceeded"` | HTTP 429 |
| Burst | 5 requests | 5 requests |
| Cleanup | Every 5 min, entries inactive > 10 min | Same |

---

## 8. Error Codes

| Code | Description |
|------|-------------|
| `UNAUTHORIZED` | Authentication failed or insufficient role |
| `INVALID_REQUEST` | Missing or invalid fields in the request |
| `NOT_FOUND` | Requested resource does not exist |
| `ALREADY_EXISTS` | Resource already exists (conflict) |
| `UNAVAILABLE` | Service temporarily unavailable |
| `RESOURCE_EXHAUSTED` | Rate limit exceeded |
| `FAILED_PRECONDITION` | Operation prerequisites not met |
| `AGENT_TIMEOUT` | Agent run exceeded time limit |
| `INTERNAL` | Unexpected server error |

Error responses include `retryable` (boolean) and `retryAfterMs` (integer) fields to guide client retry behavior.

---

## File Reference

| File | Purpose |
|------|---------|
| `internal/gateway/server.go` | Server: WebSocket upgrade, HTTP mux, CORS check, client lifecycle |
| `internal/gateway/client.go` | Client: connection management, read/write pumps, send buffer |
| `internal/gateway/router.go` | MethodRouter: handler registration, permission-checked dispatch |
| `internal/gateway/ratelimit.go` | RateLimiter: token bucket per key, cleanup loop |
| `internal/gateway/methods/chat.go` | chat.send, chat.history, chat.abort, chat.inject handlers |
| `internal/gateway/methods/agents.go` | agents.list, agents.create/update/delete, agents.files.* handlers |
| `internal/gateway/methods/sessions.go` | sessions.list/preview/patch/delete/reset handlers |
| `internal/gateway/methods/config.go` | config.get/apply/patch/schema handlers |
| `internal/gateway/methods/skills.go` | skills.list/get/update handlers |
| `internal/gateway/methods/cron.go` | cron.list/create/update/delete/toggle/run/runs handlers |
| `internal/gateway/methods/agent_links.go` | agents.links.* handlers + agent router cache invalidation |
| `internal/gateway/methods/teams.go` | teams.* handlers + auto-linking teammates |
| `internal/gateway/methods/delegations.go` | delegations.list/get handlers |
| `internal/gateway/methods/channels.go` | channels.list/status handlers |
| `internal/gateway/methods/pairing.go` | device.pair.* handlers |
| `internal/gateway/methods/exec_approval.go` | exec.approval.* handlers |
| `internal/gateway/methods/usage.go` | usage.get/summary handlers |
| `internal/gateway/methods/send.go` | send handler (direct message to channel) |
| `internal/http/chat_completions.go` | POST /v1/chat/completions (OpenAI-compatible) |
| `internal/http/responses.go` | POST /v1/responses (OpenResponses protocol) |
| `internal/http/tools_invoke.go` | POST /v1/tools/invoke (direct tool execution) |
| `internal/http/agents.go` | Agent CRUD HTTP handlers (managed mode) |
| `internal/http/skills.go` | Skills HTTP handlers (managed mode) |
| `internal/http/traces.go` | Traces HTTP handlers (managed mode) |
| `internal/http/delegations.go` | Delegation history HTTP handlers |
| `internal/http/summoner.go` | LLM-powered agent setup (XML parsing, context file generation) |
| `internal/http/auth.go` | Bearer token authentication, timing-safe comparison |
| `internal/permissions/policy.go` | PolicyEngine: role hierarchy, method-to-role mapping |
| `pkg/protocol/frames.go` | Frame types: RequestFrame, ResponseFrame, EventFrame, ErrorShape |

---

# 05 - Channels and Messaging

Channels connect external messaging platforms to the GoClaw agent runtime via a shared message bus. Each channel implementation translates platform-specific events into a unified `InboundMessage`, and converts agent responses into platform-appropriate outbound messages.

---

## 1. Message Flow

```mermaid
flowchart LR
    subgraph Platforms
        TG["Telegram"]
        DC["Discord"]
        FS["Feishu/Lark"]
        ZL["Zalo"]
        WA["WhatsApp"]
    end

    subgraph "Channel Layer"
        CH["Channel.Start()<br/>Listen for events"]
        HM["HandleMessage()<br/>Build InboundMessage"]
    end

    subgraph Core
        BUS["MessageBus"]
        AGENT["Agent Loop"]
    end

    subgraph Outbound
        DISPATCH["Manager.dispatchOutbound()"]
        SEND["Channel.Send()<br/>Format + deliver"]
    end

    TG --> CH
    DC --> CH
    FS --> CH
    ZL --> CH
    WA --> CH
    CH --> HM
    HM --> BUS
    BUS --> AGENT
    AGENT -->|OutboundMessage| BUS
    BUS --> DISPATCH
    DISPATCH --> SEND
    SEND --> TG
    SEND --> DC
    SEND --> FS
    SEND --> ZL
    SEND --> WA
```

Internal channels (`cli`, `system`, `subagent`) are silently skipped by the outbound dispatcher and never forwarded to external platforms.

### Handoff Routing (Managed Mode)

Before normal agent routing, the consumer checks the `handoff_routes` table for an active routing override. If a handoff route exists for the incoming channel + chat ID, the message is redirected to the target agent instead of the original agent.

```mermaid
flowchart TD
    MSG["Inbound message"] --> CHECK{"handoff_routes<br/>has override?"}
    CHECK -->|Yes| TARGET["Route to target agent<br/>(billing, support, etc.)"]
    CHECK -->|No| NORMAL["Route to default agent"]
    TARGET --> SESSION["New session for target agent"]
    NORMAL --> SESSION2["Existing session"]
```

Handoff routes are created by the `handoff` tool (see [03-tools-system.md](./03-tools-system.md)) and can be cleared by the target agent calling `handoff(action="clear")` or by handing back to the original agent.

### Message Routing Prefixes

The consumer routes system messages based on sender ID prefixes:

| Prefix | Route | Outbound Delivery |
|--------|-------|:-:|
| `subagent:` | Parent session queue | Yes |
| `delegate:` | Delegate scheduler lane | Yes |
| `teammate:` | Lead agent session queue | Yes |
| `handoff:` | Target agent via delegate lane | Yes |

### Managed Mode Behavior

In managed mode, channels provide per-user isolation through compound sender IDs and context propagation:

- **User scoping**: Each channel constructs a compound sender ID (e.g., `telegram:123456`) which maps to a `user_id` for session key generation. The session key format `agent:{agentId}:{channel}:direct:{peerId}` ensures each user has an isolated conversation history per agent.
- **Context propagation**: `HandleMessage()` sets `store.WithAgentID(ctx)`, `store.WithUserID(ctx)`, and `store.WithAgentType(ctx)` on the context. These values flow through to the ContextFileInterceptor, MemoryInterceptor, and per-user file seeding.
- **Pairing storage**: In managed mode, pairing state (pending requests and approved pairings) is stored in the `pairing_requests` and `paired_devices` PostgreSQL tables via `PGPairingStore`. In standalone mode, pairing state is stored in JSON files.
- **Session persistence**: Chat sessions are stored in the `sessions` PostgreSQL table via `PGSessionStore` with write-behind caching.

---

## 2. Channel Interface

Every channel must implement the following methods:

| Method | Description |
|--------|-------------|
| `Name()` | Channel identifier (e.g., `"telegram"`, `"discord"`) |
| `Start(ctx)` | Begin listening for messages (non-blocking after setup) |
| `Stop(ctx)` | Graceful shutdown |
| `Send(ctx, msg)` | Deliver an outbound message to the platform |
| `IsRunning()` | Whether the channel is actively processing |
| `IsAllowed(senderID)` | Check if a sender passes the allowlist |

`BaseChannel` provides a shared implementation that all channels embed. It handles:

- Allowlist matching with compound `"123456|username"` format and `@` prefix stripping
- `HandleMessage()` which builds an `InboundMessage` and publishes it to the bus
- `CheckPolicy()` which evaluates DM/Group policies per message
- User ID extraction from compound sender IDs (strip `|username` suffix)

---

## 3. Channel Policy

### DM Policies

| Policy | Behavior |
|--------|----------|
| `pairing` | Require pairing code for new senders |
| `allowlist` | Only whitelisted senders accepted |
| `open` | Accept all DMs |
| `disabled` | Reject all DMs |

### Group Policies

| Policy | Behavior |
|--------|----------|
| `open` | Accept all group messages |
| `allowlist` | Only whitelisted groups accepted |
| `disabled` | No group messages processed |

### Policy Evaluation

```mermaid
flowchart TD
    MSG["Incoming message"] --> KIND{"PeerKind?"}
    KIND -->|direct| DMP{"DM Policy?"}
    KIND -->|group| GPP{"Group Policy?"}

    DMP -->|disabled| REJECT["Reject"]
    DMP -->|open| ACCEPT["Accept"]
    DMP -->|allowlist| AL1{"In allowlist?"}
    AL1 -->|Yes| ACCEPT
    AL1 -->|No| REJECT
    DMP -->|pairing| PAIR{"Already paired<br/>or in allowlist?"}
    PAIR -->|Yes| ACCEPT
    PAIR -->|No| PAIR_REPLY["Send pairing instructions<br/>(debounce 60s)"]

    GPP -->|disabled| REJECT
    GPP -->|open| ACCEPT
    GPP -->|allowlist| AL2{"In allowlist?"}
    AL2 -->|Yes| ACCEPT
    AL2 -->|No| REJECT
```

Policies are configured per-channel. Default is `"open"` for channels that do not specify a policy.

---

## 4. Channel Comparison

| Feature | Telegram | Discord | Feishu/Lark | Zalo | WhatsApp |
|---------|----------|---------|-------------|------|----------|
| Connection | Long polling | Gateway events | WebSocket (default) or Webhook | Long polling | External WS bridge |
| DM support | Yes | Yes | Yes | Yes (DM only) | Yes |
| Group support | Yes (mention gating) | Yes | Yes | No | Yes |
| Message limit | 4096 chars | 2000 chars | 4000 chars | 2000 chars | N/A (bridge) |
| Streaming | Typing indicator | Edit "Thinking..." message | Streaming message cards | No | No |
| Media | Photos, voice, files | Files, embeds | Images, files (30 MB) | Images (5 MB) | JSON messages |
| Rich formatting | Markdown to HTML | Markdown | Card messages | Plain text | Plain text |
| Pairing support | Yes | No | Yes | Yes | No |

---

## 5. Telegram

The Telegram channel uses long polling via the `telego` library (Telegram Bot API).

### Key Behaviors

- **Group mention gating**: By default, bot must be @mentioned in groups (`requireMention: true`). Pending group messages without a mention are stored in a history buffer (default 50 messages) and included as context when the bot is eventually mentioned.
- **Typing indicator**: A "typing" action is sent while the agent is processing.
- **Proxy support**: Optional HTTP proxy configured via the channel config.
- **Cancel commands**: `/stop` (cancel oldest running task) and `/stopall` (cancel all + drain queue). Both are intercepted before the 800ms debouncer to avoid being merged with subsequent messages. See [08-scheduling-cron-heartbeat.md](./08-scheduling-cron-heartbeat.md) for details.
- **Concurrent group support**: Group sessions support up to 3 concurrent agent runs, allowing multiple users to get responses in parallel.

### Formatting Pipeline

LLM output is transformed through a multi-step pipeline to produce valid Telegram HTML. Telegram supports only `<b>`, `<i>`, `<s>`, `<a>`, `<code>`, `<pre>`, `<blockquote>` -- no `<table>` support.

```mermaid
flowchart TD
    IN["LLM Output (Markdown)"] --> S1["Extract tables as placeholders"]
    S1 --> S2["Extract code blocks as placeholders"]
    S2 --> S3["Extract inline code as placeholders"]
    S3 --> S4["Convert Markdown to HTML<br/>(headers, bold, italic, links, lists)"]
    S4 --> S5["Restore placeholders:<br/>inline code as code tags<br/>code blocks as pre tags<br/>tables as pre (ASCII-aligned)"]
    S5 --> S6["Chunk at 4000 chars<br/>(split at paragraph > line > space)"]
    S6 --> S7["Send as HTML<br/>(fallback: plain text on error)"]
```

- **Table rendering**: Markdown tables are rendered as ASCII-aligned text inside `<pre>` tags (not `<pre><code>` to avoid "Copy" button). Cell content has inline markdown stripped (`**bold**`, `_italic_` markers removed).
- **CJK handling**: `displayWidth()` correctly counts CJK and emoji characters as 2-column width for proper table alignment.

---

## 6. Feishu/Lark

The Feishu/Lark channel connects via native HTTP with two transport modes.

### Transport Modes

```mermaid
flowchart TD
    MODE{"Connection mode?"} -->|"ws (default)"| WS["WebSocket Client<br/>Persistent connection<br/>Auto-reconnect"]
    MODE -->|"webhook"| WH["HTTP Webhook Server<br/>Listens on configured port<br/>Challenge verification"]
```

### Key Behaviors

- **Default domain**: Lark Global (`open.larksuite.com`). Configurable for Feishu China.
- **Streaming message cards**: Responses are delivered as interactive card messages with streaming updates, providing real-time output display. Updates are throttled at 100ms intervals with incrementing sequence numbers.
- **Media handling**: Supports image and file uploads/downloads with a default 30 MB limit.
- **Mention support**: Processes `@bot` mentions in group chats with mention text stripping.
- **Sender caching**: User names are cached with a 10-minute TTL to reduce API calls.
- **Deduplication**: Message IDs tracked via `sync.Map` to prevent processing duplicate events.
- **Pairing debounce**: 60-second debounce on pairing-related replies.

---

## 7. Discord

The Discord channel uses the `discordgo` library to connect via the Discord Gateway.

### Key Behaviors

- **Gateway intents**: Requests `GuildMessages`, `DirectMessages`, and `MessageContent` intents.
- **Message limit**: 2000-character limit per message, with automatic splitting for longer content.
- **Placeholder editing**: Sends an initial "Thinking..." message that gets edited with the actual response when complete.
- **Bot identity**: Fetches `@me` on startup to detect and ignore own messages.

---

## 8. WhatsApp

The WhatsApp channel communicates through an external WebSocket bridge (e.g., whatsapp-web.js based). GoClaw does not implement the WhatsApp protocol directly.

### Key Behaviors

- **Bridge connection**: Connects to a configurable `bridge_url` via WebSocket.
- **JSON format**: Messages are sent and received as JSON objects over the WebSocket connection.
- **Auto-reconnect**: If the initial connection fails, a background listen loop retries automatically.
- **DM and group support**: Both are supported through the bridge protocol.

---

## 9. Zalo

The Zalo channel connects to the Zalo OA Bot API.

### Key Behaviors

- **DM only**: No group support. Only direct messages are processed.
- **Text limit**: 2000-character maximum per message.
- **Long polling**: Uses long polling with a default 30-second timeout and 5-second backoff on errors.
- **Media**: Image support with a 5 MB default limit.
- **Default DM policy**: `"pairing"` (requires pairing code for new users).
- **Pairing debounce**: 60-second debounce to avoid flooding users with pairing instructions.

---

## 10. Pairing System

The pairing system provides a DM authentication flow for channels using the `pairing` DM policy.

### Flow

```mermaid
sequenceDiagram
    participant U as New User
    participant CH as Channel
    participant PS as Pairing Service
    participant O as Owner

    U->>CH: First DM message
    CH->>CH: Check DM policy = "pairing"
    CH->>PS: Generate 8-char pairing code
    PS-->>CH: Code (valid 60 min)
    CH-->>U: "Reply with your pairing code from the admin"

    Note over PS: Max 3 pending codes per account

    O->>PS: Approve code via device.pair.approve
    PS->>PS: Add sender to paired devices

    U->>CH: Next DM message
    CH->>PS: Check paired status
    PS-->>CH: Paired (approved)
    CH->>CH: Process message normally
```

### Code Specification

| Aspect | Value |
|--------|-------|
| Length | 8 characters |
| Alphabet | `ABCDEFGHJKLMNPQRSTUVWXYZ23456789` (excludes ambiguous: 0, O, 1, I, L) |
| TTL | 60 minutes |
| Max pending per account | 3 |
| Reply debounce | 60 seconds per sender |

---

## File Reference

| File | Purpose |
|------|---------|
| `internal/channels/channel.go` | Channel interface, BaseChannel, DMPolicy/GroupPolicy types, HandleMessage |
| `internal/channels/manager.go` | Manager: channel registration, StartAll, StopAll, outbound dispatch |
| `internal/channels/telegram/telegram.go` | Telegram channel: long polling, mention gating, typing indicators |
| `internal/channels/telegram/commands.go` | /stop, /stopall command handlers, menu registration |
| `internal/channels/telegram/format.go` | Markdown-to-Telegram-HTML pipeline, table rendering, CJK width |
| `internal/channels/telegram/format_test.go` | Tests for Telegram formatting pipeline |
| `internal/channels/feishu/feishu.go` | Feishu/Lark channel: WS/Webhook modes, card messages |
| `internal/channels/feishu/streaming.go` | Streaming message card updates |
| `internal/channels/feishu/media.go` | Media upload/download handling |
| `internal/channels/feishu/larkclient.go` | Native HTTP client for Lark API |
| `internal/channels/feishu/larkws.go` | WebSocket transport for Lark |
| `internal/channels/feishu/larkevents.go` | Event parsing and routing |
| `internal/channels/discord/discord.go` | Discord channel: gateway events, message editing |
| `internal/channels/whatsapp/whatsapp.go` | WhatsApp channel: external WS bridge |
| `internal/channels/zalo/zalo.go` | Zalo channel: OA Bot API, long polling, DM only |
| `internal/pairing/service.go` | Pairing service: code generation, approval, persistence |
| `cmd/gateway_consumer.go` | Message consumer: routing prefixes, handoff check, cancel interception |

---

# 06 - Store Layer and Data Model

The store layer abstracts all persistence behind Go interfaces, allowing the same core engine to run with file-based storage (standalone mode) or PostgreSQL (managed mode). Each store interface has independent implementations, and the system determines which backend to use based on configuration at startup.

---

## 1. Store Layer Routing

```mermaid
flowchart TD
    START["Gateway Startup"] --> CHECK{"StoreConfig.IsManaged()?<br/>(DSN + mode = managed)"}
    CHECK -->|Yes| PG["PostgreSQL Backend"]
    CHECK -->|No| FILE["File Backend"]

    PG --> PG_STORES["PGSessionStore<br/>PGAgentStore<br/>PGProviderStore<br/>PGCronStore<br/>PGPairingStore<br/>PGSkillStore<br/>PGMemoryStore<br/>PGTracingStore<br/>PGMCPServerStore<br/>PGCustomToolStore<br/>PGChannelInstanceStore<br/>PGConfigSecretsStore<br/>PGAgentLinkStore<br/>PGTeamStore"]

    FILE --> FILE_STORES["FileSessionStore<br/>FileMemoryStore (SQLite + FTS5)<br/>FileCronStore<br/>FilePairingStore<br/>FileSkillStore<br/>FileAgentStore (filesystem + SQLite)<br/>ProviderStore = nil<br/>TracingStore = nil<br/>MCPServerStore = nil<br/>CustomToolStore = nil<br/>AgentLinks = nil<br/>Teams = nil"]
```

---

## 2. Store Interface Map

The `Stores` struct is the top-level container holding all storage backends. In standalone mode, managed-only stores are `nil`.

| Interface | Standalone Implementation | Managed Implementation | Mode |
|-----------|--------------------------|------------------------|------|
| SessionStore | `FileSessionStore` via `sessions.Manager` | `PGSessionStore` | Both |
| MemoryStore | `FileMemoryStore` (SQLite + FTS5 + embeddings) | `PGMemoryStore` (tsvector + pgvector) | Both |
| CronStore | `FileCronStore` | `PGCronStore` | Both |
| PairingStore | `FilePairingStore` via `pairing.Service` | `PGPairingStore` | Both |
| SkillStore | `FileSkillStore` via `skills.Loader` | `PGSkillStore` | Both |
| AgentStore | `FileAgentStore` (filesystem + SQLite) | `PGAgentStore` | Both |
| ProviderStore | `nil` | `PGProviderStore` | Managed only |
| TracingStore | `nil` | `PGTracingStore` | Managed only |
| MCPServerStore | `nil` | `PGMCPServerStore` | Managed only |
| CustomToolStore | `nil` | `PGCustomToolStore` | Managed only |
| ChannelInstanceStore | `nil` | `PGChannelInstanceStore` | Managed only |
| ConfigSecretsStore | `nil` | `PGConfigSecretsStore` | Managed only |
| AgentLinkStore | `nil` | `PGAgentLinkStore` | Managed only |
| TeamStore | `nil` | `PGTeamStore` | Managed only |

### Standalone AgentStore (FileAgentStore)

In standalone mode, `FileAgentStore` provides per-user context files and profiles without PostgreSQL. It combines filesystem storage (agent-level files like SOUL.md) with SQLite (`~/.goclaw/data/agents.db`) for per-user data:

| Data | Storage |
|------|---------|
| Agent metadata | In-memory from `config.json` |
| Agent-level files (SOUL.md, IDENTITY.md, ...) | Filesystem at workspace root |
| Per-user files (USER.md, BOOTSTRAP.md) | SQLite `user_context_files` |
| User profiles | SQLite `user_profiles` |
| Group file writers | SQLite `group_file_writers` |

Agent UUIDs use UUID v5 (deterministic): `uuid.NewSHA1(namespace, "goclaw-standalone:{agentKey}")` -- stable across restarts without database sequences.

---

## 3. Session Caching

The session store uses an in-memory write-behind cache to minimize database I/O during the agent tool loop. All reads and writes happen in memory; data is flushed to the persistent backend only when `Save()` is called at the end of a run.

```mermaid
flowchart TD
    subgraph "In-Memory Cache (map + mutex)"
        ADD["AddMessage()"] --> CACHE["Session Cache"]
        SET["SetSummary()"] --> CACHE
        ACC["AccumulateTokens()"] --> CACHE
        CACHE --> GET["GetHistory()"]
        CACHE --> GETSM["GetSummary()"]
    end

    CACHE -->|"Save(key)"| DB[("PostgreSQL / JSON file")]
    DB -->|"Cache miss via GetOrCreate"| CACHE
```

### Lifecycle

1. **GetOrCreate(key)**: Check cache; on miss, load from DB into cache; return session data.
2. **AddMessage/SetSummary/AccumulateTokens**: Update in-memory cache only (no DB write).
3. **Save(key)**: Snapshot data under read lock, flush to DB via UPDATE.
4. **Delete(key)**: Remove from both cache and DB. `List()` always reads directly from DB.

### Session Key Format

| Type | Format | Example |
|------|--------|---------|
| DM | `agent:{agentId}:{channel}:direct:{peerId}` | `agent:default:telegram:direct:386246614` |
| Group | `agent:{agentId}:{channel}:group:{groupId}` | `agent:default:telegram:group:-100123456` |
| Subagent | `agent:{agentId}:subagent:{label}` | `agent:default:subagent:my-task` |
| Cron | `agent:{agentId}:cron:{jobId}:run:{runId}` | `agent:default:cron:reminder:run:abc123` |
| Main | `agent:{agentId}:{mainKey}` | `agent:default:main` |

### File-Based Persistence (Standalone)

- Startup: `loadAll()` reads all `.json` files into memory
- Save: temp file + rename (atomic write, prevents corruption on crash)
- Filename: session key with `:` replaced by `_`, plus `.json` extension

---

## 4. Agent Access Control

In managed mode, agent access is checked via a 4-step pipeline.

```mermaid
flowchart TD
    REQ["CanAccess(agentID, userID)"] --> S1{"Agent exists?"}
    S1 -->|No| DENY["Deny"]
    S1 -->|Yes| S2{"is_default = true?"}
    S2 -->|Yes| ALLOW["Allow<br/>(role = owner if owner,<br/>user otherwise)"]
    S2 -->|No| S3{"owner_id = userID?"}
    S3 -->|Yes| ALLOW_OWNER["Allow (role = owner)"]
    S3 -->|No| S4{"Record in agent_shares?"}
    S4 -->|Yes| ALLOW_SHARE["Allow (role from share)"]
    S4 -->|No| DENY
```

The `agent_shares` table stores `UNIQUE(agent_id, user_id)` with roles: `user`, `admin`, `operator`.

`ListAccessible(userID)` queries: `owner_id = ? OR is_default = true OR id IN (SELECT agent_id FROM agent_shares WHERE user_id = ?)`.

---

## 5. API Key Encryption

API keys in the `llm_providers` and `mcp_servers` tables are encrypted with AES-256-GCM before storage.

```mermaid
flowchart LR
    subgraph "Storing a key"
        PLAIN["Plaintext API key"] --> ENC["AES-256-GCM encrypt"]
        ENC --> DB["DB: 'aes-gcm:' + base64(nonce + ciphertext + tag)"]
    end

    subgraph "Loading a key"
        DB2["DB value"] --> CHECK{"Has 'aes-gcm:' prefix?"}
        CHECK -->|Yes| DEC["AES-256-GCM decrypt"]
        CHECK -->|No| RAW["Return as-is<br/>(backward compatibility)"]
        DEC --> USE["Plaintext key"]
        RAW --> USE
    end
```

`GOCLAW_ENCRYPTION_KEY` accepts three formats:
- **Hex**: 64 characters (decoded to 32 bytes)
- **Base64**: 44 characters (decoded to 32 bytes)
- **Raw**: 32 characters (32 bytes direct)

---

## 6. Hybrid Memory Search

Memory search combines full-text search (FTS) and vector similarity in a weighted merge.

```mermaid
flowchart TD
    QUERY["Search(query, agentID, userID)"] --> PAR

    subgraph PAR["Parallel Search"]
        FTS["FTS Search<br/>tsvector + plainto_tsquery<br/>Weight: 0.3"]
        VEC["Vector Search<br/>pgvector cosine distance<br/>Weight: 0.7"]
    end

    FTS --> MERGE["hybridMerge()"]
    VEC --> MERGE
    MERGE --> BOOST["Per-user scope: 1.2x boost<br/>Dedup: user copy wins over global"]
    BOOST --> FILTER["Min score filter<br/>+ max results limit"]
    FILTER --> RESULT["Sorted results"]
```

### Merge Rules

1. Normalize FTS scores to [0, 1] (divide by highest score)
2. Vector scores already in [0, 1] (cosine similarity)
3. Combined score: `vec_score * 0.7 + fts_score * 0.3` for chunks found by both
4. When only one channel returns results, its weight auto-adjusts to 1.0
5. Per-user results receive a 1.2x boost
6. Deduplication: if a chunk exists in both global and per-user scope, the per-user version wins

### Fallback

When FTS returns no results (e.g., cross-language queries), a `likeSearch()` fallback runs ILIKE queries using up to 5 keywords (minimum 3 characters each), scoped to the agent's index.

### Standalone vs Managed

| Aspect | Standalone | Managed |
|--------|-----------|---------|
| FTS engine | SQLite FTS5 | PostgreSQL tsvector |
| Vector | Embedding cache | pgvector extension |
| Search function | `plainto_tsquery('simple', ...)` | Same |
| Distance operator | N/A | `<=>` (cosine) |

---

## 7. Context Files Routing

Context files are stored in two tables and routed based on agent type.

### Tables

| Table | Scope | Unique Key |
|-------|-------|------------|
| `agent_context_files` | Agent-level | `(agent_id, file_name)` |
| `user_context_files` | Per-user | `(agent_id, user_id, file_name)` |

### Routing by Agent Type

| Agent Type | Agent-Level Files | Per-User Files |
|------------|-------------------|----------------|
| `open` | Template fallback only | All 7 files (SOUL, IDENTITY, AGENTS, TOOLS, HEARTBEAT, BOOTSTRAP, USER) |
| `predefined` | 6 files (SOUL, IDENTITY, AGENTS, TOOLS, HEARTBEAT, BOOTSTRAP) | Only USER.md |

The `ContextFileInterceptor` checks agent type from context and routes read/write operations accordingly. For open agents, per-user files take priority with agent-level as fallback.

---

## 8. MCP Server Store

The MCP server store manages external tool server configurations and access grants.

### Tables

| Table | Purpose |
|-------|---------|
| `mcp_servers` | Server configurations (name, transport, command/URL, encrypted API key) |
| `mcp_agent_grants` | Per-agent access grants with tool allow/deny lists |
| `mcp_user_grants` | Per-user access grants with tool allow/deny lists |
| `mcp_access_requests` | Pending/approved/rejected access requests |

### Transport Types

| Transport | Fields Used |
|-----------|-------------|
| `stdio` | `command`, `args` (JSONB), `env` (JSONB) |
| `sse` | `url`, `headers` (JSONB) |
| `streamable-http` | `url`, `headers` (JSONB) |

`ListAccessible(agentID, userID)` returns all MCP servers the given agent+user combination can access, with effective tool allow/deny lists merged from both agent and user grants.

---

## 9. Custom Tool Store

Dynamic tool definitions stored in PostgreSQL. Each tool defines a shell command template that the LLM can invoke at runtime.

### Table: `custom_tools`

| Column | Type | Description |
|--------|------|-------------|
| `id` | UUID v7 | Primary key |
| `name` | VARCHAR | Unique tool name |
| `description` | TEXT | Tool description for the LLM |
| `parameters` | JSONB | JSON Schema for tool arguments |
| `command` | TEXT | Shell command template with `{{.key}}` placeholders |
| `working_dir` | VARCHAR | Optional working directory |
| `timeout_seconds` | INT | Execution timeout (default 60) |
| `env` | BYTEA | Encrypted environment variables (AES-256-GCM) |
| `agent_id` | UUID | `NULL` = global tool, UUID = per-agent tool |
| `enabled` | BOOLEAN | Soft enable/disable |
| `created_by` | VARCHAR | Audit trail |

**Scoping**: Global tools (`agent_id IS NULL`) are loaded at startup into the global registry. Per-agent tools are loaded on-demand when the agent is resolved, using a cloned registry to avoid polluting the global one.

---

## 10. Agent Link Store

The agent link store manages inter-agent delegation permissions -- directed edges that control which agents can delegate to which others.

### Table: `agent_links`

| Column | Type | Description |
|--------|------|-------------|
| `id` | UUID v7 | Primary key |
| `source_agent_id` | UUID | Agent that can delegate (FK → agents) |
| `target_agent_id` | UUID | Agent being delegated to (FK → agents) |
| `direction` | VARCHAR(20) | `outbound` (A→B only), `bidirectional` (A↔B) |
| `team_id` | UUID | Non-nil = auto-created by team setup (FK → agent_teams, SET NULL on delete) |
| `description` | TEXT | Link description |
| `max_concurrent` | INT | Per-link concurrency cap (default 3) |
| `settings` | JSONB | Per-user deny/allow lists for fine-grained access control |
| `status` | VARCHAR(20) | `active` or `disabled` |
| `created_by` | VARCHAR | Audit trail |

**Constraints**: `UNIQUE(source_agent_id, target_agent_id)`, `CHECK (source_agent_id != target_agent_id)`

### Agent Search Columns (migration 000002)

The `agents` table gains three columns for agent discovery during delegation:

| Column | Type | Purpose |
|--------|------|---------|
| `frontmatter` | TEXT | Short expertise summary (distinct from `other_config.description` which is the summoning prompt) |
| `tsv` | TSVECTOR | Auto-generated from `display_name + frontmatter`, GIN-indexed |
| `embedding` | VECTOR(1536) | For cosine similarity search, HNSW-indexed |

### AgentLinkStore Interface (12 methods)

- **CRUD**: `CreateLink`, `DeleteLink`, `UpdateLink`, `GetLink`
- **Queries**: `ListLinksFrom(agentID)`, `ListLinksTo(agentID)`
- **Permission**: `CanDelegate(from, to)`, `GetLinkBetween(from, to)` (returns full link with Settings for per-user checks)
- **Discovery**: `DelegateTargets(agentID)` (all targets with joined agent_key + display_name for DELEGATION.md), `SearchDelegateTargets` (FTS), `SearchDelegateTargetsByEmbedding` (vector cosine)

### Table: `delegation_history`

| Column | Type | Description |
|--------|------|-------------|
| `id` | UUID v7 | Primary key |
| `source_agent_id` | UUID | Delegating agent |
| `target_agent_id` | UUID | Target agent |
| `team_id` | UUID | Team context (nullable) |
| `team_task_id` | UUID | Related team task (nullable) |
| `user_id` | VARCHAR | User who triggered the delegation |
| `task` | TEXT | Task description sent to target |
| `mode` | VARCHAR(10) | `sync` or `async` |
| `status` | VARCHAR(20) | `completed`, `failed`, `cancelled` |
| `result` | TEXT | Target agent's response |
| `error` | TEXT | Error message on failure |
| `iterations` | INT | Number of LLM iterations |
| `trace_id` | UUID | Linked trace for observability |
| `duration_ms` | INT | Wall-clock duration |
| `completed_at` | TIMESTAMPTZ | Completion timestamp |

Every sync and async delegation is persisted here automatically via `SaveDelegationHistory()`. Results are truncated for WS transport (500 runes for list, 8000 runes for detail).

---

## 11. Team Store

The team store manages collaborative multi-agent teams with a shared task board, peer-to-peer mailbox, and handoff routing.

### Tables

| Table | Purpose | Key Columns |
|-------|---------|-------------|
| `agent_teams` | Team definitions | `name`, `lead_agent_id` (FK → agents), `status`, `settings` (JSONB) |
| `agent_team_members` | Team membership | PK `(team_id, agent_id)`, `role` (lead/member) |
| `team_tasks` | Shared task board | `subject`, `status` (pending/in_progress/completed/blocked), `owner_agent_id`, `blocked_by` (UUID[]), `priority`, `result`, `tsv` (FTS) |
| `team_messages` | Peer-to-peer mailbox | `from_agent_id`, `to_agent_id` (NULL = broadcast), `content`, `message_type` (chat/broadcast), `read` |
| `handoff_routes` | Active routing overrides | UNIQUE `(channel, chat_id)`, `from_agent_key`, `to_agent_key`, `reason` |

### TeamStore Interface (22 methods)

**Team CRUD**: `CreateTeam`, `GetTeam`, `DeleteTeam`, `ListTeams`

**Members**: `AddMember`, `RemoveMember`, `ListMembers`, `GetTeamForAgent` (find team by agent)

**Tasks**: `CreateTask`, `UpdateTask`, `ListTasks` (orderBy: priority/newest, statusFilter: active/completed/all), `GetTask`, `SearchTasks` (FTS on subject+description), `ClaimTask`, `CompleteTask`

**Delegation History**: `SaveDelegationHistory`, `ListDelegationHistory` (with filter opts), `GetDelegationHistory`

**Handoff Routes**: `SetHandoffRoute`, `GetHandoffRoute`, `ClearHandoffRoute`

**Messages**: `SendMessage`, `GetUnread`, `MarkRead`

### Atomic Task Claiming

Two agents grabbing the same task is prevented at the database level:

```sql
UPDATE team_tasks
SET status = 'in_progress', owner_agent_id = $1
WHERE id = $2 AND status = 'pending' AND owner_agent_id IS NULL
```

One row updated = claimed. Zero rows = someone else got it. Row-level locking, no distributed mutex needed.

### Task Dependencies

Tasks can declare `blocked_by` (UUID array) pointing to prerequisite tasks. When a task is completed via `CompleteTask`, all dependent tasks whose blockers are now all completed are automatically unblocked (status transitions from `blocked` to `pending`).

---

## 12. Database Schema

All tables use UUID v7 (time-ordered) as primary keys via `GenNewID()`.

```mermaid
flowchart TD
    subgraph Providers
        LP["llm_providers"] --> LM["llm_models"]
    end

    subgraph Agents
        AG["agents"] --> AS["agent_shares"]
        AG --> ACF["agent_context_files"]
        AG --> UCF["user_context_files"]
        AG --> UAP["user_agent_profiles"]
    end

    subgraph "Agent Links"
        AG --> AL["agent_links"]
        AL --> DH["delegation_history"]
    end

    subgraph Teams
        AT["agent_teams"] --> ATM["agent_team_members"]
        AT --> TT["team_tasks"]
        AT --> TM["team_messages"]
    end

    subgraph Handoff
        HR["handoff_routes"]
    end

    subgraph Sessions
        SE["sessions"]
    end

    subgraph Memory
        MD["memory_documents"] --> MC["memory_chunks"]
    end

    subgraph Cron
        CJ["cron_jobs"] --> CRL["cron_run_logs"]
    end

    subgraph Pairing
        PR["pairing_requests"]
        PD["paired_devices"]
    end

    subgraph Skills
        SK["skills"] --> SAG["skill_agent_grants"]
        SK --> SUG["skill_user_grants"]
    end

    subgraph Tracing
        TR["traces"] --> SP["spans"]
    end

    subgraph MCP
        MS["mcp_servers"] --> MAG["mcp_agent_grants"]
        MS --> MUG["mcp_user_grants"]
        MS --> MAR["mcp_access_requests"]
    end

    subgraph "Custom Tools"
        CT["custom_tools"]
    end
```

### Key Tables

| Table | Purpose | Key Columns |
|-------|---------|-------------|
| `agents` | Agent definitions | `agent_key` (UNIQUE), `owner_id`, `agent_type` (open/predefined), `is_default`, `frontmatter`, `tsv`, `embedding`, soft delete via `deleted_at` |
| `agent_shares` | Agent RBAC sharing | UNIQUE(agent_id, user_id), `role` (user/admin/operator) |
| `agent_context_files` | Agent-level context | UNIQUE(agent_id, file_name) |
| `user_context_files` | Per-user context | UNIQUE(agent_id, user_id, file_name) |
| `user_agent_profiles` | User tracking | `first_seen_at`, `last_seen_at`, `workspace` |
| `agent_links` | Inter-agent delegation permissions | UNIQUE(source, target), `direction`, `max_concurrent`, `settings` (JSONB) |
| `agent_teams` | Team definitions | `name`, `lead_agent_id`, `status`, `settings` (JSONB) |
| `agent_team_members` | Team membership | PK(team_id, agent_id), `role` (lead/member) |
| `team_tasks` | Shared task board | `subject`, `status`, `owner_agent_id`, `blocked_by` (UUID[]), `tsv` (FTS) |
| `team_messages` | Peer-to-peer mailbox | `from_agent_id`, `to_agent_id`, `message_type`, `read` |
| `delegation_history` | Persisted delegation records | `source_agent_id`, `target_agent_id`, `mode`, `status`, `result`, `trace_id` |
| `handoff_routes` | Active routing overrides | UNIQUE(channel, chat_id), `from_agent_key`, `to_agent_key` |
| `sessions` | Conversation history | `session_key` (UNIQUE), `messages` (JSONB), `summary`, token counts |
| `memory_documents` | Memory docs | UNIQUE(agent_id, COALESCE(user_id, ''), path) |
| `memory_chunks` | Chunked + embedded text | `embedding` (VECTOR), `tsv` (TSVECTOR) |
| `llm_providers` | Provider configuration | `api_key` (AES-256-GCM encrypted) |
| `traces` | LLM call traces | `agent_id`, `user_id`, `status`, `parent_trace_id`, aggregated token counts |
| `spans` | Individual operations | `span_type` (llm_call, tool_call, agent, embedding), `parent_span_id` |
| `skills` | Skill definitions | Content, metadata, grants |
| `cron_jobs` | Scheduled tasks | `schedule_kind` (at/every/cron), `payload` (JSONB) |
| `mcp_servers` | MCP server configs | `transport`, `api_key` (encrypted), `tool_prefix` |
| `custom_tools` | Dynamic tool definitions | `command` (template), `agent_id` (NULL = global), `env` (encrypted) |

### Migrations

| Migration | Purpose |
|-----------|---------|
| `000001_init_schema` | Core tables (agents, sessions, providers, memory, cron, pairing, skills, traces, MCP, custom tools) |
| `000002_agent_links` | `agent_links` table + `frontmatter`, `tsv`, `embedding` on agents + `parent_trace_id` on traces |
| `000003_agent_teams` | `agent_teams`, `agent_team_members`, `team_tasks`, `team_messages` + `team_id` on agent_links |
| `000004_teams_v2` | FTS on `team_tasks` (tsv column) + `delegation_history` table |
| `000005_phase4` | `handoff_routes` table |

### Required PostgreSQL Extensions

- **pgvector**: Vector similarity search for memory embeddings
- **pgcrypto**: UUID generation functions

---

## 13. Context Propagation

Metadata flows through `context.Context` instead of mutable state, ensuring thread safety across concurrent agent runs.

```mermaid
flowchart TD
    HANDLER["HTTP/WS Handler"] -->|"store.WithUserID(ctx)<br/>store.WithAgentID(ctx)<br/>store.WithAgentType(ctx)"| LOOP["Agent Loop"]
    LOOP -->|"tools.WithToolChannel(ctx)<br/>tools.WithToolChatID(ctx)<br/>tools.WithToolPeerKind(ctx)"| TOOL["Tool Execute(ctx)"]
    TOOL -->|"store.UserIDFromContext(ctx)<br/>store.AgentIDFromContext(ctx)<br/>tools.ToolChannelFromCtx(ctx)"| LOGIC["Domain Logic"]
```

### Store Context Keys

| Key | Type | Purpose |
|-----|------|---------|
| `goclaw_user_id` | string | External user ID (e.g., Telegram user ID) |
| `goclaw_agent_id` | uuid.UUID | Agent UUID (managed mode) |
| `goclaw_agent_type` | string | Agent type: `"open"` or `"predefined"` |
| `goclaw_sender_id` | string | Original individual sender ID (in group chats, `user_id` is group-scoped but `sender_id` preserves the actual person) |

### Tool Context Keys

| Key | Purpose |
|-----|---------|
| `tool_channel` | Current channel (telegram, discord, etc.) |
| `tool_chat_id` | Chat/conversation identifier |
| `tool_peer_kind` | Peer type: `"direct"` or `"group"` |
| `tool_sandbox_key` | Docker sandbox scope key |
| `tool_async_cb` | Callback for async tool execution |
| `tool_workspace` | Per-user workspace directory (injected by agent loop, read by filesystem/shell tools) |

---

## 14. Key PostgreSQL Patterns

### Database Driver

All PG stores use `database/sql` with the `pgx/v5/stdlib` driver. No ORM is used -- all queries are raw SQL with positional parameters (`$1`, `$2`, ...).

### Nullable Columns

Nullable columns are handled via Go pointers: `*string`, `*int`, `*time.Time`, `*uuid.UUID`. Helper functions `nilStr()`, `nilInt()`, `nilUUID()`, `nilTime()` convert zero values to `nil` for clean SQL insertion.

### Dynamic Updates

`execMapUpdate()` builds UPDATE statements dynamically from a `map[string]any` of column-value pairs. This avoids writing a separate UPDATE query for every combination of updatable fields.

### Upsert Pattern

All "create or update" operations use `INSERT ... ON CONFLICT DO UPDATE`, ensuring idempotency:

| Operation | Conflict Key |
|-----------|-------------|
| `SetAgentContextFile` | `(agent_id, file_name)` |
| `SetUserContextFile` | `(agent_id, user_id, file_name)` |
| `ShareAgent` | `(agent_id, user_id)` |
| `PutDocument` (memory) | `(agent_id, COALESCE(user_id, ''), path)` |
| `GrantToAgent` (skill) | `(skill_id, agent_id)` |

### User Profile Detection

`GetOrCreateUserProfile` uses the PostgreSQL `xmax` trick:
- `xmax = 0` after RETURNING means a real INSERT occurred (new user) -- triggers context file seeding
- `xmax != 0` means an UPDATE on conflict (existing user) -- no seeding needed

### Batch Span Insert

`BatchCreateSpans` inserts spans in batches of 100. If a batch fails, it falls back to inserting each span individually to prevent data loss.

---

## File Reference

| File | Purpose |
|------|---------|
| `internal/store/stores.go` | `Stores` container struct (all 14 store interfaces) |
| `internal/store/types.go` | `BaseModel`, `StoreConfig`, `GenNewID()` |
| `internal/store/context.go` | Context propagation: `WithUserID`, `WithAgentID`, `WithAgentType`, `WithSenderID` |
| `internal/store/session_store.go` | `SessionStore` interface, `SessionData`, `SessionInfo` |
| `internal/store/memory_store.go` | `MemoryStore` interface, `MemorySearchResult`, `EmbeddingProvider` |
| `internal/store/skill_store.go` | `SkillStore` interface |
| `internal/store/agent_store.go` | `AgentStore` interface |
| `internal/store/agent_link_store.go` | `AgentLinkStore` interface, `AgentLinkData`, link constants |
| `internal/store/team_store.go` | `TeamStore` interface, `TeamData`, `TeamTaskData`, `DelegationHistoryData`, `HandoffRouteData`, `TeamMessageData` |
| `internal/store/provider_store.go` | `ProviderStore` interface |
| `internal/store/tracing_store.go` | `TracingStore` interface, `TraceData`, `SpanData` |
| `internal/store/mcp_store.go` | `MCPServerStore` interface, grant types, access request types |
| `internal/store/channel_instance_store.go` | `ChannelInstanceStore` interface |
| `internal/store/config_secrets_store.go` | `ConfigSecretsStore` interface |
| `internal/store/pairing_store.go` | `PairingStore` interface |
| `internal/store/cron_store.go` | `CronStore` interface |
| `internal/store/custom_tool_store.go` | `CustomToolStore` interface |
| `internal/store/file/agents.go` | `FileAgentStore`: filesystem + SQLite backend for standalone mode |
| `internal/store/pg/factory.go` | PG store factory: creates all PG store instances from a connection pool |
| `internal/store/pg/sessions.go` | `PGSessionStore`: session cache, Save, GetOrCreate |
| `internal/store/pg/agents.go` | `PGAgentStore`: CRUD, soft delete, access control |
| `internal/store/pg/agents_context.go` | Agent and user context file operations |
| `internal/store/pg/agent_links.go` | `PGAgentLinkStore`: link CRUD, permissions, FTS + vector search |
| `internal/store/pg/teams.go` | `PGTeamStore`: teams, tasks (atomic claim), messages, delegation history, handoff routes |
| `internal/store/pg/memory_docs.go` | `PGMemoryStore`: document CRUD, indexing, chunking |
| `internal/store/pg/memory_search.go` | Hybrid search: FTS, vector, ILIKE fallback, merge |
| `internal/store/pg/skills.go` | `PGSkillStore`: skill CRUD and grants |
| `internal/store/pg/skills_grants.go` | Skill agent and user grants |
| `internal/store/pg/mcp_servers.go` | `PGMCPServerStore`: server CRUD, grants, access requests |
| `internal/store/pg/channel_instances.go` | `PGChannelInstanceStore`: channel instance CRUD |
| `internal/store/pg/config_secrets.go` | `PGConfigSecretsStore`: encrypted config secrets |
| `internal/store/pg/custom_tools.go` | `PGCustomToolStore`: custom tool CRUD with encrypted env |
| `internal/store/pg/providers.go` | `PGProviderStore`: provider CRUD with encrypted keys |
| `internal/store/pg/tracing.go` | `PGTracingStore`: traces and spans with batch insert |
| `internal/store/pg/pool.go` | Connection pool management |
| `internal/store/pg/helpers.go` | Nullable helpers, JSON helpers, `execMapUpdate()` |
| `internal/store/validate.go` | Input validation utilities |
| `internal/tools/context_keys.go` | Tool context keys including `WithToolWorkspace` |

---

# 07 - Bootstrap, Skills & Memory

Three foundational systems that shape each agent's personality (Bootstrap), knowledge (Skills), and long-term recall (Memory).

### Responsibilities

- Bootstrap: load context files, truncate to fit context window, seed templates for new users
- Skills: 5-tier resolution hierarchy, BM25 search, hot-reload via fsnotify
- Memory: chunking, hybrid search (FTS + vector), memory flush before compaction
- System Prompt: build 15+ sections in a fixed order with two modes (full and minimal)

---

## 1. Bootstrap Files -- 7 Template Files

Markdown files loaded at agent initialization and embedded into the system prompt. MEMORY.md is NOT a bootstrap template file; it is a separate memory document loaded independently.

| # | File | Role | Full Session | Subagent/Cron |
|---|------|------|:---:|:---:|
| 1 | AGENTS.md | Operating instructions, memory rules, safety guidelines | Yes | Yes |
| 2 | SOUL.md | Persona, tone of voice, boundaries | Yes | No |
| 3 | TOOLS.md | Local tool notes (camera, SSH, TTS, etc.) | Yes | Yes |
| 4 | IDENTITY.md | Agent name, creature, vibe, emoji | Yes | No |
| 5 | USER.md | User profile (name, timezone, preferences) | Yes | No |
| 6 | HEARTBEAT.md | Periodic check task list | Yes | No |
| 7 | BOOTSTRAP.md | First-run ritual (deleted after completion) | Yes | No |

Subagent and cron sessions load only AGENTS.md + TOOLS.md (the `minimalAllowlist`).

---

## 2. Truncation Pipeline

Bootstrap content can exceed the context window budget. A 4-step pipeline truncates files to fit, matching the behavior of the TypeScript implementation.

```mermaid
flowchart TD
    IN["Ordered list of bootstrap files"] --> S1["Step 1: Skip empty or missing files"]
    S1 --> S2["Step 2: Per-file truncation<br/>If > MaxCharsPerFile (20K):<br/>Keep 70% head + 20% tail<br/>Insert [...truncated] marker"]
    S2 --> S3["Step 3: Clamp to remaining<br/>total budget (starts at 24K)"]
    S3 --> S4{"Step 4: Remaining budget < 64?"}
    S4 -->|Yes| STOP["Stop processing further files"]
    S4 -->|No| NEXT["Continue to next file"]
```

### Truncation Defaults

| Parameter | Value |
|-----------|-------|
| MaxCharsPerFile | 20,000 |
| TotalMaxChars | 24,000 |
| MinFileBudget | 64 |
| HeadRatio | 70% |
| TailRatio | 20% |

When a file is truncated, a marker is inserted between the head and tail sections:
`[...truncated, read SOUL.md for full content...]`

---

## 3. Seeding -- Template Creation

Templates are embedded in the binary via Go `embed` (directory: `internal/bootstrap/templates/`). Seeding automatically creates default files for new workspaces or new users.

```mermaid
flowchart TD
    subgraph "Standalone Mode"
        SA["EnsureWorkspaceFiles()"] --> SA1["Iterate over embedded templates"]
        SA1 --> SA2{"File already exists?<br/>(O_EXCL atomic check)"}
        SA2 -->|Yes| SKIP1["Skip"]
        SA2 -->|No| CREATE1["Create template file on disk"]
    end

    subgraph "Standalone Mode -- Per-User (FileAgentStore)"
        SU["SeedUserFiles()"] --> SU1{"Agent type?"}
        SU1 -->|open| SU_OPEN["Seed all 7 files to SQLite"]
        SU1 -->|predefined| SU_PRED["Seed USER.md + BOOTSTRAP.md to SQLite"]
        SU_OPEN --> SU_CHK{"Row already exists?"}
        SU_PRED --> SU_CHK
        SU_CHK -->|Yes| SKIP_SU["Skip"]
        SU_CHK -->|No| SU_WRITE["INSERT into user_context_files"]
    end

    subgraph "Managed Mode -- Agent Level"
        SB["SeedToStore()"] --> SB1{"Agent type = open?"}
        SB1 -->|Yes| SKIP_AGENT["Skip (open agents use per-user only)"]
        SB1 -->|No| SB2["Seed 6 files to agent_context_files<br/>(all except BOOTSTRAP.md)"]
        SB2 --> SB3{"File already has content?"}
        SB3 -->|Yes| SKIP2["Skip"]
        SB3 -->|No| WRITE2["Write embedded template"]
    end

    subgraph "Managed Mode -- Per-User"
        MC["SeedUserFiles()"] --> MC1{"Agent type?"}
        MC1 -->|open| OPEN["Seed all 7 files to user_context_files"]
        MC1 -->|predefined| PRED["Seed USER.md + BOOTSTRAP.md to user_context_files"]
        OPEN --> CHECK{"File already has content?"}
        PRED --> CHECK
        CHECK -->|Yes| SKIP3["Skip -- never overwrite"]
        CHECK -->|No| WRITE3["Write embedded template"]
    end
```

`SeedUserFiles()` is idempotent -- safe to call multiple times without overwriting personalized content.

### Standalone UUID Generation

Standalone agents are defined in `config.json` without database-generated UUIDs. `FileAgentStore` uses UUID v5 (`uuid.NewSHA1(namespace, "goclaw-standalone:{agentKey}")`) to produce deterministic IDs from agent keys. This ensures SQLite rows for per-user files survive process restarts without coordination.

### Predefined Agent Bootstrap

Both standalone and managed mode now seed `BOOTSTRAP.md` for predefined agents (per-user). On first chat, the agent runs the bootstrap ritual (learn name, preferences), then writes an empty `BOOTSTRAP.md` which triggers deletion. The empty-write deletion is ordered *before* the predefined write-block in `ContextFileInterceptor` to prevent an infinite bootstrap loop.

---

## 4. Agent Type Routing

Two agent types determine which context files live at the agent level versus the per-user level. Agent types are now available in both managed and standalone modes.

| Agent Type | Agent-Level Files | Per-User Files |
|------------|-------------------|----------------|
| `open` | None | All 7 files (AGENTS, SOUL, TOOLS, IDENTITY, USER, HEARTBEAT, BOOTSTRAP) |
| `predefined` | 6 files (shared across all users) | USER.md + BOOTSTRAP.md |

For `open` agents, each user gets their own full set of context files. When a file is read, the system checks the per-user copy first and falls back to the agent-level copy if not found. For `predefined` agents, all users share the same agent-level files except USER.md (personalized) and BOOTSTRAP.md (per-user first-run ritual, deleted after completion).

| Mode | Agent Type Source | Per-User Storage |
|------|------------------|-----------------|
| Managed | `agents` PostgreSQL table | `user_context_files` table |
| Standalone | `config.json` agent entries | SQLite via `FileAgentStore` |

---

## 5. System Prompt -- 17+ Sections

`BuildSystemPrompt()` constructs the complete system prompt from ordered sections. Two modes control which sections are included.

```mermaid
flowchart TD
    START["BuildSystemPrompt()"] --> S1["1. Identity<br/>'You are a personal assistant<br/>running inside GoClaw'"]
    S1 --> S1_5{"1.5 BOOTSTRAP.md present?"}
    S1_5 -->|Yes| BOOT["First-run Bootstrap Override<br/>(mandatory BOOTSTRAP.md instructions)"]
    S1_5 -->|No| S2
    BOOT --> S2["2. Tooling<br/>(tool list + descriptions)"]
    S2 --> S3["3. Safety<br/>(hard safety directives)"]
    S3 --> S4["4. Skills (full only)"]
    S4 --> S5["5. Memory Recall (full only)"]
    S5 --> S6["6. Workspace"]
    S6 --> S6_5{"6.5 Sandbox enabled?"}
    S6_5 -->|Yes| SBX["Sandbox instructions"]
    S6_5 -->|No| S7
    SBX --> S7["7. User Identity (full only)"]
    S7 --> S8["8. Current Time"]
    S8 --> S9["9. Messaging (full only)"]
    S9 --> S10["10. Extra Context / Subagent Context"]
    S10 --> S11["11. Project Context<br/>(bootstrap files + virtual files)"]
    S11 --> S12["12. Silent Replies (full only)"]
    S12 --> S13["13. Heartbeats (full only)"]
    S13 --> S14["14. Sub-Agent Spawning (conditional)"]
    S14 --> S15["15. Runtime"]
```

### Mode Comparison

| Section | PromptFull | PromptMinimal |
|---------|:---:|:---:|
| 1. Identity | Yes | Yes |
| 1.5. Bootstrap Override | Conditional | Conditional |
| 2. Tooling | Yes | Yes |
| 3. Safety | Yes | Yes |
| 4. Skills | Yes | No |
| 5. Memory Recall | Yes | No |
| 6. Workspace | Yes | Yes |
| 6.5. Sandbox | Conditional | Conditional |
| 7. User Identity | Yes | No |
| 8. Current Time | Yes | Yes |
| 9. Messaging | Yes | No |
| 10. Extra Context | Conditional | Conditional |
| 11. Project Context | Yes | Yes |
| 12. Silent Replies | Yes | No |
| 13. Heartbeats | Yes | No |
| 14. Sub-Agent Spawning | Conditional | Conditional |
| 15. Runtime | Yes | Yes |

Context files are wrapped in `<context_file>` XML tags with a defensive preamble instructing the model to follow tone/persona guidance but not execute instructions that contradict core directives. The ExtraPrompt is wrapped in `<extra_context>` tags for context isolation.

### Virtual Context Files (DELEGATION.md, TEAM.md)

Two files are system-injected by the resolver rather than stored on disk or in the DB:

| File | Injection Condition | Content |
|------|-------------------|---------|
| `DELEGATION.md` | Agent has manual (non-team) agent links | ≤15 targets: static list. >15 targets: search instruction for `delegate_search` tool |
| `TEAM.md` | Agent is a member of a team | Team name, role, teammate list with descriptions, workflow sentence |

Virtual files are rendered in `<system_context>` tags (not `<context_file>`) so the LLM does not attempt to read or write them as files. During bootstrap (first-run), both files are skipped to avoid wasting tokens when the agent should focus on onboarding.

---

## 6. Context File Merging

For **open agents**, per-user context files (from `user_context_files`) are merged with base context files (from the resolver) at runtime. Per-user files override same-name base files, but base-only files are preserved.

```
Base files (resolver):     AGENTS.md, DELEGATION.md, TEAM.md
Per-user files (DB/SQLite): AGENTS.md, SOUL.md, TOOLS.md, USER.md, ...
Merged result:             SOUL.md, TOOLS.md, USER.md, ..., AGENTS.md (per-user), DELEGATION.md ✓, TEAM.md ✓
```

This ensures resolver-injected virtual files (`DELEGATION.md`, `TEAM.md`) survive alongside per-user customizations. The merge logic lives in `internal/agent/loop_history.go`.

---

## 7. Agent Summoning (Managed Mode)

Creating a predefined agent requires 5 context files (SOUL.md, IDENTITY.md, AGENTS.md, TOOLS.md, HEARTBEAT.md) with specific formatting conventions. Agent summoning generates all 5 files from a natural language description in a single LLM call.

```mermaid
flowchart TD
    USER["User: 'sarcastic Rust reviewer'"] --> API["Backend (POST /v1/agents/{id}/summon)"]
    API -->|"status: summoning"| DB["Database"]
    API --> LLM["LLM call with structured XML prompt"]
    LLM --> PARSE["Parse XML output into 5 files"]
    PARSE --> STORE["Write files to agent_context_files"]
    STORE -->|"status: active"| READY["Agent ready"]
    LLM -.->|"WS events"| UI["Dashboard modal with progress"]
```

The LLM outputs structured XML with each file in a tagged block. Parsing is done server-side in `internal/http/summoner.go`. If the LLM fails (timeout, bad XML, no provider), the agent falls back to embedded template files and goes active anyway. The user can retry via "Edit with AI" later.

**Why not `write_file`?** The `ContextFileInterceptor` blocks predefined file writes from chat by design. Bypassing it would create a security hole. Instead, the summoner writes directly to the store — one call, no tool iterations.

---

## 8. Skills -- 5-Tier Hierarchy

Skills are loaded from multiple directories with a priority ordering. Higher-tier skills override lower-tier skills with the same name.

```mermaid
flowchart TD
    T1["Tier 1 (highest): Workspace skills<br/>workspace/skills/name/SKILL.md"] --> T2
    T2["Tier 2: Project agent skills<br/>workspace/.agents/skills/"] --> T3
    T3["Tier 3: Personal agent skills<br/>~/.agents/skills/"] --> T4
    T4["Tier 4: Global/managed skills<br/>~/.goclaw/skills/"] --> T5
    T5["Tier 5 (lowest): Builtin skills<br/>(bundled with binary)"]

    style T1 fill:#e1f5fe
    style T5 fill:#fff3e0
```

Each skill directory contains a `SKILL.md` file with YAML/JSON frontmatter (`name`, `description`). The `{baseDir}` placeholder in SKILL.md content is replaced with the skill's absolute directory path at load time.

---

## 9. Skills -- Inline vs Search Mode

The system dynamically decides whether to embed skill summaries directly in the prompt (inline mode) or instruct the agent to use the `skill_search` tool (search mode).

```mermaid
flowchart TD
    COUNT["Count filtered skills<br/>Estimate tokens = sum(chars of name+desc) / 4"] --> CHECK{"skills <= 20<br/>AND tokens <= 3500?"}
    CHECK -->|Yes| INLINE["INLINE MODE<br/>BuildSummary() produces XML<br/>Agent reads available_skills directly"]
    CHECK -->|No| SEARCH["SEARCH MODE<br/>Prompt instructs agent to use skill_search<br/>BM25 ranking returns top 5"]
```

This decision is re-evaluated each time the system prompt is built, so newly hot-reloaded skills are immediately reflected.

---

## 10. Skills -- BM25 Search

An in-memory BM25 index provides keyword-based skill search. The index is lazily rebuilt whenever the skill version changes.

**Tokenization**: Lowercase the text, replace non-alphanumeric characters with spaces, filter out single-character tokens.

**Scoring formula**: `IDF(t) x tf(t,d) x (k1 + 1) / (tf(t,d) + k1 x (1 - b + b x |d| / avgDL))`

| Parameter | Value |
|-----------|-------|
| k1 | 1.2 |
| b | 0.75 |
| Max results | 5 |

IDF is computed as: `log((N - df + 0.5) / (df + 0.5) + 1)`

---

## 11. Skills -- Embedding Search (Managed Mode)

In managed mode, skill search uses a hybrid approach combining BM25 and vector similarity.

```mermaid
flowchart TD
    Q["Search query"] --> BM25["BM25 search<br/>(in-memory index)"]
    Q --> EMB["Generate query embedding"]
    EMB --> VEC["Vector search<br/>pgvector cosine distance<br/>(embedding <=> operator)"]
    BM25 --> MERGE["Weighted merge"]
    VEC --> MERGE
    MERGE --> RESULT["Final ranked results"]
```

| Component | Weight |
|-----------|--------|
| BM25 score | 0.3 |
| Vector similarity | 0.7 |

**Auto-backfill**: On startup, `BackfillSkillEmbeddings()` generates embeddings synchronously for any active skills that lack them.

---

## 12. Skills Grants & Visibility (Managed Mode)

In managed mode, skill access is controlled through a 3-tier visibility model with explicit agent and user grants.

```mermaid
flowchart TD
    SKILL["Skill record"] --> VIS{"visibility?"}
    VIS -->|public| ALL["Accessible to all agents and users"]
    VIS -->|private| OWNER["Accessible only to owner<br/>(owner_id = userID)"]
    VIS -->|internal| GRANT{"Has explicit grant?"}
    GRANT -->|skill_agent_grants| AGENT["Accessible to granted agent"]
    GRANT -->|skill_user_grants| USER["Accessible to granted user"]
    GRANT -->|No grant| DENIED["Not accessible"]
```

### Visibility Levels

| Visibility | Access Rule |
|------------|------------|
| `public` | All agents and users can discover and use the skill |
| `private` | Only the owner (`skills.owner_id = userID`) can access |
| `internal` | Requires an explicit agent grant or user grant |

### Grant Tables

| Table | Key | Extra |
|-------|-----|-------|
| `skill_agent_grants` | `(skill_id, agent_id)` | `pinned_version` for version pinning per agent, `granted_by` audit |
| `skill_user_grants` | `(skill_id, user_id)` | `granted_by` audit, ON CONFLICT DO NOTHING for idempotency |

**Resolution**: `ListAccessible(agentID, userID)` performs a DISTINCT join across `skills`, `skill_agent_grants`, and `skill_user_grants` with the visibility filter, returning only active skills the caller can access.

**Managed-mode Tier 4**: In managed mode, global skills (Tier 4 in the hierarchy) are loaded from the `skills` PostgreSQL table instead of the filesystem.

---

## 13. Hot-Reload

An fsnotify-based watcher monitors all skill directories for changes to SKILL.md files.

```mermaid
flowchart TD
    S1["fsnotify detects SKILL.md change"] --> S2["Debounce 500ms"]
    S2 --> S3["BumpVersion() sets version = timestamp"]
    S3 --> S4["Next system prompt build detects<br/>version change and reloads skills"]
```

New skill directories created inside a watched root are automatically added to the watch list. The debounce window (500ms) is shorter than the memory watcher (1500ms) because skill changes are lightweight.

---

## 14. Memory -- Indexing Pipeline

Memory documents are chunked, embedded, and stored for hybrid search.

```mermaid
flowchart TD
    IN["Document changed or created"] --> READ["Read content"]
    READ --> HASH["Compute SHA256 hash (first 16 bytes)"]
    HASH --> CHECK{"Hash changed?"}
    CHECK -->|No| SKIP["Skip -- content unchanged"]
    CHECK -->|Yes| DEL["Delete old chunks for this document"]
    DEL --> CHUNK["Split into chunks<br/>(max 1000 chars, prefer paragraph breaks)"]
    CHUNK --> EMBED{"EmbeddingProvider available?"}
    EMBED -->|Yes| API["Batch embed all chunks"]
    EMBED -->|No| SAVE
    API --> SAVE["Store chunks + tsvector index<br/>+ vector embeddings + metadata"]
```

### Chunking Rules

- Prefer splitting at blank lines (paragraph breaks) when the current chunk reaches half of `maxChunkLen`
- Force flush at `maxChunkLen` (1000 characters)
- Each chunk retains `StartLine` and `EndLine` from the source document

### Memory Paths

- `MEMORY.md` or `memory.md` at the workspace root
- `memory/*.md` (recursive, excluding `.git`, `node_modules`, etc.)

---

## 15. Hybrid Search

Combines full-text search and vector search with weighted merging.

```mermaid
flowchart TD
    Q["Search(query)"] --> FTS["FTS Search<br/>Standalone: SQLite FTS5 (BM25)<br/>Managed: tsvector + plainto_tsquery"]
    Q --> VEC["Vector Search<br/>Standalone: cosine similarity<br/>Managed: pgvector (cosine distance)"]
    FTS --> MERGE["hybridMerge()"]
    VEC --> MERGE
    MERGE --> NORM["Normalize FTS scores to 0..1<br/>Vector scores already in 0..1"]
    NORM --> WEIGHT["Weighted sum<br/>textWeight = 0.3<br/>vectorWeight = 0.7"]
    WEIGHT --> BOOST["Per-user scope: 1.2x boost<br/>Dedup: user copy wins over global"]
    BOOST --> RESULT["Sorted + filtered results"]
```

### Standalone vs Managed Comparison

| Aspect | Standalone | Managed |
|--------|-----------|---------|
| Storage | SQLite + FTS5 | PostgreSQL + tsvector + pgvector |
| FTS | `porter unicode61` tokenizer | `plainto_tsquery('simple')` |
| Vector | JSON array embedding | pgvector type |
| Scope | Global (single agent) | Per-agent + per-user |
| File watcher | fsnotify (1500ms debounce) | Not needed (DB-backed) |

When both FTS and vector search return results, scores are merged using the weighted sum. When only one channel returns results, its scores are used directly (weights normalized to 1.0).

---

## 16. Memory Flush -- Pre-Compaction

Before session history is compacted (summarized + truncated), the agent is given an opportunity to write durable memories to disk.

```mermaid
flowchart TD
    CHECK{"totalTokens >= threshold?<br/>(contextWindow - reserveFloor - softThreshold)<br/>AND not flushed in this cycle?"} -->|Yes| FLUSH
    CHECK -->|No| SKIP["Continue normal operation"]

    FLUSH["Memory Flush"] --> S1["Step 1: Build flush prompt<br/>asking to save memories to memory/YYYY-MM-DD.md"]
    S1 --> S2["Step 2: Provide tools<br/>(read_file, write_file, exec)"]
    S2 --> S3["Step 3: Run LLM loop<br/>(max 5 iterations, 90s timeout)"]
    S3 --> S4["Step 4: Mark flush done<br/>for this compaction cycle"]
    S4 --> COMPACT["Proceed with compaction<br/>(summarize + truncate history)"]
```

### Flush Defaults

| Parameter | Value |
|-----------|-------|
| softThresholdTokens | 4,000 |
| reserveTokensFloor | 20,000 |
| Max LLM iterations | 5 |
| Timeout | 90 seconds |
| Default prompt | "Store durable memories now." |

The flush is idempotent per compaction cycle -- it will not run again until the next compaction threshold is reached.

---

## File Reference

| File | Description |
|------|-------------|
| `internal/bootstrap/files.go` | Bootstrap file constants, loading, session filtering |
| `internal/bootstrap/truncate.go` | Truncation pipeline (head/tail split, budget clamping) |
| `internal/bootstrap/seed.go` | Standalone mode seeding (EnsureWorkspaceFiles) |
| `internal/bootstrap/seed_store.go` | Managed mode seeding (SeedToStore, SeedUserFiles) |
| `internal/bootstrap/load_store.go` | Load context files from DB (LoadFromStore) |
| `internal/bootstrap/templates/*.md` | Embedded template files |
| `internal/agent/systemprompt.go` | System prompt builder (BuildSystemPrompt, 17+ sections) |
| `internal/agent/systemprompt_sections.go` | Section renderers, virtual file handling (DELEGATION.md, TEAM.md) |
| `internal/agent/resolver.go` | Agent resolution, DELEGATION.md + TEAM.md injection |
| `internal/agent/loop_history.go` | Context file merging (base + per-user, base-only preserved) |
| `internal/agent/memoryflush.go` | Memory flush logic (shouldRunMemoryFlush, runMemoryFlush) |
| `internal/store/file/agents.go` | FileAgentStore -- filesystem + SQLite backend for standalone |
| `internal/http/summoner.go` | Agent summoning -- LLM-powered context file generation |
| `internal/skills/loader.go` | Skill loader (5-tier hierarchy, BuildSummary, filtering) |
| `internal/skills/search.go` | BM25 search index (tokenization, IDF scoring) |
| `internal/skills/watcher.go` | fsnotify watcher (500ms debounce, version bumping) |
| `internal/store/pg/skills.go` | Managed skill store (embedding search, backfill) |
| `internal/store/pg/skills_grants.go` | Skill grants (agent/user visibility, version pinning) |
| `internal/store/pg/memory_docs.go` | Memory document store (chunking, indexing, embedding) |
| `internal/store/pg/memory_search.go` | Hybrid search (FTS + vector merge, weighted scoring) |

---

## Cross-References

| Document | Relevant Content |
|----------|-----------------|
| [00-architecture-overview.md](./00-architecture-overview.md) | Startup sequence, managed mode wiring |
| [01-agent-loop.md](./01-agent-loop.md) | Agent loop calls BuildSystemPrompt, compaction flow |
| [03-tools-system.md](./03-tools-system.md) | ContextFileInterceptor routing read_file/write_file to DB |
| [06-store-data-model.md](./06-store-data-model.md) | memory_documents, memory_chunks tables |

---

# 08 - Scheduling, Cron & Heartbeat

Concurrency control and periodic task execution. The scheduler provides lane-based isolation and per-session serialization. Cron and heartbeat extend the agent loop with time-triggered behavior.

> **Managed mode**: Cron jobs and run logs are stored in the `cron_jobs` and `cron_run_logs` PostgreSQL tables. Cache invalidation propagates via the `cache:cron` event on the message bus. In standalone mode, cron state is persisted to JSON files.

### Responsibilities

- Scheduler: lane-based concurrency control, per-session message queue serialization
- Cron: three schedule kinds (at/every/cron), run logging, retry with exponential backoff
- Heartbeat: periodic agent wake-up, HEARTBEAT_OK detection, dedup within 24h

---

## 1. Scheduler Lanes

Named worker pools (semaphore-based) with configurable concurrency limits. Each lane processes requests independently. Unknown lane names fall back to the `main` lane.

```mermaid
flowchart TD
    subgraph "Lane: main (concurrency = 2)"
        M1["User chat 1"]
        M2["User chat 2"]
    end

    subgraph "Lane: subagent (concurrency = 4)"
        S1["Subagent 1"]
        S2["Subagent 2"]
        S3["Subagent 3"]
        S4["Subagent 4"]
    end

    subgraph "Lane: delegate (concurrency = 100)"
        D1["Delegation 1"]
        D2["Delegation 2"]
    end

    subgraph "Lane: cron (concurrency = 1)"
        C1["Cron job"]
    end

    REQ["Incoming request"] --> SCHED["Scheduler.Schedule(ctx, lane, req)"]
    SCHED --> QUEUE["getOrCreateSession(sessionKey, lane)"]
    QUEUE --> SQ["SessionQueue.Enqueue()"]
    SQ --> LANE["Lane.Submit(fn)"]
```

### Lane Defaults

| Lane | Concurrency | Env Override | Purpose |
|------|:-----------:|-------------|---------|
| `main` | 2 | `GOCLAW_LANE_MAIN` | Primary user chat sessions |
| `subagent` | 4 | `GOCLAW_LANE_SUBAGENT` | Sub-agents spawned by the main agent |
| `delegate` | 100 | `GOCLAW_LANE_DELEGATE` | Agent delegation executions |
| `cron` | 1 | `GOCLAW_LANE_CRON` | Scheduled cron jobs (sequential to avoid conflicts) |

`GetOrCreate()` allows creating new lanes on demand with custom concurrency. All lane concurrency values are configurable via environment variables.

---

## 2. Session Queue

Each session key gets a dedicated queue that manages agent runs. The queue supports configurable concurrent runs per session.

### Concurrent Runs

| Context | `maxConcurrent` | Rationale |
|---------|:--------------:|-----------|
| DMs | 1 | Single-threaded per user (no interleaving) |
| Groups | 3 | Multiple users can get responses in parallel |

**Adaptive throttle**: When session history exceeds 60% of the context window, concurrency drops to 1 to prevent context window overflow.

### Queue Modes

| Mode | Behavior |
|------|----------|
| `queue` (default) | FIFO -- messages wait until a run slot is available |
| `followup` | Same as `queue` -- messages are queued as follow-ups |
| `interrupt` | Cancel the active run, drain the queue, start the new message immediately |

### Drop Policies

When the queue reaches capacity, one of two drop policies applies.

| Policy | When Queue Is Full | Error Returned |
|--------|-------------------|----------------|
| `old` (default) | Drop the oldest queued message, add the new one | `ErrQueueDropped` |
| `new` | Reject the incoming message | `ErrQueueFull` |

### Queue Config Defaults

| Parameter | Default | Description |
|-----------|---------|-------------|
| `mode` | `queue` | Queue mode (queue, followup, interrupt) |
| `cap` | 10 | Maximum messages in the queue |
| `drop` | `old` | Drop policy when full (old or new) |
| `debounce_ms` | 800 | Collapse rapid messages within this window |

---

## 3. /stop and /stopall Commands

Cancel commands for Telegram and other channels.

| Command | Behavior |
|---------|----------|
| `/stop` | Cancel the oldest running task; others keep going |
| `/stopall` | Cancel all running tasks + drain the queue |

### Implementation Details

- **Debouncer bypass**: `/stop` and `/stopall` are intercepted before the 800ms debouncer to avoid being merged with the next user message
- **Cancel mechanism**: `SessionQueue.Cancel()` exposes the `CancelFunc` from the scheduler. Context cancellation propagates to the agent loop
- **Empty outbound**: On cancel, an empty outbound message is published to trigger cleanup (stop typing indicator, clear reactions)
- **Trace finalization**: When `ctx.Err() != nil`, trace finalization falls back to `context.Background()` for the final DB write. Status is set to `"cancelled"`
- **Context survival**: Context values (traceID, collector) survive cancellation -- only the Done channel fires

---

## 4. Cron Lifecycle

Scheduled tasks that run agent turns automatically. The run loop checks every second for due jobs.

```mermaid
stateDiagram-v2
    [*] --> Created: AddJob()
    Created --> Scheduled: Compute nextRunAtMS
    Scheduled --> DueCheck: runLoop (every 1s)
    DueCheck --> Scheduled: Not yet due
    DueCheck --> Executing: nextRunAtMS <= now
    Executing --> Completed: Success
    Executing --> Failed: Failure
    Failed --> Retrying: retry < MaxRetries
    Retrying --> Executing: Backoff delay
    Failed --> ErrorLogged: Retries exhausted
    Completed --> Scheduled: Compute next nextRunAtMS (every/cron)
    Completed --> Deleted: deleteAfterRun (at jobs)
```

### Schedule Types

| Type | Parameter | Example |
|------|-----------|---------|
| `at` | `atMs` (epoch ms) | Reminder at 3PM tomorrow, auto-deleted after execution |
| `every` | `everyMs` | Every 30 minutes (1,800,000 ms) |
| `cron` | `expr` (5-field) | `"0 9 * * 1-5"` (9AM on weekdays) |

### Job States

Jobs can be `active` or `paused`. Paused jobs skip execution during the due check. Run results are logged to the `cron_run_logs` table. Cache invalidation propagates via the message bus.

### Retry -- Exponential Backoff with Jitter

| Parameter | Default |
|-----------|---------|
| MaxRetries | 3 |
| BaseDelay | 2 seconds |
| MaxDelay | 30 seconds |

**Formula**: `delay = min(base x 2^attempt, max) +/- 25% jitter`

---

## 5. Heartbeat -- 5 Steps

Periodically wakes the agent to check on events (calendar, inbox, alerts) and surfaces anything that needs attention.

```mermaid
flowchart TD
    TICK["tick() -- every interval (default 30 min)"] --> S1{"Step 1:<br/>Within Active Hours?"}
    S1 -->|Outside hours| SKIP1["Skip"]
    S1 -->|Within hours| S2{"Step 2:<br/>HEARTBEAT.md exists<br/>and has meaningful content?"}
    S2 -->|No| SKIP2["Skip"]
    S2 -->|Yes| S3["Step 3: runner()<br/>Run agent with heartbeat prompt"]
    S3 --> S4{"Step 4:<br/>Reply contains HEARTBEAT_OK?"}
    S4 -->|OK| LOG["Log debug, discard reply"]
    S4 -->|Has content| S5{"Step 5:<br/>Dedup -- same content<br/>within 24h?"}
    S5 -->|Duplicate| SKIP3["Skip"]
    S5 -->|New| DELIVER["deliver() via resolveTarget()<br/>then msgBus.PublishOutbound()"]
```

### Heartbeat Configuration

| Parameter | Default | Description |
|-----------|---------|-------------|
| Interval | 30 minutes | Time between heartbeat wakes |
| ActiveHours | (none) | Time window restriction, supports wrap-around midnight |
| Target | `"last"` | `"last"` (last-used channel), `"none"`, or explicit channel name |
| AckMaxChars | 300 | Content alongside HEARTBEAT_OK up to this length is still treated as OK |

### HEARTBEAT_OK Detection

Recognizes multiple formatting variants: `HEARTBEAT_OK`, `**HEARTBEAT_OK**`, `` `HEARTBEAT_OK` ``, `<b>HEARTBEAT_OK</b>`. Content accompanying the token is treated as an acknowledgment (OK) if it does not exceed `AckMaxChars`.

---

## File Reference

| File | Description |
|------|-------------|
| `internal/scheduler/lanes.go` | Lane and LaneManager (semaphore-based worker pools) |
| `internal/scheduler/queue.go` | SessionQueue, Scheduler, drop policies, debounce |
| `internal/cron/service.go` | Cron run loop, schedule parsing, job lifecycle |
| `internal/cron/retry.go` | Retry with exponential backoff + jitter |
| `internal/heartbeat/service.go` | Heartbeat loop, HEARTBEAT_OK detection, active hours |
| `internal/store/cron_store.go` | CronStore interface (jobs + run logs) |
| `internal/store/pg/cron.go` | PostgreSQL cron implementation |

---

## Cross-References

| Document | Relevant Content |
|----------|-----------------|
| [00-architecture-overview.md](./00-architecture-overview.md) | Scheduler lanes in startup sequence |
| [01-agent-loop.md](./01-agent-loop.md) | Agent loop triggered by scheduler |
| [06-store-data-model.md](./06-store-data-model.md) | cron_jobs, cron_run_logs tables |

---

# 09 - Security

Defense-in-depth with five independent layers from transport to isolation. Each layer operates independently -- even if one layer is bypassed, the remaining layers continue to protect the system.

> **Managed mode**: Adds AES-256-GCM encryption for secrets stored in PostgreSQL (LLM provider API keys, MCP server API keys, custom tool environment variables), plus agent-level access control via the 4-step `CanAccess` pipeline (see [06-store-data-model.md](./06-store-data-model.md)).

---

## 1. Five Defense Layers

```mermaid
flowchart TD
    REQ["Request"] --> L1["Layer 1: Transport<br/>CORS, message size limits, timing-safe auth"]
    L1 --> L2["Layer 2: Input<br/>Injection detection (6 patterns), message truncation"]
    L2 --> L3["Layer 3: Tool<br/>Shell deny patterns, path traversal, SSRF, exec approval"]
    L3 --> L4["Layer 4: Output<br/>Credential scrubbing, content wrapping"]
    L4 --> L5["Layer 5: Isolation<br/>Workspace isolation, Docker sandbox, read-only FS"]
```

### Layer 1: Transport Security

| Mechanism | Detail |
|-----------|--------|
| CORS (WebSocket) | `checkOrigin()` validates against `allowed_origins` (empty = allow all for backward compatibility) |
| WS message limit | `SetReadLimit(512KB)` -- gorilla auto-closes connection on exceed |
| HTTP body limit | `MaxBytesReader(1MB)` -- error returned before JSON decode |
| Token auth | `crypto/subtle.ConstantTimeCompare` (timing-safe) |
| Rate limiting | Token bucket per user/IP, configurable via `rate_limit_rpm` |

### Layer 2: Input -- Injection Detection

The input guard scans for 6 injection patterns.

| Pattern | Detection Target |
|---------|-----------------|
| `ignore_instructions` | "ignore all previous instructions" |
| `role_override` | "you are now...", "pretend you are..." |
| `system_tags` | `<system>`, `[SYSTEM]`, `[INST]`, `<<SYS>>` |
| `instruction_injection` | "new instructions:", "override:", "system prompt:" |
| `null_bytes` | Null characters `\x00` (obfuscation attempts) |
| `delimiter_escape` | "end of system", `</instructions>`, `</prompt>` |

**Configurable action** (`gateway.injection_action`):

| Value | Behavior |
|-------|----------|
| `"log"` | Log info level, continue processing |
| `"warn"` (default) | Log warning level, continue processing |
| `"block"` | Log warning, return error, stop processing |
| `"off"` | Disable detection entirely |

**Message truncation**: Messages exceeding `max_message_chars` (default 32K) are truncated (not rejected), and the LLM is notified of the truncation.

### Layer 3: Tool Security

**Shell deny patterns** -- 77+ patterns across multiple categories of blocked commands:

| Category | Examples |
|----------|----------|
| Destructive file ops | `rm -rf`, `del /f`, `rmdir /s` |
| Destructive disk ops | `mkfs`, `dd if=`, `> /dev/sd*` |
| System commands | `shutdown`, `reboot`, `poweroff` |
| Fork bombs | `:(){ ... };:` |
| Remote code execution | `curl \| sh`, `wget -O - \| sh` |
| Reverse shells | `/dev/tcp/`, `nc -e` |
| Eval injection | `eval $()`, `base64 -d \| sh` |
| Data exfiltration | `curl ... -d @/etc/passwd`, `exfil`, piping sensitive files to remote hosts |
| Privilege escalation | `sudo`, `su -`, `chmod 4755`, `chown root`, `setuid` |
| Dangerous path operations | Writes to `/etc/`, `/boot/`, `/sys/`, `/proc/` system directories |

**SSRF protection** -- 3-step validation:

```mermaid
flowchart TD
    URL["URL to fetch"] --> S1["Step 1: Check blocked hostnames<br/>localhost, *.local, *.internal,<br/>metadata.google.internal"]
    S1 --> S2["Step 2: Check private IP ranges<br/>10.0.0.0/8, 172.16.0.0/12,<br/>192.168.0.0/16, 127.0.0.0/8,<br/>169.254.0.0/16, IPv6 loopback/link-local"]
    S2 --> S3["Step 3: DNS Pinning<br/>Resolve domain, check every resolved IP.<br/>Also applied to redirect targets."]
    S3 --> ALLOW["Allow request"]
```

**Path traversal**: `resolvePath()` applies `filepath.Clean()` then `HasPrefix()` to ensure all paths stay within the workspace. With `restrict = true`, any path outside the workspace is blocked.

**PathDenyable** -- An interface that lets filesystem tools reject specific path prefixes:

```go
type PathDenyable interface {
    DenyPaths(...string)
}
```

All four filesystem tools (`read_file`, `write_file`, `list_files`, `edit`) implement `PathDenyable`. The agent loop calls `DenyPaths(".goclaw")` at startup to prevent agents from accessing internal data directories. `list_files` additionally filters denied directories from output entirely -- the agent does not see denied paths in directory listings.

### Layer 4: Output Security

| Mechanism | Detail |
|-----------|--------|
| Credential scrubbing | Regex detection of: OpenAI (`sk-...`), Anthropic (`sk-ant-...`), GitHub (`ghp_/gho_/ghu_/ghs_/ghr_`), AWS (`AKIA...`), generic key-value patterns. All replaced with `[REDACTED]`. |
| Web content wrapping | Fetched content wrapped in `<<<EXTERNAL_UNTRUSTED_CONTENT>>>` tags with security warning |

### Layer 5: Isolation

**Per-user workspace isolation** -- Two levels prevent cross-user file access:

| Level | Scope | Directory Pattern |
|-------|-------|------------------|
| Per-agent | Each agent gets its own base directory | `~/.goclaw/{agent-key}-workspace/` |
| Per-user | Each user gets a subdirectory within the agent workspace | `{agent-workspace}/user_{sanitized_id}/` |

The workspace is injected into tools via `WithToolWorkspace(ctx)` context injection. Tools read the workspace from context at execution time (fallback to the struct field for backward compatibility). User IDs are sanitized: anything outside `[a-zA-Z0-9_-]` becomes an underscore (`group:telegram:-1001234` → `group_telegram_-1001234`).

**Docker sandbox** -- Container-based isolation for shell command execution:

| Hardening | Configuration |
|-----------|---------------|
| Read-only root FS | `--read-only` |
| Drop all capabilities | `--cap-drop ALL` |
| No new privileges | `--security-opt no-new-privileges` |
| Memory limit | 512 MB |
| CPU limit | 1.0 |
| PID limit | Enabled |
| Network disabled | `--network none` |
| Tmpfs mounts | `/tmp`, `/var/tmp`, `/run` |
| Output limit | 1 MB |
| Timeout | 300 seconds |

---

## 2. Encryption (Managed Mode)

AES-256-GCM encryption for secrets stored in PostgreSQL. Key provided via `GOCLAW_ENCRYPTION_KEY` environment variable.

| What's Encrypted | Table | Column |
|-----------------|-------|--------|
| LLM provider API keys | `llm_providers` | `api_key` |
| MCP server API keys | `mcp_servers` | `api_key` |
| Custom tool env vars | `custom_tools` | `env` |

**Format**: `"aes-gcm:" + base64(12-byte nonce + ciphertext + GCM tag)`

Backward compatible: values without the `aes-gcm:` prefix are returned as plaintext (for migration from unencrypted data).

---

## 3. Rate Limiting -- Gateway + Tool

Protection at two levels: gateway-wide (per user/IP) and tool-level (per session).

```mermaid
flowchart TD
    subgraph "Gateway Level"
        GW_REQ["Request"] --> GW_CHECK{"rate_limit_rpm > 0?"}
        GW_CHECK -->|No| GW_PASS["Allow all"]
        GW_CHECK -->|Yes| GW_BUCKET{"Token bucket<br/>has capacity?"}
        GW_BUCKET -->|Available| GW_ALLOW["Allow + consume token"]
        GW_BUCKET -->|Exhausted| GW_REJECT["WS: INVALID_REQUEST error<br/>HTTP: 429 + Retry-After header"]
    end

    subgraph "Tool Level"
        TL_REQ["Tool call"] --> TL_CHECK{"Entries in<br/>last 1 hour?"}
        TL_CHECK -->|">= maxPerHour"| TL_REJECT["Error: rate limit exceeded"]
        TL_CHECK -->|"< maxPerHour"| TL_ALLOW["Record + allow"]
    end
```

| Level | Algorithm | Key | Burst | Cleanup |
|-------|-----------|-----|:-----:|---------|
| Gateway | Token bucket | user/IP | 5 | Every 5 min (inactive > 10 min) |
| Tool | Sliding window | `agent:userID` | N/A | Manual `Cleanup()` |

Gateway rate limiting applies to both WebSocket (`chat.send`) and HTTP (`/v1/chat/completions`) chat endpoints. Config: `gateway.rate_limit_rpm` (0 = disabled, any positive value = enabled).

---

## 4. RBAC -- 3 Roles

Role-based access control for WebSocket RPC methods and HTTP API endpoints. Roles are hierarchical: higher levels include all permissions of lower levels.

```mermaid
flowchart LR
    V["Viewer (level 1)<br/>Read-only access"] --> O["Operator (level 2)<br/>Read + Write"]
    O --> A["Admin (level 3)<br/>Full control"]
```

| Role | Key Permissions |
|------|----------------|
| Viewer | agents.list, config.get, sessions.list, health, status, skills.list |
| Operator | + chat.send, chat.abort, sessions.delete/reset, cron.*, skills.update |
| Admin | + config.apply/patch, agents.create/update/delete, channels.toggle, device.pair.approve/revoke |

### Access Check Flow

```mermaid
flowchart TD
    REQ["Method call"] --> S1["Step 1: MethodRole(method)<br/>Determine minimum required role"]
    S1 --> S2{"Step 2: roleLevel(user) >= roleLevel(required)?"}
    S2 -->|Yes| ALLOW["Allow"]
    S2 -->|No| DENY["Deny"]
    S2 --> S3["Step 3 (optional):<br/>CanAccessWithScopes() for tokens<br/>with narrow scope restrictions"]
```

Token-based role assignment happens during the WebSocket `connect` handshake. Scopes include: `operator.admin`, `operator.read`, `operator.write`, `operator.approvals`, `operator.pairing`.

---

## 5. Sandbox -- Container Lifecycle

Docker-based code isolation for shell command execution.

```mermaid
flowchart TD
    REQ["Exec request"] --> CHECK{"ShouldSandbox?"}
    CHECK -->|off| HOST["Execute on host<br/>timeout: 60s"]
    CHECK -->|non-main / all| SCOPE["ResolveScopeKey()"]
    SCOPE --> GET["DockerManager.Get(scopeKey)"]
    GET --> EXISTS{"Container exists?"}
    EXISTS -->|Yes| REUSE["Reuse existing container"]
    EXISTS -->|No| CREATE["docker run -d<br/>+ security flags<br/>+ resource limits<br/>+ workspace mount"]
    REUSE --> EXEC["docker exec sh -c [cmd]<br/>timeout: 300s"]
    CREATE --> EXEC
    EXEC --> RESULT["ExecResult{ExitCode, Stdout, Stderr}"]
```

### Sandbox Modes

| Mode | Behavior |
|------|----------|
| `off` (default) | Execute directly on host |
| `non-main` | Sandbox all agents except main/default |
| `all` | Sandbox every agent |

### Container Scope

| Scope | Reuse Level | Scope Key |
|-------|-------------|-----------|
| `session` (default) | One container per session | sessionKey |
| `agent` | Shared across sessions for the same agent | `"agent:" + agentID` |
| `shared` | One container for all agents | `"shared"` |

### Workspace Access

| Mode | Mount |
|------|-------|
| `none` | No workspace access |
| `ro` | Read-only mount |
| `rw` | Read-write mount |

### Auto-Pruning

| Parameter | Default | Action |
|-----------|---------|--------|
| `idle_hours` | 24 | Remove containers idle for more than 24 hours |
| `max_age_days` | 7 | Remove containers older than 7 days |
| `prune_interval_min` | 5 | Check every 5 minutes |

### FsBridge -- File Operations in Sandbox

| Operation | Docker Command |
|-----------|---------------|
| ReadFile | `docker exec [id] cat -- [path]` |
| WriteFile | `docker exec -i [id] sh -c 'cat > [path]'` |
| ListDir | `docker exec [id] ls -la -- [path]` |
| Stat | `docker exec [id] stat -- [path]` |

---

## 6. Security Logging Convention

All security events use `slog.Warn` with a `security.*` prefix for consistent filtering and alerting.

| Event | Meaning |
|-------|---------|
| `security.injection_detected` | Prompt injection pattern detected |
| `security.injection_blocked` | Message blocked due to injection (when action = block) |
| `security.rate_limited` | Request rejected due to rate limit |
| `security.cors_rejected` | WebSocket connection rejected due to CORS policy |
| `security.message_truncated` | Message truncated because it exceeded the size limit |

Filter all security events by grepping for the `security.` prefix in log output.

---

## 7. Hook Recursion Prevention

The hook system (quality gates) can trigger infinite recursion: an agent evaluator delegates to a reviewer → delegation completes → fires quality gate → delegates to reviewer again → infinite loop.

A context flag `hooks.WithSkipHooks(ctx, true)` prevents this. Three injection points set the flag:

| Injection Point | Why |
|----------------|-----|
| Agent evaluator | Delegating to the reviewer for quality checks must not re-trigger gates |
| Evaluate-optimize loop | All internal generator/evaluator delegations skip gates |
| Agent eval callback (cmd layer) | When the hook engine itself triggers delegation |

`DelegateManager.Delegate()` checks `hooks.SkipHooksFromContext(ctx)` before applying quality gates. If the flag is set, gates are skipped entirely.

---

## 8. Delegation Security

Agent delegation uses directed permissions via the `agent_links` table.

| Control | Scope | Description |
|---------|-------|-------------|
| Directed links | A → B | A single row `(A→B, outbound)` means A can delegate to B, not the reverse |
| Per-user deny/allow | Per-link | `settings` JSONB on each link holds per-user restrictions (premium users only, blocked accounts) |
| Per-link concurrency | A → B | `agent_links.max_concurrent` limits simultaneous delegations from A to B |
| Per-agent load cap | B (all sources) | `other_config.max_delegation_load` limits total concurrent delegations targeting B |

When concurrency limits are hit, the error message is written for LLM reasoning: *"Agent at capacity (5/5). Try a different agent or handle it yourself."*

---

## File Reference

| File | Description |
|------|-------------|
| `internal/agent/input_guard.go` | Injection pattern detection (6 patterns) |
| `internal/tools/scrub.go` | Credential scrubbing (regex-based redaction) |
| `internal/tools/shell.go` | Shell deny patterns, command validation |
| `internal/tools/web_fetch.go` | Web content wrapping, SSRF protection |
| `internal/permissions/policy.go` | RBAC (3 roles, scope-based access) |
| `internal/gateway/ratelimit.go` | Gateway-level token bucket rate limiter |
| `internal/sandbox/` | Docker sandbox manager, FsBridge |
| `internal/crypto/aes.go` | AES-256-GCM encrypt/decrypt |
| `internal/tools/types.go` | PathDenyable interface definition |
| `internal/tools/filesystem.go` | Denied path checking (`checkDeniedPath` helper) |
| `internal/tools/filesystem_list.go` | Denied path support + directory filtering |
| `internal/hooks/context.go` | WithSkipHooks / SkipHooksFromContext (recursion prevention) |
| `internal/hooks/engine.go` | Hook engine, evaluator registry |

---

## Cross-References

| Document | Relevant Content |
|----------|-----------------|
| [03-tools-system.md](./03-tools-system.md) | Shell deny patterns, exec approval, PathDenyable, delegation system, quality gates |
| [04-gateway-protocol.md](./04-gateway-protocol.md) | WebSocket auth, RBAC, rate limiting |
| [06-store-data-model.md](./06-store-data-model.md) | API key encryption, agent access control pipeline, agent_links table |
| [07-bootstrap-skills-memory.md](./07-bootstrap-skills-memory.md) | Context file merging, virtual files |
| [08-scheduling-cron-heartbeat.md](./08-scheduling-cron-heartbeat.md) | Scheduler lanes, cron lifecycle |
| [10-tracing-observability.md](./10-tracing-observability.md) | Tracing and OTel export |

---

# 10 - Tracing & Observability

Records agent run activities asynchronously. Spans are buffered in memory and flushed to the TracingStore in batches, with optional export to external OpenTelemetry backends.

> **Managed mode only**: Tracing requires PostgreSQL. In standalone mode, `TracingStore` is nil and no traces are recorded. The `traces` and `spans` tables store all tracing data. Optional OTel export sends spans to external backends (Jaeger, Grafana Tempo, Datadog) in addition to PostgreSQL.

---

## 1. Collector -- Buffer-Flush Architecture

```mermaid
flowchart TD
    EMIT["EmitSpan(span)"] --> BUF["spanCh<br/>(buffered channel, cap = 1000)"]
    BUF --> FLUSH["flushLoop() -- every 5s"]
    FLUSH --> DRAIN["Drain all spans from channel"]
    DRAIN --> BATCH["BatchCreateSpans() to PostgreSQL"]
    DRAIN --> OTEL["OTelExporter.ExportSpans()<br/>to OTLP backend (if configured)"]
    DRAIN --> AGG["Update aggregates<br/>for dirty traces"]

    FULL{"Buffer full?"} -.->|"Drop + warning log"| BUF
```

### Trace Lifecycle

```mermaid
flowchart LR
    CT["CreateTrace()<br/>(synchronous, 1 per run)"] --> ES["EmitSpan()<br/>(async, buffered)"]
    ES --> FT["FinishTrace()<br/>(status, error, output preview)"]
```

### Cancel Handling

When a run is cancelled via `/stop` or `/stopall`, the run context is cancelled but trace finalization still needs to persist. `FinishTrace()` detects `ctx.Err() != nil` and switches to `context.Background()` for the final database write. The trace status is set to `"cancelled"` instead of `"error"`.

Context values (traceID, collector) survive cancellation -- only `ctx.Done()` and `ctx.Err()` change. This allows trace finalization to find everything it needs with a fresh context for the DB call.

---

## 2. Span Types & Hierarchy

| Type | Description | OTel Kind |
|------|-------------|-----------|
| `llm_call` | LLM provider call | Client |
| `tool_call` | Tool execution | Internal |
| `agent` | Root agent span (parents all child spans) | Internal |

```mermaid
flowchart TD
    AGENT["Agent Span (root)<br/>parents all child spans"] --> LLM1["LLM Call Span 1<br/>(model, tokens, finish reason)"]
    AGENT --> TOOL1["Tool Span: exec<br/>(tool_name, duration)"]
    AGENT --> LLM2["LLM Call Span 2"]
    AGENT --> TOOL2["Tool Span: read_file"]
    AGENT --> LLM3["LLM Call Span 3"]
```

### Token Aggregation

Token counts are aggregated **only from `llm_call` spans** (not `agent` spans) to avoid double-counting. The `BatchUpdateTraceAggregates()` method sums `input_tokens` and `output_tokens` from spans where `span_type = 'llm_call'` and writes the totals to the parent trace record.

---

## 3. Verbose Mode

| Mode | InputPreview | OutputPreview |
|------|:---:|:---:|
| Normal | Not recorded | 500 characters max |
| Verbose (`GOCLAW_TRACE_VERBOSE=1`) | Up to 50KB | 500 characters max |

Verbose mode is useful for debugging LLM conversations. Full input messages (including system prompt, history, and tool results) are serialized as JSON and stored in the span's `InputPreview` field, truncated at 50,000 characters.

---

## 4. OTel Export

Optional OpenTelemetry OTLP exporter that sends spans to external observability backends.

```mermaid
flowchart TD
    COLLECTOR["Collector flush cycle"] --> CHECK{"SpanExporter set?"}
    CHECK -->|No| PG_ONLY["Write to PostgreSQL only"]
    CHECK -->|Yes| BOTH["Write to PostgreSQL<br/>+ ExportSpans() to OTLP backend"]
    BOTH --> BACKEND["Jaeger / Tempo / Datadog"]
```

### OTel Configuration

| Parameter | Description |
|-----------|-------------|
| `endpoint` | OTLP endpoint (e.g., `localhost:4317` for gRPC, `localhost:4318` for HTTP) |
| `protocol` | `grpc` (default) or `http` |
| `insecure` | Skip TLS for local development |
| `service_name` | OTel service name (default: `goclaw-gateway`) |
| `headers` | Extra headers (auth tokens, etc.) |

### Batch Processing

| Parameter | Value |
|-----------|-------|
| Max batch size | 100 spans |
| Batch timeout | 5 seconds |

The exporter lives in a separate sub-package (`internal/tracing/otelexport/`) so its gRPC and protobuf dependencies are isolated. Commenting out the import and wiring removes approximately 15-20MB from the binary. The exporter is attached to the Collector via `SetExporter()`.

---

## 5. Trace HTTP API (Managed Mode)

| Method | Path | Description |
|--------|------|-------------|
| GET | `/v1/traces` | List traces with pagination and filters |
| GET | `/v1/traces/{id}` | Get trace details with all spans |

### Query Filters

| Parameter | Type | Description |
|-----------|------|-------------|
| `agent_id` | UUID | Filter by agent |
| `user_id` | string | Filter by user |
| `status` | string | Filter by status (running, success, error, cancelled) |
| `from` / `to` | timestamp | Date range filter |
| `limit` | int | Page size (default 50) |
| `offset` | int | Pagination offset |

---

## 6. Delegation History (Managed Mode)

Delegation history records are stored in the `delegation_history` table and exposed alongside traces for cross-referencing agent interactions.

| Channel | Endpoint | Details |
|---------|----------|---------|
| WebSocket RPC | `delegations.list` / `delegations.get` | Results truncated (500 runes for list, 8000 for detail) |
| HTTP API | `GET /v1/delegations` / `GET /v1/delegations/{id}` | Full records |
| Agent tool | `delegate(action="history")` | Agent self-checking past delegations |

Delegation history is automatically recorded by `DelegateManager.saveDelegationHistory()` for every delegation (sync/async). Each record includes source agent, target agent, input, result, duration, and status.

---

## File Reference

| File | Description |
|------|-------------|
| `internal/tracing/collector.go` | Collector buffer-flush, EmitSpan, FinishTrace |
| `internal/tracing/context.go` | Trace context propagation (TraceID, ParentSpanID) |
| `internal/tracing/otelexport/exporter.go` | OTel OTLP exporter (gRPC + HTTP) |
| `internal/store/tracing_store.go` | TracingStore interface |
| `internal/store/pg/tracing.go` | PostgreSQL trace/span persistence + aggregation |
| `internal/http/traces.go` | Trace HTTP API handler (GET /v1/traces) |
| `internal/agent/loop_tracing.go` | Span emission from agent loop (LLM, tool, agent spans) |
| `internal/http/delegations.go` | Delegation history HTTP API handler |
| `internal/gateway/methods/delegations.go` | Delegation history RPC handlers |

---

## Cross-References

| Document | Relevant Content |
|----------|-----------------|
| [01-agent-loop.md](./01-agent-loop.md) | Span emission during agent execution, cancel handling |
| [03-tools-system.md](./03-tools-system.md) | Delegation system, delegation history via agent tool |
| [06-store-data-model.md](./06-store-data-model.md) | traces/spans tables schema, delegation_history table |
| [08-scheduling-cron-heartbeat.md](./08-scheduling-cron-heartbeat.md) | /stop and /stopall commands |
| [09-security.md](./09-security.md) | Rate limiting, RBAC access control |

---

# Web Dashboard

The GoClaw Web Dashboard is a React 19 single-page application (SPA) built with Vite 6, TypeScript, Tailwind CSS 4, and Radix UI. It connects to the GoClaw gateway via WebSocket and provides a full management interface for agents, teams, tools, providers, and observability.

---

## 1. Core

### Chat (`/chat`)

Interactive chat interface for direct agent conversation.

![Chat](../images/dashboard/chat.png)

- **Agent selector** — dropdown to switch active agent
- **Session list** — shows message count and timestamp per session
- **New Chat** button — starts a fresh session
- Message input with send action

---

## 2. Management

### Agents (`/agents`)

Card grid of all registered AI agents.

![Agents](../images/dashboard/agent.png)

Each card shows: name, slug, provider/model, description, status badge (`active`/`inactive`), access type (`predefined` / `open`), and context window size.

Actions: **Create Agent**, search by name/slug, edit, delete.

### Agent Teams (`/teams`)

Manages multi-agent team configurations.

![Agent Teams](../images/dashboard/agent%20team.png)

Each team card shows: team name, status, lead agent. Actions: **Create Team**, search, edit, delete.

### Sessions (`/sessions`)

Lists all conversation sessions across agents and channels. Supports filtering and deletion.

![Sessions](../images/dashboard/session.png)

### Channels (`/channels`)

Configuration for external messaging channels (Telegram, Slack, etc.) connected to the gateway.

![Channels](../images/dashboard/channels.png)

### Skills (`/skills`)

Manages agent skill packages (ZIP uploads). Actions: **Upload**, **Refresh**, search by name.

![Skills](../images/dashboard/skills.png)

### Built-in Tools (`/builtin-tools`)

26 built-in tools across 13 categories. Each tool can be individually enabled or disabled.

![Built-in Tools](../images/dashboard/build%20in%20tool.png)

| Category | Tools |
|---|---|
| Filesystem | `edit`, `list_files`, `read_file`, `write_file` |
| Runtime | `exec` |
| Web | `web_fetch`, `web_search` |
| Memory | `memory_get`, `memory_search` |
| (+ 9 more categories) | — |

---

## 3. Monitoring

### Traces (`/traces`)

Table of LLM call traces.

![Traces](../images/dashboard/traces.png)

| Column | Description |
|---|---|
| Name | Trace / run label |
| Status | `completed`, `error`, etc. |
| Duration | Wall-clock time |
| Tokens | Input / output / cached token counts |
| Spans | Number of child spans |
| Time | Timestamp |

Filter by agent ID. **Refresh** button for manual reload.

### Delegations (`/delegations`)

Tracks inter-agent delegation events — which agent delegated a task to which sub-agent, with status and timing.

![Delegations](../images/dashboard/Delegations.png)

---

## 4. System

### Providers (`/providers`)

LLM provider management table.

![Providers](../images/dashboard/providers.png)

| Column | Description |
|---|---|
| Name | Provider label |
| Type | `dashscope`, `bailian`, `gemini`, `openrouter`, `openai_compat` |
| API Base URL | Endpoint |
| API Key | Masked |
| Status | `Enabled` / `Disabled` |

Actions: **Add Provider**, **Refresh**, edit, delete per row.

### Config (`/config`)

Gateway configuration editor with two modes: **UI form** and **Raw Editor**.

![Config](../images/dashboard/config.png)

Sections:

- **Gateway** — host, port, token, owner IDs, allowed origins, rate limit (RPM), max message chars, inbound debounce, injection action
- **LLM Providers** — inline provider list
- **Agent Defaults** — default model settings

> A yellow info banner reminds that environment variables take precedence over UI-set values and that secrets should be configured via env, not stored in the config file.

---

## 5. Accessing the Dashboard

The dashboard is bundled with GoClaw and automatically available when the gateway starts. No separate setup required.

- **URL**: `http://localhost:3000` (default)
- **Connection**: Connects to the gateway via WebSocket automatically
- See [Getting Started](#getting-started) for installation and startup instructions

---

# API Reference

## HTTP Endpoints

| Method | Path | Description |
|--------|------|-------------|
| GET | `/health` | Health check |
| GET | `/ws` | WebSocket upgrade |
| POST | `/v1/chat/completions` | OpenAI-compatible chat API |
| POST | `/v1/responses` | Responses protocol |
| POST | `/v1/tools/invoke` | Tool invocation |
| GET/POST | `/v1/agents/*` | Agent management (managed mode) |
| GET/POST | `/v1/skills/*` | Skills management (managed mode) |
| GET/POST/PUT/DELETE | `/v1/tools/custom/*` | Custom tool CRUD (managed mode) |
| GET/POST/PUT/DELETE | `/v1/mcp/*` | MCP server + grants management (managed mode) |
| GET | `/v1/traces/*` | Trace viewer (managed mode) |

## Custom Tools (Managed Mode)

Define shell-based tools at runtime via HTTP API — no recompile or restart needed. The LLM can invoke custom tools identically to built-in tools.

**How it works:**
1. Admin creates a tool via `POST /v1/tools/custom` with a shell command template
2. LLM generates a tool call with the custom tool name
3. GoClaw renders the command template with shell-escaped arguments, checks deny patterns, and executes with timeout

**Capabilities:**
- **Scope** — Global (all agents) or per-agent (`agent_id` field)
- **Parameters** — JSON Schema definition for LLM arguments
- **Security** — All arguments auto shell-escaped, deny pattern filtering (blocks `curl|sh`, reverse shells, etc.), configurable timeout (default 60s)
- **Encrypted env vars** — Environment variables stored with AES-256-GCM encryption in the database
- **Cache invalidation** — Mutations broadcast events for hot-reload without restart

**API:**

| Method | Path | Description |
|---|---|---|
| GET | `/v1/tools/custom` | List tools (filter by `?agent_id=`) |
| POST | `/v1/tools/custom` | Create a custom tool |
| GET | `/v1/tools/custom/{id}` | Get tool details |
| PUT | `/v1/tools/custom/{id}` | Update a tool (JSON patch) |
| DELETE | `/v1/tools/custom/{id}` | Delete a tool |

**Example — create a tool that checks DNS records:**

```json
{
  "name": "dns_lookup",
  "description": "Look up DNS records for a domain",
  "parameters": {
    "type": "object",
    "properties": {
      "domain": { "type": "string", "description": "Domain name to look up" },
      "record_type": { "type": "string", "enum": ["A", "AAAA", "MX", "CNAME", "TXT"] }
    },
    "required": ["domain"]
  },
  "command": "dig +short {{.record_type}} {{.domain}}",
  "timeout_seconds": 10,
  "enabled": true
}
```

## MCP Integration

Connect external [Model Context Protocol](https://modelcontextprotocol.io) servers to extend agent capabilities. MCP tools are registered transparently into GoClaw's tool registry and invoked like any built-in tool.

**Supported transports:** `stdio`, `sse`, `streamable-http`

**Standalone mode** — configure in `config.json`:

```json
{
  "mcp": {
    "servers": {
      "filesystem": {
        "transport": "stdio",
        "command": "npx",
        "args": ["-y", "@modelcontextprotocol/server-filesystem", "/workspace"]
      },
      "remote-tools": {
        "transport": "streamable-http",
        "url": "https://mcp.example.com/tools"
      }
    }
  }
}
```

**Managed mode** — full CRUD via HTTP API with per-agent and per-user access grants:

| Method | Path | Description |
|---|---|---|
| GET | `/v1/mcp/servers` | List registered MCP servers |
| POST | `/v1/mcp/servers` | Register a new MCP server |
| GET | `/v1/mcp/servers/{id}` | Get server details |
| PUT | `/v1/mcp/servers/{id}` | Update server config |
| DELETE | `/v1/mcp/servers/{id}` | Remove MCP server |
| POST | `/v1/mcp/servers/{id}/grants/agent` | Grant access to an agent |
| DELETE | `/v1/mcp/servers/{id}/grants/agent/{agentID}` | Revoke agent access |
| GET | `/v1/mcp/grants/agent/{agentID}` | List agent's MCP grants |
| POST | `/v1/mcp/servers/{id}/grants/user` | Grant access to a user |
| DELETE | `/v1/mcp/servers/{id}/grants/user/{userID}` | Revoke user access |
| POST | `/v1/mcp/requests` | Request access (user self-service) |
| GET | `/v1/mcp/requests` | List pending access requests |
| POST | `/v1/mcp/requests/{id}/review` | Approve or reject a request |

**Features:**
- **Multi-server** — Connect multiple MCP servers simultaneously
- **Tool name prefixing** — Optional `{prefix}__{toolName}` to avoid collisions
- **Per-agent grants** — Control which agents can access which MCP servers, with tool allow/deny lists
- **Per-user grants** — Fine-grained user-level access control
- **Access requests** — Users can request access; admins approve or reject

---

# WebSocket Protocol (v3)

Frame types: `req` (client request), `res` (server response), `event` (server push).

## Authentication

The first request must be a `connect` handshake. Authentication supports three paths:

```json
// Path 1: Token-based (admin role)
{"type": "req", "id": 1, "method": "connect", "params": {"token": "your-gateway-token", "user_id": "alice"}}

// Path 2: Browser pairing reconnect (operator role)
{"type": "req", "id": 1, "method": "connect", "params": {"sender_id": "previously-paired-id", "user_id": "alice"}}

// Path 3: No token — initiates browser pairing flow (returns pairing code)
{"type": "req", "id": 1, "method": "connect", "params": {"user_id": "alice"}}
```

## Methods

| Method | Description |
|--------|-------------|
| `connect` | Authentication handshake (must be first request) |
| `health` | Server health check |
| `status` | Server status and metadata |
| `chat.send` | Send a message to an agent |
| `chat.history` | Retrieve session history |
| `chat.abort` | Abort a running agent request |
| `agent` | Get agent info |
| `sessions.list` | List active sessions |
| `sessions.delete` | Delete a session |
| `sessions.patch` | Update session metadata |
| `skills.list` | List available skills |
| `cron.list` | List scheduled jobs |
| `cron.create` | Create a cron job |
| `cron.delete` | Delete a cron job |
| `cron.toggle` | Enable/disable a cron job |
| `models.list` | List available AI models |
| `browser.pairing.status` | Poll pairing approval status |
| `device.pair.request` | Request device pairing |
| `device.pair.approve` | Approve a pairing code |
| `device.pair.list` | List pending and approved pairings |
| `device.pair.revoke` | Revoke a pairing |

## Events (server push)

| Event | Description |
|-------|-------------|
| `chunk` | Streaming token from LLM (payload: `{content}`) |
| `tool.call` | Agent invoking a tool (payload: `{name, id}`) |
| `tool.result` | Tool execution result |
| `run.started` | Agent started processing |
| `run.completed` | Agent finished processing |
| `shutdown` | Server shutting down |

## Frame Format

### Request (client to server)
```json
{
  "type": "req",
  "id": "unique-request-id",
  "method": "chat.send",
  "params": { ... }
}
```

### Response (server to client)
```json
{
  "type": "res",
  "id": "matching-request-id",
  "ok": true,
  "payload": { ... }
}
```

### Error Response
```json
{
  "type": "res",
  "id": "matching-request-id",
  "ok": false,
  "error": {
    "code": "error_code",
    "message": "Human-readable error message"
  }
}
```

### Event (server push)
```json
{
  "type": "event",
  "event": "chunk",
  "payload": { "content": "streaming text..." }
}
```

---