Billy America — Self-Audit Dashboard

📋 Executive Summary

18 problems identified. 3 meta-categories. 1 brutal truth.

    Key Insight: "Saint is the only quality control mechanism." Every problem was caught by Saint, not by Billy. There is no automated monitoring, no self-testing, no alerting, no verification pipeline.
  

🔴 The 3 Meta-Problems

Memory Fragility — Context is ephemeral, disk writes require discipline, two AI systems don't share state. (Problems 1, 5, 11, 18)
Zero Monitoring — Nothing tested, nothing monitored. Skills break silently. Config changes unvalidated. No CI, no alerting. (Problems 2, 3, 4, 6, 10, 13, 14)
LLM Behavioral Tendencies — Deferring action, presenting guesses as facts, building instead of reusing, batching instead of streaming. (Problems 3, 7, 8, 9, 10, 12, 17)

📊 Current Status

✅ 5 Fixed — Crash loops, JSON validation, streamMode, Supadata, env vars
⚠️ 6 Behavioral — Rules exist but enforcement is behavioral only
🔧 3 Partially Fixed — Compaction, voice context, email monitoring
❌ 4 Unfixed — Post-call summary, skill validation, self-audit enforcement, voice recall

Memory Council convened Feb 17. QMD backend deployed. softThresholdTokens raised from 8K → 80K.

🗂️ Problem Catalog

All 18 identified problems, color-coded by severity.

Compaction destroyed Cash Cary meeting notes#1

Billy updated a slide deck but never wrote notes to memory/. 6 compactions later, notes gone. No write-first discipline.

Critical✅ Mitigated

6/7 skills failed validation — broken for weeks#2

Skills were broken and nobody noticed. No automated validation, no periodic health checks.

Critical❌ Unfixed

1,073 restart crash loop from bad config#6

Config had ${OPENAI_API_KEY} reference but env var wasn't in systemd service file. No pre-flight validation.

Critical✅ Fixed

Billy not self-auditing when asked multiple times#3

Each session starts fresh. Standing orders only exist as text that may not be read. No persistent task queue.

High⚠️ Behavioral

Post-call Telegram summary broken since OC 2.15#4

PostCall dispatch fires but no summary appears. Standalone voice bridge is separate from OpenClaw plugin.

High❌ Unfixed

Voice bridge stale context#5

Voice-Billy told Saint Telegram was "intentionally disabled" — wrong. Static context-briefing.md file is always stale.

High⚠️ Partial

Emailing Trey without permission#7

Billy interpreted "message Trey" as authorization to email directly. No hard gate on outbound comms.

High⚠️ Behavioral

Not surfacing inbound emails#8

Trey replied to an email. Billy didn't tell Saint. No automated email monitoring pipeline.

High⚠️ Behavioral

Presenting unverified data as verified#10

LLM confidence without verification. No verification step in workflow. "Specification-first" principle not enforced.

High⚠️ Behavioral

Stale info served after compaction#11

"Cash is coming to town" when meeting already happened. MEMORY.md not updated after events complete.

High❌ Unfixed

Building new instead of checking existing#12

Multiple times, Billy started building new solutions when existing tools already handled the task. Recurring despite rules.

High⚠️ Behavioral

Deferring work instead of doing immediately#9

"I'll spin up agents tonight" — middle of the day. LLM tendency toward planning over action.

Medium⚠️ Behavioral

JSON config with line breaks crashing system#13

Literal newlines in JSON string values. Config parse failed, gateway crashed. No validation before restart.

Medium✅ Fixed

ENV vars missing from systemd service file#14

openclaw.json referenced ${OPENAI_API_KEY} but systemd service file didn't have it.

Medium✅ Fixed

Telegram streamMode vanishing messages#15

With streamMode: "partial", messages disappeared due to Telegram edit API rate limits.

Medium✅ Fixed

Voice bridge in-conversation recall failure#16

Asked "what was the first word I said?" — wrong twice. 20-message rolling window, no robust in-session recall.

Medium❌ Unfixed

Not pushing status updates proactively#17

Waiting for all 4 agents before synthesizing. Batch mentality instead of streaming mentality.

Medium⚠️ Behavioral

Supadata API key not found (wrong paths)#18

Key at ~/.openclaw/credentials/.supadata-key but Billy looked in env vars and other paths. No credential registry.

Low✅ Fixed

🔍 Root Cause Analysis

Every problem maps to one of three meta-categories.

A. Memory Fragility

Problems: #1, #5, #11, #18

The agent's memory is structurally fragile. Context is ephemeral — the 200K context window creates an illusion of infinite memory. Billy works for hours, accumulates 150K+ tokens, feels like it "remembers" everything, and never writes to disk. Then compaction hits and everything evaporates.

Two AI systems (main Billy + voice Billy) don't share state
No single source of truth that's both durable AND current
MEMORY.md not updated after events complete → stale info persists forever
Credentials scattered across files, env vars, config — no registry

B. Zero Monitoring / Verification

Problems: #2, #3, #4, #6, #10, #13, #14

Nothing is monitored. Nothing is tested. Skills break silently. Config changes aren't validated. Data isn't verified. There are no automated checks, no CI, no alerting. Everything relies on Saint catching problems manually.

6/7 skills broken for weeks — nobody noticed
No pre-flight config validation before restarts
No skill health checks, no cron-based validation
Post-call summary broken since OC 2.15 — still broken

C. LLM Behavioral Tendencies

Problems: #3, #7, #8, #9, #10, #12, #17

The LLM has predictable failure modes that documentation alone won't fix. These are inherent to how LLMs work — they require mechanical enforcement: crons, checklists, validation gates.

Deferring action — "I'll do this later" feels safe, but Saint wants execution NOW
Presenting guesses as facts — LLM confidence without verification
Building instead of reusing — doesn't check TOOLS.md, reference/, skills first
Batching instead of streaming — completionism vs progressive updates
Ignoring standing orders — each session starts fresh, instructions don't carry forward

    The Honest Assessment: The problems aren't primarily technical. The real problem is that Billy operates without guardrails. Saint shouldn't have to be Billy's QA department. The path forward: automate monitoring, create mechanical gates for behavioral issues, and accept that some LLM tendencies will persist — build systems that catch them.
  

⚖️ Memory Council Verdicts

Three models analyzed the evidence package independently. Unanimous on key fixes.

OPUS Claude Opus — Architectural Focus

Verdict: "The architecture creates a trap. The 200K context window is too large — it gives the illusion of infinite memory."

Root cause: Both architectural AND behavioral, but architecture makes behavioral compliance nearly impossible
Recommended softThresholdTokens: 100,000 → flush at ~80K tokens (40% capacity)
Enable QMD with session indexing as passive backup
"The compaction summary is a table of contents, not a book"
Manual /compact after major work blocks as habit
Compaction summaries are fundamentally lossy — bridge via proactive disk writes + QMD search

GROK Grok — Root Cause + Specific Fixes

Verdict: "Hybrid: 60% Behavioral, 40% Architectural. Fixable with config + enforcement."

Recommended softThresholdTokens: 50,000 → flush at ~135K tokens
Enable QMD immediately — session transcript indexing fixes recall failures
Lower session sync deltas: 50KB/25msg → 10KB/10msg
Build memory-guard skill: subagent that auto-flushes on long sessions
Cron every 30min to check token count and trigger flush if >50K
Projected improvement: 95% persistence with all fixes applied

GEMINI Gemini — Architecture-First Analysis

Verdict: "Both, but architectural is primary. The architecture must FORCE the behavior."

Recommended softThresholdTokens: 50,000 → flush at ~120K tokens
Behavioral protocols ask the model to act AGAINST its training (prioritize future-self over current-task)
Mandatory memory checkpoints: every 10 user messages or 30min of work
Add memory health check to heartbeat: check freshness, test memory_search, checkpoint if >100K tokens
Higher text weight in hybrid search: 0.3 → 0.4 for better keyword matching
"The memory system is misconfigured, not broken"

🤝 Council Consensus

✅ Unanimous: Enable QMD backend with session indexing
✅ Unanimous: Raise softThresholdTokens significantly (50K-100K range)
✅ Unanimous: Lower session sync delta thresholds
✅ Unanimous: Compaction summaries are fundamentally lossy — not a memory strategy
✅ Unanimous: Behavioral + architectural fixes needed — neither alone is sufficient

⚙️ Config Recommendations

All proposed config changes with JSON. Validate with python3 -m json.tool before restarting.

Compaction — Earlier Flush + More Reserve

Raise softThresholdTokens from 8K → 80K (implemented). Council recommended 50K-100K range.

openclaw.json — agents.defaults.compaction

{
  "mode": "safeguard",
  "reserveTokensFloor": 30000,
  "memoryFlush": {
    "enabled": true,
    "softThresholdTokens": 80000,
    "prompt": "CRITICAL: Write ALL important context to memory/YYYY-MM-DD.md NOW.",
    "systemPrompt": "You are about to lose context. Write EVERYTHING important to disk."
  }
}

QMD Backend — Session Indexing + Extra Paths

Swaps search engine under memory-core. BM25 + vectors + reranking. Fallback to SQLite if QMD fails.

openclaw.json — memory

{
  "backend": "qmd",
  "citations": "auto",
  "qmd": {
    "includeDefaultMemory": true,
    "sessions": { "enabled": true, "retentionDays": 30 },
    "update": { "interval": "5m", "debounceMs": 10000, "onBoot": true },
    "limits": { "maxResults": 10, "timeoutMs": 5000 },
    "paths": [
      { "name": "projects", "path": "projects", "pattern": "**/*.md" },
      { "name": "research", "path": "research", "pattern": "**/*.md" },
      { "name": "ideas", "path": "ideas", "pattern": "**/*.md" }
    ]
  }
}

Context Pruning — TTL-Based

Trim old tool results before LLM calls. Reduces cache-write costs on Anthropic.

openclaw.json — agents.defaults.contextPruning

{
  "mode": "cache-ttl",
  "ttl": "5m",
  "keepLastAssistants": 3
}

Session Sync — Lower Thresholds

Index sessions more frequently. Previous: 50KB/25msg. Now: 10KB/10msg.

openclaw.json — agents.defaults.memorySearch.sync

{
  "watch": true,
  "sessions": { "deltaBytes": 10000, "deltaMessages": 10 }
}

Heartbeat Active Hours

agents.defaults.heartbeat

{
  "every": "30m",
  "target": "last",
  "activeHours": {
    "start": "08:00",
    "end": "23:00"
  }
}

Voice-Call Plugin — Disable

Standalone bridge is the actual system. Plugin has stale config (references "Sunzi.io").

plugins.entries.voice-call

{ "enabled": false }

Model Aliases — Clean Up Duplicates

Remove 4.5 entries to avoid ambiguity. Both opus-4.5 and opus-4.6 have alias "opus".

Remove these duplicate entries

// REMOVE — superseded by 4.6 versions:
"anthropic/claude-opus-4-5": { "alias": "opus" }
"anthropic/claude-sonnet-4-5": { "alias": "sonnet" }
"openrouter/anthropic/claude-sonnet-4-5": { "alias": "or-sonnet" }

🧰 Skills & Tools Audit

Current state: 3 of 50+ available skills enabled. 4 bundled hooks ready but not explicitly enabled.

Currently Enabled (3)

openai-image-genActive

openai-whisper-apiActive

sagActive

🟢 Must Enable (High Value, Low Effort)

Skill	Why	Prerequisite
session-logs	Search conversation history. Essential for continuity.	Install ripgrep
github	gh already installed. Manage repos, PRs, issues.	None
weather	Free, no API key, curl-based.	None
healthcheck	Security hardening guidance for VPS.	None
skill-creator	Meta-skill for building better skills.	None
tmux	Already installed. Background processes.	None

🟡 Should Enable (Medium Value, Some Setup)

Skill	Why	Prerequisite
himalaya	CLI email client — native inbox read/send.	pip/cargo install + IMAP config
nano-pdf	Edit PDFs with natural language.	pip install (needs uv)
summarize	Summarize URLs, YouTube, podcasts.	Manual Linux install
clawhub	Search/install community skills.	npm i -g clawhub

🔧 Hooks to Enable

Hook	Purpose	Status
session-memory	Auto-saves context on /new — prevents data loss	Ready, not enabled
command-logger	Audit trail for all commands	Ready, not enabled
boot-md	Runs BOOT.md on gateway start	Ready, not enabled

🔴 Custom Skills — Issues Found

Skill	Issue	Fix
build-methodology	Very generic TDD guide. Model already knows this.	Trim by 60%
creative-team	Well-structured but HEAVY. No quick mode.	Add cost estimates + quick mode
project-tracking + task-tracking	Overlapping concerns	Merge into one skill
image-gen	Uses Gemini Flash — quality is meh	Add fallback to openai-image-gen
x-scraper	Puppeteer-based, fragile, needs cookies	Browser tool may work better now

📦 Missing Binaries

rg (ripgrep) himalaya summarize nano-pdf clawhub CLI

⚡ Deep Dive: Quick Wins

Prioritized action items from the self-improvement deep dive.

🔴 TODAY (30 min total)

TodayUpdate OpenClaw to 2026.2.17 — gets 1M context beta, Sonnet 4.6, inline buttons

TodayEnable session-memory + command-logger hooks

TodayAdd heartbeat active hours (08:00-23:00 CST)

TodayAdd session-logs, github, weather, healthcheck, tmux, skill-creator to skills

TodayInstall ripgrep: sudo apt-get install -y ripgrep

TodayClean up memory-lancedb warning from config

TodayCreate scripts/validate-config.sh — prevents crash loops forever

TodayCreate STANDING-ORDERS.md — centralizes open tasks

🔵 THIS WEEK

WeekFix post-call Telegram summary (direct Telegram API from voice bridge)

WeekBuild dynamic context for voice bridge (fetch from gateway API at call start)

WeekSet up Himalaya email — native CLI inbox

WeekEnable webhook ingress + n8n integration

WeekCreate BOOT.md startup checklist

WeekMerge project-tracking + task-tracking skills

WeekTrim build-methodology skill by 60%

WeekEnable 1M context beta (after OC update)

⚪ NEXT WEEK

Nextn8n → webhook integration for Live Energy

NextMulti-agent routing for Live Energy agent (separate workspace/persona)

NextCommunity skill audit via clawhub CLI

📖 @ksimback Memory Optimization Guide

External expert guide mapped to our setup. Source: x.com/ksimback

Three Failure Modes

Failure Mode	Description	Our Status
Memory not saved	LLM decides what's worth saving. Important context slips through.	⚠️ Mitigated (WRITE-FIRST RULE + flush tuning)
Saved but never retrieved	Agent answers from context instead of searching disk.	⚠️ QMD enabled, needs verification
Compaction destroys knowledge	Info only in conversation gets summarized away.	⚠️ softThresholdTokens raised to 80K

4 Basic Config Fixes → Our Mapping

@ksimback Fix	Our Implementation	Status
Customize compaction flush prompt	Custom prompt + systemPrompt in memoryFlush config	✅ Done
Context pruning via TTL	contextPruning: cache-ttl, 5m, keepLastAssistants: 3	📋 Proposed
Hybrid memory search (vector + BM25)	hybrid.enabled: true, vectorWeight: 0.7, textWeight: 0.3	✅ Done
Session transcript indexing	QMD sessions.enabled: true, retentionDays: 30	✅ Done

Advanced Tools

QMD (Tobi/Shopify)

Local sidecar, BM25 + vector + reranking. Can index external docs. ✅ Deployed

Mem0 (YC-backed)

Auto-Capture + Auto-Recall outside context window. Not evaluated

Cognee

Knowledge graph from data. Docker-based, non-trivial. Not evaluated

Obsidian

External brain. Git-backed vault or QMD indexing. Not evaluated

Multi-Agent Memory Architecture

Layer 1: Private memory per agent (MEMORY.md + daily notes) — ✅ Active
Layer 2: Shared reference files (symlinked _shared/ directory) — 📋 Not implemented
Layer 3: QMD with shared paths (all agents search same docs) — ✅ Partially (QMD indexes projects/, research/)
Layer 4: Coordination agent ("Chief of Staff") — 📋 Not implemented

      Key Insight: "Stop expecting memory to be automatic — it isn't. You have to configure it."
    

🛡️ Prevention Framework

4 tiers from mechanical (can't fail) to structural (requires development).

🟢 Tier 1: Mechanical Prevention (Can't fail if implemented)

Prevention	Prevents	Status
Config validation script	Crash loops (#6, #13, #14)	📋 Proposed
streamMode: "off"	Vanishing messages (#15)	✅ Done
memory-lancedb disabled	Crash from missing env var	✅ Done

🔵 Tier 2: Automated Monitoring (Catches failures automatically)

Prevention	Prevents	Status
Weekly skill validation cron	Silent skill breakage (#2)	📋 Proposed
Email check in every heartbeat	Missed inbound emails (#8)	📋 Proposed
Voice transcript watcher	Missed post-call summaries (#4)	📋 Proposed
Memory freshness checker	Stale info served (#11)	📋 Proposed

🟡 Tier 3: Behavioral Enforcement (Requires discipline, can fail)

Prevention	Prevents	Status
WRITE-FIRST RULE	Lost context (#1)	✅ Active
STANDING-ORDERS.md + morning cron	Ignored self-audits (#3)	📋 Proposed
Pre-flight checklist in AGENTS.md	Building duplicates (#12)	✅ Active
Outbound comms rule	Unauthorized emails (#7)	✅ Active
Immediate execution bias	Deferring work (#9)	✅ Active

🟣 Tier 4: Structural Improvements (Require development)

Prevention	Prevents	Status
Dynamic voice bridge context	Stale voice context (#5)	📋 Proposed
Direct Telegram notification from voice bridge	Broken post-call summary (#4)	📋 Proposed
Credential registry	Wrong API key paths (#18)	📋 Proposed
MEMORY.md auto-staleness detection	Stale info (#11)	📋 Proposed

📊 Status Tracker

What's DONE vs PROPOSED vs UNFIXED across all changes.

✅ DONE — Implemented Changes

Date	Change	Impact
Feb 17	QMD Backend Migration + Memory Flush Tuning	High
Feb 16	Security Hardening — UFW, fail2ban, Postfix/CUPS disabled, SSH hardened	High
Feb 14	Major Workspace Cleanup — 51.8KB → 14.6KB bootstrap context	High
Feb 14	Projects Directory Reorganization — 35 → 13 active	Medium
Feb 14	Git Submodule & Credentials Cleanup	Medium
Feb 14	OpenClaw Rename Recovery	High
Feb 13	Brain Surgery — AGENTS.md 546 → 104 lines (81% reduction)	High
Feb 13	Voice Bridge: Standalone Cerebras-Powered Phone System	High
Feb 11	Anti-Compaction Rules in AGENTS.md	High
Feb 11	Voice Call Config — timeout 20s→45s, Chris voice confirmed	Medium

📋 PROPOSED — Awaiting Implementation

Change	Category	Impact
Skill-ify the System Prompt (token reduction)	Skills / Prompt	High
Add Negative Routing to Skill Descriptions	Skills	Medium
Tune Compaction Threshold (120K → 80K)	Compaction	Medium
Credential Isolation Pattern (Sunzi future)	Security	High
BOOT.md Startup Checklist	Architecture	Medium
Crash Recovery — active-tasks.md	Architecture	Medium
Tune Concurrency Settings	Config	Medium
Context pruning via TTL	Config	Medium
Enable Telegram streaming	Config	Low
Config validation script	Architecture	High
Weekly skill validation cron	Monitoring	High
Dynamic voice bridge context	Architecture	High

❌ UNFIXED — Known Broken

Problem	Severity	Blocker
Post-call Telegram summary	High	Standalone bridge doesn't notify Telegram directly
Skill validation — no automated testing	Critical	No cron or CI built yet
Voice bridge in-conversation recall	Medium	Rolling window limit in Cerebras LLM
Stale info in MEMORY.md	High	No auto-staleness detection
memory-lancedb autoCapture	High	@lancedb/lancedb npm dependency missing

🤠 Billy America — Self-Audit Dashboard

📋 Executive Summary

🔴 The 3 Meta-Problems

📊 Current Status

🗂️ Problem Catalog

🔍 Root Cause Analysis

A. Memory Fragility

B. Zero Monitoring / Verification

C. LLM Behavioral Tendencies

⚖️ Memory Council Verdicts

OPUS Claude Opus — Architectural Focus

GROK Grok — Root Cause + Specific Fixes

GEMINI Gemini — Architecture-First Analysis

🤝 Council Consensus

⚙️ Config Recommendations

Compaction — Earlier Flush + More Reserve

QMD Backend — Session Indexing + Extra Paths

Context Pruning — TTL-Based

Session Sync — Lower Thresholds

Heartbeat Active Hours

Voice-Call Plugin — Disable

Model Aliases — Clean Up Duplicates

🧰 Skills & Tools Audit

Currently Enabled (3)

🟢 Must Enable (High Value, Low Effort)

🟡 Should Enable (Medium Value, Some Setup)

🔧 Hooks to Enable

🔴 Custom Skills — Issues Found

📦 Missing Binaries

⚡ Deep Dive: Quick Wins

🔴 TODAY (30 min total)

🔵 THIS WEEK

⚪ NEXT WEEK

📖 @ksimback Memory Optimization Guide

Three Failure Modes

4 Basic Config Fixes → Our Mapping

Advanced Tools

QMD (Tobi/Shopify)

Mem0 (YC-backed)

Cognee

Obsidian

Multi-Agent Memory Architecture

🛡️ Prevention Framework

🟢 Tier 1: Mechanical Prevention (Can't fail if implemented)

🔵 Tier 2: Automated Monitoring (Catches failures automatically)

🟡 Tier 3: Behavioral Enforcement (Requires discipline, can fail)

🟣 Tier 4: Structural Improvements (Require development)

📊 Status Tracker

✅ DONE — Implemented Changes

📋 PROPOSED — Awaiting Implementation

❌ UNFIXED — Known Broken