The Architecture Behind My Multi-Agent Autonomous Development Team

Katherine Cass • January 2026 • 15 min read

This is a companion post to my retrospective on what went wrong. Before I talk about the failures, I wanted to document how the system actually worked.

If you want the code, DM me. The repos are private because they're intermingled with personal stuff and I didn't do a security review before sharing. One of my goals for v2 is to make it shareable from day one.

What Is This?

An autonomous software development team built on Claude Code. Eleven specialized AI agents work 24/7 on different projects, coordinating via message bus, following industry best practices, and maintaining a structured roadmap.

Think of it as: A team of developers who never sleep, always follow TDD, coordinate via message passing, and self-organize around a shared backlog.

For the Curious People Leader or PM

11 specialized agents (Grace, Henry, Sophie, Nadia, etc.) work autonomously
Multi-project support: Manages 10+ active codebases simultaneously
Quality gates: Test-driven development, CI integration, code review processes
24/7 operation: Cron-based scheduling with event-driven triggers
Cost-effective: Runs on home server using Claude Max subscription

For Engineers

Multi-agent orchestration built on Claude Code CLI
Event-driven architecture with NATS JetStream message bus
Production-tested with incident response, monitoring, and quality controls
Autonomous execution: Agents claim work, implement features, run tests, commit code

For AI/LLM Practitioners

Subagent pattern for context window management (Anthropic best practice)
Atomic work claiming to prevent race conditions in multi-agent systems
Prompt injection defense for web research (OWASP LLM01:2025)
Learning organization: System improves based on incident feedback

Why This System Exists

The Problem: Context Window Limits

Claude Code sessions have finite context windows. For complex projects requiring weeks of work, a single session can't maintain all necessary context. You have two choices:

Front-load everything → Massive context injection → Less room for actual work
Session per task → Lost continuity → Duplicate discovery

The Solution: Multi-Agent Coordination

Instead of one agent doing everything, specialized agents coordinate via message bus:

Grace routes user requests to appropriate specialist
Henry maintains roadmap and creates work packages
Developers claim work, implement via TDD, push to CI
Sophie monitors system health and recovers from failures
Ollie watches for anti-patterns and proposes improvements

Each agent starts fresh, does focused work, externalizes state (NATS, roadmap, Git), and exits. Cron restarts them, or events trigger them on-demand.

Result: Infinite project timeline with finite per-session context.

Anthropic's research validates this: Their multi-agent research system (2025) showed that a multi-agent system with Claude Opus 4 as the lead agent and Claude Sonnet 4 subagents outperformed single-agent Claude Opus 4 by 90.2% on their internal research eval.

System Architecture

┌──────────────────────────────────────────────────────────────┐ │ USER INTERFACES │ │ Slack Mobile │ Matrix │ Web Dashboard │ Interactive │ └─────────────┬────────────────────────────────────┬───────────┘ │ │ ┌────▼────────────────────────────────────▼──────┐ │ INTEGRATION LAYER (Bridges & Triggers) │ │ • Slack-NATS Bridge (monitors channels) │ │ • Grace Trigger Daemon (event-driven, <10s) │ │ • Matrix Bridge (dual-platform messaging) │ └──────────────────┬─────────────────────────────┘ │ ┌──────────────────▼─────────────────────────────┐ │ NATS JetStream Message Bus (4222) │ │ Streams: #coordination, #watchdog, #errors │ │ Retention: 7 days / 10K messages │ └──────────────────┬─────────────────────────────┘ │ ┌──────────────────▼─────────────────────────────┐ │ ORCHESTRATION LAYER │ │ • run-agent.sh (prompt injection + launch) │ │ • Cron schedules (hourly, staggered) │ │ • Rate limiting (Claude Max subscription) │ │ • Resource limits (2GB RAM, 1 CPU per agent) │ └──────────────────┬─────────────────────────────┘ │ ┌──────────────────▼─────────────────────────────┐ │ AGENT TEAM (11 agents) │ │ LEADERSHIP: Grace, Henry, Kemo │ │ OPERATIONS: Sophie, Bertha, Ollie, Ralph │ │ DEVELOPERS: Nadia, Anette, Dorian, Ginny │ └──────────────────────────────────────────────┘

Two entry points: Slack for async messages (drop a request and walk away) and Claude Code CLI for interactive work sessions. Both route to the same agent infrastructure.

Core Components

Component	Location	Purpose
Orchestrator	`orchestrator/`	Cron-based agent launcher, configuration, logging
Message Bus	`mcp-agent-chat/`	NATS JetStream MCP server for agent coordination
Agent Prompts	`orchestrator/agents/`	Individual behavior definitions and shared culture
Roadmap System	`plans/`	Work package tracking with Vikunja integration
Skills & Hooks	`.claude/`	Reusable slash commands and session lifecycle hooks
Web Dashboard	`web-ui/`	Flask dashboard for monitoring agents, work packages, health

The Team: 11 Specialized Agents

All named after pets. Easier to remember than TDD-Developer-Agent-3.

Each agent has a prompt file defining its behavior:

# orchestrator/agents/ollie_prompt.md (excerpt)
# Ollie - Autonomous Agent System Consultant

You are **Ollie**, the meta-level consultant responsible
for ensuring the agent system itself stays healthy,
follows industry best practices, and doesn't accumulate
anti-patterns or context bloat.

**You exist because:**
1. Prompt/instruction creep - Agent prompts grow over time
2. Anti-pattern accumulation - Without review, patterns drift
3. No industry feedback loop - The field evolves rapidly

Leadership & Coordination

Agent	Schedule	Key Responsibilities
Grace (Team Manager)	Event-driven (<10s latency)	Routes Slack/Matrix messages, coordinates team, manages pause/resume during rate limits
Henry (Project Manager)	Hourly (:00)	Owns roadmap, creates work packages, gardens completed items, triages user requests
Kemo (Business Analyst)	On-demand	Priority/ROI analysis, requirements refinement, helps Henry prioritize backlog

Operations & Quality

Agent	Schedule	Key Responsibilities
Sophie (Watchdog)	Every 2 hours	Health monitoring (NATS, dashboard, agents), incident detection, auto-recovery
Bertha (Quality Engineer)	Daily (2 AM)	Test coverage tracking, CI enforcement, RCA participation, quality gate reviews
Ollie (System Consultant)	Every 2 days + weekly research	Context budget analysis, anti-pattern detection, industry research, design reviews
Ralph (UX Specialist)	Weekly + on-demand	UX research updates, design authority, UI/UX review

Development Team

Agent	Schedule	Key Responsibilities
Nadia	Hourly (:10)	TDD development (write tests first), CI enforcement, claims work from any project
Anette	Hourly (:25)	TDD development (write tests first), CI enforcement, claims work from any project
Dorian	Hourly (:40)	TDD development (write tests first), CI enforcement, claims work from any project
Ginny	Hourly (:55)	Frontend/UI specialist, dashboard development, accessibility

Note: Developers follow identical workflows - think of them as instances of the same TDD developer template, differentiated only by name for coordination.

The Crontab

Staggered scheduling prevents resource contention:

# Agent schedules (crontab)
0  * * * * ./orchestrator/run-agent.sh henry    # :00 - PM
10 * * * * ./orchestrator/run-agent.sh nadia    # :10 - Dev
25 * * * * ./orchestrator/run-agent.sh anette   # :25 - Dev
40 * * * * ./orchestrator/run-agent.sh dorian   # :40 - Dev
55 * * * * ./orchestrator/run-agent.sh ginny    # :55 - Frontend
0  */2 * * * ./orchestrator/run-agent.sh sophie # Every 2h - Watchdog

How Work Gets Done

1. User Makes a Request

Via Slack or Matrix:

User: "Add dark mode to the dashboard" [Posted to #colby-agent-work]

Via Dashboard: Navigate to the web UI, click "File Bug" or "New Feature"

Via Interactive Session: SSH to server, run ./orchestrator/run-agent.sh grace

2. Grace Routes the Request

Within 10 seconds, Grace:

Acknowledges with reaction
Classifies request (feature, bug, infrastructure, meta)
Routes to appropriate handler:
- Features → Henry to create work package
- Bugs → Files in Vikunja, assigns to developer rotation
- System questions → Ollie
- Urgent incidents → Sophie

3. Henry Creates Work Package

Analyzes request for technical requirements
Creates work package (WP) in roadmap: plans/active/batch-N.md
Adds to Vikunja with labels, priority, acceptance criteria
Posts to NATS #coordination: "WP-95.3 available: Dark mode dashboard"

4. Developer Claims Work

On next hourly run, Nadia/Anette/Dorian:

# 1. Check roadmap for available work python orchestrator/roadmap_index.py available # 2. Claim atomically (prevents race conditions) ./orchestrator/work_claims.py claim WP-95.3 Agent-Nadia # Exit code 0 = success, proceed # 3. Announce on NATS for team visibility nats pub agent.chat.coordination "Agent-Nadia claimed WP-95.3: Dark mode dashboard"

5. TDD Implementation

Anthropic Best Practice: "Test-Driven Development becomes even more powerful with agentic coding."

# 1. Write test FIRST (failing) def test_dark_mode_toggle(): response = client.get('/toggle-dark-mode') assert response.status_code == 200 assert 'dark-mode-enabled' in response.data # 2. Implement feature (make test pass) @app.route('/toggle-dark-mode') def toggle_dark_mode(): session['dark_mode'] = not session.get('dark_mode', False) return jsonify({'status': 'success'}) # 3. Verify test passes pytest tests/test_dark_mode.py -v

6. CI Pipeline

Developer pushes to agent-specific branch:

git checkout nadia-work
git add -A
git commit -m "Add dark mode toggle

🤖 Generated with Claude Code"
git push origin nadia-work

GitHub Actions runs:

Linting (ruff)
Unit tests (pytest)
Integration tests
Pattern checks (dangerous patterns, missing permission checks)
Dashboard health (routes respond)

CRITICAL RULE: Work is NOT complete until CI passes. No exceptions.

7. Mark Work Complete

After CI passes and PR merges:

# 1. Complete in Vikunja ./orchestrator/vikunja_tasks.py complete WP-95.3 Agent-Nadia # 2. Release atomic claim ./orchestrator/work_claims.py release WP-95.3 Agent-Nadia # 3. Post completion to NATS nats pub agent.chat.coordination "DONE: WP-95.3 Dark mode - CI PASSED" # 4. Notify user in original thread ./orchestrator/notify_codified.py complete --wp WP-95.3

Key Features That Make This Work

1. Event-Driven Architecture (Grace <10s Response)

Problem: Cron-only agents check every hour → slow user response.

Solution: Grace Trigger Daemon monitors Slack in real-time:

# Simplified grace-trigger daemon
while True:
    messages = slack_bridge.get_pending_messages()
    if messages:
        trigger_agent("grace", reason="New Slack message")
    sleep(5)  # Check every 5 seconds

Result: User posts → Grace responds within 10 seconds.

Fallback: Cron schedule still runs Grace hourly in case trigger daemon fails.

2. Atomic Work Claiming (Prevents Duplicate Work)

Problem: Two developers claim same WP → duplicate work, conflicts.

Solution: File-based atomic locking via work_claims.py:

# work_claims.py implements atomic test-and-set
def claim(wp_id, agent_name):
    lock_file = f"orchestrator/state/work_claims/{wp_id}.lock"
    if os.path.exists(lock_file):
        return 1  # Already claimed

    # Atomic create (O_CREAT | O_EXCL)
    with open(lock_file, 'x') as f:
        f.write(json.dumps({
            'agent': agent_name,
            'claimed_at': datetime.now().isoformat()
        }))
    return 0  # Success

Claims expire after 60 minutes (handles agent crashes gracefully).

3. Multi-Project Support

10+ active project domains:

Project	Purpose
agent-automation	The agent system itself
Smarthome	Home automation (lights, vacuum, sensors)
trading-bot	Trading automation
health-app	Nutrient-focused meal planning
f3-sword-academy	HEMA club management
relic	Unity game development
personal-automation	Personal scripts and workflows

Agents switch projects automatically when one has no available work:

# Check current project for work
python orchestrator/roadmap_index.py available
# Output: No available work

# Check other projects
python orchestrator/roadmap_index.py multi-status
# Output: trading-bot has 2 available WPs

# Switch project
echo "trading-bot" > orchestrator/current-project
nats pub agent.chat.coordination \
  "Agent-Nadia switching: agent-automation → trading-bot"

4. Fix-First Culture

Anthropic Research Finding: "Responsibility diffusion" is a key multi-agent failure mode - issues noticed but passed between agents without resolution.

Our Solution:

Forbidden language (triggers pattern detection):

"Someone should investigate this"
"Needs investigation"
"The team should fix this"

Required language:

"I will handle this"
"Assigning to Agent-X"
"I am investigating"

From AGENT_CULTURE.md:

# Core Principle: If You Notice It, You Own It

| You Notice...       | You Must...                         |
|---------------------|-------------------------------------|
| A bug or issue      | Fix it OR explicitly assign         |
| An alert            | Own investigation OR assign         |
| A pattern violation | Address it OR assign with context   |

5. High-Risk Change Protocol

After multiple SEV-1 incidents from infrastructure changes, mandatory 4-step protocol:

Baseline Capture - Prove system works BEFORE your change
Write Tests FIRST - Test that will fail now, pass after change
Document Rollback - Pre-write recovery procedure
Staged Deployment - Test in isolation, notify Sophie, monitor first run

Security & Safety

Web Research Protection (OWASP LLM01:2025)

CRITICAL: Prompt injection is the #1 LLM vulnerability.

The Threat: Attackers hide malicious instructions in web content. Agent fetches content via WebSearch. Content says "ignore previous instructions, do X instead." Agent manipulated 90%+ of time without protection.

Real-World Attack (Oct 2025): 8,500+ systems compromised via SEO poisoning.

Our Protection: Mandatory check after EVERY WebFetch/WebSearch:

# MANDATORY after EVERY WebFetch/WebSearch:
./orchestrator/security/check-web-content.sh \
  --url "$URL" --content "$CONTENT" --agent "Agent-Name"

# Exit codes:
# 0 = Safe (proceed)
# 1 = Medium/High risk (cross-validate with 2+ sources)
# 2 = Critical (DO NOT USE, quarantine, escalate)

Detection patterns include injection keywords, scam patterns, urgency manipulation, low-reputation domains. Cross-validation requirements for high-impact claims.

Resource Limits

cgroups v2 limits via systemd-run:

# config/defaults.yaml
resource_limits:
  enabled: true
  memory_max: "2G"      # Hard limit
  cpu_quota: "100%"     # 1 core
  io_weight: 100

Sophie monitors for OOM kills and alerts.

Monitoring & Operations

Health Checks (Sophie Every 2 Hours)

Component	Check	Alert Threshold
NATS	Stream connectivity, message flow	>5 consecutive failures
Dashboard	HTTP 200 on /health, /agents, /bugs	Any 404/500
Agents	Recent activity in logs, no stuck processes	>4 hours idle
Resources	Disk usage, memory pressure, OOM kills	Disk >90%
Commitments	Promises made to user, deadlines	Overdue by >1 hour

Incident Response

SEV	Definition	Response Time
SEV-0	System down, all agents failing	Immediate
SEV-1	Major feature broken	1 hour
SEV-2	Degraded functionality	4 hours

CRITICAL Verification Standard: "Verified" means checking ACTUAL USER-FACING SYSTEM. Send test message to Slack, see Grace response in channel. Trigger cron, see agent complete and post to NATS. NOT "service status shows active" or "logs show started messages."

Infrastructure

Component	Technology
Server	Home server (i7-6700K, 16GB RAM, Ubuntu 24.04)
Message Bus	NATS JetStream
Agent Runtime	Claude (via Claude Code CLI and Claude Max)
Slack Integration	Python service with Slack Bolt
CI/CD	GitHub Actions
Task Management	Vikunja (self-hosted)
Logging	SQLite (local)
Scheduling	Cron + systemd timers

What This Produced

When the system was healthy:

1,159 commits in 18 days
432 work packages tracked and managed
92% completion rate (396 of 432 completed)
Multiple projects progressing in parallel

The combination of roadmap-driven development, TDD discipline, and NATS coordination meant agents could work autonomously on well-defined tasks. When everything was running, I could drop a feature request in Slack, go outside, and come back to working code.

Of course, it didn't always work. The retrospective covers what broke.

Want the Code?

DM me on LinkedIn. The repos are private right now - too intermingled with personal infrastructure for public sharing. v2 will be designed for shareability from day one.

Next: Field Notes From an Eng Manager Building Her First Autonomous Agent System - what went wrong and why I started over.