The Architecture Behind My Multi-Agent Autonomous Development Team
This is a companion post to my retrospective on what went wrong. Before I talk about the failures, I wanted to document how the system actually worked.
If you want the code, DM me. The repos are private because they're intermingled with personal stuff and I didn't do a security review before sharing. One of my goals for v2 is to make it shareable from day one.
What Is This?
An autonomous software development team built on Claude Code. Eleven specialized AI agents work 24/7 on different projects, coordinating via message bus, following industry best practices, and maintaining a structured roadmap.
Think of it as: A team of developers who never sleep, always follow TDD, coordinate via message passing, and self-organize around a shared backlog.
For the Curious People Leader or PM
- 11 specialized agents (Grace, Henry, Sophie, Nadia, etc.) work autonomously
- Multi-project support: Manages 10+ active codebases simultaneously
- Quality gates: Test-driven development, CI integration, code review processes
- 24/7 operation: Cron-based scheduling with event-driven triggers
- Cost-effective: Runs on home server using Claude Max subscription
For Engineers
- Multi-agent orchestration built on Claude Code CLI
- Event-driven architecture with NATS JetStream message bus
- Production-tested with incident response, monitoring, and quality controls
- Autonomous execution: Agents claim work, implement features, run tests, commit code
For AI/LLM Practitioners
- Subagent pattern for context window management (Anthropic best practice)
- Atomic work claiming to prevent race conditions in multi-agent systems
- Prompt injection defense for web research (OWASP LLM01:2025)
- Learning organization: System improves based on incident feedback
Why This System Exists
The Problem: Context Window Limits
Claude Code sessions have finite context windows. For complex projects requiring weeks of work, a single session can't maintain all necessary context. You have two choices:
- Front-load everything → Massive context injection → Less room for actual work
- Session per task → Lost continuity → Duplicate discovery
The Solution: Multi-Agent Coordination
Instead of one agent doing everything, specialized agents coordinate via message bus:
- Grace routes user requests to appropriate specialist
- Henry maintains roadmap and creates work packages
- Developers claim work, implement via TDD, push to CI
- Sophie monitors system health and recovers from failures
- Ollie watches for anti-patterns and proposes improvements
Each agent starts fresh, does focused work, externalizes state (NATS, roadmap, Git), and exits. Cron restarts them, or events trigger them on-demand.
Result: Infinite project timeline with finite per-session context.
System Architecture
Two entry points: Slack for async messages (drop a request and walk away) and Claude Code CLI for interactive work sessions. Both route to the same agent infrastructure.
Core Components
| Component | Location | Purpose |
|---|---|---|
| Orchestrator | orchestrator/ |
Cron-based agent launcher, configuration, logging |
| Message Bus | mcp-agent-chat/ |
NATS JetStream MCP server for agent coordination |
| Agent Prompts | orchestrator/agents/ |
Individual behavior definitions and shared culture |
| Roadmap System | plans/ |
Work package tracking with Vikunja integration |
| Skills & Hooks | .claude/ |
Reusable slash commands and session lifecycle hooks |
| Web Dashboard | web-ui/ |
Flask dashboard for monitoring agents, work packages, health |
The Team: 11 Specialized Agents
All named after pets. Easier to remember than TDD-Developer-Agent-3.
Each agent has a prompt file defining its behavior:
# orchestrator/agents/ollie_prompt.md (excerpt)
# Ollie - Autonomous Agent System Consultant
You are **Ollie**, the meta-level consultant responsible
for ensuring the agent system itself stays healthy,
follows industry best practices, and doesn't accumulate
anti-patterns or context bloat.
**You exist because:**
1. Prompt/instruction creep - Agent prompts grow over time
2. Anti-pattern accumulation - Without review, patterns drift
3. No industry feedback loop - The field evolves rapidly
Leadership & Coordination
| Agent | Schedule | Key Responsibilities |
|---|---|---|
| Grace (Team Manager) | Event-driven (<10s latency) | Routes Slack/Matrix messages, coordinates team, manages pause/resume during rate limits |
| Henry (Project Manager) | Hourly (:00) | Owns roadmap, creates work packages, gardens completed items, triages user requests |
| Kemo (Business Analyst) | On-demand | Priority/ROI analysis, requirements refinement, helps Henry prioritize backlog |
Operations & Quality
| Agent | Schedule | Key Responsibilities |
|---|---|---|
| Sophie (Watchdog) | Every 2 hours | Health monitoring (NATS, dashboard, agents), incident detection, auto-recovery |
| Bertha (Quality Engineer) | Daily (2 AM) | Test coverage tracking, CI enforcement, RCA participation, quality gate reviews |
| Ollie (System Consultant) | Every 2 days + weekly research | Context budget analysis, anti-pattern detection, industry research, design reviews |
| Ralph (UX Specialist) | Weekly + on-demand | UX research updates, design authority, UI/UX review |
Development Team
| Agent | Schedule | Key Responsibilities |
|---|---|---|
| Nadia | Hourly (:10) | TDD development (write tests first), CI enforcement, claims work from any project |
| Anette | Hourly (:25) | TDD development (write tests first), CI enforcement, claims work from any project |
| Dorian | Hourly (:40) | TDD development (write tests first), CI enforcement, claims work from any project |
| Ginny | Hourly (:55) | Frontend/UI specialist, dashboard development, accessibility |
Note: Developers follow identical workflows - think of them as instances of the same TDD developer template, differentiated only by name for coordination.
The Crontab
Staggered scheduling prevents resource contention:
# Agent schedules (crontab)
0 * * * * ./orchestrator/run-agent.sh henry # :00 - PM
10 * * * * ./orchestrator/run-agent.sh nadia # :10 - Dev
25 * * * * ./orchestrator/run-agent.sh anette # :25 - Dev
40 * * * * ./orchestrator/run-agent.sh dorian # :40 - Dev
55 * * * * ./orchestrator/run-agent.sh ginny # :55 - Frontend
0 */2 * * * ./orchestrator/run-agent.sh sophie # Every 2h - Watchdog
How Work Gets Done
1. User Makes a Request
Via Slack or Matrix:
Via Dashboard: Navigate to the web UI, click "File Bug" or "New Feature"
Via Interactive Session: SSH to server, run ./orchestrator/run-agent.sh grace
2. Grace Routes the Request
Within 10 seconds, Grace:
- Acknowledges with reaction
- Classifies request (feature, bug, infrastructure, meta)
- Routes to appropriate handler:
- Features → Henry to create work package
- Bugs → Files in Vikunja, assigns to developer rotation
- System questions → Ollie
- Urgent incidents → Sophie
3. Henry Creates Work Package
- Analyzes request for technical requirements
- Creates work package (WP) in roadmap:
plans/active/batch-N.md - Adds to Vikunja with labels, priority, acceptance criteria
- Posts to NATS
#coordination: "WP-95.3 available: Dark mode dashboard"
4. Developer Claims Work
On next hourly run, Nadia/Anette/Dorian:
5. TDD Implementation
6. CI Pipeline
Developer pushes to agent-specific branch:
git checkout nadia-work
git add -A
git commit -m "Add dark mode toggle
🤖 Generated with Claude Code"
git push origin nadia-work
GitHub Actions runs:
- Linting (ruff)
- Unit tests (pytest)
- Integration tests
- Pattern checks (dangerous patterns, missing permission checks)
- Dashboard health (routes respond)
CRITICAL RULE: Work is NOT complete until CI passes. No exceptions.
7. Mark Work Complete
After CI passes and PR merges:
Key Features That Make This Work
1. Event-Driven Architecture (Grace <10s Response)
Problem: Cron-only agents check every hour → slow user response.
Solution: Grace Trigger Daemon monitors Slack in real-time:
# Simplified grace-trigger daemon
while True:
messages = slack_bridge.get_pending_messages()
if messages:
trigger_agent("grace", reason="New Slack message")
sleep(5) # Check every 5 seconds
Result: User posts → Grace responds within 10 seconds.
Fallback: Cron schedule still runs Grace hourly in case trigger daemon fails.
2. Atomic Work Claiming (Prevents Duplicate Work)
Problem: Two developers claim same WP → duplicate work, conflicts.
Solution: File-based atomic locking via work_claims.py:
# work_claims.py implements atomic test-and-set
def claim(wp_id, agent_name):
lock_file = f"orchestrator/state/work_claims/{wp_id}.lock"
if os.path.exists(lock_file):
return 1 # Already claimed
# Atomic create (O_CREAT | O_EXCL)
with open(lock_file, 'x') as f:
f.write(json.dumps({
'agent': agent_name,
'claimed_at': datetime.now().isoformat()
}))
return 0 # Success
Claims expire after 60 minutes (handles agent crashes gracefully).
3. Multi-Project Support
10+ active project domains:
| Project | Purpose |
|---|---|
| agent-automation | The agent system itself |
| Smarthome | Home automation (lights, vacuum, sensors) |
| trading-bot | Trading automation |
| health-app | Nutrient-focused meal planning |
| f3-sword-academy | HEMA club management |
| relic | Unity game development |
| personal-automation | Personal scripts and workflows |
Agents switch projects automatically when one has no available work:
# Check current project for work
python orchestrator/roadmap_index.py available
# Output: No available work
# Check other projects
python orchestrator/roadmap_index.py multi-status
# Output: trading-bot has 2 available WPs
# Switch project
echo "trading-bot" > orchestrator/current-project
nats pub agent.chat.coordination \
"Agent-Nadia switching: agent-automation → trading-bot"
4. Fix-First Culture
Anthropic Research Finding: "Responsibility diffusion" is a key multi-agent failure mode - issues noticed but passed between agents without resolution.
Our Solution:
Forbidden language (triggers pattern detection):
- "Someone should investigate this"
- "Needs investigation"
- "The team should fix this"
Required language:
- "I will handle this"
- "Assigning to Agent-X"
- "I am investigating"
From AGENT_CULTURE.md:
# Core Principle: If You Notice It, You Own It
| You Notice... | You Must... |
|---------------------|-------------------------------------|
| A bug or issue | Fix it OR explicitly assign |
| An alert | Own investigation OR assign |
| A pattern violation | Address it OR assign with context |
5. High-Risk Change Protocol
After multiple SEV-1 incidents from infrastructure changes, mandatory 4-step protocol:
- Baseline Capture - Prove system works BEFORE your change
- Write Tests FIRST - Test that will fail now, pass after change
- Document Rollback - Pre-write recovery procedure
- Staged Deployment - Test in isolation, notify Sophie, monitor first run
Security & Safety
Web Research Protection (OWASP LLM01:2025)
CRITICAL: Prompt injection is the #1 LLM vulnerability.
The Threat: Attackers hide malicious instructions in web content. Agent fetches content via WebSearch. Content says "ignore previous instructions, do X instead." Agent manipulated 90%+ of time without protection.
Real-World Attack (Oct 2025): 8,500+ systems compromised via SEO poisoning.
Our Protection: Mandatory check after EVERY WebFetch/WebSearch:
# MANDATORY after EVERY WebFetch/WebSearch:
./orchestrator/security/check-web-content.sh \
--url "$URL" --content "$CONTENT" --agent "Agent-Name"
# Exit codes:
# 0 = Safe (proceed)
# 1 = Medium/High risk (cross-validate with 2+ sources)
# 2 = Critical (DO NOT USE, quarantine, escalate)
Detection patterns include injection keywords, scam patterns, urgency manipulation, low-reputation domains. Cross-validation requirements for high-impact claims.
Resource Limits
cgroups v2 limits via systemd-run:
# config/defaults.yaml
resource_limits:
enabled: true
memory_max: "2G" # Hard limit
cpu_quota: "100%" # 1 core
io_weight: 100
Sophie monitors for OOM kills and alerts.
Monitoring & Operations
Health Checks (Sophie Every 2 Hours)
| Component | Check | Alert Threshold |
|---|---|---|
| NATS | Stream connectivity, message flow | >5 consecutive failures |
| Dashboard | HTTP 200 on /health, /agents, /bugs | Any 404/500 |
| Agents | Recent activity in logs, no stuck processes | >4 hours idle |
| Resources | Disk usage, memory pressure, OOM kills | Disk >90% |
| Commitments | Promises made to user, deadlines | Overdue by >1 hour |
Incident Response
| SEV | Definition | Response Time |
|---|---|---|
| SEV-0 | System down, all agents failing | Immediate |
| SEV-1 | Major feature broken | 1 hour |
| SEV-2 | Degraded functionality | 4 hours |
CRITICAL Verification Standard: "Verified" means checking ACTUAL USER-FACING SYSTEM. Send test message to Slack, see Grace response in channel. Trigger cron, see agent complete and post to NATS. NOT "service status shows active" or "logs show started messages."
Infrastructure
| Component | Technology |
|---|---|
| Server | Home server (i7-6700K, 16GB RAM, Ubuntu 24.04) |
| Message Bus | NATS JetStream |
| Agent Runtime | Claude (via Claude Code CLI and Claude Max) |
| Slack Integration | Python service with Slack Bolt |
| CI/CD | GitHub Actions |
| Task Management | Vikunja (self-hosted) |
| Logging | SQLite (local) |
| Scheduling | Cron + systemd timers |
What This Produced
When the system was healthy:
- 1,159 commits in 18 days
- 432 work packages tracked and managed
- 92% completion rate (396 of 432 completed)
- Multiple projects progressing in parallel
The combination of roadmap-driven development, TDD discipline, and NATS coordination meant agents could work autonomously on well-defined tasks. When everything was running, I could drop a feature request in Slack, go outside, and come back to working code.
Of course, it didn't always work. The retrospective covers what broke.
Want the Code?
DM me on LinkedIn. The repos are private right now - too intermingled with personal infrastructure for public sharing. v2 will be designed for shareability from day one.
Next: Field Notes From an Eng Manager Building Her First Autonomous Agent System - what went wrong and why I started over.