Field Notes From an Eng Manager Building Her First Autonomous Agent System
Three weeks ago, I posted about an autonomous agent team that was building out my ideas while I got to go outside and touch grass. Christmas morning, they surprised me with a year-in-review feature I never asked for. It felt like magic.
Today, I pulled the plug on v1 and I'm starting from scratch.
- What worked: Roadmap-driven development, TDD discipline, ideas backlog
- What broke: 28-hour blind spots, token burn cascades, too many agents
- Key insight: The agents I talked to most were the most valuable - except Grace, where high mentions meant "broken"
Where I'm Coming From
Some context: I've been managing engineering teams for years, and before this project I was already toying with interactive Claude sessions - building little locally-hosted tools to automate parts of my workflow, both at work and in my personal life. When I left my last job, I was genuinely excited to take time off and actually dig into multi-agent systems properly. I wanted to go through The Multiverse School's Agentic SDLC curriculum and some other resources, to actually understand how this stuff works instead of just poking at it during stolen hours on weekends.
This Steve Yegge article has a good breakdown of the stages developers go through with AI. For reference, I'd say I'm at Figure 8 now - it's messy, but I'm past the Figure 7 and below stages.
I went in with a healthy sense of skepticism, largely because I've worked closely with friends who are actual engineers and developers - people with many more years of hands-on coding experience than me. When they're suspicious of something, I pay attention. And there's been a lot to be suspicious of lately: the buzzwords, the LinkedIn hype, the "look at my AI do magic" posts that conveniently skip over all the failures. So when I decided to build this system, I wasn't trying to prove AI was magical. I wanted to understand where it actually helps and where it falls apart.
But I've also been noticing something that's hard to ignore. The people who roll their eyes at AI tools and the people who use them effectively seem to be having completely different experiences. I wanted to know which one was real, or if somehow both were. So I decided to prove it to myself: build something real, see what works and what breaks. And in my gut, I do believe there's something here that's going to change how we work. I'm coming at this from a management angle, not a "10 years as a developer" angle, but I trust my instincts enough to take them seriously.
I'm writing this blog post with heavy assistance from AI, using Claude to pull data from my databases, help draft sections, and verify timelines. And even in that process, I noticed things. I was building out some workshop materials and asked Claude for ideas on additional topics I could cover if we finished early. It put together a great list with sources for 6-7 topics, but just... skipped environmental impact of AI entirely. Could be negligible. Could be something. When I asked it to explain how AI companies make models safer, it initially wrote "how they make agents helpful and safe" - and I had to push back: "No, write 'how they try to make agents helpful and safe.' This is still dangerous and people need to take everything with a grain of salt."
I'm biased too. We all are. I think where we're heading is that trusted sources are going to be humans you've worked with and trust to use these tools correctly - people who are transparent about the process. (So hey, throwing my hat in the ring. I'm happy to be one of those people you trust to be transparent.)
The Final Numbers
Looks great, right? 92% completion rate. Over a thousand commits. An entire team of specialized agents working in parallel.
Yeah. Keep a healthy suspicion of numbers like these, especially when someone posts them on LinkedIn. This is what those numbers don't tell you. (If you want the technical details on how the system was built, see the companion architecture post.)
How I Tracked This
Quick aside on the data: I set up a system to log every message I sent to the agents - whether through Slack or terminal sessions. The hook captured my prompts and stored them in a SQLite database with timestamps and auto-classified intent (feature request, bug report, correction, status check, etc.).
This wasn't for vanity metrics. I wanted Ollie (system consultant) to be able to analyze patterns in how I was using the system - recurring frustrations, repeated requests, things I kept asking for that weren't getting done. The idea was that the system could learn from my behavior and improve itself.
In practice, it mostly just gave me a detailed record of how often I was frustrated.
The Timeline
I went through the actual commit history to verify this timeline. First commit was December 18th, 2025: "Initial commit: Agent Automation System." It was just orchestrator scripts at that point.
grace-watcher.sh is not in crontab.
The architecture depends on grace-watcher.sh
polling NATS every 1-2 minutes to detect Slack
messages and launch Grace. Without this cron
entry, Grace is never triggered.
The crontab said "uses systemd timer" but no timer existed. 73 commits - most of them trying to fix things.
About the Names
If you're one of my friends and you recognize a name... lol, sorry I didn't ask permission. The agents are all named after pets I've had, alive and dead:
- Sophie - my current dog (watchdog agent)
- Henry - my late cat (project manager)
- Ollie - my cousin's dog who used to live with us (system consultant)
- Grace - other pets from over the years (team manager)
- Dorian, Nadia, Anette - Dorian is my partner's/our current cat; Nadia and Anette are pets from over the years (developers)
- Kemo - another pet (business analyst)
- Bertha - one of my chickens (quality engineer)
- Ralph, Ginny - friends' cats (UX specialist, frontend developer)
I named them partly because it felt more fun than TDD-Developer-Agent, Business-Analysis-Agent, but mostly because it helps my brain remember who does what. I'm a human being. That's how my brain is wired.
What I Built Along the Way
The agent system wasn't just building itself - I had them working on actual projects I cared about:
- Smarthome - Home Assistant integrations, voice control via Whisper/Piper. I'm increasingly privacy-conscious and annoyed at Amazon and Google smart home systems - they don't function how I want, they take my data, they advertise at me. Not getting into that rabbit hole here, but it motivated building my own.
- Trading Bot - Testing my ideas on paper, seeing if my assumptions are right about decisions people and companies are making in the market. Genesis of wanting to validate my thinking systematically.
- Health App - Nutrient-focused meal planning. I know there are a million health apps, but I'm very opinionated about how I approach nutrition and activity, and I want research tightly incorporated into both. Bespoke for me.
- Relic - A Unity game project (with a friend)
- F3 Sword Academy - I do social media for my HEMA club and help with membership tracking and setup. This was about automating some of that.
- Personal Automation - Finance tracking, house projects, personal assistant features. Same approach as everything else: I know apps exist for this, I just want bespoke ones for myself and my specific workflows.
By the end I was managing 11 different project domains through the same agent infrastructure. Way too many. Definitely too many. But this is also why I wanted to build the system in the first place - to be able to have multiple projects be somewhat self-managed. The reality is the system just wasn't at a maturity level where I could spread attention across that many things. Moving forward, I'd pick maybe three things to manage until I prove the system is actually providing value.
Here's the thing I realized though: I was starting on these things and getting really cool initial MVPs, but I wasn't really using any of them. I was using the smarthome project the most, probably because that's the one with the best UX - it's conversational, it doesn't force structure on me. And maybe that's the lesson: UX for anything backed by AI needs to be more conversational. You can't assume structure or force structure on it.
I did start spinning up UX research prompts and... it's hard. If anyone has ideas on this, please send me a link. I'd love to read more.
What Actually Worked
1. Roadmap-Driven Development
Every agent checked a shared plans/index.yaml before starting work. They'd claim a work package, announce it on NATS (our message bus), do the work, mark it complete. The NATS announcement was key - it acted as an atomic operation to prevent two agents from claiming the same work. 432 work packages tracked this way. This mirrors what Anthropic recommends for building agentic systems - explicit task boundaries and clear handoffs.
2. TDD Discipline
Agents wrote tests first, then implementation. CI had to pass before marking anything complete. This is a common industry pattern for agentic systems - Anthropic's building agents guide emphasizes automated verification loops. If you're going to have agents write code, you need verification that doesn't depend on vibes.
I cover this in my Agentic AI Workshop, though I note it's something that comes later in the maturity cycle.
3. Named Agents with Roles
Grace (team manager), Henry (PM), Sophie (watchdog), Ollie (system consultant), Kemo (business analyst), developers (Nadia, Anette, Dorian). When something broke, I could ask "Sophie, why didn't you catch this?" instead of debugging an anonymous system. Having roles also made it clearer what each agent was responsible for.
TDD-Developer-Agent. When I say "Sophie," it's unambiguous - one entity, no confusion with other terms that might get misinterpreted. This matters because there are many entry points into this system - Slack, terminal sessions, voice, different contexts - and I'm verbally interfacing with it from a bunch of different angles. A clear, single name cuts through all of that.
4. Culture Documentation
I collaborated with Ollie (system consultant) in an interactive session to create AGENT_CULTURE.md - a document capturing my ideas and ethos around how the team should work. Team values, protocols, expected behaviors. The agents actually followed it.
This came about because I kept seeing the same mistakes happen:
me: Ok, is there any way that we see trends? Like, if Sophie keeps on getting spun up to deal with this every few hours, something is clearly wrong. Will the system correct for that?
The culture doc became a forcing function for thinking through edge cases. Here's some of what it included:
That "banned language" section came directly from watching the agents pass issues back and forth without anyone actually owning them. Reminded me of every dysfunctional human team I've seen.
Side note for v2: I want to add tests for this - actually count how many times the banned patterns appear, and whether they decrease over time. I started building that, and it was beginning to work. But I'm also not sure if a culture doc is where this should live. I've seen other resources suggest this kind of thing should be enforced systematically, not just documented. Still figuring that out.
What Kept Breaking (A Detailed Breakdown)
Grace: The Team Manager Who Couldn't Stay Online
Grace (team manager) was supposed to be my interface to the system. Messages came through Slack, Grace routed them to the right agent, reported back. In theory.
I'm a manager. Of course I built a manager agent first, right? Classic human bias in system design. I'll come back to this.
In practice, Grace went down in at least four different ways over 18 days:
| Date | What Broke | Duration | Root Cause |
|---|---|---|---|
| Dec 21-22 | grace-watcher.sh not in cron | 28+ hours | Crontab said "uses systemd timer" but no timer existed |
| Dec 23 | slack-listener service down | 5.5 hours | Process terminated, no auto-restart configured |
| Dec 26 | grace-trigger daemon hung | 10+ hours | Stopped logging/responding, no heartbeat detection |
| Multiple | Event-driven triggers never deployed | Ongoing | Said "deployed" in docs, wasn't actually running |
Each time I added more monitoring: health check endpoints, heartbeat systems, watchdog timers, Sophie (watchdog) monitoring Grace, a watchdog for Sophie monitoring Grace. The complexity just created more failure modes.
At some point I realized: the amount of time I was spending maintaining Grace didn't match the value she was providing. I was able to just... exist without her. Interactive terminal sessions worked fine. The whole elaborate Slack routing system was overhead I didn't actually need.
Sophie Wasting Tokens: A Mirror of Human Patterns
Sophie (watchdog) was supposed to monitor system health and report issues. What she actually did: detect a NATS timeout error, "investigate" it, report it to Slack, then detect the same error 15 minutes later. Same investigation. Same report. Burning tokens on repeat without actually fixing the underlying trend.
me: look at the history in #colby-server-health slack. see that sophie is still wasting cycles on things that are not real alerts (alerting updated needed). please have ollie think about this pattern and update her prompt to look at how often something is happening
Here's what Sophie's commit history looked like on a single day:
Four "health check cycle complete" commits within 3 minutes. Each one burned tokens. None of them actually fixed anything.
Also: this 100% should have been code, not an LLM call. Simple logic like "have I seen this exact error in the last hour?" doesn't need intelligence - it needs a counter and a timestamp. I was reaching for the AI hammer when a bash script would have worked better.
I know this. I made the right call at plenty of other points. But when you're juggling too many projects on top of a brittle system, it's hard to know exactly where to sit down and enforce that discipline. Maybe there's a way to build this into prompts so agents automatically lean towards scripts for simple logic. But then you run into scripts not being 100% correct in what they're doing, and you don't know until something breaks. It's a recursive chicken-and-egg problem I haven't solved yet.
The 28-Hour Blind Spot
Grace (team manager) went down for 28 hours and I didn't notice. How? Because I was using interactive terminal sessions for my actual work. Grace only mattered for Slack integration - she was the agent who watched for messages in our shared Slack workspace and routed them to the right team member. The whole point was that I could drop a message in Slack from my phone while walking the dog, and the agents would handle it.
And I wasn't checking Slack because I was hoping the agents would tell me if something was wrong. They couldn't tell me. Because Grace was down.
I want to be clear: this wasn't a production system - it was an exploratory project, a simulation. I wasn't betting my business on it. I knew I could come back whenever I wanted and see what happened. But still. 28 hours is a long time to not notice your primary user interface is dead.
The Great Token Burn of January 2026
What happened: The orchestrator had a bug where multiple agents could spawn without checking if others were already running. One bad cron trigger later, 13 agents all tried to work at once.
Root cause from my incident log:
Memory: 13 concurrent Claude sessions (8.3GB total)
without system-wide limits
- Resource limits documented but not enforced
(2GB per agent not implemented)
- Documented limits not enforced → Enforcement gap exposed
The docs said "2GB per agent limit." The system didn't actually enforce it. Aspirational documentation is not operational documentation.
The silver lining: Getting locked out of my own system for 24 hours forced me to step away and think. Not just "how do I fix this" but "should I even be iterating on something this brittle?" It's the advice I keep giving friends who ask about their projects: sometimes you need to let go of sunk cost and redesign from scratch. Funny how that advice is harder to follow when it's your own system.
I Should Probably Collect More Data, Huh
Lesson learned: instrument everything from day one. I didn't set up proper metrics collection, so I'm working with incomplete data. To be fair, I had no idea I'd be writing this post - I wasn't even sure I'd keep going with the project. But here we are, and V2 will have better hooks and dashboards from the start.
Here's what I did manage to capture - which agents I mentioned most in my messages, and which agents made autonomous commits:
| Agent | My Mentions | Autonomous Commits | What This Suggests |
|---|---|---|---|
| Henry (PM) | 129 | 21 | Valuable partner - I used him a lot for roadmap work |
| Ollie (System Consultant) | 104 | 121 | Valuable in both modes - talked to AND autonomous output |
| Grace (Team Manager) | 64 | 2 | High mentions + low output = broken, not productive |
| Sophie (Watchdog) | 42 | 15 | Low mentions + output = working quietly in background |
The pattern that emerges: high mentions doesn't mean "needed hand-holding." Henry and Ollie were mentioned a lot because I found them useful to work with. I'd spin up a session with Henry to plan roadmap work, or with Ollie to analyze system patterns. Those were productive collaborations, not failures.
Grace was different - 64 mentions but only 2 commits. Those mentions were mostly complaints and troubleshooting because she kept breaking. High mentions + low output = something is wrong.
The Ollie Exception
Ollie stands out: high mentions (104) AND high autonomous output (121 prefixed commits). He was valuable both as an interactive partner and as a background worker. In sessions, I'd work with him on system analysis and prompt optimization. In the background, his scheduled cron jobs would find issues and fix them:
What made Ollie work in both modes? His prompt was designed for self-directed work: scheduled analysis cycles, persistent memory to learn from, clear criteria for when to act.
The takeaway: The agents I talked to most (Henry, Ollie) were the most valuable - that's collaboration, not failure. The exception was Grace, where high mentions meant "broken." For v2, I want more agents like Ollie: useful as interactive partners AND capable of self-directed background work. And I'm enforcing the commit prefix convention so I can actually measure what's happening.
Idea Diarrhea and the Ideas Backlog
Let me be honest about something: I have ADHD. And when I get excited about a tool, I get excited. The first week of this project, I was in full addiction mode.
Every idle thought became a feature request. "What if the agents could..." "We should add..." "Wouldn't it be cool if..." I was throwing 10 ideas per hour at the system. The agents would start on something, and I'd interrupt with something new. Nothing was ever finished because I kept changing direction.
Leaders do this too, by the way. We do it to our human teams all the time. The difference is humans get frustrated and push back. AI agents just... pivot. You can actually see the fragmentation effect on the work in a way that's harder to see with human teams who quietly absorb the chaos.
I'm aware of this pattern in myself and I actively manage it. Ask any of my reports - I'm pretty good at batching decisions and not thrashing my teams. But without that discipline, and without humans pushing back, it was easy to slip. Around Day 5, I realized I was the bottleneck. Not the agents. Me.
The fix was simple: an ideas backlog. A separate place where I could dump every thought without it becoming immediate work. I even set it up so new ideas would auto-sort into the backlog and update their own specs, so I didn't have to do any work to capture them.
The workflow became:
- Dump - Write the idea in
ideas-backlog/(or drop it in Slack) - Triage - Kemo (business analyst) evaluates ROI
- Prioritize - Henry (PM) ranks against existing work
- Execute - Only then does it become real work
By the end, I had 5 well-documented ideas in the backlog:
- 001: AI Research Monitor - Keep up with papers from Anthropic, OpenAI, DeepMind
- 002: House Projects Manager - Track home maintenance and improvements
- 003: Personal Agent Suite - Unified personal automation
- 004: Finance Agent - Budget tracking and analysis
- 005: HEMA Club Manager - Club scheduling and member management
Instead of 50 half-finished features scattered across the codebase.
(Shoutout to The Multiverse School for introducing me to this framing - their Agentic SDLC course helped me recognize this pattern.)
Context Switching Hell
At peak chaos, I had 8+ interactive Claude sessions running simultaneously. This included things like:
- Debugging Grace (team manager)
- Working with Ollie (system consultant) on patterns
- Actual feature development
- Reviewing what Sophie (watchdog) was monitoring
- Talking to Henry (PM) about roadmap
About 5% of the time, I'd type the wrong message into the wrong window. "Henry, why is this test failing?" sent to Ollie. "Ollie, what's the status on roadmap gardening?" sent to the developer session.
I actually do this as a manager - context switching between DMs with different team members. But there's usually a face or an avatar that reminds me who I'm talking to. Five identical terminal windows with slightly different prompts? My brain couldn't track it.
What helped: I learned to use different colors for different terminal tabs, with a specific color for each type of work. Green for feature dev, red for debugging, blue for planning. (Again, helpful tip from The Multiverse School.) I also had to force myself to limit the number of sessions I had open. If I needed a sixth session, I had to close one first.
Slack (via Grace) was supposed to solve this - one interface, she routes to the right agent. But Grace kept going down, so I kept falling back to direct terminal sessions, and the chaos returned.
Management Sims: Watching Myself Make Classic Leadership Mistakes
Here's the weirdest part: I got to watch myself make the exact management mistakes I tell other leaders to avoid. It's like playing a management simulation game, except the NPCs are AI agents and you're the one learning the lessons.
I actually thought I was a pretty good manager. And I still am - I've led massive, complex projects I'm proud of, and I've helped multiple reports earn promotions over the years. My career speaks for itself. But building this system made me realize that being good at managing humans doesn't automatically make you good at managing AI agents. The translation isn't automatic. You have to consciously reapply the skills, and sometimes the patterns that work with people don't work at all with agents.
Getting Too In The Weeds
This happened repeatedly: I'd have context from one session about how a feature worked, and I'd jump into another agent's work mid-task. "Actually, you should change the approach here because..." The agent would pivot, try to incorporate my feedback, and end up with a muddled implementation that didn't match either the original plan or my suggestion. I'd injected context the agent didn't have, and now we were both confused about what the goal was.
Even worse: sometimes I'd start working on something in an interactive session, not realizing an autonomous agent was already working on the same thing from the roadmap. We'd both be making changes to the same area of the codebase, with completely different assumptions. This happened a few times before I learned to check the roadmap first.
Classic micromanagement pattern. I had opinions because I had context, not because anyone needed my input. I'd coach anyone out of this in a heartbeat - and with humans, it doesn't happen as easily because the feedback loop is slower. But with AI agents, everything is lightning quick. You can jump in, derail something, and move on before you even register what you did. That speed is exactly why I caught myself doing something I normally coach others to avoid.
Forgetting What I Asked For
Day 12, I saw an agent starting work on "setting up Ollama" and immediately panicked. I had a vague long-term idea about eventually running local models, and my first thought was that the agent was trying to replace itself entirely - swapping out the Claude backbone for a local model. "Wait, what? We are not ready for this. Who approved this?" Total jump to conclusions.
I had asked for it a week earlier. I wanted to run my smart home stuff off a local model so I could finally move off my OpenAI API key, and Henry (PM) turned it into a work package that an agent claimed. I'd completely forgotten, then panicked when I saw it happening.
This is something leaders do constantly - give direction, forget, then question why the team is doing what they're doing. I'm actually good at avoiding this with human teams; it's something I've trained myself to watch for. But with agents, the throughput is just so much higher - there's more happening in parallel, more requests flying around, more things to keep track of. It's not just the speed that gets you, it's the volume.
The Leadership Meeting Fix
To stop myself from doing all of the above, around Day 10 I started enforcing structured leadership meetings. One hour. Henry (PM) facilitates. Clear agenda:
- Everything I complained about in Slack that week
- Blocked items needing my input
- Decisions that only I can make
- Status on major initiatives
Here's an example of what we actually covered:
Prompt optimization (Dec 27 meeting) - Ollie analyzed our context budget and flagged that we were burning too many tokens on instructions before agents even started working. We did a full audit:
Prompt Size Reduction Results:
- developer_prompt.md: 31,220 → 6,868 bytes (78% reduction)
- team_manager_grace.md: 42,329 → 7,526 bytes (82% reduction)
- Total across all prompts: 19% reduction
(Genuinely proud of this one - it meant agents had more room to actually work instead of burning tokens on verbose instructions.)
Other things we covered: git worktrees to eliminate merge conflicts, calling out fix-first culture violations, and forcing the team to show me what was actually working vs. what was just "in progress."
One structured meeting replaced 50 scattered Slack messages. I'd tell Henry "add X to the next meeting agenda" instead of derailing whatever the team was currently doing. At the end of the meeting, I could close my laptop knowing everything was addressed.
This is just good management practice with human teams too. But watching the agents flounder without it made it crystal clear how much chaos I was causing by not having this structure from the start.
Next Up: I Need to Stop Being the Glue
I logged about 680 messages over 18 days. That's an average of 38 interactions per day with my "autonomous" system.
| Intent | Count | Percentage |
|---|---|---|
| Feature requests (product work + building the agent system) | 441 | 69% |
| Bug reports + corrections (fixing the system) | 95 | 15% |
| Status checks + questions | 87 | 14% |
15% of my interactions were fixing the agent system itself. Every sixth message was me telling an agent it was broken, wrong, or needed to try again.
me: ok is something broken why didn't you respond?
me: why did no one respond to my above post yet?
me: what on earth is with these hourly timeouts? are they actual issues? why does it KEEP happening?
me: this cant keep breaking
me: henry, you said system healthy. it's not. talk to ollie. make sure you don't lie again.
38 interactions per day. That's how often I was stepping in - not to do the work, but to define it. Every new feature needed me to specify requirements. Every ambiguous situation needed my input. I was the one noticing when Grace went down, the one telling agents to actually finish their work instead of just reporting on it. The system couldn't move forward without me constantly feeding it direction.
What I'm Exploring in v2
I'm not committing to solutions yet - I want to experiment. But here's my thinking:
1. Architecture Decisions for Agentic Systems
Token cost of the initial prompt is easy to think about. Token cost of failure modes isn't - an agent stuck in a loop or repeatedly reporting the same issue can burn through way more than the prompt ever would. For v2, I want to ask better questions upfront: not just "how expensive is this?" but "what happens when it breaks?" A framework I want to internalize:
- Does it have trend analysis? - Will it notice if it's doing the same thing repeatedly without fixing the underlying issue? (Sophie didn't.)
- Does it have a failsafe? - What happens if it detects a problem it can't solve? Does it escalate, or does it just keep burning tokens?
- Is it event-driven or polling? - Cron polling burns tokens whether or not there's work. Event-driven only wakes up when needed - but only if the trigger is well-designed, and you'll probably miss edge cases or fail to update it as the system evolves.
Sophie's prompt had her detect issues and report them. It didn't have her check "have I reported this exact thing in the last hour?" That's the kind of gap I want to catch earlier.
2. Simpler Topology (Maybe)
11 agents might have been too many. Or maybe the issue was how they were designed, not how many there were. I want to explore this, not assume.
Some things I'm curious about:
- Duplicated simple agents - The TDD developers (Nadia, Anette, Dorian) shared the exact same prompt, just with different names. I originally set it up this way to explore agentic memory - would they diverge over time based on their individual learnings? That's still interesting, but maybe there's a better design for it.
- Specialists by codebase area - I've seen resources talk about agents owning specific parts of the codebase. Worth testing.
- Agentic memory and personalities - Each agent had a
learnings.mdandmistakes.mdthat got compressed into their context. Over time, they'd diverge in behavior based on their experiences. That's interesting - maybe intentional bifurcation is a feature, not a bug? I want to explore this more.
3. Real Verification (Experiments Needed)
"Service is running" ≠ "service is working." That was painfully clear. But what does real verification look like?
What didn't work: Checking if a process was running. Checking if logs said "started." Checking if tests passed in CI.
What might work: Actually triggering the user-facing flow. End-to-end tests that simulate real usage. Health endpoints that test actual functionality, not just "am I alive?"
I don't have this figured out yet. But I know the current approach was aspirational documentation pretending to be operational verification.
4. Design for Both Modes
The agents I found most valuable (Henry, Ollie) were ones I could work with interactively. That's not a failure - that's useful. For v2, I want to design agents that work well as collaborative partners AND can do self-directed background work (like Ollie did).
Also: enforce the commit prefix convention so I can actually measure what's autonomous vs interactive. Can't improve what you can't measure.
5. Question the Frameworks
Maybe "team manager" doesn't make sense for AI agents at this maturity level. Maybe the org chart metaphor is the wrong abstraction entirely. I built what I knew - a team with a manager, developers, analysts. But agents aren't people, and the patterns that work for human teams might not translate.
Worth exploring: What if there's no manager at all? What if routing is just a simple script, not an "agent"? What if the org chart is flat?
I don't know. But I'm suspicious of my own assumptions now.
6. Stop Working in a Silo
I've been doing this mostly alone - reading occasional articles, learning from my own mistakes, researching where I need to. That's fine for exploration, and I knew I'd need to start talking to other people doing this stuff at some point. Now that I've made my own mistakes, I'm ready. Humans, specifically.
If you have favorite meetups, communities, or resources for people building multi-agent systems (especially from a management/coordination angle rather than pure ML), please reach out. I've got my own list I'm already involved with, but I'm open to recommendations.
The Bottom Line
A 92% completion rate looks great on paper. 1,159 commits sounds impressive. But the numbers don't tell the whole story - and I didn't instrument things well enough to know the full picture.
What I do know: the agents I talked to most (Henry, Ollie) were the most valuable. That wasn't a failure of autonomy - that was productive collaboration. The exception was Grace, where high mentions meant "broken," not "useful."
What didn't work was Grace (kept breaking), the complexity overhead (15% of my time fixing the system), and the token burn cascades. Those need to go.
The fact that it's almost negligible to just rebuild from scratch - that's the interesting part. The cost of redesigning has dropped enough that iterating on something brittle doesn't make sense anymore.
Starting fresh. Will share what I learn.
Appendix: How This Data Was Collected
All the numbers in this post come from actual system logs. I used Claude to help pull and analyze this data, and verified the queries myself.
- Messages from me: Logged via hook in Claude Code (interactive sessions) and Slack bridge. Stored in SQLite with timestamps and auto-classified intent (feature-request, bug-report, correction, status-check, etc.)
- Agent mentions: SQL query counting how often each agent name appeared in my messages
- Commits by agent: Git log filtered by commit message prefix (agents use "AgentName: message" format). Only 167 commits (15%) had agent prefixes; 972 (85%) did not use the convention, so their source is unknown.
- Work packages: Tracked in
plans/index.yaml- discrete units of work like tickets, with status tracking - Incidents: Logged in
orchestrator/logs/incidents.log
The repos are private - I did some basic scrubbing to get security stuff out, but honestly I'm too lazy to do a thorough review for public sharing. I might make v2 more publicly shareable from the start.