Field Notes From an Eng Manager Building Her First Autonomous Agent System

Katherine Cass • January 2026 • 18 min read

Three weeks ago, I posted about an autonomous agent team that was building out my ideas while I got to go outside and touch grass. Christmas morning, they surprised me with a year-in-review feature I never asked for. It felt like magic.

Today, I pulled the plug on v1 and I'm starting from scratch.

TL;DR:

What worked: Roadmap-driven development, TDD discipline, ideas backlog
What broke: 28-hour blind spots, token burn cascades, too many agents
Key insight: The agents I talked to most were the most valuable - except Grace, where high mentions meant "broken"

Where I'm Coming From

Some context: I've been managing engineering teams for years, and before this project I was already toying with interactive Claude sessions - building little locally-hosted tools to automate parts of my workflow, both at work and in my personal life. When I left my last job, I was genuinely excited to take time off and actually dig into multi-agent systems properly. I wanted to go through The Multiverse School's Agentic SDLC curriculum and some other resources, to actually understand how this stuff works instead of just poking at it during stolen hours on weekends.

This Steve Yegge article has a good breakdown of the stages developers go through with AI. For reference, I'd say I'm at Figure 8 now - it's messy, but I'm past the Figure 7 and below stages.

8 Stages of Developer Evolution with AI — Source: Steve Yegge, "Welcome to Gas Town"

I went in with a healthy sense of skepticism, largely because I've worked closely with friends who are actual engineers and developers - people with many more years of hands-on coding experience than me. When they're suspicious of something, I pay attention. And there's been a lot to be suspicious of lately: the buzzwords, the LinkedIn hype, the "look at my AI do magic" posts that conveniently skip over all the failures. So when I decided to build this system, I wasn't trying to prove AI was magical. I wanted to understand where it actually helps and where it falls apart.

But I've also been noticing something that's hard to ignore. The people who roll their eyes at AI tools and the people who use them effectively seem to be having completely different experiences. I wanted to know which one was real, or if somehow both were. So I decided to prove it to myself: build something real, see what works and what breaks. And in my gut, I do believe there's something here that's going to change how we work. I'm coming at this from a management angle, not a "10 years as a developer" angle, but I trust my instincts enough to take them seriously.

A sidebar on sources: I had a teacher in high school who I still think about regularly. She taught us to always be curious about our sources - not just "is this true?" but "why was this written? What bias is the author bringing?"

I'm writing this blog post with heavy assistance from AI, using Claude to pull data from my databases, help draft sections, and verify timelines. And even in that process, I noticed things. I was building out some workshop materials and asked Claude for ideas on additional topics I could cover if we finished early. It put together a great list with sources for 6-7 topics, but just... skipped environmental impact of AI entirely. Could be negligible. Could be something. When I asked it to explain how AI companies make models safer, it initially wrote "how they make agents helpful and safe" - and I had to push back: "No, write 'how they try to make agents helpful and safe.' This is still dangerous and people need to take everything with a grain of salt."

I'm biased too. We all are. I think where we're heading is that trusted sources are going to be humans you've worked with and trust to use these tools correctly - people who are transparent about the process. (So hey, throwing my hat in the ring. I'm happy to be one of those people you trust to be transparent.)

The Final Numbers

Days Total

Dec 18 - Jan 4

~14

Days System Ran Smoothly

4+ days of major outages

1,159

Commits

To the main repo

432

Work Packages

Discrete units of work (like tickets)

92%

WP Completion Rate

396 of 432 completed

Agents

Named AI team members

~680

Messages From Me Logged

Slack + terminal sessions

~15%

Time Fixing The System

vs. building features

Looks great, right? 92% completion rate. Over a thousand commits. An entire team of specialized agents working in parallel.

Yeah. Keep a healthy suspicion of numbers like these, especially when someone posts them on LinkedIn. This is what those numbers don't tell you. (If you want the technical details on how the system was built, see the companion architecture post.)

How I Tracked This

Quick aside on the data: I set up a system to log every message I sent to the agents - whether through Slack or terminal sessions. The hook captured my prompts and stored them in a SQLite database with timestamps and auto-classified intent (feature request, bug report, correction, status check, etc.).

This wasn't for vanity metrics. I wanted Ollie (system consultant) to be able to analyze patterns in how I was using the system - recurring frustrations, repeated requests, things I kept asking for that weren't getting done. The idea was that the system could learn from my behavior and improve itself.

In practice, it mostly just gave me a detailed record of how often I was frustrated.

The Timeline

I went through the actual commit history to verify this timeline. First commit was December 18th, 2025: "Initial commit: Agent Automation System." It was just orchestrator scripts at that point.

Day 1 (Dec 18)Genesis: First commit. Orchestrator scripts, rate limit detection, Slack notifications. 4 commits total. The system was born.

Day 2 (Dec 19)Naming agents: Named the agents. Grace (team manager), Henry (PM), Sophie (watchdog), Nadia/Anette/Dorian (developers). 65 commits.

Sophie, a black and white border collie smiling on a walk — Sophie - my current dog, the namesake for our watchdog agent

Day 3 (Dec 20)Expanding the team: Added Ollie (system consultant) and Kemo (business analyst). Wrote culture documentation. Started requirements analysis. 50 commits.

# From Kemo's prompt (business analyst): Evaluate based on: - User Value: How much does the user want this? - Effort: S/M/L/XL - Dependencies: What's blocking or blocked by this? - Risk: What could go wrong?

# From Ollie's prompt (system consultant): You exist because: 1. Prompt/instruction creep - Agent prompts grow over time 2. Anti-pattern accumulation - Without review, patterns drift 3. No industry feedback loop - The field evolves rapidly

Day 4 (Dec 21)First outage: First Grace (team manager) outage. She stopped responding to Slack. Took me hours to notice. Added health checks. 41 commits.

Day 5 (Dec 22)28-hour outage: Grace down again. 28+ hours this time. Root cause from the actual RCA:

grace-watcher.sh is not in crontab.
The architecture depends on grace-watcher.sh
polling NATS every 1-2 minutes to detect Slack
messages and launch Grace. Without this cron
entry, Grace is never triggered.

The crontab said "uses systemd timer" but no timer existed. 73 commits - most of them trying to fix things.

Day 6 (Dec 23)More outages: Slack-listener service down for 5.5 hours. Process terminated without auto-restart. Grace unable to receive messages. 74 commits. At this point I started wondering if Grace needed to exist at all, or if a simpler orchestrator/router would work better.

Day 7 (Dec 24)Introduced CI/CD: Set up GitHub Actions for CI. Agents must pass tests before marking work complete. This actually worked well. 105 commits - Christmas Eve push.

Day 8 (Dec 25)Christmas surprise: Christmas morning surprise. 181 work packages complete. My partner used our Slack channel to ask the agents to put together something fun as "a surprise for Katherine" - they built me a year-in-review page without any specific direction on what it should be. Posted to LinkedIn. Felt like a proud parent. 52 commits.

Day 9 (Dec 26)Daemon hung: Grace-trigger daemon hung for 10+ hours. Didn't log, didn't respond. Restarted at 22:36 PST, found 5 pending messages. 117 commits.

0ee4bf90 Henry: Create Batch 96 - OOM SEV-1 Prevention & Autonomous Remediation

Day 10 (Dec 27)Leadership meetings: Started enforcing structured leadership meetings. One hour, clear agenda, everything I complained about that week addressed in one place. (More on this below - it was one of the things that actually worked.) 121 commits.

Day 11 (Dec 28)Tooling decisions: Slower day. Decided on Vikunja for task tracking after a proper evaluation. 43 commits. (Still embarrassed about this one - I was going against basically every recommendation, but I was also frustrated with the frameworks being recommended. I'll figure out something different in v2, and probably write about that separately.)

Day 12 (Dec 29)Peak activity: Peak activity. 144 commits in one day. Agents were crushing it. Or so I thought.

cb6deece Ollie: Full analysis cycle - stale state cleanup, core dump removed

Days 13-14 (Dec 30-31)Break: Break. 2 commits. Touched grass.

Day 15 (Jan 2)The Great Token Burn: The Great Token Burn. Two compounding disasters: (1) Disk filled up - a 1.3GB Node.js core dump plus accumulated Docker images filled /tmp to 100%. (2) 13 concurrent Claude sessions spawned and consumed 8.3GB of memory. Hit my Claude Max usage limit. Had to SSH in and manually clear disk space, kill zombie processes, clean up state files. System dead for 24 hours. 84 commits after recovery.

Days 16-17 (Jan 3-4)Recovery attempts: Tried to recover. More patches. More health checks. 180 commits over two days. Diminishing returns.

108cf570 Sophie: Health check cycle complete

9d248516 Ollie: Full analysis cycle - 2026-01-04 12:58 PST

Day 18 (Jan 4)End of v1: Pulled the plug. Tagged v1.0-legacy. Started planning v2.

About the Names

If you're one of my friends and you recognize a name... lol, sorry I didn't ask permission. The agents are all named after pets I've had, alive and dead:

Sophie - my current dog (watchdog agent)
Henry - my late cat (project manager)
Ollie - my cousin's dog who used to live with us (system consultant)
Grace - other pets from over the years (team manager)
Dorian, Nadia, Anette - Dorian is my partner's/our current cat; Nadia and Anette are pets from over the years (developers)
Kemo - another pet (business analyst)
Bertha - one of my chickens (quality engineer)
Ralph, Ginny - friends' cats (UX specialist, frontend developer)

I named them partly because it felt more fun than TDD-Developer-Agent, Business-Analysis-Agent, but mostly because it helps my brain remember who does what. I'm a human being. That's how my brain is wired.

What I Built Along the Way

The agent system wasn't just building itself - I had them working on actual projects I cared about:

Smarthome - Home Assistant integrations, voice control via Whisper/Piper. I'm increasingly privacy-conscious and annoyed at Amazon and Google smart home systems - they don't function how I want, they take my data, they advertise at me. Not getting into that rabbit hole here, but it motivated building my own.
Trading Bot - Testing my ideas on paper, seeing if my assumptions are right about decisions people and companies are making in the market. Genesis of wanting to validate my thinking systematically.
Health App - Nutrient-focused meal planning. I know there are a million health apps, but I'm very opinionated about how I approach nutrition and activity, and I want research tightly incorporated into both. Bespoke for me.
Relic - A Unity game project (with a friend)
F3 Sword Academy - I do social media for my HEMA club and help with membership tracking and setup. This was about automating some of that.
Personal Automation - Finance tracking, house projects, personal assistant features. Same approach as everything else: I know apps exist for this, I just want bespoke ones for myself and my specific workflows.

By the end I was managing 11 different project domains through the same agent infrastructure. Way too many. Definitely too many. But this is also why I wanted to build the system in the first place - to be able to have multiple projects be somewhat self-managed. The reality is the system just wasn't at a maturity level where I could spread attention across that many things. Moving forward, I'd pick maybe three things to manage until I prove the system is actually providing value.

Here's the thing I realized though: I was starting on these things and getting really cool initial MVPs, but I wasn't really using any of them. I was using the smarthome project the most, probably because that's the one with the best UX - it's conversational, it doesn't force structure on me. And maybe that's the lesson: UX for anything backed by AI needs to be more conversational. You can't assume structure or force structure on it.

I did start spinning up UX research prompts and... it's hard. If anyone has ideas on this, please send me a link. I'd love to read more.

What Actually Worked

1. Roadmap-Driven Development

Every agent checked a shared plans/index.yaml before starting work. They'd claim a work package, announce it on NATS (our message bus), do the work, mark it complete. The NATS announcement was key - it acted as an atomic operation to prevent two agents from claiming the same work. 432 work packages tracked this way. This mirrors what Anthropic recommends for building agentic systems - explicit task boundaries and clear handoffs.

2. TDD Discipline

Agents wrote tests first, then implementation. CI had to pass before marking anything complete. This is a common industry pattern for agentic systems - Anthropic's building agents guide emphasizes automated verification loops. If you're going to have agents write code, you need verification that doesn't depend on vibes.

I cover this in my Agentic AI Workshop, though I note it's something that comes later in the maturity cycle.

3. Named Agents with Roles

Grace (team manager), Henry (PM), Sophie (watchdog), Ollie (system consultant), Kemo (business analyst), developers (Nadia, Anette, Dorian). When something broke, I could ask "Sophie, why didn't you catch this?" instead of debugging an anonymous system. Having roles also made it clearer what each agent was responsible for.

Practical benefit: Single-word names are easier to parse than hyphenated strings like TDD-Developer-Agent. When I say "Sophie," it's unambiguous - one entity, no confusion with other terms that might get misinterpreted. This matters because there are many entry points into this system - Slack, terminal sessions, voice, different contexts - and I'm verbally interfacing with it from a bunch of different angles. A clear, single name cuts through all of that.

4. Culture Documentation

I collaborated with Ollie (system consultant) in an interactive session to create AGENT_CULTURE.md - a document capturing my ideas and ethos around how the team should work. Team values, protocols, expected behaviors. The agents actually followed it.

This came about because I kept seeing the same mistakes happen:

me: Ok, is there any way that we see trends? Like, if Sophie keeps on getting spun up to deal with this every few hours, something is clearly wrong. Will the system correct for that?

Frustrated after the third identical alert

The culture doc became a forcing function for thinking through edge cases. Here's some of what it included:

# From AGENT_CULTURE.md: Core Principle: If You Notice It, You Own It | You Notice... | You Must... | |---------------|-------------| | A bug or issue | Fix it if you can; if not, assign explicitly | | An alert | Own the investigation OR explicitly assign | | A pattern violation | Address it yourself OR assign with context | Banned Language: - "Someone should..." → Use "I will..." or "Assigning to [Agent]..." - "Needs investigation" → Use "I am investigating" or "Assigning to [Agent]" - "The team should..." → Use "[Agent] will handle this"

That "banned language" section came directly from watching the agents pass issues back and forth without anyone actually owning them. Reminded me of every dysfunctional human team I've seen.

Side note for v2: I want to add tests for this - actually count how many times the banned patterns appear, and whether they decrease over time. I started building that, and it was beginning to work. But I'm also not sure if a culture doc is where this should live. I've seen other resources suggest this kind of thing should be enforced systematically, not just documented. Still figuring that out.

What Kept Breaking (A Detailed Breakdown)

Grace: The Team Manager Who Couldn't Stay Online

Grace (team manager) was supposed to be my interface to the system. Messages came through Slack, Grace routed them to the right agent, reported back. In theory.

I'm a manager. Of course I built a manager agent first, right? Classic human bias in system design. I'll come back to this.

In practice, Grace went down in at least four different ways over 18 days:

Date	What Broke	Duration	Root Cause
Dec 21-22	grace-watcher.sh not in cron	28+ hours	Crontab said "uses systemd timer" but no timer existed
Dec 23	slack-listener service down	5.5 hours	Process terminated, no auto-restart configured
Dec 26	grace-trigger daemon hung	10+ hours	Stopped logging/responding, no heartbeat detection
Multiple	Event-driven triggers never deployed	Ongoing	Said "deployed" in docs, wasn't actually running

Each time I added more monitoring: health check endpoints, heartbeat systems, watchdog timers, Sophie (watchdog) monitoring Grace, a watchdog for Sophie monitoring Grace. The complexity just created more failure modes.

At some point I realized: the amount of time I was spending maintaining Grace didn't match the value she was providing. I was able to just... exist without her. Interactive terminal sessions worked fine. The whole elaborate Slack routing system was overhead I didn't actually need.

Repeat theme - Bias in system design: At some point I started wondering: does Grace even need to exist? Maybe the "team manager" framing was me projecting human org structures onto something that would work better as a simple orchestrator or router. I built what I knew. Worth questioning for v2.

Sophie Wasting Tokens: A Mirror of Human Patterns

Sophie (watchdog) was supposed to monitor system health and report issues. What she actually did: detect a NATS timeout error, "investigate" it, report it to Slack, then detect the same error 15 minutes later. Same investigation. Same report. Burning tokens on repeat without actually fixing the underlying trend.

me: look at the history in #colby-server-health slack. see that sophie is still wasting cycles on things that are not real alerts (alerting updated needed). please have ollie think about this pattern and update her prompt to look at how often something is happening

Dec 29, frustrated

Here's what Sophie's commit history looked like on a single day:

bdf0a0ed Sophie: Health check cycle complete - 2026-01-04 12:19 UTC

9cc15668 Sophie: Health check cycle complete - 2026-01-04 12:22 UTC

a1bbf554 Sophie: Health check cycle complete - 2026-01-04 12:21 UTC

06762be9 Sophie: Health check cycle complete - 2026-01-04 12:20 UTC

Four "health check cycle complete" commits within 3 minutes. Each one burned tokens. None of them actually fixed anything.

Repeat theme - Mirroring human org patterns: This is a pattern I see in human organizations all the time. Someone notices a problem, reports it, moves on. Same problem happens again, different person notices it, reports it, moves on. Nobody fixes the trend. I've spent a lot of my career being mindful of this, trying to fix root causes instead of putting band-aids on symptoms - and I'm pretty successful at catching it with humans. The hilarious part is watching my AI system do exactly this, assuming there's enough in the training data for the model to understand what I mean when I ask for something. Apparently not enough to avoid the same organizational anti-patterns humans fall into.

Also: this 100% should have been code, not an LLM call. Simple logic like "have I seen this exact error in the last hour?" doesn't need intelligence - it needs a counter and a timestamp. I was reaching for the AI hammer when a bash script would have worked better.

I know this. I made the right call at plenty of other points. But when you're juggling too many projects on top of a brittle system, it's hard to know exactly where to sit down and enforce that discipline. Maybe there's a way to build this into prompts so agents automatically lean towards scripts for simple logic. But then you run into scripts not being 100% correct in what they're doing, and you don't know until something breaks. It's a recursive chicken-and-egg problem I haven't solved yet.

The 28-Hour Blind Spot

Grace (team manager) went down for 28 hours and I didn't notice. How? Because I was using interactive terminal sessions for my actual work. Grace only mattered for Slack integration - she was the agent who watched for messages in our shared Slack workspace and routed them to the right team member. The whole point was that I could drop a message in Slack from my phone while walking the dog, and the agents would handle it.

And I wasn't checking Slack because I was hoping the agents would tell me if something was wrong. They couldn't tell me. Because Grace was down.

I want to be clear: this wasn't a production system - it was an exploratory project, a simulation. I wasn't betting my business on it. I knew I could come back whenever I wanted and see what happened. But still. 28 hours is a long time to not notice your primary user interface is dead.

The Great Token Burn of January 2026

Incident: 13 concurrent Claude sessions spawned simultaneously. 8.3GB memory consumed (6.2GB RAM + 2.1GB swap). Claude Max usage limit exceeded. All agents dead for 24 hours.

What happened: The orchestrator had a bug where multiple agents could spawn without checking if others were already running. One bad cron trigger later, 13 agents all tried to work at once.

Root cause from my incident log:

Memory: 13 concurrent Claude sessions (8.3GB total)
without system-wide limits
- Resource limits documented but not enforced
  (2GB per agent not implemented)
- Documented limits not enforced → Enforcement gap exposed

The docs said "2GB per agent limit." The system didn't actually enforce it. Aspirational documentation is not operational documentation.

The silver lining: Getting locked out of my own system for 24 hours forced me to step away and think. Not just "how do I fix this" but "should I even be iterating on something this brittle?" It's the advice I keep giving friends who ask about their projects: sometimes you need to let go of sunk cost and redesign from scratch. Funny how that advice is harder to follow when it's your own system.

I Should Probably Collect More Data, Huh

Lesson learned: instrument everything from day one. I didn't set up proper metrics collection, so I'm working with incomplete data. To be fair, I had no idea I'd be writing this post - I wasn't even sure I'd keep going with the project. But here we are, and V2 will have better hooks and dashboards from the start.

Here's what I did manage to capture - which agents I mentioned most in my messages, and which agents made autonomous commits:

Agent	My Mentions	Autonomous Commits	What This Suggests
Henry (PM)	129	21	Valuable partner - I used him a lot for roadmap work
Ollie (System Consultant)	104	121	Valuable in both modes - talked to AND autonomous output
Grace (Team Manager)	64	2	High mentions + low output = broken, not productive
Sophie (Watchdog)	42	15	Low mentions + output = working quietly in background

The pattern that emerges: high mentions doesn't mean "needed hand-holding." Henry and Ollie were mentioned a lot because I found them useful to work with. I'd spin up a session with Henry to plan roadmap work, or with Ollie to analyze system patterns. Those were productive collaborations, not failures.

Grace was different - 64 mentions but only 2 commits. Those mentions were mostly complaints and troubleshooting because she kept breaking. High mentions + low output = something is wrong.

The Ollie Exception

Ollie stands out: high mentions (104) AND high autonomous output (121 prefixed commits). He was valuable both as an interactive partner and as a background worker. In sessions, I'd work with him on system analysis and prompt optimization. In the background, his scheduled cron jobs would find issues and fix them:

46f83a4c Ollie: Add capacity-check runbook and scripts to prevent resource exhaustion

cb6deece Ollie: Full analysis cycle - stale state cleanup, core dump removed

1005937f Ollie: Full analysis cycle - fix rate_limit_file, update memory

What made Ollie work in both modes? His prompt was designed for self-directed work: scheduled analysis cycles, persistent memory to learn from, clear criteria for when to act.

# From Ollie's prompt: "Read your memory (learn from past sessions): - Check orchestrator/memory/ollie/learnings.md - Check orchestrator/memory/ollie/alerts.md - Check user voice database for recurring patterns You exist because: 1. Prompt/instruction creep - Agent prompts grow over time 2. Anti-pattern accumulation - Without review, patterns drift 3. No industry feedback loop - The field evolves rapidly"

The takeaway: The agents I talked to most (Henry, Ollie) were the most valuable - that's collaboration, not failure. The exception was Grace, where high mentions meant "broken." For v2, I want more agents like Ollie: useful as interactive partners AND capable of self-directed background work. And I'm enforcing the commit prefix convention so I can actually measure what's happening.

Idea Diarrhea and the Ideas Backlog

Let me be honest about something: I have ADHD. And when I get excited about a tool, I get excited. The first week of this project, I was in full addiction mode.

Every idle thought became a feature request. "What if the agents could..." "We should add..." "Wouldn't it be cool if..." I was throwing 10 ideas per hour at the system. The agents would start on something, and I'd interrupt with something new. Nothing was ever finished because I kept changing direction.

Leaders do this too, by the way. We do it to our human teams all the time. The difference is humans get frustrated and push back. AI agents just... pivot. You can actually see the fragmentation effect on the work in a way that's harder to see with human teams who quietly absorb the chaos.

I'm aware of this pattern in myself and I actively manage it. Ask any of my reports - I'm pretty good at batching decisions and not thrashing my teams. But without that discipline, and without humans pushing back, it was easy to slip. Around Day 5, I realized I was the bottleneck. Not the agents. Me.

The fix was simple: an ideas backlog. A separate place where I could dump every thought without it becoming immediate work. I even set it up so new ideas would auto-sort into the backlog and update their own specs, so I didn't have to do any work to capture them.

The workflow became:

Dump - Write the idea in ideas-backlog/ (or drop it in Slack)
Triage - Kemo (business analyst) evaluates ROI
Prioritize - Henry (PM) ranks against existing work
Execute - Only then does it become real work

By the end, I had 5 well-documented ideas in the backlog:

001: AI Research Monitor - Keep up with papers from Anthropic, OpenAI, DeepMind
002: House Projects Manager - Track home maintenance and improvements
003: Personal Agent Suite - Unified personal automation
004: Finance Agent - Budget tracking and analysis
005: HEMA Club Manager - Club scheduling and member management

Instead of 50 half-finished features scattered across the codebase.

Lesson: The backlog isn't just for the team. It's for you. It gives your brain permission to let go of an idea without losing it. Once you trust that you'll come back to the idea instead of panicking about getting it all out right now, it's genuinely freeing. For those of us with ADHD, this is a really important pattern to learn in day-to-day life.

(Shoutout to The Multiverse School for introducing me to this framing - their Agentic SDLC course helped me recognize this pattern.)

Context Switching Hell

At peak chaos, I had 8+ interactive Claude sessions running simultaneously. This included things like:

Debugging Grace (team manager)
Working with Ollie (system consultant) on patterns
Actual feature development
Reviewing what Sophie (watchdog) was monitoring
Talking to Henry (PM) about roadmap

About 5% of the time, I'd type the wrong message into the wrong window. "Henry, why is this test failing?" sent to Ollie. "Ollie, what's the status on roadmap gardening?" sent to the developer session.

I actually do this as a manager - context switching between DMs with different team members. But there's usually a face or an avatar that reminds me who I'm talking to. Five identical terminal windows with slightly different prompts? My brain couldn't track it.

What helped: I learned to use different colors for different terminal tabs, with a specific color for each type of work. Green for feature dev, red for debugging, blue for planning. (Again, helpful tip from The Multiverse School.) I also had to force myself to limit the number of sessions I had open. If I needed a sixth session, I had to close one first.

Slack (via Grace) was supposed to solve this - one interface, she routes to the right agent. But Grace kept going down, so I kept falling back to direct terminal sessions, and the chaos returned.

Management Sims: Watching Myself Make Classic Leadership Mistakes

Here's the weirdest part: I got to watch myself make the exact management mistakes I tell other leaders to avoid. It's like playing a management simulation game, except the NPCs are AI agents and you're the one learning the lessons.

I actually thought I was a pretty good manager. And I still am - I've led massive, complex projects I'm proud of, and I've helped multiple reports earn promotions over the years. My career speaks for itself. But building this system made me realize that being good at managing humans doesn't automatically make you good at managing AI agents. The translation isn't automatic. You have to consciously reapply the skills, and sometimes the patterns that work with people don't work at all with agents.

Getting Too In The Weeds

This happened repeatedly: I'd have context from one session about how a feature worked, and I'd jump into another agent's work mid-task. "Actually, you should change the approach here because..." The agent would pivot, try to incorporate my feedback, and end up with a muddled implementation that didn't match either the original plan or my suggestion. I'd injected context the agent didn't have, and now we were both confused about what the goal was.

Even worse: sometimes I'd start working on something in an interactive session, not realizing an autonomous agent was already working on the same thing from the roadmap. We'd both be making changes to the same area of the codebase, with completely different assumptions. This happened a few times before I learned to check the roadmap first.

Classic micromanagement pattern. I had opinions because I had context, not because anyone needed my input. I'd coach anyone out of this in a heartbeat - and with humans, it doesn't happen as easily because the feedback loop is slower. But with AI agents, everything is lightning quick. You can jump in, derail something, and move on before you even register what you did. That speed is exactly why I caught myself doing something I normally coach others to avoid.

Forgetting What I Asked For

Day 12, I saw an agent starting work on "setting up Ollama" and immediately panicked. I had a vague long-term idea about eventually running local models, and my first thought was that the agent was trying to replace itself entirely - swapping out the Claude backbone for a local model. "Wait, what? We are not ready for this. Who approved this?" Total jump to conclusions.

I had asked for it a week earlier. I wanted to run my smart home stuff off a local model so I could finally move off my OpenAI API key, and Henry (PM) turned it into a work package that an agent claimed. I'd completely forgotten, then panicked when I saw it happening.

This is something leaders do constantly - give direction, forget, then question why the team is doing what they're doing. I'm actually good at avoiding this with human teams; it's something I've trained myself to watch for. But with agents, the throughput is just so much higher - there's more happening in parallel, more requests flying around, more things to keep track of. It's not just the speed that gets you, it's the volume.

The Leadership Meeting Fix

To stop myself from doing all of the above, around Day 10 I started enforcing structured leadership meetings. One hour. Henry (PM) facilitates. Clear agenda:

Everything I complained about in Slack that week
Blocked items needing my input
Decisions that only I can make
Status on major initiatives

Here's an example of what we actually covered:

Prompt optimization (Dec 27 meeting) - Ollie analyzed our context budget and flagged that we were burning too many tokens on instructions before agents even started working. We did a full audit:

Prompt Size Reduction Results:
- developer_prompt.md: 31,220 → 6,868 bytes (78% reduction)
- team_manager_grace.md: 42,329 → 7,526 bytes (82% reduction)
- Total across all prompts: 19% reduction

(Genuinely proud of this one - it meant agents had more room to actually work instead of burning tokens on verbose instructions.)

Other things we covered: git worktrees to eliminate merge conflicts, calling out fix-first culture violations, and forcing the team to show me what was actually working vs. what was just "in progress."

One structured meeting replaced 50 scattered Slack messages. I'd tell Henry "add X to the next meeting agenda" instead of derailing whatever the team was currently doing. At the end of the meeting, I could close my laptop knowing everything was addressed.

This is just good management practice with human teams too. But watching the agents flounder without it made it crystal clear how much chaos I was causing by not having this structure from the start.

Next Up: I Need to Stop Being the Glue

I logged about 680 messages over 18 days. That's an average of 38 interactions per day with my "autonomous" system.

Intent	Count	Percentage
Feature requests (product work + building the agent system)	441	69%
Bug reports + corrections (fixing the system)	95	15%
Status checks + questions	87	14%

15% of my interactions were fixing the agent system itself. Every sixth message was me telling an agent it was broken, wrong, or needed to try again.

me: ok is something broken why didn't you respond?

Dec 22

me: why did no one respond to my above post yet?

Dec 25 (Christmas! Merry Christmas to me)

me: what on earth is with these hourly timeouts? are they actual issues? why does it KEEP happening?

Dec 27

me: this cant keep breaking

Dec 26 (it kept breaking)

me: henry, you said system healthy. it's not. talk to ollie. make sure you don't lie again.

Dec 26 (I was frustrated and probably unfair)

38 interactions per day. That's how often I was stepping in - not to do the work, but to define it. Every new feature needed me to specify requirements. Every ambiguous situation needed my input. I was the one noticing when Grace went down, the one telling agents to actually finish their work instead of just reporting on it. The system couldn't move forward without me constantly feeding it direction.

What I'm Exploring in v2

I'm not committing to solutions yet - I want to experiment. But here's my thinking:

1. Architecture Decisions for Agentic Systems

Token cost of the initial prompt is easy to think about. Token cost of failure modes isn't - an agent stuck in a loop or repeatedly reporting the same issue can burn through way more than the prompt ever would. For v2, I want to ask better questions upfront: not just "how expensive is this?" but "what happens when it breaks?" A framework I want to internalize:

Does it have trend analysis? - Will it notice if it's doing the same thing repeatedly without fixing the underlying issue? (Sophie didn't.)
Does it have a failsafe? - What happens if it detects a problem it can't solve? Does it escalate, or does it just keep burning tokens?
Is it event-driven or polling? - Cron polling burns tokens whether or not there's work. Event-driven only wakes up when needed - but only if the trigger is well-designed, and you'll probably miss edge cases or fail to update it as the system evolves.

Sophie's prompt had her detect issues and report them. It didn't have her check "have I reported this exact thing in the last hour?" That's the kind of gap I want to catch earlier.

2. Simpler Topology (Maybe)

11 agents might have been too many. Or maybe the issue was how they were designed, not how many there were. I want to explore this, not assume.

Some things I'm curious about:

Duplicated simple agents - The TDD developers (Nadia, Anette, Dorian) shared the exact same prompt, just with different names. I originally set it up this way to explore agentic memory - would they diverge over time based on their individual learnings? That's still interesting, but maybe there's a better design for it.
Specialists by codebase area - I've seen resources talk about agents owning specific parts of the codebase. Worth testing.
Agentic memory and personalities - Each agent had a learnings.md and mistakes.md that got compressed into their context. Over time, they'd diverge in behavior based on their experiences. That's interesting - maybe intentional bifurcation is a feature, not a bug? I want to explore this more.

3. Real Verification (Experiments Needed)

"Service is running" ≠ "service is working." That was painfully clear. But what does real verification look like?

What didn't work: Checking if a process was running. Checking if logs said "started." Checking if tests passed in CI.

What might work: Actually triggering the user-facing flow. End-to-end tests that simulate real usage. Health endpoints that test actual functionality, not just "am I alive?"

I don't have this figured out yet. But I know the current approach was aspirational documentation pretending to be operational verification.

4. Design for Both Modes

The agents I found most valuable (Henry, Ollie) were ones I could work with interactively. That's not a failure - that's useful. For v2, I want to design agents that work well as collaborative partners AND can do self-directed background work (like Ollie did).

Also: enforce the commit prefix convention so I can actually measure what's autonomous vs interactive. Can't improve what you can't measure.

5. Question the Frameworks

Maybe "team manager" doesn't make sense for AI agents at this maturity level. Maybe the org chart metaphor is the wrong abstraction entirely. I built what I knew - a team with a manager, developers, analysts. But agents aren't people, and the patterns that work for human teams might not translate.

Worth exploring: What if there's no manager at all? What if routing is just a simple script, not an "agent"? What if the org chart is flat?

I don't know. But I'm suspicious of my own assumptions now.

6. Stop Working in a Silo

I've been doing this mostly alone - reading occasional articles, learning from my own mistakes, researching where I need to. That's fine for exploration, and I knew I'd need to start talking to other people doing this stuff at some point. Now that I've made my own mistakes, I'm ready. Humans, specifically.

If you have favorite meetups, communities, or resources for people building multi-agent systems (especially from a management/coordination angle rather than pure ML), please reach out. I've got my own list I'm already involved with, but I'm open to recommendations.

The Bottom Line

A 92% completion rate looks great on paper. 1,159 commits sounds impressive. But the numbers don't tell the whole story - and I didn't instrument things well enough to know the full picture.

What I do know: the agents I talked to most (Henry, Ollie) were the most valuable. That wasn't a failure of autonomy - that was productive collaboration. The exception was Grace, where high mentions meant "broken," not "useful."

What didn't work was Grace (kept breaking), the complexity overhead (15% of my time fixing the system), and the token burn cascades. Those need to go.

The fact that it's almost negligible to just rebuild from scratch - that's the interesting part. The cost of redesigning has dropped enough that iterating on something brittle doesn't make sense anymore.

Starting fresh. Will share what I learn.

Appendix: How This Data Was Collected

All the numbers in this post come from actual system logs. I used Claude to help pull and analyze this data, and verified the queries myself.

Messages from me: Logged via hook in Claude Code (interactive sessions) and Slack bridge. Stored in SQLite with timestamps and auto-classified intent (feature-request, bug-report, correction, status-check, etc.)
Agent mentions: SQL query counting how often each agent name appeared in my messages
Commits by agent: Git log filtered by commit message prefix (agents use "AgentName: message" format). Only 167 commits (15%) had agent prefixes; 972 (85%) did not use the convention, so their source is unknown.
Work packages: Tracked in plans/index.yaml - discrete units of work like tickets, with status tracking
Incidents: Logged in orchestrator/logs/incidents.log

The repos are private - I did some basic scrubbing to get security stuff out, but honestly I'm too lazy to do a thorough review for public sharing. I might make v2 more publicly shareable from the start.

Field Notes From an Eng Manager Building Her First Autonomous Agent System

Where I'm Coming From

The Final Numbers

How I Tracked This

The Timeline

About the Names

What I Built Along the Way

What Actually Worked

1. Roadmap-Driven Development

2. TDD Discipline

3. Named Agents with Roles

4. Culture Documentation

What Kept Breaking (A Detailed Breakdown)

Grace: The Team Manager Who Couldn't Stay Online

Sophie Wasting Tokens: A Mirror of Human Patterns

The 28-Hour Blind Spot

The Great Token Burn of January 2026

I Should Probably Collect More Data, Huh

The Ollie Exception

Idea Diarrhea and the Ideas Backlog

Context Switching Hell

Management Sims: Watching Myself Make Classic Leadership Mistakes

Getting Too In The Weeds

Forgetting What I Asked For

The Leadership Meeting Fix

Next Up: I Need to Stop Being the Glue

What I'm Exploring in v2

1. Architecture Decisions for Agentic Systems

2. Simpler Topology (Maybe)

3. Real Verification (Experiments Needed)

4. Design for Both Modes

5. Question the Frameworks

6. Stop Working in a Silo

The Bottom Line

Appendix: How This Data Was Collected

Links