What Agents Actually Need to Succeed
Everyone building agent systems eventually hits the same wall. Your agent is smart. It can write code, do research, draft content. But it fails in production anyway — not because of intelligence, but because of everything around it.
After months of running ChiefOfStaff — an autonomous agent system that dispatches work across multiple projects while I sleep — I can tell you exactly what agents need. None of it is about making the model smarter.
Context at dispatch time
An agent waking up to work on a task has no memory of yesterday. No awareness of what other agents shipped. No understanding of why this task exists or what the human actually cares about.
Most systems solve this by dumping the entire project state into the prompt. That's expensive and noisy. The agent drowns in irrelevant context and misses the three things that actually matter.
In ChiefOfStaff, every task gets a composed prompt at dispatch time. The wave resolver — the component that decides what gets dispatched and when — calls buildDispatchPrompt() with exactly the context that task needs:
Goal context. The task knows what objective it serves. Not the entire project roadmap — just its goal, its initiative, and a one-line description of what success looks like. This is the difference between an agent that writes code and an agent that writes the right code.
Predecessor outcomes. If task B depends on task A, the agent working on B receives the handoff notes from A. Not a summary — the actual structured output that the previous agent wrote specifically for downstream consumers:
Completed prerequisite work:
- "Blog engine: MDX pipeline" [builder]: MDX blog engine fully
implemented. gray-matter + next-mdx-remote/rsc. Key files:
lib/blog.ts, app/blog/page.tsx...
This is pulled from a getHandoffNotes() query that joins task dependencies to their structured results. The predecessor agent writes a "Handoff" section in its result specifically because it knows another agent will read it. Context flows forward through the dependency graph, not through shared memory.
Sibling learnings. Before dispatching, the system queries completed tasks under the same initiative for their "What Worked" and "Reusable Pattern" sections. If a sibling task discovered that Next.js 16 needs async params, every subsequent task in that initiative knows it too. The knowledge compounds without anyone maintaining a wiki.
Feedback from the last attempt. If a human sent this task back with notes, those notes get injected prominently — along with the agent's previous result, so it can see exactly what it produced and what needs to change. This is how the system learns from review without any fine-tuning.
What's not in the prompt: the full project plan, other agents' task lists, the complete git history, or anything the agent doesn't need to make a decision about its specific work.
Review gates and earned autonomy
Agents don't start trusted. They earn it.
Every task in ChiefOfStaff has a review tier — auto, quick, or gate. The defaults are conservative: code tasks start at gate (human reviews every diff), research starts at auto (low risk, auto-approved after 18 hours), content sits in the middle at quick (auto-approved after 32 hours with a spot-check flag).
The interesting part is what happens over time. The system tracks a rolling trust score per agent per task type — the last 20 verdicts (approved, sent back, discarded). When an agent hits 85% approval over 15+ tasks, its review tier gets promoted one level. Gate becomes quick. Quick becomes auto. The agent literally earns more autonomy through consistent good work.
The reverse is also true. If more than 30% of an agent's last 10 outcomes were sent back, it gets demoted. Quick becomes gate. Auto becomes quick. Trust is hard to build and easy to lose, which is exactly right for production systems.
Here's the actual check from the wave resolver:
const trust = getAgentTrustScore(task.assigned_agent, task.type);
if (trust.count >= 15 && trust.approvalRate > 0.85) {
// Promote one level: gate → quick, quick → auto
} else if (trust.count >= 10 && trust.sendBackRate > 0.30) {
// Demote one level: auto → quick, quick → gate
}
Every tier adjustment gets logged as a task activity note with the exact numbers: Trust-based tier gate → quick (approval: 93%, sendBack: 0%, n=15). Full transparency. If I disagree with a promotion, I can override it.
This is the review loop as the learning loop. No separate training pipeline. No fine-tuning. The system gets better because humans review work, those reviews become verdicts, verdicts adjust trust, and trust adjusts how much autonomy the agent gets next time.
Dependency graphs
Agents can't sequence their own work across sessions. They finish a task, report a result, and go away. Something else needs to decide what happens next.
In ChiefOfStaff, that something is the wave resolver. It runs every 60 seconds, scanning for tasks whose blockers are all resolved. A task with blocked_by: [task_A, task_B] won't get dispatched until both A and B are done. When they complete, the wave resolver picks up the dependent task on its next cycle — sometimes within a minute.
This sounds simple. It changes everything.
It means I can define a pipeline like "scaffold the site, then build the design system, then add the blog engine, then write the first post" as four tasks with dependencies. I don't manage the sequencing. I don't wake up to check if task 2 is done so I can start task 3. The system handles it.
The wave resolver also prevents collisions. Before dispatching, it checks:
- Is the agent busy? Each agent can only run one CLI task at a time. No concurrent work on the same agent.
- Module area conflicts. Tasks can declare a
module_area(like "blog" or "auth"). If two tasks share a module area, only one runs at a time. This prevents two agents from editing the same files in parallel. - Sibling dedup. If an agent just completed a task under the same goal, there's a cooldown before dispatching another one. In sprint mode it's 10 minutes. In idle mode it's 60 minutes. This prevents the system from burning through a queue without giving me time to review intermediate results.
- CLI capacity. There's a global cap on concurrent CLI workers. The system won't dispatch more work than the infrastructure can handle.
All of this happens in a single function that runs every minute. The agent never sees any of it. From the agent's perspective, it wakes up, receives a task with full context, does the work, and reports a result. The orchestration is invisible.
Worktree isolation
Here's a problem most agent demos don't show: what happens when an agent's code changes are bad?
If the agent works directly on your main branch and produces a broken commit, you're now debugging someone else's code in your production workspace. Multiply that by three agents running overnight and you have a mess.
ChiefOfStaff uses git worktrees to give each code task its own isolated workspace. When a code task dispatches, the system creates a fresh worktree at /tmp/cos-worktrees/{task-id}, branched from main. The agent works there. When it's done, the diff is available for review — merge or discard, one click.
// Every code task defaults to worktree isolation
const isolation = task.isolation
?? (task.type === 'code' ? 'worktree' : null)
?? agentDef?.defaultIsolation
?? 'direct';
This is what made the overnight runs possible. 12 agents producing 12 independent diffs, each in its own worktree, each reviewable and mergeable independently. I discarded one (it reverted Telegram changes I'd made manually). I merged the other 11. No conflicts except two import collisions in files that two tasks both touched — which I resolved in under a minute.
Without worktree isolation, concurrent agent work is a liability. With it, it's just a review queue.
Skills and tools: what agents can do vs. what they should decide
Agents in ChiefOfStaff have two kinds of capabilities: MCP tools (structured actions like create_task, update_content_item, report_signal) and skills (procedural knowledge loaded into the system prompt, like how to post to LinkedIn or how to capture engagement metrics).
The distinction matters because it draws the line between execution and judgment.
An agent executes by calling tools: create a content draft, log a task activity, search project memory. An agent follows procedures through skills: read the SKILL.md for linkedin-post, follow the steps, post the content.
What agents should almost never do: decide priority, choose what to work on, sequence tasks across projects, or judge whether their own work is good enough. Those are orchestration decisions — they belong to the wave resolver and the human review loop.
This is the counterintuitive part. The smarter the model gets, the more tempting it is to give agents more autonomy over what to do. But the failure mode isn't intelligence — it's alignment. An agent that decides it should refactor the database schema when it was asked to fix a CSS bug is using intelligence in the wrong direction. Constrain the decisions. Expand the capabilities.
What this actually looks like
On a typical night, I queue 8-10 tasks before bed. The wave resolver picks them up based on dependencies and agent availability. Code tasks get worktree isolation. Content tasks get dispatched with performance data from the last 30 days of engagement signals. Research tasks get auto-review tiers.
By morning, the review queue has 15-20 items. I spend 30 minutes going through them — approving diffs, sending back content with notes, discarding the occasional bad output. Every verdict feeds back into the trust scores. The agents that did well get more autonomy tomorrow. The ones that missed get tighter review.
The system doesn't get dramatically better each day. But it gets slightly better every cycle. Trust scores inch up. Context injection gets more relevant as more sibling learnings accumulate. The dependency graph gets smoother as I learn how to decompose tasks that agents can actually execute.
That's the whole game with production agent systems. Not a breakthrough — a ratchet.