Building an Agent OS: What Actually Works
I build automation pipelines for a living. Research agents that crawl sources, extract what matters, and surface it. Posting systems that schedule and publish across platforms. Data pipelines that rate, filter, and act on signals. I've been doing this long before anyone called it "agentic AI." It's just software that takes action.
So when the agent hype hit, I wasn't a skeptic being converted. I was a practitioner wondering whether the new tooling would actually improve on what I was already doing — or just add overhead.
The answer turned out to be both.
The Go-To-Market Problem
I'm the technical cofounder of a SaaS product. My cofounder handles the product photography side. I handle everything else — including the go-to-market, which is not my natural strength. We needed content. Strategy. Research into our market. Posting across platforms. Community engagement. The kind of sustained, daily marketing work that a solo technical founder never has enough hours for.
So I started using agents to help. And at first, it was genuinely great. They helped me write blogs, do competitive research, refine positioning, draft social posts, monitor communities. More output in a week than I'd have managed in a month doing it manually.
But I couldn't let them run unsupervised. The first drafts were consistently bad in ways that were hard to fix with a single prompt. The tone was off, the framing was generic, the strategic context was missing. So I started writing skills for them — detailed instructions on voice, audience, positioning. I curated their context — what they should know about our market, our competitors, our users. I gave them examples of good output and bad output.
And that's when the overhead started. Managing agent context became its own job. Keeping skills updated. Making sure the right agent had the right information for the right task. It was like I'd hired a team that needed constant onboarding.
Meanwhile, my cron jobs and automation pipelines kept running perfectly. They were dumber but more reliable. A pipeline doesn't need personality files or memory management. It does the same thing every time with the right context because you iterated on it until it worked. It doesn't evolve, but it doesn't regress either.
What Agents Add vs What Pipelines Already Do
This was the tension I kept coming back to. Agents felt like management overhead where cron jobs and automation would suffice. If I need to post to Reddit every morning, I don't need an agent with a personality and a memory system. I need a cron job that runs a well-tested prompt with the right context.
But there's a category of work that pipelines can't handle — work that requires judgment, adaptation, and context that changes. Writing an article about what we built this week. Responding to a community thread in a way that's relevant to the conversation. Deciding which research findings actually matter for our strategy. Adjusting the content plan based on what's getting engagement.
That's where agents earn their keep. Not on the repeatable stuff — on the stuff that requires thinking. The problem was that without structure around them, the thinking was unfocused. The agents were capable but directionless.
Why I Started Building the Harness
The people using OpenClaw were hitting the same wall. They'd set up agents, get excited for a few weeks, then realize the system was drifting. Nobody could remember what any agent was supposed to be doing. Tokens were getting burned on unfocused work. The output quality was inconsistent because there was no feedback mechanism.
I was watching this play out while building on Anthropic's Agent SDK — I pay for the Max plan, Opus is the strongest current model, and the company's focus on safety aligns with how I think about autonomous systems. I had the benefit of learning from OpenClaw's architecture and community without being locked into their framework. That's a real testament to how intelligent the models are today — give one a file directory for memory and data and it can work for you. The thing that grounds agents is real data. Conversation history, task lists, issues, code repos. Not abstract personality files — concrete context about what's actually happening.
So I started thinking about what kind of harness would help me accomplish goals over a long horizon. Where there was more structure to the work than "here's an agent, go do stuff." Where the system tracked not just what agents did, but whether it was working, and what to do differently next time.
The Architecture That Emerged
ChiefOfStaff is project-first. Each project has an objective, measurable key results, and initiatives — strategic thrusts that group related work. Under those, tasks. The agents serve the projects, not the other way around.
An agent is a container for perspective and skills. My writer has specific instructions about voice and structure. My researcher knows how to score signals and extract what matters. My CMO thinks about go-to-market strategy and audience psychology. But those perspectives only matter when they're applied to concrete work inside a project with clear goals.
Tasks flow through a wave deployer that wakes up every sixty seconds, finds what's ready, and dispatches it to the right agent. The agent enters the project's working directory with full context. It can read files, write code, post to databases, adjust website copy. When it finishes, the work enters a review queue.
The review queue is where the system actually learns. Every review decision — approve, revise, reject — generates a structured record. What was tried, what worked, what didn't, what the reusable pattern is. Those learnings accumulate. The agents can check them before starting similar work. Month three output is meaningfully better than month one, not because the model improved, but because the system has absorbed dozens of specific lessons about what good output looks like for each project.
The Overnight Run
A few nights ago, I queued fifteen tasks across my projects. Research, coding, content drafts, bug fixes. The system ran overnight.
I woke up to eight GitHub commits. Five items in my review queue. Research summaries ready to read. Build notifications in Telegram.
That was the moment the harness justified itself. Not because any individual agent did something impressive — but because the system maintained coherence across fifteen tasks, four projects, and multiple agents without me being there. It knew what each project needed, dispatched the right work to the right agents, and queued everything that needed my judgment.
Sequential Quality
The latest evolution came from a simple observation: one-shot agent work isn't good enough for anything you care about. A single pass produces a first draft. First drafts need revision.
So we built task pipelines — sequential queries where the agent works in phases. For code: implement, then review your own work for edge cases and fix the obvious ones, then commit and clean up. For content: research and draft, then self-critique for tone and accuracy, then revise.
The agent preserves full context between phases because it's the same session. It knows every tradeoff it made, every corner it cut. A debrief turn while the agent still has that context is the cheapest quality lever available — cheaper than a separate review agent that has to reconstruct context, cheaper than human review hours later.
On top of the pipelines, hooks fire as compliance checks. Did the agent actually commit its changes? Did it capture what the next agent needs to know? Hooks are cheap, reliable, and attach to agents rather than tasks. The pipeline defines what work gets done. The hooks verify it actually happened.
What Grounds All of This
The models will keep getting better. Context windows will grow. Capabilities will expand. But the thing that makes agents useful isn't the model — it's the data surrounding it. Real project goals with measurable outcomes. Real task history showing what was tried and what worked. Real learnings from real review cycles. Real context about what the project is, what the audience cares about, what the brand sounds like.
Give a powerful model vague instructions and you get confident, polished garbage. Give it structured context and specific feedback loops and you get work that actually moves your projects forward.
That's the bet with ChiefOfStaff. Not that I can build a better model. That I can build a better harness — one that gives agents the structure, context, and feedback they need to do work I'm accountable for. And that when a smarter model drops into a system with months of accumulated learnings and carefully curated context, the output quality jumps in ways that justify every hour I spent building the infrastructure.
What I'd Tell Someone Starting
Don't build an agent framework. Build a system that solves your specific problem and happens to use agents.
Start with what you already have. If your cron jobs and pipelines work, keep them. Add agents where you need judgment and adaptation — not where you need reliability and repetition.
One project. One goal. One agent doing one type of work. Get the loop working: agent does work, you review it, the system captures what you learned. Once that loop is tight, expand.
The infrastructure will tempt you. You'll want the orchestrator before the basic loop works. You'll want five agents before the first one produces reliable output. You'll want dashboards before you have data worth displaying. Resist it. Fix the loop. Use the loop. Then build.
I'm still building. The review queue still has rough edges. The context management is partially manual. The orchestrator isn't fully autonomous. But fifteen tasks ran overnight and produced real work across four projects. The merge-check loop is reviewing and integrating code changes while I sleep. The system is learning from every review cycle.
That's not a demo. That's a system.