The Breakthrough Wasn't the Model. It Was the Workflow
For the first time in a while, using AI agents stopped feeling like a demo and started feeling like a way of working.
The last three days have been some of the most productive I’ve had in ages.
Not because everything was smooth. It wasn’t.
Sessions died mid-task. Tools failed in annoying ways. The “working” version of the SEO audit CLI turned out to be wrong in exactly the way that matters most: it was publishing the wrong artifact. At one point I was testing a URL with a typo in the hostname and briefly convinced myself the whole thing was down.
And yet, despite all that, the pace was ridiculous.
That’s the interesting part.
The breakthrough wasn’t that the AI got smarter. It was that the workflow got better.
What we were trying to do
The immediate job was pretty simple on paper: fix the SEO audit CLI so it actually works in real life.
In practice, that meant a lot more than getting a command to return without exploding.
It needed to give visible progress while it was running, produce real outputs instead of mocked happy paths, publish to the new consolidated host, survive session crashes and handoffs, and end with something I could actually use and verify myself.
Then the requirement shifted again.
Instead of one output, each run now needed to generate three:
- a Markdown report
- a summary HTML page
- a slide deck
That’s when it got fun.
The thing that actually changed
The biggest lesson from the last few days is that agent productivity is mostly a systems problem.
If the workflow is sloppy, the model quality barely matters.
If the workflow is tight, even imperfect agents become genuinely useful.
A few things made the difference.
Durable state beat context
The single most important change was treating progress as something that had to survive failure.
Instead of assuming the active session would stay alive, we started writing everything important into a live Notion tracker as we went: todo lists, current status, known blockers, what had been tested, and what was still unverified.
That changed the dynamic immediately.
When a session died, progress didn’t die with it.
That sounds obvious, but it’s the line between “clever chatbot” and “useful operator”.
Thin orchestration, heavy delegation
Rather than keeping one bloated session trying to do everything, I split the work into smaller isolated workers.
One worker reconciled the repo state. Another verified live behavior. Another handled the triple-output implementation.
The main thread stayed focused on orchestration, review, and keeping the handoff document accurate.
That ended up being much more reliable than trying to cram everything into one heroic context window.
Real tests exposed the truth
A mocked green test suite is comforting. It’s also a liar.
Real runs forced the actual problems into the open.
One failure looked like an auth problem. The real root cause turned out to be stricter and more interesting: Claude returned a severity label outside the expected enum, and the runner rejected it.
That’s exactly the kind of issue you don’t catch if you stop at “tests pass”.
The live runs also surfaced a more embarrassing mismatch: the pipeline was technically working, but it was generating a one-page report when what I actually wanted was a slide deck.
That stung a bit, but it was useful.
It’s much better to discover “working, but wrong” quickly than to ship the wrong artifact with confidence.
What we ended up with
By the end of the push, the SEO audit CLI had gone from “critical and flaky” to something much closer to a real production workflow.
A run can now produce a Markdown report in `~/Documents`, a summary page on the consolidated audit host, and a linked HTML slide deck.
More importantly, the whole thing is now grounded in real validation rather than hand-wavy optimism.
That matters.
The difference between “the agent says it works” and “I ran it on real domains and checked the outputs” is the difference between theatre and engineering.
The emotional bit
What’s got me so stoked isn’t just the feature work.
It’s the feeling that the whole setup crossed a threshold.
For a long time, working with AI agents has felt like constantly babysitting brilliant interns who forget everything, overstate their progress, and occasionally wander off into a wall.
This week felt different.
Still imperfect, still chaotic, still full of weird edge cases, but productive in a way that felt compounding.
Like I’m not just prompting tools anymore. Like I’m starting to build an actual operating system for work.
That’s a much bigger deal than any one CLI.
What I believe now
A lot of the public conversation about AI agents is still stuck in the wrong frame.
People argue about which model is best, which benchmark matters, which agent framework is winning.
That stuff matters a bit, but not nearly as much as people think.
The real leverage is in the operating model:
- how work gets broken down
- how state gets preserved
- how failure is handled
- how outputs are verified
- how humans stay in the loop without becoming the bottleneck
Get that right and agents become shockingly useful.
Get that wrong and even the fanciest model turns into a hallucinating slot machine with a terminal.
Where this is heading
I don’t think the exciting future is “AI replaces software teams”.
I think it’s smaller humans with better systems shipping faster than should be reasonable.
One person with a tight feedback loop, durable memory, isolated workers, and real verification can suddenly do a surprising amount.
That’s what the last three days have felt like.
Not magic. Not AGI. Not the end of human work.
Just a glimpse of what happens when the workflow finally catches up to the model.
And honestly, that’s more exciting than the hype.