A few weeks ago I ran an experiment I haven’t been able to stop thinking about. Not because of what either approach got right, but because of what the gap between them reveals.
I was building an open-source AI research harness, something that could take a complex research query and produce a genuinely useful, intellectually honest report. Something someone could trust reliably… not a hallucination dressed up in citations. And I wanted to get there for under a dollar a run.
I built it two ways at once.
Approach One: Outsource Everything
The first was deliberately hands-off. I had a high-quality research report produced by OpenAI’s deep research, and I handed it to OpenClaw using Codex and supplied the project with an OpenRouter API key and said: build something that produces this. Use whatever models you think are appropriate. Iterate until it’s equivalent in quality.
That was the entire brief. No architecture discussions. No requirements interviews. No decision checkpoints. I was outsourcing not just the implementation, but the judgment.
I’d check in occasionally, glance at the dashboard to make sure we weren’t burning money, and if it happened to be asking me something I might weigh in. Mostly I left it alone.
Approach Two: Human in the Loop
The second used a methodology I’ve been developing over the past six to eight months: AI-augmented coding and design.
Short version: I start with a brief, roughly similar in scope to what I handed the first set of models. But instead of handing off the wheel, I go through a structured requirements discovery process. The AI interviews me. We work through architectural decisions together. It asks questions I hadn’t thought to ask myself. I’m not writing code, but I’m absolutely shaping what gets built and why. For developers who have been using planning mode in your favorite CLI or IDE, it is like that, but intentionally far more rigorous. Think of it like a pedantic consultant thinking of every edge case before they get going on implementation. Annoying sometimes, but the results are more than worth it.
Once the initial harness was created, I reviewed each research output personally. Read the logs, watched for agent behaviors that seemed counterproductive, developed theories about why, and translated those into design changes. The same kind of work the AI attempted autonomously in approach one, but with forty years of pattern recognition behind it and a much tighter feedback loop.
I wrote almost no code. But I was very much present.
A Weekend Later
After a full weekend, the first approach had plateaued. Genuinely impressive for a fully autonomous build, but stuck, and clearly a proof of concept. Something you could run from the command line. No real architecture, no path forward.
The second approach, started a day later with about 11 or 12 hours of active engagement on my part, had produced something considerably different:
- A fully functioning API
- Command-line execution
- A query decomposition engine that breaks a research request into component questions
- Dependency analysis that identifies which questions block others, enabling intelligent sequencing
- Targeted search routing per question
- A scoring rubric to validate answers before progressing to dependent questions
- Multi-model support, with models assigned by specialty
- Built-in context management
Output quality wasn’t incrementally better. On simpler queries, it was drawing conclusions that weren’t stated anywhere in the source material but were clearly the right ones. On complex queries, it was upfront about what it couldn’t know, and it didn’t hallucinate. Note, I didn’t say it hallucinated less. Zero. I double checked every source. (Will it always behave like that? Hard to know, it is by its nature a non-deterministic system.. but the fact it was so willing to admit it didn’t know things is highly encouraging.)
One more thing worth noting: the first approach had developed some solid PDF and Word document parsing routines I hadn’t built. I took them. More on that soon.
So What’s the Point?
There’s an obvious takeaway: human guidance produced better software than fully autonomous generation. Some people will stop there and conclude nothing fundamental has changed.
That’s the wrong read.
I wrote almost no code for the system I “helped build.” Zero for the one the AI built on its own. Both of those statements would have been science fiction five years ago. Writing code is no longer the bottleneck in software development. That shift has already happened, whether or not it’s been processed.
What this experiment shows is where things actually stand right now: execution is largely automated, but judgment still benefits from a human in the loop. The first approach shows how far autonomous AI gets without steering. The second shows how much further you go when someone with domain knowledge and real architectural instinct is shaping the decisions.
And the fact that I borrowed good ideas from the autonomous build and folded them in? That’s not a footnote. That’s the actual dynamic. It isn’t about competition or replacement. More like two people working the same problem from different angles.
The interesting question isn’t “who is better.”
It’s:
“What does human judgment look like when execution is essentially free?”
Where Things Stand
The stakes are ludicrously high. There will be significant financial impact. There is already serious financial impact. The US tech sector is nearly 9% of GDP. The US alone accounts for more than half of all global software investment. And this is just the beginning. [https://www.statista.com/statistics/1239480/united-states-leading-states-by-tech-contribution-to-gross-product/] [https://www.wipo.int/en/web/global-innovation-index/w/blogs/2025/global-software-spending] I don’t say this to be alarmist. I say this only to stress the importance of not waiting to act.
Software is the first industry to be largely rewritten by AI. It won’t be the last. Knowledge work broadly is in the same position, and for a lot of people this isn’t a future concern anymore.
I’ve spent forty years watching industries change and trying to be on the right side of those changes. The pattern is consistent: the people who do well are the ones bringing the change. The ones who survive are the ones who embrace the change. The ones who perish are the ones who stick their head in the sand.
Pandora’s box is open. The change is not coming. It is here.
And I’ll just be super transparent here. I am accustomed to being the one who is bringing the change; however, the change is happening so rapidly in this case that I’m teetering between the one who is bringing it and the one who is embracing it. And so if the nature of this change is so dramatic and so rapid that someone like me, who has lived and encouraged and sought change my entire life, is struggling, then maybe I’m just older. But also I am very concerned for those who do not have my history.
The open research harness is on GitHub. If it’s useful, use it. My small contribution to the idea that strategic intelligence shouldn’t require an enterprise budget. If you improve it, you’re improving access to affordable high-quality research for the world. Pretty cool mission I think. https://github.com/matt-o-matic/openresearch