Why I'm Betting on Argument, Not Answers

Why I'm Betting on Argument, Not Answers

Software delivery

Anyone who has been around me long enough knows I love to argue. Not to prove someone wrong (most of the time) but to find truth. And most of those same people have heard me say the phrase: “Hey, wanna see something cool?” Invariably it has something to do with some weird thing I just built… and more often than not (and yes, I am obviously biased here)… it IS cool. Generally if you’ve said yes to that question once, you’ll say yes to it again. So… in the spirit of both those things… I built a thing. And I want to tell you why.

I’ve been watching how heavy AI users actually work, and there’s this thing they all do. They’ve stopped asking one model one question and waiting for an answer.

Instead they run the same problem through Claude Opus 4.6, then GPT-5.4, then maybe Gemini. They read the differences. They take the tension between the answers and feed it back in. They iterate. They’re not looking for an answer. They’re looking for whatever survives the pressure.

I wanted to build that into an actual system. So I did. And I’m open-sourcing it.

The Problem With Single-Model Answers

Single models are optimizers. They’re trained to produce the most probable next token, which usually means the most acceptable, least objectionable response. Fine for drafting an email. A real problem for anything where the right answer isn’t obvious.

When you use models from different labs, you get genuine disagreement. GPT-5.4 and Opus 4.6 will sometimes land in completely different places on a nuanced question. Not because one is wrong, but because they’ve genuinely weighted things differently. That tension is (I think??) where the actual problem lives.

How This Actually Works

Okay so here’s the setup. You feed in a prompt. Maybe it’s a technical architecture question, maybe it’s a blog post about the very tool you’re using to write it. That prompt goes to several models. Each produces an independent response. No peeking… at first.

Then they see each other’s work. This is important: this is not “give the same prompt to two models and have a third one pick the winner.” That’s not what happens. Not even close.

What actually happens is that the models get curious about each other’s answers. GPT-5.4 will encounter a framing from Opus 4.6… maybe it prioritized consistency over latency in a way GPT-5.4 wouldn’t have - and instead of dismissing it, it asks. “Wait, why weight consistency there when the failure mode is actually latency-bound?” It’s a question GPT-5.4 wouldn’t have asked itself because it wouldn’t have thought to weight them differently. Then Opus 4.6 might incorporate that, or push back, or find something in GPT-5.4’s response it had missed. They revise. They evolve. By round three the output looks different from round one, and usually better.

The referee sits outside this loop. It watches. Occasionally it pokes. It surfaces the disagreements that actually matter - not the stylistic noise, the substantive stuff. It asks the questions that unblock progress when the original prompt was underspecified. And most importantly, it knows when to call it. The referee tracks whether the collaboration is converging on something useful or just spinning in circles. When outputs stop improving, it calls it.

The special sauce is in the debate, not a static comparison

People already run the same prompt through multiple models and compare outputs. That’s useful, but it’s not this.

The difference is what happens when models actually engage with each other’s reasoning over multiple rounds. When GPT-5.4 knows Opus 4.6’s answer exists and gets to poke holes in it, you get something qualitatively different from parallel responses. Vague confidence doesn’t survive that process. You have to actually defend the reasoning.

But here’s the thing: they don’t just argue. They explore. They’re interested in each other’s reasoning in a way that surprised me when I first saw it. That curiosity is doing a lot of the work here.

You could try simulating this with one model by prompting it to “argue both sides.” People do this. It kind of works. But a model arguing against itself still knows what it’s about to say. The surprise is gone, and surprise is where the useful friction lives.

What It’s Good For / Caveats

The obvious use cases are writing and coding, tasks where judgment matters as much as correctness.

Beyond writing and code, the interesting territory is decisions with real stakes: hiring rubrics, pricing strategy, product prioritization. Anything where you’d want a second opinion but don’t always have one available. The system doesn’t replace human judgment. It gives you better raw material to exercise that judgment on.

This costs more than asking one model one question. More API calls, more tokens, more latency. For a quick lookup, it’s overkill.

It also requires judgment about when the debate is actually adding value. The referee helps, but you still need a human who can tell the difference between real insight and hallucinated controversy. Models can generate plausible-sounding disagreement for its own sake, and that’s not useful.

Why Open Source

The people who would benefit most from this are already doing it manually and would love to stop copy-pasting between browser tabs. Also, I don’t want to deal with people paying me for another thing I have to keep running. :-)

Advanced users have been hacking this workflow for a while. They’ll run a prompt through Opus 4.6, paste into GPT-5.4 with “critique this,” take that back, iterate by hand. This system formalizes that workflow and removes the friction. The architecture is simple enough to deploy yourself: a lightweight orchestrator handling message routing, model APIs on the backend, and a referee prompt you can tune for your domain.

Want the referee more aggressive about ending debates early? Tweak the prompt.

Want to add a model with a specific specialty? Plug it in.

We want to see what people build with it. What referee strategies work best, which model combinations produce the most useful tension, what problem types benefit most from the structure.

An Interesting Aside

I used this process to write this exact article. I edited the content after, but one thing was produced that gave me a genuine chuckle:

This post is being drafted through exactly this system right now, which means I’m writing about a process while that process is actively being used to write about it. I’m losing track of which direction the recursion goes. I find that genuinely interesting and a little strange, and I’m not sure I’ve fully worked out what it means.

I think I might have accidentally created Deadpool on OpenRouter’s GPUs… but back to…

The Bigger Idea

We’ve spent years treating AI like a very fast, very confident search engine. Ask something, get an answer, move on.

But the most interesting use of these systems isn’t retrieval. It’s deliberation. It’s building a process where answers have to earn their place by surviving scrutiny - from other models, from a referee, from the human who has to live with the result.

I’m still figuring out where the limits are. The system doesn’t always converge on something useful; sometimes it just generates more sophisticated confusion. The bill is definitely higher. But when it works, the answers feel earned in a way that single-model outputs rarely do. That’s enough to keep me iterating.

So… wanna see something cool?

https://github.com/matt-o-matic/multi-agent-consult