About Stet

I'm Ben Redmond. I use AI coding tools every day, both in my day job as a SWE at MongoDB and for personal projects. I've been frustrated by how little rigor we have, as consumers, for evaluating models, agents, and context on our own code.

When Opus 4.7 comes out, is it actually better than Opus 4.6 for my work? Most of the data comes from the model provider or from community posts where the setup, repo, prompting, and standards may be totally different from mine.

The same thing happens with context. Is my CLAUDE.md helping? Can I make it better? Right now the usual answer is to try a change, use it for a while, and hope the agent feels better. Too much of this space still runs on vibes, with users expected to adopt the latest tooling without much evidence for how the behavior actually differs.

Stet is my attempt to give some of that power back to the person using the tool. It turns real repo history into replayable tasks, runs agents against those tasks, and scores the resulting patches beyond simple test pass rate. The goal is to make it possible to improve agent performance systematically by changing your context, harness, and model choices instead of waiting for the next model release.

The benchmarks here are not magic. Harnesses matter, judge models can be biased, public repos may have training exposure, and a single published run is still just one run. But I would rather have concrete evidence with caveats than another round of screenshots and vibes.

Stet is built to answer practical questions: is this agent effective in my codebase, is there a cheaper alternative, and did my context change actually help? If you're interested in learning more, try the tool or reach out to me.

About Stet

Methodology

Model comparisons

Leaderboard