Stet lets your AI coding agent
improve itself.
Tell your agent what to improve, and Stet tests candidates on real work from your codebase to find the winner.
What Stet is
Stet is an eval system for AI coding agents. It replays real work from your repository, runs candidate agent configurations, and reports whether each change improves tests, review quality, equivalence, footprint risk, cost, and runtime.
Use it to compare models, reasoning settings, AGENTS.md changes, SKILL.md changes, and harness changes before they become the default path for your team.
Why test pass rate is not enough
You swapped to a new model.
Tests pass.
Did code quality go up or down?
You rewrote your AGENTS.md.
Tests pass.
Did partial implementations actually decrease?
You enabled plan-before-code.
Tests pass.
Is it costing twice as much per task?
Tests pass. That’s all you know.
How Stet evaluates coding-agent changes
Your merged PRs become eval tasks.
12 eval tasks from 47 merged PRs on platform-api
Your test suite scores every attempt.
Your rubrics measure what tests can't.
Your agent runs the experiment.
⚠Partial implementations on 3 tasks
⚠Validation still weak on 2 tasks
⚠2 tasks still fail code review for style
⚠Consider follow-up for review quality
Recent Stet experiments
Your code, tests, and baseline
Every task is a real PR your team already merged. Your test suite is the judge. The baseline is the code your team already wrote.
No synthetic benchmarks.
No curated challenges.
Your work, replayed.
Current cleaned leaderboard
AI coding agents scored on real open-source codebases.
Your codebase. Your tests. Your standards.