Let your coding agent
improve itself.
Tell your agent what to improve. Stet tests candidates on real work from your codebase and hands back the winner.
You swapped to a new model.
Tests pass.
Did code quality go up or down?
You rewrote your AGENTS.md.
Tests pass.
Did partial implementations actually decrease?
You enabled plan-before-code.
Tests pass.
Is it costing twice as much per task?
Tests pass. That’s all you know.
Your merged PRs become eval tasks.
12 eval tasks from 47 merged PRs on platform-api
Your test suite scores every attempt.
Your rubrics measure what tests can't.
Your agent runs the experiment.
⚠Partial implementations on 3 tasks
⚠Validation still weak on 2 tasks
⚠2 tasks still fail code review for style
⚠Consider follow-up for review quality
Every task is a real PR your team already merged. Your test suite is the judge. The baseline is the code your team already wrote.
No synthetic benchmarks.
No curated challenges.
Your work, replayed.
AI coding agents scored on real open-source codebases.
Your codebase. Your tests. Your standards.