STET

Let your coding agent
improve itself.

Tell your agent what to improve. Stet tests candidates on real work from your codebase and hands back the winner.

You swapped to a new model.

Tests pass.

Did code quality go up or down?

You rewrote your AGENTS.md.

Tests pass.

Did partial implementations actually decrease?

You enabled plan-before-code.

Tests pass.

Is it costing twice as much per task?

Tests pass. That’s all you know.

Your merged PRs become eval tasks.

your tasks

12 eval tasks from 47 merged PRs on platform-api

a3f2c1dfeat: batch endpoint
e8b4a09fix: rate-limit headers
7d1f3e2refactor: auth middleware
c4d8f12fix: connection pooling
+8 more

Your test suite scores every attempt.

your testssignal
15/15 12/12 12/15 8/8 22/22 6/6+6 more
11/12 tasks pass

Your rubrics measure what tests can't.

your graderssignal
correctnessavg 3.2/4
styleavg 2.8/4
coverageavg 3.5/4
+ custom: API consistency, error handling

Your agent runs the experiment.

Claude Code
>
RUN 1agents-md-v1HOLDpass 0.50cost $3.56 → $

Partial implementations on 3 tasks

RUN 2agents-md-v2HOLDpass 0.50cost $3.56 → $

Validation still weak on 2 tasks

RUN 3·agents-md-v3 vs baseline·platform-api
pass 0.50
cost $3.56
time 780s
DECISION: PROMOTE

2 tasks still fail code review for style

Consider follow-up for review quality

Recent experiments
harnessopus 4.6 on claude code vs cursorPROMOTE15 tasksApr 3
skilltest-augmentation v2 vs v1HOLD10 tasksMar 28
modelswap opus 4.6 for sonnet 4.6 for small bugfixesPROMOTE8 tasksMar 21
configtell agent to plan before startingHOLD12 tasksMar 14

Every task is a real PR your team already merged. Your test suite is the judge. The baseline is the code your team already wrote.

No synthetic benchmarks.

No curated challenges.

Your work, replayed.

Public leaderboard

AI coding agents scored on real open-source codebases.

GPT-5.4
codex cli
GPT-5.3 Codex
codex cli
GPT-5.1 Codex Mini
codex cli
Claude Opus 4.6
claude code
GPT-5.4 Mini
codex cli
Gate
78.6%
75.0%
75.0%
42.9%
32.1%
Equivalence
45.5%
42.9%
19.0%
66.7%
11.1%
Code Review
31.8%
19.0%
9.5%
58.3%
16.7%
Cost/Task
$1.99
$16.90
$5.55
$26.80
$3.34
87 tasks/3 repos

Your codebase. Your tests. Your standards.