STET

Stet lets your AI coding agent
improve itself.

Tell your agent what to improve, and Stet tests candidates on real work from your codebase to find the winner.

What Stet is

Stet is an eval system for AI coding agents. It replays real work from your repository, runs candidate agent configurations, and reports whether each change improves tests, review quality, equivalence, footprint risk, cost, and runtime.

Use it to compare models, reasoning settings, AGENTS.md changes, SKILL.md changes, and harness changes before they become the default path for your team.

Why test pass rate is not enough

You swapped to a new model.

Tests pass.

Did code quality go up or down?

You rewrote your AGENTS.md.

Tests pass.

Did partial implementations actually decrease?

You enabled plan-before-code.

Tests pass.

Is it costing twice as much per task?

Tests pass. That’s all you know.

How Stet evaluates coding-agent changes

Your merged PRs become eval tasks.

your tasks

12 eval tasks from 47 merged PRs on platform-api

a3f2c1dfeat: batch endpoint
e8b4a09fix: rate-limit headers
7d1f3e2refactor: auth middleware
c4d8f12fix: connection pooling
+8 more

Your test suite scores every attempt.

your testssignal
15/15 12/12 12/15 8/8 22/22 6/6+6 more
11/12 tasks pass

Your rubrics measure what tests can't.

your graderssignal
correctnessavg 3.2/4
styleavg 2.8/4
coverageavg 3.5/4
+ custom: API consistency, error handling

Your agent runs the experiment.

Claude Code
>
RUN 1agents-md-v1HOLDpass 0.50cost $3.56 → $

Partial implementations on 3 tasks

RUN 2agents-md-v2HOLDpass 0.50cost $3.56 → $

Validation still weak on 2 tasks

RUN 3·agents-md-v3 vs baseline·platform-api
pass 0.50
cost $3.56
time 780s
DECISION: PROMOTE

2 tasks still fail code review for style

Consider follow-up for review quality

Recent Stet experiments

Recent experiments
modelopus 4.7 vs old/new opus 4.6 on zodSTUDY28 tasksApr 16
harnessopus 4.6 on claude code vs cursorPROMOTE15 tasksApr 3
skilltest-augmentation v2 vs v1HOLD10 tasksMar 28
modelswap opus 4.6 for sonnet 4.6 for small bugfixesPROMOTE8 tasksMar 21
configtell agent to plan before startingHOLD12 tasksMar 14

Your code, tests, and baseline

Every task is a real PR your team already merged. Your test suite is the judge. The baseline is the code your team already wrote.

No synthetic benchmarks.

No curated challenges.

Your work, replayed.

Current cleaned leaderboard

AI coding agents scored on real open-source codebases.

GPT-5.5
codex cli
GPT-5.4
codex cli
Claude Opus 4.7
claude code
Gate
44.4%
33.3%
44.4%
Equivalence
66.7%
66.7%
40.7%
Code Review
51.9%
37.0%
22.2%
56 tasks/2 repos

Your codebase. Your tests. Your standards.