Stet lets your AI coding agent
improve itself.

Tell your agent what to improve, and Stet tests candidates on real work from your codebase to find the winner.

Stet is an eval system for AI coding agents. It replays real work from your repository, runs candidate agent configurations, and reports whether each change improves tests, review quality, equivalence, footprint risk, cost, and runtime.

Use it to compare models, reasoning settings, AGENTS.md changes, SKILL.md changes, and harness changes before they become the default path for your team.

You swapped to a new model.

Tests pass.

Did code quality go up or down?

You rewrote your AGENTS.md.

Tests pass.

Did partial implementations actually decrease?

You enabled plan-before-code.

Tests pass.

Is it costing twice as much per task?

Tests pass. That’s all you know.

Your merged PRs become eval tasks.

your tasks

12 eval tasks from 47 merged PRs on platform-api

a3f2c1dfeat: batch endpoint

e8b4a09fix: rate-limit headers

7d1f3e2refactor: auth middleware

c4d8f12fix: connection pooling

+8 more

Your test suite scores every attempt.

your testssignal

✓ 15/15✓ 12/12✗ 12/15✓ 8/8✓ 22/22✓ 6/6+6 more

11/12 tasks pass

Your rubrics measure what tests can't.

your graderssignal

correctnessavg 3.2/4

styleavg 2.8/4

coverageavg 3.5/4

+ custom: API consistency, error handling

Your agent runs the experiment.

Claude Code

RUN 1agents-md-v1HOLDpass 0.50 → —cost $3.56 → $—

⚠Partial implementations on 3 tasks

RUN 2agents-md-v2HOLDpass 0.50 → —cost $3.56 → $—

⚠Validation still weak on 2 tasks

RUN 3·agents-md-v3 vs baseline·platform-api

pass 0.50→—

cost $3.56→—

time 780s→—

DECISION: PROMOTE■

⚠2 tasks still fail code review for style

⚠Consider follow-up for review quality

Recent experiments

modelopus 4.7 vs old/new opus 4.6 on zodSTUDY28 tasksApr 16

harnessopus 4.6 on claude code vs cursorPROMOTE15 tasksApr 3

skilltest-augmentation v2 vs v1HOLD10 tasksMar 28

modelswap opus 4.6 for sonnet 4.6 for small bugfixesPROMOTE8 tasksMar 21

configtell agent to plan before startingHOLD12 tasksMar 14

Every task is a real PR your team already merged. Your test suite is the judge. The baseline is the code your team already wrote.

No synthetic benchmarks.

No curated challenges.

Your work, replayed.

Current cleaned leaderboard

AI coding agents scored on real open-source codebases.

GPT-5.5

codex cli

GPT-5.4

codex cli

Claude Opus 4.7

claude code

Gate

44.4%

33.3%

44.4%

Equivalence

66.7%

40.7%

Code Review

51.9%

37.0%

22.2%

56 tasks/2 repos

Your codebase. Your tests. Your standards.

Stet lets your AI coding agentimprove itself.

What Stet is

Why test pass rate is not enough

How Stet evaluates coding-agent changes

Your agent runs the experiment.

Recent Stet experiments

Your code, tests, and baseline

Current cleaned leaderboard

Stet lets your AI coding agent
improve itself.