How Stet measures real AI coding performance

Stet evaluates AI coding models on real tasks from real open-source repositories. Each week, we run the same set of tasks across all models and score them on correctness and quality.

1. Select merged PRs from tracked repositories

2. Extract the task (issue + tests that changed)

3. Reset repo to pre-merge state

4. Ask each model to solve the task

5. Run original tests — the correctness gate

6. Score attempted patches on equivalence, code review, footprint, and cost

Tests are the gate. A model's solution must pass the same tests that the human-written solution passed. But passing tests is necessary, not sufficient — models that pass the same tests can produce wildly different solutions. Quality dimensions above the gate are where differentiation lives, and they are also useful on failed attempts because they show whether a patch missed the tests while still moving toward the intended change.

Replay-Based Grading

Stet turns task and PR history evals into replayable coding-agent work. A merged change becomes the reference: the repository is reset to the pre-merge state, the agent attempts the same task, and Stet compares the resulting patch against tests and quality signals.

That replay-based grading keeps the benchmark close to real engineering work. It measures how agents handle your codebase, conventions, dependencies, and review expectations instead of only asking whether they solve a generic public challenge.

The same replay setup is what makes instruction evals useful: you can compare AGENTS.md, SKILL.md, model, reasoning, and harness changes before they become the default path for every agent session.

Above the Gate

Pass rate alone cannot differentiate models. PR-derived tasks are bimodal — easy tasks pass all models, hard tasks fail all models. Two models with identical pass rates can diverge 2–5x on quality.

For every attempted patch, Stet scores additional dimensions:

Equivalence

Is the patch functionally equivalent to the original PR's solution, or merely a different approach that happens to pass? An LLM evaluator compares the agent's patch against the human-written solution, judging semantic intent rather than structural similarity. A model can use completely different files, functions, or architecture and still be equivalent — what matters is whether it achieves the same behavioral outcome. In our data, models with identical test pass rates show 50-percentage-point gaps on equivalence, making it the strongest differentiator above the gate.

Code Review

A structured rubric scoring correctness, introduced bug risk, edge case handling, and maintainability. An LLM judge scores each dimension on a 0–4 scale with required code citations for every finding. Correctness catches logical errors the test suite might miss. Bug risk flags fragile patterns. Edge case handling measures defensive coding. Maintainability checks whether the code follows the repository's conventions. Models that pass identical test suites show 3–5x gaps on review scores.

Footprint Risk

How invasive is the patch? We compare the agent's changes to the intended fix: files touched, lines changed, non-test file divergence. A model that touches 8 files when the intended fix touches 2 creates merge risk — even if tests pass. The score blends relative overage against the reference with absolute invasiveness, normalized to a 0–1 scale. Lower scores indicate a more surgical, deployable patch.

Cost

Total API token spend per task, including all input, output, reasoning, and tool calls during the model's attempt. When two models produce equivalent-quality patches, cost per task determines ROI across model tiers.

Why This Matters

Public benchmarks are increasingly unreliable. Models trained on open-source code memorize benchmark solutions — inflating scores without improving real-world performance. And test suites designed for one implementation often reject valid alternatives. Stet measures models on specific, recognizable projects with their real test suites — and scores what matters after the tests pass.

Real repositories — not synthetic problems
Human-written tests — not generated test cases
Quality above the gate — not just "does it pass" but "does it solve it right"
Tracked over time — detect regressions week-over-week

Statistical Approach

Each repository dataset contains 50–200 tasks. A task passes whenpass orpass_with_warn. Effective sample size (n) excludes missing validations.

Pass rate establishes the correctness gate. Across the scored task slice, equivalence rate, code review score, and footprint risk are the ranking signals — they surface both quality differences among passing patches and near-miss differences among failed attempts.

For each model, we compute pass rate and a deterministic 95% bootstrap confidence interval from task-level outcomes. We use fixed bootstrap settings so repeated runs with identical inputs produce the same interval outputs.

Tiering is conservative. Model A is considered strictly superior to model B only whenpassRate(A) > ciHigh(B). Models within the same tier are differentiated by quality dimensions above the gate.

FAQ

Why not just rank by pass rate?

Tasks mined from real PRs are bimodal: easy ones pass all models, hard ones fail all models. We consistently see 94-100% task-level agreement between model pairs even when their solution quality diverges 2-5x. Pass rate is a valid correctness gate, but quality dimensions above the gate are where models actually separate.

Why these repositories?

We chose well-known projects with good test coverage across multiple languages: Go, Python, and TypeScript. Recognizable names help you calibrate. If you know Stripe CLI or FastAPI, you have intuition for what hard looks like in those codebases.

How do you prevent prompt leakage?

Models receive only the issue description and the file tree. They do not see the expected solution, the PR diff, or which tests will run. This mirrors real-world usage.

Aren't public repo evals contaminated by training data?

They can be. Frontier models have been shown to memorize solutions from open-source repositories in their training data. Stet's public leaderboard uses real tests as the correctness gate; memorizing the original patch does not help if the model's solution must independently pass the test suite. For private evals, contamination is structurally impossible because your proprietary codebase is not in any model's training data.

Why weekly runs?

Weekly cadence catches model regressions early while smoothing out transient API issues. It also creates a consistent content rhythm for the community.

Can I reproduce these results?

Yes. The Stet CLI and task definitions are open source. Each task includes the exact commit hash, test command, and timeout. Run stet validate on any task to reproduce locally.

Questions or feedback? Start by requesting private eval access.