STET

How We Measure

Stet evaluates AI coding models on real tasks from real open-source repositories. Each week, we run the same set of tasks across all models and score them on correctness and quality.

1. Select merged PRs from tracked repositories
2. Extract the task (issue + tests that changed)
3. Reset repo to pre-merge state
4. Ask each model to solve the task
5. Run original tests — the correctness gate
6. Score passing solutions on equivalence, code review, footprint, and cost

Tests are the gate. A model's solution must pass the same tests that the human-written solution passed. But passing tests is necessary, not sufficient — models that pass the same tests can produce wildly different solutions. Quality dimensions above the gate are where differentiation lives.

Above the Gate

Pass rate alone cannot differentiate models. PR-derived tasks are bimodal — easy tasks pass all models, hard tasks fail all models. Two models with identical pass rates can diverge 2–5x on quality.

For every task that passes tests, Stet scores additional dimensions:

  • Equivalence — is the patch functionally equivalent to the original PR's solution, or merely a different approach that happens to pass?
  • Code review — structured rubric scoring correctness, bug risk, edge cases, and maintainability
  • Footprint risk — how invasive is the patch? Over-editing and touching unrelated files increases merge risk
  • Cost — API spend per task, for ROI comparison across model tiers

Why This Matters

Public benchmarks are increasingly unreliable. Models trained on open-source code memorize benchmark solutions — inflating scores without improving real-world performance. And test suites designed for one implementation often reject valid alternatives. Stet measures models on specific, recognizable projects with their real test suites — and scores what matters after the tests pass.

  • Real repositories — not synthetic problems
  • Human-written tests — not generated test cases
  • Quality above the gate — not just "does it pass" but "does it solve it right"
  • Tracked over time — detect regressions week-over-week

Statistical Approach

Each repository dataset contains 50–200 tasks. A task passes whenpass orpass_with_warn. Effective sample size (n) excludes missing validations.

Pass rate establishes the correctness gate. Among tasks that pass, equivalence rate, code review score, and footprint risk are the ranking signals — they surface the quality differences that pass rate alone cannot.

For each model, we compute pass rate and a deterministic 95% bootstrap confidence interval from task-level outcomes. We use fixed bootstrap settings so repeated runs with identical inputs produce the same interval outputs.

Tiering is conservative. Model A is considered strictly superior to model B only whenpassRate(A) > ciHigh(B). Models within the same tier are differentiated by quality dimensions above the gate.

FAQ

Why not just rank by pass rate?

Tasks mined from real PRs are bimodal — easy ones pass all models, hard ones fail all models. We consistently see 94–100% task-level agreement between model pairs even when their solution quality diverges 2–5x. Pass rate is a valid correctness gate, but quality dimensions above the gate are where models actually separate.

Why these repositories?

We chose well-known projects with good test coverage across multiple languages (Go, Python, TypeScript). Recognizable names help you calibrate — if you know Stripe CLI or FastAPI, you have intuition for what "hard" looks like in those codebases.

How do you prevent prompt leakage?

Models receive only the issue description and the file tree. They don't see the expected solution, the PR diff, or which tests will run. This mirrors real-world usage.

Aren't public repo evals contaminated by training data?

They can be. Frontier models have been shown to memorize solutions from open-source repositories in their training data. Stet's public leaderboard uses real tests as the correctness gate — memorizing the original patch doesn't help if the model's solution must independently pass the test suite. For private evals, contamination is structurally impossible — your proprietary codebase isn't in any model's training data.

Why weekly runs?

Weekly cadence catches model regressions early while smoothing out transient API issues. It also creates a consistent content rhythm for the community.

Can I reproduce these results?

Yes. The Stet CLI and task definitions are open source. Each task includes the exact commit hash, test command, and timeout. Run `stet validate` on any task to reproduce locally.

Questions or feedback? Start by requesting private eval access.