STET
Continuous AI coding evals for real repos

GPT-5.4 just dropped. Same pass rate as 5.3 Codex. 4× cheaper. Higher quality scores. Your tests can’t tell you that.

Real PRs replayed against AI coding agents across 3 repos, weekly. Tests are the gate. Code quality is the signal.

GPT-5.3 Codex · GPT-5.4
100.0% pass rate — tied
4.0x
cost gap
The measurement gap

Engineering teams spend $65–150/seat/month on AI coding tools. Public benchmarks are contaminated by training data. One-off evaluations go stale in weeks. Nobody has a continuous signal for whether these tools actually work on their codebase.

Stet runs real PRs against AI coding agents on real repos, scores correctness and quality, and tracks changes week over week.

Updated Mar 07, 2026 from 87 tasks across 3 repos.
Zod (TypeScript) · graphql-go-tools (Go) · sqlparser-rs (Rust)

Private eval is in early access. Join the waitlist to get Stet running on your repo with the same weekly replay and quality scoring used on the public leaderboard.

Already approved? Sign in →