STET

Private Eval

The leaderboard measures models on public repos.
Private eval measures them on yours.

Engineering leaders are spending $65–150/seat/month on AI coding tools and can't prove they work. Usage metrics show adoption, not effectiveness. One-off evaluations go stale within weeks. There's no continuous signal.

Same methodology. Your codebase, your tests, your coding standards. The quality dimensions that differentiate models on public data become actionable when they're measured against code your team actually writes.

What Changes on Your Repo

Equivalence becomes intent alignment

On public repos, equivalence measures whether a patch matches the original PR. On your repo, it measures whether the AI solves problems the way your team would — your patterns, your abstractions, your idioms.

Code review reflects your standards

The review rubric scores maintainability, bug risk, and edge case handling. On your codebase, those scores map directly to what your reviewers would flag in a real PR.

Regressions hit your velocity

When a model update degrades quality on your repo, you see it in the next weekly run — not weeks later when your team notices AI suggestions getting worse.

What You Get

Private Eval Report — acme/backend
Gate (pass rate)73.3%
3 models tied — quality dimensions separate them
Claude 4
Codex
Gemini
Equivalence
52.0%
31.4%
28.7%
Review
41.2%
22.7%
18.9%
Footprint
12.0%
38.5%
44.1%
Cost/task
$4.12
$0.93
$2.88
Codex equivalence dropped 18pp since Jan 14
Correlated with codex-2026-01-09 model release

Sample data. Your report runs against your repo's merged PRs and test suite.

Early Access

Private eval is in early access. Join the waitlist and we’ll reach out when we’re ready to onboard your repo.