STET

Evals are broken.

Benchmarks are contaminated.
Your team has no way to tellif AI is helping — or hurting.

The Redacted Report
Event
Oct 2025Opus 4.5
██████████████
Nov 2025CC 2.0.76
█████████
Dec 2025Codex CLI 0.98
██████████████████
Jan 2026Codex 5.2
███████████
Feb 2026CC 2.1.37
████████████████
Feb 2026Opus 4.6Codex 5.3
████████████
no data collected
“is Claude broken today?”
“vibes are off”
SWE-bench: 72.1%
“works for me”
“rolled back to 2.0.62”
“we lost 2 days”

OpenAI declared SWE-bench Verified dead — contamination across all frontier models. The primary benchmark is broken. — OpenAI, Feb 2026

“Outside of vanity metrics, I have nothing of value to show.” — Principal engineer, 900-person company (source)

The Betrayal Timeline
CI pass rate
94%
PRs merged/week
stable
Sprint velocity
flat
Everything looks fine. Nothing is fine.
Oct
78
Nov
68
Dec
58
Jan
50
Feb
45
Mar
36

Model and harness updates change AI behavior. Config changes compound. Quality drifts — and the only way people notice are when the “vibes are off”.

Codex 5.2-high beats Codex 5.2-xhigh 67% of the time on real coding tasks. More thinking tokens isn’t always better. — Voratiq, 175 runs

The Untested Stack
Workflow Modehuman-in-the-loop vs background agent
Reasoning Settingslow / medium / high
Tool SettingsMCP servers, context
SkillsSKILLS.md
Custom Instructions.cursorrules, AGENTS.md
System Prompt / Rulesdirectory overrides
Harness Versioncodex-cli 0.98 → 0.104
Model Selectionopus-4.5 → opus-4.6
Base Model Behaviorchanges without notice
3 models × 4 instruction sets × 2 tool configs × 2 workflow modes × 3 reasoning levels × 2 harness versions
288 configurations. You’re testing: 1

It’s not just the model. It’s the harness, skills, the rules, the tools, the workflow. Every layer is a variable. None of them are tested together.

Median 3 tools per engineer; 14.7% use 5+. 49.1% use different tools for different tasks. — Pragmatic Engineer Survey, 2026

What passes for measurement

Green checks, red quality.

Tests
PR #847PASS
PR #848PASS
PR #849PASS
PR #850PASS
PR #851PASS
PR #852PASS
PR #853PASS
PR #854PASS
── The Gate ──
Code Quality
PR #847
35
PR #848
92
PR #849
55
PR #850
28
PR #851
88
PR #852
42
PR #853
15
PR #854
72

AI generates more PRs. Each one still needs human review. Tests / CI are the gate, not the source of truth.

“One of our main challenges has been code reviews, as the quantity of code produced goes up, and quality used to go down, pre-Opus 4.5.” — Staff engineer, 30-person company (source)

~50% of test-passing SWE-bench PRs would not be merged by repo maintainers. — METR, March 2026

Same task. Both pass.
/

zod #4843 — Fix branded-primitive typing in error tree

GPT-5.1 Codex Minicodex cli✓ Tests pass
zod/src/v4/core/errors.ts
1import { $brand, $constructor } from "./core.js";
2
3type $RemoveBrand<T> = T extends infer U
4 ? U extends { [k in typeof $brand]: unknown }
5 ? Omit<U, typeof $brand>
6 : U
7 : never;
8
9type $ZodErrorTreeInternal<T, U = string> = T extends [any, ...any[]]
10 ? { errors: U[]; items?: { [K in keyof T]?: $ZodErrorTreeInternal<$RemoveBrand<T[K]>, U> } }
11 : T extends any[]
12 ? { errors: U[]; items?: Array<$ZodErrorTreeInternal<$RemoveBrand<T[number]>, U>> }
13 : T extends object
14 ? { errors: U[]; properties?: { [K in keyof T]?: $ZodErrorTreeInternal<$RemoveBrand<T[K]>, U> } }
15 : { errors: U[] };
16
17export type $ZodErrorTree<T, U = string> = $ZodErrorTreeInternal<$RemoveBrand<T>, U>;
Correctness
1/4
Bug risk
1/4
Edge cases
1/4
Maintainability
2/4
GPT-5.4codex cli✓ Tests pass
zod/src/v4/core/errors.ts
1import type { $brand } from "./core.js";
2
3type StripPrimitiveBrand<T> = T extends infer Primitive & $brand<any>
4 ? Primitive extends util.Primitive
5 ? StripPrimitiveBrand<Primitive>
6 : T
7 : T;
8
9type _ZodErrorTree<T, U = string> = T extends [any, ...any[]]
10 ? { errors: U[]; items?: { [K in keyof T]?: $ZodErrorTree<T[K], U> } }
11 : T extends any[]
12 ? { errors: U[]; items?: Array<$ZodErrorTree<T[number], U>> }
13 : T extends object
14 ? { errors: U[]; properties?: { [K in keyof T]?: $ZodErrorTree<T[K], U> } }
15 : { errors: U[] };
16
17export type $ZodErrorTree<T, U = string> = _ZodErrorTree<StripPrimitiveBrand<T>, U>;
Correctness
3/4
Bug risk
3/4
Edge cases
3/4
Maintainability
2/4

Both pass your CI.

One produces code you’d ship. One produces code you’d rewrite.

Real patches from Stet evaluation runs · zod dataset

You don’t ship untested code changes.Stop shipping unmeasured agent changes.

The teams that get this right tell their agent to test itself. Every model swap, skill change, or config update gets measured before rollout — on their own code, their own tests, their own standards.

We ran two models on 60 tasks from a real open-source repo. Same tasks. Same tests. Here’s what the data showed.

GPT-5.3 Codex → GPT-5.4 codex cli (Mar 2026)
Pass rate75% → 79% (+4%)
Review quality19% → 32% ()
correctness1.6 → 1.9
edge case handling1.5 → 1.9
maintainability1.8 → 2.2
Cost/task$3.06 → $0.67 (↓ 78%)
Verdict:PROMOTE
PromoteHoldRollback
Drift

Quality degrades by default. Every update is a risk.

Hypothesis
2881

Every change is an experiment. Test it on your code.

Above the Gate
pass rate
quality

Pass rate ≠ quality. The gap is where models diverge.

Run it on your codebase

Stet replays your merged PRs, scores quality above pass/fail, and delivers comparison reports through your agent. Recurring runs and release gates follow.

We’re publishing our evaluation results openly.

See the data →