Evals are broken.

Benchmarks are contaminated.
Your team has no way to tell
if AI is helping — or hurting.

The Redacted Report

Event

Impact Assessment

Oct 2025Opus 4.5

██████████████

Nov 2025CC 2.0.76

█████████

Dec 2025Codex CLI 0.98

██████████████████

Jan 2026Codex 5.2

███████████

Feb 2026CC 2.1.37

████████████████

Feb 2026Opus 4.6Codex 5.3

████████████

no data collected

“is Claude broken today?”

“vibes are off”

SWE-bench: 72.1%

“works for me”

“rolled back to 2.0.62”

“we lost 2 days”

OpenAI declared SWE-bench Verified dead — contamination across all frontier models. The primary benchmark is broken. — OpenAI, Feb 2026

“Outside of vanity metrics, I have nothing of value to show.” — Principal engineer, 900-person company (source)

The Betrayal Timeline

CI pass rate

94%

PRs merged/week

stable

Sprint velocity

flat

Everything looks fine. Nothing is fine.

Oct

Nov

Dec

Jan

Feb

Mar

CI pass rate

94%

no anomalies detected

PRs merged/week

stable

no anomalies detected

Sprint velocity

flat

no anomalies detected

OctNovDecJanFebMar

Everything looks fine. Nothing is fine.

Model and harness updates change AI behavior. Config changes compound. Quality drifts — and the only way people notice are when the “vibes are off”.

Codex 5.2-high beats Codex 5.2-xhigh 67% of the time on real coding tasks. More thinking tokens isn’t always better. — Voratiq, 175 runs

The Untested Stack

Workflow Modehuman-in-the-loop vs background agentper task

Reasoning Settingslow / medium / highper task

Tool SettingsMCP servers, contexton update

SkillsSKILLS.mdweekly

Custom Instructions.cursorrules, AGENTS.mdad hoc

System Prompt / Rulesdirectory overridesweekly

Harness Versioncodex-cli 0.98 → 0.104on update

Model Selectionopus-4.5 → opus-4.6on switch

Base Model Behaviorchanges without noticemonthly

3 models × 4 instruction sets × 2 tool configs × 2 workflow modes × 3 reasoning levels × 2 harness versions

288 configurations. You’re testing: 1

It’s not just the model. It’s the harness, skills, the rules, the tools, the workflow. Every layer is a variable. None of them are tested together.

Median 3 tools per engineer; 14.7% use 5+. 49.1% use different tools for different tasks. — Pragmatic Engineer Survey, 2026

What passes for measurement

Green checks, red quality.

Tests

PR #847✓PASS

PR #848✓PASS

PR #849✓PASS

PR #850✓PASS

PR #851✓PASS

PR #852✓PASS

PR #853✓PASS

PR #854✓PASS

The Gate

── The Gate ──

Code Quality

PR #847

PR #848

PR #849

PR #850

PR #851

PR #852

PR #853

PR #854

AI generates more PRs. Each one still needs human review. Tests / CI are the gate, not the source of truth.

“One of our main challenges has been code reviews, as the quantity of code produced goes up, and quality used to go down, pre-Opus 4.5.” — Staff engineer, 30-person company (source)

~50% of test-passing SWE-bench PRs would not be merged by repo maintainers. — METR, March 2026

Same task. Both pass.

zod #4843 — Fix branded-primitive typing in error tree

GPT-5.1 Codex Minicodex cli✓ Tests pass

zod/src/v4/core/errors.ts

1import { $brand, $constructor } from "./core.js";
2
3type $RemoveBrand<T> = T extends infer U
4  ? U extends { [k in typeof $brand]: unknown }
5    ? Omit<U, typeof $brand>
6    : U
7  : never;
8
9type $ZodErrorTreeInternal<T, U = string> = T extends [any, ...any[]]
10  ? { errors: U[]; items?: { [K in keyof T]?: $ZodErrorTreeInternal<$RemoveBrand<T[K]>, U> } }
11  : T extends any[]
12    ? { errors: U[]; items?: Array<$ZodErrorTreeInternal<$RemoveBrand<T[number]>, U>> }
13    : T extends object
14      ? { errors: U[]; properties?: { [K in keyof T]?: $ZodErrorTreeInternal<$RemoveBrand<T[K]>, U> } }
15      : { errors: U[] };
16
17export type $ZodErrorTree<T, U = string> = $ZodErrorTreeInternal<$RemoveBrand<T>, U>;

Correctness

1/4

Bug risk

1/4

Edge cases

1/4

Maintainability

2/4

GPT-5.4codex cli✓ Tests pass

zod/src/v4/core/errors.ts

1import type { $brand } from "./core.js";
2
3type StripPrimitiveBrand<T> = T extends infer Primitive & $brand<any>
4  ? Primitive extends util.Primitive
5    ? StripPrimitiveBrand<Primitive>
6    : T
7  : T;
8
9type _ZodErrorTree<T, U = string> = T extends [any, ...any[]]
10  ? { errors: U[]; items?: { [K in keyof T]?: $ZodErrorTree<T[K], U> } }
11  : T extends any[]
12    ? { errors: U[]; items?: Array<$ZodErrorTree<T[number], U>> }
13    : T extends object
14      ? { errors: U[]; properties?: { [K in keyof T]?: $ZodErrorTree<T[K], U> } }
15      : { errors: U[] };
16
17export type $ZodErrorTree<T, U = string> = _ZodErrorTree<StripPrimitiveBrand<T>, U>;

Correctness

3/4

Bug risk

3/4

Edge cases

3/4

Maintainability

2/4

Both pass your CI.

One produces code you’d ship. One produces code you’d rewrite.

Real patches from Stet evaluation runs · zod dataset

You don’t ship untested code changes.
Stop shipping unmeasured agent changes.

The teams that get this right tell their agent to test itself. Every model swap, skill change, or config update gets measured before rollout — on their own code, their own tests, their own standards.

We ran two models on 60 tasks from a real open-source repo. Same tasks. Same tests. Here’s what the data showed.

GPT-5.3 Codex → GPT-5.4 codex cli (Mar 2026)

Pass rate75% → 79% (+4%)

Review quality19% → 32% (↑)

correctness1.6 → 1.9

edge case handling1.5 → 1.9

maintainability1.8 → 2.2

Cost/task$3.06 → $0.67 (↓ 78%)

Verdict:PROMOTE

PromoteHoldRollback

Run an eval on your repo →

Drift

Quality degrades by default. Every update is a risk.

Hypothesis

288→1

Every change is an experiment. Test it on your code.

Above the Gate

pass rate

quality

Pass rate ≠ quality. The gap is where models diverge.

Run it on your codebase

Stet replays your merged PRs, scores quality above pass/fail, and delivers comparison reports through your agent. Recurring runs and release gates follow.

We’re publishing our evaluation results openly.

See the data →