STET

Anthropic shipped Opus 4.6 on February 18th.
Your team’s AI-generated code changed that day.

Your tests still pass. Your velocity looks fine.
You have no idea if your code got worse.

“Outside of vanity metrics, I have nothing of value to show.” — Principal engineer, 900-person company (source)

The Betrayal Timeline
Opus 4.5Oct 2025CC 2.0.76Dec 2025Cursor 0.48Jan 2026GPT 5.2Jan 2026Opus 4.6Feb 2026WHAT YOUR DASHBOARD SHOWSCI pass rate94%PRs merged/weekstableSprint velocityflatWHAT'S ACTUALLY HAPPENING

Model updates change AI behavior. Config changes compound. Quality drifts — and nobody notices until velocity stalls.

The signals your team trusts are lagging indicators. By the time you feel the problem, it’s been compounding for weeks.

GPT-5.2-high beats GPT-5.2-xhigh 67% of the time on real coding tasks. More thinking tokens isn’t always better. — Voratiq, 175 runs

Users rolling back Claude Code from 2.0.76 to 2.0.62 due to perceived regression — GitHub #16157

The Untested Stack
Workflow Templateagentic vs single-shotper task
Tool SettingsMCP servers, contexton update
Skillsagent skillsweekly
Custom Instructions.cursorrules, AGENTS.mdad hoc
System Prompt / Rulesdirectory overridesweekly
Model Selectionopus-4.5 → opus-4.6on switch
Base Model Behaviorchanges without noticemonthly
3 models × 4 instruction sets × 2 tool configs × 3 workflow modes
= 0unique configurations
You’re testing: 1

It’s not just the model. It’s the instructions, the rules, the tool settings, the workflow. Every layer is a variable. None of them are tested together.

Median 3 tools per engineer; 14.7% use 5+. 49.1% use different tools for different tasks. — Pragmatic Engineer Survey, 2026

Green Checks, Red Quality
Tests
PR #847PASS
PR #848PASS
PR #849PASS
PR #850PASS
PR #851PASS
PR #852PASS
PR #853PASS
PR #854PASS
PR #855PASS
PR #856PASS
PR #857PASS
PR #858PASS
── The Gate ──
Code Quality
PR #847
35
PR #848
92
PR #849
55
PR #850
28
PR #851
88
PR #852
42
PR #853
15
PR #854
72
PR #855
31
PR #856
85
PR #857
48
PR #858
22

AI generates more PRs. Each one still needs human review. Tests are the gate, not the source of truth.

Review quality drops. Nobody catches it because the checks are green.

In our testing: models with identical pass rates showed 5x differences in review quality — measured across correctness, style adherence, unnecessary complexity, and diff bloat.

“The review process became a nightmare. Less experienced engineers overuse it… I see the same mistakes all over again.” — Senior SWE, 10K+ company (source)

You don’t ship untested code changes.
Stop shipping unmeasured agent changes.

OpenAI declared SWE-bench Verified dead — contamination across all frontier models. The primary benchmark is broken. — OpenAI, Feb 2026

What competent teams do instead
  1. 1.Mine tasks from your repo’s merged PRs — real work, not synthetic benchmarks
  2. 2.Replay them against two AI configurations in isolated environments
  3. 3.Score on tests + quality above the gate (review, equivalence, footprint, cost)
  4. 4.Output a decision: promote, hold, or rollback
The Control Loop
Opus 4.5Oct 2025CC 2.0.76Dec 2025Cursor 0.48Jan 2026GPT 5.2Jan 2026Opus 4.6Feb 2026
Opus 4.5 → 4.6 (Feb 18, 2026)
Pass rate73% → 73% (=)
Review quality80% → 62% ()
correctness3.2 → 2.1
unnecessary complexity0.8 → 2.4
diff bloat1.2x → 2.1x
Cost/task$0.42 → $0.38 ()
Verdict:HOLD
PromoteHoldRollback
Regression Detection

Know when a model update, config change, or repo drift breaks your AI coding quality — before you roll it out.

A/B Testing

Compare any two configurations — model, instructions, tool settings — on your actual codebase. Not a benchmark. Your code, your tests, your standards.

Quality Above the Gate

Measure what tests can’t catch: correctness, unnecessary complexity, diff footprint, cost efficiency. Tests are table stakes. Score what matters above them.

See a sample report on a real open-source repo.