Your AGENTS.md is the highest-leverage code you're not testing

April 8, 2026

I shipped a bad AGENTS.md change.

The agent became less aligned and started quietly over-engineering changes. Instead of writing a clean, 20-line function, it would write a bloated 40 lines.

That one bad change did not just affect my own workflow. It degraded the quality of every engineer using coding agents in our monorepo with 100 developers.

Why did we not stop it?

The measurement gap

We did not stop it because we did not know anything was broken.

There is a pattern here, and I am guilty of it too: treat AGENTS.md like AI-generated code. Fire and forget. Hope it does not break anything. Tweak it every time the model changes.

The problem is that AGENTS.md is insanely important. It is the highest-leverage point in the harness, pulled into every turn for every engineer in the codebase.

Every line in AGENTS.md should be considered carefully. If agents are writing 80 to 100 percent of your code, then something this fundamental to the harness deserves far more craft than we currently give it.

We should be treating AGENTS.md changes like important infrastructure.

Why this is hard

You cannot diff the output of an AGENTS.md change. Agents are non-deterministic by necessity.

One line in AGENTS.md might not produce a reliable, consistent behavior change. The same instruction edit can produce different results across tasks, repos, models, harnesses, and time.

The feedback loop is invisible and hard to close. With current tooling, how does an organization know its agents got 5 percent worse at scale? You just get slightly worse PRs. No pages. No failing tests. No obvious signal.

Our tools have not caught up to the requirements of agentic engineering. We need a way to close the feedback loop between making an AGENTS.md change and seeing what impact that change has.

Measurement

AGENTS.md changes should be A/B tested. Without A/B testing, you do not understand the impact of the change.

Here is what to measure:

Test pass rate
Agent diff equivalence to a human diff
How much extra code the agent writes relative to the human diff

The key point is to measure how the agent is aligned to your codebase and how closely it matches your expectations.

The practical version is an instruction eval: replay real repo tasks, compare the current AGENTS.md against a candidate, and inspect whether quality improved. The same how Stet measures real AI coding performance loop works for model, harness, and reasoning-level changes too.

Monitoring

Agents change over time. Monday, everything is working. Wednesday, Anthropic or OpenAI ships a silent update that shifts behavior underneath you.

Without monitoring, there is no way to know what happened. Your AGENTS.md did not break. The sand shifted under it.

That is why organizations need continuous monitoring, not just pre-release testing. It is the same reason we monitor production after deploy instead of trusting green CI forever.

Reviews

We should review AGENTS.md with the same rigor we apply to critical production code.

Each change should be read carefully and approved by a human.

Slop in, slop out. Scale that across every engineer in the org and it becomes obvious why AGENTS.md matters so much.

Conclusion

When we measured agent quality beyond test pass rate, we found a 2x quality gap hiding in plain sight: Your AI coding benchmark is hiding a 2x quality gap.

If you are not measuring and releasing AGENTS.md changes deliberately, you are guessing at scale.

How are you measuring and releasing AGENTS.md changes today? If the answer is "we're not," you do not have a release process. You have a hope-based one.