Instruction Evals
Test AGENTS.md changes before they reach every agent
Stet tests coding-agent instruction changes on real repo work before they become shared AI infrastructure.
Evals give teams confidence in shipping their shared AI infrastructure. When making these changes, the two things that matter are simple: do not make things worse, and make things better. Stet helps you measure both.
Why Instructions Need Evals
Agent-native organizations are writing up to 100% of their code with AI, yet still shipping AGENTS.md changes based on vibes. A single bad line in AGENTS.md negatively impacts every agent session for every developer in the repo.
We need a way to move from vibes to rigor.
What Goes Wrong
A concrete failure mode is adding too much implementation detail to AGENTS.md instead of using progressive disclosure. The agent gets confused on tasks unrelated to that detail and starts ignoring other instructions in AGENTS.md.
What Stet Compares
AGENTS.md changes
Model comparison, such as Opus 4.7 vs 4.6
Model reasoning effort
Harness changes, such as Claude in Cursor vs Claude in Claude Code
SKILL.md changes
What The Eval Measures
Correctness
Run the same tests and validation commands that make the original repo work trustworthy.
Equivalence
Check whether the patch solves the same problem, not merely whether it changed nearby files.
Review quality
Score maintainability, bug risk, edge cases, coherence, and instruction adherence above the test gate.
Footprint risk
Measure whether the candidate made the patch larger, touched unrelated areas, or created merge risk.
Cost and traces
Keep the token spend, attempts, logs, and evidence attached to the result so the rollout decision is inspectable.
How It Works
Build the candidate
Point Stet at the current instruction setup and the candidate: AGENTS.md, SKILL.md, model, reasoning effort, or harness.
Replay real work
Stet selects tasks from repo history, resets to the pre-change state, and runs each arm through the same agent workflow.
Read the rollout signal
The result shows what improved, what regressed, what cost changed, and whether the evidence is strong enough to ship.
The same methodology powers Stet's model comparisons and private repo evals. Use it when an instruction, model, reasoning, skill, or harness change is about to become the default path for every agent session.