Instruction Evals

Test AGENTS.md changes before they reach every agent

Stet tests coding-agent instruction changes on real repo work before they become shared AI infrastructure.

Evals give teams confidence in shipping their shared AI infrastructure. When making these changes, the two things that matter are simple: do not make things worse, and make things better. Stet helps you measure both.

Run an instruction eval on your repo Read the AGENTS.md testing story

Why Instructions Need Evals

Agent-native organizations are writing up to 100% of their code with AI, yet still shipping AGENTS.md changes based on vibes. A single bad line in AGENTS.md negatively impacts every agent session for every developer in the repo.

We need a way to move from vibes to rigor.

What Goes Wrong

A concrete failure mode is adding too much implementation detail to AGENTS.md instead of using progressive disclosure. The agent gets confused on tasks unrelated to that detail and starts ignoring other instructions in AGENTS.md.

What Stet Compares

AGENTS.md changes

Model comparison, such as Opus 4.7 vs 4.6

Model reasoning effort

Harness changes, such as Claude in Cursor vs Claude in Claude Code

SKILL.md changes

What The Eval Measures

Correctness

Run the same tests and validation commands that make the original repo work trustworthy.

Equivalence

Check whether the patch solves the same problem, not merely whether it changed nearby files.

Review quality

Score maintainability, bug risk, edge cases, coherence, and instruction adherence above the test gate.

Footprint risk

Measure whether the candidate made the patch larger, touched unrelated areas, or created merge risk.

Cost and traces

Keep the token spend, attempts, logs, and evidence attached to the result so the rollout decision is inspectable.

How It Works

Build the candidate

Point Stet at the current instruction setup and the candidate: AGENTS.md, SKILL.md, model, reasoning effort, or harness.

Replay real work

Stet selects tasks from repo history, resets to the pre-change state, and runs each arm through the same agent workflow.

Read the rollout signal

The result shows what improved, what regressed, what cost changed, and whether the evidence is strong enough to ship.

The same methodology powers Stet's model comparisons and private repo evals. Use it when an instruction, model, reasoning, skill, or harness change is about to become the default path for every agent session.