STET

Run it on your repo.

curl -fsSL https://raw.githubusercontent.com/benredmond/stet-cli/main/install.sh | sh
stet auth login

First scored result within the hour, on the Claude or Codex subscription you already have.

Your First Hour

  1. 01Install1 min
  2. 02stet auth login1 min, your account is created here
  3. 03Stet builds a task suite from your repo's merged PRs
  4. 04First scored resultabout an hour

Runs on your existing subscription limits — no new API key, no new bill.

Your code, tests, and credentials stay on your machine.

What You Get Back

decision receipt·baseline → agents-md-v3·platform-api
pass 0.501.00
cost $3.56$3.47
time 780s600s
DECISION: PROMOTE

Common First Evals

FAQ

What is a private coding-agent eval?

A private coding-agent eval measures AI coding behavior on your own repository instead of a public benchmark. Stet turns real repo work into replayable tasks, runs candidate agent configurations, and reports whether the resulting patches pass tests and meet your review standards.

Can Stet compare Claude Code vs Codex on my repo?

Yes. Stet can compare Claude Code, Codex, Cursor, model settings, reasoning levels, and harness changes on the same repo tasks. The result shows where each setup passes tests, matches the intended change, survives review, keeps the diff focused, and controls cost.

Can Stet test AGENTS.md changes?

Yes. Stet can run your current AGENTS.md against a candidate AGENTS.md change on real repo tasks before the change reaches every agent session. That lets you catch instruction regressions in tests, equivalence, review quality, footprint risk, cost, and traces.

How is this different from public benchmarks?

Public benchmarks show how models behave on shared tasks. A private Stet eval shows how agent configurations behave on your codebase, with your tests, your patterns, your review standards, and your rollout constraints.

Want Stet on CI, paid rollout gates, or a team pilot? Book a paid pilot.

Already have an account?