STET

Install Stet and run it on your repo

Try the CLI without scheduling a call.

Stet replays real work from your repository, runs your agent stack locally, and shows whether patches meet your team’s standards. Bring your own model-provider credentials; Stet does not collect OpenAI, Anthropic, Claude, Codex, or Cursor keys.

The first run is the signal. If it succeeds, we can talk about paid pilots, CI gates, and recurring evals after you have seen a scored result on your repo.

Self-serve try flow

Sign up, install the CLI, and run Stet locally with your own Claude, Codex, Cursor, OpenAI, or Anthropic credentials.

Already signed up?

curl -fsSL https://raw.githubusercontent.com/benredmond/stet-cli/main/install.sh | sh
stet auth login
stet auth status

Want Stet on CI, paid rollout gates, or a team pilot? Book a paid pilot.

Common First Evals

Compare Claude Code vs Codex or Cursor on your repo, benchmark a new coding model, test AGENTS.md and SKILL.md changes, or choose a reasoning level. If you want to test an AGENTS.md change first, Stet runs the change against real repo tasks and reports tests, equivalence, review quality, footprint risk, cost, and traces.

What We Measure

This is the public leaderboard. Your private eval uses the same methodology on your repo.

GPT-5.5
codex cli
GPT-5.4
codex cli
Claude Opus 4.7
claude code
Gate
44.4%
33.3%
44.4%
Equivalence
66.7%
66.7%
40.7%
Code Review
51.9%
37.0%
22.2%
56 tasks/2 repos/updated May 01, 2026

What Changes on Your Repo

Equivalence becomes intent alignment

On public repos, equivalence measures whether a patch matches the original PR. On your repo, it measures whether the AI solves problems the way your team would — your patterns, your abstractions, your idioms.

Code review reflects your standards

The review rubric scores maintainability, bug risk, and edge case handling. On your codebase, those scores map directly to what your reviewers would flag in a real PR.

Regressions hit your velocity

When a model update degrades quality on your repo, you see it in the next weekly run — not weeks later when your team notices AI suggestions getting worse.

FAQ

What is a private coding-agent eval?

A private coding-agent eval measures AI coding behavior on your own repository instead of a public benchmark. Stet turns real repo work into replayable tasks, runs candidate agent configurations, and reports whether the resulting patches pass tests and meet your review standards.

Can Stet compare Claude Code vs Codex on my repo?

Yes. Stet can compare Claude Code, Codex, Cursor, model settings, reasoning levels, and harness changes on the same repo tasks. The result shows where each setup passes tests, matches the intended change, survives review, keeps the diff focused, and controls cost.

Can Stet test AGENTS.md changes?

Yes. Stet can run your current AGENTS.md against a candidate AGENTS.md change on real repo tasks before the change reaches every agent session. That lets you catch instruction regressions in tests, equivalence, review quality, footprint risk, cost, and traces.

How is this different from public benchmarks?

Public benchmarks show how models behave on shared tasks. A private Stet eval shows how agent configurations behave on your codebase, with your tests, your patterns, your review standards, and your rollout constraints.