Install Stet and run it on your repo
Try the CLI without scheduling a call.
Stet replays real work from your repository, runs your agent stack locally, and shows whether patches meet your team’s standards. Bring your own model-provider credentials; Stet does not collect OpenAI, Anthropic, Claude, Codex, or Cursor keys.
The first run is the signal. If it succeeds, we can talk about paid pilots, CI gates, and recurring evals after you have seen a scored result on your repo.
Self-serve try flow
Sign up, install the CLI, and run Stet locally with your own Claude, Codex, Cursor, OpenAI, or Anthropic credentials.
Already signed up?
curl -fsSL https://raw.githubusercontent.com/benredmond/stet-cli/main/install.sh | shstet auth loginstet auth statusWant Stet on CI, paid rollout gates, or a team pilot? Book a paid pilot.
Common First Evals
Compare Claude Code vs Codex or Cursor on your repo, benchmark a new coding model, test AGENTS.md and SKILL.md changes, or choose a reasoning level. If you want to test an AGENTS.md change first, Stet runs the change against real repo tasks and reports tests, equivalence, review quality, footprint risk, cost, and traces.
What We Measure
This is the public leaderboard. Your private eval uses the same methodology on your repo.
What Changes on Your Repo
On public repos, equivalence measures whether a patch matches the original PR. On your repo, it measures whether the AI solves problems the way your team would — your patterns, your abstractions, your idioms.
The review rubric scores maintainability, bug risk, and edge case handling. On your codebase, those scores map directly to what your reviewers would flag in a real PR.
When a model update degrades quality on your repo, you see it in the next weekly run — not weeks later when your team notices AI suggestions getting worse.
FAQ
What is a private coding-agent eval?
A private coding-agent eval measures AI coding behavior on your own repository instead of a public benchmark. Stet turns real repo work into replayable tasks, runs candidate agent configurations, and reports whether the resulting patches pass tests and meet your review standards.
Can Stet compare Claude Code vs Codex on my repo?
Yes. Stet can compare Claude Code, Codex, Cursor, model settings, reasoning levels, and harness changes on the same repo tasks. The result shows where each setup passes tests, matches the intended change, survives review, keeps the diff focused, and controls cost.
Can Stet test AGENTS.md changes?
Yes. Stet can run your current AGENTS.md against a candidate AGENTS.md change on real repo tasks before the change reaches every agent session. That lets you catch instruction regressions in tests, equivalence, review quality, footprint risk, cost, and traces.
How is this different from public benchmarks?
Public benchmarks show how models behave on shared tasks. A private Stet eval shows how agent configurations behave on your codebase, with your tests, your patterns, your review standards, and your rollout constraints.