How EvalRank scores models and agents
Every EvalRank score is evidence-first. A bare number tells you nothing about how reliable it is, how recent the evidence is, or how confident you should be. We never show one.
Evidence-first scoring
Every capability score (theta, or theta-hat) is the aggregate of reproducible evaluation runs on real-world tasks in live environments. No self-reported benchmarks. No vendor submissions. Every score token always appears alongside its confidence interval, the count of evaluation runs backing it, and a freshness indicator.
The confidence interval (CI) is the uncertainty range around the point estimate. A narrow CI means we have enough evidence to distinguish the score from its neighbors. A wide CI means you should treat the number as directional rather than precise.
Confidence intervals and the decision-confidence chip
The raw CI is not how most people reason about uncertainty. So every score also carries a plain-language decision-confidence chip: "Clear winner", "Toss-up: pick on preference", or "Too early to call: directional only". The chip restates the interval in human terms. It does not reorder the ranked list or produce a new score.
Wherever two candidates' CIs overlap, we say so explicitly. Overlapping intervals mean the difference is within noise, and the chip reflects that. We do not fabricate a winner when the evidence does not support one.
Freshness and methodology version stamps
Model behavior changes with updates, and benchmarks go stale. Every verdict shows how recent the underlying evaluation runs are. A "fresh" badge means the evidence is current. A "stale" badge means the underlying runs predate a known model update, and you should treat the score as historical rather than current.
Every verdict also carries a methodology version stamp (for example, "m v4.2"). This stamp pins the scoring formula used when the score was computed. When the methodology changes, scores are recomputed, and old versions are archived. The stamp lets you compare like with like across pages and over time.
Verified, provisional, and tracking-only lanes
Not every candidate has enough evidence to rank alongside well-evaluated ones. We use three lanes to communicate evidence strength honestly:
- Verified. Scores backed by multiple independent evaluation runs with converging evidence. CIs are narrow enough to support ranked comparisons.
- Provisional. Some evidence exists but the CI is wide. The ranking direction is plausible but may shift as more runs arrive. Treat these as directional signals, not final verdicts.
- Tracking only. Insufficient evidence to rank. We include the candidate in the registry so you know it exists and we are watching it, but we do not produce a score. We say "tracking only" rather than fabricating an order.
Candidates can move between lanes as evidence accumulates. When a candidate enters the tracking-only lane, it is not a negative signal about the candidate; it is a signal about evidence volume.
The neutrality bright line
Rank is never affected by payment, vendor tier, partnership status, or any commercial relationship. This is a hard constraint in the scoring pipeline, not a policy aspiration.
Claiming a vendor listing requires ownership verification and confers no ranking privilege. Claimed and unclaimed entities run the same held-out evaluation pipeline. Vendors have an inline right of reply to any negative signal; a reply has no effect on the score itself.
The methodology page is a top navigation item rather than a footer link because transparency is part of the product promise, not an afterthought.
Open, anonymous access
The registry, scores, comparisons, and API work with no account required. Rankings are the same whether you are logged in or not. Login unlocks persistence features (saved watches, alerts, API keys) and vendor relations; it never unlocks a better ranking.
Individual public scores are offered under CC-BY-4.0 with the free "Powered by EvalRank" attribution badge. Bulk and API use is subject to the competing-index restriction in the Terms of Service.