Skip to content

Research · Benchmarks

An open scoring rubric. An invitation to replicate.

The rubric is the same one used to score the certification track. It's public, replicable, and frozen across each report's reference quarter.

Dimensions

Three dimensions.

Every challenge — and every report — scores along these three. No composite. No black box.

Dimension

Time-to-correct-fix

Time from session start to a fix that passes the challenge's hidden acceptance tests. Wall-clock, with idle gating to discount distraction.

Dimension

30-day regression rate

Fraction of submissions whose fix is reverted, patched, or rolled back inside a 30-day post-merge window. Measured against the canonical regression test set.

Dimension

Defensibility score

Blind panel score (0–5) on how well the candidate's replay walks a reviewer through the decision points. Inter-rater agreement is reported per round.

Task categories

Twelve categories.

Each task is tagged by category. Reports break out findings by category and experience band.

  • Debugging — single-service
  • Debugging — distributed
  • Security — auth / authz
  • Security — injection / SSRF
  • Refactoring — within-module
  • Refactoring — cross-module
  • Feature build — bounded
  • Feature build — cross-cutting
  • System design — write-path
  • System design — read-path
  • Code review — accept / reject
  • Incident response — diagnosis