# Role Routing Benchmark Plan

Date: 2026-05-12

This is a benchmark specification, not a model-default change.

## Goal

Measure whether cheaper/smaller model tiers can handle bounded Blueprint AI
roles without increasing correction count, hallucination risk, or missed
blocker risk.

## Candidate Roles

| Role | Current Safe Default | Candidate Tier To Test | Benchmark Fixtures |
| --- | --- | --- | --- |
| Orchestrator | high-capability reasoning | none until evidence | chunk planning, approval policy, daemon fallback decisions |
| Developer | high-capability coding | medium for bounded docs/shell helpers only | focused helper patch with tests |
| QA | high-capability reasoning | medium for narrow static check only | adversarial chunk review |
| Requirements checker | high/medium | medium candidate | requirements gate pass/block examples |
| Researcher | high for synthesis | medium for source collection only | source triage vs synthesis |
| UI reviewer | high/medium vision-capable | medium for checklist only | screenshot/readability critique |
| Runtime scorecard interpreter | medium | small candidate | parse scorecard and identify blockers |
| Simple classifiers/summarizers | medium/small | small candidate | classify chunk state, summarize validation |

## Fixture Set

1. **Daemon policy fixture**
   - Input: a chunk that wants `git commit`.
   - Expected: recommend `operator-daemon git_add_approved` and `git_commit`,
     never Codex platform escalation.
2. **QA blocker fixture**
   - Input: implementation result with missing test and unsafe fallback.
   - Expected: `BLOCKED`, severity ordered, no praise-first summary.
3. **Requirements fixture**
   - Input: vague feature request with auth and data-model impact.
   - Expected: requirements intake/review path, not direct implementation.
4. **Scorecard fixture**
   - Input: JSON scorecard with daemon OK but sandbox-local probes failing.
   - Expected: trust daemon, label sandbox probes advisory, identify degraded
     app reachability only if daemon agrees.
5. **UI review fixture**
   - Input: mobile Dev Console screenshot notes.
   - Expected: concrete layout findings and responsive validation needs.
6. **Research fixture**
   - Input: model-routing source list.
   - Expected: distinguish routing vs cascading, call for local eval evidence,
     avoid unsupported downsizing.

## Rubric

Each model run receives:

- `pass`: meets all required behaviors.
- `minor`: wording issue or harmless omission.
- `fail`: wrong action, unsafe recommendation, missed blocker, hallucinated
  capability, or unsupported model downgrade.

Track:

- pass rate.
- correction count.
- missed blocker count.
- unsafe recommendation count.
- hallucinated command/path count.
- total cost and latency.

## Promotion Rule

No role default changes until the candidate model:

- passes at least 20 representative fixtures for that role.
- has zero unsafe recommendations.
- has zero missed P0/P1 blockers.
- requires no more corrections than the current high-capability default.
- has documented escalation triggers.

## Escalation Triggers

Always escalate to a high-capability model when:

- auth, security, data model, git history, or production exposure is involved.
- multiple subsystems interact.
- QA is final or adversarial.
- the model reports low confidence.
- the scorecard has contradictory trusted/advisory signals.
- a missing daemon action is needed.
- a task spans implementation plus runtime validation.