# Blueprint AI Runtime Hardening Review

Date: 2026-05-12

Primary source:
`ai/reports/blueprint-boilerplate-gap-analysis.md`

## Executive Summary

Blueprint's strongest differentiator is the local/dev AI engineering runtime:
Codex orchestration, deterministic chunk lifecycle, Telegram/operator Q&A,
trusted daemon execution, tmux-managed services, Dev Console visibility, and
closed-loop validation. The runtime is already more advanced than a normal app
boilerplate, but it still has weak points that can cause manual correction:
missing end-to-end tests, duplicated approval concepts, sparse runtime
scorecards, partial prompt/resume validation, and UI areas that still feel more
like a demo than an operational control surface.

The next AI-runtime work should focus on reliability and observability:

1. Make `node ai/runtime/dist/cli.js doctor --json` the standard first diagnostic.
2. Add machine-readable runtime scorecards.
3. Finish closed-loop E2E tests for Telegram/Q&A/daemon/Codex I/O.
4. Add a missing-action registry so Codex stops cleanly instead of improvising.
5. Add a lightweight runtime status model for the app UI.
6. Trim non-operational UI content and make Dev Console the primary surface.

## Current Runtime Strengths

- Canonical tmux sessions are documented in
  `ai/standards/local-dev-runtime.md`.
- Local-dev startup exists through `ai/tools/local-dev/start-stack.sh`.
- The trusted operator daemon owns registered privileged actions and avoids
  Codex platform escalation for known actions.
- Operator questions define a single answer model for local and Telegram
  answers.
- Telegram is now positioned as transport/answer surface, not a shell.
- Codex I/O bridge exists and can inject accepted answers into the canonical
  Codex tmux pane.
- Dev-server helpers own deterministic frontend/backend sessions.
- Screenshot guidance uses `npx playwright` and `/tmp` artifacts.
- Dev Console provides real tmux output and input into the configured target.
- Chunk workflow, QA gates, prompt synthesis, and workflow-state helpers are
  documented and testable.

## Main Weak Points

| Area | Finding | Risk | Priority | Recommendation |
| --- | --- | --- | --- | --- |
| Runtime diagnosis | No single Runtime CLI doctor entry point existed | Operators and Codex may run partial checks or trust sandbox-local probes | P0 | Use `node ai/runtime/dist/cli.js doctor --json` as the first status command |
| Closed-loop validation | Coverage exists, but not every loop is exercised end-to-end | Regressions appear only during live remote operation | P0 | Add a runtime E2E suite that drives Q&A, daemon, Codex I/O, and Telegram fixtures |
| Missing actions | Standards say to stop, but no first-class registry/report exists | Codex can still improvise or ask for manual work | P0 | Add a missing-action registry and summary output |
| Runtime scorecard | Human-readable status exists; machine-readable scorecard does not | Codex has to parse prose and may misread state | P0 | Add JSON/ENV scorecard output for stack, chunks, daemon, bridges, and servers |
| Duplicate approvals | Improved, but still partly policy-based | Telegram stale approval confusion can recur | P0 | Encode "fresh human approval only" in operator question metadata/tests |
| Codex I/O bridge | Bridge exists, but prompt detection/injection is still the most fragile path | Remote autonomy can break when prompt formats change | P1 | Add fixture prompts and live smoke cases for common Codex prompts |
| UI runtime visibility | Backend/frontend/daemon/Telegram status is not surfaced in one app-level model | Operator has to infer state from terminal output | P1 | Add lightweight runtime status endpoint/query and frontend status strip |
| UI focus | Dev Console is strong; other pages still contain placeholder/admin-template feel | Blueprint feels less intentional than the runtime deserves | P1 | Remove non-operational filler and make control surfaces purposeful |
| Restart/recovery | Startup exists, but recovery playbooks are spread across docs | After stale bridge/daemon state, operator may guess | P1 | Add recovery section to doctor output and local-dev docs |
| Screenshot/browser loop | Works, but depends on managed server correctness | Future UI chunks may regress into stale "browser unavailable" claims | P1 | Add doctor check for Playwright + managed URL reachability |

## Closed-Loop E2E Audit

| Workflow | Classification | Evidence | Gap |
| --- | --- | --- | --- |
| Chunk lifecycle | Partially validated | `workflow-scenarios-test.sh`, `workflow-state.sh`, completed chunks | Needs one full active -> QA -> complete -> daemon commit fixture |
| Orchestrator flow | Partially validated | Role/standards and scenario tests | Needs scorecard-driven continuation tests |
| Operator questions | Mostly closed-loop | `operator-questions/test/operator-questions-test.sh` | Needs more live stale/duplicate approval cases |
| Telegram Q&A | Partially validated | Telegram bridge tests and live manual tests | Needs reliable send-status/resend assertions and compact UX regression tests |
| Trusted daemon actions | Mostly closed-loop | Operator-daemon tests and fixture git flows | Needs daemon long-running loop health/recovery tests |
| Codex I/O bridge | Partially validated | Bridge fixture tests and live manual testing | Needs prompt-pattern regression suite |
| Dev-server lifecycle | Partially validated | Managed helper tests/status and live startup | Needs daemon action E2E for restart/status/screenshot in one suite |
| Screenshot validation | Partially validated | Known-good `npx playwright screenshot` flow | Needs doctor/check command to prevent stale browser diagnoses |
| Backend API/GraphQL | Partially validated | Backend tests, generated schema, health query | Needs app-level smoke tying frontend auth route to backend health |
| Auth/session persistence | Partially validated | Frontend/backend tests and manual mobile observations | Needs browser reload smoke in managed runtime |
| Dev Console tmux I/O | Partially validated | Backend service tests and live manual testing | Needs fixture plus live codex-target smoke guidance |
| Mobile/PWA flows | Partially validated | Manifest/icons and mobile manual checks | Needs Lighthouse/installability check or documented manual pass |
| Runtime connection state | Missing | Health service only exposes backend health string | Needs lightweight runtime status model |
| UI operational focus | Missing as E2E | Manual review only | Needs UI-review checklist and screenshots against operational goals |

## UI Cleanup Findings

The Dev Console has become the strongest screen because it was iterated against
real operator workflows. Other UI areas should be simplified to support the
same operational intent.

Recommended cleanup:

- Keep Dev Console as the default admin landing surface.
- Make top navigation compact and task-oriented.
- Remove placeholder/demo copy that does not help runtime operation.
- Prefer status strips and action bars over large dashboard cards.
- Keep admin user management available but secondary.
- Add a small runtime status summary instead of multiple disconnected status
  labels.
- Avoid adding a generic admin-template dashboard until there are real product
  metrics.
- Treat mobile as an operational console: tight header, usable terminal,
  minimal decoration.

Do not redesign the whole UI yet. The next visible UI chunk should be a
targeted operational cleanup with screenshots.

## Runtime Connection Visibility

Current state:

- Frontend has `HealthService`, backed by GraphQL `health`.
- Dev Console can show tmux target availability through its own API.
- Local-dev and daemon status are available through shell helpers, not a unified
  app-level model.

Recommended minimal architecture:

- Backend exposes a local/dev/admin-only `runtimeStatus` GraphQL query.
- The query returns coarse states, not secrets:
  - backend: `connected`
  - graphql: `connected`
  - tmuxTarget: `available|unavailable|unknown`
  - daemon: `available|unavailable|unknown`
  - telegramBridge: `available|unavailable|unknown`
  - codexIoBridge: `available|unavailable|unknown`
  - frontendManagedServer: `available|unknown`
  - backendManagedServer: `available|unknown`
- Frontend exposes one compact status strip in the Dev Console header/footer.
- Status refresh can use polling first; no websocket is required yet.
- Websocket/subscription infrastructure should wait until there is a specific
  product need for live bidirectional state beyond the existing terminal
  polling.

Decision: do not add websocket infrastructure now. Polling plus daemon-backed
doctor/scorecard is enough for the next step.

## Runtime CLI Doctor Baseline

`node ai/runtime/dist/cli.js doctor --json` is now the recommended first command when runtime state is
unclear. It:

- prints repo and git state.
- checks the trusted operator daemon status.
- requests read-only daemon actions:
  - `local_dev_status`
  - `dev_server_status --target all`
  - `telegram_bridge_status`
- labels direct local/sandbox probes as advisory.
- checks local frontend/backend HTTP reachability when possible.
- checks local Playwright availability through `yarn exec playwright --version`
  with an `npx --no-install` fallback.

This command does not replace daemon actions. It is a diagnostic entry point
that prefers trusted-runtime answers and makes sandbox-local uncertainty
explicit.

## P0 Follow-Up Chunks

1. **Runtime scorecard JSON hardening**
   - Continue replacing prose parsing with structured helper output. The first
     pass added a canonical Playwright probe and `--kv` outputs for daemon,
     dev-server, Telegram, and Codex I/O bridge status.
2. **Closed-loop runtime E2E suite**
   - Exercise operator questions, Telegram-style answers, daemon actions,
     Codex I/O fixture injection, managed dev-server status, and screenshot
     capture.
3. **Missing-action registry**
   - Add a file/report workflow for unregistered recurring actions and make it
     visible in handoffs.
4. **Runtime status query and UI strip**
   - Add a small local/dev/admin-only status model in backend/frontend.
5. **Operational UI cleanup**
   - Remove placeholder noise, tighten navigation, and make Dev Console/runtime
     status the center of the admin experience.

## What Not To Build Yet

- Do not add a large websocket/event bus just for status lights.
- Do not add arbitrary shell execution to solve missing daemon actions.
- Do not build a generic admin dashboard before the runtime controls are clean.
- Do not add a broad UI component system beyond current needs.
- Do not convert Telegram into a command shell.
- Do not expand product boilerplate until P0 runtime loop tests exist.

## Open Questions

- Should Runtime CLI doctor eventually fail non-zero on degraded runtime, or remain
  advisory by default?
- Should runtime scorecards be JSON, shell `key=value`, or both?
- Which runtime status belongs in the app UI versus terminal-only diagnostics?
- Should live Telegram tests be optional/manual or part of a gated local-dev
  smoke suite?
- How much UI cleanup should happen before the next product-boilerplate chunk?
