---
Status: Active
Owner Role: Developer
Created: 2026-05-22T07:40:00.000Z
Completed:
Depends On: 000299
Validation: node ai/runtime/dist/cli.js validate --tier chunk --chunk 000300 --json
---

# Runtime API Watchdog Safe Resurrection Design for cfd-0024

## Autopilot

- Repository/runtime state is authoritative; chat memory is not authoritative.
- Continue autonomously unless a stop condition is reached.
- Do not kill the Runtime API unless safe proof criteria are fully satisfied.

## Goal

Design and, only if safe, implement the Runtime-owned watchdog prerequisite for future Runtime API killed-process resurrection proof.

## Scope

- Inspect typed supervisor.
- Inspect Runtime API daemon lifecycle.
- Inspect service fail log.
- Inspect Runtime Health service projection.
- Inspect existing planned restart and dev-server resurrection proofs.
- Define Runtime API watchdog policy:
  - detection source
  - ownership proof
  - recovery allowed/denied decision
  - cooldown/backoff/circuit breaker
  - max attempts
  - manual action required state
  - service fail log evidence
  - Runtime State Journal events
  - Action Panel visibility
- Implement only safe non-destructive pieces if feasible.
- If real killed-process proof is still unsafe, record exact prerequisites.

## Acceptance Criteria

- Runtime API resurrection policy is explicit.
- Watchdog design is typed Runtime-owned, not shell-owned.
- Policy distinguishes planned restart, unsolicited death, recovery started, recovery completed, recovery failed, and manual action required.
- Circuit breaker/backoff semantics exist before autonomous recovery expansion.
- If safe proof is executed, it proves recovery end-to-end.
- If safe proof is not executed, `cfd-0024` remains narrowed with exact missing prerequisite.
- Existing validation passes.

## Validation Requirements

- Supervisor/watchdog tests.
- Service fail-log tests.
- Runtime Health smoke.
- Browser proof if user-visible behavior changes.
- Summary validation.
- Canonical validation.

## Stop Conditions

- Stop if ownership of the process cannot be proven.
- Stop if killing Runtime API risks data loss or authority-state corruption.
- Stop if recovery authority is ambiguous.
- Stop if proof would require unsafe manual assumptions.

## Lifecycle Transition Evidence
- Tool: Blueprint Runtime Executor
- Transition: Backlog -> Active
- Recommendation: activate
- Approval side effect: not_required
- Recorded at: 2026-05-22T09:17:42.991Z

## Details

Chunk 000300 adds a non-destructive Runtime API watchdog policy prerequisite for `cfd-0024`. The new Runtime-owned policy lives in `ai/runtime/src/modules/supervisor/runtime-api-watchdog.ts` and inspects Runtime API pid/heartbeat evidence, planned restart suppression, manual suppression, cooldown, max-attempt, and circuit-breaker state.

The policy is exposed through Runtime API inspection (`inspectRuntimeApiWatchdogPolicy`), the Runtime CLI (`supervisor runtime-api-watchdog-policy --json`), and Runtime Health service projection fields on the `runtime-api` service row. It records typed watchdog fail-log events without killing or restarting the Runtime API.

## Good

- Runtime API recovery policy is now explicit and typed.
- The policy distinguishes `healthy`, `planned_restart`, `observed_failure`, `recovery_started`, `recovered`, `recovery_failed`, and `manual_action_required`.
- Runtime Health can surface watchdog status, recommendation, reason, circuit-breaker state, and attempts-in-window.
- Supervisor tests cover observed failure, recovery-started fail-log evidence, circuit-breaker blocking, and manual suppression.
- No unsafe killed-process proof was attempted.

## Bad

- Autonomous Runtime API killed-process resurrection is still not proven.
- The new policy is an inspection/projection prerequisite; the supervisor loop does not yet watch unsolicited Runtime API death and enqueue recovery.

## Ugly

- The existing supervisor test had stale assumptions about failed supervisor work remaining active and using the test fixture artifact id. The current work-session semantics correctly record failed supervisor action work as terminal failed work under the supervisor request artifact.

## Root Cause

Prior chunks proved typed planned Runtime API restart and frontend dev-server resurrection, but the Runtime API itself did not yet have a Runtime-owned watchdog policy that could decide whether an unsolicited daemon death should be recovered automatically.

## Drift Mechanism

Without a watchdog policy and loop, planned restart proof can drift away from actual unsolicited-failure recovery. Runtime Health could show service state, but not the recovery decision inputs needed for safe autonomous resurrection.

## Policy Enforcement

- `runtimeApiWatchdogPolicy()` is read-only and sets `mutates_authority_state: false`.
- Recovery is allowed only for observed failure when enabled, not planned-suppressed, not manually suppressed, and not circuit-breaker blocked.
- `recordRuntimeApiWatchdogEvent()` writes typed fail-log evidence for future loop/proof integration.
- Runtime Health remains projection truth; no frontend inference was added.

## Validation

- `node ai/runtime/dist/cli.js supervisor runtime-api-watchdog-policy --json` passed through `runtime command run`.
- `node ai/runtime/test/runtime-supervisor-test.mjs` passed through `runtime command run`.
- `yarn --cwd ai/runtime typecheck` passed through `runtime command run`.
- `yarn --cwd ai/runtime build` passed through `runtime command run`.

## Next

- Implement the autonomous supervisor watchdog loop that observes unsolicited Runtime API daemon death, records recovery lifecycle evidence, and enqueues typed `runtime_api_restart` only when the policy permits it.
- Prove the killed-process path in browser only after the loop is safe and ownership/recovery authority is unambiguous.

## Developer Evidence

- Report: `ai/reports/report-000300-runtime-api-watchdog-safe-resurrection-design.yaml`
- Runtime API watchdog module: `ai/runtime/src/modules/supervisor/runtime-api-watchdog.ts`
- Runtime Health projection: `ai/runtime/src/modules/health/scorecard.ts`
- CLI/API exposure: `ai/runtime/src/cli.ts`, `ai/runtime/src/api/runtime.ts`
- Test coverage: `ai/runtime/test/runtime-supervisor-test.mjs`

## Acceptance Criteria Verification

- Runtime API resurrection policy is explicit: PASS.
- Watchdog design is typed Runtime-owned, not shell-owned: PASS.
- Policy distinguishes planned restart, unsolicited death, recovery started/completed/failed, and manual action required: PASS.
- Circuit breaker/backoff semantics exist before autonomous recovery expansion: PASS.
- Safe killed-process proof executed only if criteria are satisfied: PASS, not executed because the autonomous loop is not implemented.
- `cfd-0024` remains narrowed with exact missing prerequisite: PASS.

## Carry Forward

- `cfd-0024` remains open and narrowed to the autonomous watchdog loop plus browser-visible killed-process proof.
- `cfd-0033` remains relevant for broader service recovery state-machine/backoff expansion beyond the Runtime API-specific policy added here.

## QA Review

Pending final summary validation, QA evidence check, state journal validation, Runtime Health refresh, and canonical validation before close.

## Handoff

After close, remaining package chunks are `000301` Runtime Health minimum-downtime/heavy refresh performance for `cfd-0035`, `000302` command wrapper/hosted tool enforcement for `cfd-0036`, and `000303` final follow-up package summary/CFD reconciliation.
