# Adversarial Workflow Audit

## Summary

The workflow is much stronger than it was, but it is not yet trustworthy enough
to run real product implementation with low human supervision. The biggest risk
is not missing documentation. The biggest risk is false confidence: a chunk can
look complete because it has `PASS`, `Verified`, `Test Impact`, and clean helper
output while the evidence underneath is still prose-based, sampled, or manually
asserted.

This audit intentionally looks for ways the workflow can appear correct without
actually proving correctness.

## High-Risk Workflow Weaknesses

### 1. QA Can Still Rubber-Stamp Prose Evidence

Recent QA reviews often say acceptance criteria are verified, but the evidence is
frequently a summary rather than deterministic assertions. This is visible in
recent workflow chunks where QA records `PASS` after checking representative
output and reading docs, while the system itself cannot prove that every claim in
the report is true.

Risk:

- A report can claim traceability, safety, or validation coverage without a
  machine check proving it.
- QA may validate that text exists, not that the workflow behavior is correct.

Needed fix:

- Add audit/check helpers for report and chunk structure where feasible.
- Require QA to identify which claims are machine-verified, manually inspected,
  or accepted as prose-only.

### 2. Acceptance Criteria Verification Is Mostly Self-Reported

`workflow-state.sh` checks that `## Acceptance Criteria Verification` exists and
that bullet items include `Verified`, `Blocked`, or `Not Applicable`. It does not
check whether every original acceptance criterion is represented, whether the
verification is truthful, or whether a criterion was weakened by wording.

Risk:

- Developer can mark every item `Verified` and pass readiness.
- QA can repeat the summary without catching missing or altered criteria.

Needed fix:

- Add a deterministic acceptance-criteria comparer that lists original
  acceptance criteria next to verification bullets and flags count/wording drift.

### 3. Test Impact Can Be Complete In Form But Weak In Substance

`workflow-state.sh` checks for Test Impact fields, but it cannot judge whether
the testing choice is adequate. A chunk can say runtime smoke is not applicable,
or a test gap is future work, and still pass if the prose is plausible.

Risk:

- Behavior-changing chunks may defer meaningful coverage too easily.
- Documentation chunks can make future claims stronger than current executable
  coverage.

Needed fix:

- Add risk-tier guidance and a QA checklist that maps changed files/categories to
  expected test layers.
- Require explicit `Machine-verified`, `Manual-review`, or `Deferred follow-up`
  labels for each Test Impact line.

### 4. Requirements Workflow Is Still Mostly Prose-Simulated

The auth/admin simulation became fixture-driven, but it still does not execute a
requirements lifecycle harness. Requirements intake, clarification, requirements
review, and chunk planning are manually described in a report.

Risk:

- The system can claim end-to-end requirements flow works without actually
  running a deterministic requirements-state scenario.
- Invented assumptions may return when the next domain is less familiar.

Needed fix:

- Build a requirements workflow scenario harness using fixture rough ideas and
  clarification answers.
- Assert pre-clarification BLOCKED, post-clarification planning readiness, and
  chunk-plan structure.

### 5. Git Diff Stat Can Hide New Files

`git diff --stat` does not include untracked files. Several recent chunk summaries
show `(no diff)` while all work is in untracked files. The git status lists the
files, but the diff stat alone is misleading.

Risk:

- Reviewers may underestimate the change size.
- Summary packets can appear empty despite new reports/chunks/fixtures.

Needed fix:

- Update `workflow-summary.sh` to add an untracked-file summary or use a
  combined diff-stat style report for untracked files.

## Medium-Risk Weaknesses

### 1. Scenario Harness Covers Chunk State Better Than Product Workflow

`workflow-scenarios-test.sh` is useful for chunk states, prompt mode selection,
and summary command placement. It does not yet exercise:

- requirements intake/review/approval.
- chunk planning from requirements.
- Telegram wrappers after every shared-helper change.
- product domain scenarios with fixtures.

Confidence type:

- Chunk workflow: simulation-based confidence.
- Requirements workflow: mostly reasoning-based confidence.
- Runtime product behavior: real runtime confidence only when app tests/smoke run.

### 2. Prompt Synthesis Still Depends On Markdown Shape

Prompt synthesis reads active chunk sections and pass history from markdown.
Recent fixes improved relevant Developer pass context, but markdown parsing is
still fragile.

Risk:

- Slight heading drift can silently remove context.
- Prompt review may wrap a flawed deterministic prompt without catching missing
  source context.

Needed fix:

- Add fixture tests for prompt synthesis inputs with malformed or stale sections.

### 3. Handoff Correctness Has Regressed Before

The system previously confused readiness gates with exact next actions. Scenario
assertions were added later, but similar confusion can recur for new states or
requirements flows.

Needed fix:

- Expand output-quality assertions for every canonical state and requirements
  state, not just known regressions.

### 4. QA PASS Often Happens After Developer-Provided Evidence

QA frequently reruns commands, which is good. But for reports and docs, QA often
relies on Developer summaries plus spot checks.

Risk:

- QA misses internal contradictions or untested claims.
- QA verifies formatting rather than adversarially challenging scope.

Needed fix:

- Add a QA adversarial checklist requiring at least one attempt to falsify the
  chunk's strongest claim.

## Low-Risk Weaknesses

- Some ordered lists in docs can become visually awkward after insertions, even
  though Markdown renders them correctly.
- `workflow-summary.sh` trims long sections, which is useful for mobile but can
  hide details relevant to audit.
- Prompt review modes are documented but still manual; they do not enforce vetoes
  automatically.
- Completed chunk history is useful but verbose; reviewers may skip older
  details where important stale-state lessons live.

## Likely False-Positive PASS Areas

- Report-only chunks that claim a workflow is coherent without executable
  scenario assertions.
- Test Impact sections that explain why tests are not applicable.
- Operator Sanity checks based on representative output rather than full state
  matrix coverage.
- Requirements Review PASS in simulations where no real requirements file was
  approved.
- Prompt Synthesizer review prompts that are generated but not actually reviewed
  by a separate role or human.

## Areas Relying Too Much On Prose Review

- Requirements intake and clarification quality.
- Chunk plan derivation from requirements.
- Test adequacy judgment.
- Acceptance criteria truthfulness.
- Report traceability.
- Runtime smoke applicability decisions.
- Follow-up chunk prioritization.

## Places Where Simulation Should Replace Reasoning

1. Requirements lifecycle from rough fixture to reviewed requirement.
2. Acceptance criteria to verification matching.
3. Test Impact adequacy for file categories.
4. Prompt synthesis for malformed/stale chunk sections.
5. Workflow summary output for untracked-only changes.
6. Telegram wrapper consistency for shared helper output.
7. Completion readiness after multiple Developer/QA iterations.

## Missing Deterministic Assertions

- Every acceptance criterion appears in verification.
- Every Test Impact field is specific, not just present.
- Every QA Review includes exact output checked when Operator Sanity applies.
- `git diff --stat` is not the only size signal when files are untracked.
- Requirements Review PASS cannot be claimed for simulations unless clearly
  labeled non-approval.
- Prompt review output has actually been consumed before execution.

## QA Weaknesses

- QA can pass chunks without proving that acceptance criteria map one-to-one.
- QA reports can use broad summaries such as "all criteria verified" rather than
  listing sampled evidence.
- QA may accept "not applicable" runtime smoke decisions without adversarially
  checking whether behavior actually changed.
- QA does not consistently state confidence type: reasoning-based,
  simulation-based, or runtime-verified.
- QA does not always identify the strongest possible false PASS path.

## Orchestration Weaknesses

- Orchestrator guidance is clearer, but automatic enforcement is limited.
- Requirements and chunk workflows are not yet simulated together by an
  executable harness.
- Manual intervention gates exist, but ambiguous product decisions can still be
  hidden as follow-up prose.
- The system can move quickly through chunk completion even when the next
  required artifact is a real requirements approval.

## Summary And Handoff Weaknesses

- Summary output can show `(no diff)` when new untracked files contain all work.
- Mobile-friendly trimming can hide critical pass-history details.
- Handoff blocks are only as correct as canonical-state parsing and scenario
  coverage.
- Advisory commands are useful, but they can create confidence that the workflow
  is more automated than it is.

## How This Workflow Could Still Fail In Real Product Implementation

1. A rough product idea enters requirements intake.
2. Intake produces plausible requirements but invents a missing security decision.
3. Requirements Review marks PASS because the document is complete-looking.
4. Chunk Planner creates reasonable chunks from the invented assumption.
5. Developer prompt includes that assumption as fact.
6. Developer implements behavior and writes tests for the wrong policy.
7. QA validates tests and output quality but does not challenge the original
   assumption.
8. Workflow summary shows clean state and exact next commands.
9. The product ships behavior that is coherent, tested, and wrong.

The current system reduces this risk but does not eliminate it. The missing layer
is deterministic traceability and adversarial review of product assumptions.

## What Would Make The System Trustworthy Enough For Real Auth/Admin Implementation

- Requirements lifecycle scenario harness using auth/admin fixtures.
- Human approval gate for production bootstrap policy before implementation.
- Acceptance criteria verification comparer.
- Test Impact adequacy checker for backend/API/frontend file categories.
- Backend auth/admin scenario harness with local-only fixtures and cleanup.
- Frontend browser smoke setup for role-visible UI.
- Summary output that includes untracked-file size/count signals.
- QA adversarial checklist requiring false PASS analysis in every product chunk.

## Recommended Fixes

1. Add a requirements lifecycle scenario harness.
2. Add an acceptance-criteria verification checker.
3. Add untracked-file visibility to workflow summary.
4. Add Test Impact adequacy heuristics by change category.
5. Add prompt synthesis fixture tests for malformed/stale markdown.
6. Add a QA adversarial review section to QA template and role docs.
7. Add backend auth/admin bootstrap scenario harness before product
   implementation.

## Recommended Future Chunks

### Priority 1: Requirements Lifecycle Scenario Harness

Use rough idea and clarification fixtures to assert requirements intake,
pre-clarification BLOCKED review, post-clarification planning readiness, and
chunk plan structure.

### Priority 2: Acceptance Criteria Verification Checker

Compare `## Acceptance Criteria` against `## Acceptance Criteria Verification`
and fail readiness on missing, extra, or unmarked criteria.

### Priority 3: Workflow Summary Untracked Diff Visibility

Report untracked file count and paths in summary/diff stat sections so new-file
chunks cannot appear as `(no diff)`.

### Priority 4: QA Adversarial Gate

Add a required QA section:

- strongest false PASS risk.
- evidence type.
- attempted falsification.
- remaining unproven claims.

### Priority 5: Auth/Admin Backend Scenario Harness

Add local-only deterministic backend scenario coverage for admin bootstrap,
role-aware login/currentUser, generated user cleanup, and production-safety
boundaries.

## Confidence Assessment

- Reasoning-based confidence: role documentation, report quality, many QA
  judgments.
- Simulation-based confidence: workflow-state, prompt mode selection, summary
  command placement, chunk pass mechanics.
- Runtime confidence: backend/frontend tests and runtime smoke only when those
  commands are actually run.

The workflow is ready for continued hardening. It is not yet ready for
low-supervision implementation of sensitive auth/admin behavior.
