16. Testing Strategy

1. Why This Exists

The chat pipeline is the most consequential code in Sasha — it touches every user interaction — and historically has been the least tested. The test strategy is anchored on five product-quality goals; each test layer below maps to one or more of them.

#	Goal	What "good" looks like	Where coverage lives
1	Stable chat session	No dropped or duplicated messages on reconnect, restart, or compaction	Y (server, shipped Phase 0); B.2 (client, pending). E2E layer 1 below.
2	Don't lose messages we ought to surface	Anything Anthropic's CLI persists to JSONL is replayable to the client	Y mapper + tail service unit tests; A2 replay harness (planned).
3	Reduce noisy messages	`[Message format not recognized]` never reaches a user; status events don't pollute the transcript	Tactical: 10 unit tests on `messageNormalizer.isSystemJson` cover the `rate_limit_info` envelope (Phase C0, 2026-05-09). Strategic coverage depends on B.3 (signal classification), not yet started.
4	Stay consistent with stdout/stderr	Live-stream and replay produce semantically equivalent client state	Mapper unit tests assert wire-shape parity with `messageStreamHandler.js` broadcast payload.
5	Debuggable	A developer can grep one log prefix and see what every chat-subscribe did	A1 debug pane (shipped, 30 tests); `[chat-subscribe]` server log (shipped, no automated assertion yet).

The biggest unaddressed gap is goal 3 (noise) — there is no work in progress on signal classification. The biggest in-flight gap is goal 1 — Y's server work is shipped behind a feature flag, but the client (B.2) still uses the legacy path so the feature is dormant in production.

For program context, see docs/superpowers/specs/2026-05-06-replay-via-jsonl-tail-design.md (Y) and docs/superpowers/plans/2026-05-06-client-cutover-b2-redesign.md (B.2).

2. Test Framework

Tool	Purpose
Vitest	Test runner (unit + integration). Configured in `claudecodeui/vitest.config.ts`.
`@testing-library/react`	Component + hook tests in jsdom
`@vitest/coverage-v8`	Coverage measurement (V8-native, fast)
jsdom	Browser-environment polyfill for component/hook tests
better-sqlite3	In-memory database for test isolation

3. Running Tests

cd claudecodeui

npm test                  # Run once
npm run test:watch        # Watch mode
npm run test:coverage     # Run + emit reports under coverage/
npm run test:integration  # Server integration tests (uses --env-file=.env.test)
npm run lint              # ESLint (server/ only)

4. Current Test Inventory

~31 test files in claudecodeui/. The chat-pipeline coverage breakdown below is the load-bearing part — most of the rest is incidental.

By area

Area	Test files	Test count	Status
JSONL-tail replay backend (Y — 2026-05-08)	5	58	All passing — Y Phase 0 (Tasks 1-7)
Chat event ring (B.1 — 2026-05-05)	7	47	All passing — slated for deletion in Y Phase 3
Live debug pane (A1 — 2026-05-04)	6	30	All passing
Workflow parsing	2	substantial	All passing
Scheduler	4	substantial	1 file failing (mock issue, pre-existing)
Conversation compaction	1	—	All passing
MCP integrations (docsidecar)	7	—	All passing
Project state reducer	1	1	All passing (post-orphan-cleanup)
Server integration (live infra needed)	4	4 failing	Need live DB/server
Other server services	2	mixed	promptMaterializer fs-dependent

The pre-existing 16-failure count is unchanged from before Y; all failures are live-infra-gated and unrelated to recent code.

Chat-pipeline-related coverage detail

Y — JSONL tail replay backend (2026-05-08, Phase 0)

Suite	Tests	What it covers
`server/services/__tests__/jsonlTailCursor.test.js`	6	Opaque base64url cursor codec; structure/type/version validation; `jsonlPath` excluded (security)
`server/services/__tests__/jsonlTailService.test.js`	15	`fingerprint` (dev:ino:mtime:size:firstHash:lastHash); `tailFile` partial-line safety; `freshCursor`; `resolveTranscriptPath` candidate-iteration
`server/services/__tests__/jsonlReplayMapper.test.js`	11	JSONL envelope → wire-format `message_streamed` event; skip-list (queue-operation, last-prompt, permission-mode); preserves `tool_use_result` (NOT stripped)
`server/websocket/__tests__/eventRingHandler.test.js`	21	Both backends behind `REPLAY_BACKEND` flag — 12 ring tests + new JSONL coverage (fresh subscribe, cursor-at-EOF, malformed cursor, fingerprint mismatch, mid-read rewrite, missing project)
`server/__tests__/integration/jsonl-tail-replay-e2e.test.js`	4	End-to-end against real on-disk JSONL: write → tail → assert; append → tail → only new; compaction-style rewrite changes fingerprint; cursor round-trip

B.1 — Event ring (slated for deletion in Y Phase 3)

Suite	Tests	What it covers
`server/services/__tests__/event-ring.test.js`	17	Ring buffer: push, replay, eviction, dedup, markCompleted
`server/services/__tests__/event-ring-cursor.test.js`	6	Opaque base64url cursor codec, validation
`server/services/__tests__/event-ring-config.test.js`	3	Env-var parsing with fallback
`server/services/__tests__/event-ring-sweeper.test.js`	5	TTL eviction, max-sessions cap (time-mocked)
`server/services/__tests__/event-ring-sweeper-prep.test.js`	1	markCompleted smoke test
`server/__tests__/integration/event-ring-e2e.test.js`	3	Full pipeline: subscribe → push → reconnect → replay

A1 — Live debug pane

Suite	Tests	What it covers
`src/dev/__tests__/wsDebugStore.test.js`	8	Debug-pane ring buffer
`src/dev/__tests__/useWsDebugStore.test.jsx`	1	React hook over the store
`src/dev/__tests__/useDebugToggle.test.jsx`	6	Cmd+Shift+Y toggle + localStorage persistence
`src/dev/__tests__/debugToggleStore.test.js`	5	Toggle singleton store
`src/dev/__tests__/downloadFrames.test.js`	3	JSONL serialise + browser download
`src/dev/__tests__/ChatStreamDebugPane.test.jsx`	12	The component (rows, expand, pause, clear, download, filter, source colours)

Manual validation done outside the test suite

Y backend end-to-end via Playwright + browser console (2026-05-09): ran REPLAY_BACKEND=jsonl npm run dev, logged in, opened a real session, sent synthetic chat-subscribe over the live WebSocket, and verified: fresh subscribe returned a cursor at EOF=194524 with the correct path resolved; rewinding the cursor to byteOffset=0 returned 27 events with monotonic seq 1-27, all type=message_streamed operation=replay, mapper output matched the live wire shape. This is the proof of life that motivates building Layer 1 below into CI.

5. Coverage Measurement

Current baseline (2026-05-06)

npm run test:coverage produces a CLI summary plus coverage/index.html and coverage/coverage-summary.json. Reports are gitignored.

Unit-test scope (integration excluded):

Metric	Coverage
Lines	1.74% (846 / 48,372)
Statements	1.69% (885 / 52,332)
Functions	1.67% (126 / 7,508)
Branches	1.10% (477 / 43,180)

The aggregate is low because most of the codebase has no tests. Coverage is concentrated in a few well-tested modules.

Where coverage is strong (>70%)

File	Coverage source
`server/services/jsonlTailCursor.js`	Y Task 1 — round-trip + 5 validation paths
`server/services/jsonlTailService.js`	Y Task 2+4 — fingerprint, tailFile, freshCursor, resolveTranscriptPath
`server/services/jsonlReplayMapper.js`	Y Task 3 — every envelope type + skip-list + fixture
`server/websocket/eventRingHandler.js`	Both backends covered behind feature-flag tests (21)
`server/services/event-ring-cursor.js`	B.1 — 100% (will be deleted in Y Phase 3)
`server/services/event-ring.js`	B.1 — 91% (will be deleted in Y Phase 3)
`server/services/event-ring-config.js`	B.1 — 89% (will be deleted in Y Phase 3)
`server/services/event-ring-sweeper.js`	B.1 — 71% (will be deleted in Y Phase 3)
`server/services/scheduler.js`	90%
`server/services/schedulerLogger.js`	100%
`server/conversations/compaction.js`	79%
`src/utils/workflowParser.js`	98%
`src/utils/workflowSerializer.js`	90%

(File-level percentages need a fresh npm run test:coverage run after Y. The chart above is qualitative until the next refresh.)

Where coverage is absent (the chat-pipeline gap)

The most consequential code in the system is essentially uncovered:

File	Lines	Coverage
`src/components/ChatInterface.jsx`	2,842	0%
`server/claude-cli.js`	1,846	0%
`src/hooks/useProjectWebSocketV2.js`	1,032	0%
`src/reducers/projectReducer.js`	446	7%
`src/utils/websocket.js`	301	0%
`src/utils/messageNormalizer.js`	123	0%

~6,500 lines of the most consequential code at essentially zero coverage. This is the gap the planned A2 replay-test harness is designed to close — see docs/superpowers/specs/2026-05-02-chat-replay-testing-design.md. Y has improved server-side coverage but the client renderer remains untested.

Coverage thresholds — not yet set

A 1.74% baseline can't carry a meaningful CI gate. vitest.config.ts has the coverage block configured but no thresholds. The right time to add per-area thresholds is after A2 lands and the chat pipeline gets real coverage.

6. Build Validation

Check	Command	Blocking?
ESLint (server only)	`npm run lint`	Yes (CI blocks)
Frontend build	`npx vite build`	Yes
Coverage report	`npm run test:coverage`	No (informational)
TypeScript	N/A (no TypeScript)	N/A

ESLint scope is server/ only; frontend is validated via successful Vite build.

7. Critical Flows — Coverage Status

P0 — Must Not Break

Flow	Coverage
User registration (first user = admin)	Not tested
User login / JWT generation	Not tested
Chat message → Claude CLI → streaming response	Server event-ring layer only (B.1, ~95%); rendering path 0%
File upload and download	Not tested
Scheduled prompt execution	Integration tests exist
Reconnect → cursor-based replay (ring)	B.1 covers (47 tests)
Reconnect → cursor-based replay (JSONL)	Y covers server-side (58 tests + manual E2E proof). Client cutover (B.2) not shipped — flag `REPLAY_BACKEND=jsonl` is dormant in production until B.2 lands.

P1 — Should Be Tested

Flow	Coverage
Password reset	Not tested
Skill CRUD and execution	Not tested
Admin user management	Not tested
AI provider configuration	Not tested
Onboarding	Not tested
Cloud drive OAuth + mount	Not tested
Git operations	Not tested
Workflow editor parsing	Unit tests exist (98%)

P2 — Nice to Have

Flow	Coverage
Meeting transcription	Integration tests exist
MCP service registration	MCP tests exist
Conversation compaction	Unit tests exist (79%)
Analytics report generation	Not tested
Output style management	Not tested
Live debug pane (dev tool)	A1 covers (30 tests, ~90%)

8. Strategic Direction

Testing approach is layered. Order is "tightest, cheapest, fastest" first.

Layer 1 — Unit tests (existing, well-suited for pure logic)

Reducers, parsers, ring buffers, codecs, mappers, scheduler logic. Covered today where the modules exist and are isolated. Y's three new modules (cursor codec, tail service, replay mapper) are exemplary — pure functions, fixture-driven, fast.

Layer 2 — Integration tests (existing, gated on infrastructure)

Scheduler API, tasks API, meeting API, Y's jsonl-tail-replay-e2e.test.js (real on-disk JSONL via tmp dirs). The Y integration test demonstrates the right pattern: spin up a tmp file, exercise the real modules, no mocks. Older integration tests need .env.test (npm run test:integration); the Y one runs in plain npm test.

Layer 3 — WebSocket protocol E2E (shipped 2026-05-09)

Status: shipped. claudecodeui/server/__tests__/e2e/chat-subscribe-jsonl.test.js — 8 scenarios, ~250 lines, ~460ms.

Implementation choice: in-process minimal Express+WS harness (not a child process spawn of server/index.js). The full server has heavy import-time side effects (DB migrations, scheduler init, ~7700 lines) that are not relevant to the protocol contract under test, so the harness imports handleChatSubscribe directly and dispatches messages to it from a WebSocketServer bound to an ephemeral port. JWT auth is intentionally skipped — that layer is enforced by verifyClient on the production WSS and is independently covered.

Coverage:

Fresh subscribe → empty replay + EOF cursor
Cursor at byteOffset=0 with current fingerprint → all events with monotonic seq starting at 1
Append a line, resume from prior cursor → exactly 1 new event
Compaction (delete + recreate, changes inode) → cursor-expired with reason fingerprint-mismatch
Malformed cursor → cursor-expired with reason malformed-or-mismatched-cursor
Missing project context → subscribe-error
Unknown sessionId on fresh subscribe → empty replay (graceful)
Unknown sessionId with cursor → subscribe-error (path unresolvable)

This is the natural place to catch fingerprint-comparison regressions (e.g., the Task 5 sameFileIdentity blind spot — currently dev:ino only) automatically. A2 Docker validation will additionally watch the was=…/now=… log line for in-place compaction cases that this test cannot reproduce on a local filesystem.

Layer 4 — Replay test harness for renderer (planned, A2)

The chat pipeline's data shape is too variable for hand-written fixtures. A2 (docs/superpowers/specs/2026-05-02-chat-replay-testing-design.md) builds a harness that:

Captures real ~/.claude/projects/**/*.jsonl session files and runs them through the server in replay mode
Records resulting WebSocket frames
Replays them through the React component tree in jsdom
Asserts no events drop (D1/D2/E completeness invariants) and no out-of-order rendering (F invariant)

This is the right tool for testing ChatInterface.jsx and messageNormalizer.js because hand-written tests cannot keep up with envelope-shape variance. Most useful after B.2 ships, because before then the client doesn't actually subscribe.

Layer 5 — Browser E2E (Playwright, scaffolded but not built)

Playwright is in devDependencies. A focused suite under claudecodeui/e2e/ would cover golden flows:

Login → project list → open existing chat → assert no [object Object] placeholders, no duplicates
Send message → assert it streams → reload page → assert history is intact (the "no drops on reconnect" property)
Toggle the A1 debug pane and assert specific event types appear / don't (catches noise-class regressions once B.3 ships signal classification)

Some of this is meaningful before B.2 (login + history rendering); the chat-subscribe assertions become meaningful only after B.2.

Recommended build order

~~Layer 3 first~~ — Shipped 2026-05-09. Going forward, extend this suite as new chat-subscribe edge cases surface.
A2 Docker validation — exercise the in-place compaction case the local-filesystem Layer 3 test cannot reproduce; watch [chat-subscribe] reason=fingerprint-mismatch was=…/now=… for the sameFileIdentity blind spot.
Layer 5 scaffolding — set up Playwright config, write the login + project list test as the first stable scenario. Cheap.
A2 replay harness (Layer 4) after B.2 ships — closes the renderer gap.

9. CI Integration

GitHub Actions Pipeline

PR opened → ci.yml:
  1. npm install
  2. npm run lint (server)
  3. npx vite build (frontend)
  4. npm run test:integration (optional)
  5. CodeQL analysis (security)
  6. Semgrep scan (SAST)
  7. OSV scan (dependencies)

npm run test:coverage is not yet wired into CI. It will be added with thresholds once A2 raises coverage above the trivial baseline.

Security Scanning in CI

Scanner	Focus	Blocking?
CodeQL	Code patterns, injection, auth issues	Advisory
Semgrep	SAST rules, OWASP patterns	Advisory
OSV	Known dependency vulnerabilities	Advisory
Container scan	Docker image vulnerabilities	Advisory

10. Test Accounts

Local Development

Username: lindsay / Password: password
Login confirmed working on local dev as of 2026-05-09 (memory note about "local login broken" is stale)
Note: npm run dev only loads .env, not .env.local, so the backend binds to whatever .env says (3005) — visit localhost:3005 directly. Vite at 5173 has a proxy mismatch with this configuration.

Docker Development

First user registered becomes admin
No pre-seeded test accounts

11. Known Gaps

Major

Chat-pipeline rendering: 0% covered — ChatInterface.jsx, useProjectWebSocketV2.js, messageNormalizer.js, websocket.js, claude-cli.js. Tracked by A2 spec.
B.2 client cutover not shipped — server-side Y is ready and working (verified end-to-end manually), but no client code sends chat-subscribe. Setting REPLAY_BACKEND=jsonl is dormant in production until B.2 lands.
Auth: untested — Registration, login, JWT validation, admin gating.
File operations: untested — Upload, download, safePath traversal prevention.
No WS-protocol E2E in CI — ~~Layer 3 above (proposed) closes this.~~ Closed 2026-05-09 by chat-subscribe-jsonl.test.js (8 scenarios).
No browser E2E — Playwright installed but unused.
Signal classification (B.3) not started — goal 3 (noise reduction) has only the C0 tactical fix in place (isSystemJson recognises rate_limit_info envelopes since 2026-05-09). The strategic typed-envelope normaliser has no coverage and no work in flight. The [Message format not recognized] fallback at messageNormalizer.js:170 is no longer triggered by rate_limit_info, but other unknown envelopes survive Y entirely.
No accessibility tests — No automated a11y checks.
No performance baselines — No load testing, no streaming latency benchmarks.

Recommended Priority

~~Layer 3 WS-protocol E2E~~ — Shipped 2026-05-09. server/__tests__/e2e/chat-subscribe-jsonl.test.js. Gives B.2 a stable contract to land against.
Y Phase 1 Docker validation — exercise the JSONL backend in production-like Docker before Phase 3 ring deletion.
Auth integration tests — registration, login, protected-route gating.
A2 replay-test harness — closes the rendering coverage gap; most valuable after B.2 ships.
File operations integration tests — upload, safePath validation, path-traversal prevention.
Coverage thresholds — after A2 raises the baseline, set per-area floors so regressions block PRs.
Playwright golden flows — login + history-render today (cheap); chat-subscribe assertions after B.2.

12. Companion Documents

Y — JSONL tail replay (current Phase 0 work)

docs/superpowers/specs/2026-05-06-replay-via-jsonl-tail-design.md — design spec post-Codex review
docs/superpowers/plans/2026-05-06-jsonl-tail-replay-y.md — implementation plan, Tasks 1-9
docs/superpowers/reviews/2026-05-06-codex-second-opinion-jsonl-tail.md — independent review

A1/A2/A3 — Diagnostic + replay harness

docs/superpowers/specs/2026-05-02-chat-replay-testing-design.md — A1 (shipped), A2 (planned), A3 (complete)
docs/superpowers/plans/2026-05-04-chat-debug-pane-a1.md — A1 implementation plan

B — Architectural fix family

docs/superpowers/specs/2026-05-05-chat-event-replay-ring-design.md — B (B.1 shipped, B.2/B.3 pending)
docs/superpowers/plans/2026-05-05-event-replay-ring-b1.md — B.1 plan (superseded by Y)
docs/superpowers/plans/2026-05-06-client-cutover-b2-redesign.md — B.2 redesign plan
docs/superpowers/reviews/2026-05-05-codex-second-opinion-chat-pipeline.md — first Codex review