Last updated: May 12, 2026, 08:49 PM UTC

16. Testing Strategy

1. Why This Exists

The chat pipeline is the most consequential code in Sasha — it touches every user interaction — and historically has been the least tested. The test strategy is anchored on five product-quality goals; each test layer below maps to one or more of them.

# Goal What "good" looks like Where coverage lives
1 Stable chat session No dropped or duplicated messages on reconnect, restart, or compaction Y (server, shipped Phase 0); B.2 (client, pending). E2E layer 1 below.
2 Don't lose messages we ought to surface Anything Anthropic's CLI persists to JSONL is replayable to the client Y mapper + tail service unit tests; A2 replay harness (planned).
3 Reduce noisy messages [Message format not recognized] never reaches a user; status events don't pollute the transcript Tactical: 10 unit tests on messageNormalizer.isSystemJson cover the rate_limit_info envelope (Phase C0, 2026-05-09). Strategic coverage depends on B.3 (signal classification), not yet started.
4 Stay consistent with stdout/stderr Live-stream and replay produce semantically equivalent client state Mapper unit tests assert wire-shape parity with messageStreamHandler.js broadcast payload.
5 Debuggable A developer can grep one log prefix and see what every chat-subscribe did A1 debug pane (shipped, 30 tests); [chat-subscribe] server log (shipped, no automated assertion yet).

The biggest unaddressed gap is goal 3 (noise) — there is no work in progress on signal classification. The biggest in-flight gap is goal 1 — Y's server work is shipped behind a feature flag, but the client (B.2) still uses the legacy path so the feature is dormant in production.

For program context, see docs/superpowers/specs/2026-05-06-replay-via-jsonl-tail-design.md (Y) and docs/superpowers/plans/2026-05-06-client-cutover-b2-redesign.md (B.2).


2. Test Framework

Tool Purpose
Vitest Test runner (unit + integration). Configured in claudecodeui/vitest.config.ts.
@testing-library/react Component + hook tests in jsdom
@vitest/coverage-v8 Coverage measurement (V8-native, fast)
jsdom Browser-environment polyfill for component/hook tests
better-sqlite3 In-memory database for test isolation

3. Running Tests

cd claudecodeui

npm test                  # Run once
npm run test:watch        # Watch mode
npm run test:coverage     # Run + emit reports under coverage/
npm run test:integration  # Server integration tests (uses --env-file=.env.test)
npm run lint              # ESLint (server/ only)

4. Current Test Inventory

~31 test files in claudecodeui/. The chat-pipeline coverage breakdown below is the load-bearing part — most of the rest is incidental.

By area

Area Test files Test count Status
JSONL-tail replay backend (Y — 2026-05-08) 5 58 All passing — Y Phase 0 (Tasks 1-7)
Chat event ring (B.1 — 2026-05-05) 7 47 All passing — slated for deletion in Y Phase 3
Live debug pane (A1 — 2026-05-04) 6 30 All passing
Workflow parsing 2 substantial All passing
Scheduler 4 substantial 1 file failing (mock issue, pre-existing)
Conversation compaction 1 All passing
MCP integrations (docsidecar) 7 All passing
Project state reducer 1 1 All passing (post-orphan-cleanup)
Server integration (live infra needed) 4 4 failing Need live DB/server
Other server services 2 mixed promptMaterializer fs-dependent

The pre-existing 16-failure count is unchanged from before Y; all failures are live-infra-gated and unrelated to recent code.

Chat-pipeline-related coverage detail

Y — JSONL tail replay backend (2026-05-08, Phase 0)

Suite Tests What it covers
server/services/__tests__/jsonlTailCursor.test.js 6 Opaque base64url cursor codec; structure/type/version validation; jsonlPath excluded (security)
server/services/__tests__/jsonlTailService.test.js 15 fingerprint (dev:ino:mtime:size:firstHash:lastHash); tailFile partial-line safety; freshCursor; resolveTranscriptPath candidate-iteration
server/services/__tests__/jsonlReplayMapper.test.js 11 JSONL envelope → wire-format message_streamed event; skip-list (queue-operation, last-prompt, permission-mode); preserves tool_use_result (NOT stripped)
server/websocket/__tests__/eventRingHandler.test.js 21 Both backends behind REPLAY_BACKEND flag — 12 ring tests + new JSONL coverage (fresh subscribe, cursor-at-EOF, malformed cursor, fingerprint mismatch, mid-read rewrite, missing project)
server/__tests__/integration/jsonl-tail-replay-e2e.test.js 4 End-to-end against real on-disk JSONL: write → tail → assert; append → tail → only new; compaction-style rewrite changes fingerprint; cursor round-trip

B.1 — Event ring (slated for deletion in Y Phase 3)

Suite Tests What it covers
server/services/__tests__/event-ring.test.js 17 Ring buffer: push, replay, eviction, dedup, markCompleted
server/services/__tests__/event-ring-cursor.test.js 6 Opaque base64url cursor codec, validation
server/services/__tests__/event-ring-config.test.js 3 Env-var parsing with fallback
server/services/__tests__/event-ring-sweeper.test.js 5 TTL eviction, max-sessions cap (time-mocked)
server/services/__tests__/event-ring-sweeper-prep.test.js 1 markCompleted smoke test
server/__tests__/integration/event-ring-e2e.test.js 3 Full pipeline: subscribe → push → reconnect → replay

A1 — Live debug pane

Suite Tests What it covers
src/dev/__tests__/wsDebugStore.test.js 8 Debug-pane ring buffer
src/dev/__tests__/useWsDebugStore.test.jsx 1 React hook over the store
src/dev/__tests__/useDebugToggle.test.jsx 6 Cmd+Shift+Y toggle + localStorage persistence
src/dev/__tests__/debugToggleStore.test.js 5 Toggle singleton store
src/dev/__tests__/downloadFrames.test.js 3 JSONL serialise + browser download
src/dev/__tests__/ChatStreamDebugPane.test.jsx 12 The component (rows, expand, pause, clear, download, filter, source colours)

Manual validation done outside the test suite

  • Y backend end-to-end via Playwright + browser console (2026-05-09): ran REPLAY_BACKEND=jsonl npm run dev, logged in, opened a real session, sent synthetic chat-subscribe over the live WebSocket, and verified: fresh subscribe returned a cursor at EOF=194524 with the correct path resolved; rewinding the cursor to byteOffset=0 returned 27 events with monotonic seq 1-27, all type=message_streamed operation=replay, mapper output matched the live wire shape. This is the proof of life that motivates building Layer 1 below into CI.

5. Coverage Measurement

Current baseline (2026-05-06)

npm run test:coverage produces a CLI summary plus coverage/index.html and coverage/coverage-summary.json. Reports are gitignored.

Unit-test scope (integration excluded):

Metric Coverage
Lines 1.74% (846 / 48,372)
Statements 1.69% (885 / 52,332)
Functions 1.67% (126 / 7,508)
Branches 1.10% (477 / 43,180)

The aggregate is low because most of the codebase has no tests. Coverage is concentrated in a few well-tested modules.

Where coverage is strong (>70%)

File Coverage source
server/services/jsonlTailCursor.js Y Task 1 — round-trip + 5 validation paths
server/services/jsonlTailService.js Y Task 2+4 — fingerprint, tailFile, freshCursor, resolveTranscriptPath
server/services/jsonlReplayMapper.js Y Task 3 — every envelope type + skip-list + fixture
server/websocket/eventRingHandler.js Both backends covered behind feature-flag tests (21)
server/services/event-ring-cursor.js B.1 — 100% (will be deleted in Y Phase 3)
server/services/event-ring.js B.1 — 91% (will be deleted in Y Phase 3)
server/services/event-ring-config.js B.1 — 89% (will be deleted in Y Phase 3)
server/services/event-ring-sweeper.js B.1 — 71% (will be deleted in Y Phase 3)
server/services/scheduler.js 90%
server/services/schedulerLogger.js 100%
server/conversations/compaction.js 79%
src/utils/workflowParser.js 98%
src/utils/workflowSerializer.js 90%

(File-level percentages need a fresh npm run test:coverage run after Y. The chart above is qualitative until the next refresh.)

Where coverage is absent (the chat-pipeline gap)

The most consequential code in the system is essentially uncovered:

File Lines Coverage
src/components/ChatInterface.jsx 2,842 0%
server/claude-cli.js 1,846 0%
src/hooks/useProjectWebSocketV2.js 1,032 0%
src/reducers/projectReducer.js 446 7%
src/utils/websocket.js 301 0%
src/utils/messageNormalizer.js 123 0%

~6,500 lines of the most consequential code at essentially zero coverage. This is the gap the planned A2 replay-test harness is designed to close — see docs/superpowers/specs/2026-05-02-chat-replay-testing-design.md. Y has improved server-side coverage but the client renderer remains untested.

Coverage thresholds — not yet set

A 1.74% baseline can't carry a meaningful CI gate. vitest.config.ts has the coverage block configured but no thresholds. The right time to add per-area thresholds is after A2 lands and the chat pipeline gets real coverage.

6. Build Validation

Check Command Blocking?
ESLint (server only) npm run lint Yes (CI blocks)
Frontend build npx vite build Yes
Coverage report npm run test:coverage No (informational)
TypeScript N/A (no TypeScript) N/A

ESLint scope is server/ only; frontend is validated via successful Vite build.

7. Critical Flows — Coverage Status

P0 — Must Not Break

Flow Coverage
User registration (first user = admin) Not tested
User login / JWT generation Not tested
Chat message → Claude CLI → streaming response Server event-ring layer only (B.1, ~95%); rendering path 0%
File upload and download Not tested
Scheduled prompt execution Integration tests exist
Reconnect → cursor-based replay (ring) B.1 covers (47 tests)
Reconnect → cursor-based replay (JSONL) Y covers server-side (58 tests + manual E2E proof). Client cutover (B.2) not shipped — flag REPLAY_BACKEND=jsonl is dormant in production until B.2 lands.

P1 — Should Be Tested

Flow Coverage
Password reset Not tested
Skill CRUD and execution Not tested
Admin user management Not tested
AI provider configuration Not tested
Onboarding Not tested
Cloud drive OAuth + mount Not tested
Git operations Not tested
Workflow editor parsing Unit tests exist (98%)

P2 — Nice to Have

Flow Coverage
Meeting transcription Integration tests exist
MCP service registration MCP tests exist
Conversation compaction Unit tests exist (79%)
Analytics report generation Not tested
Output style management Not tested
Live debug pane (dev tool) A1 covers (30 tests, ~90%)

8. Strategic Direction

Testing approach is layered. Order is "tightest, cheapest, fastest" first.

Layer 1 — Unit tests (existing, well-suited for pure logic)

Reducers, parsers, ring buffers, codecs, mappers, scheduler logic. Covered today where the modules exist and are isolated. Y's three new modules (cursor codec, tail service, replay mapper) are exemplary — pure functions, fixture-driven, fast.

Layer 2 — Integration tests (existing, gated on infrastructure)

Scheduler API, tasks API, meeting API, Y's jsonl-tail-replay-e2e.test.js (real on-disk JSONL via tmp dirs). The Y integration test demonstrates the right pattern: spin up a tmp file, exercise the real modules, no mocks. Older integration tests need .env.test (npm run test:integration); the Y one runs in plain npm test.

Layer 3 — WebSocket protocol E2E (shipped 2026-05-09)

Status: shipped. claudecodeui/server/__tests__/e2e/chat-subscribe-jsonl.test.js — 8 scenarios, ~250 lines, ~460ms.

Implementation choice: in-process minimal Express+WS harness (not a child process spawn of server/index.js). The full server has heavy import-time side effects (DB migrations, scheduler init, ~7700 lines) that are not relevant to the protocol contract under test, so the harness imports handleChatSubscribe directly and dispatches messages to it from a WebSocketServer bound to an ephemeral port. JWT auth is intentionally skipped — that layer is enforced by verifyClient on the production WSS and is independently covered.

Coverage:

  1. Fresh subscribe → empty replay + EOF cursor
  2. Cursor at byteOffset=0 with current fingerprint → all events with monotonic seq starting at 1
  3. Append a line, resume from prior cursor → exactly 1 new event
  4. Compaction (delete + recreate, changes inode) → cursor-expired with reason fingerprint-mismatch
  5. Malformed cursor → cursor-expired with reason malformed-or-mismatched-cursor
  6. Missing project context → subscribe-error
  7. Unknown sessionId on fresh subscribe → empty replay (graceful)
  8. Unknown sessionId with cursor → subscribe-error (path unresolvable)

This is the natural place to catch fingerprint-comparison regressions (e.g., the Task 5 sameFileIdentity blind spot — currently dev:ino only) automatically. A2 Docker validation will additionally watch the was=…/now=… log line for in-place compaction cases that this test cannot reproduce on a local filesystem.

Layer 4 — Replay test harness for renderer (planned, A2)

The chat pipeline's data shape is too variable for hand-written fixtures. A2 (docs/superpowers/specs/2026-05-02-chat-replay-testing-design.md) builds a harness that:

  • Captures real ~/.claude/projects/**/*.jsonl session files and runs them through the server in replay mode
  • Records resulting WebSocket frames
  • Replays them through the React component tree in jsdom
  • Asserts no events drop (D1/D2/E completeness invariants) and no out-of-order rendering (F invariant)

This is the right tool for testing ChatInterface.jsx and messageNormalizer.js because hand-written tests cannot keep up with envelope-shape variance. Most useful after B.2 ships, because before then the client doesn't actually subscribe.

Layer 5 — Browser E2E (Playwright, scaffolded but not built)

Playwright is in devDependencies. A focused suite under claudecodeui/e2e/ would cover golden flows:

  • Login → project list → open existing chat → assert no [object Object] placeholders, no duplicates
  • Send message → assert it streams → reload page → assert history is intact (the "no drops on reconnect" property)
  • Toggle the A1 debug pane and assert specific event types appear / don't (catches noise-class regressions once B.3 ships signal classification)

Some of this is meaningful before B.2 (login + history rendering); the chat-subscribe assertions become meaningful only after B.2.

Recommended build order

  1. Layer 3 first — Shipped 2026-05-09. Going forward, extend this suite as new chat-subscribe edge cases surface.
  2. A2 Docker validation — exercise the in-place compaction case the local-filesystem Layer 3 test cannot reproduce; watch [chat-subscribe] reason=fingerprint-mismatch was=…/now=… for the sameFileIdentity blind spot.
  3. Layer 5 scaffolding — set up Playwright config, write the login + project list test as the first stable scenario. Cheap.
  4. A2 replay harness (Layer 4) after B.2 ships — closes the renderer gap.

9. CI Integration

GitHub Actions Pipeline

PR opened → ci.yml:
  1. npm install
  2. npm run lint (server)
  3. npx vite build (frontend)
  4. npm run test:integration (optional)
  5. CodeQL analysis (security)
  6. Semgrep scan (SAST)
  7. OSV scan (dependencies)

npm run test:coverage is not yet wired into CI. It will be added with thresholds once A2 raises coverage above the trivial baseline.

Security Scanning in CI

Scanner Focus Blocking?
CodeQL Code patterns, injection, auth issues Advisory
Semgrep SAST rules, OWASP patterns Advisory
OSV Known dependency vulnerabilities Advisory
Container scan Docker image vulnerabilities Advisory

10. Test Accounts

Local Development

  • Username: lindsay / Password: password
  • Login confirmed working on local dev as of 2026-05-09 (memory note about "local login broken" is stale)
  • Note: npm run dev only loads .env, not .env.local, so the backend binds to whatever .env says (3005) — visit localhost:3005 directly. Vite at 5173 has a proxy mismatch with this configuration.

Docker Development

  • First user registered becomes admin
  • No pre-seeded test accounts

11. Known Gaps

Major

  1. Chat-pipeline rendering: 0% coveredChatInterface.jsx, useProjectWebSocketV2.js, messageNormalizer.js, websocket.js, claude-cli.js. Tracked by A2 spec.
  2. B.2 client cutover not shipped — server-side Y is ready and working (verified end-to-end manually), but no client code sends chat-subscribe. Setting REPLAY_BACKEND=jsonl is dormant in production until B.2 lands.
  3. Auth: untested — Registration, login, JWT validation, admin gating.
  4. File operations: untested — Upload, download, safePath traversal prevention.
  5. No WS-protocol E2E in CILayer 3 above (proposed) closes this. Closed 2026-05-09 by chat-subscribe-jsonl.test.js (8 scenarios).
  6. No browser E2E — Playwright installed but unused.
  7. Signal classification (B.3) not started — goal 3 (noise reduction) has only the C0 tactical fix in place (isSystemJson recognises rate_limit_info envelopes since 2026-05-09). The strategic typed-envelope normaliser has no coverage and no work in flight. The [Message format not recognized] fallback at messageNormalizer.js:170 is no longer triggered by rate_limit_info, but other unknown envelopes survive Y entirely.
  8. No accessibility tests — No automated a11y checks.
  9. No performance baselines — No load testing, no streaming latency benchmarks.

Recommended Priority

  1. Layer 3 WS-protocol E2E — Shipped 2026-05-09. server/__tests__/e2e/chat-subscribe-jsonl.test.js. Gives B.2 a stable contract to land against.
  2. Y Phase 1 Docker validation — exercise the JSONL backend in production-like Docker before Phase 3 ring deletion.
  3. Auth integration tests — registration, login, protected-route gating.
  4. A2 replay-test harness — closes the rendering coverage gap; most valuable after B.2 ships.
  5. File operations integration tests — upload, safePath validation, path-traversal prevention.
  6. Coverage thresholds — after A2 raises the baseline, set per-area floors so regressions block PRs.
  7. Playwright golden flows — login + history-render today (cheap); chat-subscribe assertions after B.2.

12. Companion Documents

Y — JSONL tail replay (current Phase 0 work)

  • docs/superpowers/specs/2026-05-06-replay-via-jsonl-tail-design.md — design spec post-Codex review
  • docs/superpowers/plans/2026-05-06-jsonl-tail-replay-y.md — implementation plan, Tasks 1-9
  • docs/superpowers/reviews/2026-05-06-codex-second-opinion-jsonl-tail.md — independent review

A1/A2/A3 — Diagnostic + replay harness

  • docs/superpowers/specs/2026-05-02-chat-replay-testing-design.md — A1 (shipped), A2 (planned), A3 (complete)
  • docs/superpowers/plans/2026-05-04-chat-debug-pane-a1.md — A1 implementation plan

B — Architectural fix family

  • docs/superpowers/specs/2026-05-05-chat-event-replay-ring-design.md — B (B.1 shipped, B.2/B.3 pending)
  • docs/superpowers/plans/2026-05-05-event-replay-ring-b1.md — B.1 plan (superseded by Y)
  • docs/superpowers/plans/2026-05-06-client-cutover-b2-redesign.md — B.2 redesign plan
  • docs/superpowers/reviews/2026-05-05-codex-second-opinion-chat-pipeline.md — first Codex review