16. Testing Strategy
1. Why This Exists
The chat pipeline is the most consequential code in Sasha — it touches every user interaction — and historically has been the least tested. The test strategy is anchored on five product-quality goals; each test layer below maps to one or more of them.
| # | Goal | What "good" looks like | Where coverage lives |
|---|---|---|---|
| 1 | Stable chat session | No dropped or duplicated messages on reconnect, restart, or compaction | Y (server, shipped Phase 0); B.2 (client, pending). E2E layer 1 below. |
| 2 | Don't lose messages we ought to surface | Anything Anthropic's CLI persists to JSONL is replayable to the client | Y mapper + tail service unit tests; A2 replay harness (planned). |
| 3 | Reduce noisy messages | [Message format not recognized] never reaches a user; status events don't pollute the transcript |
Tactical: 10 unit tests on messageNormalizer.isSystemJson cover the rate_limit_info envelope (Phase C0, 2026-05-09). Strategic coverage depends on B.3 (signal classification), not yet started. |
| 4 | Stay consistent with stdout/stderr | Live-stream and replay produce semantically equivalent client state | Mapper unit tests assert wire-shape parity with messageStreamHandler.js broadcast payload. |
| 5 | Debuggable | A developer can grep one log prefix and see what every chat-subscribe did | A1 debug pane (shipped, 30 tests); [chat-subscribe] server log (shipped, no automated assertion yet). |
The biggest unaddressed gap is goal 3 (noise) — there is no work in progress on signal classification. The biggest in-flight gap is goal 1 — Y's server work is shipped behind a feature flag, but the client (B.2) still uses the legacy path so the feature is dormant in production.
For program context, see docs/superpowers/specs/2026-05-06-replay-via-jsonl-tail-design.md (Y) and docs/superpowers/plans/2026-05-06-client-cutover-b2-redesign.md (B.2).
2. Test Framework
| Tool | Purpose |
|---|---|
| Vitest | Test runner (unit + integration). Configured in claudecodeui/vitest.config.ts. |
@testing-library/react |
Component + hook tests in jsdom |
@vitest/coverage-v8 |
Coverage measurement (V8-native, fast) |
| jsdom | Browser-environment polyfill for component/hook tests |
| better-sqlite3 | In-memory database for test isolation |
3. Running Tests
cd claudecodeui
npm test # Run once
npm run test:watch # Watch mode
npm run test:coverage # Run + emit reports under coverage/
npm run test:integration # Server integration tests (uses --env-file=.env.test)
npm run lint # ESLint (server/ only)
4. Current Test Inventory
~31 test files in claudecodeui/. The chat-pipeline coverage breakdown below is the load-bearing part — most of the rest is incidental.
By area
| Area | Test files | Test count | Status |
|---|---|---|---|
| JSONL-tail replay backend (Y — 2026-05-08) | 5 | 58 | All passing — Y Phase 0 (Tasks 1-7) |
| Chat event ring (B.1 — 2026-05-05) | 7 | 47 | All passing — slated for deletion in Y Phase 3 |
| Live debug pane (A1 — 2026-05-04) | 6 | 30 | All passing |
| Workflow parsing | 2 | substantial | All passing |
| Scheduler | 4 | substantial | 1 file failing (mock issue, pre-existing) |
| Conversation compaction | 1 | — | All passing |
| MCP integrations (docsidecar) | 7 | — | All passing |
| Project state reducer | 1 | 1 | All passing (post-orphan-cleanup) |
| Server integration (live infra needed) | 4 | 4 failing | Need live DB/server |
| Other server services | 2 | mixed | promptMaterializer fs-dependent |
The pre-existing 16-failure count is unchanged from before Y; all failures are live-infra-gated and unrelated to recent code.
Chat-pipeline-related coverage detail
Y — JSONL tail replay backend (2026-05-08, Phase 0)
| Suite | Tests | What it covers |
|---|---|---|
server/services/__tests__/jsonlTailCursor.test.js |
6 | Opaque base64url cursor codec; structure/type/version validation; jsonlPath excluded (security) |
server/services/__tests__/jsonlTailService.test.js |
15 | fingerprint (dev:ino:mtime:size:firstHash:lastHash); tailFile partial-line safety; freshCursor; resolveTranscriptPath candidate-iteration |
server/services/__tests__/jsonlReplayMapper.test.js |
11 | JSONL envelope → wire-format message_streamed event; skip-list (queue-operation, last-prompt, permission-mode); preserves tool_use_result (NOT stripped) |
server/websocket/__tests__/eventRingHandler.test.js |
21 | Both backends behind REPLAY_BACKEND flag — 12 ring tests + new JSONL coverage (fresh subscribe, cursor-at-EOF, malformed cursor, fingerprint mismatch, mid-read rewrite, missing project) |
server/__tests__/integration/jsonl-tail-replay-e2e.test.js |
4 | End-to-end against real on-disk JSONL: write → tail → assert; append → tail → only new; compaction-style rewrite changes fingerprint; cursor round-trip |
B.1 — Event ring (slated for deletion in Y Phase 3)
| Suite | Tests | What it covers |
|---|---|---|
server/services/__tests__/event-ring.test.js |
17 | Ring buffer: push, replay, eviction, dedup, markCompleted |
server/services/__tests__/event-ring-cursor.test.js |
6 | Opaque base64url cursor codec, validation |
server/services/__tests__/event-ring-config.test.js |
3 | Env-var parsing with fallback |
server/services/__tests__/event-ring-sweeper.test.js |
5 | TTL eviction, max-sessions cap (time-mocked) |
server/services/__tests__/event-ring-sweeper-prep.test.js |
1 | markCompleted smoke test |
server/__tests__/integration/event-ring-e2e.test.js |
3 | Full pipeline: subscribe → push → reconnect → replay |
A1 — Live debug pane
| Suite | Tests | What it covers |
|---|---|---|
src/dev/__tests__/wsDebugStore.test.js |
8 | Debug-pane ring buffer |
src/dev/__tests__/useWsDebugStore.test.jsx |
1 | React hook over the store |
src/dev/__tests__/useDebugToggle.test.jsx |
6 | Cmd+Shift+Y toggle + localStorage persistence |
src/dev/__tests__/debugToggleStore.test.js |
5 | Toggle singleton store |
src/dev/__tests__/downloadFrames.test.js |
3 | JSONL serialise + browser download |
src/dev/__tests__/ChatStreamDebugPane.test.jsx |
12 | The component (rows, expand, pause, clear, download, filter, source colours) |
Manual validation done outside the test suite
- Y backend end-to-end via Playwright + browser console (2026-05-09): ran
REPLAY_BACKEND=jsonl npm run dev, logged in, opened a real session, sent syntheticchat-subscribeover the live WebSocket, and verified: fresh subscribe returned a cursor at EOF=194524 with the correct path resolved; rewinding the cursor tobyteOffset=0returned 27 events with monotonicseq 1-27, alltype=message_streamed operation=replay, mapper output matched the live wire shape. This is the proof of life that motivates building Layer 1 below into CI.
5. Coverage Measurement
Current baseline (2026-05-06)
npm run test:coverage produces a CLI summary plus coverage/index.html and coverage/coverage-summary.json. Reports are gitignored.
Unit-test scope (integration excluded):
| Metric | Coverage |
|---|---|
| Lines | 1.74% (846 / 48,372) |
| Statements | 1.69% (885 / 52,332) |
| Functions | 1.67% (126 / 7,508) |
| Branches | 1.10% (477 / 43,180) |
The aggregate is low because most of the codebase has no tests. Coverage is concentrated in a few well-tested modules.
Where coverage is strong (>70%)
| File | Coverage source |
|---|---|
server/services/jsonlTailCursor.js |
Y Task 1 — round-trip + 5 validation paths |
server/services/jsonlTailService.js |
Y Task 2+4 — fingerprint, tailFile, freshCursor, resolveTranscriptPath |
server/services/jsonlReplayMapper.js |
Y Task 3 — every envelope type + skip-list + fixture |
server/websocket/eventRingHandler.js |
Both backends covered behind feature-flag tests (21) |
server/services/event-ring-cursor.js |
B.1 — 100% (will be deleted in Y Phase 3) |
server/services/event-ring.js |
B.1 — 91% (will be deleted in Y Phase 3) |
server/services/event-ring-config.js |
B.1 — 89% (will be deleted in Y Phase 3) |
server/services/event-ring-sweeper.js |
B.1 — 71% (will be deleted in Y Phase 3) |
server/services/scheduler.js |
90% |
server/services/schedulerLogger.js |
100% |
server/conversations/compaction.js |
79% |
src/utils/workflowParser.js |
98% |
src/utils/workflowSerializer.js |
90% |
(File-level percentages need a fresh npm run test:coverage run after Y. The chart above is qualitative until the next refresh.)
Where coverage is absent (the chat-pipeline gap)
The most consequential code in the system is essentially uncovered:
| File | Lines | Coverage |
|---|---|---|
src/components/ChatInterface.jsx |
2,842 | 0% |
server/claude-cli.js |
1,846 | 0% |
src/hooks/useProjectWebSocketV2.js |
1,032 | 0% |
src/reducers/projectReducer.js |
446 | 7% |
src/utils/websocket.js |
301 | 0% |
src/utils/messageNormalizer.js |
123 | 0% |
~6,500 lines of the most consequential code at essentially zero coverage. This is the gap the planned A2 replay-test harness is designed to close — see docs/superpowers/specs/2026-05-02-chat-replay-testing-design.md. Y has improved server-side coverage but the client renderer remains untested.
Coverage thresholds — not yet set
A 1.74% baseline can't carry a meaningful CI gate. vitest.config.ts has the coverage block configured but no thresholds. The right time to add per-area thresholds is after A2 lands and the chat pipeline gets real coverage.
6. Build Validation
| Check | Command | Blocking? |
|---|---|---|
| ESLint (server only) | npm run lint |
Yes (CI blocks) |
| Frontend build | npx vite build |
Yes |
| Coverage report | npm run test:coverage |
No (informational) |
| TypeScript | N/A (no TypeScript) | N/A |
ESLint scope is server/ only; frontend is validated via successful Vite build.
7. Critical Flows — Coverage Status
P0 — Must Not Break
| Flow | Coverage |
|---|---|
| User registration (first user = admin) | Not tested |
| User login / JWT generation | Not tested |
| Chat message → Claude CLI → streaming response | Server event-ring layer only (B.1, ~95%); rendering path 0% |
| File upload and download | Not tested |
| Scheduled prompt execution | Integration tests exist |
| Reconnect → cursor-based replay (ring) | B.1 covers (47 tests) |
| Reconnect → cursor-based replay (JSONL) | Y covers server-side (58 tests + manual E2E proof). Client cutover (B.2) not shipped — flag REPLAY_BACKEND=jsonl is dormant in production until B.2 lands. |
P1 — Should Be Tested
| Flow | Coverage |
|---|---|
| Password reset | Not tested |
| Skill CRUD and execution | Not tested |
| Admin user management | Not tested |
| AI provider configuration | Not tested |
| Onboarding | Not tested |
| Cloud drive OAuth + mount | Not tested |
| Git operations | Not tested |
| Workflow editor parsing | Unit tests exist (98%) |
P2 — Nice to Have
| Flow | Coverage |
|---|---|
| Meeting transcription | Integration tests exist |
| MCP service registration | MCP tests exist |
| Conversation compaction | Unit tests exist (79%) |
| Analytics report generation | Not tested |
| Output style management | Not tested |
| Live debug pane (dev tool) | A1 covers (30 tests, ~90%) |
8. Strategic Direction
Testing approach is layered. Order is "tightest, cheapest, fastest" first.
Layer 1 — Unit tests (existing, well-suited for pure logic)
Reducers, parsers, ring buffers, codecs, mappers, scheduler logic. Covered today where the modules exist and are isolated. Y's three new modules (cursor codec, tail service, replay mapper) are exemplary — pure functions, fixture-driven, fast.
Layer 2 — Integration tests (existing, gated on infrastructure)
Scheduler API, tasks API, meeting API, Y's jsonl-tail-replay-e2e.test.js (real on-disk JSONL via tmp dirs). The Y integration test demonstrates the right pattern: spin up a tmp file, exercise the real modules, no mocks. Older integration tests need .env.test (npm run test:integration); the Y one runs in plain npm test.
Layer 3 — WebSocket protocol E2E (shipped 2026-05-09)
Status: shipped. claudecodeui/server/__tests__/e2e/chat-subscribe-jsonl.test.js — 8 scenarios, ~250 lines, ~460ms.
Implementation choice: in-process minimal Express+WS harness (not a child process spawn of server/index.js). The full server has heavy import-time side effects (DB migrations, scheduler init, ~7700 lines) that are not relevant to the protocol contract under test, so the harness imports handleChatSubscribe directly and dispatches messages to it from a WebSocketServer bound to an ephemeral port. JWT auth is intentionally skipped — that layer is enforced by verifyClient on the production WSS and is independently covered.
Coverage:
- Fresh subscribe → empty replay + EOF cursor
- Cursor at byteOffset=0 with current fingerprint → all events with monotonic seq starting at 1
- Append a line, resume from prior cursor → exactly 1 new event
- Compaction (delete + recreate, changes inode) →
cursor-expiredwith reasonfingerprint-mismatch - Malformed cursor →
cursor-expiredwith reasonmalformed-or-mismatched-cursor - Missing project context →
subscribe-error - Unknown sessionId on fresh subscribe → empty replay (graceful)
- Unknown sessionId with cursor →
subscribe-error(path unresolvable)
This is the natural place to catch fingerprint-comparison regressions (e.g., the Task 5 sameFileIdentity blind spot — currently dev:ino only) automatically. A2 Docker validation will additionally watch the was=…/now=… log line for in-place compaction cases that this test cannot reproduce on a local filesystem.
Layer 4 — Replay test harness for renderer (planned, A2)
The chat pipeline's data shape is too variable for hand-written fixtures. A2 (docs/superpowers/specs/2026-05-02-chat-replay-testing-design.md) builds a harness that:
- Captures real
~/.claude/projects/**/*.jsonlsession files and runs them through the server in replay mode - Records resulting WebSocket frames
- Replays them through the React component tree in jsdom
- Asserts no events drop (D1/D2/E completeness invariants) and no out-of-order rendering (F invariant)
This is the right tool for testing ChatInterface.jsx and messageNormalizer.js because hand-written tests cannot keep up with envelope-shape variance. Most useful after B.2 ships, because before then the client doesn't actually subscribe.
Layer 5 — Browser E2E (Playwright, scaffolded but not built)
Playwright is in devDependencies. A focused suite under claudecodeui/e2e/ would cover golden flows:
- Login → project list → open existing chat → assert no
[object Object]placeholders, no duplicates - Send message → assert it streams → reload page → assert history is intact (the "no drops on reconnect" property)
- Toggle the A1 debug pane and assert specific event types appear / don't (catches noise-class regressions once B.3 ships signal classification)
Some of this is meaningful before B.2 (login + history rendering); the chat-subscribe assertions become meaningful only after B.2.
Recommended build order
Layer 3 first— Shipped 2026-05-09. Going forward, extend this suite as new chat-subscribe edge cases surface.- A2 Docker validation — exercise the in-place compaction case the local-filesystem Layer 3 test cannot reproduce; watch
[chat-subscribe] reason=fingerprint-mismatch was=…/now=…for thesameFileIdentityblind spot. - Layer 5 scaffolding — set up Playwright config, write the login + project list test as the first stable scenario. Cheap.
- A2 replay harness (Layer 4) after B.2 ships — closes the renderer gap.
9. CI Integration
GitHub Actions Pipeline
PR opened → ci.yml:
1. npm install
2. npm run lint (server)
3. npx vite build (frontend)
4. npm run test:integration (optional)
5. CodeQL analysis (security)
6. Semgrep scan (SAST)
7. OSV scan (dependencies)
npm run test:coverage is not yet wired into CI. It will be added with thresholds once A2 raises coverage above the trivial baseline.
Security Scanning in CI
| Scanner | Focus | Blocking? |
|---|---|---|
| CodeQL | Code patterns, injection, auth issues | Advisory |
| Semgrep | SAST rules, OWASP patterns | Advisory |
| OSV | Known dependency vulnerabilities | Advisory |
| Container scan | Docker image vulnerabilities | Advisory |
10. Test Accounts
Local Development
- Username:
lindsay/ Password:password - Login confirmed working on local dev as of 2026-05-09 (memory note about "local login broken" is stale)
- Note:
npm run devonly loads.env, not.env.local, so the backend binds to whatever.envsays (3005) — visitlocalhost:3005directly. Vite at 5173 has a proxy mismatch with this configuration.
Docker Development
- First user registered becomes admin
- No pre-seeded test accounts
11. Known Gaps
Major
- Chat-pipeline rendering: 0% covered —
ChatInterface.jsx,useProjectWebSocketV2.js,messageNormalizer.js,websocket.js,claude-cli.js. Tracked by A2 spec. - B.2 client cutover not shipped — server-side Y is ready and working (verified end-to-end manually), but no client code sends
chat-subscribe. SettingREPLAY_BACKEND=jsonlis dormant in production until B.2 lands. - Auth: untested — Registration, login, JWT validation, admin gating.
- File operations: untested — Upload, download,
safePathtraversal prevention. - No WS-protocol E2E in CI —
Layer 3 above (proposed) closes this.Closed 2026-05-09 bychat-subscribe-jsonl.test.js(8 scenarios). - No browser E2E — Playwright installed but unused.
- Signal classification (B.3) not started — goal 3 (noise reduction) has only the C0 tactical fix in place (
isSystemJsonrecognisesrate_limit_infoenvelopes since 2026-05-09). The strategic typed-envelope normaliser has no coverage and no work in flight. The[Message format not recognized]fallback atmessageNormalizer.js:170is no longer triggered byrate_limit_info, but other unknown envelopes survive Y entirely. - No accessibility tests — No automated a11y checks.
- No performance baselines — No load testing, no streaming latency benchmarks.
Recommended Priority
Layer 3 WS-protocol E2E— Shipped 2026-05-09.server/__tests__/e2e/chat-subscribe-jsonl.test.js. Gives B.2 a stable contract to land against.- Y Phase 1 Docker validation — exercise the JSONL backend in production-like Docker before Phase 3 ring deletion.
- Auth integration tests — registration, login, protected-route gating.
- A2 replay-test harness — closes the rendering coverage gap; most valuable after B.2 ships.
- File operations integration tests — upload, safePath validation, path-traversal prevention.
- Coverage thresholds — after A2 raises the baseline, set per-area floors so regressions block PRs.
- Playwright golden flows — login + history-render today (cheap); chat-subscribe assertions after B.2.
12. Companion Documents
Y — JSONL tail replay (current Phase 0 work)
docs/superpowers/specs/2026-05-06-replay-via-jsonl-tail-design.md— design spec post-Codex reviewdocs/superpowers/plans/2026-05-06-jsonl-tail-replay-y.md— implementation plan, Tasks 1-9docs/superpowers/reviews/2026-05-06-codex-second-opinion-jsonl-tail.md— independent review
A1/A2/A3 — Diagnostic + replay harness
docs/superpowers/specs/2026-05-02-chat-replay-testing-design.md— A1 (shipped), A2 (planned), A3 (complete)docs/superpowers/plans/2026-05-04-chat-debug-pane-a1.md— A1 implementation plan
B — Architectural fix family
docs/superpowers/specs/2026-05-05-chat-event-replay-ring-design.md— B (B.1 shipped, B.2/B.3 pending)docs/superpowers/plans/2026-05-05-event-replay-ring-b1.md— B.1 plan (superseded by Y)docs/superpowers/plans/2026-05-06-client-cutover-b2-redesign.md— B.2 redesign plandocs/superpowers/reviews/2026-05-05-codex-second-opinion-chat-pipeline.md— first Codex review