Healing and Failure Intelligence

Key	Value
Status	Active
Owner	QA Automation
Updated	2026-03-26
Scope	Self-healing, incident store, fix database, recovery workflows, and operator decision guide

The healing system is how the team avoids rediscovering the same test breakages over and over. It combines automated selector repair, a durable incident store, and a recovery lifecycle that posts back to Slack when a failure stops repeating.

When To Use Healing vs Manual Fix

Situation	Recommended Action
selector stopped finding its element	`npm run heal:claude`
timeout is too short for current page speed	`npm run heal:claude` or manual timeout increase
test logic is wrong (assertion on changed behavior)	manual fix
site has a real bug (product issue)	escalate, do not fix the test
consent dialog blocking the test	`npm run heal:claude`
HTTP 5xx from the site	investigate infra, do not fix the test
HTTP 4xx (page removed)	decide whether to remove or update the test
flaky timing issue	manual fix with explicit wait strategy

Decision Tree

Failure Category	First Action	If That Does Not Work
`SELECTOR_NOT_IN_DOM`	`npm run heal:claude`	check MCP Playwright DOM live
`SELECTOR_STALE`	`npm run heal:apply` with fresh selector	run `npm run discover:all`
`TIMEOUT_ELEMENT`	`npm run heal:claude`	increase timeout manually + add waitForLoadState
`CONSENT_BLOCKING`	`npm run heal:claude`	check consent handler config
`CONSENT_NOT_FOUND`	update consent config	check site for dialog changes
`HTTP_5XX`	check site status, wait	escalate to infra
`HTTP_4XX`	check if page was removed	update or remove the test
`CONTENT_MISMATCH`	check if content changed by design	update test expectation
`TEST_FLAKY`	add explicit wait, rerun	investigate timing
`PRODUCT_BUG`	file bug report	do not fix the test

Commands

Healing

Command	Purpose
`npm run heal`	general failure analysis
`npm run heal:claude`	interactive Claude-assisted healing (recommended)
`npm run heal:ai`	AI analysis path
`npm run heal:apply`	apply a proposed fix locally
`npm run heal:dry`	preview changes without applying
`npm run heal:mr`	package a fix as a GitLab MR

Fix Database

Command	Purpose
`npm run fix:stats`	show fix database statistics
`npm run fix:recent`	show recent fixes
`npm run fix:recent blesk`	recent fixes for a specific site

Incident Store

Command	Purpose
`npm run incidents:query`	show failures needing classification (last 7 days)
`npm run incidents:query -- --days=30`	last 30 days
`npm run incidents:template`	generate incident JSON template for a new incident
`npm run incidents:record-recovery`	append a confirmed recovery to history

Recovery Replies

Command	Purpose
`npm run slack:reply-resolved`	dry run: show what recovery replies would be posted
`npm run slack:reply-resolved:send`	post recovery replies to fixed failure Slack threads
`npm run slack:reply-thread`	post to a specific thread (manual override)

Fix Database

The fix database lives at data/fixes.json. It stores every fix the system has applied or recorded.

Field	Meaning
fix type	`SELECTOR`, `TIMEOUT`, `CONSENT`, `WAIT`, `ASSERTION`
source	`AUTO` (healer), `MANUAL` (operator), `AI` (AI-assisted)
selector before/after	what changed
verified	whether the fix was confirmed by a passing test run
site	which site the fix applies to

Querying the fix database before making a manual change often reveals that the same selector has broken before and what the replacement was.

Incident Store

The incident store lives at data/failure-incidents.json. It is the system's memory of known recurring failure patterns.

Field	Meaning
id	unique incident identifier
title	human-readable description
root cause	one of the standard root-cause domains
status	`open`, `resolved`, or `monitoring`
sites affected	which sites show this failure
fingerprint patterns	how the system matches new failures to this incident
confirmed recoveries	timestamps of confirmed resolution events

When a new failure arrives, the incident matcher checks whether its fingerprint matches a known incident. If it does, the Slack alert includes the known context instead of treating it as a fresh unknown.

Step-By-Step: I Have A Failing Test

Check the Slack alert. The alert headline and Initial Read section give you the first signal.
Look at the failure screenshot in test-results/artifacts/ or in Grafana > Investigate.
Read the JSONL log for the failure in test-results/logs/. Look at error_category, selector, and action fields.
Check the incident store: npm run incidents:query. If there is a match, you already know the root cause.
Decide whether this is auto-healable (see decision tree above).
If auto-healable: run npm run heal:claude and follow the interactive prompts.
If manual: make the fix, re-run the test, confirm it passes.
Record the fix: it will be saved automatically when the test passes after a healer-applied fix.
If you fixed a known incident: run npm run incidents:record-recovery to append it to history.
Post recovery: npm run slack:reply-resolved:send posts a recovery reply to the original Slack thread.

Recovery Workflow

When a failure stops repeating after a fix, the recovery workflow closes the loop in Slack.

Step	Command	Purpose
check what would be posted	`npm run slack:reply-resolved`	dry run preview
post recovery replies	`npm run slack:reply-resolved:send`	replies in original failure thread
record in history	`npm run incidents:record-recovery`	improves future confidence scoring

Recovery replies are valuable because they turn silent dead ends into explicit resolved threads. Anyone following the original failure thread knows the issue was fixed and why.

How To Read Heal Output

When npm run heal:claude runs interactively, it presents:

a list of recent failures with their categories
for each SELECTOR failure: the old selector and a proposed replacement found by scanning the live DOM
a prompt to confirm, skip, or adjust before applying

The --dry flag shows the same output without writing any changes. Use dry mode first when you are unsure what the healer will touch.

Need	Go To
failure categories and labels	Failure Categories
AI intelligence detail	AI Processing
Slack alerts and reports	Reporting
full command list	CLI Reference

Healing and Failure Intelligence

When To Use Healing vs Manual Fix

Decision Tree

Commands

Healing

Fix Database

Incident Store

Recovery Replies

Fix Database

Incident Store

Step-By-Step: I Have A Failing Test

Recovery Workflow

How To Read Heal Output

Related Pages