Healing and Failure Intelligence

KeyValue
StatusActive
OwnerQA Automation
Updated2026-03-26
ScopeSelf-healing, incident store, fix database, recovery workflows, and operator decision guide

The healing system is how the team avoids rediscovering the same test breakages over and over. It combines automated selector repair, a durable incident store, and a recovery lifecycle that posts back to Slack when a failure stops repeating.

When To Use Healing vs Manual Fix

SituationRecommended Action
selector stopped finding its elementnpm run heal:claude
timeout is too short for current page speednpm run heal:claude or manual timeout increase
test logic is wrong (assertion on changed behavior)manual fix
site has a real bug (product issue)escalate, do not fix the test
consent dialog blocking the testnpm run heal:claude
HTTP 5xx from the siteinvestigate infra, do not fix the test
HTTP 4xx (page removed)decide whether to remove or update the test
flaky timing issuemanual fix with explicit wait strategy

Decision Tree

Failure CategoryFirst ActionIf That Does Not Work
SELECTOR_NOT_IN_DOMnpm run heal:claudecheck MCP Playwright DOM live
SELECTOR_STALEnpm run heal:apply with fresh selectorrun npm run discover:all
TIMEOUT_ELEMENTnpm run heal:claudeincrease timeout manually + add waitForLoadState
CONSENT_BLOCKINGnpm run heal:claudecheck consent handler config
CONSENT_NOT_FOUNDupdate consent configcheck site for dialog changes
HTTP_5XXcheck site status, waitescalate to infra
HTTP_4XXcheck if page was removedupdate or remove the test
CONTENT_MISMATCHcheck if content changed by designupdate test expectation
TEST_FLAKYadd explicit wait, reruninvestigate timing
PRODUCT_BUGfile bug reportdo not fix the test

Commands

Healing

CommandPurpose
npm run healgeneral failure analysis
npm run heal:claudeinteractive Claude-assisted healing (recommended)
npm run heal:aiAI analysis path
npm run heal:applyapply a proposed fix locally
npm run heal:drypreview changes without applying
npm run heal:mrpackage a fix as a GitLab MR

Fix Database

CommandPurpose
npm run fix:statsshow fix database statistics
npm run fix:recentshow recent fixes
npm run fix:recent bleskrecent fixes for a specific site

Incident Store

CommandPurpose
npm run incidents:queryshow failures needing classification (last 7 days)
npm run incidents:query -- --days=30last 30 days
npm run incidents:templategenerate incident JSON template for a new incident
npm run incidents:record-recoveryappend a confirmed recovery to history

Recovery Replies

CommandPurpose
npm run slack:reply-resolveddry run: show what recovery replies would be posted
npm run slack:reply-resolved:sendpost recovery replies to fixed failure Slack threads
npm run slack:reply-threadpost to a specific thread (manual override)

Fix Database

The fix database lives at data/fixes.json. It stores every fix the system has applied or recorded.

FieldMeaning
fix typeSELECTOR, TIMEOUT, CONSENT, WAIT, ASSERTION
sourceAUTO (healer), MANUAL (operator), AI (AI-assisted)
selector before/afterwhat changed
verifiedwhether the fix was confirmed by a passing test run
sitewhich site the fix applies to

Querying the fix database before making a manual change often reveals that the same selector has broken before and what the replacement was.

Incident Store

The incident store lives at data/failure-incidents.json. It is the system's memory of known recurring failure patterns.

FieldMeaning
idunique incident identifier
titlehuman-readable description
root causeone of the standard root-cause domains
statusopen, resolved, or monitoring
sites affectedwhich sites show this failure
fingerprint patternshow the system matches new failures to this incident
confirmed recoveriestimestamps of confirmed resolution events

When a new failure arrives, the incident matcher checks whether its fingerprint matches a known incident. If it does, the Slack alert includes the known context instead of treating it as a fresh unknown.

Step-By-Step: I Have A Failing Test

  1. Check the Slack alert. The alert headline and Initial Read section give you the first signal.
  2. Look at the failure screenshot in test-results/artifacts/ or in Grafana > Investigate.
  3. Read the JSONL log for the failure in test-results/logs/. Look at error_category, selector, and action fields.
  4. Check the incident store: npm run incidents:query. If there is a match, you already know the root cause.
  5. Decide whether this is auto-healable (see decision tree above).
  6. If auto-healable: run npm run heal:claude and follow the interactive prompts.
  7. If manual: make the fix, re-run the test, confirm it passes.
  8. Record the fix: it will be saved automatically when the test passes after a healer-applied fix.
  9. If you fixed a known incident: run npm run incidents:record-recovery to append it to history.
  10. Post recovery: npm run slack:reply-resolved:send posts a recovery reply to the original Slack thread.

Recovery Workflow

When a failure stops repeating after a fix, the recovery workflow closes the loop in Slack.

StepCommandPurpose
check what would be postednpm run slack:reply-resolveddry run preview
post recovery repliesnpm run slack:reply-resolved:sendreplies in original failure thread
record in historynpm run incidents:record-recoveryimproves future confidence scoring

Recovery replies are valuable because they turn silent dead ends into explicit resolved threads. Anyone following the original failure thread knows the issue was fixed and why.

How To Read Heal Output

When npm run heal:claude runs interactively, it presents:

  1. a list of recent failures with their categories
  2. for each SELECTOR failure: the old selector and a proposed replacement found by scanning the live DOM
  3. a prompt to confirm, skip, or adjust before applying

The --dry flag shows the same output without writing any changes. Use dry mode first when you are unsure what the healer will touch.

NeedGo To
failure categories and labelsFailure Categories
AI intelligence detailAI Processing
Slack alerts and reportsReporting
full command listCLI Reference