Healing and Failure Intelligence
| Key | Value |
|---|
| Status | Active |
| Owner | QA Automation |
| Updated | 2026-03-26 |
| Scope | Self-healing, incident store, fix database, recovery workflows, and operator decision guide |
The healing system is how the team avoids rediscovering the same test breakages over and over. It combines automated selector repair, a durable incident store, and a recovery lifecycle that posts back to Slack when a failure stops repeating.
When To Use Healing vs Manual Fix
| Situation | Recommended Action |
|---|
| selector stopped finding its element | npm run heal:claude |
| timeout is too short for current page speed | npm run heal:claude or manual timeout increase |
| test logic is wrong (assertion on changed behavior) | manual fix |
| site has a real bug (product issue) | escalate, do not fix the test |
| consent dialog blocking the test | npm run heal:claude |
| HTTP 5xx from the site | investigate infra, do not fix the test |
| HTTP 4xx (page removed) | decide whether to remove or update the test |
| flaky timing issue | manual fix with explicit wait strategy |
Decision Tree
| Failure Category | First Action | If That Does Not Work |
|---|
SELECTOR_NOT_IN_DOM | npm run heal:claude | check MCP Playwright DOM live |
SELECTOR_STALE | npm run heal:apply with fresh selector | run npm run discover:all |
TIMEOUT_ELEMENT | npm run heal:claude | increase timeout manually + add waitForLoadState |
CONSENT_BLOCKING | npm run heal:claude | check consent handler config |
CONSENT_NOT_FOUND | update consent config | check site for dialog changes |
HTTP_5XX | check site status, wait | escalate to infra |
HTTP_4XX | check if page was removed | update or remove the test |
CONTENT_MISMATCH | check if content changed by design | update test expectation |
TEST_FLAKY | add explicit wait, rerun | investigate timing |
PRODUCT_BUG | file bug report | do not fix the test |
Commands
Healing
| Command | Purpose |
|---|
npm run heal | general failure analysis |
npm run heal:claude | interactive Claude-assisted healing (recommended) |
npm run heal:ai | AI analysis path |
npm run heal:apply | apply a proposed fix locally |
npm run heal:dry | preview changes without applying |
npm run heal:mr | package a fix as a GitLab MR |
Fix Database
| Command | Purpose |
|---|
npm run fix:stats | show fix database statistics |
npm run fix:recent | show recent fixes |
npm run fix:recent blesk | recent fixes for a specific site |
Incident Store
| Command | Purpose |
|---|
npm run incidents:query | show failures needing classification (last 7 days) |
npm run incidents:query -- --days=30 | last 30 days |
npm run incidents:template | generate incident JSON template for a new incident |
npm run incidents:record-recovery | append a confirmed recovery to history |
Recovery Replies
| Command | Purpose |
|---|
npm run slack:reply-resolved | dry run: show what recovery replies would be posted |
npm run slack:reply-resolved:send | post recovery replies to fixed failure Slack threads |
npm run slack:reply-thread | post to a specific thread (manual override) |
Fix Database
The fix database lives at data/fixes.json. It stores every fix the system has applied or recorded.
| Field | Meaning |
|---|
| fix type | SELECTOR, TIMEOUT, CONSENT, WAIT, ASSERTION |
| source | AUTO (healer), MANUAL (operator), AI (AI-assisted) |
| selector before/after | what changed |
| verified | whether the fix was confirmed by a passing test run |
| site | which site the fix applies to |
Querying the fix database before making a manual change often reveals that the same selector has broken before and what the replacement was.
Incident Store
The incident store lives at data/failure-incidents.json. It is the system's memory of known recurring failure patterns.
| Field | Meaning |
|---|
| id | unique incident identifier |
| title | human-readable description |
| root cause | one of the standard root-cause domains |
| status | open, resolved, or monitoring |
| sites affected | which sites show this failure |
| fingerprint patterns | how the system matches new failures to this incident |
| confirmed recoveries | timestamps of confirmed resolution events |
When a new failure arrives, the incident matcher checks whether its fingerprint matches a known incident. If it does, the Slack alert includes the known context instead of treating it as a fresh unknown.
Step-By-Step: I Have A Failing Test
- Check the Slack alert. The alert headline and Initial Read section give you the first signal.
- Look at the failure screenshot in
test-results/artifacts/ or in Grafana > Investigate.
- Read the JSONL log for the failure in
test-results/logs/. Look at error_category, selector, and action fields.
- Check the incident store:
npm run incidents:query. If there is a match, you already know the root cause.
- Decide whether this is auto-healable (see decision tree above).
- If auto-healable: run
npm run heal:claude and follow the interactive prompts.
- If manual: make the fix, re-run the test, confirm it passes.
- Record the fix: it will be saved automatically when the test passes after a healer-applied fix.
- If you fixed a known incident: run
npm run incidents:record-recovery to append it to history.
- Post recovery:
npm run slack:reply-resolved:send posts a recovery reply to the original Slack thread.
Recovery Workflow
When a failure stops repeating after a fix, the recovery workflow closes the loop in Slack.
| Step | Command | Purpose |
|---|
| check what would be posted | npm run slack:reply-resolved | dry run preview |
| post recovery replies | npm run slack:reply-resolved:send | replies in original failure thread |
| record in history | npm run incidents:record-recovery | improves future confidence scoring |
Recovery replies are valuable because they turn silent dead ends into explicit resolved threads. Anyone following the original failure thread knows the issue was fixed and why.
How To Read Heal Output
When npm run heal:claude runs interactively, it presents:
- a list of recent failures with their categories
- for each SELECTOR failure: the old selector and a proposed replacement found by scanning the live DOM
- a prompt to confirm, skip, or adjust before applying
The --dry flag shows the same output without writing any changes. Use dry mode first when you are unsure what the healer will touch.
Related Pages