Troubleshooting

KeyValue
StatusActive
OwnerQA Automation
Updated2026-03-26
ScopePractical triage paths for common local, CI, suite, and integration problems

This page is written for moments when something is already broken and you want the shortest path to a useful next step.

Start Here

SymptomBest First Check
one test failed locallyartifacts and local logs
scheduled run failed in Slackinvestigation thread plus history context
many tests failed at oncelook for a shared environment cause before fixing tests
visual suite explodedcheck baseline availability before reading diffs
dashboards look emptyrun observability or Grafana validation
Slack did not postverify token, webhook, and channel config

Local Setup Problems

ProblemMost Likely CauseGood Next Step
Playwright will not startbrowser install missinginstall browsers again
commands fail immediately after clonedependencies not installedrun npm install
local run behaves oddly across siteswrong or stale .env assumptionscheck SITE and local overrides

CI Versus Local Mismatch

PatternLikely Cause
passes locally, fails in CItiming, network, artifact, or baseline difference
only fails on one runnerenvironment instability
visual suite only fails in CILinux baseline mismatch or missing snapshots

Suite-Specific Quick Advice

Smoke, Shadow, PDT

If You SeeThink About
one selector timeout on one siterecent redesign or selector drift
cross-domain oddness on Blesk family siteshost guard logic and site assumptions
sudden broad failure clustershared environment or site-wide issue

E2E And User-Flows

If You SeeThink About
gallery or video failurescontent rotation, fallback logic, lazy loading
login or premium failuresauth state, consent overlays, modal interference
conditional live-surface failuresephemeral content, not always regression

Mobile

If You SeeThink About
mobile-only flakeresponsive overlays, touch target timing, seed URL choice
Safari-specific driftbrowser-specific rendering or interaction differences
deep-tier failuresperformance budget and journey timing before selectors

Content

If You SeeThink About
many missing entries in one contextseed URL or context grouping issue
one handler failing repeatedlyhandler logic before registry data
regressions after CMS changeregistry and selector freshness

Visual

If You SeeThink About
A snapshot doesn't existmissing baseline, not visual regression
many failures across all sites at oncebaseline lifecycle or compare-mode problem
one true diff on one sitestructural change or intended redesign

Integration Problems

Slack

SymptomUsual Cause
no message postedmissing webhook or bot token
wrong channelenv var mismatch
thread replies missingbot-token path unavailable or thread tracking issue

Grafana And OpenSearch

SymptomUsual Cause
dashboard panels emptydatasource or query drift
query validation errorsfield-name or index mismatch
misleading aggregateslegacy and current data mixed in shared indices
OpenSearch write issueswrong endpoint or missing permissions

A Good Investigation Order

  1. confirm whether the failure is isolated or clustered
  2. check artifacts and logs
  3. check whether history or incident memory already explains it
  4. decide if the problem is product, test, infra, or configuration
  5. only then decide whether to fix code, rerun, or just watch
NeedCommand
recent failuresnpm run os:failed
fix historynpm run fix:recent
health verificationnpm run observability:verify:health
dashboard query validationnpm run grafana:validate-data
local debug runnpm run test:debug
UI investigationnpm run test:ui

When To Escalate

Escalate sooner when:

  • several suites fail at once
  • a deploy-critical PDT failure repeats
  • dashboards or observability are blind
  • robots.txt, sitemap, or URL monitors show broad site damage
  • the same incident repeats after a supposed fix
NeedGo To
failure interpretationFailure Categories
alerting and reportsReporting
commandsCLI Reference