Troubleshooting

Key	Value
Status	Active
Owner	QA Automation
Updated	2026-03-26
Scope	Practical triage paths for common local, CI, suite, and integration problems

This page is written for moments when something is already broken and you want the shortest path to a useful next step.

Start Here

Symptom	Best First Check
one test failed locally	artifacts and local logs
scheduled run failed in Slack	investigation thread plus history context
many tests failed at once	look for a shared environment cause before fixing tests
visual suite exploded	check baseline availability before reading diffs
dashboards look empty	run observability or Grafana validation
Slack did not post	verify token, webhook, and channel config

Local Setup Problems

Problem	Most Likely Cause	Good Next Step
Playwright will not start	browser install missing	install browsers again
commands fail immediately after clone	dependencies not installed	run `npm install`
local run behaves oddly across sites	wrong or stale `.env` assumptions	check `SITE` and local overrides

CI Versus Local Mismatch

Pattern	Likely Cause
passes locally, fails in CI	timing, network, artifact, or baseline difference
only fails on one runner	environment instability
visual suite only fails in CI	Linux baseline mismatch or missing snapshots

Suite-Specific Quick Advice

Smoke, Shadow, PDT

If You See	Think About
one selector timeout on one site	recent redesign or selector drift
cross-domain oddness on Blesk family sites	host guard logic and site assumptions
sudden broad failure cluster	shared environment or site-wide issue

E2E And User-Flows

If You See	Think About
gallery or video failures	content rotation, fallback logic, lazy loading
login or premium failures	auth state, consent overlays, modal interference
conditional live-surface failures	ephemeral content, not always regression

Mobile

If You See	Think About
mobile-only flake	responsive overlays, touch target timing, seed URL choice
Safari-specific drift	browser-specific rendering or interaction differences
deep-tier failures	performance budget and journey timing before selectors

Content

If You See	Think About
many missing entries in one context	seed URL or context grouping issue
one handler failing repeatedly	handler logic before registry data
regressions after CMS change	registry and selector freshness

Visual

If You See	Think About
`A snapshot doesn't exist`	missing baseline, not visual regression
many failures across all sites at once	baseline lifecycle or compare-mode problem
one true diff on one site	structural change or intended redesign

Integration Problems

Slack

Symptom	Usual Cause
no message posted	missing webhook or bot token
wrong channel	env var mismatch
thread replies missing	bot-token path unavailable or thread tracking issue

Grafana And OpenSearch

Symptom	Usual Cause
dashboard panels empty	datasource or query drift
query validation errors	field-name or index mismatch
misleading aggregates	legacy and current data mixed in shared indices
OpenSearch write issues	wrong endpoint or missing permissions

A Good Investigation Order

confirm whether the failure is isolated or clustered
check artifacts and logs
check whether history or incident memory already explains it
decide if the problem is product, test, infra, or configuration
only then decide whether to fix code, rerun, or just watch

Need	Command
recent failures	`npm run os:failed`
fix history	`npm run fix:recent`
health verification	`npm run observability:verify:health`
dashboard query validation	`npm run grafana:validate-data`
local debug run	`npm run test:debug`
UI investigation	`npm run test:ui`

When To Escalate

Escalate sooner when:

several suites fail at once
a deploy-critical PDT failure repeats
dashboards or observability are blind
robots.txt, sitemap, or URL monitors show broad site damage
the same incident repeats after a supposed fix

Need	Go To
failure interpretation	Failure Categories
alerting and reports	Reporting
commands	CLI Reference