Architecture Overview
| Key | Value |
|---|---|
| Status | Active |
| Owner | QA Automation |
| Updated | 2026-03-26 |
| Scope | System design, data flow, and operational building blocks |
PW-Tests is built around one idea: tests should produce reusable operational evidence, not just a red or green line in CI. That is why the architecture is wider than Playwright itself. A run starts in Playwright, but it keeps going through logging, classification, Slack reporting, historical matching, dashboards, and long-term documentation.
Core Design Principles
| Principle | What It Means In Practice |
|---|---|
| Tests stay focused | Tests verify behavior. They do not decide policy, reporting, or recovery rules on their own. |
| Evidence is reusable | The same run data feeds Slack alerts, OpenSearch, Grafana, reports, and healing workflows. |
| Site-specific behavior lives in configuration | Shared logic stays in framework code; per-site differences stay in config and handlers. |
| Operational clarity matters | The system tries to distinguish regression, flake, infra noise, and already-known incidents. |
| Reliability beats cleverness | Fast health checks run often; deeper suites run when the signal is worth the time. |
High-Level System View
%%{init: {'theme':'base'}}%%
flowchart LR
CI["GitLab schedules and manual runs"] --> PW["Playwright suites"]
PW --> EL["EventLogger and reporter output"]
EL --> ART["Artifacts and local run files"]
EL --> OS["OpenSearch"]
EL --> SL["Slack notifications"]
OS --> GR["Grafana dashboards"]
ART --> AI["Healing, incident matching, reports"]
AI --> SL
AI --> HIST["Failure history and incident stores"]
HIST --> SLMain Subsystems
| Subsystem | What It Owns |
|---|---|
tests/ | Playwright suites for smoke, PDT, E2E, mobile, content, visual, and more |
src/core/ | EventLogger, shared test hooks, base helpers, log schema |
src/config/ | Site configuration, selectors, content registry |
src/services/ | Incident matching, failure priors, cause assessment, intelligence |
src/integrations/ | Slack and service integration wrappers |
src/reporting/ | Reporter pipeline and report data preparation |
scripts/ci/ | Slack notifiers, reports, recovery replies, merge and CI utilities |
scripts/monitoring/ | Consent, selector, URL, robots.txt, sitemap, Grafana visual checks |
observability/ | Grafana dashboard-as-code, OpenSearch setup, Prometheus integration |
.confluence/ | Confluence publishing flow and formatting rules |
Suite Layout
The system is broader than the original smoke-plus-E2E model. The current platform includes:
| Area | Purpose |
|---|---|
| Smoke | Fast health signal |
| Shadow | Continuous degradation monitoring |
| PDT | Post-deploy confidence checks |
| E2E | Full user journeys |
| User-Flows | Auth, premium, and session-sensitive flows |
| Mobile | Responsive, touch, and mobile-performance tiers |
| Content | Registry-driven content module validation |
| Visual | Structural screenshot comparisons |
| Performance | Core Web Vitals and baseline comparison |
| Ads | Ad placement and rendering checks |
| Events | Analytics event verification |
| Monitors | Consent, selectors, URLs, robots.txt, sitemap health, Grafana visual quality |
How A Run Moves Through The System
1. Scheduling
Runs start from GitLab schedules, manual web triggers, post-deploy workflows, or local development.
2. Test Execution
Playwright executes the relevant suite or project. Some suites are intentionally sequential for stability. Others are narrow and fast by design.
3. Logging
The EventLogger writes structured records locally and, when configured, to OpenSearch. The reporter adds run summaries, screenshot records, and step records used by Grafana and investigation workflows.
4. Classification And Post-Processing
After execution, post-processing scripts can:
- send Slack alerts
- match a failure against known incidents
- look at historical recurrence
- add root-cause confidence
- post recovery replies when an issue stops repeating
- generate weekly or monthly summaries
5. Human Consumption
The same data is then visible in:
- Slack for immediate action
- Grafana for trends and investigation
- Confluence and markdown docs for long-term system memory
Event And Reporting Architecture
| Layer | What It Produces | Who Uses It |
|---|---|---|
| EventLogger | detailed event stream | debugging, AI workflows, telemetry |
| Reporter | test summaries, screenshots, step records | Grafana, Slack, reports |
| Failure history | recurrence over time | humanized notifications, priors |
| Incident store | known resolved or open incidents | root-cause tagging and matching |
| Cause assessor | confidence-weighted verdicts | Slack investigations and future predictive workflows |
Why The Logging Model Changed
Older versions of the repo mixed multiple logging styles and left operators stitching together context by hand. The current model is simpler:
- one event pipeline instead of several overlapping ones
- shared identifiers across reporter and logger records
- local artifacts for direct debugging
- OpenSearch records for dashboards and long-term analysis
This is what makes things like recurrence detection, incident clustering, and recovery replies possible.
Current Operational Features Worth Knowing
| Feature | Why It Matters |
|---|---|
| Humanized Slack failure alerts | Reduces panic and makes triage faster |
| Slack recovery replies | Closes the loop when a failure disappears after a fix |
| Incident store | Prevents the team from rediscovering the same root cause every week |
| Failure priors and cause assessor | Makes “likely flaky” versus “likely regression” more evidence-based |
| Grafana dashboard-as-code | Keeps dashboards reviewable and deployable from git |
| Observability verification scripts | Lets operators check health and data quality without a manual audit |
| Visual fact-checker | Adds AI review on top of screenshot artifacts for selected flows |
Important Data Stores
| Path Or Store | Role |
|---|---|
test-results/logs/ | Local structured logs |
test-results/history/ | Run history for recurrence and trend logic |
data/fixes.json | Fix database |
data/failure-incidents.json | Known incident registry |
data/failure-history.json | Confirmed recovery history and prior signals |
data/slack-threads.json | Slack thread tracking for recovery replies |
OpenSearch cncqa_tests-* | Human-facing test records for dashboards |
OpenSearch cncqa_events-* | Machine-facing event records |
Directory Map
| Path | Why It Exists |
|---|---|
src/ | Shared runtime code |
tests/ | Playwright suites and helpers |
scripts/ | Operators’ command-line tools and CI helpers |
observability/ | Dashboards, mappings, retention helpers |
docs/wiki/ | Main Confluence wiki source |
docs/confluence/ | Standalone Confluence pages outside the main wiki tree |
.confluence/ | Publishing engine and Confluence-specific rules |
If you are new to the repo, this order usually works best:
- Read Test Types to understand the suite landscape.
- Read Logging System to understand what evidence each run leaves behind.
- Read Integrations to understand Slack, Grafana, and OpenSearch.
- Read AI Processing only after the basics make sense.
Practical Takeaway
If you remember one thing, make it this: PW-Tests is not just a collection of Playwright specs. It is a small operational platform. Tests create the evidence, but the value comes from how that evidence is enriched, routed, explained, and reused.