Observability

KeyValue
StatusActive
OwnerQA Automation
Updated2026-03-26
ScopeOpenSearch, Grafana, Prometheus, and Alertmanager — the telemetry and dashboarding stack

Observability is what turns raw test results into something you can investigate, trend, and act on. The stack here combines structured log storage, metrics collection, dashboards, and alert routing so that operators do not have to piece together context by hand.

Stack Components

ComponentPortPurpose
OpenSearch9200structured log storage, full-text search, aggregation queries
OpenSearch Dashboards5601log browser and ad-hoc query UI
Grafana3000primary dashboards, trends, investigation panels
Prometheus9090time-series metrics from the metrics exporter
Alertmanager9093alert routing rules, Slack delivery

Data Flow

%%{init: {'theme':'base', 'themeVariables': {'primaryColor': '#4a90d9', 'primaryTextColor': '#fff', 'primaryBorderColor': '#2c6fad', 'lineColor': '#555', 'fontFamily': 'sans-serif'}}}%%
flowchart LR
    TESTS["Playwright tests"] --> EL["EventLogger\nsrc/core/event-logger.ts"]
    EL --> JSONL["Local JSONL logs\ntest-results/logs/"]
    EL --> OS["OpenSearch\ncncqa_tests-* / cncqa_events-*"]
    OS --> GR["Grafana dashboards"]
    TESTS --> PROM["Prometheus metrics\nmetrics-exporter"]
    PROM --> GR
    GR --> AM["Alertmanager\nalert routing"]
    AM --> SLACK["Slack\nalert channels"]

Tests produce structured records through the EventLogger. Those records land in OpenSearch for long-term querying and in Grafana for dashboards. Prometheus collects numeric metrics in parallel. Alertmanager routes threshold-based alerts to Slack.

OpenSearch Indices And Users

Indices

Index PatternContentsAudience
cncqa_tests-*test summaries, pass/fail records, screenshotsGrafana dashboards, reports
cncqa_events-*detailed event stream (per-action records)AI workflows, deep debugging

Access Users

UserAccess LevelUse Case
cnc_writerwrite + search via Grafana proxy UIDreporter write path, shared across CNC projects
v1admincluster admin via Grafana proxy ID 38ISM policies, index templates, retention setup, index deletion

Both users are accessed via the Grafana proxy using GRAFANA_SERVICE_ACCOUNT_TOKEN. Direct access to port 9200 is not the primary path in production.

Starting And Stopping The Stack

CommandWhat It Does
npm run observability:startstart OpenSearch, Grafana, Prometheus, Alertmanager
npm run observability:stopstop the stack
npm run observability:statuscheck whether each component is running
npm run observability:logstail combined stack logs

The stack runs via Docker Compose defined in observability/. Configuration files live under observability/prometheus/, observability/grafana/provisioning/, and observability/alertmanager/.

Grafana

Access

Grafana is available at https://grafana.measure.aws.cnci.tech in production (CI and scheduled runs) and at localhost:3000 when the local stack is running.

Dashboard Areas

Dashboard AreaWhat It Shows
Statuscurrent health summary, recent run outcomes, per-site and per-suite pass rates
Investigatefailure drill-down, error categories, selector failures, site-specific breakdowns
Trends14-day pass rate history, recurrence patterns, flaky test candidates

Dashboard-As-Code

Grafana dashboards are not hand-authored JSON. They are generated from TypeScript using the Foundation SDK.

PathPurpose
observability/grafana/suite/TypeScript dashboard definitions
observability/grafana/generated/generated JSON dashboards (output)
observability/grafana/deploy.tsdeploy script
observability/grafana/generate.tsgeneration pipeline
observability/grafana/validate.tsquery validation

Dashboard Deployment Commands

CommandWhat It Does
npm run grafana:deploygenerate and deploy dashboards to Grafana
npm run grafana:validate-datavalidate that dashboard queries return data
npm run monitor:grafanabrowser-based visual panel health check

The monitor:grafana command opens Grafana in a real browser and checks each dashboard panel for empty states or error conditions. This catches query regressions that static validation misses.

Retention And Index Management

CommandWhat It Does
npm run os:setup-retentioncreate ISM policies and index templates (needs OPENSEARCH_URL + GRAFANA_TOKEN)
npm run os:statsshow OpenSearch index statistics
npm run os:failedquery recent failures from OpenSearch

ISM (Index State Management) policies handle automatic rollover and deletion of old shards. These require v1admin access and should be run once during initial setup or when retention rules change.

Key Configuration

VariablePurpose
OPENSEARCH_URLwrite endpoint; set to Grafana proxy URL in CI
GRAFANA_URLGrafana base URL for links and queries
GRAFANA_SERVICE_ACCOUNT_TOKENauthenticates both OpenSearch users via Grafana proxy
PROMETHEUS_PUSHGATEWAY_URLmetrics push target for CI runs

Common Issues

IssueWhat To Check
OpenSearch not starting locallyport 9200 conflict; check npm run observability:status
Grafana showing no dataverify OPENSEARCH_URL points to correct proxy; check index pattern matches cncqa_tests-*
field type mismatch in queriesold indices (pre-2025) may lack .keyword sub-fields; aggregations will fail on older shards
V1 vs beta data mixingV1 records use blesk.cz site names; beta records use blesk; queries may need to handle both
events index emptycncqa_events-* is created but the EventLogger write path to OpenSearch may not be reaching it in all environments; check OPENSEARCH_URL in the runner environment
NeedGo To
logging model and event schemaLogging System
Slack, GitLab, and OpenSearch credentialsIntegrations
configuration and env varsConfiguration Guide
dashboard queries and patterns.claude/docs/grafana-patterns.md in the repo