Observability

Key	Value
Status	Active
Owner	QA Automation
Updated	2026-03-26
Scope	OpenSearch, Grafana, Prometheus, and Alertmanager — the telemetry and dashboarding stack

Observability is what turns raw test results into something you can investigate, trend, and act on. The stack here combines structured log storage, metrics collection, dashboards, and alert routing so that operators do not have to piece together context by hand.

Stack Components

Component	Port	Purpose
OpenSearch	9200	structured log storage, full-text search, aggregation queries
OpenSearch Dashboards	5601	log browser and ad-hoc query UI
Grafana	3000	primary dashboards, trends, investigation panels
Prometheus	9090	time-series metrics from the metrics exporter
Alertmanager	9093	alert routing rules, Slack delivery

Data Flow

%%{init: {'theme':'base', 'themeVariables': {'primaryColor': '#4a90d9', 'primaryTextColor': '#fff', 'primaryBorderColor': '#2c6fad', 'lineColor': '#555', 'fontFamily': 'sans-serif'}}}%%
flowchart LR
    TESTS["Playwright tests"] --> EL["EventLogger\nsrc/core/event-logger.ts"]
    EL --> JSONL["Local JSONL logs\ntest-results/logs/"]
    EL --> OS["OpenSearch\ncncqa_tests-* / cncqa_events-*"]
    OS --> GR["Grafana dashboards"]
    TESTS --> PROM["Prometheus metrics\nmetrics-exporter"]
    PROM --> GR
    GR --> AM["Alertmanager\nalert routing"]
    AM --> SLACK["Slack\nalert channels"]

Tests produce structured records through the EventLogger. Those records land in OpenSearch for long-term querying and in Grafana for dashboards. Prometheus collects numeric metrics in parallel. Alertmanager routes threshold-based alerts to Slack.

OpenSearch Indices And Users

Indices

Index Pattern	Contents	Audience
`cncqa_tests-*`	test summaries, pass/fail records, screenshots	Grafana dashboards, reports
`cncqa_events-*`	detailed event stream (per-action records)	AI workflows, deep debugging

Access Users

User	Access Level	Use Case
`cnc_writer`	write + search via Grafana proxy UID	reporter write path, shared across CNC projects
`v1admin`	cluster admin via Grafana proxy ID 38	ISM policies, index templates, retention setup, index deletion

Both users are accessed via the Grafana proxy using GRAFANA_SERVICE_ACCOUNT_TOKEN. Direct access to port 9200 is not the primary path in production.

Starting And Stopping The Stack

Command	What It Does
`npm run observability:start`	start OpenSearch, Grafana, Prometheus, Alertmanager
`npm run observability:stop`	stop the stack
`npm run observability:status`	check whether each component is running
`npm run observability:logs`	tail combined stack logs

The stack runs via Docker Compose defined in observability/. Configuration files live under observability/prometheus/, observability/grafana/provisioning/, and observability/alertmanager/.

Grafana

Access

Grafana is available at https://grafana.measure.aws.cnci.tech in production (CI and scheduled runs) and at localhost:3000 when the local stack is running.

Dashboard Areas

Dashboard Area	What It Shows
Status	current health summary, recent run outcomes, per-site and per-suite pass rates
Investigate	failure drill-down, error categories, selector failures, site-specific breakdowns
Trends	14-day pass rate history, recurrence patterns, flaky test candidates

Dashboard-As-Code

Grafana dashboards are not hand-authored JSON. They are generated from TypeScript using the Foundation SDK.

Path	Purpose
`observability/grafana/suite/`	TypeScript dashboard definitions
`observability/grafana/generated/`	generated JSON dashboards (output)
`observability/grafana/deploy.ts`	deploy script
`observability/grafana/generate.ts`	generation pipeline
`observability/grafana/validate.ts`	query validation

Dashboard Deployment Commands

Command	What It Does
`npm run grafana:deploy`	generate and deploy dashboards to Grafana
`npm run grafana:validate-data`	validate that dashboard queries return data
`npm run monitor:grafana`	browser-based visual panel health check

The monitor:grafana command opens Grafana in a real browser and checks each dashboard panel for empty states or error conditions. This catches query regressions that static validation misses.

Retention And Index Management

Command	What It Does
`npm run os:setup-retention`	create ISM policies and index templates (needs `OPENSEARCH_URL` + `GRAFANA_TOKEN`)
`npm run os:stats`	show OpenSearch index statistics
`npm run os:failed`	query recent failures from OpenSearch

ISM (Index State Management) policies handle automatic rollover and deletion of old shards. These require v1admin access and should be run once during initial setup or when retention rules change.

Key Configuration

Variable	Purpose
`OPENSEARCH_URL`	write endpoint; set to Grafana proxy URL in CI
`GRAFANA_URL`	Grafana base URL for links and queries
`GRAFANA_SERVICE_ACCOUNT_TOKEN`	authenticates both OpenSearch users via Grafana proxy
`PROMETHEUS_PUSHGATEWAY_URL`	metrics push target for CI runs

Common Issues

Issue	What To Check
OpenSearch not starting locally	port 9200 conflict; check `npm run observability:status`
Grafana showing no data	verify `OPENSEARCH_URL` points to correct proxy; check index pattern matches `cncqa_tests-*`
field type mismatch in queries	old indices (pre-2025) may lack `.keyword` sub-fields; aggregations will fail on older shards
V1 vs beta data mixing	V1 records use `blesk.cz` site names; beta records use `blesk`; queries may need to handle both
events index empty	`cncqa_events-*` is created but the EventLogger write path to OpenSearch may not be reaching it in all environments; check `OPENSEARCH_URL` in the runner environment

Need	Go To
logging model and event schema	Logging System
Slack, GitLab, and OpenSearch credentials	Integrations
configuration and env vars	Configuration Guide
dashboard queries and patterns	`.claude/docs/grafana-patterns.md` in the repo