Observability
| Key | Value |
|---|
| Status | Active |
| Owner | QA Automation |
| Updated | 2026-03-26 |
| Scope | OpenSearch, Grafana, Prometheus, and Alertmanager — the telemetry and dashboarding stack |
Observability is what turns raw test results into something you can investigate, trend, and act on. The stack here combines structured log storage, metrics collection, dashboards, and alert routing so that operators do not have to piece together context by hand.
Stack Components
| Component | Port | Purpose |
|---|
| OpenSearch | 9200 | structured log storage, full-text search, aggregation queries |
| OpenSearch Dashboards | 5601 | log browser and ad-hoc query UI |
| Grafana | 3000 | primary dashboards, trends, investigation panels |
| Prometheus | 9090 | time-series metrics from the metrics exporter |
| Alertmanager | 9093 | alert routing rules, Slack delivery |
Data Flow
%%{init: {'theme':'base', 'themeVariables': {'primaryColor': '#4a90d9', 'primaryTextColor': '#fff', 'primaryBorderColor': '#2c6fad', 'lineColor': '#555', 'fontFamily': 'sans-serif'}}}%%
flowchart LR
TESTS["Playwright tests"] --> EL["EventLogger\nsrc/core/event-logger.ts"]
EL --> JSONL["Local JSONL logs\ntest-results/logs/"]
EL --> OS["OpenSearch\ncncqa_tests-* / cncqa_events-*"]
OS --> GR["Grafana dashboards"]
TESTS --> PROM["Prometheus metrics\nmetrics-exporter"]
PROM --> GR
GR --> AM["Alertmanager\nalert routing"]
AM --> SLACK["Slack\nalert channels"]
Tests produce structured records through the EventLogger. Those records land in OpenSearch for long-term querying and in Grafana for dashboards. Prometheus collects numeric metrics in parallel. Alertmanager routes threshold-based alerts to Slack.
OpenSearch Indices And Users
Indices
| Index Pattern | Contents | Audience |
|---|
cncqa_tests-* | test summaries, pass/fail records, screenshots | Grafana dashboards, reports |
cncqa_events-* | detailed event stream (per-action records) | AI workflows, deep debugging |
Access Users
| User | Access Level | Use Case |
|---|
cnc_writer | write + search via Grafana proxy UID | reporter write path, shared across CNC projects |
v1admin | cluster admin via Grafana proxy ID 38 | ISM policies, index templates, retention setup, index deletion |
Both users are accessed via the Grafana proxy using GRAFANA_SERVICE_ACCOUNT_TOKEN. Direct access to port 9200 is not the primary path in production.
Starting And Stopping The Stack
| Command | What It Does |
|---|
npm run observability:start | start OpenSearch, Grafana, Prometheus, Alertmanager |
npm run observability:stop | stop the stack |
npm run observability:status | check whether each component is running |
npm run observability:logs | tail combined stack logs |
The stack runs via Docker Compose defined in observability/. Configuration files live under observability/prometheus/, observability/grafana/provisioning/, and observability/alertmanager/.
Grafana
Access
Grafana is available at https://grafana.measure.aws.cnci.tech in production (CI and scheduled runs) and at localhost:3000 when the local stack is running.
Dashboard Areas
| Dashboard Area | What It Shows |
|---|
| Status | current health summary, recent run outcomes, per-site and per-suite pass rates |
| Investigate | failure drill-down, error categories, selector failures, site-specific breakdowns |
| Trends | 14-day pass rate history, recurrence patterns, flaky test candidates |
Dashboard-As-Code
Grafana dashboards are not hand-authored JSON. They are generated from TypeScript using the Foundation SDK.
| Path | Purpose |
|---|
observability/grafana/suite/ | TypeScript dashboard definitions |
observability/grafana/generated/ | generated JSON dashboards (output) |
observability/grafana/deploy.ts | deploy script |
observability/grafana/generate.ts | generation pipeline |
observability/grafana/validate.ts | query validation |
Dashboard Deployment Commands
| Command | What It Does |
|---|
npm run grafana:deploy | generate and deploy dashboards to Grafana |
npm run grafana:validate-data | validate that dashboard queries return data |
npm run monitor:grafana | browser-based visual panel health check |
The monitor:grafana command opens Grafana in a real browser and checks each dashboard panel for empty states or error conditions. This catches query regressions that static validation misses.
Retention And Index Management
| Command | What It Does |
|---|
npm run os:setup-retention | create ISM policies and index templates (needs OPENSEARCH_URL + GRAFANA_TOKEN) |
npm run os:stats | show OpenSearch index statistics |
npm run os:failed | query recent failures from OpenSearch |
ISM (Index State Management) policies handle automatic rollover and deletion of old shards. These require v1admin access and should be run once during initial setup or when retention rules change.
Key Configuration
| Variable | Purpose |
|---|
OPENSEARCH_URL | write endpoint; set to Grafana proxy URL in CI |
GRAFANA_URL | Grafana base URL for links and queries |
GRAFANA_SERVICE_ACCOUNT_TOKEN | authenticates both OpenSearch users via Grafana proxy |
PROMETHEUS_PUSHGATEWAY_URL | metrics push target for CI runs |
Common Issues
| Issue | What To Check |
|---|
| OpenSearch not starting locally | port 9200 conflict; check npm run observability:status |
| Grafana showing no data | verify OPENSEARCH_URL points to correct proxy; check index pattern matches cncqa_tests-* |
| field type mismatch in queries | old indices (pre-2025) may lack .keyword sub-fields; aggregations will fail on older shards |
| V1 vs beta data mixing | V1 records use blesk.cz site names; beta records use blesk; queries may need to handle both |
| events index empty | cncqa_events-* is created but the EventLogger write path to OpenSearch may not be reaching it in all environments; check OPENSEARCH_URL in the runner environment |
Related Pages
| Need | Go To |
|---|
| logging model and event schema | Logging System |
| Slack, GitLab, and OpenSearch credentials | Integrations |
| configuration and env vars | Configuration Guide |
| dashboard queries and patterns | .claude/docs/grafana-patterns.md in the repo |