Title:
Grafana Observability SME Duration :
6
Months Location:
Poughkeepsie, NY Role Summary Own the end-to-end technical design, build, and operationalization of the Grafana Cloud observability platform for a 50-application estate spanning Java, .NET, Go, Python, and Node.js workloads hosted across on-premises data centres, AWS, and Azure. The SME serves as the senior technical authority across all eight in-scope Grafana Cloud modules and is accountable for instrumentation strategy, alerting design, dashboarding standards, and integration into ServiceNow ITOM via native Event Management. Scope is application-level observability only — server and network health remain on SolarWinds, and URL/synthetic monitoring remains on Uptrends. Key Responsibilities
- Platform architecture and configuration across all eight in-scope Grafana Cloud modules: Grafana 12 (visualization), Mimir (metrics, 13-month retention), Loki (logs), Tempo (distributed tracing via OTLP), Alloy (telemetry collection agent), Beyla (eBPF zero-code auto-instrumentation), Application Observability (OTel-native APM), and Unified Alerting.
- Tenancy and access design — organizations, folders, teams, role-based access control, dashboard variables, template links, and annotations.
- Application instrumentation strategy by technology stack: Beyla eBPF as the default zero-code path for Simple and Medium apps; OpenTelemetry SDKs/agents (Java, .NET, Go, Python, Node.js) for Complex apps requiring deeper traces and custom metrics; JMX Exporter, prometheus_client, and runtime-specific exporters where stack-appropriate.
- Log pipeline engineering via Alloy — structured JSON, Log4j/Logback, Serilog, NLog, Windows Event Log, Winston, Pino, loguru — with parsing rules tuned per stack and LogQL-based dashboards and alerts.
- Alerting design — PromQL/LogQL/TraceQL rules, severity taxonomy, grouping, routing, and notification policies. Build a low-noise, actionable alert feed; tune thresholds iteratively with application owners.
- Single Pane of Glass — design and deliver a tiered SPoG that surfaces Grafana application telemetry alongside contextual links to SolarWinds and Uptrends.
- Business Dashboards and Reporting — partner with the Dashboard Lead to define KPI taxonomy and ensure dashboard-as-code patterns and version control.
- ServiceNow ITOM integration — co-own the design and review of Grafana ServiceNow Event Management (native inbound integration) flow: event allow-list governance ("deny by default"), enrichment, deduplication, AIOps correlation, automated incident creation with severity mapping and assignment group rules, CMDB CI attachment, and ServiceNow-as-master incident state.
- Quality assurance authority across all technical deliverables — solution architecture document, instrumentation runbooks, dashboard and alert library, integration test results.
- Phased delivery execution — Mobilise & Discover Application Foundation (ML1) Onboarding of 40 Simple apps (ML2) Medium/Complex apps + ITOM Integration (ML2 3) SPoG, Dashboards & Reporting (ML3 4) Stabilisation, KT, and post-deployment support (ML4).
- Knowledge transfer — produce platform operating procedures and conduct structured handover to the client's run team.
Top Skills:
1. Production expertise across the full Grafana stack: Mimir, Loki, Tempo, Alloy, Beyla, Grafana Application Observability, Unified Alerting. 2. Strong PromQL, LogQL, and TraceQL authoring skills; able to write recording rules and SLO queries from scratch. 3. OpenTelemetry practitioner — OTLP, collectors, SDK/agent instrumentation for at least three of Java, .NET, Go, Python, Node.js. 4. eBPF-based auto-instrumentation experience with Beyla (or equivalent — Pixie, Cilium Tetragon) in a production context. 5. Experience integrating Grafana alerts into ServiceNow Event Management (native inbound integration, not webhook-only patterns); familiarity with ServiceNow ITOM, AIOps event correlation, and CMDB CI attachment. 6. Multi-environment hosting fluency — on-prem, AWS, Azure — and Linux/Windows host agent deployment at scale. 7. Dashboard-as-code and GitOps patterns (Grafana provisioning, Terraform provider, or Grizzly). 8. Excellent written communication — solution architecture documents, runbooks, and stakeholder-facing status reporting. Required Skills & Experience
- 7+ years in observability/monitoring engineering with deep, recent hands-on Grafana Cloud experience (not just OSS Grafana).
- Production expertise across the full Grafana stack: Mimir, Loki, Tempo, Alloy, Beyla, Grafana Application Observability, Unified Alerting.
- Strong PromQL, LogQL, and TraceQL authoring skills; able to write recording rules and SLO queries from scratch.
- OpenTelemetry practitioner — OTLP, collectors, SDK/agent instrumentation for at least three of Java, .NET, Go, Python, Node.js.
- eBPF-based auto-instrumentation experience with Beyla (or equivalent — Pixie, Cilium Tetragon) in a production context.
- Experience in