Skip to main content
Tallo logoTallo logo
Apply for this opportunity

This job application is on an outside website. Be sure to review the job posting there to verify it's the same.

Grafana Observability SME

Job

Spectraforce

Poughkeepsie, NY (In Person)

Full-Time

Posted 2 weeks ago (Updated 20 hours ago) • Actively hiring

Expires 7/24/2026

Review key factors to help you decide if the role fits your goals.
Pay Growth
?
out of 5
Not enough data
Not enough info to score pay or growth
Job Security
?
out of 5
Not enough data
Calculating job security score...
Total Score
85
out of 100
Average of individual scores

Were these scores useful?

Skill Insights

Compare your current skills to what this opportunity needs—we'll show you what you already have and what could strengthen your application.

Job Description

Title:
Grafana Observability SME Duration :
6
Months Location:
Poughkeepsie, NY Role Summary Own the end-to-end technical design, build, and operationalization of the Grafana Cloud observability platform for a 50-application estate spanning Java, .NET, Go, Python, and Node.js workloads hosted across on-premises data centres, AWS, and Azure. The SME serves as the senior technical authority across all eight in-scope Grafana Cloud modules and is accountable for instrumentation strategy, alerting design, dashboarding standards, and integration into ServiceNow ITOM via native Event Management. Scope is application-level observability only — server and network health remain on SolarWinds, and URL/synthetic monitoring remains on Uptrends. Key Responsibilities
  • Platform architecture and configuration across all eight in-scope Grafana Cloud modules: Grafana 12 (visualization), Mimir (metrics, 13-month retention), Loki (logs), Tempo (distributed tracing via OTLP), Alloy (telemetry collection agent), Beyla (eBPF zero-code auto-instrumentation), Application Observability (OTel-native APM), and Unified Alerting.
  • Tenancy and access design — organizations, folders, teams, role-based access control, dashboard variables, template links, and annotations.
  • Application instrumentation strategy by technology stack: Beyla eBPF as the default zero-code path for Simple and Medium apps; OpenTelemetry SDKs/agents (Java, .NET, Go, Python, Node.js) for Complex apps requiring deeper traces and custom metrics; JMX Exporter, prometheus_client, and runtime-specific exporters where stack-appropriate.
  • Log pipeline engineering via Alloy — structured JSON, Log4j/Logback, Serilog, NLog, Windows Event Log, Winston, Pino, loguru — with parsing rules tuned per stack and LogQL-based dashboards and alerts.
  • Alerting design — PromQL/LogQL/TraceQL rules, severity taxonomy, grouping, routing, and notification policies. Build a low-noise, actionable alert feed; tune thresholds iteratively with application owners.
  • Single Pane of Glass — design and deliver a tiered SPoG that surfaces Grafana application telemetry alongside contextual links to SolarWinds and Uptrends.
  • Business Dashboards and Reporting — partner with the Dashboard Lead to define KPI taxonomy and ensure dashboard-as-code patterns and version control.
  • ServiceNow ITOM integration — co-own the design and review of Grafana ServiceNow Event Management (native inbound integration) flow: event allow-list governance ("deny by default"), enrichment, deduplication, AIOps correlation, automated incident creation with severity mapping and assignment group rules, CMDB CI attachment, and ServiceNow-as-master incident state.
  • Quality assurance authority across all technical deliverables — solution architecture document, instrumentation runbooks, dashboard and alert library, integration test results.
  • Phased delivery execution — Mobilise & Discover Application Foundation (ML1) Onboarding of 40 Simple apps (ML2) Medium/Complex apps + ITOM Integration (ML2 3) SPoG, Dashboards & Reporting (ML3 4) Stabilisation, KT, and post-deployment support (ML4).
  • Knowledge transfer — produce platform operating procedures and conduct structured handover to the client's run team.
Top Skills:
1. Production expertise across the full Grafana stack: Mimir, Loki, Tempo, Alloy, Beyla, Grafana Application Observability, Unified Alerting. 2. Strong PromQL, LogQL, and TraceQL authoring skills; able to write recording rules and SLO queries from scratch. 3. OpenTelemetry practitioner — OTLP, collectors, SDK/agent instrumentation for at least three of Java, .NET, Go, Python, Node.js. 4. eBPF-based auto-instrumentation experience with Beyla (or equivalent — Pixie, Cilium Tetragon) in a production context. 5. Experience integrating Grafana alerts into ServiceNow Event Management (native inbound integration, not webhook-only patterns); familiarity with ServiceNow ITOM, AIOps event correlation, and CMDB CI attachment. 6. Multi-environment hosting fluency — on-prem, AWS, Azure — and Linux/Windows host agent deployment at scale. 7. Dashboard-as-code and GitOps patterns (Grafana provisioning, Terraform provider, or Grizzly). 8. Excellent written communication — solution architecture documents, runbooks, and stakeholder-facing status reporting. Required Skills & Experience
  • 7+ years in observability/monitoring engineering with deep, recent hands-on Grafana Cloud experience (not just OSS Grafana).
  • Production expertise across the full Grafana stack: Mimir, Loki, Tempo, Alloy, Beyla, Grafana Application Observability, Unified Alerting.
  • Strong PromQL, LogQL, and TraceQL authoring skills; able to write recording rules and SLO queries from scratch.
  • OpenTelemetry practitioner — OTLP, collectors, SDK/agent instrumentation for at least three of Java, .NET, Go, Python, Node.js.
  • eBPF-based auto-instrumentation experience with Beyla (or equivalent — Pixie, Cilium Tetragon) in a production context.
  • Experience in