Tallo logoTallo logo

Senior Site Reliability Engineer

Job

Insight Global

Downers Grove, IL (In Person)

Full-Time

Posted 3 days ago (Updated 16 hours ago) • Actively hiring

Expires 6/13/2026

Apply for this opportunity

This job application is on an outside website. Be sure to review the job posting there to verify it's the same.

Review key factors to help you decide if the role fits your goals.
Pay Growth
?
out of 5
Not enough data
Not enough info to score pay or growth
Job Security
?
out of 5
Not enough data
Calculating job security score...
Total Score
99
out of 100
Average of individual scores

Were these scores useful?

Skill Insights

Compare your current skills to what this opportunity needs—we'll show you what you already have and what could strengthen your application.

Job Description

Job Description Lead SRE project plan and implementation for distributed applications across GCP and Azure covering API's , data pipelines , messaging/event driven systems and also external data platforms.
Job Description:
Design and implement comprehensive SRE monitoring for distributed applications Implement distributed tracing and logging using W3C Trace Context headers and OpenTelemetry standards across all applications Create drill-down Grafana dashboards with correlation between metrics, logs, and traces Integrate GCP and Azure Monitoring, Logging, and Trace with existing Open telemetry standards by enterprise teams Implement zero code instrumentation for monitoring and traceability Experience in defining and working with core SRE models like SLI's , SLO's , Error budgets etc Design reliability focused metrics (Latency, Request rate, Error, Duration, Availability) dashboards Build service health dashboards with drill-down capabilities and error message analysis Develop and maintain SRE automation/scripts within GKE namespaces for monitoring, deployment, and troubleshooting -Configure APIGEE monitoring and API performance tracking for applications working with enterprise teams We are a company committed to creating diverse and inclusive environments where people can bring their full, authentic selves to work every day. We are an equal opportunity/affirmative action employer that believes everyone matters. Qualified candidates will receive consideration for employment regardless of their race, color, ethnicity, religion, sex (including pregnancy), sexual orientation, gender identity and expression, marital status, national origin, ancestry, genetic factors, age, disability, protected veteran status, military or uniformed service member status, or any other status or characteristic protected by applicable laws, regulations, and ordinances. If you need assistance and/or a reasonable accommodation due to a disability during the application or recruiting process, please send a request to HR@insightglobal.com.

To learn more about how we collect, keep, and process your private information, please review
Insight Global's Workforce Privacy Policy:
https://insightglobal.com/workforce-privacy-policy/. Skills and Requirements 7+ years in SRE with proven Azure, GCP observability, Grafana stack, GKE, AKS, OpenTelemetry, and instrumentation implementation experience.
Technical:
Prometheus, Grafana, Kubernetes, Loki, Tempo, GCP or Azure logging
Logging & Tracing:
Distributed tracing, W3C Trace Context headers implementation, log aggregation standards, correlation IDs across systems/applications
Structured Logging:
JSON format with specific fields (trace_id, service.name, log.level, customer.id, request.id) Experience monitoring batch/data pipelines (Cloud composer,Dataproc,ETL workflows) including job failures, scheduling issues
Infrastructure:
CI/CD pipelines , AI tools like GIT copilot etc.
Observability Tools & Query Languages:
PromQL for querying metrics (Grafana) Strong experience with Kubernetes (GKE,AKS), including namespace management, RBAC, and deploying/maintaining SRE tools via code (Java/Python, Bash, YAML, Helm) OpenTelemetry (OTEL): Instrumentation, collectors, data collection from GCP services Alerting and Incident management :

Implementing structured processes for handling failures, and conducting reviews that focus on fixing system issues - Experience in monitoring external managed services like Mongo DB ,Kafka,Cloud SQL, Azure based monitoring , Oncall systems designing and writing on call rotation policies and rules (Xmatters or PagerDuty or Opsgenie etc.) AI experience

Similar remote jobs

Similar jobs in Downers Grove, IL

Similar jobs in Illinois