Senior Site Reliability Engineer
Job
Insight Global
Downers Grove, IL (In Person)
Full-Time
Review key factors to help you decide if the role fits your goals.
Pay Growth
?
out of 5
Not enough data
Not enough info to score pay or growth
Job Security
?
out of 5
Not enough data
Calculating job security score...
Total Score
99
out of 100
Average of individual scores
Skill Insights
Compare your current skills to what this opportunity needs—we'll show you what you already have and what could strengthen your application.
Job Description
Job Description Lead SRE project plan and implementation for distributed applications across GCP and Azure covering API's , data pipelines , messaging/event driven systems and also external data platforms.
To learn more about how we collect, keep, and process your private information, please review
Implementing structured processes for handling failures, and conducting reviews that focus on fixing system issues - Experience in monitoring external managed services like Mongo DB ,Kafka,Cloud SQL, Azure based monitoring , Oncall systems designing and writing on call rotation policies and rules (Xmatters or PagerDuty or Opsgenie etc.) AI experience
Job Description:
Design and implement comprehensive SRE monitoring for distributed applications Implement distributed tracing and logging using W3C Trace Context headers and OpenTelemetry standards across all applications Create drill-down Grafana dashboards with correlation between metrics, logs, and traces Integrate GCP and Azure Monitoring, Logging, and Trace with existing Open telemetry standards by enterprise teams Implement zero code instrumentation for monitoring and traceability Experience in defining and working with core SRE models like SLI's , SLO's , Error budgets etc Design reliability focused metrics (Latency, Request rate, Error, Duration, Availability) dashboards Build service health dashboards with drill-down capabilities and error message analysis Develop and maintain SRE automation/scripts within GKE namespaces for monitoring, deployment, and troubleshooting -Configure APIGEE monitoring and API performance tracking for applications working with enterprise teams We are a company committed to creating diverse and inclusive environments where people can bring their full, authentic selves to work every day. We are an equal opportunity/affirmative action employer that believes everyone matters. Qualified candidates will receive consideration for employment regardless of their race, color, ethnicity, religion, sex (including pregnancy), sexual orientation, gender identity and expression, marital status, national origin, ancestry, genetic factors, age, disability, protected veteran status, military or uniformed service member status, or any other status or characteristic protected by applicable laws, regulations, and ordinances. If you need assistance and/or a reasonable accommodation due to a disability during the application or recruiting process, please send a request to HR@insightglobal.com.To learn more about how we collect, keep, and process your private information, please review
Insight Global's Workforce Privacy Policy:
https://insightglobal.com/workforce-privacy-policy/. Skills and Requirements 7+ years in SRE with proven Azure, GCP observability, Grafana stack, GKE, AKS, OpenTelemetry, and instrumentation implementation experience.Technical:
Prometheus, Grafana, Kubernetes, Loki, Tempo, GCP or Azure loggingLogging & Tracing:
Distributed tracing, W3C Trace Context headers implementation, log aggregation standards, correlation IDs across systems/applicationsStructured Logging:
JSON format with specific fields (trace_id, service.name, log.level, customer.id, request.id) Experience monitoring batch/data pipelines (Cloud composer,Dataproc,ETL workflows) including job failures, scheduling issuesInfrastructure:
CI/CD pipelines , AI tools like GIT copilot etc.Observability Tools & Query Languages:
PromQL for querying metrics (Grafana) Strong experience with Kubernetes (GKE,AKS), including namespace management, RBAC, and deploying/maintaining SRE tools via code (Java/Python, Bash, YAML, Helm) OpenTelemetry (OTEL): Instrumentation, collectors, data collection from GCP services Alerting and Incident management :Implementing structured processes for handling failures, and conducting reviews that focus on fixing system issues - Experience in monitoring external managed services like Mongo DB ,Kafka,Cloud SQL, Azure based monitoring , Oncall systems designing and writing on call rotation policies and rules (Xmatters or PagerDuty or Opsgenie etc.) AI experience
Similar remote jobs
Veolia Environnement SA
Minnetonka, MN
Posted2 days ago
Updated16 hours ago
Cloud for Good
Asheville, NC
Posted2 days ago
Updated16 hours ago
Emory University
Atlanta, GA
Posted2 days ago
Updated16 hours ago
Similar jobs in Downers Grove, IL
Cintas
Downers Grove, IL
Posted2 days ago
Updated16 hours ago
Albertsons Companies
Downers Grove, IL
Posted2 days ago
Updated16 hours ago
Similar jobs in Illinois
Costco Wholesale Corporation
Hinsdale, IL
Posted2 days ago
Updated16 hours ago
UnitedStates
Chicago, IL
Posted2 days ago
Updated16 hours ago