Apply for this opportunity

This job application is on an outside website. Be sure to review the job posting there to verify it's the same.

Apply Offsite

Senior SRE

Job

HTC Global Services

Orlando, FL (In Person)

Full-Time

Posted 5 days ago (Updated 2 days ago) • Actively hiring

Expires 7/22/2026

See Job Scorecard

Review key factors to help you decide if the role fits your goals.

How is this calculated?

Pay Growth

out of 5

Not enough data

Not enough info to score pay or growth

Job Security

out of 5

Not enough data

Calculating job security score...

Total Score

out of 100

Average of individual scores

Were these scores useful?

Skill Insights

Compare your current skills to what this opportunity needs—we'll show you what you already have and what could strengthen your application.

Job Description

Job Title:

Senior Site Reliability Engineer (SRE) Overview / Summary We are seeking a Site Reliability Engineer (SRE) with 8 10 years of experience to drive reliability, observability, and resilience improvements across critical systems. This is a high-impact, front-line operations role focused on real-time incident response, proactive prevention, continuous automation, and reliability engineering for Tier-1 business-critical applications. Key Responsibilities Drive automation initiatives to improve system performance and operational efficiency. Improve application reliability and availability by proactively identifying and mitigating risks. Analyze production incidents and root cause analyses (RCAs) to eliminate recurring issues and reduce outages. Define and manage Service Level Objectives (SLOs), Service Level Indicators (SLIs), and Error Budgets using Nobl9. Conduct reliability assessments across applications, infrastructure, Kubernetes, databases, networks, caching platforms, and cloud environments. Drive observability improvements using OpenTelemetry, Grafana Cloud, AppDynamics, Splunk, and monitoring best practices. Perform performance and scalability reviews to support current and future demand. Lead chaos engineering exercises using Gremlin or Harness Chaos Engineering. Review cloud architectures against AWS Well-Architected Framework standards and drive remediation of reliability gaps. Automate operational tasks and implement self-healing solutions. Identify and eliminate single points of failure (SPOFs) and strengthen disaster recovery and failover capabilities. Collaborate with Development, Infrastructure, Performance Engineering, and Operations teams to improve system resilience. Establish reliability governance, dashboards, runbooks, and continuous improvement processes. Reliability Assessment & Engineering Conduct application reliability assessments using established reliability frameworks. Review historical incidents, Sev-1/Sev-2 RCAs, and recurring failure patterns. Identify reliability debt and drive remediation initiatives. Evaluate application readiness for SRE engagement. Perform end-to-end reliability reviews across application, infrastructure, network, and platform layers. Define reliability roadmaps and track improvement initiatives. Incident Management & RCA Analyze incident trends using CSI or equivalent incident management platforms. Participate in Major Incident Management and Problem Management processes. Drive RCA reviews and corrective actions. Track reliability improvement initiatives resulting from postmortems. Reduce Mean Time to Detect (MTTD) and Mean Time to Recover (MTTR). Service Level Management Define and implement SLIs. Establish SLOs and Error Budgets using Nobl9. Partner with Product and Engineering teams to define business-focused reliability targets. Build SLO dashboards and reliability scorecards. Monitor error budget consumption and enforce governance policies. Conduct reliability reviews based on SLO compliance. Cloud & Platform Reliability Review cloud architectures against AWS Well-Architected Framework principles. Conduct reliability, performance, cost optimization, security, and operational excellence assessments. Identify High Risk Issues (HRIs) and drive remediation. Validate high availability, disaster recovery, backup, and failover capabilities. Ensure multi-AZ and multi-region deployment strategies are implemented where required. Kubernetes & Infrastructure Reliability Review Kubernetes cluster health and workload configurations. Validate resource requests, limits, autoscaling, and resiliency patterns. Assess readiness, liveness, and startup probes. Review service mesh configurations, network policies, and traffic routing. Validate database high availability, caching strategies, and scaling configurations. Identify and eliminate single points of failure. Observability & Monitoring Design and improve enterprise observability strategies. Implement OpenTelemetry-based telemetry collection. Manage metrics, events, logs, and traces (MELT). Integrate telemetry into Grafana Cloud, Splunk Observability, or equivalent platforms. Utilize AI-driven observability capabilities for anomaly detection and root cause analysis. Improve alert quality, reduce alert fatigue, and increase actionable monitoring coverage. Ensure every alert has an owner, runbook, and customer impact justification. Application Performance Engineering Conduct dependency mapping and architecture reviews. Analyze latency, throughput, and scalability bottlenecks. Review timeout, retry, circuit breaker, and resilience patterns. Collaborate with Performance Engineering teams on load and stress testing. Validate system capacity against current and future traffic demands. Review Akamai CDN configurations, traffic routing, caching, and failover strategies. Ensure applications can sustain significant traffic spikes and peak loads. Chaos Engineering & Resilience Testing Design and execute chaos engineering experiments using Gremlin or Harness Chaos Engineering. Simulate infrastructure, network, application, and dependency failures. Validate system behavior during failure scenarios. Establish reliability score baselines and improvement goals. Measure resilience against real-world production conditions. Document findings and implement corrective improvements. Automation & Self-Healing Identify repetitive operational tasks suitable for automation. Develop self-healing workflows for common infrastructure and application failures. Automate alert remediation, scaling, recovery, and operational activities. Reduce manual intervention and operational toil. Improve platform efficiency through engineering-driven automation. Required Qualifications 8 10 years of experience in Site Reliability Engineering. Experience with CSI for incident and RCA tracking. Experience with Nobl9 for SLO management. Experience with AppDynamics for application performance monitoring. Experience with OpenTelemetry and Grafana Cloud for telemetry and observability. Experience with Gremlin or Harness Chaos Engineering. Experience with Akamai CDN. Knowledge of AWS Well-Architected Framework. Experience with Kubernetes reliability, observability, incident management, automation, and resilience engineering. What Makes HTC A Great Place To Build Your Future HTC Global Services wants you to join our team. Come build new things with us and advance your career. At HTC Global, you ll collaborate with experts, work alongside clients, and be part of high-performing teams driving success together. You ll have long-term opportunities to grow your career and develop skills in the latest emerging technologies. At HTC Global Services, our employees have access to a comprehensive benefits package. Benefits can include Group Health (Medical, Dental, and Vision), Paid Time Off, Paid Holidays, 401(k) matching, Group Life and Disability insurance, Professional Development opportunities, Wellness programs, and a variety of other perks. Our success as a company is built on inclusion and diversity. HTC Global Services is committed to providing a workplace free from discrimination and harassment, where every employee is treated with dignity and respect. We celebrate differences and believe that diverse cultures, perspectives, and skills drive innovation and success. HTC is an Equal Opportunity Employer and a proud National Minority Supplier. We seek to empower each individual, fostering an environment where everyone feels valued, included, and respected. #LI-ST1 #LI-Hybrid #Hiring