Apply for this opportunity

This job application is on an outside website. Be sure to review the job posting there to verify it's the same.

Apply Offsite

Site Reliability Engineer, AI & Agentic Systems

Job

ServiceLink

Remote

Full-Time

Posted 5 days ago (Updated 3 days ago) • Actively hiring

Expires 7/21/2026

See Job Scorecard

Review key factors to help you decide if the role fits your goals.

How is this calculated?

Pay Growth

out of 5

Not enough data

Not enough info to score pay or growth

Job Security

out of 5

Not enough data

Calculating job security score...

Total Score

100

out of 100

Average of individual scores

Were these scores useful?

Skill Insights

Compare your current skills to what this opportunity needs—we'll show you what you already have and what could strengthen your application.

Job Description

Overview As our SRE charter continues to evolve, this role demands strong hands-on ownership of production reliability and troubleshooting, coupled with advanced capabilities in AI- and agentic-driven automation and performance engineering. The Site Reliability Engineer will play a critical role in ensuring reliability, scalability, performance, and operational excellence of our platforms. The ideal candidate will leverage Azure-native AI services and agentic systems to reduce toil, improve incident response, and enable intelligent operations—while also driving performance testing practices to validate system resilience under load. •This is a hybrid role, located at our Plano, TX office. Candidates must be willing and able to work in-office 3 days per week in Plano, TX. Applicants must be currently authorized to work in the United States on a full-time basis and must not require sponsorship for employment visa status now or in the future

A DAY IN THE LIFE

In this role, you will… Own end-to-end reliability of large-scale, Azure-hosted production systems, ensuring high availability, fault tolerance, and graceful degradation Lead hands-on incident troubleshooting, root cause analysis (RCA), and post-incident reviews with actionable follow-ups Build and operate resilient, scalable services on Microsoft Azure (AKS, App Services, Functions, Event Hubs, etc.) Design and maintain comprehensive observability platforms using Prometheus for metrics, Loki for log aggregation, Tempo for distributed tracing, and Grafana for dashboarding and alerting Design, develop, and execute performance testing strategies for distributed systems and microservices, including load testing, stress testing, soak testing, and capacity planning Integrate AI agents with Azure monitoring stack, CI/CD tooling, and incident management platforms Contribute to evolving SRE standards, tooling, operational processes, and knowledge base Responsibilities Reliability Engineering & Production Ownership Own end-to-end reliability of large-scale, Azure-hosted production systems, ensuring high availability, fault tolerance, and graceful degradation Lead hands-on incident troubleshooting, root cause analysis (RCA), and post-incident reviews with actionable follow-ups Define, measure, and enforce Service Level Indicators (SLIs), Service Level Objectives (SLOs), and error budgets aligned with business outcomes Drive proactive reliability improvements based on operational insights, failure mode analysis, and capacity planning Participate in on-call rotations and take real-time ownership during production incidents Platform & Automation Engineering Build and operate resilient, scalable services on Microsoft Azure (AKS, App Services, Functions, Event Hubs, etc.) Design and maintain comprehensive observability platforms using Prometheus for metrics, Loki for log aggregation, Tempo for distributed tracing, and Grafana for dashboarding and alerting Create automation to eliminate manual operational tasks and reduce Mean Time to Recovery (MTTR) Implement self-healing mechanisms, automated remediation workflows, and runbook automation Manage and optimize API lifecycle and traffic management using Gravitee API Gateway Design and implement durable, fault-tolerant workflows and microservice orchestration patterns using Temporal Administer and tune PostgreSQL databases for reliability, performance, and high availability Partner with application and platform teams to improve service operability, deployment safety, and change management Performance Testing & Load Engineering Design, develop, and execute performance testing strategies for distributed systems and microservices, including load testing, stress testing, soak testing, and capacity planning Build and maintain performance test scripts and virtual user scenarios using Micro Focus LoadRunner and VuGen (Virtual User Generator) Analyze performance test results to identify bottlenecks, regressions, and scalability limits; produce clear reports with actionable recommendations Integrate performance testing into CI/CD pipelines to enable continuous performance validation and shift-left testing practices Establish and monitor performance baselines, benchmarks, and SLAs across critical service endpoints and user journeys Collaborate with development and architecture teams to resolve performance issues and optimize system throughput, latency, and resource utilization AI / Agentic Engineering (Azure Focus) Design and implement AI-driven and agentic systems to enhance operational workflows and intelligent decision-making Build intelligent automation for operational use cases, including: Incident triage, enrichment, and automated escalation Alert correlation, deduplication, and noise reduction Automated diagnosis and remediation of recurring failures Leverage Azure AI services (Azure OpenAI, Cognitive Services, Azure ML) for operational intelligence and predictive insights Integrate AI agents with Azure monitoring stack, CI/CD tooling, and incident management platforms Ensure safe, reliable, and observable operation of AI-powered systems in production, including guardrails, fallback mechanisms, and audit trails Collaboration & Technical Leadership Act as a reliability, performance, and automation champion across engineering teams Mentor junior SREs and influence adoption of best practices in reliability, observability, and performance engineering Contribute to evolving SRE standards, tooling, operational processes, and knowledge base Participate in architecture reviews and provide guidance on non-functional requirements (reliability, scalability, performance) Qualifications Core SRE Skills 5+ years of experience in Site Reliability Engineering, DevOps, or Production Engineering roles Strong hands-on experience in production troubleshooting of distributed systems at scale Solid understanding of Linux internals, networking (TCP/IP, DNS, HTTP, TLS), and system performance tuning Deep hands-on experience with Microsoft Azure (compute, networking, storage, managed services, AKS) Strong knowledge of Kubernetes, container orchestration, Helm charts, and microservices architectures Proficiency in one or more programming languages: Python, Go, Java, or equivalent Experience with CI/CD pipelines (Azure DevOps, GitHub Actions) and Infrastructure as Code (Terraform, ARM Templates, Bicep) Observability & Monitoring Hands-on experience building and operating observability stacks using Prometheus, Grafana, Loki, and Tempo Experience with alerting strategies, SLI/SLO-based monitoring, and on-call incident management Performance Testing & Load Engineering Proven experience designing and executing performance and load testing for large-scale distributed applications Hands-on proficiency with Micro Focus LoadRunner and VuGen for scripting virtual user scenarios, parameterization, correlation, and result analysis Strong understanding of performance testing methodologies: load testing, stress testing, endurance/soak testing, spike testing, and capacity planning Ability to analyze performance metrics (throughput, response time, error rate, resource utilization) and translate findings into engineering actions Experience integrating performance tests into automated CI/CD pipelines Platform & Middleware Experience with Gravitee or equivalent API gateway platforms for traffic management, rate limiting, and API lifecycle governance Hands-on experience with Temporal for workflow orchestration, durable execution, and distributed task management Strong PostgreSQL administration skills, including query optimization, replication, backup/recovery, and performance tuning AI / Agentic Systems Hands-on experience building or integrating AI-powered automation in production environments Experience with agent-based systems, LLM-powered workflows, Retrieval-Augmented Generation (RAG), or intelligent assistants Familiarity with Azure-based AI and ML services (Azure OpenAI, Cognitive Services, Azure ML) Understanding of reliability, safety, observability, and operational challenges of AI systems in production