Platform Site Reliability Engineer
Job
MW Partners LLC
Eagle Mountain, UT (In Person)
Full-Time
Review key factors to help you decide if the role fits your goals.
Pay Growth
?
out of 5
Not enough data
Not enough info to score pay or growth
Job Security
?
out of 5
Not enough data
Calculating job security score...
Total Score
99
out of 100
Average of individual scores
Skill Insights
Compare your current skills to what this opportunity needs—we'll show you what you already have and what could strengthen your application.
Job Description
Responsibilities and duties: Ensure the reliability, availability, and operational health of a portfolio of SaaSbased solutions, including vendormanaged services and inhouse (customerzero) platforms. Participate in oncall rotations and incident response, leading investigation, mitigation, coordination, and postincident followup. Establish and maintain effective observability for systems that are not fully owned, identifying practical ways to obtain actionable metrics, logs, and signals from vendor and partner solutions. Use operational data and incident learnings to identify reliability risks and drive targeted improvements that reduce customer impact. Apply appropriate change controls at owned or influenced layers of the stack, balancing reliability, velocity, and business needs. Partner with internal teams and external vendors to communicate expectations, coordinate response and remediation, and influence reliability outcomes. Produce clear incident communications and postincident analyses that inform stakeholders and drive lasting improvements. Leverage automation and AIassisted tooling to improve detection, triage, and operational efficiency. Requirements Bachelor s degree in Computer Science, Information Technology, Engineering, or a related field, or equivalent practical experience. 3 5 years of experience in Site Reliability Engineering, production operations, or a closely related role. Strong foundation in Site Reliability Engineering practices, including observability, incident response, and reliability measurement. Handson experience operating SaaS or thirdparty systems where fullstack ownership is limited. Deep understanding of monitoring, logging, and alerting, with the ability to design signals that are actionable rather than noisy. Proven incident response experience, including oncall participation and crossteam coordination during highimpact events. Ability to think creatively and pragmatically when instrumenting and improving systems with constrained control. Excellent written and verbal communication skills, especially in highpressure incident and vendorcoordination scenarios. Experience working across organizational and vendor boundaries to resolve complex operational issues. Sound engineering judgment when assessing risk, prioritizing work, and making reliability tradeoffs in production environments. Experience supporting production systems with oncall responsibilities and incident response expectations. Strong experience working with observability data (metrics, logs, alerts) to diagnose issues and drive improvements. Comfort using automation and AIassisted tools as part of everyday operational workflows.
Preferred Skills:
Experience supporting enterprisescale SaaS platforms or shared services. Prior experience working directly with vendors to resolve reliability or operational issues. Familiarity with cloudbased and distributed system architectures.Similar remote jobs
Volkswagen Group DE
Ashburn, VA
Posted2 days ago
Updated20 hours ago
Similar jobs in Eagle Mountain, UT
Similar jobs in Utah
PROMONTORY CLUB MASTER
Park City, UT
Posted2 days ago
Updated20 hours ago