Apply for this opportunity

This job application is on an outside website. Be sure to review the job posting there to verify it's the same.

Apply Offsite

Sr Site Reliability Engineer

Job

Commence

Remote

$160,000 Salary, Full-Time

Posted 4 weeks ago (Updated 3 days ago) • Actively hiring

Expires 7/1/2026

See Job Scorecard

Review key factors to help you decide if the role fits your goals.

How is this calculated?

Pay Growth

out of 5

Not enough data

Not enough info to score pay or growth

Job Security

out of 5

Not enough data

Calculating job security score...

Total Score

out of 100

Average of individual scores

Were these scores useful?

Skill Insights

Compare your current skills to what this opportunity needs—we'll show you what you already have and what could strengthen your application.

Job Description

Description:

At Commence, we're the start of a new age of data-centric transformation, elevating health outcomes and powering better, more efficient process to program and patient health. We combine quality data-driven solutions that fuel answers, technology that advances performance, and clinical expertise that builds trust to create a more efficient path to quality care. With human-centered, healthcare-relevant, and value-based solutions, we create new possibilities with data. We provide proof beyond the concept and performance beyond the scope with a focus on efficiencies that transform the lives of those we serve. With a culture driven by purpose, straightforward communication and clinical domain expertise, Commence cuts straight to better care.

Requirements:

As a Senior Site Reliability Engineer at Commence, you will own the reliability, scalability, and operational health of our mission-critical healthcare data platform. You will bridge the gap between engineering and operations—embedding reliability as a first-class concern from architecture through deployment. This role is built for someone who thrives when systems are under pressure and who treats an outage as a problem to be engineered away permanently, not just survived. Design, implement, and own observability infrastructure including metrics, logging, tracing, and alerting across distributed systems. Define and enforce SLOs, SLIs, and error budgets in partnership with product and engineering teams.

Lead incident response:

triage, coordinate remediation, conduct blameless post-mortems, and drive systemic fixes. Build and maintain CI/CD pipelines that support rapid, safe delivery of changes to production. Collaborate with engineering teams on infrastructure changes; able to read, modify, and contribute to existing infrastructure-as-code (Terraform or CloudFormation). Design and operate highly available, fault-tolerant systems—including auto-scaling, failover, and disaster recovery strategies. Reduce operational toil through automation; eliminate manual processes before they become habits. Collaborate with software engineers to establish reliability-first design patterns and review architectures for operational risk. Manage Kubernetes or container orchestration environments at scale. Ensure systems meet compliance and security requirements, particularly those applicable to healthcare data (HIPAA, SOC 2). Provide technical mentorship and guidance to engineers across the organization on reliability practices. Participate in on-call rotation with a commitment to continuously reducing the need for it. Qualifications 7+ years of experience in SRE, platform engineering, or DevOps roles. Exceptional problem-solving under pressure—demonstrated track record of diagnosing complex, high-stakes system failures and building durable solutions. Deep hands-on experience with AWS services including EC2, EKS/ECS, Lambda, RDS, S3, CloudWatch, and related tooling. Familiarity with infrastructure-as-code (Terraform or CloudFormation)—able to contribute to existing configurations. Experience designing and operating distributed systems with strict availability and latency requirements. Proficiency in at least one scripting or systems language (Python, Go, Bash, or similar) for automation and tooling. Experience with container orchestration (Kubernetes, ECS) in production environments. Expertise in observability tooling (OpenSearch, Prometheus/Grafana, or equivalent). Hands-on experience with CI/CD platforms (GitHub Actions, Jenkins, CircleCI, or similar). Proven ability to define and operationalize SLOs and error budgets. Experience with relational and NoSQL databases—performance tuning, replication, and backup strategies. Strong working knowledge of networking fundamentals: DNS, load balancing, VPCs, TLS. Excellent communication skills—able to translate technical risk into business impact for non-engineering stakeholders. Additional Requirements AWS Certifications (Solutions Architect, DevOps Engineer, or SysOps Administrator). Experience in healthcare technology or other regulated industries (HIPAA, SOC 2, FedRAMP). Familiarity with chaos engineering practices and tooling. Experience with data pipeline reliability (ETL/ELT workflows, streaming systems). Exposure to AI/ML infrastructure and the reliability challenges unique to model serving. Familiarity with additional cloud platforms (Azure, Google Cloud). Contributions to open-source reliability or infrastructure tooling. Work Environment/Physical Demands The work environment and physical demands described here are representative of those that must be met by an employee to successfully perform the essential functions of this job. Reasonable accommodations may be made to enable individuals with disabilities to perform the essential functions. This is a remote position. While performing the duties of this job, the employee regularly works in a climate-controlled environment. Candidates must be able to sit, read, work on a computer, and watch a computer screen for extended periods of time. Occasionally required to stand, walk, use hands and fingers, kneel or crouch. Commence is an equal employment opportunity employer. All personnel processes are merit-based and applied without discrimination on the basis of race, color, religion, sex, sexual orientation, gender identity, marital status, age, disability, national or ethnic origin, military and veteran status or any other characteristic protected by applicable law. Commence.

AI is committed to providing equal employment opportunities to all applicants, including individuals with disabilities. If you require a reasonable accommodation to participate in the application process due to a disability, please contact Human Resources at (757) 306-4920 or hr@commence.ai. Please note that unless you are requesting an accommodation, all applications must be submitted through our online application system. Sr Site Reliability Engineer 15941 Frost Leaf Ln, Leesburg, VA 20176 $145,000

Sr Site Reliability Engineer

Commence

See Job Scorecard

Skill Insights

Job Description

Description:

Requirements:

Lead incident response:

Full-time Description:

Requirements:

Lead incident response: