Principal Site Reliability Engineer (Intelligent Automation)
Genentech
South San Francisco, CA (In Person)
$232,300 Salary, Full-Time
Skill Insights
Compare your current skills to what this opportunity needs—we'll show you what you already have and what could strengthen your application.
Job Description
The Opportunity:
Infrastructure as Code (IaC) Design and Implementation Architect and implement IaC solutions using tools like Terraform, Spacelift, or CloudFormation to provision and manage cloud infrastructure for ML and HPC workloads. Automate the deployment of scalable ML pipelines, HPC clusters, and supporting services across global regions. Global Availability and Resiliency Architect resilient and highly available solutions for ML and HPC workloads using cloud-native practices such as auto-scaling, load balancing, and failover mechanisms. Implement disaster recovery (DR) and business continuity plans for critical systems to ensure global operational integrity. Conduct chaos engineering experiments to validate system reliability and identify potential weaknesses. Automation and Observability Develop automation scripts and workflows to streamline infrastructure management, deployment, and scaling for ML and HPC use cases. Implement robust monitoring, logging, and alerting frameworks using tools like Prometheus, Grafana, Datadog, or ELK Stack to provide deep insights into system health and performance. Knowledge of AIOps incident management, processes and tooling. Collaboration and Leadership Provide technical leadership to a team of engineers, fostering a culture of collaboration, innovation, and continuous improvement. Partner with cross-functional teams to align infrastructure solutions with business objectives and ML/HPC workload requirements. Mentor and train junior engineers in IaC practices, ML, and HPC infrastructure design. Cost Optimization and Governance Monitor and optimize cloud infrastructure usage and costs for ML and HPC workloads. Ensure compliance with organizational security, governance, and regulatory policies in all IaC and cloud implementations.Who You Are:
Bachelor's or Master's degree in Computer Science or similar technical field, or equivalent experience and 7+ years of experience in software engineering Site Reliability Engineering (SRE). Proven expertise in supporting and deploying IaC solutions in cloud environments (AWS, Azure, or GCP) for ML and HPC workloads. Background in MLOps pipelines, including model versioning, CI/CD for ML, and feature store integration including experience with managed ML services (e.g., AWS SageMaker, Google AI Platform, or Azure ML). Deep understanding of cloud-native architectures, including autoscaling, serverless, and multi-region deployments.Technical Skills:
Advanced proficiency with IaC tools: Terraform, Pulumi, or CloudFormation.Expert in scripting and automation:
Python, Bash, or Go. Strong understanding of GPU-accelerated computing (e.g., NVIDIA CUDA, TensorFlow) and HPC workload scaling. Knowledge of distributed systems, storage solutions, and data pipelines. Familiar with monitoring and observability tools: Prometheus, Grafana, Datadog, or similar.Soft Skills:
Strong problem-solving skills, with a methodical approach to troubleshooting. Excellent communication, leadership, and mentoring abilities. Ability to work collaboratively across teams in a fast-paced, dynamic environment. Preferred Qualifications Certifications in cloud platforms (e.g., AWS Certified Solutions Architect, GCP Professional Cloud Architect, or Azure Solutions Architect). Experience with distributed ML frameworks and data engineering pipelines (e.g., Horovod, TensorFlow Distributed, Apache Airflow, Apache Spark ). Experience with compliance frameworks (e.g., GDPR, SOC 2, ISO 27001). Onsite presence, on our South San Francisco campus, is expected for at least 3 days a week. Relocation benefits are not available for this job posting. The expected salary range for this position based on the primary location of California is $162,600 - $302,000. Actual pay will be determined based on experience, qualifications, geographic location, and other job-related factors permitted by law. A discretionary annual bonus may be available based on individual and Company performance. This position also qualifies for the benefits detailed at the link provided below. Benefits #LI-JD1 #ComputationCoE Genentech is an equal opportunity employer. It is our policy and practice to employ, promote, and otherwise treat any and all employees and applicants on the basis of merit, qualifications, and competence. The company's policy prohibits unlawful discrimination, including but not limited to, discrimination on the basis of Protected Veteran status, individuals with disabilities status, and consistent with all federal, state, or local laws. If you have a disability and need an accommodation in relation to the online application process, please contact us by completing this form Accommodations for Applicants .Similar remote jobs
GE Aerospace
Helena, MT
Posted2 days ago
Updated23 hours ago
Similar jobs in South San Francisco, CA
Genentech
South San Francisco, CA
Posted2 days ago
Updated23 hours ago
Costco Wholesale Corporation
South San Francisco, CA
Posted2 days ago
Updated23 hours ago
Costco Wholesale Corporation
South San Francisco, CA
Posted2 days ago
Updated23 hours ago
Kaiser Permanente
South San Francisco, CA
Posted2 days ago
Updated23 hours ago
Similar jobs in California
White Glove Placement
Los Angeles, CA
Posted1 day ago
Updated23 hours ago
Costco Wholesale Corporation
Novato, CA
Posted2 days ago
Updated23 hours ago
Seacorners Supply Chain Incorporated
Jurupa Valley, CA
Posted2 days ago
Updated23 hours ago