Job Description
Site Reliability Engineer II
Alibaba Cloud US LLC -
Bellevue, WA Posted:
5/28/2026 - Expires:
7/2/2026 Job ID:
293447539 Job Description Platform Stability & High Availability:
Conduct health checks, risk assessments, and preventive maintenance for database platform components. Design and implement HA solutions (e.g., automated fault recovery, adaptive disaster resilience) and cloud-native technologies. Optimize network architecture and Kubernetes (k8s) cluster operations for database services. Operational Tooling & Automation:
Develop platforms/tools for large-scale distributed systems management, including automated deployment, monitoring, and diagnostics. Enhance observability through metrics, logging, tracing, and alerting systems (e.g., Prometheus, Grafana, OpenTelemetry). Incident Management & Optimization:
Resolve live-site issues, including performance bottlenecks, capacity scaling, and security threats. Collaborate with product teams to refine architectures, reduce latency, and improve availability. Cross-Functional Collaboration:
Drive standardization of control-plane components (e.g., microservice frameworks, metadata services) across database engines. 1. Research and Development of Database Platform Infrastructure Systems & Products:
The employee will design and support Database-as-a-Service (DBaaS) platforms. This includes cloud-native database engines (such as PolarDB, RDS, or similar distributed SQL/NoSQL databases) and their control-plane orchestration systems. Research Areas:
Conduct research on Distributed Consensus Protocols (e.g., Paxos, Raft) to ensure data consistency and high availability. Research Adaptive Disaster Resilience algorithms to automate failover across multi-region cloud architectures. Process:
Lead the end-to-end lifecycle of high-availability solutions, from architectural design and prototyping to automated stress testing and chaos engineering to validate system robustness under extreme failure modes. 2. Large-Scale Distributed Systems Management & Tooling Equipment & Systems:
Work extensively with Kubernetes (K8s) orchestration, focusing on Custom Resource Definitions (CRDs) and Operators to manage stateful database workloads. Tools & Technologies:
Develop and maintain internal automation platforms using languages such as Go (Golang), Java, or Python. Utilize Prometheus, Grafana, and OpenTelemetry to build advanced observability frameworks that provide real-time telemetry and predictive diagnostics for thousands of database nodes. Specific Projects:
Development of an automated Database Fleet Management System that handles seamless patching, scaling, and migration of large-scale distributed clusters without service interruption. 3. Network Architecture and Cloud-Native Optimization Technical Focus:
Optimize the networking stack within virtualized environments (e.g., Service Mesh, VPC configurations, Load Balancers) to minimize tail latency and maximize throughput for database traffic. Industry Application:
These duties are situated within the Cloud Computing and Information Technology Services industry, specifically focusing on Infrastructure-as- Software and Large-Scale Data Management. 4. Incident Management and Security Performance Process:
Implement a systematic approach to Root Cause Analysis (RCA) for complex live-site incidents involving performance bottlenecks, such as CPU saturation, I/O wait times, or memory leaks in distributed environments. Security:
Design and implement automated security auditing tools to ensure database components comply with industry standards (e.g., encryption at rest/in transit, identity and access management). Telecommuting may be permitted. When not telecommuting, must report to worksite. Requirements:
Bachelor's degree or foreign degree equivalent in Computer Science, Information Science, or related field.
2 years of experience in the Site Reliability Engineer II, or any other related occupation, job title/position. Worksite Address:
205 108th Ave NE, Suite 400, Bellevue, WA, 98004 Job Summary
Company Details
Company
Alibaba Cloud US LLC
Industry
All Other Professional, Scientific, and Technical Services
Contact method
Contact Info Job Information
Location
Bellevue, WA
Job Type
Full Time Employee
Education Level
Bachelor's degree
Job Position
1 Position(s) Open
Salary/Wage
$144,000.00 - $172,800.00 /year
Duration
Over 150 Days
Additional Information
Reference Code
9849968
Federal Contractor
No
Affirmative Action Plan
No