HPC Systems Administrator (Hardware & Infrastructure Operations)
Job
Stanford University
Stanford, CA (In Person)
$160,981 Salary, Full-Time
Review key factors to help you decide if the role fits your goals.
Pay Growth
?
out of 5
Not enough data
Not enough info to score pay or growth
Job Security
?
out of 5
Not enough data
Calculating job security score...
Total Score
75
out of 100
Average of individual scores
Skill Insights
Compare your current skills to what this opportunity needs—we'll show you what you already have and what could strengthen your application.
Job Description
HPC Systems Administrator (Hardware & Infrastructure Operations)
Business Affairs:
University IT (UIT), Stanford, California, United States Thank you for your interest in Stanford University. While we have instituted a hiring pause for non-critical staff positions, we are actively recruiting for the positions currently listed on our careers page.Please note:
The application portal will be offline for maintenance from May 7 at 7 PM to May 8 at 3 AM (PT). Job SummaryDATE POSTED
Apr 17, 2026 SCHEDULE Full-timeJOB CODE
4833 EMPLOYEE STATUS Regular GRADE K REQUISITION ID 108777 WORKARRANGEMENT
On Site Please note: Visa Sponsorship is not provided for this position. The Sherlock HPC cluster is the flagship of Stanford's research computing environment, supporting thousands of users and a massive variety of scientific workloads. We are looking for an HPC Systems Administrator who thrives at the intersection of high-density hardware and Linux systems engineering. In this role, you will be the primary steward of the physical infrastructure on Sherlock and other platforms. You will ensure that our 1,500+ compute nodes, high-density GPU racks, and petabyte-scale storage arrays are meticulously maintained, expertly tuned, and highly available. Why Stanford? You won't just be swapping parts; you will be managing the physical backbone of a world-class research environment. From debugging errors on NVIDIA H200s to optimizing InfiniBand cabling for our Lustre scratch tiers, your work is the foundation upon which Nobel-caliber research is built. Primary ResponsibilitiesHardware Lifecycle & Deployment:
Lead the physical deployment, burn-in, troubleshooting, and decommissioning of compute nodes, GPU servers, and high-density storage systems.Diagnostics & Root Cause Analysis:
Perform troubleshooting on hardware issues—such as memory errors, GPU thermal throttling, network failures — and coordinate with vendors for support and replacements.Data Center Operations:
Collaborate with the data centers team to plan and manage hardware deployments.Provisioning & Automation:
Work with lead platform administrators on testing and provisioning to ensure rapid, consistent deployment of cluster images across the fleet.Health & Telemetry:
Refine hardware-level monitoring to proactively identify failing components before they impact active research jobs.Required Qualifications:
Education:
Bachelor's degree and eight years of relevant experience, or a combination of education and relevant experience.Experience:
3-5+ years of experience in Linux Systems Administration, with a strong preference for candidates from HPC, larges-scale data center, or research environments.Hardware Proficiency:
Solid understanding of x86 server architecture, GPU systems, ethernet,and high-performance interconnects.Scripting:
Proficiency in scripting languages for automating hardware health checks, log parsing, and routine maintenance tasks.Infrastructure Management:
Experience using configuration management tools to manage hardware settings and firmware versions at scale. Experience working with data center teams to populate and maintain DCIM solutions preferred.Physical Requirements:
Ability to lift up to 50 lbs and work comfortably in a data center environment, including racking equipment and managing complex cable topologies.Communication:
Strong written and verbal communication skills. Preferred Skills- Direct experience maintaining hardware for HPC systems and large scale storage systems.
- Familiarity with the Slurm workload manager and how hardware health impacts job scheduling.
- Exposure to liquid cooling solutions or high-density rack power management. Physical Requirements
- : • Constantly perform desk-based computer tasks.
- Frequently sit, grasp lightlyfine manipulation.
- Occasionally standwalk, writing by hand.
- Rarely use a telephone, liftcarrypushpull objects that weigh up to 10 pounds.
Working Conditions:
- May work extended hours, evenings, and weekends.
Work Standards:
Interpersonal Skills:
Demonstrates the ability to work well with Stanford colleagues and clients and with external organizations.Promote Culture of Safety:
Demonstrates commitment to personal responsibility and value for safety; communicates safety concerns; uses and promotes safe behaviors based on training and lessons learned.- Subject to and expected to stay in sync with all applicable University policies and procedures, including but not limited to the personnel policies and other policies found in Stanford's Administrative Guide, adminguide.
Similar remote jobs
International Foundation of Employee Benefit Plans
Brookfield, WI
Posted2 days ago
Updated1 day ago
Similar jobs in Stanford, CA
Stanford University
Stanford, CA
Posted3 days ago
Updated1 day ago
The Planet Group
Stanford, CA
Posted3 days ago
Updated2 days ago