Compare your current skills to what this opportunity needs—we'll show you what you already have and what could strengthen your application.
Job Description
Operations focused Systems Administrator / Systems Engineer (more of a Data Center Technician actually) supporting a large-scale bare-metal server environment (~17,000 servers) with a heavy emphasis on CPU and GPU compute availability. This role is centered on reliability, automation, and operational excellence digging into systems and pipelines when things break and improving them so they break less often. This is not hands-on data center work. What you ll be doing Administer and support large-scale bare-metal server infrastructure, primarily HPE and Dell platforms Perform server break/fix troubleshooting including hardware faults, firmware/BIOS/BMC issues, POST failures, degraded components, and system instability Manage server lifecycle operations: onboarding, provisioning, firmware updates, BIOS/BMC configuration, and hardware refresh kits Own incident response and break/fix workflows while maintaining 98%+ compute availability SLAs Work cross-functionally with Data Center and Networking teams during hardware incidents, including ticket creation, repair coordination, and log collection Interface directly with HPE and Dell vendors: gathering diagnostics, sending logs, driving RMAs, and tracking issues through resolution Support and troubleshoot CI/CD and automation pipelines used for server provisioning, configuration, and lifecycle management Dig into automation code and workflows (Ansible, scripts, pipelines) when jobs fail to understand root cause and unblock deployments Identify recurring operational issues and contribute to process improvements, runbooks, and reliability enhancements Help manage and reduce the operations backlog, prioritizing fixes, cleanup, and automation improvements
Must Have:
Hands-on experience supporting HPE and Dell servers in production, including break/fix and hardware incident troubleshooting Experience with HPE iLO, Dell iDRAC, and related BMC environments Strong understanding of server hardware components (CPU, GPU, memory, disks, NICs, power) and common failure modes Experience troubleshooting automation and CI/CD pipelines that manage infrastructure (not just running them, but fixing them when they fail) Operational mindset with experience owning incidents, SLAs, backlog items, and process improvements Automation experience with Ansible, Bash, Jenkins, or similar tooling Exposure to GPU-dense, HPC, or high-performance compute environments Experience improving runbooks, reducing toil, and scaling operations through automation