Apply for this opportunity

This job application is on an outside website. Be sure to review the job posting there to verify it's the same.

Apply Offsite

HPC Kubernetes/Slurm Cluster Engineer

Job

Fluid Numerics

Remote

$140,000 Salary, Full-Time

Posted 3 weeks ago (Updated 2 weeks ago) • Actively hiring

Expires 6/17/2026

See Job Scorecard

Review key factors to help you decide if the role fits your goals.

How is this calculated?

Pay Growth

out of 5

Not enough data

Not enough info to score pay or growth

Job Security

out of 5

Not enough data

Calculating job security score...

Total Score

100

out of 100

Average of individual scores

Were these scores useful?

Skill Insights

Compare your current skills to what this opportunity needs—we'll show you what you already have and what could strengthen your application.

Job Description

Overview We are seeking a highly skilled

HPC/AI/ML

Cluster Engineer to support the design, deployment, and ongoing operations of large-scale HPC environments powered by Kubernetes and Slurm (Slinky). This role centers on cluster engineering, administration, and performance optimization, with emphasis on GPU-accelerated computing, advanced networking, and workload scheduling. In this role, you will work closely with our researchers, vendors, and partners to manage Slurm clusters that are used for AI/ML workloads. You will work alongside our team to support in-house, partner, and customer infrastructure Responsibilities Cluster Engineering & Deployment Participate in the design and bring-up of bare metal

HPC/AI/ML

environments Integrate heterogeneous hardware platforms into cohesive scheduling environments. Develop provisioning and imaging workflows (Ansible, MAAS, cloud-init, warewulf, CI/CD pipelines) for reproducible cluster build-out. Coordinate communications between vendors, researchers, and other partners during cluster bring-up and operation. Slurm Management Configure and operate the Slurm Workload Manager. Build custom Slurm plugins and scripts (epilog/prolog, pam_slurm_adopt) to extend functionality and integrate with authentication, health checking, and monitoring. Manage federated Slurm setups across multi-site or hybrid cloud environments. System Administration & Monitoring Administer Linux HPC environments, including network configuration, storage integration, and kernel tuning for HPC workloads. Deploy and maintain observability stacks for system health, GPU metrics, and job monitoring. Automate failure detection, node health checks, and job cleanup to ensure high uptime and reliability. Manage security and access control (LDAP/SSSD, VPN, PAM, SSH session auditing). User & Stakeholder Support Assist cluster users with developing workflows that make efficient use of compute resources. Containerize HPC applications with Docker/Podman/Enroot-Pyxis and integrate GPU-aware runtimes into Slurm jobs. Automate cost accounting and cluster usage reporting. Qualifications Previous experience in HPC cluster administration and engineering, with deep knowledge of Slurm. Expert in Slurm configuration, partition design, QoS/preemption policies, and

GRES GPU

scheduling. Strong background in Linux system administration, networking, and performance tuning for HPC environments. Hands-on experience with parallel file system, advanced networking (InfiniBand, RoCE, 100/200 GbE), and monitoring stacks. Proficient with automation tools (Ansible, Terraform, CI/CD pipelines) and version control. Demonstrated ability to operate GPU-accelerated clusters at scale. Previous experience managing kubernetes deployments Exceptional candidates have familiarity with common AI/ML software package dependencies and researcher workflows

Job Type:

Full-time Pay:

From $140,000.00 per year

Benefits:

401(k) 401(k) matching Health insurance Relocation assistance

Experience:

Linux and HPC cluster system administration: 1 year (Required)

Language:

English (Required)

Work Location:

Hybrid remote in Hickory, NC 28602