NetOps Engineer GPU Cluster Upgrades

Job

E-Solutions Inc.

Full-Time

Posted 8 weeks ago (Updated 7 weeks ago) • Actively hiring

Expires 5/27/2026

Apply for this opportunity

This job application is on an outside website. Be sure to review the job posting there to verify it's the same.

See Job Scorecard

Review key factors to help you decide if the role fits your goals.

How is this calculated?

Pay Growth

out of 5

Not enough data

Not enough info to score pay or growth

Job Security

out of 5

Not enough data

Calculating job security score...

Total Score

out of 100

Average of individual scores

Were these scores useful?

Skill Insights

Compare your current skills to what this opportunity needs—we'll show you what you already have and what could strengthen your application.

Job Description

NetOps Engineer - GPU Cluster Upgrades Remote Job | 2026-02-24 16:29:45

Job Description Job Title:

NetOps Engineer - GPU Cluster Upgrades

Location:

Mountain View, CA /

Hybrid Duration:

6 Months Contract (Likely Extension)

Domain:

Network Operations /

GPU Infrastructure Job Description:

We are seeking a highly skilled NetOps Engineer with experience in GPU cluster infrastructure and automation to support upgrades and operational setup for a cluster of approximately 100 NVIDIA A100 GPUs. The engagement involves deploying industry-standard automation using MAAS and Ansible to automate the lifecycle management of GPU compute nodes and implement monitoring, scheduling, and configuration management across the cluster environment. Key Responsibilities 1. Bare Metal Provisioning Install and configure Canonical MAAS (Metal-as-a-Service) on the designated head node. Configure automated OS provisioning for Ubuntu nodes via

IPMI/BMC.

Enable node discovery, power management, disk wiping, and automated provisioning. 2. Configuration Management Develop modular Ansible Playbooks for automated configuration management. Automate installation and version locking for: NVIDIA Drivers (Headless) CUDA Toolkit v12.x Infiniband / RDMA networking stack Security agents including CrowdStrike and fleet security monitoring agents. 3. Workload Scheduling Install and configure Slurm Workload Manager using Ansible. Configure Fair Share scheduling and preemption rules to allocate GPU resources across multiple teams. Integrate user authentication via local accounts or LDAP. 4. Observability & Monitoring Deploy Prometheus for metrics collection. Implement visualization dashboards using Grafana. Deploy

NVIDIA DCGM

Exporter on all compute nodes to collect GPU telemetry. Configure dashboards for monitoring: GPU temperature Power draw Per-user GPU utilization Expected Deliverables Functional MAAS Web UI for automated node lifecycle management. Master Ansible Playbook capable of configuring a fresh Ubuntu installation automatically. Fully operational Slurm controller (slurmctld) and compute node agents (slurmd). Pre-built NVIDIA monitoring dashboards in Grafana displaying real-time GPU metrics. Required Skills Infrastructure & OS Expertise with Ubuntu 22.04 LTS provisioning Experience with

IPMI/BMC

based bare-metal management Automation Strong experience with Ansible Playbook development

GPU & NVIDIA

Stack

NVIDIA GPU

infrastructure CUDA Toolkit DCGM Exporter Ability to troubleshoot

NVIDIA XID

errors and PCIe GPU issues Networking Experience with Infiniband / RDMA networking Cluster Scheduling Deployment and configuration of Slurm Workload Manager Monitoring Experience with Prometheus + Grafana observability stacks Rakesh Kumar Patel Sr.

Technical Recruiter m:

(408) 459 7218 w: www.e-solutionsinc.

com e:

rakesh.p@e-solutionsinc.com NetOps Engineer - GPU Cluster Upgrades1NetOps, GPU, Nvidia, MAAS United States

Similar remote jobs

Job
Strategic Account Executive
CH
CVS Health
Massachusetts
Posted1 day ago
Updated9 hours ago
Job
Staff Engineer, Storage Engine
C
CoreWeave
New York, NY
Posted1 day ago
Updated9 hours ago
Job
Licensed Therapist (LCSW, LPC, LMFT) - Norfolk, VA
LH
LifeStance Health
Norfolk, VA
Posted1 day ago
Updated9 hours ago
Job
Associate Principal Scientist, Mass spectrometry/Separations, AR&D
C
CenterWell
Posted1 day ago
Updated9 hours ago
Job
Pharmacy Verification Technician
AP
Allivet Pet Pharmacy - Miami Lakes, FL
Miami Lakes, FL
Posted1 day ago
Updated9 hours ago