Tallo logoTallo logo

NetOps Engineer GPU Cluster Upgrades

Job

E-Solutions Inc.

Full-Time

Posted 8 weeks ago (Updated 7 weeks ago) • Actively hiring

Expires 5/27/2026

Apply for this opportunity

This job application is on an outside website. Be sure to review the job posting there to verify it's the same.

Review key factors to help you decide if the role fits your goals.
Pay Growth
?
out of 5
Not enough data
Not enough info to score pay or growth
Job Security
?
out of 5
Not enough data
Calculating job security score...
Total Score
78
out of 100
Average of individual scores

Were these scores useful?

Skill Insights

Compare your current skills to what this opportunity needs—we'll show you what you already have and what could strengthen your application.

Job Description

NetOps Engineer - GPU Cluster Upgrades Remote Job | 2026-02-24 16:29:45
Job Description Job Title:
NetOps Engineer - GPU Cluster Upgrades
Location:
Mountain View, CA /
Hybrid Duration:
6 Months Contract (Likely Extension)
Domain:
Network Operations /
GPU Infrastructure Job Description:
We are seeking a highly skilled NetOps Engineer with experience in GPU cluster infrastructure and automation to support upgrades and operational setup for a cluster of approximately 100 NVIDIA A100 GPUs. The engagement involves deploying industry-standard automation using MAAS and Ansible to automate the lifecycle management of GPU compute nodes and implement monitoring, scheduling, and configuration management across the cluster environment. Key Responsibilities 1. Bare Metal Provisioning Install and configure Canonical MAAS (Metal-as-a-Service) on the designated head node. Configure automated OS provisioning for Ubuntu nodes via
IPMI/BMC.
Enable node discovery, power management, disk wiping, and automated provisioning. 2. Configuration Management Develop modular Ansible Playbooks for automated configuration management. Automate installation and version locking for: NVIDIA Drivers (Headless) CUDA Toolkit v12.x Infiniband / RDMA networking stack Security agents including CrowdStrike and fleet security monitoring agents. 3. Workload Scheduling Install and configure Slurm Workload Manager using Ansible. Configure Fair Share scheduling and preemption rules to allocate GPU resources across multiple teams. Integrate user authentication via local accounts or LDAP. 4. Observability & Monitoring Deploy Prometheus for metrics collection. Implement visualization dashboards using Grafana. Deploy
NVIDIA DCGM
Exporter on all compute nodes to collect GPU telemetry. Configure dashboards for monitoring: GPU temperature Power draw Per-user GPU utilization Expected Deliverables Functional MAAS Web UI for automated node lifecycle management. Master Ansible Playbook capable of configuring a fresh Ubuntu installation automatically. Fully operational Slurm controller (slurmctld) and compute node agents (slurmd). Pre-built NVIDIA monitoring dashboards in Grafana displaying real-time GPU metrics. Required Skills Infrastructure & OS Expertise with Ubuntu 22.04 LTS provisioning Experience with
IPMI/BMC
based bare-metal management Automation Strong experience with Ansible Playbook development
GPU & NVIDIA
Stack
NVIDIA GPU
infrastructure CUDA Toolkit DCGM Exporter Ability to troubleshoot
NVIDIA XID
errors and PCIe GPU issues Networking Experience with Infiniband / RDMA networking Cluster Scheduling Deployment and configuration of Slurm Workload Manager Monitoring Experience with Prometheus + Grafana observability stacks Rakesh Kumar Patel Sr.
Technical Recruiter m:
 (408) 459 7218 w: www.e-solutionsinc.
com e:
rakesh.p@e-solutionsinc.com NetOps Engineer - GPU Cluster Upgrades1NetOps, GPU, Nvidia, MAAS United States

Similar remote jobs