Skip to main content
Tallo logoTallo logo
Apply for this opportunity

This job application is on an outside website. Be sure to review the job posting there to verify it's the same.

AI Operations & Infrastructure Engineer

Job

Invictus

Fort Meade, MD (In Person)

Full-Time

Posted 2 days ago (Updated 18 hours ago) • Actively hiring

Expires 7/24/2026

Review key factors to help you decide if the role fits your goals.
Pay Growth
?
out of 5
Not enough data
Not enough info to score pay or growth
Job Security
?
out of 5
Not enough data
Calculating job security score...
Total Score
100
out of 100
Average of individual scores

Were these scores useful?

Skill Insights

Compare your current skills to what this opportunity needs—we'll show you what you already have and what could strengthen your application.

Job Description

Job Requirements Fort Meade, MD Top Secret/SCI CI Polygraph Mid Level Career (5+ yrs experience) Salary not specified Join Premium to unlock estimated salaries
Job Description Title:
AI Operations & Infrastructure Engineer Location:
Fort Meade, MD Clearance:
TS/SCI with a
CI Polygraph Job Details:
Manage and maintain AI computing platforms, including GPUs and other specialized hardware Install and configure GPU drivers and software Oversee the AI software stack and tools Implement and manage containerization technologies like Docker and Kubernetes Configure and optimize networking infrastructure for AI workloads, including InfiniBand and Ethernet Manage storage solutions for AI data, considering performance and capacity requirements Deploy and manage data processing units (DPUs) to accelerate data center workloads Monitor and manage AI cluster health and resource utilization Implement workload management and scheduling tools like Slurm and Kubernetes Ensure efficient power and cooling for AI infrastructure to maintain optimal operating conditions Configure high-performance networking solutions for AI and machine learning workloads Optimize network performance to ensure maximum throughput and minimal latency for AI computations Implement and fine-tune network protocols to enhance data transfer speeds and efficiency Integrate NVIDIA networking products with existing AI infrastructure, including servers, GPUs, and storage systems Deploy networking solutions in data centers to ensure seamless connectivity between AI components Diagnose and resolve networking issues impacting AI workloads to maintain optimal system performance Provide technical support and guidance to teams managing AI infrastructure Collaborate with data scientists, researchers, and IT professionals to understand networking requirements and challenges Lead deployment and validation of servers and systems for AI enabled platforms Configure and manage network topologies, BMC, OOB, TPM, power, and cooling Install, upgrade, and validate GPU-based servers, BlueField DPUs, cables, and transceivers Perform firmware upgrades, hardware validation, and storage setup Configure and administer physical and logical resources, including M IG partitioning and BlueField platforms Install and configure operating systems, cluster software, drivers, containers (Docker), and NGC CLI Manage and orchestrate clusters using
NVIDIA Base Command Manager, Slurm, Pyxis, Enroot, and Run:
Ai Perform stress, benchmarking, and burn-in tests using
HPL, NCCL, NVIDIA
Nemo, and ClusterKit Verify cabling, firmware/software versions, and network signal quality Troubleshoot and resolve hardware, software, storage, and performance faults Replace faulty components and optimize systems for AMD/Intel platforms Monitor, document, and report on cluster health, resource usage, and job performance Ensure secure, efficient, and scalable operation of
NVIDIA AI
infrastructure, including user access and workload management
Requirements:
Qualified candidates must hold an active NVIDIA Professional Certification in either AI Networking, AI Infrastructure, or AI Operations Prior direct, hands-on professional experience administering NVIDIA GPU and data processing unit (DPU) technologies, AI software stacks, and data center environments for high-performance AI workloads Comprehensive expertise in deploying and maintaining AI compute platforms, requiring proficiency in containerization and workload orchestration using
Docker, Kubernetes, Slurm, NVIDIA Base Command Manager, and Run:
Ai Must be capable of configuring physical and logical resources, including Multi-Instance GPU (MIG) partitioning and BlueField platforms, while overseeing critical facility elements such as power, cooling, and storage solutions The ability to demonstrate advanced skills in AI networking, specifically configuring and optimizing high-performance InfiniBand and Ethernet fabrics to ensure maximum throughput and minimal latency Current active TS/SCI clearance with a CI Polygraph Equal Opportunity Employer/Veterans/Disabled group id: 90789821 Apply now