Job Description
Senior Infrastructure Lead Acasia Operations, Inc. Phoenix, AZ Job Details Full-time $140,000 - $170,000 a year 4 hours ago Benefits Health insurance Dental insurance 401(k) Vision insurance 401(k) matching Qualifications Managing maintenance teams Communication with suppliers Managing technical support teams Data center experience Linux support Procedural guides Technical documentation Hardware maintenance Incident Escalation Network cabling Firewall Managing engineering teams Attention to detail Computer hardware Systems engineering Vendor relationship management Leading team collaboration initiatives Hardware management Network routing Linux Supplier relationship management (SRM) Escalation handling Standard operating procedures (SOPs) Project stakeholder communication Technical writing for network engineers Cross-functional communication Senior leadership Managing tech infrastructure teams
Full Job Description Senior Infrastructure Lead Compensation:
Base:
$140k-$170k Bonus:
25% Benefits:
Health, Dental, Vision, 401k Location:
Onsite. Phoenix, Arizona About Acasia Acasia builds and operates GPU infrastructure for enterprise AI workloads. We help customers access high-performance compute by deploying, managing, and supporting GPU clusters in data center environments. As Acasia scales, infrastructure execution becomes a core company capability. The Senior Infrastructure Lead will be responsible for building the operating system that allows Acasia to deploy GPU infrastructure reliably, maintain customer SLAs, and scale technical operations across data center environments. Role Summary The Senior Infrastructure Lead will own the deployment, maintenance, operational readiness, and technical support of Acasia's GPU infrastructure across data center sites. This person will lead a small team of onsite technical specialists, create deployment and maintenance standards, manage escalations, coordinate with engineering, vendors, data centers, and customers, and ensure Acasia's infrastructure is delivered and operated with discipline. This is a senior hands-on leadership role. The right person can work at both levels: they can build the process, lead the team, and personally jump into technical issues when needed. Key ResponsibilitiesInfrastructure Deployment Leadership Lead the deployment of GPU servers, networking equipment, storage, racks, PDUs, cabling, and supporting infrastructure. Own site readiness, deployment planning, installation sequencing, equipment validation, and customer handoff. Build and maintain deployment runbooks, checklists, test plans, and acceptance criteria. Coordinate with data centers, vendors, procurement, logistics, engineering, finance, and customer-facing teams to ensure deployments are executed on schedule. Ensure all equipment is installed, configured, labeled, documented, and validated according to Acasia standards. Team Leadership Manage, train, and develop a team of Infrastructure Technical Specialists. Establish expectations for execution quality, documentation, safety, escalation, and customer-impacting work. Create staffing plans for deployment, maintenance, travel, and on-call coverage. Build repeatable operating rhythms for daily work, incident reviews, maintenance windows, and deployment readiness. Hold the team accountable to timelines, quality standards, and SLA-related responsibilities. Operations, Maintenance & SLA Ownership Own infrastructure uptime, maintenance execution, incident response, and operational readiness. Lead response to customer-impacting infrastructure issues. Establish escalation paths across technical specialists, engineering, data center vendors, hardware suppliers, and customer-facing teams. Ensure maintenance, repairs, replacements, and upgrades are executed safely and efficiently. Track recurring infrastructure issues and drive root cause analysis. Develop preventative maintenance programs to reduce avoidable downtime. Technical Troubleshooting & Escalation Serve as the senior technical escalation point for hardware, GPU, networking, power, cooling, connectivity, and configuration issues. Diagnose complex issues across server hardware, GPUs, firmware, operating systems, networking, and data center environments. Partner with engineering on cluster bring-up, monitoring, performance validation, and reliability improvements. Support customer escalations with clear technical diagnosis, action plans, and timelines. Ensure incidents are documented with root cause, corrective action, and prevention steps. Standards, Process & Documentation Build Acasia's infrastructure operations playbook. Create and maintain SOPs for deployments, maintenance, incident response, change management, inventory control, hardware swaps, site access, and customer handoff. Own infrastructure documentation, including rack elevations, asset records, serial numbers, cabling maps, network diagrams, configuration records, and maintenance history. Implement quality control standards before infrastructure is placed into production. Improve process maturity as Acasia scales from individual deployments to a repeatable infrastructure platform. Vendor & Data Center Coordination Manage operational relationships with data center providers, hardware vendors, logistics providers, and field service partners. Coordinate site access, maintenance windows, delivery schedules, remote hands, hardware replacements, and escalation procedures. Help evaluate data center operational readiness for GPU deployments. Ensure vendor accountability against timelines, quality standards, and service obligations. Required Qualifications 7+ years of experience in data center operations, infrastructure operations, systems engineering, network operations, technical field operations, or high-performance compute environments. 2+ years leading technical teams, field teams, infrastructure technicians, or operational support functions. Strong hands-on experience with server hardware, racks, cabling, power, cooling, networking, and data center procedures. Strong Linux systems knowledge. Strong understanding of networking fundamentals, including switching, routing, VLANs, DNS, DHCP, firewalls, IP addressing, and network troubleshooting. Experience managing technical incidents, customer-impacting outages, maintenance windows, and escalation workflows. Ability to create operational processes, runbooks, checklists, and documentation from scratch. Strong vendor management and cross-functional coordination skills. Ability to travel to data center sites as needed. Strong communication skills with both technical and non-technical stakeholders. Preferred Qualifications Experience with GPU infrastructure, AI infrastructure, HPC, or high-density compute environments. Experience with NVIDIA GPU
systems, CUDA, NVIDIA drivers, InfiniBand, RoCE, Mellanox/NVIDIA networking, or high-performance networking fabrics. Experience with Kubernetes, Slurm, Docker, virtualization, storage systems, or cloud infrastructure. Experience with enterprise server platforms such as Dell, HPE, Supermicro, Lenovo, ASUS, Gigabyte, or similar. Familiarity with remote management tools such as IPMI/BMC, iDRAC, iLO, Redfish, and DCIM systems. Experience building operational functions in a startup or high-growth technical environment. Pay:
$140,000.00 - $170,000.00 per year Benefits:
401(k) 401(k) matching Dental insurance Health insurance Vision insurance Work Location:
In person