Apply for this opportunity

This job application is on an outside website. Be sure to review the job posting there to verify it's the same.

Apply Offsite

IOC Systems Specialist

Job

Optomi

Fort Worth, TX (In Person)

Full-Time

Posted 6 days ago (Updated 8 hours ago) • Actively hiring

Expires 7/4/2026

See Job Scorecard

Review key factors to help you decide if the role fits your goals.

How is this calculated?

Pay Growth

out of 5

Not enough data

Not enough info to score pay or growth

Job Security

out of 5

Not enough data

Calculating job security score...

Total Score

out of 100

Average of individual scores

Were these scores useful?

Skill Insights

Compare your current skills to what this opportunity needs—we'll show you what you already have and what could strengthen your application.

Job Description

IOC Systems Specialist at Optomi IOC Systems Specialist at Optomi in Fort Worth, Texas Posted in 4 days ago.

Type:

full-time

Job Description:

IOC Systems Specialist Schedule:

Onsite | 12-hour rotating shifts (4-on/3-off alternating with 3-on/4-off) Optomi, in partnership with a leading AI cloud infrastructure organization, is seeking an IOC Systems Specialist to join their growing operations team in Fort Worth, TX. This role will provide Tier 2 operational support for high-performance computing (HPC) cloud environments focused on large-scale AI training and inference workloads. The ideal candidate will have hands-on experience supporting HPC infrastructure, Kubernetes environments, Slurm workload management, and enterprise storage platforms such as WEKA and VAST. This individual will play a key role in maintaining system stability, troubleshooting complex incidents, and supporting mission-critical infrastructure within a 24x7 IOC/NOC environment. What the

Right Candidate Will Enjoy:

Working with cutting-edge AI and HPC infrastructure technologies! Supporting large-scale GPU cluster environments! Exposure to advanced Kubernetes, cloud, and storage technologies! Opportunities to contribute to operational improvements and automation initiatives! Joining a fast-growing organization focused on sustainable, renewable-powered AI infrastructure! Collaborative environment with strong technical leadership and growth opportunities! What Type of Experience the

Right Candidate Has:

2-5 years of experience supporting or operating HPC clusters in production environments Strong operational experience with WEKA and VAST storage platforms Hands-on experience with Kubernetes administration and troubleshooting Experience supporting Slurm workload manager environments Familiarity with HPC monitoring, observability, and alerting platforms Experience performing incident response and root cause analysis in complex systems Understanding of cloud platforms such as AWS, Azure, or GCP Knowledge of HPC networking and storage technologies, including InfiniBand and high-throughput interconnects Responsibilities of the

Right Candidate:

Provide Tier 2 operational support for HPC cloud infrastructure environments Monitor, troubleshoot, and resolve incidents involving Kubernetes, Slurm, storage, networking, and cloud systems Serve as an escalation point for Tier 1 support teams Perform root cause analysis and coordinate with engineering teams on permanent resolutions Execute operational changes, upgrades, patching, and maintenance activities Maintain and improve operational documentation, runbooks, and knowledge base articles Support monitoring and observability tooling to proactively identify system issues Assist with operational readiness and production support for new HPC capabilities Mentor junior operations staff and support continuous service improvement initiatives Participate in on-call rotations and major incident response activities

Job Must Haves:

Must have hands-on experience with WEKA and VAST storage environments 2-5 years supporting HPC clusters in production or IOC/NOC environments Working knowledge of Kubernetes Operational experience with Slurm workload manager Familiarity with HPC monitoring and observability tooling Experience with incident response and root cause analysis Understanding of AWS, Azure, or GCP cloud platforms Knowledge of HPC networking and storage infrastructure Ability to work onsite in Fort Worth on a rotating 12-hour shift schedule Nice to

Have Skills:

Bare-metal Kubernetes experience Relevant certifications such as CKA/CKAD, RHCSA, Linux+, ITIL, or Server+ Experience with GPU or HPC vendor technologies Experience supporting AI or large-scale compute environments Automation or scripting experience