Apply for this opportunity

This job application is on an outside website. Be sure to review the job posting there to verify it's the same.

Apply Offsite

Reliability Analyst

Job

Optomi

Fort Worth, TX (In Person)

Full-Time

Posted 6 days ago (Updated 15 hours ago) • Actively hiring

Expires 7/3/2026

See Job Scorecard

Review key factors to help you decide if the role fits your goals.

How is this calculated?

Pay Growth

out of 5

Not enough data

Not enough info to score pay or growth

Job Security

out of 5

Not enough data

Calculating job security score...

Total Score

100

out of 100

Average of individual scores

Were these scores useful?

Skill Insights

Compare your current skills to what this opportunity needs—we'll show you what you already have and what could strengthen your application.

Job Description

Reliability Analyst at Optomi Reliability Analyst at Optomi in Fort Worth, Texas Posted in 4 days ago.

Type:

full-time

Job Description:

Reliability & Observability Analyst I Job Type:

Full-Time |

Location:

On-site (Fort Worth, TX) Optomi, in partnership with one of our premier clients, is seeking an IOC Reliability & Observability Analyst I to support 24/7 HPC data center operations. This role is ideal for early-career professionals looking to grow into Site Reliability Engineering, Infrastructure Operations, or Platform Engineering paths while gaining hands-on experience in observability, incident analysis, operational automation, and AIOps-enabled environments. The ideal candidate will bring a strong reliability mindset, foundational Linux and networking knowledge, and experience working within production infrastructure or operations environments. What the Right Candidate will

Enjoy:

Working within a cutting-edge HPC and data center operations environment powered by renewable energy! Gaining hands-on exposure to observability, AIOps, reliability engineering, and operational automation! Collaborating closely with IOC, infrastructure, and engineering teams in a highly technical environment! Clear growth path into SRE, Platform Engineering, or Infrastructure Operations roles! Exposure to enterprise observability tooling, incident analysis, and reliability initiatives! Experience of the

Right Candidate:

1-3 years of experience in IOC, NOC, technical operations, systems analysis, or SRE-adjacent environments. Exposure to 24/7 production infrastructure, cloud, or data center operations environments. Foundational understanding of SRE concepts including MTTR, MTTD, service health, and incident management lifecycle. Working knowledge of Linux systems, networking fundamentals, and infrastructure dependencies. Experience working with logs, metrics, dashboards, and alerting systems. Familiarity with observability platforms such as Splunk, Datadog, Prometheus, or similar tools. Understanding of alert quality analysis, event correlation, anomaly detection, and monitoring gap identification. Ability to review automation artifacts such as Python, Bash, or configuration-based workflows. Strong analytical, troubleshooting, and communication skills with attention to operational detail. Responsibilities of the

Right Candidate:

Analyze incident data, operational signals, and system behaviors across infrastructure and data center environments. Identify alerting gaps, false positives, delayed detections, and monitoring improvement opportunities. Support continuous improvement initiatives for observability, reliability, and operational reporting. Validate incident, ticketing, and operational data for accuracy and reporting integrity. Review outputs from AIOps and automation platforms including anomaly detection and event correlation systems. Assist with alert routing, enrichment, suppression testing, and observability automation efforts. Produce SLA/KPI dashboards, reliability reporting, and operational insights for engineering and leadership teams. Contribute to operational documentation, runbooks, and reliability-focused process improvements. Partner cross-functionally with IOC, operations, and engineering teams to support platform stability and incident response. Operate within established IOC processes while progressively developing deeper SRE and infrastructure operations expertise.