Tallo logoTallo logo

Reliability & Observability Analyst II

Job

IREN

Fort Worth, TX (In Person)

Full-Time

Posted 3 days ago (Updated 16 hours ago) • Actively hiring

Expires 6/9/2026

Apply for this opportunity

This job application is on an outside website. Be sure to review the job posting there to verify it's the same.

Review key factors to help you decide if the role fits your goals.
Pay Growth
?
out of 5
Not enough data
Not enough info to score pay or growth
Job Security
?
out of 5
Not enough data
Calculating job security score...
Total Score
100
out of 100
Average of individual scores

Were these scores useful?

Skill Insights

Compare your current skills to what this opportunity needs—we'll show you what you already have and what could strengthen your application.

Job Description

Reliability & Observability Analyst II IREN Dallas-Fort Worth, TX Job Details Full-time 7 hours ago Benefits Health savings account Paid holidays Disability insurance Health insurance Dental insurance Financial planning services Paid time off Employee assistance program Vision insurance 401(k) matching Professional development assistance Life insurance Qualifications Jira Performance dashboard reports Data Center Operations Computer science Data center experience Bachelor's degree in information technology Computer Science Automation IT system monitoring IT service management Data reporting AIOps Mid-level 3 years Data quality management Incident response Outlier detection Bachelor's degree SRE Distributed systems Splunk Mentoring Scripting Computer networking ServiceNow IT Linux Data validation Training & development Root cause analysis
Full Job Description Job Type:
Full-Time l
Location:
Dallas / Fort Worth, TX l
Department:
Operations l Reporting to: Data Center Manager |
Work Location Type:
#onsite IREN is a leading next-generation data center business powering the future with 100% renewable energy. We build, own and operate our data centers and take pride in being at the forefront of sustainable solutions for the ever-evolving applications of high-performance computers. We believe that human progress is invaluable, but it should be done in the right way - responsibly, sustainably and having a positive impact on the communities we operate in. With 100% renewable energy, we build, own and operate our data centers and take pride in being at the forefront of sustainable solutions for the ever-evolving applications of high-performance compute. We believe that human progress is invaluable, but it should be done in the right way - responsibly, sustainably and having a positive impact on the communities we operate in. We are seeking an IOC Reliability & Observability Analyst II to support our 24/7 HPC Data Center Operations by performing advanced incident triage, improving alert quality and routing, and maintaining high-quality operational telemetry and reporting. This role partners with engineering and operations teams to identify detection gaps, tune monitoring and dashboards, and implement small automations and enrichment to reduce operational toil and improve time-to-action. This is not a reporting-only role. You will partner closely with IOC, engineering, and operations teams to validate operational signals, tune alerts and dashboards, leverage AIOps outputs during incident response, and ensure telemetry is actionable for real-time triage and escalation. 3-5 years of experience in
IOC/NOC/SRE
‑adjacent operations, reliability engineering, observability, or production support roles within 24/7 production environments Bachelor's degree in Computer Science, Data Science, IT, or equivalent hands‑on professional experience Demonstrated ability to apply reliability engineering principles (e.g., incident lifecycle, MTTD/MTTR, operational risk) to improve detection, response effectiveness, and overall service stability Strong working knowledge of Linux systems, basic networking, and infrastructure dependencies across compute, network, and facility domains Practical experience supporting GPU‑based compute environments or high‑density clusters, including analysis of GPU health, performance degradation, and failure patterns to reduce customer impact and improve reliability Proven experience owning and improving alert quality, including reduction of false positives, missed detections, poor routing, and alert fatigue across complex environments Hands‑on experience maintaining service health dashboards and operational reliability metrics, including supporting SLI/SLO reporting where defined by engineering or service owners Ability to correlate logs, metrics, and alerts across distributed systems (including GPU, network, and facility telemetry) to accelerate triage and diagnose complex incidents Experience working with AIOps‑enabled outputs (e.g., anomaly detection, event correlation, automated enrichment), validating accuracy during incident triage and escalating when automated signals do not align with operational conditions Ability to write or modify small automation artifacts (e.g., scripts, templates, configuration‑driven workflows) to standardize triage, enrich alerts or tickets, and reduce manual operational toil Experience ensuring operational data integrity across ticketing systems, incident records, and dashboards to support trend analysis, high‑quality RCAs, and executive reporting Strong communication skills with the ability to work cross‑functionally with IOC leadership, engineering, and operations teams, including mentoring less‑experienced analysts Strong working experience with IOC/NOC tooling, including ITSM/ticketing systems (e.g., ServiceNow, Jira) and monitoring platforms (e.g., Splunk, Datadog) Experience producing operational reports, incident summaries, and shift handoff documentation for IOC leadership and stakeholders Familiarity with RCA workflows, including ensuring incident records, timelines, and artifacts are complete and accurate Other important requirements This role operates in a 24×7 IOC/NOC environment and works 12‑hour rotating shifts on a 4‑days‑on / 3‑days‑off, alternating with 3‑days‑on / 4‑days‑off schedule Pre-employment screening, including background check and substance testing may be required according to company policies Perform advanced Level 2 incident analysis by reviewing incident data, system behavior, and operational signals across GPU clusters, networks, and facilities to identify recurring issues, improve triage accuracy, and support faster and more effective escalation Maintain IOC service health dashboards and operational metrics that reflect alert effectiveness, incident response performance (e.g., MTTD/MTTR), and customer impact for use in day‑to‑day operations and leadership reporting Identify alerting and monitoring gaps, under‑monitored systems, and noisy or ineffective alerts; perform day‑to‑day tuning of thresholds, routing, suppression, and enrichment within IOC tooling, and partner with engineering teams when instrumentation changes are required Own operational alert quality outcomes by ensuring sustained reductions in false positives, missed detections, poor routing, and alert fatigue through IOC‑approved standards, validation, and continuous review of alert performance Analyze GPU health and performance signals (errors, degradation, failure indicators) during incidents to support faster triage, improve escalation quality, and reduce customer impact in GPU‑based environments Validate and oversee automated detection and correlation outputs, ensuring alerts, anomalies, and insights are accurate, actionable, and aligned with operational reality Implement and maintain IOC‑level automation (e.g., alert routing rules, enrichment fields, ticket templates, runbook scripts) to standardize response and reduce manual toil during incidents Ensure ITSM incident and ticket records meet IOC quality standards by validating timelines, categorizations, ownership, and resolution notes; support RCA workflows by providing complete operational inputs and tracking monitoring follow‑ups Provide peer coaching and onboarding support to Analyst I team members on triage patterns, alert interpretation, dashboard usage, and IOC runbooks; contribute to and maintain operational documentation Support IOC shift operations through detailed incident handoffs, queue hygiene, and coordination with on‑call engineering and facilities teams during escalations At IREN, we offer a comprehensive, market-competitive total rewards package designed to support employees' well-being, career advancement, and financial wealth. Our offerings reflect our commitment to Proceed with Purpose while rewarding high performance and long-term growth. Compensation Actual compensation will be determined based on factors such as experience, qualifications. Overtime compensation for non-exempt workers for hours worked over 40 per week Health & Wellness 100% company paid health insurance premiums (medical, dental, and vision) for employees, 75% company paid coverage for dependents Company-paid short-term and long-term disability insurance Voluntary life, critical illness, and accident coverage available Health Savings Accounts (HSA) - when combined with the High-Deductible Health Plan Employee Assistance Program and wellness resources Retirement & Financial Wealth 401(k) retirement plan with company match Paid professional development and access to financial planning and legal services Time Off & Leave Programs Paid Time Off (PTO) and paid holidays Growth & Development Professional development to support certifications, continuing education, or role related training Community & Culture Company events and team-building activities We value diverse perspectives and believe that skills can be developed. If you're passionate about this role, we want to hear from you — whether you meet every criteria or not. Your unique experiences might be exactly what we need! IE US Operations Inc., the employing entity and proud member of the IREN group is an equal opportunity employer that is committed to creating an inclusive workplace. We are committed to evaluating qualified applicants and do not discriminate against protected characteristics under applicable legislation. We participate in E-Verify and will provide the federal government with your Form I-9 information to confirm that you are authorized to work in the U.S. E-Verify Participation Notice . By applying for this position and submitting your resume and application materials, you consent to the processing of your personal information in accordance with our Job Applicant Privacy Statement available on our website at www.iren.com .

Similar remote jobs

Similar jobs in Fort Worth, TX

Similar jobs in Texas