Skip to main content
Tallo logoTallo logo
Apply for this opportunity

This job application is on an outside website. Be sure to review the job posting there to verify it's the same.

Senior Engineer, Site Reliability

Job

Royal Caribbean Group

Miramar, FL (In Person)

Full-Time

Posted 1 week ago (Updated 2 days ago) • Actively hiring

Expires 7/22/2026

Review key factors to help you decide if the role fits your goals.
Pay Growth
?
out of 5
Not enough data
Not enough info to score pay or growth
Job Security
?
out of 5
Not enough data
Calculating job security score...
Total Score
96
out of 100
Average of individual scores

Were these scores useful?

Skill Insights

Compare your current skills to what this opportunity needs—we'll show you what you already have and what could strengthen your application.

Job Description

Senior Engineer, Site Reliability Royal Caribbean Group - 4.2 Miramar, FL Job Details Full-time 1 day ago Benefits Career development plan Qualifications Containerization systems Machine learning/AI-based analysis Cloud Logging Telecommunications network performance monitoring Tooling Infrastructure as Code (IaC) IT system monitoring Incident management software Application deployment Metrics Reporting AI platforms (beyond public GPTs) Bash Key Performance Indicators SRE Incident management operations support Splunk SDKs BGP WAN Cloud Native Design SOAR platforms ServiceNow Azure Kubernetes Service (AKS) System deployment Cloud automation DevOps automation AI-driven automation Cloud monitoring Python Amazon EKS Full Job Description Journey with us! Combine your career goals and sense of adventure by joining our exciting team of employees Royal Caribbean Group is pleased to offer a competitive compensation and benefits package and excellent career development opportunities each offering unique ways to explore the world We are proud to be the vacation-industry leader with global brands — including Royal Caribbean International Celebrity Cruises and Silversea Cruises — the most innovative fleet and private destinations and the best people Together we are dedicated to turning the vacation of a lifetime into a lifetime of vacations for our guests The Royal Caribbean Group's Site Reliability Team has an exciting career opportunity for a full time Senior Engineer Site Reliability reporting to the Senior Manager SIte Reliability This position is onsite and based in Miramar Florida Tis position is also not eligible for work authorization sponsorship
Position Summary:
We are seeking a highly skilled Senior Site Reliability Engineer to own operate and continuously mature our enterprise observability platform across one of the most complex hospitality and maritime technology environments in the world This role is the engineering backbone of RCG's observability practice — responsible for ensuring deep reliable system visibility across 950+ applications serving 100000+ users across Royal Caribbean International Celebrity Cruises and Silversea You will operate at the intersection of infrastructure application performance network intelligence and AIOps — driving measurable improvements in mean-time-to-detect (MTTD) mean-time-to-resolve (MTTR) and overall service reliability This is a platform engineering and standards leadership role not a tool administration position
Key Responsibilities:
Platform Ownership & Architecture Own and evolve the enterprise observability platform spanning Cisco AppDynamics Splunk ThousandEyes and PagerDuty AIOps across AWS and Azure environments Architect and enforce a unified telemetry strategy — metrics logs traces and events — standardized via OpenTelemetry across all application tiers Design and govern telemetry data pipelines including ingestion filtering routing and retention to optimize signal quality and platform cost at enterprise scale Drive full-stack observability coverage across ship and shore environments including maritime network paths contact center platforms and revenue-critical booking systems SLIs SLOs & Reliability Engineering Define and implement Service Level Indicators (SLIs) Service Level Objectives (SLOs) and error budgets for all critical services across RCG's three brands Build alerting frameworks that minimize noise surface actionable signals and integrate cleanly with PagerDuty AIOps on-call workflows Partner with SRE teams to drive MTTR reduction post-incident observability improvements and proactive reliability practices Instrument and publish DORA metrics (Deployment Frequency Lead Time Change Failure Rate MTTR) to support engineering productivity and release confidence AIOps & Intelligent Detection Drive AI-assisted incident detection anomaly correlation and root cause analysis using PagerDuty AIOps and Splunk IT Service Intelligence (ITSI) Tune and mature ML-based alert grouping and noise suppression models to reduce alert fatigue and accelerate triage Integrate observability signals with ServiceNow ITSM for automated incident creation enrichment and closed-loop resolution workflows Kubernetes & Cloud-Native Observability Enable and govern Kubernetes observability for EKS and AKS workloads — container health resource utilization pod-level tracing and cluster performance Integrate observability instrumentation into CI/CD pipelines (GitHub Actions) to enable deployment-correlated performance analysis Maintain and extend AWS CloudWatch and Azure Monitor integrations to ensure cloud infrastructure is fully represented in the observability estate Standards Enablement & Technical Leadership Define observability standards instrumentation best practices and onboarding frameworks for product and platform engineering teams Mentor junior engineers and serve as the technical authority for observability discipline across SRE and Platform Engineering Lead post-incident reviews (PIRs) and translate findings into observability platform improvements Govern observability cost optimization: telemetry volume management retention tiering and platform licensing efficiency Required Qualifications 6-9+ years in Observability SRE or Platform Engineering in enterprise-scale environments Deep hands-on expertise with Cisco AppDynamics — APM configuration business transaction mapping code-level diagnostics and baseline management Strong proficiency with Splunk — SPL query development ITSI service health trees KPI configuration alert policy management and log pipeline design Experience with Cisco ThousandEyes for network path monitoring ISP/WAN intelligence and BGP-level visibility Proficiency with PagerDuty AIOps — intelligent alert grouping noise suppression event orchestration and on-call workflow design Strong command of OpenTelemetry — collector configuration SDK instrumentation semantic conventions and multi-backend exporting Hands-on Kubernetes experience (EKS/AKS) — container observability resource metrics and pod-level distributed tracing Experience with AWS CloudWatch and/or Azure Monitor for cloud infrastructure observability Scripting and automation proficiency: Python Bash Terraform and/or Ansible for observability tooling deployment and configuration Experience defining SLIs/SLOs error budgets and actionable alerting strategies tied to business service reliability ServiceNow ITSM integration experience — event management incident auto-creation and CMDB-enriched alerting Experience with CI/CD observability integration (GitHub Actions or equivalent) Preferred Qualifications Experience with Prometheus Grafana Loki or Tempo for supplemental or hybrid observability architectures Familiarity with eBPF-based observability tooling (eg Pixie Cilium) for deep kernel-level and network-layer visibility Experience with synthetic monitoring and real user monitoring (RUM) to capture end-user experience across digital channels Familiarity with Cribl or equivalent telemetry pipeline tooling for data routing enrichment and cost governance Exposure to DORA metrics instrumentation and developer experience observability frameworks Experience in large-scale hospitality travel maritime or consumer digital platforms
Certifications:
Cisco AppDynamics Certified Associate Splunk Core Certified Power User AWS Solutions Architect Kubernetes (CKA/CKAD) or OpenTelemetry Certified Associate (OTCA/CNCF)
Agency and Third-Party Submissions:
Please note this is a direct search by the Company and applications through agencies and other third parties will not be accepted nor will fees be paid for unsolicited resumes Any unsolicited resumes will be considered the Company's property We know there's a lot to consider As you go through the application process our recruiters will be glad to provide guidance and more relevant details to answer any additional questions Thank you again for your interest in Royal Caribbean Group We'll hope to see you onboard soon! It is the policy of the Company to ensure equal employment and promotion opportunity to qualified candidates without discrimination or harassment on the basis of race color religion sex age national origin disability sexual orientation sexuality gender identity or expression marital status or any other characteristic protected by law Royal Caribbean Group and each of its subsidiaries prohibit and will not tolerate discrimination or harassment
Nearest Major Market:
Miami