Apply for this opportunity

This job application is on an outside website. Be sure to review the job posting there to verify it's the same.

Apply Offsite

SITE RELIABILITY ENGINEER

Job

United States Cold Storage Inc

Remote

Full-Time

Posted 6 weeks ago (Updated 3 weeks ago) • Actively hiring

Expires 6/11/2026

See Job Scorecard

Review key factors to help you decide if the role fits your goals.

How is this calculated?

Pay Growth

out of 5

Not enough data

Not enough info to score pay or growth

Job Security

out of 5

Not enough data

Calculating job security score...

Total Score

100

out of 100

Average of individual scores

Were these scores useful?

Skill Insights

Compare your current skills to what this opportunity needs—we'll show you what you already have and what could strengthen your application.

Job Description

Site Reliability Engineer (SRE) Engineer Reliability into the Systems That Move the Nation's Food Supply Who We Are US Cold owns and operates one of the most complex temperature-controlled logistics networks in North America. Every day, our systems coordinate the storage and movement of food at national scale across a network of state-of-the-art distribution centers, including multiple highly automated warehouse facilities. We continue to advance our core warehouse and logistics platforms. Our current focus is on modular, event-driven, API-first and cloud architectures. We continue to enhance reliability and accelerate engineering productivity by strengthening our SRE and AI practices. This is a large investment in innovation to continue to drive operational excellence at our facilities. If you want to build durable systems that operate in the physical world at scale, this is that opportunity. The Role The Site Reliability Engineer is a founding member of US Cold's SRE practice. This role exists to move the organization from reactive operations to engineered reliability. You will study how our most critical systems fail — particularly our Phenix WMS and facility automation interfaces — and design controls, automation, and observability that reduce incidents over time. Success in this role means fewer false alerts, faster recovery, less manual intervention, and systems that heal themselves when possible. You will work closely with application, infrastructure, and operations teams and participate directly in on ‑ call and incident response. What You Will Own Reliability of the Phenix WMS and its integration with facility automation systems (robotics, conveyors, and control interfaces) Definition and implementation of SLIs and SLOs that measure meaningful system health, not just availability Observability across the full stack, correlating cloud services, APIs, and on ‑ premise facility operations Automation to eliminate operational toil, including patching, data corrections, restarts, and recovery tasks Development of self ‑ healing behaviors for common failure modes Participation in on ‑ call rotations and leadership of blameless post ‑ incident reviews Design and execution of disaster recovery tests across SaaS, cloud, and on ‑ premise environments This is hands ‑ on reliability engineering. The systems you improve will directly impact daily warehouse operations. Technical Environment Hybrid environments spanning cloud and on ‑ premise infrastructure Azure cloud services Warehouse Management Systems (Phenix WMS) and facility automation interfaces Observability tooling across logs, metrics, and alerting Automation using Python, PowerShell, Bash, or Ansible CI/CD tools and modern deployment practices Exposure to containerized and distributed systems environments What We're Looking For 3+ years of experience in SRE, DevOps, Systems Engineering, or related roles Strong Linux and Windows systems administration and troubleshooting skills Hands ‑ on experience with automation and scripting Experience designing and operating monitoring, alerting, and observability solutions Practical experience working in Azure environments Strong analytical skills and a bias toward eliminating root causes, not symptoms Ability to collaborate across application, infrastructure, and operations teams Experience supporting warehouse management systems or industrial automation platforms Exposure to Kubernetes, microservices, or container orchestration Familiarity with infrastructure ‑ as ‑ code tools such as Terraform or Ansible Understanding of distributed systems and high ‑ availability design Experience with SRE practices such as SLO ‑ based operations, runbook automation, or chaos testing Why This Role Is Different This is not an inherited SRE function. There is no mature framework to maintain.

You will:

Help define what reliability means at US Cold Work on systems that operate in the physical world Engineer solutions that reduce toil and operational load See the direct impact of your work on warehouse uptime and performance Build practices that scale as the platform modernizes This is an opportunity to grow as an SRE while helping establish the reliability foundation of a mission ‑ critical platform. Operational Context Systems operate continuously across warehouse facilities Reliability failures have physical and operational consequences On ‑ call participation is part of the role Work occurs across cloud, SaaS, and on ‑ premise environments

SITE RELIABILITY ENGINEER 3.3 3.3

out of 5 stars Camden, NJ 08103 Hybrid work United States Cold Storage Inc 554 reviews Site Reliability Engineer (SRE) Engineer Reliability into the Systems That Move the Nation's Food Supply Who We Are US Cold owns and operates one of the most complex temperature-controlled logistics networks in North America. Every day, our systems coordinate the storage and movement of food at national scale across a network of state-of-the-art distribution centers, including multiple highly automated warehouse facilities. We continue to advance our core warehouse and logistics platforms. Our current focus is on modular, event-driven, API-first and cloud architectures. We continue to enhance reliability and accelerate engineering productivity by strengthening our SRE and AI practices. This is a large investment in innovation to continue to drive operational excellence at our facilities. If you want to build durable systems that operate in the physical world at scale, this is that opportunity. The Role The Site Reliability Engineer is a founding member of US Cold's SRE practice. This role exists to move the organization from reactive operations to engineered reliability. You will study how our most critical systems fail — particularly our Phenix WMS and facility automation interfaces — and design controls, automation, and observability that reduce incidents over time. Success in this role means fewer false alerts, faster recovery, less manual intervention, and systems that heal themselves when possible. You will work closely with application, infrastructure, and operations teams and participate directly in on ‑ call and incident response. What You Will Own Reliability of the Phenix WMS and its integration with facility automation systems (robotics, conveyors, and control interfaces) Definition and implementation of SLIs and SLOs that measure meaningful system health, not just availability Observability across the full stack, correlating cloud services, APIs, and on ‑ premise facility operations Automation to eliminate operational toil, including patching, data corrections, restarts, and recovery tasks Development of self ‑ healing behaviors for common failure modes Participation in on ‑ call rotations and leadership of blameless post ‑ incident reviews Design and execution of disaster recovery tests across SaaS, cloud, and on ‑ premise environments This is hands ‑ on reliability engineering. The systems you improve will directly impact daily warehouse operations. Technical Environment Hybrid environments spanning cloud and on ‑ premise infrastructure Azure cloud services Warehouse Management Systems (Phenix WMS) and facility automation interfaces Observability tooling across logs, metrics, and alerting Automation using Python, PowerShell, Bash, or Ansible CI/CD tools and modern deployment practices Exposure to containerized and distributed systems environments What We're Looking For 3+ years of experience in SRE, DevOps, Systems Engineering, or related roles Strong Linux and Windows systems administration and troubleshooting skills Hands ‑ on experience with automation and scripting Experience designing and operating monitoring, alerting, and observability solutions Practical experience working in Azure environments Strong analytical skills and a bias toward eliminating root causes, not symptoms Ability to collaborate across application, infrastructure, and operations teams Experience supporting warehouse management systems or industrial automation platforms Exposure to Kubernetes, microservices, or container orchestration Familiarity with infrastructure ‑ as ‑ code tools such as Terraform or Ansible Understanding of distributed systems and high ‑ availability design Experience with SRE practices such as SLO ‑ based operations, runbook automation, or chaos testing Why This Role Is Different This is not an inherited SRE function. There is no mature framework to maintain.