Site Reliability Engineer (SRE) Position Available In Orange, Florida
Tallo's Job Summary: We are seeking a Site Reliability Engineer (SRE) to join our team, responsible for building scalable, reliable systems and driving observability and service performance improvements. Key responsibilities include designing and implementing highly available systems, monitoring DORA metrics, and collaborating cross-functionally. Basic qualifications include 5+ years of software engineering experience and proficiency in Python and AWS core services. Preferred qualifications include hands-on experience with SLAs/SLOs/SLIs and exposure to real-time data infrastructure.
Job Description
Site Reliability Engineer (SRE)
About the Role
We’re looking for a Site Reliability Engineer (SRE) to join our engineering team and help build scalable, reliable systems while driving observability and service performance improvements. You’ll collaborate closely with software engineers, data scientists, and DevOps practitioners to enhance service reliability and efficiency across the platform. Your work will play a key role in improving SLAs, tracking SLOs/SLIs, and driving DORA metric outcomes.
Key Responsibilities
Design and implement highly available, low-latency, and observable systems and infrastructure components.
Build tools and dashboards for visualization, tracing, and optimization to enhance system reliability and performance.
Monitor and drive improvements across DORA metrics (deployment frequency, lead time, mean time to recovery, and change failure rate).
Establish and maintain SLA, SLO, and SLI definitions and processes in collaboration with service owners.
Participate in on-call rotations and lead incident response processes with a focus on continuous improvement and postmortems.
Collaborate cross-functionally to identify system bottlenecks and propose architecture or code-level changes.
Champion DevOps practices such as CI/CD, automated testing, and infrastructure-as-code.
Review pull requests and offer guidance to uphold high standards in code quality and reliability.
Basic Qualifications
5+ years of software engineering experience, ideally in reliability-focused or DevOps-heavy environments.
Strong coding skills in Python and at least one strongly typed language (e.g., TypeScript
, Java
).
Proficiency with AWS core services (e.g., IAM, S3, Lambda, Kinesis, SNS).
Experience with observability tools (e.g., OpenTelemetry, Datadog, Prometheus, Grafana, Honeycomb, etc.).
Practical knowledge of CI/CD pipelines
, Docker
, and system automation.
Familiarity with infrastructure-as-code tools like Terraform, AWS CDK, or CloudFormation.
Working knowledge of distributed systems and trade-offs across SQL/NoSQL storage solutions.
Preferred Qualifications
Hands-on experience implementing and tracking SLAs/SLOs/SLIs
.
Familiarity with performance profiling, distributed tracing, and root cause analysis
.
Experience implementing practices that improve DORA metrics
.
Exposure to real-time data infrastructure or event-driven architecture.
Prior participation in an on-call rotation or incident management lifecycle.
Who You Are
You’re passionate about building stable, efficient, and observable systems.
You’re proactive in identifying reliability risks and driving solutions.
You balance engineering excellence with pragmatic, operational solutions.
#INDCEI
Employers have access to artificial intelligence language tools (“AI”) that help generate and enhance job descriptions and AI may have been used to create this description. The position description has been reviewed for accuracy and Dice believes it to correctly reflect the job opportunity.
Report this job
Dice Id:
ceiam
Position Id:
28269