Tallo logoTallo logo

Cloud Engineer - Observability & SRE

Job

GDH

Remote

$137,280 Salary, Full-Time

Posted 3 days ago (Updated 19 hours ago) • Actively hiring

Expires 6/12/2026

Apply for this opportunity

This job application is on an outside website. Be sure to review the job posting there to verify it's the same.

Review key factors to help you decide if the role fits your goals.
Pay Growth
?
out of 5
Not enough data
Not enough info to score pay or growth
Job Security
?
out of 5
Not enough data
Calculating job security score...
Total Score
85
out of 100
Average of individual scores

Were these scores useful?

Skill Insights

Compare your current skills to what this opportunity needs—we'll show you what you already have and what could strengthen your application.

Job Description

Role Summary A senior Cloud Engineer with expertise in building and managing scalable observability and infrastructure platforms for enterprise-level cloud microservices environments. This hybrid role demands hands-on experience with container orchestration, cloud infrastructure automation, and high-volume monitoring systems. The engineer will own end-to-end components, support production operations, and leverage AI tools for system troubleshooting and code generation. Responsibilities Design, develop, and operate observability platforms enabling logging, metrics collection, and tracing for cloud-based microservices applications. Manage and optimize large-scale Kubernetes clusters across multiple regions, including Helm chart management, pod scheduling, and resource tuning. Own and maintain CI/CD pipelines using tools such as Argo CD, Helm, and GitOps methodologies to ensure reliable deployment workflows. Implement Infrastructure as Code (IaC) solutions utilizing Terraform on AWS to provision and manage cloud infrastructure at scale. Operate and maintain monitoring ecosystems including OpenSearch/Elasticsearch, Prometheus, Grafana, Splunk, and Kafka, ensuring high availability and performance. Develop automation solutions to detect, respond, and remediate production issues proactively. Ensure security and compliance by managing vulnerability patching and automating security best practices in container environments. Collaborate with cross-functional teams to improve system reliability, scalability, and performance, contributing to distributed system design. Participate in on-call rotations, incident response, and post-incident analysis to uphold SLA commitments. Utilize AI-assisted coding and troubleshooting tools to accelerate system development, automation, and incident resolution. Qualifications Bachelor''s degree in Computer Science, Information Technology, or related field. Minimum of 8 years of experience in DevOps, SRE, or platform engineering roles supporting production cloud environments. Proven incident response experience, including alert triage, root cause analysis, and SLA management in 24/7 operations. Expertise in Infrastructure as Code principles with proficiency in Terraform, Ansible, or similar automation tools for cloud provisioning. Strong scripting skills in Python, Golang, or Bash for automation, tooling, and CI/CD pipeline integration. Extensive experience operating and troubleshooting large-scale Kubernetes workloads, including Helm chart management and multi-cluster orchestration. Hands-on knowledge of observability stacks such as OpenSearch, Prometheus, Grafana, Loki, and Splunk, including query optimization and capacity planning. Familiarity with Kafka and AWS MSK, including cluster operation, topic configuration, and schema management. Experience deploying, managing, and migrating Splunk Enterprise environments with Kubernetes-based log shipping architectures. Working knowledge of OpenTelemetry, distributed tracing, and application performance monitoring in cloud environments. Understanding of security frameworks, container hardening practices, and vulnerability remediation at scale, including standards such as Fed
RAMP, STIG, IL5, ISO
27001, and SOC 2. Experience using AI tools like LLMs, GitHub Copilot, or custom AI agents to enhance operational workflows and incident management. Effective communication skills and the ability to work independently in a hybrid work setting.
Publishing Pay Range:
$65.00 - $67.00 hourly This position offers a hybrid schedule, with time split between the office and remote work.

Similar remote jobs

Similar jobs in Plano, TX

Similar jobs in Texas