Senior Site Reliability Engineer, Crew IT Position Available In DeKalb, Georgia
Tallo's Job Summary: The Senior Site Reliability Engineer at Crew IT will optimize Delta Software Solutions by implementing SRE tools and processes. Responsibilities include automating tasks, troubleshooting incidents, and mentoring team members. The role requires 5+ years of related experience, proficiency in scripting languages, and knowledge of networking protocols. Delta Air Lines offers competitive salary, industry-leading profit sharing, and comprehensive benefits.
Job Description
How you’ll help us Keep Climbing (overview & key responsibilities) At Delta Air Lines, connection is at the heart of everything we do and guides our every action. We strive to welcome and care for all of our customers during their travels with us and aim to deliver an elevated experience. Delta is focused on sustaining a strong IT operation, growing our capabilities, and maximizing optimization across each of our tech hubs to elevate the travel experience for our customers and empower our 90,000 Delta people. We’re committed to fostering innovation, and we’re excited to invite you to be part of our journey as we shape the future of technology at the world’s best airline! The Senior Site Reliability Engineer works to improve the Reliability and Resiliency of Delta Software Solutions to meet the business requirements by implementing SRE tools, processes, and standard methodologies. SRE is what happens when you ask a software engineer to design an operations function. The Senior Site Reliability Engineer designs, develops, tests, debugs, and automates tasks for applications. They troubleshoot incidents to address failure patterns, automate remediation through runbooks, and document application optimization. Responsibilities Supporting a reliable application suite for the environment in order to meet the development and maintenance requirements of systems/platforms. Working as part of the development team to evaluate the health, stability, and reliability of applications. Utilizing monitoring, alerts, dashboards, and management tools to ensure the availability, reliability and performance of applications and services. Constantly working to improve and implement automation of applications tasks. Providing technical support for systems/platforms according to application SLA’s. Responsible for developing resiliency in the application code, troubleshooting incidents, engaging with squads to address failure patterns, and participating in incident management. Leading and mentoring junior team members and software engineers to enhance our SRE practice Benefits and Perks to Help You Keep Climbing Our culture is rooted in a shared dedication to living our values – Care, Integrity, Resilience and Servant Leadership – every day, in everything we do. At Delta, our people are our success. At the heart of what we offer is our focus on Sharing Success with Delta employees. Exploring a career at Delta gives you a chance to see the world while earning great compensation and benefits to help you keep climbing along the way: Competitive salary, industry-leading profit sharing program, and performance incentives 401(k) with generous company contributions up to 9% New hires are eligible for up to 2-weeks of vacation. This is earned for use in the following vacation year (April 1 – March 31) In addition to vacation, new hires are eligible for up to 56 hours of paid personal time within a 12-month period 10 paid holidays per calendar year Birthing parents are eligible for 12-weeks of paid maternity/parental leave Non-birthing parents are eligible for 2-weeks of paid parental leave Comprehensive health benefits including medical, dental, vision, short/long term disability and life insurance benefits Family care assistance through fertility support, surrogacy and adoption assistance, lactation support, subsidized back-up care, and programs that help with loved ones in all stages Holistic Wellbeing programs to support physical, emotional, social, and financial health, including access to an employee assistance program offering support for you and anyone in your household, free financial coaching, and extensive resources supporting mental health Domestic and International space-available flight privileges for employees and eligible family members Career development programs to achieve your long-term career goals World-wide partnerships to engage in community service and innovative goals created to focus on sustainability and reducing our carbon footprint Business Resource Groups created to connect employees with common interests to promote inclusion, provide perspective and help implement strategies Recognition rewards and awards through the platform Unstoppable Together Access to over 500 discounts, specialty savings and voluntary benefits through Deltaperks such as car and hotel rentals and auto, home, and pet insurance, legal services, and childcare What you need to succeed (minimum qualifications) 5 or more years of hands-on experience as a Site Reliability Engineer or related technical engineering capacity. Experience handling large numbers of diverse systems with configuration management systems like Puppet, Chef, Ansible Experience with developing and maintaining tools, dashboards and scripts to monitor application functions across a wide array of systems to detect and resolve issues with an aim towards maintaining optimal conditions for system applications. Knowledge of software engineering; ability to deliver new or enhanced fee-based software products. Proficient in one or more of the following scripting languages: JavaScript, Nodejs, Python, Ansible, Bash, etc. Strong documentation skills, with the ability to create and maintain clear, concise, and actionable technical documentation, including runbooks, incident reports, architectural diagrams, and operational procedures. Commitment to documentation as a first-class engineering practice. Strong experience with monitoring and alerting systems like Prometheus, Grafana, Datadog and PagerDuty. Knowledge of agile methodologies and the agile development lifecycle; ability to use formal agile methodologies, disciplines, practices and techniques for the delivery of new and enhanced applications. Knowledge of concepts, values and tools applied in building Continuous Integration (CI), Continuous Delivery and Continuous Deployment (CD) pipeline; ability to design, build, implement and maintain CI/CD pipelines to achieve the automation of software delivery process. Experience engineering software within an Amazon Web Services (AWS) cloud infrastructure or other prominent enterprise cloud provider. Experience in containerized workloads and management platforms such as Docker or Kubernetes Understanding of standard networking protocols and components such as HTTP, DNS, ECMP, TCP/IP, ICMP, the OSI Model, Subnetting and Load Balancing strategies Knowledge of the theories and methodologies of reliability engineering; ability to design, develop and support various tools, services and applications to maintain a reliable site environment. Embraces a diverse set of people, thinking and styles. Consistently makes safety and security, of self and others, the priority. High School diploma, GED or High School Equivalency. What will give you a competitive edge (preferred qualifications) Experience using tools and services such as, AWS Cloudwatch and Dynatrace. Experience with Change, Incident, Problem and Configuration Management tools such as, PagerDuty and ServiceNOW. Experience with IAC tools such as CFT, CDK or Terraform. Experience with reliability engineering practices, including incident response practices, capacity planning and SLA tracking. Bachelors Degree in Computer Science, Information Systems or related technical field. Experience working in an airline technology environment.