Apply for this opportunity

This job application is on an outside website. Be sure to review the job posting there to verify it's the same.

Apply Offsite

Site Reliability Engineer - SRE

Job

Open Practice Solutions

Hudson, OH (In Person)

$85,000 Salary, Full-Time

Posted 6 weeks ago (Updated 3 weeks ago) • Actively hiring

Expires 7/3/2026

See Job Scorecard

Review key factors to help you decide if the role fits your goals.

How is this calculated?

Pay Growth

out of 5

Not enough data

Not enough info to score pay or growth

Job Security

out of 5

Not enough data

Calculating job security score...

Total Score

out of 100

Average of individual scores

Were these scores useful?

Skill Insights

Compare your current skills to what this opportunity needs—we'll show you what you already have and what could strengthen your application.

Job Description

Site Reliability Engineer•

SRE 4.0 4.0

out of 5 stars Hudson, OH 44236 From $85,000 a year•Full-time Open Practice Solutions 5 reviews From $85,000 a year•

Full-time Location:

On-site /

Hybrid Employment Type:

Full-time Experience Level:

Mid-Senior About the Role We're looking for a Site Reliability Engineer to help operate and scale a multi-tenant, web-based application running on AWS. This is a hands-on role for someone who's comfortable jumping into an already-established architecture, making incremental improvements, and solving real production problems. You'll work closely with engineering and product teams to keep our platform reliable, performant, and scalable as customer usage grows. This is not a "greenfield rewrite" role — we need someone scrappy, practical, and effective inside real-world constraints. What You'll Do Ensure the reliability, availability, and performance of a multi-tenant production system Scale and operate AWS-based infrastructure supporting a Java web application Monitor and troubleshoot issues across application, database, cache, and data warehouse layers Improve observability through metrics, logging, and alerting Participate in on-call rotations and lead incident response and root cause analysis Identify performance bottlenecks and scaling limits in a shared-tenant environment Automate operational tasks and reduce toil where it matters most Work within existing frameworks and tooling to make systems safer and more scalable Partner with developers to improve deployments, capacity planning, and failure handling Implement automated load and fuzz testing Define key service level objectives (SLO) Technologies You'll Work With AWS (EC2, ECS, RDS, ElastiCache, Redshift, and related services) Java-based web applications MySQL (performance tuning, scaling, reliability) Amazon ElastiCache (Redis/Memcached) Amazon Redshift Monitoring and alerting tools (Graphite, Grafana, Cloudwatch) What You Bring 3+ years of experience in SRE, DevOps, or production operations roles Strong understanding of AWS infrastructure and cloud-native scaling patterns Experience supporting Java applications in production Solid knowledge of MySQL performance, replication, and scaling strategies Experience operating cache layers and data stores at scale Understanding of multi-tenant architectures, including isolation, noisy-neighbor issues, and capacity planning Strong Linux fundamentals and troubleshooting skills Ability to stay calm, think clearly, and prioritize during incidents A "get-things-done" mindset — pragmatic, resourceful, and comfortable with imperfect systems Nice to Have Experience scaling multi-tenant SaaS platforms Familiarity with Redshift performance tuning and data workflows Infrastructure-as-code experience (Terraform) CI/CD and GitLab pipeline experience Prior ownership of on-call rotations and incident processes Experience improving reliability without large architectural rewrites What We Value Engineers who work within reality, not just ideal architectures Incremental improvements that reduce risk and improve uptime Clear communication during incidents Ownership, accountability, and practical problem-solving

Pay:

From $85,000.00 per year Application Question(s): Describe a production outage you personally worked on, including how it was detected, how you mitigated it, and what permanent changes you implemented to prevent recurrence. Explain how you would design a highly available architecture in Amazon Web Services for a Java-based, multi-tenant SaaS application using MySQL and Redis. How would you prevent a single tenant from exhausting shared database or cache resources in a multi-tenant SaaS platform? Tell us about a time you inherited a fragile production system and what you prioritized first to improve reliability. Your API latency increases from 80ms to 900ms across all tenants while CPU utilization remains normal. What are the first steps you take to investigate? We are only considering candidates currently based in Northeast Ohio for this role (relocation is not being considered at this time). Are you currently located in Northeast Ohio?

Work Location:

In person