Platform Engineer - Reliability

Job

Squarepoint

Houston, TX (In Person)

Full-Time

Posted 4 days ago (Updated 1 day ago) • Actively hiring

Expires 6/25/2026

Apply for this opportunity

This job application is on an outside website. Be sure to review the job posting there to verify it's the same.

See Job Scorecard

Review key factors to help you decide if the role fits your goals.

How is this calculated?

Pay Growth

out of 5

Not enough data

Not enough info to score pay or growth

Job Security

out of 5

Not enough data

Calculating job security score...

Total Score

out of 100

Average of individual scores

Were these scores useful?

Skill Insights

Compare your current skills to what this opportunity needs—we'll show you what you already have and what could strengthen your application.

Job Description

Platform Engineer - Reliability at Squarepoint Platform Engineer - Reliability at Squarepoint in Houston, Texas Posted in 1 day ago.

Type:

full-time

Job Description:

Role Overview As a Platform Reliability Specialist at Squarepoint, you will play a critical role in ensuring the stability, performance, and day to day reliability of the shared platform services. You will work with a diverse group of stakeholders, including developers, researchers, and infrastructure teams, to maintain highly reliable systems and drive proactive improvements. You will be responsible for reducing operational toil, improving response and learning from production issues, and evolving our reliability practices. This role blends software engineering, platform ownership, operational ownership, and long?term architectural thinking to enhance our production systems. While you may have deep expertise in one or more areas, you will contribute across the platform.

Key areas include:

Operations & Toil Reduction:

Own and improve day?to?day platform operations by streamlining workflows and enhancing on?call ergonomics through better automations and runbooks

Reliability Engineering & Hardening:

Work with service owners to apply engineering principles to improve resilience and performance: harden critical services against degradation and outages.

Tooling & Automation:

Build and maintain platform tools, automation, and GitOps workflows that make it easy for teams to deploy, operate, and observe their services with minimal friction and operational overhead.

Knowledge & Standards:

Capture and share reliability knowledge through documentation, runbooks, and post?incident reviews. Help define and evolve reliability standards and best practices across the platform. Required qualifications 4+ years in SRE, Production Engineering, or Reliability Engineering roles with direct ownership of production systems. Experience with system administration and troubleshooting (Linux, Bash, containers). Software development experience with Python, version control (Git), and CI/CD systems. Hands?on experience with observability systems including metrics, tracing, log pipelines, and alert design. Demonstrated experience running systems at scale, including performance tuning, HA/DR architectures, and resilience engineering. Nice to have Expertise in a modern observability stack (e.g., Prometheus, Grafana, ELK, VictoriaMetrics). Experience operating enterprise platform software such as Kubernetes clusters, GitLab at scale, or Slurm environments. Familiarity with messaging systems (Kafka/RabbitMQ), service discovery (Consul), and databases (PostgreSQL, ClickHouse, Redis). Experience authoring runbooks, running failure/chaos experiments, and participating in DR exercises. Infrastructure automation and configuration management experience (e.g., Ansible, Terraform, Puppet).