Site Reliability Engineer

Job

Hammerhead AI

Redwood City, CA (In Person)

Full-Time

Posted 4 weeks ago (Updated 4 weeks ago) • Actively hiring

Expires 5/28/2026

Apply for this opportunity

This job application is on an outside website. Be sure to review the job posting there to verify it's the same.

See Job Scorecard

Review key factors to help you decide if the role fits your goals.

How is this calculated?

Pay Growth

out of 5

Not enough data

Not enough info to score pay or growth

Job Security

out of 5

Not enough data

Calculating job security score...

Total Score

100

out of 100

Average of individual scores

Were these scores useful?

Skill Insights

Compare your current skills to what this opportunity needs—we'll show you what you already have and what could strengthen your application.

Job Description

About Hammerhead We're unleashing AI with intelligent orchestration while addressing one of the most pressing bottlenecks for

AI:

access to power. Our cutting-edge platform optimizes data center power infrastructure to maximize AI workload throughput within existing electrical limits, without requiring new power plants or grid expansions. Our platform uses reinforcement learning to intelligently orchestrate power, cooling, and compute in real time, enabling data centers to run significantly more AI workloads within their existing electrical and thermal limits. Our team, at AutoGrid, had optimized over 8 gigawatts of mission-critical power globally. At Hammerhead, we're addressing a $64 billion-per-year market opportunity while dramatically reducing the environmental footprint of AI infrastructure. At Hammerhead, you will: Work at the intersection of AI, energy, and compute to help build the next generation of AI infrastructure Collaborate with colleagues that are experts in modern RL and AI, IoT and IIoT software, and infrastructure technologies Contribute to building a more efficient and sustainable future for AI compute Join a company at the cutting edge of modern data center design and operation Receive competitive compensation, equity, and benefits in a high-growth, mission-driven environment Learn from an experienced team that has built and sold startups before About the Role We are seeking a Site Reliability Engineer to own the reliability, scalability, and operational excellence of Hammerhead's AI-driven power orchestration platform. Our software runs in production data centers around the world, where real-time decisions directly affect gigawatts of compute infrastructure. Availability, latency, and correctness are not negotiable. You will work at the boundary between software and infrastructure, building the systems that deploy, monitor, and protect Hammerhead's platform in production. You will partner with engineering teams to establish SLOs, automate toil, accelerate releases, and ensure that when things go wrong, we know fast and recover faster. This is a foundational SRE role. You will be the first dedicated hire in this function. You will set the standard for how Hammerhead runs software in production. You will report to the Head of Engineering. Key Responsibilities Own production reliability for Hammerhead's platform: define and enforce SLOs, SLAs, and error budgets across services, and drive resolution when they are breached. Build and maintain the observability stack: metrics, logging, distributed tracing, and alerting across cloud and on-prem deployment environments. Architect and manage CI/CD pipelines that enable fast, safe, and repeatable deployments to production data center environments.

Automate operational toil:

provisioning, configuration management, scaling, failover, and incident response workflows.

Lead incident response:

act as the primary on-call escalation, run blameless post-mortems, and drive systemic fixes that prevent recurrence. Partner with software and RL engineers to bake reliability into the development lifecycle: code reviews, deployment checklists, chaos testing, and load testing. Manage and evolve Hammerhead's cloud infrastructure (primarily AWS) and edge deployment infrastructure at customer data center sites. Establish security and compliance practices for production environments: secrets management, access controls, audit logging, and vulnerability remediation. Evaluate and introduce tooling that improves platform velocity and reliability, from container orchestration to infrastructure-as-code to incident management platforms. Qualifications Required 4+ years of experience in site reliability engineering, DevOps, or platform/infrastructure engineering in production environments. Deep proficiency with Kubernetes and container orchestration in production, including cluster management, resource limits, autoscaling, and network policies. Strong infrastructure-as-code experience with Terraform, Pulumi, or equivalent. You manage cloud resources in code, not consoles. Hands-on experience with observability tools (Prometheus, Grafana, Datadog, OpenTelemetry, or equivalent) and building alerting that is actionable, not noisy. Experience using Claude Code (or similar) to develop and maintain secure, compliant, and automated infrastructure-as-code (IaC) workflows. Expertise in Python for writing automation scripts, internal tooling, and operational runbooks. Experience managing CI/CD pipelines (GitHub Actions, ArgoCD, or equivalent) and deployment strategies (blue/green, canary, rollback).

Strong incident response instincts:

you stay calm under pressure, communicate clearly during outages, and follow through on fixes after the fact. Comfortable working in environments with strict operational requirements: uptime SLAs, customer-facing commitments, and regulated or critical infrastructure. Preferred Experience deploying or operating software in industrial, energy, or data center environments, especially hybrid cloud/on-prem topologies.

Familiarity with ML/AI system operations:

managing model serving infrastructure, GPU workload scheduling, or real-time inference pipelines. Experience with advanced Kubernetes networking and security primitives. Background in chaos engineering, game day exercises, or formal reliability testing frameworks. Prior experience as the first or founding SRE at an early-stage company. What We Offer Competitive base salary + meaningful equity in a high-growth, well-funded company Comprehensive health, dental, and vision insurance The opportunity to build Hammerhead's reliability function from the ground up and own it A collaborative, low-ego team of world-class engineers and researchers solving genuinely hard problems Work that matters: our platform reduces the energy footprint of AI at scale Hammerhead is an equal opportunity employer. We are committed to creating an inclusive environment for all employees and encourage applications from candidates of all backgrounds, experiences, and perspectives. We provide reasonable accommodations for individuals with disabilities throughout the hiring process. Site Reliability Engineer Hammerhead AI Redwood City, CA Full-time Full-time About Hammerhead We're unleashing AI with intelligent orchestration while addressing one of the most pressing bottlenecks for

AI:

Automate operational toil:

provisioning, configuration management, scaling, failover, and incident response workflows.

Lead incident response:

Strong incident response instincts:

Familiarity with ML/AI system operations:

Similar remote jobs

Job
AI and Data Controls Manager
LO
Lap Of Love
Posted2 days ago
Updated12 hours ago
Job
Senior Tax Accountant
SF
Staff Financial Group
Atlanta, GA
Posted2 days ago
Updated12 hours ago
Job
Experienced Clinical Research Associate - Full-Service
SH
Syneos Health/ inVentiv Health Commercial LLC
Maryland
Posted2 days ago
Updated12 hours ago
Job
Pricing Consultant
T
TTEC
Charlotte, NC
Posted2 days ago
Updated12 hours ago
Job
Financial Analyst II, Operations
L
Lindt & Sprungli (North America) Inc.
Stratham, NH
Posted2 days ago
Updated12 hours ago

Similar jobs in Redwood City, CA

Job
Staff Product Manager - Marketplace
SA
Snorkel AI
Redwood City, CA
Posted2 days ago
Updated12 hours ago
Job
Operations Manager
CR
Courtyard Redwood City
Redwood City, CA
Posted2 days ago
Updated12 hours ago
Job
General Dentist
C
Confidential
Redwood City, CA
Posted2 days ago
Updated12 hours ago
Job
3D Artist
D
Dreamcraft
Redwood City, CA
Posted2 days ago
Updated12 hours ago
Job
HVAC Dispatcher / Customer Service Specialist
EA
Electric Air
Redwood City, CA
Posted2 days ago
Updated12 hours ago

Similar jobs in California

Job
Telehealth Clinician - Concord, CA
PP
PLANNED PARENTHOOD NORTHERN CALIFORNIA
Concord, CA
Posted1 day ago
Updated12 hours ago
Job
Clinician - San Francisco, CA
PP
PLANNED PARENTHOOD NORTHERN CALIFORNIA
San Francisco, CA
Posted1 day ago
Updated12 hours ago
Job
Coordinator, Literacy & Language
OC
Orange County Department of Education
Costa Mesa, CA
Posted2 days ago
Updated1 day ago
Job
RN - Electrophysiology - Per Diem Day
P
Providence
Spanish Flat, CA
Posted2 days ago
Updated12 hours ago
Job
RN Registered Nurse Labor and Delivery L&D Manager | San Luis Obispo, CA
WG
White Glove Placement, Inc.
California
Posted2 days ago
Updated12 hours ago