Apply for this opportunity

This job application is on an outside website. Be sure to review the job posting there to verify it's the same.

Apply Offsite

INFERENCE ENGINEER

Job

MakerMaker

San Francisco, CA (In Person)

Full-Time

Posted 5 weeks ago (Updated 1 week ago) • Actively hiring

Expires 7/16/2026

See Job Scorecard

Review key factors to help you decide if the role fits your goals.

How is this calculated?

Pay Growth

out of 5

Not enough data

Not enough info to score pay or growth

Job Security

out of 5

Not enough data

Calculating job security score...

Total Score

out of 100

Average of individual scores

Were these scores useful?

Skill Insights

Compare your current skills to what this opportunity needs—we'll show you what you already have and what could strengthen your application.

Job Description

INFERENCE ENGINEER

MakerMaker San Francisco, CA Job Details Full-time 11 hours ago Qualifications Software coding Machine learning Distributed systems Systems engineering Distributed computing Python System performance monitoring Full Job Description

ABOUT THE COMPANY

We're building autonomous research agents for recursive self-improvement (multi-agent systems that propose, run, and analyze machine learning experiments). We're a small team based in San Francisco, on-site

ABOUT THE ROLE

You build and operate the inference systems that serve our models in production. The work spans serving infrastructure, runtime optimization, and the long tail of production infrastructure that come with running real workloads. This is an engineering role, not a research role. You'll measure, profile, debug, and ship. You'll work alongside researchers, but your job is to make their work fast and reliable in production. Real ownership, real autonomy.

WHAT YOU'LL DO

Build, operate, and harden production inference systems serving large models at high throughput Own the performance characteristics of those systems end-to-end: throughput, latency, cost-per-token, reliability under load Profile real workloads to identify bottlenecks; ship fixes that move the metric you set out to improve Implement and integrate inference optimizations from the research team (quantization, custom kernels, scheduling improvements, memory management) into production Design observability into the inference layer: metrics, tracing, alerting that surface regressions before users notice them Run capacity planning, autoscaling, and load testing for varied workload shapes (batch, online, mixed, agentic) Diagnose and resolve production incidents; write postmortems that turn bugs into systemic fixes

WHAT WE'RE LOOKING FOR

Senior ML systems engineer with 3+ years building production-grade, large-scale serving infrastructure Strong distributed systems experience ; you've been on-call for systems that matter Performance profiling and optimization fluency: you read flame graphs, you are analytical and measured before you change Experience with GPU-accelerated inference at scale (multi-GPU, multi-node, batched and streaming workloads), preferably experience with AMD GPUs Fluent Python; comfortable reading and writing systems-level code in at least one of the following languages: C++,CUDA, ROCm or Triton Track record of shipping production infrastructure, preferably surfaces serving millions of requests across diverse workloads Good written communication; you can write a runbook that someone else can follow at 3am

NICE TO HAVE

Open-source contributions to inference / serving frameworks Experience with mixed cloud and on-premises deployments Familiarity with hardware-aware optimization (memory hierarchy, NCCL/RDMA, NUMA) Background in compilers, runtimes, or accelerator software stacks

THIS ROLE IS PROBABLY NOT FOR YOU IF

You're primarily a researcher, the work here is building, not exploring You want to focus narrowly on one component; this role spans the stack Production responsibility (incidents, on-call, ownership of running systems) isn't appealing