Skip to main content
Tallo logoTallo logo

RESEARCHER, EFFICIENT INFERENCE

Job

MakerMaker

San Francisco, CA (In Person)

Full-Time

Posted 5 days ago (Updated 2 days ago) • Actively hiring

Expires 6/21/2026

Apply for this opportunity

This job application is on an outside website. Be sure to review the job posting there to verify it's the same.

Review key factors to help you decide if the role fits your goals.
Pay Growth
?
out of 5
Not enough data
Not enough info to score pay or growth
Job Security
?
out of 5
Not enough data
Calculating job security score...
Total Score
79
out of 100
Average of individual scores

Were these scores useful?

Skill Insights

Compare your current skills to what this opportunity needs—we'll show you what you already have and what could strengthen your application.

Job Description

RESEARCHER, EFFICIENT INFERENCE
MakerMaker San Francisco, CA Job Details Full-time 11 hours ago Qualifications Scientific publications Statistical analysis Machine learning libraries Research & development Machine learning frameworks Prototypes Full Job Description
ABOUT THE COMPANY
We're building autonomous research agents for recursive self-improvement (multi-agent systems that propose, run, and analyze machine learning experiments). We're a small team based in San Francisco, on-site
ABOUT THE ROLE
You'll be researching making models efficient: quantization, speculative decoding, sparse and structured attention, distillation, mixture-of-experts inference, and the training-time techniques that make those methods possible. The work spans algorithm design, careful evaluation, and pushing methods to where they actually run. This is a senior research role with a clear engineering edge. You'll spend time at the intersection of model architecture and inference performance, designing methods that move accuracy/latency/cost trade-offs in our favor (then partnering with engineers to make those wins real in production).
WHAT YOU'LL DO
Research and develop quantization methods: post-training quantization, quantization-aware training, mixed-precision regimes, low-bit-width arithmetic Design and evaluate speculative decoding approaches: draft models, tree attention, parallel speculation, lookahead decoding Investigate training-time efficiency methods that compose well with inference: distillation, sparse attention, mixture-of-experts, low-rank adaptation, pruning Run controlled experiments at production scale; characterize what works on real workloads, not just toy benchmarks Co-design methods with the inference engineering team: push results to where they actually run, not stop at the paper Read deeply across the efficient ML / efficient inference literature; translate the most useful ideas into our stack Publish when the work warrants it; share findings internally Partner with model and training researchers so efficiency choices align with model architecture and post-training decisions
WHAT WE'RE LOOKING FOR
Strong track record of ML research on efficiency methods: quantization, speculative decoding, distillation, MoE, sparse attention, or adjacent 5+ years of hands-on research experience Deep familiarity with both training and inference performance characteristics Fluent in PyTorch, Jax or equivalent; comfortable working at the kernel and serving-framework level when methods require it Track record of moving efficiency research from prototype to production Strong statistical expertise: you'd notice a flawed comparison before someone else points it out Strong written communication Published research at NeurIPS, ICML, ICLR, MLSys, or comparable venues
NICE TO HAVE
PhD in ML, systems, or related field Open-source contributions to quantization, speculative-decoding, or efficient-inference libraries Experience with hardware-aware optimization and accelerator-specific tooling Background in numerical methods, low-precision arithmetic, or approximate computation
THIS ROLE IS PROBABLY NOT FOR YOU IF
You want to focus on pretraining large models from scratch (that's a different role) You prefer abstract algorithmic research without hands-on implementation You want a fixed benchmark with stable targets (our targets shift with what our models actually need to do)

Similar jobs in San Francisco, CA

Similar jobs in California