Skip to main content
Tallo logoTallo logo

Member of Technical Staff, ML Infrastructure & Inference

Job

Acceler8 Talent

San Lorenzo, CA (In Person)

Full-Time

Posted 2 days ago (Updated 2 hours ago) • Actively hiring

Expires 6/22/2026

Apply for this opportunity

This job application is on an outside website. Be sure to review the job posting there to verify it's the same.

Review key factors to help you decide if the role fits your goals.
Pay Growth
?
out of 5
Not enough data
Not enough info to score pay or growth
Job Security
?
out of 5
Not enough data
Calculating job security score...
Total Score
100
out of 100
Average of individual scores

Were these scores useful?

Skill Insights

Compare your current skills to what this opportunity needs—we'll show you what you already have and what could strengthen your application.

Job Description

Member of Technical Staff, ML Infrastructure & Inference at Acceler8 Talent Member of Technical Staff, ML Infrastructure & Inference at Acceler8 Talent in San Lorenzo, California Posted in about 2 hours ago.
Type:
full-time
Job Description:
Member of Technical Staff, ML Infrastructure & Inference Overview We are a cutting-edge AI infrastructure company is building a scalable cloud platform designed for next-generation machine learning workloads ($80M series A). As AI systems continue to grow in complexity, traditional infrastructure models are facing limitations in efficiency, scalability, and cost. The platform addresses these challenges through a hardware-agnostic architecture that dynamically maps workloads across diverse accelerator environments, enabling higher utilization and better performance across multi-vendor systems. The company is also developing production-grade infrastructure for agentic AI applications, allowing customers to deploy and manage workloads through simple APIs without handling low-level optimization or hardware orchestration. The Role The team is seeking a Member of Technical Staff focused on ML systems and inference infrastructure. In this role, you will build and optimize large-scale inference systems that serve modern AI models efficiently in production environments. You'll work across runtime behavior, scheduling, memory management, and system optimization to improve latency, throughput, and scalability. This opportunity is well suited to engineers who understand how modern models execute at scale and enjoy solving deep performance challenges across the inference stack. Responsibilities Design and optimize end-to-end inference pipelines from request intake through response generation Build scalable inference runtimes optimized for latency, throughput, and concurrency Improve batching, scheduling, and queueing strategies under real-world production workloads Develop efficient KV cache allocation, reuse, and eviction strategies Optimize prefill and decode execution paths, including attention and memory performance Debug and profile bottlenecks across models, runtimes, and distributed systems Partner with compiler, kernel, networking, and infrastructure teams to improve system-wide performance Required Qualifications Strong software engineering and systems fundamentals Experience building or operating ML inference or model serving systems Understanding of runtime performance, memory usage, and system behavior under load Preferred Qualifications Experience with inference frameworks such as TensorRT-LLM, vLLM, or custom serving infrastructure Strong understanding of transformer architectures and attention mechanisms Experience with batching, scheduling, and concurrency optimization in inference systems Familiarity with KV cache management and memory placement strategies Experience tuning latency- and throughput-sensitive systems Strong programming skills in Python and C++ Based onsite in
SF Keywords:
ML Systems, Inference Infrastructure, LLM Inference, Model Serving, Distributed Systems, GPU Infrastructure, AI Infrastructure, Inference Runtime, TensorRT-LLM, vLLM, Transformer Architecture, Attention Mechanisms, KV Cache, Memory Optimization, Latency Optimization, Throughput Optimization, Concurrency Control, Batching, Scheduling Systems, Runtime Optimization, Performance Profiling, Scalable Inference, Distributed Inference, CUDA, PyTorch

Similar jobs in San Lorenzo, CA

Similar jobs in California