Skip to main content
Tallo logoTallo logo

Inference Engineer

Job

Acceler8 Talent

Los Altos, CA (In Person)

Full-Time

Posted 2 days ago (Updated 6 hours ago) • Actively hiring

Expires 6/25/2026

Apply for this opportunity

This job application is on an outside website. Be sure to review the job posting there to verify it's the same.

Review key factors to help you decide if the role fits your goals.
Pay Growth
?
out of 5
Not enough data
Not enough info to score pay or growth
Job Security
?
out of 5
Not enough data
Calculating job security score...
Total Score
100
out of 100
Average of individual scores

Were these scores useful?

Skill Insights

Compare your current skills to what this opportunity needs—we'll show you what you already have and what could strengthen your application.

Job Description

Inference Engineer at Acceler8 Talent Inference Engineer at Acceler8 Talent in Los Altos, California Posted in 10 minutes ago.
Type:
full-time
Job Description:
Inference Engineer We're partnered with an AI infrastructure company building next-generation systems for large-scale AI workloads. Their platform is rethinking how inference runs at scale - intelligently orchestrating workloads across heterogeneous hardware to unlock major gains in performance, efficiency, and cost. The team is solving some of the hardest problems in modern AI infrastructure: inference scheduling, KV cache management, runtime optimization, memory efficiency, and low-latency serving across distributed systems. They're looking for engineers who care deeply about how models execute in production - not just training models, but making them fast, scalable, and reliable under real-world load. What You'll Work On Designing and optimizing large-scale inference pipelines Improving latency, throughput, and concurrency under production workloads Building inference runtimes and serving infrastructure Optimizing batching, scheduling, and request orchestration Managing KV cache allocation, reuse, placement, and eviction strategies Improving prefill/decode performance and memory efficiency Profiling bottlenecks across model, runtime, and distributed system layers Collaborating closely with compiler, kernel, and systems engineers What They're Looking For Strong systems engineering fundamentals Experience building or scaling ML inference / model serving systems Deep understanding of performance optimization and memory behavior Experience with runtimes such as vLLM, TensorRT-LLM, or custom serving infrastructure Strong understanding of transformer architectures and attention mechanisms Familiarity with batching, scheduling, concurrency, and cache management Strong Python and/or C++ engineering skills Why Join Work on cutting-edge inference infrastructure and AI systems problems Build systems designed for next-generation AI scale Small, highly technical engineering team Significant ownership and technical impact Opportunity to shape foundational infrastructure for future AI workloads

Similar jobs in Los Altos, CA

Similar jobs in California