Apply for this opportunity

This job application is on an outside website. Be sure to review the job posting there to verify it's the same.

Apply Offsite

Senior ML Infrastructure Engineer (PyTorch, Kubernetes, GPU Training)

Job

Finoit Inc.

Redwood City, CA (In Person)

Full-Time

Posted 1 week ago (Updated 23 hours ago) • Actively hiring

Expires 7/23/2026

See Job Scorecard

Review key factors to help you decide if the role fits your goals.

How is this calculated?

Pay Growth

out of 5

Not enough data

Not enough info to score pay or growth

Job Security

out of 5

Not enough data

Calculating job security score...

Total Score

100

out of 100

Average of individual scores

Were these scores useful?

Skill Insights

Compare your current skills to what this opportunity needs—we'll show you what you already have and what could strengthen your application.

Job Description

Senior ML Infrastructure Engineer (PyTorch, Kubernetes, GPU Training) Short Job Description We are seeking a Senior ML Infrastructure Engineer to design and scale the infrastructure powering large-scale machine learning training workloads. In this role, you'll build high-performance GPU training platforms, optimize distributed training pipelines, and improve the developer experience for ML researchers.

Responsibilities:

Design and scale distributed ML training infrastructure for large GPU clusters. Build and optimize training pipelines using PyTorch , DeepSpeed , and distributed training frameworks. Develop and maintain job scheduling systems using Kubernetes and/or SLURM . Create high-throughput data pipelines for large-scale multimodal datasets. Optimize GPU utilization, memory efficiency, and overall system performance. Build low-latency inference pipelines for production ML deployments.

Required Skills:

7+ years of experience in ML Infrastructure, HPC, or Distributed Systems. Strong experience with PyTorch , DeepSpeed , FSDP , ZeRO , or similar distributed training frameworks. Hands-on experience with Kubernetes , cloud platforms ( AWS/Google Cloud Platform ), and containerized environments. Strong understanding of distributed systems, GPU optimization, NCCL, memory management, and performance tuning. Experience building scalable ML infrastructure from development through production.

Location:

Redwood City, CA (On-site)

Employment Type:

Full-Time Nice to

Have:

Experience with multimodal AI, robotics data pipelines, Triton, TensorRT, custom ML kernels, or ML compiler/runtime optimization.