Skip to main content
Tallo logoTallo logo

Principal Product Manager

Job

2100 NVIDIA USA

Santa Clara, CA (In Person)

$309,750 Salary, Full-Time

Posted 1 week ago (Updated 17 hours ago) • Actively hiring

Expires 6/21/2026

Apply for this opportunity

This job application is on an outside website. Be sure to review the job posting there to verify it's the same.

Review key factors to help you decide if the role fits your goals.
Pay Growth
?
out of 5
Not enough data
Not enough info to score pay or growth
Job Security
?
out of 5
Not enough data
Calculating job security score...
Total Score
79
out of 100
Average of individual scores

Were these scores useful?

Skill Insights

Compare your current skills to what this opportunity needs—we'll show you what you already have and what could strengthen your application.

Job Description

NVIDIA is driving a vision for AI factories that convert tokens to intelligence at scale to power AI demands of tomorrow. Maintaining AI infrastructure at scale takes more than human involvement; it demands smart automation. The orchestration engine for AI factory break-fix runs live in production at DGX Cloud. As the Product Manager leading all aspects of resilient automation at AI Factory, you will manage break-fix automation. You will develop the product strategy, improve operator experience, and guide the roadmap for professionals. You will build a scalable, reliable product from a strong engineering foundation that NVIDIA Cloud Partners depend on to uphold their SLAs. This is your chance to compose how AI factories self-heal!
What You'll Be Doing:
Take full responsibility for the strategic direction and roadmap of the break-fix automation system spanning multiple vendors, technologies, and CSPs. Define automation confidence thresholds, blocking issue criteria, and human-in-the-loop intervention points that balance speed with operational safety. Build the operator UX for repair queues, workflow transparency, and audit trails — ensuring on-call engineers have the context they need to act quickly and confidently. Drive the integration between failure attribution and automated repair actions, following through from detection to resolution. Define repair SLOs and own the metrics framework for time-to-drain, time-to-healthy, and overall fleet availability. Collaborate with NCP operators, SRE teams, and hardware vendor partners to integrate RMA processes and optimize repair workflows at scale. What We Need to
See:
15+ years of product management experience in infrastructure, platform, or MLOps areas, or equivalent background. BS or MS in Computer Science, Engineering, or a related technical area, or equivalent experience. Demonstrated expertise with distributed systems, workflow orchestration, and the safety tradeoffs inherent in automation. Track record owning products with real-world operational consequences — you understand blast radius and build accordingly. Strong operator UX instincts — proven ability to translate complex system state into workflows that on-call engineers can act on under pressure. Ability to build alignment across engineering, SRE, and external vendor partner teams. Ways to Stand Out from the
Crowd:
Hands-on experience with GPU infrastructure, datacenter operations, or AI factory environments. Experience with RMA logistics, vendor SLA oversight, and hardware repair processes on a large scale. Background in reliability engineering, SLO build, or chaos/fault-injection testing. Prior experience at a cloud service provider or Hyperscalers infrastructure team. Experience building Agentic AI workflow software Widely considered to be one of the technology world's most desirable employers, NVIDIA offers highly competitive salaries and a comprehensive benefits package. As you plan your future, see what we can offer to you and your family www.nvidiabenefits.com/ #LI-Hybrid Your base salary will be determined based on your location, experience, and the pay of employees in similar positions. The base salary range is 240,000 USD - 379,500 USD. You will also be eligible for equity and benefits. Applications for this job will be accepted at least until May 18, 2026. This posting is for an existing vacancy. NVIDIA uses AI tools in its recruiting processes. NVIDIA is committed to fostering a diverse work environment and proud to be an equal opportunity employer. As we highly value diversity in our current and future employees, we do not discriminate (including in our hiring and promotion practices) on the basis of race, religion, color, national origin, gender, gender expression, sexual orientation, age, marital status, veteran status, disability status or any other characteristic protected by law. NVIDIA pioneered accelerated computing. Today, our AI infrastructure powers global intelligence, transforming every industry. Learn more about NVIDIA.

Similar jobs in Santa Clara, CA

Similar jobs in California