Tallo logoTallo logo

Lead AI SRE/ AI Ops Engineer

Job

GTSS Inc

Fremont, CA (In Person)

Full-Time

Posted 1 week ago (Updated 4 days ago) • Actively hiring

Expires 6/5/2026

Apply for this opportunity

This job application is on an outside website. Be sure to review the job posting there to verify it's the same.

Review key factors to help you decide if the role fits your goals.
Pay Growth
?
out of 5
Not enough data
Not enough info to score pay or growth
Job Security
?
out of 5
Not enough data
Calculating job security score...
Total Score
100
out of 100
Average of individual scores

Were these scores useful?

Skill Insights

Compare your current skills to what this opportunity needs—we'll show you what you already have and what could strengthen your application.

Job Description

Role:
Lead
AI SRE/ AI
Ops Engineer Location:
Fremont, CA (Hybrid Onsite)
Duration:
12+ Months Role Summary We are looking for a strong hands-on Lead AI-Assisted SRE / AIOps Engineer to help operationalize and scale an SRE agent-driven operations model. This role will lead the onboarding of existing scripts, SOPs, and operational workflows into the SRE agent while also supporting production releases, validation, incident response, and operational governance. This is not a pure support role. The ideal candidate must be technically strong, practical, and capable of using independent judgment rather than relying blindly on AI outputs. Experience Total 13+ years of experience required and around 5+ years of hands-on experience in IT operations, cloud operations, SRE, platform support, or production engineering Proven experience in production support, incident handling, automation, and operational troubleshooting Experience working with monitoring, observability, scripting, and release validation Exposure to AIOps, AI-assisted operations, or automation-led support models is strongly preferred Key Responsibilities Lead the adoption and operationalization of the SRE agent across support and reliability workflows Translate existing scripts, runbooks, SOPs, and operational knowledge into agent-compatible workflows Work with teams to identify which use cases should be automated, semi-automated, or remain human-driven Validate agent outputs, recommendations, and remediation steps before operational use Support production releases, release validation, smoke testing, and post-release health checks Drive troubleshooting during incidents and ensure proper root cause analysis and follow-through Improve alert handling, event correlation, and operational response patterns Coordinate with engineering, operations, and platform teams on onboarding and process changes Mentor junior engineers and guide them on workflow design, validation, and operational execution Maintain high-quality documentation, runbooks, and operational standards Required Technical Skills Strong hands-on scripting experience in PowerShell, Python, Shell/Bash Experience with monitoring, alerting, logs, dashboards, and incident workflows Good understanding of production support processes , release support, and validation practices Experience with cloud platforms, preferably Azure Familiarity with ITSM/ticketing tools such as ServiceNow, Jira, or similar Ability to understand existing operational scripts and modernize them into scalable workflows Experience with APIs, integrations, or automation pipelines is preferred Exposure to Kubernetes / AKS/AI tools - ChatGPT, copilot is a plus

Similar remote jobs

Similar jobs in Fremont, CA

Similar jobs in California