Senior Android Operations / Reliability Engineer

Job

TALENT Software Services

San Anselmo, CA (In Person)

Full-Time

Posted 2 days ago (Updated 10 hours ago) • Actively hiring

Expires 6/22/2026

Apply for this opportunity

This job application is on an outside website. Be sure to review the job posting there to verify it's the same.

See Job Scorecard

Review key factors to help you decide if the role fits your goals.

How is this calculated?

Pay Growth

out of 5

Not enough data

Not enough info to score pay or growth

Job Security

out of 5

Not enough data

Calculating job security score...

Total Score

out of 100

Average of individual scores

Were these scores useful?

Skill Insights

Compare your current skills to what this opportunity needs—we'll show you what you already have and what could strengthen your application.

Job Description

Senior Android Operations / Reliability Engineer at TALENT Software Services Senior Android Operations / Reliability Engineer at TALENT Software Services in San Anselmo, California Posted in about 3 hours ago.

Type:

full-time

Job Description:

Are you an experienced Senior Android Operations / Reliability Engineer with a desire to excel? If so, then Talent Software Services may have the job for you! Our client is seeking an experienced Senior Android Operations / Reliability Engineer to work at their company in Redmond, WA. Typical Day in the Role

Purpose of the

Team:

The purpose of this team is to support a confidential companies device project by ensuring software quality and stability during the internal self-host program.

Key projects: This role will contribute to monitoring device health through telemetry dashboards, investigating issues, assigning bugs, and gathering logs (including hands-on device support) to ensure stability in production. Candidate Requirements
Best vs.

Average:

The ideal resume would contain experience with Android (mobile OS), 5-7 years minimum is required for the role but more experience be a bonus, should be able to work independently # Senior Operations / Reliability Engineer ## Summary We are seeking a

Senior Operations / Reliability Engineer
to support live operations, service reliability, release stability, and prototype device monitoring for a new hardware and software product.

This role will focus on monitoring telemetry, diagnosing live issues, validating software releases, supporting incident response, and helping improve operational readiness across services, applications, and prototype device environments. This is an engineering-oriented operations role. The ideal candidate will be comfortable working with logs, dashboards, alerts, deployment signals, and live system behavior, while partnering closely with software engineers, QA, infrastructure teams, PMs, and product leadership. The role will be strongly supported by experienced engineers on the team, who will provide technical guidance on service architecture, prototype device workflows, telemetry interpretation, release processes, and complex debugging. The engineer will collaborate closely with these senior team members while taking ownership of day-to-day monitoring, release validation, live issue triage, documentation, and operational reporting. # Scope of Work & Responsibilities ## Live Monitoring & Telemetry

Monitor telemetry from services, applications, and prototype devices to assess operational health.
Observe dashboards, alerts, logs, and metrics to identify anomalies, failures, performance degradation, or emerging reliability risks.
Analyze real-time metrics and logs to support troubleshooting across cloud, on-premises, and prototype device environments.
Triage operational issues and communicate findings clearly to engineering, QA, PM, and product teams.
Provide actionable insights based on telemetry trends, system behavior, and recurring failure patterns.
Help improve monitoring coverage, alert quality, dashboard usefulness, and operational visibility. ## Release & Service Operations
Support software releases by validating deployments, monitoring live systems, and assessing post-deployment stability.
Track service health during rollouts, ring deployments, updates, and release validation windows.
Identify, debug, and help resolve live issues affecting services, devices, internal users, or product readiness.
Partner with engineering teams to support mitigations, fixes, rollbacks, or follow-up validation.
Assist with post-release verification and stabilization reporting.
Document release observations, risks, incidents, and readiness concerns. ## Incident Response & Reliability Support
Support incident response by gathering data, summarizing impact, identifying suspected causes, and tracking mitigation progress.
Participate in post-incident reviews and help document lessons learned.
Recommend improvements to monitoring, alerting, operational procedures, and service reliability practices.
Maintain clear records of incidents, recurring issues, known risks, and follow-up actions.
Help reduce operational toil by identifying repeatable troubleshooting steps, documentation gaps, and automation opportunities. ## On-Site Hardware & Environment Support
Perform in-person troubleshooting for self-hosted systems, prototype devices, or test environments when telemetry or dashboards indicate issues.
Assist with device configuration, deployment, validation, and live verification.
Run smoke checks or readiness checks to confirm device, service, and environment health.
Maintain documentation of hardware configurations, operational procedures, environment setup, and observed issues.
Coordinate with engineering and infrastructure teams to resolve environment or device-level reliability problems. ## Collaboration & Communication
Work closely with software, QA, infrastructure, PM, and product teams to support operational readiness and release reliability.
Communicate operational status, risks, and technical findings clearly and promptly.
Provide concise summaries of system health, release readiness, incident status, and recommended next steps.
Operate independently on assigned areas while escalating appropriately when issues require deeper engineering involvement. # Deliverables
Real-time telemetry dashboards, monitoring views, and actionable alerting improvements.
Release verification and stabilization reports.
Incident reports, issue summaries, and operational analysis for live events.
Documentation of hardware configurations, device workflows, operational procedures, and troubleshooting steps.
Service health summaries, risk assessments, and recommendations for reliability improvements.
Clear communication of live issues, suspected causes, mitigation status, and follow-up actions.
Recommendations for improving monitoring, alerting, release validation, and operational readiness. # Qualifications
Bachelor's degree in Computer Science, Computer Engineering, Software Engineering, or a related technical field, or equivalent practical experience.
5-7 years of relevant experience in software engineering, DevOps, SRE, production operations, infrastructure, service reliability, or related technical operations roles.
Experience monitoring live services, applications, infrastructure, or device environments.
Experience using dashboards, alerts, logs, metrics, and telemetry to diagnose system health and troubleshoot issues.
Experience supporting software releases, deployments, production validation, or service rollouts.
Ability to investigate technical issues, summarize findings, and communicate risks clearly to engineering and product teams.
Experience documenting incidents, operational procedures, known issues, and troubleshooting steps.
Familiarity with CI/CD workflows, cloud or hybrid infrastructure, release validation, and incident response practices.
Strong problem-solving skills, communication skills, and ability to work independently in a fast-moving engineering environment.

Explain a typical day in the role.: A typical day may include reviewing dashboards and alerts, checking telemetry from recent builds or deployments, investigating anomalies, and summarizing operational health for the team. The engineer may help validate a software rollout, monitor ring deployments, troubleshoot prototype device issues on-site, or gather logs and metrics for an active investigation. They will work closely with engineering, QA, PM, and infrastructure teams to communicate issues, document findings, verify fixes, and identify improvements to monitoring, alerting, and reliability practices. What is the ideal background of a candidate for this role?: The ideal candidate will have a software engineering, DevOps, SRE, production engineering, service operations, or infrastructure background. They should be comfortable diagnosing live system issues, interpreting logs and telemetry, validating deployments, and working closely with engineering teams to resolve reliability problems. A strong candidate will have experience with operational monitoring, alerting systems, cloud or hybrid environments, CI/CD or release workflows, incident response, and technical troubleshooting. Experience with prototype devices, hardware/software integration, or on-site lab environments is a strong plus. What are unique selling points that would get candidates interested in your role over another?: This role offers the opportunity to work on an interesting new hardware product and the software and services that support it. Candidates will gain hands-on exposure to prototype devices, live telemetry, release operations, and real-world reliability challenges. The role is supported by a strong engineering team with experienced technical leaders, a collaborative culture, and meaningful opportunities to improve operational practices for a developing product area. How will contractor performance be measured?: Contractor performance will be measured by the quality and timeliness of operational monitoring, issue triage, release validation, incident documentation, and reliability recommendations. Success will also be evaluated based on the contractor's ability to identify meaningful risks, communicate findings clearly, support live issue resolution, maintain accurate operational documentation, and collaborate effectively with engineering, QA, PM, infrastructure, and product teams. Top 3 Must-Have HARD Skills & years of experience for each: 1.

Software engineering, DevOps, SRE, or production operations experience
5+ years 2.
Monitoring, telemetry analysis, logging, and live issue troubleshooting
3+ years 3.
Ability to independently drive technical work and deliver operational value
3+ years

Similar jobs in San Anselmo, CA

Job
Retail Sales Associate - San Anselmo
PF
Pet Food Express
San Anselmo, CA
Posted1 day ago
Updated10 hours ago
Job
Physical Therapist (PT) for Home Health
FC
FeldCare Connects
San Anselmo, CA
Posted1 day ago
Updated10 hours ago
Job
Retail Assistant Manager - San Anselmo
PF
Pet Food Express
San Anselmo, CA
Posted1 day ago
Updated10 hours ago
Job
Line Cook
TB
The Baan Thai Cuisine
San Anselmo, CA
Posted2 days ago
Updated10 hours ago
Job
Special Education Teacher
OH
Oak Hill School
San Anselmo, CA
Posted5 days ago
Updated4 days ago

Similar jobs in California

Job
Border Patrol Agent - Entry Level
UC
U.S. Customs and Border Protection
Wildomar, CA
Posted1 day ago
Updated10 hours ago
Job
Medical Fellow
D
Doximity
San Francisco, CA
Posted1 day ago
Updated10 hours ago
Job
Customer Service Rep(08352) - 21028 Hawthorne Blvd
D
Domino's
Torrance, CA
Posted1 day ago
Updated10 hours ago
Job
Special Needs Caregiver - Behavioral Technician
H
Hammond - Special Needs assistant
Newcastle, CA
Posted1 day ago
Updated10 hours ago
Job
Founding AE
C
CyberCoders
San Francisco, CA
Posted1 day ago
Updated10 hours ago