Senior Android Operations / Reliability Engineer
Job
TALENT Software Services
San Anselmo, CA (In Person)
Full-Time
Review key factors to help you decide if the role fits your goals.
Pay Growth
?
out of 5
Not enough data
Not enough info to score pay or growth
Job Security
?
out of 5
Not enough data
Calculating job security score...
Total Score
77
out of 100
Average of individual scores
Skill Insights
Compare your current skills to what this opportunity needs—we'll show you what you already have and what could strengthen your application.
Job Description
Senior Android Operations / Reliability Engineer at TALENT Software Services Senior Android Operations / Reliability Engineer at TALENT Software Services in San Anselmo, California Posted in about 3 hours ago.
Type:
full-timeJob Description:
Are you an experienced Senior Android Operations / Reliability Engineer with a desire to excel? If so, then Talent Software Services may have the job for you! Our client is seeking an experienced Senior Android Operations / Reliability Engineer to work at their company in Redmond, WA. Typical Day in the Role- Purpose of the
Team:
The purpose of this team is to support a confidential companies device project by ensuring software quality and stability during the internal self-host program.- Key projects: This role will contribute to monitoring device health through telemetry dashboards, investigating issues, assigning bugs, and gathering logs (including hands-on device support) to ensure stability in production. Candidate Requirements
- Best vs.
Average:
The ideal resume would contain experience with Android (mobile OS), 5-7 years minimum is required for the role but more experience be a bonus, should be able to work independently # Senior Operations / Reliability Engineer ## Summary We are seeking a- Senior Operations / Reliability Engineer
- to support live operations, service reliability, release stability, and prototype device monitoring for a new hardware and software product.
- Monitor telemetry from services, applications, and prototype devices to assess operational health.
- Observe dashboards, alerts, logs, and metrics to identify anomalies, failures, performance degradation, or emerging reliability risks.
- Analyze real-time metrics and logs to support troubleshooting across cloud, on-premises, and prototype device environments.
- Triage operational issues and communicate findings clearly to engineering, QA, PM, and product teams.
- Provide actionable insights based on telemetry trends, system behavior, and recurring failure patterns.
- Help improve monitoring coverage, alert quality, dashboard usefulness, and operational visibility. ## Release & Service Operations
- Support software releases by validating deployments, monitoring live systems, and assessing post-deployment stability.
- Track service health during rollouts, ring deployments, updates, and release validation windows.
- Identify, debug, and help resolve live issues affecting services, devices, internal users, or product readiness.
- Partner with engineering teams to support mitigations, fixes, rollbacks, or follow-up validation.
- Assist with post-release verification and stabilization reporting.
- Document release observations, risks, incidents, and readiness concerns. ## Incident Response & Reliability Support
- Support incident response by gathering data, summarizing impact, identifying suspected causes, and tracking mitigation progress.
- Participate in post-incident reviews and help document lessons learned.
- Recommend improvements to monitoring, alerting, operational procedures, and service reliability practices.
- Maintain clear records of incidents, recurring issues, known risks, and follow-up actions.
- Help reduce operational toil by identifying repeatable troubleshooting steps, documentation gaps, and automation opportunities. ## On-Site Hardware & Environment Support
- Perform in-person troubleshooting for self-hosted systems, prototype devices, or test environments when telemetry or dashboards indicate issues.
- Assist with device configuration, deployment, validation, and live verification.
- Run smoke checks or readiness checks to confirm device, service, and environment health.
- Maintain documentation of hardware configurations, operational procedures, environment setup, and observed issues.
- Coordinate with engineering and infrastructure teams to resolve environment or device-level reliability problems. ## Collaboration & Communication
- Work closely with software, QA, infrastructure, PM, and product teams to support operational readiness and release reliability.
- Communicate operational status, risks, and technical findings clearly and promptly.
- Provide concise summaries of system health, release readiness, incident status, and recommended next steps.
- Operate independently on assigned areas while escalating appropriately when issues require deeper engineering involvement. # Deliverables
- Real-time telemetry dashboards, monitoring views, and actionable alerting improvements.
- Release verification and stabilization reports.
- Incident reports, issue summaries, and operational analysis for live events.
- Documentation of hardware configurations, device workflows, operational procedures, and troubleshooting steps.
- Service health summaries, risk assessments, and recommendations for reliability improvements.
- Clear communication of live issues, suspected causes, mitigation status, and follow-up actions.
- Recommendations for improving monitoring, alerting, release validation, and operational readiness. # Qualifications
- Bachelor's degree in Computer Science, Computer Engineering, Software Engineering, or a related technical field, or equivalent practical experience.
- 5-7 years of relevant experience in software engineering, DevOps, SRE, production operations, infrastructure, service reliability, or related technical operations roles.
- Experience monitoring live services, applications, infrastructure, or device environments.
- Experience using dashboards, alerts, logs, metrics, and telemetry to diagnose system health and troubleshoot issues.
- Experience supporting software releases, deployments, production validation, or service rollouts.
- Ability to investigate technical issues, summarize findings, and communicate risks clearly to engineering and product teams.
- Experience documenting incidents, operational procedures, known issues, and troubleshooting steps.
- Familiarity with CI/CD workflows, cloud or hybrid infrastructure, release validation, and incident response practices.
- Strong problem-solving skills, communication skills, and ability to work independently in a fast-moving engineering environment.
- Software engineering, DevOps, SRE, or production operations experience
- 5+ years 2.
- Monitoring, telemetry analysis, logging, and live issue troubleshooting
- 3+ years 3.
- Ability to independently drive technical work and deliver operational value
- 3+ years
Similar jobs in San Anselmo, CA
Pet Food Express
San Anselmo, CA
Posted1 day ago
Updated10 hours ago
FeldCare Connects
San Anselmo, CA
Posted1 day ago
Updated10 hours ago
Pet Food Express
San Anselmo, CA
Posted1 day ago
Updated10 hours ago
Similar jobs in California
U.S. Customs and Border Protection
Wildomar, CA
Posted1 day ago
Updated10 hours ago
Domino's
Torrance, CA
Posted1 day ago
Updated10 hours ago
Hammond - Special Needs assistant
Newcastle, CA
Posted1 day ago
Updated10 hours ago