AI / ML Engineer (With Observability)
Job
Mindlance
Remote
Full-Time
Review key factors to help you decide if the role fits your goals.
Pay Growth
?
out of 5
Not enough data
Not enough info to score pay or growth
Job Security
?
out of 5
Not enough data
Calculating job security score...
Total Score
100
out of 100
Average of individual scores
Skill Insights
Compare your current skills to what this opportunity needs—we'll show you what you already have and what could strengthen your application.
Job Description
AI / ML Engineer (with Observability)#26-13810
Coppell, TX
30% Remote Job Description
Hybrid onsite at Dallas, TX, 75019 / Tampa, FL, 33647 CTH
2 rounds of interviews Overview
We are seeking a passionate and hands-on AI/ML Engineer to accelerate our Enterprise Observability strategy. This role will design, build, and operationalize AI/ML capabilities that enhance end to end telemetry pipelines, anomaly detection, intelligent alerting, and proactive system resiliency.
You will work at the intersection of AI/ML engineering, Observability platforms, and automation, developing solutions that improve detection, diagnosis, and prevention of operational issues across distributed systems.
________________________________________
Key Responsibilities
- Design and deploy AI/ML models supporting anomaly detection, baselining, event correlation, and predictive operational analytics.
- Build and integrate AI‐enabled capabilities into enterprise Observability platforms, including Grafana, APM/RUM tools, network telemetry systems, and data observability tools.
- Develop AI Agents that can autonomously triage issues, recommend corrective actions, and initiate automated remediation workflows to reduce recovery time and improve system resilience.
- Implement self‐healing automation using AI‐driven decisioning, integrating with orchestration frameworks, service APIs, and infrastructure automation pipelines.
- Engineer and maintain real‐time and batch data pipelines using Snowflake ML Jobs, Snowflake Cortex, streams, tasks, and UDFs.
- Implement and manage OpenTelemetry‐based telemetry ingestion for logs, metrics, traces, and spans across distributed systems.
- Build asynchronous Python APIs and services for model inferencing and operational integration.
- Enhance observability intelligence with AI-powered capabilities such as root‐cause acceleration, chatbot/search enablement, and automated insights.
- Contribute to SLO/SLI modeling, Golden Signals instrumentation, and Observability NFR adoption.
- Collaborate across engineering, SRE, platform and business teams to embed proactive intelligence and Observability standards throughout the ecosystem. Required Skills & Qualifications Core Technical Skills
- Strong proficiency in Python and data science/ML libraries: NumPy, Pandas, scikit learn, TensorFlow, PyTorch, Matplotlib, Seaborn.
- Experience with Generative AI, LLM fine tuning, prompt engineering, RAG pipelines, and LLM evaluation frameworks.
- Expertise in developing and deploying ML models in production (batch & streaming).
- Strong understanding of statistics, time series modeling, and anomaly detection. Observability & Telemetry
- Experience with OpenTelemetry for logs, metrics, traces, spans.
- Familiarity with Observability concepts: Golden Signals, SLO/SLI design, APM, RUM, Synthetics, event correlation, baselining.
- Experience with Observability tools such as: Grafana (Alloy agents, dashboards, ML capabilities), Dynatrace, Monte Carlo (Data Observability), Netscout, ThousandEyes, SolarWinds, NetBrain. Cloud, Data & Platform
- Hands on with AWS (SageMaker, Bedrock), Snowflake ML, Snowflake/Openflow, Snowflake AI Observability tooling.
- Experience building Snowflake data pipelines (streams, tasks, UDFs) - plus for Cortex features.
- Strong understanding of distributed systems and microservices telemetry requirements. Automation & Engineering Quality
- Experience with automation pipelines, CI/CD, and infrastructure as code patterns supporting Observability adoption.
- Ability to build asynchronous Python APIs or services for model inference and operational integration. ________________________________________ Preferred Qualifications
- Experience developing agentic AI systems that analyze telemetry, generate action recommendations, or execute automated operational responses.
- Experience building self‐healing patterns, including automated rollback, service restarts, configuration corrections, and predictive maintenance.
- Experience in Snowflake ML workflows, Snowflake Cortex Agents, and data pipeline automation.
- Exposure to AI-enabled alerting, RCA automation, and operational self‐healing concepts.
- Experience with large-scale operational telemetry and multi-cloud ecosystems. Soft Skills
- Strong analytical thinking and problem solving.
- Excellent communication skills for cross functional collaboration with infrastructure, SRE, engineering, business, and leadership teams.
- Curiosity, continuous learning mindset, and passion for applied AI and Observability.
EEO:
"Mindlance is an Equal Opportunity Employer and does not discriminate in employment on the basis of - Minority/Gender/Disability/Religion/LGBTQI/Age/Veterans."Similar remote jobs
The Advocates for Human Rights
Minneapolis, MN
Posted12 hours ago
Updated33 minutes ago
TCA Counseling Group
Boston, MA
Posted1 day ago
Updated33 minutes ago
Similar jobs in Coppell, TX
Similar jobs in Texas
Byrnes & Rupkey, Inc.
San Antonio, TX
Posted1 day ago
Updated33 minutes ago
Brighton Collectibles, LLC.
Houston, TX
Posted1 day ago
Updated33 minutes ago
The Stuart Firm
Midland, TX
Posted1 day ago
Updated33 minutes ago