Hybrid onsite at Dallas, TX, 75019 / Tampa, FL, 33647
Type:
Contract to
Hire Job Description:
Overview We are seeking a passionate and hands-on AI/ML Engineer to accelerate our Enterprise Observability strategy. This role will design, build, and operationalize AI/ML capabilities that enhance end to end telemetry pipelines, anomaly detection, intelligent alerting, and proactive system resiliency. You will work at the intersection of AI/ML engineering, Observability platforms, and automation, developing solutions that improve detection, diagnosis, and prevention of operational issues across distributed systems. ________________________________________ Key Responsibilities Design and deploy AI/ML models supporting anomaly detection, baselining, event correlation, and predictive operational analytics. Build and integrate AI‐enabled capabilities into enterprise Observability platforms, including Grafana, APM/RUM tools, network telemetry systems, and data observability tools. Develop AI Agents that can autonomously triage issues, recommend corrective actions, and initiate automated remediation workflows to reduce recovery time and improve system resilience. Implement self‐healing automation using AI‐driven decisioning, integrating with orchestration frameworks, service APIs, and infrastructure automation pipelines. Engineer and maintain real‐time and batch data pipelines using Snowflake ML Jobs, Snowflake Cortex, streams, tasks, and UDFs. Implement and manage OpenTelemetry‐based telemetry ingestion for logs, metrics, traces, and spans across distributed systems. Build asynchronous Python APIs and services for model inferencing and operational integration. Enhance observability intelligence with AI-powered capabilities such as root‐cause acceleration, chatbot/search enablement, and automated insights. Contribute to SLO/SLI modeling, Golden Signals instrumentation, and Observability NFR adoption. Collaborate across engineering, SRE, platform and business teams to embed proactive intelligence and Observability standards throughout the ecosystem. Required Skills & Qualifications Core Technical Skills Strong proficiency in Python and data science/ML libraries: NumPy, Pandas, scikit learn, TensorFlow, PyTorch, Matplotlib, Seaborn. Experience with Generative AI, LLM fine tuning, prompt engineering, RAG pipelines, and LLM evaluation frameworks. Expertise in developing and deploying ML models in production (batch & streaming). Strong understanding of statistics, time series modeling, and anomaly detection. Observability & Telemetry Experience with OpenTelemetry for logs, metrics, traces, spans.
Familiarity with Observability concepts:
Golden Signals, SLO/SLI design, APM, RUM, Synthetics, event correlation, baselining. Experience with Observability tools such as: Grafana (Alloy agents, dashboards, ML capabilities), Dynatrace, Monte Carlo (Data Observability), Netscout, ThousandEyes, SolarWinds, NetBrain.