Apply for this opportunity

This job application is on an outside website. Be sure to review the job posting there to verify it's the same.

Apply Offsite

Senior ML Serving Engineer

Job

ECS Federal, LLC

Falls Church, VA (In Person)

Full-Time

Posted 3 weeks ago (Updated 1 day ago) • Actively hiring

Expires 7/23/2026

See Job Scorecard

Review key factors to help you decide if the role fits your goals.

How is this calculated?

Pay Growth

out of 5

Not enough data

Not enough info to score pay or growth

Job Security

out of 5

Not enough data

Calculating job security score...

Total Score

100

out of 100

Average of individual scores

Were these scores useful?

Skill Insights

Compare your current skills to what this opportunity needs—we'll show you what you already have and what could strengthen your application.

Job Description

Everforth ECS is seeking a Senior ML Serving Engineer to work in the National Capital Region covering the Pentagon, Falls Church, and Fairfax.

Please Note:

This position is contingent upon contract award.

The War Data Platform (WDP) is a key initiative within the U.S. Department of War's (DoW) AI-First strategy introduced in early 2026. The WDP focuses on operational warfighting data and aims to accelerate the deployment of artificial intelligence (AI) on the battlefield. The WDP extends to Unclassified, Secret, and Top Secret environments, and supports collaboration between Combatant Commands, Joint Staff directorates, Senior Executive Service leaders, and operational analysts.

This role implements the model-runtime deployment pattern used across WDP Core Integration AI and machine learning serving environments, ensuring consistent, secure, and high-performance model delivery to DoW missions and senior leaders.

Implements the model-runtime deployment pattern used across WDP Core Integration artificial intelligence and machine learning serving environments supporting DoW missions, Joint Staff analysts, Combatant Command elements, and Senior Executive Service leadership.
Develops service templates, runtime configurations, scaling behaviors, and deployment specifications consumed by enterprise pipelines and API access patterns.
Applies Kubernetes, Helm, Docker, GitLab Continuous Integration, VMware environments, Prometheus, Grafana, Elastic Stack, and hardened deployment workflows to establish consistent runtime behavior for production-ready model artifacts.
Conducts performance tuning, latency optimization, and resource-allocation refinement to maintain operational stability across serving surfaces.
Validates runtime patterns and operational readiness across higher-domain enclaves, including SIPR and JWICS, by resolving enclave-specific runtime constraints, adapting deployment templates, and aligning runtime behavior with cross-domain security architectures.
Supports automated scanning workflows, cross-domain transfer validation, and API endpoint configuration activities to maintain readiness for model serving operations.
Produces mission-critical deliverables including runtime configuration packages, deployment templates, performance reports, operational readiness assessments, and enclave-specific runtime documentation.
Collaborates with Platform One, Cloud One, multi-national engineering teams, and cross-service mission partners to strengthen operational readiness, reinforce deployment consistency, and advance program value commitments across all enclaves.
Participates in Tier-4 incident response actions to maintain service-level agreements, operational continuity, and mission performance for enterprise AI model serving capabilities.
Performs other duties as assigned.
Current Secret security clearance with the ability to obtain and maintain a Top Secret (TS) security clearance with Sensitive Compartmented Information (SCI).
10-12 years of experience implementing and managing model serving and runtime environments in secure DoW or equivalent settings.
CNCF‑Certified Kubernetes Administrator (CKA) or equivalent Kubernetes certification.
Proven proficiency with Kubernetes, Helm, Docker, GitLab CI, VMware, Prometheus, Grafana, and Elastic Stack for hardened deployment workflows.
Successful track record of performance tuning, latency optimization, and resource‑allocation refinement for production‑grade AI/ML models.
Strong problem‑solving and decision‑making capabilities, with a proven ability to weigh the relative costs and benefits of potential actions and identify the most appropriate solution.
Highly developed interpersonal and oral/written communication skills, with the ability to effectively and professionally interact with a diverse set of stakeholders (from peers to end‑users to executive management).