Software Engineering, CoreAI
Microsoft
Redmond, WA (In Person)
Full-Time
Skill Insights
Compare your current skills to what this opportunity needs—we'll show you what you already have and what could strengthen your application.
Job Description
Architect, design, and develop core AI Infrastructure services developed in Go, Rust, Python, C++, and C# deployed on large-scale Kubernetes clusters to support pre-training and post-training of state-of-the-art LLMs, SLMs, multimodal, and code-specific models. Collaborate closely with engineers, researchers and external partners to debug, diagnose, and improve stability of large-scale training runs. Enhance systems and applications to deliver high stability, low latency, strong security, and maintainability in large-scale complex training environments in Azure and in partner clouds. Provide operational support, technical leadership, and vision while contributing to the deployment, monitoring, and continuous improvement of engineering systems and practices. Bachelor's Degree in Computer Science or related technical field and 2+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, Python, or equivalent experience. These requirements include but are not limited to the following specialized security screenings: 2+ years designing, developing, and shipping high quality software. 2+ years of experience with distributed systems and cloud-based infrastructure. 1+ year of experience with DevOps practices (CI/CD, automated testing, deployment, etc.). 2+ years of software development experience in C#, C++, Python, or similar languages. 2+ years of experience with containerization tools (e.g., Docker, Kubernetes). Knowledge and hands on experience with production ML systems, large-scale training infrastructure, NCCL, CUDA libraries and tools.