Req ID:
JR98097 Principal Machine Learning Engineer Our vision is to transform how the world uses information to enrich life for all. Micron Technology is a world leader in innovating memory and storage solutions that accelerate the transformation of information into intelligence, inspiring the world to learn, communicate and advance faster than ever. The Smart Manufacturing and AI team at Micron Technology is looking for an ambitious Machine Learning Engineer. Our mission is to provide leading machine learning, custom GenAI, and Agentic AI solutions that support Micron's leadership in the competitive memory solutions market. Qualified applicants will have experience with various data and cloud technologies and strong skills in modeling data, querying, and deploying scalable data pipelines to complete machine learning models and AI agents. You will work closely with Data Scientists, Data Engineers, and expert users to build and launch scalable AI/ML solutions that generate value and insight from Micron's manufacturing processes and systems.
Responsibilities:
- Architect and complete large-scale custom model training and fine-tuning jobs (SFT, RLHF) on multi-node, multi-GPU clusters.
- Optimize training throughput and memory efficiency using distributed training strategies (FSDP, DeepSpeed, Megatron-LM) and mixed-precision techniques (FP16/BF16).
- Build and develop autonomous AI Agents capable of multi-step reasoning, planning, and tool execution to automate complex manufacturing workflows.
- Implement Agentic frameworks (e.g., LangChain, LangGraph, CrewAI) to orchestrate LLM interactions with internal APIs, databases, and software tools.
- Profile and debug GPU performance bottlenecks using tools like Nsight Systems or PyTorch Profiler to improve hardware utilization.
- Develop and sustain data/solution pipelines that support machine learning models and GenAI applications.
- Build and optimize data structures in data management systems (Snowflake, and Google Cloud platforms) to enable AI/ML and Agentic solutions.
- Build and maintain CI/CD pipelines of machine learning and AI Agent solutions in the cloud.
Minimum Qualifications:
- 10+ years of experience with deep expertise in GPU architecture (memory hierarchy, tensor cores, NVLink) and GPU resource management across cloud and on‑prem environments.
- 5+ years in performance optimization, parallel computing, and low-level systems. Strong C++ skills and experience with GPGPU frameworks. CUDA is preferred, but HIP, OpenCL, or Metal are acceptable.
- Hands-on experience building end-to-end ML systems, including distributed training techniques (DDP, FSDP, model parallelism) and automated pipelines for training, testing, and deployment.
- Strong proficiency in LLMs, including timely engineering, fine-tuning (LoRA/QLoRA), inference optimization (vLLM, TensorRT-LLM), and development of GenAI applications/agents using LangChain, LlamaIndex, AutoGen, and PyTorch.
- Proficient programming skills in Python (preferred) or Java are required, along with experience in CI/CD and cloud-native tools such as Git, Jenkins, Docker, and Kubernetes.
Candidates should have strong communication abilities and perform well in dynamic settings. A Bachelor's or Master's degree or equivalent experience in Computer Science, Statistics, or a related field is expected.
Preferred Qualifications:
- A Ph.D. in Computer Science or Statistics, or equivalent experience.
- Experience with HPC job schedulers (e.g., Slurm) and managing large scale GPU workloads on Kubernetes using tools like Ray and Kubeflow.
- Knowledge of CUDA programming, Triton kernels, and building custom C++ extensions for PyTorch to accelerate workloads.
- Experience crafting and orchestrating teamwork between niche agents in multi agent architectures.
- Deep knowledge of mathematics, probability, statistics, and algorithms.
Proven track record to evolve data science prototypes into production systems, with knowledge of computer vision and/or signal processing techniques for classification and feature extraction. Job Profile(s): Machine Learning Engineer 5 -
Machine Learning Engineering MTS Relocation Level:
TBD Before Getting Started Please review Micron's Internal Job Application Policy on your regional PeopleNow Career Opportunities page before searching and applying for jobs.
Note in particular that:
- Hiring managers may view your performance appraisals, original resume, transcripts or other performance-related documentation in your personal file. This information will be held in confidence.
- If you are selected to interview for a position, you must notify your direct supervisor before participating in the interview process.
As a world leader in the semiconductor industry, Micron is dedicated to your personal wellbeing and professional growth. Micron benefits are designed to help you stay well, provide peace of mind and help you prepare for the future. We offer a choice of medical, dental and vision plans in all locations enabling team members to select the plans that best meet their family healthcare needs and budget. Micron also provides benefit programs that help protect your income if you are unable to work due to illness or injury, and paid family leave. Additionally, Micron benefits include a robust paid time-off program and paid holidays. For additional information regarding the Benefit programs available, please see the Benefits Guide posted on Benefits | Micron Technology, Inc Micron is proud to be an equal opportunity workplace and is an affirmative action employer. All qualified applicants will receive consideration for employment without regard to race, color, religion, sex, sexual orientation, age, national origin, citizenship status, disability, protected veteran status, gender identity or any other factor protected by applicable federal, state, or local laws. To learn about your right to work click here.
For US Sites Only:
To request assistance with the application process and/or for reasonable accommodations, please contact Micron's People Organization at hrsupport_na@micron.com or 1-800-336-8918 (select option #3)