Skip to main content
Tallo logoTallo logo
Apply for this opportunity

This job application is on an outside website. Be sure to review the job posting there to verify it's the same.

Pre-Training & Continued Pretraining - Algorithm Expert / Architect [33247]

Job

Stealth Startup

Stanford, CA (In Person)

Full-Time

Posted 5 days ago (Updated 19 hours ago) • Actively hiring

Expires 7/24/2026

Review key factors to help you decide if the role fits your goals.
Pay Growth
?
out of 5
Not enough data
Not enough info to score pay or growth
Job Security
?
out of 5
Not enough data
Calculating job security score...
Total Score
77
out of 100
Average of individual scores

Were these scores useful?

Skill Insights

Compare your current skills to what this opportunity needs—we'll show you what you already have and what could strengthen your application.

Job Description

Pre-Training & Continued Pretraining - Algorithm Expert / Architect [33247] at Stealth Startup Pre-Training & Continued Pretraining - Algorithm Expert / Architect [33247] at Stealth Startup in STANFORD, California Posted in 2 days ago.
Type:
full-time
Job Description:
Responsibilities Core base-model R D. Take full ownership of pre-training trillion-parameter code models from scratch; lead the design and selection of next-generation model architectures; continuously raise the capability ceiling in code generation, understanding, and complex logical reasoning. Massive data engineering. Lead cleaning, filtering, and mixing strategy for TB- to PB-scale high-quality code data; explore data optimization methods for test-time scaling; build a high-quality corpus and data-synthesis flywheel spanning multilingual code, open-source projects, and technical documentation. Training strategy tuning. Address the pain points of pre-training at extreme scale-design and optimize learning-rate scheduling, loss functions, curriculum learning, and anti-forgetting mechanisms; resolve difficult issues such as loss spikes and gradient anomalies; ensure extreme stability of 100B/trillion-scale cluster training. Efficient training engineering. Participate deeply in the development and tuning of the underlying distributed training framework; master parallelism techniques (DP/TP/PP/EP) and memory optimization schemes (ZeRO/FSDP); break communication bottlenecks at 10k-GPU scale and maximize Model FLOPs Utilization (MFU). Frontier insight & delivery. Maintain acute awareness of leading AI coding tools (DeepSeek V4, Cursor, Composer-2, Claude Code, etc.) and deconstruct their underlying paradigms; track and reproduce the latest top-conference LLM papers, turning frontier research into production capability. Qualifications Education & background. Bachelor's degree or above in CS, AI, Mathematics, or a related field, with a solid foundation in machine learning, deep learning, and compiler theory. Hands-on experience. Complete end-to-end pre-training experience with models from 10B to trillion parameters (especially code or long-context models); strong command of autoregressive language modeling objectives; understanding of SFT and
RLHF/DPO
alignment a plus. Frameworks & systems. Proficient in PyTorch; skilled with large-model pre-training frameworks such as Megatron-LM and DeepSpeed; able to read and modify mainstream open-source LLM source code; comfortable with Linux and GPU-cluster operations and debugging. Data processing. Deep grasp of data-centric AI; hands-on experience with massive unlabeled text/code data; mastery of high-quality corpus deduplication, quality assessment, tokenization design, and data mixing. Research + engineering. Publications at top venues (Neur
IPS, ICML, ICLR, ACL
) on LLM pre-training, alignment, or efficient training; alongside strong ability to turn theory into high-performance production code. Nice-to-haves Heavy hands-on use of mainstream AI coding tools (Cursor, Composer-2, Claude Code, Copilot) with deep insight into coding-agent workflows. Familiarity with MoE architectures and their load-balancing mechanisms, or hands-on experience with novel architectures such as MLA (Multi-head Latent Attention) or linear attention. Core-contributor experience on code or pre-training for well-known open-source LLM projects (Llama, Qwen, DeepSeek, etc.). Strong placements in ACM-ICPC, Kaggle, or other authoritative AI competitions, or a high-star personal open-source project.