Job Description
Autonomous Driving Multimodal Model Algorithm Engineer Black Sesame Technologies Inc - 5.0 San Jose, CA Job Details Full-time $130,000 - $400,000 a year 8 hours ago Benefits Health savings account Health insurance Dental insurance 401(k) Flexible spending account Paid time off Employee assistance program Vision insurance Employee discount Life insurance Referral program Retirement plan Qualifications Git Model training Debugging Cross-functional communication Full Job Description Company Overview Black Sesame Technologies is building high-performance AI algorithms and self-developed chips for intelligent driving and beyond. As an Autonomous Driving Multimodal Model Algorithm Engineer, you will work on next-generation multimodal AI models for autonomous driving, including Vision-Language Models, Vision-Language-Action Models, and World Models. You will collaborate with perception, prediction, planning, data, simulation, and deployment teams to integrate multimodal models with existing BEV perception, two-stage E2E, and one-stage E2E autonomous driving systems. We are looking for candidates with hands-on experience in one or more of the following areas: Vision-Language Models, Vision-Language-Action Models, World Models. Responsibilities Multimodal Model Development for Autonomous Driving Work on one or more multimodal modeling directions for autonomous driving, including VLM-based scene understanding, VLA-style planning-oriented modeling, and World Model-based future prediction. Develop and optimize models that reason over multi-camera images, BEV features, map elements, object/lane instances, occupancy, trajectories, ego-motion, and driving context. Explore model architectures that connect perception, prediction, planning, and decision-making in two-stage and one-stage E2E autonomous driving systems. Collaborate with BEV perception and planning teams to improve representation quality, temporal consistency, long-tail robustness, and planning relevance. Vision-Language and Vision-Language-Action Modeling Develop VLM-based methods for driving scene understanding, open-vocabulary perception, risk reasoning, corner-case analysis, and interpretable autonomy. Adapt and extend open-source multimodal architectures such as LLaVA, Qwen-VL, InternVL, MiniCPM-V, OpenVLA, or similar models for autonomous driving scenarios. Research VLA-style models that map multimodal driving context, navigation intent, and high-level instructions to trajectories, actions, or planning representations. Align visual, BEV, map, object, lane, occupancy, trajectory, and language representations for driving-specific tasks. Build supervised fine-tuning, instruction-tuning, and efficient adaptation pipelines for driving-relevant multimodal tasks. World Model and Future Prediction Build world-model-based approaches for future BEV, occupancy, object motion, lane evolution, traffic interaction, and ego-conditioned scene rollout. Explore generative and predictive modeling methods such as diffusion models, autoregressive transformers, latent dynamics models, video prediction, and BEV prediction. Use learned world models for scenario generation, counterfactual reasoning, long-tail case mining, planning evaluation, and closed-loop analysis. Work with simulation and data teams to improve safety-critical scenario discovery and model-based evaluation. Efficient Adaptation and Deployment Apply efficient fine-tuning and adaptation methods such as LoRA, QLoRA, Adapter, Prompt Tuning, Prefix Tuning, or other PEFT techniques. Develop multimodal feature alignment modules, including projection heads, query adapters, cross-attention modules, tokenization strategies, and representation converters. Optimize model architecture, latency, memory footprint, and compute cost for automotive deployment. Apply distillation, quantization, pruning, sparse computation, and efficient attention methods where appropriate. Collaborate with chip, compiler, runtime, and deployment teams to adapt multimodal models to in-house automotive AI hardware. Research, Evaluation, and Iteration Track the latest research in VLM, VLA, World Models, BEV perception, E2E driving, robotics foundation models, generative simulation, and multimodal learning. Design evaluation metrics for reasoning quality, grounding accuracy, temporal consistency, prediction quality, planning relevance, and safety-critical scenarios. Perform systematic failure analysis and drive data/model iteration based on real-world autonomous driving cases. Contribute to patents, technical reports, internal research platforms, and conference or journal publications. Qualifications MS or PhD in Computer Science, Electrical Engineering, Robotics, Artificial Intelligence, or a related field. Strong background in deep learning, computer vision, multimodal learning, robotics, or autonomous driving. Hands-on experience in one or more of the following areas: Vision-Language Models, multimodal large models, or open-source VLM adaptation Vision-Language-Action models, robotics foundation models, or action-conditioned modeling World models, generative prediction, latent dynamics modeling, or future scene simulation BEV perception, multi-view 3D perception, or end-to-end autonomous driving Motion prediction, planning, trajectory generation, or closed-loop evaluation Practical experience with open-source multimodal architectures such as LLaVA, Qwen-VL, InternVL, MiniCPM-V, OpenVLA, BLIP-style models, Flamingo-style models, or similar systems. Solid understanding of multimodal feature alignment, including vision-language alignment, cross-modal attention, visual tokenization, projection layers, query-based fusion, or embedding-space alignment. Experience with efficient fine-tuning or adaptation methods, such as LoRA, QLoRA, Adapter, Prompt Tuning, Prefix Tuning, supervised fine-tuning, or instruction tuning. Proficient in PyTorch and capable of modifying, training, debugging, and evaluating deep learning models. Familiar with transformer architectures, attention mechanisms, temporal modeling, and large-scale training. Experience with multimodal data, such as camera, radar, LiDAR, IMU, map, trajectory, language, or structured driving data. Strong engineering ability in Python; C++/CUDA/TensorRT experience is a plus. Comfortable with Git, Docker, Linux, distributed training, and collaborative development workflows. Strong communication skills and ability to work across perception, planning, data, simulation, and deployment teams. Preferred Qualifications Experience adapting or fine-tuning VLM/VLA models such as LLaVA, Qwen-VL, InternVL, MiniCPM-V, OpenVLA, or similar architectures. Experience with Hugging Face Transformers, PEFT, DeepSpeed, FSDP, vLLM, SGLang, TensorRT-LLM, or similar training/inference frameworks. Experience building multimodal instruction datasets, driving-scene QA datasets, grounding datasets, scene-reasoning datasets, or planner-oriented supervision signals. Experience aligning multimodal model representations with BEV features, object queries, lane instances, occupancy grids, map vectors, trajectories, or planner inputs. Experience with autonomous driving architectures such as BEVFormer, DETR/DINO, MapTR/MapQR, occupancy networks, diffusion planners, trajectory transformers, or similar models. Experience with world models, generative models, video prediction, future BEV prediction, occupancy forecasting, learned simulation, or closed-loop evaluation. Experience with efficient adaptation of large models, including LoRA/QLoRA, distillation, quantization, pruning, sparse attention, or lightweight adapter design. Experience deploying deep learning models on automotive SoCs, ASICs, GPUs, or edge AI accelerators. Publications or strong project experience in CVPR, ICCV, ECCV, NeurIPS, ICLR, ICML, CoRL, ICRA, IROS, RSS, or related autonomous driving and robotics venues. Strong ability to convert research ideas into robust production systems. Experience with AI agent tools and basic harness engineering, including building evaluation scripts, task runners, automated workflows, tool-use pipelines, and reproducible testing environments for model or agent development.
Pay:
$130,000.00 - $400,000.00 per year Benefits:
401(k) Dental insurance Employee assistance program Employee discount Flexible spending account Health insurance Health savings account Life insurance Paid time off Referral program Retirement plan Vision insurance Work Location:
In person