Skip to main content
Tallo logoTallo logo

NPU Kernel/Operator Engineer

Job

Black Sesame Technologies Inc

Campbell, CA (In Person)

Full-Time

Posted 3 days ago (Updated 1 day ago) • Actively hiring

Expires 7/1/2026

Apply for this opportunity

This job application is on an outside website. Be sure to review the job posting there to verify it's the same.

Review key factors to help you decide if the role fits your goals.
Pay Growth
?
out of 5
Not enough data
Not enough info to score pay or growth
Job Security
?
out of 5
Not enough data
Calculating job security score...
Total Score
100
out of 100
Average of individual scores

Were these scores useful?

Skill Insights

Compare your current skills to what this opportunity needs—we'll show you what you already have and what could strengthen your application.

Job Description

NPU Kernel/Operator Engineer at Black Sesame Technologies Inc NPU Kernel/Operator Engineer at Black Sesame Technologies Inc in Campbell, California Posted in about 22 hours ago.
Type:
full-time
Job Description:
Role We are looking for a Senior NPU Kernel/Operator Engineer to lead the design and optimization of high-performance kernels for a custom AI accelerator / NPU. This role focuses on general-purpose deep learning operators, fused kernels, and hardware-aware performance optimization across CNNs, transformers, and other neural network workloads. The ideal candidate has strong experience in performance engineering on
GPU, NPU, DSP, CPU
SIMD, compiler backend, embedded accelerator, or HPC systems. Responsibilities Design and optimize high-performance NPU kernels for a broad range of neural network workloads. Own critical operators such as attention-style kernels, normalization, reduction, layout conversion, gather/scatter, quant/dequant, and fused operators. Develop tiling, blocking, vectorization, and memory scheduling strategies. Optimize data movement across matrix engine, vector engine, SRAM, DMA, NoC, cache, and DRAM. Analyze bottlenecks in compute utilization, memory bandwidth, synchronization, DMA overlap, bank conflicts, and instruction overhead. Build first-principles performance models for key operators. Drive kernels toward hardware roofline limits. Collaborate with hardware, compiler, runtime, and model teams on ISA features, tensor layouts, memory access patterns, and operator APIs. Debug complex correctness, precision, and performance issues on simulator or silicon. Mentor junior engineers and establish kernel optimization best practices. Requirements BS/MS/PhD in CS, EE, Computer Engineering, or related field. 5+ years of experience in performance optimization, accelerator programming, GPU/NPU/DSP development, compiler backend, embedded systems, or HPC. Strong C/C++ programming skills. Deep understanding of memory hierarchy, tiling, parallelism, vectorization, synchronization, and bandwidth analysis. Experience optimizing performance-critical kernels or numerical computation. Ability to reason from algorithm requirements to hardware execution and performance bottlenecks. Preferred Experience with CUDA, Triton, CUTLASS, OpenCL, TVM, MLIR, Halide, SIMD intrinsics, DSP SDKs, or custom accelerator SDKs. Experience optimizing operators such as convolution, GEMM, attention, softmax, normalization, reduction, image processing, or fused compute/memory kernels. Familiarity with custom AI accelerator architecture, matrix engines, vector engines, systolic arrays, DMA, SRAM, NoC, or DRAM systems. Experience with mixed precision and quantization:
FP32, FP16, BF16, FP8, INT8, INT4.
Experience with simulator/emulator/FPGA/silicon bring-up is a plus.