Job Description
NPU Kernel/Operator Engineer at Black Sesame Technologies Inc NPU Kernel/Operator Engineer at Black Sesame Technologies Inc in Campbell, California Posted in about 22 hours ago.
Type:
full-time Job Description:
Role We are looking for a Senior NPU Kernel/Operator Engineer to lead the design and optimization of high-performance kernels for a custom AI accelerator / NPU. This role focuses on general-purpose deep learning operators, fused kernels, and hardware-aware performance optimization across CNNs, transformers, and other neural network workloads. The ideal candidate has strong experience in performance engineering on GPU, NPU, DSP, CPU
SIMD, compiler backend, embedded accelerator, or HPC systems. Responsibilities Design and optimize high-performance NPU kernels for a broad range of neural network workloads. Own critical operators such as attention-style kernels, normalization, reduction, layout conversion, gather/scatter, quant/dequant, and fused operators. Develop tiling, blocking, vectorization, and memory scheduling strategies. Optimize data movement across matrix engine, vector engine, SRAM, DMA, NoC, cache, and DRAM. Analyze bottlenecks in compute utilization, memory bandwidth, synchronization, DMA overlap, bank conflicts, and instruction overhead. Build first-principles performance models for key operators. Drive kernels toward hardware roofline limits. Collaborate with hardware, compiler, runtime, and model teams on ISA features, tensor layouts, memory access patterns, and operator APIs. Debug complex correctness, precision, and performance issues on simulator or silicon. Mentor junior engineers and establish kernel optimization best practices. Requirements BS/MS/PhD in CS, EE, Computer Engineering, or related field. 5+ years of experience in performance optimization, accelerator programming, GPU/NPU/DSP development, compiler backend, embedded systems, or HPC. Strong C/C++ programming skills. Deep understanding of memory hierarchy, tiling, parallelism, vectorization, synchronization, and bandwidth analysis. Experience optimizing performance-critical kernels or numerical computation. Ability to reason from algorithm requirements to hardware execution and performance bottlenecks. Preferred Experience with CUDA, Triton, CUTLASS, OpenCL, TVM, MLIR, Halide, SIMD intrinsics, DSP SDKs, or custom accelerator SDKs. Experience optimizing operators such as convolution, GEMM, attention, softmax, normalization, reduction, image processing, or fused compute/memory kernels. Familiarity with custom AI accelerator architecture, matrix engines, vector engines, systolic arrays, DMA, SRAM, NoC, or DRAM systems. Experience with mixed precision and quantization: FP32, FP16, BF16, FP8, INT8, INT4.
Experience with simulator/emulator/FPGA/silicon bring-up is a plus.