Senior/Staff Machine Learning Engineer, Training Runtime Performance
- Collaborate with ML practitioners and other infrastructure teams to understand their needs and integrate optimized input pipelines seamlessly into their workflows.
- Detect, diagnose, and resolve performance bottlenecks across training, eval, and model distillation workflows.
- Optimize training performance, resource utilization, and ensure consistent, reproducible model training outcomes.
- Optimize input data pipelines to increase runtime goodput, ensuring accelerators maximize their "time on task" and minimize idle cycles.
- Champion best practices for robust, reproducible, and debuggable ML experimentation.
- B.S./M.S./Ph.D. in Computer Science, Electrical Engineering, or related technical field (or equivalent experience).
- 4+ years of professional experience in ML infrastructure, distributed training, or ML systems engineering, scaling models on multi-node, multi-accelerator clusters.
- Understanding of training, evaluation, and distillation workflows for billion-parameter models
- Expert-level knowledge in distributed systems and (remote) Python
- Strong skills in profiling, debugging, and optimizing quantized workloads.
- Experience with ML compilers and strategies to reduce startup overhead
- Familiarity with model distillation and efficient inference workflows.
- Previous contributions to open source ML infra projects or research publications in ML systems.
- Hands-on experience with Foundation Model infrastructure
- Highly proficient in C++, distributed systems, ML framework internals (e.g., NCCL, Horovod, DeepSpeed, Ray)
Recommended Jobs
Client Experience Coordinator
Cartier seeks a Client Experience Coordinator in Topanga to support the execution of client experience strategies in a luxury retail environment. Responsibilities include managing client interactions,…
Digital Customer Success Manager
About the role: Skydio is seeking a proactive and customer-focused Digital Customer Success Manager (CSM) to serve as a strategic partner to our customers. In this digital-first role, you will pla…
Product Manager, Growth Engineering
About LangChain: At LangChain, our mission is to make intelligent agents ubiquitous. We provide the agent engineering platform and open source frameworks developers need to ship reliable agents fast…
Restaurant Manager
Job Overview As a Restaurant Manager, you will be responsible to ensure a pleasant dining experience for the customers. Your role includes maintaining the quality and standard of the services as w…
Senior bookkeeper
Full-time Description Job Title: Experienced Bookkeeper Job Description: We are seeking a detail-oriented and experienced Bookkeeper to join our team. The ideal candidate will have …
QA Manager
Veeva Systems is a mission-driven organization and pioneer in industry cloud, helping life sciences companies bring therapies to patients faster. As one of the fastest-growing SaaS companies in histo…
26 Ft Box Truck Owner-Operators (Exclusive Contract)
Exclusive 6-Month Contract with Acies Transport (Work Under Our MC) Looking for stability, great rates, and a reliable partner in trucking? At Acies Transport, we make sure our Owner-Operators …
Cloud Software Engineer II
The Cloud Software Engineer II is responsible for developing customer-facing cloud services, web applications, and internal tools to enable a range of new data driven high performance applications fo…
Staff Software Engineer (Platform)
WHAT YOU’LL DO As a Staff Software Engineer on the Platform team, you will build software that solves complex problems while considering long-term strategy and direction. You will deliver simple, …
Senior Backend Software Engineer
We're looking for a passionate, results oriented backend software engineer who's excited to bring an entire industry online for the first time. You will be joining a team of world class engineers to …