Machine Learning Infrastructure Engineer

Institute Of Foundation Models
Sunnyvale, CA

About the Institute of Foundation Models

We are a dedicated research lab for building, understanding, using, and risk-managing foundation models. Our mandate is to advance research, nurture the next generation of AI builders, and drive transformative contributions to a knowledge-driven economy.

As part of our team, you’ll have the opportunity to work on the core of cutting-edge foundation model training, alongside world-class researchers, data scientists, and engineers, tackling the most fundamental and impactful challenges in AI development. You will participate in the development of groundbreaking AI solutions that have the potential to reshape entire industries. Strategic and innovative problem-solving skills will be instrumental in establishing MBZUAI as a global hub for high-performance computing in deep learning, driving impactful discoveries that inspire the next generation of AI pioneers.

The Role

We're looking for a distributed ML infrastructure engineer to help extend and scale our training systems. You’ll work side-by-side with world-class researchers and engineers to:

• Extend distributed training frameworks (e.g., DeepSpeed, FSDP, FairScale, Horovod)

• Implement distributed optimizers from mathematical specs

• Build robust config + launch systems across multi-node, multi-GPU clusters

• Own experiment tracking, metrics logging, and job monitoring for external visibility

• Improve training system reliability, maintainability, and performance

• While much of the work will support large-scale pre-training, pre-training experience is not required. Strong infrastructure and systems experience is what we value most.

Key Responsibilities

• Distributed Framework Ownership – Extend or modify training frameworks (e.g., DeepSpeed, FSDP) to support new use cases and architectures.

• Optimizer Implementation – Translate mathematical optimizer specs into distributed implementations.

• Launch Config & Debugging – Create and debug multi-node launch scripts with flexible batch sizes, parallelism strategies, and hardware targets.

• Metrics & Monitoring – Build systems for experiment tracking, job monitoring, and logging usable by collaborators and researchers.

• Infra Engineering – Write production-quality code and tests for ML infra in PyTorch or JAX; ensure reliability and maintainability at scale.

Qualifications

Must-Haves:

• 5+ years of experience in ML systems, infra, or distributed training

• Experience modifying distributed ML frameworks (e.g., DeepSpeed, FSDP, FairScale, Horovod)

• Strong software engineering fundamentals (Python, systems design, testing)

• Proven multi-node experience (e.g., Slurm, Kubernetes, Ray) and debugging skills (e.g., NCCL/GLOO)

• Ability to implement algorithms across GPUs/nodes based on mathematical specs

• Experience working on an ML platform/ infrastructure, and/or distributed inference optimization team

• Experience with large-scale machine learning workloads (strong ML fundamentals)

Nice-to-Haves:

• Exposure to mixed-precision training (e.g., bf16, fp8) with accuracy validation

• Familiarity with performance profiling, kernel fusion, or memory optimization

• Open-source contributions or published research (MLSys, ICML, NeurIPS)

• CUDA or Triton kernel experience

• Experience with large-scale pre-training

• Experience building custom training pipelines at scale and modifying them for custom needs

• Deep familiarity with training infrastructure and performance tuning

$300,000 - $600,000 a year

Total compensation target: $300,000–$600,000 (inclusive of base salary and target bonus of up to 30%), commensurate with experience.

• Comprehensive medical, dental, and vision

• 401(k) program

• Generous PTO, sick leave, and holidays

• Paid parental leave and family-friendly benefits

• On-site amenities and perks: Complimentary lunch, gym access, and a short walk to the Sunnyvale Caltrain station

Posted 2025-09-22

Recommended Jobs

Software Engineer - Developer Infrastructure

Applied Intuition
Mountain View, CA

About Applied Intuition Applied Intuition is the vehicle intelligence company that accelerates the global adoption of safe, AI-driven machines. Founded in 2017, Applied Intuition delivers the to…

View Details
Posted 2025-09-22

Customer Support Rider Operations

Jhcareers Llc
Foster, CA

Location: Foster City, CA, United States Work environment: In-person Expected pay amount: 29.00 USD Per Hour Schedule: Multiple shifts/schedules Assignment length: 6 month rolling contract …

View Details
Posted 2025-09-28

Youth Development Professional - ACMS

Boys & Girls Club of North Lake Tahoe
Truckee, CA

TRAVEL TO Truckee, CA REQUIRED Our mission is to inspire and enable all youth, especially those who need us most, to reach their full potential as productive, caring, responsible citizens.  Under …

View Details
Posted 2025-10-19

Test Engineer

Sgs
Milpitas, CA

Company Description SGS is the global leader and innovator in inspection, verification, testing and certification services. Founded in 1878, SGS is recognized as the global benchmark in quality …

View Details
Posted 2025-09-14

Associate Fraud Strategy Data Scientist San Jose, CA

Esrhealthcare
San Jose, CA

Associate Fraud Strategy Data Scientist San Jose, CA Fraud Strategy Data Scientist, Risk Data Scientist w/Fraud, Risk Analytics, Data Analysis, Data Science, Fraud Mitigation, Industry: eCommerce, …

View Details
Posted 2025-10-28

Data Science Manager- Credit Cards

Sunbit
Los Angeles, CA

JOB TITLE: Data Science Manager- Credit Cards LOCATION: Remote, US REPORTS TO: Head of Business Data Analytics The Company: Sunbit builds financial technology for real life. Our AI-nat…

View Details
Posted 2025-10-13

Senior Accountant

Last Bottle Wines
American Canyon, CA

Digital Beverage Group (including WineBid and Last Bottle) is looking for a Senior Accountant to join our Accounting Team. The ideal candidate is based in Napa Valley, CA however we will consider ca…

View Details
Posted 2025-10-31

Full Time Internal Medicine Job Riverside, CA

CompHealth CompHealth
Riverside, CA

Set against the backdrop of Southern California's scenic mountains and palm-lined streets, Riverside offers physicians an exciting place to live and practice. As one of the region's growing urban cent…

View Details
Posted 2025-10-31

CONTRÔLEUR(E)-TECHNISEAL, CANDIAC QC

Techniseal
California

Oldcastle® APG, une société de CRH, est le principal fournisseur nord-américain de solutions innovantes pour la vie en plein air, permettant aux clients de bien vivre à l'extérieur. Le portefeu…

View Details
Posted 2025-11-04

Scientist II

Planet Group
South San Francisco, CA

Target PR Range: 66-76/hr *Depending on experience The Quantitative, Translational ADME Sciences (QTAS) department at  is seeking a highly motivated scientist to join the Biologics bioanalysis…

View Details
Posted 2025-10-28