Machine Learning Infrastructure Engineer

Institute Of Foundation Models
Sunnyvale, CA

About the Institute of Foundation Models

We are a dedicated research lab for building, understanding, using, and risk-managing foundation models. Our mandate is to advance research, nurture the next generation of AI builders, and drive transformative contributions to a knowledge-driven economy.

As part of our team, you’ll have the opportunity to work on the core of cutting-edge foundation model training, alongside world-class researchers, data scientists, and engineers, tackling the most fundamental and impactful challenges in AI development. You will participate in the development of groundbreaking AI solutions that have the potential to reshape entire industries. Strategic and innovative problem-solving skills will be instrumental in establishing MBZUAI as a global hub for high-performance computing in deep learning, driving impactful discoveries that inspire the next generation of AI pioneers.

The Role

We're looking for a distributed ML infrastructure engineer to help extend and scale our training systems. You’ll work side-by-side with world-class researchers and engineers to:

• Extend distributed training frameworks (e.g., DeepSpeed, FSDP, FairScale, Horovod)

• Implement distributed optimizers from mathematical specs

• Build robust config + launch systems across multi-node, multi-GPU clusters

• Own experiment tracking, metrics logging, and job monitoring for external visibility

• Improve training system reliability, maintainability, and performance

• While much of the work will support large-scale pre-training, pre-training experience is not required. Strong infrastructure and systems experience is what we value most.

Key Responsibilities

• Distributed Framework Ownership – Extend or modify training frameworks (e.g., DeepSpeed, FSDP) to support new use cases and architectures.

• Optimizer Implementation – Translate mathematical optimizer specs into distributed implementations.

• Launch Config & Debugging – Create and debug multi-node launch scripts with flexible batch sizes, parallelism strategies, and hardware targets.

• Metrics & Monitoring – Build systems for experiment tracking, job monitoring, and logging usable by collaborators and researchers.

• Infra Engineering – Write production-quality code and tests for ML infra in PyTorch or JAX; ensure reliability and maintainability at scale.

Qualifications

Must-Haves:

• 5+ years of experience in ML systems, infra, or distributed training

• Experience modifying distributed ML frameworks (e.g., DeepSpeed, FSDP, FairScale, Horovod)

• Strong software engineering fundamentals (Python, systems design, testing)

• Proven multi-node experience (e.g., Slurm, Kubernetes, Ray) and debugging skills (e.g., NCCL/GLOO)

• Ability to implement algorithms across GPUs/nodes based on mathematical specs

• Experience working on an ML platform/ infrastructure, and/or distributed inference optimization team

• Experience with large-scale machine learning workloads (strong ML fundamentals)

Nice-to-Haves:

• Exposure to mixed-precision training (e.g., bf16, fp8) with accuracy validation

• Familiarity with performance profiling, kernel fusion, or memory optimization

• Open-source contributions or published research (MLSys, ICML, NeurIPS)

• CUDA or Triton kernel experience

• Experience with large-scale pre-training

• Experience building custom training pipelines at scale and modifying them for custom needs

• Deep familiarity with training infrastructure and performance tuning

$300,000 - $600,000 a year

Total compensation target: $300,000–$600,000 (inclusive of base salary and target bonus of up to 30%), commensurate with experience.

• Comprehensive medical, dental, and vision

• 401(k) program

• Generous PTO, sick leave, and holidays

• Paid parental leave and family-friendly benefits

• On-site amenities and perks: Complimentary lunch, gym access, and a short walk to the Sunnyvale Caltrain station

Posted 2025-12-22

Recommended Jobs

Sales Associate

Meissner Sewing & Vacuum Centers
Roseville, CA

About Us: At Meissner Sewing & Vacuum Centers, we’re more than just a store—we’re a legacy. Founded in 1930 and still family-owned, we take pride in being Northern California’s premier destination f…

View Details
Posted 2025-11-16

Cyber Security Engineer - Sr. Consultant level - Regulatory, Audit, & Compliance

Visa
Foster, CA

Company Description Visa is a world leader in payments and technology, with over 259 billion payments transactions flowing safely between consumers, merchants, financial institutions, and govern…

View Details
Posted 2026-01-13

Autonomy Test Engineer

Serve Robotics
Los Angeles, CA

At Serve Robotics, we’re reimagining how things move in cities. Our personable sidewalk robot is our vision for the future. It’s designed to take deliveries away from congested streets, make deliveri…

View Details
Posted 2025-11-28

Senior Software Engineer, Game UI - League of Legends

Riot Games
Los Angeles, CA

As a Senior Software Engineer on the League of Legends team, you will own and deliver meta experiences that help new and returning players feel welcome and engaged — including onboarding tutorials, e…

View Details
Posted 2025-12-18

Front Desk Associate and Personal Assistant

Autonomous Solutions
Pasadena, CA

Front Desk Associate and Personal Assistant Location Pasadena, CA : Job Title: Personal Assistant & Front Desk Associate Location: Pasadena, CA Type: Full-Time, In-Person Salary: $20/hr + …

View Details
Posted 2026-01-09

Sr Data Scientist- AXS

Aeg Worldwide
Los Angeles, CA

AXS connects fans with the artists and teams they love. Each year we sell millions of tickets to thousands of incredible events – from concerts and festivals to sports and theater – at some of the mo…

View Details
Posted 2025-12-25

Process Operator I

SGS Consulting
Ontario, CA

Job Responsibilities: Manufacture products by operating equipment according to the Master Batch Record (MBR) Complete GMP documentation (MBRs, labels, logbooks, validation records) with accuracy…

View Details
Posted 2025-11-14

Senior Level Forklift Mechanic

Naumann Hobbs MHC
San Marcos, CA

$1000 Sign on Bonus Overview: We are seeking a skilled and dedicated Forklift Technician to join our team in San Marcos CA. In this role, you will play a crucial part in ensuring the safe and effi…

View Details
Posted 2025-11-09

Principal Avionics Test Engineer

Relativity Space
Long Beach, CA

At Relativity Space, we’re building rockets to serve today’s needs and tomorrow’s breakthroughs. Our Terran R vehicle will deliver customer payloads to orbit, meeting the growing demand for launch ca…

View Details
Posted 2025-11-25

Project Executive (Commercial Construction)

K2 Staffing
Orange County, CA

Consistently recognized as a best workplace, and for our commitment to safety, sustainability, and community partnerships, we hire the very best in the construction industry and strives to create an …

View Details
Posted 2025-10-03