Machine Learning Infrastructure Engineer

Institute Of Foundation Models
Sunnyvale, CA

About the Institute of Foundation Models

We are a dedicated research lab for building, understanding, using, and risk-managing foundation models. Our mandate is to advance research, nurture the next generation of AI builders, and drive transformative contributions to a knowledge-driven economy.

As part of our team, you’ll have the opportunity to work on the core of cutting-edge foundation model training, alongside world-class researchers, data scientists, and engineers, tackling the most fundamental and impactful challenges in AI development. You will participate in the development of groundbreaking AI solutions that have the potential to reshape entire industries. Strategic and innovative problem-solving skills will be instrumental in establishing MBZUAI as a global hub for high-performance computing in deep learning, driving impactful discoveries that inspire the next generation of AI pioneers.

The Role

We're looking for a distributed ML infrastructure engineer to help extend and scale our training systems. You’ll work side-by-side with world-class researchers and engineers to:

• Extend distributed training frameworks (e.g., DeepSpeed, FSDP, FairScale, Horovod)

• Implement distributed optimizers from mathematical specs

• Build robust config + launch systems across multi-node, multi-GPU clusters

• Own experiment tracking, metrics logging, and job monitoring for external visibility

• Improve training system reliability, maintainability, and performance

• While much of the work will support large-scale pre-training, pre-training experience is not required. Strong infrastructure and systems experience is what we value most.

Key Responsibilities

• Distributed Framework Ownership – Extend or modify training frameworks (e.g., DeepSpeed, FSDP) to support new use cases and architectures.

• Optimizer Implementation – Translate mathematical optimizer specs into distributed implementations.

• Launch Config & Debugging – Create and debug multi-node launch scripts with flexible batch sizes, parallelism strategies, and hardware targets.

• Metrics & Monitoring – Build systems for experiment tracking, job monitoring, and logging usable by collaborators and researchers.

• Infra Engineering – Write production-quality code and tests for ML infra in PyTorch or JAX; ensure reliability and maintainability at scale.

Qualifications

Must-Haves:

• 5+ years of experience in ML systems, infra, or distributed training

• Experience modifying distributed ML frameworks (e.g., DeepSpeed, FSDP, FairScale, Horovod)

• Strong software engineering fundamentals (Python, systems design, testing)

• Proven multi-node experience (e.g., Slurm, Kubernetes, Ray) and debugging skills (e.g., NCCL/GLOO)

• Ability to implement algorithms across GPUs/nodes based on mathematical specs

• Experience working on an ML platform/ infrastructure, and/or distributed inference optimization team

• Experience with large-scale machine learning workloads (strong ML fundamentals)

Nice-to-Haves:

• Exposure to mixed-precision training (e.g., bf16, fp8) with accuracy validation

• Familiarity with performance profiling, kernel fusion, or memory optimization

• Open-source contributions or published research (MLSys, ICML, NeurIPS)

• CUDA or Triton kernel experience

• Experience with large-scale pre-training

• Experience building custom training pipelines at scale and modifying them for custom needs

• Deep familiarity with training infrastructure and performance tuning

$300,000 - $600,000 a year

Total compensation target: $300,000–$600,000 (inclusive of base salary and target bonus of up to 30%), commensurate with experience.

• Comprehensive medical, dental, and vision

• 401(k) program

• Generous PTO, sick leave, and holidays

• Paid parental leave and family-friendly benefits

• On-site amenities and perks: Complimentary lunch, gym access, and a short walk to the Sunnyvale Caltrain station

Posted 2025-09-22

Recommended Jobs

COOK HELPER (FULL TIME)

Compass Group
San Lorenzo, CA

    Se habla español. Para aplicar en español, haga clic trabajos.compassgroupcareers.com(opens in a new tab). We are hiring immediately for a full time  COOK HELPER  position. Location :…

View Details
Posted 2025-07-29

Corporate Recruiter

Premier Healthcare Services
Los Angeles, CA

The Recruiter is responsible for assisting with the recruitment process. The process has to be properly designed and implemented. The Recruiter assists with building a healthy relationship with inter…

View Details
Posted 2025-09-17

Software Engineer, Tenancy

Benchling
San Francisco, CA

Biotechnology is rewriting life as we know it, from the medicines we take, to the crops we grow, the materials we wear, and the household goods that we rely on every day. But moving at the new speed …

View Details
Posted 2025-09-22

Dishwasher

Oakmont Management
Roseville, CA

Position: Dishwasher Shifts, Time, and Days: AM Shift, Flexible Days, Open to Weekends Pay Range: $16.50 to $17.00 per hour Oakmont of Roseville is a premier senior living community situated on a …

View Details
Posted 2025-09-10

Accounts Payable Administrator (Temp)

Mdjc
Montebello, CA

Assignment timeframe: Approx. Two weeks (Tuesday to Thursday for approximately 8 hours or less.) Pay rate = $18 to $20 per hour. Position: Office Clerk for the Accounting department. …

View Details
Posted 2025-09-22

Work At Home Data Entry - Remote - Admin Assistant

Maxion Corp
Newport Beach, CA

Join Our Team as a Work-From-Home Data Entry Research Panelist! Are you ready to earn money from the comfort of your own home? This exciting opportunity is perfect for anyone with a var…

View Details
Posted 2025-09-10

Sr. AI/Edge Compute Engineer

Planet
San Francisco, CA

Welcome to Planet. We believe in using space to help life on Earth. Planet designs, builds, and operates the largest constellation of imaging satellites in history. This constellation delivers an …

View Details
Posted 2025-09-22

Brake Operator - Second Shift

Four C's Construction
Fresno, CA

Job Description Job Description Salary: $20-$28 DOE Job Summary: The Brake Operator is responsible for setting up, operating, and maintaining Auto Brakes and Double Folders to bend, shape, …

View Details
Posted 2025-07-30

Locum Tenens Cardiology Interventional Job CA

Weatherby Healthcare Weatherby Healthcare
California

Weatherby Healthcare is currently seeking a Cardiology Interventional Physician in CA If you are seeking a new opportunity or would simply like to learn more about locum tenens, give Weatherby a ca…

View Details
Posted 2025-09-10

Software Engineer, Online Storage

Openai
San Francisco, CA

About the Team We are the Online Storage team powering ChatGPT, Sora, and the OpenAI APIs. We’re a growing team set up to own the databases and online‑storage infrastructure that serve all our produ…

View Details
Posted 2025-09-13