Senior Infrastructure Engineer - Supercomputing

Institute Of Foundation Models
Sunnyvale, CA

About the Institute of Foundation Models

We are a dedicated research lab for building, understanding, using, and risk-managing foundation models. Our mandate is to advance research, nurture the next generation of AI builders, and drive transformative contributions to a knowledge-driven economy.

As part of our team, you’ll have the opportunity to work on the core of cutting-edge foundation model training, alongside world-class researchers, data scientists, and engineers, tackling the most fundamental and impactful challenges in AI development. You will participate in the development of groundbreaking AI solutions that have the potential to reshape entire industries. Strategic and innovative problem-solving skills will be instrumental in establishing MBZUAI as a global hub for high-performance computing in deep learning, driving impactful discoveries that inspire the next generation of AI pioneers.

The Role

We are operating some of the world’s largest GPU supercomputing clusters to support cutting-edge AI research and large-scale model deployment. We’re looking for an Infrastructure Engineer to join our core platform team to help build, operate, and scale our hybrid infrastructure across both on-prem and cloud environments.

This role is ideal for engineers who thrive at the intersection of distributed systems, cloud automation, and high-performance computing.

Key Responsibilities

  • Operate and scale high-performance GPU clusters used for AI training and production inference.
  • Manage infrastructure across on-premise (Slurm-based) HPC environments and cloud providers like AWS and Azure .
  • Implement and maintain Infrastructure as Code using Pulumi , Terraform , or Ansible .
  • Enhance and secure deployment pipelines using Kubernetes , Flux , and ArgoCD .
  • Help define and enforce security best practices for internal researchers and production services.
  • Continuously improve observability, resiliency, and operational tooling across environments.

Tech Stack

  • Kubernetes, Slurm
  • Pulumi, Terraform, Ansible
  • Rust and Go
  • Flux, ArgoCD
  • AWS, Azure

Professional Experience

  • Strong experience managing compute infrastructure in hybrid environments (on-prem and cloud).
  • Hands-on experience operating Slurm clusters at scale.
  • Proficiency in deploying and managing containerized applications, ideally written in Rust or Go .
  • Solid background in IaC and CI/CD best practices.
  • Experience working with GPU workloads or HPC infrastructure is a strong plus.
  • Familiarity with securing and monitoring multi-tenant compute environments.

$200,000 - $400,000 a year

Salary depends on level.

Visa Sponsorship

This position is eligible for visa sponsorship.

Benefits Include

*Comprehensive medical, dental, and vision benefits

*Bonus

*401K Plan

*Generous paid time off, sick leave and holidays

*Paid Parental Leave

*Employee Assistance Program

*Life insurance and disability

Posted 2025-09-22

Recommended Jobs

Document Management Technician (30258)

ExamWorks
Rancho Cordova, CA

Description At ExamWorks, we’re searching for a dynamic Document Management Technician to join our team at our Rancho Cordova, CA office! In this role, you’ll be the unsung hero behind the scene…

View Details
Posted 2025-09-10

Full Time Orthopedics Job Santa Clarita, CA

Providence Providence
Santa Clarita, CA

The Center for Orthopedic Specialists is hiring a board certified/board eligible Orthopedic Trauma Surgeon to join their expanding practice in the Los Angeles area. This is an excellent opportunity t…

View Details
Posted 2025-09-10

Full Time Family Practice Job Sacramento, CA

Enterprise Medical Recruiting Enterprise Medical Recruiting
Sacramento, CA

An independent physician group affiliated with California North State University is in search of a Primary Care Director for its clinic in Sacramento, California. Practice Details Employed, o…

View Details
Posted 2025-09-10

Senior Hardware Engineer

ChargePoint
California

About Us With electric vehicles expected to be nearly 30% of new vehicle sales by 2025 and more than 50% by 2040, electric mobility is becoming a reality. ChargePoint (NYSE: CHPT) is at the center…

View Details
Posted 2025-09-10

Full Stack Software Engineer, Leverage Engineering

Openai
San Francisco, CA

About the Team The Leverage team is scaling OpenAI with OpenAI. We apply our latest models to real-world problems in order to assist with or automate work across the company—then share what we learn…

View Details
Posted 2025-09-14

Now Hiring Cybersecurity Professionals | Business Staffing of America, Inc.

Business-staffing-of-america-inc
Paradise, CA

Business Staffing of America, Inc. is seeking highly skilled professionals to support an Enterprise Security Operations Center (ESOC) . These roles are critical in safeguarding the confidentiality,…

View Details
Posted 2025-09-22

Analytics and AI Integration Engineer

Cxapp Us, Inc.
San Ramon, CA

About Us:  CXAPP is a forward-thinking technology company that leverages AI to transform industries, drive innovation and deliver cutting-edge solutions. J ob Description: As an Analytics an…

View Details
Posted 2025-09-22

Entry Level Parts Counter Person

San Francisco Honda Acura
San Francisco, CA

Job Description Job Description Job Description: Entry Level Parts Counter Person The Parts Counter Person is knowledgeable about automotive parts and accessories and can accurately answer que…

View Details
Posted 2025-07-30

Multi-Unit Manager

Loves Travel Stops & Country Store
Lodi, CA

Req ID: 470655  Address: 15250 N Thornton Road Lodi, CA, 95242   Benefits: * Fuel Your Growth with Love's - company funded tuition assistance * Paid Time Off * 401(k) – 100% match up to 5% * Me…

View Details
Posted 2025-09-08

Full Time Surgery Job Fullerton, CA

Providence Providence
Fullerton, CA

Providence St. Jude Heritage Medical Group is seeking a full-time General Surgeon to join its well-established team in Fullerton, California. Located at Providence St. Jude Medical Center, a large co…

View Details
Posted 2025-09-10