Senior Infrastructure Engineer - Supercomputing
About the Institute of Foundation Models
We are a dedicated research lab for building, understanding, using, and risk-managing foundation models. Our mandate is to advance research, nurture the next generation of AI builders, and drive transformative contributions to a knowledge-driven economy.
As part of our team, you’ll have the opportunity to work on the core of cutting-edge foundation model training, alongside world-class researchers, data scientists, and engineers, tackling the most fundamental and impactful challenges in AI development. You will participate in the development of groundbreaking AI solutions that have the potential to reshape entire industries. Strategic and innovative problem-solving skills will be instrumental in establishing MBZUAI as a global hub for high-performance computing in deep learning, driving impactful discoveries that inspire the next generation of AI pioneers.
The Role
We are operating some of the world’s largest GPU supercomputing clusters to support cutting-edge AI research and large-scale model deployment. We’re looking for an Infrastructure Engineer to join our core platform team to help build, operate, and scale our hybrid infrastructure across both on-prem and cloud environments.
This role is ideal for engineers who thrive at the intersection of distributed systems, cloud automation, and high-performance computing.
Key Responsibilities
- Operate and scale high-performance GPU clusters used for AI training and production inference.
- Manage infrastructure across on-premise (Slurm-based) HPC environments and cloud providers like AWS and Azure .
- Implement and maintain Infrastructure as Code using Pulumi , Terraform , or Ansible .
- Enhance and secure deployment pipelines using Kubernetes , Flux , and ArgoCD .
- Help define and enforce security best practices for internal researchers and production services.
- Continuously improve observability, resiliency, and operational tooling across environments.
Tech Stack
- Kubernetes, Slurm
- Pulumi, Terraform, Ansible
- Rust and Go
- Flux, ArgoCD
- AWS, Azure
Professional Experience
- Strong experience managing compute infrastructure in hybrid environments (on-prem and cloud).
- Hands-on experience operating Slurm clusters at scale.
- Proficiency in deploying and managing containerized applications, ideally written in Rust or Go .
- Solid background in IaC and CI/CD best practices.
- Experience working with GPU workloads or HPC infrastructure is a strong plus.
- Familiarity with securing and monitoring multi-tenant compute environments.
$200,000 - $400,000 a year
Salary depends on level.Visa Sponsorship
This position is eligible for visa sponsorship.
Benefits Include
*Comprehensive medical, dental, and vision benefits
*Bonus
*401K Plan
*Generous paid time off, sick leave and holidays
*Paid Parental Leave
*Employee Assistance Program
*Life insurance and disability
Recommended Jobs
Document Management Technician (30258)
Description At ExamWorks, we’re searching for a dynamic Document Management Technician to join our team at our Rancho Cordova, CA office! In this role, you’ll be the unsung hero behind the scene…
Full Time Orthopedics Job Santa Clarita, CA
The Center for Orthopedic Specialists is hiring a board certified/board eligible Orthopedic Trauma Surgeon to join their expanding practice in the Los Angeles area. This is an excellent opportunity t…
Full Time Family Practice Job Sacramento, CA
An independent physician group affiliated with California North State University is in search of a Primary Care Director for its clinic in Sacramento, California. Practice Details Employed, o…
Senior Hardware Engineer
About Us With electric vehicles expected to be nearly 30% of new vehicle sales by 2025 and more than 50% by 2040, electric mobility is becoming a reality. ChargePoint (NYSE: CHPT) is at the center…
Full Stack Software Engineer, Leverage Engineering
About the Team The Leverage team is scaling OpenAI with OpenAI. We apply our latest models to real-world problems in order to assist with or automate work across the company—then share what we learn…
Now Hiring Cybersecurity Professionals | Business Staffing of America, Inc.
Business Staffing of America, Inc. is seeking highly skilled professionals to support an Enterprise Security Operations Center (ESOC) . These roles are critical in safeguarding the confidentiality,…
Analytics and AI Integration Engineer
About Us: CXAPP is a forward-thinking technology company that leverages AI to transform industries, drive innovation and deliver cutting-edge solutions. J ob Description: As an Analytics an…
Entry Level Parts Counter Person
Job Description Job Description Job Description: Entry Level Parts Counter Person The Parts Counter Person is knowledgeable about automotive parts and accessories and can accurately answer que…
Multi-Unit Manager
Req ID: 470655 Address: 15250 N Thornton Road Lodi, CA, 95242 Benefits: * Fuel Your Growth with Love's - company funded tuition assistance * Paid Time Off * 401(k) – 100% match up to 5% * Me…
Full Time Surgery Job Fullerton, CA
Providence St. Jude Heritage Medical Group is seeking a full-time General Surgeon to join its well-established team in Fullerton, California. Located at Providence St. Jude Medical Center, a large co…