Site Reliability Engineer (SRE) - AI Infrastructure
Are you looking for an exciting new opportunity?
Join a stealth-mode hyperscale data center startup building a next-generation AI and cloud platform designed for startups and advanced research, powered by thousands of H100, H200, and B200 GPUs available on demand. Their platform supports everything from rapid experimentation to full-scale model training and inference, with flexible orchestration via Slurm, Kubernetes, or direct SSH access.
This is a rare opportunity to work at the intersection of hyperscale infrastructure and AI, shaping the operational backbone of one of the largest GPU clusters in private deployment. If you want to build and operate infrastructure for frontier AI workloads, automate systems at petascale, and be part of a founding engineering team, this is the place to do it.
Responsibilities:
- Design, deploy, and maintain large-scale GPU clusters (H100/H200/B200) for training and inference workloads.
- Build automation pipelines for provisioning, scaling, and monitoring compute resources across Slurm and Kubernetes environments.
- Develop observability, alerting, and auto-healing systems for high-availability GPU workloads.
- Collaborate with ML, networking, and platform teams to optimise resource scheduling, GPU utilization, and data flow.
- Implement infrastructure-as-code, CI/CD pipelines, and reliability standards across thousands of nodes.
- Diagnose performance bottlenecks and drive continuous improvements in reliability, latency, and throughput.
Skills / Must Have:
- 7+ years of experience in SRE, DevOps, or Infrastructure Engineering roles supporting large-scale compute environments.
- Strong hands-on experience with Kubernetes and Slurm for cluster orchestration and workload management.
- Deep knowledge of Linux systems, networking, and GPU infrastructure (NVIDIA H100/H200/B200 preferred).
- Proficiency in Python, Go, or Bash for automation, tooling, and performance tuning.
- Experience with observability stacks (Prometheus, Grafana, Loki) and incident response frameworks.
- Familiarity with high-performance computing (HPC) or AI/ML training infrastructure at scale.
- Background in reliability engineering, distributed systems, or hardware acceleration environments is a strong plus.
Benefits:
- Equity
Salary:
- $300,000 gross per year
Recommended Jobs
Solution Sales Expert - Business Data Cloud (BDC) Public Services/Utilities West
We help the world run better At SAP, we keep it simple: you bring your best to us, and we'll bring out the best in you. We're builders touching over 20 industries and 80% of global commerce, and w…
Photo Style Editors US Based Remote
About the project. Join a creative research initiative helping our team understand how people perceive and refine photography styles in AI-generated images. As a Photo Style Editor , you’ll review…
Principal Software Engineer
Additional Location(s): US-CA-Valencia; US-CA-San Diego; US-CA-San Jose Diversity - Innovation - Caring - Global Collaboration - Winning Spirit - High Performance At Boston Scientific, we’ll g…
Travel Nurse RN - Cardiovascular Intensive Care Unit - $3,000 to $3,100 per week in Sacramento, CA
Registered Nurse (RN) | Cardiovascular Intensive Care Unit Location: Sacramento, CA Agency: United Health Care Staffing, Inc. Pay: $3,000 to $3,100 per week Shift Information: Days…
P/T Retail Store Associate
At adidas we have been challenging the status quo for over 70 years and we’re not done yet. We are calling all Store Associates who don’t accept what “was” or what “is,” but those who want to creat…
Head of PEO Sales
About Gusto At Gusto, we're on a mission to grow the small business economy. We handle the hard stuff—like payroll, health insurance, 401(k)s, and HR—so owners can focus on their craft and custo…
Sr. SBA Credit Analyst - To $90K - Los Angeles, CA - Job # 1934
Sr. SBA Credit Analyst – To $90K – Los Angeles, CA – Job # 1934 Who We Are The Symicor Group is a boutique talent acquisition firm based in Lincolnshire, IL & Rockport, TX. Our nationally unique val…
Data scientist
Alignment Health is breaking the mold in conventional health care, committed to serving seniors and those who need it most: the chronically ill and frail. It takes an entire team of passionate and ca…
Travel Nurse RN - Emergency Room (ER) / Trauma - $1,867 to $2,067 per week in Arcata, CA
Registered Nurse (RN) | Emergency Room (ER) / Trauma Location: Arcata, CA Agency: Ventura MedStaff Pay: $1,867 to $2,067 per week Shift Information: Nights - 3 days x 12 hours …