Site Reliability Engineer (SRE) - AI Infrastructure

San Francisco, CA

Are you looking for an exciting new opportunity?

Join a stealth-mode hyperscale data center startup building a next-generation AI and cloud platform designed for startups and advanced research, powered by thousands of H100, H200, and B200 GPUs available on demand. Their platform supports everything from rapid experimentation to full-scale model training and inference, with flexible orchestration via Slurm, Kubernetes, or direct SSH access.

This is a rare opportunity to work at the intersection of hyperscale infrastructure and AI, shaping the operational backbone of one of the largest GPU clusters in private deployment. If you want to build and operate infrastructure for frontier AI workloads, automate systems at petascale, and be part of a founding engineering team, this is the place to do it.

Responsibilities:

  • Design, deploy, and maintain large-scale GPU clusters (H100/H200/B200) for training and inference workloads.
  • Build automation pipelines for provisioning, scaling, and monitoring compute resources across Slurm and Kubernetes environments.
  • Develop observability, alerting, and auto-healing systems for high-availability GPU workloads.
  • Collaborate with ML, networking, and platform teams to optimise resource scheduling, GPU utilization, and data flow.
  • Implement infrastructure-as-code, CI/CD pipelines, and reliability standards across thousands of nodes.
  • Diagnose performance bottlenecks and drive continuous improvements in reliability, latency, and throughput.

Skills / Must Have:

  • 7+ years of experience in SRE, DevOps, or Infrastructure Engineering roles supporting large-scale compute environments.
  • Strong hands-on experience with Kubernetes and Slurm for cluster orchestration and workload management.
  • Deep knowledge of Linux systems, networking, and GPU infrastructure (NVIDIA H100/H200/B200 preferred).
  • Proficiency in Python, Go, or Bash for automation, tooling, and performance tuning.
  • Experience with observability stacks (Prometheus, Grafana, Loki) and incident response frameworks.
  • Familiarity with high-performance computing (HPC) or AI/ML training infrastructure at scale.
  • Background in reliability engineering, distributed systems, or hardware acceleration environments is a strong plus.

Benefits:

  • Equity

Salary:

  • $300,000 gross per year
Posted 2025-12-18

Recommended Jobs

Solution Sales Expert - Business Data Cloud (BDC) Public Services/Utilities West

SAP
Palo Alto, CA

We help the world run better At SAP, we keep it simple: you bring your best to us, and we'll bring out the best in you. We're builders touching over 20 industries and 80% of global commerce, and w…

View Details
Posted 2026-05-15

Photo Style Editors US Based Remote

Recrewt
San Francisco, CA

About the project. Join a creative research initiative helping our team understand how people perceive and refine photography styles in AI-generated images. As a Photo Style Editor , you’ll review…

View Details
Posted 2025-10-31

Principal Software Engineer

Boston Scientific
Valencia, CA

Additional Location(s):  US-CA-Valencia; US-CA-San Diego; US-CA-San Jose Diversity - Innovation - Caring - Global Collaboration - Winning Spirit - High Performance At Boston Scientific, we’ll g…

View Details
Posted 2026-05-21

Travel Nurse RN - Cardiovascular Intensive Care Unit - $3,000 to $3,100 per week in Sacramento, CA

TravelNurseSource
Sacramento, CA

Registered Nurse (RN) | Cardiovascular Intensive Care Unit Location: Sacramento, CA Agency: United Health Care Staffing, Inc. Pay: $3,000 to $3,100 per week Shift Information: Days…

View Details
Posted 2026-05-21

P/T Retail Store Associate

adidas
Napa, CA

At adidas we have been challenging the status quo for over 70 years and we’re not done yet.   We are calling all Store Associates who don’t accept what “was” or what “is,” but those who want to creat…

View Details
Posted 2026-01-27

Head of PEO Sales

gusto
Ontario, CA

About Gusto At Gusto, we're on a mission to grow the small business economy. We handle the hard stuff—like payroll, health insurance, 401(k)s, and HR—so owners can focus on their craft and custo…

View Details
Posted 2026-04-09

Sr. SBA Credit Analyst - To $90K - Los Angeles, CA - Job # 1934

Symicor Group
Los Angeles, CA

Sr. SBA Credit Analyst – To $90K – Los Angeles, CA – Job # 1934 Who We Are The Symicor Group is a boutique talent acquisition firm based in Lincolnshire, IL & Rockport, TX. Our nationally unique val…

View Details
Posted 2026-04-15

Data scientist

Alignment Healthcare
Orange, CA

Alignment Health is breaking the mold in conventional health care, committed to serving seniors and those who need it most: the chronically ill and frail. It takes an entire team of passionate and ca…

View Details
Posted 2026-05-06

Travel Nurse RN - Emergency Room (ER) / Trauma - $1,867 to $2,067 per week in Arcata, CA

TravelNurseSource
Arcata, CA

Registered Nurse (RN) | Emergency Room (ER) / Trauma Location: Arcata, CA Agency: Ventura MedStaff Pay: $1,867 to $2,067 per week Shift Information: Nights - 3 days x 12 hours …

View Details
Posted 2026-05-21