Site Reliability Engineer (SRE) - AI Infrastructure

San Francisco, CA

Are you looking for an exciting new opportunity?

Join a stealth-mode hyperscale data center startup building a next-generation AI and cloud platform designed for startups and advanced research, powered by thousands of H100, H200, and B200 GPUs available on demand. Their platform supports everything from rapid experimentation to full-scale model training and inference, with flexible orchestration via Slurm, Kubernetes, or direct SSH access.

This is a rare opportunity to work at the intersection of hyperscale infrastructure and AI, shaping the operational backbone of one of the largest GPU clusters in private deployment. If you want to build and operate infrastructure for frontier AI workloads, automate systems at petascale, and be part of a founding engineering team, this is the place to do it.

Responsibilities:

  • Design, deploy, and maintain large-scale GPU clusters (H100/H200/B200) for training and inference workloads.
  • Build automation pipelines for provisioning, scaling, and monitoring compute resources across Slurm and Kubernetes environments.
  • Develop observability, alerting, and auto-healing systems for high-availability GPU workloads.
  • Collaborate with ML, networking, and platform teams to optimise resource scheduling, GPU utilization, and data flow.
  • Implement infrastructure-as-code, CI/CD pipelines, and reliability standards across thousands of nodes.
  • Diagnose performance bottlenecks and drive continuous improvements in reliability, latency, and throughput.

Skills / Must Have:

  • 7+ years of experience in SRE, DevOps, or Infrastructure Engineering roles supporting large-scale compute environments.
  • Strong hands-on experience with Kubernetes and Slurm for cluster orchestration and workload management.
  • Deep knowledge of Linux systems, networking, and GPU infrastructure (NVIDIA H100/H200/B200 preferred).
  • Proficiency in Python, Go, or Bash for automation, tooling, and performance tuning.
  • Experience with observability stacks (Prometheus, Grafana, Loki) and incident response frameworks.
  • Familiarity with high-performance computing (HPC) or AI/ML training infrastructure at scale.
  • Background in reliability engineering, distributed systems, or hardware acceleration environments is a strong plus.

Benefits:

  • Equity

Salary:

  • $300,000 gross per year
Posted 2025-12-18

Recommended Jobs

North America Forestry Category Manager

HAIX North America, Inc.
California

THE COMPANY Who We Are: Tradition and Innovation HAIX® is a family-owned footwear manufacturer trusted by thousands of firefighters, EMTs and paramedics, law enforcement officers, foresters and…

View Details
Posted 2026-02-07

Account Manager (Influencer Marketing)

Seismic
Oakland, CA

About Seismic Seismic is a boutique creator sponsorship agency that builds high-performance campaigns on YouTube, Instagram, podcast, and emerging-channel platforms for some of the world's most ac…

View Details
Posted 2026-03-31

Custodian

Moreno & Associates, Inc
San Jose, CA

This position supports the operations department by performing essential janitorial tasks. Our priority is ensuring the successful completion of building cleaning fostering an attractive, sanitary, a…

View Details
Posted 2025-11-22

Marketing / Sales Intern

Anthro
Alameda, CA

We’re looking for a creative, motivated Marketing / Sales Intern to support our go-to-market efforts. This role is ideal for someone who enjoys blending creativity with data, helping build digital ca…

View Details
Posted 2026-01-30

Harness Technician II/III

Rocket Lab USA
Long Beach, CA

About The Role ABOUT ROCKET LAB Rocket Lab is an end-to-end space company delivering responsive launch services, complete spacecraft design and manufacturing, payloads, satellite components, and mor…

View Details
Posted 2026-04-06

Chief Forensic Laboratories

Crime Scene Resources, Inc
Los Angeles, CA

Duties and Requirements Click to read more Duties Essential Job Functions Directs the administrative and technical activities of all laboratory personnel, including those involved in…

View Details
Posted 2025-11-14

Machinist-Expert (Swing)

Avispa Technology
Santa Rosa, CA

Machinist-Expert (Swing) 37058095 ~ Hourly pay: $35-40/hr ~ Worksite: Leading electronic testing company (Santa Rosa, CA 95403 - Onsite) ~ W2 Employment, Group Medical, Dental, Vision, Life, Re…

View Details
Posted 2026-04-03

Concrete Superintendent

Kimmel and Associates
Sacramento, CA

About the Company The company is a well-established construction firm with a strong reputation for delivering high-quality structural and site concrete work across commercial, multifamily, industria…

View Details
Posted 2026-04-03