Senior Site Reliability Engineer (SRE) - Data Center

San Francisco, CA

Join a stealth-mode hyperscale data center startup building an AI and cloud platform, powered by thousands of H100s, H200s, and B200s, ready to go for experimentation, full-scale model training, or inference. As a Senior Site Reliability Engineer, you’ll own the reliability, performance, and automation of this GPU-powered infrastructure, ensuring seamless orchestration across environments managed by Slurm, Kubernetes, or direct SSH access.

This is a rare opportunity to work at the intersection of hyperscale infrastructure and AI, shaping the operational backbone of one of the largest GPU clusters in private deployment.

If you want to build and operate infrastructure for frontier AI workloads, automate systems at petascale, and be part of a founding engineering team, this is the place to do it.

If you are interested in this opportunity, get in touch! You don't want to miss out!

Responsibilities:

  • Design, deploy, and maintain large-scale GPU clusters (H100/H200/B200) for training and inference workloads.
  • Build automation pipelines for provisioning, scaling, and monitoring compute resources across Slurm and Kubernetes environments.
  • Develop observability, alerting, and auto-healing systems for high-availability GPU workloads.
  • Collaborate with ML, networking, and platform teams to optimise resource scheduling, GPU utilisation, and data flow.
  • Implement infrastructure-as-code, CI/CD pipelines, and reliability standards across thousands of nodes.
  • Diagnose performance bottlenecks and drive continuous improvements in reliability, latency, and throughput.

Skills / Must Have:

  • 7+ years of experience in SRE, DevOps, or Infrastructure Engineering roles supporting large-scale compute environments.
  • Strong hands-on experience with Kubernetes and Slurm for cluster orchestration and workload management.
  • Deep knowledge of Linux systems, networking, and GPU infrastructure (NVIDIA H100/H200/B200 preferred).
  • Proficiency in Python, Go, or Bash for automation, tooling, and performance tuning.
  • Experience with observability stacks (Prometheus, Grafana, Loki) and incident response frameworks.
  • Familiarity with high-performance computing (HPC) or AI/ML training infrastructure at scale.
  • Background in reliability engineering, distributed systems, or hardware acceleration environments is a strong plus.

Benefits:

  • Equity

Salary:

  • $300,000 gross per year
Posted 2025-11-21

Recommended Jobs

Licensed Spa Coordinator, Newport Beach

Equinox
Newport Beach, CA

We are looking for an energetic creative and enthusiastic Spa Coordinator to join the Equinox team ! This is a great position for a candidate looking to make a significant impact in a growing and …

View Details
Posted 2025-11-21

Operations Controller

Gucci
Cabazon, CA

Gucci seeks an Operations Controller for its Cabazon Outlet to manage stockroom efficiency and inventory accuracy. The role involves supporting management in operational troubleshooting and ensuring c…

View Details
Posted 2025-10-31

Sonographer Full Time, Evenings (08HR)

Stanford Health Care
Palo Alto, CA

If youre ready to be part of our legacy of hope and innovation we encourage you to take the first step and explore our current job openings. Your best is waiting to be discovered. Evening - 08 H…

View Details
Posted 2025-11-20

Onshore Service Delivery Manager

Donato Technologies, Inc
San Jose, CA

Onshore Service Delivery Manager San JoseCA Must Have - GCP BQ AI/ML Agentic AI SAC Analytics Good to have - Hadoop SQL Data Background Flexibility : work with offshore(India) team…

View Details
Posted 2025-11-20

Program Manager Quality, Customer Service

Waymo
California

Waymo is an autonomous driving technology company with the mission to be the most trusted driver. Since its start as the Google Self-Driving Car Project in 2009, Waymo has focused on building the Waym…

View Details
Posted 2025-11-15

Hematology Oncology Nurse Practitioner

Palm Careers
Rancho Mirage, CA

Palm Health is hiring an experienced Hematology Oncology Physician Assistant or Nurse Practitioner to join our dynamic health care team in sunny Palm Springs, CA! Outstanding Base salary!  $10,000 …

View Details
Posted 2025-07-31

Shop Helper- 2 Shift

Allied Mechanical
Ontario, CA

ABOUT ALLIED MECHANICAL Since 1951, Allied Mechanical ( has been providing excellence in machining services to a broad range of business platforms including aerospace, space, energy, nuclear, milit…

View Details
Posted 2025-11-22

Manufacturing Technician

Applied Materials
Santa Clara County, CA

Who We Are Applied Materials is a global leader in materials engineering solutions used to produce virtually every new chip and advanced display in the world. We design build and service cutting…

View Details
Posted 2025-11-21

Ad Marketing Manager

Roku
Santa Monica, CA

Teamwork makes the stream work. Roku is changing how the world watches TV Roku is the #1 TV streaming platform in the U.S. Canada and Mexico and weve set our sights on powering every televisi…

View Details
Posted 2025-11-22

Software Engineer, AI (Contract)

PlayStation Global
Aliso Viejo, CA

Why PlayStation PlayStation isnt just the Best Place to Play its also the Best Place to Work. Today were recognized as a global leader in entertainment producing The PlayStation family of products…

View Details
Posted 2025-11-22