Senior Site Reliability Engineer (SRE) - Data Center
Join a stealth-mode hyperscale data center startup building an AI and cloud platform, powered by thousands of H100s, H200s, and B200s, ready to go for experimentation, full-scale model training, or inference. As a Senior Site Reliability Engineer, you’ll own the reliability, performance, and automation of this GPU-powered infrastructure, ensuring seamless orchestration across environments managed by Slurm, Kubernetes, or direct SSH access.
This is a rare opportunity to work at the intersection of hyperscale infrastructure and AI, shaping the operational backbone of one of the largest GPU clusters in private deployment.
If you want to build and operate infrastructure for frontier AI workloads, automate systems at petascale, and be part of a founding engineering team, this is the place to do it.
If you are interested in this opportunity, get in touch! You don't want to miss out!
Responsibilities:
- Design, deploy, and maintain large-scale GPU clusters (H100/H200/B200) for training and inference workloads.
- Build automation pipelines for provisioning, scaling, and monitoring compute resources across Slurm and Kubernetes environments.
- Develop observability, alerting, and auto-healing systems for high-availability GPU workloads.
- Collaborate with ML, networking, and platform teams to optimise resource scheduling, GPU utilisation, and data flow.
- Implement infrastructure-as-code, CI/CD pipelines, and reliability standards across thousands of nodes.
- Diagnose performance bottlenecks and drive continuous improvements in reliability, latency, and throughput.
Skills / Must Have:
- 7+ years of experience in SRE, DevOps, or Infrastructure Engineering roles supporting large-scale compute environments.
- Strong hands-on experience with Kubernetes and Slurm for cluster orchestration and workload management.
- Deep knowledge of Linux systems, networking, and GPU infrastructure (NVIDIA H100/H200/B200 preferred).
- Proficiency in Python, Go, or Bash for automation, tooling, and performance tuning.
- Experience with observability stacks (Prometheus, Grafana, Loki) and incident response frameworks.
- Familiarity with high-performance computing (HPC) or AI/ML training infrastructure at scale.
- Background in reliability engineering, distributed systems, or hardware acceleration environments is a strong plus.
Benefits:
- Equity
Salary:
- $300,000 gross per year
Recommended Jobs
Licensed Spa Coordinator, Newport Beach
We are looking for an energetic creative and enthusiastic Spa Coordinator to join the Equinox team ! This is a great position for a candidate looking to make a significant impact in a growing and …
Operations Controller
Gucci seeks an Operations Controller for its Cabazon Outlet to manage stockroom efficiency and inventory accuracy. The role involves supporting management in operational troubleshooting and ensuring c…
Sonographer Full Time, Evenings (08HR)
If youre ready to be part of our legacy of hope and innovation we encourage you to take the first step and explore our current job openings. Your best is waiting to be discovered. Evening - 08 H…
Onshore Service Delivery Manager
Onshore Service Delivery Manager San JoseCA Must Have - GCP BQ AI/ML Agentic AI SAC Analytics Good to have - Hadoop SQL Data Background Flexibility : work with offshore(India) team…
Program Manager Quality, Customer Service
Waymo is an autonomous driving technology company with the mission to be the most trusted driver. Since its start as the Google Self-Driving Car Project in 2009, Waymo has focused on building the Waym…
Hematology Oncology Nurse Practitioner
Palm Health is hiring an experienced Hematology Oncology Physician Assistant or Nurse Practitioner to join our dynamic health care team in sunny Palm Springs, CA! Outstanding Base salary! Â $10,000 …
Shop Helper- 2 Shift
ABOUT ALLIED MECHANICAL Since 1951, Allied Mechanical ( has been providing excellence in machining services to a broad range of business platforms including aerospace, space, energy, nuclear, milit…
Manufacturing Technician
Who We Are Applied Materials is a global leader in materials engineering solutions used to produce virtually every new chip and advanced display in the world. We design build and service cutting…
Ad Marketing Manager
Teamwork makes the stream work. Roku is changing how the world watches TV Roku is the #1 TV streaming platform in the U.S. Canada and Mexico and weve set our sights on powering every televisi…
Software Engineer, AI (Contract)
Why PlayStation PlayStation isnt just the Best Place to Play its also the Best Place to Work. Today were recognized as a global leader in entertainment producing The PlayStation family of products…