Site Reliability Engineer (SRE) - AI Infrastructure

San Francisco, CA

Are you looking for an exciting new opportunity?

Join a stealth-mode hyperscale data center startup building a next-generation AI and cloud platform designed for startups and advanced research, powered by thousands of H100, H200, and B200 GPUs available on demand. Their platform supports everything from rapid experimentation to full-scale model training and inference, with flexible orchestration via Slurm, Kubernetes, or direct SSH access.

This is a rare opportunity to work at the intersection of hyperscale infrastructure and AI, shaping the operational backbone of one of the largest GPU clusters in private deployment. If you want to build and operate infrastructure for frontier AI workloads, automate systems at petascale, and be part of a founding engineering team, this is the place to do it.

Responsibilities:

  • Design, deploy, and maintain large-scale GPU clusters (H100/H200/B200) for training and inference workloads.
  • Build automation pipelines for provisioning, scaling, and monitoring compute resources across Slurm and Kubernetes environments.
  • Develop observability, alerting, and auto-healing systems for high-availability GPU workloads.
  • Collaborate with ML, networking, and platform teams to optimise resource scheduling, GPU utilization, and data flow.
  • Implement infrastructure-as-code, CI/CD pipelines, and reliability standards across thousands of nodes.
  • Diagnose performance bottlenecks and drive continuous improvements in reliability, latency, and throughput.

Skills / Must Have:

  • 7+ years of experience in SRE, DevOps, or Infrastructure Engineering roles supporting large-scale compute environments.
  • Strong hands-on experience with Kubernetes and Slurm for cluster orchestration and workload management.
  • Deep knowledge of Linux systems, networking, and GPU infrastructure (NVIDIA H100/H200/B200 preferred).
  • Proficiency in Python, Go, or Bash for automation, tooling, and performance tuning.
  • Experience with observability stacks (Prometheus, Grafana, Loki) and incident response frameworks.
  • Familiarity with high-performance computing (HPC) or AI/ML training infrastructure at scale.
  • Background in reliability engineering, distributed systems, or hardware acceleration environments is a strong plus.

Benefits:

  • Equity

Salary:

  • $300,000 gross per year
Posted 2025-12-18

Recommended Jobs

CNA - Certified Nursing Assistant

Nicole Roman
Santa Clara, CA

Valley House is looking for experienced and reliable LICENSED CNA's to join our team! We are a 200 bed skilled nursing facility providing long and short-term care. We strive to provide exceptional ca…

View Details
Posted 2025-09-01

Electrical Engineer/Senior Electrical Engineer - Power Distribution

Gillig
Livermore, CA

As the leading transit bus manufacturer in the United States, GILLIG buses play a critical role in the environmental and social initiatives in communities across our nation. GILLIG is on the forefr…

View Details
Posted 2025-12-13

Test Engineer 4 (AHT)

Northrop Grumman
Los Angeles, CA

RELOCATION ASSISTANCE: Relocation assistance may be available CLEARANCE TYPE: Interim Secret TRAVEL: Yes, 10% of the Time Description At Northrop Grumman, our employees have incredible o…

View Details
Posted 2025-12-25

Software Engineer, iOS (e-commerce)

Newsbreak
Mountain View, CA

About NewsBreak NewsBreak is redefining the way users interact with local news and their communities. By bridging local users, local content creators, and local businesses, our mission is to fos…

View Details
Posted 2026-01-07

Staff Accountant

Alleviate
Irvine, CA

Are you a detail-oriented accounting professional looking to advance your career in a dynamic and supportive environment? As a Staff Accountant, you will play a vital role in maintaining our financ…

View Details
Posted 2025-11-28

Au Pair

GreatAuPair LLC
Los Angeles, CA

Get hired for Mladen's aupair Job in Los Angeles, CA. Need help to take care my daughter. Find aupair care work in Los Angeles.

View Details
Posted 2025-11-20

Senior Structural Engineer

Disneyland Resort
Anaheim, CA

“We create happiness.” That’s our motto at Disney Experiences. At Disney, you’ll help inspire magic by enabling our teams to push the limits of entertainment and create innovative built environments. …

View Details
Posted 2026-01-09

.NET Developer

MarshWagner
San Francisco, CA

Our company is looking for a .NET Developer who is capable of building .NET applications. Your primary role will be to create the applications from scratch, configure the systems and provide user sup…

View Details
Posted 2025-07-30

Software Engineer, AI Growth

Benchling
San Francisco, CA

Biotechnology is rewriting life as we know it, from the medicines we take, to the crops we grow, the materials we wear, and the household goods that we rely on every day. But moving at the new speed …

View Details
Posted 2026-01-01