Site Reliability Engineer (SRE) - AI Infrastructure

San Francisco, CA

Are you looking for an exciting new opportunity?

Join a stealth-mode hyperscale data center startup building a next-generation AI and cloud platform designed for startups and advanced research, powered by thousands of H100, H200, and B200 GPUs available on demand. Their platform supports everything from rapid experimentation to full-scale model training and inference, with flexible orchestration via Slurm, Kubernetes, or direct SSH access.

This is a rare opportunity to work at the intersection of hyperscale infrastructure and AI, shaping the operational backbone of one of the largest GPU clusters in private deployment. If you want to build and operate infrastructure for frontier AI workloads, automate systems at petascale, and be part of a founding engineering team, this is the place to do it.

Responsibilities:

Design, deploy, and maintain large-scale GPU clusters (H100/H200/B200) for training and inference workloads.
Build automation pipelines for provisioning, scaling, and monitoring compute resources across Slurm and Kubernetes environments.
Develop observability, alerting, and auto-healing systems for high-availability GPU workloads.
Collaborate with ML, networking, and platform teams to optimise resource scheduling, GPU utilization, and data flow.
Implement infrastructure-as-code, CI/CD pipelines, and reliability standards across thousands of nodes.
Diagnose performance bottlenecks and drive continuous improvements in reliability, latency, and throughput.

Skills / Must Have:

7+ years of experience in SRE, DevOps, or Infrastructure Engineering roles supporting large-scale compute environments.
Strong hands-on experience with Kubernetes and Slurm for cluster orchestration and workload management.
Deep knowledge of Linux systems, networking, and GPU infrastructure (NVIDIA H100/H200/B200 preferred).
Proficiency in Python, Go, or Bash for automation, tooling, and performance tuning.
Experience with observability stacks (Prometheus, Grafana, Loki) and incident response frameworks.
Familiarity with high-performance computing (HPC) or AI/ML training infrastructure at scale.
Background in reliability engineering, distributed systems, or hardware acceleration environments is a strong plus.

Benefits:

Equity

Salary:

$300,000 gross per year

Posted 2025-12-18

Recommended Jobs

General Manager

Private Listing

Los Angeles, CA

Wildly popular and successful South Asian-inspired concept is looking for a passionate and service-driven General Manager to join the team! The concept is part of a small but mighty and steadily gr…

View Details

Posted 2026-06-27

Operations & Logistics Coordinator (Lifestyle / Apparel)

Kowa American Corp

Torrance, CA

Job Description Job Description Reports to: Senior Ops. Manager & B.D. Manager / Lifestyle Div. Department: Lifestyle Division Product Range : Apparel and other lifestyle-related goods FL…

View Details

Posted 2026-06-26

Financial Analyst Supervisor

General Atomics and Affiliated Companies

Poway, CA

: General Atomics Aeronautical Systems, Inc. (GA-ASI), an affiliate of General Atomics, is a world leader in proven, reliable remotely piloted aircraft and tactical reconnaissance radars, as well as…

View Details

Posted 2026-07-12

Plumbing and HVAC Purchasing

Gulfstream Strategic Placements, LLC

San Jose, CA

Plumbing-HVAC Purchasing position in San Francisco Bay Area. We are a large, growing Mechanical and Plumbing Contractor looking to hire a top-notch Purchasing person to grow with our organization.…

View Details

Posted 2026-06-30

Brand Representative

Valley Wide L.L.C.

Visalia, CA

Job Posting: Brand Representative Company Overview Valley Wide L.L.C., a pioneer in integrated marketing solutions, is excited to announce a new career opportunity within our dynamic team. We are…

View Details

Posted 2026-05-18

Real Estate Investment Associate

Shin Yen Management

Chino, CA

: About Us Shin Yen Retail Property Management is a leading commercial real estate company with a growing portfolio of retail properties across diverse markets. We are committed to strategic grow…

View Details

Posted 2026-07-09

CNC Machinist / Machinist

AppleOne

Santa Fe Springs, CA

CNC Machinist / Machinist Santa Fe Springs, CA | Manufacturing (Aerospace Industry) DOE | Multiple Openings | 1st & 2nd Shift Available Join a growing aerospace manufacturing company special…

View Details

Posted 2026-07-09

Business Development Director, Optical Systems - TS/SCI Clearance

Rocket Lab USA

Long Beach, CA

About The Role ABOUT ROCKET LAB Rocket Lab is an end-to-end space company delivering responsive launch services, complete spacecraft design and manufacturing, payloads, satellite components, and mor…

View Details

Posted 2026-06-26

Christmas Seasonal Character Escort

Marriott

Chula Vista, CA

POSITION SUMMARY Facilitate entertainment performances by providing exceptional service and guest experience through one-on-one interactions with guests while assisting performers. Assist and …

View Details

Posted 2026-06-24