Site Reliability Engineer (SRE) - AI Infrastructure

San Francisco, CA

Are you looking for an exciting new opportunity?

Join a stealth-mode hyperscale data center startup building a next-generation AI and cloud platform designed for startups and advanced research, powered by thousands of H100, H200, and B200 GPUs available on demand. Their platform supports everything from rapid experimentation to full-scale model training and inference, with flexible orchestration via Slurm, Kubernetes, or direct SSH access.

This is a rare opportunity to work at the intersection of hyperscale infrastructure and AI, shaping the operational backbone of one of the largest GPU clusters in private deployment. If you want to build and operate infrastructure for frontier AI workloads, automate systems at petascale, and be part of a founding engineering team, this is the place to do it.

Responsibilities:

  • Design, deploy, and maintain large-scale GPU clusters (H100/H200/B200) for training and inference workloads.
  • Build automation pipelines for provisioning, scaling, and monitoring compute resources across Slurm and Kubernetes environments.
  • Develop observability, alerting, and auto-healing systems for high-availability GPU workloads.
  • Collaborate with ML, networking, and platform teams to optimise resource scheduling, GPU utilization, and data flow.
  • Implement infrastructure-as-code, CI/CD pipelines, and reliability standards across thousands of nodes.
  • Diagnose performance bottlenecks and drive continuous improvements in reliability, latency, and throughput.

Skills / Must Have:

  • 7+ years of experience in SRE, DevOps, or Infrastructure Engineering roles supporting large-scale compute environments.
  • Strong hands-on experience with Kubernetes and Slurm for cluster orchestration and workload management.
  • Deep knowledge of Linux systems, networking, and GPU infrastructure (NVIDIA H100/H200/B200 preferred).
  • Proficiency in Python, Go, or Bash for automation, tooling, and performance tuning.
  • Experience with observability stacks (Prometheus, Grafana, Loki) and incident response frameworks.
  • Familiarity with high-performance computing (HPC) or AI/ML training infrastructure at scale.
  • Background in reliability engineering, distributed systems, or hardware acceleration environments is a strong plus.

Benefits:

  • Equity

Salary:

  • $300,000 gross per year
Posted 2025-12-18

Recommended Jobs

Senior Software Engineer

Parafin
San Francisco, CA

About Us: At Parafin, we’re on a mission to grow small businesses. Small businesses are the backbone of our economy, but traditional banks often don’t have their backs. We build tech that makes…

View Details
Posted 2026-02-13

Machine Learning Engineer Intern, Autonomy Behavior

zoox
Foster, CA

Zoox’s internship program provides hands-on experiences with state of the art technology, mentorship from some of the industry's brightest minds, and the opportunity to play a part in our success. In…

View Details
Posted 2025-11-18

Research Study Participant - Video Project - (East Asian Community)

Recrewt
Los Angeles, CA

About the project. Were inviting men and women from the East Asian community (18+) in the Los Angeles / Glendale area to participate in a paid, in-person technology research study supporting a glo…

View Details
Posted 2026-01-15

Attorney | Civil Litigation | Hybrid | 147830

Mission Recruiting
Fullerton, CA

Courtroom advocacy sits at the center of this opportunity. Mission Recruiting is partnering with a respected Southern California public law practice to identify a seasoned trial attorney who thrive…

View Details
Posted 2026-01-24

Data Scientist 2

Biomarin Pharmaceutical
Novato, CA

Description About Technical Operations BioMarin’s Technical Operations group is responsible for creating our drugs for use in clinical trials and for scaling production of those drugs for t…

View Details
Posted 2026-02-22

Accounting Clerk II

Planet Group
San Diego, CA

Duties:     Responsible for performing a variety of entry-level bookkeeping and accounting work on a timely basis. Responsibilities include data entry, accounts payable, accounts receivable, collect…

View Details
Posted 2025-11-21

Mechanical Design Engineer, Battery HV Distribution

Archer
California

Archer is an aerospace company based in San Jose, California building an all-electric vertical takeoff and landing aircraft with a mission to advance the benefits of sustainable air mobility. We are d…

View Details
Posted 2026-01-30

Dental Patient Care Coordinator

San Diego Periodontics & Implant Dentistry
San Diego, CA

Front Office Coordinator – Periodontics (In-Office) How to Apply: Please call 619-800-2644 . Reply with your resume , cover letter , and desired pay . Job Overview: We are a patie…

View Details
Posted 2026-01-26

Director of Safety

Canopy Service Partners
California

Canopy Service Partners is a collaborative and growth-oriented organization dedicated to supporting local partner businesses across the tree care industry. As the Safety Director, you will be respons…

View Details
Posted 2026-02-07

Test Engineer - CA

MGA Research Corporation
Hughson, CA

TEST ENGINEERS The sky is the limit when it comes to new ideas here at MGA Research Corporation because we are always looking for ways to improve and your ideas are valued. From testing individua…

View Details
Posted 2026-02-16