Site Reliability Engineer (SRE) - AI Infrastructure

San Francisco, CA

Are you looking for an exciting new opportunity?

Join a stealth-mode hyperscale data center startup building a next-generation AI and cloud platform designed for startups and advanced research, powered by thousands of H100, H200, and B200 GPUs available on demand. Their platform supports everything from rapid experimentation to full-scale model training and inference, with flexible orchestration via Slurm, Kubernetes, or direct SSH access.

This is a rare opportunity to work at the intersection of hyperscale infrastructure and AI, shaping the operational backbone of one of the largest GPU clusters in private deployment. If you want to build and operate infrastructure for frontier AI workloads, automate systems at petascale, and be part of a founding engineering team, this is the place to do it.

Responsibilities:

  • Design, deploy, and maintain large-scale GPU clusters (H100/H200/B200) for training and inference workloads.
  • Build automation pipelines for provisioning, scaling, and monitoring compute resources across Slurm and Kubernetes environments.
  • Develop observability, alerting, and auto-healing systems for high-availability GPU workloads.
  • Collaborate with ML, networking, and platform teams to optimise resource scheduling, GPU utilization, and data flow.
  • Implement infrastructure-as-code, CI/CD pipelines, and reliability standards across thousands of nodes.
  • Diagnose performance bottlenecks and drive continuous improvements in reliability, latency, and throughput.

Skills / Must Have:

  • 7+ years of experience in SRE, DevOps, or Infrastructure Engineering roles supporting large-scale compute environments.
  • Strong hands-on experience with Kubernetes and Slurm for cluster orchestration and workload management.
  • Deep knowledge of Linux systems, networking, and GPU infrastructure (NVIDIA H100/H200/B200 preferred).
  • Proficiency in Python, Go, or Bash for automation, tooling, and performance tuning.
  • Experience with observability stacks (Prometheus, Grafana, Loki) and incident response frameworks.
  • Familiarity with high-performance computing (HPC) or AI/ML training infrastructure at scale.
  • Background in reliability engineering, distributed systems, or hardware acceleration environments is a strong plus.

Benefits:

  • Equity

Salary:

  • $300,000 gross per year
Posted 2025-12-18

Recommended Jobs

Wealth Management Client Associate

Bank of America Corporation
Redding, CA

Merrill Wealth Management is a leading provider of comprehensive wealth management and investment products and services for individuals, companies, and institutions. Merrill Wealth Management …

View Details
Posted 2026-04-03

CNA - Certified Nursing Assistant

Nicole Roman
Santa Clara, CA

Valley House is looking for experienced and reliable LICENSED CNA's to join our team! We are a 200 bed skilled nursing facility providing long and short-term care. We strive to provide exceptional ca…

View Details
Posted 2025-09-01

SE NECESITAN REPARTIDORES Hasta 30.00 por hora y bono de 1000

Torero Logistics Corp
South San Francisco, CA

Torero Logistics (TLC) busca repartidores motivados y con actitud positiva. Estamos ubicados en South San Francisco . No se requiere experiencia! Necesita tener un conocimiento básico de inglés par…

View Details
Posted 2026-01-15

Intern Software Engineer (Summer 2026)

Veeva Systems
Pleasanton, CA

Veeva Systems is building the industry cloud for Life Sciences to help companies work in a more efficient and connected way. Learn more about our products, vision and values, and status as a public b…

View Details
Posted 2026-01-13

Moving Crew Captain / Lead Mover

College Hunks Hauling Junk & Moving
Canoga Park, CA

Moving Crew Captain / Lead Mover College Hunks Hauling Junk & Moving – Thousand Oaks Canoga Park, CA Pay: $21.00 – $25.00 per hour + Tips Job Types: Full-time, Part-time Lead a Prof…

View Details
Posted 2026-03-07

Service Technician - Madison

Greystar
Sacramento, CA

ABOUT GREYSTAR Greystar is a leading, fully integrated global real estate platform offering expertise in property management, investment management, development, and construction services in ins…

View Details
Posted 2026-03-30

Travel Registered Nurse Outpatient Job

Calabasas, CA

Job Overview Employer: TLC Nursing Associates, Inc. Responsibilities TLC Nursing Associates, Inc. is looking for a Travel RN – Outpatient to provide high-quality patient care in an outpat…

View Details
Posted 2026-03-15

Merchandising Pricing & Systems Analyst

Brilliant Earth
California

Merchandising Pricing & Systems Analyst - Brilliant Earth About the Role Brilliant Earth is looking for a Merchandising Pricing & Systems Analyst to sit at the intersection of data and analytics,…

View Details
Posted 2026-04-09

Policy CounselNew

Anthropic
San Francisco, CA

About Anthropic Anthropic’s mission is to create reliable, interpretable, and steerable AI systems. We want AI to be safe and beneficial for our users and for society as a whole. Our team is a quick…

View Details
Posted 2026-04-09