Senior Site Reliability Engineer GPU Infrastructure

Genmo
San Francisco, CA

We are Genmo, a research lab dedicated to building open, state-of-the-art models for video generation towards unlocking the right brain of AGI. Join us in shaping the future of AI and pushing the boundaries of what's possible in video generation.

What You’ll Do

  • Own the design and day‑to‑day operation of GPU clusters that train and serve frontier generative models.

  • Lead production Kubernetes operations: GPU scheduling, cluster upgrades, multi‑cluster federation.

  • Define and implement Infrastructure‑as‑Code (Terraform, Helm, Ansible) and GitOps workflows with Argo CD or Flux.

  • Build CI/CD pipelines, automated testing, and rollout strategies for infra changes.

  • Develop an observability stack (Prometheus, Grafana, OpenTelemetry, eBPF) plus GPU telemetry with NVIDIA DCGM.

  • Optimize high‑performance networking (InfiniBand/RDMA) and debug perf bottlenecks.

  • Run and continuously improve the 24×7 on‑call rotation; lead post‑incident reviews.

  • Partner with researchers and engineers, communicate crisply, and ship with a high‑ownership mindset.

Minimum Qualifications

  • BS/MS/PhD in CS, EE, or related field.

  • 3+ yrs SRE/DevOps in production; 2+ yrs managing large Kubernetes fleets.

  • Expert‑level Kubernetes experience.

  • Proficient in Python and Bash and IaC tools (Terraform, Helm, Ansible).

  • Track record of shipping and operating large‑scale infrastructure with high reliability and clear communication.

Nice to Have

  • Multi‑cluster / multi‑cloud (AWS, GCP, Azure, bare‑metal) production experience.

  • Hands‑on with containerized GPU stacks (nvidia‑container‑toolkit, GPU Operator)

  • GPU schedulers such as Slurm or Kueue.

  • Familiarity with CI/CD tooling (GitHub Actions, BuildKit).

  • Prior work with distributed training, model‑serving patterns, or other ML/GPU workloads.

Machine‑learning depth is a plus—not a prerequisite. We’ll help you level up if needed.

Genmo is an Equal Opportunity Employer. Candidates are evaluated without regard to age, race, color, religion, sex, disability, national origin, sexual orientation, veteran status, or any other characteristic protected by federal or state law. Genmo, Inc. is an E-Verify company and you may review the Notice of E-Verify Participation and the Right to Work posters in English and Spanish .

Posted 2025-09-22

Recommended Jobs

Project Accountant

Mns Engineers
San Francisco, CA

MNS Engineers, Inc. (MNS) is a fast-growing profitable multi-service infrastructure consulting firm offering planning, construction management, civil engineering, and surveying services throughout …

View Details
Posted 2025-09-22

QA Engineer - Automation Specialist

Philo
San Francisco, CA

Philo: TV on the Internet!!! At Philo, we’re a group of technology and product people who set out to build the future of television, marrying the best in modern technology with the most compelling …

View Details
Posted 2025-10-01

Full-Stack Developer - Robotics & AI Systems

Dexmate
Santa Clara, CA

Company Description We are an early-stage robotics startup working on building multi-purpose mobile robots that can do complex manipulation tasks. We are looking for a creative, skilled, and motivat…

View Details
Posted 2025-09-22

Node.js Developer

Techiumph Technologies
California

Techiumph has a client looking for a consultant to work onsite. DUTIES: -Design and build core frameworks on Node.JS, shared services, NPM packages, and RESTful APIs. -Translates complex requi…

View Details
Posted 2025-08-18

Sous Chef, Catering & Special Events

Fox Corporation
Los Angeles, CA

OVERVIEW OF THE COMPANY Fox Corporation Under the FOX banner, we produce and distribute content through some of the world’s leading and most valued brands, including: FOX News Media, FOX Sports…

View Details
Posted 2025-10-22

Software Engineer- Android Mobile

Cloudkitchens
Mountain View, CA

About Us CloudKitchens helps restaurateurs around the world thrive in the digital food delivery space. Our mission is to make food more affordable , high-quality , and accessible for every…

View Details
Posted 2025-10-31

Housekeeping Manager

The Hoxton
Delano, CA

Company Description We are looking for a Housekeeping Manager to join the re-opening of the famed Delano Miami Beach. Set to open its doors in early 2026, Delano Miami Beach will fuse historic charm…

View Details
Posted 2025-10-31

Locum CRNA

Palm Careers
Eureka, CA

Locum CRNA Opportunity Northern California (Redwood Coast Region) Palm Health Resources is hiring CRNAs for a flexible locum assignment at a reputable critical access hospital along Californias sce…

View Details
Posted 2025-07-31

Data Analyst

Midstream Health
San Francisco, CA

Data Analyst (Founding Team) Stealth Healthcare Start-Up &##128205; SF-based preferred | &##129523; Some travel required | &##128336; Full-time A Different Approach to Building Healthcare Tech…

View Details
Posted 2025-09-13