Senior Site Reliability Engineer GPU Infrastructure

Genmo
San Francisco, CA

We are Genmo, a research lab dedicated to building open, state-of-the-art models for video generation towards unlocking the right brain of AGI. Join us in shaping the future of AI and pushing the boundaries of what's possible in video generation.

What You’ll Do

  • Own the design and day‑to‑day operation of GPU clusters that train and serve frontier generative models.

  • Lead production Kubernetes operations: GPU scheduling, cluster upgrades, multi‑cluster federation.

  • Define and implement Infrastructure‑as‑Code (Terraform, Helm, Ansible) and GitOps workflows with Argo CD or Flux.

  • Build CI/CD pipelines, automated testing, and rollout strategies for infra changes.

  • Develop an observability stack (Prometheus, Grafana, OpenTelemetry, eBPF) plus GPU telemetry with NVIDIA DCGM.

  • Optimize high‑performance networking (InfiniBand/RDMA) and debug perf bottlenecks.

  • Run and continuously improve the 24×7 on‑call rotation; lead post‑incident reviews.

  • Partner with researchers and engineers, communicate crisply, and ship with a high‑ownership mindset.

Minimum Qualifications

  • BS/MS/PhD in CS, EE, or related field.

  • 3+ yrs SRE/DevOps in production; 2+ yrs managing large Kubernetes fleets.

  • Expert‑level Kubernetes experience.

  • Proficient in Python and Bash and IaC tools (Terraform, Helm, Ansible).

  • Track record of shipping and operating large‑scale infrastructure with high reliability and clear communication.

Nice to Have

  • Multi‑cluster / multi‑cloud (AWS, GCP, Azure, bare‑metal) production experience.

  • Hands‑on with containerized GPU stacks (nvidia‑container‑toolkit, GPU Operator)

  • GPU schedulers such as Slurm or Kueue.

  • Familiarity with CI/CD tooling (GitHub Actions, BuildKit).

  • Prior work with distributed training, model‑serving patterns, or other ML/GPU workloads.

Machine‑learning depth is a plus—not a prerequisite. We’ll help you level up if needed.

Genmo is an Equal Opportunity Employer. Candidates are evaluated without regard to age, race, color, religion, sex, disability, national origin, sexual orientation, veteran status, or any other characteristic protected by federal or state law. Genmo, Inc. is an E-Verify company and you may review the Notice of E-Verify Participation and the Right to Work posters in English and Spanish .

Posted 2025-11-25

Recommended Jobs

Metrology & Display Test Engineer, Consumer Products

OpenAI
San Francisco, CA

Join OpenAI’s Consumer Products team to develop, validate, and scale advanced display modules. We’re looking for a hands-on Metrology & Display Test Engineer who will own display component and system…

View Details
Posted 2025-11-25

Massage Therapist

Cavallo Point
Sausalito, CA

                        We are seeking part-time/temporary  Massage Therapists  who can provide our guests with authentic, soothing massages using organic or wild-crafted ingredients bl…

View Details
Posted 2025-11-22

Senior Software Engineer

Navan
Palo Alto, CA

We believe "It’s all about the user. All of them." We’re passionate about providing a seamless one-stop experience for travelers, no matter how they travel, where they stay, or where they’re going. A…

View Details
Posted 2025-11-25

Senior Product Manager, Premium Discovery

Linkedin
Sunnyvale, CA

Company Description LinkedIn is the world’s largest professional network, built to create economic opportunity for every member of the global workforce. Our products help people make powerful co…

View Details
Posted 2025-11-25

Sr. Utilities Engineer

Florida Crystals Corporation
Crockett, CA

ASR Group is the world’s largest refiner and marketer of cane sugar, with an annual production capacity of more than 6 million tons of sugar. The company produces a full line of grocery, industrial, …

View Details
Posted 2025-09-14

Flight Software Engineer

Xona Space Systems
Burlingame, CA

Xona is the navigational intelligence company bringing real-time, centimeter-level certainty to any device, anywhere on Earth. With Pulsar – the world’s most advanced PNT satellite infrastructure …

View Details
Posted 2025-11-19

Tutor

Education Enrichment Services
Roseville, CA

Tutor Job Description: Education Enrichment Services, (EES) is a tutoring, education and academic intervention company offering part time positions for teachers. EES is hiring talented, creative …

View Details
Posted 2025-07-29

Part Time Receptionist

Precision Honda
Downey, CA

Responsibilities: Ensure all phone calls are directed in a timely and professional manner. Always provide excellent customer service over the phone and in person.  Greet and guide customers to a…

View Details
Posted 2025-09-03

Principal Product Manager - Orchestration

Lambda
San Francisco, CA

In 2012, Lambda started with a crew of AI engineers publishing research at top machine-learning conferences. We began as an AI company built by AI engineers. That hasn't changed. Today, we're on a mi…

View Details
Posted 2025-11-25