Product Infrastructure Engineer - Site Reliability

Zyphra
Palo Alto, CA

Zyphra is an artificial intelligence company based in Palo Alto, California.

The Role:

As a Infrastructure Engineer - Site Reliability , you’ll be responsible for designing and maintaining the systems that keep Zyphra’s infrastructure robust, observable, secure, and scalable. Your work will be essential to ensuring the reliability and reproducibility of ML workloads, the safety and control of deployments, and the long-term maintainability of our compute environments.

You’ll work across:

  • Building and improving observability systems (monitoring, logging, alerting)

  • Designing resilient build and deployment systems across research and production environments

  • Implementing secure release processes with strong auditability and rollback support

  • Collaborating closely with ML engineers, DevOps, and infra teams to improve system reliability and performance

  • Leading incident response, root-cause analysis, and postmortems with a focus on learning and prevention

  • This role is ideal for someone who loves building systems that make other teams faster, safer, and more productive

Requirements:

  • Experience in high-performance compute environments, such as ML clusters or GPU farms

  • Background in infrastructure as code (e.g., Ansible, Terraform)

  • Familiarity with software release engineering with for ML/AI systems is a plus

  • Experience designing reliable environments for experimental workloads and reproducible runs

  • Knowledge of compliance and audit standards in deployment and system security

  • Experience with load testing, fault injection, and chaos engineering to harden systems under stress

  • Passion for building tooling that makes infrastructure invisible and reliable for end users

Bonus Qualifications:

  • Experience with infrastructure as code (e.g., Ansible, Terraform)

  • Prior work supporting ML/AI infrastructure, including GPU management and workload optimization

  • Exposure to backend development for ML model serving (e.g., vLLM, Ray, SGLang)

  • Experience working with cloud platforms such as AWS, Azure, or GCP

  • Familiarity with containers (Docker, Apptainer) and their integration with scheduling systems (Slurm, Kubernetes)

Why Work at Zyphra:

  • Our research methodology is to make grounded, methodical steps toward ambitious goals. Both deep research and engineering excellence are equally valued

  • We strongly value new and crazy ideas and are very willing to bet big on new ideas

  • We move as quickly as we can; we aim to minimize the bar to impact as low as possible

  • We all enjoy what we do and love discussing AI

Benefits and Perks:

  • Comprehensive medical, dental, vision, and FSA plans

  • Competitive compensation and 401(k)

  • Relocation and immigration support on a case-by-case basis

  • On-site meals prepared by a dedicated culinary team; Thursday Happy Hours

  • In-person team in Palo Alto, CA, with a collaborative, high-energy environment

If you are excited to bring reliability best practices to the frontier of AI infrastructure, this job is for you. Apply Today!

Posted 2025-11-25

Recommended Jobs

Los Angeles California | Veterinary Medicine | Make $600,000 | GP, Urgent Care, EM

Optigy
Fullerton, CA

Veterinary Medicine -- Compensation: $400,000--$600,000 a year Southeast of Los Angeles Your pet is very important and deserves VIP treatment. Expert doctors, latest medical advances, evidence-base…

View Details
Posted 2025-10-03

Account Executive (Stockton)

Sentry Insurance
Stockton, CA

Our highly trained Account Executives present specialized commercial insurance, 401K, and life insurance products that are perfectly designed for the industries we serve. Youre not just in it to wi…

View Details
Posted 2026-01-09

Sr. Analyst, Project Management

Planet Group
Canyon, CA

Will be responsible for successfully leading projects through the various stages of product development, clinical evaluations, operational readiness, and commercialization for new and improved medi…

View Details
Posted 2025-12-18

Scheduling Clerk-Cath Lab-Per Diem Various-Temecula Valley Hospital

Temecula Valley Hospital
Temecula, CA

Responsibilities About Temecula Valley Hospital Temecula Valley Hospital (TVH), part of Southwest Healthcare, brings advanced technology, innovative programs, patient-centered and family sensit…

View Details
Posted 2025-09-10

Lighting Project Coordinator

Alameda Electrical Distributors & California Service Tool
Hayward, CA

Come join one of the fastest-growing independently owned distributors in Northern California! We are looking for a full-time  Lighting Project Coordinator  to join our team at our branch in  Hayward,…

View Details
Posted 2026-01-09

Product Manager

Abbott
Alameda, CA

Abbott is a global healthcare leader that helps people live more fully at all stages of life. Our portfolio of life-changing technologies spans the spectrum of healthcare, with leading businesses and…

View Details
Posted 2026-01-09

Emergency Veterinarian- San Francisco, CA

Sage Veterinary Centers - San Francisco
San Francisco, CA

SAGE San Francisco is hiring an Emergency Veterinarian to grow our skilled team of seven emergency doctors, two criticalists, and a diverse group of specialists to serve this phenomenal city! …

View Details
Posted 2025-12-18

Mission Test, Modeling & Dev Ops Software Engineer

True Anomaly
Long Beach, CA

A new space race has begun. True Anomaly seeks those with the talent and ambition to build innovative technology that solves the next generation of engineering, manufacturing, and operational challen…

View Details
Posted 2025-12-19

Software Engineer, Product Frontend (6+ YOE)

Airtable
San Francisco, CA

Airtable is the no-code app platform that empowers people closest to the work to accelerate their most critical business processes. More than 500,000 organizations, including 80% of the Fortune 100, …

View Details
Posted 2026-01-07

Project Controls Analyst, Part time

Long Beach, CA

Accountant Opportunity with Traylor Bros., Inc.!   Traylor Bros., Inc. is a highly-respected heavy civil construction company working on some of the biggest, most technically challenging bridge, m…

View Details
Posted 2026-01-12