Product Infrastructure Engineer - Site Reliability
Zyphra is an artificial intelligence company based in Palo Alto, California.
The Role:
As a Infrastructure Engineer - Site Reliability , you’ll be responsible for designing and maintaining the systems that keep Zyphra’s infrastructure robust, observable, secure, and scalable. Your work will be essential to ensuring the reliability and reproducibility of ML workloads, the safety and control of deployments, and the long-term maintainability of our compute environments.
You’ll work across:
Building and improving observability systems (monitoring, logging, alerting)
Designing resilient build and deployment systems across research and production environments
Implementing secure release processes with strong auditability and rollback support
Collaborating closely with ML engineers, DevOps, and infra teams to improve system reliability and performance
Leading incident response, root-cause analysis, and postmortems with a focus on learning and prevention
This role is ideal for someone who loves building systems that make other teams faster, safer, and more productive
Requirements:
Experience in high-performance compute environments, such as ML clusters or GPU farms
Background in infrastructure as code (e.g., Ansible, Terraform)
Familiarity with software release engineering with for ML/AI systems is a plus
Experience designing reliable environments for experimental workloads and reproducible runs
Knowledge of compliance and audit standards in deployment and system security
Experience with load testing, fault injection, and chaos engineering to harden systems under stress
Passion for building tooling that makes infrastructure invisible and reliable for end users
Bonus Qualifications:
Experience with infrastructure as code (e.g., Ansible, Terraform)
Prior work supporting ML/AI infrastructure, including GPU management and workload optimization
Exposure to backend development for ML model serving (e.g., vLLM, Ray, SGLang)
Experience working with cloud platforms such as AWS, Azure, or GCP
Familiarity with containers (Docker, Apptainer) and their integration with scheduling systems (Slurm, Kubernetes)
Why Work at Zyphra:
Our research methodology is to make grounded, methodical steps toward ambitious goals. Both deep research and engineering excellence are equally valued
We strongly value new and crazy ideas and are very willing to bet big on new ideas
We move as quickly as we can; we aim to minimize the bar to impact as low as possible
We all enjoy what we do and love discussing AI
Benefits and Perks:
Comprehensive medical, dental, vision, and FSA plans
Competitive compensation and 401(k)
Relocation and immigration support on a case-by-case basis
On-site meals prepared by a dedicated culinary team; Thursday Happy Hours
In-person team in Palo Alto, CA, with a collaborative, high-energy environment
If you are excited to bring reliability best practices to the frontier of AI infrastructure, this job is for you. Apply Today!
Recommended Jobs
Los Angeles California | Veterinary Medicine | Make $600,000 | GP, Urgent Care, EM
Veterinary Medicine -- Compensation: $400,000--$600,000 a year Southeast of Los Angeles Your pet is very important and deserves VIP treatment. Expert doctors, latest medical advances, evidence-base…
Account Executive (Stockton)
Our highly trained Account Executives present specialized commercial insurance, 401K, and life insurance products that are perfectly designed for the industries we serve. Youre not just in it to wi…
Sr. Analyst, Project Management
Will be responsible for successfully leading projects through the various stages of product development, clinical evaluations, operational readiness, and commercialization for new and improved medi…
Scheduling Clerk-Cath Lab-Per Diem Various-Temecula Valley Hospital
Responsibilities About Temecula Valley Hospital Temecula Valley Hospital (TVH), part of Southwest Healthcare, brings advanced technology, innovative programs, patient-centered and family sensit…
Lighting Project Coordinator
Come join one of the fastest-growing independently owned distributors in Northern California! We are looking for a full-time Lighting Project Coordinator to join our team at our branch in Hayward,…
Product Manager
Abbott is a global healthcare leader that helps people live more fully at all stages of life. Our portfolio of life-changing technologies spans the spectrum of healthcare, with leading businesses and…
Emergency Veterinarian- San Francisco, CA
SAGE San Francisco is hiring an Emergency Veterinarian to grow our skilled team of seven emergency doctors, two criticalists, and a diverse group of specialists to serve this phenomenal city! …
Mission Test, Modeling & Dev Ops Software Engineer
A new space race has begun. True Anomaly seeks those with the talent and ambition to build innovative technology that solves the next generation of engineering, manufacturing, and operational challen…
Software Engineer, Product Frontend (6+ YOE)
Airtable is the no-code app platform that empowers people closest to the work to accelerate their most critical business processes. More than 500,000 organizations, including 80% of the Fortune 100, …
Project Controls Analyst, Part time
Accountant Opportunity with Traylor Bros., Inc.! Traylor Bros., Inc. is a highly-respected heavy civil construction company working on some of the biggest, most technically challenging bridge, m…