Cluster Infrastructure Engineer

Cartesia
San Francisco, CA

About Cartesia

Our mission is to build the next generation of AI: ubiquitous, interactive intelligence that runs wherever you are. Today, not even the best models can continuously process and reason over a year-long stream of audio, video and text—1B text tokens, 10B audio tokens and 1T video tokens—let alone do this on-device.

We're pioneering the model architectures that will make this possible. Our founding team met as PhDs at the Stanford AI Lab, where we invented State Space Models or SSMs, a new primitive for training efficient, large-scale foundation models. Our team combines deep expertise in model innovation and systems engineering paired with a design-minded product engineering team to build and ship cutting edge models and experiences.

We're funded by leading investors at Index Ventures and Lightspeed Venture Partners, along with Factory, Conviction, A Star, General Catalyst, SV Angel, Databricks and others. We're fortunate to have the support of many amazing advisors, and 90+ angels across many industries, including the world's foremost experts in AI.

About the Role

We’re looking for a Cluster Infrastructure Engineer to help build and scale the compute backbone that powers Cartesia’s research on real-time, multimodal intelligence. In this role, you’ll work at the intersection of distributed systems and infrastructure engineering, designing and operating the large-scale GPU clusters that train and serve Cartesia’s foundation models. You’ll own systems that need to be fast, reliable, and highly automated — ensuring our researchers and product teams can move at the speed of innovation. You’ll build the tooling, automation, and monitoring needed to keep clusters resilient under load, quickly diagnose and resolve issues, and continuously push the boundaries of scalability and efficiency.

Your Impact

  • Design and build large-scale GPU clusters for model training and low-latency inference

  • Develop automation for provisioning, scaling, and monitoring to ensure clusters are fast, resilient, and self-healing

  • Collaborate closely with research and product teams to enable distributed training at scale, optimizing for speed, reliability, and utilization

  • Implement robust observability and alerting systems to monitor GPU health, node stability, and job performance

  • Diagnose and triage hardware, networking, and distributed training issues across environments, coordinating with provider support as needed

  • Continuously improve cluster reliability, developer ergonomics, and overall system efficiency across Cartesia’s research and production workloads

What You Bring

  • Strong engineering fundamentals and experience building and operating large-scale distributed systems

  • Deep familiarity with HPC & GPU cluster management using Kubernetes and Slurm

  • A blend of developer empathy and raw performance engineering, designing systems and tools that are intuitive to use and fast

  • Ability to balance principled engineering with the urgency of keeping mission-critical systems alive

  • Proficiency with Infrastructure-as-Code tools (Terraform, Ansible, etc.) and observability tools (Prometheus, Grafana, etc.)

  • Strong debugging skills— comfortable diagnosing NCCL issues, CUDA errors, and network or driver-level faults.

What Sets You Apart

  • Experience optimizing large-scale distributed training frameworks such as DeepSpeed, Megatron-LM, or similar

  • Familiarity with advanced parallelization techniques such as FSDP, context parallelism, or tensor parallelism

Our culture

🏢 We’re an in-person team based out of San Francisco. We love being in the office, hanging out together and learning from each other everyday.

🚢 We ship fast. All of our work is novel and cutting edge, and execution speed is paramount. We have a high bar, and we don’t sacrifice quality and design along the way.

🤝 We support each other. We have an open and inclusive culture that’s focused on giving everyone the resources they need to succeed.

Posted 2025-11-28

Recommended Jobs

Merchandise Coordinator

Sephora
San Francisco, CA

Sephora is seeking a Merchandise Coordinator in San Francisco to support its dynamic merchandising team. This role involves assisting with daily business operations, preparing for brand meetings, and …

View Details
Posted 2025-11-20

Senior Backend Engineer

Nimblerx
Redwood City, CA

NimbleRx  is a technology company that enables people to live their best lives by improving access to reliable, affordable healthcare. Our mission is to bring pharmacies into the future by building a…

View Details
Posted 2025-11-25

$20-$25 Warehouse Maintenance Technician Mohave Valley, AZ

Belmar Integrated Logistics
Needles, CA

NOW HIRING - Maintenance Technician in Mohave Valley, AZ Full-time Position Schedule: Monday-Friday 6:00am-2:45pm Payrate: $20/hour + Health Benefits Available Benefits:  - Heal…

View Details
Posted 2025-10-19

Embrace Stockton: Care for Newborns Amidst Vibrant Culture!

NurseRecruiter
Stockton, CA

Registered Nurse - Neonatal Intensive Care - Travel - (NICU RN) Embrace an exciting travel opportunity as a Neonatal Intensive Care RN in vibrant Stockton, CA. Starting 9/2/2025 on 12-hour night shif…

View Details
Posted 2025-08-19

Director, Product Marketing - Auth0

Okta
Ontario, CA

Get to know Okta Okta is The World’s Identity Company. We free everyone to safely use any technology, anywhere, on any device or app. Our flexible and neutral products, Okta Platform and Auth0 Pla…

View Details
Posted 2025-10-27

Warehouse Supervisor/Specialist

SwiftX Inc.
Hayward, CA

Job Title: Warehouse Supervisor **Key Responsibilities:** (We will assign different supervisors to oversee various tasks, with on-the-job training provided through a rotational approach) · Supervi…

View Details
Posted 2025-07-29

Locum Anesthesiologist

Palm Careers
San Francisco, CA

Outstanding LOCUM TENEN Anesthesiologist needed in San Francisco with an outstanding university based program.  Come and join a top team in San Francisco for 6 months!  Lots of outdoor activities: …

View Details
Posted 2025-10-31

Senior Backend Engineer

Plenful
San Francisco, CA

About the role We’re looking for a Senior Backend Engineer to design and build systems that power our distributed computing, data workflows, and core infrastructure. You’ll take ownership of maj…

View Details
Posted 2025-11-28

Senior · Staff · Principal Backend Engineer

Lead Allies
San Francisco, CA

Senior / Staff / Principal Backend Engineer Location: Onsite San Francisco We have multiple startups interested in talent. Here is a generic summary. Instead of a perfect job description, we pres…

View Details
Posted 2025-11-25