Cluster Infrastructure Engineer
About Cartesia
Our mission is to build the next generation of AI: ubiquitous, interactive intelligence that runs wherever you are. Today, not even the best models can continuously process and reason over a year-long stream of audio, video and text—1B text tokens, 10B audio tokens and 1T video tokens—let alone do this on-device.
We're pioneering the model architectures that will make this possible. Our founding team met as PhDs at the Stanford AI Lab, where we invented State Space Models or SSMs, a new primitive for training efficient, large-scale foundation models. Our team combines deep expertise in model innovation and systems engineering paired with a design-minded product engineering team to build and ship cutting edge models and experiences.
We're funded by leading investors at Index Ventures and Lightspeed Venture Partners, along with Factory, Conviction, A Star, General Catalyst, SV Angel, Databricks and others. We're fortunate to have the support of many amazing advisors, and 90+ angels across many industries, including the world's foremost experts in AI.
About the Role
We’re looking for a Cluster Infrastructure Engineer to help build and scale the compute backbone that powers Cartesia’s research on real-time, multimodal intelligence. In this role, you’ll work at the intersection of distributed systems and infrastructure engineering, designing and operating the large-scale GPU clusters that train and serve Cartesia’s foundation models. You’ll own systems that need to be fast, reliable, and highly automated — ensuring our researchers and product teams can move at the speed of innovation. You’ll build the tooling, automation, and monitoring needed to keep clusters resilient under load, quickly diagnose and resolve issues, and continuously push the boundaries of scalability and efficiency.
Your Impact
Design and build large-scale GPU clusters for model training and low-latency inference
Develop automation for provisioning, scaling, and monitoring to ensure clusters are fast, resilient, and self-healing
Collaborate closely with research and product teams to enable distributed training at scale, optimizing for speed, reliability, and utilization
Implement robust observability and alerting systems to monitor GPU health, node stability, and job performance
Diagnose and triage hardware, networking, and distributed training issues across environments, coordinating with provider support as needed
Continuously improve cluster reliability, developer ergonomics, and overall system efficiency across Cartesia’s research and production workloads
What You Bring
Strong engineering fundamentals and experience building and operating large-scale distributed systems
Deep familiarity with HPC & GPU cluster management using Kubernetes and Slurm
A blend of developer empathy and raw performance engineering, designing systems and tools that are intuitive to use and fast
Ability to balance principled engineering with the urgency of keeping mission-critical systems alive
Proficiency with Infrastructure-as-Code tools (Terraform, Ansible, etc.) and observability tools (Prometheus, Grafana, etc.)
Strong debugging skills— comfortable diagnosing NCCL issues, CUDA errors, and network or driver-level faults.
What Sets You Apart
Experience optimizing large-scale distributed training frameworks such as DeepSpeed, Megatron-LM, or similar
Familiarity with advanced parallelization techniques such as FSDP, context parallelism, or tensor parallelism
Our culture
🏢 We’re an in-person team based out of San Francisco. We love being in the office, hanging out together and learning from each other everyday.
🚢 We ship fast. All of our work is novel and cutting edge, and execution speed is paramount. We have a high bar, and we don’t sacrifice quality and design along the way.
🤝 We support each other. We have an open and inclusive culture that’s focused on giving everyone the resources they need to succeed.
Recommended Jobs
Full Stack Engineer - Backend
About Our Client The assets that generate, store, and consume electricity are shaping the next era of the global economy. Yet financing these assets remains slow, costly, and fragmentedâtied up b…
Steward/Dishwasher (Part Time) - Parc 55 San Francisco a Hilton Hotel
The Parc 55 is looking for a Steward/Dishwasher (Part Time. You will find our downtown San Francisco hotel in the center of it all. Powell Street station and the cable car turnaround are one block fr…
Senior Frontend Engineer
We’re building a gamified developer platform empowering tens of thousands of coders to compete in exciting software engineering challenges – all while pushing the frontier of LLMs! As a Senior Fronte…
Production/Administrative Assistant, Fox News @ Night
OVERVIEW OF THE COMPANY Fox News Media FOX News Media operates the FOX News Channel (FNC), FOX Business Network (FBN), FOX News Digital, FOX News Audio, FOX News Books, the direct-to-consumer d…
Employment | Wildlands Studies Field Studies Advisor
BECOME THE WILDLANDS STUDIES Field studies advisor-currently filled If you are interested please complete an application and submit a cover letter and resume. WILDLANDS STUDIES Field studies advisor…
Senior Accountant
COMPANY OVERVIEW TransGrid Energy is a newly established renewable energy company to invest, develop, own and operate utility-scale renewable energy projects. TransGrid has a robust pipeline of proj…
Wendy's Shift Manager
Hours: Opening Shifts/Mid-Day Shifts/Closing Shifts/Weekends Starting Pay: $14.50/hour Step Into Leadership At Wendy’s we don’t just serve burgers, we build careers! Join a team that values yo…
Accounts receivable administrative assistant
GHJ Search and Staffing serves as the recruitment division of GHJ, a prominent national accounting and advisory firm. Our team provides qualified Accounting and Finance professionals on a temporary a…
Adventure Awaits: Nursing in Beautiful Fresno!
Registered Nurse - Labor & Delivery - Travel - (LD RN) Adventure awaits you in beautiful Fresno, California! Join a compassionate team as a Labor and Delivery Registered Nurse at a state-of-the-art m…
Senior Product Manager, Recommendations
Inkitt is building the Disney of the 21st Century, standing at the forefront of technology and entertainment. Leveraging AI and predictive algorithms, Inkitt discovers unknown stories and turns them…