Senior Site Reliability Engineer, Managed AI
Crusoe's mission is to accelerate the abundance of energy and intelligence. We’re crafting the engine that powers a world where people can create ambitiously with AI — without sacrificing scale, speed, or sustainability.
Be a part of the AI revolution with sustainable technology at Crusoe. Here, you'll drive meaningful innovation, make a tangible impact, and join a team that’s setting the pace for responsible, transformative cloud infrastructure.
About the Role:
At Crusoe, our Site Reliability Engineering team ensures the reliability and scalability of Crusoe’s AI-optimized cloud platform. We’re looking for a Senior Site Reliability Engineer with a strong background in distributed systems and hands-on experience with large language models to help us build and operate managed AI services at scale. This role is central to delivering highly available, performant, and cost-efficient AI infrastructure that powers compute-intensive, latency-sensitive workloads for our customers.
What You’ll Work On:
Design and operate reliable managed AI services with a focus on serving and scaling LLM workloads
Build automation and reliability tooling to support distributed AI pipelines and inference services
Define, measure, and improve SLIs/SLOs across AI workloads to ensure performance and reliability targets are met
Collaborate with AI, platform, and infrastructure teams to optimize large-scale training and inference clusters
Automate observability by building telemetry and performance tuning strategies for latency-sensitive AI services
Investigate and resolve reliability issues in distributed AI systems using telemetry, logs, and profiling
Contribute to the architecture of next-generation distributed systems purpose-built for AI-first environments
What You’ll Bring:
Strong software engineering background — experience building production-grade systems beyond scripting or Bash
Demonstrated experience in distributed systems design and implementation
Hands-on work with large language models (LLMs) or AI/ML infrastructure
SRE mindset and experience (whether or not under the SRE title) including:
Defining and measuring SLIs/SLOs
Building monitoring and observability systems
Driving performance and reliability improvements
Designing fault-tolerant systems and automated testing strategies
Proficiency in at least one modern programming language (Python, Go, Java, C++)
Familiarity with Kubernetes or container orchestration platforms
Strong collaboration and communication skills
Ability to thrive in a fast-paced, mission-driven environment
Bonus Points:
Experience scaling inference or training workloads for LLMs
Benefits:
Industry competitive pay
Restricted Stock Units in a fast growing, well-funded technology company
Health insurance package options that include HDHP and PPO, vision, and dental for you and your dependents
Employer contributions to HSA accounts
Paid Parental Leave
Paid life insurance, short-term and long-term disability
Teladoc
401(k) with a 100% match up to 4% of salary
Generous paid time off and holiday schedule
Cell phone reimbursement
Tuition reimbursement
Subscription to the Calm app
MetLife Legal
Company paid commuter benefit; $300 per month
Compensation:
Compensation will be paid in the range of $172,000 - $209,000 + Bonus. Restricted Stock Units are included in all offers. Compensation to be determined by the applicant’s education, experience, knowledge, skills, and abilities, as well as internal equity and alignment with market data.
Crusoe is an Equal Opportunity Employer. Employment decisions are made without regard to race, color, religion, disability, genetic information, pregnancy, citizenship, marital status, sex/gender, sexual preference/ orientation, gender identity, age, veteran status, national origin, or any other status protected by law or regulation.
Recommended Jobs
Travel Nurse - Labor and Delivery
We are seeking a dedicated Travel Nurse specializing in Labor and Delivery in Daly City, CA. Responsibilities Include Providing compassionate care to mothers and newborns during labor, delivery,…
Commercial Lines Actuary - Large Growing Insurance Carrier - Base Salary to 150k/year - Anaheim, CA
Commercial Lines Actuary - Large Growing Insurance Carrier - Base Salary to 150k/year - Anaheim, CA ~ Our client, a respected and expanding insurance carrier, is looking for a highly analytical and…
Willow Inventory Specialist (San Diego, CA)
Epic Willow Inventory Specialist/Senior Analyst Position Overview: We are seeking a highly skilled and experienced Epic Willow Inpatient Specialist to join our dynamic team to implement Epic Invent…
Project Estimator
Role Reporting to the Preconstruction Manager, the Project Estimator plays a key role in the preparation of complete EPC (Engineering, Procurement, and Construction) proposals for high-voltage ele…
Used Car Buyer-Kia of Vacaville
Job Title: Used Car Buyer Location: Kia of Vacaville Job Type: Full-Time Company Overview: Kia of Vacaville is a reputable dealership specializing in high-quality pre-owned vehicles. We a…
Travel Nurse - Case Management Job in Whittier, CA - $12,229 per Month (2 Years Experience Needed)
Vetted is seeking a RN - Case Management for a travel job in Whittier, California . Must have 2+ years of experience. This contract pays approximately $12,229/month gross. Assignment deta…
Part Time Recruiting/ HR Assistant
About us: Pacifico Energy is seeking a tech-savvy, proactive, and highly organized Part-Time Recruiting / HR Assistant (20 hours per week) to support our People team while also assisting with essen…
Challenger Sports Soccer Academy
Position available: Academy Coach in Schools Who are Challenger Sports? Challenger understands that players, coaches, parents and administrators all have different needs and priorities, and by li…
Retail Merchandiser
Location: Chico, CA Category: Stock Support Req ID: 105 Description Job Title: Retail Merchandiser Location: Chico, CA Job Type: Part-Time Pay: $16-$18 per hour (rate based …
Web Developer
About the Job: We are seeking a Front-End Web Developer to build and maintain engaging, high-quality user interfaces for our web-based applications using HTML, CSS, and JavaScript. This role fo…