Site Reliability Engineer

Xai
Palo Alto, CA

About xAI


xAI’s mission is to create AI systems that can accurately understand the universe and aid humanity in its pursuit of knowledge. Our team is small, highly motivated, and focused on engineering excellence. This organization is for individuals who appreciate challenging themselves and thrive on curiosity. We operate with a flat organizational structure. All employees are expected to be hands-on and to contribute directly to the company’s mission. Leadership is given to those who show initiative and consistently deliver excellence. Work ethic and strong prioritization skills are important. All engineers are expected to have strong communication skills. They should be able to concisely and accurately share knowledge with their teammates.

About the Role


We are seeking a highly skilled Senior Site Reliability Storage Engineer to join our mission-driven team, focusing on designing, building, and optimizing Kubernetes clusters across multiple regions. In this role, you will leverage your expertise in Kubernetes orchestration and distributed systems to enhance the reliability, performance, and cost-effectiveness of xAI’s infrastructure. You will collaborate closely with engineering teams to deliver robust, scalable solutions that support large-scale AI workloads. The ideal candidate is passionate about automation, observability, and ensuring the integrity of critical systems in a fast-paced, innovative environment.

Responsibilities



  • Develop and optimize software to provision and manage Kubernetes clusters on-premises, enabling xAI to scale efficiently.

  • Enhance the reliability, performance, and cost-effectiveness of Kubernetes infrastructure to support large-scale AI and application workloads.

  • Collaborate with xAI engineers to understand workload requirements and design tailored Kubernetes solutions to meet their needs.

  • Implement robust observability, monitoring, and security practices to ensure the integrity, availability, and confidentiality of critical systems.

  • Manage storage infrastructure using Infrastructure-as-Code (IaC) tools such as Pulumi, Terraform, or Ansible.

  • Drive system reliability through incident management, postmortems, and the definition of clear SLAs and SLOs.

  • Contribute to the Kubernetes stack, including expertise in CNI, CRI, CSI, and related components.

  • This is an in-person role based in Palo Alto, CA, with up to 25% travel required.

Required Qualifications



  • 5+ years of experience as a Site Reliability Engineer or similar role, with a focus on building and maintaining reliable, scalable systems.

  • Proven expertise in managing Kubernetes infrastructure using tools like Cluster API (CAPI) and kubeadm.

  • Proficiency in managing storage infrastructure with IaC tools such as Pulumi, Terraform, or Ansible.

  • Deep understanding of the Kubernetes stack, including CNI, CRI, CSI, and related components.

  • Demonstrated ability to improve system reliability through incident management, postmortems, and defining SLAs/SLOs.

Preferred Qualifications



  • Experience with high-traffic web or mobile application workloads, including optimizing Kubernetes for large-scale deployments.

  • Familiarity with chaos engineering, capacity planning, or similar practices for ensuring system resilience.

  • Proficiency with tools such as Kyverno, ArgoCD, or Go programming for infrastructure automation.

  • Strong sense of ownership, curiosity, and enthusiasm for tackling complex technical challenges.

  • Passion for problem-solving and a proactive drive to deliver impactful results.

  • A sense of adventure and humor to navigate challenges with a positive mindset.

Annual Salary Range


$180,000 - $440,000 USD

Benefits


Base salary is just one part of our total rewards package at xAI, which also includes equity, comprehensive medical, vision, and dental coverage, access to a 401(k) retirement plan, short & long-term disability insurance, life insurance, and various other discounts and perks.

xAI is an equal opportunity employer.

Posted 2025-10-13

Recommended Jobs

Director of Regulatory Affairs (San Francisco, CA)

CEDENT
San Francisco, CA

Summary: This role provides global regulatory leadership and expertise to ensure compliance and successful submission of clinical regulatory filings. The position involves developing regulatory stra…

View Details
Posted 2025-09-02

Supply Chain/Logistics Manager

IMPERIAL STAR SOLAR
Irvine, CA

Supply Chain/Logistics Manager Location: Irvine, CA Department: Supply Chain Reports To: Executive Vice President (EVP/US Head) Job Type: Full-Time About Imperial Star Solar Im…

View Details
Posted 2025-10-10

Human Resource Clerk

Inland Respite Inc
Corona, CA

Now Hiring: Human Resource Clerk Location: Palm Desert, California | 8am5pm Full-Time Department: People Operations Classification: Non-Exempt | Reports To: Director, People Operations _______________…

View Details
Posted 2025-10-19

FT Sales Associate MAJE, Livermore, Ca

MAJE
Livermore, CA

SMCP - Sandro, Maje, Claudie Pierlot FT Sales Associate at SMCP MAJE Location: Livermore At SMCP, we embody the essence of Parisian elegance with our renowned brands, Sandro and Maje. With a commit…

View Details
Posted 2025-10-15

Enterprise Customer Success Manager

Sendbird
San Mateo, CA

We’re looking for a dynamic and experienced  Customer Success Manager to lead strategic relationships with some of our largest and fastest-growing accounts. In this role, you will be the trusted adv…

View Details
Posted 2025-10-04

Experience Innovation in Healthcare at Palo Alto’s Heart

NurseRecruiter
Palo Alto, CA

Registered Nurse - Progressive Care - Travel - (PCU RN) Join a dynamic team as a Registered Nurse in the Progressive Care Unit in Palo Alto, where innovation meets compassionate care. Ideal candidate…

View Details
Posted 2025-07-30

Full Time Internal Medicine Job Manteca, CA

CompHealth CompHealth
Manteca, CA

Come practice in Manteca, a city in the Central Valley of California, 90 miles east of San Francisco. You will live within easy access to a variety of recreational areas with great boating, fishing, h…

View Details
Posted 2025-09-10

Frontend Engineer

Tarro
Menlo Park, CA

About us: Here at Tarro we build products that empower small brick and mortar restaurants by liberating them of the operational burden of running their business. We accomplish this by providing a fr…

View Details
Posted 2025-09-13

Expert GIS Technical Specialist

Pacific Gas And Electric Company
Oakland, CA

Requisition ID # 166237  Job Category: Maintenance / Construction / Operations  Job Level: Individual Contributor Business Unit: Electric Engineering Work Type: Hybrid Job Location: Oakl…

View Details
Posted 2025-10-15

SEO Product Manager

Achieve
Los Angeles, CA

Company Description Achieve is a leading digital personal finance company. We help everyday people move from struggling to thriving by providing innovative, personalized financial solutions. By …

View Details
Posted 2025-10-13