Site Reliability Engineer - US Government

Xai
Palo Alto, CA

About xAI


xAI’s mission is to create AI systems that can accurately understand the universe and aid humanity in its pursuit of knowledge. Our team is small, highly motivated, and focused on engineering excellence. This organization is for individuals who appreciate challenging themselves and thrive on curiosity. We operate with a flat organizational structure. All employees are expected to be hands-on and to contribute directly to the company’s mission. Leadership is given to those who show initiative and consistently deliver excellence. Work ethic and strong prioritization skills are important. All engineers are expected to have strong communication skills. They should be able to concisely and accurately share knowledge with their teammates.

About the Role


We are seeking a highly skilled Senior Infrastructure Engineer to join our US Government Team, focused on designing, building, and operating secure, scalable infrastructure for critical government projects. In this role, you will develop and manage training and inference clusters, as well as highly reliable applications, across bare metal, classified cloud, and hybrid cloud architectures. You will leverage your expertise in Kubernetes and GPU hardware to deliver robust, secure systems that support large-scale AI workloads while meeting stringent federal compliance requirements. This role demands a passion for automation, observability, and ensuring system integrity in a fast-paced, high-security environment.

Responsibilities



  • Develop and optimize software to provision and manage xAI’s infrastructure across on-premise, virtual machine, and classified cloud environments, enabling efficient scaling for US government initiatives.

  • Enhance the reliability, performance, and cost-effectiveness of infrastructure to support large-scale AI and application workloads in secure, classified settings.

  • Collaborate with xAI engineers to understand workload requirements and design tailored solutions that meet government-specific needs and compliance standards.

  • Implement robust observability, monitoring, and security practices to ensure the integrity, availability, and confidentiality of critical systems, adhering to federal protocols.

  • Manage storage infrastructure using Infrastructure-as-Code (IaC) tools such as Pulumi, Terraform, or Ansible, with a focus on secure data handling.

  • Drive system reliability through incident management, postmortems, and the definition of clear SLAs and SLOs, while maintaining security and compliance.

  • This is an in-person role based in Palo Alto, CA or Washington, DC, with up to 50% travel required.

Required Qualifications



  • Active Top Secret (TS) security clearance.

  • 5+ years of experience as an Infrastructure Engineer, Site Reliability Engineer, or similar role, with a focus on building and maintaining reliable, scalable systems, preferably in secure or government environments.

  • Proficiency in managing storage infrastructure with IaC tools such as Pulumi, Terraform, or Ansible.

  • Deep understanding of the Kubernetes stack, including CNI, CRI, CSI, and related components.

  • Demonstrated ability to improve system reliability through incident management, postmortems, and defining SLAs/SLOs.

  • Excellent communication and documentation skills, with the ability to handle sensitive information concisely and accurately.

Preferred Qualifications



  • Deep familiarity with installing and using GPU hardware, including setting up drivers, debugging issues, and ensuring reliability.

  • Experience with high-traffic web or mobile application workloads, including optimizing Kubernetes for large-scale deployments in classified or federal settings.

  • Familiarity with chaos engineering, capacity planning, or similar practices for ensuring system resilience in government projects.

  • Proficiency with tools such as Kyverno, ArgoCD, or Go programming for infrastructure automation.

  • Strong sense of ownership, curiosity, and enthusiasm for tackling complex technical challenges in secure environments.

  • Passion for problem-solving and a proactive drive to deliver impactful results while adhering to security protocols.

  • Certifications in security-related fields (e.g., CISSP) or experience in secure federal environments.

Interview Process


After submitting your application, our team will review your CV and statement of exceptional work. If your application advances, you will be invited to a 15-minute phone interview to discuss basic qualifications. Successful candidates will proceed to the main process, which includes:


  1. Technical deep-dive: Discussing your infrastructure and secure systems experience.

  2. A hands-on challenge focused on designing or troubleshooting infrastructure for secure environments.

  3. A meet-and-greet with the wider team.

Our goal is to complete the main interview process within one week.

Annual Salary Range


$180,000 - $440,000 USD

Benefits


Base salary is just one part of our total rewards package at xAI, which also includes equity, comprehensive medical, vision, and dental coverage, access to a 401(k) retirement plan, short & long-term disability insurance, life insurance, and various other discounts and perks.

xAI is an equal opportunity employer.

Posted 2025-10-19

Recommended Jobs

Machine Learning Engineer

Toyota Research Institute
Los Altos, CA

At Toyota Research Institute (TRI), we're on a mission to improve the quality of human life. We're developing new tools and capabilities to amplify the human experience. To lead this transformative s…

View Details
Posted 2025-09-25

ASIC Design Technical Leader - Design & Timing Constraints Focus

Cisco
San Jose, CA

The application window is expected to close on: US 9/15/2025 This position will be onsite in San Jose 5 days per week Meet the Team Join the Cisco Silicon One team in developing a unified sil…

View Details
Posted 2025-10-03

Associate Test Engineer 2 (Shift 3)

Dawar Consulting
Milpitas, CA

Our client, a leading leader in life sciences and diagnostics, is looking for an “ Associate Test Engineer 2 (Shift 3)” based out of  Milpitas, CA. Duration: Long-Term Contract (High possibili…

View Details
Posted 2025-10-01

Embrace Healing in Scenic Grass Valley, CA!

NurseRecruiter
Grass Valley, CA

Occupational Therapist Acute Care Hospital job in Grass Valley, CA Discover an exhilarating opportunity as an Occupational Therapist in picturesque Grass Valley, CA! Picture yourself embracing the vi…

View Details
Posted 2025-07-30

Principal Software Engineer

Freeform
Los Angeles, CA

PRINCIPAL SOFTWARE ENGINEER  Freeform is deploying software-defined, autonomous metal 3D printing factories around the world, bringing the scalability of software to physical production. Our prop…

View Details
Posted 2025-09-14

Staff Software Engineer

Boon Technologies
San Francisco, CA

About Boon Technologies, Inc. Boon is an AI agent platform designed specifically for supply chain and logistics providers—the backbone of our economy. Founded in the San Francisco Bay Area in 2023 b…

View Details
Posted 2025-09-14

Assistant Bakery Manager (SF Bay Area Applicants Only)

Mollie Stone's Markets
Greenbrae, CA

About Us:  Proudly serving our San Francisco Bay Area communities since 1986, Mollie Stone’s Markets is a local, family-owned grocery store chain. With over 38 years of exemplary performance, Molli…

View Details
Posted 2025-07-31

Entry Level Process Engineer (Start May 2026)

Dennis Group
Carlsbad, CA

Dennis Group’s Process Engineers are key in our projects of designing and building food and beverage processing facilities. Process Engineers work in every aspect of a project - controls, packaging, …

View Details
Posted 2025-09-08

Pain Medicine Locums

California

All Star Healthcare Solutions is seeking a Pain Medicine Physician to assist with locums coverage in California. Job details include:  ~ Coverage: Dec 1-ongoing ~2 weeks on/2 weeks off ~12-15 PP…

View Details
Posted 2025-09-10

Software Engineer, Frontend

Openai
San Francisco, CA

About the Team The Applied team works across research, engineering, product, and design to bring OpenAI’s technology to the world. We seek to learn from deployment and broadly distribute the bene…

View Details
Posted 2025-09-14