Software Engineer, ML & Data Infra

Xai
Palo Alto, CA

About xAI


xAI’s mission is to create AI systems that can accurately understand the universe and aid humanity in its pursuit of knowledge. Our team is small, highly motivated, and focused on engineering excellence. This organization is for individuals who appreciate challenging themselves and thrive on curiosity. We operate with a flat organizational structure. All employees are expected to be hands-on and to contribute directly to the company’s mission. Leadership is given to those who show initiative and consistently deliver excellence. Work ethic and strong prioritization skills are important. All employees are expected to have strong communication skills. They should be able to concisely and accurately share knowledge with their teammates.

About the Role


The ML and Data Infrastructure team is responsible for building the foundational infrastructure that powers frontier AI models and truth-seeking agents—from petabyte-scale data acquisition and multimodal crawling, to web-scale search/retrieval systems, reliable high-throughput inference serving, low-level GPU/kernel optimizations, compiler/runtime innovations, and high-speed interconnect fabrics for massive clusters. In this role, you will collaborate across pre-training, multimodal, reasoning, and product teams in a fast-paced, meritocratic environment where you will tackle ambiguous, high-stakes problems with first-principles thinking and rigorous execution.

Responsibilities



  • Design, build, and operate petabyte-to-exabyte scale distributed systems for data acquisition, web crawling, preprocessing, filtering/classification, and multimodal pipelines (CPU/GPU workloads).

  • Architect high-performance search/retrieval engines (vector/hybrid/semantic) at trillion-document scale, integrating with LLMs/agents for truth-seeking, low-hallucination reasoning, and real-time knowledge access.

  • Develop reliable inference serving infrastructure: load balancing, autoscaling, KV cache, batching, fault-tolerance, monitoring (Prometheus/Grafana), CI/CD (Buildkite/ArgoCD), and benchmarking for 100% uptime and optimal tail latency.

  • Optimize low-level performance: CUDA kernels (GeMM, attention), Triton/CUTLASS extensions, quantization/distillation/speculative decoding, GPU memory hierarchy, and model-hardware co-design for next-gen architectures.

  • Innovate on compilers/runtimes (JAX/XLA/MLIR, custom features for Hopper/Blackwell), distributed profiling/debugging tools, and interconnect fabrics (copper/optical, 1.6T+, SerDes/photonics, topology simulation, vendor roadmaps).

  • Manage complex workloads across clouds/clusters: orchestration (Kubernetes), data bookkeeping/verifiability, high-speed interconnect validation, failure analysis, and telemetry/automation for production reliability.

Required Qualifications



  • Strong systems engineering skills with proven impact on large-scale distributed infrastructure (data processing, search, inference, or cluster networking).

  • Proficiency in Python and at least one compiled language (Rust, C++, Go, Java); experience building bespoke libraries, optimizing performance, and debugging complex systems.

  • Hands-on experience with at least one key area: petabyte-scale data pipelines/crawling (Spark/Ray/Kubernetes), web-scale search/retrieval (vector DBs, ranking, RAG), inference optimization (SGLang, kernels, batching), compiler features (JAX/XLA), or high-speed interconnects (optical/copper, SerDes, signal integrity).job

  • Deep understanding of distributed systems challenges: high-throughput ops/sec, latency/throughput tradeoffs, fault-tolerance, monitoring, and scaling to production billions-of-users or 100k+ GPUs.

  • Passion for AI infrastructure: keeping up with SOTA techniques, first-principles problem-solving, meticulous organization/bookkeeping, and delivering rigorous, high-quality results.

Preferred Qualifications



  • Experience with multimodal data (images/video/audio), epistemics/truth-seeking in retrieval, or agentic systems (long-horizon reasoning, feedback loops).

  • Low-level optimizations: CUDA kernel development (Tensor cores, attention), GPU profiling (Nsight), low-precision numerics, or interconnect pathfinding (LPO/LRO/CPO, photonics).

  • Production expertise in inference reliability (0% error target), CI/CD for ML, or cluster networking (topology, vendor collaboration, failure root-cause).

  • Track record owning end-to-end projects in hyperscale environments, with strong debugging, vendor management, or open-source contributions (e.g., SGLang).

Annual Salary Range


$180,000 - $440,000 USD

Benefits


Base salary is just one part of our total rewards package at xAI, which also includes equity, comprehensive medical, vision, and dental coverage, access to a 401(k) retirement plan, short & long-term disability insurance, life insurance, and various other discounts and perks.

xAI is an equal opportunity employer. For details on data processing, view our

Posted 2026-03-10

Recommended Jobs

Sr. Program Manager

Medical Device Company
San Diego, CA

Roles & Responsibilities Duties And Responsibilities • Demonstrates program management standard methodologies and contributes to improving processes • Leads the team to prepare for Phase Gate …

View Details
Posted 2026-03-12

Software Engineer, Frontend (UI) San Francisco

Flow Engineering
San Francisco, CA

About Flow Flow Engineering is building a requirements platform that makes complex engineering work clear, collaborative, and verifiable. The team is now building Flow v3: a more intuitive, AI-enhan…

View Details
Posted 2026-02-19

Product Expert - Clinical Data Management Systems

Veeva Systems
Pleasanton, CA

Veeva Systems is a mission-driven organization and pioneer in industry cloud, helping life sciences companies bring therapies to patients faster. As one of the fastest-growing SaaS companies in histo…

View Details
Posted 2026-03-07

Wastewater - Heavy Civil Sr. Superintendent (Modesto)

Jobot
Modesto, CA

This Jobot Job is hosted by: Bryna Rabin Are you a fit? Easy Apply now by clicking the "Apply" button and sending us your resume. Salary: $150,000 - $195,000 per year A bit about us: We…

View Details
Posted 2026-03-06

Associate Software Engineer - Seeking 2025 & 2026 Grads

Veeva Systems
Pleasanton, CA

Veeva Systems is building the industry cloud for Life Sciences to help companies work in a more efficient and connected way. Learn more about our products, vision and values, and status as a public b…

View Details
Posted 2025-08-28

CTEC Classroom Assistant I

Chemehuevi Tribe
Lake County, CA

STATUS : Part-Time (29hrs/week), Permanent, Non-Exempt   DEPARTMENT : Education REPORTS TO : Director of Education WAGE : $18- $18.54/hr DOE Duties and Responsibilities: …

View Details
Posted 2026-02-28

Workers' Compensation Paralegal

Goldberg Segalla LLP
Orange County, CA

Goldberg Segalla, a dynamic and rapidly growing law firm with a national practice, seeks a highly motivated and organized individual to join its team as an experience Workers' Compensation Defense Pa…

View Details
Posted 2026-03-13

Regional Property Manager

Carlo Inc.
Sherman Oaks, CA

Regional Property Manager (450 Units – Northridge / Tarzana / Lake Balboa) Location: Sherman Oaks, CA Schedule: Full-time, on-site Portfolio: 10 buildings / 450 units About Carlo, Inc. …

View Details
Posted 2026-02-28

Travel Radiology Special Procedures Tech Job

Greenbrae, CA

Job Overview TLC Nursing Associates, Inc. is seeking an experienced Special Procedures Tech for travel assignments in various healthcare facilities. The Special Procedures Tech will assist in…

View Details
Posted 2026-02-18

Senior Project Engineer - Civil Transportation

BKF Engineers
Stockton, CA

BKF is a multi-service infrastructure consulting firm providing civil engineering and surveying services across California, the Pacific Northwest, and beyond. With offices throughout California and t…

View Details
Posted 2026-01-15