Senior HPC & GPU Infrastructure Engineer

Sciforium

San Francisco, CA

Sciforium is an AI infrastructure company developing next-generation multimodal AI models and a proprietary, high-efficiency serving platform. Backed by multi-million-dollar funding and direct sponsorship from AMD with hands-on support from AMD engineers the team is scaling rapidly to build the full stack powering frontier AI models and real-time applications.

About the role

We are seeking a Senior HPC & GPU Infrastructure Engineer to take full ownership of the health, reliability, and performance of our GPU compute cluster. You will be the primary PyTOrchcustodian of our high-density accelerator environment and the linchpin between hardware operations, distributed systems, and machine learning workflows. This role spans everything from hands-on Linux systems engineering and GPU driver bring-up to maintaining the ML software stack (CUDA/ROCm, PyTorch, JAX, vLLM). If you love squeezing every bit of performance out of hardware, enjoy debugging GPUs at scale, and want to build world-class AI infrastructure, this role is for you.

What you'll do

1. System Health & Reliability (SRE)

On-Call Response: Act as the primary responder for system outages, GPU failures, node crashes, and cluster-wide incidents. Minimize downtime by resolving issues rapidly.
Cluster Monitoring: Implement and maintain monitoring for GPU health, thermal behavior, PCIe/NVLink topology issues, memory errors, and overall system load.
Vendor Liaison: Coordinate with data center staff, hardware vendors, and on-site technicians for repairs, RMA processing, and physical maintenance of the cluster.

2. Linux & Network Administration

OS Management: Install, patch, and maintain Linux distributions (Ubuntu / CentOS / RHEL). Ensure consistent configuration, kernel tuning, and automation for large node fleets.
Security & Access Controls: Configure VPNs, iptables/firewalls, SSH hardening, and network routing to secure our computer infrastructure.
Identity & Storage Management: Manage LDAP/FreeIPA/AD for user identity, and administer distributed file systems such as NFS, GPFS, or Lustre.

3. GPU & ML Stack Engineering

Deployment & Bring-Up: Lead deployment of new GPU nodes, including BIOS configuration, NUMA tuning, GPU topology validation, and cluster integration.
Driver & Kernel Management: Build and optimize kernel modules, maintain GPU drivers and runtime stacks for both NVIDIA (CUDA) and AMD (ROCm).
Software Stack Maintenance: Maintain and optimize ML frameworks and libraries PyTorch, JAX, CUDA toolkit, cuDNN, ROCm, NCCL, and supporting runtime systems.
Advanced Debugging: Troubleshoot complex interactions involving GPUs, compilers, ML frameworks, and distributed training runtimes (e.g., vLLM compilation failures, CUDA memory leaks, ROCm kernel crashes).

Ideal candidate profile

5+ years of experience in HPC, GPU cluster operations, Linux systems engineering, or similar roles.
Bachelor’s or Master’s degree in Computer Science, Computer Engineering, Electrical Engineering, or a related technical field.
Strong expertise with NVIDIA (H100/B200) or AMD (MI325x/MI355x) GPUs, including driver and kernel-level debugging.
Deep understanding of Linux internals, kernel modules, hardware bring-up, and systems performance tuning.
Experience with network security, including VPNs, iptables/firewalld, SSH, and identity management (LDAP/FreeIPA/AD).
Proficiency in Bash and Python for scripting, automation, and workflow tooling.
Familiarity with ML software stacks: CUDA toolkit, cuDNN, NCCL, ROCm, JAX/PyTorch runtime behavior.
Deep debugging experience with NVLink/NVSwitch fabrics and RDMA networking.

Nice-to-have

Experience with job schedulers such as Slurm, Kubernetes, or Run:AI.
Exposure to vLLM, model serving optimizations, or inference systems.
Hands-on experience with configuration management tools (Ansible, SaltStack, Terraform).
Previous experience supporting ML research teams in a startup or research-heavy environment.

Benefits include

Medical, dental, and vision insurance
401k plan
Daily lunch, snacks, and beverages
Flexible time off
Competitive salary and equity

Equal opportunity

Sciforium is an equal opportunity employer. All applicants will be considered for employment without attention to race, color, religion, sex, sexual orientation, gender identity, national origin, veteran or disability status.

Posted 2026-01-13

Recommended Jobs

Overnight Veterinary Technician, Critical Care (Campbell)

Ethos Veterinary Health

Campbell, CA

Registered Veterinary Technician or Senior Veterinary Assistant Critical Care Overnights About SAGE Campbell: SAGE Campbell is a leading provider of specialty and emergency veterinary care i…

View Details

Posted 2026-01-06

Cash Applications Analyst

Tesla

Fremont, CA

What To Expect This role is an analyst within Tesla’s Corporate Accounting Services organization, based in either Draper, Utah, or Fremont, California. The position focuses on critical responsibil…

View Details

Posted 2025-12-18

Senior software engineer - vehicle

General Motors

Mountain View, CA

Job Description Work Arrangement: This role is categorized as hybrid. This means the successful candidate is expected to report to Warren, MI or Mountain View, CA three times per week, at …

View Details

Posted 2026-01-09

Principal Software Engineer

Snowflake

Menlo Park, CA

Snowflake is about empowering enterprises to achieve their full potential — and people too. With a culture that’s all in on impact, innovation, and collaboration, Snowflake is the sweet spot for buil…

View Details

Posted 2026-01-10

Senior Staff Software Engineer, API Infrastructure

gusto

Los Angeles, CA

About Gusto Gusto is a modern, online people platform that helps small businesses take care of their teams. On top of full-service payroll, Gusto offers health insurance, 401(k)s, expert HR, and…

View Details

Posted 2025-10-24

Construction Accountant

K2 Staffing

San Diego, CA

Summary Our client, one of San Diego's most well-respected commercial construction general contractors/developers, is in need of an Project Accountant . This individual will be managing daily ta…

View Details

Posted 2025-10-03

Software Engineer, Logs Storage Infrastructure

Waymo

Mountain View, CA

Waymo is an autonomous driving technology company with the mission to be the world's most trusted driver. Since its start as the Google Self-Driving Car Project in 2009, Waymo has focused on buildi…

View Details

Posted 2025-12-19

Junior Software Engineer

Broccoli AI

San Francisco, CA

About Broccoli AI Broccoli is building the AI operating system for the $500B home services market. We deploy intelligent AI agents at the front lines of HVAC, roofing, and other trades businesses t…

View Details

Posted 2026-01-07

Machine Learning Engineer (SWE)

Mercor

San Francisco, CA

About Mercor Mercor is training models that predict how well someone will perform on a job better than a human can. Similar to how a human would review a resume, conduct an interview, and decide who…

View Details

Posted 2025-12-13

Sales Incentive Compensation Functional Lead (San Jose, CA)

CEDENT

San Jose, CA

We are seeking a visionary Sales Incentive Compensation Functional Lead to spearhead the product strategy, development, and deployment of our Sales Incentive Compensation system. This role is pivotal …

View Details

Posted 2025-09-02