Software Engineer, Inference Platform

Fluidstack
San Francisco, CA

About Fluidstack

At Fluidstack, we’re building the infrastructure for abundant intelligence. We partner with top AI labs, governments, and enterprises - including Mistral, Poolside, Black Forest Labs, Meta, and more - to unlock compute at the speed of light.

We’re working with urgency to make AGI a reality. As such, our team is highly motivated and committed to delivering world-class infrastructure. We treat our customers’ outcomes as our own, taking pride in the systems we build and the trust we earn. If you’re motivated by purpose, obsessed with excellence, and ready to work very hard to accelerate the future of intelligence, join us in building what's next.

About the Role

Inference is now the defining cost and latency bottleneck for frontier AI. Fluidstack’s Inference Platform team owns the serving layer that sits between our global accelerator supply and the production workloads our customers run on it: LLM serving frameworks, KV cache infrastructure, disaggregated prefill/decode pipelines, and Kubernetes-based orchestration across multi-datacenter footprints.

This is a hands-on IC role at the intersection of distributed systems, model optimization, and serving infrastructure. You’ll own end-to-end inference deployments for frontier AI labs and our inference product, drive measurable improvements in throughput, cost-per-token, and time-to-first-token, and contribute to the platform architecture choices that determine how Fluidstack deploys across tens of thousands of accelerators.


You will:

  • Own inference deployments end-to-end: from initial configuration and performance tuning to production SLA maintenance and incident response.

  • Drive measurable improvements in throughput, TTFT, and cost-per-token across diverse model families (dense transformers, mixture-of-experts, multi-modal) and customer workload patterns.

  • Build and operate KV cache and scheduling infrastructure to maximize utilization across concurrent requests.

  • Implement and validate disaggregated prefill/decode pipelines and the Kubernetes orchestration that supports them at scale.

  • Profile and resolve bottlenecks at the compute, memory, and communication layers; instrument deployments for end-to-end observability.

  • Partner with customers to translate their model architectures, access patterns, and latency requirements into deployment configurations and upstream platform improvements.

  • Contribute to inference platform architecture and roadmap, with a focus on reducing deployment complexity, improving hardware utilization, and expanding support for new model classes and accelerators.

  • Participate in an on-call rotation (up to one week per month) to maintain the reliability and SLA commitments of production deployments.


Basic Qualifications

  • 5+ years of professional software engineering experience with a track record of shipping production-quality systems.

  • Strong programming skills in Python and/or Go.

  • Hands-on production experience with at least one LLM serving framework (vLLM, SGLang, TensorRT-LLM, TGI, or equivalent).

  • Working knowledge of PyTorch or JAX and an understanding of how model architecture choices affect inference characteristics.

  • Experience deploying and operating GPU workloads on Kubernetes at production scale, including autoscaling and resource scheduling.

  • Solid understanding of GPU memory hierarchies, compute parallelism, and the tradeoffs across tensor, pipeline, and expert parallelism strategies.

  • Ability to create structure from ambiguity and communicate technical tradeoffs clearly to both engineering peers and customers.

  • Great written and verbal communication skills in English.


Preferred Qualifications

  • Production experience with disaggregated prefill/decode architectures (NVIDIA Dynamo, LLM-d, or equivalent), including scheduling policies and network fabric configuration.

  • Deep familiarity with KV cache strategies: RadixAttention, slab-based memory allocators, cross-request prefix sharing, and cache-aware scheduling.

  • Experience with multi-node GPU inference across InfiniBand or RoCE fabrics, including NCCL collective communication tuning.

  • Custom kernel or operator development experience (e.g., CUDA, Triton, torch.compile, Pallas, or equivalent)

  • Contributions to open-source inference engines (vLLM, SGLang, TGI, TensorRT-LLM, or similar).

  • Hands-on experience with quantization tooling: GPTQ, AWQ, FP8 via llm-compressor, or AutoGPTQ.

  • Knowledge of speculative decoding implementations (Medusa, EAGLE-3, draft-model approaches) and their performance/quality tradeoffs.

  • Experience optimizing and adapting model implementations for non-NVIDIA accelerators and their ecosystems: AMD, TPU, Trainium/Inferentia, Cerebras, Groq, and other custom ASICs.


Salary & Benefits

  • Competitive total compensation package (salary + equity).

  • Retirement or pension plan, in line with local norms.

  • Health, dental, and vision insurance.

  • Generous PTO policy, in line with local norms.

The base salary range for this position is $165,000 – $500,000 per year, depending on experience, skills, qualifications, and location. This range represents our good faith estimate of the compensation for this role at the time of posting. Total compensation may also include equity in the form of stock options.

We are committed to pay equity and transparency.

Fluidstack is an Equal Employment Opportunity Employer. All qualified applicants will receive consideration for employment without regard to race, color, religion, sex, national origin, sexual orientation, gender identity, disability and protected veterans’ status, or any other characteristic protected by law. Fluidstack will consider for employment qualified applicants with arrest and conviction records pursuant to applicable law.

You will receive a confirmation email once your application has successfully been accepted. If there is an error with your submission and you did not receive a confirmation email, please email [email protected] with your resume/CV, the role you've applied for, and the date you submitted your application-- someone from our recruiting team will be in touch.

Posted 2026-03-10

Recommended Jobs

Travel Registered Nurse ICU Job

Los Angeles, CA

Job Overview TLC Nursing Associates, Inc. is seeking a highly skilled Registered Nurse (RN) – Intensive Care Unit (ICU) to provide specialized nursing care to critically ill patients. The ICU RN …

View Details
Posted 2026-02-15

Warehouse Associate

San Jose, CA

We’re looking for bold, entrepreneurial talent ready to help build something extraordinary — and reshape the future of building products distribution. QXO is a publicly traded company founded by…

View Details
Posted 2026-02-24

LOCUM Urology Nurse Practitioner

Palm Careers
San Francisco, CA

We are hiring an experienced locum tenens Urology Surgery APP (Nurse Practitioner or Physician Assistant) for an outstanding client in San Francisco! Come work and live in a great city of 6 months an…

View Details
Posted 2026-01-24

Janitor

Dedicated Building Services
Rohnert Park, CA

Under general supervision, performs a variety of cleaning and custodialduties including daily cleaning and supplying of bathrooms, offices and otherassigned areas in accordance with standard procedur…

View Details
Posted 2026-01-30

Assistant Planner (20715150)

CalOpps
San Francisco, CA

Location 26379 Fremont Rd. Los Altos Hills, 94022 Description   ~Are you a detail-oriented planning professional with experience in development review, zoning interpretation, and envi…

View Details
Posted 2026-02-28

HVAC Preconstruction Manager / Design/Engineer

Gulfstream Strategic Placements, LLC
Orange, CA

HVAC Preconstruction Manager / Design/Engineer – Orange County, CA Responsibilities: Review all contract project requirements Ensure client satisfaction in early phase of upcoming projects …

View Details
Posted 2026-03-09

AI Automation Lead, Trust (Remote)

Autodesk
San Francisco, CA

Job Requisition ID # 26WD95218 Position Overview The AI Automation Lead for Trust will lead Autodesk Trust in designing and deploying automation and AI driven process with a goal to assist, a…

View Details
Posted 2026-02-28

Part Time Receptionist

Precision Honda
Downey, CA

Responsibilities: Ensure all phone calls are directed in a timely and professional manner. Always provide excellent customer service over the phone and in person.  Greet and guide customers to a…

View Details
Posted 2025-09-03

After School Program Educators - Monterey Peninsula

HOKALI
Corral De Tierra, CA

About HOKALI At HOKALI, we simplify how schools book, organize, and manage after-school programs and camps. Our platform helps schools find and book a wide variety of onsite programs to suppleme…

View Details
Posted 2026-01-15

DevOps Engineer

Vertex Sigma Software
Foster, CA

Responsibilities: You will be developing and maintaining tools that support cross-functional teams in their efforts towards developing software that adheres to a High Assurance Process. Work cl…

View Details
Posted 2026-02-04