Senior/Staff Site Reliability Engineer
You are a seasoned SRE who keeps production infrastructure running at scale. You own the reliability and availability of customer-facing systems — from Kubernetes clusters to deployment pipelines to the networking layer that connects it all. You think in SLOs, automate ruthlessly, and treat every incident as a chance to make the system better.
Key Responsibilities
- Own and operate our Kubernetes infrastructure: cluster lifecycle, upgrades, networking, and multi-tenant isolation for customer workloads
- Build and maintain CI/CD pipelines and deployment infrastructure
- Leverage AI to an extreme level to automate analysis and resolution of production issues, and improve software development speed, reliability and maintainability
- Build dashboards, alerting, and anomaly detection across our systems
- Define and enforce SLOs and build out incident response processes
- Manage and improve our networking, load balancing, and service mesh configurations
- Drive reliability improvements across the stack through automation, runbooks, and chaos engineering
Requirements
- 5+ years experience in managing critical production systems and software development workflows
- Strong production experience setting up and operating Kubernetes at scale, using infrastructure-as-code (Terraform, Ansible)
- Deep knowledge of Linux networking, container networking (CNI plugins, VXLAN, BGP), and DNS
- Experience building CI/CD systems and GitOps workflows (FluxCD, ArgoCD)
- Proficiency in Python and either Go or Bash for tooling and automation
- Strong experience with logging, monitoring and alerting (Prometheus, Grafana, Loki, Thanos, VictoriaMetrics, Datadog)
- Excellent communication and ability to drive technical decisions across teams
- Self-starter who executes quickly, takes ownership, and constantly seeks improvement
Nice to have
- Experience with managing GPU and AI/ML workloads
- Experience with kernel-based monitoring and routing (eBPF, XDP)
- Experience with security tooling (Falco, Coroot, SIEM)
- Experience with bare metal Kubernetes networking (Calico, Cilium, MetalLB)
- Experience with distributed storage systems (Ceph, Longhorn, etc.)
Compensation
- $180,000-250,000 plus equity + benefits
Location
San Francisco, CA
What we offer at fal
- Interesting and challenging work
- A lot of learning and growth opportunities
- We are currently hiring in downtown San Francisco.
- We offer visa sponsorship and will help you relocate to San Francisco.
- Health, dental, and vision insurance (US)
Regular team events and offsites
Recommended Jobs
Home Care Aide AM PM 4 to 6 Hour Shifts
Actively Hiring Home Care Aide Hourly Pay Rate: $20 - $23 per hour Shift: AM/PM 4 to 6 hours, Monday-Sunday, (9am-1pm, 1pm-6pm, 5pm-10pm) Location: San Diego North County, La Jolla Are you…
Lead Product Manager Data and Data Analytics
Inputs to product strategy with cross-functional stakeholders consistent with the shared vision for the product or enabling platform. Implements the architectural roadmap that balances innovative desi…
Wedding Planner
Leigh & Co. Events is a fast-growing wedding planning company based throughout the entire United States - we are looking for planners in California . We provide wedding design, wedding coordination…
Esthetician
Description:: As a FACE FOUNDRIÉ Skin Expert you will… Provide excellent customer service at all times to all guests. Achieve personal/store sales and service goals. Demonstrate, recommend an…
Laborer / Warehouse Worker (CAN)
Job Responsibilities: In-store Warehouse Worker - Full Serve - Contributing to the fulfillment of customer orders in different sales channels in the retail store by ensuring an efficiently execute…
CDL A Driver
Ewing Outdoor Supply-Distribution Center 30928 San Antonio St. Hayward, CA Base Pay: $29.00/hr DOE The Ewing CDL Driver serves as the front line for customer relations, customer servic…
Lead Product Manager (Payor)
About Us Hippocratic AI is the leading generative AI company in healthcare. We have the only system that can have safe, autonomous, clinical conversations with patients. We have trained our own LLMs…
Developer Advocate (DevRel)
Uses data to build insights on product or platform requirements consistent with the shared vision for the product. Gathers insights from the customer experience and customer needs to input to product …
Project Accountant
Description Ratcliff is an industry leader in the planning and design of healthcare, academic, and civic projects throughout Northern California. We are seeking a Project Accountant to join our te…
Locum Pediatric Anesthesiologist
&##128313; LOCUM ANESTHESIOLOGIST Pediatric | Fontana, CA Weekday Shifts | 1 Weekend/Month | Family-Friendly Contract Join a well-established care team in Fontana, CA as a locum pediatric anesthes…