Senior Site Reliability Engineer (SRE) - (Dublin, CA)
- Architect and maintain scalable, highly available infrastructure for our GenAI platform.
- Design and implement robust monitoring, alerting, and observability solutions to proactively ensure system health and performance.
- Automate deployment, scaling, and management of our cloud-native infrastructure, reducing toil and improving efficiency.
- Define, measure, and improve Service Level Objectives (SLOs) and Service Level Indicators (SLIs) to deliver outstanding service quality.
- Participate in on-call rotations and provide rapid response to production incidents, minimizing downtime and user impact.
- Collaborate closely with development teams to build reliable, scalable, and efficient systems for complex AI workloads.
- Lead incident response efforts, conduct thorough post-mortems, and champion continuous improvement initiatives.
- Optimize infrastructure for performance, scalability, and cost-effectiveness—especially for high-demand AI workloads.
- Implement and enforce security best practices across all systems and environments.
- Create and maintain comprehensive documentation, including runbooks and knowledge base articles, to foster a culture of shared knowledge.
- Bachelor's degree in Computer Science, Engineering, or related field, or equivalent practical experience
- 8+ years of experience in DevOps, SRE, or similar roles
- Strong experience with cloud platforms (AWS, GCP, or Azure)
- Proficiency in at least one programming/scripting language (Python, Go, Bash, etc.)
- Hands-on experience with infrastructure as code tools (Terraform, CloudFormation, etc.)
- Solid background in containerization technologies (Docker, Kubernetes)
- Proven experience with monitoring and observability tools (Prometheus, Grafana, ELK stack, etc.)
- Strong understanding of CI/CD pipelines and automation
- Exceptional troubleshooting and problem-solving skills and ability to troubleshoot complex systems
- Experience supporting AI/ML systems in production
- Knowledge of GPU infrastructure management and optimization
- Familiarity with distributed systems and high-performance computing
- Experience with database systems (SQL and NoSQL)
- Certifications in cloud platforms (AWS, GCP, Azure)
- Experience with chaos engineering and resilience testing
- Knowledge of security best practices and compliance requirements
Recommended Jobs
Clinical Research Coordinator 2, Abdominal Transplantation (Hybrid Opportunity)
Note that the following position will be based on the Stanford campus and is a hybrid (working on-site and working from home), subject to operational need. From benchtop research to international …
Estimator
Sr Nuclear Medicine Technologist San Francisco, CA 150-210K + Full Benefits + Bonus+ Sign On Bonus Relocation Assistance Available The technologist is licensed in all scopes of practice as…
CNC Machinist 2 (Weekend 1st shift)
Position Summary: Level Two: Building upon level one with minimal assistance level two machinists operate 3-5 axis CNC machines which requires basic machine setup including preparing and setting …
Sanitation Supervisor
**Job Title: Sanitation Supervisor** **Job Description** We are seeking a dedicated Sanitation Supervisor to oversee and ensure the cleanliness and safety of our manufacturing environment. As the sole…
Temporary Class A Driver (local)
Temporary Class A Driver (Local) Would you like to have meaningful work with an employer that values work-life balance, and colleagues that are all-in, big hearted and solution focused? If so…
Corporate Accounting Clerk
At Griffith Company, we recognize that people are our most valuable resource. We nurture that resource by fostering a work environment that encourages communication, respect and recognition. In this e…
Production Chemist
Job Responsibilities: Performs in-process testing of bulk solutions. Maintains detailed device history records and training logs. Ensures proper transportation, handling, and disposal of haz…
Administrative Office Assistant Intern
Bee Sweet Citrus, Inc. is a grower, packer, and shipper of citrus and is a leader in today’s agriculture industry. Bee Sweet Citrus Inc. is a family owned and operated company, providing over 10 diffe…
Mapping Software Intern
Zoox’s internship program provides hands-on experiences with state of the art technology, mentorship from some of the industry's brightest minds, and the opportunity to play a part in our success. In…
2026 QTAS Bioanalytical Intern
Envision spending your summer working with energetic colleagues and inspirational leaders all while gaining world-class experience in one of the most dynamic organizations in the pharmaceutical indus…