Site Reliability Engineer
Position overview:
Our center provides essential HPC and data systems to more than 10,000 researchers working in areas such as alternative energy, climate science, energy efficiency, environmental science, and other missions.
As a Site Reliability Engineer, you will be part of a 24/7 operations team that ensures our systems are accessible, reliable, secure, and available to the scientific community. You will work with a state-of-the-art data collection and monitoring system to maintain and optimize performance across complex HPC and data environments. What You Will Do at Level 2
- Work five shifts per week monitoring a large HPC facility, including 2–3 overnight shifts (midnight–8 a.m.) per week.
- Split time between on-site and off-site shifts depending on staffing needs.
- Review and respond to alerts from computing systems, storage, networks, and other data center/facility systems by triaging or escalating to on-call staff.
- Develop solutions to improve processes, prevent recurrence of issues, and automate responses to routine service conditions.
- Identify areas for improved monitoring and automation; propose and implement solutions.
- Respond to monitoring alerts to ensure continuous 24/7 data collection for real-time diagnoses.
- Develop and maintain tools within the monitoring pipeline in collaboration with the Operations Team.
- Create software programs to provide alerts and notifications from HPC system APIs into the monitoring pipeline.
- Configure software and solve technical issues to ensure programs scale reliably with increasing data volume.
- Collaborate with other groups to ensure workflows are understood and maintained.
- Assign technical tasks to other monitoring team members as needed.
- Coordinate system maintenance activities and manage diagnostic and notification software during outages.
- Provide accurate documentation in ticketing systems for outages, updates, and incidents.
- Work on and resolve problems of diverse scope where analysis requires evaluation of identifiable factors.
- Provide leadership in developing monitoring and alerting pipelines, documentation, and software.
- Contribute to the design and deployment of the monitoring cluster.
- Partner with other technical groups to improve monitoring experiences.
- Tackle complex problems requiring in-depth evaluation of variable factors.
- Determine methods and procedures on new assignments and may coordinate activities of other team members.
- Typically requires 5+ years of related experience with a Bachelor’s degree, or 3+ years with a Master’s degree, or equivalent work experience.
- Strong hands-on knowledge of Linux shell and command-line environments.
- Experience developing tools using languages such as C, C++, Perl, Java, or Python.
- Knowledge of IT infrastructure and large data communication networks supporting highly available systems.
- Ability to learn and work with data center management technologies (e.g., Kubernetes, Prometheus, alerting/monitoring tools, building management software, cooling/power systems).
- Strong communication skills and ability to collaborate across multiple technical teams.
- Experience working in a 24/7 operations team managing large data centers or installations.
- Knowledge of network security, ACLs, firewalls, and protocols.
- Relevant certifications in system administration or related areas.
- Typically requires 8+ years of related experience with a Bachelor’s degree, or 6+ years with a Master’s degree, or equivalent.
- Advanced expertise in one or more programming languages such as C, C++, Perl, Java, or Python.
- Demonstrated excellence with monitoring and automation tools.
- Experience leading technical projects.
- Strong ability to respond proactively to complex issues.
- Shift: Includes overnight “Owl” shifts (12 a.m. – 8 a.m.), primarily on-site.
- This is a full-time, exempt position (monthly paid).
- A background check is required. Convictions are reviewed in relation to job responsibilities and do not automatically disqualify applicants.
- This position requires substantial on-site presence, but hybrid schedules may be available depending on business needs. Candidates must reside within 150 miles of the work site.
Recommended Jobs
Software Engineer II, Storage
Crusoe is building the World’s Favorite AI-first Cloud infrastructure company. We’re pioneering vertically integrated, purpose-built AI infrastructure solutions trusted by Fortune 500 companies to po…
Integrative Medicine (Acupuncture) Assistant
The Integrative Medicine Assistant is an unlicensed health professional and is primarily responsible for assisting patients and Acupuncturist/Chiropractor/Physical Therapist throughout the routine vi…
Caregiver - Glendale (Bilingual English/Arabic)
About Aarris At Aarris Homecare, we understand that our caregivers are our best asset and we care about the work that you do. If you’re passionate and committed to client well-being and are looking …
Clinic Director
Cortica is looking for a Site Director to join its growing team! The Site Director serves as the senior leader responsible for all aspects of operations, performance, and culture at a single C…
Experienced Plumber
Wanted: Residential plumber with three years verifiable experience and clean DMV printout. High school diploma required. Salary negotiable depending on experience. CalSavers, health insur…
Principal Product Manager, AI
The Chan Zuckerberg Initiative was founded by Priscilla Chan and Mark Zuckerberg in 2015 to help solve some of society’s toughest challenges — from eradicating disease and improving education to addr…
AI Engineer
About Amplifier Health Amplifier Health is revolutionizing healthcare by turning the human voice into a vital sign. We're developing the world's first voice-based diagnostic platform that can detect…
Community Living Services Instructor
Welcome to NCI Affiliates! Why join our team? NCI Affiliates is a mission-driven non-profit organization dedicated to empowering adults with intellectual and physical disabilities to achieve greate…
Bilingual Administrative Assistant
Purpose of Position: To provide office support to the technicians and office staff, plan and execute functions, order supplies and maintain supplies. Customer service and sales backup. Duties & Res…
Tamarack Lodge General Manager
Create Your Experience of a Lifetime! Come work and play in the mountains! Whether it’s your first-time seeing snow or you were born on the slopes, joining our team means discovering (or re-di…