Senior Site Reliability Engineer
Role:
Northwood is looking for a Senior Site Reliability Engineer to architect and lead the monitoring and reliability systems that keep satellites connected to Earth. As we rapidly scale our ground station network across multiple continents, you'll design and build the observability infrastructure that ensures our space communications systems operate 24/7 for customers ranging from commercial satellite operators to national security missions.
This is a high-impact leadership role where you'll architect global-scale reliability platforms while mentoring junior engineers and establishing SRE practices across the organization. You'll work directly with our founding engineering team and department heads to define the monitoring, alerting, and deployment strategies that will scale with us from startup to enterprise. If you're excited about space technology and want to architect infrastructure that directly supports mission-critical satellite operations while building and leading technical teams, this role offers that opportunity.
Responsibilities:
Architect and maintain enterprise observability stack (Grafana, Prometheus, Loki, Vector, VictoriaMetrics) monitoring ground stations, satellite communications, and multi-region AWS infrastructure
Design SRE practices, error budgets, and SLO/SLI frameworks for mission-critical satellite systems with 99.9%+ uptime requirements
Build advanced AWS infrastructure with Terraform, implementing multi-region reliability, automated scaling, and disaster recovery for ground station operations
Lead CI/CD pipeline architecture using GitLab and ArgoCD with advanced deployment strategies for mission-critical software releases
Mentor junior engineers and establish reliability standards across the growing engineering organization
Design comprehensive Kubernetes deployments with Helm, focusing on high availability and zero-downtime operations
Lead incident response, conduct post-mortems, and drive systematic reliability improvements
Basic Qualifications
5-8 years of production infrastructure and SRE experience with demonstrated leadership in reliability improvements and team mentorship
Expert-level experience with Kubernetes, Docker, and container orchestration in large-scale production environments
Strong background in infrastructure as code (Terraform) and advanced CI/CD practices with experience mentoring others on these technologies
Advanced AWS experience including multi-region architectures, networking, security, and cost optimization, with demonstrated ability to architect complex cloud solutions
Proven track record of leading technical projects from conception to production in fast-moving, high-growth environments
Deep understanding of SRE principles, error budgets, SLOs/SLIs, and experience implementing reliability frameworks across engineering organizations
Preferred Qualifications
Production experience architecting and scaling observability tools (Vector, Loki, Grafana, Prometheus, VictoriaMetrics) in high-throughput environments
Advanced experience with HashiCorp Vault, Okta, and enterprise identity/secrets management systems including policy design and implementation
Previous experience scaling infrastructure and leading technical teams at high-growth companies (startup to 500+ employees)
AWS Professional certification or equivalent demonstrated expertise with advanced cloud networking, security, and compliance frameworks
Strong Linux system administration and networking expertise with experience troubleshooting complex distributed systems
Background in aerospace, telecommunications, defense contracting, or other mission-critical, highly regulated industries
Experience with ITAR, NIST 800-171, or other defense/aerospace compliance requirements
Recommended Jobs
Field Engineer - Mechanical / Piping / Instrument
Field Engineer - Mechanical / Piping / Instrument JOB-10045417 Anticipated Start Date 12/21/2025 Location Sabine Pass, TX Type of Employment Contract Hire Employer…
Timberland Conservation and Fire Resiliency Coordinator
Job Description and Duties Are you looking to be a part of a dynamic team of conservation-minded scientists working to protect fish and wildlife resources in the coniferous forests of the Sierra N…
Product Manager
About Authorium Authorium is on a mission to redefine how agencies manage complex document-centric workflows by pioneering a unified platform that integrates all facets of key administrative funct…
Principal Test Engineer - System Test
Company Description At Western Digital, our vision is to power global innovation and push the boundaries of technology to make what you thought was once impossible, possible. At our core, Wes…
IT HELP DESK ANALYST I
Description Company Description : Axis Community Health, a nonprofit established in 1972, provides comprehensive healthcare services to over 15,000 individuals across all age groups in the Tri…
Senior Autonomy Systems Test Engineer
Autonomous vehicles have some of the largest, most complex software ever shipped in a safety-critical environment. Solving that problem is one of the most exciting technical challenges of our lifetim…
Social Media Manager
About us: Grail Talent is a Creator Management Agency that connects our diverse and carefully curated roster of content creators with digital marketing opportunities, working with brands to reach …
Construction Lead/Foreman
Seeking an experienced lead/foreman to run a new construction crew. Ability to complete installation of all new construction electrical (site lighting) projects, from initial print review/analysis to …
ABA Behavior Therapist
Cortica is looking for dedicated, compassionate Behavior Technicians to join our growing team and help us design and deliver life-changing care for children with neurodevelopmental differences. At…
Flight Test Engineer
This position is looking for a highly motivated flight test engineer with emphasis in developmental test of new aircraft weapon and missile systems. You will work closely with the program test lead…