Staff Site Reliability Engineer

Shein

San Diego, CA

About SHEIN

SHEIN is a global online fashion and lifestyle retailer, offering SHEIN branded apparel and products from a global network of vendors, all at affordable prices. Headquartered in Singapore, with more than 15,000 employees operating from offices around the world, SHEIN is committed to making the beauty of fashion accessible to all, promoting its industry-leading, on-demand production methodology, for a smarter, future-ready industry.

Position Summary

We are seeking a Staff Site Reliability Engineer (Official Title: Staff Site Reliability Engineer I) with deep experience operating and evolving large-scale, mission-critical systems where availability and reliability are non-negotiable. At SHEIN, Site Reliability Engineers are hybrid software and systems engineers responsible for keeping production services always on while enabling the platform to scale rapidly and safely. In this role, you will own and support complex services and infrastructure, ensuring they consistently meet reliability and performance expectations. At the Staff level, you will also provide technical leadership, influencing platform architecture, reliability strategy, and operational standards across the organization. The SRE team owns and maintains critical open-source and in-house technologies that underpin the platform and serves as a core contributor to major engineering initiatives. We are accountable for driving platform operability forward by reducing incident frequency, minimizing MTTR, and improving system resilience, efficiency, and resource utilization. You will work closely with global, cross-functional teams to design, build, and evolve observability and operational tooling—including metrics, logs, traces, alerting, and automation—providing deep visibility into system behavior. Through hands-on engineering and operational excellence, you will proactively identify risks and failure modes, help prevent incidents before they occur, and lead fast, effective responses when they do. To succeed in this role, you will combine strong software engineering skills, solid to deep expertise in Linux, networking, and distributed systems, and a passion for solving problems of scale, complexity, and reliability. Your work will directly contribute to delivering a stable, scalable, and high-performing experience for customers worldwide.

Job Responsibilities

Keep SHEIN’s mission-critical production systems running 24/7/365, participating in on-call rotations and acting decisively during incidents.

Triage and resolve production incidents, driving root cause analysis and contributing to continuous improvements that reduce MTTR and prevent recurrence.

Monitor and manage capacity planning and resource utilization, partnering with cross-functional teams to ensure systems scale safely while remaining cost-effective.

Own and operate core open-source infrastructure such as APISIX, Nginx, Kubernetes, Kafka, Elasticsearch, Redis, Consul, Etcd, Zookeeper and other large-scale distributed systems.

Design, build, and maintain observability solutions (metrics, logs, traces, alerting) to improve system visibility, reliability, and resiliency.

Automate operational workflows and eliminate manual toil through scripting, tooling, and process improvements.

Develop and maintain technical documentation, including runbooks, architecture diagrams, operational procedures, and on-call playbooks.

Work closely with global engineering teams to improve infrastructure reliability and performance through better system design and operational discipline.

Mentor Senior and mid-level SREs, raising the overall technical bar and operational maturity of the team.

Lead efforts to modernize the platform in alignment with industry best practices and evolving technology standards.

Job Requirements

Bachelor’s degree in Computer Science, Information Systems, or a related technical discipline, or equivalent practical experience.

6+ years of experience owning and operating large-scale, high-traffic, 24/7 production systems, ideally in cloud or cloud-native environments.

Solid foundations in Linux, networking, and distributed systems, with the ability to debug complex production issues end to end.

Hands-on experience with incident response, troubleshooting, and performance optimization in distributed systems.

Strong software engineering skills with experience building automation, tooling, or platforms in languages such as Python or Go.

Experience operating or supporting open-source infrastructure components such as APISIX, Nginx, Kubernetes, Kafka, Elasticsearch, Redis, Consul, Etcd, Zookeeper, etc.

Experience with observability and monitoring systems (Prometheus, Grafana, Zabbix, etc.) and performance analysis.

Familiarity with Git, CI/CD pipelines, and configuration management tools (e.g., Ansible).

A strong sense of ownership, a systematic approach to problem-solving, and a passion for making systems more reliable.

Strong communication skills and the ability to collaborate effectively with geographically distributed teams.

Nice to Have

Bilingual fluency in Mandarin and English.

Kubernetes Administrator certification or equivalent real-world experience.

Experience operating big data platforms (Hadoop, Yarn, HBase, Hive, Spark).

Experience applying AI/LLM-powered tools to reliability engineering, including designing and building automation or internal tools using AI-assisted development platforms (e.g., Claude Code).

Benefits and Perks

Bonus and RSU eligible

Healthcare (medical, dental, vision, prescription drugs)

Health Savings Account with Employer Funding

Flexible Spending Accounts (Healthcare and Dependent care)

Company-Paid Basic Life/AD&D insurance

Company-Paid Short-Term and Long-Term Disability

Voluntary Benefit Offerings (Voluntary Life/AD&D, Hospital Indemnity, Critical Illness, and Accident)

Employee Assistance Program

Business Travel Accident Insurance

401(k) Savings Plan with discretionary company match and access to a financial advisor

Vacation, paid holidays, floating holiday and sick days

Employee discounts

Free weekly catered lunch

Dog-friendly office (available at select locations)

Free gym access (available at select locations)

Free swag giveaways

Annual Holiday Party

Invitations to pop-ups and other company events

Complimentary daily office snacks and beverages

#LI-ED1

Pay Range

$108,000 - $180,000 USD

Posted 2026-02-25

Recommended Jobs

Senior DataOps Engineer

Scout Motors Inc.

California

Here at Scout Motors, we're carrying forward the heritage of one of the most iconic American vehicles in history. A vehicle dating back to 1960. One that forged the path for future generations of rugg…

View Details

Posted 2026-01-30

Master Automotive Technician

Super Shop Automotive

Merced, CA

Full job description: We are seeking an experienced Master Automotive Technician to join our team. As an Automotive Technician, you will be responsible for diagnosing, repairing, and maintaining var…

View Details

Posted 2026-01-13

Senior Systems Software Engineer - Peripherals & Companion Products

Playstation Global

San Mateo, CA

Why PlayStation? PlayStation isn’t just the Best Place to Play — it’s also the Best Place to Work. Today, we’re recognized as a global leader in entertainment producing The PlayStation family of pr…

View Details

Posted 2026-02-25

Finance Manager

Consultative Search Group

Buena Park, CA

Finance Manager Summary: The Finance Manger is responsible for managing budgeting, forecasting, and delivering comprehensive financial reporting and analysis to drive business performance…

View Details

Posted 2026-01-15

Staff Data Algorithm Software Engineer

XPENG

California

XPENG is a leading smart technology company at the forefront of innovation, integrating advanced AI and autonomous driving technologies into its vehicles, including electric vehicles (EVs), electric …

View Details

Posted 2026-02-25

Discover a new source of income by taking surveys on Prime Opinion

Prime Opinion

California

Earn cash or gift cards instantly by participating in engaging surveys through Prime Opinion. At Prime Opinion, we aim to empower our members by providing them with the best survey-taking experience…

View Details

Posted 2025-12-03

Travel Occupational Therapist Job in Palo Alto, CA - $14,805 per Month (2 Years Experience Needed)

Vetted Health

Palo Alto, CA

Vetted is seeking a Occupational Therapist for a travel job in Palo Alto, California . Must have 2+ years of experience. This contract pays approximately $14,805/month gross. Assignment d…

View Details

Posted 2026-02-25

AI Software Engineer

Unitq

San Francisco, CA

About unitQ unitQ is a game-changing AI SaaS platform that empowers companies to build the world’s best products by leveraging real-time customer feedback to improve product quality and drive grow…

View Details

Posted 2026-02-22

Head of Generative AI Research

VISA

Foster, CA

Job Description Visa Research is seeking an exceptional Head of Generative AI Research to lead our strategic research initiatives in generative artificial intelligence. This senior leadership role …

View Details

Posted 2026-01-30

Enterprise Sales Leader

TicketManager

Calabasas, CA

Live events are fun. Concerts, sporting events, and festivals create memorable lifelong experiences with clients, prospects, partners, friends, and family—and they drive real business impact. C…

View Details

Posted 2026-01-15