Senior Product Manager - Observability and Resilience (Santa Clara)
NVIDIA has become the platform upon which every new AI-powered application is built. From healthcare research applications to autonomous vehicles, or voice-recognition systems, there is a need to simplify and deliver predictability for AI applications and workflows ... and NVIDIA is right in the center of this revolution. Resiliency and Observability are key to delivering customer value and exhilarating customer experience. This product manager will lead the development of foundational tools dedicated to ensuring the resiliency and observability of large-scale accelerated computing platforms. By creating essential tools for system diagnostics, performance monitoring, and automated recovery, they will empower customers to confidently operate both complex AI training and demanding inference workloads with maximum uptime and efficiency.
What you will be doing:- Be a subject-matter expert on resiliency and observability. Deeply understand failure modes across the GPU hardware, network, and software stack, along with the telemetry signals that reveal them, and how they correlate to workload health and SLOs. Master modern reliability architectures. Keep up-to-date with the industry trends.
- Build for all that want to use. Drive joint project planning. Define concrete achievements, tasks, and work for resiliency and observability initiatives with external partners.
- Fuel innovation in reliability tooling. Lead ideation sessions to propose novel approaches and shape new proof-of-concepts.
- Bridge development, SRE, and partner teams. Facilitate clear communication, triage emergent issues rapidly, and ensure feedback loops between engineering and customer operations remain tight.
- Coordinate execution across different functions. Work with engineering, design, operations, sales, and marketing to embed resiliency and observability requirements into every product launch, capacity expansion, and lifecycle transition.
- BS or MS in Computer Science, Computer Engineering, or a related field (or equivalent experience) and 12+ years of product-management experience in enterprise technology.
- Experience with GPU observability (DCGM, NVML, etc.) and integration into large-scale telemetry systems.
- Deep knowledge of AI/ML infrastructure, high-performance computing (HPC), networking, and cloud technologies (IaaS, PaaS) including containerization, Kubernetes, and automation tools.
- Familiarity with modern observability stacks: metrics, logs, traces, OpenTelemetry, Prometheus/Grafana, ELK/OpenSearch.
- Experience building and preferably deep understanding of secure, compliance-focused telemetry pipelines (SOC2, FedRAMP).
- Ability to articulate trade-offs among latency, throughput, cost, and reliability to both engineering and executive audiences.
- Data-driven approach: defines SLIs/SLOs, manages error budgets, and develops value models.
- Strong cross-functional execution: writes clear specs and PRDs, produces GTM collateral, and leads agile processes.
- Masters/Phd or Expertise in distributed systems, performance modeling, or fault-tolerant computing.
- Experience with MLOps and LLMOps ecosystems and integrating with enterprise platforms; deployments at modern data-center scale; delivered ML/AI observability solutions for LLMOps, predictive incident detection, or anomaly classification.
- Startup or 0 -> 1 experience building cloud-native observability or resilience tools; proven success bringing open-source observability products to market and shaping GTM strategy.
- Familiarity with MLOps toolchains and integrations with monitoring platforms such as Splunk, Datadog, and Grafana Cloud.
- Expertise with containerization technologies like Docker and Kubernetes, plus virtualization. Proficiency in network architecture and high-performance interconnects (InfiniBand, Ethernet, RoCE).
Recommended Jobs
Project Manager, ECE Contracts & Ops - Remote | WFH
We are looking for a highly organized and detail-oriented Project Manager to provide strategic and operational support for managing grants, contracts, and compliance related to public sector agencies …
Medical and Surgical and Aesthetics Office Manager
Job Description Job Description About Us: We are a dynamic, patient-centered private practice specializing in women’s health, minimally invasive surgery, and aesthetic services. Our missio…
Product Manager, Model Behavior
About the Team The Model Behavior team is responsible for how OpenAI’s models behave. We’re focused on making current and future models better for people at scale—improving existing capabilities, sh…
Power Systems Engineer - In Office Only
Job Description Job Description We have an immediate opportunity for an Electrical Engineer with an Electrical/Power emphasis. This position will conduct analyses, calculations and develop recomm…
Senior Fullstack Engineer
Who We Are At Serval , we're building the AI platform for IT teams. Our goal is to take on legacy players like ServiceNow, a $230+ bn company, by building the platform for AI agents to resolve IT …
Locum CRNA
&##128313; LOCUM CRNA – Downey, CA Weekday Schedule | No Call | Consistent Hours Near L.A. This Downey-based CRNA locum role offers a consistent Monday–Friday day shift in a high-demand OR set…
Commercial Account Executive
Who You Are: You are a highly motivated Sales Professional looking to have an immediate impact on the systems, processes and technology of markets that are often called the engine of the US economy.…
Nurse Practitioner - Primary Care
Primary Care Nurse Practitioner Opportunity in Sunny Palm Springs, California! Palm Springs has outstanding weather 365 days/year with access to both Los Angeles and San Diego within about 2 hours.…
Enterprise Risk & Controls- IT Audit- Senior Manager Save for Later Remove job
At PwC, our people in audit and assurance focus on providing independent and objective assessments of financial statements, internal controls, and other assurable information enhancing the credibil…
Career Services Associate
Stanbridge University is a premier institution dedicated to excellence in nursing and allied health education, offering academic programs in pre-licensure nursing, graduate nursing, occupational ther…