Software Engineer - Distributed Training Infrastructure

Clockwork.io
Palo Alto, CA

Clockwork.io is a Silicon Valley startup that delivers state-of-the-art AI compute acceleration.

We are founded by Stanford researchers and veteran systems engineers with a shared belief: distributed systems powering modern AI require a new approach to managing time, reliability, and performance. Unlike traditional solutions that rely on specialized hardware or embedded telemetry in switches, Clockwork’s system brings insane visibility, resilience, acceleration and efficiency to the network layer entirely through software. As AI workloads continue to scale in size, urgency, and impact, networks must evolve to keep up. Clockwork exists to make that evolution possible.

About Us

Clockwork.io – A Software-Driven Revolution in AI Networking

Clockwork Systems was founded by Stanford researchers and veteran systems engineers who share a vision for redefining the foundations of distributed computing. As AI workloads grow increasingly complex, traditional infrastructure struggles to meet the demands of performance, reliability, and precise coordination. Clockwork is pioneering a software-driven approach to AI networking, delivering deterministic time, ultra-low latency, and seamless scalability for modern distributed systems.

To learn more, visit .

About the Role

We are looking for an experienced software engineer to help build, optimize, and maintain large-scale distributed training infrastructure based on the PyTorch ecosystem. This role focuses on production-grade training workflows involving multi-GPU and multi-node orchestration, high-performance communication layers, and advanced parallelism strategies.

You’ll work alongside infrastructure and machine learning teams to ensure training jobs are efficient, scalable, and resilient.

What You will do



  • Develop and support distributed PyTorch training jobs using torch.distributed / c10d

  • Integrate and maintain frameworks like Megatron-LM, DeepSpeed, and related LLM training stacks

  • Diagnose and resolve distributed training issues (e.g., NCCL hangs, OOM, checkpoint corruption)

  • Optimize performance across communication, I/O, and memory bottlenecks

  • Implement fault tolerance, checkpointing, and recovery mechanisms for long-running jobs

  • Write tooling and scripts to streamline training workflows and experiment management

  • Collaborate with ML engineers to ensure compatibility with orchestration and container environments (e.g., Slurm, Kubernetes)

What We’re Looking For



  • Deep experience with PyTorch and torch.distributed (c10d)

  • Hands-on experience with at least one of: Megatron-LM, DeepSpeed, or FairScale

  • Proficiency in Python and Linux shell scripting

  • Experience with multi-node GPU clusters using Slurm, Kubernetes, or similar

  • Strong understanding of NCCL, collective communication, and GPU topology

  • Familiarity with debugging tools and techniques for distributed systems

Preferred Skills


  • Experience scaling LLM training across 8+ GPUs and multiple nodes

  • Knowledge of tensor, pipeline, and data parallelism

  • Familiarity with containerized training environments (Docker, Singularity)

  • Exposure to HPC environments or cloud GPU infrastructure

  • Experience with training workload orchestration tools or custom job launchers

  • Comfort with large-scale checkpointing, resume/restart logic, and model I/O

Bonus Skills


  • Profiling tools: PyTorch Profiler, Nsight, nvprof, or equivalent

  • Experience with performance tuning in distributed training environments

  • Contributions to ML infrastructure open-source projects

  • Familiarity with storage, networking, or RDMA/GPU Direct technologies

  • Understanding of observability in ML pipelines (metrics, logs, dashboards)

Enjoy


  • Challenging projects.

  • A friendly and inclusive workplace culture.

  • Competitive compensation.

  • A great benefits package.

  • Catered lunch

Clockwork is assembling world class teams to build cutting edge software. We look for bright people from all walks of life and we grow together. All qualified applicants will receive consideration for employment without regard to race, color, ancestry, religion, age, sex, sexual orientation, gender identity, national origin, or protected veteran status and will not be discriminated against on the basis of disability.

Posted 2025-12-19

Recommended Jobs

Product Manager, Perception Object Generation

Waymo
San Francisco, CA

Waymo is an autonomous driving technology company with the mission to be the world's most trusted driver. Since its start as the Google Self-Driving Car Project in 2009, Waymo has focused on buildi…

View Details
Posted 2025-12-13

IT Web Business Systems Analyst

Informatica
Redwood City, CA

IT Web Business Systems Analyst Location Redwood City, CA : Build Your Career at Informatica We're looking for a diverse group of collaborators who believe data has the power to improve society. Adv…

View Details
Posted 2026-01-09

Hotel General Manager

Carlisle Inn Walnut Creek
Walnut Creek, CA

Carlisle Inn Walnut Creek , Walnut Creek, OH is a great place to start or develop your career in hospitality to learn skills you'll use for the rest of your life. If you enjoy sharing hospitality wi…

View Details
Posted 2026-01-13

Manager, Advanced Analytics

Walmart Inc.
San Bruno, CA

What you'll do at What you'll do... Position: Manager, Advanced Analytics Job Location: 850 Cherry Avenue, San Bruno, CA 94066 Duties: Help manage the interpretation, analytics rep…

View Details
Posted 2026-01-13

Scan Center Document Prep Clerk (Fulltime Days)

MetaSource
Anaheim, CA

Description The Scan Center Document Prep Clerk performs fast paced, clerical work (similar to product and manufacturing environments) preparing documents and records for scanning. The team proc…

View Details
Posted 2025-12-31

Paralegal - Preventing and Ending Homelessness Project

Bet Tzedek Legal Services
Los Angeles, CA

For nearly 50 years, Bet Tzedek Legal Services has provided high-quality, free legal services to Los Angeles’ most vulnerable residents. In that tradition, Bet Tzedek is seeking a paralegal to join o…

View Details
Posted 2025-12-31

Experienced Housekeeper

Vail Resorts
Truckee, CA

  Create Your Experience of a Lifetime! Come work and play in the mountains! Whether it’s your first-time seeing snow or you were born on the slopes, joining our team means discovering (or re-disc…

View Details
Posted 2025-07-30

Staff Product Manager, Modem Cellular Technology

Qualcomm
San Diego, CA

Company: Qualcomm Technologies, Inc. Job Area: Operations Group, Operations Group Product Management General Summary: Qualcomm is seeking a product manager for cellular modem techno…

View Details
Posted 2025-12-25

Project Quality Manager

K2 Staffing
Carlsbad, CA

Summary To lead, perform or assist in a variety of quality control activities in accordance with applicable company standards, Site Specific Quality Management Plan and approved contract documents…

View Details
Posted 2025-10-03

Data Analyst (OYCR) - IT

Amity Foundation
Los Angeles, CA

Amity Foundation , an internationally acclaimed Teaching, and Therapeutic Community is seeking compassionate and enthusiastic individuals with a desire to teach, learn and join our community as a D…

View Details
Posted 2025-11-25