Software Engineer - Distributed Training Infrastructure

Clockwork.io
Palo Alto, CA

Clockwork.io is a Silicon Valley startup that delivers state-of-the-art AI compute acceleration.

We are founded by Stanford researchers and veteran systems engineers with a shared belief: distributed systems powering modern AI require a new approach to managing time, reliability, and performance. Unlike traditional solutions that rely on specialized hardware or embedded telemetry in switches, Clockwork’s system brings insane visibility, resilience, acceleration and efficiency to the network layer entirely through software. As AI workloads continue to scale in size, urgency, and impact, networks must evolve to keep up. Clockwork exists to make that evolution possible.

About Us

Clockwork.io – A Software-Driven Revolution in AI Networking

Clockwork Systems was founded by Stanford researchers and veteran systems engineers who share a vision for redefining the foundations of distributed computing. As AI workloads grow increasingly complex, traditional infrastructure struggles to meet the demands of performance, reliability, and precise coordination. Clockwork is pioneering a software-driven approach to AI networking, delivering deterministic time, ultra-low latency, and seamless scalability for modern distributed systems.

To learn more, visit .

About the Role

We are looking for an experienced software engineer to help build, optimize, and maintain large-scale distributed training infrastructure based on the PyTorch ecosystem. This role focuses on production-grade training workflows involving multi-GPU and multi-node orchestration, high-performance communication layers, and advanced parallelism strategies.

You’ll work alongside infrastructure and machine learning teams to ensure training jobs are efficient, scalable, and resilient.

What You will do



  • Develop and support distributed PyTorch training jobs using torch.distributed / c10d

  • Integrate and maintain frameworks like Megatron-LM, DeepSpeed, and related LLM training stacks

  • Diagnose and resolve distributed training issues (e.g., NCCL hangs, OOM, checkpoint corruption)

  • Optimize performance across communication, I/O, and memory bottlenecks

  • Implement fault tolerance, checkpointing, and recovery mechanisms for long-running jobs

  • Write tooling and scripts to streamline training workflows and experiment management

  • Collaborate with ML engineers to ensure compatibility with orchestration and container environments (e.g., Slurm, Kubernetes)

What We’re Looking For



  • Deep experience with PyTorch and torch.distributed (c10d)

  • Hands-on experience with at least one of: Megatron-LM, DeepSpeed, or FairScale

  • Proficiency in Python and Linux shell scripting

  • Experience with multi-node GPU clusters using Slurm, Kubernetes, or similar

  • Strong understanding of NCCL, collective communication, and GPU topology

  • Familiarity with debugging tools and techniques for distributed systems

Preferred Skills


  • Experience scaling LLM training across 8+ GPUs and multiple nodes

  • Knowledge of tensor, pipeline, and data parallelism

  • Familiarity with containerized training environments (Docker, Singularity)

  • Exposure to HPC environments or cloud GPU infrastructure

  • Experience with training workload orchestration tools or custom job launchers

  • Comfort with large-scale checkpointing, resume/restart logic, and model I/O

Bonus Skills


  • Profiling tools: PyTorch Profiler, Nsight, nvprof, or equivalent

  • Experience with performance tuning in distributed training environments

  • Contributions to ML infrastructure open-source projects

  • Familiarity with storage, networking, or RDMA/GPU Direct technologies

  • Understanding of observability in ML pipelines (metrics, logs, dashboards)

Enjoy


  • Challenging projects.

  • A friendly and inclusive workplace culture.

  • Competitive compensation.

  • A great benefits package.

  • Catered lunch

Clockwork is assembling world class teams to build cutting edge software. We look for bright people from all walks of life and we grow together. All qualified applicants will receive consideration for employment without regard to race, color, ancestry, religion, age, sex, sexual orientation, gender identity, national origin, or protected veteran status and will not be discriminated against on the basis of disability.

Posted 2026-01-07

Recommended Jobs

Manager of Radiology Diagnostic Imaging

Clinical Management Consultants
San Lorenzo, CA

Manager of Radiology Diagnostic Imaging — Lead Imaging Excellence in Northern California &##127753; The Manager of Radiology Diagnostic Imaging will step into a high-impact leadership role that champ…

View Details
Posted 2025-12-12

Senior Machine Learning Engineer

Metropolis
Los Angeles, CA

The Company Metropolis is an artificial intelligence company that uses computer vision technology to enable frictionless, checkout-free experiences in the real world. Today, we are reimagining par…

View Details
Posted 2025-12-10

Principal Software Engineer - API Infrastructure

Rubrik
Palo Alto, CA

About the team Our team is responsible for building the foundational API layer for all user and system interaction with Rubrik products. We connect our distributed SaaS products, and federated…

View Details
Posted 2026-01-07

Staff Product Manager Upstream Software

Intuitive
Sunnyvale, CA

Company Description Primary Function of Position Product Managers play a critical role in Ion’s success by empowering teams across technical and commercial functions to bring life-changing to…

View Details
Posted 2025-11-25

Looking for Automotive Repair Store Managers (Beaumont)

Ramona Tire & Service Centers
Beaumont, CA

Overview: Ramona Tire & Service Centers has been serving Southern California for over 45 years with honesty and integrity. Drive over to one of our 17 convenient locations for professional auto…

View Details
Posted 2026-01-06

Senior Software Engineer Datacenter Automation

Zipline
South San Francisco, CA

About Zipline Do you want to change the world? Zipline is on a mission to transform the way goods move. Our aim is to solve the world’s most urgent and complex access challenges by building, m…

View Details
Posted 2026-01-07

Senior Software Engineer, Luau App Foundations

Roblox
San Mateo, CA

As Senior Software Engineer on the Consumer Frontend team, you will leverage the Roblox tech stack and tools to build groundbreaking experiences that push the boundaries of what is possible on the …

View Details
Posted 2025-12-25

QA Automation Engineer

Ampa
Palo Alto, CA

Disclaimer: Working at Ampa is a rare chance to help transform global mental health and save millions of lives — a level of impact that demands deep commitment. Our team puts in 60–80 hours per week…

View Details
Posted 2026-01-13

CMOS Image Sensor Characterization Engineer

Fairchild Imaging, Inc.
San Jose, CA

CMOS Image Sensor Characterization Engineer Location San Jose, CA (North San Jose area) : Fairchild Imaging, headquartered in San Jose, California is a specialty image sensor design and manufacturin…

View Details
Posted 2026-01-10

Board Certified Behavior Analyst BCBA Telehealth Remote (N/A)

United Care Applied Behavioral Analysis PLLC
California

Join Our Nationwide Telehealth Team at United Care ABA! As Medicaid funding drops, most are stepping back. We're stepping up. Help us make sure no child is left behind no matter the zip code or …

View Details
Posted 2025-12-18