Software Engineer - Distributed Training Infrastructure
Clockwork.io is a Silicon Valley startup that delivers state-of-the-art AI compute acceleration.
We are founded by Stanford researchers and veteran systems engineers with a shared belief: distributed systems powering modern AI require a new approach to managing time, reliability, and performance. Unlike traditional solutions that rely on specialized hardware or embedded telemetry in switches, Clockwork’s system brings insane visibility, resilience, acceleration and efficiency to the network layer entirely through software. As AI workloads continue to scale in size, urgency, and impact, networks must evolve to keep up. Clockwork exists to make that evolution possible.
About Us
Clockwork.io – A Software-Driven Revolution in AI Networking
Clockwork Systems was founded by Stanford researchers and veteran systems engineers who share a vision for redefining the foundations of distributed computing. As AI workloads grow increasingly complex, traditional infrastructure struggles to meet the demands of performance, reliability, and precise coordination. Clockwork is pioneering a software-driven approach to AI networking, delivering deterministic time, ultra-low latency, and seamless scalability for modern distributed systems.
To learn more, visit .
About the Role
We are looking for an experienced software engineer to help build, optimize, and maintain large-scale distributed training infrastructure based on the PyTorch ecosystem. This role focuses on production-grade training workflows involving multi-GPU and multi-node orchestration, high-performance communication layers, and advanced parallelism strategies.
You’ll work alongside infrastructure and machine learning teams to ensure training jobs are efficient, scalable, and resilient.
What You will do
- Develop and support distributed PyTorch training jobs using torch.distributed / c10d
- Integrate and maintain frameworks like Megatron-LM, DeepSpeed, and related LLM training stacks
- Diagnose and resolve distributed training issues (e.g., NCCL hangs, OOM, checkpoint corruption)
- Optimize performance across communication, I/O, and memory bottlenecks
- Implement fault tolerance, checkpointing, and recovery mechanisms for long-running jobs
- Write tooling and scripts to streamline training workflows and experiment management
- Collaborate with ML engineers to ensure compatibility with orchestration and container environments (e.g., Slurm, Kubernetes)
What We’re Looking For
- Deep experience with PyTorch and torch.distributed (c10d)
- Hands-on experience with at least one of: Megatron-LM, DeepSpeed, or FairScale
- Proficiency in Python and Linux shell scripting
- Experience with multi-node GPU clusters using Slurm, Kubernetes, or similar
- Strong understanding of NCCL, collective communication, and GPU topology
- Familiarity with debugging tools and techniques for distributed systems
Preferred Skills
- Experience scaling LLM training across 8+ GPUs and multiple nodes
- Knowledge of tensor, pipeline, and data parallelism
- Familiarity with containerized training environments (Docker, Singularity)
- Exposure to HPC environments or cloud GPU infrastructure
- Experience with training workload orchestration tools or custom job launchers
- Comfort with large-scale checkpointing, resume/restart logic, and model I/O
⸻
Bonus Skills
- Profiling tools: PyTorch Profiler, Nsight, nvprof, or equivalent
- Experience with performance tuning in distributed training environments
- Contributions to ML infrastructure open-source projects
- Familiarity with storage, networking, or RDMA/GPU Direct technologies
- Understanding of observability in ML pipelines (metrics, logs, dashboards)
Enjoy
- Challenging projects.
- A friendly and inclusive workplace culture.
- Competitive compensation.
- A great benefits package.
- Catered lunch
Clockwork is assembling world class teams to build cutting edge software. We look for bright people from all walks of life and we grow together. All qualified applicants will receive consideration for employment without regard to race, color, ancestry, religion, age, sex, sexual orientation, gender identity, national origin, or protected veteran status and will not be discriminated against on the basis of disability.
Recommended Jobs
Product Manager, Perception Object Generation
Waymo is an autonomous driving technology company with the mission to be the world's most trusted driver. Since its start as the Google Self-Driving Car Project in 2009, Waymo has focused on buildi…
IT Web Business Systems Analyst
IT Web Business Systems Analyst Location Redwood City, CA : Build Your Career at Informatica We're looking for a diverse group of collaborators who believe data has the power to improve society. Adv…
Hotel General Manager
Carlisle Inn Walnut Creek , Walnut Creek, OH is a great place to start or develop your career in hospitality to learn skills you'll use for the rest of your life. If you enjoy sharing hospitality wi…
Manager, Advanced Analytics
What you'll do at What you'll do... Position: Manager, Advanced Analytics Job Location: 850 Cherry Avenue, San Bruno, CA 94066 Duties: Help manage the interpretation, analytics rep…
Scan Center Document Prep Clerk (Fulltime Days)
Description The Scan Center Document Prep Clerk performs fast paced, clerical work (similar to product and manufacturing environments) preparing documents and records for scanning. The team proc…
Paralegal - Preventing and Ending Homelessness Project
For nearly 50 years, Bet Tzedek Legal Services has provided high-quality, free legal services to Los Angeles’ most vulnerable residents. In that tradition, Bet Tzedek is seeking a paralegal to join o…
Experienced Housekeeper
Create Your Experience of a Lifetime! Come work and play in the mountains! Whether it’s your first-time seeing snow or you were born on the slopes, joining our team means discovering (or re-disc…
Staff Product Manager, Modem Cellular Technology
Company: Qualcomm Technologies, Inc. Job Area: Operations Group, Operations Group Product Management General Summary: Qualcomm is seeking a product manager for cellular modem techno…
Project Quality Manager
Summary To lead, perform or assist in a variety of quality control activities in accordance with applicable company standards, Site Specific Quality Management Plan and approved contract documents…
Data Analyst (OYCR) - IT
Amity Foundation , an internationally acclaimed Teaching, and Therapeutic Community is seeking compassionate and enthusiastic individuals with a desire to teach, learn and join our community as a D…