Software Engineer - Distributed Training Infrastructure

Clockwork.io
Palo Alto, CA

Clockwork.io is a Silicon Valley startup that delivers state-of-the-art AI compute acceleration.

We are founded by Stanford researchers and veteran systems engineers with a shared belief: distributed systems powering modern AI require a new approach to managing time, reliability, and performance. Unlike traditional solutions that rely on specialized hardware or embedded telemetry in switches, Clockwork’s system brings insane visibility, resilience, acceleration and efficiency to the network layer entirely through software. As AI workloads continue to scale in size, urgency, and impact, networks must evolve to keep up. Clockwork exists to make that evolution possible.

About Us

Clockwork.io – A Software-Driven Revolution in AI Networking

Clockwork Systems was founded by Stanford researchers and veteran systems engineers who share a vision for redefining the foundations of distributed computing. As AI workloads grow increasingly complex, traditional infrastructure struggles to meet the demands of performance, reliability, and precise coordination. Clockwork is pioneering a software-driven approach to AI networking, delivering deterministic time, ultra-low latency, and seamless scalability for modern distributed systems.

To learn more, visit .

About the Role

We are looking for an experienced software engineer to help build, optimize, and maintain large-scale distributed training infrastructure based on the PyTorch ecosystem. This role focuses on production-grade training workflows involving multi-GPU and multi-node orchestration, high-performance communication layers, and advanced parallelism strategies.

You’ll work alongside infrastructure and machine learning teams to ensure training jobs are efficient, scalable, and resilient.

What You will do



  • Develop and support distributed PyTorch training jobs using torch.distributed / c10d

  • Integrate and maintain frameworks like Megatron-LM, DeepSpeed, and related LLM training stacks

  • Diagnose and resolve distributed training issues (e.g., NCCL hangs, OOM, checkpoint corruption)

  • Optimize performance across communication, I/O, and memory bottlenecks

  • Implement fault tolerance, checkpointing, and recovery mechanisms for long-running jobs

  • Write tooling and scripts to streamline training workflows and experiment management

  • Collaborate with ML engineers to ensure compatibility with orchestration and container environments (e.g., Slurm, Kubernetes)

What We’re Looking For



  • Deep experience with PyTorch and torch.distributed (c10d)

  • Hands-on experience with at least one of: Megatron-LM, DeepSpeed, or FairScale

  • Proficiency in Python and Linux shell scripting

  • Experience with multi-node GPU clusters using Slurm, Kubernetes, or similar

  • Strong understanding of NCCL, collective communication, and GPU topology

  • Familiarity with debugging tools and techniques for distributed systems

Preferred Skills


  • Experience scaling LLM training across 8+ GPUs and multiple nodes

  • Knowledge of tensor, pipeline, and data parallelism

  • Familiarity with containerized training environments (Docker, Singularity)

  • Exposure to HPC environments or cloud GPU infrastructure

  • Experience with training workload orchestration tools or custom job launchers

  • Comfort with large-scale checkpointing, resume/restart logic, and model I/O

Bonus Skills


  • Profiling tools: PyTorch Profiler, Nsight, nvprof, or equivalent

  • Experience with performance tuning in distributed training environments

  • Contributions to ML infrastructure open-source projects

  • Familiarity with storage, networking, or RDMA/GPU Direct technologies

  • Understanding of observability in ML pipelines (metrics, logs, dashboards)

Enjoy


  • Challenging projects.

  • A friendly and inclusive workplace culture.

  • Competitive compensation.

  • A great benefits package.

  • Catered lunch

Clockwork is assembling world class teams to build cutting edge software. We look for bright people from all walks of life and we grow together. All qualified applicants will receive consideration for employment without regard to race, color, ancestry, religion, age, sex, sexual orientation, gender identity, national origin, or protected veteran status and will not be discriminated against on the basis of disability.

Posted 2026-02-25

Recommended Jobs

Barista Part Time

Cabazon, CA

Coach is a global fashion house founded in New York in 1941. Inspired by the vision of Creative Director Stuart Vevers and the inclusive and courageous spirit of our hometown, we make beautiful thing…

View Details
Posted 2026-01-23

Hair Stylist

Supercuts
Long Beach, CA

Supercuts Long Beach, CA Instant Clientele | Flexible Schedule Love cutting hair? Ready to grow your income? Supercuts (operated by Moxie Management Group) is hiring Junior & Experienc…

View Details
Posted 2026-02-15

Staff Software Engineer, AI Agentic Experience (Auth0)

Okta
San Francisco, CA

Get to know Okta Okta is The World’s Identity Company. We free everyone to safely use any technology, anywhere, on any device or app. Our flexible and neutral products, Okta Platform and Auth0 P…

View Details
Posted 2026-02-13

Au Pair

GreatAuPair LLC
Yuba City, CA

Get hired for Lisbet's aupair Job in Yuba City, CA. California Family seeking a Bilingual AU Pair!. Find aupair care work in Yuba City.

View Details
Posted 2025-11-09

Associate Teacher

Piper Preschool
Irvine, CA

Employment Type: Full-time. Operating Hours: Monday through Friday, schedules vary between 8:00 am to 5:30 pm; When: ASAP. Looking for an Associate Teacher for a progressive preschool …

View Details
Posted 2025-12-18

Customer Service (remote work , no vaccination required)

Path Arc
Hemet, CA

The customer service representative will be responsible for answering client inquiries, provide product information, and help the customer by being informative, empathetic, and eager to quickly solve…

View Details
Posted 2026-01-15

Senior Software Engineer - iOS

Poshmark
Redwood City, CA

About Poshmark Poshmark is a leading fashion resale marketplace powered by a vibrant, highly engaged community of buyers and sellers and real-time social experiences. Designed to make online selling…

View Details
Posted 2026-02-25

AG Sales Representative- Uncapped High Commission + Real growth

Adriana's Insurance
Irvine, CA

Pay: $20.00 per hour Job summary: Our Compensation & Benefits: Unlimited/ Uncapped Commission - Your income is a direct result of your work ethics and sales results. Paid training - We …

View Details
Posted 2026-02-13

Automotive Technician

A-PLUS AUTOMOTIVE
Fresno, CA

Looking for weekends off? Competitive pay and career, not a job, then keep on reading. Click on link to apply We are seeking a skilled and experienced Automotive Technician to join our team. As an…

View Details
Posted 2026-02-06

Software engineer, SaaS

Hercules
San Francisco, CA

What are we looking for? We are looking for a full-stack software engineer who specializes in SaaS. What are examples of projects you’d lead? You could build SDK features that users can sea…

View Details
Posted 2026-02-13