Software Engineer, Infrastructure Generalist
Thinking Machines Lab's mission is to empower humanity through advancing collaborative general intelligence. We're building a future where everyone has access to the knowledge and tools to make AI work for their unique needs and goals.
We are a small team of scientists, engineers, and builders who've created some of the most widely used AI products, like ChatGPT, Character.ai, Mistral, PyTorch, OpenAI Gym, Fairseq, and Segment Anything.
About This Role
We're looking for a Staff Software Engineer—a generalist across the backend—to help build the systems that power our foundation models.
You'll join a small, high-impact team responsible for architecting and scaling the core infrastructure behind everything we do. You’ll work across the full technical stack, solving complex distributed systems problems and building robust, scalable platforms.
Infrastructure is critical to us: it's the bedrock that enables every breakthrough. You'll work directly with researchers to accelerate experiments, improve infrastructure efficiency, and enable key insights across our models, products, and data assets.
What You’ll Do
- Design, build, and operate scalable, fault-tolerant infrastructure for LLM Research: distributed compute, data orchestration, and storage across modalities.
- Develop high-throughput systems for data ingestion, processing, and transformation — including training data catalogs, deduplication, quality checks, and search.
- Build systems for traceability, reproducibility, and robust quality control at every stage of the data lifecycle.
- Implement and maintain monitoring and alerting to support platform reliability and performance.
- Collaborate with research teams to unlock new features, improve system efficiency, and accelerate training cycles.
Required Qualifications
- Technical expertise:
- 5+ years of experience building distributed systems, ideally supporting high-scale applications or research platforms.
- Fluent in containerization, orchestration, and distributed compute frameworks.
- Hands-on experience with Kubernetes, Terraform, service discovery, and workflow orchestration tools.
- Experience with network programming, load balancing, or distributed consensus systems.
- Extensive experience with performance optimization, caching strategies, and system scalability patterns.
- Deeply familiar with cloud infrastructure, microservices architectures, and both synchronous and asynchronous processing.
- Strong knowledge of databases, storage systems, and how architecture choices impact performance at scale.
- Proactive about automation, testing, and building tools that empower engineering teams.
- System Design & Performance:
- Strong proficiency in systems programming languages (Rust) and scripting (Python)
- Familiarity with performance profiling and optimization in high-throughput distributed environments
- Track record of architecting resilient systems and debugging complex production issues
- Excellent communication and collaboration skills
Strong Candidates May Also Have
- Experience supporting machine learning training infrastructure or GPU clusters
- Background at AI research labs, high-performance computing centers, or ML-focused companies
- Published work on distributed systems, infrastructure, or performance optimization
- Open-source contributions to infrastructure projects, orchestration tools, or distributed computing frameworks
- Experience with specialized hardware (GPUs, TPUs) and their integrations into distributed training systems
Logistics
- Location: This role is based in San Francisco, California.
- Visa sponsorship: We sponsor visas. While we can't guarantee success for every candidate or role, if you're the right fit, we're committed to working through the visa process together.
- Benefits: Thinking Machines offers generous health, dental, and vision benefits, unlimited PTO, paid parental leave, and relocation support as needed.
- Compensation: Depending on background, skills and experience, the expected annual salary range for this position is $300,000-$350,000 USD.
- We encourage you to apply even if you do not believe you meet every single qualification.
- As set forth in Thinking Machines' Equal Employment Opportunity policy, we do not discriminate on the basis of any protected group status under any applicable law.
Recommended Jobs
Senior Research Associate (Assay Development) - Contract
About Us DELFI Diagnostics, Inc. (DELFI Diagnostics) is developing next-generation, blood-based tests that are reliable, accessible, and deliver a new way to help detect cancer. Employing advanced…
Data Entry Specialist
```html Job Description Job Responsibilities: Accurately input data into the company database Organize and maintain records of incoming and outgoing data Ensure data integrity and quality…
Locum CRNA
LOCUM CRNA Los Angeles, CA Evenings/Nights | No Call | Big-City Variety This locum CRNA assignment in Los Angeles offers predictable evening or night shifts with no call and wide case diversit…
Staff Product Manager MarTech
Who is Taco Bell? Taco Bell was born and raised in California and has been around since 1962. We went from selling everyone’s favorite Crunchy Tacos on the West Coast to a global brand with 8,200+ …
Software Engineer, Teleoperation
We are looking for an engineer excited about building systems that enable operators to control our robots in intuitive ways with minimal latency. In this role, you will take a boots-on-the-ground app…
Resident and Internship Out-Rotation Application
Please apply here for your internship/residency out-rotation “externship” at SAGE Concord. Please select "Yes" to the AVMA question on the application, or you will be automatically declined. …
Product Manager - Hardware
Product Manager - Hardware Location: Mountain View, CA Lunar Energy is hiring a strategic Hardware Product Manager to drive the scaling and expansion of our residential energy products. You wi…
ASSOCIATE GOVERNMENTAL PROGRAM ANALYST
If this is a limited term position, it may be extended up to 24 months and/or become permanent. If this is a part-time or intermittent position, the time base may be increased up t…
Leasing Professional
Job Details Description Leasing Professional - Briarwood Apartments & Village Ceres Apartments | Turlock and Ceres, CA Who We Are: Founded in 1975, CONAM Management operates in 10 states, …