Applied Data Scientist, Evaluation & Model Behavior

Agi

San Francisco, CA

Think Different. Build the Future.

Our Mission

Build everyday AGI. Trustworthy, consumer-grade agents that redefine human–AI collaboration for millions. Software shouldn’t wait for commands; it should partner with you, amplifying what you can do every single day.

Why AGI, Inc.

We’re a stealth team of elite founders and AI researchers, with backgrounds spanning Stanford, OpenAI, and DeepMind . We’re industry leaders in mobile and computer-use agents, bringing these capabilities to consumer scale.

Grounded in years of agent research, our AI is designed with trustworthiness and reliability as core pillars, not afterthoughts.

We are supported by tier-1 investors who funded the first generation of AI giants; now they’re backing us to build the next: everyday AGI. (Watch the demo )

If you see possibility where others see limits, read on.

About the Role

As an Applied Scientist focused on Evaluation & Model Behavior, you will design and implement the systems used to measure and improve the performance of Computer Use Agents.

This is not a support role. You will be responsible for the technical definition of model quality, including the design of evaluation metrics, the curation of training datasets, and the engineering of system prompts. You'll work directly with the engineering team to translate product requirements into technical specifications and quantifiable benchmarks.

You'll focus on rigor, clarity, and impact, ensuring every metric, dataset, and prompt moves us toward more reliable, trustworthy agents.

What You'll Do

Model Behavior Design: Translate product requirements into technical specifications for model behavior. Engineer system prompts and few-shot examples to address specific capability gaps and behavioral failures.

Evaluation Design: Define metrics for reasoning, tool usage, and safety, and validate these metrics against human judgment to ensure statistical rigor.

Data Strategy: Design algorithms to filter, score, and select training data. Write Python scripts to sanitize inputs and manage the training data lifecycle from raw logs to high-quality datasets.

Failure Analysis: Investigate regressions in model benchmarks. Diagnose root causes, distinguishing between data quality issues, prompt instruction failures, or underlying model capability gaps and implement fixes.

Ground Truth Management: Define rubrics and guidelines for human annotation. Maintain reference datasets ("Golden Sets") to establish a consistent baseline for model performance evaluation.

Minimum Qualifications

Master's degree or PhD in Computer Science, Data Science, Statistics, or a related technical field, or equivalent practical experience
3+ years of experience in Data Science, Machine Learning, or Applied Science
Proficiency in Python, with experience writing production-quality code for data pipelines or evaluation harnesses
Experience with experimental design, A/B testing, or statistical analysis

Preferred Qualifications

Experience with Large Language Models (LLMs), including prompt engineering, fine-tuning, or RLHF workflows
Experience building automated evaluation systems or implementing model-based evaluation frameworks
Ability to translate product requirements into measurable technical metrics
Experience managing human-in-the-loop data pipelines or annotation quality control

Why This Role Matters

You can't improve what you can't measure. You can't ship what you can't trust.

You will define the technical definition of quality for our agents — the metrics that predict real-world success, the datasets that encode user intent, and the prompts that shape model behavior. Your work will directly determine how quickly we can iterate and how confidently we can ship.

Our Culture

All in, in person — work moves faster face-to-face
Ship by default — speed and polish can coexist
One band, one sound — radical candor, zero politics

Perks

Competitive company-sponsored medical, dental, and vision insurance
✈️ Top-tier relocation and immigration support

How to Apply

Send us:

A link — or 60-second video — of something you built and why it matters
Your resume or LinkedIn
Two sentences on the hardest problem you've cracked

Every exceptional candidate hears back within 48 hours.

If you see possibility where others see limits, we'd love to meet you.

Posted 2026-04-04

Recommended Jobs

Litigation Practice Assistant (Newport Beach)

HERS Advisors

Newport Beach, CA

HERS Advisors have partnered with a major multinational law firm who are looking for a Litigation Practice Assistant to work in their Orange County, California office (hybrid). This is an excitin…

View Details

Posted 2026-01-30

Account Executive, Molecular - San Diego & Orange County, CA

San Diego, CA

Abbott is a global healthcare leader that helps people live more fully at all stages of life. Our portfolio of life-changing technologies spans the spectrum of healthcare, with leading businesses and…

View Details

Posted 2026-04-03

Traveling Maintenance - Residential Builds (Fresno)

Jobot

Fresno, CA

Join a Global Fashion Powerhouse, Work with Iconic Brands - Competitive Salary, 401(k), & Big Growth Runway! This Jobot Job is hosted by: Sierra Johnson Are you a fit? Easy Apply now by clicking…

View Details

Posted 2026-03-27

Software Engineer 5 - Ads Finance

Netflix

Los Gatos, CA

Netflix is one of the world's leading entertainment services, with over 300 million paid memberships in over 190 countries enjoying TV series, films and games across a wide variety of genres and lang…

View Details

Posted 2026-02-19

Senior Backend Engineer (Remote) - AI (San Francisco)

Jobot

San Francisco, CA

Multiple Bonuses + 401(k) Match + 100% Health Insurance Premium Coverage This Jobot Job is hosted by: Katie Griffith Are you a fit? Easy Apply now by clicking the Apply button and sending us yo…

View Details

Posted 2026-03-27

Senior Technical Product Manager, Operations Software (Robot Startup)

Vertex Sigma Software

Foster, CA

We are building autonomous mobility from the ground up. Our Fleet operations are an integral part of building towards live service. Within it lies Base Operations and Mission Readiness, responsible f…

View Details

Posted 2026-04-04

Qi Data Analyst

Comprehensive Community Health Centers

Glendale, CA

Full-time Description QI DATA ANALYST JOB SUMMARY The QI Data Analyst manages reporting, analytics, data validation, and performance measurement activities in support of UDS, HEDIS, H…

View Details

Posted 2026-03-04

Sushi Chef

RO Restaurant & Lounge

Yountville, CA

Chef Thomas Keller's Yountville destination, RO Restaurant & Lounge, is looking for individuals who are enthusiastic about food, beverage, and hospitality to join the culinary brigade as a   Sushi …

View Details

Posted 2025-12-19

Principal Product Manager, Tax Credits

gusto

San Francisco, CA

About Gusto At Gusto, we're on a mission to grow the small business economy. We handle the hard stuff—like payroll, health insurance, 401(k)s, and HR—so owners can focus on their craft and custo…

View Details

Posted 2026-03-18

Partner Director

Cognition

San Francisco, CA

We are an applied AI lab building end-to-end software agents. We're the makers of Devin, the first AI software engineer, and Windsurf, the AI-native IDE. Together, they represent our vision for coll…

View Details

Posted 2026-03-28