Data Engineer

Institute Of Foundation Models
Sunnyvale, CA

About the Institute of Foundation Models

We are a dedicated research lab for building, understanding, using, and risk-managing foundation models. Our mandate is to advance research, nurture the next generation of AI builders, and drive transformative contributions to a knowledge-driven economy.

As part of our team, you’ll have the opportunity to work on the core of cutting-edge foundation model training, alongside world-class researchers, data scientists, and engineers, tackling the most fundamental and impactful challenges in AI development. You will participate in the development of groundbreaking AI solutions that have the potential to reshape entire industries. Strategic and innovative problem-solving skills will be instrumental in establishing MBZUAI as a global hub for high-performance computing in deep learning, driving impactful discoveries that inspire the next generation of AI pioneers.

The Role

As a Data Engineer specializing in Natural Language Processing (NLP) and large-scale data processing, you will quickly and effectively gather, curate, and prepare high-quality datasets to support cutting-edge NLP research. Your role will be instrumental in enabling researchers by delivering essential data through efficient and scalable engineering practices, including web crawling, LLM-generated content refinement, and robust data pipelines, primarily leveraging Python and related technologies.

Key Responsibilities

  • Rapidly collect, curate, and preprocess datasets based on detailed specifications provided by NLP researchers, delivering data within tight timelines (typically within 1-2 days).
  • Develop and maintain efficient web crawling solutions, APIs, and automated workflows to continuously improve data collection processes.
  • Refine and evaluate outputs from Large Language Models (LLMs) to generate structured datasets suitable for model training and benchmarking.
  • Implement scalable data pipelines, ensuring efficient data processing, storage, retrieval, and distribution to research teams.
  • Collaborate closely with researchers and engineers to ensure collected data meets specified quality and relevance criteria.
  • Document data collection methodologies, dataset characteristics, and pipeline architecture clearly and effectively.
  • Engage with peer teams and participate in technical reviews to uphold best practices and data quality standards.
  • Represent MBZUAI at industry and research forums, showcasing technical capabilities in large-scale data processing and AI data infrastructure.
  • Perform all other duties as reasonably directed by the line manager commensurate with these functional objectives.

Academic Qualifications

  • Bachelor's degree in Computer Science, Data Science, Engineering, or a related technical field required
  • Master’s degree or equivalent experience in Computer Science, Data Engineering, or related technical fields preferred.

Professional Experience - Required

  • Extensive experience in data engineering, data processing, and automation using Python.
  • Demonstrated proficiency in designing and deploying web crawling solutions, automated data extraction, and processing pipelines.
  • Strong understanding of data structures, algorithms, databases, SQL, and performance optimization.
  • Experience working with cloud infrastructure and distributed data processing frameworks (e.g., AWS, Spark, Kafka, Kubernetes).
  • Excellent problem-solving abilities, attention to detail, and the capability to rapidly address technical challenges.
  • Strong communication and collaboration skills with cross-functional teams.

Professional Experience - Preferred

  • Proven track record of supporting NLP or AI research teams with rapid and reliable data delivery.
  • Experience with refining outputs from large-scale AI models, such as LLM-generated data.
  • Contributions to open-source projects, coding competitions, or high visibility in coding communities (e.g., GitHub, Stack Overflow).
  • Familiarity with the latest advancements in NLP data processing and large language model technologies.

$100,000 - $500,000 a year

Visa Sponsorship

This position is eligible for visa sponsorship.

Benefits Include

*Comprehensive medical, dental, and vision benefits

*Bonus

*401K Plan

*Generous paid time off, sick leave and holidays

*Paid Parental Leave

*Employee Assistance Program

*Life insurance and disability

Posted 2025-09-22

Recommended Jobs

Janitor (Skilled Nursing Facility)

Eastland Sub-Acute & Rehabilitation Center
El Monte, CA

We are seeking a Janitor professional to join our team! Full-Time AM & PM Shift, experience needed. The ideal candidate can work with either AM (4am to 12:30pm) or PM (1:30pm to 10:30pm) shifts.  Th…

View Details
Posted 2025-09-10

Registered Nurse | RN

Interim HealthCare Northern CA, NV, OR
Redding, CA

Registered Nurse | RN Wage: $50/Visit. Location: Shasta, Siskiyou, Tehama, Trinity Counties Seeking an RN to provide in home supervisory visits to VA clients. $18/hr for Travel Time and Mileage …

View Details
Posted 2025-09-10

GRILL COOK (FULL TIME)

Compass Group
San Diego, CA

    We are hiring immediately for full time GRILL COOK positions. Location : Torrey Heights by Breakthrough Properties - 11220 El Camino Real Suite 150, San Diego, CA 92130. Note: online app…

View Details
Posted 2025-09-10

Sales Trainee

Red Bull
Stockton, CA

This is a training role that is made to prepare the Sales Trainee for the Account Sales Manager role. The role primarily is to support sales initiatives and provide route coverage for the ASM during …

View Details
Posted 2025-08-29

Senior Systems Data Analyst

Child Development Associates
National City, CA

Employment Type: Full-time | Exemption Status: Exempt Salary: $118,657.43 Location : Bonita, CA ( full-time in office ) Purpose of Role: Under general supervision, the Senior Syste…

View Details
Posted 2025-09-22

Caregiver - Alhambra

San Gabriel, CA

About Aarris At Aarris Homecare, we understand that our caregivers are our best asset and we care about the work that you do. If you’re passionate and committed to client well-being and are looking …

View Details
Posted 2025-08-18

Shape Lives Amid Oakland’s Scenic Bay Views!

NurseRecruiter
Oakland, CA

Physical Therapist Home Health job in Oakland, CA Embark on a rewarding journey as a Physical Therapist in scenic Oakland, where you can shape lives while enjoying breathtaking bay views. Providing h…

View Details
Posted 2025-07-30

Support Representative I (Bilingual in Spanish)

ConsumerDirect, Inc.
Irvine, CA

This position is full-time and in person located in Irvine, CA About ConsumerDirect At ConsumerDirect, we’re transforming the way consumers control their credit, money and privacy through innovat…

View Details
Posted 2025-07-29

Location Estimation Engineer

Apple
Cupertino, CA

Location Estimation Engineer Location Cupertino, CA : Develop and improve multi-sensor navigation technology as a location estimation engineer. Craft the next generation of location-aware mobile exper…

View Details
Posted 2025-09-22

OR Circulating Nurse - Full-Time

Temecula Valley Day Surgery Center
Murrieta, CA

Responsibilities Join the Temecula Valley Day Surgery Team! About Us: Temecula Valley Day Surgery is a multi-specialty Ambulatory Surgery Center, licensed by the State of California, certif…

View Details
Posted 2025-09-03