CAREER: TRANSFER LEARNING FOR ECONOMIC PREDICTION OF LABOR DATA

Abstract

Labor economists regularly analyze employment data by fitting predictive models to small, carefully constructed longitudinal survey datasets. Although modern machine learning methods offer promise for such problems, these survey datasets are too small to take advantage of them. In recent years large datasets of online resumes have also become available, providing data about the career trajectories of millions of individuals. However, standard econometric models cannot take advantage of their scale or incorporate them into the analysis of survey data. To this end we develop CAREER, a transformer-based model that uses transfer learning to learn representations of job sequences. CAREER is first fit to large, passivelycollected resume data and then fine-tuned to smaller, better-curated datasets for economic inferences. We fit CAREER to a dataset of 24 million job sequences from resumes, and fine-tune its representations on longitudinal survey datasets. We find that CAREER forms accurate predictions of job sequences, achieving state-of-the-art predictive performance on three widely-used economics datasets. We further find that CAREER can be used to form good predictions of other downstream variables; incorporating CAREER into a wage model provides better predictions than the econometric models currently in use.

1. INTRODUCTION

A variety of economic analyses rely on models for predicting an individual's future occupations. These models are crucial for estimating important economic quantities, such as gender or racial differences in unemployment (Hall, 1972; Fairlie & Sundstrom, 1999) ; they underpin causal analyses and decompositions that rely on simulating counterfactual occupations for individuals (Brown et al., 1980; Schubert et al., 2021) ; and they inform policy, by forecasting occupations with rising or declining market shares. These analyses typically involve fitting predictive models to longitudinal surveys that follow a cohort of individuals during their working career (Panel Study of Income Dynamics, 2021; Bureau of Labor Statistics, 2019a). Such surveys have been carefully collected to represent national demographics, ensuring that the economic analyses can generalize to larger populations. But these datasets are also small, usually containing only thousands of workers, because maintaining them requires regularly interviewing each individual. Consequently, economists use simple sequential models, where a worker's next occupation depends on their history only through the most recent occupation (Hall, 1972) or a few summary statistics about the past (Blau & Riphahn, 1999) . In recent years, however, much larger datasets of online resumes have also become available. In contrast to longitudinal surveys, these passively-collected datasets are not typically used directly for economic inferences because they contain noisy observations and they are missing important economic variables such as demographics and wage. However, they provide occupation sequences of millions of individuals, potentially expanding the scope of insights that can be obtained from analyses on downstream survey datasets. The simple econometric models currently in use cannot incorporate the complex patterns embedded in these larger datasets into the analysis of survey data. To this end, we develop CAREER, a neural sequence model of occupation trajectories. CAREER is designed to be pretrained on large-scale resume data and then fine-tuned to small and bettercurated survey data for economic prediction. Its architecture is based on the transformer language model (Vaswani et al., 2017) , for which pretraining and fine-tuning has proven to be an effective

