OPTIMAL DATA SAMPLING FOR TRAINING NEURAL SURROGATES OF PROGRAMS Anonymous

Abstract

Programmers and researchers are increasingly developing surrogates of programs, models of a subset of the observable behavior of a given program, to solve a variety of software development challenges. Programmers train surrogates from measurements of the behavior of a program on a dataset of input examples. We present a methodology for optimally sampling datasets to train neural network based surrogates of programs. We first characterize the optimal proportion of data to sample from each path in a program based on the complexity of learning the path. We next provide a program analysis to determine the complexity of different paths in a program. We evaluate these results on a large-scale graphics program, demonstrating that theoretically optimal sampling results in empirical improvements in accuracy.

1. INTRODUCTION

Programmers and researchers are increasingly developing surrogates of programs, models of a subset of the observable behavior of a given program, to solve a variety of software development challenges (Renda et al., 2021) . For example, Esmaeilzadeh et al. (2012) train small neural networks to mimic existing programs, then deploy the neural networks in place of the programs to speed up computation. Generally, surrogates are used to accelerate programs (Esmaeilzadeh et al., 2012; Mendis et al., 2019; Munk et al., 2019) , apply transfer learning to programs (Tercan et al., 2018; Kustowski et al., 2020; Kwon & Carloni, 2020) , and approximate the gradient of programs to optimize their inputs (Renda et al., 2020; She et al., 2019; Tseng et al., 2019) .

Dataset generation.

Training a surrogate of a program requires measurements of the behavior of the program on a dataset of input examples. There are three common approaches to collecting this dataset. The first is to use data that is uniformly sampled (or sampled using another manually defined distribution) from the input space of the program (Tseng et al., 2019; Kustowski et al., 2020) . The second is to use data instrumented from running the original program on a workload of interest (Renda et al., 2020; Esmaeilzadeh et al., 2012) . The third is to use active learning (Settles, 2009) , a class of online methods that iteratively query labels for data points based on the expected improvement in accuracy resulting from additional samples ( İpek et al., 2006; She et al., 2019; Pestourie et al., 2020) . These approaches show promise, but they face challenges with programs with control flow. Programs with control flow (e.g. branches and loops) are piecewise functions: each control flow path induces a different trace of operations that are applied to the input. The sampling techniques above do not optimally allocate samples between different paths, resulting in surrogates which do not adequately learn the behavior of the program along all paths. For example, Renda et al. (2020, Section IV.A) identify a scenario in which an instrumented dataset does not exercise a set of paths in the program enough times for the surrogate to learn the behavior along those paths. Our approach. Our approach uses the source code and semantics of the program under study to guide dataset generation for training a surrogate of the program. The core concept is to analyze the complexity of each path in a program and to allocate more samples to paths that are more complex to learn. Stratified functions. Our approach represents the program as a stratified function, a function with different behavior in different regions (strata) of the input space (i.e., a piecewise function). 1 We use 1 We choose the term stratified by analogy with the technique of stratified sampling. 1

