OPTIMAL DATA SAMPLING FOR TRAINING NEURAL SURROGATES OF PROGRAMS Anonymous

Abstract

Programmers and researchers are increasingly developing surrogates of programs, models of a subset of the observable behavior of a given program, to solve a variety of software development challenges. Programmers train surrogates from measurements of the behavior of a program on a dataset of input examples. We present a methodology for optimally sampling datasets to train neural network based surrogates of programs. We first characterize the optimal proportion of data to sample from each path in a program based on the complexity of learning the path. We next provide a program analysis to determine the complexity of different paths in a program. We evaluate these results on a large-scale graphics program, demonstrating that theoretically optimal sampling results in empirical improvements in accuracy.

1. INTRODUCTION

Programmers and researchers are increasingly developing surrogates of programs, models of a subset of the observable behavior of a given program, to solve a variety of software development challenges (Renda et al., 2021) . For example, Esmaeilzadeh et al. (2012) train small neural networks to mimic existing programs, then deploy the neural networks in place of the programs to speed up computation. Generally, surrogates are used to accelerate programs (Esmaeilzadeh et al., 2012; Mendis et al., 2019; Munk et al., 2019) , apply transfer learning to programs (Tercan et al., 2018; Kustowski et al., 2020; Kwon & Carloni, 2020) , and approximate the gradient of programs to optimize their inputs (Renda et al., 2020; She et al., 2019; Tseng et al., 2019) .

Dataset generation.

Training a surrogate of a program requires measurements of the behavior of the program on a dataset of input examples. There are three common approaches to collecting this dataset. The first is to use data that is uniformly sampled (or sampled using another manually defined distribution) from the input space of the program (Tseng et al., 2019; Kustowski et al., 2020) . The second is to use data instrumented from running the original program on a workload of interest (Renda et al., 2020; Esmaeilzadeh et al., 2012) . The third is to use active learning (Settles, 2009) , a class of online methods that iteratively query labels for data points based on the expected improvement in accuracy resulting from additional samples ( İpek et al., 2006; She et al., 2019; Pestourie et al., 2020) . These approaches show promise, but they face challenges with programs with control flow. Programs with control flow (e.g. branches and loops) are piecewise functions: each control flow path induces a different trace of operations that are applied to the input. The sampling techniques above do not optimally allocate samples between different paths, resulting in surrogates which do not adequately learn the behavior of the program along all paths. For example, Renda et al. (2020, Section IV .A) identify a scenario in which an instrumented dataset does not exercise a set of paths in the program enough times for the surrogate to learn the behavior along those paths. Our approach. Our approach uses the source code and semantics of the program under study to guide dataset generation for training a surrogate of the program. The core concept is to analyze the complexity of each path in a program and to allocate more samples to paths that are more complex to learn. Stratified functions. Our approach represents the program as a stratified function, a function with different behavior in different regions (strata) of the input space (i.e., a piecewise function). 1 We use stratified surrogates to model such functions. To construct a stratified surrogate, we train independent surrogates of each component of the stratified function. At evaluation time, a stratified surrogate checks which stratum an input is in (using the original program) then applies the corresponding surrogate. This evaluation-time stratum check must not preclude the use of the surrogate for its downstream task. We therefore adopt a standard modeling assumption in the approximate computing literature: that precisely determining paths is an acceptable cost during approximate program execution (Sampson et al., 2011; Carbin et al., 2013) . 2,3 Optimal sampling. With this stratified modeling assumption, we then determine how many samples to allocate to train each surrogate. Using neural network sample complexity bounds for learning analytic functions (Arora et al., 2019; Agarwala et al., 2021) we calculate a complexity for each component function which gives an upper bound on how many samples are required to learn the behavior of that component to a given error. Given a data distribution describing the frequency of each component and given each component function's complexity, we then derive the optimal number of samples to allocate to training each surrogate of each component, minimizing the upper bound on the stratified surrogate's error. Complexity analysis. We present a programming language, TURACO, in which programs denote stratified functions with well-defined complexity measures. We provide a program analysis for TURACO programs that automatically determines the strata of the function and calculates an upper bound on the complexity of each component of the stratified function that the program denotes. Renderer demonstration. To demonstrate that optimal sampling using our complexity analysis improves surrogate accuracy on downstream tasks, we present a case study of learning a surrogate of a renderer in a video game engine. We show that our optimal sampling approach results in between 15% and 47% lower error than training using distributions that do not take into account path complexity. These accuracy improvements correlate with perceptual improvements in the generated renders.

Contributions.

In sum, we present the following contributions: • An optimal approach to allocating samples among strata to train stratified neural network surrogates of stratified analytic functions that minimizes the upper bound on the surrogate's error. • A programming language, TURACO, in which all programs are learnable stratified functions, and a program analysis to determine the complexity of learning surrogates of those programs. • An evaluation of these results on a graphics program, demonstrating that theoretically optimal sampling using TURACO's complexity analysis results in empirical improvements in accuracy. We lay the groundwork for analyzing optimal sampling approaches for training surrogates of programs. Our results hold out the promise of surrogate training approaches that intelligently use the program's semantics to guide the design and training of surrogates of programs.

2. EXAMPLE

Figure 1a presents an example distilled from our evaluation (Section 5) that we use to demonstrate how optimal path sampling, sampling from paths according both to their frequency in a data distribution and to their complexity, results in a more accurate surrogate than frequency-based path sampling, sampling according to the frequency of paths alone. Program under study. We study a graphics program that calculates the luminance (i.e., brightness) at a point in a scene as a function of sunPosition, the height of the sun in the sky (i.e., the time of day) which ranges from -1 to 1, and emission, a property of the material at that point which ranges from -1 to 1. The program first checks whether it is daytime (Line 2), and sets the ambient lighting variable accordingly. The program next checks whether the sun position is above a threshold (Line 7) and sets the emission variable accordingly. The output is then the sum of the ambient light and the light emitted by the material. Figure 1b presents the output of this program on inputs between -1 and 1. The path conditions (Lines 2 and 7) partition the program into three traces: nighttime, when sunPosition is less than 0 (Figure 1c ); twilight, when sunPosition is between 0 and 0.1 (Figure 1d ); and daytime, when sunPosition is greater than 0.1 (Figure 1e ). These paths are separated by dashed black lines in Figure 1b . Training a surrogate of this program poses a particular challenge because of the different behavior of these traces. Furthermore, these traces have different relative complexities: when sunPosition is less than 0.1 the function is linear, but when sunPosition is above 0.1 the function is quadratic. We must ensure that the data distribution that we use for training surrogates reflects not only the different paths of the program, but also the relative complexities of each path of the program. Optimal path sampling. We present an approach to determining the optimal amount of training data to sample from each path to train a stratified surrogate of this program. Specifically, given a data distribution and a data budget we want to find the optimal number of data points to sample from each path to minimize the expected error of a surrogate of the program over the data distribution. Intuitively, our approach is to prioritize sampling paths that are frequent and paths that are complex (and thus require more samples to learn). First we determine the frequency of each path in the underlying data distribution. We assume that the data has a uniform distribution over inputs between -1 and 1. This results in path frequencies for the nighttime path (sunPosition < 0) of 50%, the twilight path (0 < sunPosition < 0.1) of 10%, and the daytime path (0.1 < sunPosition) of 40%. Next we determine the sample complexity of each path, the number of samples required to learn the function along that path to a given error. We use the sample complexity results of Agarwala et al. (2021) , who give an upper bound on the number of samples required to learn a neural network approximation of a given function. Using this bound (as implemented by our TURACO analysis described in Section 4.2), we determine that the twilight path takes 1.5× as many samples to train a surrogate to a given error as the nighttime path, and the daytime path requires 5× as many samples. Finally, we combine these metrics to determine the optimal sampling rates. Using Equation (3) in Section 3.2, we find that the optimal sampling rate is to sample 39.3% of the data from the nighttime path, 14.3% of the data from the twilight path, and 46.4% of the data from the daytime path. Stratified surrogates. The class of surrogate model for which the above approach is optimal is that of a stratified neural surrogate -a set of disjoint neural networks which are applied based on which path the inputs induce in the program. Concretely, this means that we train one surrogate per path, and pick which to apply for each input at evaluation time. For this example program, picking which surrogate to apply just requires comparing sunPosition against constant threshold values. Training methodology. For each surrogate, we train a 1-hidden-layer MLP with 512 hidden units with a ReLU activation, using 10,000 steps of Adam with learning rate 0.0005 and batch size 128. Results. Figure 2 presents the error of surrogates trained to mimic this program, as a function of the training dataset size. On the left, Figure 2a presents the error of surrogates of each path. On the right, Figure 2b presents the error of stratified surrogates of the entire program for optimal path sampling and a baseline of sampling according to path frequency alone. The x axis of each plot is the dataset size used to train the surrogate (on the left, the dataset size per path; on the right, the total dataset size used for all paths). The y axis of each plot is the error of the resulting surrogate (lower is better). Figure 2a shows that though the complexity measure is not exactly proportional to the empirical sample complexity, it does correlate with it: at a given data budget, the daytime path with the highest complexity has the highest error, followed by twilight then nighttime. Figure 2b shows that the optimal path sampling approach results in lower error than sampling according to path frequency alone. For datasets of total size below 70 samples, the surrogate trained with optimal path sampling has a geometric mean decrease in error of 27.5%. For datasets of total size above 70 samples, the surrogate trained with optimal path sampling has a geometric mean decrease in error of 5.5%. Across the entire range of dataset sizes evaluated in this plot, the surrogate trained with optimal path sampling has a geometric mean decrease in error of 15%.

3. OPTIMAL SAMPLING

In this section, we formally define stratified functions and stratified surrogates, and derive the optimal sampling distribution to use when training a stratified surrogate of stratified function.

3.1. SETUP

We define a learning algorithm, a function that trains a surrogate of a given input function, as a random function tr Equation 1 defines a given function f as probably approximately correctly learnable (abbreviated as learnable for the remainder of the paper) for a given learning algorithm tr and loss function if for all distributions D, with high probability 1δ the learning algorithm returns a surrogate f that approximately matches the original function f over the distribution D: ∀D, ∈ (0,1),δ ∈ (0,1).∃n. P : (X → Y)×D ×N×L → (X → Y) that takes a function f : X → Y from inputs x ∈ X to outputs y ∈ Y, a distribution D ∈ D over inputs x, f ∼tr (f ,D,n, ) E x∼D f (x),f (x) ≤ ≥ 1-δ (1) Following Arora et al. (2019) and Agarwala et al. (2021) we study functions and learning algorithms for which there is a measure of the complexity ζ(f ) of function f for the learning algorithm tr such that the relationship between n, ζ(f ), , and δ in Equation ( 1) is: ∃C.n ≤ C ζ(f )+log δ -1 2 (2) We instantiate the complexity measure ζ(f ) for neural networks in Section 3.3. We define a stratified function f as follows: f (x)      f 1 (x) if x ∈ s 1 . . . f c (x) if x ∈ s c where c is the number of strata, {s i } c i=1 are strata, ∀i = j. s i ∩s j = ∅, and ∪ i s i = X . We define a stratified surrogate f as a stratified function with components f i . For a data distribution D, let D(x) be the probability that x is sampled from D, and D(s i ) be the total probability mass of all data points within s i over D (i.e., D(s i ) = x∈si D(x)). Let D(x|s i ) be the probability of a data point x sampled from D if x ∈ s i .

3.2. OPTIMAL SAMPLING

Our goal is to learn a stratified surrogate f of a stratified function f , where each component function f i is learnable. We are given a data distribution D, a maximum sample budget n, a learning algorithm tr , a loss function , and a failure probability δ. Our task is therefore to find the number of samples to allocate to each stratum to train a surrogate of that function component. We assume that each surrogate component's failure probability δ i = δ c , which satisfies the overall failure probability by union bound. Note that Equation ( 2) is an upper bound and not an exact equality. We are therefore minimizing the upper bound of the error of the resulting surrogate, rather than directly minimizing the error. Formally, we solve the following optimization problem: argmin {ni} c i=1 ∈N + E x∼D f (x),f (x) s.t. c i=1 n i ≤ n and f i ∼ tr (f i ,D(x|s i ),n i , ) Theorem 3.1. With δ i = δ c , the upper bound of Equation ( 3) is minimized at: n i = n D(s i ) ζ(f i )+log(cδ -1 ) 2 3 c i=1 D(s i ) ζ(f i )+log(cδ -1 ) 2 3 , The proof of this theorem is presented in Appendix C.

3.3. NEURAL NETWORK LEARNABILITY

Equation ( 2) defines the required sample complexity for learning f as a function of ζ(f ), the complexity of f . This section defines ζ(f ) for training neural network surrogates of analytic functions. This section is an abridged summary of assumptions and results presented by Agarwala et al. (2021) and Arora et al. (2019) ; refer to Agarwala et al. (2021) for the full set of assumptions. Agarwala et al. (2021) provide a calculus for learning surrogates of analytic functions f (around 0) based on the tilde f of the function: f (x) = ∞ n=0 a n x n f (x) ∞ n=0 |a n |x n Note the following properties for x ≥ 0: h(x) ≤    f (x)+g(x) if h(x) = f (x)+g(x) f (x)•g(x) if h(x) = f (x)•g(x) f (g(x)) if h(x) = f (g(x) ) and f converges for g(x) (4) For a 2-layer neural network trained with stochastic gradient descent, if f is analytic, x is on the d-dimensional unit sphere ( x ∈ S d ), β ∈ R d (a parameter set to control the scale of the inputs), and is 1-Lipschitz, then f (β • x) is learnable with: ζ(f ) = β 2 f ( β 2 )+ f (0) 2 (5) 4 TURACO: PROGRAMS AS STRATIFIED FUNCTIONS In this section we present TURACO, a programming language in which all programs denote learnable stratified functions. We provide a program analysis for TURACO programs which calculates an upper bound on the complexity of each component of the stratified functions that the program denotes. Figure 3 presents the syntax of TURACO, a loop-free IMP-like language (Winskel, 1993) . A TURACO program p takes a list of inputs x, executes a top-level statement s, and returns a single variable x. Statements s are skips, sequences, assignments, or if statements. Expressions e are variables x, values v, binary operations b, or unary operations u. TURACO supports analytic operations (e.g., NEG, SIN, EXP), including those which can be represented by a power series within a given domain: LOG1P computes log(1+x) for x ∈ (-1,1]. We restrict the supported operations to those required to implement the case study in Section 5. Appendix D.1 presents the full set of semantics for TURACO.

4.2. COMPLEXITY ANALYSIS

We now present a program analysis that gives an upper bound on the complexity of traces of TURACO programs. This section presents a core set of rules; Appendix D.2 presents the full analysis. The analysis uses three core concepts: a complexity interpretation of expressions to calculate an upper bound on the tilde of expressions based on the calculus in Equation ( 4), a dual-number execution (Wengert, 1964; Griewank & Walther, 2008) to calculate the derivative of the upper bound on the tilde (to compute the first term in Equation ( 5)), and a path analysis which splits the program by paths to compute the complexity of each trace. Figure 4 presents the big-step relations used to calculate an upper bound on the complexity for a subset of expressions. The relation σ,[e] ⇓ (ṽ,ṽ ) says that under the variable complexity mapping σ (mapping variables to tuples with their respective tilde and tilde derivative), the expression e has ẽ ≤ ṽ and ẽ ≤ ṽ . Note that the rule for MUL uses the upper bound for multiplication in Equation (4). Figure 5 presents the complexity relation for if statements. For statements, σ,s ⇓ Σ says that under the variable complexity mapping σ, the statement s results in a set of paths with complexity mappings Σ(p) for path p. We use a period to denote string concatenation (e.g., l.p to prepend p with l). σ,v ⇓ (|v|,0) σ,x ⇓ σ(x) σ,e 1 ⇓ (ṽ 1 ,ṽ 1 ) σ,e 2 ⇓ (ṽ 2 ,ṽ 2 ) σ,MUL(e 1 ,e 2 ) ⇓ (ṽ 1 •ṽ 2 ,ṽ 1 •ṽ 2 +ṽ 2 •ṽ 1 ) Figure 4 : Complexity relation for expressions in TURACO. To calculate the complexity ζ as defined in Section 3.3 (for a given input β parameter, which represents the scale of the input data) of each path of a program, we use the statement relation to calculate f (β) and f (0). Appendix D.3 presents this rule, along with the theorem that the complexity calculated by this analysis is an upper bound on the complexity as defined in Section 3.3. σ,s 1 ⇓ Σl σ,s 2 ⇓ Σr σ,if (e > 0) { s 1 } else { s 2 } ⇓ l.p → Σl (p) p ∈ Σl ∪ r.p → Σr (p) p ∈ Σr

5. RENDERER DEMONSTRATION

In this section we present a case study of our optimal sampling results and complexity analysis. The program under study is a demonstration 3D renderer (Lettier, 2019) , such as forms the core of a graphics rendering pipeline for a movie or 3D game engine (Christensen et al., 2018; Tatarchuk, 2006) . Figures 6a and 6b show scenes that the renderer generates. We demonstrate that the sampling and analysis techniques in Sections 3 and 4 consistently result in more accurate surrogates than those trained using baseline distributions (the frequency distribution of paths and the uniform distribution). Compared to training surrogates on the frequency distribution of paths, optimal path sampling decreases error by 15%. Compared to training on the uniform distribution of paths, optimal path sampling decreases error by 47%. These improvements in error correspond to perceptual improvements in the generated images, as shown in Figures 6c to 6e.

5.1. PROGRAM UNDER STUDY

The full renderer program is a 2750 lines-of-code C++ program, which invokes 38 different GLSL shader programs totaling 2446 lines of code. We learn a surrogate of a section of one core shader, totaling 60 lines of code.foot_5 Figure 18 in Appendix E presents the code under study. Input-output specification. This program is a shader which assigns colors to pixels in the image based on the scene geometry, materials, lights, and other properties. The program is called for each pixel that is rendered in the image. Each invocation the of program takes as input a set of 11 fixed-size vectors, totaling 35 inputs. The program returns as output a set of 2 fixed-size vectors, totaling 8 outputs. 6 These outputs are two RGBA colors, the first representing the base color of the pixel, and the second representing the color and intensity of a specular map at that pixel. Scenes and datasets. We evaluate the renderer on four different scenes, which we combine into nine different datasets. Figures 6a and 6b present two of the four different scenes under consideration; the four scenes are all combinations of views from the front and top, during the day and night. We combine these scenes into nine datasets: a dataset with each scene, a dataset combining each scene from each angle (front day and front night, top day and top night), a dataset combining each scene from each time of day, and a dataset combining all scenes. Figure 14 in Appendix E presents the full set of scenes under study. Table 1 : Top: the identifier, lines of code, complexity, and description of each path present in our datasets. Bottom: the distribution (abbreviated distr.) of each path across a subset of datasets: the frequency (freq.) of each observed path, and the optimal sampling rate (opt.) of that path. Frequency -4% +6% +2% -49% +3% -33% -2% -13% -29% -15% Uniform -46% -32% -36% -45% -47% -64% -35% -58% -53% -47% Paths. The program is a conjunction of 48 different paths, 9 of which are exercised by the renderer. The top part of Table 1 presents statistics about the paths under study, showing the identifier (a trace of l and r characters denoting which branch of each if statement the path takes), the lines of code in the corresponding trace, and the complexity of the corresponding trace according to the analysis in Section 4.2. The paths are broken up into a path for rendering smoke particles from the chimney, water particles in the river, and the solids of the ground and house. Each set of paths is duplicated for twilight, nighttime, and daytime. Within each time of day, the smoke paths are the least complex, followed by water then solids. Across time, twilight paths the least complex, followed by daytime then nighttime. Table 1 also presents the observed and optimal distributions of paths for each dataset. In general, the twilight paths are rarer than the nighttime paths, which are rarer than the daytime paths: this is because data collection for the nighttime scenes extends through twilight and into the morning. For all datasets, the smoke paths are rarer than the water paths, which are in turn rarer than the solids paths; this is purely due to the number of points observed for each scene. Appendix E.1 presents code, statistics, and visualizations of all paths in the scene.

5.2. SURROGATE TRAINING AND DEPLOYMENT METHODOLOGY

To create and deploy a surrogate of the renderer, we train a surrogate of each path, then create a stratified surrogate which branches on the set of path conditions and applies the corresponding surrogate. Our goal is to compare the theoretical and empirical errors achieved by training on the optimal sampling distribution against those of baseline distributions. We compare the approaches across different training datasets, different total numbers of training data points, and evaluating across different evaluation sets, all with multiple trials. Full methodological details are presented in Appendix E.2.

5.3. SURROGATE ERRORS

Table 2 presents the geometric mean change in error of using optimal sampling compared to each baseline, on each dataset. Across most datasets, optimal path sampling results in lower error than both frequency-based path sampling and uniform path sampling. On datasets with few paths (frontday) and in which all paths are well represented (minimum 5% frequency), the gap is minimal, and frequency-based path sampling matches or outperforms optimal path sampling. On datasets with more and rarer paths (top-night), the gap widens and optimal path sampling outperforms frequency-based path sampling. On all datasets, optimal path sampling outperforms uniform path sampling.

5.4. VISUALIZATION

Figure 6 presents the renderings generated by the surrogates for the front-day and the top-night scene. These budgets correspond to the smallest budget that lead to a validation error less than 2%, which was qualitatively chosen as a threshold around which surrogate renders converge on the ground truth. The top row shows the front day scene using surrogates trained on the front day dataset. In this scene, the optimal and frequency-based surrogates result in similarly accurate renders, with the primary difference being that the frequency-based surrogate is overall more red, which is most notable in the windows and the front face of the house. This similarity is expected given the similar errors observed in Table 2 . Uniform sampling results in an inaccurate render, as expected given its high error. The bottom row show the top night scene using surrogates trained on the dataset combining all scenes. In this scene the optimal surrogate has the most accurate render, as expected given the errors observed in Table 2 . The frequency-trained surrogate colors everything (especially the roof and water) slightly more pink. The uniform-trained surrogate colors everything significantly more green. In sum, the error improvements in Table 2 result in visual improvements in the generated renders.

6. CONCLUSION

We present an optimal approach to allocating samples among strata to train stratified neural network surrogates of stratified functions. We also present a programming language, TURACO, in which all programs are learnable stratified functions and a program analysis to determine the complexity of learning surrogates of those programs. Our results take a step towards a cohesive, end-to-end methodology for programming using surrogates of programs.

A RELATED WORK

In this appendix we survey related work for each contribution. A.1 OPTIMAL STRATIFIED SAMPLING Optimal stratified sampling is a classic area in statistics (Thompson, 2012) . Most work in this domain focuses on optimal parameter estimation, and uses stratified sampling to reduce the variance of estimates by ensuring sufficient independent samples are taken from each subpopulation. Our approach is novel in the assumptions we make for training stratified surrogates of programs, and in the specific sample complexity bounds we base our results on. 2019) present an active learning approach for learning in the regime where the input space is partitioned into separate regions (strata, using our terminology) and a separate hypothesis (surrogate) is trained of each, and derive a similar optimal of allocation of data points. This approach has two primary differences from our approach. First, it is an active learning approach that determines whether or not to query a label of a given data point for an input stream of data points, whereas our approach operates offline. Second, it assumes a different form for sample complexity and derives correspondingly different optimal sampling bounds than ours. The definition of complexity (ζ in our formalism) that Cortes et al. use is a function of the number of hypotheses in the hypothesis class, the total number of data points used, and the number of data points for a given stratum that have been queried thus far. It is not a function of any complexity metric of the function being learned. Cortes et al.'s approach is thus a better fit when learning stratified functions of unknown complexity (i.e., non-analytic functions) using a finitely sized hypothesis class (not a neural network), and is targeted at the online setting when given a sampler of the overall data distribution but not one for each stratum. Our approach is a better fit when learning stratified analytic functions with neural networks, and is targeted at the offline setting when given a sampler for each stratum.

A.2 SAMPLE COMPLEXITY PROGRAM ANALYSIS

Program analysis is a broad set of techniques to determine properties of programs (Nielson et al., 1999; Cousot & Cousot, 1977) . Our analysis in Section 4.2 is a novel nonstandard interpretation calculating the tilde, combined with a standard implementation of forward-mode automatic differentiation (Wengert, 1964; Griewank & Walther, 2008 ) and a standard symbolic execution which executes all paths in the program (King, 1976; Cadar et al., 2008) . Bao et al. (2012) present a program analysis that decomposes programs into continuous regions, with the goal of characterizing the sensitivity of each continuous region to input noise. This analysis computes a different notion of complexity than ours, and does not represent the sample complexity of learning a surrogate of each region. Hoffmann & Hofmann (2010) present a program analysis that calculates the algorithmic complexity of a program. This complexity again does not lead to bounds on the sample complexity of learning a surrogate of the program.

B ASSUMPTIONS

Our contributions in Sections 3 and 4 make assumptions about the programs under study, the functions that those programs denote, and the surrogate training algorithms. Here we document these assumptions and note possible failure modes for our techniques. Assumptions imported from prior work. Our sample complexity results are subject to all assumptions from the prior work that gives the sample complexity bounds for neural networks that we use (Agarwala et al., 2021) . These sample complexity bounds only apply to analytic functions. They further assume that inputs come from the unit sphere; this does not match many practical applications, including those in Section 5. Finally, these sample complexity bounds assume that the neural network under study is a 2-layer, infinitely wide neural network trained with SGD with an infinitely small step size, using a 1-Lipschitz loss function. Despite these assumptions, Agarwala et al. (2021, Appendix B. 2) empirically verify that the sample complexity bounds hold. We also show in Section 2 that the theoretical sample complexity bounds correlate with empirical sample complexity results on three functions. Optimal sampling. The first assumption is that we know the distribution of inputs ahead of time D(x), both in terms of the distribution of strata D(s i ) and the distribution of inputs within a given stratum D(x|s i ). The second assumption is that optimizing the upper bound of the per-stratum loss results in a reasonable optimum for the combined surrogate. The third is the technical assumption we make that δ i = δ c ; this is not guaranteed to be optimal. Program analysis. The main assumption here is that Agarwala et al. ( 2021)'s algebra on tilde functions results in a sufficiently precise upper bound on the tilde. This is not always the case -for instance, this assumption says that ADD(x, NEG(x)) has a positive tilde (since the tilde of NEG(x) is equal to the tilde of x), whereas it should have a tilde of 0. Stratified surrogates. We provide sample complexity bounds for constructing stratified surrogates, assuming that for a given program every path is a different function. This assumes both that it is tractable to compute which stratum a given input resides in before applying the surrogate. This also assumes that there are a tractable number of paths, which excludes programs with a large number of if statements or loops.

C PROOF OF OPTIMAL SAMPLING

This appendix provides the proof for the optimal sampling result in Section 3. For convenience we duplicate the definitions of learnability. A learnable function is one such that: ∀D,n, ,δ.∃n. P f ∼tr (f ,D,n, ) E x∼D f (x),f (x) ≤ ≥ 1-δ Following Arora et al. (2019) and Agarwala et al. (2021) we study functions and learning algorithms for which the relationship between n, , and δ in Equation ( 1) is: ∃C.n ≤ C ζ(f )+log δ -1 2 (2) The optimization problem of our optimal sampling approach is: argmin {ni} c i=1 ∈N + E x∼D f (x),f (x) s.t. i n i ≤ n and f i ∼ tr (f i ,D(x|s i ),n i , ) Theorem 3.1. With δ i = δ c , the upper bound of Equation (3) is minimized at: n i = n D(s i ) ζ(f i )+log(cδ -1 ) 2 3 c i=1 D(s i ) ζ(f i )+log(cδ -1 ) 2 3 , Proof. The task is to find n i for each surrogate f i that minimize the expectation of error and such that i n i ≤ n. We first separate out the expectation by stratum, and upper bound the error using the sample Under review as a conference paper at ICLR 2023 complexity assuming for the surrogates: E x∼D f (x),f (x) = E si∼{D(si)} E x∼D(x|si) f (x),f (x) ≤ E si∼{D(si)} C n ζ(f i )+log δ -1 i = i D(s i ) C n ζ(f i )+log δ -1 i Plugging in δ i and writing out the Lagrangian, we have: L(n i ,λ) = c i=1 D(s i )C 1 2 ζ(f i )+log c δ n -1 2 i +λ n- n i ∂ ∂n i L = - 1 2 D(s i )C 1 2 ζ(f i )+log c δ n -3 2 i -λ = 0 n i = 1 2|λ| D(s i )C 1 2 ζ(f i )+log c δ 2 3 ∂ ∂λ L = n- i n i = 0 n = i n i n = i 1 2|λ| D(s i )C 1 2 ζ(f i )+log c δ 2 3 |λ| = 1 n i 1 2 D(s i )C 1 2 ζ(f i )+log c δ 2 3 3 2 n i = 1 n i 1 2 D(s i )C 1 2 ζ(f i )+log c δ 2 3 -1 1 2 D(s i )C 1 2 ζ(f i )+log c δ 2 3 n i = n D(s i ) ζ(f i )+log c δ 2 3 i D(s i ) ζ(f i )+log c δ 2 3 Because the objective is convex in n i , this is the optimal solution.

D FULL TURACO SEMANTICS

This appendix presents the full semantics and analysis for TURACO. Figure 7 The relation σ,[e] ⇓ (ṽ,ṽ ) says that under the variable complexity mapping σ (mapping variables to tuples with their respective tilde and tilde derivative), the expression e has ẽ ≤ ṽ and ẽ ≤ ṽ . v) σ,e ⇓ v -1 < v ≤ 1 σ,LOG1P(e) ⇓ log(1+v) σ,e 1 ⇓ v 1 σ,e 2 ⇓ v 2 σ,ADD(e 1 ,e 2 ) ⇓ v 1 +v 2 σ,e 1 ⇓ v 1 σ,e 2 ⇓ v 2 σ,MUL(e 1 ,e 2 ) ⇓ v 1 •v 2 σ,s 1 ⇓ σ σ ,s 2 ⇓ σ σ,s 1 ;s 2 ⇓ σ σ e ⇓ v σ,x := e ⇓ σ[x → v] σ,e ⇓ v v > 0 σ,s 1 ⇓ σ l σ,if (e > 0) { s 1 } else { s 2 } ⇓ σ l σ,e ⇓ v v ≤ 0 σ,s 2 ⇓ σ r σ,if (e > 0) { s 1 } else { s 2 } ⇓ σ r Figure 12 presents the complexity relation for statements. For statements, the relation σ,s ⇓ Σ says that under the variable complexity mapping σ, the statement s results in a set of paths with variable complexity mappings Σ(p) for path p. Proof. The proof is by induction using Fact 1 (p. 17) and Lemma 7 (p. 22) in Agarwala et al. (2021) . σ,v ⇓ (|v|,0) σ,x ⇓ σ(x) σ,e ⇓ (ṽ,ṽ ) σ,NEG(e) ⇓ (ṽ,ṽ ) σ,e ⇓ (ṽ,ṽ ) σ,SIN(e) ⇓ (sinh(ṽ),ṽ cosh(ṽ)) σ,e ⇓ (ṽ,ṽ ) σ,EXP(e) ⇓ (exp(ṽ),ṽ exp(ṽ )) σ,e ⇓ (ṽ,ṽ ) 0 ⇓ ṽ < 2 σ,LOG1P(e) ⇓ log(2-ṽ), ṽ 2-ṽ σ,e 1 ⇓ (ṽ,ṽ ) σ,e 2 ⇓ (ṽ 2 ,ṽ 2 ) σ,ADD(e 1 ,e 2 ) ⇓ (ṽ+ṽ 2 ,ṽ +ṽ 2 ) σ,e 1 ⇓ (ṽ,ṽ ) σ,e 2 ⇓ (ṽ 2 ,ṽ 2 ) σ,MUL(e 1 ,e 2 ) ⇓ (ṽ•ṽ 2 ,ṽ•ṽ 2 +ṽ 2 •ṽ ) Figure 11 : Complexity relation for expressions in TURACO.  σ,s 1 ⇓ Σ σ ,s 2 ⇓ Σ σ,s 1 ;s 2 ⇓ Σ σ ζ[e] ≤ (ṽ,ṽ ) σ,x := e ⇓ {• → σ[x → (ṽ,ṽ )]} σ,s 1 ⇓ Σl σ,s 2 ⇓ Σr σ,if (e > 0) { s 1 } else { s 2 } ⇓ l.p → Σl (p) p ∈ Σl ∪ r.p → Σr (p) p ∈ Σr [x i → (0,1)],s ⇓ Σ0 [x i → (β,1)],s ⇓ Σβ ∀p. Σ0 (p)(x) = (f p,0 ,f p,0 ) ∀p. Σβ (p)(x) = (ṽ p,β ,ṽ p,β ) ζ β [fun ({{x i }) { s } return x] ≤ p → ṽp,0 +ṽ p,β 2 p ∈ Σ0 

E EXTENDED RENDERER DEMONSTRATION

This appendix provides further details on the evaluation on the renderer program in Section 5. Figure 18 presents the full code for the renderer program under study.foot_7 Figure 14 presents the full set of scenes used in our evaluation. This program is a good candidate to train a surrogate of for several reasons. First, it is an approximable program: as long as the outputs of a surrogate of the program are sufficiently close to the ground-truth outputs, the generated image will be perceptually indistinguishable. Second, its paths are all determined by uniform input variables, variables in GLSL that are constant across each invocation of the shader. This means that relative to the cost of executing the program, it is cheap to determine which path a given input induces in the program (and thus which surrogate to apply). Third, its execution environment is well suited to be replaced with a neural network, since the original program itself performs batch processing on a GPU.

E.1 PATHS

Table 3 presents statistics about the entire set of paths paths under study. Figure 16 presents a side-by-side comparison of traces of two paths: lrrlrl (twilight smoke), the least complex path in the dataset, and lrrrrr (nighttime solids), the most complex path in the dataset. Two factors lead to the twilight smoke path being less complex than the nighttime solids path. First, the twilight smoke path computes one lighting factor from a constant (Figure 16a , line 14) rather than from a function of the sun's position (Figure 16b , lines 13-14). Second, the twilight smoke path does not have any specular output (Figure 16a , line 18), unlike the nighttime solid output (Figure 16b , line 18). Figure 15 shows side-by-side comparisons of the three classes of paths: water, smoke, and solids. In each of these images, one path returns red for all pixels, while the other paths return black for all pixels. The base scene is the front daytime scene in Figure 19a . Figure 16 : Side-by-side comparison of the trace of the least complex path (lrrlrl: twilight smoke) and that of the most complex path (lrrrrr: nighttime solids), with differences highlighted.

E.2 ARCHITECTURE AND HYPERPARAMETER DETAILS

For the design of each surrogate, we use a simple MLP architecture with a single hidden layer of 512 units and a ReLU activations. This architecture matches that of Agarwala et al. (2021) , except using 512 rather than 1000 hidden units (we found that the accuracy of each path surrogate plateaued by 512 hidden units). We train the surrogate using the Adam optimizer with a learning rate of 0.0001 and a batch size of 256 for 50,000 steps. We run 5 trials of all experiments, and report the mean error.

E.3 FULL RESULTS

Figure 17 presents the error of surrogates on each dataset. Each plot shows the error for a different dataset. Each plot has three different lines, respectively showing the error of each surrogate training distribution (optimal, frequency-based, and uniform). Each x axis is the total training data budget. Each y axis is the error of the resulting stratified surrogate. 



We choose the term stratified by analogy with the technique of stratified sampling. "EnerJ ... prohibit[s] approximate values in conditions that affect control flow."(Sampson et al., 2011). "Rely assumes that ... control flow branch targets are computed reliably."(Carbin et al., 2013). https://github.com/iclr-2022-UcKEodTPtfI/turaco contains our implementation of TURACO and of the experiments in Sections 2 and Lines 278 through 337 of https://github.com/lettier/3d-game-shaders-for-begin ners/blob/2970085/demonstration/shaders/fragment/base.frag. Our implementation of TURACO extends the interpreter and analysis to vectors, with the resulting complexity being the sum of the complexity of each component of the output vector. The original program is written in GLSL. We present a semantically equivalent translation (preserving all paths) of the program to TURACO for simplicity of presentation.



f u n ( s u n P o s i t i o n , e m i s s i o n ) { i f ( s u n P o s i t i o n < 0 ) { a m b i e n t = 0 } e l s e { a m b i e n t = s u n P o s i t i o n } e m i s s i o n * = max ( 0 . 1 , s u n P o s i t i o n ) } r e t u r n a m b i e n t + e m i s s i o n (a) Graphics program calculating the luminance of a pixel as a function of ambient light and material properties.

Output of the program on inputs in [-1,1], with dashes separating the three paths. / / assume : s u n P o s i t i o n < 0 f u n ( s u n P o s i t i o n , e m i s s i o n ) { a m b i e n t = 0 ; e m i s s i o n * = 0 . 1 ; } r e t u r n a m b i e n t + e m i s s i o n ; (c) Nighttime (ll) path. / / assume 0 < s u n P o s i t i o n < 0 . 1 f u n ( s u n P o s i t i o n , e m i s s i o n ) { a m b i e n t = s u n P o s i t i o n ; e m i s s i o n * = 0 . 1 ; } r e t u r n a m b i e n t + e m i s s i o n ; (d) Twilight (rl) path. / / assume : s u n P o s i t i o n > 0 . 1 f u n ( s u n P o s i t i o n , e m i s s i o n ) { a m b i e n t = s u n P o s i t i o n ; e m i s s i o n * = s u n P o s i t i o n ; } r e t u r n a m b i e n t + e m i s s i o n ;(e) Daytime (rr) path.

Figure 1: Example program, outputs, and traces.

Per-path surrogate errors (log-log plot).

Stratified surrogate errors (log-log plot).Optimal sampling decreases the error by 15%.

Figure 2: Per-path surrogate errors (left) and combined errors (right) for the example.

a number of training examples n ∈ N, and a loss function ∈ L : Y ×Y → R ≥0 which measures the cost of an incorrect prediction, and returns a function (representing the output surrogate) f : X → Y.

Figure 3: Syntax of TURACO.

Figure 5: Complexity relation for if statements, using a period to denote string concatenation.

(a) Ground-truth front-day scene. (b) Ground-truth top-night scene. (c) Optimal surrogate. (d) Frequency-based surrogate. (e) Uniform surrogate.

Figure 6: Ground-truth (top) and surrogate renderings (bottom) of scenes generated by the renderer.

Figure8presents the big-step evaluation relation for expressions in TURACO. The expression relation σ,e ⇓ v says that under variable store σ (assigning values to all variables in e), the expression e evaluates to value v.

Figure 8: Big-step evaluation relation for expressions in TURACO.

Figure 9: Big-step evaluation relation for statements.

Figure9presents the big-step evaluation relation for statements in TURACO. The statement relation σ,s ⇓ σ says that under variable store σ, the statement s evaluates to a new variable store σ .

Figure10presents the big-step evaluation relation for TURACO programs. The program relation σ,fun (x 0 ,x 1 ...,x n ) { s } return x ⇓ v says that under variable store σ (assigning values to all x i ) the program evaluates to value v.

Figure 13 presents the complexity ζ (for a given input β parameter) of each path of a program.

Figure 12: Complexity relation for statements.

Figure 13: Complexity relation for full programs.

Figure 14: Ground-truth scenes generated by the renderer.

Figure 15: Daytime scene with each different path highlighted red, and all others black.

t 0 r g b : = add ( . . . , e m i s s i o n ) 17 o u t 0 a : = d i f f u s e C o l o r [ 3 ] 18 o u t 1 a : = d i f f u s e C o l o r [ 3 ] 19 out1rgb := [0.0, 0.0, 0.0] 20 } r e t u r n ( o u t 0 r g b , o u t 0 a , o u t 1 r g b , o u t 1 a ) (a) Twilight smoke (lrrlrl) trace. 1 f u n ( . . . ) { 2 s u n P o s i t i o n : = s i n ( mul ( s u n P o s i t i o n [ 0 ] , → mul ( p i , 0 . 0 0 5 5 5 5 6 ) ) ) . . . 14 sunPositionPow := exp(mul(log(add(sunPosition, -1.0)), 0.4)) 15 emission := mul(emission, sunPositionPow) 16 o u t 0 r g b : = add ( . . . , e m i s s i o n ) 17 o u t 0 a : = d i f f u s e C o l o r [ 3 ] 18 o u t 1 a : = d i f f u s e C o l o r [ 3 ] 19 out1rgb := [specular[0],specular[1],specular[2]] 20 } r e t u r n ( o u t 0 r g b , o u t 0 a , o u t 1 r g b , o u t 1 a ) (b) Nighttime solids (lrrrrr) trace.

Figure 18: Full code for the renderer case study.

Average change in error across all budgets from using optimal sampling compared to baselines on each dataset (negative means optimal sampling has lower error).

Santner et al. (2018) survey sampling techniques for computer experiments. Chapter 5.2.3 discusses stratified random sampling in particular, showing optimality criteria for sampling for unbiased estimators. These approaches are generic for minimizing the variance of estimators, and do not consider specifically training a neural network. These approaches also do not consider the different complexity of different strata.

Top: the identifier, lines of code, complexity, and description of each path present in our datasets. Bottom: the distribution (abbreviated distr.) of each path across each dataset: the frequency (freq.) of each observed path, and the optimal sampling rate (opt.) of that path.

