ROBUST SCHEDULING WITH GFLOWNETS

Abstract

Finding the best way to schedule operations in a computation graph is a classical NP-hard problem which is central to compiler optimization. However, evaluating the goodness of a schedule on the target hardware can be very time-consuming. Traditional approaches as well as previous machine learning ones typically optimize proxy metrics, which are fast to evaluate but can lead to bad schedules when tested on the target hardware. In this work, we propose a new approach to scheduling by sampling proportionally to the proxy metric using a novel GFlowNet method. We introduce a technique to control the trade-off between diversity and goodness of the proposed schedules at inference time and demonstrate empirically that the pure optimization baselines can lead to subpar performance with respect to our approach when tested on a target model. Furthermore, we show that conditioning the GFlowNet on the computation graph enables generalization to unseen scheduling problems for both synthetic and real-world compiler datasets.

1. INTRODUCTION

Efficient execution of computation graphs is paramount to many scientific and industrial applications, with deep learning being a prominent example (Amodei & Hernandez, 2018) . Scheduling is the action of assigning operations to the available compute resources, such as threads, cores, or nodes in a cluster (Kwok & Ahmad, 1999; Hennessy & Patterson, 2011; Pinedo, 2012) . Unfortunately, finding the schedule with the shortest possible makespan (start-to-end runtime) is in general NP-hard (Papadimitriou & Steiglitz, 1998) . As a result, domain experts have come up with heuristics that are tailored to specific problem instances (Ibarra & Kim, 1977) . Machine learning approaches promise the possibility to automate this process allowing for fast adaptation to new graph distributions (Wang & O'Boyle, 2018; Bengio et al., 2021c) . In this work, we consider the problem of scheduling a set of operations with precedence constraints on a fixed number of homogeneous devices, i.e., any operation can run on any device and the runtime is the same on all devices. Evaluating the makespan of a schedule involves running all operations in the computation graph on some target hardware. This can be very resource intensive, especially when the computation graph includes lengthy operations, the evaluated schedule is inefficient, or the intended target hardware is a cluster with many nodes. Heuristic optimizers, like genetic algorithms (Hou et al., 1994) , or machine learning (Mao et al., 2019) approaches further exacerbate this problem because they require many evaluations to converge (Chen et al., 2018) . Proxies are a much faster alternative that estimates the makespan using a simplified model of the hardware. However, this comes at the cost of discrepancies between the proxy makespan and the one observed on the hardware; as a result, performant solutions on the proxy might ultimately be unsatisfactory once tested on the target. Nonetheless, proxies remain a good indicator for most schedules and are essential due to their efficiency. We aim to learn a scheduler that can be trained using the proxy, whilst being robust to its inaccuracies. The common approach to scheduling problems (and combinatorial optimization problems in general) is to look for the single best schedule that minimizes a makespan measure which can be an analytical proxy (Paliwal et al., 2020) , the output of a simulator (Zhou et al., 2020) , or even the real makespan on hardware (Khadka et al., 2021) . We propose a different philosophy: generate a set of candidate schedules that have a low makespan according to the proxy and are diverse. By hav-Figure 1 : Full pipeline of our generative scheduling approach. Conditioned on the computation graph we generate multiple candidate schedules using GFlowNet, filter for the best k with the proxy and pick the best performing one out of the k that we check on the target. Here we illustrate the pipeline for k = 2 and two devices, d 1 , d 2 . ing multiple good schedules that are significantly different, we can reduce the impact of systematic errors in the proxy, and hope for robust performance on the target. Our goal is to learn a generative model that assigns higher probability to low-makespan schedules, and importantly can also discover the different modes associated with local optima of the makespan cost. Generative Flow Networks (GFlowNets) have recently been introduced as a method for learning a stochastic policy that can piece-by-piece construct discrete and composite objects, proportional to a given reward (Bengio et al., 2021b) . By computing the reward from the proxy-makespan we can use GFlowNets to sample a diverse set of candidate schedules. Our main contributions are: 1. We introduce an alternative to the pure proxy optimization viewpoint of scheduling that achieves better robustness to proxy errors, by generating multiple candidate schedules to evaluate directly on the target hardware. 2. We extend GFlowNets to generate schedules conditioned on a computation graph. Additionally, we introduce a method to control diversity and goodness at inference time, without the need for retraining. These contributions may be of general interest, beyond the scheduling problem. 3. We empirically demonstrate the robustness of our method to proxy errors and verify the generalization ability on a diverse set of synthetic and real-world computation graphs.

2. ROBUST SCHEDULING

In this section, we first provide a definition of the scheduling problem we consider in this work. Then, we discuss how a proxy simulates the schedule execution as well as the difficulties of specifying a reliable proxy. Finally, we describe our proposed generative scheduling framework.

2.1. PROBLEM DEFINITION

In scheduling, we are given a computation graph G C = (O, P ) that is a direct acyclic graph (DAG) consisting of operations (nodes) o ∈ O and precedence constraints (edges) p ∈ P that encode a partial order in which the operations need to be executed. In particular, the edge p ij encodes that operation o i needs to finish before o j can start, for example because o j requires the output of o i as input. Our task is to run all operations on a set of devices D = {d 1 , . . . , d m }, without violating the precedence constraints. In addition to the precedence constraints, the devices can only run one operation at a time. We can then view scheduling as performing two distinct tasks: assign a device to each operation, and determine a (complete) order among all operations on the same device that is compatible with the precedence constraints encoded in G C . We can model the schedule as a chain of operations for each device, where the chain denotes the order in which the operations run on that device. See Figure 1 for a visual example of the chain graphs. Our aim is to find the schedule with the lowest makespan for some target hardware.

2.2. TARGET MODEL VS. PROXIES

The makespan of any schedule can be evaluated on the target hardware by running all the operations in the specified order and on the specified devices. However, this can take up significant time

