NEURAL DAG SCHEDULING VIA ONE-SHOT PRIORITY SAMPLING

Abstract

We consider the problem of scheduling operations/nodes, the dependency among which is characterized by a Directed Acyclic Graph (DAG). Due to its NP-hard nature, heuristic algorithms were traditionally used to acquire reasonably good solutions, and more recent works have proposed Machine Learning (ML) heuristics that can generalize to unseen graphs and outperform the non-ML heuristics. However, it is computationally costly to generate solutions using existing ML schedulers since they adopt the episodic reinforcement learning framework that necessitates multi-round neural network processing. We propose a novel ML scheduler that uses a one-shot neural network encoder to sample node priorities which are converted by list scheduling to the final schedules. Since the one-shot encoder can efficiently sample the priorities in parallel, our algorithm runs significantly faster than existing ML baselines and has comparable run time with the fast traditional heuristics. We empirically show that our algorithm generates better schedules than both non-neural and neural baselines across various real-world and synthetic scheduling tasks.

1. INTRODUCTION

The problem of scheduling operations arises across many domains, such as data centers where the incoming jobs have to be scheduled on a distributed server (Mao et al., 2019) , manufacturing pipelines in the form of job shop scheduling problems (JSSP) (Manne, 1960) , and ML compilers where the operations of a computation graph need to be scheduled on the available hardware devices (Paliwal et al., 2020; Zhou et al., 2020) . In all these cases, the problem may be abstracted using a directed acyclic graph (DAG) where the nodes of the graph represent the operations and the edges represent the dependency constraints between the operations and hence the problem is also referred to as DAG scheduling. The objective is to minimize the finish time (or makespan) of the DAG subject to resource and dependency constraints. It is well known that this is an NP-hard problem (Kan, 2012), and practitioners have traditionally relied on heuristic methods to obtain good solutions. One of the celebrated scheduling approaches is list scheduling (Graham, 1969) where the idea is to schedule nodes as early as possible and to break ties using priorities. The priorities can be obtained via different node metrics which are computationally inexpensive such as critical-path based, shortest processing time or most operations remaining (Haupt, 1989) . More recently, researchers have proposed deep reinforcement learning based methods to solve scheduling problems (Zhang et al., 2020; Zhou et al., 2020; Wang et al., 2021; Mao et al., 2019) . The scheduling policy in all the references utilize Graph Neural Networks (GNN) as an encoder to derive node embeddings. Zhang et al. (2020) proposed an auto-regressive GNN based policy for the JSSP problem which predicts the next node for scheduling given the nodes scheduled so far. Wang et al. (2021) proposed a bi-level optimization approach which modifies the input DAG by adding multiple edges via a learned policy and then apply the critical-path heuristic on the modified DAG. One major drawback of the existing ML based schedulers is the computational cost as they require multi-round neural network processing (encoding step). The multi-round neural network processing is reflected as auto-regressive architecture (Zhang et al., 2020) or bi-level optimization design (Wang et al., 2021) . This drawback limits the scalability to large graphs and the applicability to domains where solutions need to be obtained in a timely manner (e.g., scheduling computation graphs in compilers). In this paper, we propose a novel ML scheduler that uses a one-shot neural network encoder to sample node priorities which are converted by list scheduling to the final schedules. Since our encoder generates node priorities with a single forward pass of a neural network and efficiently samples priorities in parallel, our algorithm runs significantly faster than existing ML baselines and has comparable run time with the fast traditional heuristics. The contributions of this paper are summarized below: • We propose a novel end-to-end approach to learn scheduling priorities for list scheduling on DAGs. 2022). • Our approach uses the one-shot encoder which generates the node priorities by running the Topoformer encoder once. This is in contrast of existing neural baselines (Wang et al., 2021; Zhang et al., 2020) , all of which involves multi-round neural network processing. Due to the one-shot nature of our model, our method runs significantly faster than our neural baselines, while achieving runtimes slightly worse than yet comparable with those of computationally-efficient and simple non-ML heuristics. • We show that our approach can be generally applied to a variety of scheduling tasks that includes JSSP, TPC-H benchmark and scheduling for synthetic and real-world computation graphs. For all benchmarks, our model outperforms both neural and non-neural baselines w.r.t. makespan metric (Wang et al., 2021; Zhang et al., 2020) .

2. PRELIMINARIES 2.1 SCHEDULING PROBLEM

In scheduling problems, we define a DAG as a tuple G := (V, E, δ, ρ, µ) with a set V of nodes (or vertices) and a set E of directed edges (or arcs). Each node v ∈ V represents an operation with δ(v) ≥ 0 denoting its operational duration and ρ(v) ≥ 0 denoting the resources required to execute v. For a set M of machine types, each node v ∈ V has to be assigned to its own machine type µ(v) ∈ M (|M| = 1 and |M| > 1 correspond to scheduling with homogeneous machines and heterogeneous ones, respectively). The set E of edges in the DAG G represents computational dependency among nodes. For instance, for the scheduled start time τ (v) ≥ 0, v ∈ V for each node, a directed edge (v 1 , v 2 ) ∈ E, v 1 , v 2 ∈ V, means τ (v 1 ) + δ(v 1 ) ≤ τ (v 2 ), i.e. , any node should be scheduled on or after all its predecessor nodes are finished. We assume that each type of machine m ∈ M has its own maximum resource limit λ(m) ≥ 0, i.e., at any point of time the total amount of occupied resources for machines of type m cannot exceed λ(m).

Let us introduce the vectorized notation

τ = [τ (v)] v∈V ∈ R |V| ≥0 of the start times with a little abuse of notation for the sake of simpler notation. We define a valid schedule as a vector τ ∈ T where T is the set of all valid schedules (satisfying both precedence and resource constraints for given DAG G). The objective of the scheduling problem is to find τ * := arg min τ ∈T C(τ ; G), where C(τ ; G) := max v∈V {τ (v) + δ(v)}, the duration required to complete all operations, is the makespan of schedule τ .

2.2. LIST SCHEDULING

List scheduling (Graham, 1969) is a class of priority-based schedule algorithms that are widely adopted in practice due to their simplicity. We describe how list scheduling works as follows: (Step 1) Input a list of node priorities and set the current decision time to be zero. (Step 2) Find ready nodes that can be scheduled at the current decision time, i.e., nodes whose predecessors have finished.



Our model adopts the recently proposed Topoformer architecture(Gagrani et al., 2022)   as a DAG encoder and the Gumbel-Top-k trick(Kool et al., 2019b)  to sample node priorities (which are acquired by perturbing the encoder's output and converted into valid schedules via list scheduling). While optimizing our model with REINFORCE (Williams, 1992), we introduce logit norm regularization and cost standardization that significantly improve our model's representation power and performance compared to the model used inGagrani et al. (

