DEEP REINFORCEMENT LEARNING FOR WIRELESS SCHEDULING WITH MULTICLASS SERVICES

Abstract

In this paper, we investigate the problem of scheduling and resource allocation over a time varying set of clients with heterogeneous demands. This problem appears when service providers need to serve traffic generated by users with different classes of requirements. We thus have to allocate bandwidth resources over time to efficiently satisfy these demands within a limited time horizon. This is a highly intricate problem and solutions may involve tools stemming from diverse fields like combinatorics and optimization. Recent work has successfully proposed Deep Reinforcement Learning (DRL) solutions, although not yet for heterogeneous user traffic. We propose a deep deterministic policy gradient algorithm combining state of the art techniques, namely Distributional RL and Deep Sets, to train a model for heterogeneous traffic scheduling. We test on diverse number scenarios with different time dependence dynamics, users' requirements, and resources available, demonstrating consistent results. We evaluate the algorithm on a wireless communication setting and show significant gains against state-of-theart conventional algorithms from combinatorics and optimization (e.g.

1. INTRODUCTION

User scheduling (i.e., which user to be served when) and associated resource allocation (i.e., which and how many resources should be assigned to scheduled users) are two long-standing fundamental problems in communications, which have recently attracted vivid attention in the context of next generation communication systems (5G and beyond). The main reason is the heterogeneity in users' traffic and the diverse Quality of Service (QoS) requirements required by the users. The goal of this paper is to design a scheduler and resource assigner, which takes as inputs the specific constraints of the traffic/service class each user belongs in order to maximize the number of satisfied users. This problem is hard to solve since we have at least two main technical challenges: (i) except for some special cases, there is no simple closed-form expression for the problem and a fortiori for its solution; (ii) the problem solving algorithm has to be scalable with the number of users. Current solutions rely on combinatorial approaches or suboptimal solutions, which seem to work satisfactorily in specific scenarios, failing though to perform well when the number of active users is large. This motivates the quest for alternative solutions; we propose to resort to Deep Reinforcement Learning (DRL) to tackle this problem. In the context of DRL, we propose to combine together several ingredients in order to solve the aforementioned challgening problem. In particular, we leverage on the theory of Deep Sets to design permutation equivariant and invariant models, which solves the scalability issue, i.e., the number of users can be increased without having to increase the number of parameters. We also stabilize the learning process by adding in a new way the distributional dimension marrying it with Dueling Networks to "center the losses". Finally, we compare the proposed DRL-based algorithm with conventional solutions based on combinatorial or suboptimal optimization approaches. Our experiments and simulation results clearly show that our DRL method significanlty outperforms conventional state-of-the-art algorithms.

2. RELATED WORK

The scheduling problem is a well known problem appearing in various fields and as technologies progress and more people want to take advantage of the new services, how to schedule them in an efficient way becomes more intricate. This is exactly the case in wireless communication systems. Researchers are resorting to new methods, such as deep reinforcement learning, which have shown impressive results Mnih et al. (2015) ; Silver et al. (2016) . For example in (Chinchali et al., 2018) they perform scheduling on a cellular level using Deep Reinforcement learning (DRL). Also ideas using DRL in a distributed way to perform dynamic power allocation has appeared in (Naparstek & Cohen, 2018; Nasir & Guo, 2019) . Nevertheless, to the best of our knowledge, the problem of scheduling on traffic of users with heterogeneous performance requirements has not been appropriately addressed. To solve this hard problem, one can resort to distributional Reinforcement Learning researched in Jaquette (1973) and followed by (Dabney et al., 2018a; b) in order to have richer representations of the environment and obtain better solutions. Also techniques like noisy network for better explorations (Fortunato et al., 2018) or architectures like dueling networks (Wang et al., 2016) have greatly improved stability of the trained models. Finally ideas (Zaheer et al., 2017) managed to simplify and improve neural network models when permutation invariance properties apply. We combine those ideas with a deep deterministic policy gradient method (Lillicrap et al., 2016) to reach a very efficient scheduling algorithm.

3.1. THE PROBLEM

The problem we consider here involves a set of randomly arriving users that communicate wirelessly with a base station (service provider); users require that their traffic is served according to the quality of service (QoS) requirements imposed by the service class they belong to. We consider the case where users belong to different service classes with heterogeneous requirements. Each class specifies the amount of data to be delivered, the maximum tolerable latency, and the "importance/priority" of the user. A centralized scheduler (at the base station) at each time step takes as input this time varying set of users belonging to different service classes and has to decide how to allocate its limited resources per time step in order to maximize the long-term "importance" weighted sum of satisfied users. A user is considered to be satisfied whenever it successfully received its data within the maximum tolerable latency specified by its service class. The hard problem of scheduling and resource allocation -which is combinatorial by nature -is exacerbated by the wireless communication, which in turn brings additional uncertainty due to time-varying random connection quality. The scheduler that assigns resources does not exclude the possibility of a bad connection (low channel quality) which renders data transmission unsuccessful. In order to mitigate that effect, some protocols make use of channel state information (CSI) to the transmitter, i.e., the base station/scheduler knows in advance the channel quality and adapts the allocated resources to the instantaneous channel conditions. We consider here two extreme cases of channel knowledge: (i) full-CSI, in which perfect (instantaneous, error free) CSI is provided to the scheduler enabling accurate estimation of exact resources each user needs; and (ii) no-CSI, in which the scheduler is agnostic to the channel quality. In case of unsuccessful/erroneous data reception, we employ a simple retransmission protocol (HARQ-type I). A widely used way to model the channel connection dynamics is to make the wireless channel quality depend on the distance of the user from the base station and evolve in a Markovian way from the quality on the channel realization of the previous time step. The mathematical description of the traffic generator model and the channel dynamics is provided in the appendix A. To better understand the problem, we draw the following analogy. Imagine a server having a water pitcher that is full at every time step and has to distribute it across a set of people. Every person has a glass and leaves satisfied only if its glass is filled (or overfilled) at any time instant prior to certain maximum waiting time. As mentioned before, in this work we consider a retransmission protocol (HARQ-type I), which in our analogy means that the server cannot fill a glass using multiple trials; if a glass is not filled completely at a time step, then it will be emptied and the server has to retry. The wireless communication setting brings the additional complication that the sizes of the glasses are not actually fixed but fluctuate (due to the randomness of the connection quality of each user). In

