EFFICIENT REINFORCEMENT LEARNING IN RESOURCE ALLOCATION PROBLEMS THROUGH PERMUTATION INVARIANT MULTI-TASK LEARNING

Abstract

One of the main challenges in real-world reinforcement learning is to learn successfully from limited training samples. We show that in certain settings, the available data can be dramatically increased through a form of multi-task learning, by exploiting an invariance property in the tasks. We provide a theoretical performance bound for the gain in sample efficiency under this setting. This motivates a new approach to multi-task learning, which involves the design of an appropriate neural network architecture and a prioritized task-sampling strategy. We demonstrate empirically the effectiveness of the proposed approach on two real-world sequential resource allocation tasks where this invariance property occurs: financial portfolio optimization and meta federated learning.



. The models encapsulate knowledge explicitly, complementing the experiences that are gained by sampling from the RL environment. Another means towards increasing the availability of samples for a reinforcement learner is by tilting the training towards one that will better transfer to related tasks: if the training process is sufficiently well adapted to more than one task, then the training of a particular task should be able to benefit from samples from the other related tasks. This idea was explored a decade ago in Lazaric & Ghavamzadeh (2010) and has been gaining traction ever since, as researchers try to increase the reach of deep reinforcement learning from its comfortable footing in solving games outrageously well to solving other important problems. Yu (2018) discusses a number of methods for increasing sample efficiency in RL and includes experience transfer as one important avenue, covering the transfer of samples, as we do here, transfer of representation or skills, and jumpstarting models which are then ready to be quickly, i.e. with few samples, updated to different tasks. D 'Eramo et al. (2020) address the same idea, noting that multi-task learning can improve the learning of each individual task, motivated by robotics-type tasks with underlying commonality, such as balancing a single vs. a double pendulum, or hopping vs. walking. We are interested in exploiting the ability of multi-task learning to solve the sample efficiency problem of RL. Our setting does not apply to all problem classes nor does it seek to exploit the kind of physical similarities found in robotics tasks that form the motivation of Lazaric & Ghavamzadeh (2010); D 'Eramo et al. (2020) . Rather, we show that there are a number of reinforcement learning tasks with a particular fundamental property that makes them ideal candidates for multi-task learning with the goal of increasing the availability of samples for their training. We refer to this property as permutation invariance. It is present in very diverse tasks: we illustrate it on a financial portfolio optimization problem, whereby trades are executed sequentially over a given time horizon, and on the problem of meta-learning in a federated supervised learning setting. Permutation invariance in the financial portfolio problem exhibits itself as follows: consider the task of allocating a portion of wealth to each of a number of financial instruments using a trading policy. If the trading policy is permutation invariant, one can change the order of the instruments without changing the policy. This allows one to generate multiple portfolio optimization tasks from a given set of financial instruments. A commonality between applications that have this property is that they concern sequential resource allocation: at each time step, the resource allocation scores the quality of each available candidate entity (for example a financial instrument in the above example), then based on those scores, apportions out the resource (the total wealth to invest, in the above example) among the entities at that time step, so that over the horizon of interest, the reward is maximized. Sequential resource allocation problems include applications such as sequential allocation of budget, sequential allocation of space, e.g. in IT systems, hotels, delivery vehicles, sequential allocation of people to work slots or appointments, etc. Many such applications possess permutation invariance in that the ordering of the entities, i.e. where the resources are allocated, can change without changing the resulting optimal allocation. We show that under this form of permutation invariance, it is possible to derive a bound on the performance of the policy. The bound is an extension of that of Lazaric & Ghavamzadeh (2010), and while similar to, provides additional information beyond the bound of D' Eramo et al. (2020) . We use the bound to motivate an algorithm that allows for substantially improved results as compared with solving each task on its own. The bound and the algorithm are first analyzed on a synthetic problem that validates the bound in our theorem and confirms the multi-task gain that the theory predicts. Hessel et al. ( 2018); Bram et al. ( 2019) have cautioned against degrading of the performance on each task when some tasks bias the updates to the detriment of others in multi-task learning. They claim that some tasks have a greater density or magnitude of in-task rewards and hence a disproportionate impact on the learning process. In our setting, deleterious effects of some tasks on others could also arise. The algorithm we propose handles this through a form of prioritized sampling, where priorities are put on the tasks themselves, and acts like a prioritized experience replay buffer, applied to a multi-task learning problem. We show empirically that the priorities thus defined protect the overall learning problem from the deleterious effects that unrelated or unhelpful tasks could otherwise have on the policy. The contributions of this work are as follows: (1) we identify the permutation invariance property of the class of reinforcement learning problems involving sequential resource allocation, (2) we define a method to increase sample efficiency in these reinforcement learning problems by leveraging this property of permutation invariance; (3) we provide a theoretical performance bound for the class of problems; (4) we validate experimentally the utility of permutation variance on sample efficiency as well as the validity of the bound on a synthetic problem; and (5) we illustrate two real-world RL resource allocation tasks for which this property holds and demonstrate the benefits of the proposed method on sample efficiency and thus also on the overall performance of the models.

2. RELATED WORK

A notable first stream of work on leveraging multi-task learning for enhancing RL performance on single tasks can be found in Wilson et al. (2007); Lazaric & Ghavamzadeh (2010) which consider, as we do, that there is an underlying MDP from which the multiple tasks can be thought to derive. They use however a Bayesian approach and propose a different algorithmic method than ours. Our 2017) define a policy centroid, that is, a shared distilled policy, that captures the commonalities across the behaviors in the tasks. In all of these distillation-type methods, the tasks considered are simple or complex games. Teh et al. (2017) note that their policy centroid method, distral, is likely to be affected by task interference, in that differences across tasks may degrade the performance of the resulting policy of any of the constituent tasks. 



reinforcement learning (RL) is an elusive goal. Recent attempts at increasing the sample efficiency of RL implementations have focused to a large extent on incorporating models into the training process: Xu et al. (2019); Clavera et al. (2018); Zhang et al. (2018); Berkenkamp et al. (2017); Ke et al. (2019); Yarats et al. (2019); Huang et al. (2019); Chua et al. (2018); Serban et al.

results extend performance bounds by Lazaric et al. (2012) on single-task RL. As noted by Yu (2018), jumpstarting, or distilling experiences and representations of relevant policies is another means to increasing sample efficiency in solving a new but related problem. Rusu et al. (2016) uses this idea in so-called progressive neural networks and Parisotto et al. (2015) leverage multiple experts to guide the derivation of a general policy. With a similar objective, Teh et al. (

This topic was studied by Hessel et al. (2018); Bram et al. (2019). Hessel et al. (2018) proposed a solution to this by extending the so-called PopArt normalization van Hasselt et al. (2016) to re-scale the updates of each task so that the different characteristics of the task-specific reward do not skew the learning process. Bram et al. (2019) use a different approach that learns attention weights of the sub-networks of each task and discards those that are not relevant or helpful. Vuong et al. (2019); D'Eramo et al. (2020) are, like our work, concerned with sharing of experiences to facilitate a more sample-efficient learning process. Vuong et al. (2019) suggest

