EFFICIENT REINFORCEMENT LEARNING IN RESOURCE ALLOCATION PROBLEMS THROUGH PERMUTATION INVARIANT MULTI-TASK LEARNING

Abstract

One of the main challenges in real-world reinforcement learning is to learn successfully from limited training samples. We show that in certain settings, the available data can be dramatically increased through a form of multi-task learning, by exploiting an invariance property in the tasks. We provide a theoretical performance bound for the gain in sample efficiency under this setting. This motivates a new approach to multi-task learning, which involves the design of an appropriate neural network architecture and a prioritized task-sampling strategy. We demonstrate empirically the effectiveness of the proposed approach on two real-world sequential resource allocation tasks where this invariance property occurs: financial portfolio optimization and meta federated learning.

1. INTRODUCTION

Sample efficiency in reinforcement learning (RL) is an elusive goal. Recent attempts at increasing the sample efficiency of RL implementations have focused to a large extent on incorporating models into the training process: Xu et al. (2019) ; Clavera et al. (2018) ; Zhang et al. (2018) ; Berkenkamp et al. (2017) ; Ke et al. (2019) ; Yarats et al. (2019) ; Huang et al. (2019) ; Chua et al. (2018) ; Serban et al. (2018) . The models encapsulate knowledge explicitly, complementing the experiences that are gained by sampling from the RL environment. Another means towards increasing the availability of samples for a reinforcement learner is by tilting the training towards one that will better transfer to related tasks: if the training process is sufficiently well adapted to more than one task, then the training of a particular task should be able to benefit from samples from the other related tasks. This idea was explored a decade ago in Lazaric & Ghavamzadeh (2010) and has been gaining traction ever since, as researchers try to increase the reach of deep reinforcement learning from its comfortable footing in solving games outrageously well to solving other important problems. Yu (2018) discusses a number of methods for increasing sample efficiency in RL and includes experience transfer as one important avenue, covering the transfer of samples, as we do here, transfer of representation or skills, and jumpstarting models which are then ready to be quickly, i.e. with few samples, updated to different tasks. D 'Eramo et al. (2020) address the same idea, noting that multi-task learning can improve the learning of each individual task, motivated by robotics-type tasks with underlying commonality, such as balancing a single vs. a double pendulum, or hopping vs. walking. We are interested in exploiting the ability of multi-task learning to solve the sample efficiency problem of RL. Our setting does not apply to all problem classes nor does it seek to exploit the kind of physical similarities found in robotics tasks that form the motivation of Lazaric & Ghavamzadeh (2010) ; D 'Eramo et al. (2020) . Rather, we show that there are a number of reinforcement learning tasks with a particular fundamental property that makes them ideal candidates for multi-task learning with the goal of increasing the availability of samples for their training. We refer to this property as permutation invariance. It is present in very diverse tasks: we illustrate it on a financial portfolio optimization problem, whereby trades are executed sequentially over a given time horizon, and on the problem of meta-learning in a federated supervised learning setting. Permutation invariance in the financial portfolio problem exhibits itself as follows: consider the task of allocating a portion of wealth to each of a number of financial instruments using a trading policy. If the trading policy is permutation invariant, one can change the order of the instruments without changing the policy. This allows one to generate multiple portfolio optimization tasks from a given set of financial instruments. A commonality between applications that have this property is that they concern sequential resource allocation: at each time step, the resource allocation scores the quality of each available candidate entity (for example a financial instrument in the above example), then based on those scores, apportions out the resource (the total wealth to invest, in the above example) among the entities at that time step, so that over the horizon of interest, the reward is maximized. Sequential resource allocation problems include applications such as sequential allocation of budget, sequential allocation of space, e.g. in IT systems, hotels, delivery vehicles, sequential allocation of people to work slots or appointments, etc. Many such applications possess permutation invariance in that the ordering of the entities, i.e. where the resources are allocated, can change without changing the resulting optimal allocation. We show that under this form of permutation invariance, it is possible to derive a bound on the performance of the policy. The bound is an extension of that of Lazaric & Ghavamzadeh (2010) , and while similar to, provides additional information beyond the bound of D 'Eramo et al. (2020) . We use the bound to motivate an algorithm that allows for substantially improved results as compared with solving each task on its own. The bound and the algorithm are first analyzed on a synthetic problem that validates the bound in our theorem and confirms the multi-task gain that the theory predicts. Hessel et al. (2018) ; Bram et al. (2019) have cautioned against degrading of the performance on each task when some tasks bias the updates to the detriment of others in multi-task learning. They claim that some tasks have a greater density or magnitude of in-task rewards and hence a disproportionate impact on the learning process. In our setting, deleterious effects of some tasks on others could also arise. The algorithm we propose handles this through a form of prioritized sampling, where priorities are put on the tasks themselves, and acts like a prioritized experience replay buffer, applied to a multi-task learning problem. We show empirically that the priorities thus defined protect the overall learning problem from the deleterious effects that unrelated or unhelpful tasks could otherwise have on the policy. The contributions of this work are as follows: (1) we identify the permutation invariance property of the class of reinforcement learning problems involving sequential resource allocation, (2) we define a method to increase sample efficiency in these reinforcement learning problems by leveraging this property of permutation invariance; (3) we provide a theoretical performance bound for the class of problems; (4) we validate experimentally the utility of permutation variance on sample efficiency as well as the validity of the bound on a synthetic problem; and (5) we illustrate two real-world RL resource allocation tasks for which this property holds and demonstrate the benefits of the proposed method on sample efficiency and thus also on the overall performance of the models.

2. RELATED WORK

A notable first stream of work on leveraging multi-task learning for enhancing RL performance on single tasks can be found in Wilson et al. (2007) ; Lazaric & Ghavamzadeh (2010) which consider, as we do, that there is an underlying MDP from which the multiple tasks can be thought to derive. They use however a Bayesian approach and propose a different algorithmic method than ours. Our results extend performance bounds by Lazaric et al. (2012) on single-task RL. As noted by Yu (2018) , jumpstarting, or distilling experiences and representations of relevant policies is another means to increasing sample efficiency in solving a new but related problem. Rusu et al. (2016) uses this idea in so-called progressive neural networks and Parisotto et al. (2015) leverage multiple experts to guide the derivation of a general policy. With a similar objective, Teh et al. (2017) define a policy centroid, that is, a shared distilled policy, that captures the commonalities across the behaviors in the tasks. In all of these distillation-type methods, the tasks considered are simple or complex games. Teh et al. (2017) note that their policy centroid method, distral, is likely to be affected by task interference, in that differences across tasks may degrade the performance of the resulting policy of any of the constituent tasks. This topic was studied by Hessel et al. (2018) ; Bram et al. (2019) . Hessel et al. (2018) proposed a solution to this by extending the so-called PopArt normalization van Hasselt et al. (2016) to re-scale the updates of each task so that the different characteristics of the task-specific reward do not skew the learning process. Bram et al. (2019) use a different approach that learns attention weights of the sub-networks of each task and discards those that are not relevant or helpful. Vuong et al. (2019) ; D 'Eramo et al. (2020) are, like our work, concerned with sharing of experiences to facilitate a more sample-efficient learning process. Vuong et al. (2019) suggest identifying the shared portions of tasks to allow sharing of samples in those portions. The work of D 'Eramo et al. (2020) is in some ways quite similar to ours: the authors' goal is the same and they derive a bound as we do on the performance in this setting. However, their setting is different in that their tasks have both shared and task-specific components, and their bound becomes tighter only as the number of tasks increases. In our setting, we do not require a task-specific component, and we are able to show how the distance between the MDPs of each task, in addition to the number of tasks, affects the strength of the bound. Recently, permutation invariance has been exploited in deep multi-agent reinforcement learning (Liu et al., 2019) where the invariance properties arise naturally in a homogeneous multi-agent setting. Their work employs permutation invariance in learning the critic whereas in our case the entire learned policy employs permutation invariance.

3. PRELIMINARIES

We begin by defining notation. For a measurable space with domain X , let S(X ) denote the set of probability measures over X , and B(X ; L) the space of bounded measurable functions with domain X and bound 0 < L < ∞. For a measure ρ ∈ S(X ) and a measurable function f : X → R, the l 2 (ρ)-norm of f is f ρ , and for a set of n points X 1 , • • • , X n ∈ X , the empirical norm, f n is f 2 ρ = f (x) 2 ρ(dx) and f 2 n = 1 n n t=1 f (X t ) 2 . Let f ∞ = sup x∈X |f (x)| be the supremum norm of f . Consider a set of MDPs indexed by t. Each MDP is denoted by a tuple M t = X , A, R t , P t , γ , where X , a bounded closed subset of the s-dimensional Euclidean space, is a common state space; A is a common action space, R t : X × A → R is a task specific reward function uniformly bounded by R max , P t is a task specific transition kernel such that P t (•|x, a) is a distribution over X for all x ∈ X and a ∈ A, and γ ∈ (0, 1) is a common discount factor. Deterministic policies are denoted by π : X → A. For a given policy π, the MDP M t is reduced to a Markov chain M π t = X , R π t , P π t , γ with reward function R π t (x) = R t (x, π(x)), transition kernel P π t (•|x) = P t (•|x, π(x)) , and stationary distribution ρ π t . The value function V π t for MDP t is defined as the unique fixed-point of the Bellman operator T π t : B(X ; V max = R max /(1 -γ)) → B(X ; V max ), given by (T π t V )(x) = R π t (x) + γ X P π t (dy|x)V (y). Let π * t denote the optimal policy for M t . The optimal value function V π * t t for M t is defined as the unique fixed-point of its optimal Bellman operator T π * t t which is defined by (T π * t t V )(x) = max a∈A R t (x, a) + γ X P t (dy|x, a)V (y) . To approximate the value function V , we use a linear approximation architecture with parameters α ∈ R d and basis functions ϕ i ∈ B(X ; L) for i = 1, • • • , d. Let ϕ(•) = (ϕ 1 (•), • • • , ϕ d (•)) T ∈ R d be the feature vector and F the linear function space spanned by basis functions ϕ i . Thus, F = {f α | α ∈ R d and f α (•) = ϕ(•) T α}. Consider a learning task to dynamically allocate a common resource across entities U t ⊆ U. Each t corresponds to a task, but for now take t to be an arbitrary fixed index. At each time step n, the decision maker observes states x n = (x i,n ) i∈Ut of the entities, where x i,n is the state of entity i, and takes action a n = (a i,n ) i∈Ut , where a i,n is the share of the resource allocated to entity i. The total resource capacity is normalized to 1 for convenience. Therefore, allocations satisfy 0 ≤ a i,n ≤ 1 and i∈Ut a i,n = 1. We consider policy π θ (x n ) parameterized by θ. Assume that we have access to the reward function R t as well as a simulator that generates a trajectory of length N given any arbitrary policy π θ . The objective of the learning task is to maximize J t (θ) = E N n=1 γ n-1 R t (x n , a n ) a n+1 = π θ (x n ), x n+1 ∼ P t (•|x n , a n ), x 1 ∼ P t (•) . In many settings, N is small and simulators are inaccurate; therefore, trajectories generated by the simulator are poor representations of the actual transition dynamics. This occurs in batch RL where trajectories are rollouts from a dataset. In these cases, policies overfit and generalize poorly.

4. THEORETICAL RESULTS

We introduce first a property that we term permutation-invariance for the policy network that can be shown to help significantly reduce overfitting. Definition 1 (Permutation Invariant Policy Network) A policy network π θ is permutation invariant if it satisfies π θ (σ(x)) = σ(π θ (x)) for any permutation σ. Permutation invariant policy networks have significant advantages over completely integrated policy networks. While the latter are likely to fit correlations between different entities, this is not possible with permutation invariant policy networks as they are agnostic to identities of entities. Therefore, permutation invariant policy networks are better able to leverage experience across time and entities, leading to greater efficiency in data usage. Moreover, observe that if the transition kernels can be factored into independent and identical transition kernels across entities, then the optimal policy is indeed permutation invariant. Our main theoretical contributions start with an extension of results from Lazaric et al. (2012) , where a finite-sample error bound was derived for the least squares policy iteration (LSPI) algorithm on a single task. Lazaric et al. (2012) provided a high-probability bound on the performance difference between the final learned policy and the optimal policy, of the form c 1 + c 2 / √ N , where c 1 and c 2 are constants that depend on the task and the chosen feature space, and N is the number of training examples. We extend their result by showing that, as long as tasks are -close to each other (with respect to a similarity measure we define later), the error bound of solving each task using our multi-task approach has the form c 1 + c 2 / √ N T + c 3 , where T is the number of tasks and c 3 is a task-dependent constant. Specifically, our theorem provides a general result and performance guarantee with respect to using data from a different but similar MDP. Definition 1 provides a basis for generating many such MDPs. Finally, the benefit of doing so shall be provided by Corollary 2. Thus, provided is small, a given task can benefit from a much larger set of N T training examples. In addition to the assumptions of Lazaric et al. (2012) , we extend the definition of second-order discounted-average concentrability, proposed in Antos et al. (2008) , and define the notion of first-order discounted-average concentrability. The latter will be used in our main result, Theorem 1. Assumption 1 There exists a distribution µ ∈ S(X ) such that for any policy π that is greedy with respect to a function in the truncated space F, µ ≤ Cρ π t for all t, where C < ∞ is a constant. Given the target distribution σ ∈ S(X ) and an arbitrary sequence of policies {π m } m≥1 , let c σ,µ = sup π1,...,πm d(µP π1 . . . P πm ) dσ . We assume that C σ,µ , C σ,µ < ∞, and define first and second order discounted-average concentrability of future-state distributions as follows: C σ,µ = (1 -γ) m≥0 γ m c σ,µ (m), C σ,µ = (1 -γ) 2 m≥1 mγ m-1 c σ,µ (m). Theorem 1 (Multi-Task Finite-Sample Error Bound) Let M = X , A, R, P, γ be an MDP with reward function R and transition kernel P . Assume A finite. Denote its Bellman operator by (T π V )(x) = R π (x) + γ X P π (dy|x)V (y). Given a policy π, define the Bellman difference operator between M t and M to be D π t V = T π t V -T π V . Apply the LSPI algorithm to M, by generating, at each iteration k, a path from M of size N , where N satisfies Lemma 4 in Lazaric et al. (2012) . Let V -1 ∈ F be an arbitrary initial value function, V 0 , • • • , V K-1 ( Ṽ0 , • • • , ṼK-1 ) be the sequence of value functions (truncated value functions) generated by the LSPI after K iterations, and π k be the greedy policy w.r.t. the truncated value function Ṽk-1 . Suppose also that D π t V π µ ≤ ∀ π, and D π k t Ṽk-1 µ ≤ ∀ k. Then, for constants c 1 , c 2 , c 3 , c 4 that are dependent on M, with probability 1 -δ (with respect to the random samples): V π * t t -V π K t σ ≤ c 1 1 √ N + c 2 C σ,µ + c 3 C σ,µ + c 4 . The proof is deferred to the Appendix. Theorem 1 formalizes the trade off between drawing fewer samples from the exact MDP M t , versus drawing more samples from a different MDP M. Importantly, it shows how to benefit from solving a different MDP, M, when: (a) additional samples can be obtained from M, and (b) M is not too different from M t . In particular, the distance measure is simply the distance between the Bellman operators of the MDPs, which can be bounded if the difference in both the transition and reward functions are bounded. In recent work, a performance bound for multi-task learning was given in Theorem 2 and 3 of D 'Eramo et al. (2020) . However, the authors used a different setup containing both shared and task-specific representations, and their focus was on showing that the cost of learning the shared representation decreases with more tasks. They did not show how the similarity or difference across tasks affects performance. In contrast, our setup does not contain task-specific representations, and our focus is on how differences across MDPs impact the benefit of having more tasks (and consequently more samples). We show this in Corollary 1 and Corollary 2. Remark 1 While our theoretical results are based on LSTD and LSPI and assume finite action space, our approach is applicable to a wide range of reinforcement learning algorithms, including policy gradient methods and to MDPs with continuous action spaces. Deriving similar results for a larger family of models and algorithms remains an interesting, albeit challenging, future work. Permutation invariant policy networks allow using data from the global set of entities U. Since the policy network is agnostic to the identities of the entities, one can learn a single policy for all tasks, where each task t ∈ [T ] is a resource allocation problem over a subset of entities U t . For notational simplicity, assume that all tasks have the same number of entities, and all trajectories are of equal length N . Our approach can, however, be readily extended to tasks with different numbers of entities and different trajectory lengths. Permutation invariance allows a large set of MDPs to leverage the result of Theorem 1. In the next section we shall provide an algorithm, motivated by the following corollaries, and a prioritized sampling strategy for this setting that drives significantly greater sample efficiency for the original task. The sampling strategy also helps to stabilize the learning process, reducing the risk of deleterious effects of the multi-task setting, as discussed by Teh et al. (2017) and addressed in works such as Hessel et al. (2018) ; Bram et al. (2019) . Corollary 1 Let [T ] be a set of similar tasks such that their distance from the average MDP, given by (T π V )(x) = 1 T T t=1 R π t (x) + γ X 1 T T t=1 P π t (dy|x)V (y), is bounded by as defined in Theorem 1. Let N be the number of samples available in each task. Let π K be the policy obtained at the K th iteration when applying LSPI to the average MDP. Then, the suboptimality of the policy on each task is O(1/ √ N T ) + O( ) + c for some constant c (where suboptimality is defined according to Theorem 1). Recall that each task is formed by selecting a subset U t of entities from the global set U. We thus have the following sample gain that can be attributed to the permutation invariance of the policy network. Disregarding correlation between samples from tasks with overlapping entities Corollary 1 and Corollary 2 together suggest that the (up to) exponential increase in the number of available tasks can significantly improve sample efficiency as compared to learning each task separately.

5. EXPLOITING PERMUTATION INVARIANCE THROUGH MULTI-TASK REINFORCEMENT LEARNING

Our approach to exploiting permutation invariance is via multi-task reinforcement learning, where each "task" corresponds to a particular choice of subset U t ⊂ U. Furthermore, for each task, we enforce permutation invariance among the entities i by forcing the neural network to apply the same sequence of operations to the state input x i of each instrument through parameter sharing. The proposed method, shown in Algorithm 1, learns a single policy by sampling subsequences of trajectories from the different MDPs. At each step, we sample a task t according to a distribution defined by task selection policy p. Then, a minibatch sample B t is drawn from the replay buffer for task t, and gradient descent is performed using the sampled transitions B t (alternatively, samples can be generated using policy rollouts for the specific task). Separate replay buffers maintained for each task are updated only when the corresponding task is being used. In contrast with other active sampling approaches in multi-task learning, our approach maintains an estimate of the difficulty of each task t as a score, s t . After each training step, we update the score for only the sampled task based on minibatch B t , avoiding evaluation over all the tasks. The scoring functions depend on the sampled minibatch; to reduce fluctuations in scores for each task, exponential smoothing is applied s t ← γs t + (1 -γ) • scorer(B t ). We propose a stochastic prioritization method that interpolates between pure greedy prioritization and uniform random sampling. Our approach is similar to prioritized experience replay (PER) by Schaul et al. (2016) , but while classical PER prioritizes samples, we prioritize tasks. The probability of sampling task t is p t = s α t / t s α t , where the exponent α determines the degree of prioritization, with α = 0 corresponding to the uniform case. We correct for bias with importance-sampling (IS) weights w t = 1/(T p t ) β , that compensate for non-uniform probabilities if β = 1. We normalize weights by 1/ max t w t . Tasks on which the reward variance is high can be interpreted as having more challenging samples, hence reward variance can be used as a scoring function. With the aim of validating the theory presented in Section 4, we define a synthetic example to explore the efficiency gain afforded by permutation invariance. To do so, we control of the deviation between any two tasks, thereby empirically validating the main theoretical results. Consider a resource allocation problem where the observed state x i for each entity i ∈ {1 . . . m} is a single scalar x i ∈ [0, 1]. The action space is the probability simplex, where each action  = i x i a i -β i a i log a i where β i is a weight parameter for each entity. Note that when β i = β for all i, the reward function becomes R(x, a) = ( i x i a i ) + βH(a) where H is the Shannon entropy. This implies that maximizing the reward involves a tradeoff between focusing resources on high x i or distributing them uniformly across all i. Note that the reward function is permutation invariant, but that when we allow a varying β i over the entities, the function deviates from being perfectly permutation invariant. We use the range max i β i -min i β i as a stand-in for . Let m = 10. For each , we run two experiments. The first examines the performance of policies trained by LSP I using N real examples drawn i.i.d from the state-action space, for N = 20 . . . 2000. A small Gaussian noise is added to each reward to make learning harder. The second experiment uses only 20 real examples, but augments the training set (up to N ) through random permutation of the real examples. The first two figures in Fig. 1 show the results for = 0.8 and = 0, respectively. Performance improves with N , as predicted by the 1/ √ N term in our error bound. Note that in the experiment using only 20 real examples, a performance gain is achieved by using permuted examples; this corresponds precisely to the multi-task gain predicted by the 1/ √ N T term. When is large, there is a significant gap between the results of the two experiments, as predicted by the -term in the error bound. The last plot in Fig. 1 shows this gap at N = 2000 when varies from 0 to 0.8.

6.2. REAL-WORD DATA

We consider two real-world resource allocation settings: financial portfolio optimization and meta federated learning. Financial portfolio optimization is discussed below while meta federated learning is in the Appendix. Given historical prices for a universe of financial assets, U, the goal of task t is to allocate investments across a subset of assets U t ⊆ U. The multiple tasks t thus correspond to multiple portfolios of instruments. Permutation invariance will be of use in this setting since, from a given universe of instruments (e.g. the 500 instruments in the S&P 500), an exponential number of tasks can be generated, each with its own portfolio. Consider now one such task. At the beginning of time period n, the action a i,n represents the fraction of wealth the decision maker allocates to asset i. The allocations evolve over the time period due to changes in asset prices. Let w i,n denote the allocation of asset i at the end of time period n. We model the state of an asset using its current allocation and a window of its H most recent prices. In particular, let v i,n denote the close price of asset i over time period n, and let y i,n = v i,n /v i,n-1 denote the ratio of close prices between adjacent time periods 1 . Then, the allocation in asset i at the end of time period n is given by w i,n = a i,n y i,n i∈Ut a i,n y i,n , and the state of asset i at the beginning of time period n is given by x i,n = (w i,n-1 , v i,n-H /v i,n-1 , . . . , v i,n-2 /v i,n-1 ). 1 Daily high and low prices are also used in the state but omitted here for brevity. The change in portfolio value over period n depends on the asset prices and transaction costs incurred in rebalancing the portfolio from (w i,n-1 ) i∈Ut to (a i,n ) i∈Ut . The reward over period n is defined as the log rate of return: R t (x n , a n ) = ln β ((w i,n-1 ) i∈Ut , (a i,n ) i∈Ut ) i∈Ut a i,n y i,n , where β can be evaluated using an iterative procedure (see Jiang et al. (2017) ). Defining the reward this way is appealing because maximizing average total reward over consecutive periods is equivalent to maximizing the total rate of return over the periods. To leverage this, we approximate β((w i,n-1 ) i∈Ut , (a i,n ) i∈Ut ) ≈ c i∈Ut |w i,n-1 -a i,n |, where c is a commission rate to obtain a closed-form expression for R t (x n , a n ) (see Jiang et al. (2017) ). We optimize using direct policy gradient on minibatches of consecutive samples θ ← θ + η∇ θ 1 B n b +B-1 n=n b w t R t (x n , π θ (x n )) , where n b is the first time index in the minibatch, B the size of a minibatch, and w t the IS weight for task t. As in Jiang et al. (2017) , we sample n b from a geometric distribution that prioritises recent samples and implement replay buffers for each task. A benchmark trading strategy is equal constantly-rebalanced portfolio (CRP) that rebalances to maintain equal weights. As we noted earlier, ideally one would prefer for the scoring function to depend only on the minibatch B t . A deviation from Equal CRP can be viewed as learning to exploit price movements, and is thus here we use this as the goal of the policy. Prioritised MTL thus prioritises tasks which deviate from Equal CRP. Note that the policy deviates from CRP only when profitable. Let scorer(B t ) = max n∈{n b ,...,n b +B-1} π θ (x n ) - 1 |U t | ∞ , be the scoring of tasks in Prioritised MTL using mean absolute deviation of the minibatch allocation from Equal CRP. Figure 2 (left) shows a scatter plot of the maximum score seen every 50 steps and the change in episode rewards in a single-task learning experiment, and (right) of the minibatch score and the maximum gradient norm for the minibatch. Higher scores imply higher variance in the episode rewards and hence more challenging and useful samples. The correlation between scores and gradient norms shows that our approach is performing gradient-based prioritisation, (see Katharopoulos & Fleuret (2018) ; Loshchilov & Hutter (2015) ; Alain et al. (2015) ) but in a computationally efficient manner. The details of the dataset and parameter settings can be found in the Appendix. Figure 3 shows the performance of the learned policies tested on 10 tasks drawn from out-of-sample instruments. The policy network with weights initialized close to zero behaves like an Equal CRP policy. As noted, any profitable deviation from Equal CRP implies learning useful trading strategies. The plots show that the MTL policies perform well on instruments never seen during training, offering a remarkable benefit for using RL in the design of trading policies. Fig. 4 shows the performance of prioritised multi-task learning (MTL) versus single-task learning (STL) (i.e. learning a policy for each task independently on the instruments in the task). We also show results for MTL without prioritised sampling, i.e., with α = 0. We consider 5 tasks and 30 tasks. The plots show that prioritised MTL performs significantly better than STL in both convergence time and final achieved A curve further to the right shows higher gain over STL. From 30% on the y-axis, the P-MTL gain is higher (more towards the right) than the MTL gain. As expected, when few tasks are used, prioritizing tasks doesn't help much (y-axis from 0 to 0.2). Figure 5 : Comparison of a multi-task policy vs. a single-task policy on the testing period for a specific task. The leftmost plot shows the percentage gain in portfolio value over time for both policies against that from the baseline Equal CRP policy. The right two plots show the asset allocations. performance. The performance with 30 tasks is significantly better than the performance with 5 tasks, showing that our approach leverages the samples of the additional tasks. Fig. 5 illustrates the typical behavior of a multi-task learning (MTL) and a single-task learning (STL) policy on the test period for tasks where multi-task policy performed significantly better. The single-task policy kept constant equal allocations while the multi-task policy was able to learn more complex allocations. In financial data, strongly trending prices do not occur often and are inherently noisy. Multi-task learning with permutation invariance helps with both challenges, allowing the algorithm to learn more complex patterns in a given training period.

7. CONCLUSIONS

We introduce an approach for increasing the sample efficiency of reinforcement learning in a setting with widespread applicability within the class of sequential resource allocation problems. This property is permutation invariance: resources are allocated to entities according to a score, and the order can change without modifying the optimal allocation. Under this property, we show that a bound exists on the policy performance. This bound motivates a highly effective algorithm for improving the policy through a multi-task approach. Using prioritized task-sampling, the method not only improves the reward of the final policy but also renders it more robust. We illustrate the property and the method on two important problems: sequential financial portfolio optimization and meta federated learning, where the latter is provided in the Appendix.

A APPENDIX

Theorem 1 Let M = X , A, R, P, γ be an MDP with reward function R and transition kernel P . Denote its Bellman operator by (T π V )(x) = R π (x) + γ X P π (dy|x)V (y). Given a policy π, define the Bellman difference operator between M t and M to be D π t V = T π t V -T π V . Apply the LSPI algorithm to M, by generating, at each iteration k, a path from M of size N , where n satisfies Lemma 4 in Antos et al. (2008) . Let V -1 ∈ F be an arbitrary initial value function, V 0 , • • • , V K-1 ( Ṽ0 , • • • , ṼK-1 ) be the sequence of value functions (truncated value functions) generated by the LSPI after K iterations, and π k be the greedy policy w.r.t. the truncated value function Ṽk-1 . Suppose also that D π t V π µ ≤ ∀ π, and D π k t Ṽk-1 µ ≤ ∀ k. Then, with probability 1 -δ (with respect to the random samples), we have V π * t t -V π K t σ ≤ 2 (1 -γ) 2 (1 + γ) CC σ,µ 2 1 -γ 2 2 √ 2E 0 (F) + E 2 + 2 1 -γ γV max L d ν µ ( 8 log(8dK/δ) N + 1 N ) + E 1 + γ K-1 2 R max + 3 2C σ,µ . Proof: For convenience, we will simply remove the task subscript whenever we refer to variables associated with M. Define d π t = D π t V π , dt,k = D π k t Ṽk-1 , e k = Ṽk -T π k Ṽk , E k = P π k+1 (I -γP π k+1 ) -1 -P π * (I -γP π k ) -1 , F k = P π k+1 (I -γP π k+1 ) -1 + P π * (I -γP π k ) -1 . From the proof of Lemma 12 in Antos et al. (2008) , we get V π * -V π K ≤ γ K-1 k=0 (γP π * ) K-k-1 E k e k + (γP π * ) K (V π * -V π0 ). By applying the above inequality, and taking the absolute value on both sides point-wise, we get |V π * t t -V π K t | = |V π * t t -V π * | + |V π * -V π K | + |V π K -V π K t | ≤ γ K-1 k=0 (γP π * ) K-k-1 F k |e k | + (γP π * ) K |V π * -V π0 | + |V π * t t -V π * | + |V π K -V π K t | ≤ γ K-1 k=0 (γP π * ) K-k-1 F k |e k | + 2R max 1 -γ γ K + |V π * t t -V π * | + |V π K -V π K t | where we used the fact that |V π * -V π0 | ≤ (2R max /(1 -γ))1. Next, we derive upper bounds for |V π * t t -V π * | and |V π K -V π K t |. (a) Observe that V π * t t -V π * = T π * t t V π * t t -T π * V π * ≤ T π * t t V π * t t -T π * t V π * = T π * t t V π * t t -T π * t V π * t t + T π * t (V π * t t -V π * ) ≤ (I -γP π * t t ) -1 d π * t t . The first inequality follows from the fact that π * is optimal with respect to V π * . The second inequality follows from the taylor expansion of the inverse term. By closely following the same steps, we also get V π * t t -V π * = T π * t t V π * t t -T π * V π * ≥ T π * t V π * t t -T π * V π * = T π * t V π * -T π * V π * + T π * t (V π * t t -V π * ) ≥ (I -γP π * t ) -1 d π * t . By splitting into positive and negative components and applying the above bounds, we get |V π * t t -V π * | = |(V π * t t -V π * ) + -(V π * t t -V π * ) -| ≤ |(V π * t t -V π * ) + | + |(V π * t t -V π * ) -| ≤ |(I -γP π * t t ) -1 d π * t t | + |(I -γP π * t ) -1 d π * t | ≤ (I -γP π * t t ) -1 |d π * t t | + (I -γP π * t ) -1 |d π * t | (b) Observe that V π K -V π K t ≤ T π K V π K + T π K ṼK-1 -T π K t ṼK-1 -T π K t V π K t = T π K V π K + T π K ṼK-1 -T π K t ṼK-1 -T π K t V π K + T π K t (V π K -V π K t ) ≤ (I -γP π K t ) -1 (-d π K t -dt,K ). The first inequality follows from the fact that π K is optimal with respect to ṼK-1 . The second inequality follows from the taylor expansion of the inverse term. By closely following the same steps, we also get V π K -V π K t ≥ T π K V π K -T π K ṼK-1 + T π K t ṼK-1 -T π K t V π K t = T π K V π K -T π K ṼK-1 + T π K t ṼK-1 -T π K t V π K + T π K t (V π K -V π K t ) ≥ (I -γP π K t ) -1 (-d π K t + dt,K ) . By splitting into positive and negative components and applying the above bounds, we get |V π K -V π K t | = |(V π K -V π K t ) + -(V π K -V π K t ) -| ≤ |(V π K -V π K t ) + | + |(V π K -V π K t ) -| = |(I -γP π K t ) -1 (-d π K t -dt,K )| + |(I -γP π K t ) -1 (-d π K t + dt,K )| ≤ (I -γP π K t ) -1 | -d π K t -dt,K | + (I -γP π K t ) -1 | -d π K t + dt,K | ≤ 2(I -γP π K t ) -1 (|d π K t | + | dt,K |). By applying the upper bounds from (a) and (b), we get |V π * t t -V π K t | ≤ 2(1 -γ K+2 ) (1 -γ) 2 K-1 k=0 α k A k |e k | + α(R max /γ) + (β/6)B π * t • 6|d π * t t | + (β/6)B π * • 6|d π * t | + (β/3)B π K • 6|d π K t | + (β/3)B π K • 6| dt,K | where we introduced the positive coefficients α k = (1 -γ) 1 -γ K+2 γ K-k , for 0 ≤ k < K, α = (1 -γ) 1 -γ K+2 γ K+1 , β = (1 -γ) 2(1 -γ K+2 ) , and the operators A k = 1 -γ 2 (P π * ) K-k-1 F k , for 0 ≤ k < K, B π = (1 -γ)(I -γP π t ) -1 . Let λ K = 2(1-γ K+2 ) (1-γ) 2 p . Note that the coefficients α k , α, and β, sum to 1, and the operators are positive linear operators that satisfy A k 1 = 1 and B π 1 = 1. Therefore, by taking the pth power on both sides, applying Jensen's inequality twice, and then integrating both sides with respect to σ(x), we get V π * t t -V π K t p p,σ = σ(dx)|V π * t t -V π K t | p ≤ λ K σ K-1 k=0 α k A k |e k | p + α(R max /γ) p + (β/6)B π * t (6|d π * t t |) p + (β/6)B π * (6|d π * t |) p + (β/3)B π K (6|d π K t |) p + (β/3)B π K (6| dt,K |) p . From the definition of the coefficients c σ,µ (m), we get σA k ≤ (1 -γ) m≥0 γ m c σ,µ (m + K -k)µ, σB π ≤ (1 -γ) m≥0 γ m c σ,µ (m)µ. Therefore, it follows that σ K-1 k=0 α k A k |e k | p ≤ (1 -γ) K-1 k=0 α k m≥0 γ m c σ,µ (m + K -k)µ|e k | p = γ(1 -γ) 2 1 -γ K+2 K-1 k=0 m≥0 γ m+K-k-1 c σ,µ (m + K -k) e k p p,µ ≤ γ 1 -γ K+2 C σ,µ e p where e = max 0≤k<K e k p p,µ . The terms involving B π satisfy σ [B π (6|d π t |) p ] ≤ 6 p (1 -γ) m≥0 γ m c σ,µ (m)µ|d π t | p ≤ 6 p C σ,µ d π t p p,µ . Putting all these together, and choosing p = 2, we get V π * t t -V π K t σ ≤ λ 1 2 K γ 1 -γ K+2 C σ,µ e 2 + (1 -γ)γ K+1 1 -γ K+2 (R max /γ) 2 + 36(1 -γ) 2(1 -γ K+2 ) C σ,µ 2 1 2 ≤ 2 (1 -γ) 2 γC σ,µ e 2 + (1 -γ)γ K+1 (R max /γ) 2 + 36(1 -γ) 2 C σ,µ 2 1 2 ≤ 2 (1 -γ) 2 C σ,µ e 2 + γ K+1 (R max /γ) 2 + 18C σ,µ 2 1 2 ≤ 2 (1 -γ) 2 C σ,µ e + γ K-1 2 R max + 3 2C σ,µ . The desired result can then be obtained by applying the same steps as in the proof of Theorem 8 in Lazaric et al. (2012) . A We construct tasks by randomly choosing a portfolio of |U t | = 10 instruments for each task. We create a permutation invariant policy network by applying the same sequence of operations to every instrument state. That is, for each instrument, the flattened input prices are passed through a common RNN with 25 hidden units and tanh activation, this output is concatenated with the latest allocation fraction of the instrument, and passed through a common dense layer to produce a score. Instrument scores are passed to a softmax function to produce allocations that sum to one. The smoothing parameter for the scores γ = 0.2, α = 0.5 for the task prioritisation parameter and β = 1.0 to fully compensate for the prioritized sampling bias. A.2 META FEDERATED LEARNING Suppose we have a universe of federated learning clients U. The goal of task t is to aggregate models in a federated learning experiment over a subset of clients U t ⊆ U. At each step n, the action a i,n represents the weight assigned to the supervised learning model of client i in the averaging procedure. Let v i,n denote the model of the client (i.e. the tensor of model parameters). We model the state of the client as some function of its H most recent models x i,n = f (v i,n-H+1 , . . . , v i,n ). Assume that the aggregator has access to a small evaluation dataset that it can use to approximately assess the quality of models. We define the reward at each step to be the accuracy of the aggregate model, R t (x n , a n ) = L i∈Ut a i,n v i,n , where L(v) is a function that provides the accuracy of a model v on the evaluation dataset. Therefore, by maximizing the total return over all time periods, we seek to maximize both the accuracy at the final time step as well as the time to convergence. We optimize the policy using Proximal Policy Optimization (PPO). We use the MNIST digit recognition problem. Each client observes 600 samples from the train dataset and trains a classifier composed of one 5x5 convolutional layer (with 32 channels and ReLu activation) and a softmax output layer. We use the same permutation invariant policy network architecture as before with 10 hidden units in the RNN. We randomly select |U t | = 10 clients for each task. We learn using an evaluation dataset comprised of 1000 random samples from the test dataset and test using all 10000 samples in the test dataset. We fix the number of federated learning iterations to 50. We explore the benefit of MTL in identifying useful clients in scenarios with skewed data distribution. We partition the dataset such that 8 of the clients in each task observe random digits between 0 to 5 and the remaining 2 clients observe random digits between 6 to 9. Therefore, for each task, 20% of the clients possess 40% of the unique labels. The state of each client are the accuracies of its H most recent models on the evaluation dataset. Figure 6 shows the potential benefits of multi-task learning when simulators are inaccurate. In particular, we obtain two aggregation policies, one trained using single-task learning (STL), and another trained using multi-task learning (MTL), both trained using the same number of steps, and we observe their behavior during testing. The plots show that multi-task learning is able to learn non-uniform averaging policies that improve the convergence and performance of federated learning runs. More importantly, it can perform better than single-task learning even with the same number of samples. This may be attributed to the wider variety of client configurations (and consequently experiences) in the multi-task approach.



Gain in Sample Efficiency from Permutation Invariance) Let M = |U| and m = |U t |. Given fixed M and m, there are T = M m ≥ M m m different tasks. Then, by Cor. 1, assuming all pairs of tasks are weakly correlated, the potential gain in sample efficiency is exponential in m.

Figure 1: Performance for = 0.8 (left), = 0 (middle), and at N = 2000 with varying (right).

Figure 2: Scatter plots of the maximum absolute deviation from Equal CRP vs. the change in rewards every 50 steps (left) and the max. norm of the gradient for the minibatch (right).

Figure 3: Mean performance gain over Equal CRP of the learned policies when tested on 10 tasks using out-of-sample instruments. Error bars denote the standard deviation over 10 experiments.

Figure4: The two left plots show mean annualized return in the testing period over 10 experiments (different instruments) each with 5 and 30 tasks. X-axes are scaled to make the curves comparable: each epoch has 1500 (5-tasks) and 9000 steps (30-tasks) and an evaluation. Shaded regions denote the interquartile range. The rightmost figure shows, for each fraction of tasks, the gain over Single Task Learning (STL). A curve further to the right shows higher gain over STL. From 30% on the y-axis, the P-MTL gain is higher (more towards the right) than the MTL gain. As expected, when few tasks are used, prioritizing tasks doesn't help much (y-axis from 0 to 0.2).

Figure 6: These plots compare the behavior of a multi-task policy and a single-task policy during testing. FedAvg denotes the accuracy of federated learning with uniform averaging. The left plot shows the accuracy of the aggregate model during federated learning. The right two plots show the weights produced by the policy for different clients. Note that clients 8 and 9 possess 40% of the unique labels.

Algorithm 1 Prioritized Multi-Task Reinforcement Learning for Increasing Sample Efficiency Initialize policy network π For task t, select action a n according to current policy and exploration noise Execute action a n , and observe reward r n and new state x n+1 Store transition (x n , a n , r n , x n+1 ) in R t end for If n < N , update n t ← n + 1, otherwise, update n t ← 1

.1 FINANCIAL PORTFOLIO OPTIMIZATION: ADDITIONAL DETAILS The dataset consists of daily prices for 68 instruments in the technology and communication sectors from 2009 to 2019. We use 2009-2018 for training and 2019 for testing. To validate that our approach learns common features across instruments, and thus can transfer, we reserve 18 instruments not seen during training for further testing. The global asset universe U used for training contains 50 instruments.

